Documents

Documents are files uploaded to a collection. When uploaded, they are automatically parsed, chunked, embedded, and stored for vector search.

Supported Formats

bigRAG uses Docling for document parsing:

Format	Extensions	Notes
PDF	`.pdf`	With OCR for scanned documents
Microsoft Word	`.docx`	Full layout support
Microsoft PowerPoint	`.pptx`	Slide content extraction
Microsoft Excel	`.xlsx`	Table data extraction
HTML	`.html`, `.htm`	Web page content
Markdown	`.md`	Native support
Plain Text	`.txt`	Direct ingestion
CSV / TSV	`.csv`, `.tsv`	Tabular data
XML	`.xml`	Structured data
JSON	`.json`	Structured data
Images	`.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp`, `.gif`	OCR text extraction

Ingestion Pipeline

When you upload a document, it goes through this pipeline:

Upload — file is stored on the configured storage backend (local disk or S3)
Queue — document is added to the Redis ingestion queue with status pending
Parse — a background worker picks up the document and parses it with Docling
Chunk — extracted text is split into chunks based on the collection's chunk_size and chunk_overlap
Embed — each chunk is embedded using the collection's configured embedding model
Store — embeddings are batch-inserted into the Milvus collection
Ready — document status is updated to ready with the chunk count

Processing Status

Documents transition through these states:

Status	Description
`pending`	Queued, waiting for a worker
`processing`	Being parsed, chunked, and embedded
`ready`	Successfully processed, searchable
`failed`	Processing failed (see `error_message`)

Filter documents by status:

curl "http://localhost:6100/v1/collections/research/documents?status=failed" \
  -H "Authorization: Bearer $BIGRAG_API_SECRET"

Chunking Strategy

Chunking splits document text into overlapping segments for embedding and retrieval.

Setting	Default	Range	Description
`chunk_size`	512	64–10,000	Maximum characters per chunk
`chunk_overlap`	50	0–5,000	Overlap characters between adjacent chunks

Smaller chunks (256–512) are better for precise answers and factual retrieval
Larger chunks (1,000–2,000) provide more context per result
Overlap ensures important content at chunk boundaries is not lost

Real-Time Progress (SSE)

Monitor document processing in real time via Server-Sent Events:

const eventSource = new EventSource(
  "http://localhost:6100/v1/collections/research/documents/DOC_ID/progress"
);

eventSource.onmessage = (event) => {
  const data = JSON.parse(event.data);
  console.log(`${data.step}: ${data.progress}% — ${data.message}`);
};

Progress events include:

Field	Type	Description
`step`	string	Current step (`parsing`, `chunking`, `embedding`, `complete`)
`status`	string	Step status
`message`	string	Human-readable progress message
`progress`	float	Overall progress percentage (0–100)

Batch Operations

Upload, check status, or delete multiple documents in a single request:

# Batch upload (up to 100 files)
curl -X POST http://localhost:6100/v1/collections/docs/documents/batch/upload \
  -H "Authorization: Bearer $BIGRAG_API_SECRET" \
  -F "files=@paper1.pdf" \
  -F "files=@paper2.pdf" \
  -F 'metadata={"source": "batch-import"}'

# Batch status check
curl -X POST http://localhost:6100/v1/collections/docs/documents/batch/status \
  -H "Authorization: Bearer $BIGRAG_API_SECRET" \
  -H "Content-Type: application/json" \
  -d '{"document_ids": ["doc-id-1", "doc-id-2"]}'

# Batch get full document metadata
curl -X POST http://localhost:6100/v1/collections/docs/documents/batch/get \
  -H "Authorization: Bearer $BIGRAG_API_SECRET" \
  -H "Content-Type: application/json" \
  -d '{"document_ids": ["doc-id-1", "doc-id-2"]}'

# Batch delete
curl -X POST http://localhost:6100/v1/collections/docs/documents/batch/delete \
  -H "Authorization: Bearer $BIGRAG_API_SECRET" \
  -H "Content-Type: application/json" \
  -d '{"document_ids": ["doc-id-1", "doc-id-2"]}'

Batch operations support up to 100 items per request. Partial success is supported for batch delete — failed items are reported in the errors array.

Reprocessing

Reprocess a document to re-parse, re-chunk, and re-embed it. Useful after changing collection settings or if processing previously failed:

curl -X POST http://localhost:6100/v1/collections/docs/documents/DOC_ID/reprocess \
  -H "Authorization: Bearer $BIGRAG_API_SECRET"