bigRAG
Concepts

Documents

How bigRAG ingests, parses, chunks, and embeds documents for vector search.

Documents are files uploaded to a collection. When uploaded, they are automatically parsed, chunked, embedded, and stored for vector search.

Supported Formats

bigRAG uses Docling for document parsing:

FormatExtensionsNotes
PDF.pdfWith OCR for scanned documents
Microsoft Word.docxFull layout support
Microsoft PowerPoint.pptxSlide content extraction
Microsoft Excel.xlsxTable data extraction
HTML.html, .htmWeb page content
Markdown.mdNative support
Plain Text.txtDirect ingestion
CSV / TSV.csv, .tsvTabular data
XML.xmlStructured data
JSON.jsonStructured data
Images.png, .jpg, .jpeg, .tiff, .bmp, .gifOCR text extraction

Ingestion Pipeline

When you upload a document, it goes through this pipeline:

  1. Upload — file is stored on the configured storage backend (local disk or S3)
  2. Queue — document is added to the Redis ingestion queue with status pending
  3. Parse — a background worker picks up the document and parses it with Docling
  4. Chunk — extracted text is split into chunks based on the collection's chunk_size and chunk_overlap
  5. Embed — each chunk is embedded using the collection's configured embedding model
  6. Store — embeddings are batch-inserted into the Milvus collection
  7. Ready — document status is updated to ready with the chunk count

Processing Status

Documents transition through these states:

StatusDescription
pendingQueued, waiting for a worker
processingBeing parsed, chunked, and embedded
readySuccessfully processed, searchable
failedProcessing failed (see error_message)

Filter documents by status:

curl "http://localhost:6100/v1/collections/research/documents?status=failed" \
  -H "Authorization: Bearer $BIGRAG_API_SECRET"

Chunking Strategy

Chunking splits document text into overlapping segments for embedding and retrieval.

SettingDefaultRangeDescription
chunk_size51264–10,000Maximum characters per chunk
chunk_overlap500–5,000Overlap characters between adjacent chunks
  • Smaller chunks (256–512) are better for precise answers and factual retrieval
  • Larger chunks (1,000–2,000) provide more context per result
  • Overlap ensures important content at chunk boundaries is not lost

Real-Time Progress (SSE)

Monitor document processing in real time via Server-Sent Events:

const eventSource = new EventSource(
  "http://localhost:6100/v1/collections/research/documents/DOC_ID/progress"
);

eventSource.onmessage = (event) => {
  const data = JSON.parse(event.data);
  console.log(`${data.step}: ${data.progress}% — ${data.message}`);
};

Progress events include:

FieldTypeDescription
stepstringCurrent step (parsing, chunking, embedding, complete)
statusstringStep status
messagestringHuman-readable progress message
progressfloatOverall progress percentage (0–100)

Batch Operations

Upload, check status, or delete multiple documents in a single request:

# Batch upload (up to 100 files)
curl -X POST http://localhost:6100/v1/collections/docs/documents/batch/upload \
  -H "Authorization: Bearer $BIGRAG_API_SECRET" \
  -F "files=@paper1.pdf" \
  -F "files=@paper2.pdf" \
  -F 'metadata={"source": "batch-import"}'

# Batch status check
curl -X POST http://localhost:6100/v1/collections/docs/documents/batch/status \
  -H "Authorization: Bearer $BIGRAG_API_SECRET" \
  -H "Content-Type: application/json" \
  -d '{"document_ids": ["doc-id-1", "doc-id-2"]}'

# Batch get full document metadata
curl -X POST http://localhost:6100/v1/collections/docs/documents/batch/get \
  -H "Authorization: Bearer $BIGRAG_API_SECRET" \
  -H "Content-Type: application/json" \
  -d '{"document_ids": ["doc-id-1", "doc-id-2"]}'

# Batch delete
curl -X POST http://localhost:6100/v1/collections/docs/documents/batch/delete \
  -H "Authorization: Bearer $BIGRAG_API_SECRET" \
  -H "Content-Type: application/json" \
  -d '{"document_ids": ["doc-id-1", "doc-id-2"]}'

Batch operations support up to 100 items per request. Partial success is supported for batch delete — failed items are reported in the errors array.

Reprocessing

Reprocess a document to re-parse, re-chunk, and re-embed it. Useful after changing collection settings or if processing previously failed:

curl -X POST http://localhost:6100/v1/collections/docs/documents/DOC_ID/reprocess \
  -H "Authorization: Bearer $BIGRAG_API_SECRET"

On this page