Extraction Stage
The extraction stage is responsible for reading and extracting content from various document formats. It's the first stage in most pipeline configurations.
Supported Formats
- PDF: Portable Document Format files
- HTML: Hypertext Markup Language files
- XML: Extensible Markup Language files
- Markdown: Markdown text files
- JSONL: JSON Lines format (one JSON object per line)
Configuration
Basic Configuration
- name: extraction
config:
format: "pdf" # or , "html", "xml", "markdown"
Stage Behavior
Input Processing
The extraction stage processes documents from the configured input directory:
pipeline:
inputs:
path: "input_documents"
- Recursively scans the input directory
- Supports nested folder structures
Format-Specific Features
PDF Extraction
For PDF documents, the extractor:
- Extracts text content using Nougat OCR.
- Preserves document structure (headings, paragraphs).
- Maintains table and formulas.
- name: extraction
config:
format: "pdf"
Nougat Server
You need to setup the nougat server found under the /server
cd server
python3 nougat_server.py
HTML Extraction
For HTML documents, the extractor use Trafilatura to extract the content.
- name: extraction
config:
format: "html"
XML Extraction
For XML documents, the extractor:
- Extracts text content from XML tags
- Preserves document structure
- Handles namespaces appropriately
- Maintains attribute information when relevant
- name: extraction
config:
format: "xml"
JSONL Extraction
JSONL (JSON Lines) format allows you to input pre-structured documents with custom metadata. Each line in the file must be a valid JSON object.
Format Requirements:
Required Fields:
- content (string): The document text content
Optional Fields:
- metadata (object): Custom metadata that will be preserved throughout the pipeline
- embedding (array): Pre-computed embedding vector (useful when using use_existing_embeddings: true)
- pipeline_metadata (object): Internal metadata from previous pipeline runs
Example JSONL file:
{"content": "This is the first document.", "metadata": {"title": "Document 1", "author": "John Doe", "year": 2024}}
{"content": "Second document with tags.", "metadata": {"title": "Doc 2", "source": "paper.pdf", "tags": ["AI", "ML"]}}
{"content": "Document with pre-computed embedding.", "metadata": {"title": "Doc 3"}, "embedding": [0.123, 0.456, ...]}
Configuration:
pipeline:
inputs:
path: "data/documents.jsonl"
stages:
- name: extraction
config:
format: "jsonl"
Key Features:
- Flexible Metadata: Add any custom fields you need (title, author, tags, year, etc.)
- Metadata Preservation: All metadata fields are preserved throughout the entire pipeline
- Metadata Inheritance: When documents are chunked, each chunk inherits the original document's metadata
- Pre-computed Embeddings: Include embeddings to skip re-computation in later stages
- Pipeline Chaining: Output from one pipeline can be input to another via JSONL export
Practical Example:
pipeline:
inputs:
path: "research_papers.jsonl"
stages:
- name: extraction
config: { format: "jsonl" }
- name: chunker
config: { max_chunk_size: 512 }
- name: qdrant_upload
config:
mode: "qdrant"
# ... other config
After chunking, each chunk will have metadata like:
{
"title": "Document 1",
"author": "John Doe",
"year": 2024,
"headers": ["#Introduction", "##Background"]
}
This metadata is then uploaded to Qdrant, making it easy to filter and search by author, year, or other custom fields.
Next Steps
- Learn about deduplication
- Explore cleaning options
- Configure PII removal