Configuration Guide
This guide covers the main configuration options for the EVE Pipeline. You can find the detailed configurations under each Pipeline Stage.
Configuration File Structure
The pipeline is configured using a YAML file (typically config.yaml) with the following structure:
pipeline:
batch_size: integer
inputs:
path: string
# ... other input options
stages:
- name: string
config: object
# ... more stages
Global Configuration
batch_size
- Type: Integer
- Default:
10 - Description: Number of documents to process in each batch
- Note: Not applicable to deduplication stage
pipeline:
batch_size: 20
inputs
path
- Type: String
- Required: Yes
- Description: Path to input directory or file containing documents
- Supported Formats: Directories with PDF/HTML/XML/Markdown files, or JSONL files
pipeline:
inputs:
path: "input_documents" # Directory with various document formats
# OR
path: "data/documents.jsonl" # JSONL file with structured data
Using JSONL Input Files
JSONL (JSON Lines) format is a powerful way to provide pre-structured documents with metadata. Each line in the file must be a valid JSON object.
Required Fields:
- content (string): The document text content
Optional Fields:
- metadata (object): Custom metadata preserved throughout the pipeline
- embedding (array): Pre-computed embedding vector
- pipeline_metadata (object): Internal metadata from previous pipeline runs
Example JSONL file (documents.jsonl):
{"content": "First document text here.", "metadata": {"title": "Document 1", "author": "John Doe", "year": 2024}}
{"content": "Second document text.", "metadata": {"title": "Document 2", "source": "research.pdf", "tags": ["AI", "ML"]}}
{"content": "Document with embedding.", "metadata": {"title": "Doc 3"}, "embedding": [0.123, 0.456, 0.789, ...]}
Key Benefits: - Metadata Preservation: All metadata fields are preserved throughout the pipeline - Metadata Inheritance: When chunking, each chunk inherits the original document's metadata - Pre-computed Embeddings: Can include embeddings to skip re-computation - Flexible Schema: Add any custom metadata fields you need
Usage Example:
pipeline:
inputs:
path: "data/papers.jsonl"
stages:
- name: extraction
config: { format: "jsonl" }
- name: chunker
config: { max_chunk_size: 512 }
# Chunks will have metadata: {"title": "...", "author": "...", "year": ..., "headers": [...]}
Pipeline Stages
Extraction Stage
Extracts content from various document formats.
- name: extraction
config:
format: "" # or "pdf", "html", "xml", "markdown", "jsonl"
url: "http://127.0.0.1:8001" # for server-based extraction
Options
-
format: Document format specification
""(default): Automatically detect format"pdf": PDF documents"html": HTML documents"xml": XML documents"markdown": Markdown documents"jsonl": JSON Lines format (one JSON object per line)
-
url: Server URL for nougat extraction (required for PDF format)
Note on JSONL Format: When using JSONL input, each line must be a valid JSON object with a required content field. See the JSONL Input Files section above for detailed format specifications and examples.
Deduplication Stage
Removes duplicate and near-duplicate documents.
- name: duplication
config:
method: "exact" # or "lsh"
# LSH options (when method: "lsh")
shingle_size: 3
num_perm: 128
threshold: 0.8
Options
-
method: Deduplication method
"exact"(default): Exact hash-based deduplication"lsh": Locality Sensitive Hashing for near-duplicates
LSH Options
- shingle_size: Size of text shingles (default:
3) - num_perm: Number of permutations (default:
128) - threshold: Similarity threshold (default:
0.8)
Cleaning Stage
Removes noise and improves document quality.
- name: cleaning
config:
ocr_threshold: 0.9
min_words: 2
enable_latex_correction: True
Options
- ocr_threshold: OCR duplicate threshold (default:
0.99) - min_words: Minimum words for processing (default:
2) - enable_latex_correction: Use LLM to fix latex formulas and tables (default:
false)
PII Removal Stage
Redacts personally identifiable information.
- name: pii
config:
url: "http://127.0.0.1:8000"
Options
- url: Presidio server URL
Export Stage
Saves processed documents to output.
- name: export
config:
format: "md" # or "txt", "json"
destination: "output"
Options
-
format: Output format
"md"(default): Markdown"txt": Plain text"json": JSON with metadata
-
destination: Output directory path