Basic Usage Examples

This section provides practical examples of using the EVE Pipeline for common document processing tasks.

Simple Document Processing

Process all documents from an input directory and export to markdown:

# config.yaml
pipeline:
  batch_size: 10
  inputs:
    path: "input_documents"
  stages:
    - name: extraction
    - name: export
      config:
        format: "md"
        destination: "output"

# Run the pipeline
eve run

PDF Processing Pipeline

Process PDF documents with cleaning and deduplication:

pipeline:
  batch_size: 5
  inputs:
    path: "research_papers"
  stages:
    - name: extraction
      config:
        format: "pdf"
    - name: duplication
    - name: cleaning
    - name: export
      config:
        format: "md"
        destination: "processed_papers"

Web Content Processing

Process HTML documents with PII removal:

pipeline:
  batch_size: 20
  inputs:
    path: "web_pages"
  stages:
    - name: extraction
      config:
        format: "html"
    - name: pii
      config:
        url: "http://127.0.0.1:8000"
    - name: export
      config:
        format: "txt"
        destination: "clean_content"

Advanced Pipeline with All Features

Complete pipeline for scientific document processing:

pipeline:
  batch_size: 10
  inputs:
    path: "scientific_documents"
  stages:
    - name: extraction
      config:
        url: "http://127.0.0.1:8001"
    - name: duplication
      config:
        method: "lsh"
        shingle_size: 3
        num_perm: 128
        threshold: 0.85
    - name: cleaning
      config:
        ocr_threshold: 0.99
        enable_latex_correction: true
        debug: true
    - name: pii
      config:
        url: "http://127.0.0.1:8000"
    - name: metadata
      config:
    - name: export
      config:
        export_metadata: true
        metadata_destination: "./output"

Mixed Format Processing

Process different document types in the same pipeline:

pipeline:
  batch_size: 10
  inputs:
    path: "mixed_documents"
  stages:
    - name: extraction
      # Auto-detect format based on file extension
    - name: duplication
      config:
        method: "lsh"
        threshold: 0.8
    - name: cleaning
    - name: export
      config:
        format: "md"
        destination: "unified_output"

Process and Upload to Qdrant

This example demonstrates a complete pipeline that extracts, chunks, filters, embeds, and uploads documents to Qdrant in one workflow.

Prerequisites: - VLLM server running for embeddings: python server/vllm.py - Qdrant instance running: docker run -p 6333:6333 qdrant/qdrant

Understanding JSONL Input Format

The pipeline accepts JSONL (JSON Lines) files where each line is a JSON document. The extractor recognizes the following fields:

Required Fields: - content (string): The document text content

Optional Fields: - metadata (object): Custom metadata that will be preserved throughout the pipeline - embedding (array): Pre-computed embedding vector (if using use_existing_embeddings: true) - pipeline_metadata (object): Internal metadata from previous pipeline runs

Example JSONL file:

{"content": "This is the first document.", "metadata": {"title": "Doc 1", "author": "John Doe", "year": 2024}}
{"content": "This is the second document.", "metadata": {"title": "Doc 2", "source": "research_paper.pdf"}}
{"content": "Third document with embedding.", "metadata": {"title": "Doc 3"}, "embedding": [0.123, 0.456, ...]}

Important Notes: - Each line must be valid JSON - The content field is required; documents without it will be skipped - Metadata fields are completely flexible - you can include any custom fields - When chunks are created, they inherit all metadata from the original document - Chunking adds a headers field to metadata containing markdown header hierarchy

# examples/process_and_upload.yaml
pipeline:
  batch_size: 10
  inputs:
    path: "data/doc_w_metadata.jsonl"

  stages:
    # Step 1: Extract content from documents
    - name: extraction
      config: { format: "jsonl" }

    # Step 2: Chunk documents into semantic pieces
    - name: chunker
      config: {
        "chunk_overlap": 0,
        "max_chunk_size": 512,
        "word_overlap": 0,
        "add_headers": false,
        "merge_small_chunks": true,
        "headers_to_split_on": [ 1, 2, 3, 4, 5, 6 ]
      }

    # Step 3: Remove short chunks (< 40 words)
    - name: length_filter
      config:
        length: 40
        comparison: "greater"
        action: "keep"

    # Step 4: Remove long chunks (>= 1024 words)
    - name: length_filter
      config:
        length: 1024
        comparison: "less"
        action: "keep"

    # Step 5: Remove references and acknowledgements
    - name: reference_filter
      config:
        action: "discard"

    # Step 6: PII filter with threshold
    - name: pii_filter
      config:
        threshold: 0.03
        action: "discard"
        apply_filter: true

    # Step 7: Remove chunks with excessive newlines
    - name: newline_filter
      config:
        chunks: 60
        comparison: "less"
        action: "keep"

    # Step 8: Generate embeddings and upload to Qdrant
    - name: qdrant_upload
      config:
        mode: "qdrant"
        use_existing_embeddings: false
        upload_pipeline_metadata: true

        embedder:
          model_name: "Qwen/Qwen3-Embedding-4B"
          url: 'http://0.0.0.0:8000'
          timeout: 300
          api_key: "EMPTY"

        vector_store:
          batch_size: 1000
          collection_name: "your-collection-name"
          vector_size: 2560
          url: "http://localhost:6333"
          api_key: "your-api-key"

To run:

cp examples/process_and_upload.yaml config.yaml
# Edit config.yaml to set your Qdrant collection name, URL, and API key
eve run

What this pipeline does: 1. Extracts content from JSONL documents (preserves all metadata from input) 2. Splits documents into chunks of up to 512 words - Each chunk inherits all metadata from the original document - Chunking adds a headers field to metadata with markdown header hierarchy 3. Filters chunks by length (40-1024 words) 4. Removes references and acknowledgements sections 5. Filters out chunks with PII above 3% threshold 6. Removes chunks with excessive newlines 7. Generates embeddings using VLLM server 8. Uploads filtered documents with embeddings to Qdrant - Includes original metadata from JSONL input - Includes chunk headers - Includes filter statistics (if upload_pipeline_metadata: true)

Metadata Flow Example:

Input JSONL:
{"content": "# Introduction\n\nThis is my paper...", "metadata": {"title": "My Paper", "author": "Jane Doe"}}

After Chunking:
Document 1: {"content": "This is my paper...", "metadata": {"title": "My Paper", "author": "Jane Doe", "headers": ["#Introduction"]}}

After Upload to Qdrant:
All chunks retain: title="My Paper", author="Jane Doe", headers=["#Introduction"], plus any filter metadata

Selective Stage Processing

Skip certain stages based on your needs:

Extraction Only

pipeline:
  batch_size: 20
  inputs:
    path: "raw_documents"
  stages:
    - name: extraction
    - name: export
      config:
        format: "md"
        destination: "extracted_content"

Deduplication Only

pipeline:
  inputs:
    path: "markdown_documents"
  stages:
    - name: duplication
      config:
        method: "lsh"
        threshold: 0.9
    - name: export
      config:
        format: "md"
        destination: "unique_documents"