Quick Start

This tutorial will walk you through running your first data processing pipeline with EVE.

Step 1: Prepare Your Data

Create an input directory with your documents:

mkdir -p input_data
# Copy your PDF, HTML, XML, or Markdown files here
cp /path/to/your/documents/* input_data/

Step 2: Basic Configuration

Create a config.yaml file:

pipeline:
  batch_size: 10
  inputs:
    path: "input_data"
  stages:
    - name: extraction
      # Automatically detects file format
    - name: duplication
    - name: export
      config:
        format: "md"
        destination: "output"

Step 3: Run the Pipeline

Execute the pipeline:

eve run

Step 4: Check Results

Your processed documents will be in the output directory:

ls output/

Example Pipeline Configurations

PDF Processing Only

pipeline:
  batch_size: 5
  inputs:
    path: "pdfs"
  stages:
    - name: extraction
      config: { format: "pdf" }
    - name: cleaning
    - name: export
      config: { format: "md", destination: "processed_pdfs" }

HTML Processing with PII Removal

pipeline:
  batch_size: 10
  inputs:
    path: "html_docs"
  stages:
    - name: extraction
      config: { format: "html", url: "http://127.0.0.1:8001" }
    - name: pii
      config: { url: "http://127.0.0.1:8000" }
    - name: export
      config: { format: "md"}

Advanced Pipeline with All Stages

pipeline:
  batch_size: 10
  inputs:
    path: "mixed_docs"
  stages:
    - name: extraction
      config: { url: "http://127.0.0.1:8001" }
    - name: duplication
      config: {
        method: "lsh",
        shingle_size: 3,
        num_perm: 128,
        threshold: 0.8
      }
    - name: cleaning
    - name: pii
      config: { url: "http://127.0.0.1:8000" }
    - name: metadata

Monitoring Progress

The pipeline provides progress updates:

$ eve run

[2024-01-15 10:30:00] INFO: Starting pipeline with 100 documents
[2024-01-15 10:30:01] INFO: Stage 1/5: Extraction
[2024-01-15 10:30:15] INFO: Processing batch 1/10 (10 documents)
[2024-01-15 10:30:30] INFO: Processing batch 2/10 (20 documents)
...
[2024-01-15 10:35:00] INFO: Pipeline completed successfully
[2024-01-15 10:35:00] INFO: Processed 95 documents, 5 duplicates removed

Next Steps

Learn about configuration options