Data Processing Pipeline

A high-performance, modular library for extracting, deduplicating, cleaning, anonymizing, and exporting large-scale Earth science and Earth observation datasets.

Features

Extraction

Supports PDF, HTML, XML, Markdown and nested folder structures
Automatically detects file formats unless explicitly specified

Deduplication

Performs exact matching using SHA-256 checksum
Supports LSH based near-duplicate detection with configurable:
Shingle size
Permutations
Similarity threshold

Cleaning

Removes irregularities and noise artifacts
Corrects LaTeX equations and tables using LLM assistance

PII Removal

Automatically masks Names and Emails using the Presidio framework
Configurable detection patterns

Metadata Extraction

Extracts Title, Authors, DOI, URL, Year, Journal, and Citation Count
PDF-based extraction using MonkeyOCR integration
Support for HTML and other formats

Export

Saves processed content in multiple formats (default: Markdown)

Quick Start

Install the packages

bash uv sync
Configure the pipeline (config.yaml)

yaml pipeline: batch_size: 10 inputs: path: "input_dir" stages: - name: extraction config: { format: "xml"} - name: duplication config: { method: "lsh", shingle_size: 3, num_perm: 128, threshold: 0.8 } - name: pii config: { url: "http://127.0.0.1:8000" } - name: export config: { format: "md", destination: "output/files"}
Run the pipeline

bash eve run

Funding

This project is supported by the European Space Agency (ESA) Φ-lab through the Large Language Model for Earth Observation and Earth Science project, as part of the Foresight Element within FutureEO Block 4 programme.

Citation

If you use this project in academic or research settings, please cite:

License

This project is released under the Apache 2.0 License - see the LICENSE file for more details.