Data Scraping Pipeline

Welcome to the Data Scraping pipeline documentation. This pipeline is designed to collect and scrape Earth Observation and Earth Science data from various academic publishers, journals, and data sources.

Features

32+ Specialized Scrapers: Pre-configured scrapers for major publishers and data sources including IEEE, Springer, Elsevier, NASA, ESA, and more
Flexible Architecture: Extensible base classes for creating new scrapers
Cloud Storage Integration: S3-compatible storage (AWS S3, MinIO)
Database Tracking: MySQL database for tracking scraping progress and analytics
Docker Support: Containerized deployment for easy setup
Proxy Support: Built-in proxy support for restricted content
Resume Capability: Resume failed scraping operations
Analytics: Comprehensive statistics on scraping operations

Quick Start

# Clone the repository
git clone <repository-url>
cd data-scraping

# Start Docker containers
make up

# Run the pipeline
make run

For detailed installation instructions, see the Getting Started page.

Documentation Structure

Getting Started: Installation, prerequisites, and setup guide
Scrapers: Complete documentation of all available scrapers
Model: Data models and configuration schemas
Examples: Usage examples and common workflows

Architecture Overview

The scraping system is built on a hierarchical architecture:

Base Scrapers: Abstract classes providing core functionality (Selenium, storage, database)
Specialized Scrapers: Publisher-specific implementations
Configuration: JSON-based configuration for each scraper
Storage: S3-compatible storage for collected data
Database: MySQL for tracking progress and analytics

Funding

This project is supported by the European Space Agency (ESA) Φ-lab through the Large Language Model for Earth Observation and Earth Science project, as part of the Foresight Element within FutureEO Block 4 programme.

License

This project is released under the Apache 2.0 License.

Contributing

We welcome contributions! Please open an issue or submit a pull request on GitHub to help improve the pipeline.