Data Scraping Pipeline
Welcome to the Data Scraping pipeline documentation. This pipeline is designed to collect and scrape Earth Observation and Earth Science data from various academic publishers, journals, and data sources.
Features
- 32+ Specialized Scrapers: Pre-configured scrapers for major publishers and data sources including IEEE, Springer, Elsevier, NASA, ESA, and more
- Flexible Architecture: Extensible base classes for creating new scrapers
- Cloud Storage Integration: S3-compatible storage (AWS S3, MinIO)
- Database Tracking: MySQL database for tracking scraping progress and analytics
- Docker Support: Containerized deployment for easy setup
- Proxy Support: Built-in proxy support for restricted content
- Resume Capability: Resume failed scraping operations
- Analytics: Comprehensive statistics on scraping operations
Quick Start
# Clone the repository
git clone <repository-url>
cd data-scraping
# Start Docker containers
make up
# Run the pipeline
make run
For detailed installation instructions, see the Getting Started page.
Documentation Structure
- Getting Started: Installation, prerequisites, and setup guide
- Scrapers: Complete documentation of all available scrapers
- Model: Data models and configuration schemas
- Examples: Usage examples and common workflows
Architecture Overview
The scraping system is built on a hierarchical architecture:
- Base Scrapers: Abstract classes providing core functionality (Selenium, storage, database)
- Specialized Scrapers: Publisher-specific implementations
- Configuration: JSON-based configuration for each scraper
- Storage: S3-compatible storage for collected data
- Database: MySQL for tracking progress and analytics
Funding
This project is supported by the European Space Agency (ESA) Φ-lab through the Large Language Model for Earth Observation and Earth Science project, as part of the Foresight Element within FutureEO Block 4 programme.
License
This project is released under the Apache 2.0 License.
Contributing
We welcome contributions! Please open an issue or submit a pull request on GitHub to help improve the pipeline.