Getting Started
This guide will help you set up and run the Data Scraping pipeline on your local machine or in production.
Prerequisites
Before you begin, ensure you have the following installed:
- Python 3.8+
- Docker and Docker Compose
- Make
Local Development Setup
1. Create Virtual Environment
Create a virtual environment in the venv folder:
python3 -m venv venv
Activate the virtual environment:
source venv/bin/activate
2. Environment Configuration
Create a .env file in the root of the project with the following content:
# AWS/MinIO Configuration
AWS_URL=http://minio:9100
AWS_REGION=us-east-1
AWS_ACCESS_KEY=minio
AWS_SECRET_KEY=minio1234
AWS_BUCKET_NAME=esa-eve
AWS_MAIN_FOLDER=raw_data
MINIO_URL=http://minio:9100
# Browser Configuration
HEADLESS_BROWSER=true
XVFB_MODE=false
# Database Configuration
DB_HOST=mysql
DB_PORT=3306
DB_NAME=esa_eve
DB_USER=root
DB_PASSWORD=root
# Proxy Configuration (optional)
INTERACTING_PROXY_HOST=brd.superproxy.io
INTERACTING_PROXY_PORT=33335
INTERACTING_PROXY_USER=<username>
INTERACTING_PROXY_PASSWORD=<password>
Configuration Notes
MinIO for Local Development:
- The MinIO server emulates a remote S3 bucket for local testing
- The
AWS_URLkey must be set to the URL of the MinIO server - For production, remove the
AWS_URLfrom the configuration to use real AWS S3
Browser Settings:
HEADLESS_BROWSER=true: Run browser without GUI (recommended for servers)XVFB_MODE=true: Use virtual frame buffer (required for headless Linux servers)
Proxy Settings (optional):
- Required only for accessing restricted content
- Contact the project maintainer for proxy credentials
3. Installation
Start the Docker containers:
make up
Install required Python packages:
make sync-requirements
4. Verify Installation
Check that all containers are running:
docker ps
You should see containers for:
- MinIO (S3-compatible storage)
- MySQL (database)
- The scraping application
Running the Pipeline
Basic Usage
Run all configured scrapers:
make run
The pipeline will:
- Connect to MinIO and MySQL
- Execute all scrapers defined in
config/config.json - Store scraped data in MinIO
- Track progress in MySQL database
Running Specific Scrapers
Execute one or more specific scrapers:
# Run a single scraper
make run args="--scrapers IOPScraper"
# Run multiple scrapers
make run args="--scrapers IOPScraper SpringerScraper"
Force Re-execution
Force complete re-execution of a scraper (even if already completed):
make run args="--scrapers IOPScraper --force"
Resume Failed Operations
Resume only the failed URLs from the last execution:
make run args="--scrapers IOPScraper --resume"
Resume only the failed uploads from the last execution:
make run args="--scrapers IOPScraper --resume-upload"
Note: The --resume and --resume-upload parameters cannot be used together.
View Analytics
Retrieve statistics from the last execution:
make run args="--analytics-only"
# For specific scrapers
make run args="--analytics-only --scrapers IOPScraper"
Production Deployment
1. Production Environment Setup
Create a production .env file with real credentials:
# AWS Configuration (remove AWS_URL for production S3)
AWS_REGION=<region>
AWS_ACCESS_KEY=<access-key>
AWS_SECRET_KEY=<secret-key>
AWS_BUCKET_NAME=<bucket-name>
AWS_MAIN_FOLDER=<folder-path>
# Database Configuration
DB_HOST=<database-host>
DB_PORT=3306
DB_NAME=<database-name>
DB_USER=<database-user>
DB_PASSWORD=<database-password>
# Browser Configuration
HEADLESS_BROWSER=true
XVFB_MODE=true
# Proxy Configuration (if needed)
INTERACTING_PROXY_HOST=<proxy-host>
INTERACTING_PROXY_PORT=<proxy-port>
INTERACTING_PROXY_USER=<proxy-username>
INTERACTING_PROXY_PASSWORD=<proxy-password>
Contact the project maintainer to obtain production credentials.
2. Deploy to Production POD
Run the deployment script on your Linux-based POD:
sh pod.sh
This script will:
- Install required packages
- Set up the environment
- Configure the infrastructure
3. Run in Production
Execute the pipeline in production:
make runpod
# With arguments
make runpod args="--scrapers IOPScraper"
The make runpod command supports all the same parameters as make run.
Configuration
Scraper Configuration
Scrapers are configured in config/config.json. Each scraper has its own configuration entry:
{
"ScraperName": {
"bucket_key": "{main_folder}/subfolder",
"base_url": "https://example.com",
"cookie_selector": "button#accept-cookies",
"files_by_request": true,
"sources": [
{
"url": "https://example.com/articles",
"type": "journal"
}
]
}
}
Key configuration parameters:
bucket_key: S3 path where data will be storedbase_url: Base URL of the website (optional)cookie_selector: CSS selector for cookie banner (optional)files_by_request: Whether to retrieve files via HTTP request vs scraping (default:true)sources: List of URLs or configurations to scrape
For detailed configuration examples, see the Scrapers documentation.
Common Commands
| Command | Description |
|---|---|
make up |
Start Docker containers |
make down |
Stop Docker containers |
make run |
Run pipeline locally |
make runpod |
Run pipeline in production |
make sync-requirements |
Install Python packages |
Troubleshooting
Docker Issues
If containers fail to start:
# Stop all containers
make down
# Remove volumes
docker-compose down -v
# Restart
make up
Database Connection Issues
Verify MySQL is running:
docker ps | grep mysql
Check database logs:
docker logs <mysql-container-id>
MinIO Access Issues
Access MinIO console at http://localhost:9100 with credentials from .env:
- Username:
minio - Password:
minio1234
Browser/Selenium Issues
If you encounter browser-related errors:
- Ensure
HEADLESS_BROWSER=truein.env - For Linux servers, set
XVFB_MODE=true - Check that SeleniumBase is properly installed
Analytics
At the end of each scraper execution, the pipeline stores statistics in the scraper_analytics table. The statistics include:
- scraped: The analyzed URLs
- content_retrieved: Resources successfully collected, grouped by whether their contents were retrieved
- uploaded: Resources successfully collected and uploaded to remote storage
View the latest statistics:
make run args="--analytics-only"
# For specific scrapers
make run args="--analytics-only --scrapers IOPScraper"