Examples
This page provides practical examples and common workflows for using the Data Scraping pipeline.
Basic Usage Examples
Running a Single Scraper
Run a specific scraper to collect data from one source:
make run args="--scrapers IOPScraper"
What happens:
- The pipeline loads configuration from
config/config.json - IOPScraper is initialized with its configuration
- The scraper navigates to configured URLs
- PDF links are extracted and downloaded
- Files are uploaded to MinIO/S3
- Results are logged to the database
Running Multiple Scrapers
Execute several scrapers sequentially:
make run args="--scrapers IOPScraper SpringerScraper MDPIScraper"
This will run each scraper one after another, useful for collecting data from multiple sources in one command.
Running All Scrapers
Execute all configured scrapers:
make run
This processes all scrapers defined in config/config.json.
Advanced Usage
Force Re-execution
Force a scraper to run again even if it completed successfully before:
make run args="--scrapers IOPScraper --force"
Use cases:
- Testing after code changes
- Collecting new content from the same source
- Recovering from incomplete runs
Resume Failed URLs
Resume only failed URLs from the previous execution:
make run args="--scrapers IOPScraper --resume"
When to use:
- Network errors occurred during scraping
- Some pages were temporarily unavailable
- Timeout errors on specific URLs
What it does:
- Queries
scraper_failuretable for failed URLs - Re-attempts only those URLs
- Updates success/failure status
Resume Failed Uploads
Resume only failed file uploads:
make run args="--scrapers IOPScraper --resume-upload"
When to use:
- S3/MinIO connection issues occurred
- Upload timeouts for large files
- Storage quota was exceeded
Important: Cannot combine --resume and --resume-upload in one command.
Analytics Examples
View All Scraper Statistics
Get analytics for all scrapers:
make run args="--analytics-only"
Output includes:
- URLs scraped
- Content successfully retrieved
- Files uploaded to storage
- Failure counts
View Specific Scraper Statistics
Get analytics for specific scrapers:
make run args="--analytics-only --scrapers IOPScraper SpringerScraper"
Understanding Analytics Output
The analytics JSON contains:
{
"scraped": {
"total": 100,
"successful": 95,
"failed": 5
},
"content_retrieved": {
"retrieved": 90,
"not_retrieved": 5
},
"uploaded": {
"successful": 88,
"failed": 2
}
}
Metrics explained:
- scraped: URLs processed by the scraper
- content_retrieved: Resources whose content was successfully downloaded
- uploaded: Resources successfully uploaded to S3
Configuration Examples
Simple Publisher Configuration
For a straightforward journal website:
{
"SimplePublisherScraper": {
"bucket_key": "{main_folder}/simple_publisher",
"base_url": "https://journal.example.com",
"cookie_selector": "button.accept-cookies",
"sources": [
{
"url": "https://journal.example.com/volume/1/issue/1",
"type": "issue_or_collection"
}
]
}
}
Iterative Journal Configuration
For journals with volume/issue structure:
{
"JournalScraper": {
"bucket_key": "{main_folder}/journal",
"journals": [
{
"url": "https://journal.com",
"name": "Journal of Earth Observation",
"start_volume": 1,
"end_volume": 20,
"start_issue": 1,
"end_issue": 12,
"consecutive_missing_volumes_threshold": 3,
"consecutive_missing_issues_threshold": 3
}
]
}
}
Parameters:
start_volume/end_volume: Volume range to scrapestart_issue/end_issue: Issue range per volumeconsecutive_missing_*_threshold: Stop after N consecutive missing volumes/issues
Pagination Configuration
For search results with pagination:
{
"SearchScraper": {
"bucket_key": "{main_folder}/search",
"base_url": "https://publisher.com",
"sources": [
{
"landing_page_url": "https://publisher.com/search?q=remote+sensing&page={page_number}",
"page_size": 50,
"max_allowed_papers": 1000
}
]
}
}
Parameters:
{page_number}: Placeholder for page number (auto-incremented)page_size: Results per pagemax_allowed_papers: Maximum papers to collect
Multi-Source Configuration
For scrapers with multiple sub-sources:
{
"MultiSourceScraper": {
"bucket_key": "{main_folder}/multi",
"sources": [
{
"name": "Source A",
"scraper": "SubScraperA",
"config": {
"base_url": "https://source-a.com",
"cookie_selector": "button.accept"
}
},
{
"name": "Source B",
"scraper": "SubScraperB",
"config": {
"base_url": "https://source-b.com",
"urls": ["https://source-b.com/papers"]
}
}
]
}
}
Direct Download Configuration
For known PDF URLs:
{
"DirectLinksScraper": {
"bucket_key": "{main_folder}/direct",
"sources": [
{
"name": "Example PDFs",
"config": {
"bucket_key": "custom_folder",
"urls": [
"https://example.com/paper1.pdf",
"https://example.com/paper2.pdf",
"https://example.com/paper3.pdf"
]
}
}
]
}
}
Debugging Examples
Check Scraper Status
View database status for a scraper:
-- Connect to MySQL
docker exec -it <mysql-container> mysql -u root -p
-- Check completed scrapers
SELECT scraper, created_at FROM scraper_output;
-- Check failed URLs
SELECT scraper, url, error FROM scraper_failure ORDER BY created_at DESC LIMIT 10;
-- Check uploaded resources
SELECT COUNT(*) as total FROM uploaded_resource;
Inspect MinIO Storage
Access MinIO console:
# Local: http://localhost:9100
# Login: minio / minio1234
Navigate to your bucket and verify files were uploaded correctly.
View Scraper Logs
Check Docker logs for errors:
# View recent logs
docker logs <container-name> --tail 100
# Follow logs in real-time
docker logs -f <container-name>
# Search for errors
docker logs <container-name> 2>&1 | grep ERROR
Test Configuration
Validate configuration before running:
# In Python console
from model.iop_models import IOPConfig
import json
with open('config/config.json') as f:
config = json.load(f)
# Validate configuration
iop_config = IOPConfig(**config['IOPScraper'])
print(iop_config)
Troubleshooting Examples
Handle Cookie Banners
If scraper fails due to cookie banners:
- Inspect the page in browser
- Find the "Accept" button CSS selector
- Add to configuration:
{
"ScraperName": {
"cookie_selector": "button#onetrust-accept-btn-handler",
...
}
}
Handle Dynamic Loading
For JavaScript-loaded content:
# In your scraper's scrape() method
# Wait for element to load
self._driver.wait_for_element("div.article-list", timeout=10)
# Or use loading tag
# Configure in config.json:
{
"loading_tag": "div.loading-spinner"
}
Handle Pagination Edge Cases
For pagination that doesn't follow standard patterns:
# In your scraper
page_number = 1
while True:
url = f"https://example.com/search?page={page_number}"
self._driver.open(url)
# Check if results exist
results = self._driver.find_elements("div.result")
if not results:
break # No more pages
# Process results
# ...
page_number += 1
Handle Proxy Requirements
For sites requiring proxy:
{
"ScraperName": {
"request_with_proxy": true,
...
}
}
Ensure proxy credentials are in .env:
INTERACTING_PROXY_HOST=proxy.example.com
INTERACTING_PROXY_PORT=8080
INTERACTING_PROXY_USER=username
INTERACTING_PROXY_PASSWORD=password
Performance Optimization
Parallel Scraping
Run multiple scrapers in parallel using separate processes:
# Terminal 1
make run args="--scrapers IOPScraper MDPIScraper" &
# Terminal 2
make run args="--scrapers SpringerScraper ElsevierScraper" &
Note: Ensure each scraper targets different domains to avoid rate limiting.
Optimize Page Size
For pagination scrapers, adjust page size:
{
"sources": [
{
"landing_page_url": "...",
"page_size": 100 // Larger pages = fewer requests
}
]
}
Balance: Larger pages are faster but may timeout.
Limit Paper Count
To avoid excessive scraping:
{
"sources": [
{
"landing_page_url": "...",
"max_allowed_papers": 5000
}
]
}
Integration Examples
Query Uploaded Resources
After scraping, query the database:
from repository.uploaded_resource_repository import UploadedResourceRepository
repo = UploadedResourceRepository()
# Get all resources from a specific scraper
resources = repo.get_all_by({"source": "IOPScraper"})
for resource in resources:
print(f"URL: {resource.url}")
print(f"S3 Key: {resource.s3_key}")
print(f"SHA256: {resource.sha256}")
Download from S3
Retrieve scraped files from S3:
from service.storage import S3Storage
s3 = S3Storage()
# Download a file
s3_key = "raw_data/iopscience/paper.pdf"
local_path = "/tmp/paper.pdf"
s3.download_file(s3_key, local_path)
Export Analytics
Export analytics to CSV:
# Get analytics as JSON
make run args="--analytics-only --scrapers IOPScraper" > analytics.json
# Process with Python
python -c "
import json
import csv
with open('analytics.json') as f:
data = json.load(f)
with open('analytics.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(['Metric', 'Value'])
for key, value in data.items():
writer.writerow([key, value])
"