Scrapers
This section documents all available scrapers in the project for collecting Earth Observation and Remote Sensing data from various academic publishers, journals, and data sources.
Overview
The scraping system is designed around a hierarchical class structure where specialized scrapers inherit from base classes that provide common functionality. Each scraper is configured via config/config.json and targets specific publishers or data sources.
Configured Scrapers
The following table lists all scrapers currently configured in the system:
| Scraper | Base URL | Storage Folder | Description |
|---|---|---|---|
| IOPScraper | https://iopscience.iop.org |
{main_folder}/iopscience |
IOP Science journal articles and issues |
| MDPIScraper | https://www.mdpi.com |
{main_folder}/mdpi |
MDPI journals including Remote Sensing, Geosciences, Atmosphere |
| SpringerScraper | https://link.springer.com |
{main_folder}/springer |
Springer journals, books, and search results |
| AMSScraper | https://journals.ametsoc.org |
{main_folder}/ams |
American Meteorological Society publications |
| CopernicusScraper | Multiple Copernicus journals | {main_folder}/copernicus |
16+ Copernicus open-access journals |
| CopernicusCatalogueScraper | https://www.copernicus.eu/ |
{main_folder}/copernicus |
Copernicus services catalogue |
| SeosScraper | https://seos-project.eu |
{main_folder}/seos |
SEOS project educational materials |
| NCBIScraper | https://www.ncbi.nlm.nih.gov |
{main_folder}/ncbi |
NCBI PubMed Central articles |
| CambridgeUniversityPressScraper | https://www.cambridge.org |
{main_folder}/cambridge_university_press |
Cambridge University Press journals |
| OxfordAcademicScraper | https://academic.oup.com |
{main_folder}/oxford_academic |
Oxford Academic journals |
| IEEEScraper | https://ieeexplore.ieee.org |
{main_folder}/ieee |
IEEE Xplore open access articles |
| TaylorAndFrancisScraper | https://www.tandfonline.com |
{main_folder}/taylor_and_francis |
Taylor & Francis journals |
| FrontiersScraper | https://www.frontiersin.org/ |
{main_folder}/frontiers |
Frontiers in Remote Sensing |
| SageScraper | https://journals.sagepub.com |
{main_folder}/sage |
SAGE Publications journals |
| EOGEScraper | https://eoge.ut.ac.ir |
{main_folder}/eoge |
Earth Observations and Geomatics Engineering journal |
| ArxivScraper | https://arxiv.org |
{main_folder}/arxiv |
arXiv preprints |
| WileyScraper | Multiple Wiley domains | {main_folder}/wiley |
Wiley journals (AGU, EOS, etc.) |
| EOSScraper | https://eos.org/ |
{main_folder}/eos |
EOS Science News archives |
| ESAScraper | Multiple ESA domains | {main_folder}/esa |
ESA Earth Online, EO Portal, Sentiwiki |
| ElsevierScraper | https://www.sciencedirect.com |
{main_folder}/elsevier |
ScienceDirect open access journals |
| NASAScraper | Multiple NASA domains | {main_folder}/nasa |
NASA EarthData, NTRS, EOS Portal |
| OpenNightLightsScraper | https://worldbank.github.io/OpenNightLights/ |
{main_folder}/open_night_lights_scraper |
World Bank Open Night Lights documentation |
| WikipediaScraper | https://en.wikipedia.org/ |
{main_folder}/wikipedia |
Wikipedia EO-related categories |
| MITScraper | https://ocw.mit.edu/ |
{main_folder}/mit |
MIT OpenCourseWare |
| JAXAScraper | https://earth.jaxa.jp/en/eo-knowledge |
{main_folder}/jaxa |
JAXA Earth Observation knowledge base |
| UKMetOfficeScraper | https://library.metoffice.gov.uk |
{main_folder}/uk_met_office |
UK Met Office library |
| EOAScraper | https://www.eoa.org.au/ |
{main_folder}/eoa |
Earth Observation Australia textbooks |
| ISPRSScraper | https://www.isprs.org/ |
{main_folder}/isprs |
ISPRS publication archives |
| EUMETSATScraper | Multiple EUMETSAT domains | {main_folder}/eumetsat |
EUMETSAT documentation and case studies |
| EarthDataScienceScraper | https://www.earthdatascience.org/ |
{main_folder}/earth_data_science |
Earth Data Science tutorials |
| DirectLinksScraper | Various | {main_folder}/miscellaneous |
Direct PDF links from multiple sources |
| IntechOpenScraper | https://www.intechopen.com/ |
{main_folder}/intech_open |
IntechOpen books and chapters |
Base Scraper Architecture
The scraping system is built on a set of abstract base classes that provide common functionality. Understanding these base classes is essential for extending or modifying the scraping behavior.
Base Class Descriptions
BaseScraper: The root abstract class that all scrapers inherit from. Provides core functionality including Selenium WebDriver management, cookie handling, S3 storage integration, database repository access, and analytics tracking. Every scraper implements the abstract scrape() method defined here.
BaseIterativePublisherScraper: Designed for publishers that organize content in a journal → volume → issue hierarchy. Iterates through volumes and issues systematically, with support for handling missing volumes/issues using consecutive threshold logic.
BasePaginationPublisherScraper: Handles publishers with paginated search results or article listings. Automatically navigates through pages until no more results are found, with configurable page sizes and maximum paper limits.
BaseUrlPublisherScraper: Used for publishers where content URLs follow predictable patterns. Processes lists of URLs and extracts content directly without complex navigation.
BaseMappedPublisherScraper: For publishers that require mapping between different URL structures or identifiers before scraping content. Provides a two-stage process: first mapping, then scraping.
BaseCrawlingScraper: Implements recursive web crawling from a starting URL. Follows links within the same domain and extracts content from all discovered pages. Useful for documentation sites and knowledge bases.
BaseSourceDownloadScraper: Specialized for direct file downloads where download URLs are known in advance. Handles PDF and other document formats directly without HTML parsing.
Adding a New Scraper
To extend the pipeline with a new scraper, follow these steps:
1. Create Scraper File
Create a new file in the scraper folder with the name of your scraper:
# scraper/new_publisher_scraper.py
from scraper.base_scraper import BaseScraper
from model.new_publisher_models import NewPublisherConfig
class NewPublisherScraper(BaseScraper):
"""
Scraper for New Publisher website.
"""
@property
def config_model_type(self):
"""Return the Pydantic model for configuration."""
return NewPublisherConfig
def scrape(self):
"""
Scrape the website and return scraped data.
Returns:
dict: Dictionary containing scraped data
"""
# Implement your scraping logic here
# Use self._driver for Selenium operations
# Use self._config_model to access configuration
scraped_data = {}
for source in self._config_model.sources:
# Navigate to URL
self._driver.open(source.url)
# Handle cookies if needed
if not self._cookie_handled and self._config_model.cookie_selector:
self._driver.click(self._config_model.cookie_selector)
self._cookie_handled = True
# Extract data
# ... your scraping logic ...
return scraped_data
def post_process(self, scraped_data):
"""
Process scraped data and return URLs to download.
Args:
scraped_data: Data returned from scrape()
Returns:
List[str]: List of URLs to download/upload
"""
urls = []
# Process scraped_data and extract download URLs
# ... your post-processing logic ...
return urls
2. Create Model File (Optional)
If you need custom Pydantic models, create a file in the model folder:
# model/new_publisher_models.py
from typing import List
from pydantic import Field
from model.base_models import BaseConfig, BaseSource
class NewPublisherSource(BaseSource):
"""Configuration for a single source."""
url: str
type: str = "journal"
# Add custom fields as needed
class NewPublisherConfig(BaseConfig):
"""Configuration model for NewPublisherScraper."""
base_url: str
sources: List[NewPublisherSource]
# Add any custom configuration fields
custom_field: str = Field(default="default_value")
Note: If you need enumerators, extend Enum from the base_enum module.
3. Choose the Right Base Class
Select the appropriate base class for your scraper:
- BaseScraper: For custom scraping logic
- BaseIterativePublisherScraper: For journal → volume → issue hierarchies
- BasePaginationPublisherScraper: For paginated search results
- BaseUrlPublisherScraper: For simple URL lists
- BaseMappedPublisherScraper: For two-stage mapping and scraping
- BaseCrawlingScraper: For recursive web crawling
- BaseSourceDownloadScraper: For direct file downloads
Example using BaseIterativePublisherScraper:
from scraper.base_iterative_publisher_scraper import BaseIterativePublisherScraper
class NewJournalScraper(BaseIterativePublisherScraper):
@property
def config_model_type(self):
return NewJournalConfig
def _scrape_journal(self, journal):
# Implement journal-specific scraping
pass
4. Add Configuration to config.json
Add your scraper's configuration to config/config.json (you can find more examples here):
{
"NewPublisherScraper": {
"bucket_key": "{main_folder}/new_publisher",
"base_url": "https://newpublisher.com",
"cookie_selector": "button#accept-cookies",
"files_by_request": true,
"sources": [
{
"url": "https://newpublisher.com/articles",
"type": "journal"
}
]
}
}
Configuration Keys
Required:
bucket_key: S3 storage path (use{main_folder}placeholder)
Optional:
base_url: Base URL of the websitecookie_selector: CSS selector for cookie banner acceptance buttonfiles_by_request: Whether to download files via HTTP (default:true) or scrape themrequest_with_proxy: Use proxy for requests (default:false)sources: List of sources to scrape (structure depends on your model)
5. Test Your Scraper
Run your new scraper:
# Run with force to test from scratch
make run args="--scrapers NewPublisherScraper --force"
# Check logs for errors
docker logs <container-id>
6. Verify Results
Check that data was scraped and uploaded correctly:
- MinIO Console: Visit
http://localhost:9100and check your bucket - Database: Query the
scraper_outputanduploaded_resourcetables - Analytics: Run
make run args="--analytics-only --scrapers NewPublisherScraper"
Example: Complete Simple Scraper
Here's a complete example of a simple scraper:
# scraper/example_scraper.py
from typing import List
from scraper.base_scraper import BaseScraper
from model.base_models import BaseConfig, BaseSource
class ExampleConfig(BaseConfig):
base_url: str
sources: List[BaseSource]
class ExampleScraper(BaseScraper):
@property
def config_model_type(self):
return ExampleConfig
def scrape(self):
pdf_links = []
for source in self._config_model.sources:
self._driver.open(source.url)
# Find all PDF links
links = self._driver.find_elements("a[href$='.pdf']")
for link in links:
href = link.get_attribute('href')
if href:
pdf_links.append(href)
return {"pdf_links": pdf_links}
def post_process(self, scraped_data):
return scraped_data.get("pdf_links", [])
With configuration:
{
"ExampleScraper": {
"bucket_key": "{main_folder}/example",
"base_url": "https://example.com",
"sources": [
{"url": "https://example.com/papers"}
]
}
}
Common Selenium Operations
# Open URL
self._driver.open(url)
# Click element
self._driver.click(selector)
# Find elements
elements = self._driver.find_elements(selector)
# Get attribute
href = element.get_attribute('href')
# Get text
text = element.text
# Wait for element
self._driver.wait_for_element(selector)
# Execute JavaScript
self._driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
Code Reference
Below is the detailed API documentation for all scraper classes:
Base Classes
scraper.base_scraper
BaseScraper
Bases: ABC
config_model_type
abstractmethod
property
Return the configuration model type. This property must be implemented in the derived class.
Returns:
| Type | Description |
|---|---|
Type[BaseConfig]
|
Type[BaseConfig]: The configuration model type |
post_process(scrape_output)
abstractmethod
Post-process the scraped output. This method is called after the sources have been scraped. It is used to retrieve the final list of processed URLs. This method must be implemented in the derived class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scrape_output
|
Any
|
The scraped output |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, List[str]] | List[str]
|
Dict[str, List[str]] | List[str]: A dictionary or a list containing the processed links |
resume_scraping()
Resume the scraping of the resources that failed to scrape.
resume_uploads()
Resume the uploads of the resources that failed to upload.
scrape()
abstractmethod
Scrape the resources links. This method must be implemented in the derived class.
Returns:
| Name | Type | Description |
|---|---|---|
Any |
Any | None
|
The output of the scraping, or None if something went wrong. |
scrape_failure(failure)
abstractmethod
Scrape the failed resource. This method must be implemented in the derived class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
failure
|
ScraperFailure
|
The failure model. |
required |
Returns:
| Type | Description |
|---|---|
List[str]
|
List[str]: The list of the successfully scraped links |
upload_to_s3(sources_links)
Upload the source files to S3.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sources_links
|
Dict[str, List[str]] | List[str]
|
The list of links of the various sources. |
required |
scraper.base_crawling_scraper
BaseCrawlingScraper
Bases: BaseScraper
crawling_folder_path
abstractmethod
property
The folder path where the crawling files are stored. This property must be implemented in the derived class.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The folder path. |
scrape()
Scrape the website, even better crawl the website.
Returns:
| Name | Type | Description |
|---|---|---|
BaseCrawledPublisherScraperOutput |
BaseCrawlingScraperOutput | None
|
The output of the scraper, or None if the scraping failed. |
scraper.base_iterative_publisher_scraper
BaseIterativePublisherScraper
Bases: BaseScraper
journal_identifier(model)
abstractmethod
Return the journal identifier. This method must be implemented in the derived class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
BaseIterativePublisherJournal
|
The configuration model. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The journal identifier |
post_process(scrape_output)
Extract the PDF links from the dictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scrape_output
|
IterativePublisherScrapeOutput
|
A dictionary containing the PDF links. |
required |
Returns:
| Type | Description |
|---|---|
List[str]
|
List[str]: A list of strings containing the PDF links |
scrape()
Scrape the journals for PDF links.
Returns:
| Type | Description |
|---|---|
IterativePublisherScrapeOutput | None
|
IterativePublisherScrapeOutput | None: A dictionary containing the PDF links, or None if no link was found. |
scraper.base_mapped_publisher_scraper
BaseMappedPublisherScraper
Bases: BaseScraper
config_model_type
property
Return the configuration model type.
Returns:
| Type | Description |
|---|---|
Type[BaseMappedConfig]
|
Type[BaseMappedConfig]: The configuration model type |
mapping
abstractmethod
property
Return the mapping of the scraper to the source. This method must be implemented in the derived class.
Returns:
| Type | Description |
|---|---|
Dict[str, Type[BaseMappedSubScraper]]
|
Dict[str, Type[BaseMappedSubScraper]]: The mapping of the scraper to the source |
post_process(scrape_output)
Post-process the scraped output. This method is called after the sources have been scraped. It is used to retrieve the final list of processed URLs. This method must be implemented in the derived class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scrape_output
|
Dict[str, List[str] | Dict[str, List[str]]]
|
The scraped output |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, List[str]]
|
Dict[str, List[str]]: The results of the scraping |
scrape()
Scrape the resources links.
Returns:
| Type | Description |
|---|---|
Dict[str, List[str] | Dict[str, List[str]]] | None
|
Dict[str, List | Dict]: The output of the scraping. |
scraper.base_pagination_publisher_scraper
BasePaginationPublisherScraper
Bases: BaseScraper
post_process(scrape_output)
Extract the href attribute from the links.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scrape_output
|
BasePaginationPublisherScrapeOutput
|
A dictionary containing the PDF links. Each key is the name of the source which PDF links have been found for, and the value is the list of PDF links itself. |
required |
Returns:
| Type | Description |
|---|---|
List[str]
|
List[str]: A list of strings containing the PDF links |
scrape()
abstractmethod
Scrape the resources links. This method must be implemented in the derived class.
Returns:
| Type | Description |
|---|---|
BasePaginationPublisherScrapeOutput | None
|
BasePaginationPublisherScrapeOutput | None: The output of the scraping, i.e., a dictionary containing the PDF links. Each key is the name of the source which PDF links have been found for, and the value is the list of PDF links itself. |
scraper.base_url_publisher_scraper
BaseUrlPublisherScraper
Bases: BaseScraper
post_process(scrape_output)
Extract the href attribute from the links.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scrape_output
|
ResultSet | List[Tag]
|
A ResultSet (i.e., a list) or a list of Tag objects containing the tags to the PDF links. |
required |
Returns:
| Type | Description |
|---|---|
List[str]
|
List[str]: A list of strings containing the PDF links |
scrape()
Scrape the source URLs of for PDF links.
Returns:
| Type | Description |
|---|---|
ResultSet | List[Tag] | None
|
ResultSet | List[Tag]: A ResultSet (i.e., a list) or a list of Tag objects containing the tags to the PDF links. If no tag was found, return None. |
scraper.base_source_download_scraper
BaseSourceDownloadScraper
Bases: BaseScraper, ABC
Publisher Scrapers
scraper.ams_scraper
scraper.arxiv_scraper
scraper.cambridge_university_press_scraper
scraper.elsevier_scraper
ElsevierScraper
Bases: BaseSourceDownloadScraper
__scrape_issue(source)
Scrape the issue for the PDFs. The logic is as follows:
- Find the next issue URL, i.e., the URL of the previous issue, if it exists.
- Check if there are any PDFs to download. If not, try with the next issue.
- Download the PDFs in a zip file and wait for the download to complete.
- Unpack the zip files in a temporary folder.
- Return the result of the scraping. If the issue was scraped successfully, return the next issue URL, i.e., the URL of the previous issue to scrape next.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
ElsevierSource
|
The source model. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
ElsevierScrapeIssueOutput |
ElsevierScrapeIssueOutput
|
The result of the scraping. |
__scrape_journal(source)
Scrape the journal for the issues. The logic is as follows:
- Get the first issue link from the journal page, i.e., the newest issue.
- Scrape the issue and get the next issue URL. If the issue was scraped successfully, add the issue URL to the list of journal links.
- Repeat the process until there are no more issues to scrape.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
ElsevierSource
|
The source model. |
required |
Returns:
| Type | Description |
|---|---|
List[str] | None
|
List[str] | None: The list of journal links if the journal was scraped successfully, None otherwise |
scraper.frontiers_scraper
scraper.ieee_scraper
scraper.intechopen_scraper
scraper.iop_scraper
scraper.isprs_scraper
ISPRSScraper
Bases: BaseScraper
__scrape_archive_article(article_link)
Scrape a single article from the archives. The article contains the PDF link. If the article does not contain a PDF link, it will be saved as a failure.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
article_link
|
str
|
The article link to scrape. |
required |
Returns:
| Type | Description |
|---|---|
str | None
|
str | None: The PDF link found in the article. |
__scrape_archives(archive_links)
Scrape the archives for PDF links. The archives contain links to articles, which in turn contain the PDF links.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
archive_links
|
List[str]
|
A list of archive links. |
required |
Returns:
| Type | Description |
|---|---|
List[str]
|
List[str]: A list of PDF links found in the archives. |
__scrape_proceedings(proceedings_urls)
Scrape the proceedings for PDF links. The proceedings contain links to articles, which in turn contain the PDF links.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
proceedings_urls
|
List[str]
|
A list of proceedings links. |
required |
Returns:
| Type | Description |
|---|---|
List[str]
|
List[str]: A list of PDF links found in the proceedings |
scraper.mdpi_scraper
MDPIJournalsScraper
Bases: BaseIterativePublisherScraper, BaseMappedSubScraper
__scrape_url(url)
Scrape the issue URL for PDF links.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
The issue URL. |
required |
Returns:
| Type | Description |
|---|---|
IterativePublisherScrapeIssueOutput | None
|
BaseIterativePublisherScrapeIssueOutput | None: A list of PDF links found in the issue, or None if something went wrong. |
scraper.ncbi_scraper
scraper.oxford_academic_scraper
OxfordAcademicScraper
Bases: BaseIterativePublisherScraper
__scrape_issue(issue_url)
Scrape the issue URL for PDF links.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
issue_url
|
str
|
The issue URL to scrape. |
required |
Returns:
| Type | Description |
|---|---|
IterativePublisherScrapeIssueOutput | None
|
IterativePublisherScrapeIssueOutput | None: A list of PDF links found in the issue, or None is something went wrong |
scraper.sage_scraper
scraper.springer_scraper
scraper.taylor_and_francis_scraper
scraper.wiley_scraper
Data Source Scrapers
scraper.copernicus_catalogue_scraper
scraper.copernicus_scraper
CopernicusScraper
Bases: BaseIterativeWithConstraintPublisherScraper
__scrape_article(article_url)
Scrape a single article.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
article_url
|
str
|
The article URL to scrape. |
required |
Returns:
| Type | Description |
|---|---|
str | None
|
str | None: The string containing the PDF link. |
__scrape_issue(issue_url)
Scrape the issue URL for PDF links.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
issue_url
|
str
|
The issue URL to scrape. |
required |
Returns:
| Type | Description |
|---|---|
IterativePublisherScrapeIssueOutput | None
|
IterativePublisherScrapeIssueOutput | None: A list of PDF links found in the issue, or None is something went wrong |
scraper.direct_links_scraper
scraper.earth_data_science_scraper
scraper.eoa_scraper
scraper.eoge_scraper
scraper.eos_scraper
scraper.esa_scraper
scraper.eumetsat_scraper
scraper.jaxa_scraper
scraper.mit_scraper
scraper.nasa_scraper
scraper.open_night_lights_scraper
scraper.seos_scraper
SeosScraper
Bases: BaseScraper
config_model_type
property
Return the configuration model type.
Returns:
| Type | Description |
|---|---|
Type[SeosConfig]
|
Type[SeosConfig]: The configuration model type |
__scrape_source(source)
Scrape the source URL for HTML links.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
SeosSource
|
The source to scrape. |
required |
Returns:
| Type | Description |
|---|---|
List[str]
|
List[str]: A list of HTML links. |
post_process(scrape_output)
Extract the href attribute from the links.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scrape_output
|
Dict[str, List[Tag]]
|
A dictionary collecting, for each source, the corresponding list of Tag objects containing the tags to the HTML links. |
required |
Returns:
| Type | Description |
|---|---|
List[str]
|
List[str]: A list of strings containing the HTML links |
scrape()
Scrape the Seos sources for HTML links.
Returns:
| Type | Description |
|---|---|
Dict[str, List[str]] | None
|
Dict[str, List[str]]: a dictionary collecting, for each source, the corresponding list of the HTML links. If no link was found, return None. |