Skip to content

Scrapers

This section documents all available scrapers in the project for collecting Earth Observation and Remote Sensing data from various academic publishers, journals, and data sources.

Overview

The scraping system is designed around a hierarchical class structure where specialized scrapers inherit from base classes that provide common functionality. Each scraper is configured via config/config.json and targets specific publishers or data sources.

Configured Scrapers

The following table lists all scrapers currently configured in the system:

Scraper Base URL Storage Folder Description
IOPScraper https://iopscience.iop.org {main_folder}/iopscience IOP Science journal articles and issues
MDPIScraper https://www.mdpi.com {main_folder}/mdpi MDPI journals including Remote Sensing, Geosciences, Atmosphere
SpringerScraper https://link.springer.com {main_folder}/springer Springer journals, books, and search results
AMSScraper https://journals.ametsoc.org {main_folder}/ams American Meteorological Society publications
CopernicusScraper Multiple Copernicus journals {main_folder}/copernicus 16+ Copernicus open-access journals
CopernicusCatalogueScraper https://www.copernicus.eu/ {main_folder}/copernicus Copernicus services catalogue
SeosScraper https://seos-project.eu {main_folder}/seos SEOS project educational materials
NCBIScraper https://www.ncbi.nlm.nih.gov {main_folder}/ncbi NCBI PubMed Central articles
CambridgeUniversityPressScraper https://www.cambridge.org {main_folder}/cambridge_university_press Cambridge University Press journals
OxfordAcademicScraper https://academic.oup.com {main_folder}/oxford_academic Oxford Academic journals
IEEEScraper https://ieeexplore.ieee.org {main_folder}/ieee IEEE Xplore open access articles
TaylorAndFrancisScraper https://www.tandfonline.com {main_folder}/taylor_and_francis Taylor & Francis journals
FrontiersScraper https://www.frontiersin.org/ {main_folder}/frontiers Frontiers in Remote Sensing
SageScraper https://journals.sagepub.com {main_folder}/sage SAGE Publications journals
EOGEScraper https://eoge.ut.ac.ir {main_folder}/eoge Earth Observations and Geomatics Engineering journal
ArxivScraper https://arxiv.org {main_folder}/arxiv arXiv preprints
WileyScraper Multiple Wiley domains {main_folder}/wiley Wiley journals (AGU, EOS, etc.)
EOSScraper https://eos.org/ {main_folder}/eos EOS Science News archives
ESAScraper Multiple ESA domains {main_folder}/esa ESA Earth Online, EO Portal, Sentiwiki
ElsevierScraper https://www.sciencedirect.com {main_folder}/elsevier ScienceDirect open access journals
NASAScraper Multiple NASA domains {main_folder}/nasa NASA EarthData, NTRS, EOS Portal
OpenNightLightsScraper https://worldbank.github.io/OpenNightLights/ {main_folder}/open_night_lights_scraper World Bank Open Night Lights documentation
WikipediaScraper https://en.wikipedia.org/ {main_folder}/wikipedia Wikipedia EO-related categories
MITScraper https://ocw.mit.edu/ {main_folder}/mit MIT OpenCourseWare
JAXAScraper https://earth.jaxa.jp/en/eo-knowledge {main_folder}/jaxa JAXA Earth Observation knowledge base
UKMetOfficeScraper https://library.metoffice.gov.uk {main_folder}/uk_met_office UK Met Office library
EOAScraper https://www.eoa.org.au/ {main_folder}/eoa Earth Observation Australia textbooks
ISPRSScraper https://www.isprs.org/ {main_folder}/isprs ISPRS publication archives
EUMETSATScraper Multiple EUMETSAT domains {main_folder}/eumetsat EUMETSAT documentation and case studies
EarthDataScienceScraper https://www.earthdatascience.org/ {main_folder}/earth_data_science Earth Data Science tutorials
DirectLinksScraper Various {main_folder}/miscellaneous Direct PDF links from multiple sources
IntechOpenScraper https://www.intechopen.com/ {main_folder}/intech_open IntechOpen books and chapters

Base Scraper Architecture

The scraping system is built on a set of abstract base classes that provide common functionality. Understanding these base classes is essential for extending or modifying the scraping behavior.

Base Class Descriptions

BaseScraper: The root abstract class that all scrapers inherit from. Provides core functionality including Selenium WebDriver management, cookie handling, S3 storage integration, database repository access, and analytics tracking. Every scraper implements the abstract scrape() method defined here.

BaseIterativePublisherScraper: Designed for publishers that organize content in a journal → volume → issue hierarchy. Iterates through volumes and issues systematically, with support for handling missing volumes/issues using consecutive threshold logic.

BasePaginationPublisherScraper: Handles publishers with paginated search results or article listings. Automatically navigates through pages until no more results are found, with configurable page sizes and maximum paper limits.

BaseUrlPublisherScraper: Used for publishers where content URLs follow predictable patterns. Processes lists of URLs and extracts content directly without complex navigation.

BaseMappedPublisherScraper: For publishers that require mapping between different URL structures or identifiers before scraping content. Provides a two-stage process: first mapping, then scraping.

BaseCrawlingScraper: Implements recursive web crawling from a starting URL. Follows links within the same domain and extracts content from all discovered pages. Useful for documentation sites and knowledge bases.

BaseSourceDownloadScraper: Specialized for direct file downloads where download URLs are known in advance. Handles PDF and other document formats directly without HTML parsing.

Adding a New Scraper

To extend the pipeline with a new scraper, follow these steps:

1. Create Scraper File

Create a new file in the scraper folder with the name of your scraper:

# scraper/new_publisher_scraper.py
from scraper.base_scraper import BaseScraper
from model.new_publisher_models import NewPublisherConfig

class NewPublisherScraper(BaseScraper):
    """
    Scraper for New Publisher website.
    """

    @property
    def config_model_type(self):
        """Return the Pydantic model for configuration."""
        return NewPublisherConfig

    def scrape(self):
        """
        Scrape the website and return scraped data.

        Returns:
            dict: Dictionary containing scraped data
        """
        # Implement your scraping logic here
        # Use self._driver for Selenium operations
        # Use self._config_model to access configuration

        scraped_data = {}

        for source in self._config_model.sources:
            # Navigate to URL
            self._driver.open(source.url)

            # Handle cookies if needed
            if not self._cookie_handled and self._config_model.cookie_selector:
                self._driver.click(self._config_model.cookie_selector)
                self._cookie_handled = True

            # Extract data
            # ... your scraping logic ...

        return scraped_data

    def post_process(self, scraped_data):
        """
        Process scraped data and return URLs to download.

        Args:
            scraped_data: Data returned from scrape()

        Returns:
            List[str]: List of URLs to download/upload
        """
        urls = []

        # Process scraped_data and extract download URLs
        # ... your post-processing logic ...

        return urls

2. Create Model File (Optional)

If you need custom Pydantic models, create a file in the model folder:

# model/new_publisher_models.py
from typing import List
from pydantic import Field
from model.base_models import BaseConfig, BaseSource

class NewPublisherSource(BaseSource):
    """Configuration for a single source."""
    url: str
    type: str = "journal"
    # Add custom fields as needed

class NewPublisherConfig(BaseConfig):
    """Configuration model for NewPublisherScraper."""
    base_url: str
    sources: List[NewPublisherSource]
    # Add any custom configuration fields
    custom_field: str = Field(default="default_value")

Note: If you need enumerators, extend Enum from the base_enum module.

3. Choose the Right Base Class

Select the appropriate base class for your scraper:

  • BaseScraper: For custom scraping logic
  • BaseIterativePublisherScraper: For journal → volume → issue hierarchies
  • BasePaginationPublisherScraper: For paginated search results
  • BaseUrlPublisherScraper: For simple URL lists
  • BaseMappedPublisherScraper: For two-stage mapping and scraping
  • BaseCrawlingScraper: For recursive web crawling
  • BaseSourceDownloadScraper: For direct file downloads

Example using BaseIterativePublisherScraper:

from scraper.base_iterative_publisher_scraper import BaseIterativePublisherScraper

class NewJournalScraper(BaseIterativePublisherScraper):
    @property
    def config_model_type(self):
        return NewJournalConfig

    def _scrape_journal(self, journal):
        # Implement journal-specific scraping
        pass

4. Add Configuration to config.json

Add your scraper's configuration to config/config.json (you can find more examples here):

{
  "NewPublisherScraper": {
    "bucket_key": "{main_folder}/new_publisher",
    "base_url": "https://newpublisher.com",
    "cookie_selector": "button#accept-cookies",
    "files_by_request": true,
    "sources": [
      {
        "url": "https://newpublisher.com/articles",
        "type": "journal"
      }
    ]
  }
}

Configuration Keys

Required:

  • bucket_key: S3 storage path (use {main_folder} placeholder)

Optional:

  • base_url: Base URL of the website
  • cookie_selector: CSS selector for cookie banner acceptance button
  • files_by_request: Whether to download files via HTTP (default: true) or scrape them
  • request_with_proxy: Use proxy for requests (default: false)
  • sources: List of sources to scrape (structure depends on your model)

5. Test Your Scraper

Run your new scraper:

# Run with force to test from scratch
make run args="--scrapers NewPublisherScraper --force"

# Check logs for errors
docker logs <container-id>

6. Verify Results

Check that data was scraped and uploaded correctly:

  1. MinIO Console: Visit http://localhost:9100 and check your bucket
  2. Database: Query the scraper_output and uploaded_resource tables
  3. Analytics: Run make run args="--analytics-only --scrapers NewPublisherScraper"

Example: Complete Simple Scraper

Here's a complete example of a simple scraper:

# scraper/example_scraper.py
from typing import List
from scraper.base_scraper import BaseScraper
from model.base_models import BaseConfig, BaseSource

class ExampleConfig(BaseConfig):
    base_url: str
    sources: List[BaseSource]

class ExampleScraper(BaseScraper):
    @property
    def config_model_type(self):
        return ExampleConfig

    def scrape(self):
        pdf_links = []

        for source in self._config_model.sources:
            self._driver.open(source.url)

            # Find all PDF links
            links = self._driver.find_elements("a[href$='.pdf']")
            for link in links:
                href = link.get_attribute('href')
                if href:
                    pdf_links.append(href)

        return {"pdf_links": pdf_links}

    def post_process(self, scraped_data):
        return scraped_data.get("pdf_links", [])

With configuration:

{
  "ExampleScraper": {
    "bucket_key": "{main_folder}/example",
    "base_url": "https://example.com",
    "sources": [
      {"url": "https://example.com/papers"}
    ]
  }
}

Common Selenium Operations

# Open URL
self._driver.open(url)

# Click element
self._driver.click(selector)

# Find elements
elements = self._driver.find_elements(selector)

# Get attribute
href = element.get_attribute('href')

# Get text
text = element.text

# Wait for element
self._driver.wait_for_element(selector)

# Execute JavaScript
self._driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

Code Reference

Below is the detailed API documentation for all scraper classes:

Base Classes

scraper.base_scraper

BaseScraper

Bases: ABC

config_model_type abstractmethod property

Return the configuration model type. This property must be implemented in the derived class.

Returns:

Type Description
Type[BaseConfig]

Type[BaseConfig]: The configuration model type

post_process(scrape_output) abstractmethod

Post-process the scraped output. This method is called after the sources have been scraped. It is used to retrieve the final list of processed URLs. This method must be implemented in the derived class.

Parameters:

Name Type Description Default
scrape_output Any

The scraped output

required

Returns:

Type Description
Dict[str, List[str]] | List[str]

Dict[str, List[str]] | List[str]: A dictionary or a list containing the processed links

resume_scraping()

Resume the scraping of the resources that failed to scrape.

resume_uploads()

Resume the uploads of the resources that failed to upload.

scrape() abstractmethod

Scrape the resources links. This method must be implemented in the derived class.

Returns:

Name Type Description
Any Any | None

The output of the scraping, or None if something went wrong.

scrape_failure(failure) abstractmethod

Scrape the failed resource. This method must be implemented in the derived class.

Parameters:

Name Type Description Default
failure ScraperFailure

The failure model.

required

Returns:

Type Description
List[str]

List[str]: The list of the successfully scraped links

upload_to_s3(sources_links)

Upload the source files to S3.

Parameters:

Name Type Description Default
sources_links Dict[str, List[str]] | List[str]

The list of links of the various sources.

required

scraper.base_crawling_scraper

BaseCrawlingScraper

Bases: BaseScraper

crawling_folder_path abstractmethod property

The folder path where the crawling files are stored. This property must be implemented in the derived class.

Returns:

Name Type Description
str str

The folder path.

scrape()

Scrape the website, even better crawl the website.

Returns:

Name Type Description
BaseCrawledPublisherScraperOutput BaseCrawlingScraperOutput | None

The output of the scraper, or None if the scraping failed.

scraper.base_iterative_publisher_scraper

BaseIterativePublisherScraper

Bases: BaseScraper

journal_identifier(model) abstractmethod

Return the journal identifier. This method must be implemented in the derived class.

Parameters:

Name Type Description Default
model BaseIterativePublisherJournal

The configuration model.

required

Returns:

Name Type Description
str str

The journal identifier

post_process(scrape_output)

Extract the PDF links from the dictionary.

Parameters:

Name Type Description Default
scrape_output IterativePublisherScrapeOutput

A dictionary containing the PDF links.

required

Returns:

Type Description
List[str]

List[str]: A list of strings containing the PDF links

scrape()

Scrape the journals for PDF links.

Returns:

Type Description
IterativePublisherScrapeOutput | None

IterativePublisherScrapeOutput | None: A dictionary containing the PDF links, or None if no link was found.

scraper.base_mapped_publisher_scraper

BaseMappedPublisherScraper

Bases: BaseScraper

config_model_type property

Return the configuration model type.

Returns:

Type Description
Type[BaseMappedConfig]

Type[BaseMappedConfig]: The configuration model type

mapping abstractmethod property

Return the mapping of the scraper to the source. This method must be implemented in the derived class.

Returns:

Type Description
Dict[str, Type[BaseMappedSubScraper]]

Dict[str, Type[BaseMappedSubScraper]]: The mapping of the scraper to the source

post_process(scrape_output)

Post-process the scraped output. This method is called after the sources have been scraped. It is used to retrieve the final list of processed URLs. This method must be implemented in the derived class.

Parameters:

Name Type Description Default
scrape_output Dict[str, List[str] | Dict[str, List[str]]]

The scraped output

required

Returns:

Type Description
Dict[str, List[str]]

Dict[str, List[str]]: The results of the scraping

scrape()

Scrape the resources links.

Returns:

Type Description
Dict[str, List[str] | Dict[str, List[str]]] | None

Dict[str, List | Dict]: The output of the scraping.

scraper.base_pagination_publisher_scraper

BasePaginationPublisherScraper

Bases: BaseScraper

post_process(scrape_output)

Extract the href attribute from the links.

Parameters:

Name Type Description Default
scrape_output BasePaginationPublisherScrapeOutput

A dictionary containing the PDF links. Each key is the name of the source which PDF links have been found for, and the value is the list of PDF links itself.

required

Returns:

Type Description
List[str]

List[str]: A list of strings containing the PDF links

scrape() abstractmethod

Scrape the resources links. This method must be implemented in the derived class.

Returns:

Type Description
BasePaginationPublisherScrapeOutput | None

BasePaginationPublisherScrapeOutput | None: The output of the scraping, i.e., a dictionary containing the PDF links. Each key is the name of the source which PDF links have been found for, and the value is the list of PDF links itself.

scraper.base_url_publisher_scraper

BaseUrlPublisherScraper

Bases: BaseScraper

post_process(scrape_output)

Extract the href attribute from the links.

Parameters:

Name Type Description Default
scrape_output ResultSet | List[Tag]

A ResultSet (i.e., a list) or a list of Tag objects containing the tags to the PDF links.

required

Returns:

Type Description
List[str]

List[str]: A list of strings containing the PDF links

scrape()

Scrape the source URLs of for PDF links.

Returns:

Type Description
ResultSet | List[Tag] | None

ResultSet | List[Tag]: A ResultSet (i.e., a list) or a list of Tag objects containing the tags to the PDF links. If no tag was found, return None.

scraper.base_source_download_scraper

BaseSourceDownloadScraper

Bases: BaseScraper, ABC

Publisher Scrapers

scraper.ams_scraper

scraper.arxiv_scraper

scraper.cambridge_university_press_scraper

scraper.elsevier_scraper

ElsevierScraper

Bases: BaseSourceDownloadScraper

__scrape_issue(source)

Scrape the issue for the PDFs. The logic is as follows:

  • Find the next issue URL, i.e., the URL of the previous issue, if it exists.
  • Check if there are any PDFs to download. If not, try with the next issue.
  • Download the PDFs in a zip file and wait for the download to complete.
  • Unpack the zip files in a temporary folder.
  • Return the result of the scraping. If the issue was scraped successfully, return the next issue URL, i.e., the URL of the previous issue to scrape next.

Parameters:

Name Type Description Default
source ElsevierSource

The source model.

required

Returns:

Name Type Description
ElsevierScrapeIssueOutput ElsevierScrapeIssueOutput

The result of the scraping.

__scrape_journal(source)

Scrape the journal for the issues. The logic is as follows:

  • Get the first issue link from the journal page, i.e., the newest issue.
  • Scrape the issue and get the next issue URL. If the issue was scraped successfully, add the issue URL to the list of journal links.
  • Repeat the process until there are no more issues to scrape.

Parameters:

Name Type Description Default
source ElsevierSource

The source model.

required

Returns:

Type Description
List[str] | None

List[str] | None: The list of journal links if the journal was scraped successfully, None otherwise

scraper.frontiers_scraper

scraper.ieee_scraper

scraper.intechopen_scraper

scraper.iop_scraper

scraper.isprs_scraper

ISPRSScraper

Bases: BaseScraper

__scrape_archive_article(article_link)

Scrape a single article from the archives. The article contains the PDF link. If the article does not contain a PDF link, it will be saved as a failure.

Parameters:

Name Type Description Default
article_link str

The article link to scrape.

required

Returns:

Type Description
str | None

str | None: The PDF link found in the article.

__scrape_archives(archive_links)

Scrape the archives for PDF links. The archives contain links to articles, which in turn contain the PDF links.

Parameters:

Name Type Description Default
archive_links List[str]

A list of archive links.

required

Returns:

Type Description
List[str]

List[str]: A list of PDF links found in the archives.

__scrape_proceedings(proceedings_urls)

Scrape the proceedings for PDF links. The proceedings contain links to articles, which in turn contain the PDF links.

Parameters:

Name Type Description Default
proceedings_urls List[str]

A list of proceedings links.

required

Returns:

Type Description
List[str]

List[str]: A list of PDF links found in the proceedings

scraper.mdpi_scraper

MDPIJournalsScraper

Bases: BaseIterativePublisherScraper, BaseMappedSubScraper

__scrape_url(url)

Scrape the issue URL for PDF links.

Parameters:

Name Type Description Default
url str

The issue URL.

required

Returns:

Type Description
IterativePublisherScrapeIssueOutput | None

BaseIterativePublisherScrapeIssueOutput | None: A list of PDF links found in the issue, or None if something went wrong.

scraper.ncbi_scraper

scraper.oxford_academic_scraper

OxfordAcademicScraper

Bases: BaseIterativePublisherScraper

__scrape_issue(issue_url)

Scrape the issue URL for PDF links.

Parameters:

Name Type Description Default
issue_url str

The issue URL to scrape.

required

Returns:

Type Description
IterativePublisherScrapeIssueOutput | None

IterativePublisherScrapeIssueOutput | None: A list of PDF links found in the issue, or None is something went wrong

scraper.sage_scraper

scraper.springer_scraper

scraper.taylor_and_francis_scraper

scraper.wiley_scraper

Data Source Scrapers

scraper.copernicus_catalogue_scraper

scraper.copernicus_scraper

CopernicusScraper

Bases: BaseIterativeWithConstraintPublisherScraper

__scrape_article(article_url)

Scrape a single article.

Parameters:

Name Type Description Default
article_url str

The article URL to scrape.

required

Returns:

Type Description
str | None

str | None: The string containing the PDF link.

__scrape_issue(issue_url)

Scrape the issue URL for PDF links.

Parameters:

Name Type Description Default
issue_url str

The issue URL to scrape.

required

Returns:

Type Description
IterativePublisherScrapeIssueOutput | None

IterativePublisherScrapeIssueOutput | None: A list of PDF links found in the issue, or None is something went wrong

scraper.earth_data_science_scraper

scraper.eoa_scraper

scraper.eoge_scraper

scraper.eos_scraper

scraper.esa_scraper

scraper.eumetsat_scraper

scraper.jaxa_scraper

scraper.mit_scraper

scraper.nasa_scraper

scraper.open_night_lights_scraper

scraper.seos_scraper

SeosScraper

Bases: BaseScraper

config_model_type property

Return the configuration model type.

Returns:

Type Description
Type[SeosConfig]

Type[SeosConfig]: The configuration model type

__scrape_source(source)

Scrape the source URL for HTML links.

Parameters:

Name Type Description Default
source SeosSource

The source to scrape.

required

Returns:

Type Description
List[str]

List[str]: A list of HTML links.

post_process(scrape_output)

Extract the href attribute from the links.

Parameters:

Name Type Description Default
scrape_output Dict[str, List[Tag]]

A dictionary collecting, for each source, the corresponding list of Tag objects containing the tags to the HTML links.

required

Returns:

Type Description
List[str]

List[str]: A list of strings containing the HTML links

scrape()

Scrape the Seos sources for HTML links.

Returns:

Type Description
Dict[str, List[str]] | None

Dict[str, List[str]]: a dictionary collecting, for each source, the corresponding list of the HTML links. If no link was found, return None.

scraper.uk_met_office_scraper

scraper.wikipedia_scraper