Scrapers

This section documents all available scrapers in the project for collecting Earth Observation and Remote Sensing data from various academic publishers, journals, and data sources.

Overview

The scraping system is designed around a hierarchical class structure where specialized scrapers inherit from base classes that provide common functionality. Each scraper is configured via config/config.json and targets specific publishers or data sources.

Configured Scrapers

The following table lists all scrapers currently configured in the system:

Scraper	Base URL	Storage Folder	Description
IOPScraper	`https://iopscience.iop.org`	`{main_folder}/iopscience`	IOP Science journal articles and issues
MDPIScraper	`https://www.mdpi.com`	`{main_folder}/mdpi`	MDPI journals including Remote Sensing, Geosciences, Atmosphere
SpringerScraper	`https://link.springer.com`	`{main_folder}/springer`	Springer journals, books, and search results
AMSScraper	`https://journals.ametsoc.org`	`{main_folder}/ams`	American Meteorological Society publications
CopernicusScraper	Multiple Copernicus journals	`{main_folder}/copernicus`	16+ Copernicus open-access journals
CopernicusCatalogueScraper	`https://www.copernicus.eu/`	`{main_folder}/copernicus`	Copernicus services catalogue
SeosScraper	`https://seos-project.eu`	`{main_folder}/seos`	SEOS project educational materials
NCBIScraper	`https://www.ncbi.nlm.nih.gov`	`{main_folder}/ncbi`	NCBI PubMed Central articles
CambridgeUniversityPressScraper	`https://www.cambridge.org`	`{main_folder}/cambridge_university_press`	Cambridge University Press journals
OxfordAcademicScraper	`https://academic.oup.com`	`{main_folder}/oxford_academic`	Oxford Academic journals
IEEEScraper	`https://ieeexplore.ieee.org`	`{main_folder}/ieee`	IEEE Xplore open access articles
TaylorAndFrancisScraper	`https://www.tandfonline.com`	`{main_folder}/taylor_and_francis`	Taylor & Francis journals
FrontiersScraper	`https://www.frontiersin.org/`	`{main_folder}/frontiers`	Frontiers in Remote Sensing
SageScraper	`https://journals.sagepub.com`	`{main_folder}/sage`	SAGE Publications journals
EOGEScraper	`https://eoge.ut.ac.ir`	`{main_folder}/eoge`	Earth Observations and Geomatics Engineering journal
ArxivScraper	`https://arxiv.org`	`{main_folder}/arxiv`	arXiv preprints
WileyScraper	Multiple Wiley domains	`{main_folder}/wiley`	Wiley journals (AGU, EOS, etc.)
EOSScraper	`https://eos.org/`	`{main_folder}/eos`	EOS Science News archives
ESAScraper	Multiple ESA domains	`{main_folder}/esa`	ESA Earth Online, EO Portal, Sentiwiki
ElsevierScraper	`https://www.sciencedirect.com`	`{main_folder}/elsevier`	ScienceDirect open access journals
NASAScraper	Multiple NASA domains	`{main_folder}/nasa`	NASA EarthData, NTRS, EOS Portal
OpenNightLightsScraper	`https://worldbank.github.io/OpenNightLights/`	`{main_folder}/open_night_lights_scraper`	World Bank Open Night Lights documentation
WikipediaScraper	`https://en.wikipedia.org/`	`{main_folder}/wikipedia`	Wikipedia EO-related categories
MITScraper	`https://ocw.mit.edu/`	`{main_folder}/mit`	MIT OpenCourseWare
JAXAScraper	`https://earth.jaxa.jp/en/eo-knowledge`	`{main_folder}/jaxa`	JAXA Earth Observation knowledge base
UKMetOfficeScraper	`https://library.metoffice.gov.uk`	`{main_folder}/uk_met_office`	UK Met Office library
EOAScraper	`https://www.eoa.org.au/`	`{main_folder}/eoa`	Earth Observation Australia textbooks
ISPRSScraper	`https://www.isprs.org/`	`{main_folder}/isprs`	ISPRS publication archives
EUMETSATScraper	Multiple EUMETSAT domains	`{main_folder}/eumetsat`	EUMETSAT documentation and case studies
EarthDataScienceScraper	`https://www.earthdatascience.org/`	`{main_folder}/earth_data_science`	Earth Data Science tutorials
DirectLinksScraper	Various	`{main_folder}/miscellaneous`	Direct PDF links from multiple sources
IntechOpenScraper	`https://www.intechopen.com/`	`{main_folder}/intech_open`	IntechOpen books and chapters

Base Scraper Architecture

The scraping system is built on a set of abstract base classes that provide common functionality. Understanding these base classes is essential for extending or modifying the scraping behavior.

Base Class Descriptions

BaseScraper: The root abstract class that all scrapers inherit from. Provides core functionality including Selenium WebDriver management, cookie handling, S3 storage integration, database repository access, and analytics tracking. Every scraper implements the abstract scrape() method defined here.

BaseIterativePublisherScraper: Designed for publishers that organize content in a journal → volume → issue hierarchy. Iterates through volumes and issues systematically, with support for handling missing volumes/issues using consecutive threshold logic.

BasePaginationPublisherScraper: Handles publishers with paginated search results or article listings. Automatically navigates through pages until no more results are found, with configurable page sizes and maximum paper limits.

BaseUrlPublisherScraper: Used for publishers where content URLs follow predictable patterns. Processes lists of URLs and extracts content directly without complex navigation.

BaseMappedPublisherScraper: For publishers that require mapping between different URL structures or identifiers before scraping content. Provides a two-stage process: first mapping, then scraping.

BaseCrawlingScraper: Implements recursive web crawling from a starting URL. Follows links within the same domain and extracts content from all discovered pages. Useful for documentation sites and knowledge bases.

BaseSourceDownloadScraper: Specialized for direct file downloads where download URLs are known in advance. Handles PDF and other document formats directly without HTML parsing.

Adding a New Scraper

To extend the pipeline with a new scraper, follow these steps:

1. Create Scraper File

Create a new file in the scraper folder with the name of your scraper:

# scraper/new_publisher_scraper.py
from scraper.base_scraper import BaseScraper
from model.new_publisher_models import NewPublisherConfig

class NewPublisherScraper(BaseScraper):
    """
    Scraper for New Publisher website.
    """

    @property
    def config_model_type(self):
        """Return the Pydantic model for configuration."""
        return NewPublisherConfig

    def scrape(self):
        """
        Scrape the website and return scraped data.

        Returns:
            dict: Dictionary containing scraped data
        """
        # Implement your scraping logic here
        # Use self._driver for Selenium operations
        # Use self._config_model to access configuration

        scraped_data = {}

        for source in self._config_model.sources:
            # Navigate to URL
            self._driver.open(source.url)

            # Handle cookies if needed
            if not self._cookie_handled and self._config_model.cookie_selector:
                self._driver.click(self._config_model.cookie_selector)
                self._cookie_handled = True

            # Extract data
            # ... your scraping logic ...

        return scraped_data

    def post_process(self, scraped_data):
        """
        Process scraped data and return URLs to download.

        Args:
            scraped_data: Data returned from scrape()

        Returns:
            List[str]: List of URLs to download/upload
        """
        urls = []

        # Process scraped_data and extract download URLs
        # ... your post-processing logic ...

        return urls

2. Create Model File (Optional)

If you need custom Pydantic models, create a file in the model folder:

# model/new_publisher_models.py
from typing import List
from pydantic import Field
from model.base_models import BaseConfig, BaseSource

class NewPublisherSource(BaseSource):
    """Configuration for a single source."""
    url: str
    type: str = "journal"
    # Add custom fields as needed

class NewPublisherConfig(BaseConfig):
    """Configuration model for NewPublisherScraper."""
    base_url: str
    sources: List[NewPublisherSource]
    # Add any custom configuration fields
    custom_field: str = Field(default="default_value")

Note: If you need enumerators, extend Enum from the base_enum module.

3. Choose the Right Base Class

Select the appropriate base class for your scraper:

BaseScraper: For custom scraping logic
BaseIterativePublisherScraper: For journal → volume → issue hierarchies
BasePaginationPublisherScraper: For paginated search results
BaseUrlPublisherScraper: For simple URL lists
BaseMappedPublisherScraper: For two-stage mapping and scraping
BaseCrawlingScraper: For recursive web crawling
BaseSourceDownloadScraper: For direct file downloads

Example using BaseIterativePublisherScraper:

from scraper.base_iterative_publisher_scraper import BaseIterativePublisherScraper

class NewJournalScraper(BaseIterativePublisherScraper):
    @property
    def config_model_type(self):
        return NewJournalConfig

    def _scrape_journal(self, journal):
        # Implement journal-specific scraping
        pass

4. Add Configuration to config.json

Add your scraper's configuration to config/config.json (you can find more examples here):

{
  "NewPublisherScraper": {
    "bucket_key": "{main_folder}/new_publisher",
    "base_url": "https://newpublisher.com",
    "cookie_selector": "button#accept-cookies",
    "files_by_request": true,
    "sources": [
      {
        "url": "https://newpublisher.com/articles",
        "type": "journal"
      }
    ]
  }
}

Configuration Keys

Required:

bucket_key: S3 storage path (use {main_folder} placeholder)

Optional:

base_url: Base URL of the website
cookie_selector: CSS selector for cookie banner acceptance button
files_by_request: Whether to download files via HTTP (default: true) or scrape them
request_with_proxy: Use proxy for requests (default: false)
sources: List of sources to scrape (structure depends on your model)

5. Test Your Scraper

Run your new scraper:

# Run with force to test from scratch
make run args="--scrapers NewPublisherScraper --force"

# Check logs for errors
docker logs <container-id>

6. Verify Results

Check that data was scraped and uploaded correctly:

MinIO Console: Visit http://localhost:9100 and check your bucket
Database: Query the scraper_output and uploaded_resource tables
Analytics: Run make run args="--analytics-only --scrapers NewPublisherScraper"

Example: Complete Simple Scraper

Here's a complete example of a simple scraper:

# scraper/example_scraper.py
from typing import List
from scraper.base_scraper import BaseScraper
from model.base_models import BaseConfig, BaseSource

class ExampleConfig(BaseConfig):
    base_url: str
    sources: List[BaseSource]

class ExampleScraper(BaseScraper):
    @property
    def config_model_type(self):
        return ExampleConfig

    def scrape(self):
        pdf_links = []

        for source in self._config_model.sources:
            self._driver.open(source.url)

            # Find all PDF links
            links = self._driver.find_elements("a[href$='.pdf']")
            for link in links:
                href = link.get_attribute('href')
                if href:
                    pdf_links.append(href)

        return {"pdf_links": pdf_links}

    def post_process(self, scraped_data):
        return scraped_data.get("pdf_links", [])

With configuration:

{
  "ExampleScraper": {
    "bucket_key": "{main_folder}/example",
    "base_url": "https://example.com",
    "sources": [
      {"url": "https://example.com/papers"}
    ]
  }
}

Common Selenium Operations

# Open URL
self._driver.open(url)

# Click element
self._driver.click(selector)

# Find elements
elements = self._driver.find_elements(selector)

# Get attribute
href = element.get_attribute('href')

# Get text
text = element.text

# Wait for element
self._driver.wait_for_element(selector)

# Execute JavaScript
self._driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

Code Reference

Below is the detailed API documentation for all scraper classes:

Base Classes

`scraper.base_scraper`

`BaseScraper`

Bases: ABC

`config_model_type` `abstractmethod` `property`

Return the configuration model type. This property must be implemented in the derived class.

Returns:

Type	Description
`Type[BaseConfig]`	Type[BaseConfig]: The configuration model type

`post_process(scrape_output)` `abstractmethod`

Post-process the scraped output. This method is called after the sources have been scraped. It is used to retrieve the final list of processed URLs. This method must be implemented in the derived class.

Parameters:

Name	Type	Description	Default
`scrape_output`	`Any`	The scraped output	required

Returns:

Type	Description
`Dict[str, List[str]] \| List[str]`	Dict[str, List[str]] \| List[str]: A dictionary or a list containing the processed links

`resume_scraping()`

Resume the scraping of the resources that failed to scrape.

`resume_uploads()`

Resume the uploads of the resources that failed to upload.

`scrape()` `abstractmethod`

Scrape the resources links. This method must be implemented in the derived class.

Returns:

Name	Type	Description
`Any`	`Any \| None`	The output of the scraping, or None if something went wrong.

`scrape_failure(failure)` `abstractmethod`

Scrape the failed resource. This method must be implemented in the derived class.

Parameters:

Name	Type	Description	Default
`failure`	`ScraperFailure`	The failure model.	required

Returns:

Type	Description
`List[str]`	List[str]: The list of the successfully scraped links

`upload_to_s3(sources_links)`

Upload the source files to S3.

Parameters:

Name	Type	Description	Default
`sources_links`	`Dict[str, List[str]] \| List[str]`	The list of links of the various sources.	required

`scraper.base_crawling_scraper`

`BaseCrawlingScraper`

Bases: BaseScraper

`crawling_folder_path` `abstractmethod` `property`

The folder path where the crawling files are stored. This property must be implemented in the derived class.

Returns:

Name	Type	Description
`str`	`str`	The folder path.

`scrape()`

Scrape the website, even better crawl the website.

Returns:

Name	Type	Description
`BaseCrawledPublisherScraperOutput`	`BaseCrawlingScraperOutput \| None`	The output of the scraper, or None if the scraping failed.

`scraper.base_iterative_publisher_scraper`

`BaseIterativePublisherScraper`

Bases: BaseScraper

`journal_identifier(model)` `abstractmethod`

Return the journal identifier. This method must be implemented in the derived class.

Parameters:

Name	Type	Description	Default
`model`	`BaseIterativePublisherJournal`	The configuration model.	required

Returns:

Name	Type	Description
`str`	`str`	The journal identifier

`post_process(scrape_output)`

Extract the PDF links from the dictionary.

Parameters:

Name	Type	Description	Default
`scrape_output`	`IterativePublisherScrapeOutput`	A dictionary containing the PDF links.	required

Returns:

Type	Description
`List[str]`	List[str]: A list of strings containing the PDF links

`scrape()`

Scrape the journals for PDF links.

Returns:

Type	Description
`IterativePublisherScrapeOutput \| None`	IterativePublisherScrapeOutput \| None: A dictionary containing the PDF links, or None if no link was found.

`scraper.base_mapped_publisher_scraper`

`BaseMappedPublisherScraper`

Bases: BaseScraper

`config_model_type` `property`

Return the configuration model type.

Returns:

Type	Description
`Type[BaseMappedConfig]`	Type[BaseMappedConfig]: The configuration model type

`mapping` `abstractmethod` `property`

Return the mapping of the scraper to the source. This method must be implemented in the derived class.

Returns:

Type	Description
`Dict[str, Type[BaseMappedSubScraper]]`	Dict[str, Type[BaseMappedSubScraper]]: The mapping of the scraper to the source

`post_process(scrape_output)`

Post-process the scraped output. This method is called after the sources have been scraped. It is used to retrieve the final list of processed URLs. This method must be implemented in the derived class.

Parameters:

Name	Type	Description	Default
`scrape_output`	`Dict[str, List[str] \| Dict[str, List[str]]]`	The scraped output	required

Returns:

Type	Description
`Dict[str, List[str]]`	Dict[str, List[str]]: The results of the scraping

`scrape()`

Scrape the resources links.

Returns:

Type	Description
`Dict[str, List[str] \| Dict[str, List[str]]] \| None`	Dict[str, List \| Dict]: The output of the scraping.

`scraper.base_pagination_publisher_scraper`

`BasePaginationPublisherScraper`

Bases: BaseScraper

`post_process(scrape_output)`

Extract the href attribute from the links.

Parameters:

Name	Type	Description	Default
`scrape_output`	`BasePaginationPublisherScrapeOutput`	A dictionary containing the PDF links. Each key is the name of the source which PDF links have been found for, and the value is the list of PDF links itself.	required

Returns:

Type	Description
`List[str]`	List[str]: A list of strings containing the PDF links

`scrape()` `abstractmethod`

Scrape the resources links. This method must be implemented in the derived class.

Returns:

Type	Description
`BasePaginationPublisherScrapeOutput \| None`	BasePaginationPublisherScrapeOutput \| None: The output of the scraping, i.e., a dictionary containing the PDF links. Each key is the name of the source which PDF links have been found for, and the value is the list of PDF links itself.

`scraper.base_url_publisher_scraper`

`BaseUrlPublisherScraper`

Bases: BaseScraper

`post_process(scrape_output)`

Extract the href attribute from the links.

Parameters:

Name	Type	Description	Default
`scrape_output`	`ResultSet \| List[Tag]`	A ResultSet (i.e., a list) or a list of Tag objects containing the tags to the PDF links.	required

Returns:

Type	Description
`List[str]`	List[str]: A list of strings containing the PDF links

`scrape()`

Scrape the source URLs of for PDF links.

Returns:

Type	Description
`ResultSet \| List[Tag] \| None`	ResultSet \| List[Tag]: A ResultSet (i.e., a list) or a list of Tag objects containing the tags to the PDF links. If no tag was found, return None.

`scraper.base_source_download_scraper`

`BaseSourceDownloadScraper`

Bases: BaseScraper, ABC

Publisher Scrapers

`scraper.ams_scraper`

`scraper.arxiv_scraper`

`scraper.cambridge_university_press_scraper`

`scraper.elsevier_scraper`

`ElsevierScraper`

Bases: BaseSourceDownloadScraper

`__scrape_issue(source)`

Scrape the issue for the PDFs. The logic is as follows:

Find the next issue URL, i.e., the URL of the previous issue, if it exists.
Check if there are any PDFs to download. If not, try with the next issue.
Download the PDFs in a zip file and wait for the download to complete.
Unpack the zip files in a temporary folder.
Return the result of the scraping. If the issue was scraped successfully, return the next issue URL, i.e., the URL of the previous issue to scrape next.

Parameters:

Name	Type	Description	Default
`source`	`ElsevierSource`	The source model.	required

Returns:

Name	Type	Description
`ElsevierScrapeIssueOutput`	`ElsevierScrapeIssueOutput`	The result of the scraping.

`__scrape_journal(source)`

Scrape the journal for the issues. The logic is as follows:

Get the first issue link from the journal page, i.e., the newest issue.
Scrape the issue and get the next issue URL. If the issue was scraped successfully, add the issue URL to the list of journal links.
Repeat the process until there are no more issues to scrape.

Parameters:

Name	Type	Description	Default
`source`	`ElsevierSource`	The source model.	required

Returns:

Type	Description
`List[str] \| None`	List[str] \| None: The list of journal links if the journal was scraped successfully, None otherwise

`scraper.frontiers_scraper`

`scraper.ieee_scraper`

`scraper.intechopen_scraper`

`scraper.iop_scraper`

`scraper.isprs_scraper`

`ISPRSScraper`

Bases: BaseScraper

`__scrape_archive_article(article_link)`

Scrape a single article from the archives. The article contains the PDF link. If the article does not contain a PDF link, it will be saved as a failure.

Parameters:

Name	Type	Description	Default
`article_link`	`str`	The article link to scrape.	required

Returns:

Type	Description
`str \| None`	str \| None: The PDF link found in the article.

`__scrape_archives(archive_links)`

Scrape the archives for PDF links. The archives contain links to articles, which in turn contain the PDF links.

Parameters:

Name	Type	Description	Default
`archive_links`	`List[str]`	A list of archive links.	required

Returns:

Type	Description
`List[str]`	List[str]: A list of PDF links found in the archives.

`__scrape_proceedings(proceedings_urls)`

Scrape the proceedings for PDF links. The proceedings contain links to articles, which in turn contain the PDF links.

Parameters:

Name	Type	Description	Default
`proceedings_urls`	`List[str]`	A list of proceedings links.	required

Returns:

Type	Description
`List[str]`	List[str]: A list of PDF links found in the proceedings

`scraper.mdpi_scraper`

`MDPIJournalsScraper`

Bases: BaseIterativePublisherScraper, BaseMappedSubScraper

`__scrape_url(url)`

Scrape the issue URL for PDF links.

Parameters:

Name	Type	Description	Default
`url`	`str`	The issue URL.	required

Returns:

Type	Description
`IterativePublisherScrapeIssueOutput \| None`	BaseIterativePublisherScrapeIssueOutput \| None: A list of PDF links found in the issue, or None if something went wrong.

`scraper.ncbi_scraper`

`scraper.oxford_academic_scraper`

`OxfordAcademicScraper`

Bases: BaseIterativePublisherScraper

`__scrape_issue(issue_url)`

Scrape the issue URL for PDF links.

Parameters:

Name	Type	Description	Default
`issue_url`	`str`	The issue URL to scrape.	required

Returns:

Type	Description
`IterativePublisherScrapeIssueOutput \| None`	IterativePublisherScrapeIssueOutput \| None: A list of PDF links found in the issue, or None is something went wrong

`scraper.sage_scraper`

`scraper.springer_scraper`

`scraper.taylor_and_francis_scraper`

`scraper.wiley_scraper`

Data Source Scrapers

`scraper.copernicus_catalogue_scraper`

`scraper.copernicus_scraper`

`CopernicusScraper`

Bases: BaseIterativeWithConstraintPublisherScraper

`__scrape_article(article_url)`

Scrape a single article.

Parameters:

Name	Type	Description	Default
`article_url`	`str`	The article URL to scrape.	required

Returns:

Type	Description
`str \| None`	str \| None: The string containing the PDF link.

`__scrape_issue(issue_url)`

Scrape the issue URL for PDF links.

Parameters:

Name	Type	Description	Default
`issue_url`	`str`	The issue URL to scrape.	required

Returns:

Type	Description
`IterativePublisherScrapeIssueOutput \| None`	IterativePublisherScrapeIssueOutput \| None: A list of PDF links found in the issue, or None is something went wrong

`scraper.direct_links_scraper`

`scraper.earth_data_science_scraper`

`scraper.eoa_scraper`

`scraper.eoge_scraper`

`scraper.eos_scraper`

`scraper.esa_scraper`

`scraper.eumetsat_scraper`

`scraper.jaxa_scraper`

`scraper.mit_scraper`

`scraper.nasa_scraper`

`scraper.open_night_lights_scraper`

`scraper.seos_scraper`

`SeosScraper`

Bases: BaseScraper

`config_model_type` `property`

Return the configuration model type.

Returns:

Type	Description
`Type[SeosConfig]`	Type[SeosConfig]: The configuration model type

`__scrape_source(source)`

Scrape the source URL for HTML links.

Parameters:

Name	Type	Description	Default
`source`	`SeosSource`	The source to scrape.	required

Returns:

Type	Description
`List[str]`	List[str]: A list of HTML links.

`post_process(scrape_output)`

Extract the href attribute from the links.

Parameters:

Name	Type	Description	Default
`scrape_output`	`Dict[str, List[Tag]]`	A dictionary collecting, for each source, the corresponding list of Tag objects containing the tags to the HTML links.	required

Returns:

Type	Description
`List[str]`	List[str]: A list of strings containing the HTML links

`scrape()`

Scrape the Seos sources for HTML links.

Returns:

Type	Description
`Dict[str, List[str]] \| None`	Dict[str, List[str]]: a dictionary collecting, for each source, the corresponding list of the HTML links. If no link was found, return None.

Scrapers

Overview

Configured Scrapers

Base Scraper Architecture

Base Class Descriptions

Adding a New Scraper

1. Create Scraper File

2. Create Model File (Optional)

3. Choose the Right Base Class

4. Add Configuration to config.json

Configuration Keys

5. Test Your Scraper

6. Verify Results

Example: Complete Simple Scraper

Common Selenium Operations

Code Reference

Base Classes

scraper.base_scraper

BaseScraper

config_model_type abstractmethod property

post_process(scrape_output) abstractmethod

resume_scraping()

resume_uploads()

scrape() abstractmethod

scrape_failure(failure) abstractmethod

upload_to_s3(sources_links)

scraper.base_crawling_scraper

BaseCrawlingScraper

crawling_folder_path abstractmethod property

scrape()

scraper.base_iterative_publisher_scraper

BaseIterativePublisherScraper

journal_identifier(model) abstractmethod

post_process(scrape_output)

scrape()

scraper.base_mapped_publisher_scraper

BaseMappedPublisherScraper

config_model_type property

mapping abstractmethod property

post_process(scrape_output)

scrape()

scraper.base_pagination_publisher_scraper

BasePaginationPublisherScraper

post_process(scrape_output)

scrape() abstractmethod

scraper.base_url_publisher_scraper

BaseUrlPublisherScraper

post_process(scrape_output)

scrape()

scraper.base_source_download_scraper

BaseSourceDownloadScraper

Publisher Scrapers

scraper.ams_scraper

scraper.arxiv_scraper

scraper.cambridge_university_press_scraper

scraper.elsevier_scraper

ElsevierScraper

__scrape_issue(source)

__scrape_journal(source)

scraper.frontiers_scraper

scraper.ieee_scraper

scraper.intechopen_scraper

scraper.iop_scraper

scraper.isprs_scraper

ISPRSScraper

__scrape_archive_article(article_link)

__scrape_archives(archive_links)

__scrape_proceedings(proceedings_urls)

scraper.mdpi_scraper

MDPIJournalsScraper

__scrape_url(url)

scraper.ncbi_scraper

scraper.oxford_academic_scraper

OxfordAcademicScraper

__scrape_issue(issue_url)

scraper.sage_scraper

scraper.springer_scraper

scraper.taylor_and_francis_scraper

scraper.wiley_scraper

Data Source Scrapers

`scraper.base_scraper`

`BaseScraper`

`config_model_type` `abstractmethod` `property`

`post_process(scrape_output)` `abstractmethod`

`resume_scraping()`

`resume_uploads()`

`scrape()` `abstractmethod`

`scrape_failure(failure)` `abstractmethod`

`upload_to_s3(sources_links)`

`scraper.base_crawling_scraper`

`BaseCrawlingScraper`

`crawling_folder_path` `abstractmethod` `property`

`scrape()`

`scraper.base_iterative_publisher_scraper`

`BaseIterativePublisherScraper`

`journal_identifier(model)` `abstractmethod`

`post_process(scrape_output)`

`scrape()`

`scraper.base_mapped_publisher_scraper`

`BaseMappedPublisherScraper`

`config_model_type` `property`

`mapping` `abstractmethod` `property`

`post_process(scrape_output)`

`scrape()`

`scraper.base_pagination_publisher_scraper`

`BasePaginationPublisherScraper`

`post_process(scrape_output)`

`scrape()` `abstractmethod`

`scraper.base_url_publisher_scraper`

`BaseUrlPublisherScraper`

`post_process(scrape_output)`

`scrape()`

`scraper.base_source_download_scraper`

`BaseSourceDownloadScraper`

`scraper.ams_scraper`

`scraper.arxiv_scraper`

`scraper.cambridge_university_press_scraper`

`scraper.elsevier_scraper`

`ElsevierScraper`

`__scrape_issue(source)`

`__scrape_journal(source)`

`scraper.frontiers_scraper`

`scraper.ieee_scraper`

`scraper.intechopen_scraper`

`scraper.iop_scraper`

`scraper.isprs_scraper`

`ISPRSScraper`

`__scrape_archive_article(article_link)`

`__scrape_archives(archive_links)`

`__scrape_proceedings(proceedings_urls)`

`scraper.mdpi_scraper`

`MDPIJournalsScraper`

`__scrape_url(url)`

`scraper.ncbi_scraper`

`scraper.oxford_academic_scraper`

`OxfordAcademicScraper`

`__scrape_issue(issue_url)`

`scraper.sage_scraper`

`scraper.springer_scraper`

`scraper.taylor_and_francis_scraper`

`scraper.wiley_scraper`

`scraper.copernicus_catalogue_scraper`

`scraper.copernicus_scraper`

`CopernicusScraper`

`__scrape_article(article_url)`

`__scrape_issue(issue_url)`

`scraper.direct_links_scraper`

`scraper.earth_data_science_scraper`

`scraper.eoa_scraper`

`scraper.eoge_scraper`

`scraper.eos_scraper`

`scraper.esa_scraper`

`scraper.eumetsat_scraper`

`scraper.jaxa_scraper`

`scraper.mit_scraper`

`scraper.nasa_scraper`

`scraper.open_night_lights_scraper`

`scraper.seos_scraper`

`SeosScraper`

`config_model_type` `property`