Metadata Extraction Stage
The metadata extraction stage automatically identifies and extracts structured metadata from documents.
Extracted Metadata Fields
Document Identification
- title: Document title
- authors: List of author names
- doi: Digital Object Identifier
- url: Source URL or link
- year: Publication year
- journal: Journal or publication name
- publisher: Publisher name
Extraction Methods
PDF Metadata Extraction
Setup MonkeyOCR using the bash file under the \server directory. Then run the extractions using this command python3 parse.py <dir> --pred-abandon
You will see the predictions stored in the MonkeyOCR folder. You can then run the metadata extraction pipeline given below -
pipeline:
batch_size: 2
inputs:
path: "htmls" # path to the folder
stages:
- name: metadata
config:
enabled_formats: ["pdf", "html", "txt", "md"]
- name: export
config: { format: "jsonl", output_dir: "output"}
- We first extract text from the first page of the PDF files using MonkeyOCR. The doi and the title are usually present within the first page of the document.
- We extract dois using handwritten regex patterns
- if the file is from arXiv, we invoke the arXiv API to extract metadata.
- if the file is from other publishers, we invoke the crossref API to extract metadata.
- Fallback - if doi is not present, we extract the title and then invoke the crossref API using the title to extract the metadata.
Other format Extraction
For other documents like HTML, TXT, JSON, the extractor uses handwritten regex patterns to extract the document title and the URL of the page.
Configuration Parameters
enabled_formats
- Type: List
- Default:
["pdf", "html", "txt", "md"] - Description: The list of file formats to process.
export_metadata
- Type: Boolean
- Default:
true - Description: Whether to export metadata to JSON file.
metadata_destination
- Type: String
- Default:
./output - Description: Directory to save metadata file
Next Steps
- Learn about document export