Skip to content

Cleaning Stage

The cleaning stage improves document quality by removing OCR errors, noise artifacts, correcting formatting issues, and enhancing readability.

Features

  • Noise Removal: Fixes the errors introduced during the OCR extraction.
  • Nougat Correction: This is a series of post processing cleaning step by Nougat to make the document markdown compactible.
  • Rule based Correction: Custom regex based patterns to remove the most commonly occuring errors.
  • LaTeX Correction: Fixes mathematical equations and notation

Configuration

Basic Configuration

- name: cleaning
  config:
    ocr_threshold: 0.99

LLM Enhanced Cleaning (Optional)

For the latex correction, the latex components are extracted and passed to an LLM for improvement, the syntax is verified using pdflatex and then merged back into the document.

To use this module, You need to set the .env key for OPENROUTER_API_KEY.

- name: cleaning
  config:
    enable_latex_correction: true

Configuration Parameters

ocr_threshold

  • Type: Float
  • Default: 0.99
  • Description: This parameter controls what level of similarity is required for two sentences to be considered duplicate.

min_words

  • Type: Int
  • Default: 2
  • Description: This parameter defines the minimum number of words a sentence should have for the duplication process. Higher the value, the more accurate the duplicate ocr segments are removed.

enable_latex_correction

  • Type: Boolean
  • Default: false
  • Description: Use LLM to fix latex formulas and tables

openrouter_model

  • Type: String
  • Default: anthropic/claude-3-haiku
  • Description: The model to be used for latex correction

debug

  • Type: Boolean
  • Default: false
  • Description: To enable debug output

Next Steps