Chunking

The chunking stage splits large documents into smaller, semantically meaningful chunks that are suitable for downstream processing like embedding generation and vector database upload.

Overview

Chunking is essential for:

Vector database upload: Breaking documents into appropriately-sized pieces for embedding
Semantic retrieval: Creating chunks that represent coherent topics or concepts
Context window management: Ensuring chunks fit within model token limits
Performance optimization: Parallelizing processing across multiple chunks

The Eve pipeline uses a sophisticated two-step chunking strategy that preserves document structure and special content:

Header-based splitting: First splits documents by Markdown headers to maintain semantic structure
Sentence-based splitting: If sections exceed the size limit, further splits them by sentences
Smart merging: Optionally merges small chunks back together when they share compatible heading levels
Content preservation: Keeps LaTeX formulas, equations, and tables intact as atomic units

Features

Semantic chunking: Respects document structure by splitting on Markdown headers
LaTeX preservation: Keeps mathematical formulas and equations together
Table preservation: Maintains tables as complete units without splitting
Configurable overlap: Add word-based overlap between chunks for better retrieval
Parallel processing: Uses multiprocessing for fast chunking of large document sets
Header inclusion: Optionally adds section headers to chunks for context

Configuration

Step name: chunker

Configuration Parameters

Parameter	Type	Required	Default	Description
`max_chunk_size`	int	No	`512`	Maximum size of any chunk in words
`chunk_overlap`	int	No	`0`	Number of characters to overlap between chunks during secondary splitting
`word_overlap`	int	No	`0`	Number of words to overlap between chunks (takes precedence over chunk_overlap)
`add_headers`	bool	No	`False`	Whether to prepend section headers to chunk content
`merge_small_chunks`	bool	No	`True`	Whether to merge small chunks that share compatible heading levels
`headers_to_split_on`	list[int]	No	`[1, 2, 3, 4, 5, 6]`	Markdown header levels to split on (1=`#`, 2=`##`, etc.)
`max_workers`	int	No	`None`	Number of parallel workers (None = CPU count)

Basic Configuration

- name: chunker
  config:
    max_chunk_size: 512
    add_headers: true
    merge_small_chunks: true

Advanced Configuration

- name: chunker
  config:
    max_chunk_size: 1024
    chunk_overlap: 0
    word_overlap: 50
    add_headers: true
    merge_small_chunks: true
    headers_to_split_on: [1, 2, 3]  # Only split on H1, H2, H3
    max_workers: 8  # Use 8 parallel workers

How It Works

Two-Step Chunking Strategy

Step 1: Header-Based Splitting

The chunker first splits the document based on Markdown headers:

# Introduction
This is the introduction text...

## Background
This is the background section...

## Methods
This section describes methods...

This creates initial chunks at natural document boundaries.

Step 2: Size-Based Splitting

If any chunk exceeds max_chunk_size, it's further split using sentence boundaries while preserving:

LaTeX environments (\begin{...}...\end{...})
Inline and display math ( $...$ , $$...$$)
Markdown tables
Figure and table references

Step 3: Smart Merging

If merge_small_chunks: true, the chunker merges adjacent chunks when:

Combined length doesn't exceed max_chunk_size
Chunks have compatible heading levels:
Same level headers (e.g., two H2 sections)
Previous chunk has higher level header (e.g., H1 followed by H2)

This prevents overly small chunks while maintaining semantic coherence.

Header Inclusion

When add_headers: true, section headers are prepended to each chunk:

Without headers:

This section describes the methodology used in the study...

With headers:

# Introduction
## Methods
This section describes the methodology used in the study...

This provides context for each chunk, especially useful for retrieval systems.

Content Preservation

The chunker intelligently handles special content:

LaTeX Formulas:

The equation \begin{equation}
E = mc^2
\end{equation} is preserved intact.

Tables:

| Column 1 | Column 2 |
|----------|----------|
| Data 1   | Data 2   |

These are never split mid-formula or mid-table, even if they exceed max_chunk_size.

Use Cases

Small Chunks for Dense Retrieval

- name: chunker
  config:
    max_chunk_size: 256
    word_overlap: 20
    add_headers: true
    merge_small_chunks: false

Creates small, focused chunks with overlap for better semantic retrieval.

Large Chunks for Context

- name: chunker
  config:
    max_chunk_size: 2048
    add_headers: false
    merge_small_chunks: true
    headers_to_split_on: [1, 2]  # Only split on major sections

Creates larger chunks that preserve more context, suitable for summarization or large context windows.

Academic Papers

- name: chunker
  config:
    max_chunk_size: 512
    add_headers: true
    merge_small_chunks: true
    headers_to_split_on: [1, 2, 3, 4, 5, 6]

Respects the hierarchical structure of academic papers while maintaining readable chunk sizes.

Output Format

Each chunk becomes a separate Document with:

content: The chunk text (with headers if add_headers: true)
file_path: Original document file path
file_format: Original document format
metadata.headers: List of Markdown headers that apply to this chunk

Example Output

Document(
    content="# Introduction\n## Background\nThis paper discusses...",
    file_path="papers/paper1.pdf",
    file_format="pdf",
    metadata={
        "headers": ["#Introduction", "##Background"],
        # ... other metadata from original document
    }
)

Performance

The chunker uses parallel processing to handle large document sets efficiently:

Documents are processed in separate processes using ProcessPoolExecutor
Each process runs an independent chunker instance
Results are collected and flattened into a single list
Set max_workers to control parallelism (defaults to CPU count)

Performance tip: For I/O-bound operations, use the default max_workers=None. For CPU-intensive chunking of very large documents, experiment with different worker counts.

Integration with Other Steps

Typical Pipeline Order

pipeline:
  inputs:
    path: "documents"
  stages:
    - name: extraction

    - name: deduplication
      config:
        method: "lsh"

    - name: cleaning

    - name: chunker
      config:
        max_chunk_size: 512
        add_headers: true

    - name: embedding  # Or qdrant upload

    - name: export

Before Vector Database Upload

Chunking is typically done before uploading to vector databases:

- name: chunker
  config:
    max_chunk_size: 512
    add_headers: true

- name: qdrant
  config:
    database:
      collection_name: "documents"
    # ... other qdrant config

This ensures each chunk gets its own embedding vector in the database.

Best Practices

Choose appropriate chunk size:
Smaller chunks (256-512 words) for dense retrieval
Larger chunks (1024-2048 words) for summarization or large context models
Add headers for context: Enable add_headers: true when chunks will be retrieved without surrounding context
Merge small chunks: Keep merge_small_chunks: true to avoid tiny chunks that lack sufficient context
Adjust header levels: For documents with deep nesting, limit headers_to_split_on to major sections only

Troubleshooting

Chunks are too large

Decrease max_chunk_size
Add more header levels to headers_to_split_on
Set merge_small_chunks: false

Chunks are too small

Increase max_chunk_size
Set merge_small_chunks: true
Reduce header levels in headers_to_split_on

LaTeX formulas are broken

The chunker should preserve LaTeX automatically. If formulas are breaking:

Check that LaTeX uses proper \begin{...} and \end{...} syntax
Verify formulas aren't malformed in the original document
Review the cleaning step output before chunking

Slow performance

Adjust max_workers (try different values)
Ensure you're chunking after deduplication and cleaning
Consider increasing max_chunk_size to reduce total chunk count

Next Steps

Set up Qdrant upload to store chunks in a vector database
Learn about Export options for saving chunked documents

Code Reference

Document chunking step using semantic two-step chunking strategy.

`ChunkerStep`

Bases: PipelineStep

Chunk documents into smaller, semantically meaningful pieces.

Uses a two-step chunking strategy:

Split by Markdown headers to maintain document structure
Further split large sections by sentences while preserving LaTeX and tables
Optionally merge small chunks that share compatible heading levels

The chunker processes documents in parallel using multiprocessing for performance.

Config parameters:

- max_chunk_size (int): Maximum size of any chunk in words (default: 512)
- chunk_overlap (int): Number of characters to overlap between chunks (default: 0)
- word_overlap (int): Number of words to overlap between chunks (default: 0)
- add_headers (bool): Whether to prepend section headers to chunks (default: False)
- merge_small_chunks (bool): Whether to merge small chunks with compatible headers (default: True)
- headers_to_split_on (list[int]): Markdown header levels to split on (default: [1, 2, 3, 4, 5, 6])
- max_workers (int): Number of parallel workers, None uses CPU count (default: None)

Examples:

Basic chunking with default settings

config: {max_chunk_size: 512}

Chunking with headers and overlap for retrieval

config: { max_chunk_size: 512, add_headers: true, word_overlap: 20, merge_small_chunks: true }

Large chunks for context preservation

config: { max_chunk_size: 2048, headers_to_split_on: [1, 2], merge_small_chunks: true }

Source code in eve/steps/chunking/chunker_step.py

class ChunkerStep(PipelineStep):
    """Chunk documents into smaller, semantically meaningful pieces.

    Uses a two-step chunking strategy:

    1. Split by Markdown headers to maintain document structure
    2. Further split large sections by sentences while preserving LaTeX and tables
    3. Optionally merge small chunks that share compatible heading levels

    The chunker processes documents in parallel using multiprocessing for performance.

    Config parameters:

        - max_chunk_size (int): Maximum size of any chunk in words (default: 512)
        - chunk_overlap (int): Number of characters to overlap between chunks (default: 0)
        - word_overlap (int): Number of words to overlap between chunks (default: 0)
        - add_headers (bool): Whether to prepend section headers to chunks (default: False)
        - merge_small_chunks (bool): Whether to merge small chunks with compatible headers (default: True)
        - headers_to_split_on (list[int]): Markdown header levels to split on (default: [1, 2, 3, 4, 5, 6])
        - max_workers (int): Number of parallel workers, None uses CPU count (default: None)

    Examples:
        # Basic chunking with default settings
        config: {max_chunk_size: 512}

        # Chunking with headers and overlap for retrieval
        config: {
            max_chunk_size: 512,
            add_headers: true,
            word_overlap: 20,
            merge_small_chunks: true
        }

        # Large chunks for context preservation
        config: {
            max_chunk_size: 2048,
            headers_to_split_on: [1, 2],
            merge_small_chunks: true
        }
    """

    def __init__(self, config: dict):
        """Initialize the chunker step.

        Args:
            config: Configuration dictionary containing chunking parameters
        """
        super().__init__(config, name="ChunkerStep")

        self.chunk_overlap = config.get("chunk_overlap", 0)
        self.max_chunk_size = config.get("max_chunk_size", 512)
        self.word_overlap = config.get("word_overlap", 0)
        self.add_headers = config.get("add_headers", False)
        self.merge_small_chunks = config.get("merge_small_chunks", True)
        self.headers_to_split_on = config.get("headers_to_split_on", [1, 2, 3, 4, 5, 6])
        self.max_workers = config.get("max_workers", None)  # None = CPU count

        self.chunker = MarkdownTwoStepChunker(
            self.max_chunk_size,
            self.chunk_overlap,
            self.add_headers,
            self.word_overlap,
            self.headers_to_split_on,
            self.merge_small_chunks,
        )

    async def execute(self, documents: List[Document]) -> List[Document]:
        """Execute chunking on documents in parallel.

        Processes each document independently using multiprocessing, then flattens
        all chunks into a single list.

        Args:
            documents: List of documents to chunk

        Returns:
            Flattened list of all chunks from all documents
        """
        self.logger.info(f"Chunking {len(documents)} documents")
        self.logger.info(f"Using max_chunk_size={self.max_chunk_size}, chunk_overlap={self.chunk_overlap}")
        self.logger.info(f"Parallel processing with max_workers={self.max_workers or 'CPU count'}")

        loop = asyncio.get_event_loop()

        # Serialize documents to plain dicts for pickling
        serialized_docs = [_serialize_document(doc) for doc in documents]

        # Create a partial function with the chunker configuration
        chunk_func = partial(
            _chunk_document,
            max_chunk_size=self.max_chunk_size,
            chunk_overlap=self.chunk_overlap,
            add_headers=self.add_headers,
            word_overlap=self.word_overlap,
            headers_to_split_on=self.headers_to_split_on,
            merge_small_chunks=self.merge_small_chunks,
        )

        # Process documents in parallel
        with ProcessPoolExecutor(max_workers=self.max_workers) as executor:
            tasks = [
                loop.run_in_executor(executor, chunk_func, doc)
                for doc in serialized_docs
            ]
            results = await asyncio.gather(*tasks)

        # Flatten and deserialize results
        all_chunks = []
        for doc_chunks in results:
            all_chunks.extend([_deserialize_document(chunk) for chunk in doc_chunks])

        self.logger.info(f"Chunking complete: {len(documents)} documents -> {len(all_chunks)} chunks")

        return all_chunks

`init(config)`

Initialize the chunker step.

Parameters:

Name	Type	Description	Default
`config`	`dict`	Configuration dictionary containing chunking parameters	required

Source code in eve/steps/chunking/chunker_step.py

def __init__(self, config: dict):
    """Initialize the chunker step.

    Args:
        config: Configuration dictionary containing chunking parameters
    """
    super().__init__(config, name="ChunkerStep")

    self.chunk_overlap = config.get("chunk_overlap", 0)
    self.max_chunk_size = config.get("max_chunk_size", 512)
    self.word_overlap = config.get("word_overlap", 0)
    self.add_headers = config.get("add_headers", False)
    self.merge_small_chunks = config.get("merge_small_chunks", True)
    self.headers_to_split_on = config.get("headers_to_split_on", [1, 2, 3, 4, 5, 6])
    self.max_workers = config.get("max_workers", None)  # None = CPU count

    self.chunker = MarkdownTwoStepChunker(
        self.max_chunk_size,
        self.chunk_overlap,
        self.add_headers,
        self.word_overlap,
        self.headers_to_split_on,
        self.merge_small_chunks,
    )

`execute(documents)` `async`

Execute chunking on documents in parallel.

Processes each document independently using multiprocessing, then flattens all chunks into a single list.

Parameters:

Name	Type	Description	Default
`documents`	`List[Document]`	List of documents to chunk	required

Returns:

Type	Description
`List[Document]`	Flattened list of all chunks from all documents

Source code in eve/steps/chunking/chunker_step.py

async def execute(self, documents: List[Document]) -> List[Document]:
    """Execute chunking on documents in parallel.

    Processes each document independently using multiprocessing, then flattens
    all chunks into a single list.

    Args:
        documents: List of documents to chunk

    Returns:
        Flattened list of all chunks from all documents
    """
    self.logger.info(f"Chunking {len(documents)} documents")
    self.logger.info(f"Using max_chunk_size={self.max_chunk_size}, chunk_overlap={self.chunk_overlap}")
    self.logger.info(f"Parallel processing with max_workers={self.max_workers or 'CPU count'}")

    loop = asyncio.get_event_loop()

    # Serialize documents to plain dicts for pickling
    serialized_docs = [_serialize_document(doc) for doc in documents]

    # Create a partial function with the chunker configuration
    chunk_func = partial(
        _chunk_document,
        max_chunk_size=self.max_chunk_size,
        chunk_overlap=self.chunk_overlap,
        add_headers=self.add_headers,
        word_overlap=self.word_overlap,
        headers_to_split_on=self.headers_to_split_on,
        merge_small_chunks=self.merge_small_chunks,
    )

    # Process documents in parallel
    with ProcessPoolExecutor(max_workers=self.max_workers) as executor:
        tasks = [
            loop.run_in_executor(executor, chunk_func, doc)
            for doc in serialized_docs
        ]
        results = await asyncio.gather(*tasks)

    # Flatten and deserialize results
    all_chunks = []
    for doc_chunks in results:
        all_chunks.extend([_deserialize_document(chunk) for chunk in doc_chunks])

    self.logger.info(f"Chunking complete: {len(documents)} documents -> {len(all_chunks)} chunks")

    return all_chunks

`convert_langchain_doc(doc, chunk)`

Convert a LangChain Document chunk to an Eve Document.

Extracts headers from chunk metadata and combines with original document metadata.

Parameters:

Name	Type	Description	Default
`doc`	`Document`	Original Eve Document	required
`chunk`	`Document`	LangChain Document chunk with header metadata	required

Returns:

Type	Description
`Document`	Eve Document with chunk content and combined metadata

Source code in eve/steps/chunking/chunker_step.py

def convert_langchain_doc(doc: Document, chunk: LangchainDocument) -> Document:
    """Convert a LangChain Document chunk to an Eve Document.

    Extracts headers from chunk metadata and combines with original document metadata.

    Args:
        doc: Original Eve Document
        chunk: LangChain Document chunk with header metadata

    Returns:
        Eve Document with chunk content and combined metadata
    """
    headers = ["#" * key + value for key, value in chunk.metadata.items()]
    return Document(
        content=chunk.page_content,
        file_path=doc.file_path,
        file_format=doc.file_format,
        metadata={"headers": headers, **doc.metadata},
    )

Chunking

Overview

Features

Configuration

Configuration Parameters

Basic Configuration

Advanced Configuration

How It Works

Two-Step Chunking Strategy

Step 1: Header-Based Splitting

Step 2: Size-Based Splitting

Step 3: Smart Merging

Header Inclusion

Content Preservation

Use Cases

Small Chunks for Dense Retrieval

Large Chunks for Context

Academic Papers

Output Format

Example Output

Performance

Integration with Other Steps

Typical Pipeline Order

Before Vector Database Upload

Best Practices

Troubleshooting

Chunks are too large

Chunks are too small

LaTeX formulas are broken

Slow performance

Next Steps

Code Reference

ChunkerStep

Basic chunking with default settings

Chunking with headers and overlap for retrieval

Large chunks for context preservation

__init__(config)

execute(documents) async

convert_langchain_doc(doc, chunk)

`ChunkerStep`

`init(config)`

`execute(documents)` `async`

`convert_langchain_doc(doc, chunk)`