Skip to content

Chunking

The chunking stage splits large documents into smaller, semantically meaningful chunks that are suitable for downstream processing like embedding generation and vector database upload.

Overview

Chunking is essential for:

  • Vector database upload: Breaking documents into appropriately-sized pieces for embedding
  • Semantic retrieval: Creating chunks that represent coherent topics or concepts
  • Context window management: Ensuring chunks fit within model token limits
  • Performance optimization: Parallelizing processing across multiple chunks

The Eve pipeline uses a sophisticated two-step chunking strategy that preserves document structure and special content:

  1. Header-based splitting: First splits documents by Markdown headers to maintain semantic structure
  2. Sentence-based splitting: If sections exceed the size limit, further splits them by sentences
  3. Smart merging: Optionally merges small chunks back together when they share compatible heading levels
  4. Content preservation: Keeps LaTeX formulas, equations, and tables intact as atomic units

Features

  • Semantic chunking: Respects document structure by splitting on Markdown headers
  • LaTeX preservation: Keeps mathematical formulas and equations together
  • Table preservation: Maintains tables as complete units without splitting
  • Configurable overlap: Add word-based overlap between chunks for better retrieval
  • Parallel processing: Uses multiprocessing for fast chunking of large document sets
  • Header inclusion: Optionally adds section headers to chunks for context

Configuration

Step name: chunker

Configuration Parameters

Parameter Type Required Default Description
max_chunk_size int No 512 Maximum size of any chunk in words
chunk_overlap int No 0 Number of characters to overlap between chunks during secondary splitting
word_overlap int No 0 Number of words to overlap between chunks (takes precedence over chunk_overlap)
add_headers bool No False Whether to prepend section headers to chunk content
merge_small_chunks bool No True Whether to merge small chunks that share compatible heading levels
headers_to_split_on list[int] No [1, 2, 3, 4, 5, 6] Markdown header levels to split on (1=#, 2=##, etc.)
max_workers int No None Number of parallel workers (None = CPU count)

Basic Configuration

- name: chunker
  config:
    max_chunk_size: 512
    add_headers: true
    merge_small_chunks: true

Advanced Configuration

- name: chunker
  config:
    max_chunk_size: 1024
    chunk_overlap: 0
    word_overlap: 50
    add_headers: true
    merge_small_chunks: true
    headers_to_split_on: [1, 2, 3]  # Only split on H1, H2, H3
    max_workers: 8  # Use 8 parallel workers

How It Works

Two-Step Chunking Strategy

Step 1: Header-Based Splitting

The chunker first splits the document based on Markdown headers:

# Introduction
This is the introduction text...

## Background
This is the background section...

## Methods
This section describes methods...

This creates initial chunks at natural document boundaries.

Step 2: Size-Based Splitting

If any chunk exceeds max_chunk_size, it's further split using sentence boundaries while preserving:

  • LaTeX environments (\begin{...}...\end{...})
  • Inline and display math ($...$, $$...$$)
  • Markdown tables
  • Figure and table references

Step 3: Smart Merging

If merge_small_chunks: true, the chunker merges adjacent chunks when:

  1. Combined length doesn't exceed max_chunk_size
  2. Chunks have compatible heading levels:
  3. Same level headers (e.g., two H2 sections)
  4. Previous chunk has higher level header (e.g., H1 followed by H2)

This prevents overly small chunks while maintaining semantic coherence.

Header Inclusion

When add_headers: true, section headers are prepended to each chunk:

Without headers:

This section describes the methodology used in the study...

With headers:

# Introduction
## Methods
This section describes the methodology used in the study...

This provides context for each chunk, especially useful for retrieval systems.

Content Preservation

The chunker intelligently handles special content:

LaTeX Formulas:

The equation \begin{equation}
E = mc^2
\end{equation} is preserved intact.

Tables:

| Column 1 | Column 2 |
|----------|----------|
| Data 1   | Data 2   |

These are never split mid-formula or mid-table, even if they exceed max_chunk_size.

Use Cases

Small Chunks for Dense Retrieval

- name: chunker
  config:
    max_chunk_size: 256
    word_overlap: 20
    add_headers: true
    merge_small_chunks: false

Creates small, focused chunks with overlap for better semantic retrieval.

Large Chunks for Context

- name: chunker
  config:
    max_chunk_size: 2048
    add_headers: false
    merge_small_chunks: true
    headers_to_split_on: [1, 2]  # Only split on major sections

Creates larger chunks that preserve more context, suitable for summarization or large context windows.

Academic Papers

- name: chunker
  config:
    max_chunk_size: 512
    add_headers: true
    merge_small_chunks: true
    headers_to_split_on: [1, 2, 3, 4, 5, 6]

Respects the hierarchical structure of academic papers while maintaining readable chunk sizes.

Output Format

Each chunk becomes a separate Document with:

  • content: The chunk text (with headers if add_headers: true)
  • file_path: Original document file path
  • file_format: Original document format
  • metadata.headers: List of Markdown headers that apply to this chunk

Example Output

Document(
    content="# Introduction\n## Background\nThis paper discusses...",
    file_path="papers/paper1.pdf",
    file_format="pdf",
    metadata={
        "headers": ["#Introduction", "##Background"],
        # ... other metadata from original document
    }
)

Performance

The chunker uses parallel processing to handle large document sets efficiently:

  • Documents are processed in separate processes using ProcessPoolExecutor
  • Each process runs an independent chunker instance
  • Results are collected and flattened into a single list
  • Set max_workers to control parallelism (defaults to CPU count)

Performance tip: For I/O-bound operations, use the default max_workers=None. For CPU-intensive chunking of very large documents, experiment with different worker counts.

Integration with Other Steps

Typical Pipeline Order

pipeline:
  inputs:
    path: "documents"
  stages:
    - name: extraction

    - name: deduplication
      config:
        method: "lsh"

    - name: cleaning

    - name: chunker
      config:
        max_chunk_size: 512
        add_headers: true

    - name: embedding  # Or qdrant upload

    - name: export

Before Vector Database Upload

Chunking is typically done before uploading to vector databases:

- name: chunker
  config:
    max_chunk_size: 512
    add_headers: true

- name: qdrant
  config:
    database:
      collection_name: "documents"
    # ... other qdrant config

This ensures each chunk gets its own embedding vector in the database.

Best Practices

  1. Choose appropriate chunk size:
  2. Smaller chunks (256-512 words) for dense retrieval
  3. Larger chunks (1024-2048 words) for summarization or large context models

  4. Add headers for context: Enable add_headers: true when chunks will be retrieved without surrounding context

  5. Merge small chunks: Keep merge_small_chunks: true to avoid tiny chunks that lack sufficient context

  6. Adjust header levels: For documents with deep nesting, limit headers_to_split_on to major sections only

Troubleshooting

Chunks are too large

  • Decrease max_chunk_size
  • Add more header levels to headers_to_split_on
  • Set merge_small_chunks: false

Chunks are too small

  • Increase max_chunk_size
  • Set merge_small_chunks: true
  • Reduce header levels in headers_to_split_on

LaTeX formulas are broken

The chunker should preserve LaTeX automatically. If formulas are breaking:

  • Check that LaTeX uses proper \begin{...} and \end{...} syntax
  • Verify formulas aren't malformed in the original document
  • Review the cleaning step output before chunking

Slow performance

  • Adjust max_workers (try different values)
  • Ensure you're chunking after deduplication and cleaning
  • Consider increasing max_chunk_size to reduce total chunk count

Next Steps

Code Reference

Document chunking step using semantic two-step chunking strategy.

ChunkerStep

Bases: PipelineStep

Chunk documents into smaller, semantically meaningful pieces.

Uses a two-step chunking strategy:

  1. Split by Markdown headers to maintain document structure
  2. Further split large sections by sentences while preserving LaTeX and tables
  3. Optionally merge small chunks that share compatible heading levels

The chunker processes documents in parallel using multiprocessing for performance.

Config parameters:

- max_chunk_size (int): Maximum size of any chunk in words (default: 512)
- chunk_overlap (int): Number of characters to overlap between chunks (default: 0)
- word_overlap (int): Number of words to overlap between chunks (default: 0)
- add_headers (bool): Whether to prepend section headers to chunks (default: False)
- merge_small_chunks (bool): Whether to merge small chunks with compatible headers (default: True)
- headers_to_split_on (list[int]): Markdown header levels to split on (default: [1, 2, 3, 4, 5, 6])
- max_workers (int): Number of parallel workers, None uses CPU count (default: None)

Examples:

Basic chunking with default settings

config: {max_chunk_size: 512}

Chunking with headers and overlap for retrieval

config: { max_chunk_size: 512, add_headers: true, word_overlap: 20, merge_small_chunks: true }

Large chunks for context preservation

config: { max_chunk_size: 2048, headers_to_split_on: [1, 2], merge_small_chunks: true }

Source code in eve/steps/chunking/chunker_step.py
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
class ChunkerStep(PipelineStep):
    """Chunk documents into smaller, semantically meaningful pieces.

    Uses a two-step chunking strategy:

    1. Split by Markdown headers to maintain document structure
    2. Further split large sections by sentences while preserving LaTeX and tables
    3. Optionally merge small chunks that share compatible heading levels

    The chunker processes documents in parallel using multiprocessing for performance.

    Config parameters:

        - max_chunk_size (int): Maximum size of any chunk in words (default: 512)
        - chunk_overlap (int): Number of characters to overlap between chunks (default: 0)
        - word_overlap (int): Number of words to overlap between chunks (default: 0)
        - add_headers (bool): Whether to prepend section headers to chunks (default: False)
        - merge_small_chunks (bool): Whether to merge small chunks with compatible headers (default: True)
        - headers_to_split_on (list[int]): Markdown header levels to split on (default: [1, 2, 3, 4, 5, 6])
        - max_workers (int): Number of parallel workers, None uses CPU count (default: None)

    Examples:
        # Basic chunking with default settings
        config: {max_chunk_size: 512}

        # Chunking with headers and overlap for retrieval
        config: {
            max_chunk_size: 512,
            add_headers: true,
            word_overlap: 20,
            merge_small_chunks: true
        }

        # Large chunks for context preservation
        config: {
            max_chunk_size: 2048,
            headers_to_split_on: [1, 2],
            merge_small_chunks: true
        }
    """

    def __init__(self, config: dict):
        """Initialize the chunker step.

        Args:
            config: Configuration dictionary containing chunking parameters
        """
        super().__init__(config, name="ChunkerStep")

        self.chunk_overlap = config.get("chunk_overlap", 0)
        self.max_chunk_size = config.get("max_chunk_size", 512)
        self.word_overlap = config.get("word_overlap", 0)
        self.add_headers = config.get("add_headers", False)
        self.merge_small_chunks = config.get("merge_small_chunks", True)
        self.headers_to_split_on = config.get("headers_to_split_on", [1, 2, 3, 4, 5, 6])
        self.max_workers = config.get("max_workers", None)  # None = CPU count

        self.chunker = MarkdownTwoStepChunker(
            self.max_chunk_size,
            self.chunk_overlap,
            self.add_headers,
            self.word_overlap,
            self.headers_to_split_on,
            self.merge_small_chunks,
        )

    async def execute(self, documents: List[Document]) -> List[Document]:
        """Execute chunking on documents in parallel.

        Processes each document independently using multiprocessing, then flattens
        all chunks into a single list.

        Args:
            documents: List of documents to chunk

        Returns:
            Flattened list of all chunks from all documents
        """
        self.logger.info(f"Chunking {len(documents)} documents")
        self.logger.info(f"Using max_chunk_size={self.max_chunk_size}, chunk_overlap={self.chunk_overlap}")
        self.logger.info(f"Parallel processing with max_workers={self.max_workers or 'CPU count'}")

        loop = asyncio.get_event_loop()

        # Serialize documents to plain dicts for pickling
        serialized_docs = [_serialize_document(doc) for doc in documents]

        # Create a partial function with the chunker configuration
        chunk_func = partial(
            _chunk_document,
            max_chunk_size=self.max_chunk_size,
            chunk_overlap=self.chunk_overlap,
            add_headers=self.add_headers,
            word_overlap=self.word_overlap,
            headers_to_split_on=self.headers_to_split_on,
            merge_small_chunks=self.merge_small_chunks,
        )

        # Process documents in parallel
        with ProcessPoolExecutor(max_workers=self.max_workers) as executor:
            tasks = [
                loop.run_in_executor(executor, chunk_func, doc)
                for doc in serialized_docs
            ]
            results = await asyncio.gather(*tasks)

        # Flatten and deserialize results
        all_chunks = []
        for doc_chunks in results:
            all_chunks.extend([_deserialize_document(chunk) for chunk in doc_chunks])

        self.logger.info(f"Chunking complete: {len(documents)} documents -> {len(all_chunks)} chunks")

        return all_chunks

__init__(config)

Initialize the chunker step.

Parameters:

Name Type Description Default
config dict

Configuration dictionary containing chunking parameters

required
Source code in eve/steps/chunking/chunker_step.py
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
def __init__(self, config: dict):
    """Initialize the chunker step.

    Args:
        config: Configuration dictionary containing chunking parameters
    """
    super().__init__(config, name="ChunkerStep")

    self.chunk_overlap = config.get("chunk_overlap", 0)
    self.max_chunk_size = config.get("max_chunk_size", 512)
    self.word_overlap = config.get("word_overlap", 0)
    self.add_headers = config.get("add_headers", False)
    self.merge_small_chunks = config.get("merge_small_chunks", True)
    self.headers_to_split_on = config.get("headers_to_split_on", [1, 2, 3, 4, 5, 6])
    self.max_workers = config.get("max_workers", None)  # None = CPU count

    self.chunker = MarkdownTwoStepChunker(
        self.max_chunk_size,
        self.chunk_overlap,
        self.add_headers,
        self.word_overlap,
        self.headers_to_split_on,
        self.merge_small_chunks,
    )

execute(documents) async

Execute chunking on documents in parallel.

Processes each document independently using multiprocessing, then flattens all chunks into a single list.

Parameters:

Name Type Description Default
documents List[Document]

List of documents to chunk

required

Returns:

Type Description
List[Document]

Flattened list of all chunks from all documents

Source code in eve/steps/chunking/chunker_step.py
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
async def execute(self, documents: List[Document]) -> List[Document]:
    """Execute chunking on documents in parallel.

    Processes each document independently using multiprocessing, then flattens
    all chunks into a single list.

    Args:
        documents: List of documents to chunk

    Returns:
        Flattened list of all chunks from all documents
    """
    self.logger.info(f"Chunking {len(documents)} documents")
    self.logger.info(f"Using max_chunk_size={self.max_chunk_size}, chunk_overlap={self.chunk_overlap}")
    self.logger.info(f"Parallel processing with max_workers={self.max_workers or 'CPU count'}")

    loop = asyncio.get_event_loop()

    # Serialize documents to plain dicts for pickling
    serialized_docs = [_serialize_document(doc) for doc in documents]

    # Create a partial function with the chunker configuration
    chunk_func = partial(
        _chunk_document,
        max_chunk_size=self.max_chunk_size,
        chunk_overlap=self.chunk_overlap,
        add_headers=self.add_headers,
        word_overlap=self.word_overlap,
        headers_to_split_on=self.headers_to_split_on,
        merge_small_chunks=self.merge_small_chunks,
    )

    # Process documents in parallel
    with ProcessPoolExecutor(max_workers=self.max_workers) as executor:
        tasks = [
            loop.run_in_executor(executor, chunk_func, doc)
            for doc in serialized_docs
        ]
        results = await asyncio.gather(*tasks)

    # Flatten and deserialize results
    all_chunks = []
    for doc_chunks in results:
        all_chunks.extend([_deserialize_document(chunk) for chunk in doc_chunks])

    self.logger.info(f"Chunking complete: {len(documents)} documents -> {len(all_chunks)} chunks")

    return all_chunks

convert_langchain_doc(doc, chunk)

Convert a LangChain Document chunk to an Eve Document.

Extracts headers from chunk metadata and combines with original document metadata.

Parameters:

Name Type Description Default
doc Document

Original Eve Document

required
chunk Document

LangChain Document chunk with header metadata

required

Returns:

Type Description
Document

Eve Document with chunk content and combined metadata

Source code in eve/steps/chunking/chunker_step.py
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
def convert_langchain_doc(doc: Document, chunk: LangchainDocument) -> Document:
    """Convert a LangChain Document chunk to an Eve Document.

    Extracts headers from chunk metadata and combines with original document metadata.

    Args:
        doc: Original Eve Document
        chunk: LangChain Document chunk with header metadata

    Returns:
        Eve Document with chunk content and combined metadata
    """
    headers = ["#" * key + value for key, value in chunk.metadata.items()]
    return Document(
        content=chunk.page_content,
        file_path=doc.file_path,
        file_format=doc.file_format,
        metadata={"headers": headers, **doc.metadata},
    )