Chunking
The chunking stage splits large documents into smaller, semantically meaningful chunks that are suitable for downstream processing like embedding generation and vector database upload.
Overview
Chunking is essential for:
- Vector database upload: Breaking documents into appropriately-sized pieces for embedding
- Semantic retrieval: Creating chunks that represent coherent topics or concepts
- Context window management: Ensuring chunks fit within model token limits
- Performance optimization: Parallelizing processing across multiple chunks
The Eve pipeline uses a sophisticated two-step chunking strategy that preserves document structure and special content:
- Header-based splitting: First splits documents by Markdown headers to maintain semantic structure
- Sentence-based splitting: If sections exceed the size limit, further splits them by sentences
- Smart merging: Optionally merges small chunks back together when they share compatible heading levels
- Content preservation: Keeps LaTeX formulas, equations, and tables intact as atomic units
Features
- Semantic chunking: Respects document structure by splitting on Markdown headers
- LaTeX preservation: Keeps mathematical formulas and equations together
- Table preservation: Maintains tables as complete units without splitting
- Configurable overlap: Add word-based overlap between chunks for better retrieval
- Parallel processing: Uses multiprocessing for fast chunking of large document sets
- Header inclusion: Optionally adds section headers to chunks for context
Configuration
Step name: chunker
Configuration Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
max_chunk_size |
int | No | 512 |
Maximum size of any chunk in words |
chunk_overlap |
int | No | 0 |
Number of characters to overlap between chunks during secondary splitting |
word_overlap |
int | No | 0 |
Number of words to overlap between chunks (takes precedence over chunk_overlap) |
add_headers |
bool | No | False |
Whether to prepend section headers to chunk content |
merge_small_chunks |
bool | No | True |
Whether to merge small chunks that share compatible heading levels |
headers_to_split_on |
list[int] | No | [1, 2, 3, 4, 5, 6] |
Markdown header levels to split on (1=#, 2=##, etc.) |
max_workers |
int | No | None |
Number of parallel workers (None = CPU count) |
Basic Configuration
- name: chunker
config:
max_chunk_size: 512
add_headers: true
merge_small_chunks: true
Advanced Configuration
- name: chunker
config:
max_chunk_size: 1024
chunk_overlap: 0
word_overlap: 50
add_headers: true
merge_small_chunks: true
headers_to_split_on: [1, 2, 3] # Only split on H1, H2, H3
max_workers: 8 # Use 8 parallel workers
How It Works
Two-Step Chunking Strategy
Step 1: Header-Based Splitting
The chunker first splits the document based on Markdown headers:
# Introduction
This is the introduction text...
## Background
This is the background section...
## Methods
This section describes methods...
This creates initial chunks at natural document boundaries.
Step 2: Size-Based Splitting
If any chunk exceeds max_chunk_size, it's further split using sentence boundaries while preserving:
- LaTeX environments (
\begin{...}...\end{...}) - Inline and display math (
$...$,$$...$$) - Markdown tables
- Figure and table references
Step 3: Smart Merging
If merge_small_chunks: true, the chunker merges adjacent chunks when:
- Combined length doesn't exceed
max_chunk_size - Chunks have compatible heading levels:
- Same level headers (e.g., two H2 sections)
- Previous chunk has higher level header (e.g., H1 followed by H2)
This prevents overly small chunks while maintaining semantic coherence.
Header Inclusion
When add_headers: true, section headers are prepended to each chunk:
Without headers:
This section describes the methodology used in the study...
With headers:
# Introduction
## Methods
This section describes the methodology used in the study...
This provides context for each chunk, especially useful for retrieval systems.
Content Preservation
The chunker intelligently handles special content:
LaTeX Formulas:
The equation \begin{equation}
E = mc^2
\end{equation} is preserved intact.
Tables:
| Column 1 | Column 2 |
|----------|----------|
| Data 1 | Data 2 |
These are never split mid-formula or mid-table, even if they exceed max_chunk_size.
Use Cases
Small Chunks for Dense Retrieval
- name: chunker
config:
max_chunk_size: 256
word_overlap: 20
add_headers: true
merge_small_chunks: false
Creates small, focused chunks with overlap for better semantic retrieval.
Large Chunks for Context
- name: chunker
config:
max_chunk_size: 2048
add_headers: false
merge_small_chunks: true
headers_to_split_on: [1, 2] # Only split on major sections
Creates larger chunks that preserve more context, suitable for summarization or large context windows.
Academic Papers
- name: chunker
config:
max_chunk_size: 512
add_headers: true
merge_small_chunks: true
headers_to_split_on: [1, 2, 3, 4, 5, 6]
Respects the hierarchical structure of academic papers while maintaining readable chunk sizes.
Output Format
Each chunk becomes a separate Document with:
- content: The chunk text (with headers if
add_headers: true) - file_path: Original document file path
- file_format: Original document format
- metadata.headers: List of Markdown headers that apply to this chunk
Example Output
Document(
content="# Introduction\n## Background\nThis paper discusses...",
file_path="papers/paper1.pdf",
file_format="pdf",
metadata={
"headers": ["#Introduction", "##Background"],
# ... other metadata from original document
}
)
Performance
The chunker uses parallel processing to handle large document sets efficiently:
- Documents are processed in separate processes using
ProcessPoolExecutor - Each process runs an independent chunker instance
- Results are collected and flattened into a single list
- Set
max_workersto control parallelism (defaults to CPU count)
Performance tip: For I/O-bound operations, use the default max_workers=None. For CPU-intensive chunking of very large documents, experiment with different worker counts.
Integration with Other Steps
Typical Pipeline Order
pipeline:
inputs:
path: "documents"
stages:
- name: extraction
- name: deduplication
config:
method: "lsh"
- name: cleaning
- name: chunker
config:
max_chunk_size: 512
add_headers: true
- name: embedding # Or qdrant upload
- name: export
Before Vector Database Upload
Chunking is typically done before uploading to vector databases:
- name: chunker
config:
max_chunk_size: 512
add_headers: true
- name: qdrant
config:
database:
collection_name: "documents"
# ... other qdrant config
This ensures each chunk gets its own embedding vector in the database.
Best Practices
- Choose appropriate chunk size:
- Smaller chunks (256-512 words) for dense retrieval
-
Larger chunks (1024-2048 words) for summarization or large context models
-
Add headers for context: Enable
add_headers: truewhen chunks will be retrieved without surrounding context -
Merge small chunks: Keep
merge_small_chunks: trueto avoid tiny chunks that lack sufficient context -
Adjust header levels: For documents with deep nesting, limit
headers_to_split_onto major sections only
Troubleshooting
Chunks are too large
- Decrease
max_chunk_size - Add more header levels to
headers_to_split_on - Set
merge_small_chunks: false
Chunks are too small
- Increase
max_chunk_size - Set
merge_small_chunks: true - Reduce header levels in
headers_to_split_on
LaTeX formulas are broken
The chunker should preserve LaTeX automatically. If formulas are breaking:
- Check that LaTeX uses proper
\begin{...}and\end{...}syntax - Verify formulas aren't malformed in the original document
- Review the cleaning step output before chunking
Slow performance
- Adjust
max_workers(try different values) - Ensure you're chunking after deduplication and cleaning
- Consider increasing
max_chunk_sizeto reduce total chunk count
Next Steps
- Set up Qdrant upload to store chunks in a vector database
- Learn about Export options for saving chunked documents
Code Reference
Document chunking step using semantic two-step chunking strategy.
ChunkerStep
Bases: PipelineStep
Chunk documents into smaller, semantically meaningful pieces.
Uses a two-step chunking strategy:
- Split by Markdown headers to maintain document structure
- Further split large sections by sentences while preserving LaTeX and tables
- Optionally merge small chunks that share compatible heading levels
The chunker processes documents in parallel using multiprocessing for performance.
Config parameters:
- max_chunk_size (int): Maximum size of any chunk in words (default: 512)
- chunk_overlap (int): Number of characters to overlap between chunks (default: 0)
- word_overlap (int): Number of words to overlap between chunks (default: 0)
- add_headers (bool): Whether to prepend section headers to chunks (default: False)
- merge_small_chunks (bool): Whether to merge small chunks with compatible headers (default: True)
- headers_to_split_on (list[int]): Markdown header levels to split on (default: [1, 2, 3, 4, 5, 6])
- max_workers (int): Number of parallel workers, None uses CPU count (default: None)
Examples:
Basic chunking with default settings
config: {max_chunk_size: 512}
Chunking with headers and overlap for retrieval
config: { max_chunk_size: 512, add_headers: true, word_overlap: 20, merge_small_chunks: true }
Large chunks for context preservation
config: { max_chunk_size: 2048, headers_to_split_on: [1, 2], merge_small_chunks: true }
Source code in eve/steps/chunking/chunker_step.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 | |
__init__(config)
Initialize the chunker step.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
dict
|
Configuration dictionary containing chunking parameters |
required |
Source code in eve/steps/chunking/chunker_step.py
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 | |
execute(documents)
async
Execute chunking on documents in parallel.
Processes each document independently using multiprocessing, then flattens all chunks into a single list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
documents
|
List[Document]
|
List of documents to chunk |
required |
Returns:
| Type | Description |
|---|---|
List[Document]
|
Flattened list of all chunks from all documents |
Source code in eve/steps/chunking/chunker_step.py
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 | |
convert_langchain_doc(doc, chunk)
Convert a LangChain Document chunk to an Eve Document.
Extracts headers from chunk metadata and combines with original document metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
doc
|
Document
|
Original Eve Document |
required |
chunk
|
Document
|
LangChain Document chunk with header metadata |
required |
Returns:
| Type | Description |
|---|---|
Document
|
Eve Document with chunk content and combined metadata |
Source code in eve/steps/chunking/chunker_step.py
224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 | |