Chunking
Learn how to optimize document segmentation for maximum retrieval relevance
This guide explains document chunking in GraphorLM - the critical process of dividing your documents into optimally-sized segments for retrieval. You’ll learn about different chunking strategies, how to configure the chunking component in your RAG pipeline, and best practices for improving retrieval quality through effective document segmentation.
Overview
Effective document chunking is critical for RAG pipeline performance. This guide explains:
- What chunking is and why it matters
- GraphorLM’s chunking strategies and capabilities
- How to configure chunking for your specific use case
- Best practices for optimizing retrieval quality
What is Chunking?
Chunking is the process of breaking down documents into smaller, manageable segments (chunks) that can be:
- Embedded effectively as vectors
- Retrieved efficiently during the search process
- Processed meaningfully by language models
The way you chunk your documents directly impacts:
- Retrieval accuracy: How well the system finds relevant information
- Context quality: How complete and coherent the retrieved information is
- Processing efficiency: How quickly the system can handle and respond to queries
Element-Aware Chunking in GraphorLM
GraphorLM leverages document structure and element classifications (from the data ingestion phase) to make intelligent chunking decisions:
- Preserves semantic units: Keeps related content together
- Respects document hierarchies: Maintains the relationship between headings and content
- Handles special elements: Properly processes tables, lists, and code blocks
Available Chunking Strategies
GraphorLM provides multiple chunking strategies, each with specific use cases and configuration requirements:
Smart Chunking
- How it works: Uses document structure and element classification to make intelligent splitting decisions
- Configuration required:
- Chunk Size: Maximum size of each chunk (actual chunks are typically smaller based on content boundaries)
- Best for: Complex documents with varied structures, headings, and special elements
- Advantages: Preserves meaning, maintains context, respects document hierarchy
- Limitations: May produce variable chunk sizes
By Character
- How it works: Splits text based on a fixed number of characters
- Configuration required:
- Chunk Size: Maximum number of characters per chunk (actual chunks may be smaller when using separators)
- Overlap: Number of characters to overlap between chunks
- Separator (optional): Character or string to use as chunk boundary
- Best for: Simple, uniform documents with straightforward content
- Advantages: Fast processing, consistent chunk sizes
- Limitations: May break semantic units and natural document flow
By Element
- How it works: Creates chunks based on document element types from the classification process
- Configuration required: None (automatically uses element boundaries)
- Best for: Documents with well-defined structural elements
- Advantages: Preserves natural document elements (paragraphs, sections, tables)
- Limitations: Chunk sizes can vary dramatically based on element length
By Tokens
- How it works: Creates chunks with a specific token count limit
- Configuration required:
- Chunk Size: Maximum number of tokens per chunk (actual chunks will often be smaller due to sentence or paragraph boundaries)
- Overlap: Number of tokens to overlap between chunks
- Best for: Optimizing for LLM context window sizes
- Advantages: Precise control over token usage, better for managing API costs
- Limitations: May not respect semantic boundaries
By Semantic
- How it works: Uses semantic analysis to identify natural conceptual boundaries in text
- Configuration required: None (automatically determines semantic boundaries)
- Best for: Documents where semantic meaning spans across structural elements
- Advantages: Creates more meaningful chunks based on content rather than just structure
- Limitations: More computationally intensive, may produce less predictable chunks
Using the Chunking Component
In the Flow Builder, the Chunking component processes your documents for optimal retrieval:
Step-by-Step Configuration
- Add the Chunking component to your flow
- Connect your Dataset component to the Chunking component’s input
- Double-click the Chunking component to open its configuration panel
- Configure the following settings:
1. Embedding / Indexer
Select the embedding model to use for converting text chunks into vector representations:
- text-embedding-3-small: OpenAI’s efficient embedding model (recommended for most use cases)
- text-embedding-3-large: OpenAI’s larger, more precise embedding model
- text-embedding-ada-002: Legacy OpenAI embedding model
- colqwen2-v0.1 (vision): Uses Colpali technology to generate embeddings from images rather than text
- Only compatible with PDF documents and images (PNG, JPG, etc.)
- Documents in other formats will be ignored during chunking when this option is selected
- Creates multimodal chunks that leverage both visual and textual information
- Particularly effective for documents where visual elements are important to understanding
- Performance note: Using Colpali technology can impact both simulation and runtime performance of your RAG pipeline
- Requires more computational resources and may increase processing time
- May result in slower response times during retrieval operations
- Consider these tradeoffs when choosing this option for production workflows
- Other models: Depending on your GraphorLM configuration
2. Elements to Remove
Choose which document element types to exclude from chunking using the multi-select dropdown:
- None: Include all document elements (default)
- Or select specific element types to exclude, such as:
- Header: Remove headers at the top of pages
- Footer: Remove footers at the bottom of pages
- Page number: Remove page numbering elements
- Figure caption: Remove image captions
- And other element types as needed
You can select multiple element types to exclude from your chunks. This helps improve retrieval quality by removing noise and repetitive content that might not contribute meaningful information.
3. Splitter Type
Select your preferred chunking strategy from the available options (Smart chunking, By Character, By Element, By Tokens, By Semantic).
4. Chunk Size and Other Parameters
Configure the additional parameters required for your selected strategy:
- For Smart Chunking: Set Chunk Size (maximum allowed size; actual chunks will typically be smaller)
- For By Character: Set Chunk Size (maximum), Overlap, and optional Separator
- For By Tokens: Set Chunk Size (maximum) and Overlap
Important Note: The Chunk Size setting in all strategies represents the maximum allowed size, not a fixed size. The actual chunks created will often be smaller than this value, as the chunking algorithm respects content boundaries like paragraphs, sentences, and document elements.
Chunk Size Considerations
The size of your chunks is a critical factor in retrieval quality:
Recommended Settings by Document Type
Different documents require different chunking approaches:
Document Type | Recommended Strategy | Typical Chunk Size | Notes |
---|---|---|---|
Technical documentation | Smart chunking | 3000-5000 | Preserves structure |
Articles and blogs | Smart chunking | 3000-4000 | Good for narrative flow |
Legal documents | By Tokens | 4000-6000 | Precise token control |
Code and technical specs | By Element | N/A | Keeps code blocks intact |
Structured data | By Element | N/A | Preserves table structures |
Multi-language content | By Semantic | N/A | Better language boundary handling |
Troubleshooting Common Issues
Best Practices
- Match strategy to content: Choose your chunking strategy based on document type and structure
- Start with Smart chunking: This works well for most general documents
- Test different approaches: Experiment with strategies and settings using the same queries
- Consider user questions: Align chunking with the types of questions users will ask
- Remove noise: Configure element removal to exclude irrelevant content
- Evaluate and iterate: Use evaluation tools to measure and refine your approach
Next Steps
After optimizing your chunking configuration, explore:
Retrieval
Configure search parameters and algorithms of your RAG
Evaluation
Measure and improve your RAG pipeline performance with comprehensive metrics
LLM Integration
Connect language models to utilize your chunked content effectively
Integrate Workflow
Connect your RAG systems to applications via REST API and MCP Server integration