This guide explains document chunking in GraphorLM - the critical process of dividing your documents into optimally-sized segments for retrieval. You’ll learn about different chunking strategies, how to configure the chunking component in your RAG pipeline, and best practices for improving retrieval quality through effective document segmentation.

Overview

Effective document chunking is critical for RAG pipeline performance. This guide explains:

  1. What chunking is and why it matters
  2. GraphorLM’s chunking strategies and capabilities
  3. How to configure chunking for your specific use case
  4. Best practices for optimizing retrieval quality

What is Chunking?

Chunking is the process of breaking down documents into smaller, manageable segments (chunks) that can be:

  • Embedded effectively as vectors
  • Retrieved efficiently during the search process
  • Processed meaningfully by language models

The way you chunk your documents directly impacts:

  • Retrieval accuracy: How well the system finds relevant information
  • Context quality: How complete and coherent the retrieved information is
  • Processing efficiency: How quickly the system can handle and respond to queries

Element-Aware Chunking in GraphorLM

GraphorLM leverages document structure and element classifications (from the data ingestion phase) to make intelligent chunking decisions:

  • Preserves semantic units: Keeps related content together
  • Respects document hierarchies: Maintains the relationship between headings and content
  • Handles special elements: Properly processes tables, lists, and code blocks

Available Chunking Strategies

GraphorLM provides multiple chunking strategies, each with specific use cases and configuration requirements:

Smart Chunking

  • How it works: Uses document structure and element classification to make intelligent splitting decisions
  • Configuration required:
    • Chunk Size: Maximum size of each chunk (actual chunks are typically smaller based on content boundaries)
  • Best for: Complex documents with varied structures, headings, and special elements
  • Advantages: Preserves meaning, maintains context, respects document hierarchy
  • Limitations: May produce variable chunk sizes

By Character

  • How it works: Splits text based on a fixed number of characters
  • Configuration required:
    • Chunk Size: Maximum number of characters per chunk (actual chunks may be smaller when using separators)
    • Overlap: Number of characters to overlap between chunks
    • Separator (optional): Character or string to use as chunk boundary
  • Best for: Simple, uniform documents with straightforward content
  • Advantages: Fast processing, consistent chunk sizes
  • Limitations: May break semantic units and natural document flow

By Element

  • How it works: Creates chunks based on document element types from the classification process
  • Configuration required: None (automatically uses element boundaries)
  • Best for: Documents with well-defined structural elements
  • Advantages: Preserves natural document elements (paragraphs, sections, tables)
  • Limitations: Chunk sizes can vary dramatically based on element length

By Tokens

  • How it works: Creates chunks with a specific token count limit
  • Configuration required:
    • Chunk Size: Maximum number of tokens per chunk (actual chunks will often be smaller due to sentence or paragraph boundaries)
    • Overlap: Number of tokens to overlap between chunks
  • Best for: Optimizing for LLM context window sizes
  • Advantages: Precise control over token usage, better for managing API costs
  • Limitations: May not respect semantic boundaries

By Semantic

  • How it works: Uses semantic analysis to identify natural conceptual boundaries in text
  • Configuration required: None (automatically determines semantic boundaries)
  • Best for: Documents where semantic meaning spans across structural elements
  • Advantages: Creates more meaningful chunks based on content rather than just structure
  • Limitations: More computationally intensive, may produce less predictable chunks

Using the Chunking Component

In the Flow Builder, the Chunking component processes your documents for optimal retrieval:

Step-by-Step Configuration

  1. Add the Chunking component to your flow
  2. Connect your Dataset component to the Chunking component’s input
  3. Double-click the Chunking component to open its configuration panel
  4. Configure the following settings:

1. Embedding / Indexer

Select the embedding model to use for converting text chunks into vector representations:

  • text-embedding-3-small: OpenAI’s efficient embedding model (recommended for most use cases)
  • text-embedding-3-large: OpenAI’s larger, more precise embedding model
  • text-embedding-ada-002: Legacy OpenAI embedding model
  • colqwen2-v0.1 (vision): Uses Colpali technology to generate embeddings from images rather than text
    • Only compatible with PDF documents and images (PNG, JPG, etc.)
    • Documents in other formats will be ignored during chunking when this option is selected
    • Creates multimodal chunks that leverage both visual and textual information
    • Particularly effective for documents where visual elements are important to understanding
    • Performance note: Using Colpali technology can impact both simulation and runtime performance of your RAG pipeline
      • Requires more computational resources and may increase processing time
      • May result in slower response times during retrieval operations
      • Consider these tradeoffs when choosing this option for production workflows
  • Other models: Depending on your GraphorLM configuration

2. Elements to Remove

Choose which document element types to exclude from chunking using the multi-select dropdown:

  • None: Include all document elements (default)
  • Or select specific element types to exclude, such as:
    • Header: Remove headers at the top of pages
    • Footer: Remove footers at the bottom of pages
    • Page number: Remove page numbering elements
    • Figure caption: Remove image captions
    • And other element types as needed

You can select multiple element types to exclude from your chunks. This helps improve retrieval quality by removing noise and repetitive content that might not contribute meaningful information.

3. Splitter Type

Select your preferred chunking strategy from the available options (Smart chunking, By Character, By Element, By Tokens, By Semantic).

4. Chunk Size and Other Parameters

Configure the additional parameters required for your selected strategy:

  • For Smart Chunking: Set Chunk Size (maximum allowed size; actual chunks will typically be smaller)
  • For By Character: Set Chunk Size (maximum), Overlap, and optional Separator
  • For By Tokens: Set Chunk Size (maximum) and Overlap

Important Note: The Chunk Size setting in all strategies represents the maximum allowed size, not a fixed size. The actual chunks created will often be smaller than this value, as the chunking algorithm respects content boundaries like paragraphs, sentences, and document elements.

Chunk Size Considerations

The size of your chunks is a critical factor in retrieval quality:

Different documents require different chunking approaches:

Document TypeRecommended StrategyTypical Chunk SizeNotes
Technical documentationSmart chunking3000-5000Preserves structure
Articles and blogsSmart chunking3000-4000Good for narrative flow
Legal documentsBy Tokens4000-6000Precise token control
Code and technical specsBy ElementN/AKeeps code blocks intact
Structured dataBy ElementN/APreserves table structures
Multi-language contentBy SemanticN/ABetter language boundary handling

Troubleshooting Common Issues

Best Practices

  1. Match strategy to content: Choose your chunking strategy based on document type and structure
  2. Start with Smart chunking: This works well for most general documents
  3. Test different approaches: Experiment with strategies and settings using the same queries
  4. Consider user questions: Align chunking with the types of questions users will ask
  5. Remove noise: Configure element removal to exclude irrelevant content
  6. Evaluate and iterate: Use evaluation tools to measure and refine your approach

Next Steps

After optimizing your chunking configuration, explore: