Transforming unstructured documents into processable data is the foundation of any effective RAG pipeline. GraphorLM’s data ingestion capabilities provide advanced tools for extracting, processing, and organizing information from various document formats.

Overview

GraphorLM’s data ingestion process involves:

  1. Document upload - Import files from various sources
  2. Text extraction - Convert documents to machine-readable text
  3. Structure recognition - Identify document elements and hierarchy
  4. Metadata extraction - Capture important document properties
  5. Content classification - Categorize document sections

Supported Document Types

GraphorLM supports a wide range of document formats:

Document TypeExtensionsFeatures
Text DocumentsPDF, TXT, TEXT, MD, DOC, DOCX, ODT, HTML, HTMFull text extraction, structure preservation
ImagesPNG, JPG, JPEG, TIFF, BMP, HEICOCR for text extraction, image analysis
PresentationsPPT, PPTXSlide extraction, image processing
SpreadsheetsXLS, XLSX, CSV, TSVTable parsing, data extraction
Web ContentURLWeb scraping, content extraction

Importing Documents

There are several ways to import documents into GraphorLM:

Method 1: Local Files Upload

  1. Navigate to Sources in the left sidebar
  2. Click Add Sources
  3. Select Local files
  4. Drag and drop files or click to browse your file system
  5. Click Finish to begin processing

Method 2: URL Import

To import content directly from a web address:

  1. Navigate to Sources in the left sidebar
  2. Click Add Sources
  3. Select URL
  4. Enter the web address of the content you want to import
  5. Click Finish to begin processing

GraphorLM will crawl the specified URL, extract the content, and process it for ingestion.

Advanced OCR Processing

GraphorLM utilizes state-of-the-art OCR (Optical Character Recognition) to extract text from images and scanned documents.

OCR Features

  • Multi-language support - Recognize text in various languages
  • Layout preservation - Maintain document structure and formatting
  • Table detection - Extract structured data from tables
  • Image text extraction - Identify and capture text embedded in images
  • Handwriting recognition - Process handwritten notes (with varying accuracy)

Document Parsing Methods and Classification

GraphorLM offers four parsing and classification methods to optimize document processing based on your needs:

Basic

  • Does not utilize OCR processing
  • Classifies document elements using heuristic methods
  • Suitable for simple text documents with clear structure
  • Fastest processing option

OCR Only

  • Utilizes OCR for text extraction and parsing
  • Classifies document elements using heuristic methods
  • Recommended for scanned documents and images
  • Balances processing speed and accuracy

YOLOX

  • Utilizes OCR for text extraction and parsing
  • Classifies document elements using the YOLOX model
  • Better recognition of document structure and components
  • Improved accuracy for complex documents

Advanced (Premium)

  • Utilizes OCR for text extraction and parsing
  • Classifies document elements using a fine-tuned model
  • Highest accuracy for document structure recognition
  • Optimized for specialized document types

To select a parsing method for your document:

  1. Upload your document to GraphorLM
  2. Select the document in the Sources list to open the document settings
  3. In the document settings modal, locate the Methods dropdown in the top-right corner
  4. Select your preferred parsing method from the dropdown menu
  5. Click “Reprocess elements” to apply the new parsing method to your document
  6. Wait for the reprocessing to complete - this may take a few moments depending on document size and complexity

When reprocessing is complete, you’ll see the updated document structure and classification results in the preview panel.

Content Classification

GraphorLM can automatically classify document sections to improve retrieval relevance:

Document Element Types

The platform classifies content into the following specific element types:

  • Title - Document and section titles
  • Narrative text - Main body paragraphs and content
  • List item - Items in bullet points or numbered lists
  • Table - Complete data tables
  • Table row - Individual rows within tables
  • Image - Picture or graphic elements
  • Footer - Footer content at bottom of pages
  • Formula - Mathematical formulas and equations
  • Composite element - Elements containing multiple types
  • Figure caption - Text describing images or figures
  • Page break - Indicators of page separation
  • Address - Physical address information
  • Email address - Email contact information
  • Page number - Page numbering elements
  • Code snippet - Programming code segments
  • Header - Header content at top of pages
  • Form keys values - Key-value pairs in forms
  • Link - Hyperlinks and references
  • Uncategorized text - Text that doesn’t fit other categories

These classifications help GraphorLM understand document structure and perform more intelligent chunking during the RAG pipeline process. By recognizing different element types, the system can make better decisions about how to segment documents, keeping related elements together and creating more semantically meaningful chunks.

Metadata Extraction

GraphorLM automatically extracts and processes document metadata:

  • File name and type
  • Creation and modification dates
  • Document size and page count
  • Author information (when available)
  • Title and description

Monitoring Processing Status

Monitor the progress of document processing in the Sources dashboard:

  • Uploading - Documents currently being uploaded
  • Processing - Documents currently being converted
  • New - Documents ready for use in RAG pipelines
  • Failed - Documents that encountered errors during processing

For failed documents, you can view error details and retry processing with adjusted settings.

Best Practices

To optimize data ingestion results:

  1. Use consistent formats - When possible, standardize document formats
  2. Check processing results - Review extracted text for accuracy
  3. Customize for complex documents - Use advanced parsing for specialized content
  4. Monitor processing status - Check for failed documents and resolve issues

Troubleshooting

Common issues and solutions:

Using Ingested Data in Your RAG Pipeline

After successfully ingesting your documents, you’ll need to connect them to your RAG pipeline using the Dataset component in the Flow Builder.

The Dataset Component

The Dataset component is the entry point to your RAG pipeline that connects your ingested documents to the subsequent processing steps:

  1. Navigate to the Flows section in the left sidebar
  2. Create a new flow or open an existing one
  3. Drag the Dataset component from the component palette onto the canvas
  4. Double-click the component to open its configuration panel

Configuring the Dataset Component

In the configuration panel, you can:

  1. Select Sources: Choose which documents from your Sources library to include

    • Click on document thumbnails to select/deselect specific documents
    • Select multiple documents for comprehensive knowledge bases
  2. Preview Content: See a preview of the selected documents to confirm you’ve chosen the right sources

Once configured, the Dataset component will process the selected documents and make them available to subsequent components in your flow, particularly the Chunking component.

Best Practices for Dataset Configuration

  • Start focused: Begin with a smaller, high-quality set of documents for initial testing
  • Group related content: Include documents that cover similar or related topics in the same dataset
  • Consider performance: Very large datasets may impact processing time and performance
  • Review regularly: Update your dataset selection as your document library evolves

Next Steps

After successfully ingesting your documents, you’re ready to move on to Chunking to prepare your data for efficient retrieval in your RAG pipeline.