Transforming unstructured documents into processable data is the foundation of any effective RAG pipeline. GraphorLM’s data ingestion capabilities provide advanced tools for extracting, processing, and organizing information from various document formats. Data ingestion overview

Overview

GraphorLM’s data ingestion process involves:
  1. Document upload - Import files from various sources
  2. Text extraction - Convert documents to machine-readable text
  3. Structure recognition - Identify document elements and hierarchy
  4. Metadata extraction - Capture important document properties
  5. Content classification - Categorize document sections

Supported Document Types

GraphorLM supports a wide range of document formats:
Document TypeExtensionsFeatures
Text DocumentsPDF, TXT, TEXT, MD, DOC, DOCX, ODT, HTML, HTMFull text extraction, structure preservation
ImagesPNG, JPG, JPEG, TIFF, BMP, HEICOCR for text extraction, image analysis
PresentationsPPT, PPTXSlide extraction, image processing
SpreadsheetsXLS, XLSX, CSV, TSVTable parsing, data extraction
Audio FilesMP3, WAV, M4A, OGG, FLACSpeech-to-text transcription, audio analysis
Video FilesMP4, MOV, AVI, MKV, WEBMVideo transcription, visual content extraction
Web ContentURLWeb scraping, content extraction
Code RepositoriesGitHub URLRepository content extraction, code analysis
Video ContentYouTube URLVideo transcription, content extraction

Importing Documents

There are several ways to import documents into GraphorLM:

Method 1: Local Files Upload

  1. Navigate to Sources in the left sidebar
  2. Click Add Sources
  3. Select Local files
  4. Drag and drop files or click to browse your file system
  5. Click Finish to begin processing
Local files upload Note: Large files (>100MB) are automatically uploaded in smaller chunks for improved reliability and progress tracking.

Method 2: URL Import

To import content directly from a web address:
  1. Navigate to Sources in the left sidebar
  2. Click Add Sources
  3. Select URL
  4. Enter the web address of the content you want to import
  5. Optionally enable Crawl URLs to extract links and import related pages
  6. Click Finish to begin processing
GraphorLM will crawl the specified URL, extract the content, and process it for ingestion.

Method 3: GitHub Repository Import

To import content from a GitHub repository:
  1. Navigate to Sources in the left sidebar
  2. Click Add Sources
  3. Select GitHub
  4. Enter the GitHub repository URL (e.g., https://github.com/username/repository)
  5. Click Finish to begin processing
GraphorLM will clone the repository, extract code files, documentation, and README files for processing. This is particularly useful for:
  • Code documentation and analysis
  • Repository knowledge bases
  • Technical documentation extraction
  • Open source project analysis

Method 4: YouTube Video Import

To import content from YouTube videos:
  1. Navigate to Sources in the left sidebar
  2. Click Add Sources
  3. Select YouTube
  4. Enter the YouTube video URL
  5. Click Finish to begin processing
GraphorLM will extract audio from the video, perform speech-to-text transcription, and process the resulting content. This enables:
  • Lecture and educational content extraction
  • Meeting and conference transcription
  • Video-based knowledge extraction
  • Audio content analysis

Advanced OCR Processing

GraphorLM utilizes state-of-the-art OCR (Optical Character Recognition) to extract text from images and scanned documents.

OCR Features

  • Multi-language support - Recognize text in various languages
  • Layout preservation - Maintain document structure and formatting
  • Table detection - Extract structured data from tables
  • Image text extraction - Identify and capture text embedded in images
  • Handwriting recognition - Process handwritten notes (with varying accuracy)

Document Parsing Methods and Classification

GraphorLM offers five parsing and classification methods to optimize document processing based on your needs:

Basic

  • Does not utilize OCR processing
  • Classifies document elements using heuristic methods
  • Suitable for simple text documents with clear structure
  • Fastest processing option

OCR Only

  • Utilizes OCR for text extraction and parsing
  • Classifies document elements using heuristic methods
  • Recommended for scanned documents and images
  • Balances processing speed and accuracy

YOLOX

  • Utilizes OCR for text extraction and parsing
  • Classifies document elements using the YOLOX model
  • Better recognition of document structure and components
  • Improved accuracy for complex documents

Advanced (Premium)

  • Utilizes OCR for text extraction and parsing
  • Classifies document elements using a fine-tuned model
  • Highest accuracy for document structure recognition
  • Optimized for specialized document types

GraphorLM (Beta)

  • Advanced graph-based RAG partitioning method
  • Utilizes knowledge graph structures for content organization
  • Optimized for complex document relationships
  • Beta feature with enhanced semantic understanding
To select a parsing method for your document:
  1. Upload your document to GraphorLM
  2. Select the document in the Sources list to open the document settings
  3. In the document settings modal, locate the Methods dropdown in the top-right corner
  4. Select your preferred parsing method from the dropdown menu
  5. Click “Reprocess elements” to apply the new parsing method to your document
  6. Wait for the reprocessing to complete - this may take a few moments depending on document size and complexity
When reprocessing is complete, you’ll see the updated document structure and classification results in the preview panel. Parsing method selection

Content Classification

GraphorLM can automatically classify document sections to improve retrieval relevance:

Document Element Types

The platform classifies content into the following specific element types:
  • Title - Document and section titles
  • Narrative text - Main body paragraphs and content
  • List item - Items in bullet points or numbered lists
  • Table - Complete data tables
  • Table row - Individual rows within tables
  • Image - Picture or graphic elements
  • Footer - Footer content at bottom of pages
  • Formula - Mathematical formulas and equations
  • Composite element - Elements containing multiple types
  • Figure caption - Text describing images or figures
  • Page break - Indicators of page separation
  • Address - Physical address information
  • Email address - Email contact information
  • Page number - Page numbering elements
  • Code snippet - Programming code segments
  • Header - Header content at top of pages
  • Form keys values - Key-value pairs in forms
  • Link - Hyperlinks and references
  • Uncategorized text - Text that doesn’t fit other categories
These classifications help GraphorLM understand document structure and perform more intelligent chunking during the RAG pipeline process. By recognizing different element types, the system can make better decisions about how to segment documents, keeping related elements together and creating more semantically meaningful chunks.

Metadata Extraction

GraphorLM automatically extracts and processes document metadata:
  • File name and type
  • Creation and modification dates
  • Document size and page count
  • Author information (when available)
  • Title and description

Monitoring Processing Status

Monitor the progress of document processing in the Sources dashboard:
  • Uploading - Documents currently being uploaded
  • Processing - Documents currently being converted
  • New - Documents ready for use in RAG pipelines
  • Failed - Documents that encountered errors during processing
For failed documents, you can view error details and retry processing with adjusted settings.

Best Practices

To optimize data ingestion results:

General Practices

  1. Use consistent formats - When possible, standardize document formats
  2. Check processing results - Review extracted text for accuracy
  3. Customize for complex documents - Use advanced parsing for specialized content
  4. Monitor processing status - Check for failed documents and resolve issues

Source-Specific Practices

For Local Files:
  • Optimize file sizes before upload (large files are processed in chunks automatically)
  • Use descriptive filenames for better organization
  • Group related documents for batch processing
For Web URLs:
  • Enable “Crawl URLs” for comprehensive site extraction
  • Verify URLs are publicly accessible
  • Consider the depth of crawling for large websites
For GitHub Repositories:
  • Use specific branch or tag URLs when needed
  • Be aware that large repositories may take longer to process
For YouTube Videos:
  • Ensure videos have clear audio for better transcription
  • Consider video length - longer videos require more processing time
  • Check that videos are publicly accessible
For Audio/Video Files:
  • Use high-quality audio for better transcription accuracy
  • Consider file size limits and processing time
  • Ensure proper audio codecs for compatibility

Processing Method Selection

  • Basic: Use for simple text documents and fastest processing
  • OCR: Choose for scanned documents and images
  • YOLOX: Ideal for complex layouts and mixed content types
  • Advanced: Best for specialized document types requiring highest accuracy
  • GraphorLM (Beta): Experiment with for complex document relationships and advanced semantic understanding

Troubleshooting

Common issues and solutions:

Using Ingested Data in Your RAG Pipeline

After successfully ingesting your documents, you’ll need to connect them to your RAG pipeline using the Dataset component in the Flow Builder.

The Dataset Component

Dataset component The Dataset component is the entry point to your RAG pipeline that connects your ingested documents to the subsequent processing steps:
  1. Navigate to the Flows section in the left sidebar
  2. Create a new flow or open an existing one
  3. Drag the Dataset component from the component palette onto the canvas
  4. Double-click the component to open its configuration panel

Configuring the Dataset Component

In the configuration panel, you can:
  1. Select Sources: Choose which documents from your Sources library to include
    • Click on document thumbnails to select/deselect specific documents
    • Select multiple documents for comprehensive knowledge bases
  2. Preview Content: See a preview of the selected documents to confirm you’ve chosen the right sources
Once configured, the Dataset component will process the selected documents and make them available to subsequent components in your flow, particularly the Chunking component. Dataset configuration

Best Practices for Dataset Configuration

  • Start focused: Begin with a smaller, high-quality set of documents for initial testing
  • Group related content: Include documents that cover similar or related topics in the same dataset
  • Consider performance: Very large datasets may impact processing time and performance
  • Review regularly: Update your dataset selection as your document library evolves

Document Reprocessing

After uploading documents, you may want to reprocess them with different parsing methods to improve accuracy or extract additional information.

When to Reprocess

  • Improve accuracy: Switch from Basic to more advanced methods for better results
  • Different content focus: Use GraphorLM method for enhanced semantic relationships
  • Quality issues: Reprocess with OCR or YOLOX if initial parsing missed content
  • Method comparison: Test different methods to find optimal results for your use case

How to Reprocess

  1. Navigate to your Sources and select the document you want to reprocess
  2. In the document details panel, locate the Methods dropdown
  3. Select a different parsing method from the available options
  4. Click “Reprocess elements” to start the reprocessing
  5. Monitor the progress
Note: Reprocessing will replace the existing document elements with newly extracted content using the selected method. The original file remains unchanged.

Next Steps

After successfully ingesting your documents, explore these next steps to build your RAG pipeline: