Data Ingestion - GraphorLM Docs

Transforming unstructured documents into processable data is the foundation of any effective RAG pipeline. GraphorLM’s data ingestion capabilities provide advanced tools for extracting, processing, and organizing information from various document formats. Data ingestion overview

Overview

GraphorLM’s data ingestion process involves:

Document upload - Import files from various sources
Text extraction - Convert documents to machine-readable text
Structure recognition - Identify document elements and hierarchy
Metadata extraction - Capture important document properties
Content classification - Categorize document sections

Supported Document Types

GraphorLM supports a wide range of document formats:

Document Type	Extensions	Features
Text Documents	PDF, TXT, TEXT, MD, DOC, DOCX, ODT, HTML, HTM	Full text extraction, structure preservation
Images	PNG, JPG, JPEG, TIFF, BMP, HEIC	OCR for text extraction, image analysis
Presentations	PPT, PPTX	Slide extraction, image processing
Spreadsheets	XLS, XLSX, CSV, TSV	Table parsing, data extraction
Audio Files	MP3, WAV, M4A, OGG, FLAC	Speech-to-text transcription, audio analysis
Video Files	MP4, MOV, AVI, MKV, WEBM	Video transcription, visual content extraction
Web Content	URL	Web scraping, content extraction
Code Repositories	GitHub URL	Repository content extraction, code analysis
Video Content	YouTube URL	Video transcription, content extraction

Importing Documents

There are several ways to import documents into GraphorLM:

Method 1: Local Files Upload

Navigate to Sources in the left sidebar
Click Add Sources
Select Local files
Drag and drop files or click to browse your file system
Click Finish to begin processing

Note: Large files (>100MB) are automatically uploaded in smaller chunks for improved reliability and progress tracking.

Method 2: URL Import

To import content directly from a web address:

Navigate to Sources in the left sidebar
Click Add Sources
Select URL
Enter the web address of the content you want to import
Optionally enable Crawl URLs to extract links and import related pages
Click Finish to begin processing

GraphorLM will crawl the specified URL, extract the content, and process it for ingestion.

Method 3: GitHub Repository Import

To import content from a GitHub repository:

Navigate to Sources in the left sidebar
Click Add Sources
Select GitHub
Enter the GitHub repository URL (e.g., https://github.com/username/repository)
Click Finish to begin processing

GraphorLM will clone the repository, extract code files, documentation, and README files for processing. This is particularly useful for:

Code documentation and analysis
Repository knowledge bases
Technical documentation extraction
Open source project analysis

Method 4: YouTube Video Import

To import content from YouTube videos:

Navigate to Sources in the left sidebar
Click Add Sources
Select YouTube
Enter the YouTube video URL
Click Finish to begin processing

GraphorLM will extract audio from the video, perform speech-to-text transcription, and process the resulting content. This enables:

Lecture and educational content extraction
Meeting and conference transcription
Video-based knowledge extraction
Audio content analysis

Advanced OCR Processing

GraphorLM utilizes state-of-the-art OCR (Optical Character Recognition) to extract text from images and scanned documents.

OCR Features

Multi-language support - Recognize text in various languages
Layout preservation - Maintain document structure and formatting
Table detection - Extract structured data from tables
Image text extraction - Identify and capture text embedded in images
Handwriting recognition - Process handwritten notes (with varying accuracy)

Document Parsing Methods and Classification

GraphorLM offers five parsing and classification methods to optimize document processing based on your needs:

Basic

Does not utilize OCR processing
Classifies document elements using heuristic methods
Suitable for simple text documents with clear structure
Fastest processing option

OCR Only

Utilizes OCR for text extraction and parsing
Classifies document elements using heuristic methods
Recommended for scanned documents and images
Balances processing speed and accuracy

YOLOX

Utilizes OCR for text extraction and parsing
Classifies document elements using the YOLOX model
Better recognition of document structure and components
Improved accuracy for complex documents

Advanced (Premium)

Utilizes OCR for text extraction and parsing
Classifies document elements using a fine-tuned model
Highest accuracy for document structure recognition
Optimized for specialized document types

GraphorLM (Beta)

Advanced graph-based RAG partitioning method
Utilizes knowledge graph structures for content organization
Optimized for complex document relationships
Beta feature with enhanced semantic understanding

To select a parsing method for your document:

Upload your document to GraphorLM
Select the document in the Sources list to open the document settings
In the document settings modal, locate the Methods dropdown in the top-right corner
Select your preferred parsing method from the dropdown menu
Click “Reprocess elements” to apply the new parsing method to your document
Wait for the reprocessing to complete - this may take a few moments depending on document size and complexity

When reprocessing is complete, you’ll see the updated document structure and classification results in the preview panel. Parsing method selection

Content Classification

GraphorLM can automatically classify document sections to improve retrieval relevance:

Document Element Types

The platform classifies content into the following specific element types:

Title - Document and section titles
Narrative text - Main body paragraphs and content
List item - Items in bullet points or numbered lists
Table - Complete data tables
Table row - Individual rows within tables
Image - Picture or graphic elements
Footer - Footer content at bottom of pages
Formula - Mathematical formulas and equations
Composite element - Elements containing multiple types
Figure caption - Text describing images or figures
Page break - Indicators of page separation
Address - Physical address information
Email address - Email contact information
Page number - Page numbering elements
Code snippet - Programming code segments
Header - Header content at top of pages
Form keys values - Key-value pairs in forms
Link - Hyperlinks and references
Uncategorized text - Text that doesn’t fit other categories

These classifications help GraphorLM understand document structure and perform more intelligent chunking during the RAG pipeline process. By recognizing different element types, the system can make better decisions about how to segment documents, keeping related elements together and creating more semantically meaningful chunks.

Metadata Extraction

GraphorLM automatically extracts and processes document metadata:

File name and type
Creation and modification dates
Document size and page count
Author information (when available)
Title and description

Monitoring Processing Status

Monitor the progress of document processing in the Sources dashboard:

Uploading - Documents currently being uploaded
Processing - Documents currently being converted
New - Documents ready for use in RAG pipelines
Failed - Documents that encountered errors during processing

For failed documents, you can view error details and retry processing with adjusted settings.

Best Practices

To optimize data ingestion results:

General Practices

Use consistent formats - When possible, standardize document formats
Check processing results - Review extracted text for accuracy
Customize for complex documents - Use advanced parsing for specialized content
Monitor processing status - Check for failed documents and resolve issues

Source-Specific Practices

For Local Files:

Optimize file sizes before upload (large files are processed in chunks automatically)
Use descriptive filenames for better organization
Group related documents for batch processing

For Web URLs:

Enable “Crawl URLs” for comprehensive site extraction
Verify URLs are publicly accessible
Consider the depth of crawling for large websites

For GitHub Repositories:

Use specific branch or tag URLs when needed
Be aware that large repositories may take longer to process

For YouTube Videos:

Ensure videos have clear audio for better transcription
Consider video length - longer videos require more processing time
Check that videos are publicly accessible

For Audio/Video Files:

Use high-quality audio for better transcription accuracy
Consider file size limits and processing time
Ensure proper audio codecs for compatibility

Processing Method Selection

Basic: Use for simple text documents and fastest processing
OCR: Choose for scanned documents and images
YOLOX: Ideal for complex layouts and mixed content types
Advanced: Best for specialized document types requiring highest accuracy
GraphorLM (Beta): Experiment with for complex document relationships and advanced semantic understanding

Troubleshooting

Common issues and solutions:

Poor OCR quality

Table extraction problems

Multi-language document issues

GitHub repository access issues

YouTube video processing issues

Audio/Video processing issues

Using Ingested Data in Your RAG Pipeline

After successfully ingesting your documents, you’ll need to connect them to your RAG pipeline using the Dataset component in the Flow Builder.

The Dataset Component

The Dataset component is the entry point to your RAG pipeline that connects your ingested documents to the subsequent processing steps:

Navigate to the Flows section in the left sidebar
Create a new flow or open an existing one
Drag the Dataset component from the component palette onto the canvas
Double-click the component to open its configuration panel

Configuring the Dataset Component

In the configuration panel, you can:

Select Sources: Choose which documents from your Sources library to include
- Click on document thumbnails to select/deselect specific documents
- Select multiple documents for comprehensive knowledge bases
Preview Content: See a preview of the selected documents to confirm you’ve chosen the right sources

Once configured, the Dataset component will process the selected documents and make them available to subsequent components in your flow, particularly the Chunking component. Dataset configuration

Best Practices for Dataset Configuration

Start focused: Begin with a smaller, high-quality set of documents for initial testing
Group related content: Include documents that cover similar or related topics in the same dataset
Consider performance: Very large datasets may impact processing time and performance
Review regularly: Update your dataset selection as your document library evolves

Document Reprocessing

After uploading documents, you may want to reprocess them with different parsing methods to improve accuracy or extract additional information.

When to Reprocess

Improve accuracy: Switch from Basic to more advanced methods for better results
Different content focus: Use GraphorLM method for enhanced semantic relationships
Quality issues: Reprocess with OCR or YOLOX if initial parsing missed content
Method comparison: Test different methods to find optimal results for your use case

How to Reprocess

Navigate to your Sources and select the document you want to reprocess
In the document details panel, locate the Methods dropdown
Select a different parsing method from the available options
Click “Reprocess elements” to start the reprocessing
Monitor the progress

Note: Reprocessing will replace the existing document elements with newly extracted content using the selected method. The original file remains unchanged.

Next Steps

After successfully ingesting your documents, explore these next steps to build your RAG pipeline:

Document Chunking

Learn how to optimize document segmentation for efficient retrieval in your RAG pipeline

API Integration

Integrate document upload into your applications using the GraphorLM REST API

Flow Builder

Create custom RAG workflows using the visual Flow Builder interface

API Tokens

Set up authentication for programmatic access to your documents and data

Smart RAG

Implement intelligent retrieval with context-aware document processing

Graph RAG

Build knowledge graphs from your documents for enhanced semantic retrieval

Get Started

Guides

​Overview

​Supported Document Types

​Importing Documents

​Method 1: Local Files Upload

​Method 2: URL Import

​Method 3: GitHub Repository Import

​Method 4: YouTube Video Import

​Advanced OCR Processing

​OCR Features

​Document Parsing Methods and Classification

​Basic

​OCR Only

​YOLOX

​Advanced (Premium)

​GraphorLM (Beta)

​Content Classification

​Document Element Types

​Metadata Extraction

​Monitoring Processing Status

​Best Practices

​General Practices

​Source-Specific Practices

​Processing Method Selection

​Troubleshooting

​Using Ingested Data in Your RAG Pipeline

​The Dataset Component

​Configuring the Dataset Component

​Best Practices for Dataset Configuration

​Document Reprocessing

​When to Reprocess

​How to Reprocess

​Next Steps