
Overview
GraphorLM’s data ingestion process involves:- Document upload - Import files from various sources
- Text extraction - Convert documents to machine-readable text
- Structure recognition - Identify document elements and hierarchy
- Metadata extraction - Capture important document properties
- Content classification - Categorize document sections
Supported Document Types
GraphorLM supports a wide range of document formats:Document Type | Extensions | Features |
---|---|---|
Text Documents | PDF, TXT, TEXT, MD, DOC, DOCX, ODT, HTML, HTM | Full text extraction, structure preservation |
Images | PNG, JPG, JPEG, TIFF, BMP, HEIC | OCR for text extraction, image analysis |
Presentations | PPT, PPTX | Slide extraction, image processing |
Spreadsheets | XLS, XLSX, CSV, TSV | Table parsing, data extraction |
Audio Files | MP3, WAV, M4A, OGG, FLAC | Speech-to-text transcription, audio analysis |
Video Files | MP4, MOV, AVI, MKV, WEBM | Video transcription, visual content extraction |
Web Content | URL | Web scraping, content extraction |
Code Repositories | GitHub URL | Repository content extraction, code analysis |
Video Content | YouTube URL | Video transcription, content extraction |
Importing Documents
There are several ways to import documents into GraphorLM:Method 1: Local Files Upload
- Navigate to Sources in the left sidebar
- Click Add Sources
- Select Local files
- Drag and drop files or click to browse your file system
- Click Finish to begin processing

Method 2: URL Import
To import content directly from a web address:- Navigate to Sources in the left sidebar
- Click Add Sources
- Select URL
- Enter the web address of the content you want to import
- Optionally enable Crawl URLs to extract links and import related pages
- Click Finish to begin processing
Method 3: GitHub Repository Import
To import content from a GitHub repository:- Navigate to Sources in the left sidebar
- Click Add Sources
- Select GitHub
- Enter the GitHub repository URL (e.g.,
https://github.com/username/repository
) - Click Finish to begin processing
- Code documentation and analysis
- Repository knowledge bases
- Technical documentation extraction
- Open source project analysis
Method 4: YouTube Video Import
To import content from YouTube videos:- Navigate to Sources in the left sidebar
- Click Add Sources
- Select YouTube
- Enter the YouTube video URL
- Click Finish to begin processing
- Lecture and educational content extraction
- Meeting and conference transcription
- Video-based knowledge extraction
- Audio content analysis
Advanced OCR Processing
GraphorLM utilizes state-of-the-art OCR (Optical Character Recognition) to extract text from images and scanned documents.OCR Features
- Multi-language support - Recognize text in various languages
- Layout preservation - Maintain document structure and formatting
- Table detection - Extract structured data from tables
- Image text extraction - Identify and capture text embedded in images
- Handwriting recognition - Process handwritten notes (with varying accuracy)
Document Parsing Methods and Classification
GraphorLM offers five parsing and classification methods to optimize document processing based on your needs:Basic
- Does not utilize OCR processing
- Classifies document elements using heuristic methods
- Suitable for simple text documents with clear structure
- Fastest processing option
OCR Only
- Utilizes OCR for text extraction and parsing
- Classifies document elements using heuristic methods
- Recommended for scanned documents and images
- Balances processing speed and accuracy
Hi-Res
- Utilizes OCR for text extraction and parsing
- Classifies document elements using the Hi-Res model
- Better recognition of document structure and components
- Improved accuracy for complex documents
Hi-Res (fine-tuned)
- Utilizes OCR for text extraction and parsing
- Classifies document elements using a fine-tuned model
- Highest accuracy for document structure recognition
- Optimized for specialized document types
GraphorLM (Beta)
- Advanced graph-based RAG partitioning method
- Utilizes knowledge graph structures for content organization
- Optimized for complex document relationships
- Beta feature with enhanced semantic understanding
MAI
- Model-assisted partitioning focused on text content
- Does not produce bounding boxes or page layout metadata (no bbox)
- Faster and lighter when layout-aware features are unnecessary
- Recommended when you only need clean text and semantic element types
- Performs page annotation (page-level labels and context)
- Performs document annotation (document-level labels and summaries)
- Performs image annotation when images are present in the document
- Best-in-class text parsing quality
- Limited element classification quality (focus is on textual content)
- Upload your document to GraphorLM
- Select the document in the Sources list to open the document settings
- In the document settings modal, locate the Methods dropdown in the top-right corner
- Select your preferred parsing method from the dropdown menu
- Click “Reprocess elements” to apply the new parsing method to your document
- Wait for the reprocessing to complete - this may take a few moments depending on document size and complexity

Content Classification
GraphorLM can automatically classify document sections to improve retrieval relevance:Document Element Types
The platform classifies content into the following specific element types:- Title - Document and section titles
- Narrative text - Main body paragraphs and content
- List item - Items in bullet points or numbered lists
- Table - Complete data tables
- Table row - Individual rows within tables
- Image - Picture or graphic elements
- Footer - Footer content at bottom of pages
- Formula - Mathematical formulas and equations
- Composite element - Elements containing multiple types
- Figure caption - Text describing images or figures
- Page break - Indicators of page separation
- Address - Physical address information
- Email address - Email contact information
- Page number - Page numbering elements
- Code snippet - Programming code segments
- Header - Header content at top of pages
- Form keys values - Key-value pairs in forms
- Link - Hyperlinks and references
- Uncategorized text - Text that doesn’t fit other categories
Metadata Extraction
GraphorLM automatically extracts and processes document metadata:- File name and type
- Creation and modification dates
- Document size and page count
- Author information (when available)
- Title and description
Monitoring Processing Status
Monitor the progress of document processing in the Sources dashboard:- Uploading - Documents currently being uploaded
- Processing - Documents currently being converted
- New - Documents ready for use in RAG pipelines
- Failed - Documents that encountered errors during processing
Best Practices
To optimize data ingestion results:General Practices
- Use consistent formats - When possible, standardize document formats
- Check processing results - Review extracted text for accuracy
- Customize for complex documents - Use advanced parsing for specialized content
- Monitor processing status - Check for failed documents and resolve issues
Source-Specific Practices
For Local Files:- Optimize file sizes before upload (large files are processed in chunks automatically)
- Use descriptive filenames for better organization
- Group related documents for batch processing
- Enable “Crawl URLs” for comprehensive site extraction
- Verify URLs are publicly accessible
- Consider the depth of crawling for large websites
- Use specific branch or tag URLs when needed
- Be aware that large repositories may take longer to process
- Ensure videos have clear audio for better transcription
- Consider video length - longer videos require more processing time
- Check that videos are publicly accessible
- Use high-quality audio for better transcription accuracy
- Consider file size limits and processing time
- Ensure proper audio codecs for compatibility
Processing Method Selection
- Basic: Use for simple text documents and fastest processing
- OCR: Choose for scanned documents and images
- Hi-Res: Ideal for complex layouts and mixed content types
- Hi-Res (fine-tuned): Best for specialized document types requiring highest accuracy
- GraphorLM (Beta): Experiment with for complex document relationships and advanced semantic understanding
- MAI: Prefer when you don’t need bounding boxes/layout metadata and want speed and text precision
Troubleshooting
Common issues and solutions:Poor OCR quality
Poor OCR quality
For low-quality scanned documents, try:
- Using OCR Only, Hi-Res, Hi-Res (tuned), or GraphorLM methods instead of Basic
- Breaking large documents into smaller files
- Improving document quality before upload if possible
- Consider MAI for best text parsing and speed when you don’t need bounding boxes
Table extraction problems
Table extraction problems
If tables aren’t being properly recognized:
- Use Hi-Res, Hi-Res (fine-tuned), or GraphorLM method for better table detection
- Convert complex tables to simpler formats before upload
Multi-language document issues
Multi-language document issues
For documents with multiple languages:
- Process different language sections as separate documents
- Use Hi-Res, Hi-Res (tuned), or GraphorLM method which has better multi-language support
- For multilingual content where layout metadata is unnecessary, MAI provides strong OCR-backed text parsing without bbox
Slow processing time
Slow processing time
If processing is taking too long:
- Prefer MAI when you don’t need bounding boxes or layout metadata to reduce latency
- Avoid using Hi-Res (fine-tuned) unless you need its layout accuracy
GitHub repository access issues
GitHub repository access issues
If you’re having trouble importing from GitHub:
- Ensure the repository is public or you have proper access permissions
- Check that the repository URL is correctly formatted
- Large repositories may take longer to process - be patient during import
YouTube video processing issues
YouTube video processing issues
For YouTube video import problems:
- Verify the video URL is accessible and not private
- Note that very long videos may take significant processing time
- Audio quality affects transcription accuracy
Audio/Video processing issues
Audio/Video processing issues
For audio and video file problems:
- Ensure audio quality is clear for better transcription results
- Large video files may be automatically processed in chunks
- Check that audio language is supported for transcription
Using Ingested Data in Your RAG Pipeline
After successfully ingesting your documents, you’ll need to connect them to your RAG pipeline using the Dataset component in the Flow Builder.The Dataset Component

- Navigate to the Flows section in the left sidebar
- Create a new flow or open an existing one
- Drag the Dataset component from the component palette onto the canvas
- Double-click the component to open its configuration panel
Configuring the Dataset Component
In the configuration panel, you can:-
Select Sources: Choose which documents from your Sources library to include
- Click on document thumbnails to select/deselect specific documents
- Select multiple documents for comprehensive knowledge bases
- Preview Content: See a preview of the selected documents to confirm you’ve chosen the right sources

Best Practices for Dataset Configuration
- Start focused: Begin with a smaller, high-quality set of documents for initial testing
- Group related content: Include documents that cover similar or related topics in the same dataset
- Consider performance: Very large datasets may impact processing time and performance
- Review regularly: Update your dataset selection as your document library evolves
Document Reprocessing
After uploading documents, you may want to reprocess them with different parsing methods to improve accuracy or extract additional information.When to Reprocess
- Improve accuracy: Switch from Basic to more advanced methods for better results
- Different content focus: Use GraphorLM method for enhanced semantic relationships
- Quality issues: Reprocess with OCR or Hi-Res if initial parsing missed content
- Method comparison: Test different methods to find optimal results for your use case
How to Reprocess
- Navigate to your Sources and select the document you want to reprocess
- In the document details panel, locate the Methods dropdown
- Select a different parsing method from the available options
- Click “Reprocess elements” to start the reprocessing
- Monitor the progress
Next Steps
After successfully ingesting your documents, explore these next steps to build your RAG pipeline:Document Chunking
Learn how to optimize document segmentation for efficient retrieval in your RAG pipeline
API Integration
Integrate document upload into your applications using the GraphorLM REST API
Flow Builder
Create custom RAG workflows using the visual Flow Builder interface
API Tokens
Set up authentication for programmatic access to your documents and data
Smart RAG
Implement intelligent retrieval with context-aware document processing
Graph RAG
Build knowledge graphs from your documents for enhanced semantic retrieval