Data Ingestion
Learn how to efficiently import and process unstructured documents in GraphorLM
Transforming unstructured documents into processable data is the foundation of any effective RAG pipeline. GraphorLM’s data ingestion capabilities provide advanced tools for extracting, processing, and organizing information from various document formats.
Overview
GraphorLM’s data ingestion process involves:
- Document upload - Import files from various sources
- Text extraction - Convert documents to machine-readable text
- Structure recognition - Identify document elements and hierarchy
- Metadata extraction - Capture important document properties
- Content classification - Categorize document sections
Supported Document Types
GraphorLM supports a wide range of document formats:
Document Type | Extensions | Features |
---|---|---|
Text Documents | PDF, TXT, TEXT, MD, DOC, DOCX, ODT, HTML, HTM | Full text extraction, structure preservation |
Images | PNG, JPG, JPEG, TIFF, BMP, HEIC | OCR for text extraction, image analysis |
Presentations | PPT, PPTX | Slide extraction, image processing |
Spreadsheets | XLS, XLSX, CSV, TSV | Table parsing, data extraction |
Web Content | URL | Web scraping, content extraction |
Importing Documents
There are several ways to import documents into GraphorLM:
Method 1: Local Files Upload
- Navigate to Sources in the left sidebar
- Click Add Sources
- Select Local files
- Drag and drop files or click to browse your file system
- Click Finish to begin processing
Method 2: URL Import
To import content directly from a web address:
- Navigate to Sources in the left sidebar
- Click Add Sources
- Select URL
- Enter the web address of the content you want to import
- Click Finish to begin processing
GraphorLM will crawl the specified URL, extract the content, and process it for ingestion.
Advanced OCR Processing
GraphorLM utilizes state-of-the-art OCR (Optical Character Recognition) to extract text from images and scanned documents.
OCR Features
- Multi-language support - Recognize text in various languages
- Layout preservation - Maintain document structure and formatting
- Table detection - Extract structured data from tables
- Image text extraction - Identify and capture text embedded in images
- Handwriting recognition - Process handwritten notes (with varying accuracy)
Document Parsing Methods and Classification
GraphorLM offers four parsing and classification methods to optimize document processing based on your needs:
Basic
- Does not utilize OCR processing
- Classifies document elements using heuristic methods
- Suitable for simple text documents with clear structure
- Fastest processing option
OCR Only
- Utilizes OCR for text extraction and parsing
- Classifies document elements using heuristic methods
- Recommended for scanned documents and images
- Balances processing speed and accuracy
YOLOX
- Utilizes OCR for text extraction and parsing
- Classifies document elements using the YOLOX model
- Better recognition of document structure and components
- Improved accuracy for complex documents
Advanced (Premium)
- Utilizes OCR for text extraction and parsing
- Classifies document elements using a fine-tuned model
- Highest accuracy for document structure recognition
- Optimized for specialized document types
To select a parsing method for your document:
- Upload your document to GraphorLM
- Select the document in the Sources list to open the document settings
- In the document settings modal, locate the Methods dropdown in the top-right corner
- Select your preferred parsing method from the dropdown menu
- Click “Reprocess elements” to apply the new parsing method to your document
- Wait for the reprocessing to complete - this may take a few moments depending on document size and complexity
When reprocessing is complete, you’ll see the updated document structure and classification results in the preview panel.
Content Classification
GraphorLM can automatically classify document sections to improve retrieval relevance:
Document Element Types
The platform classifies content into the following specific element types:
- Title - Document and section titles
- Narrative text - Main body paragraphs and content
- List item - Items in bullet points or numbered lists
- Table - Complete data tables
- Table row - Individual rows within tables
- Image - Picture or graphic elements
- Footer - Footer content at bottom of pages
- Formula - Mathematical formulas and equations
- Composite element - Elements containing multiple types
- Figure caption - Text describing images or figures
- Page break - Indicators of page separation
- Address - Physical address information
- Email address - Email contact information
- Page number - Page numbering elements
- Code snippet - Programming code segments
- Header - Header content at top of pages
- Form keys values - Key-value pairs in forms
- Link - Hyperlinks and references
- Uncategorized text - Text that doesn’t fit other categories
These classifications help GraphorLM understand document structure and perform more intelligent chunking during the RAG pipeline process. By recognizing different element types, the system can make better decisions about how to segment documents, keeping related elements together and creating more semantically meaningful chunks.
Metadata Extraction
GraphorLM automatically extracts and processes document metadata:
- File name and type
- Creation and modification dates
- Document size and page count
- Author information (when available)
- Title and description
Monitoring Processing Status
Monitor the progress of document processing in the Sources dashboard:
- Uploading - Documents currently being uploaded
- Processing - Documents currently being converted
- New - Documents ready for use in RAG pipelines
- Failed - Documents that encountered errors during processing
For failed documents, you can view error details and retry processing with adjusted settings.
Best Practices
To optimize data ingestion results:
- Use consistent formats - When possible, standardize document formats
- Check processing results - Review extracted text for accuracy
- Customize for complex documents - Use advanced parsing for specialized content
- Monitor processing status - Check for failed documents and resolve issues
Troubleshooting
Common issues and solutions:
Using Ingested Data in Your RAG Pipeline
After successfully ingesting your documents, you’ll need to connect them to your RAG pipeline using the Dataset component in the Flow Builder.
The Dataset Component
The Dataset component is the entry point to your RAG pipeline that connects your ingested documents to the subsequent processing steps:
- Navigate to the Flows section in the left sidebar
- Create a new flow or open an existing one
- Drag the Dataset component from the component palette onto the canvas
- Double-click the component to open its configuration panel
Configuring the Dataset Component
In the configuration panel, you can:
-
Select Sources: Choose which documents from your Sources library to include
- Click on document thumbnails to select/deselect specific documents
- Select multiple documents for comprehensive knowledge bases
-
Preview Content: See a preview of the selected documents to confirm you’ve chosen the right sources
Once configured, the Dataset component will process the selected documents and make them available to subsequent components in your flow, particularly the Chunking component.
Best Practices for Dataset Configuration
- Start focused: Begin with a smaller, high-quality set of documents for initial testing
- Group related content: Include documents that cover similar or related topics in the same dataset
- Consider performance: Very large datasets may impact processing time and performance
- Review regularly: Update your dataset selection as your document library evolves
Next Steps
After successfully ingesting your documents, you’re ready to move on to Chunking to prepare your data for efficient retrieval in your RAG pipeline.