Sources API Overview - GraphorLM Docs

The GraphorLM Sources API provides a comprehensive set of endpoints for managing documents in your projects. From uploading files to processing content with advanced AI models, these endpoints enable you to build powerful document ingestion pipelines and RAG applications.

What are Sources?

Sources in GraphorLM represent documents that serve as the foundation of your knowledge base. These can include:

Local files: PDFs, Word documents, text files, images, spreadsheets, presentations
Web content: URLs, web pages, online articles
Code repositories: GitHub repositories and documentation
Media content: Audio and video files for transcription and analysis

All sources are processed through GraphorLM’s advanced AI pipeline to extract text, recognize structure, and prepare content for retrieval-augmented generation (RAG) workflows.

API Endpoints Overview

The Sources API consists of four main endpoints that provide complete document lifecycle management:

Upload Source

POST https://sources.graphorlm.com/upload

Upload documents from your local system to GraphorLM for processing

Process Source

POST https://sources.graphorlm.com/process

Reprocess existing documents with different AI models and parsing methods

List Sources

GET https://sources.graphorlm.com

Retrieve information about all documents in your project

Delete Source

DELETE https://sources.graphorlm.com/delete

Permanently remove documents from your project

Document Processing Pipeline

Understanding how GraphorLM processes your documents helps you make the most of the Sources API:

1. Upload Stage

When you upload a document using the Upload Source endpoint:

File is validated for type and size (max 100MB)
Document is securely stored in your project
Initial metadata is extracted (filename, size, type)
Processing begins automatically with the default method

2. Processing Methods

GraphorLM offers multiple processing methods, selectable via the Process Source endpoint:

Basic Method

OCR Method

YOLOX Method

Advanced Method

3. Document Status Lifecycle

Documents progress through various states that you can monitor using the List Sources endpoint:

Status	Description	Next Steps
New	Document uploaded, awaiting processing	Processing will begin automatically
Processing	AI models are analyzing the document	Wait for completion
Completed	Document ready for use in RAG pipelines	Can be used in flows
Failed	Processing encountered an error	Try different processing method

Authentication

All Sources API endpoints require authentication using API tokens:

Authorization: Bearer grlm_your_api_token_here

Learn how to create and manage API tokens in the API Tokens guide.

Common Workflows

Basic Document Upload Workflow

Upload: Use Upload Source to add your document
Monitor: Check status with List Sources
Optimize: Reprocess with Process Source if needed
Use: Document is ready for your RAG workflows

Quality Optimization Workflow

Start with Basic method for speed
Review extraction quality
Upgrade to YOLOX or Advanced if needed
Use best results in your application

Document Lifecycle Management

Supported File Types

The Sources API supports a wide range of document formats:

Documents & Text

PDF: Portable Document Format files
Microsoft Office: DOC, DOCX, PPT, PPTX, XLS, XLSX
OpenDocument: ODT (Text documents)
Text Files: TXT, TEXT, MD (Markdown), HTML, HTM
Data Files: CSV, TSV (Comma/Tab-separated values)

Images & Media

Images: PNG, JPG, JPEG, TIFF, BMP, HEIC
Audio: MP3, WAV, M4A, OGG, FLAC
Video: MP4, MOV, AVI, MKV, WEBM

Processing Recommendations

File Type	Recommended Method	Notes
Clean PDFs	Basic or OCR	Fast processing for digital PDFs
Scanned PDFs	OCR or YOLOX	OCR needed for text extraction
Complex Documents	YOLOX or Advanced	Better structure recognition
Images with Text	OCR or YOLOX	Requires OCR for text extraction
Spreadsheets	Basic or YOLOX	YOLOX better for complex tables
Presentations	YOLOX or Advanced	Better slide layout recognition

Rate Limits and Best Practices

Rate Limits

Upload: No strict limits, but large files may take longer
Processing: Allow adequate time for complex methods
List/Delete: Standard API rate limits apply

Best Practices

Upload Optimization

Processing Strategy

Management & Monitoring

Error Handling

All Sources API endpoints use consistent error responses:

Common Error Codes

Status Code	Meaning	Common Causes
400	Bad Request	Invalid file type, missing parameters, malformed request
401	Unauthorized	Invalid or missing API token
403	Forbidden	Insufficient permissions for the project
404	Not Found	File or project not found
413	Payload Too Large	File exceeds 100MB limit
500	Internal Server Error	Processing failure or server issues

Error Response Format

{
  "detail": "Descriptive error message explaining what went wrong"
}

Retry Strategy

import time
import requests
from typing import Optional

def api_call_with_retry(
    method: str, 
    url: str, 
    headers: dict, 
    max_retries: int = 3,
    **kwargs
) -> Optional[requests.Response]:
    """Make API call with exponential backoff retry logic."""
    
    for attempt in range(max_retries):
        try:
            response = requests.request(method, url, headers=headers, **kwargs)
            
            # Success cases
            if response.status_code < 400:
                return response
            
            # Don't retry client errors (4xx)
            if 400 <= response.status_code < 500:
                response.raise_for_status()
            
            # Retry server errors (5xx)
            if response.status_code >= 500 and attempt < max_retries - 1:
                wait_time = 2 ** attempt  # Exponential backoff
                print(f"Server error {response.status_code}, retrying in {wait_time}s...")
                time.sleep(wait_time)
                continue
            
            response.raise_for_status()
            
        except requests.exceptions.RequestException as e:
            if attempt < max_retries - 1:
                wait_time = 2 ** attempt
                print(f"Request failed: {e}, retrying in {wait_time}s...")
                time.sleep(wait_time)
            else:
                raise
    
    return None

Integration Examples

Complete Document Management System

import requests
from typing import List, Dict, Optional
import time

class GraphorLMSourcesClient:
    def __init__(self, api_token: str):
        self.api_token = api_token
        self.base_url = "https://sources.graphorlm.com"
        self.headers = {
            "Authorization": f"Bearer {api_token}",
            "Content-Type": "application/json"
        }
    
    def upload_document(self, file_path: str) -> Dict:
        """Upload a document to GraphorLM."""
        upload_headers = {"Authorization": f"Bearer {self.api_token}"}
        
        with open(file_path, "rb") as f:
            files = {"file": (file_path, f)}
            response = requests.post(
                f"{self.base_url}/upload",
                headers=upload_headers,
                files=files,
                timeout=300
            )
        
        response.raise_for_status()
        return response.json()
    
    def process_document(self, file_name: str, method: str = "yolox") -> Dict:
        """Reprocess a document with specified method."""
        payload = {
            "file_name": file_name,
            "partition_method": method
        }
        
        response = requests.post(
            f"{self.base_url}/process",
            headers=self.headers,
            json=payload,
            timeout=300
        )
        
        response.raise_for_status()
        return response.json()
    
    def list_documents(self) -> List[Dict]:
        """Get all documents in the project."""
        response = requests.get(
            self.base_url,
            headers=self.headers,
            timeout=30
        )
        
        response.raise_for_status()
        return response.json()
    
    def delete_document(self, file_name: str) -> Dict:
        """Delete a document from the project."""
        payload = {"file_name": file_name}
        
        response = requests.delete(
            f"{self.base_url}/delete",
            headers=self.headers,
            json=payload,
            timeout=60
        )
        
        response.raise_for_status()
        return response.json()
    
    def wait_for_processing(self, file_name: str, timeout: int = 300) -> str:
        """Wait for document processing to complete."""
        start_time = time.time()
        
        while time.time() - start_time < timeout:
            documents = self.list_documents()
            doc = next((d for d in documents if d['file_name'] == file_name), None)
            
            if not doc:
                raise ValueError(f"Document {file_name} not found")
            
            status = doc['status'].lower()
            if status == 'completed':
                return 'completed'
            elif status == 'failed':
                return 'failed'
            
            time.sleep(5)  # Check every 5 seconds
        
        raise TimeoutError(f"Processing timeout for {file_name}")

# Usage Example
client = GraphorLMSourcesClient("grlm_your_api_token")

# Upload and process a document
upload_result = client.upload_document("./document.pdf")
print(f"Uploaded: {upload_result['file_name']}")

# Wait for initial processing
status = client.wait_for_processing(upload_result['file_name'])
print(f"Initial processing: {status}")

# Upgrade to better processing method
if status == 'completed':
    process_result = client.process_document(
        upload_result['file_name'], 
        "yolox"
    )
    print(f"Reprocessed with YOLOX: {process_result['partition_method']}")

Document Quality Assessment Pipeline

class DocumentQualityPipeline {
  constructor(apiToken) {
    this.apiToken = apiToken;
    this.baseUrl = 'https://sources.graphorlm.com';
  }

  async assessAndOptimize(fileName) {
    const methods = ['basic', 'ocr', 'yolox', 'advanced'];
    let bestResult = null;
    let bestScore = 0;

    for (const method of methods) {
      try {
        console.log(`Testing ${method} method for ${fileName}...`);
        
        const result = await this.processDocument(fileName, method);
        const score = this.calculateQualityScore(result);
        
        console.log(`${method}: score ${score}`);
        
        if (score > bestScore) {
          bestScore = score;
          bestResult = { method, result, score };
        }
        
        // If score is good enough, don't try more expensive methods
        if (score >= 0.8 && method !== 'advanced') {
          break;
        }
        
      } catch (error) {
        console.error(`Failed with ${method}:`, error.message);
      }
    }

    return bestResult;
  }

  calculateQualityScore(result) {
    // Implement your quality assessment logic
    // This is a simple example based on available data
    let score = 0;
    
    // Base score for successful processing
    if (result.status === 'success') score += 0.5;
    
    // Higher score for more advanced methods
    const methodScores = {
      'basic': 0.1,
      'ocr': 0.2,
      'yolox': 0.3,
      'advanced': 0.4
    };
    score += methodScores[result.partition_method] || 0;
    
    // File size and type considerations
    if (result.file_size > 1000000) score += 0.1; // Larger files
    if (result.file_type === 'pdf') score += 0.1;  // PDFs often need better processing
    
    return Math.min(score, 1.0);
  }

  async processDocument(fileName, method) {
    const response = await fetch(`${this.baseUrl}/process`, {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${this.apiToken}`,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({
        file_name: fileName,
        partition_method: method
      })
    });

    if (!response.ok) {
      throw new Error(`HTTP ${response.status}: ${await response.text()}`);
    }

    return response.json();
  }
}

// Usage
const pipeline = new DocumentQualityPipeline('grlm_your_token');
pipeline.assessAndOptimize('complex_document.pdf')
  .then(result => {
    if (result) {
      console.log(`Best method: ${result.method} (score: ${result.score})`);
    } else {
      console.log('No processing method succeeded');
    }
  });

Next Steps

Now that you understand the Sources API, explore these related topics:

Data Ingestion Guide

Learn best practices for document processing and optimization

API Tokens

Set up authentication for accessing the Sources API

Chunking Guide

Optimize document segmentation after processing for better RAG performance

Flows API

Build RAG pipelines using your processed documents

Support and Resources

Getting Help

Code Examples

Monitoring & Analytics

Sources

​What are Sources?

​API Endpoints Overview

Upload Source

Process Source

List Sources

Delete Source

​Document Processing Pipeline

​1. Upload Stage

​2. Processing Methods

​3. Document Status Lifecycle

​Authentication

​Common Workflows

​Basic Document Upload Workflow

​Quality Optimization Workflow

​Document Lifecycle Management

​Supported File Types

​Documents & Text

​Images & Media

​Processing Recommendations

​Rate Limits and Best Practices

​Rate Limits

​Best Practices

​Error Handling

​Common Error Codes

​Error Response Format

​Retry Strategy

​Integration Examples

​Complete Document Management System

​Document Quality Assessment Pipeline

​Next Steps

Data Ingestion Guide

API Tokens

Chunking Guide

Flows API

​Support and Resources

What are Sources?

API Endpoints Overview

Document Processing Pipeline

1. Upload Stage

2. Processing Methods

3. Document Status Lifecycle

Authentication

Common Workflows

Basic Document Upload Workflow

Quality Optimization Workflow

Document Lifecycle Management

Supported File Types

Documents & Text

Images & Media

Processing Recommendations

Rate Limits and Best Practices

Rate Limits

Best Practices

Error Handling

Common Error Codes

Error Response Format

Retry Strategy

Integration Examples

Complete Document Management System

Document Quality Assessment Pipeline

Next Steps

Support and Resources