Skip to main content
The parse method allows you to reprocess previously uploaded documents using different parsing and classification methods. This enables you to optimize document processing for better text extraction, structure recognition, and retrieval performance without re-uploading the file.

Method Overview

Sync Method

client.sources.parse()

Async Method

await client.sources.parse()

Method Signature

client.sources.parse(
    file_id: str | None = None,          # Preferred
    file_name: str | None = None,        # Deprecated
    partition_method: PartitionMethod,   # Optional
    timeout: float | None = None
) -> PublicSource

Parameters

ParameterTypeDescriptionRequired
file_idstrUnique identifier for the source (preferred)No*
file_namestrName of the previously uploaded file to reprocess (deprecated, use file_id)No*
partition_methodPartitionMethodProcessing method to use (see available methods below)No
timeoutfloatRequest timeout in secondsNo
*At least one of file_id or file_name must be provided. file_id is preferred.

Available Processing Methods

The PartitionMethod type accepts the following literal values:
Value: "basic"Best for: Simple text documents, quick processing
  • Fast processing with heuristic classification
  • No OCR processing
  • Suitable for plain text files and well-structured documents
  • Recommended for testing and development
Value: "hi_res"Best for: Complex documents with varied layouts
  • OCR-based text extraction
  • AI-powered document structure classification using Hi-Res model
  • Better recognition of tables, figures, and document elements
  • Enhanced accuracy for complex layouts
Value: "hi_res_ft"Best for: Premium accuracy, specialized documents
  • OCR-based text extraction
  • Fine-tuned AI model for document classification
  • Highest accuracy for document structure recognition
  • Optimized for specialized and complex document types
  • Note: Premium feature
Value: "mai"Best for: Text-first parsing, manuscripts, and handwritten documents
  • Our best text-first parsing with high-quality output
  • Does not output bounding boxes or page layout (no bbox)
  • Best for MANUSCRIPT and HANDWRITTEN documents
  • Performs page annotation (page-level labels and context)
  • Performs document annotation (document-level labels and summaries)
  • Performs image annotation when images are present in the document
  • Best-in-class text parsing quality; element classification is limited
Value: "graphorlm"Best for: Complex layouts, multi-page tables, diagrams, and images
  • Our highest parsing setting for complex layouts
  • Rich annotations for images and complex elements
  • Uses agentic processing for enhanced understanding
  • Advanced document understanding capabilities

Method Reference

Methodpartition_method Value
Fast"basic"
Balanced"hi_res"
Accurate"hi_res_ft"
VLM"mai"
Agentic"graphorlm"

Processing Method Comparison

MethodSpeedText ParsingElement ClassificationBounding BoxesBest Use CasesOCR
Fast⚡⚡⚡⭐⭐⭐⭐✅ (limited)Simple text files, testing
Balanced⭐⭐⭐⭐⭐⭐⭐⭐Complex layouts, mixed content
Accurate⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐Premium accuracy needed
VLM⚡⚡⚡⭐⭐⭐⭐⭐⭐⭐⭐Manuscripts, handwritten documents
Agentic⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐Complex layouts, multi-page tables, diagrams

Response Object

The method returns a PublicSource object with the following properties:
PropertyTypeDescription
statusstrProcessing result (typically “success”)
messagestrHuman-readable success message
file_namestrName of the processed file
file_sizeintSize of the file in bytes
file_typestrFile extension/type
file_sourcestrSource type of the original file
project_idstrUUID of the project containing the file
project_namestrName of the project
partition_methodstr | NoneProcessing method that was applied

Code Examples

Basic Usage

from graphor import Graphor

client = Graphor()

# Reprocess a document with the Balanced method
source = client.sources.parse(
    file_name="document.pdf",
    partition_method="hi_res"
)

print(f"Processed: {source.file_name}")
print(f"Method: {source.partition_method}")
print(f"Status: {source.status}")

Using Different Methods

from graphor import Graphor

client = Graphor()

# Fast processing for simple documents
source = client.sources.parse(
    file_name="simple-text.txt",
    partition_method="basic"
)

# Balanced for complex layouts
source = client.sources.parse(
    file_name="report.pdf",
    partition_method="hi_res"
)

# Accurate for premium quality
source = client.sources.parse(
    file_name="legal-contract.pdf",
    partition_method="hi_res_ft"
)

# VLM for handwritten documents
source = client.sources.parse(
    file_name="handwritten-notes.pdf",
    partition_method="mai"
)

# Agentic for complex diagrams and tables
source = client.sources.parse(
    file_name="technical-manual.pdf",
    partition_method="graphorlm"
)

Async Usage

import asyncio
from graphor import AsyncGraphor

async def process_document(file_name: str, method: str):
    client = AsyncGraphor()
    
    source = await client.sources.parse(
        file_name=file_name,
        partition_method=method
    )
    
    print(f"Processed: {source.file_name}")
    print(f"Method: {source.partition_method}")
    
    return source

# Run the async function
asyncio.run(process_document("document.pdf", "hi_res"))

With Extended Timeout

Processing complex documents can take several minutes. Configure appropriate timeouts:
from graphor import Graphor

# Configure default timeout for all requests
client = Graphor(timeout=300.0)  # 5 minutes

source = client.sources.parse(
    file_name="large-document.pdf",
    partition_method="graphorlm"
)

# Or per-request timeout
source = client.with_options(timeout=600.0).sources.parse(
    file_name="very-large-document.pdf",
    partition_method="hi_res_ft"
)
Processing can take several minutes depending on document size, complexity, and the selected processing method. Advanced methods like Balanced, Accurate, VLM and Agentic typically require more time for analysis.

Error Handling

import graphor
from graphor import Graphor

client = Graphor()

try:
    source = client.sources.parse(
        file_name="document.pdf",
        partition_method="hi_res"
    )
    print(f"Processing successful: {source.file_name}")
    
except graphor.NotFoundError as e:
    print(f"File not found: {e}")
    
except graphor.BadRequestError as e:
    print(f"Invalid request (check partition_method): {e}")
    
except graphor.AuthenticationError as e:
    print(f"Invalid API key: {e}")
    
except graphor.RateLimitError as e:
    print(f"Rate limit exceeded. Please wait and retry: {e}")
    
except graphor.InternalServerError as e:
    print(f"Processing failed on server: {e}")
    
except graphor.APIConnectionError as e:
    print(f"Connection error: {e}")
    
except graphor.APITimeoutError as e:
    print(f"Request timed out. Try increasing timeout: {e}")

Advanced Examples

Automatic Quality Improvement

Progressively try more advanced processing methods until quality is satisfactory:
from graphor import Graphor
import graphor

client = Graphor(timeout=300.0)

def improve_processing_quality(file_name: str):
    """Automatically upgrade processing method for better quality."""
    methods = ["basic", "hi_res", "hi_res_ft", "mai", "graphorlm"]
    
    for method in methods:
        try:
            print(f"Trying {method} method...")
            source = client.sources.parse(
                file_name=file_name,
                partition_method=method
            )
            
            # Add your quality assessment logic here
            if assess_quality(source):
                print(f"✅ Success with {method} method")
                return source
            else:
                print(f"⚠️ Quality insufficient with {method}, trying next...")
                
        except graphor.APIStatusError as e:
            print(f"❌ Failed with {method}: {e}")
            continue
    
    raise Exception("All processing methods failed or produced insufficient quality")

def assess_quality(source) -> bool:
    """Add your quality assessment logic here."""
    # Example: check if processing was successful
    return source.status == "success"

# Usage
try:
    result = improve_processing_quality("complex-document.pdf")
    print(f"Final result: {result.partition_method}")
except Exception as e:
    print(f"Error: {e}")

Batch Reprocessing

Reprocess multiple files with the same method:
from graphor import Graphor
import graphor
import time

client = Graphor(timeout=300.0)

def batch_reprocess(file_names: list[str], method: str):
    """Reprocess multiple files with the same method."""
    results = []
    failed = []
    
    for file_name in file_names:
        try:
            print(f"Processing {file_name} with {method}...")
            source = client.sources.parse(
                file_name=file_name,
                partition_method=method
            )
            results.append(source)
            print(f"✅ {file_name} processed successfully")
            
            # Small delay between requests
            time.sleep(1.0)
            
        except graphor.APIStatusError as e:
            print(f"❌ Failed to process {file_name}: {e}")
            failed.append({"file_name": file_name, "error": str(e)})
    
    print(f"\nSummary: {len(results)} successful, {len(failed)} failed")
    return results, failed

# Usage
files = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]
successful, failed = batch_reprocess(files, "hi_res")

Async Batch Processing

Process multiple files concurrently for better performance:
import asyncio
from graphor import AsyncGraphor
import graphor

async def process_single(client: AsyncGraphor, file_name: str, method: str):
    """Process a single file."""
    try:
        source = await client.sources.parse(
            file_name=file_name,
            partition_method=method
        )
        return {"file_name": file_name, "status": "success", "source": source}
    except graphor.APIStatusError as e:
        return {"file_name": file_name, "status": "failed", "error": str(e)}

async def batch_reprocess_async(file_names: list[str], method: str, max_concurrent: int = 3):
    """Reprocess multiple files with controlled concurrency."""
    client = AsyncGraphor(timeout=300.0)
    
    # Use semaphore to limit concurrent requests
    semaphore = asyncio.Semaphore(max_concurrent)
    
    async def process_with_semaphore(file_name: str):
        async with semaphore:
            print(f"Processing {file_name}...")
            result = await process_single(client, file_name, method)
            status_icon = "✅" if result["status"] == "success" else "❌"
            print(f"{status_icon} {file_name}: {result['status']}")
            return result
    
    tasks = [process_with_semaphore(f) for f in file_names]
    results = await asyncio.gather(*tasks)
    
    successful = [r for r in results if r["status"] == "success"]
    failed = [r for r in results if r["status"] == "failed"]
    
    print(f"\nSummary: {len(successful)} successful, {len(failed)} failed")
    return results

# Usage
files = ["doc1.pdf", "doc2.pdf", "doc3.pdf", "doc4.pdf", "doc5.pdf"]
results = asyncio.run(batch_reprocess_async(files, "hi_res", max_concurrent=3))

Processing with Progress Tracking

from graphor import Graphor
import graphor
import time
from typing import TypedDict

class ProcessingTask(TypedDict):
    file_name: str
    method: str

client = Graphor(timeout=300.0)

def process_with_progress(tasks: list[ProcessingTask]):
    """Process multiple files with progress tracking."""
    total = len(tasks)
    completed = 0
    results = []
    
    print(f"Starting batch processing of {total} files...\n")
    
    for task in tasks:
        file_name = task["file_name"]
        method = task["method"]
        
        try:
            print(f"[{completed + 1}/{total}] Processing {file_name} with {method}...")
            start_time = time.time()
            
            source = client.sources.parse(
                file_name=file_name,
                partition_method=method
            )
            
            duration = time.time() - start_time
            completed += 1
            
            results.append({
                "file_name": file_name,
                "method": method,
                "status": "success",
                "duration": duration,
                "source": source
            })
            
            print(f"✅ Completed {file_name} in {duration:.1f}s")
            
        except graphor.APIStatusError as e:
            completed += 1
            results.append({
                "file_name": file_name,
                "method": method,
                "status": "failed",
                "error": str(e)
            })
            print(f"❌ Failed {file_name}: {e}")
        
        # Progress update
        progress = (completed / total) * 100
        print(f"Progress: {progress:.1f}% ({completed}/{total})\n")
        
        # Small delay between requests
        time.sleep(0.5)
    
    return results

# Usage
processing_queue = [
    {"file_name": "document1.pdf", "method": "hi_res"},
    {"file_name": "document2.pdf", "method": "hi_res_ft"},
    {"file_name": "document3.pdf", "method": "mai"}
]

results = process_with_progress(processing_queue)

# Print final summary
successful = [r for r in results if r["status"] == "success"]
failed = [r for r in results if r["status"] == "failed"]
print(f"\n{'='*50}")
print(f"Final Summary: {len(successful)} successful, {len(failed)} failed")

When to Reprocess

Symptoms: Missing text, garbled characters, incomplete contentRecommended methods:
  • "hi_res" or "hi_res_ft" for complex layouts
  • "mai" for text-only documents when bounding boxes are not required
Symptoms: Tables not properly recognized, merged cells, structure lostRecommended methods:
  • "hi_res" for better table detection
  • "hi_res_ft" for complex table structures
  • "graphorlm" for multi-page tables
Symptoms: Missing captions, poor figure recognitionRecommended methods:
  • "hi_res" for figure detection
  • "hi_res_ft" for comprehensive image analysis
  • "graphorlm" for rich image annotations
Symptoms: Headers/footers mixed with content, poor section detectionRecommended methods:
  • "hi_res" for structure recognition
  • "hi_res_ft" for complex document hierarchies
  • "graphorlm" for enhanced semantic structure and relationships

Best Practices

Processing Strategy

  • Start with Fast ("basic"): For testing and simple documents
  • Upgrade gradually: Move to "hi_res""hi_res_ft""mai""graphorlm" based on needs
  • Monitor results: Use document preview to evaluate processing quality
  • Consider efficiency vs. quality: Advanced methods take longer but provide better results

Performance Optimization

  • Batch processing: Process multiple files sequentially rather than simultaneously
  • Method selection: Choose the appropriate method for your document types
  • Timeout handling: Allow sufficient time for complex processing methods (5+ minutes)
  • Error recovery: Implement retry logic for transient failures

Quality Assessment

After processing, evaluate the results by:
  • Checking text extraction completeness
  • Verifying table and figure recognition
  • Reviewing document structure classification
  • Testing retrieval quality in your RAG pipeline

Error Reference

Error TypeStatus CodeDescription
BadRequestError400Invalid request format or partition method
AuthenticationError401Invalid or missing API key
PermissionDeniedError403Access denied to the specified project
NotFoundError404File not found in the project
RateLimitError429Too many requests, please retry after waiting
InternalServerError≥500Processing failure or server error
APIConnectionErrorN/ANetwork connectivity issues
APITimeoutErrorN/ARequest timed out

Troubleshooting

Causes: Large files, complex documents, or heavy server loadSolutions:
  • Increase request timeout (5+ minutes recommended)
  • Try a simpler processing method first
  • Process during off-peak hours
client = Graphor(timeout=600.0)  # 10 minutes
Causes: Incorrect file name, file deleted, or wrong projectSolutions:
  • Verify exact file name (case-sensitive)
  • Use client.sources.list() to check available files
  • Ensure you’re using the correct API key for the project
# List all sources to find the correct file name
sources = client.sources.list()
for source in sources:
    print(source.file_name)
Causes: Corrupted files, unsupported content, or method incompatibilitySolutions:
  • Try a different processing method
  • Check file integrity
  • Re-upload the file if necessary using client.sources.upload()
Causes: Method not suitable for document type, or complex layoutSolutions:
  • Upgrade to "hi_res" or "hi_res_ft" method
  • Use "mai" for manuscripts and handwritten documents
  • Use "graphorlm" for complex layouts with tables and diagrams
  • Ensure document quality is good

Next Steps

After successfully processing your documents: