Skip to main content
The load_elements method allows you to retrieve detailed information about document elements (partitions) from processed sources in your Graphor project. This method provides access to individual text blocks, images, tables, and other document components with their metadata, positioning, and content, enabling you to analyze document structure and extract specific information programmatically.

Method Overview

Sync Method

client.sources.load_elements()

Async Method

await client.sources.load_elements()

Method Signature

client.sources.load_elements(
    file_id: str | None = None,        # Preferred
    file_name: str | None = None,      # Deprecated
    page: int | None = None,
    page_size: int | None = None,
    filter: Filter | None = None,
    timeout: float | None = None
) -> SourceLoadElementsResponse

Parameters

ParameterTypeDescriptionRequired
file_idstrUnique identifier for the source (preferred)No*
file_namestrName of the source file to retrieve elements from (deprecated, use file_id)No*
pageintPage number for pagination (starts from 1)No
page_sizeintNumber of elements to return per pageNo
filterFilterFilter criteria to refine element selectionNo
timeoutfloatRequest timeout in secondsNo
*At least one of file_id or file_name must be provided. file_id is preferred.

Filter Parameters

The filter parameter accepts a TypedDict with the following optional fields:
ParameterTypeDescription
typestrFilter by specific element type (e.g., "Title", "NarrativeText", "Table")
page_numberslist[int]Filter elements from specific page numbers
elements_to_removelist[str]Exclude specific element types from results

Response Object

The method returns a SourceLoadElementsResponse object:
PropertyTypeDescription
itemslist[Item]List of document elements in the current page
totalintTotal number of elements matching the filter
pageint | NoneCurrent page number
page_sizeint | NoneNumber of elements per page
total_pagesint | NoneTotal number of pages available

Item Object

Each item in the items list has the following properties:
PropertyTypeDescription
idstr | NoneElement identifier (may be None)
page_contentstrText content of the element
typeLiteral["Document"] | NoneAlways “Document” for this method
metadatadict | NoneRich metadata about the element

Metadata Fields

The metadata dictionary contains detailed information:
FieldTypeDescription
coordinatesdictPixel coordinates and layout information
filenamestrOriginal filename of the source document
languageslist[str]Detected languages in the element
last_modifiedstrISO timestamp of last modification
page_numberintPage number where element appears
filetypestrMIME type of the source file
text_as_htmlstrHTML representation of the element
element_typestrType classification of the element
element_idstrUnique identifier for the element
positionintSequential position within the document
bounding_boxdictRectangular bounds of the element
page_layoutdictOverall page dimensions

Element Types

TypeDescription
TitleDocument and section titles
NarrativeTextMain body paragraphs and content
ListItemItems in bullet points or numbered lists
TableComplete data tables
TableRowIndividual rows within tables
ImagePicture or graphic elements
HeaderHeader content at top of pages
FooterFooter content at bottom of pages
FormulaMathematical formulas and equations
CompositeElementElements containing multiple types
FigureCaptionText describing images or figures
PageBreakIndicators of page separation
AddressPhysical address information
EmailAddressEmail contact information
PageNumberPage numbering elements
CodeSnippetProgramming code segments
FormKeysValuesKey-value pairs in forms
LinkHyperlinks and references
UncategorizedTextText that doesn’t fit other categories

Code Examples

Basic Usage

from graphor import Graphor

client = Graphor()

# Get elements from a document
response = client.sources.load_elements(
    file_name="document.pdf",
    page=1,
    page_size=20
)

print(f"Found {response.total} elements (page {response.page}/{response.total_pages})")

for item in response.items:
    element_type = item.metadata.get("element_type") if item.metadata else "Unknown"
    print(f"{element_type}: {item.page_content[:50]}...")

Filter by Element Type

from graphor import Graphor

client = Graphor()

# Get only titles
response = client.sources.load_elements(
    file_name="document.pdf",
    page_size=50,
    filter={"type": "Title"}
)

print(f"Found {response.total} titles")

for item in response.items:
    page_num = item.metadata.get("page_number") if item.metadata else "?"
    print(f"Page {page_num}: {item.page_content}")

Filter by Page Numbers

from graphor import Graphor

client = Graphor()

# Get elements from specific pages
response = client.sources.load_elements(
    file_name="document.pdf",
    page_size=100,
    filter={"page_numbers": [1, 2, 3]}
)

print(f"Found {response.total} elements on pages 1-3")

for item in response.items:
    print(f"Page {item.metadata['page_number']}: {item.page_content[:80]}...")

Exclude Element Types

from graphor import Graphor

client = Graphor()

# Get all elements except footers and page numbers
response = client.sources.load_elements(
    file_name="document.pdf",
    page_size=50,
    filter={"elements_to_remove": ["Footer", "PageNumber"]}
)

print(f"Found {response.total} content elements (excluding footers/page numbers)")

Combine Filters

from graphor import Graphor

client = Graphor()

# Get tables from pages 2-5, excluding certain elements
response = client.sources.load_elements(
    file_name="document.pdf",
    page_size=50,
    filter={
        "type": "Table",
        "page_numbers": [2, 3, 4, 5]
    }
)

print(f"Found {response.total} tables on pages 2-5")

for item in response.items:
    print(f"Table on page {item.metadata['page_number']}:")
    print(f"  {item.page_content[:100]}...")

Async Usage

import asyncio
from graphor import AsyncGraphor

async def get_document_elements(file_name: str):
    client = AsyncGraphor()
    
    response = await client.sources.load_elements(
        file_name=file_name,
        page=1,
        page_size=50
    )
    
    print(f"Found {response.total} elements")
    
    for item in response.items:
        print(f"{item.metadata['element_type']}: {item.page_content[:50]}...")
    
    return response

asyncio.run(get_document_elements("document.pdf"))

Paginate Through All Elements

from graphor import Graphor

client = Graphor()

def get_all_elements(file_name: str, page_size: int = 50):
    """Retrieve all elements from a document."""
    all_elements = []
    page = 1
    
    while True:
        response = client.sources.load_elements(
            file_name=file_name,
            page=page,
            page_size=page_size
        )
        
        all_elements.extend(response.items)
        print(f"Retrieved page {page}/{response.total_pages} ({len(all_elements)}/{response.total} elements)")
        
        if page >= response.total_pages:
            break
        page += 1
    
    return all_elements

# Usage
elements = get_all_elements("document.pdf")
print(f"Total elements retrieved: {len(elements)}")

Error Handling

import graphor
from graphor import Graphor

client = Graphor()

try:
    response = client.sources.load_elements(
        file_name="document.pdf",
        page=1,
        page_size=20
    )
    print(f"Found {response.total} elements")
    
except graphor.NotFoundError as e:
    print(f"File not found: {e}")
    
except graphor.BadRequestError as e:
    print(f"Invalid request parameters: {e}")
    
except graphor.AuthenticationError as e:
    print(f"Invalid API key: {e}")
    
except graphor.APIConnectionError as e:
    print(f"Connection error: {e}")
    
except graphor.APIStatusError as e:
    print(f"API error (status {e.status_code}): {e}")

Advanced Examples

Document Structure Analyzer

Analyze the structure of a document:
from graphor import Graphor
from collections import defaultdict

client = Graphor()

def analyze_document_structure(file_name: str):
    """Analyze document structure and element distribution."""
    all_elements = []
    page = 1
    
    # Fetch all elements
    while True:
        response = client.sources.load_elements(
            file_name=file_name,
            page=page,
            page_size=100
        )
        all_elements.extend(response.items)
        
        if page >= response.total_pages:
            break
        page += 1
    
    # Analyze structure
    type_counts = defaultdict(int)
    page_distribution = defaultdict(int)
    total_chars = 0
    languages = set()
    
    for item in all_elements:
        metadata = item.metadata or {}
        
        element_type = metadata.get("element_type", "Unknown")
        type_counts[element_type] += 1
        
        page_num = metadata.get("page_number", 0)
        page_distribution[page_num] += 1
        
        total_chars += len(item.page_content)
        
        for lang in metadata.get("languages", []):
            languages.add(lang)
    
    return {
        "total_elements": len(all_elements),
        "element_types": dict(type_counts),
        "pages": len(page_distribution),
        "elements_per_page": dict(page_distribution),
        "total_characters": total_chars,
        "average_element_length": total_chars / len(all_elements) if all_elements else 0,
        "detected_languages": list(languages)
    }

# Usage
analysis = analyze_document_structure("research_paper.pdf")
print(f"Document Analysis:")
print(f"  Total elements: {analysis['total_elements']}")
print(f"  Pages: {analysis['pages']}")
print(f"  Element types: {analysis['element_types']}")
print(f"  Languages: {analysis['detected_languages']}")

Extract Tables

Extract all tables from a document:
from graphor import Graphor

client = Graphor()

def extract_tables(file_name: str):
    """Extract all tables from a document."""
    tables = []
    page = 1
    
    while True:
        response = client.sources.load_elements(
            file_name=file_name,
            page=page,
            page_size=50,
            filter={"type": "Table"}
        )
        
        for item in response.items:
            metadata = item.metadata or {}
            tables.append({
                "content": item.page_content,
                "page": metadata.get("page_number"),
                "position": metadata.get("position"),
                "html": metadata.get("text_as_html"),
                "bounding_box": metadata.get("bounding_box")
            })
        
        if page >= response.total_pages:
            break
        page += 1
    
    return tables

# Usage
tables = extract_tables("financial_report.pdf")
print(f"Found {len(tables)} tables")

for i, table in enumerate(tables, 1):
    print(f"\nTable {i} (Page {table['page']}):")
    print(f"  {table['content'][:200]}...")

Build Document Outline

Create a document outline from titles:
from graphor import Graphor

client = Graphor()

def build_document_outline(file_name: str):
    """Build a document outline from titles."""
    response = client.sources.load_elements(
        file_name=file_name,
        page_size=500,
        filter={"type": "Title"}
    )
    
    outline = []
    
    for item in response.items:
        metadata = item.metadata or {}
        html = metadata.get("text_as_html", "")
        
        # Detect heading level from HTML
        level = 5  # default
        if "<h1>" in html: level = 1
        elif "<h2>" in html: level = 2
        elif "<h3>" in html: level = 3
        elif "<h4>" in html: level = 4
        
        outline.append({
            "title": item.page_content,
            "page": metadata.get("page_number"),
            "level": level,
            "position": metadata.get("position")
        })
    
    # Sort by position
    outline.sort(key=lambda x: (x["page"] or 0, x["position"] or 0))
    
    return outline

# Usage
outline = build_document_outline("book.pdf")
print("Document Outline:")
for item in outline:
    indent = "  " * (item["level"] - 1)
    print(f"{indent}{item['title']} (Page {item['page']})")

Search Content in Elements

Search for specific content within document elements:
from graphor import Graphor

client = Graphor()

def search_in_document(file_name: str, query: str):
    """Search for content within document elements."""
    matches = []
    page = 1
    
    while True:
        response = client.sources.load_elements(
            file_name=file_name,
            page=page,
            page_size=100,
            filter={"elements_to_remove": ["Footer", "PageNumber"]}
        )
        
        for item in response.items:
            if query.lower() in item.page_content.lower():
                metadata = item.metadata or {}
                matches.append({
                    "content": item.page_content,
                    "page": metadata.get("page_number"),
                    "type": metadata.get("element_type"),
                    "position": metadata.get("position")
                })
        
        if page >= response.total_pages:
            break
        page += 1
    
    return matches

def highlight_match(text: str, query: str) -> str:
    """Highlight search query in text."""
    import re
    pattern = re.compile(f"({re.escape(query)})", re.IGNORECASE)
    return pattern.sub(r"**\1**", text)

# Usage
query = "machine learning"
matches = search_in_document("research_paper.pdf", query)

print(f"Found {len(matches)} matches for '{query}':")
for i, match in enumerate(matches[:10], 1):
    print(f"\n{i}. Page {match['page']} ({match['type']}):")
    highlighted = highlight_match(match["content"][:200], query)
    print(f"   {highlighted}...")

Async Batch Processing

Process multiple documents concurrently:
import asyncio
from graphor import AsyncGraphor
import graphor

async def get_elements_async(client: AsyncGraphor, file_name: str):
    """Get all elements from a single document."""
    all_elements = []
    page = 1
    
    while True:
        try:
            response = await client.sources.load_elements(
                file_name=file_name,
                page=page,
                page_size=100
            )
            all_elements.extend(response.items)
            
            if page >= response.total_pages:
                break
            page += 1
            
        except graphor.APIStatusError as e:
            print(f"Error processing {file_name}: {e}")
            break
    
    return {"file_name": file_name, "elements": all_elements}

async def batch_get_elements(file_names: list[str], max_concurrent: int = 3):
    """Get elements from multiple documents concurrently."""
    client = AsyncGraphor()
    semaphore = asyncio.Semaphore(max_concurrent)
    
    async def process_with_semaphore(file_name: str):
        async with semaphore:
            print(f"Processing: {file_name}")
            result = await get_elements_async(client, file_name)
            print(f"  Completed: {file_name} ({len(result['elements'])} elements)")
            return result
    
    tasks = [process_with_semaphore(f) for f in file_names]
    results = await asyncio.gather(*tasks, return_exceptions=True)
    
    return [r for r in results if not isinstance(r, Exception)]

# Usage
files = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]
results = asyncio.run(batch_get_elements(files))

for result in results:
    print(f"{result['file_name']}: {len(result['elements'])} elements")

Document Comparator

Compare element structure between documents:
from graphor import Graphor
from collections import defaultdict

client = Graphor()

def get_document_stats(file_name: str) -> dict:
    """Get statistics for a document."""
    type_counts = defaultdict(int)
    total_chars = 0
    page = 1
    
    while True:
        response = client.sources.load_elements(
            file_name=file_name,
            page=page,
            page_size=100
        )
        
        for item in response.items:
            metadata = item.metadata or {}
            type_counts[metadata.get("element_type", "Unknown")] += 1
            total_chars += len(item.page_content)
        
        if page >= response.total_pages:
            total_elements = response.total
            break
        page += 1
    
    return {
        "file_name": file_name,
        "total_elements": total_elements,
        "total_characters": total_chars,
        "element_types": dict(type_counts)
    }

def compare_documents(file_name_1: str, file_name_2: str):
    """Compare two documents."""
    stats1 = get_document_stats(file_name_1)
    stats2 = get_document_stats(file_name_2)
    
    all_types = set(stats1["element_types"].keys()) | set(stats2["element_types"].keys())
    
    comparison = {
        "documents": [stats1["file_name"], stats2["file_name"]],
        "total_elements": [stats1["total_elements"], stats2["total_elements"]],
        "total_characters": [stats1["total_characters"], stats2["total_characters"]],
        "element_comparison": {}
    }
    
    for element_type in sorted(all_types):
        count1 = stats1["element_types"].get(element_type, 0)
        count2 = stats2["element_types"].get(element_type, 0)
        comparison["element_comparison"][element_type] = [count1, count2]
    
    return comparison

# Usage
comparison = compare_documents("version1.pdf", "version2.pdf")
print(f"Comparing: {comparison['documents'][0]} vs {comparison['documents'][1]}")
print(f"Elements: {comparison['total_elements'][0]} vs {comparison['total_elements'][1]}")
print(f"Characters: {comparison['total_characters'][0]} vs {comparison['total_characters'][1]}")
print("\nElement breakdown:")
for elem_type, counts in comparison["element_comparison"].items():
    print(f"  {elem_type}: {counts[0]} vs {counts[1]}")

Error Reference

Error TypeStatus CodeDescription
BadRequestError400Invalid request payload or parameters
AuthenticationError401Invalid or missing API key
NotFoundError404Specified file not found in project
RateLimitError429Too many requests, please retry after waiting
InternalServerError≥500Server-side error processing request
APIConnectionErrorN/ANetwork connectivity issues
APITimeoutErrorN/ARequest timed out

Best Practices

Performance Optimization

  • Use appropriate page sizes: Start with 20-50 elements per page for optimal performance
  • Filter server-side: Use filter parameters to reduce data transfer
  • Cache results: Store element data locally for repeated access
# Good: Filter on server
response = client.sources.load_elements(
    file_name="doc.pdf",
    filter={"type": "Title"}  # Filter on server
)

# Less efficient: Filter on client
response = client.sources.load_elements(
    file_name="doc.pdf",
    page_size=500
)
titles = [item for item in response.items if item.metadata.get("element_type") == "Title"]

Data Processing

  • Element type awareness: Different element types need different processing
  • Use HTML field: The text_as_html field preserves formatting
  • Handle None metadata: Always check if metadata exists before accessing
for item in response.items:
    # Safe metadata access
    metadata = item.metadata or {}
    element_type = metadata.get("element_type", "Unknown")
    page_num = metadata.get("page_number", 0)

Memory Management

  • Stream large documents: Process in chunks rather than loading all at once
  • Clear processed data: Remove unnecessary fields when not needed
# Process large documents in chunks
page = 1
while True:
    response = client.sources.load_elements(
        file_name="large_doc.pdf",
        page=page,
        page_size=50
    )
    
    # Process this batch
    for item in response.items:
        process_element(item)  # Your processing logic
    
    if page >= response.total_pages:
        break
    page += 1

Troubleshooting

Causes: Large page sizes, complex filters, or server loadSolutions:
  • Reduce page_size to 25-50 elements
  • Use specific filters to reduce result set
  • Implement request timeouts
client = Graphor(timeout=60.0)
Causes: File not processed, incorrect file name, or overly restrictive filtersSolutions:
  • Verify file has been processed successfully with client.sources.list()
  • Check file name matches exactly (case-sensitive)
  • Remove or relax filter criteria
Causes: Processing method limitations, file format issues, or filter conflictsSolutions:
  • Try different partition methods using client.sources.parse()
  • Check if elements are categorized under different types
  • Remove elements_to_remove filter temporarily
Causes: Processing too many elements at onceSolutions:
  • Reduce page_size and process incrementally
  • Filter out unnecessary element types
  • Use streaming processing patterns

Next Steps

After successfully retrieving document elements: