Skip to main content
The get_elements method (same name as the API endpoint) returns the parsed elements of a source. Each item is a BuildStatusElement with element_id, element_type, text, markdown, html, optional img_base64, position, page_number, bounding_box, page_layout, and more. Use file_id (from list or get build status).

Method overview

client.sources.get_elements()

Method signature

client.sources.get_elements(
    file_id: str,                          # Required
    page: int | None = None,
    page_size: int | None = None,
    suppress_img_base64: bool = False,
    type: str | None = None,              # Filter by element type
    page_numbers: list[int] | None = None,
    elements_to_remove: list[str] | None = None,
    timeout: float | None = None
) -> SourceGetElementsResponse

Parameters

ParameterTypeDescriptionRequired
file_idstrUnique identifier of the sourceYes
pageint | None1-based page number (use with page_size)No
page_sizeint | NoneElements per page (1–100)No
suppress_img_base64boolWhen true, omit img_base64 from each elementNo
typestr | NoneFilter by element type (e.g. Title, NarrativeText, Table)No
page_numberslist[int] | NoneRestrict to specific page numbersNo
elements_to_removelist[str] | NoneElement types to excludeNo
timeoutfloatRequest timeout in secondsNo

Filter parameters

All filter parameters are passed at the top level (not as a nested object).
ParameterPythonTypeScriptDescription
Element type filtertypetypeFilter by element type (e.g. Title, NarrativeText, Table)
Page number filterpage_numberspage_numbersRestrict to specific page numbers
Exclude typeselements_to_removeelementsToRemoveElement types to exclude

Response

Paginated response with BuildStatusElement items (same shape as elements in Get build status):
FieldTypeDescription
itemslistElements in the current page (or all if no pagination)
totalintTotal elements matching filters
pageint | nullCurrent page (1-based) or null
page_sizeint | nullElements per page or null
total_pagesint | nullTotal pages or null

BuildStatusElement (each item)

FieldTypeDescription
element_idstr | nullUnique identifier for the element
element_typestr | nulle.g. Title, NarrativeText, Table, Image
textstrPlain text content
markdownstr | nullMarkdown when available
htmlstr | nullHTML when available
img_base64str | nullBase64 image (omitted if suppress_img_base64=true)
positionint | nullOrder within the document
page_numberint | nullPage number (1-based)
bounding_boxobject | nullBounding box (left, top, width, height)
page_layoutobject | nullPage dimensions
page_annotationstr | nullPage-level annotation
page_keywordsarray | nullKeywords for the page
page_topicsarray | nullTopics for the page
metadataobjectAdditional metadata

Element Types

TypeDescription
TitleDocument and section titles
NarrativeTextMain body paragraphs and content
ListItemItems in bullet points or numbered lists
TableComplete data tables
TableRowIndividual rows within tables
ImagePicture or graphic elements
HeaderHeader content at top of pages
FooterFooter content at bottom of pages
FormulaMathematical formulas and equations
CompositeElementElements containing multiple types
FigureCaptionText describing images or figures
PageBreakIndicators of page separation
AddressPhysical address information
EmailAddressEmail contact information
PageNumberPage numbering elements
CodeSnippetProgramming code segments
FormKeysValuesKey-value pairs in forms
LinkHyperlinks and references
UncategorizedTextText that doesn’t fit other categories

Code Examples

Basic usage

from graphor import Graphor

client = Graphor()
file_id = "file_abc123"  # from list() or get_build_status

response = client.sources.get_elements(file_id=file_id, page=1, page_size=20)

print(f"Found {response.total} elements (page {response.page}/{response.total_pages})")

for item in response.items:
    print(f"{item.element_type}: {item.text[:50]}...")

Filter by element type

response = client.sources.get_elements(
    file_id=file_id,
    page_size=50,
    type="Title"
)
for item in response.items:
    print(f"Page {item.page_number}: {item.text}")

Filter by page numbers

response = client.sources.get_elements(
    file_id=file_id,
    page_size=100,
    page_numbers=[1, 2, 3]
)
for item in response.items:
    print(f"Page {item.page_number}: {item.text[:80]}...")

Exclude element types

response = client.sources.get_elements(
    file_id=file_id,
    page_size=50,
    elements_to_remove=["Footer", "PageNumber"]
)

Combine filters

response = client.sources.get_elements(
    file_id=file_id,
    page_size=50,
    type="Table",
    page_numbers=[2, 3, 4, 5]
)
for item in response.items:
    print(f"Table on page {item.page_number}: {item.text[:100]}...")

Async usage

import asyncio
from graphor import AsyncGraphor

async def get_document_elements(file_id: str):
    client = AsyncGraphor()
    response = await client.sources.get_elements(file_id=file_id, page=1, page_size=50)
    print(f"Found {response.total} elements")
    for item in response.items:
        print(f"{item.element_type}: {item.text[:50]}...")
    return response

asyncio.run(get_document_elements("file_abc123"))

Paginate through all elements

def get_all_elements(file_id: str, page_size: int = 50):
    all_elements = []
    page = 1
    while True:
        response = client.sources.get_elements(file_id=file_id, page=page, page_size=page_size)
        all_elements.extend(response.items)
        if response.total_pages is None or page >= response.total_pages:
            break
        page += 1
    return all_elements

elements = get_all_elements("file_abc123")

Error handling

try:
    response = client.sources.get_elements(file_id=file_id, page=1, page_size=20)
    print(f"Found {response.total} elements")
except graphor.NotFoundError as e:
    print("Source not found:", e)
except graphor.BadRequestError as e:
    print("Invalid request (e.g. missing file_id):", e)
except graphor.APIStatusError as e:
    print("API error:", e)

Advanced Examples

Document Structure Analyzer

Analyze the structure of a document:
from graphor import Graphor
from collections import defaultdict

client = Graphor()

def analyze_document_structure(file_id: str):
    """Analyze document structure and element distribution."""
    all_elements = []
    page = 1
    
    # Fetch all elements
    while True:
        response = client.sources.get_elements(
            file_id=file_id,
            page=page,
            page_size=100
        )
        all_elements.extend(response.items)
        
        if page >= response.total_pages:
            break
        page += 1
    
    # Analyze structure
    type_counts = defaultdict(int)
    page_distribution = defaultdict(int)
    total_chars = 0
    languages = set()
    
    for item in all_elements:
        element_type = item.element_type or "Unknown"
        type_counts[element_type] += 1
        
        page_num = item.page_number or 0
        page_distribution[page_num] += 1
        
        total_chars += len(item.text)
        
        for lang in (item.metadata or {}).get("languages", []):
            languages.add(lang)
    
    return {
        "total_elements": len(all_elements),
        "element_types": dict(type_counts),
        "pages": len(page_distribution),
        "elements_per_page": dict(page_distribution),
        "total_characters": total_chars,
        "average_element_length": total_chars / len(all_elements) if all_elements else 0,
        "detected_languages": list(languages)
    }

# Usage
analysis = analyze_document_structure("file_abc123")
print(f"Document Analysis:")
print(f"  Total elements: {analysis['total_elements']}")
print(f"  Pages: {analysis['pages']}")
print(f"  Element types: {analysis['element_types']}")
print(f"  Languages: {analysis['detected_languages']}")

Extract Tables

Extract all tables from a document:
from graphor import Graphor

client = Graphor()

def extract_tables(file_id: str):
    """Extract all tables from a document."""
    tables = []
    page = 1
    
    while True:
        response = client.sources.get_elements(
            file_id=file_id,
            page=page,
            page_size=50,
            type="Table"
        )
        
        for item in response.items:
            tables.append({
                "content": item.text,
                "page": item.page_number,
                "position": item.position,
                "html": item.html,
                "bounding_box": item.bounding_box
            })
        
        if page >= response.total_pages:
            break
        page += 1
    
    return tables

# Usage
tables = extract_tables("file_abc123")
print(f"Found {len(tables)} tables")

for i, table in enumerate(tables, 1):
    print(f"\nTable {i} (Page {table['page']}):")
    print(f"  {table['content'][:200]}...")

Build Document Outline

Create a document outline from titles:
from graphor import Graphor

client = Graphor()

def build_document_outline(file_id: str):
    """Build a document outline from titles."""
    response = client.sources.get_elements(
        file_id=file_id,
        page_size=500,
        type="Title"
    )
    
    outline = []
    
    for item in response.items:
        html = item.html or ""
        level = 5
        if "<h1>" in html: level = 1
        elif "<h2>" in html: level = 2
        elif "<h3>" in html: level = 3
        elif "<h4>" in html: level = 4
        outline.append({
            "title": item.text,
            "page": item.page_number,
            "level": level,
            "position": item.position
        })
    
    # Sort by position
    outline.sort(key=lambda x: (x["page"] or 0, x["position"] or 0))
    
    return outline

# Usage
outline = build_document_outline("file_abc123")
print("Document Outline:")
for item in outline:
    indent = "  " * (item["level"] - 1)
    print(f"{indent}{item['title']} (Page {item['page']})")

Search Content in Elements

Search for specific content within document elements:
from graphor import Graphor

client = Graphor()

def search_in_document(file_id: str, query: str):
    """Search for content within document elements."""
    matches = []
    page = 1
    
    while True:
        response = client.sources.get_elements(
            file_id=file_id,
            page=page,
            page_size=100,
            elements_to_remove=["Footer", "PageNumber"]
        )
        
        for item in response.items:
            if query.lower() in item.text.lower():
                matches.append({
                    "content": item.text,
                    "page": item.page_number,
                    "type": item.element_type,
                    "position": item.position
                })
        
        if page >= response.total_pages:
            break
        page += 1
    
    return matches

def highlight_match(text: str, query: str) -> str:
    """Highlight search query in text."""
    import re
    pattern = re.compile(f"({re.escape(query)})", re.IGNORECASE)
    return pattern.sub(r"**\1**", text)

# Usage
query = "machine learning"
matches = search_in_document("file_abc123", query)

print(f"Found {len(matches)} matches for '{query}':")
for i, match in enumerate(matches[:10], 1):
    print(f"\n{i}. Page {match['page']} ({match['type']}):")
    highlighted = highlight_match(match["content"][:200], query)
    print(f"   {highlighted}...")

Async Batch Processing

Process multiple documents concurrently:
import asyncio
from graphor import AsyncGraphor
import graphor

async def get_elements_async(client: AsyncGraphor, file_id: str):
    """Get all elements from a single document."""
    all_elements = []
    page = 1
    
    while True:
        try:
            response = await client.sources.get_elements(
                file_id=file_id,
                page=page,
                page_size=100
            )
            all_elements.extend(response.items)
            
            if page >= response.total_pages:
                break
            page += 1
            
        except graphor.APIStatusError as e:
            print(f"Error processing {file_id}: {e}")
            break
    
    return {"file_id": file_id, "elements": all_elements}

async def batch_get_elements(file_ids: list[str], max_concurrent: int = 3):
    """Get elements from multiple documents concurrently."""
    client = AsyncGraphor()
    semaphore = asyncio.Semaphore(max_concurrent)
    
    async def process_with_semaphore(fid: str):
        async with semaphore:
            print(f"Processing: {fid}")
            result = await get_elements_async(client, fid)
            print(f"  Completed: {fid} ({len(result['elements'])} elements)")
            return result
    
    tasks = [process_with_semaphore(f) for f in file_ids]
    results = await asyncio.gather(*tasks, return_exceptions=True)
    
    return [r for r in results if not isinstance(r, Exception)]

# Usage
file_ids = ["file_1", "file_2", "file_3"]
results = asyncio.run(batch_get_elements(file_ids))

for result in results:
    print(f"{result['file_id']}: {len(result['elements'])} elements")

Document Comparator

Compare element structure between documents:
from graphor import Graphor
from collections import defaultdict

client = Graphor()

def get_document_stats(file_id: str) -> dict:
    """Get statistics for a document."""
    type_counts = defaultdict(int)
    total_chars = 0
    page = 1
    
    while True:
        response = client.sources.get_elements(
            file_id=file_id,
            page=page,
            page_size=100
        )
        
        for item in response.items:
            type_counts[item.element_type or "Unknown"] += 1
            total_chars += len(item.text)
        
        if page >= response.total_pages:
            total_elements = response.total
            break
        page += 1
    
    return {
        "file_id": file_id,
        "total_elements": total_elements,
        "total_characters": total_chars,
        "element_types": dict(type_counts)
    }

def compare_documents(file_id_1: str, file_id_2: str):
    """Compare two documents."""
    stats1 = get_document_stats(file_id_1)
    stats2 = get_document_stats(file_id_2)
    
    all_types = set(stats1["element_types"].keys()) | set(stats2["element_types"].keys())
    
    comparison = {
        "documents": [stats1["file_id"], stats2["file_id"]],
        "total_elements": [stats1["total_elements"], stats2["total_elements"]],
        "total_characters": [stats1["total_characters"], stats2["total_characters"]],
        "element_comparison": {}
    }
    
    for element_type in sorted(all_types):
        count1 = stats1["element_types"].get(element_type, 0)
        count2 = stats2["element_types"].get(element_type, 0)
        comparison["element_comparison"][element_type] = [count1, count2]
    
    return comparison

# Usage
comparison = compare_documents("file_1", "file_2")
print(f"Comparing: {comparison['documents'][0]} vs {comparison['documents'][1]}")
print(f"Elements: {comparison['total_elements'][0]} vs {comparison['total_elements'][1]}")
print(f"Characters: {comparison['total_characters'][0]} vs {comparison['total_characters'][1]}")
print("\nElement breakdown:")
for elem_type, counts in comparison["element_comparison"].items():
    print(f"  {elem_type}: {counts[0]} vs {counts[1]}")

Error Reference

Error TypeStatus CodeDescription
BadRequestError400Invalid request payload or parameters
AuthenticationError401Invalid or missing API key
NotFoundError404Source not found for the given file_id
RateLimitError429Too many requests, please retry after waiting
InternalServerError≥500Server-side error processing request
APIConnectionErrorN/ANetwork connectivity issues
APITimeoutErrorN/ARequest timed out

Best Practices

Performance Optimization

  • Use appropriate page sizes: Start with 20-50 elements per page for optimal performance
  • Filter server-side: Use filter parameters to reduce data transfer
  • Cache results: Store element data locally for repeated access
# Good: Filter on server
response = client.sources.get_elements(file_id=file_id, type="Title")

# Less efficient: Filter on client
response = client.sources.get_elements(file_id=file_id, page_size=500)
titles = [item for item in response.items if item.element_type == "Title"]

Data Processing

  • Element type awareness: Different element types need different processing
  • Use HTML field: The text_as_html field preserves formatting
  • Handle None metadata: Always check if metadata exists before accessing
for item in response.items:
    element_type = item.element_type or "Unknown"
    page_num = item.page_number or 0

Memory Management

  • Stream large documents: Process in chunks rather than loading all at once
  • Clear processed data: Remove unnecessary fields when not needed
# Process large documents in chunks
page = 1
while True:
    response = client.sources.get_elements(
        file_id=file_id,
        page=page,
        page_size=50
    )
    
    # Process this batch
    for item in response.items:
        process_element(item)  # Your processing logic
    
    if page >= response.total_pages:
        break
    page += 1

Troubleshooting

Causes: Large page sizes, complex filters, or server loadSolutions:
  • Reduce page_size to 25-50 elements
  • Use specific filters to reduce result set
  • Implement request timeouts
client = Graphor(timeout=60.0)
Causes: File not processed, incorrect file name, or overly restrictive filtersSolutions:
  • Verify source is processed (status Completed) with client.sources.list()
  • Use file_id from list or get build status
  • Remove or relax filter criteria
Causes: Processing method limitations, file format issues, or filter conflictsSolutions:
  • Try a different partition method using client.sources.reprocess()
  • Check if elements are categorized under different types
  • Remove elements_to_remove filter temporarily
Causes: Processing too many elements at onceSolutions:
  • Reduce page_size and process incrementally
  • Filter out unnecessary element types
  • Use streaming processing patterns

Next steps

After retrieving elements:

Get build status

Poll build status and get elements for a build

List sources

List all sources and their file_ids

Upload

Ingest files, URLs, GitHub, or YouTube

Reprocess source

Re-process a source with a different partition method

Delete source

Remove a source by file_id