The load_elements method allows you to retrieve detailed information about document elements (partitions) from processed sources in your Graphor project. This method provides access to individual text blocks, images, tables, and other document components with their metadata, positioning, and content, enabling you to analyze document structure and extract specific information programmatically.
Method Overview
Sync Method client.sources.load_elements()
Async Method await client.sources.load_elements()
Method Signature
client.sources.load_elements(
file_id: str | None = None , # Preferred
file_name: str | None = None , # Deprecated
page: int | None = None ,
page_size: int | None = None ,
filter : Filter | None = None ,
timeout: float | None = None
) -> SourceLoadElementsResponse
Parameters
Parameter Type Description Required file_idstrUnique identifier for the source (preferred) No* file_namestrName of the source file to retrieve elements from (deprecated, use file_id) No* pageintPage number for pagination (starts from 1) No page_sizeintNumber of elements to return per page No filterFilterFilter criteria to refine element selection No timeoutfloatRequest timeout in seconds No
*At least one of file_id or file_name must be provided. file_id is preferred.
Filter Parameters
The filter parameter accepts a TypedDict with the following optional fields:
Parameter Type Description typestrFilter by specific element type (e.g., "Title", "NarrativeText", "Table") page_numberslist[int]Filter elements from specific page numbers elements_to_removelist[str]Exclude specific element types from results
Response Object
The method returns a SourceLoadElementsResponse object:
Property Type Description itemslist[Item]List of document elements in the current page totalintTotal number of elements matching the filter pageint | NoneCurrent page number page_sizeint | NoneNumber of elements per page total_pagesint | NoneTotal number of pages available
Item Object
Each item in the items list has the following properties:
Property Type Description idstr | NoneElement identifier (may be None) page_contentstrText content of the element typeLiteral["Document"] | NoneAlways “Document” for this method metadatadict | NoneRich metadata about the element
The metadata dictionary contains detailed information:
Field Type Description coordinatesdictPixel coordinates and layout information filenamestrOriginal filename of the source document languageslist[str]Detected languages in the element last_modifiedstrISO timestamp of last modification page_numberintPage number where element appears filetypestrMIME type of the source file text_as_htmlstrHTML representation of the element element_typestrType classification of the element element_idstrUnique identifier for the element positionintSequential position within the document bounding_boxdictRectangular bounds of the element page_layoutdictOverall page dimensions
Element Types
Type Description TitleDocument and section titles NarrativeTextMain body paragraphs and content ListItemItems in bullet points or numbered lists TableComplete data tables TableRowIndividual rows within tables ImagePicture or graphic elements HeaderHeader content at top of pages FooterFooter content at bottom of pages FormulaMathematical formulas and equations CompositeElementElements containing multiple types FigureCaptionText describing images or figures PageBreakIndicators of page separation AddressPhysical address information EmailAddressEmail contact information PageNumberPage numbering elements CodeSnippetProgramming code segments FormKeysValuesKey-value pairs in forms LinkHyperlinks and references UncategorizedTextText that doesn’t fit other categories
Code Examples
Basic Usage
from graphor import Graphor
client = Graphor()
# Get elements from a document
response = client.sources.load_elements(
file_name = "document.pdf" ,
page = 1 ,
page_size = 20
)
print ( f "Found { response.total } elements (page { response.page } / { response.total_pages } )" )
for item in response.items:
element_type = item.metadata.get( "element_type" ) if item.metadata else "Unknown"
print ( f " { element_type } : { item.page_content[: 50 ] } ..." )
Filter by Element Type
from graphor import Graphor
client = Graphor()
# Get only titles
response = client.sources.load_elements(
file_name = "document.pdf" ,
page_size = 50 ,
filter = { "type" : "Title" }
)
print ( f "Found { response.total } titles" )
for item in response.items:
page_num = item.metadata.get( "page_number" ) if item.metadata else "?"
print ( f "Page { page_num } : { item.page_content } " )
Filter by Page Numbers
from graphor import Graphor
client = Graphor()
# Get elements from specific pages
response = client.sources.load_elements(
file_name = "document.pdf" ,
page_size = 100 ,
filter = { "page_numbers" : [ 1 , 2 , 3 ]}
)
print ( f "Found { response.total } elements on pages 1-3" )
for item in response.items:
print ( f "Page { item.metadata[ 'page_number' ] } : { item.page_content[: 80 ] } ..." )
Exclude Element Types
from graphor import Graphor
client = Graphor()
# Get all elements except footers and page numbers
response = client.sources.load_elements(
file_name = "document.pdf" ,
page_size = 50 ,
filter = { "elements_to_remove" : [ "Footer" , "PageNumber" ]}
)
print ( f "Found { response.total } content elements (excluding footers/page numbers)" )
Combine Filters
from graphor import Graphor
client = Graphor()
# Get tables from pages 2-5, excluding certain elements
response = client.sources.load_elements(
file_name = "document.pdf" ,
page_size = 50 ,
filter = {
"type" : "Table" ,
"page_numbers" : [ 2 , 3 , 4 , 5 ]
}
)
print ( f "Found { response.total } tables on pages 2-5" )
for item in response.items:
print ( f "Table on page { item.metadata[ 'page_number' ] } :" )
print ( f " { item.page_content[: 100 ] } ..." )
Async Usage
import asyncio
from graphor import AsyncGraphor
async def get_document_elements ( file_name : str ):
client = AsyncGraphor()
response = await client.sources.load_elements(
file_name = file_name,
page = 1 ,
page_size = 50
)
print ( f "Found { response.total } elements" )
for item in response.items:
print ( f " { item.metadata[ 'element_type' ] } : { item.page_content[: 50 ] } ..." )
return response
asyncio.run(get_document_elements( "document.pdf" ))
Paginate Through All Elements
from graphor import Graphor
client = Graphor()
def get_all_elements ( file_name : str , page_size : int = 50 ):
"""Retrieve all elements from a document."""
all_elements = []
page = 1
while True :
response = client.sources.load_elements(
file_name = file_name,
page = page,
page_size = page_size
)
all_elements.extend(response.items)
print ( f "Retrieved page { page } / { response.total_pages } ( { len (all_elements) } / { response.total } elements)" )
if page >= response.total_pages:
break
page += 1
return all_elements
# Usage
elements = get_all_elements( "document.pdf" )
print ( f "Total elements retrieved: { len (elements) } " )
Error Handling
import graphor
from graphor import Graphor
client = Graphor()
try :
response = client.sources.load_elements(
file_name = "document.pdf" ,
page = 1 ,
page_size = 20
)
print ( f "Found { response.total } elements" )
except graphor.NotFoundError as e:
print ( f "File not found: { e } " )
except graphor.BadRequestError as e:
print ( f "Invalid request parameters: { e } " )
except graphor.AuthenticationError as e:
print ( f "Invalid API key: { e } " )
except graphor.APIConnectionError as e:
print ( f "Connection error: { e } " )
except graphor.APIStatusError as e:
print ( f "API error (status { e.status_code } ): { e } " )
Advanced Examples
Document Structure Analyzer
Analyze the structure of a document:
from graphor import Graphor
from collections import defaultdict
client = Graphor()
def analyze_document_structure ( file_name : str ):
"""Analyze document structure and element distribution."""
all_elements = []
page = 1
# Fetch all elements
while True :
response = client.sources.load_elements(
file_name = file_name,
page = page,
page_size = 100
)
all_elements.extend(response.items)
if page >= response.total_pages:
break
page += 1
# Analyze structure
type_counts = defaultdict( int )
page_distribution = defaultdict( int )
total_chars = 0
languages = set ()
for item in all_elements:
metadata = item.metadata or {}
element_type = metadata.get( "element_type" , "Unknown" )
type_counts[element_type] += 1
page_num = metadata.get( "page_number" , 0 )
page_distribution[page_num] += 1
total_chars += len (item.page_content)
for lang in metadata.get( "languages" , []):
languages.add(lang)
return {
"total_elements" : len (all_elements),
"element_types" : dict (type_counts),
"pages" : len (page_distribution),
"elements_per_page" : dict (page_distribution),
"total_characters" : total_chars,
"average_element_length" : total_chars / len (all_elements) if all_elements else 0 ,
"detected_languages" : list (languages)
}
# Usage
analysis = analyze_document_structure( "research_paper.pdf" )
print ( f "Document Analysis:" )
print ( f " Total elements: { analysis[ 'total_elements' ] } " )
print ( f " Pages: { analysis[ 'pages' ] } " )
print ( f " Element types: { analysis[ 'element_types' ] } " )
print ( f " Languages: { analysis[ 'detected_languages' ] } " )
Extract all tables from a document:
from graphor import Graphor
client = Graphor()
def extract_tables ( file_name : str ):
"""Extract all tables from a document."""
tables = []
page = 1
while True :
response = client.sources.load_elements(
file_name = file_name,
page = page,
page_size = 50 ,
filter = { "type" : "Table" }
)
for item in response.items:
metadata = item.metadata or {}
tables.append({
"content" : item.page_content,
"page" : metadata.get( "page_number" ),
"position" : metadata.get( "position" ),
"html" : metadata.get( "text_as_html" ),
"bounding_box" : metadata.get( "bounding_box" )
})
if page >= response.total_pages:
break
page += 1
return tables
# Usage
tables = extract_tables( "financial_report.pdf" )
print ( f "Found { len (tables) } tables" )
for i, table in enumerate (tables, 1 ):
print ( f " \n Table { i } (Page { table[ 'page' ] } ):" )
print ( f " { table[ 'content' ][: 200 ] } ..." )
Build Document Outline
Create a document outline from titles:
from graphor import Graphor
client = Graphor()
def build_document_outline ( file_name : str ):
"""Build a document outline from titles."""
response = client.sources.load_elements(
file_name = file_name,
page_size = 500 ,
filter = { "type" : "Title" }
)
outline = []
for item in response.items:
metadata = item.metadata or {}
html = metadata.get( "text_as_html" , "" )
# Detect heading level from HTML
level = 5 # default
if "<h1>" in html: level = 1
elif "<h2>" in html: level = 2
elif "<h3>" in html: level = 3
elif "<h4>" in html: level = 4
outline.append({
"title" : item.page_content,
"page" : metadata.get( "page_number" ),
"level" : level,
"position" : metadata.get( "position" )
})
# Sort by position
outline.sort( key = lambda x : (x[ "page" ] or 0 , x[ "position" ] or 0 ))
return outline
# Usage
outline = build_document_outline( "book.pdf" )
print ( "Document Outline:" )
for item in outline:
indent = " " * (item[ "level" ] - 1 )
print ( f " { indent } • { item[ 'title' ] } (Page { item[ 'page' ] } )" )
Search Content in Elements
Search for specific content within document elements:
from graphor import Graphor
client = Graphor()
def search_in_document ( file_name : str , query : str ):
"""Search for content within document elements."""
matches = []
page = 1
while True :
response = client.sources.load_elements(
file_name = file_name,
page = page,
page_size = 100 ,
filter = { "elements_to_remove" : [ "Footer" , "PageNumber" ]}
)
for item in response.items:
if query.lower() in item.page_content.lower():
metadata = item.metadata or {}
matches.append({
"content" : item.page_content,
"page" : metadata.get( "page_number" ),
"type" : metadata.get( "element_type" ),
"position" : metadata.get( "position" )
})
if page >= response.total_pages:
break
page += 1
return matches
def highlight_match ( text : str , query : str ) -> str :
"""Highlight search query in text."""
import re
pattern = re.compile( f "( { re.escape(query) } )" , re. IGNORECASE )
return pattern.sub( r " ** \1 ** " , text)
# Usage
query = "machine learning"
matches = search_in_document( "research_paper.pdf" , query)
print ( f "Found { len (matches) } matches for ' { query } ':" )
for i, match in enumerate (matches[: 10 ], 1 ):
print ( f " \n { i } . Page { match[ 'page' ] } ( { match[ 'type' ] } ):" )
highlighted = highlight_match(match[ "content" ][: 200 ], query)
print ( f " { highlighted } ..." )
Async Batch Processing
Process multiple documents concurrently:
import asyncio
from graphor import AsyncGraphor
import graphor
async def get_elements_async ( client : AsyncGraphor, file_name : str ):
"""Get all elements from a single document."""
all_elements = []
page = 1
while True :
try :
response = await client.sources.load_elements(
file_name = file_name,
page = page,
page_size = 100
)
all_elements.extend(response.items)
if page >= response.total_pages:
break
page += 1
except graphor.APIStatusError as e:
print ( f "Error processing { file_name } : { e } " )
break
return { "file_name" : file_name, "elements" : all_elements}
async def batch_get_elements ( file_names : list[ str ], max_concurrent : int = 3 ):
"""Get elements from multiple documents concurrently."""
client = AsyncGraphor()
semaphore = asyncio.Semaphore(max_concurrent)
async def process_with_semaphore ( file_name : str ):
async with semaphore:
print ( f "Processing: { file_name } " )
result = await get_elements_async(client, file_name)
print ( f " Completed: { file_name } ( { len (result[ 'elements' ]) } elements)" )
return result
tasks = [process_with_semaphore(f) for f in file_names]
results = await asyncio.gather( * tasks, return_exceptions = True )
return [r for r in results if not isinstance (r, Exception )]
# Usage
files = [ "doc1.pdf" , "doc2.pdf" , "doc3.pdf" ]
results = asyncio.run(batch_get_elements(files))
for result in results:
print ( f " { result[ 'file_name' ] } : { len (result[ 'elements' ]) } elements" )
Document Comparator
Compare element structure between documents:
from graphor import Graphor
from collections import defaultdict
client = Graphor()
def get_document_stats ( file_name : str ) -> dict :
"""Get statistics for a document."""
type_counts = defaultdict( int )
total_chars = 0
page = 1
while True :
response = client.sources.load_elements(
file_name = file_name,
page = page,
page_size = 100
)
for item in response.items:
metadata = item.metadata or {}
type_counts[metadata.get( "element_type" , "Unknown" )] += 1
total_chars += len (item.page_content)
if page >= response.total_pages:
total_elements = response.total
break
page += 1
return {
"file_name" : file_name,
"total_elements" : total_elements,
"total_characters" : total_chars,
"element_types" : dict (type_counts)
}
def compare_documents ( file_name_1 : str , file_name_2 : str ):
"""Compare two documents."""
stats1 = get_document_stats(file_name_1)
stats2 = get_document_stats(file_name_2)
all_types = set (stats1[ "element_types" ].keys()) | set (stats2[ "element_types" ].keys())
comparison = {
"documents" : [stats1[ "file_name" ], stats2[ "file_name" ]],
"total_elements" : [stats1[ "total_elements" ], stats2[ "total_elements" ]],
"total_characters" : [stats1[ "total_characters" ], stats2[ "total_characters" ]],
"element_comparison" : {}
}
for element_type in sorted (all_types):
count1 = stats1[ "element_types" ].get(element_type, 0 )
count2 = stats2[ "element_types" ].get(element_type, 0 )
comparison[ "element_comparison" ][element_type] = [count1, count2]
return comparison
# Usage
comparison = compare_documents( "version1.pdf" , "version2.pdf" )
print ( f "Comparing: { comparison[ 'documents' ][ 0 ] } vs { comparison[ 'documents' ][ 1 ] } " )
print ( f "Elements: { comparison[ 'total_elements' ][ 0 ] } vs { comparison[ 'total_elements' ][ 1 ] } " )
print ( f "Characters: { comparison[ 'total_characters' ][ 0 ] } vs { comparison[ 'total_characters' ][ 1 ] } " )
print ( " \n Element breakdown:" )
for elem_type, counts in comparison[ "element_comparison" ].items():
print ( f " { elem_type } : { counts[ 0 ] } vs { counts[ 1 ] } " )
Error Reference
Error Type Status Code Description BadRequestError400 Invalid request payload or parameters AuthenticationError401 Invalid or missing API key NotFoundError404 Specified file not found in project RateLimitError429 Too many requests, please retry after waiting InternalServerError≥500 Server-side error processing request APIConnectionErrorN/A Network connectivity issues APITimeoutErrorN/A Request timed out
Best Practices
Use appropriate page sizes : Start with 20-50 elements per page for optimal performance
Filter server-side : Use filter parameters to reduce data transfer
Cache results : Store element data locally for repeated access
# Good: Filter on server
response = client.sources.load_elements(
file_name = "doc.pdf" ,
filter = { "type" : "Title" } # Filter on server
)
# Less efficient: Filter on client
response = client.sources.load_elements(
file_name = "doc.pdf" ,
page_size = 500
)
titles = [item for item in response.items if item.metadata.get( "element_type" ) == "Title" ]
Data Processing
Element type awareness : Different element types need different processing
Use HTML field : The text_as_html field preserves formatting
Handle None metadata : Always check if metadata exists before accessing
for item in response.items:
# Safe metadata access
metadata = item.metadata or {}
element_type = metadata.get( "element_type" , "Unknown" )
page_num = metadata.get( "page_number" , 0 )
Memory Management
Stream large documents : Process in chunks rather than loading all at once
Clear processed data : Remove unnecessary fields when not needed
# Process large documents in chunks
page = 1
while True :
response = client.sources.load_elements(
file_name = "large_doc.pdf" ,
page = page,
page_size = 50
)
# Process this batch
for item in response.items:
process_element(item) # Your processing logic
if page >= response.total_pages:
break
page += 1
Troubleshooting
Causes : Large page sizes, complex filters, or server loadSolutions :
Reduce page_size to 25-50 elements
Use specific filters to reduce result set
Implement request timeouts
client = Graphor( timeout = 60.0 )
Causes : File not processed, incorrect file name, or overly restrictive filtersSolutions :
Verify file has been processed successfully with client.sources.list()
Check file name matches exactly (case-sensitive)
Remove or relax filter criteria
Missing expected elements
Causes : Processing method limitations, file format issues, or filter conflictsSolutions :
Try different partition methods using client.sources.parse()
Check if elements are categorized under different types
Remove elements_to_remove filter temporarily
Memory issues with large documents
Causes : Processing too many elements at onceSolutions :
Reduce page_size and process incrementally
Filter out unnecessary element types
Use streaming processing patterns
Next Steps
After successfully retrieving document elements: