The extract method allows you to extract structured information from your documents using standard JSON Schema and natural language instructions. This is ideal for document processing pipelines that need to convert unstructured documents into structured data.
Method Overview
Sync Method client.sources.extract()
Async Method await client.sources.extract()
Method Signature
client.sources.extract(
file_ids: list[ str ] | None = None , # Preferred
file_names: list[ str ] | None = None , # Deprecated
output_schema: dict[ str , object ], # Required
user_instruction: str , # Required
thinking_level: str | None = None ,
timeout: float | None = None
) -> SourceExtractResponse
At least one of file_ids or file_names must be provided. file_ids is preferred.
Parameters
Parameter Type Description Required file_idslist[str]List of file IDs to extract data from (preferred) No* file_nameslist[str]List of file names to extract data from (deprecated, use file_ids) No* output_schemadict[str, object]JSON Schema defining the structure of the extracted data ✅ Yes user_instructionstrNatural language instructions to guide the extraction ✅ Yes thinking_levelstrControls model and thinking configuration: "fast", "balanced", "accurate" (default) No timeoutfloatRequest timeout in seconds No
*At least one of file_ids or file_names must be provided. file_ids is preferred.
Thinking Level
The thinking_level parameter controls the model and thinking configuration used for extraction:
Value Description "fast"Uses a faster model without extended thinking. Best for simple extractions where speed is prioritized. "balanced"Uses a more capable model with low thinking. Good balance between quality and speed. "accurate"Default. Uses a more capable model with high thinking. Best for complex extractions requiring deep reasoning.
Output Schema
The output_schema parameter accepts a standard JSON Schema object. This defines the structure of the data you want to extract from your documents.
Supported Schema Features
Basic types : string, number, integer, boolean
Object types : Nested objects with properties
Array types : Lists with items schema
Null unions : ["string", "null"] for optional fields
Required fields : Specify mandatory properties with required array
Descriptions : Help the model understand what to extract
Unsupported Schema Features
oneOf, anyOf, allOf combinators
$ref references
Complex regex patterns
External schema references
Response Object
The method returns a SourceExtractResponse object with the following properties:
Property Type Description file_idslist[str] | NoneList of file IDs used for extraction file_nameslist[str]List of file names used for extraction structured_outputdict[str, object] | NoneExtracted data matching your schema raw_jsonstr | NoneRaw JSON text produced by the model before validation/correction
Code Examples
from graphor import Graphor
client = Graphor()
# Extract invoice data using file_ids (preferred)
result = client.sources.extract(
file_ids = [ "file_abc123" ],
user_instruction = "Extract all invoice information. Use YYYY-MM-DD format for dates." ,
output_schema = {
"type" : "object" ,
"properties" : {
"invoice_number" : {
"type" : "string" ,
"description" : "The unique invoice identifier"
},
"invoice_date" : {
"type" : "string" ,
"description" : "Invoice date in YYYY-MM-DD format"
},
"total_amount" : {
"type" : "number" ,
"description" : "Total amount due"
},
"vendor_name" : {
"type" : "string" ,
"description" : "Name of the company issuing the invoice"
}
},
"required" : [ "invoice_number" , "total_amount" ]
}
)
# Access extracted data
output = result.structured_output
print ( f "Invoice: { output[ 'invoice_number' ] } " )
print ( f "Amount: $ { output[ 'total_amount' ] } " )
print ( f "Date: { output[ 'invoice_date' ] } " )
from graphor import Graphor
client = Graphor()
# Extract invoice data using file_names (deprecated)
result = client.sources.extract(
file_names = [ "invoice-2024.pdf" ],
user_instruction = "Extract all invoice information. Use YYYY-MM-DD format for dates." ,
output_schema = {
"type" : "object" ,
"properties" : {
"invoice_number" : {
"type" : "string" ,
"description" : "The unique invoice identifier"
},
"total_amount" : {
"type" : "number" ,
"description" : "Total amount due"
}
},
"required" : [ "invoice_number" , "total_amount" ]
}
)
print ( f "Invoice: { result.structured_output[ 'invoice_number' ] } " )
Using Thinking Level
Control the model’s reasoning depth with thinking_level:
from graphor import Graphor
client = Graphor()
# Fast mode for simple extractions
result = client.sources.extract(
file_names = [ "simple-invoice.pdf" ],
user_instruction = "Extract the invoice number and total amount." ,
thinking_level = "fast" ,
output_schema = {
"type" : "object" ,
"properties" : {
"invoice_number" : { "type" : "string" , "description" : "Invoice ID" },
"total_amount" : { "type" : "number" , "description" : "Total due" }
}
}
)
print ( f "Invoice: { result.structured_output } " )
# Accurate mode for complex legal document analysis
result = client.sources.extract(
file_names = [ "complex-contract.pdf" ],
user_instruction = "Extract all legal clauses with their implications and potential risks." ,
thinking_level = "accurate" ,
output_schema = {
"type" : "object" ,
"properties" : {
"clauses" : {
"type" : "array" ,
"items" : {
"type" : "object" ,
"properties" : {
"title" : { "type" : "string" , "description" : "Clause title" },
"content" : { "type" : "string" , "description" : "Clause content" },
"implications" : { "type" : "string" , "description" : "Legal implications" },
"risks" : { "type" : "string" , "description" : "Potential risks" }
}
}
}
}
}
)
print ( f "Clauses extracted: { len (result.structured_output[ 'clauses' ]) } " )
Extraction with Nested Objects and Arrays
from graphor import Graphor
client = Graphor()
# Extract invoice with line items and address
result = client.sources.extract(
file_names = [ "invoice-2024.pdf" ],
user_instruction = "Extract invoice with line items and address details." ,
output_schema = {
"type" : "object" ,
"properties" : {
"invoice_number" : {
"type" : "string" ,
"description" : "The unique invoice identifier"
},
"billing_address" : {
"type" : "object" ,
"description" : "Billing address details" ,
"properties" : {
"street" : { "type" : "string" , "description" : "Street address" },
"city" : { "type" : "string" , "description" : "City name" },
"zip_code" : { "type" : "string" , "description" : "Postal code" },
"country" : { "type" : "string" , "description" : "Country name" }
}
},
"tags" : {
"type" : "array" ,
"description" : "Invoice tags or categories" ,
"items" : { "type" : "string" }
},
"line_items" : {
"type" : "array" ,
"description" : "Invoice line items" ,
"items" : {
"type" : "object" ,
"properties" : {
"description" : { "type" : "string" , "description" : "Item description" },
"quantity" : { "type" : "number" , "description" : "Item quantity" },
"unit_price" : { "type" : "number" , "description" : "Price per unit" },
"total" : { "type" : "number" , "description" : "Line item total" }
}
}
}
},
"required" : [ "invoice_number" ]
}
)
output = result.structured_output
print ( f "Invoice: { output[ 'invoice_number' ] } " )
print ( f "City: { output[ 'billing_address' ][ 'city' ] } " )
print ( f "Tags: { ', ' .join(output[ 'tags' ]) } " )
print ( "Line Items:" )
for item in output[ "line_items" ]:
print ( f " - { item[ 'description' ] } : { item[ 'quantity' ] } x $ { item[ 'unit_price' ] } " )
import asyncio
from graphor import AsyncGraphor
async def extract_invoice_data ( file_name : str ):
client = AsyncGraphor()
result = await client.sources.extract(
file_names = [file_name],
user_instruction = "Extract invoice details." ,
output_schema = {
"type" : "object" ,
"properties" : {
"invoice_number" : { "type" : "string" , "description" : "Invoice ID" },
"total_amount" : { "type" : "number" , "description" : "Total due" },
"invoice_date" : { "type" : "string" , "description" : "Invoice date" }
},
"required" : [ "invoice_number" , "total_amount" ]
}
)
return result.structured_output
# Run the async function
data = asyncio.run(extract_invoice_data( "invoice.pdf" ))
print ( f "Invoice: { data[ 'invoice_number' ] } " )
from graphor import Graphor
client = Graphor()
# Extract data from multiple related files
result = client.sources.extract(
file_names = [ "contract-part1.pdf" , "contract-part2.pdf" ],
user_instruction = "Extract key contract terms from both documents." ,
output_schema = {
"type" : "object" ,
"properties" : {
"contract_title" : { "type" : "string" , "description" : "Title of the contract" },
"effective_date" : { "type" : "string" , "description" : "Contract start date" },
"termination_date" : { "type" : "string" , "description" : "Contract end date" },
"parties" : {
"type" : "array" ,
"description" : "Parties involved in the contract" ,
"items" : {
"type" : "object" ,
"properties" : {
"name" : { "type" : "string" , "description" : "Party name" },
"role" : { "type" : "string" , "description" : "Role (e.g., Licensor, Licensee)" }
}
}
}
},
"required" : [ "contract_title" , "parties" ]
}
)
print ( f "Contract: { result.structured_output[ 'contract_title' ] } " )
print ( f "Files processed: { result.file_names } " )
Error Handling
import graphor
from graphor import Graphor
client = Graphor()
try :
result = client.sources.extract(
file_names = [ "document.pdf" ],
user_instruction = "Extract data from the document." ,
output_schema = {
"type" : "object" ,
"properties" : {
"title" : { "type" : "string" , "description" : "Document title" }
}
}
)
print ( f "Extracted: { result.structured_output } " )
except graphor.NotFoundError as e:
print ( f "File not found: { e } " )
except graphor.BadRequestError as e:
print ( f "Invalid schema or request: { e } " )
except graphor.AuthenticationError as e:
print ( f "Invalid API key: { e } " )
except graphor.RateLimitError as e:
print ( f "Rate limit exceeded. Please wait and retry: { e } " )
except graphor.InternalServerError as e:
print ( f "Server error: { e } " )
except graphor.APIConnectionError as e:
print ( f "Connection error: { e } " )
except graphor.APITimeoutError as e:
print ( f "Request timed out: { e } " )
Schema Examples
invoice_schema = {
"type" : "object" ,
"properties" : {
"invoice_number" : { "type" : "string" , "description" : "Unique invoice identifier" },
"invoice_date" : { "type" : "string" , "description" : "Invoice date (YYYY-MM-DD)" },
"due_date" : { "type" : "string" , "description" : "Payment due date (YYYY-MM-DD)" },
"vendor_name" : { "type" : "string" , "description" : "Company issuing the invoice" },
"customer_name" : { "type" : "string" , "description" : "Customer being billed" },
"subtotal" : { "type" : "number" , "description" : "Subtotal before tax" },
"tax_amount" : { "type" : "number" , "description" : "Tax amount" },
"total_amount" : { "type" : "number" , "description" : "Total amount due" }
},
"required" : [ "invoice_number" , "total_amount" ]
}
result = client.sources.extract(
file_names = [ "invoice.pdf" ],
user_instruction = "Extract all invoice details. Convert amounts to numbers without currency symbols." ,
output_schema = invoice_schema
)
Contract Analysis
contract_schema = {
"type" : "object" ,
"properties" : {
"contract_title" : { "type" : "string" , "description" : "Title or name of the contract" },
"effective_date" : { "type" : "string" , "description" : "When the contract becomes effective" },
"termination_date" : { "type" : "string" , "description" : "When the contract ends" },
"auto_renewal" : { "type" : "boolean" , "description" : "Whether contract auto-renews" },
"parties" : {
"type" : "array" ,
"description" : "All parties involved in the contract" ,
"items" : {
"type" : "object" ,
"properties" : {
"name" : { "type" : "string" , "description" : "Party name" },
"role" : { "type" : "string" , "description" : "Role (e.g., Licensor, Licensee)" },
"address" : { "type" : "string" , "description" : "Party address" }
}
}
},
"key_terms" : {
"type" : "object" ,
"description" : "Key contract terms and conditions" ,
"properties" : {
"payment_terms" : { "type" : "string" , "description" : "Payment conditions" },
"liability_cap" : { "type" : "number" , "description" : "Maximum liability amount" },
"notice_period_days" : { "type" : "integer" , "description" : "Notice period in days" }
}
}
},
"required" : [ "contract_title" , "parties" ]
}
result = client.sources.extract(
file_names = [ "contract.pdf" ],
user_instruction = "Extract key contract terms with all parties and obligations." ,
output_schema = contract_schema
)
Resume Parsing
resume_schema = {
"type" : "object" ,
"properties" : {
"full_name" : { "type" : "string" , "description" : "Candidate's full name" },
"email" : { "type" : "string" , "description" : "Email address" },
"phone" : { "type" : "string" , "description" : "Phone number" },
"years_experience" : { "type" : "number" , "description" : "Total years of experience" },
"skills" : {
"type" : "array" ,
"description" : "List of technical and soft skills" ,
"items" : { "type" : "string" }
},
"work_experience" : {
"type" : "array" ,
"description" : "Work history" ,
"items" : {
"type" : "object" ,
"properties" : {
"company" : { "type" : "string" , "description" : "Company name" },
"title" : { "type" : "string" , "description" : "Job title" },
"start_date" : { "type" : "string" , "description" : "Start date" },
"end_date" : { "type" : "string" , "description" : "End date (or 'current')" }
}
}
},
"education" : {
"type" : "array" ,
"description" : "Educational background" ,
"items" : {
"type" : "object" ,
"properties" : {
"institution" : { "type" : "string" , "description" : "School or university name" },
"degree" : { "type" : "string" , "description" : "Degree obtained" },
"graduation_year" : { "type" : "integer" , "description" : "Year of graduation" }
}
}
}
},
"required" : [ "full_name" ]
}
result = client.sources.extract(
file_names = [ "resume.pdf" ],
user_instruction = "Extract complete candidate information including work history and education." ,
output_schema = resume_schema
)
Product Catalog
product_schema = {
"type" : "object" ,
"properties" : {
"product_name" : { "type" : "string" , "description" : "Product name" },
"sku" : { "type" : "string" , "description" : "Product SKU" },
"base_price" : { "type" : "number" , "description" : "Base price" },
"in_stock" : { "type" : "boolean" , "description" : "Whether product is in stock" },
"specifications" : {
"type" : "object" ,
"description" : "Product specifications" ,
"properties" : {
"weight" : { "type" : "number" , "description" : "Weight in kg" },
"dimensions" : { "type" : "string" , "description" : "Dimensions (LxWxH)" },
"material" : { "type" : "string" , "description" : "Main material" }
}
},
"categories" : {
"type" : "array" ,
"description" : "Product categories" ,
"items" : { "type" : "string" }
},
"variants" : {
"type" : "array" ,
"description" : "Product variants" ,
"items" : {
"type" : "object" ,
"properties" : {
"color" : { "type" : "string" , "description" : "Variant color" },
"size" : { "type" : "string" , "description" : "Variant size" },
"price_modifier" : { "type" : "number" , "description" : "Price adjustment" }
}
}
}
},
"required" : [ "product_name" , "sku" ]
}
result = client.sources.extract(
file_names = [ "catalog.pdf" ],
user_instruction = "Extract all products with their specifications and variants." ,
output_schema = product_schema
)
Advanced Examples
Build a complete extraction pipeline for processing multiple documents:
from graphor import Graphor
import graphor
from typing import Any
from dataclasses import dataclass
@dataclass
class ExtractionResult :
file_name: str
data: dict[ str , Any] | None
error: str | None = None
class DocumentExtractor :
def __init__ ( self , api_key : str | None = None ):
self .client = Graphor( api_key = api_key) if api_key else Graphor()
def extract_invoices ( self , file_names : list[ str ]) -> list[ExtractionResult]:
"""Extract invoice data from multiple files."""
results = []
invoice_schema = {
"type" : "object" ,
"properties" : {
"invoice_number" : { "type" : "string" , "description" : "Invoice ID" },
"vendor_name" : { "type" : "string" , "description" : "Vendor name" },
"total_amount" : { "type" : "number" , "description" : "Total amount" },
"invoice_date" : { "type" : "string" , "description" : "Date (YYYY-MM-DD)" },
"line_items" : {
"type" : "array" ,
"description" : "Line items" ,
"items" : {
"type" : "object" ,
"properties" : {
"description" : { "type" : "string" },
"quantity" : { "type" : "number" },
"amount" : { "type" : "number" }
}
}
}
},
"required" : [ "invoice_number" , "total_amount" ]
}
for file_name in file_names:
try :
result = self .client.sources.extract(
file_names = [file_name],
user_instruction = "Extract invoice information. Use YYYY-MM-DD for dates." ,
output_schema = invoice_schema
)
results.append(ExtractionResult(
file_name = file_name,
data = result.structured_output
))
print ( f "✅ Extracted: { file_name } " )
except graphor.APIStatusError as e:
results.append(ExtractionResult(
file_name = file_name,
data = None ,
error = str (e)
))
print ( f "❌ Failed: { file_name } - { e } " )
return results
def extract_with_custom_schema (
self ,
file_names : list[ str ],
schema : dict ,
instruction : str
) -> dict[ str , Any] | None :
"""Extract data using a custom schema."""
try :
result = self .client.sources.extract(
file_names = file_names,
user_instruction = instruction,
output_schema = schema
)
return result.structured_output
except graphor.APIStatusError as e:
print ( f "Extraction error: { e } " )
return None
# Usage
extractor = DocumentExtractor()
# Process multiple invoices
invoices = extractor.extract_invoices([
"invoice-001.pdf" ,
"invoice-002.pdf" ,
"invoice-003.pdf"
])
# Calculate totals
total = sum (
inv.data[ "total_amount" ]
for inv in invoices
if inv.data is not None
)
print ( f "Total amount: $ { total :,.2f} " )
Process many documents efficiently with async:
import asyncio
from graphor import AsyncGraphor
import graphor
async def extract_single (
client : AsyncGraphor,
file_name : str ,
schema : dict ,
instruction : str
):
"""Extract data from a single file."""
try :
result = await client.sources.extract(
file_names = [file_name],
user_instruction = instruction,
output_schema = schema
)
return {
"file_name" : file_name,
"status" : "success" ,
"data" : result.structured_output
}
except graphor.APIStatusError as e:
return {
"file_name" : file_name,
"status" : "failed" ,
"error" : str (e)
}
async def batch_extract (
file_names : list[ str ],
schema : dict ,
instruction : str ,
max_concurrent : int = 3
):
"""Extract data from multiple files with controlled concurrency."""
client = AsyncGraphor( timeout = 120.0 )
# Use semaphore to limit concurrent requests
semaphore = asyncio.Semaphore(max_concurrent)
async def extract_with_semaphore ( file_name : str ):
async with semaphore:
print ( f "Processing: { file_name } ..." )
result = await extract_single(client, file_name, schema, instruction)
status_icon = "✅" if result[ "status" ] == "success" else "❌"
print ( f " { status_icon } { file_name } : { result[ 'status' ] } " )
return result
tasks = [extract_with_semaphore(f) for f in file_names]
results = await asyncio.gather( * tasks)
successful = [r for r in results if r[ "status" ] == "success" ]
failed = [r for r in results if r[ "status" ] == "failed" ]
print ( f " \n Summary: { len (successful) } successful, { len (failed) } failed" )
return results
# Usage
schema = {
"type" : "object" ,
"properties" : {
"title" : { "type" : "string" , "description" : "Document title" },
"summary" : { "type" : "string" , "description" : "Brief summary" }
}
}
files = [ "doc1.pdf" , "doc2.pdf" , "doc3.pdf" , "doc4.pdf" , "doc5.pdf" ]
results = asyncio.run(batch_extract(
files,
schema,
"Extract the title and summary from this document." ,
max_concurrent = 3
))
Add validation to ensure extracted data meets your requirements:
from graphor import Graphor
import graphor
from typing import Any
class ValidatedExtractor :
def __init__ ( self ):
self .client = Graphor()
def extract_and_validate (
self ,
file_names : list[ str ],
schema : dict ,
instruction : str ,
validators : dict[ str , callable ] | None = None
) -> dict[ str , Any]:
"""Extract data and validate the results."""
result = self .client.sources.extract(
file_names = file_names,
user_instruction = instruction,
output_schema = schema
)
data = result.structured_output
if validators and data:
validation_errors = []
for field, validator in validators.items():
if field in data:
try :
if not validator(data[field]):
validation_errors.append( f "Validation failed for ' { field } '" )
except Exception as e:
validation_errors.append( f "Validator error for ' { field } ': { e } " )
if validation_errors:
return {
"success" : False ,
"data" : data,
"errors" : validation_errors
}
return {
"success" : True ,
"data" : data,
"errors" : []
}
# Usage
extractor = ValidatedExtractor()
# Define validators
validators = {
"invoice_number" : lambda x : x and len (x) > 0 ,
"total_amount" : lambda x : x and x > 0 ,
"invoice_date" : lambda x : x and len (x) == 10 , # YYYY-MM-DD format
}
result = extractor.extract_and_validate(
file_names = [ "invoice.pdf" ],
schema = {
"type" : "object" ,
"properties" : {
"invoice_number" : { "type" : "string" , "description" : "Invoice ID" },
"total_amount" : { "type" : "number" , "description" : "Total amount" },
"invoice_date" : { "type" : "string" , "description" : "Date YYYY-MM-DD" }
},
"required" : [ "invoice_number" , "total_amount" ]
},
instruction = "Extract invoice details. Use YYYY-MM-DD for dates." ,
validators = validators
)
if result[ "success" ]:
print ( f "Valid extraction: { result[ 'data' ] } " )
else :
print ( f "Validation errors: { result[ 'errors' ] } " )
Debugging with Raw JSON
Use the raw_json field to debug extraction issues:
from graphor import Graphor
import json
client = Graphor()
result = client.sources.extract(
file_names = [ "document.pdf" ],
user_instruction = "Extract document information." ,
output_schema = {
"type" : "object" ,
"properties" : {
"title" : { "type" : "string" , "description" : "Document title" },
"author" : { "type" : "string" , "description" : "Author name" }
}
}
)
# Compare raw vs structured output
print ( "Raw JSON from model:" )
print (result.raw_json)
print ( " \n Structured output (validated):" )
print (json.dumps(result.structured_output, indent = 2 ))
# Check for differences (useful for debugging)
if result.raw_json:
raw_parsed = json.loads(result.raw_json)
if raw_parsed != result.structured_output:
print ( " \n ⚠️ Note: Structured output differs from raw JSON (post-validation)" )
Best Practices
Schema Design
Use clear descriptions : Detailed property descriptions improve extraction accuracy
Match types to data : Use number for amounts, string for dates, boolean for flags
Keep nesting shallow : Avoid deeply nested structures (2-3 levels maximum)
Define required fields : Use the required array to specify mandatory properties
Use arrays for lists : Extract repeating items using arrays with item schemas
Instruction Writing
Be specific : Include format preferences (e.g., “Use YYYY-MM-DD for dates”)
Handle edge cases : Specify what to do for missing data (e.g., “Use null if not found”)
Provide context : Explain what the document contains and what you need
Avoid ambiguity : Use clear, unambiguous language
Batch related files : Process related documents together for context
Use appropriate timeouts : Extraction can take time for complex documents
Implement retries : Handle transient errors with the SDK’s retry mechanism
Cache results : Store extraction results to avoid reprocessing
# Configure retries for reliability
client = Graphor( max_retries = 3 , timeout = 120.0 )
# Or per-request
result = client.with_options( max_retries = 5 , timeout = 180.0 ).sources.extract(
file_names = [ "large-document.pdf" ],
user_instruction = "Extract all data." ,
output_schema = schema
)
Error Reference
Error Type Status Code Description BadRequestError400 Invalid parameters or malformed schema AuthenticationError401 Invalid or missing API key PermissionDeniedError403 Access denied to the specified project NotFoundError404 File not found or no parsing history RateLimitError429 Too many requests, please retry after waiting InternalServerError≥500 Server-side processing error APIConnectionErrorN/A Network connectivity issues APITimeoutErrorN/A Request timed out
Troubleshooting
Causes : File doesn’t exist, hasn’t been processed, or wrong file nameSolutions :
Verify the exact file name (case-sensitive)
Ensure the file has been uploaded and processed
Use client.sources.list() to check available files
# List all sources to find correct file names
sources = client.sources.list()
for source in sources:
if source.status == "Completed" :
print (source.file_name)
Causes : Malformed JSON Schema or unsupported featuresSolutions :
Validate your schema against JSON Schema spec
Avoid unsupported features ($ref, oneOf, anyOf)
Ensure all type values are valid
Check that properties is an object, not an array
Causes : Large documents, complex schemas, or server loadSolutions :
Increase the timeout value
Simplify the schema (fewer fields, shallower nesting)
Process smaller document batches
client = Graphor( timeout = 180.0 ) # 3 minutes
Missing or incorrect data
Causes : Vague instructions, poor document quality, or inappropriate schemaSolutions :
Make instructions more specific
Reprocess the document with a better partition method
Add more context in property descriptions
Check the raw_json field for debugging
Next Steps
After extracting data from your documents: