extract method allows you to extract structured information from your documents using standard JSON Schema and natural language instructions. This is ideal for document processing pipelines that need to convert unstructured documents into structured data.
Method Overview
- Python
- TypeScript
Sync Method
client.sources.extract()Async Method
await client.sources.extract() (using AsyncGraphor)Async Method
await client.sources.extract()All TypeScript methods are async and return a Promise.Method Signature
- Python
- TypeScript
Copy
client.sources.extract(
file_ids: list[str] | None = None, # Preferred
file_names: list[str] | None = None, # Deprecated
output_schema: dict[str, object], # Required
user_instruction: str, # Required
thinking_level: str | None = None,
timeout: float | None = None
) -> SourceExtractResponse
Copy
await client.sources.extract({
output_schema: Record<string, unknown>, // Required
user_instruction: string, // Required
file_ids?: string[] | null, // Preferred
file_names?: string[] | null, // Deprecated
thinking_level?: 'fast' | 'balanced' | 'accurate' | null,
}): Promise<SourceExtractResponse>
At least one of
file_ids or file_names must be provided. file_ids is preferred.Parameters
- Python
- TypeScript
| Parameter | Type | Description | Required |
|---|---|---|---|
file_ids | list[str] | List of file IDs to extract data from (preferred) | No* |
file_names | list[str] | List of file names to extract data from (deprecated, use file_ids) | No* |
output_schema | dict[str, object] | JSON Schema defining the structure of the extracted data | Yes |
user_instruction | str | Natural language instructions to guide the extraction | Yes |
thinking_level | str | Controls model and thinking configuration: "fast", "balanced", "accurate" (default) | No |
timeout | float | Request timeout in seconds | No |
| Parameter | Type | Description | Required |
|---|---|---|---|
file_ids | string[] | null | List of file IDs to extract data from (preferred) | No* |
file_names | string[] | null | List of file names to extract data from (deprecated, use file_ids) | No* |
output_schema | Record<string, unknown> | JSON Schema defining the structure of the extracted data | Yes |
user_instruction | string | Natural language instructions to guide the extraction | Yes |
thinking_level | 'fast' | 'balanced' | 'accurate' | null | Controls model and thinking configuration (default: "accurate") | No |
*At least one of
file_ids or file_names must be provided. file_ids is preferred.Thinking Level
Thethinking_level parameter controls the model and thinking configuration used for extraction:
| Value | Description |
|---|---|
"fast" | Uses a faster model without extended thinking. Best for simple extractions where speed is prioritized. |
"balanced" | Uses a more capable model with low thinking. Good balance between quality and speed. |
"accurate" | Default. Uses a more capable model with high thinking. Best for complex extractions requiring deep reasoning. |
Output Schema
Theoutput_schema parameter accepts a standard JSON Schema object. This defines the structure of the data you want to extract from your documents.
Supported Schema Features
Supported Schema Features
- Basic types:
string,number,integer,boolean - Object types: Nested objects with
properties - Array types: Lists with
itemsschema - Null unions:
["string", "null"]for optional fields - Required fields: Specify mandatory properties with
requiredarray - Descriptions: Help the model understand what to extract
Unsupported Schema Features
Unsupported Schema Features
oneOf,anyOf,allOfcombinators$refreferences- Complex regex patterns
- External schema references
Response Object
The method returns aSourceExtractResponse object with the following properties:
| Property | Type | Description |
|---|---|---|
file_ids | list[str] | None | List of file IDs used for extraction |
file_names | list[str] | List of file names used for extraction |
structured_output | dict | None | Extracted data matching your schema |
raw_json | str | None | Raw JSON text produced by the model before validation/correction |
Code Examples
Basic Extraction
- Python
- TypeScript
Copy
from graphor import Graphor
client = Graphor()
# Extract invoice data using file_ids (preferred)
result = client.sources.extract(
file_ids=["file_abc123"],
user_instruction="Extract all invoice information. Use YYYY-MM-DD format for dates.",
output_schema={
"type": "object",
"properties": {
"invoice_number": {
"type": "string",
"description": "The unique invoice identifier"
},
"invoice_date": {
"type": "string",
"description": "Invoice date in YYYY-MM-DD format"
},
"total_amount": {
"type": "number",
"description": "Total amount due"
},
"vendor_name": {
"type": "string",
"description": "Name of the company issuing the invoice"
}
},
"required": ["invoice_number", "total_amount"]
}
)
# Access extracted data
output = result.structured_output
print(f"Invoice: {output['invoice_number']}")
print(f"Amount: ${output['total_amount']}")
print(f"Date: {output['invoice_date']}")
Copy
import Graphor from 'graphor';
const client = new Graphor();
// Extract invoice data using file_ids (preferred)
const result = await client.sources.extract({
file_ids: ['file_abc123'],
user_instruction: 'Extract all invoice information. Use YYYY-MM-DD format for dates.',
output_schema: {
type: 'object',
properties: {
invoice_number: {
type: 'string',
description: 'The unique invoice identifier',
},
invoice_date: {
type: 'string',
description: 'Invoice date in YYYY-MM-DD format',
},
total_amount: {
type: 'number',
description: 'Total amount due',
},
vendor_name: {
type: 'string',
description: 'Name of the company issuing the invoice',
},
},
required: ['invoice_number', 'total_amount'],
},
});
// Access extracted data
const output = result.structured_output as Record<string, unknown>;
console.log(`Invoice: ${output.invoice_number}`);
console.log(`Amount: $${output.total_amount}`);
console.log(`Date: ${output.invoice_date}`);
Basic Extraction (using file_names - deprecated)
- Python
- TypeScript
Copy
from graphor import Graphor
client = Graphor()
# Extract invoice data using file_names (deprecated)
result = client.sources.extract(
file_names=["invoice-2024.pdf"],
user_instruction="Extract all invoice information. Use YYYY-MM-DD format for dates.",
output_schema={
"type": "object",
"properties": {
"invoice_number": {
"type": "string",
"description": "The unique invoice identifier"
},
"total_amount": {
"type": "number",
"description": "Total amount due"
}
},
"required": ["invoice_number", "total_amount"]
}
)
print(f"Invoice: {result.structured_output['invoice_number']}")
Copy
import Graphor from 'graphor';
const client = new Graphor();
// Extract invoice data using file_names (deprecated)
const result = await client.sources.extract({
file_names: ['invoice-2024.pdf'],
user_instruction: 'Extract all invoice information. Use YYYY-MM-DD format for dates.',
output_schema: {
type: 'object',
properties: {
invoice_number: {
type: 'string',
description: 'The unique invoice identifier',
},
total_amount: {
type: 'number',
description: 'Total amount due',
},
},
required: ['invoice_number', 'total_amount'],
},
});
const output = result.structured_output as Record<string, unknown>;
console.log(`Invoice: ${output.invoice_number}`);
Using Thinking Level
Control the model’s reasoning depth withthinking_level:
- Python
- TypeScript
Copy
from graphor import Graphor
client = Graphor()
# Fast mode for simple extractions
result = client.sources.extract(
file_names=["simple-invoice.pdf"],
user_instruction="Extract the invoice number and total amount.",
thinking_level="fast",
output_schema={
"type": "object",
"properties": {
"invoice_number": {"type": "string", "description": "Invoice ID"},
"total_amount": {"type": "number", "description": "Total due"}
}
}
)
print(f"Invoice: {result.structured_output}")
# Accurate mode for complex legal document analysis
result = client.sources.extract(
file_names=["complex-contract.pdf"],
user_instruction="Extract all legal clauses with their implications and potential risks.",
thinking_level="accurate",
output_schema={
"type": "object",
"properties": {
"clauses": {
"type": "array",
"items": {
"type": "object",
"properties": {
"title": {"type": "string", "description": "Clause title"},
"content": {"type": "string", "description": "Clause content"},
"implications": {"type": "string", "description": "Legal implications"},
"risks": {"type": "string", "description": "Potential risks"}
}
}
}
}
}
)
print(f"Clauses extracted: {len(result.structured_output['clauses'])}")
Copy
import Graphor from 'graphor';
const client = new Graphor();
// Fast mode for simple extractions
const result = await client.sources.extract({
file_names: ['simple-invoice.pdf'],
user_instruction: 'Extract the invoice number and total amount.',
thinking_level: 'fast',
output_schema: {
type: 'object',
properties: {
invoice_number: { type: 'string', description: 'Invoice ID' },
total_amount: { type: 'number', description: 'Total due' },
},
},
});
console.log('Invoice:', result.structured_output);
// Accurate mode for complex legal document analysis
const contractResult = await client.sources.extract({
file_names: ['complex-contract.pdf'],
user_instruction: 'Extract all legal clauses with their implications and potential risks.',
thinking_level: 'accurate',
output_schema: {
type: 'object',
properties: {
clauses: {
type: 'array',
items: {
type: 'object',
properties: {
title: { type: 'string', description: 'Clause title' },
content: { type: 'string', description: 'Clause content' },
implications: { type: 'string', description: 'Legal implications' },
risks: { type: 'string', description: 'Potential risks' },
},
},
},
},
},
});
const clauses = (contractResult.structured_output as Record<string, unknown>)
.clauses as unknown[];
console.log(`Clauses extracted: ${clauses.length}`);
Extraction with Nested Objects and Arrays
- Python
- TypeScript
Copy
from graphor import Graphor
client = Graphor()
# Extract invoice with line items and address
result = client.sources.extract(
file_names=["invoice-2024.pdf"],
user_instruction="Extract invoice with line items and address details.",
output_schema={
"type": "object",
"properties": {
"invoice_number": {
"type": "string",
"description": "The unique invoice identifier"
},
"billing_address": {
"type": "object",
"description": "Billing address details",
"properties": {
"street": {"type": "string", "description": "Street address"},
"city": {"type": "string", "description": "City name"},
"zip_code": {"type": "string", "description": "Postal code"},
"country": {"type": "string", "description": "Country name"}
}
},
"tags": {
"type": "array",
"description": "Invoice tags or categories",
"items": {"type": "string"}
},
"line_items": {
"type": "array",
"description": "Invoice line items",
"items": {
"type": "object",
"properties": {
"description": {"type": "string", "description": "Item description"},
"quantity": {"type": "number", "description": "Item quantity"},
"unit_price": {"type": "number", "description": "Price per unit"},
"total": {"type": "number", "description": "Line item total"}
}
}
}
},
"required": ["invoice_number"]
}
)
output = result.structured_output
print(f"Invoice: {output['invoice_number']}")
print(f"City: {output['billing_address']['city']}")
print(f"Tags: {', '.join(output['tags'])}")
print("Line Items:")
for item in output["line_items"]:
print(f" - {item['description']}: {item['quantity']} x ${item['unit_price']}")
Copy
import Graphor from 'graphor';
const client = new Graphor();
// Extract invoice with line items and address
const result = await client.sources.extract({
file_names: ['invoice-2024.pdf'],
user_instruction: 'Extract invoice with line items and address details.',
output_schema: {
type: 'object',
properties: {
invoice_number: {
type: 'string',
description: 'The unique invoice identifier',
},
billing_address: {
type: 'object',
description: 'Billing address details',
properties: {
street: { type: 'string', description: 'Street address' },
city: { type: 'string', description: 'City name' },
zip_code: { type: 'string', description: 'Postal code' },
country: { type: 'string', description: 'Country name' },
},
},
tags: {
type: 'array',
description: 'Invoice tags or categories',
items: { type: 'string' },
},
line_items: {
type: 'array',
description: 'Invoice line items',
items: {
type: 'object',
properties: {
description: { type: 'string', description: 'Item description' },
quantity: { type: 'number', description: 'Item quantity' },
unit_price: { type: 'number', description: 'Price per unit' },
total: { type: 'number', description: 'Line item total' },
},
},
},
},
required: ['invoice_number'],
},
});
const output = result.structured_output as Record<string, any>;
console.log(`Invoice: ${output.invoice_number}`);
console.log(`City: ${output.billing_address.city}`);
console.log(`Tags: ${(output.tags as string[]).join(', ')}`);
console.log('Line Items:');
for (const item of output.line_items as Array<Record<string, unknown>>) {
console.log(` - ${item.description}: ${item.quantity} x $${item.unit_price}`);
}
Async Extraction
- Python
- TypeScript
Copy
import asyncio
from graphor import AsyncGraphor
async def extract_invoice_data(file_name: str):
client = AsyncGraphor()
result = await client.sources.extract(
file_names=[file_name],
user_instruction="Extract invoice details.",
output_schema={
"type": "object",
"properties": {
"invoice_number": {"type": "string", "description": "Invoice ID"},
"total_amount": {"type": "number", "description": "Total due"},
"invoice_date": {"type": "string", "description": "Invoice date"}
},
"required": ["invoice_number", "total_amount"]
}
)
return result.structured_output
# Run the async function
data = asyncio.run(extract_invoice_data("invoice.pdf"))
print(f"Invoice: {data['invoice_number']}")
Copy
import Graphor from 'graphor';
const client = new Graphor();
async function extractInvoiceData(fileName: string) {
const result = await client.sources.extract({
file_names: [fileName],
user_instruction: 'Extract invoice details.',
output_schema: {
type: 'object',
properties: {
invoice_number: { type: 'string', description: 'Invoice ID' },
total_amount: { type: 'number', description: 'Total due' },
invoice_date: { type: 'string', description: 'Invoice date' },
},
required: ['invoice_number', 'total_amount'],
},
});
return result.structured_output as Record<string, unknown>;
}
const data = await extractInvoiceData('invoice.pdf');
console.log(`Invoice: ${data.invoice_number}`);
Multi-File Extraction
- Python
- TypeScript
Copy
from graphor import Graphor
client = Graphor()
# Extract data from multiple related files
result = client.sources.extract(
file_names=["contract-part1.pdf", "contract-part2.pdf"],
user_instruction="Extract key contract terms from both documents.",
output_schema={
"type": "object",
"properties": {
"contract_title": {"type": "string", "description": "Title of the contract"},
"effective_date": {"type": "string", "description": "Contract start date"},
"termination_date": {"type": "string", "description": "Contract end date"},
"parties": {
"type": "array",
"description": "Parties involved in the contract",
"items": {
"type": "object",
"properties": {
"name": {"type": "string", "description": "Party name"},
"role": {"type": "string", "description": "Role (e.g., Licensor, Licensee)"}
}
}
}
},
"required": ["contract_title", "parties"]
}
)
print(f"Contract: {result.structured_output['contract_title']}")
print(f"Files processed: {result.file_names}")
Copy
import Graphor from 'graphor';
const client = new Graphor();
// Extract data from multiple related files
const result = await client.sources.extract({
file_names: ['contract-part1.pdf', 'contract-part2.pdf'],
user_instruction: 'Extract key contract terms from both documents.',
output_schema: {
type: 'object',
properties: {
contract_title: { type: 'string', description: 'Title of the contract' },
effective_date: { type: 'string', description: 'Contract start date' },
termination_date: { type: 'string', description: 'Contract end date' },
parties: {
type: 'array',
description: 'Parties involved in the contract',
items: {
type: 'object',
properties: {
name: { type: 'string', description: 'Party name' },
role: { type: 'string', description: 'Role (e.g., Licensor, Licensee)' },
},
},
},
},
required: ['contract_title', 'parties'],
},
});
const output = result.structured_output as Record<string, unknown>;
console.log(`Contract: ${output.contract_title}`);
console.log(`Files processed: ${result.file_names.join(', ')}`);
Error Handling
- Python
- TypeScript
Copy
import graphor
from graphor import Graphor
client = Graphor()
try:
result = client.sources.extract(
file_names=["document.pdf"],
user_instruction="Extract data from the document.",
output_schema={
"type": "object",
"properties": {
"title": {"type": "string", "description": "Document title"}
}
}
)
print(f"Extracted: {result.structured_output}")
except graphor.NotFoundError as e:
print(f"File not found: {e}")
except graphor.BadRequestError as e:
print(f"Invalid schema or request: {e}")
except graphor.AuthenticationError as e:
print(f"Invalid API key: {e}")
except graphor.RateLimitError as e:
print(f"Rate limit exceeded. Please wait and retry: {e}")
except graphor.InternalServerError as e:
print(f"Server error: {e}")
except graphor.APIConnectionError as e:
print(f"Connection error: {e}")
except graphor.APITimeoutError as e:
print(f"Request timed out: {e}")
Copy
import Graphor from 'graphor';
const client = new Graphor();
try {
const result = await client.sources.extract({
file_names: ['document.pdf'],
user_instruction: 'Extract data from the document.',
output_schema: {
type: 'object',
properties: {
title: { type: 'string', description: 'Document title' },
},
},
});
console.log('Extracted:', result.structured_output);
} catch (err) {
if (err instanceof Graphor.NotFoundError) {
console.log(`File not found: ${err.message}`);
} else if (err instanceof Graphor.BadRequestError) {
console.log(`Invalid schema or request: ${err.message}`);
} else if (err instanceof Graphor.AuthenticationError) {
console.log(`Invalid API key: ${err.message}`);
} else if (err instanceof Graphor.RateLimitError) {
console.log(`Rate limit exceeded. Please wait and retry: ${err.message}`);
} else if (err instanceof Graphor.InternalServerError) {
console.log(`Server error: ${err.message}`);
} else if (err instanceof Graphor.APIConnectionError) {
console.log(`Connection error: ${err.message}`);
} else if (err instanceof Graphor.APIError) {
console.log(`API error (status ${err.status}): ${err.message}`);
} else {
throw err;
}
}
Schema Examples
Invoice Extraction
- Python
- TypeScript
Copy
invoice_schema = {
"type": "object",
"properties": {
"invoice_number": {"type": "string", "description": "Unique invoice identifier"},
"invoice_date": {"type": "string", "description": "Invoice date (YYYY-MM-DD)"},
"due_date": {"type": "string", "description": "Payment due date (YYYY-MM-DD)"},
"vendor_name": {"type": "string", "description": "Company issuing the invoice"},
"customer_name": {"type": "string", "description": "Customer being billed"},
"subtotal": {"type": "number", "description": "Subtotal before tax"},
"tax_amount": {"type": "number", "description": "Tax amount"},
"total_amount": {"type": "number", "description": "Total amount due"}
},
"required": ["invoice_number", "total_amount"]
}
result = client.sources.extract(
file_names=["invoice.pdf"],
user_instruction="Extract all invoice details. Convert amounts to numbers without currency symbols.",
output_schema=invoice_schema
)
Copy
const invoiceSchema = {
type: 'object',
properties: {
invoice_number: { type: 'string', description: 'Unique invoice identifier' },
invoice_date: { type: 'string', description: 'Invoice date (YYYY-MM-DD)' },
due_date: { type: 'string', description: 'Payment due date (YYYY-MM-DD)' },
vendor_name: { type: 'string', description: 'Company issuing the invoice' },
customer_name: { type: 'string', description: 'Customer being billed' },
subtotal: { type: 'number', description: 'Subtotal before tax' },
tax_amount: { type: 'number', description: 'Tax amount' },
total_amount: { type: 'number', description: 'Total amount due' },
},
required: ['invoice_number', 'total_amount'],
};
const result = await client.sources.extract({
file_names: ['invoice.pdf'],
user_instruction: 'Extract all invoice details. Convert amounts to numbers without currency symbols.',
output_schema: invoiceSchema,
});
Contract Analysis
- Python
- TypeScript
Copy
contract_schema = {
"type": "object",
"properties": {
"contract_title": {"type": "string", "description": "Title or name of the contract"},
"effective_date": {"type": "string", "description": "When the contract becomes effective"},
"termination_date": {"type": "string", "description": "When the contract ends"},
"auto_renewal": {"type": "boolean", "description": "Whether contract auto-renews"},
"parties": {
"type": "array",
"description": "All parties involved in the contract",
"items": {
"type": "object",
"properties": {
"name": {"type": "string", "description": "Party name"},
"role": {"type": "string", "description": "Role (e.g., Licensor, Licensee)"},
"address": {"type": "string", "description": "Party address"}
}
}
},
"key_terms": {
"type": "object",
"description": "Key contract terms and conditions",
"properties": {
"payment_terms": {"type": "string", "description": "Payment conditions"},
"liability_cap": {"type": "number", "description": "Maximum liability amount"},
"notice_period_days": {"type": "integer", "description": "Notice period in days"}
}
}
},
"required": ["contract_title", "parties"]
}
result = client.sources.extract(
file_names=["contract.pdf"],
user_instruction="Extract key contract terms with all parties and obligations.",
output_schema=contract_schema
)
Copy
const contractSchema = {
type: 'object',
properties: {
contract_title: { type: 'string', description: 'Title or name of the contract' },
effective_date: { type: 'string', description: 'When the contract becomes effective' },
termination_date: { type: 'string', description: 'When the contract ends' },
auto_renewal: { type: 'boolean', description: 'Whether contract auto-renews' },
parties: {
type: 'array',
description: 'All parties involved in the contract',
items: {
type: 'object',
properties: {
name: { type: 'string', description: 'Party name' },
role: { type: 'string', description: 'Role (e.g., Licensor, Licensee)' },
address: { type: 'string', description: 'Party address' },
},
},
},
key_terms: {
type: 'object',
description: 'Key contract terms and conditions',
properties: {
payment_terms: { type: 'string', description: 'Payment conditions' },
liability_cap: { type: 'number', description: 'Maximum liability amount' },
notice_period_days: { type: 'integer', description: 'Notice period in days' },
},
},
},
required: ['contract_title', 'parties'],
};
const result = await client.sources.extract({
file_names: ['contract.pdf'],
user_instruction: 'Extract key contract terms with all parties and obligations.',
output_schema: contractSchema,
});
Resume Parsing
- Python
- TypeScript
Copy
resume_schema = {
"type": "object",
"properties": {
"full_name": {"type": "string", "description": "Candidate's full name"},
"email": {"type": "string", "description": "Email address"},
"phone": {"type": "string", "description": "Phone number"},
"years_experience": {"type": "number", "description": "Total years of experience"},
"skills": {
"type": "array",
"description": "List of technical and soft skills",
"items": {"type": "string"}
},
"work_experience": {
"type": "array",
"description": "Work history",
"items": {
"type": "object",
"properties": {
"company": {"type": "string", "description": "Company name"},
"title": {"type": "string", "description": "Job title"},
"start_date": {"type": "string", "description": "Start date"},
"end_date": {"type": "string", "description": "End date (or 'current')"}
}
}
},
"education": {
"type": "array",
"description": "Educational background",
"items": {
"type": "object",
"properties": {
"institution": {"type": "string", "description": "School or university name"},
"degree": {"type": "string", "description": "Degree obtained"},
"graduation_year": {"type": "integer", "description": "Year of graduation"}
}
}
}
},
"required": ["full_name"]
}
result = client.sources.extract(
file_names=["resume.pdf"],
user_instruction="Extract complete candidate information including work history and education.",
output_schema=resume_schema
)
Copy
const resumeSchema = {
type: 'object',
properties: {
full_name: { type: 'string', description: "Candidate's full name" },
email: { type: 'string', description: 'Email address' },
phone: { type: 'string', description: 'Phone number' },
years_experience: { type: 'number', description: 'Total years of experience' },
skills: {
type: 'array',
description: 'List of technical and soft skills',
items: { type: 'string' },
},
work_experience: {
type: 'array',
description: 'Work history',
items: {
type: 'object',
properties: {
company: { type: 'string', description: 'Company name' },
title: { type: 'string', description: 'Job title' },
start_date: { type: 'string', description: 'Start date' },
end_date: { type: 'string', description: "End date (or 'current')" },
},
},
},
education: {
type: 'array',
description: 'Educational background',
items: {
type: 'object',
properties: {
institution: { type: 'string', description: 'School or university name' },
degree: { type: 'string', description: 'Degree obtained' },
graduation_year: { type: 'integer', description: 'Year of graduation' },
},
},
},
},
required: ['full_name'],
};
const result = await client.sources.extract({
file_names: ['resume.pdf'],
user_instruction: 'Extract complete candidate information including work history and education.',
output_schema: resumeSchema,
});
Product Catalog
- Python
- TypeScript
Copy
product_schema = {
"type": "object",
"properties": {
"product_name": {"type": "string", "description": "Product name"},
"sku": {"type": "string", "description": "Product SKU"},
"base_price": {"type": "number", "description": "Base price"},
"in_stock": {"type": "boolean", "description": "Whether product is in stock"},
"specifications": {
"type": "object",
"description": "Product specifications",
"properties": {
"weight": {"type": "number", "description": "Weight in kg"},
"dimensions": {"type": "string", "description": "Dimensions (LxWxH)"},
"material": {"type": "string", "description": "Main material"}
}
},
"categories": {
"type": "array",
"description": "Product categories",
"items": {"type": "string"}
},
"variants": {
"type": "array",
"description": "Product variants",
"items": {
"type": "object",
"properties": {
"color": {"type": "string", "description": "Variant color"},
"size": {"type": "string", "description": "Variant size"},
"price_modifier": {"type": "number", "description": "Price adjustment"}
}
}
}
},
"required": ["product_name", "sku"]
}
result = client.sources.extract(
file_names=["catalog.pdf"],
user_instruction="Extract all products with their specifications and variants.",
output_schema=product_schema
)
Copy
const productSchema = {
type: 'object',
properties: {
product_name: { type: 'string', description: 'Product name' },
sku: { type: 'string', description: 'Product SKU' },
base_price: { type: 'number', description: 'Base price' },
in_stock: { type: 'boolean', description: 'Whether product is in stock' },
specifications: {
type: 'object',
description: 'Product specifications',
properties: {
weight: { type: 'number', description: 'Weight in kg' },
dimensions: { type: 'string', description: 'Dimensions (LxWxH)' },
material: { type: 'string', description: 'Main material' },
},
},
categories: {
type: 'array',
description: 'Product categories',
items: { type: 'string' },
},
variants: {
type: 'array',
description: 'Product variants',
items: {
type: 'object',
properties: {
color: { type: 'string', description: 'Variant color' },
size: { type: 'string', description: 'Variant size' },
price_modifier: { type: 'number', description: 'Price adjustment' },
},
},
},
},
required: ['product_name', 'sku'],
};
const result = await client.sources.extract({
file_names: ['catalog.pdf'],
user_instruction: 'Extract all products with their specifications and variants.',
output_schema: productSchema,
});
Advanced Examples
Document Extraction Pipeline
Build a complete extraction pipeline for processing multiple documents:- Python
- TypeScript
Copy
from graphor import Graphor
import graphor
from typing import Any
from dataclasses import dataclass
@dataclass
class ExtractionResult:
file_name: str
data: dict[str, Any] | None
error: str | None = None
class DocumentExtractor:
def __init__(self, api_key: str | None = None):
self.client = Graphor(api_key=api_key) if api_key else Graphor()
def extract_invoices(self, file_names: list[str]) -> list[ExtractionResult]:
"""Extract invoice data from multiple files."""
results = []
invoice_schema = {
"type": "object",
"properties": {
"invoice_number": {"type": "string", "description": "Invoice ID"},
"vendor_name": {"type": "string", "description": "Vendor name"},
"total_amount": {"type": "number", "description": "Total amount"},
"invoice_date": {"type": "string", "description": "Date (YYYY-MM-DD)"},
"line_items": {
"type": "array",
"description": "Line items",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"quantity": {"type": "number"},
"amount": {"type": "number"}
}
}
}
},
"required": ["invoice_number", "total_amount"]
}
for file_name in file_names:
try:
result = self.client.sources.extract(
file_names=[file_name],
user_instruction="Extract invoice information. Use YYYY-MM-DD for dates.",
output_schema=invoice_schema
)
results.append(ExtractionResult(
file_name=file_name,
data=result.structured_output
))
print(f"OK - Extracted: {file_name}")
except graphor.APIStatusError as e:
results.append(ExtractionResult(
file_name=file_name,
data=None,
error=str(e)
))
print(f"FAIL - {file_name}: {e}")
return results
def extract_with_custom_schema(
self,
file_names: list[str],
schema: dict,
instruction: str
) -> dict[str, Any] | None:
"""Extract data using a custom schema."""
try:
result = self.client.sources.extract(
file_names=file_names,
user_instruction=instruction,
output_schema=schema
)
return result.structured_output
except graphor.APIStatusError as e:
print(f"Extraction error: {e}")
return None
# Usage
extractor = DocumentExtractor()
# Process multiple invoices
invoices = extractor.extract_invoices([
"invoice-001.pdf",
"invoice-002.pdf",
"invoice-003.pdf"
])
# Calculate totals
total = sum(
inv.data["total_amount"]
for inv in invoices
if inv.data is not None
)
print(f"Total amount: ${total:,.2f}")
Copy
import Graphor from 'graphor';
interface ExtractionResult {
fileName: string;
data: Record<string, unknown> | null;
error?: string;
}
class DocumentExtractor {
private client: Graphor;
constructor(apiKey?: string) {
this.client = apiKey ? new Graphor({ apiKey }) : new Graphor();
}
async extractInvoices(fileNames: string[]): Promise<ExtractionResult[]> {
const results: ExtractionResult[] = [];
const invoiceSchema = {
type: 'object',
properties: {
invoice_number: { type: 'string', description: 'Invoice ID' },
vendor_name: { type: 'string', description: 'Vendor name' },
total_amount: { type: 'number', description: 'Total amount' },
invoice_date: { type: 'string', description: 'Date (YYYY-MM-DD)' },
line_items: {
type: 'array',
description: 'Line items',
items: {
type: 'object',
properties: {
description: { type: 'string' },
quantity: { type: 'number' },
amount: { type: 'number' },
},
},
},
},
required: ['invoice_number', 'total_amount'],
};
for (const fileName of fileNames) {
try {
const result = await this.client.sources.extract({
file_names: [fileName],
user_instruction: 'Extract invoice information. Use YYYY-MM-DD for dates.',
output_schema: invoiceSchema,
});
results.push({
fileName,
data: result.structured_output as Record<string, unknown>,
});
console.log(`OK - Extracted: ${fileName}`);
} catch (err) {
results.push({
fileName,
data: null,
error: err instanceof Graphor.APIError ? err.message : String(err),
});
console.log(`FAIL - ${fileName}: ${err}`);
}
}
return results;
}
async extractWithCustomSchema(
fileNames: string[],
schema: Record<string, unknown>,
instruction: string,
): Promise<Record<string, unknown> | null> {
try {
const result = await this.client.sources.extract({
file_names: fileNames,
user_instruction: instruction,
output_schema: schema,
});
return result.structured_output as Record<string, unknown>;
} catch (err) {
console.log(`Extraction error: ${err}`);
return null;
}
}
}
// Usage
const extractor = new DocumentExtractor();
// Process multiple invoices
const invoices = await extractor.extractInvoices([
'invoice-001.pdf',
'invoice-002.pdf',
'invoice-003.pdf',
]);
// Calculate totals
const total = invoices
.filter((inv) => inv.data !== null)
.reduce((sum, inv) => sum + (inv.data!.total_amount as number), 0);
console.log(`Total amount: $${total.toLocaleString('en-US', { minimumFractionDigits: 2 })}`);
Async Batch Extraction
Process many documents efficiently with async:- Python
- TypeScript
Copy
import asyncio
from graphor import AsyncGraphor
import graphor
async def extract_single(
client: AsyncGraphor,
file_name: str,
schema: dict,
instruction: str
):
"""Extract data from a single file."""
try:
result = await client.sources.extract(
file_names=[file_name],
user_instruction=instruction,
output_schema=schema
)
return {
"file_name": file_name,
"status": "success",
"data": result.structured_output
}
except graphor.APIStatusError as e:
return {
"file_name": file_name,
"status": "failed",
"error": str(e)
}
async def batch_extract(
file_names: list[str],
schema: dict,
instruction: str,
max_concurrent: int = 3
):
"""Extract data from multiple files with controlled concurrency."""
client = AsyncGraphor(timeout=120.0)
# Use semaphore to limit concurrent requests
semaphore = asyncio.Semaphore(max_concurrent)
async def extract_with_semaphore(file_name: str):
async with semaphore:
print(f"Processing: {file_name}...")
result = await extract_single(client, file_name, schema, instruction)
status_icon = "OK" if result["status"] == "success" else "FAIL"
print(f"{status_icon} - {file_name}: {result['status']}")
return result
tasks = [extract_with_semaphore(f) for f in file_names]
results = await asyncio.gather(*tasks)
successful = [r for r in results if r["status"] == "success"]
failed = [r for r in results if r["status"] == "failed"]
print(f"\nSummary: {len(successful)} successful, {len(failed)} failed")
return results
# Usage
schema = {
"type": "object",
"properties": {
"title": {"type": "string", "description": "Document title"},
"summary": {"type": "string", "description": "Brief summary"}
}
}
files = ["doc1.pdf", "doc2.pdf", "doc3.pdf", "doc4.pdf", "doc5.pdf"]
results = asyncio.run(batch_extract(
files,
schema,
"Extract the title and summary from this document.",
max_concurrent=3
))
Copy
import Graphor from 'graphor';
const client = new Graphor({ timeout: 120 * 1000 });
interface BatchResult {
fileName: string;
status: 'success' | 'failed';
data?: Record<string, unknown>;
error?: string;
}
async function extractSingle(
fileName: string,
schema: Record<string, unknown>,
instruction: string,
): Promise<BatchResult> {
try {
const result = await client.sources.extract({
file_names: [fileName],
user_instruction: instruction,
output_schema: schema,
});
return {
fileName,
status: 'success',
data: result.structured_output as Record<string, unknown>,
};
} catch (err) {
return {
fileName,
status: 'failed',
error: err instanceof Graphor.APIError ? err.message : String(err),
};
}
}
async function batchExtract(
fileNames: string[],
schema: Record<string, unknown>,
instruction: string,
maxConcurrent = 3,
): Promise<BatchResult[]> {
const results: BatchResult[] = [];
// Process in batches to control concurrency
for (let i = 0; i < fileNames.length; i += maxConcurrent) {
const batch = fileNames.slice(i, i + maxConcurrent);
console.log(`Processing batch: ${batch.join(', ')}...`);
const batchResults = await Promise.all(
batch.map((fileName) => extractSingle(fileName, schema, instruction)),
);
for (const result of batchResults) {
const icon = result.status === 'success' ? 'OK' : 'FAIL';
console.log(`${icon} - ${result.fileName}: ${result.status}`);
}
results.push(...batchResults);
}
const successful = results.filter((r) => r.status === 'success');
const failed = results.filter((r) => r.status === 'failed');
console.log(`\nSummary: ${successful.length} successful, ${failed.length} failed`);
return results;
}
// Usage
const schema = {
type: 'object',
properties: {
title: { type: 'string', description: 'Document title' },
summary: { type: 'string', description: 'Brief summary' },
},
};
const files = ['doc1.pdf', 'doc2.pdf', 'doc3.pdf', 'doc4.pdf', 'doc5.pdf'];
const results = await batchExtract(
files,
schema,
'Extract the title and summary from this document.',
3,
);
Extraction with Validation
Add validation to ensure extracted data meets your requirements:- Python
- TypeScript
Copy
from graphor import Graphor
import graphor
from typing import Any
class ValidatedExtractor:
def __init__(self):
self.client = Graphor()
def extract_and_validate(
self,
file_names: list[str],
schema: dict,
instruction: str,
validators: dict[str, callable] | None = None
) -> dict[str, Any]:
"""Extract data and validate the results."""
result = self.client.sources.extract(
file_names=file_names,
user_instruction=instruction,
output_schema=schema
)
data = result.structured_output
if validators and data:
validation_errors = []
for field, validator in validators.items():
if field in data:
try:
if not validator(data[field]):
validation_errors.append(f"Validation failed for '{field}'")
except Exception as e:
validation_errors.append(f"Validator error for '{field}': {e}")
if validation_errors:
return {
"success": False,
"data": data,
"errors": validation_errors
}
return {
"success": True,
"data": data,
"errors": []
}
# Usage
extractor = ValidatedExtractor()
# Define validators
validators = {
"invoice_number": lambda x: x and len(x) > 0,
"total_amount": lambda x: x and x > 0,
"invoice_date": lambda x: x and len(x) == 10, # YYYY-MM-DD format
}
result = extractor.extract_and_validate(
file_names=["invoice.pdf"],
schema={
"type": "object",
"properties": {
"invoice_number": {"type": "string", "description": "Invoice ID"},
"total_amount": {"type": "number", "description": "Total amount"},
"invoice_date": {"type": "string", "description": "Date YYYY-MM-DD"}
},
"required": ["invoice_number", "total_amount"]
},
instruction="Extract invoice details. Use YYYY-MM-DD for dates.",
validators=validators
)
if result["success"]:
print(f"Valid extraction: {result['data']}")
else:
print(f"Validation errors: {result['errors']}")
Copy
import Graphor from 'graphor';
type Validator = (value: unknown) => boolean;
interface ValidationResult {
success: boolean;
data: Record<string, unknown> | null;
errors: string[];
}
class ValidatedExtractor {
private client: Graphor;
constructor() {
this.client = new Graphor();
}
async extractAndValidate(
fileNames: string[],
schema: Record<string, unknown>,
instruction: string,
validators?: Record<string, Validator>,
): Promise<ValidationResult> {
const result = await this.client.sources.extract({
file_names: fileNames,
user_instruction: instruction,
output_schema: schema,
});
const data = result.structured_output as Record<string, unknown> | null;
if (validators && data) {
const validationErrors: string[] = [];
for (const [field, validator] of Object.entries(validators)) {
if (field in data) {
try {
if (!validator(data[field])) {
validationErrors.push(`Validation failed for '${field}'`);
}
} catch (e) {
validationErrors.push(`Validator error for '${field}': ${e}`);
}
}
}
if (validationErrors.length > 0) {
return { success: false, data, errors: validationErrors };
}
}
return { success: true, data, errors: [] };
}
}
// Usage
const extractor = new ValidatedExtractor();
// Define validators
const validators: Record<string, Validator> = {
invoice_number: (x) => typeof x === 'string' && x.length > 0,
total_amount: (x) => typeof x === 'number' && x > 0,
invoice_date: (x) => typeof x === 'string' && x.length === 10, // YYYY-MM-DD
};
const result = await extractor.extractAndValidate(
['invoice.pdf'],
{
type: 'object',
properties: {
invoice_number: { type: 'string', description: 'Invoice ID' },
total_amount: { type: 'number', description: 'Total amount' },
invoice_date: { type: 'string', description: 'Date YYYY-MM-DD' },
},
required: ['invoice_number', 'total_amount'],
},
'Extract invoice details. Use YYYY-MM-DD for dates.',
validators,
);
if (result.success) {
console.log('Valid extraction:', result.data);
} else {
console.log('Validation errors:', result.errors);
}
Debugging with Raw JSON
Use theraw_json field to debug extraction issues:
- Python
- TypeScript
Copy
from graphor import Graphor
import json
client = Graphor()
result = client.sources.extract(
file_names=["document.pdf"],
user_instruction="Extract document information.",
output_schema={
"type": "object",
"properties": {
"title": {"type": "string", "description": "Document title"},
"author": {"type": "string", "description": "Author name"}
}
}
)
# Compare raw vs structured output
print("Raw JSON from model:")
print(result.raw_json)
print("\nStructured output (validated):")
print(json.dumps(result.structured_output, indent=2))
# Check for differences (useful for debugging)
if result.raw_json:
raw_parsed = json.loads(result.raw_json)
if raw_parsed != result.structured_output:
print("\nNote: Structured output differs from raw JSON (post-validation)")
Copy
import Graphor from 'graphor';
const client = new Graphor();
const result = await client.sources.extract({
file_names: ['document.pdf'],
user_instruction: 'Extract document information.',
output_schema: {
type: 'object',
properties: {
title: { type: 'string', description: 'Document title' },
author: { type: 'string', description: 'Author name' },
},
},
});
// Compare raw vs structured output
console.log('Raw JSON from model:');
console.log(result.raw_json);
console.log('\nStructured output (validated):');
console.log(JSON.stringify(result.structured_output, null, 2));
// Check for differences (useful for debugging)
if (result.raw_json) {
const rawParsed = JSON.parse(result.raw_json);
if (JSON.stringify(rawParsed) !== JSON.stringify(result.structured_output)) {
console.log('\nNote: Structured output differs from raw JSON (post-validation)');
}
}
Best Practices
Schema Design
- Use clear descriptions: Detailed property descriptions improve extraction accuracy
- Match types to data: Use
numberfor amounts,stringfor dates,booleanfor flags - Keep nesting shallow: Avoid deeply nested structures (2-3 levels maximum)
- Define required fields: Use the
requiredarray to specify mandatory properties - Use arrays for lists: Extract repeating items using arrays with item schemas
Instruction Writing
- Be specific: Include format preferences (e.g., “Use YYYY-MM-DD for dates”)
- Handle edge cases: Specify what to do for missing data (e.g., “Use null if not found”)
- Provide context: Explain what the document contains and what you need
- Avoid ambiguity: Use clear, unambiguous language
Performance
- Batch related files: Process related documents together for context
- Use appropriate timeouts: Extraction can take time for complex documents
- Implement retries: Handle transient errors with the SDK’s retry mechanism
- Cache results: Store extraction results to avoid reprocessing
- Python
- TypeScript
Copy
# Configure retries for reliability
client = Graphor(max_retries=3, timeout=120.0)
# Or per-request
result = client.with_options(max_retries=5, timeout=180.0).sources.extract(
file_names=["large-document.pdf"],
user_instruction="Extract all data.",
output_schema=schema
)
Copy
// Configure retries for reliability
const client = new Graphor({
maxRetries: 3,
timeout: 120 * 1000, // 2 minutes (milliseconds)
});
// Or per-request
const result = await client.sources.extract(
{
file_names: ['large-document.pdf'],
user_instruction: 'Extract all data.',
output_schema: schema,
},
{ maxRetries: 5, timeout: 180 * 1000 },
);
Error Reference
| Error Type | Status Code | Description |
|---|---|---|
BadRequestError | 400 | Invalid parameters or malformed schema |
AuthenticationError | 401 | Invalid or missing API key |
PermissionDeniedError | 403 | Access denied to the specified project |
NotFoundError | 404 | File not found or no parsing history |
RateLimitError | 429 | Too many requests, please retry after waiting |
InternalServerError | ≥500 | Server-side processing error |
APIConnectionError | N/A | Network connectivity issues |
APITimeoutError | N/A | Request timed out |
Troubleshooting
File not found errors
File not found errors
Causes: File doesn’t exist, hasn’t been processed, or wrong file nameSolutions:
- Verify the exact file name (case-sensitive)
- Ensure the file has been uploaded and processed
- Use
client.sources.list()to check available files
- Python
- TypeScript
Copy
# List all sources to find correct file names
sources = client.sources.list()
for source in sources:
if source.status == "Completed":
print(source.file_name)
Copy
// List all sources to find correct file names
const sources = await client.sources.list();
for (const source of sources) {
if (source.status === 'Completed') {
console.log(source.file_name);
}
}
Invalid schema errors
Invalid schema errors
Causes: Malformed JSON Schema or unsupported featuresSolutions:
- Validate your schema against JSON Schema spec
- Avoid unsupported features (
$ref,oneOf,anyOf) - Ensure all
typevalues are valid - Check that
propertiesis an object, not an array
Timeout errors
Timeout errors
Causes: Large documents, complex schemas, or server loadSolutions:
- Increase the timeout value
- Simplify the schema (fewer fields, shallower nesting)
- Process smaller document batches
- Python
- TypeScript
Copy
client = Graphor(timeout=180.0) # 3 minutes
Copy
const client = new Graphor({ timeout: 180 * 1000 }); // 3 minutes
Missing or incorrect data
Missing or incorrect data
Causes: Vague instructions, poor document quality, or inappropriate schemaSolutions:
- Make instructions more specific
- Reprocess the document with a better partition method
- Add more context in property descriptions
- Check the
raw_jsonfield for debugging
Partial extraction
Partial extraction
Causes: Document doesn’t contain all expected informationSolutions:
- Make non-essential fields optional (remove from
required) - Use null unions:
"type": ["string", "null"] - Add instructions for handling missing data
Next Steps
After extracting data from your documents:Upload Source
Upload new documents for extraction
Parse Source
Reprocess documents for better extraction quality
Chat with Documents
Ask questions about your documents
Data Extraction Guide
Learn schema design and extraction best practices

