Overview
The Data Extraction API allows you to extract structured information from your documents using standard JSON Schema and natural language instructions. The extraction uses the active parsing version of the specified document.
Endpoint
POST https://sources.graphorlm.com/run-extraction
Authentication
Include your API token in the Authorization header:
Authorization: Bearer YOUR_API_TOKEN
Request
| Header | Value | Required |
|---|
Authorization | Bearer YOUR_API_TOKEN | Yes |
Content-Type | application/json | Yes |
Body Parameters
| Parameter | Type | Required | Description |
|---|
file_ids | string[] | No* | List of file IDs to extract from (preferred) |
file_names | string[] | No* | List of file names to extract from (deprecated, use file_ids) |
user_instruction | string | Yes | Natural language instructions to guide the extraction |
output_schema | object | Yes | JSON Schema defining the structure of the extracted data |
thinking_level | string | No | Controls model and thinking configuration. Values: "fast", "balanced", "accurate" (default). See Thinking Level for details. |
*At least one of file_ids or file_names must be provided. file_ids is preferred.
Output Schema
The output_schema parameter accepts a standard JSON Schema object. This is the same format used by the Chat API for structured outputs.
Thinking Level
The thinking_level parameter controls the model and thinking configuration used for extraction:
| Value | Description |
|---|
"fast" | Uses a faster model without extended thinking. Best for simple extractions where speed is prioritized. |
"balanced" | Uses a more capable model with low thinking. Good balance between quality and speed. |
"accurate" | Default. Uses a more capable model with high thinking. Best for complex extractions requiring deep reasoning. |
Example Request (using file_ids)
curl -X POST "https://sources.graphorlm.com/run-extraction" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"file_ids": ["file_abc123"],
"user_instruction": "Extract all invoice information. Use YYYY-MM-DD format for dates.",
"output_schema": {
"type": "object",
"properties": {
"invoice_number": {
"type": "string",
"description": "The unique invoice identifier"
},
"invoice_date": {
"type": "string",
"description": "Invoice date in YYYY-MM-DD format"
},
"total_amount": {
"type": "number",
"description": "Total amount due"
},
"vendor_name": {
"type": "string",
"description": "Name of the company issuing the invoice"
}
},
"required": ["invoice_number", "total_amount"]
}
}'
Example Request (using file_names - deprecated)
curl -X POST "https://sources.graphorlm.com/run-extraction" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"file_names": ["invoice-2024.pdf"],
"user_instruction": "Extract all invoice information. Use YYYY-MM-DD format for dates.",
"output_schema": {
"type": "object",
"properties": {
"invoice_number": {
"type": "string",
"description": "The unique invoice identifier"
},
"invoice_date": {
"type": "string",
"description": "Invoice date in YYYY-MM-DD format"
},
"total_amount": {
"type": "number",
"description": "Total amount due"
},
"vendor_name": {
"type": "string",
"description": "Name of the company issuing the invoice"
}
},
"required": ["invoice_number", "total_amount"]
}
}'
Example Request with Thinking Level
curl -X POST "https://sources.graphorlm.com/run-extraction" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"file_names": ["complex-contract.pdf"],
"user_instruction": "Extract all legal clauses with their implications.",
"thinking_level": "accurate",
"output_schema": {
"type": "object",
"properties": {
"clauses": {
"type": "array",
"items": {
"type": "object",
"properties": {
"title": { "type": "string" },
"content": { "type": "string" },
"implications": { "type": "string" }
}
}
}
}
}
}'
Example Request with Nested Objects and Arrays
curl -X POST "https://sources.graphorlm.com/run-extraction" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"file_names": ["invoice-2024.pdf"],
"user_instruction": "Extract invoice with line items and address details.",
"output_schema": {
"type": "object",
"properties": {
"invoice_number": {
"type": "string",
"description": "The unique invoice identifier"
},
"billing_address": {
"type": "object",
"description": "Billing address details",
"properties": {
"street": { "type": "string", "description": "Street address" },
"city": { "type": "string", "description": "City name" },
"zip_code": { "type": "string", "description": "Postal code" },
"country": { "type": "string", "description": "Country name" }
}
},
"tags": {
"type": "array",
"description": "Invoice tags or categories",
"items": { "type": "string" }
},
"line_items": {
"type": "array",
"description": "Invoice line items",
"items": {
"type": "object",
"properties": {
"description": { "type": "string", "description": "Item description" },
"quantity": { "type": "number", "description": "Item quantity" },
"unit_price": { "type": "number", "description": "Price per unit" },
"total": { "type": "number", "description": "Line item total" }
}
}
}
},
"required": ["invoice_number"]
}
}'
Response
Success Response (200 OK)
| Field | Type | Description |
|---|
file_ids | array | List of file IDs used for extraction |
file_names | array | List of file names used for extraction |
structured_output | object | Extracted data matching your schema |
raw_json | string | Raw JSON-text produced by the model before validation/correction |
Example Response
{
"file_ids": ["file_abc123"],
"file_names": ["invoice-2024.pdf"],
"structured_output": {
"invoice_number": "INV-2024-001",
"invoice_date": "2024-01-15",
"total_amount": 1250.00,
"vendor_name": "Acme Corporation"
},
"raw_json": "{\"invoice_number\": \"INV-2024-001\", \"invoice_date\": \"2024-01-15\", \"total_amount\": 1250.00, \"vendor_name\": \"Acme Corporation\"}"
}
Response with Object and Array Types
When using object and array types in your schema:
{
"file_names": ["invoice-2024.pdf"],
"structured_output": {
"invoice_number": "INV-2024-001",
"billing_address": {
"street": "123 Main Street",
"city": "New York",
"zip_code": "10001",
"country": "USA"
},
"tags": ["urgent", "corporate", "Q1-2024"],
"line_items": [
{
"description": "Professional Services",
"quantity": 10,
"unit_price": 100.00,
"total": 1000.00
},
{
"description": "Software License",
"quantity": 1,
"unit_price": 250.00,
"total": 250.00
}
]
},
"raw_json": "{...}"
}
### Error Responses
| Status Code | Description |
|-------------|-------------|
| 400 | Bad Request - Invalid parameters or schema |
| 401 | Unauthorized - Invalid or missing API token |
| 404 | Not Found - File not found or no parsing history |
| 500 | Internal Server Error |
## Schema Examples
### Invoice Extraction
```json
{
"file_names": ["invoice.pdf"],
"user_instruction": "Extract all invoice details. Convert amounts to numbers without currency symbols.",
"output_schema": {
"type": "object",
"properties": {
"invoice_number": { "type": "string", "description": "Unique invoice identifier" },
"invoice_date": { "type": "string", "description": "Invoice date (YYYY-MM-DD)" },
"due_date": { "type": "string", "description": "Payment due date (YYYY-MM-DD)" },
"vendor_name": { "type": "string", "description": "Company issuing the invoice" },
"customer_name": { "type": "string", "description": "Customer being billed" },
"subtotal": { "type": "number", "description": "Subtotal before tax" },
"tax_amount": { "type": "number", "description": "Tax amount" },
"total_amount": { "type": "number", "description": "Total amount due" }
},
"required": ["invoice_number", "total_amount"]
}
}
Contract Analysis
{
"file_names": ["contract.pdf"],
"user_instruction": "Extract key contract terms with all parties and obligations.",
"output_schema": {
"type": "object",
"properties": {
"contract_title": { "type": "string", "description": "Title or name of the contract" },
"effective_date": { "type": "string", "description": "When the contract becomes effective" },
"termination_date": { "type": "string", "description": "When the contract ends" },
"auto_renewal": { "type": "boolean", "description": "Whether contract auto-renews" },
"parties": {
"type": "array",
"description": "All parties involved in the contract",
"items": {
"type": "object",
"properties": {
"name": { "type": "string", "description": "Party name" },
"role": { "type": "string", "description": "Role (e.g., Licensor, Licensee)" },
"address": { "type": "string", "description": "Party address" }
}
}
},
"key_terms": {
"type": "object",
"description": "Key contract terms and conditions",
"properties": {
"payment_terms": { "type": "string", "description": "Payment conditions" },
"liability_cap": { "type": "number", "description": "Maximum liability amount" },
"notice_period_days": { "type": "number", "description": "Notice period in days" }
}
}
},
"required": ["contract_title", "parties"]
}
}
Resume Parsing
{
"file_names": ["resume.pdf"],
"user_instruction": "Extract complete candidate information including work history and education.",
"output_schema": {
"type": "object",
"properties": {
"full_name": { "type": "string", "description": "Candidate's full name" },
"email": { "type": "string", "description": "Email address" },
"phone": { "type": "string", "description": "Phone number" },
"years_experience": { "type": "number", "description": "Total years of experience" },
"skills": {
"type": "array",
"description": "List of technical and soft skills",
"items": { "type": "string" }
},
"work_experience": {
"type": "array",
"description": "Work history",
"items": {
"type": "object",
"properties": {
"company": { "type": "string", "description": "Company name" },
"title": { "type": "string", "description": "Job title" },
"start_date": { "type": "string", "description": "Start date" },
"end_date": { "type": "string", "description": "End date (or current)" }
}
}
},
"education": {
"type": "array",
"description": "Educational background",
"items": {
"type": "object",
"properties": {
"institution": { "type": "string", "description": "School or university name" },
"degree": { "type": "string", "description": "Degree obtained" },
"graduation_year": { "type": "number", "description": "Year of graduation" }
}
}
}
},
"required": ["full_name"]
}
}
Product Catalog
{
"file_names": ["catalog.pdf"],
"user_instruction": "Extract all products with their specifications and variants.",
"output_schema": {
"type": "object",
"properties": {
"product_name": { "type": "string", "description": "Product name" },
"sku": { "type": "string", "description": "Product SKU" },
"base_price": { "type": "number", "description": "Base price" },
"in_stock": { "type": "boolean", "description": "Whether product is in stock" },
"specifications": {
"type": "object",
"description": "Product specifications",
"properties": {
"weight": { "type": "number", "description": "Weight in kg" },
"dimensions": { "type": "string", "description": "Dimensions (LxWxH)" },
"material": { "type": "string", "description": "Main material" }
}
},
"categories": {
"type": "array",
"description": "Product categories",
"items": { "type": "string" }
},
"variants": {
"type": "array",
"description": "Product variants",
"items": {
"type": "object",
"properties": {
"color": { "type": "string", "description": "Variant color" },
"size": { "type": "string", "description": "Variant size" },
"price_modifier": { "type": "number", "description": "Price adjustment" }
}
}
}
},
"required": ["product_name", "sku"]
}
}
Usage Examples
Python
import requests
url = "https://sources.graphorlm.com/run-extraction"
headers = {
"Authorization": "Bearer YOUR_API_TOKEN",
"Content-Type": "application/json"
}
# Basic extraction
payload = {
"file_names": ["invoice.pdf"],
"user_instruction": "Extract invoice information. Use YYYY-MM-DD for dates.",
"output_schema": {
"type": "object",
"properties": {
"invoice_number": {"type": "string", "description": "Invoice ID"},
"total_amount": {"type": "number", "description": "Total due"},
"invoice_date": {"type": "string", "description": "Invoice date"}
},
"required": ["invoice_number", "total_amount"]
}
}
response = requests.post(url, headers=headers, json=payload)
data = response.json()
output = data["structured_output"]
print(f"Invoice: {output['invoice_number']}")
print(f"Amount: ${output['total_amount']}")
print(f"Date: {output['invoice_date']}")
Python with Nested Objects and Arrays
import requests
url = "https://sources.graphorlm.com/run-extraction"
headers = {
"Authorization": "Bearer YOUR_API_TOKEN",
"Content-Type": "application/json"
}
# Extraction with nested objects and arrays
payload = {
"file_names": ["invoice.pdf"],
"user_instruction": "Extract invoice with line items and address.",
"output_schema": {
"type": "object",
"properties": {
"invoice_number": {"type": "string", "description": "Invoice ID"},
"billing_address": {
"type": "object",
"description": "Billing address",
"properties": {
"street": {"type": "string", "description": "Street"},
"city": {"type": "string", "description": "City"},
"country": {"type": "string", "description": "Country"}
}
},
"line_items": {
"type": "array",
"description": "Invoice line items",
"items": {
"type": "object",
"properties": {
"description": {"type": "string", "description": "Item description"},
"quantity": {"type": "number", "description": "Quantity"},
"price": {"type": "number", "description": "Unit price"}
}
}
}
},
"required": ["invoice_number"]
}
}
response = requests.post(url, headers=headers, json=payload)
data = response.json()
output = data["structured_output"]
print(f"Invoice: {output['invoice_number']}")
print(f"City: {output['billing_address']['city']}")
print("Line Items:")
for line in output["line_items"]:
print(f" - {line['description']}: {line['quantity']} x ${line['price']}")
JavaScript
const API_URL = "https://sources.graphorlm.com/run-extraction";
const API_TOKEN = "YOUR_API_TOKEN";
async function extractData(fileNames, instruction, schema) {
const response = await fetch(API_URL, {
method: "POST",
headers: {
"Authorization": `Bearer ${API_TOKEN}`,
"Content-Type": "application/json"
},
body: JSON.stringify({
file_names: fileNames,
user_instruction: instruction,
output_schema: schema
})
});
return response.json();
}
// Basic usage
const result = await extractData(
["invoice.pdf"],
"Extract invoice details",
{
type: "object",
properties: {
invoice_number: { type: "string", description: "Invoice ID" },
total_amount: { type: "number", description: "Total due" }
},
required: ["invoice_number", "total_amount"]
}
);
const { structured_output } = result;
console.log(`Invoice: ${structured_output.invoice_number}`);
console.log(`Amount: $${structured_output.total_amount}`);
JavaScript with Nested Objects and Arrays
const API_URL = "https://sources.graphorlm.com/run-extraction";
const API_TOKEN = "YOUR_API_TOKEN";
// Extraction with nested objects and arrays
const schema = {
type: "object",
properties: {
invoice_number: { type: "string", description: "Invoice ID" },
billing_address: {
type: "object",
description: "Billing address",
properties: {
street: { type: "string", description: "Street" },
city: { type: "string", description: "City" },
country: { type: "string", description: "Country" }
}
},
tags: {
type: "array",
description: "Invoice tags",
items: { type: "string" }
},
line_items: {
type: "array",
description: "Invoice line items",
items: {
type: "object",
properties: {
description: { type: "string", description: "Item description" },
quantity: { type: "number", description: "Quantity" },
price: { type: "number", description: "Unit price" }
}
}
}
},
required: ["invoice_number"]
};
const response = await fetch(API_URL, {
method: "POST",
headers: {
"Authorization": `Bearer ${API_TOKEN}`,
"Content-Type": "application/json"
},
body: JSON.stringify({
file_names: ["invoice.pdf"],
user_instruction: "Extract invoice with all details",
output_schema: schema
})
});
const data = await response.json();
const { invoice_number, billing_address, tags, line_items } = data.structured_output;
console.log(`Invoice: ${invoice_number}`);
console.log(`Address: ${billing_address.street}, ${billing_address.city}`);
console.log(`Tags: ${tags.join(", ")}`);
console.log("Line Items:");
line_items.forEach(line => {
console.log(` - ${line.description}: ${line.quantity} x $${line.price}`);
});
Best Practices
- Use standard JSON Schema — The API accepts any valid JSON Schema, giving you full flexibility
- Be specific in descriptions — Detailed property descriptions improve extraction accuracy
- Use appropriate types — Match property types to expected data (
number for amounts, string for dates)
- Provide clear instructions — Guide the extraction with format preferences and edge cases
- Use objects for structured data — Group related fields using nested objects (e.g., address with street, city, zip)
- Use arrays for lists — Extract repeating items using arrays with appropriate item schemas
- Keep nesting shallow — Avoid deeply nested structures for better extraction accuracy
- Define required fields — Use the
required array to specify mandatory properties
- Use raw_json for debugging — The
raw_json field contains the model’s raw output before validation