Skip to main content

Overview

The Data Extraction API allows you to extract structured information from your documents using standard JSON Schema and natural language instructions. The extraction uses the active parsing version of the specified document.

Endpoint

POST https://sources.graphorlm.com/run-extraction

Authentication

Include your API token in the Authorization header:
Authorization: Bearer YOUR_API_TOKEN

Request

Headers

HeaderValueRequired
AuthorizationBearer YOUR_API_TOKENYes
Content-Typeapplication/jsonYes

Body Parameters

ParameterTypeRequiredDescription
file_idsstring[]No*List of file IDs to extract from (preferred)
file_namesstring[]No*List of file names to extract from (deprecated, use file_ids)
user_instructionstringYesNatural language instructions to guide the extraction
output_schemaobjectYesJSON Schema defining the structure of the extracted data
thinking_levelstringNoControls model and thinking configuration. Values: "fast", "balanced", "accurate" (default). See Thinking Level for details.
*At least one of file_ids or file_names must be provided. file_ids is preferred.

Output Schema

The output_schema parameter accepts a standard JSON Schema object. This is the same format used by the Chat API for structured outputs.

Thinking Level

The thinking_level parameter controls the model and thinking configuration used for extraction:
ValueDescription
"fast"Uses a faster model without extended thinking. Best for simple extractions where speed is prioritized.
"balanced"Uses a more capable model with low thinking. Good balance between quality and speed.
"accurate"Default. Uses a more capable model with high thinking. Best for complex extractions requiring deep reasoning.

Example Request (using file_ids)

curl -X POST "https://sources.graphorlm.com/run-extraction" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "file_ids": ["file_abc123"],
    "user_instruction": "Extract all invoice information. Use YYYY-MM-DD format for dates.",
    "output_schema": {
      "type": "object",
      "properties": {
        "invoice_number": {
          "type": "string",
          "description": "The unique invoice identifier"
        },
        "invoice_date": {
          "type": "string",
          "description": "Invoice date in YYYY-MM-DD format"
        },
        "total_amount": {
          "type": "number",
          "description": "Total amount due"
        },
        "vendor_name": {
          "type": "string",
          "description": "Name of the company issuing the invoice"
        }
      },
      "required": ["invoice_number", "total_amount"]
    }
  }'

Example Request (using file_names - deprecated)

curl -X POST "https://sources.graphorlm.com/run-extraction" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "file_names": ["invoice-2024.pdf"],
    "user_instruction": "Extract all invoice information. Use YYYY-MM-DD format for dates.",
    "output_schema": {
      "type": "object",
      "properties": {
        "invoice_number": {
          "type": "string",
          "description": "The unique invoice identifier"
        },
        "invoice_date": {
          "type": "string",
          "description": "Invoice date in YYYY-MM-DD format"
        },
        "total_amount": {
          "type": "number",
          "description": "Total amount due"
        },
        "vendor_name": {
          "type": "string",
          "description": "Name of the company issuing the invoice"
        }
      },
      "required": ["invoice_number", "total_amount"]
    }
  }'

Example Request with Thinking Level

curl -X POST "https://sources.graphorlm.com/run-extraction" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "file_names": ["complex-contract.pdf"],
    "user_instruction": "Extract all legal clauses with their implications.",
    "thinking_level": "accurate",
    "output_schema": {
      "type": "object",
      "properties": {
        "clauses": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "title": { "type": "string" },
              "content": { "type": "string" },
              "implications": { "type": "string" }
            }
          }
        }
      }
    }
  }'

Example Request with Nested Objects and Arrays

curl -X POST "https://sources.graphorlm.com/run-extraction" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "file_names": ["invoice-2024.pdf"],
    "user_instruction": "Extract invoice with line items and address details.",
    "output_schema": {
      "type": "object",
      "properties": {
        "invoice_number": {
          "type": "string",
          "description": "The unique invoice identifier"
        },
        "billing_address": {
          "type": "object",
          "description": "Billing address details",
          "properties": {
            "street": { "type": "string", "description": "Street address" },
            "city": { "type": "string", "description": "City name" },
            "zip_code": { "type": "string", "description": "Postal code" },
            "country": { "type": "string", "description": "Country name" }
          }
        },
        "tags": {
          "type": "array",
          "description": "Invoice tags or categories",
          "items": { "type": "string" }
        },
        "line_items": {
          "type": "array",
          "description": "Invoice line items",
          "items": {
            "type": "object",
            "properties": {
              "description": { "type": "string", "description": "Item description" },
              "quantity": { "type": "number", "description": "Item quantity" },
              "unit_price": { "type": "number", "description": "Price per unit" },
              "total": { "type": "number", "description": "Line item total" }
            }
          }
        }
      },
      "required": ["invoice_number"]
    }
  }'

Response

Success Response (200 OK)

FieldTypeDescription
file_idsarrayList of file IDs used for extraction
file_namesarrayList of file names used for extraction
structured_outputobjectExtracted data matching your schema
raw_jsonstringRaw JSON-text produced by the model before validation/correction

Example Response

{
  "file_ids": ["file_abc123"],
  "file_names": ["invoice-2024.pdf"],
  "structured_output": {
    "invoice_number": "INV-2024-001",
    "invoice_date": "2024-01-15",
    "total_amount": 1250.00,
    "vendor_name": "Acme Corporation"
  },
  "raw_json": "{\"invoice_number\": \"INV-2024-001\", \"invoice_date\": \"2024-01-15\", \"total_amount\": 1250.00, \"vendor_name\": \"Acme Corporation\"}"
}

Response with Object and Array Types

When using object and array types in your schema:
{
  "file_names": ["invoice-2024.pdf"],
  "structured_output": {
    "invoice_number": "INV-2024-001",
    "billing_address": {
      "street": "123 Main Street",
      "city": "New York",
      "zip_code": "10001",
      "country": "USA"
    },
    "tags": ["urgent", "corporate", "Q1-2024"],
    "line_items": [
      {
        "description": "Professional Services",
        "quantity": 10,
        "unit_price": 100.00,
        "total": 1000.00
      },
      {
        "description": "Software License",
        "quantity": 1,
        "unit_price": 250.00,
        "total": 250.00
      }
    ]
  },
  "raw_json": "{...}"
}

### Error Responses

| Status Code | Description |
|-------------|-------------|
| 400 | Bad Request - Invalid parameters or schema |
| 401 | Unauthorized - Invalid or missing API token |
| 404 | Not Found - File not found or no parsing history |
| 500 | Internal Server Error |

## Schema Examples

### Invoice Extraction

```json
{
  "file_names": ["invoice.pdf"],
  "user_instruction": "Extract all invoice details. Convert amounts to numbers without currency symbols.",
  "output_schema": {
    "type": "object",
    "properties": {
      "invoice_number": { "type": "string", "description": "Unique invoice identifier" },
      "invoice_date": { "type": "string", "description": "Invoice date (YYYY-MM-DD)" },
      "due_date": { "type": "string", "description": "Payment due date (YYYY-MM-DD)" },
      "vendor_name": { "type": "string", "description": "Company issuing the invoice" },
      "customer_name": { "type": "string", "description": "Customer being billed" },
      "subtotal": { "type": "number", "description": "Subtotal before tax" },
      "tax_amount": { "type": "number", "description": "Tax amount" },
      "total_amount": { "type": "number", "description": "Total amount due" }
    },
    "required": ["invoice_number", "total_amount"]
  }
}

Contract Analysis

{
  "file_names": ["contract.pdf"],
  "user_instruction": "Extract key contract terms with all parties and obligations.",
  "output_schema": {
    "type": "object",
    "properties": {
      "contract_title": { "type": "string", "description": "Title or name of the contract" },
      "effective_date": { "type": "string", "description": "When the contract becomes effective" },
      "termination_date": { "type": "string", "description": "When the contract ends" },
      "auto_renewal": { "type": "boolean", "description": "Whether contract auto-renews" },
      "parties": {
        "type": "array",
        "description": "All parties involved in the contract",
        "items": {
          "type": "object",
          "properties": {
            "name": { "type": "string", "description": "Party name" },
            "role": { "type": "string", "description": "Role (e.g., Licensor, Licensee)" },
            "address": { "type": "string", "description": "Party address" }
          }
        }
      },
      "key_terms": {
        "type": "object",
        "description": "Key contract terms and conditions",
        "properties": {
          "payment_terms": { "type": "string", "description": "Payment conditions" },
          "liability_cap": { "type": "number", "description": "Maximum liability amount" },
          "notice_period_days": { "type": "number", "description": "Notice period in days" }
        }
      }
    },
    "required": ["contract_title", "parties"]
  }
}

Resume Parsing

{
  "file_names": ["resume.pdf"],
  "user_instruction": "Extract complete candidate information including work history and education.",
  "output_schema": {
    "type": "object",
    "properties": {
      "full_name": { "type": "string", "description": "Candidate's full name" },
      "email": { "type": "string", "description": "Email address" },
      "phone": { "type": "string", "description": "Phone number" },
      "years_experience": { "type": "number", "description": "Total years of experience" },
      "skills": {
        "type": "array",
        "description": "List of technical and soft skills",
        "items": { "type": "string" }
      },
      "work_experience": {
        "type": "array",
        "description": "Work history",
        "items": {
          "type": "object",
          "properties": {
            "company": { "type": "string", "description": "Company name" },
            "title": { "type": "string", "description": "Job title" },
            "start_date": { "type": "string", "description": "Start date" },
            "end_date": { "type": "string", "description": "End date (or current)" }
          }
        }
      },
      "education": {
        "type": "array",
        "description": "Educational background",
        "items": {
          "type": "object",
          "properties": {
            "institution": { "type": "string", "description": "School or university name" },
            "degree": { "type": "string", "description": "Degree obtained" },
            "graduation_year": { "type": "number", "description": "Year of graduation" }
          }
        }
      }
    },
    "required": ["full_name"]
  }
}

Product Catalog

{
  "file_names": ["catalog.pdf"],
  "user_instruction": "Extract all products with their specifications and variants.",
  "output_schema": {
    "type": "object",
    "properties": {
      "product_name": { "type": "string", "description": "Product name" },
      "sku": { "type": "string", "description": "Product SKU" },
      "base_price": { "type": "number", "description": "Base price" },
      "in_stock": { "type": "boolean", "description": "Whether product is in stock" },
      "specifications": {
        "type": "object",
        "description": "Product specifications",
        "properties": {
          "weight": { "type": "number", "description": "Weight in kg" },
          "dimensions": { "type": "string", "description": "Dimensions (LxWxH)" },
          "material": { "type": "string", "description": "Main material" }
        }
      },
      "categories": {
        "type": "array",
        "description": "Product categories",
        "items": { "type": "string" }
      },
      "variants": {
        "type": "array",
        "description": "Product variants",
        "items": {
          "type": "object",
          "properties": {
            "color": { "type": "string", "description": "Variant color" },
            "size": { "type": "string", "description": "Variant size" },
            "price_modifier": { "type": "number", "description": "Price adjustment" }
          }
        }
      }
    },
    "required": ["product_name", "sku"]
  }
}

Usage Examples

Python

import requests

url = "https://sources.graphorlm.com/run-extraction"
headers = {
    "Authorization": "Bearer YOUR_API_TOKEN",
    "Content-Type": "application/json"
}

# Basic extraction
payload = {
    "file_names": ["invoice.pdf"],
    "user_instruction": "Extract invoice information. Use YYYY-MM-DD for dates.",
    "output_schema": {
        "type": "object",
        "properties": {
            "invoice_number": {"type": "string", "description": "Invoice ID"},
            "total_amount": {"type": "number", "description": "Total due"},
            "invoice_date": {"type": "string", "description": "Invoice date"}
        },
        "required": ["invoice_number", "total_amount"]
    }
}

response = requests.post(url, headers=headers, json=payload)
data = response.json()

output = data["structured_output"]
print(f"Invoice: {output['invoice_number']}")
print(f"Amount: ${output['total_amount']}")
print(f"Date: {output['invoice_date']}")

Python with Nested Objects and Arrays

import requests

url = "https://sources.graphorlm.com/run-extraction"
headers = {
    "Authorization": "Bearer YOUR_API_TOKEN",
    "Content-Type": "application/json"
}

# Extraction with nested objects and arrays
payload = {
    "file_names": ["invoice.pdf"],
    "user_instruction": "Extract invoice with line items and address.",
    "output_schema": {
        "type": "object",
        "properties": {
            "invoice_number": {"type": "string", "description": "Invoice ID"},
            "billing_address": {
                "type": "object",
                "description": "Billing address",
                "properties": {
                    "street": {"type": "string", "description": "Street"},
                    "city": {"type": "string", "description": "City"},
                    "country": {"type": "string", "description": "Country"}
                }
            },
            "line_items": {
                "type": "array",
                "description": "Invoice line items",
                "items": {
                    "type": "object",
                    "properties": {
                        "description": {"type": "string", "description": "Item description"},
                        "quantity": {"type": "number", "description": "Quantity"},
                        "price": {"type": "number", "description": "Unit price"}
                    }
                }
            }
        },
        "required": ["invoice_number"]
    }
}

response = requests.post(url, headers=headers, json=payload)
data = response.json()

output = data["structured_output"]
print(f"Invoice: {output['invoice_number']}")
print(f"City: {output['billing_address']['city']}")
print("Line Items:")
for line in output["line_items"]:
    print(f"  - {line['description']}: {line['quantity']} x ${line['price']}")

JavaScript

const API_URL = "https://sources.graphorlm.com/run-extraction";
const API_TOKEN = "YOUR_API_TOKEN";

async function extractData(fileNames, instruction, schema) {
  const response = await fetch(API_URL, {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${API_TOKEN}`,
      "Content-Type": "application/json"
    },
    body: JSON.stringify({
      file_names: fileNames,
      user_instruction: instruction,
      output_schema: schema
    })
  });
  
  return response.json();
}

// Basic usage
const result = await extractData(
  ["invoice.pdf"],
  "Extract invoice details",
  {
    type: "object",
    properties: {
      invoice_number: { type: "string", description: "Invoice ID" },
      total_amount: { type: "number", description: "Total due" }
    },
    required: ["invoice_number", "total_amount"]
  }
);

const { structured_output } = result;
console.log(`Invoice: ${structured_output.invoice_number}`);
console.log(`Amount: $${structured_output.total_amount}`);

JavaScript with Nested Objects and Arrays

const API_URL = "https://sources.graphorlm.com/run-extraction";
const API_TOKEN = "YOUR_API_TOKEN";

// Extraction with nested objects and arrays
const schema = {
  type: "object",
  properties: {
    invoice_number: { type: "string", description: "Invoice ID" },
    billing_address: {
      type: "object",
      description: "Billing address",
      properties: {
        street: { type: "string", description: "Street" },
        city: { type: "string", description: "City" },
        country: { type: "string", description: "Country" }
      }
    },
    tags: {
      type: "array",
      description: "Invoice tags",
      items: { type: "string" }
    },
    line_items: {
      type: "array",
      description: "Invoice line items",
      items: {
        type: "object",
        properties: {
          description: { type: "string", description: "Item description" },
          quantity: { type: "number", description: "Quantity" },
          price: { type: "number", description: "Unit price" }
        }
      }
    }
  },
  required: ["invoice_number"]
};

const response = await fetch(API_URL, {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${API_TOKEN}`,
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    file_names: ["invoice.pdf"],
    user_instruction: "Extract invoice with all details",
    output_schema: schema
  })
});

const data = await response.json();

const { invoice_number, billing_address, tags, line_items } = data.structured_output;

console.log(`Invoice: ${invoice_number}`);
console.log(`Address: ${billing_address.street}, ${billing_address.city}`);
console.log(`Tags: ${tags.join(", ")}`);

console.log("Line Items:");
line_items.forEach(line => {
  console.log(`  - ${line.description}: ${line.quantity} x $${line.price}`);
});

Best Practices

  1. Use standard JSON Schema — The API accepts any valid JSON Schema, giving you full flexibility
  2. Be specific in descriptions — Detailed property descriptions improve extraction accuracy
  3. Use appropriate types — Match property types to expected data (number for amounts, string for dates)
  4. Provide clear instructions — Guide the extraction with format preferences and edge cases
  5. Use objects for structured data — Group related fields using nested objects (e.g., address with street, city, zip)
  6. Use arrays for lists — Extract repeating items using arrays with appropriate item schemas
  7. Keep nesting shallow — Avoid deeply nested structures for better extraction accuracy
  8. Define required fields — Use the required array to specify mandatory properties
  9. Use raw_json for debugging — The raw_json field contains the model’s raw output before validation