Extraction API

Overview

The Data Extraction API allows you to extract structured information from your documents using custom schemas and natural language instructions. The extraction uses the active parsing version of the specified document.

Endpoint

POST https://sources.graphorlm.com/run-extraction

Authentication

Include your API token in the Authorization header:

Authorization: Bearer YOUR_API_TOKEN

Request

Headers

Header	Value	Required
`Authorization`	`Bearer YOUR_API_TOKEN`	Yes
`Content-Type`	`application/json`	Yes

Body Parameters

Parameter	Type	Required	Description
`file_name`	string	Yes	The name of the file to extract from
`user_instruction`	string	Yes	Natural language instructions to guide the extraction
`output_schema_fields`	array	Yes	List of field definitions for the extraction schema

Schema Field Object

Each field in output_schema_fields has the following properties:

Property	Type	Required	Description
`key`	string	Yes	Field name in the output (use snake_case)
`type`	string	Yes	Data type: `string`, `number`, `date`, `boolean`, `object`, or `array`
`description`	string	Yes	Description of what to extract
`nested_fields`	array	No	Required for `object` type or `array` with `items_type: "object"`. Defines nested field structure
`items_type`	string	No	Required for `array` type. Type of array items: `string`, `number`, `date`, `boolean`, or `object`

Example Request

curl -X POST "https://sources.graphorlm.com/run-extraction" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "file_name": "invoice-2024.pdf",
    "user_instruction": "Extract all invoice information. Use YYYY-MM-DD format for dates.",
    "output_schema_fields": [
      {
        "key": "invoice_number",
        "type": "string",
        "description": "The unique invoice identifier"
      },
      {
        "key": "invoice_date",
        "type": "date",
        "description": "Invoice date in YYYY-MM-DD format"
      },
      {
        "key": "total_amount",
        "type": "number",
        "description": "Total amount due"
      },
      {
        "key": "vendor_name",
        "type": "string",
        "description": "Name of the company issuing the invoice"
      }
    ]
  }'

Example Request with Object and Array Types

curl -X POST "https://sources.graphorlm.com/run-extraction" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "file_name": "invoice-2024.pdf",
    "user_instruction": "Extract invoice with line items and address details.",
    "output_schema_fields": [
      {
        "key": "invoice_number",
        "type": "string",
        "description": "The unique invoice identifier"
      },
      {
        "key": "billing_address",
        "type": "object",
        "description": "Billing address details",
        "nested_fields": [
          { "key": "street", "type": "string", "description": "Street address" },
          { "key": "city", "type": "string", "description": "City name" },
          { "key": "zip_code", "type": "string", "description": "Postal code" },
          { "key": "country", "type": "string", "description": "Country name" }
        ]
      },
      {
        "key": "tags",
        "type": "array",
        "description": "Invoice tags or categories",
        "items_type": "string"
      },
      {
        "key": "line_items",
        "type": "array",
        "description": "Invoice line items",
        "items_type": "object",
        "nested_fields": [
          { "key": "description", "type": "string", "description": "Item description" },
          { "key": "quantity", "type": "number", "description": "Item quantity" },
          { "key": "unit_price", "type": "number", "description": "Price per unit" },
          { "key": "total", "type": "number", "description": "Line item total" }
        ]
      }
    ]
  }'

Response

Success Response (200 OK)

Field	Type	Description
`file_name`	string	The file name of the source
`build_id`	string	The parsing version used for extraction
`extracted_items`	array	List of extracted items with page references

Extracted Item Object

Each item in extracted_items contains:

Field	Type	Description
`output`	object	Extracted data matching your schema fields
`page_numbers`	array	List of page numbers where the data was found

Example Response

{
  "file_name": "invoice-2024.pdf",
  "build_id": "build_xyz789",
  "extracted_items": [
    {
      "output": {
        "invoice_number": "INV-2024-001",
        "invoice_date": "2024-01-15",
        "total_amount": 1250.00,
        "vendor_name": "Acme Corporation"
      },
      "page_numbers": [1]
    }
  ]
}

Multiple Items Response

When a document contains multiple extractable entities:

{
  "file_name": "product-catalog.pdf",
  "build_id": "build_abc456",
  "extracted_items": [
    {
      "output": {
        "product_name": "Widget Pro",
        "price": 49.99,
        "in_stock": true
      },
      "page_numbers": [2, 3]
    },
    {
      "output": {
        "product_name": "Widget Basic",
        "price": 29.99,
        "in_stock": true
      },
      "page_numbers": [4]
    }
  ]
}

Response with Object and Array Types

When using object and array types in your schema:

{
  "file_name": "invoice-2024.pdf",
  "build_id": "build_xyz789",
  "extracted_items": [
    {
      "output": {
        "invoice_number": "INV-2024-001",
        "billing_address": {
          "street": "123 Main Street",
          "city": "New York",
          "zip_code": "10001",
          "country": "USA"
        },
        "tags": ["urgent", "corporate", "Q1-2024"],
        "line_items": [
          {
            "description": "Professional Services",
            "quantity": 10,
            "unit_price": 100.00,
            "total": 1000.00
          },
          {
            "description": "Software License",
            "quantity": 1,
            "unit_price": 250.00,
            "total": 250.00
          }
        ]
      },
      "page_numbers": [1, 2]
    }
  ]
}

Error Responses

Status Code	Description
400	Bad Request - Invalid parameters or schema
401	Unauthorized - Invalid or missing API token
404	Not Found - File not found or no parsing history
500	Internal Server Error

Schema Examples

Invoice Extraction

{
  "file_name": "invoice.pdf",
  "user_instruction": "Extract all invoice details. Convert amounts to numbers without currency symbols.",
  "output_schema_fields": [
    { "key": "invoice_number", "type": "string", "description": "Unique invoice identifier" },
    { "key": "invoice_date", "type": "date", "description": "Invoice date (YYYY-MM-DD)" },
    { "key": "due_date", "type": "date", "description": "Payment due date (YYYY-MM-DD)" },
    { "key": "vendor_name", "type": "string", "description": "Company issuing the invoice" },
    { "key": "customer_name", "type": "string", "description": "Customer being billed" },
    { "key": "subtotal", "type": "number", "description": "Subtotal before tax" },
    { "key": "tax_amount", "type": "number", "description": "Tax amount" },
    { "key": "total_amount", "type": "number", "description": "Total amount due" }
  ]
}

Contract Analysis

{
  "file_name": "contract.pdf",
  "user_instruction": "Extract key contract terms with all parties and obligations.",
  "output_schema_fields": [
    { "key": "contract_title", "type": "string", "description": "Title or name of the contract" },
    { "key": "effective_date", "type": "date", "description": "When the contract becomes effective" },
    { "key": "termination_date", "type": "date", "description": "When the contract ends" },
    { "key": "auto_renewal", "type": "boolean", "description": "Whether contract auto-renews" },
    {
      "key": "parties",
      "type": "array",
      "description": "All parties involved in the contract",
      "items_type": "object",
      "nested_fields": [
        { "key": "name", "type": "string", "description": "Party name" },
        { "key": "role", "type": "string", "description": "Role (e.g., Licensor, Licensee)" },
        { "key": "address", "type": "string", "description": "Party address" }
      ]
    },
    {
      "key": "key_terms",
      "type": "object",
      "description": "Key contract terms and conditions",
      "nested_fields": [
        { "key": "payment_terms", "type": "string", "description": "Payment conditions" },
        { "key": "liability_cap", "type": "number", "description": "Maximum liability amount" },
        { "key": "notice_period_days", "type": "number", "description": "Notice period in days" }
      ]
    }
  ]
}

Resume Parsing

{
  "file_name": "resume.pdf",
  "user_instruction": "Extract complete candidate information including work history and education.",
  "output_schema_fields": [
    { "key": "full_name", "type": "string", "description": "Candidate's full name" },
    { "key": "email", "type": "string", "description": "Email address" },
    { "key": "phone", "type": "string", "description": "Phone number" },
    { "key": "years_experience", "type": "number", "description": "Total years of experience" },
    {
      "key": "skills",
      "type": "array",
      "description": "List of technical and soft skills",
      "items_type": "string"
    },
    {
      "key": "work_experience",
      "type": "array",
      "description": "Work history",
      "items_type": "object",
      "nested_fields": [
        { "key": "company", "type": "string", "description": "Company name" },
        { "key": "title", "type": "string", "description": "Job title" },
        { "key": "start_date", "type": "date", "description": "Start date" },
        { "key": "end_date", "type": "date", "description": "End date (or current)" }
      ]
    },
    {
      "key": "education",
      "type": "array",
      "description": "Educational background",
      "items_type": "object",
      "nested_fields": [
        { "key": "institution", "type": "string", "description": "School or university name" },
        { "key": "degree", "type": "string", "description": "Degree obtained" },
        { "key": "graduation_year", "type": "number", "description": "Year of graduation" }
      ]
    }
  ]
}

Product Catalog

{
  "file_name": "catalog.pdf",
  "user_instruction": "Extract all products with their specifications and variants.",
  "output_schema_fields": [
    { "key": "product_name", "type": "string", "description": "Product name" },
    { "key": "sku", "type": "string", "description": "Product SKU" },
    { "key": "base_price", "type": "number", "description": "Base price" },
    { "key": "in_stock", "type": "boolean", "description": "Whether product is in stock" },
    {
      "key": "specifications",
      "type": "object",
      "description": "Product specifications",
      "nested_fields": [
        { "key": "weight", "type": "number", "description": "Weight in kg" },
        { "key": "dimensions", "type": "string", "description": "Dimensions (LxWxH)" },
        { "key": "material", "type": "string", "description": "Main material" }
      ]
    },
    {
      "key": "categories",
      "type": "array",
      "description": "Product categories",
      "items_type": "string"
    },
    {
      "key": "variants",
      "type": "array",
      "description": "Product variants",
      "items_type": "object",
      "nested_fields": [
        { "key": "color", "type": "string", "description": "Variant color" },
        { "key": "size", "type": "string", "description": "Variant size" },
        { "key": "price_modifier", "type": "number", "description": "Price adjustment" }
      ]
    }
  ]
}

Usage Examples

Python

import requests

url = "https://sources.graphorlm.com/run-extraction"
headers = {
    "Authorization": "Bearer YOUR_API_TOKEN",
    "Content-Type": "application/json"
}

# Basic extraction
payload = {
    "file_name": "invoice.pdf",
    "user_instruction": "Extract invoice information. Use YYYY-MM-DD for dates.",
    "output_schema_fields": [
        {"key": "invoice_number", "type": "string", "description": "Invoice ID"},
        {"key": "total_amount", "type": "number", "description": "Total due"},
        {"key": "invoice_date", "type": "date", "description": "Invoice date"}
    ]
}

response = requests.post(url, headers=headers, json=payload)
data = response.json()

for item in data["extracted_items"]:
    print(f"Invoice: {item['output']['invoice_number']}")
    print(f"Amount: ${item['output']['total_amount']}")
    print(f"Found on pages: {item['page_numbers']}")

Python with Object and Array Types

import requests

url = "https://sources.graphorlm.com/run-extraction"
headers = {
    "Authorization": "Bearer YOUR_API_TOKEN",
    "Content-Type": "application/json"
}

# Extraction with nested objects and arrays
payload = {
    "file_name": "invoice.pdf",
    "user_instruction": "Extract invoice with line items and address.",
    "output_schema_fields": [
        {"key": "invoice_number", "type": "string", "description": "Invoice ID"},
        {
            "key": "billing_address",
            "type": "object",
            "description": "Billing address",
            "nested_fields": [
                {"key": "street", "type": "string", "description": "Street"},
                {"key": "city", "type": "string", "description": "City"},
                {"key": "country", "type": "string", "description": "Country"}
            ]
        },
        {
            "key": "line_items",
            "type": "array",
            "description": "Invoice line items",
            "items_type": "object",
            "nested_fields": [
                {"key": "description", "type": "string", "description": "Item description"},
                {"key": "quantity", "type": "number", "description": "Quantity"},
                {"key": "price", "type": "number", "description": "Unit price"}
            ]
        }
    ]
}

response = requests.post(url, headers=headers, json=payload)
data = response.json()

for item in data["extracted_items"]:
    output = item["output"]
    print(f"Invoice: {output['invoice_number']}")
    print(f"City: {output['billing_address']['city']}")
    print("Line Items:")
    for line in output["line_items"]:
        print(f"  - {line['description']}: {line['quantity']} x ${line['price']}")

JavaScript

const API_URL = "https://sources.graphorlm.com/run-extraction";
const API_TOKEN = "YOUR_API_TOKEN";

async function extractData(fileName, instruction, schema) {
  const response = await fetch(API_URL, {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${API_TOKEN}`,
      "Content-Type": "application/json"
    },
    body: JSON.stringify({
      file_name: fileName,
      user_instruction: instruction,
      output_schema_fields: schema
    })
  });
  
  return response.json();
}

// Basic usage
const result = await extractData(
  "invoice.pdf",
  "Extract invoice details",
  [
    { key: "invoice_number", type: "string", description: "Invoice ID" },
    { key: "total_amount", type: "number", description: "Total due" }
  ]
);

result.extracted_items.forEach(item => {
  console.log(item.output);
  console.log(`Pages: ${item.page_numbers.join(", ")}`);
});

JavaScript with Object and Array Types

const API_URL = "https://sources.graphorlm.com/run-extraction";
const API_TOKEN = "YOUR_API_TOKEN";

// Extraction with nested objects and arrays
const schema = [
  { key: "invoice_number", type: "string", description: "Invoice ID" },
  {
    key: "billing_address",
    type: "object",
    description: "Billing address",
    nested_fields: [
      { key: "street", type: "string", description: "Street" },
      { key: "city", type: "string", description: "City" },
      { key: "country", type: "string", description: "Country" }
    ]
  },
  {
    key: "tags",
    type: "array",
    description: "Invoice tags",
    items_type: "string"
  },
  {
    key: "line_items",
    type: "array",
    description: "Invoice line items",
    items_type: "object",
    nested_fields: [
      { key: "description", type: "string", description: "Item description" },
      { key: "quantity", type: "number", description: "Quantity" },
      { key: "price", type: "number", description: "Unit price" }
    ]
  }
];

const response = await fetch(API_URL, {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${API_TOKEN}`,
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    file_name: "invoice.pdf",
    user_instruction: "Extract invoice with all details",
    output_schema_fields: schema
  })
});

const data = await response.json();

data.extracted_items.forEach(item => {
  const { invoice_number, billing_address, tags, line_items } = item.output;
  
  console.log(`Invoice: ${invoice_number}`);
  console.log(`Address: ${billing_address.street}, ${billing_address.city}`);
  console.log(`Tags: ${tags.join(", ")}`);
  
  console.log("Line Items:");
  line_items.forEach(line => {
    console.log(`  - ${line.description}: ${line.quantity} x $${line.price}`);
  });
});

Best Practices

Be specific in descriptions — Detailed field descriptions improve extraction accuracy
Use appropriate types — Match field types to expected data (number for amounts, date for dates)
Provide clear instructions — Guide the extraction with format preferences and edge cases
Handle multiple items — Design your schema for documents that may contain multiple entities
Check page references — Use page_numbers to verify extraction accuracy
Use objects for structured data — Group related fields using object type (e.g., address with street, city, zip)
Use arrays for lists — Extract repeating items using array type with appropriate items_type
Keep nesting shallow — Avoid deeply nested structures for better extraction accuracy
Use primitive arrays when possible — For simple lists (tags, categories), use items_type: "string" instead of objects

Data Extraction Guide

Learn schema design and extraction best practices

Data Ingestion

Improve parsing quality for better extraction results

Get Started

Data API Options

RAG Pipelines API

Overview

Endpoint

Authentication

Request

Headers

Body Parameters

Schema Field Object

Example Request

Example Request with Object and Array Types

Response

Success Response (200 OK)

Extracted Item Object

Example Response

Multiple Items Response

Response with Object and Array Types

Error Responses

Schema Examples

Invoice Extraction

Contract Analysis

Resume Parsing

Product Catalog

Usage Examples

Python

Python with Object and Array Types

JavaScript

JavaScript with Object and Array Types

Best Practices

Data Extraction Guide

Data Ingestion

Get Started

Data API Options

RAG Pipelines API

​Overview

​Endpoint

​Authentication

​Request

​Headers

​Body Parameters

​Schema Field Object

​Example Request

​Example Request with Object and Array Types

​Response

​Success Response (200 OK)

​Extracted Item Object

​Example Response

​Multiple Items Response

​Response with Object and Array Types

​Error Responses

​Schema Examples

​Invoice Extraction

​Contract Analysis

​Resume Parsing

​Product Catalog

​Usage Examples

​Python

​Python with Object and Array Types

​JavaScript

​JavaScript with Object and Array Types

​Best Practices

​Related

Data Extraction Guide

Data Ingestion

Overview

Endpoint

Authentication

Request

Headers

Body Parameters

Schema Field Object

Example Request

Example Request with Object and Array Types

Response

Success Response (200 OK)

Extracted Item Object

Example Response

Multiple Items Response

Response with Object and Array Types

Error Responses

Schema Examples

Invoice Extraction

Contract Analysis

Resume Parsing

Product Catalog

Usage Examples

Python

Python with Object and Array Types

JavaScript

JavaScript with Object and Array Types

Best Practices

Related