Skip to main content
Graphor’s Data Extraction feature transforms unstructured documents into structured, actionable data. Define custom output schemas, provide natural language instructions, and extract exactly the information you need — with full page-level provenance. Data extraction interface

Overview

Data Extraction uses LLM-powered processing to intelligently extract structured information from your documents. This is perfect for:
  • Invoice processing — Extract invoice numbers, dates, amounts, and line items
  • Contract analysis — Pull key terms, parties, dates, and obligations
  • Resume parsing — Extract contact info, skills, experience, and education
  • Product catalogs — Capture product names, prices, descriptions, and specifications
  • Research papers — Extract titles, authors, abstracts, and citations

How It Works

  1. Parse your document — First, ingest and parse your document using any parsing method
  2. Define your schema — Specify the fields you want to extract with types and descriptions
  3. Add instructions — Provide optional natural language guidance for the extraction
  4. Run extraction — The LLM processes the document and extracts matching data
  5. Review results — View extracted data with page-level provenance

Accessing Data Extraction

To access the Data Extraction feature:
  1. Navigate to Sources in the left sidebar
  2. Double-click on a processed document to open Source details
  3. Click the Extraction tab at the top center of the page
You can run Data Extraction on any parsing version of your document. Select the desired version from the Versions panel before running the extraction.

Defining Your Schema

The extraction schema defines what information to extract from your document. Each field in your schema has three components:

Field Name

The key that will be used in the extracted data output. Use descriptive, snake_case names:
  • invoice_number
  • total_amount
  • customer_name
  • line_items

Field Type

Choose the appropriate data type for each field:
TypeDescriptionExample Output
TextString values"INV-2024-001"
NumberNumeric values299.99
BooleanTrue/false valuestrue
DateDate values (YYYY-MM-DD)"2024-01-15"
ObjectNested structured data{"street": "123 Main St", "city": "NYC"}
ArrayLists of values["item1", "item2"] or [{...}, {...}]

Object Type

Use the Object type when you need to group related fields together. This is useful for:
  • Addresses (street, city, zip, country)
  • Contact information (name, email, phone)
  • Specifications (weight, dimensions, material)
When you select Object type, you can define nested fields with their own key, type, and description.

Array Type

Use the Array type for extracting lists of items. Arrays require you to specify the Items Type:
Items TypeUse CaseExample
TextList of stringsTags, categories, skills
NumberList of numbersPage numbers, quantities
BooleanList of booleansFeature flags
DateList of datesEvent dates, milestones
ObjectList of structured itemsLine items, work experience, products
When using Object as the items type, you define nested fields that apply to each item in the array.

Field Description

A natural language description that helps the LLM understand what to extract. Be specific and include:
  • What the field represents
  • Expected format (if applicable)
  • Any special instructions
Good descriptions:
  • “The unique invoice identifier, usually starting with ‘INV-’”
  • “Total amount due in USD, as a number without currency symbols”
  • “List of all product names mentioned in the document”
Avoid vague descriptions:
  • “The number”
  • “Amount”
  • “Items”

Writing Effective Instructions

Instructions provide additional context and guidance for the extraction process. They help the LLM understand:
  • Scope — What parts of the document to focus on
  • Format — How to format the extracted data
  • Edge cases — How to handle ambiguous situations
  • Multiple items — How to handle documents with multiple extractable entities

Example Instructions

For invoice extraction:
Extract all invoice information from the document. If multiple invoices 
are present, extract each one as a separate item. Use USD for currency 
values and convert dates to YYYY-MM-DD format. If a field is not found, 
leave it empty rather than guessing.
For contract analysis:
Focus on extracting the main contractual terms. For dates, use ISO format 
(YYYY-MM-DD). For monetary values, include the currency. If there are 
multiple parties, list all of them in the parties field.
For product catalog:
Extract each product as a separate item. Prices should be numeric values 
in the document's currency. Include all product variants as separate items.

Running an Extraction

Once your schema and instructions are ready:
  1. Review your field definitions in the schema builder
  2. Add your instructions in the instructions text area
  3. Click Extract to start the extraction process
  4. Wait for the extraction to complete (processing time varies by document size)
The extraction runs asynchronously. You can navigate away and return later — the results will be saved.

Viewing Results

After extraction completes, the Results view displays:

Extracted Data Table

A structured table showing all extracted items with:
  • Each row representing one extracted entity
  • Columns for each field in your schema
  • Values extracted from the document

Page References

Each extracted item includes page numbers indicating where the information was found. This provides:
  • Traceability — Know exactly where each piece of data came from
  • Verification — Quickly check the source for accuracy
  • Context — Understand the surrounding content

Export Options

Export your extracted data for use in other systems:
  • JSON — Structured data for programmatic use
  • CSV — Tabular format for spreadsheets and databases

Schema Examples

Invoice Extraction

Field NameTypeDescription
invoice_numberTextThe unique invoice identifier
invoice_dateDateInvoice date in YYYY-MM-DD format
due_dateDatePayment due date in YYYY-MM-DD format
vendor_nameTextName of the company issuing the invoice
customer_nameTextName of the customer being billed
billing_addressObjectBilling address details
streetTextStreet address
cityTextCity name
zip_codeTextPostal code
countryTextCountry name
subtotalNumberSubtotal amount before tax
tax_amountNumberTax amount
total_amountNumberTotal amount due
line_itemsArray (Object)List of products/services
descriptionTextItem description
quantityNumberQuantity ordered
unit_priceNumberPrice per unit
totalNumberLine item total

Resume Parsing

Field NameTypeDescription
full_nameTextCandidate’s full name
emailTextEmail address
phoneTextPhone number
locationTextCity and country
summaryTextProfessional summary or objective
skillsArray (Text)List of technical and soft skills
work_experienceArray (Object)Work history
companyTextCompany name
titleTextJob title
start_dateDateEmployment start date
end_dateDateEmployment end date (or “Present”)
responsibilitiesTextKey responsibilities
educationArray (Object)Educational background
institutionTextSchool or university name
degreeTextDegree obtained
graduation_yearNumberYear of graduation

Product Catalog

Field NameTypeDescription
product_nameTextName of the product
skuTextProduct SKU or identifier
priceNumberProduct price
descriptionTextProduct description
in_stockBooleanWhether the product is in stock
categoriesArray (Text)Product categories
specificationsObjectProduct specifications
weightNumberWeight in kg
dimensionsTextDimensions (LxWxH)
materialTextMain material
variantsArray (Object)Product variants
colorTextVariant color
sizeTextVariant size
price_modifierNumberPrice adjustment

Best Practices

Schema Design

  1. Start simple — Begin with a few essential fields, then expand
  2. Be specific — Detailed descriptions produce better results
  3. Use appropriate types — Match the field type to expected data
  4. Consider edge cases — Think about what happens when data is missing
  5. Use objects for structured data — Group related fields (address, contact info) using Object type
  6. Use arrays for repeating items — Line items, work history, and skills are perfect for Array type
  7. Keep nesting shallow — Avoid deeply nested structures for better extraction accuracy
  8. Choose the right array items type — Use Text arrays for simple lists (tags, skills), Object arrays for complex items (line items, experience)

Instructions

  1. Be explicit about format — Specify date formats, currency handling, etc.
  2. Handle multiples — Clarify how to handle multiple items (e.g., multiple invoices)
  3. Set defaults — Explain what to do when information isn’t found
  4. Provide context — Mention the document type if relevant

Parsing Method Selection

The quality of extraction depends on the quality of parsing. For best results:
  • Complex layouts — Use MAI or Graphor parsing methods
  • Scanned documents — Use Hi-Res or Hi-Res FT for better OCR
  • Simple text documentsFast method is usually sufficient

API Integration

Data Extraction is available via the REST API for programmatic use:

Basic Extraction

curl -X POST "https://sources.graphorlm.com/run-extraction" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "file_name": "invoice.pdf",
    "user_instruction": "Extract all invoice information. Use YYYY-MM-DD for dates.",
    "output_schema_fields": [
      {
        "key": "invoice_number",
        "type": "string",
        "description": "The unique invoice identifier"
      },
      {
        "key": "total_amount",
        "type": "number",
        "description": "Total amount due"
      }
    ]
  }'

Extraction with Object and Array Types

curl -X POST "https://sources.graphorlm.com/run-extraction" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "file_name": "invoice.pdf",
    "user_instruction": "Extract invoice with line items and billing address.",
    "output_schema_fields": [
      {
        "key": "invoice_number",
        "type": "string",
        "description": "The unique invoice identifier"
      },
      {
        "key": "billing_address",
        "type": "object",
        "description": "Billing address details",
        "nested_fields": [
          { "key": "street", "type": "string", "description": "Street address" },
          { "key": "city", "type": "string", "description": "City name" },
          { "key": "country", "type": "string", "description": "Country name" }
        ]
      },
      {
        "key": "tags",
        "type": "array",
        "description": "Invoice tags",
        "items_type": "string"
      },
      {
        "key": "line_items",
        "type": "array",
        "description": "Invoice line items",
        "items_type": "object",
        "nested_fields": [
          { "key": "description", "type": "string", "description": "Item description" },
          { "key": "quantity", "type": "number", "description": "Quantity" },
          { "key": "unit_price", "type": "number", "description": "Price per unit" }
        ]
      }
    ]
  }'
The API returns the extracted data for the active document version. See the Extraction API Reference for complete documentation.

Troubleshooting

If fields are missing or incorrect:
  • Improve field descriptions with more specific guidance
  • Add detailed instructions for edge cases
  • Try a different parsing method for better document understanding
  • Verify the document is properly parsed before extraction
Extraction time depends on document size and complexity:
  • Large documents take longer to process
  • Complex schemas with many fields require more processing
  • Consider extracting from specific page ranges for large documents
For scanned or image-heavy documents:
  • Use Hi-Res, Hi-Res FT, or MAI parsing methods
  • Ensure the document was properly OCR’d during parsing
  • Check the parsing results before running extraction
When extracting multiple items (e.g., multiple invoices in one document):
  • Explicitly state in instructions how to handle multiples
  • Each extracted entity appears as a separate row in results
  • Page references help identify which item came from where

Next Steps