Skip to main content
The Extractor node uses LLM to extract structured information from documents based on custom schemas. It processes documents from Dataset or Chunking nodes through intelligent batching and parallel file processing, supporting multimodal content including text, HTML, and images. Extractor node overview

Overview

The Extractor node transforms unstructured documents into structured data by:
  1. Processing documents — Receives documents from Dataset or Chunking nodes
  2. Applying custom schemas — Extracts data according to your defined fields
  3. Using LLM intelligence — Leverages language models to understand and extract information
  4. Outputting structured data — Returns results as structured JSON that can be exported to CSV
The Extractor node is different from Data Extraction in the Sources page. The Extractor node is part of a RAG pipeline and processes documents in batch, while Data Extraction works on individual documents with page-level provenance.

Using the Extractor Node

Adding the Extractor Node

  1. Open your flow in the Flow Builder
  2. Drag the Extractor node from the sidebar onto the canvas
  3. Connect an input node to the Extractor:
    • Dataset → Extractor (extracts from raw documents)
    • Chunking → Extractor (extracts from chunked content)
  4. Double-click the Extractor node to configure

Input Connections

The Extractor node accepts input from:
Source NodeUse Case
DatasetExtract from full documents (uses Mistral for PDFs/images)
ChunkingExtract from chunked content (better for large documents)

Output Connections

The Extractor node can connect to:
Target NodeUse Case
ResponseOutput extracted data as the pipeline result

Configuring the Extractor Node

Double-click the Extractor node to open the configuration panel: Extractor configuration

Defining Your Schema

The schema defines what information to extract. Each field has:
PropertyDescriptionExample
KeyField name in the outputinvoice_number
TypeData type (string, number, boolean, date, object, array)string
DescriptionWhat to extract”The unique invoice identifier”
ExampleSample value (helps the LLM)“INV-2024-001”

Adding Fields

  1. In the Settings tab, click Add Field
  2. Fill in the field properties:
    • Key: Use snake_case names (e.g., customer_name)
    • Type: Choose the appropriate data type
    • Description: Be specific about what to extract
    • Example: Provide a realistic example value
  3. Repeat for all fields you need

Field Types

TypeDescriptionExample Output
stringText values"John Doe"
numberNumeric values299.99
booleanTrue/false valuestrue
dateDate values"2024-01-15"
objectNested structured data{"street": "123 Main St", "city": "NYC"}
arrayLists of values["item1", "item2"] or [{...}, {...}]

Object Type

Use object type to group related fields together (e.g., address, specifications). When selecting object, define nested fields with their own key, type, and description.

Array Type

Use array type for lists. Specify the Items Type to define what the array contains:
Items TypeUse CaseExample
stringList of textTags, skills
numberList of numbersQuantities
booleanList of booleansFeature flags
dateList of datesEvent dates
objectList of structured itemsLine items, experience
When using object as items type, define nested fields that apply to each array item.

Schema Examples

Invoice Extraction

KeyTypeDescriptionExample
invoice_numberstringThe unique invoice identifierINV-2024-001
invoice_datedateInvoice date in YYYY-MM-DD format2024-01-15
vendor_namestringName of the company issuing the invoiceAcme Corp
total_amountnumberTotal amount due1250.00
billing_addressobjectBilling address details-
streetstringStreet address123 Main St
citystringCity nameNew York
countrystringCountryUSA
line_itemsarray (object)List of products/services-
descriptionstringItem descriptionWidget A
quantitynumberQuantity2
pricenumberUnit price50.00

Contract Analysis

KeyTypeDescriptionExample
contract_titlestringTitle or name of the contractService Agreement
effective_datedateWhen the contract becomes effective2024-02-01
termination_datedateWhen the contract ends2025-01-31
auto_renewalbooleanWhether contract auto-renewstrue
partiesarray (object)Parties involved in the contract-
namestringParty nameCompany A
rolestringRole in contractLicensor
key_termsobjectKey contract terms-
payment_termsstringPayment conditionsNet 30
notice_periodnumberNotice period in days30

Product Catalog

KeyTypeDescriptionExample
product_namestringName of the productWidget Pro
skustringStock keeping unit identifierWDG-PRO-001
pricenumberProduct price49.99
in_stockbooleanWhether product is availabletrue
featuresarray (string)List of product features[“Durable”, “Lightweight”]
specificationsobjectProduct specifications-
weightnumberWeight in kg0.5
dimensionsstringDimensions10x5x3 cm
variantsarray (object)Product variants-
colorstringVariant colorBlue
sizestringVariant sizeLarge

Viewing Results

After running the extraction (click Update Results):
  1. Go to the Results tab
  2. View extracted data in a table format
  3. Each row represents one extracted item
  4. Columns correspond to your schema fields

Exporting Results

Click Download CSV to export the extracted data:
  • All fields are included as columns
  • Each extracted item is a row
  • Values are properly escaped for CSV format

Best Practices

Schema Design

  1. Start simple — Begin with essential fields, then expand
  2. Be specific in descriptions — Tell the LLM exactly what to look for
  3. Provide examples — Example values help the LLM understand the expected format
  4. Use appropriate types — Match field types to expected data
  5. Use objects for structured data — Group related fields (address, specifications) using object type
  6. Use arrays for lists — Line items, skills, and features are perfect for array type
  7. Keep nesting shallow — Avoid deeply nested structures for better extraction accuracy

Input Selection

ScenarioRecommended Input
Small documents (< 50 pages)Dataset node
Large documentsChunking node
Image-heavy PDFsDataset node (uses Mistral)
Text-heavy documentsEither works well

Performance Tips

  1. Limit concurrent files — Default is 3-5 for optimal balance
  2. Reduce batch size for images — Image processing is more resource-intensive
  3. Use chunking for large docs — Better memory management and extraction quality

Pipeline Examples

Direct Extraction Pipeline

Dataset → Extractor → Response
Best for: Simple extraction from a collection of documents.

Chunked Extraction Pipeline

Dataset → Chunking → Extractor → Response
Best for: Large documents that need to be split for better extraction.

Combined RAG + Extraction Pipeline

PathFlow
RAG PathDataset → Chunking → Retrieval → LLM → Response
Extraction PathDataset → Extractor → Response
Best for: Both Q&A and structured extraction from the same documents.

Troubleshooting

If no data is extracted:
  • Verify schema fields have clear descriptions
  • Add example values to guide the LLM
  • Check that input documents contain the expected information
  • Try using Chunking node for better document processing
If seeing duplicate items:
  • The Extractor automatically deduplicates, but check your schema
  • Ensure key fields are unique identifiers
  • Review if documents contain repeated information
If extraction is taking too long:
  • Reduce the number of input documents
  • Use Chunking node to split large documents
  • Consider extracting fewer fields
  • Check document complexity (image-heavy PDFs take longer)
If extracted values have wrong types:
  • Be explicit in field descriptions about expected format
  • Add examples that match the expected type
  • Use specific instructions (e.g., “as a number without currency symbols”)

Next Steps