This page documents how to ingest content into your Graphor project using the SDK. Ingestion is asynchronous : each method returns a build_id immediately; you then use get build status to poll until processing completes and get the file_id for use in other API calls.
Supported sources: local file , web page URL , GitHub repository , and YouTube video .
Async flow
Call one of the ingest methods (ingest_file, ingest_url, ingest_github, ingest_youtube). The method returns a build_id .
Call get_build_status(build_id) to poll. When the returned status is completed, use the file_id for ask, extract, list elements, delete, etc.
Available Methods
Get build status client.sources.get_build_status(build_id)Poll status and optional elements for an async ingestion
Ingest file client.sources.ingest_file()Upload a local file; processing runs in the background
Ingest URL client.sources.ingest_url()Ingest a public web page by URL (async)
Ingest GitHub client.sources.ingest_github()Ingest a public GitHub repository (async)
Ingest YouTube client.sources.ingest_youtube()Ingest a public YouTube video (async)
Get build status client.sources.getBuildStatus(buildId)Poll status and optional elements for an async ingestion
Ingest file client.sources.ingestFile()Upload a local file; processing runs in the background
Ingest URL client.sources.ingestURL()Ingest a public web page by URL (async)
Ingest GitHub client.sources.ingestGitHub()Ingest a public GitHub repository (async)
Ingest YouTube client.sources.ingestYoutube()Ingest a public YouTube video (async)
Installation
Python 3.9 or higher is required.
TypeScript 4.9+ and Node.js 20+ (LTS) are recommended.
Authentication
All SDK methods require authentication using an API key. You can provide your API key in two ways:
Environment Variable (Recommended)
Set the GRAPHOR_API_KEY environment variable:
export GRAPHOR_API_KEY = "grlm_your_api_key_here"
Then initialize the client without any arguments:
from graphor import Graphor
client = Graphor()
import Graphor from 'graphor' ;
const client = new Graphor ();
Direct Initialization
from graphor import Graphor
client = Graphor( api_key = "grlm_your_api_key_here" )
import Graphor from 'graphor' ;
const client = new Graphor ({ apiKey: 'grlm_your_api_key_here' });
Never hardcode API keys in your source code. Use environment variables or a secrets manager.
Get build status
Poll the status of an async ingestion (or reprocess). Use the build_id returned by any ingest method or by reprocess.
Method Signature
client.sources.get_build_status(
build_id: str , # Required
suppress_elements: bool = False ,
suppress_img_base64: bool = False ,
page: int | None = None ,
page_size: int | None = None ,
timeout: float | None = None
) -> BuildStatus
await client . sources . getBuildStatus ( buildId : string , options ?: {
suppressElements? : boolean ;
suppressImgBase64 ?: boolean ;
page ?: number ;
pageSize ?: number ;
}): Promise < BuildStatus >
Return value
When the build has been persisted, the response includes success, status, file_id, file_name, and optionally paginated elements. Possible status values:
Completed — Build finished successfully; use file_id for subsequent calls.
Processing — Build is running; keep polling.
Pending — Request was received but the build has not started yet; keep polling.
Processing failed — Build failed; check error for details.
not_found — No history yet (build not started or invalid build_id).
Use file_id from a response where success is true for subsequent API calls.
Poll until complete
from graphor import Graphor
import time
client = Graphor()
response = client.sources.ingest_file( file = Path( "./document.pdf" ))
build_id = response.build_id
while True :
status = client.sources.get_build_status(build_id)
if status.success:
file_id = status.file_id
print ( f "Ready. file_id: { file_id } " )
break
if status.error and status.status != "not_found" :
raise RuntimeError (status.error)
time.sleep( 2 )
const client = new Graphor ();
const { build_id : buildId } = await client . sources . ingestFile ({ file: fs . createReadStream ( './document.pdf' ) });
while ( true ) {
const status = await client . sources . getBuildStatus ( buildId );
if ( status . success ) {
console . log ( 'Ready. file_id:' , status . file_id );
break ;
}
if ( status . error && status . status !== 'not_found' ) throw new Error ( status . error );
await new Promise ( r => setTimeout ( r , 2000 ));
}
Ingest file
Upload a local file and schedule ingestion in the background. Returns a build_id ; use Get build status to poll until the source is ready.
Method Signature
client.sources.ingest_file(
file : FileTypes, # Required
method: str | None = None , # Optional: fast, balanced, accurate, vlm, agentic
timeout: float | None = None
) -> SourceIngestFileResponse
Returns SourceIngestFileResponse with .build_id. await client . sources . ingestFile ({
file: Uploadable , // Required
method? : 'fast' | 'balanced' | 'accurate' | 'vlm' | 'agentic' | null ,
}): Promise < SourceIngestFileResponse >
Returns SourceIngestFileResponse with .build_id.
Parameters
Parameter Type Description Required fileFileTypesThe file to upload. Accepts bytes, Path, or tuple (filename, contents, media_type) Yes methodstr | NoneOne of: fast, balanced, accurate, vlm, agentic (see Partition methods below) No timeoutfloatRequest timeout in seconds (default: 60) No
Parameter Type Description Required fileUploadableThe file to upload. Accepts ReadStream, File, Response, or toFile() helper Yes methodstring | nullOne of: fast, balanced, accurate, vlm, agentic (see Partition methods below) No
Partition methods
When provided, method controls how the document is parsed. If omitted, the system default is used.
Value Name Description "fast"Fast Fast processing with heuristic classification. No OCR. "balanced"Balanced OCR-based extraction with structure classification. "accurate"Accurate Fine-tuned model for highest accuracy (Premium). "vlm"VLM Best for manuscripts and handwritten content. "agentic"Agentic Highest accuracy for complex layouts, tables, and diagrams.
File requirements
Documents : PDF, DOC, DOCX, ODT, TXT, TEXT, MD, HTML, HTM · Presentations : PPT, PPTX · Spreadsheets : CSV, TSV, XLS, XLSX · Images : PNG, JPG, JPEG, TIFF, BMP, HEIC · Audio : MP3, WAV, M4A, OGG, FLAC · Video : MP4, MOV, AVI, MKV, WEBM
Maximum file size : 100 MB per file. The request must include a Content-Length so the server can enforce the limit.
Code examples
Ingest file and poll until ready
from pathlib import Path
from graphor import Graphor
import time
client = Graphor()
build_id = client.sources.ingest_file( file = Path( "./document.pdf" ))
print ( f "Build ID: { build_id } " )
while True :
status = client.sources.get_build_status(build_id)
if status.success:
print ( f "Ready. file_id: { status.file_id } " )
break
if status.error and status.status != "not_found" :
raise RuntimeError (status.error)
time.sleep( 2 )
import Graphor from 'graphor' ;
import fs from 'fs' ;
const client = new Graphor ();
const { build_id : buildId } = await client . sources . ingestFile ({
file: fs . createReadStream ( './document.pdf' ),
});
console . log ( 'Build ID:' , buildId );
while ( true ) {
const status = await client . sources . getBuildStatus ( buildId );
if ( status . success ) {
console . log ( 'Ready. file_id:' , status . file_id );
break ;
}
if ( status . error && status . status !== 'not_found' ) throw new Error ( status . error );
await new Promise ( r => setTimeout ( r , 2000 ));
}
Ingest with partition method
from pathlib import Path
from graphor import Graphor
client = Graphor()
response = client.sources.ingest_file(
file = Path( "./document.pdf" ),
method = "balanced"
)
print ( f "Build ID: { response.build_id } " )
import Graphor from 'graphor' ;
import fs from 'fs' ;
const client = new Graphor ();
const { build_id : buildId } = await client . sources . ingestFile ({
file: fs . createReadStream ( './document.pdf' ),
method: 'balanced' ,
});
console . log ( 'Build ID:' , buildId );
Ingest from bytes / buffer
from graphor import Graphor
client = Graphor()
with open ( "document.pdf" , "rb" ) as f:
content = f.read()
build_id = client.sources.ingest_file( file = ( "document.pdf" , content, "application/pdf" ))
print ( f "Build ID: { build_id } " )
import Graphor , { toFile } from 'graphor' ;
import fs from 'fs' ;
const client = new Graphor ();
const buffer = fs . readFileSync ( 'document.pdf' );
const { build_id : buildId } = await client . sources . ingestFile ({
file: await toFile ( buffer , 'document.pdf' ),
});
console . log ( 'Build ID:' , buildId );
Batch ingest (returns build_ids)
from pathlib import Path
from graphor import Graphor
client = Graphor()
supported = { '.pdf' , '.doc' , '.docx' , '.txt' , '.md' , '.html' }
build_ids = []
for path in Path( "./documents" ).iterdir():
if path.suffix.lower() in supported:
try :
bid = client.sources.ingest_file( file = path)
build_ids.append(bid)
print ( f "OK - Scheduled: { path.name } -> { bid } " )
except Exception as e:
print ( f "FAIL - { path.name } : { e } " )
print ( f "Summary: { len (build_ids) } scheduled" )
import Graphor from 'graphor' ;
import fs from 'fs' ;
import path from 'path' ;
const client = new Graphor ();
const exts = new Set ([ '.pdf' , '.doc' , '.docx' , '.txt' , '.md' , '.html' ]);
const buildIds : string [] = [];
for ( const file of fs . readdirSync ( './documents' )) {
if ( ! exts . has ( path . extname ( file ). toLowerCase ())) continue ;
try {
const { build_id : bid } = await client . sources . ingestFile ({
file: fs . createReadStream ( path . join ( './documents' , file )),
});
buildIds . push ( bid );
console . log ( `OK - Scheduled: ${ file } -> ${ bid } ` );
} catch ( err ) {
console . log ( `FAIL - ${ file } :` , err );
}
}
console . log ( `Summary: ${ buildIds . length } scheduled` );
Error handling
Ingest methods throw on invalid file type, missing Content-Length, size over 100 MB, or server errors. Use Get build status to detect processing failures (e.g. status.error).
import graphor
from graphor import Graphor
from pathlib import Path
client = Graphor()
try :
build_id = client.sources.ingest_file( file = Path( "./document.pdf" ))
print ( f "Scheduled. Build ID: { build_id } " )
except graphor.BadRequestError as e:
print ( f "Invalid file type or request: { e } " )
except graphor.APIStatusError as e:
print ( f "API error (status { e.status_code } ): { e } " )
try {
const { build_id : buildId } = await client . sources . ingestFile ({
file: fs . createReadStream ( './document.pdf' ),
});
console . log ( 'Scheduled. Build ID:' , buildId );
} catch ( err ) {
if ( err instanceof Graphor . BadRequestError ) {
console . log ( 'Invalid file type or request:' , err . message );
} else if ( err instanceof Graphor . APIError ) {
console . log ( 'API error (status ' + err . status + '):' , err . message );
} else {
throw err ;
}
}
Ingest URL
Ingest a web page by URL (async). Returns a build_id ; use Get build status to poll until ready.
Method signature
client.sources.ingest_url(
url: str , # Required
crawl_urls: bool = False ,
method: str | None = None , # Optional: fast, balanced, accurate, vlm, agentic
timeout: float | None = None
) -> SourceIngestURLResponse
Returns SourceIngestURLResponse with .build_id. await client . sources . ingestURL ({
url: string , // Required
crawlUrls? : boolean ,
method? : 'fast' | 'balanced' | 'accurate' | 'vlm' | 'agentic' | null ,
}): Promise < SourceIngestURLResponse >
Returns SourceIngestURLResponse with .build_id.
Parameters
Parameter Type Description Required urlstrThe web page URL to ingest Yes crawl_urlsboolWhen true, follow and ingest links from the page (default: False) No methodstr | NoneOne of: fast, balanced, accurate, vlm, agentic No timeoutfloatRequest timeout in seconds No
Parameter Type Description Required urlstringThe web page URL to ingest Yes crawlUrlsbooleanWhen true, follow and ingest links from the page No methodstring | nullOne of: fast, balanced, accurate, vlm, agentic No
URL Requirements
Public web pages
Pages that render primary content server-side and are reachable without interaction
The URL must be publicly reachable over HTTPS
Authentication-protected pages are not supported
Code Examples
Basic URL ingest
from graphor import Graphor
client = Graphor()
build_id = client.sources.ingest_url( url = "https://example.com/article" )
print ( f "Build ID: { build_id } " )
const client = new Graphor ();
const { build_id : buildId } = await client . sources . ingestURL ({ url: 'https://example.com/article' });
console . log ( 'Build ID:' , buildId );
Ingest with crawling
build_id = client.sources.ingest_url(
url = "https://example.com/documentation" ,
crawl_urls = True
)
const { build_id : buildId } = await client . sources . ingestURL ({
url: 'https://example.com/documentation' ,
crawlUrls: true ,
});
Ingest GitHub
Ingest a public GitHub repository (async). Returns a build_id ; use Get build status to poll until ready.
Method signature
client.sources.ingest_github(
url: str , # Required
timeout: float | None = None
) -> str
Returns build_id (str). await client . sources . ingestGitHub ({
url: string , // Required
}): Promise < string >
Returns build_id (string).
Parameters
Parameter Type Description Required urlstrGitHub repo URL (e.g. https://github.com/org/repo) Yes timeoutfloatRequest timeout in seconds No
Parameter Type Description Required urlstringGitHub repo URL (e.g. https://github.com/org/repo) Yes
Repository Requirements
Public GitHub repositories
HTTPS URLs (https://github.com/...)
Only public repositories are supported
Private repository ingestion is not supported
Code Examples
Basic GitHub ingest
from graphor import Graphor
client = Graphor()
build_id = client.sources.ingest_github( url = "https://github.com/organization/repository" )
print ( f "Build ID: { build_id } " )
const client = new Graphor ();
const { build_id : buildId } = await client . sources . ingestGitHub ({
url: 'https://github.com/organization/repository' ,
});
console . log ( 'Build ID:' , buildId );
Ingest YouTube
Ingest a public YouTube video (async). Returns a build_id ; use Get build status to poll until ready.
Method signature
client.sources.ingest_youtube(
url: str , # Required
timeout: float | None = None
) -> str
Returns build_id (str). await client . sources . ingestYoutube ({
url: string , // Required
}): Promise < string >
Returns build_id (string).
Parameters
Parameter Type Description Required urlstrYouTube video URL (e.g. https://www.youtube.com/watch?v=...) Yes timeoutfloatRequest timeout in seconds No
Parameter Type Description Required urlstringYouTube video URL (e.g. https://www.youtube.com/watch?v=...) Yes
Video Requirements
Public YouTube video URLs (HTTPS)
Standard watch URLs (https://www.youtube.com/watch?v=VIDEO_ID)
The video must be publicly accessible
Private or access-restricted videos are not supported
Code Examples
Basic YouTube ingest
from graphor import Graphor
client = Graphor()
build_id = client.sources.ingest_youtube( url = "https://www.youtube.com/watch?v=VIDEO_ID" )
print ( f "Build ID: { build_id } " )
const client = new Graphor ();
const { build_id : buildId } = await client . sources . ingestYoutube ({
url: 'https://www.youtube.com/watch?v=VIDEO_ID' ,
});
console . log ( 'Build ID:' , buildId );
Advanced Configuration
Custom timeout
For large files or slow connections, increase the ingest request timeout. Use Get build status with a suitable poll interval for long-running processing.
from graphor import Graphor
from pathlib import Path
client = Graphor( timeout = 300.0 ) # 5 minutes
# Or per-request
build_id = client.with_options( timeout = 300.0 ).sources.ingest_file(
file = Path( "./large-document.pdf" )
)
const client = new Graphor ({ timeout: 300 * 1000 }); // 5 minutes
const { build_id : buildId } = await client . sources . ingestFile (
{ file: fs . createReadStream ( './large-document.pdf' ) },
{ timeout: 300 * 1000 },
);
Retry configuration
Configure automatic retries for transient errors on ingest or get_build_status:
client = Graphor( max_retries = 5 )
response = client.with_options( max_retries = 5 ).sources.ingest_file( file = Path( "./document.pdf" ))
build_id = response.build_id
const client = new Graphor ({ maxRetries: 5 });
const { build_id : buildId } = await client . sources . ingestFile (
{ file: fs . createReadStream ( './document.pdf' ) },
{ maxRetries: 5 },
);
Accessing raw response (Python only)
response = client.sources.with_raw_response.ingest_file( file = Path( "./document.pdf" ))
print ( "Headers:" , response.headers)
build_id = response.parse() # str
Using aiohttp for concurrency (Python only)
import asyncio
from graphor import AsyncGraphor, DefaultAioHttpClient
async def ingest_many_files ( file_paths : list ):
async with AsyncGraphor( http_client = DefaultAioHttpClient()) as client:
tasks = [client.sources.ingest_file( file = p) for p in file_paths]
return await asyncio.gather( * tasks, return_exceptions = True )
# pip install graphor[aiohttp]
Error Reference
Error Type Status Code Description BadRequestError400 Invalid file type, missing filename, or malformed request AuthenticationError401 Invalid or missing API key PermissionDeniedError403 Access denied to the specified project NotFoundError404 Project or source not found RateLimitError429 Too many requests, please retry after waiting InternalServerError≥500 Server-side processing error APIConnectionErrorN/A Network connectivity issues APITimeoutErrorN/A Request timed out
Next Steps
After ingesting, use Get build status to wait until processing completes, then:
Reprocess source Reprocess a source with a different partition method
List sources List all sources (optionally filter by file_ids)
Get elements Retrieve parsed elements/chunks from a source
Delete source Remove a source by file_id