Skip to main content
This page documents how to ingest content into your Graphor project using the SDK. Ingestion is asynchronous: each method returns a build_id immediately; you then use get build status to poll until processing completes and get the file_id for use in other API calls. Supported sources: local file, web page URL, GitHub repository, and YouTube video.

Async flow

  1. Call one of the ingest methods (ingest_file, ingest_url, ingest_github, ingest_youtube). The method returns a build_id.
  2. Call get_build_status(build_id) to poll. When the returned status is completed, use the file_id for ask, extract, list elements, delete, etc.

Available Methods

Get build status

client.sources.get_build_status(build_id)Poll status and optional elements for an async ingestion

Ingest file

client.sources.ingest_file()Upload a local file; processing runs in the background

Ingest URL

client.sources.ingest_url()Ingest a public web page by URL (async)

Ingest GitHub

client.sources.ingest_github()Ingest a public GitHub repository (async)

Ingest YouTube

client.sources.ingest_youtube()Ingest a public YouTube video (async)

Installation

pip install graphor
Python 3.9 or higher is required.

Authentication

All SDK methods require authentication using an API key. You can provide your API key in two ways: Set the GRAPHOR_API_KEY environment variable:
export GRAPHOR_API_KEY="grlm_your_api_key_here"
Then initialize the client without any arguments:
from graphor import Graphor

client = Graphor()

Direct Initialization

from graphor import Graphor

client = Graphor(api_key="grlm_your_api_key_here")
Never hardcode API keys in your source code. Use environment variables or a secrets manager.
Learn how to create and manage API tokens in the API Tokens guide.

Get build status

Poll the status of an async ingestion (or reprocess). Use the build_id returned by any ingest method or by reprocess.

Method Signature

client.sources.get_build_status(
    build_id: str,                                    # Required
    suppress_elements: bool = False,
    suppress_img_base64: bool = False,
    page: int | None = None,
    page_size: int | None = None,
    timeout: float | None = None
) -> BuildStatus

Return value

When the build has been persisted, the response includes success, status, file_id, file_name, and optionally paginated elements. Possible status values:
  • Completed — Build finished successfully; use file_id for subsequent calls.
  • Processing — Build is running; keep polling.
  • Pending — Request was received but the build has not started yet; keep polling.
  • Processing failed — Build failed; check error for details.
  • not_found — No history yet (build not started or invalid build_id).
Use file_id from a response where success is true for subsequent API calls.

Poll until complete

from graphor import Graphor
import time

client = Graphor()

response = client.sources.ingest_file(file=Path("./document.pdf"))
build_id = response.build_id
while True:
    status = client.sources.get_build_status(build_id)
    if status.success:
        file_id = status.file_id
        print(f"Ready. file_id: {file_id}")
        break
    if status.error and status.status != "not_found":
        raise RuntimeError(status.error)
    time.sleep(2)

Ingest file

Upload a local file and schedule ingestion in the background. Returns a build_id; use Get build status to poll until the source is ready.

Method Signature

client.sources.ingest_file(
    file: FileTypes,                              # Required
    method: str | None = None,                   # Optional: fast, balanced, accurate, vlm, agentic
    timeout: float | None = None
) -> SourceIngestFileResponse
Returns SourceIngestFileResponse with .build_id.

Parameters

ParameterTypeDescriptionRequired
fileFileTypesThe file to upload. Accepts bytes, Path, or tuple (filename, contents, media_type)Yes
methodstr | NoneOne of: fast, balanced, accurate, vlm, agentic (see Partition methods below)No
timeoutfloatRequest timeout in seconds (default: 60)No

Partition methods

When provided, method controls how the document is parsed. If omitted, the system default is used.
ValueNameDescription
"fast"FastFast processing with heuristic classification. No OCR.
"balanced"BalancedOCR-based extraction with structure classification.
"accurate"AccurateFine-tuned model for highest accuracy (Premium).
"vlm"VLMBest for manuscripts and handwritten content.
"agentic"AgenticHighest accuracy for complex layouts, tables, and diagrams.
For more details, see Reprocess source documentation.

File requirements

Documents: PDF, DOC, DOCX, ODT, TXT, TEXT, MD, HTML, HTM · Presentations: PPT, PPTX · Spreadsheets: CSV, TSV, XLS, XLSX · Images: PNG, JPG, JPEG, TIFF, BMP, HEIC · Audio: MP3, WAV, M4A, OGG, FLAC · Video: MP4, MOV, AVI, MKV, WEBM
Maximum file size: 100 MB per file. The request must include a Content-Length so the server can enforce the limit.

Code examples

Ingest file and poll until ready

from pathlib import Path
from graphor import Graphor
import time

client = Graphor()

build_id = client.sources.ingest_file(file=Path("./document.pdf"))
print(f"Build ID: {build_id}")

while True:
    status = client.sources.get_build_status(build_id)
    if status.success:
        print(f"Ready. file_id: {status.file_id}")
        break
    if status.error and status.status != "not_found":
        raise RuntimeError(status.error)
    time.sleep(2)

Ingest with partition method

from pathlib import Path
from graphor import Graphor

client = Graphor()

response = client.sources.ingest_file(
    file=Path("./document.pdf"),
    method="balanced"
)
print(f"Build ID: {response.build_id}")

Ingest from bytes / buffer

from graphor import Graphor
client = Graphor()
with open("document.pdf", "rb") as f:
    content = f.read()
build_id = client.sources.ingest_file(file=("document.pdf", content, "application/pdf"))
print(f"Build ID: {build_id}")

Batch ingest (returns build_ids)

from pathlib import Path
from graphor import Graphor
client = Graphor()
supported = {'.pdf', '.doc', '.docx', '.txt', '.md', '.html'}
build_ids = []
for path in Path("./documents").iterdir():
    if path.suffix.lower() in supported:
        try:
            bid = client.sources.ingest_file(file=path)
            build_ids.append(bid)
            print(f"OK - Scheduled: {path.name} -> {bid}")
        except Exception as e:
            print(f"FAIL - {path.name}: {e}")
print(f"Summary: {len(build_ids)} scheduled")

Error handling

Ingest methods throw on invalid file type, missing Content-Length, size over 100 MB, or server errors. Use Get build status to detect processing failures (e.g. status.error).
import graphor
from graphor import Graphor
from pathlib import Path
client = Graphor()
try:
    build_id = client.sources.ingest_file(file=Path("./document.pdf"))
    print(f"Scheduled. Build ID: {build_id}")
except graphor.BadRequestError as e:
    print(f"Invalid file type or request: {e}")
except graphor.APIStatusError as e:
    print(f"API error (status {e.status_code}): {e}")

Ingest URL

Ingest a web page by URL (async). Returns a build_id; use Get build status to poll until ready.

Method signature

client.sources.ingest_url(
    url: str,                                     # Required
    crawl_urls: bool = False,
    method: str | None = None,                   # Optional: fast, balanced, accurate, vlm, agentic
    timeout: float | None = None
) -> SourceIngestURLResponse
Returns SourceIngestURLResponse with .build_id.

Parameters

ParameterTypeDescriptionRequired
urlstrThe web page URL to ingestYes
crawl_urlsboolWhen true, follow and ingest links from the page (default: False)No
methodstr | NoneOne of: fast, balanced, accurate, vlm, agenticNo
timeoutfloatRequest timeout in secondsNo

URL Requirements

  • Public web pages
  • Pages that render primary content server-side and are reachable without interaction
  • The URL must be publicly reachable over HTTPS
  • Authentication-protected pages are not supported

Code Examples

Basic URL ingest

from graphor import Graphor

client = Graphor()
build_id = client.sources.ingest_url(url="https://example.com/article")
print(f"Build ID: {build_id}")

Ingest with crawling

build_id = client.sources.ingest_url(
    url="https://example.com/documentation",
    crawl_urls=True
)

Ingest GitHub

Ingest a public GitHub repository (async). Returns a build_id; use Get build status to poll until ready.

Method signature

client.sources.ingest_github(
    url: str,              # Required
    timeout: float | None = None
) -> str
Returns build_id (str).

Parameters

ParameterTypeDescriptionRequired
urlstrGitHub repo URL (e.g. https://github.com/org/repo)Yes
timeoutfloatRequest timeout in secondsNo

Repository Requirements

  • Public GitHub repositories
  • HTTPS URLs (https://github.com/...)
  • Only public repositories are supported
  • Private repository ingestion is not supported

Code Examples

Basic GitHub ingest

from graphor import Graphor
client = Graphor()
build_id = client.sources.ingest_github(url="https://github.com/organization/repository")
print(f"Build ID: {build_id}")

Ingest YouTube

Ingest a public YouTube video (async). Returns a build_id; use Get build status to poll until ready.

Method signature

client.sources.ingest_youtube(
    url: str,              # Required
    timeout: float | None = None
) -> str
Returns build_id (str).

Parameters

ParameterTypeDescriptionRequired
urlstrYouTube video URL (e.g. https://www.youtube.com/watch?v=...)Yes
timeoutfloatRequest timeout in secondsNo

Video Requirements

  • Public YouTube video URLs (HTTPS)
  • Standard watch URLs (https://www.youtube.com/watch?v=VIDEO_ID)
  • The video must be publicly accessible
  • Private or access-restricted videos are not supported

Code Examples

Basic YouTube ingest

from graphor import Graphor
client = Graphor()
build_id = client.sources.ingest_youtube(url="https://www.youtube.com/watch?v=VIDEO_ID")
print(f"Build ID: {build_id}")

Advanced Configuration

Custom timeout

For large files or slow connections, increase the ingest request timeout. Use Get build status with a suitable poll interval for long-running processing.
from graphor import Graphor
from pathlib import Path

client = Graphor(timeout=300.0)  # 5 minutes

# Or per-request
build_id = client.with_options(timeout=300.0).sources.ingest_file(
    file=Path("./large-document.pdf")
)

Retry configuration

Configure automatic retries for transient errors on ingest or get_build_status:
client = Graphor(max_retries=5)
response = client.with_options(max_retries=5).sources.ingest_file(file=Path("./document.pdf"))
build_id = response.build_id

Accessing raw response (Python only)

response = client.sources.with_raw_response.ingest_file(file=Path("./document.pdf"))
print("Headers:", response.headers)
build_id = response.parse()  # str

Using aiohttp for concurrency (Python only)

import asyncio
from graphor import AsyncGraphor, DefaultAioHttpClient

async def ingest_many_files(file_paths: list):
    async with AsyncGraphor(http_client=DefaultAioHttpClient()) as client:
        tasks = [client.sources.ingest_file(file=p) for p in file_paths]
        return await asyncio.gather(*tasks, return_exceptions=True)

# pip install graphor[aiohttp]

Error Reference

Error TypeStatus CodeDescription
BadRequestError400Invalid file type, missing filename, or malformed request
AuthenticationError401Invalid or missing API key
PermissionDeniedError403Access denied to the specified project
NotFoundError404Project or source not found
RateLimitError429Too many requests, please retry after waiting
InternalServerError≥500Server-side processing error
APIConnectionErrorN/ANetwork connectivity issues
APITimeoutErrorN/ARequest timed out

Next Steps

After ingesting, use Get build status to wait until processing completes, then:

Reprocess source

Reprocess a source with a different partition method

List sources

List all sources (optionally filter by file_ids)

Get elements

Retrieve parsed elements/chunks from a source

Delete source

Remove a source by file_id