Web Page Ingestion

Use this endpoint to ingest content by scraping a public web page. It fetches the page, extracts text, and creates a source in your project for downstream processing.

Endpoint Overview

HTTP Method

POST

Endpoint URL

https://sources.graphorlm.com/upload-url-source

Authentication

This endpoint requires authentication using an API token. Include your API token as a Bearer token in the Authorization header.

Learn how to create and manage API tokens in the API Tokens guide.

Request Format

Headers

Header	Value	Required
`Authorization`	`Bearer YOUR_API_TOKEN`	✅ Yes
`Content-Type`	`application/json`	✅ Yes

Request Body

Send a JSON body with the following fields:

Field	Type	Description	Required
`url`	string	The URL of the web page to scrape	✅ Yes
`crawlUrls`	boolean	Whether to crawl and ingest links from the given URL	No (default: `false`)

URL Requirements

Supported URL types

Public web pages
Pages that render primary content server-side and are reachable without interaction

Access requirements

The URL must be publicly reachable over HTTPS
Authentication-protected pages are not supported by this endpoint

This endpoint scrapes web pages. To ingest files (PDF, DOCX, etc.), use the Upload File endpoint.

Response Format

Success Response (200 OK)

{
  "status": "Processing",
  "message": "Source processed successfully",
  "file_name": "https://example.com/",
  "file_size": 0,
  "file_type": "",
  "file_source": "url",
  "project_id": "550e8400-e29b-41d4-a716-446655440000",
  "project_name": "My Project",
  "partition_method": "basic"
}

Response Fields

Field	Type	Description
`status`	string	Processing status (`New`, `Processing`, `Completed`, `Failed`, etc.)
`message`	string	Human-readable status message
`file_name`	string	Name or URL for the ingested source
`file_size`	integer	Size in bytes (0 for URL-based initial record)
`file_type`	string	Detected type when applicable
`file_source`	string	Source type (`url`)
`project_id`	string	UUID of the project
`project_name`	string	Name of the project
`partition_method`	string	Document processing method used

Code Examples

JavaScript/Node.js

import fetch from 'node-fetch';

const uploadUrlSource = async (apiToken, url, crawlUrls = false) => {
  const response = await fetch('https://sources.graphorlm.com/upload-url-source', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${apiToken}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({ url, crawlUrls })
  });

  if (!response.ok) {
    throw new Error(`Upload from URL failed: ${response.status} ${response.statusText}`);
  }

  const result = await response.json();
  console.log('URL upload accepted:', result);
  return result;
};

// Usage (scrapes the page content)
uploadUrlSource('grlm_your_api_token_here', 'https://example.com/');

Python

import requests

def upload_url_source(api_token, url, crawl_urls=False):
    endpoint = "https://sources.graphorlm.com/upload-url-source"
    headers = {"Authorization": f"Bearer {api_token}", "Content-Type": "application/json"}
    payload = {"url": url, "crawlUrls": crawl_urls}

    response = requests.post(endpoint, headers=headers, json=payload, timeout=300)
    response.raise_for_status()
    return response.json()

# Usage (scrapes the page content)
result = upload_url_source("grlm_your_api_token_here", "https://example.com")
print("URL scraping accepted:", result["file_name"])  # typically echoes the URL

cURL

curl -X POST https://sources.graphorlm.com/upload-url-source \
  -H "Authorization: Bearer grlm_your_api_token_here" \
  -H "Content-Type: application/json" \
  -d '{"url":"https://example.com/","crawlUrls":false}'

Error Responses

Common Error Codes

Status Code	Error Type	Description
`400`	Bad Request	Invalid or missing URL, malformed JSON
`401`	Unauthorized	Invalid or missing API token
`403`	Forbidden	Access denied to the specified project
`404`	Not Found	Project or source not found
`500`	Internal Server Error	Error during URL processing

Error Response Format

{
  "detail": "Invalid input: URL is required"
}

Error Examples

Invalid URL (400)

{ "detail": "Invalid input: URL is required" }

Invalid API Token (401)

{ "detail": "Invalid authentication credentials" }

Document Processing

After a successful request, Graphor begins fetching and scraping the web page in the background.

Processing Stages

URL Accepted - The request is validated and scheduled
Content Retrieval - The page is fetched over HTTPS
Text Extraction - Visible text is extracted and normalized
Structure Recognition - Document elements are identified and classified
Ready for Use - Document is available for chunking and retrieval

Processing Methods

The system selects the optimal processing method based on the detected content. You can reprocess with a different method after ingestion.

You can reprocess sources using the Process Source endpoint after ingestion.

Best Practices

Provide reachable URLs: Ensure the page is publicly accessible over HTTPS
Disable crawling when unneeded: Set crawlUrls to false to ingest only the provided URL
Respect site policies: Only scrape pages you are permitted to and consider website rate limits
Retry logic: Implement retries for transient network issues

Next Steps

Process Source

Reprocess with different parsing methods for optimal results

List Sources

Retrieve information about all uploaded documents in your project

Upload File

Ingest local files (PDF, DOCX, etc.)

Chunking

Learn how to optimize document segmentation for your RAG pipeline

Delete Source

Remove documents that are no longer needed from your project

Get Started

Data API Options

RAG Pipelines API

Endpoint Overview

HTTP Method

Endpoint URL

Authentication

Request Format

Headers

Request Body

URL Requirements

Response Format

Success Response (200 OK)

Response Fields

Code Examples

JavaScript/Node.js

Python

cURL

Error Responses

Common Error Codes

Error Response Format

Error Examples

Document Processing

Processing Stages

Processing Methods

Best Practices

Next Steps

Process Source

List Sources

Upload File

Chunking

Delete Source

Get Started

Data API Options

RAG Pipelines API

​Endpoint Overview

HTTP Method

Endpoint URL

​Authentication

​Request Format

​Headers

​Request Body

​URL Requirements

​Response Format

​Success Response (200 OK)

​Response Fields

​Code Examples

​JavaScript/Node.js

​Python

​cURL

​Error Responses

​Common Error Codes

​Error Response Format

​Error Examples

​Document Processing

​Processing Stages

​Processing Methods

​Best Practices

​Next Steps

Process Source

List Sources

Upload File

Chunking

Delete Source

Endpoint Overview

Authentication

Request Format

Headers

Request Body

URL Requirements

Response Format

Success Response (200 OK)

Response Fields

Code Examples

JavaScript/Node.js

Python

cURL

Error Responses

Common Error Codes

Error Response Format

Error Examples

Document Processing

Processing Stages

Processing Methods

Best Practices

Next Steps