Dataset Endpoints Overview

Dataset endpoints allow you to manage the data components of your flows in GraphorLM. Dataset nodes are the entry points that connect your uploaded documents to the RAG pipeline, determining which files are processed and how they’re configured.

What are Dataset Nodes?

Dataset nodes are fundamental components in GraphorLM flows that:

Connect Sources to Flows: Link your uploaded documents to processing pipelines
Control Data Input: Determine which files are included in each dataset
Enable Configuration: Allow customization of how documents are processed
Support Flow Logic: Act as the starting point for RAG pipelines

Available Endpoints

GraphorLM provides comprehensive REST API endpoints for dataset management:

List Dataset Nodes

Retrieve all dataset nodes from a specific flow with their configurations and status

Update Dataset

Modify dataset node configurations to change which files are included

Core Concepts

Dataset Node Structure

Each dataset node contains:

{
  "id": "dataset-1748287628684",
  "type": "dataset",
  "position": { "x": 100, "y": 200 },
  "data": {
    "name": "Research Papers",
    "config": {
      "files": [
        "attention.pdf",
        "transformer_architecture.pdf",
        "bert.pdf"
      ]
    },
    "result": {
      "updated": false,
      "lastRun": "2024-01-15T10:30:00Z"
    }
  }
}

Key Components

Component	Description
ID	Unique identifier for the dataset node
Config	File selection and processing settings
Result	Status and metadata from last processing
Position	Visual placement in the flow editor

File Management

Dataset nodes manage file relationships through:

File Selection: Choose which uploaded sources to include
Dynamic Updates: Add or remove files without recreating the node
Validation: Ensure all specified files exist as sources
Status Tracking: Monitor processing state and updates needed

Common Workflows

Setting Up a New Dataset

List Available Nodes: Use the List Dataset Nodes endpoint to see existing datasets
Update Configuration: Use the Update Dataset endpoint to configure file selection
Deploy Flow: Deploy the flow to apply changes and make it executable

Managing Dataset Files

// Example: Adding new files to an existing dataset
const updateDataset = async (flowName, nodeId, newFiles) => {
  const response = await fetch(`https://${flowName}.flows.graphorlm.com/datasets/${nodeId}`, {
    method: 'PATCH',
    headers: {
      'Authorization': 'Bearer YOUR_API_TOKEN',
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      config: { files: newFiles }
    })
  });
  
  return await response.json();
};

Monitoring Dataset Status

# Example: Checking which datasets need updates
def check_dataset_status(flow_name, api_token):
    url = f"https://{flow_name}.flows.graphorlm.com/datasets"
    headers = {"Authorization": f"Bearer {api_token}"}
    
    response = requests.get(url, headers=headers)
    datasets = response.json()
    
    needs_update = [
        d for d in datasets 
        if not d['data']['result'].get('updated', True)
    ]
    
    return needs_update

Authentication

All dataset endpoints require authentication via API tokens:

Authorization: Bearer YOUR_API_TOKEN

Learn how to generate and manage API tokens in the API Tokens guide.

URL Structure

Dataset endpoints follow a consistent URL pattern:

https://{flow_name}.flows.graphorlm.com/datasets[/{node_id}]

Where:

{flow_name}: The name of your deployed flow
{node_id}: The specific dataset node identifier (for update operations)

Response Formats

Dataset Node Object

All endpoints return dataset nodes with this structure:

{
  "id": "string",
  "type": "dataset",
  "position": {
    "x": "number",
    "y": "number"
  },
  "style": {
    "width": "number",
    "height": "number"
  },
  "data": {
    "name": "string",
    "config": {
      "files": ["string"]
    },
    "result": {
      "updated": "boolean",
      "lastRun": "string",
      "fileCount": "number"
    }
  }
}

Success Responses

Update operations return confirmation:

{
  "success": true,
  "message": "Dataset node updated successfully",
  "node_id": "dataset-1748287628684"
}

Error Handling

Common error scenarios and responses:

Error Type	HTTP Status	Description
Authentication	401	Invalid or missing API token
Flow Not Found	404	Flow doesn’t exist or isn’t accessible
Node Not Found	404	Dataset node doesn’t exist in the flow
Files Not Found	400	Specified files don’t exist as sources
Invalid Config	400	Configuration validation failed
Server Error	500	Internal processing error

Error Response Format

{
  "detail": "Descriptive error message explaining the issue"
}

Integration Examples

Dataset Management Class

class DatasetManager {
  constructor(flowName, apiToken) {
    this.flowName = flowName;
    this.apiToken = apiToken;
    this.baseUrl = `https://${flowName}.flows.graphorlm.com/datasets`;
  }

  async listNodes() {
    const response = await fetch(this.baseUrl, {
      headers: { 'Authorization': `Bearer ${this.apiToken}` }
    });
    return await response.json();
  }

  async updateNode(nodeId, files) {
    const response = await fetch(`${this.baseUrl}/${nodeId}`, {
      method: 'PATCH',
      headers: {
        'Authorization': `Bearer ${this.apiToken}`,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({ config: { files } })
    });
    return await response.json();
  }

  async getNodeStatus(nodeId) {
    const nodes = await this.listNodes();
    return nodes.find(node => node.id === nodeId);
  }
}

Batch Operations

class BatchDatasetOperations:
    def __init__(self, flow_name, api_token):
        self.flow_name = flow_name
        self.api_token = api_token
        self.base_url = f"https://{flow_name}.flows.graphorlm.com/datasets"
    
    def get_all_nodes(self):
        """Get all dataset nodes in the flow"""
        response = requests.get(
            self.base_url,
            headers={"Authorization": f"Bearer {self.api_token}"}
        )
        response.raise_for_status()
        return response.json()
    
    def update_multiple_nodes(self, updates):
        """Update multiple dataset nodes"""
        results = []
        for node_id, files in updates.items():
            try:
                response = requests.patch(
                    f"{self.base_url}/{node_id}",
                    headers={
                        "Authorization": f"Bearer {self.api_token}",
                        "Content-Type": "application/json"
                    },
                    json={"config": {"files": files}}
                )
                response.raise_for_status()
                results.append({
                    "node_id": node_id,
                    "status": "success",
                    "result": response.json()
                })
            except Exception as e:
                results.append({
                    "node_id": node_id,
                    "status": "error",
                    "error": str(e)
                })
        return results

Best Practices

Configuration Management

Version Control: Keep track of dataset configurations over time
Validation: Always verify files exist before updating configurations
Atomic Updates: Update entire file lists rather than partial modifications
Documentation: Document the purpose of each dataset node

Performance Optimization

File Organization: Group related files in the same dataset nodes
Batch Operations: Update multiple nodes efficiently using batch processing
Monitoring: Regularly check dataset status to identify issues early
Caching: Cache dataset node information to reduce API calls

Error Prevention

File Validation: Use the List Sources endpoint to verify file availability
Status Monitoring: Check node update status before flow deployment
Graceful Degradation: Handle missing files and failed updates appropriately
Retry Logic: Implement retry mechanisms for transient failures

Relationship with Other Endpoints

Dataset endpoints work closely with other GraphorLM APIs:

Source Management

Use Sources endpoints to upload and manage files
Validate file availability before updating dataset configurations

Flow Management

Use Flow endpoints to deploy updated configurations
Monitor flow status after dataset changes

Flow Execution

Use Run Flow endpoint to execute flows with updated datasets
Verify results reflect the new dataset configurations

Use Cases

Content Management System

// Scenario: Managing document sets for different topics
const contentManager = new DatasetManager('knowledge-base', 'YOUR_API_TOKEN');

const topicDatasets = {
  'ai-research': ['ai_overview.pdf', 'machine_learning.pdf', 'neural_networks.pdf'],
  'documentation': ['user_guide.pdf', 'api_reference.pdf', 'tutorials.pdf'],
  'legal': ['terms_of_service.pdf', 'privacy_policy.pdf', 'compliance.pdf']
};

// Update each dataset with topic-specific files
for (const [topic, files] of Object.entries(topicDatasets)) {
  const nodeId = await findDatasetNodeByTopic(topic);
  await contentManager.updateNode(nodeId, files);
}

Dynamic Content Updates

# Scenario: Updating datasets based on new document uploads
def update_datasets_with_new_files(flow_name, api_token, new_files):
    manager = BatchDatasetOperations(flow_name, api_token)
    nodes = manager.get_all_nodes()
    
    updates = {}
    for node in nodes:
        current_files = node['data']['config'].get('files', [])
        # Add new files to existing configuration
        updated_files = list(set(current_files + new_files))
        updates[node['id']] = updated_files
    
    return manager.update_multiple_nodes(updates)

Quality Assurance

// Scenario: Auditing dataset configurations for consistency
async function auditDatasetConfigurations(flowName, apiToken) {
  const manager = new DatasetManager(flowName, apiToken);
  const nodes = await manager.listNodes();
  
  const audit = {
    totalNodes: nodes.length,
    emptyNodes: [],
    outdatedNodes: [],
    duplicateFiles: {}
  };
  
  const fileUsage = {};
  
  for (const node of nodes) {
    const files = node.data.config.files || [];
    
    // Check for empty nodes
    if (files.length === 0) {
      audit.emptyNodes.push(node.id);
    }
    
    // Check for outdated nodes
    if (!node.data.result?.updated) {
      audit.outdatedNodes.push(node.id);
    }
    
    // Track file usage for duplicate detection
    for (const file of files) {
      if (!fileUsage[file]) fileUsage[file] = [];
      fileUsage[file].push(node.id);
    }
  }
  
  // Find duplicate files
  for (const [file, nodeIds] of Object.entries(fileUsage)) {
    if (nodeIds.length > 1) {
      audit.duplicateFiles[file] = nodeIds;
    }
  }
  
  return audit;
}

Migration and Maintenance

Upgrading Dataset Configurations

When migrating or updating your flow configurations:

Backup Current State: List all current dataset configurations
Plan Updates: Design new file organization structure
Test Changes: Update configurations in a development environment
Apply Updates: Use batch operations to update production datasets
Verify Results: Check that all nodes are properly updated

Monitoring and Alerting

Set up monitoring for dataset health:

def monitor_dataset_health(flow_name, api_token):
    """Monitor dataset health and return alerts"""
    manager = BatchDatasetOperations(flow_name, api_token)
    nodes = manager.get_all_nodes()
    
    alerts = []
    
    for node in nodes:
        # Check for empty configurations
        files = node['data']['config'].get('files', [])
        if not files:
            alerts.append(f"Empty dataset node: {node['id']}")
        
        # Check for outdated status
        if not node['data']['result'].get('updated', True):
            alerts.append(f"Outdated dataset node: {node['id']}")
    
    return alerts

Next Steps

Ready to start managing your dataset nodes? Here are the next steps:

List Dataset Nodes

Start by exploring existing dataset nodes in your flows

Update Dataset

Learn how to modify dataset configurations

Sources Overview

Manage the source files that power your datasets

Deploy Flow

Deploy your updated flows to make changes active

For more advanced usage patterns and integration examples, explore our comprehensive guides:

Data Ingestion Guide

Learn about uploading and processing documents

Integrate Workflow

Advanced integration patterns and best practices

Get Started

Sources

Flows

​What are Dataset Nodes?

​Available Endpoints

List Dataset Nodes

Update Dataset

​Core Concepts

​Dataset Node Structure

​Key Components

​File Management

​Common Workflows

​Setting Up a New Dataset

​Managing Dataset Files

​Monitoring Dataset Status

​Authentication

​URL Structure

​Response Formats

​Dataset Node Object

​Success Responses

​Error Handling

​Error Response Format

​Integration Examples

​Dataset Management Class

​Batch Operations

​Best Practices

​Configuration Management

​Performance Optimization

​Error Prevention

​Relationship with Other Endpoints

​Source Management

​Flow Management

​Flow Execution

​Use Cases

​Content Management System

​Dynamic Content Updates

​Quality Assurance

​Migration and Maintenance

​Upgrading Dataset Configurations

​Monitoring and Alerting

​Next Steps

List Dataset Nodes

Update Dataset

Sources Overview

Deploy Flow

Data Ingestion Guide

Integrate Workflow

What are Dataset Nodes?

Available Endpoints

Core Concepts

Dataset Node Structure

Key Components

File Management

Common Workflows

Setting Up a New Dataset

Managing Dataset Files

Monitoring Dataset Status

Authentication

URL Structure

Response Formats

Dataset Node Object

Success Responses

Error Handling

Error Response Format

Integration Examples

Dataset Management Class

Batch Operations

Best Practices

Configuration Management

Performance Optimization

Error Prevention

Relationship with Other Endpoints

Source Management

Flow Management

Flow Execution

Use Cases

Content Management System

Dynamic Content Updates

Quality Assurance

Migration and Maintenance

Upgrading Dataset Configurations

Monitoring and Alerting

Next Steps