Update Dataset - GraphorLM Docs

Update the configuration of a specific dataset node within a flow in your GraphorLM project. This endpoint allows you to modify which files are included in a dataset node and automatically marks the node for reprocessing.

Overview

The Update Dataset endpoint allows you to modify the configuration of dataset nodes within your flows. Dataset nodes are components that connect document sources to your flow pipeline, and updating them is essential for managing data inputs and keeping your RAG pipelines current.

Method: PATCH
URL: https://{flow_name}.flows.graphorlm.com/datasets/{node_id}
Authentication: Required (API Token)

Authentication

All requests must include a valid API token in the Authorization header:

Authorization: Bearer YOUR_API_TOKEN

Learn how to generate API tokens in the API Tokens guide.

Request Format

Headers

Header	Value	Required
`Authorization`	`Bearer YOUR_API_TOKEN`	Yes
`Content-Type`	`application/json`	Yes

URL Parameters

Parameter	Type	Required	Description
`flow_name`	string	Yes	The name of the flow containing the dataset node
`node_id`	string	Yes	The unique identifier of the dataset node to update

Request Body

The request body should be a JSON object with the following structure:

Field	Type	Required	Description
`config`	object	Yes	The new configuration for the dataset node
`config.files`	array	No	List of file names to include in the dataset node

Example Request

PATCH https://my-rag-pipeline.flows.graphorlm.com/datasets/dataset-1748287628684
Authorization: Bearer YOUR_API_TOKEN
Content-Type: application/json

{
  "config": {
    "files": [
      "attention.pdf",
      "bert.pdf",
      "transformer_architecture.pdf"
    ]
  }
}

Response Format

Success Response (200 OK)

The response contains confirmation of the successful update:

{
  "success": true,
  "message": "Dataset node 'dataset-1748287628684' updated successfully",
  "node_id": "dataset-1748287628684"
}

Response Fields

Field	Type	Description
`success`	boolean	Whether the update operation was successful
`message`	string	Descriptive message about the operation result
`node_id`	string	The ID of the updated dataset node

Code Examples

JavaScript/Node.js

async function updateDatasetNode(flowName, nodeId, files, apiToken) {
  const response = await fetch(`https://${flowName}.flows.graphorlm.com/datasets/${nodeId}`, {
    method: 'PATCH',
    headers: {
      'Authorization': `Bearer ${apiToken}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      config: {
        files: files
      }
    })
  });

  if (!response.ok) {
    throw new Error(`HTTP error! status: ${response.status}`);
  }

  return await response.json();
}

// Usage
updateDatasetNode(
  'my-rag-pipeline',
  'dataset-1748287628684',
  ['attention.pdf', 'bert.pdf', 'transformer_architecture.pdf'],
  'YOUR_API_TOKEN'
)
  .then(result => {
    console.log('Dataset updated:', result);
    console.log(`Node ${result.node_id} updated successfully`);
  })
  .catch(error => console.error('Error:', error));

Python

import requests
import json

def update_dataset_node(flow_name, node_id, files, api_token):
    url = f"https://{flow_name}.flows.graphorlm.com/datasets/{node_id}"
    
    headers = {
        "Authorization": f"Bearer {api_token}",
        "Content-Type": "application/json"
    }
    
    payload = {
        "config": {
            "files": files
        }
    }
    
    response = requests.patch(url, headers=headers, json=payload)
    response.raise_for_status()
    
    return response.json()

def manage_dataset_configuration(flow_name, node_id, api_token):
    """Example of comprehensive dataset management"""
    
    # First, get current configuration
    current_config = get_current_dataset_config(flow_name, node_id, api_token)
    print(f"Current files: {current_config.get('files', [])}")
    
    # Define new file configuration
    new_files = [
        "attention.pdf",
        "bert.pdf", 
        "transformer_architecture.pdf",
        "neural_networks.pdf"
    ]
    
    print(f"Updating to {len(new_files)} files...")
    
    try:
        result = update_dataset_node(flow_name, node_id, new_files, api_token)
        
        print("✅ Update successful!")
        print(f"Success: {result['success']}")
        print(f"Message: {result['message']}")
        print(f"Updated Node ID: {result['node_id']}")
        
        return result
        
    except requests.exceptions.HTTPError as e:
        print(f"❌ Update failed: {e}")
        if e.response.status_code == 404:
            print("Flow or dataset node not found")
        elif e.response.status_code == 400:
            print("Invalid configuration or files not found")
        raise

def get_current_dataset_config(flow_name, node_id, api_token):
    """Helper function to get current dataset configuration"""
    # This would use the List Dataset Nodes endpoint
    url = f"https://{flow_name}.flows.graphorlm.com/datasets"
    headers = {"Authorization": f"Bearer {api_token}"}
    
    response = requests.get(url, headers=headers)
    response.raise_for_status()
    
    nodes = response.json()
    for node in nodes:
        if node['id'] == node_id:
            return node['data']['config']
    
    raise ValueError(f"Dataset node {node_id} not found")

# Usage
try:
    result = manage_dataset_configuration(
        flow_name="my-rag-pipeline",
        node_id="dataset-1748287628684",
        api_token="YOUR_API_TOKEN"
    )
except Exception as e:
    print(f"Error managing dataset: {e}")

cURL

# Basic update
curl -X PATCH https://my-rag-pipeline.flows.graphorlm.com/datasets/dataset-1748287628684 \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "config": {
      "files": [
        "attention.pdf",
        "bert.pdf",
        "transformer_architecture.pdf"
      ]
    }
  }'

# Update with empty files list (clear all files)
curl -X PATCH https://my-rag-pipeline.flows.graphorlm.com/datasets/dataset-1748287628684 \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "config": {
      "files": []
    }
  }'

# Add single file to configuration
curl -X PATCH https://my-rag-pipeline.flows.graphorlm.com/datasets/dataset-1748287628684 \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "config": {
      "files": ["new_document.pdf"]
    }
  }'

PHP

<?php
function updateDatasetNode($flowName, $nodeId, $files, $apiToken) {
    $url = "https://{$flowName}.flows.graphorlm.com/datasets/{$nodeId}";
    
    $data = [
        'config' => [
            'files' => $files
        ]
    ];
    
    $options = [
        'http' => [
            'header' => [
                "Authorization: Bearer {$apiToken}",
                "Content-Type: application/json"
            ],
            'method' => 'PATCH',
            'content' => json_encode($data)
        ]
    ];
    
    $context = stream_context_create($options);
    $result = file_get_contents($url, false, $context);
    
    if ($result === FALSE) {
        throw new Exception('Failed to update dataset node');
    }
    
    return json_decode($result, true);
}

function manageDatasetFiles($flowName, $nodeId, $apiToken) {
    // Define file operations
    $operations = [
        'add_research_papers' => [
            'attention.pdf',
            'bert.pdf',
            'transformer_architecture.pdf'
        ],
        'add_documentation' => [
            'api_guide.pdf',
            'user_manual.pdf'
        ],
        'minimal_set' => [
            'quick_reference.pdf'
        ]
    ];
    
    foreach ($operations as $operation => $files) {
        echo "🔄 Operation: {$operation}\n";
        echo "Files: " . implode(', ', $files) . "\n";
        
        try {
            $result = updateDatasetNode($flowName, $nodeId, $files, $apiToken);
            
            echo "✅ Success: {$result['message']}\n";
            echo "Updated Node: {$result['node_id']}\n\n";
            
            // Wait between operations
            sleep(1);
            
        } catch (Exception $e) {
            echo "❌ Failed: " . $e->getMessage() . "\n\n";
        }
    }
}

// Usage
try {
    $result = updateDatasetNode(
        'my-rag-pipeline',
        'dataset-1748287628684',
        ['attention.pdf', 'bert.pdf'],
        'YOUR_API_TOKEN'
    );
    
    echo "Dataset updated successfully!\n";
    echo "Result: " . json_encode($result, JSON_PRETTY_PRINT) . "\n";
    
} catch (Exception $e) {
    echo "Error: " . $e->getMessage() . "\n";
}
?>

Error Responses

Common Error Codes

Status Code	Description	Example Response
400	Bad Request - Invalid configuration or files not found	`{"detail": "The following files do not exist as sources in the dataset: missing_file.pdf"}`
401	Unauthorized - Invalid or missing API token	`{"detail": "Invalid authentication credentials"}`
404	Not Found - Flow or dataset node not found	`{"detail": "Dataset node with id 'invalid-node' not found in flow 'my-flow'"}`
500	Internal Server Error - Server error	`{"detail": "Failed to update dataset node"}`

Error Response Format

{
  "detail": "Error message describing what went wrong"
}

Example Error Responses

Files Not Found in Dataset

{
  "detail": "The following files do not exist as sources in the dataset: nonexistent_file.pdf, another_missing.pdf"
}

Dataset Node Not Found

{
  "detail": "Dataset node with id 'dataset-invalid' not found in flow 'my-rag-pipeline'"
}

Flow Not Found

{
  "detail": "Flow with name 'nonexistent-flow' not found"
}

Invalid API Token

{
  "detail": "Invalid authentication credentials"
}

Update Behavior

Node Status Changes

When you update a dataset node:

Configuration Updated: The node’s file list is replaced with the new configuration
Status Reset: The node is marked as "updated": false to indicate it needs reprocessing
Successor Nodes: All downstream nodes in the flow are also marked as needing updates
Flow State: The flow maintains its deployed status but requires redeployment to apply changes

File Validation

The endpoint validates that:

All specified files exist as sources in the project
At least one file is specified (empty lists are not allowed)
File names exactly match uploaded source files (case-sensitive)

Integration Examples

Dataset Configuration Manager

class DatasetConfigManager {
  constructor(flowName, apiToken) {
    this.flowName = flowName;
    this.apiToken = apiToken;
    this.baseUrl = `https://${flowName}.flows.graphorlm.com`;
  }

  async getCurrentNodes() {
    const response = await fetch(`${this.baseUrl}/datasets`, {
      headers: { 'Authorization': `Bearer ${this.apiToken}` }
    });

    if (!response.ok) {
      throw new Error(`Failed to get dataset nodes: ${response.status}`);
    }

    return await response.json();
  }

  async updateNode(nodeId, files) {
    const response = await fetch(`${this.baseUrl}/datasets/${nodeId}`, {
      method: 'PATCH',
      headers: {
        'Authorization': `Bearer ${this.apiToken}`,
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({
        config: { files }
      })
    });

    if (!response.ok) {
      const error = await response.json();
      throw new Error(`Update failed: ${error.detail}`);
    }

    return await response.json();
  }

  async addFilesToNode(nodeId, newFiles) {
    // Get current configuration
    const nodes = await this.getCurrentNodes();
    const targetNode = nodes.find(node => node.id === nodeId);
    
    if (!targetNode) {
      throw new Error(`Dataset node ${nodeId} not found`);
    }

    const currentFiles = targetNode.data.config.files || [];
    const updatedFiles = [...new Set([...currentFiles, ...newFiles])]; // Remove duplicates

    console.log(`Adding ${newFiles.length} files to node ${nodeId}`);
    console.log(`Total files after update: ${updatedFiles.length}`);

    return await this.updateNode(nodeId, updatedFiles);
  }

  async removeFilesFromNode(nodeId, filesToRemove) {
    // Get current configuration
    const nodes = await this.getCurrentNodes();
    const targetNode = nodes.find(node => node.id === nodeId);
    
    if (!targetNode) {
      throw new Error(`Dataset node ${nodeId} not found`);
    }

    const currentFiles = targetNode.data.config.files || [];
    const updatedFiles = currentFiles.filter(file => !filesToRemove.includes(file));

    console.log(`Removing ${filesToRemove.length} files from node ${nodeId}`);
    console.log(`Files remaining: ${updatedFiles.length}`);

    return await this.updateNode(nodeId, updatedFiles);
  }

  async replaceAllFiles(nodeId, newFiles) {
    console.log(`Replacing all files in node ${nodeId} with ${newFiles.length} new files`);
    return await this.updateNode(nodeId, newFiles);
  }

  async auditNodeConfiguration() {
    const nodes = await this.getCurrentNodes();
    const audit = {
      totalNodes: nodes.length,
      totalFiles: 0,
      nodeDetails: [],
      duplicateFiles: {},
      emptyNodes: []
    };

    const fileUsage = {};

    for (const node of nodes) {
      const files = node.data.config.files || [];
      const nodeDetail = {
        id: node.id,
        name: node.data.name,
        fileCount: files.length,
        files: files,
        needsUpdate: !node.data.result?.updated
      };

      audit.nodeDetails.push(nodeDetail);
      audit.totalFiles += files.length;

      if (files.length === 0) {
        audit.emptyNodes.push(nodeDetail);
      }

      // Track file usage
      for (const file of files) {
        if (!fileUsage[file]) {
          fileUsage[file] = [];
        }
        fileUsage[file].push(node.id);
      }
    }

    // Find duplicate files
    for (const [file, nodeIds] of Object.entries(fileUsage)) {
      if (nodeIds.length > 1) {
        audit.duplicateFiles[file] = nodeIds;
      }
    }

    return audit;
  }
}

// Usage
const manager = new DatasetConfigManager('my-rag-pipeline', 'YOUR_API_TOKEN');

manager.auditNodeConfiguration()
  .then(audit => {
    console.log('Configuration Audit:', audit);
    
    if (audit.emptyNodes.length > 0) {
      console.log('Empty nodes found:', audit.emptyNodes.map(n => n.id));
    }
    
    if (Object.keys(audit.duplicateFiles).length > 0) {
      console.log('Duplicate files found:', audit.duplicateFiles);
    }
  })
  .catch(console.error);

Batch Configuration Tool

import requests
from typing import List, Dict, Any
import time

class BatchDatasetUpdater:
    def __init__(self, flow_name: str, api_token: str):
        self.flow_name = flow_name
        self.api_token = api_token
        self.base_url = f"https://{flow_name}.flows.graphorlm.com"
        
    def get_dataset_nodes(self) -> List[Dict[str, Any]]:
        """Get all dataset nodes in the flow"""
        response = requests.get(
            f"{self.base_url}/datasets",
            headers={"Authorization": f"Bearer {self.api_token}"}
        )
        response.raise_for_status()
        return response.json()
    
    def update_single_node(self, node_id: str, files: List[str]) -> Dict[str, Any]:
        """Update a single dataset node"""
        response = requests.patch(
            f"{self.base_url}/datasets/{node_id}",
            headers={
                "Authorization": f"Bearer {self.api_token}",
                "Content-Type": "application/json"
            },
            json={"config": {"files": files}}
        )
        response.raise_for_status()
        return response.json()
    
    def batch_update_nodes(self, updates: Dict[str, List[str]], delay: float = 1.0) -> Dict[str, Any]:
        """
        Update multiple dataset nodes with different file configurations
        
        Args:
            updates: Dictionary mapping node_ids to lists of files
            delay: Delay between updates in seconds
        """
        results = {
            "successful_updates": [],
            "failed_updates": [],
            "total_nodes": len(updates)
        }
        
        for node_id, files in updates.items():
            try:
                print(f"Updating node {node_id} with {len(files)} files...")
                result = self.update_single_node(node_id, files)
                results["successful_updates"].append({
                    "node_id": node_id,
                    "files": files,
                    "result": result
                })
                print(f"✅ Success: {result['message']}")
                
            except Exception as e:
                error_info = {
                    "node_id": node_id,
                    "files": files,
                    "error": str(e)
                }
                results["failed_updates"].append(error_info)
                print(f"❌ Failed: {e}")
            
            # Delay between updates
            if delay > 0:
                time.sleep(delay)
        
        return results
    
    def standardize_node_configurations(self, file_groups: Dict[str, List[str]]) -> Dict[str, Any]:
        """
        Apply standardized file configurations to dataset nodes
        
        Args:
            file_groups: Dictionary mapping group names to file lists
        """
        nodes = self.get_dataset_nodes()
        
        # Interactive selection of nodes for each group
        assignments = {}
        
        for group_name, files in file_groups.items():
            print(f"\n📁 Group: {group_name}")
            print(f"Files: {', '.join(files)}")
            print("Available nodes:")
            
            for i, node in enumerate(nodes):
                current_files = node['data']['config'].get('files', [])
                print(f"  {i+1}. {node['id']} - {node['data']['name']} ({len(current_files)} files)")
            
            # In a real implementation, you'd get user input here
            # For this example, we'll assign the first node to each group
            if nodes:
                selected_node = nodes[0]['id']
                assignments[selected_node] = files
                print(f"Assigned {group_name} to node {selected_node}")
        
        return self.batch_update_nodes(assignments)

# Usage
updater = BatchDatasetUpdater("my-rag-pipeline", "YOUR_API_TOKEN")

# Example: Standardize configurations
file_groups = {
    "research_papers": [
        "attention.pdf",
        "bert.pdf",
        "transformer_architecture.pdf"
    ],
    "documentation": [
        "api_guide.pdf",
        "user_manual.pdf",
        "troubleshooting.pdf"
    ],
    "datasets": [
        "training_data.csv",
        "validation_data.csv"
    ]
}

try:
    results = updater.standardize_node_configurations(file_groups)
    print(f"\nBatch update completed:")
    print(f"Successful: {len(results['successful_updates'])}")
    print(f"Failed: {len(results['failed_updates'])}")
    
except Exception as e:
    print(f"Batch update failed: {e}")

Best Practices

Configuration Management

Validate Files First: Use the List Sources endpoint to verify file availability before updating
Backup Configurations: Save current configurations before making changes
Incremental Updates: Make small, incremental changes rather than large configuration replacements
Document Changes: Keep track of configuration changes for rollback purposes

File Organization

Consistent Naming: Use clear, consistent file naming conventions
Logical Grouping: Group related files in the same dataset nodes
Version Control: Include version information in file names when appropriate
Size Considerations: Balance the number of files per node for optimal performance

Error Handling

Validate Before Update: Check that files exist before attempting updates
Handle Partial Failures: In batch operations, handle individual failures gracefully
Retry Logic: Implement retry mechanisms for transient failures
Detailed Logging: Log all configuration changes for audit trails

Troubleshooting

Files Not Found Error

Dataset Node Not Found

Empty Files Configuration

Update Not Reflected

Network Connection Issues

Next Steps

After updating dataset configurations, you might want to:

List Dataset Nodes

View updated dataset node configurations and status

Deploy Flow

Deploy your flow to apply configuration changes

List Sources

View available source files for configuration

Run Flow

Execute your flow with updated configurations

Get Started

Sources

Flows

​Overview

​Authentication

​Request Format

​Headers

​URL Parameters

​Request Body

​Example Request

​Response Format

​Success Response (200 OK)

​Response Fields

​Code Examples

​JavaScript/Node.js

​Python

​cURL

​PHP

​Error Responses

​Common Error Codes

​Error Response Format

​Example Error Responses

​Files Not Found in Dataset

​Dataset Node Not Found

​Flow Not Found

​Invalid API Token

​Update Behavior

​Node Status Changes

​File Validation

​Integration Examples

​Dataset Configuration Manager

​Batch Configuration Tool

​Best Practices

​Configuration Management

​File Organization

​Error Handling

​Troubleshooting

​Next Steps

List Dataset Nodes

Deploy Flow

List Sources

Run Flow

Overview

Authentication

Request Format

Headers

URL Parameters

Request Body

Example Request

Response Format

Success Response (200 OK)

Response Fields

Code Examples

JavaScript/Node.js

Python

cURL

PHP

Error Responses

Common Error Codes

Error Response Format

Example Error Responses

Files Not Found in Dataset

Dataset Node Not Found

Flow Not Found

Invalid API Token

Update Behavior

Node Status Changes

File Validation

Integration Examples

Dataset Configuration Manager

Batch Configuration Tool

Best Practices

Configuration Management

File Organization

Error Handling

Troubleshooting

Next Steps