Update the configuration of a specific dataset node within a flow in your GraphorLM project. This endpoint allows you to modify which files are included in a dataset node and automatically marks the node for reprocessing.

Overview

The Update Dataset endpoint allows you to modify the configuration of dataset nodes within your flows. Dataset nodes are components that connect document sources to your flow pipeline, and updating them is essential for managing data inputs and keeping your RAG pipelines current.

  • Method: PATCH
  • URL: https://{flow_name}.flows.graphorlm.com/datasets/{node_id}
  • Authentication: Required (API Token)

Authentication

All requests must include a valid API token in the Authorization header:

Authorization: Bearer YOUR_API_TOKEN

Learn how to generate API tokens in the API Tokens guide.

Request Format

Headers

HeaderValueRequired
AuthorizationBearer YOUR_API_TOKENYes
Content-Typeapplication/jsonYes

URL Parameters

ParameterTypeRequiredDescription
flow_namestringYesThe name of the flow containing the dataset node
node_idstringYesThe unique identifier of the dataset node to update

Request Body

The request body should be a JSON object with the following structure:

FieldTypeRequiredDescription
configobjectYesThe new configuration for the dataset node
config.filesarrayNoList of file names to include in the dataset node

Example Request

PATCH https://my-rag-pipeline.flows.graphorlm.com/datasets/dataset-1748287628684
Authorization: Bearer YOUR_API_TOKEN
Content-Type: application/json
{
  "config": {
    "files": [
      "attention.pdf",
      "bert.pdf",
      "transformer_architecture.pdf"
    ]
  }
}

Response Format

Success Response (200 OK)

The response contains confirmation of the successful update:

{
  "success": true,
  "message": "Dataset node 'dataset-1748287628684' updated successfully",
  "node_id": "dataset-1748287628684"
}

Response Fields

FieldTypeDescription
successbooleanWhether the update operation was successful
messagestringDescriptive message about the operation result
node_idstringThe ID of the updated dataset node

Code Examples

JavaScript/Node.js

async function updateDatasetNode(flowName, nodeId, files, apiToken) {
  const response = await fetch(`https://${flowName}.flows.graphorlm.com/datasets/${nodeId}`, {
    method: 'PATCH',
    headers: {
      'Authorization': `Bearer ${apiToken}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      config: {
        files: files
      }
    })
  });

  if (!response.ok) {
    throw new Error(`HTTP error! status: ${response.status}`);
  }

  return await response.json();
}

// Usage
updateDatasetNode(
  'my-rag-pipeline',
  'dataset-1748287628684',
  ['attention.pdf', 'bert.pdf', 'transformer_architecture.pdf'],
  'YOUR_API_TOKEN'
)
  .then(result => {
    console.log('Dataset updated:', result);
    console.log(`Node ${result.node_id} updated successfully`);
  })
  .catch(error => console.error('Error:', error));

Python

import requests
import json

def update_dataset_node(flow_name, node_id, files, api_token):
    url = f"https://{flow_name}.flows.graphorlm.com/datasets/{node_id}"
    
    headers = {
        "Authorization": f"Bearer {api_token}",
        "Content-Type": "application/json"
    }
    
    payload = {
        "config": {
            "files": files
        }
    }
    
    response = requests.patch(url, headers=headers, json=payload)
    response.raise_for_status()
    
    return response.json()

def manage_dataset_configuration(flow_name, node_id, api_token):
    """Example of comprehensive dataset management"""
    
    # First, get current configuration
    current_config = get_current_dataset_config(flow_name, node_id, api_token)
    print(f"Current files: {current_config.get('files', [])}")
    
    # Define new file configuration
    new_files = [
        "attention.pdf",
        "bert.pdf", 
        "transformer_architecture.pdf",
        "neural_networks.pdf"
    ]
    
    print(f"Updating to {len(new_files)} files...")
    
    try:
        result = update_dataset_node(flow_name, node_id, new_files, api_token)
        
        print("✅ Update successful!")
        print(f"Success: {result['success']}")
        print(f"Message: {result['message']}")
        print(f"Updated Node ID: {result['node_id']}")
        
        return result
        
    except requests.exceptions.HTTPError as e:
        print(f"❌ Update failed: {e}")
        if e.response.status_code == 404:
            print("Flow or dataset node not found")
        elif e.response.status_code == 400:
            print("Invalid configuration or files not found")
        raise

def get_current_dataset_config(flow_name, node_id, api_token):
    """Helper function to get current dataset configuration"""
    # This would use the List Dataset Nodes endpoint
    url = f"https://{flow_name}.flows.graphorlm.com/datasets"
    headers = {"Authorization": f"Bearer {api_token}"}
    
    response = requests.get(url, headers=headers)
    response.raise_for_status()
    
    nodes = response.json()
    for node in nodes:
        if node['id'] == node_id:
            return node['data']['config']
    
    raise ValueError(f"Dataset node {node_id} not found")

# Usage
try:
    result = manage_dataset_configuration(
        flow_name="my-rag-pipeline",
        node_id="dataset-1748287628684",
        api_token="YOUR_API_TOKEN"
    )
except Exception as e:
    print(f"Error managing dataset: {e}")

cURL

# Basic update
curl -X PATCH https://my-rag-pipeline.flows.graphorlm.com/datasets/dataset-1748287628684 \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "config": {
      "files": [
        "attention.pdf",
        "bert.pdf",
        "transformer_architecture.pdf"
      ]
    }
  }'

# Update with empty files list (clear all files)
curl -X PATCH https://my-rag-pipeline.flows.graphorlm.com/datasets/dataset-1748287628684 \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "config": {
      "files": []
    }
  }'

# Add single file to configuration
curl -X PATCH https://my-rag-pipeline.flows.graphorlm.com/datasets/dataset-1748287628684 \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "config": {
      "files": ["new_document.pdf"]
    }
  }'

PHP

<?php
function updateDatasetNode($flowName, $nodeId, $files, $apiToken) {
    $url = "https://{$flowName}.flows.graphorlm.com/datasets/{$nodeId}";
    
    $data = [
        'config' => [
            'files' => $files
        ]
    ];
    
    $options = [
        'http' => [
            'header' => [
                "Authorization: Bearer {$apiToken}",
                "Content-Type: application/json"
            ],
            'method' => 'PATCH',
            'content' => json_encode($data)
        ]
    ];
    
    $context = stream_context_create($options);
    $result = file_get_contents($url, false, $context);
    
    if ($result === FALSE) {
        throw new Exception('Failed to update dataset node');
    }
    
    return json_decode($result, true);
}

function manageDatasetFiles($flowName, $nodeId, $apiToken) {
    // Define file operations
    $operations = [
        'add_research_papers' => [
            'attention.pdf',
            'bert.pdf',
            'transformer_architecture.pdf'
        ],
        'add_documentation' => [
            'api_guide.pdf',
            'user_manual.pdf'
        ],
        'minimal_set' => [
            'quick_reference.pdf'
        ]
    ];
    
    foreach ($operations as $operation => $files) {
        echo "🔄 Operation: {$operation}\n";
        echo "Files: " . implode(', ', $files) . "\n";
        
        try {
            $result = updateDatasetNode($flowName, $nodeId, $files, $apiToken);
            
            echo "✅ Success: {$result['message']}\n";
            echo "Updated Node: {$result['node_id']}\n\n";
            
            // Wait between operations
            sleep(1);
            
        } catch (Exception $e) {
            echo "❌ Failed: " . $e->getMessage() . "\n\n";
        }
    }
}

// Usage
try {
    $result = updateDatasetNode(
        'my-rag-pipeline',
        'dataset-1748287628684',
        ['attention.pdf', 'bert.pdf'],
        'YOUR_API_TOKEN'
    );
    
    echo "Dataset updated successfully!\n";
    echo "Result: " . json_encode($result, JSON_PRETTY_PRINT) . "\n";
    
} catch (Exception $e) {
    echo "Error: " . $e->getMessage() . "\n";
}
?>

Error Responses

Common Error Codes

Status CodeDescriptionExample Response
400Bad Request - Invalid configuration or files not found{"detail": "The following files do not exist as sources in the dataset: missing_file.pdf"}
401Unauthorized - Invalid or missing API token{"detail": "Invalid authentication credentials"}
404Not Found - Flow or dataset node not found{"detail": "Dataset node with id 'invalid-node' not found in flow 'my-flow'"}
500Internal Server Error - Server error{"detail": "Failed to update dataset node"}

Error Response Format

{
  "detail": "Error message describing what went wrong"
}

Example Error Responses

Files Not Found in Dataset

{
  "detail": "The following files do not exist as sources in the dataset: nonexistent_file.pdf, another_missing.pdf"
}

Dataset Node Not Found

{
  "detail": "Dataset node with id 'dataset-invalid' not found in flow 'my-rag-pipeline'"
}

Flow Not Found

{
  "detail": "Flow with name 'nonexistent-flow' not found"
}

Invalid API Token

{
  "detail": "Invalid authentication credentials"
}

Update Behavior

Node Status Changes

When you update a dataset node:

  1. Configuration Updated: The node’s file list is replaced with the new configuration
  2. Status Reset: The node is marked as "updated": false to indicate it needs reprocessing
  3. Successor Nodes: All downstream nodes in the flow are also marked as needing updates
  4. Flow State: The flow maintains its deployed status but requires redeployment to apply changes

File Validation

The endpoint validates that:

  • All specified files exist as sources in the project
  • At least one file is specified (empty lists are not allowed)
  • File names exactly match uploaded source files (case-sensitive)

Integration Examples

Dataset Configuration Manager

class DatasetConfigManager {
  constructor(flowName, apiToken) {
    this.flowName = flowName;
    this.apiToken = apiToken;
    this.baseUrl = `https://${flowName}.flows.graphorlm.com`;
  }

  async getCurrentNodes() {
    const response = await fetch(`${this.baseUrl}/datasets`, {
      headers: { 'Authorization': `Bearer ${this.apiToken}` }
    });

    if (!response.ok) {
      throw new Error(`Failed to get dataset nodes: ${response.status}`);
    }

    return await response.json();
  }

  async updateNode(nodeId, files) {
    const response = await fetch(`${this.baseUrl}/datasets/${nodeId}`, {
      method: 'PATCH',
      headers: {
        'Authorization': `Bearer ${this.apiToken}`,
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({
        config: { files }
      })
    });

    if (!response.ok) {
      const error = await response.json();
      throw new Error(`Update failed: ${error.detail}`);
    }

    return await response.json();
  }

  async addFilesToNode(nodeId, newFiles) {
    // Get current configuration
    const nodes = await this.getCurrentNodes();
    const targetNode = nodes.find(node => node.id === nodeId);
    
    if (!targetNode) {
      throw new Error(`Dataset node ${nodeId} not found`);
    }

    const currentFiles = targetNode.data.config.files || [];
    const updatedFiles = [...new Set([...currentFiles, ...newFiles])]; // Remove duplicates

    console.log(`Adding ${newFiles.length} files to node ${nodeId}`);
    console.log(`Total files after update: ${updatedFiles.length}`);

    return await this.updateNode(nodeId, updatedFiles);
  }

  async removeFilesFromNode(nodeId, filesToRemove) {
    // Get current configuration
    const nodes = await this.getCurrentNodes();
    const targetNode = nodes.find(node => node.id === nodeId);
    
    if (!targetNode) {
      throw new Error(`Dataset node ${nodeId} not found`);
    }

    const currentFiles = targetNode.data.config.files || [];
    const updatedFiles = currentFiles.filter(file => !filesToRemove.includes(file));

    console.log(`Removing ${filesToRemove.length} files from node ${nodeId}`);
    console.log(`Files remaining: ${updatedFiles.length}`);

    return await this.updateNode(nodeId, updatedFiles);
  }

  async replaceAllFiles(nodeId, newFiles) {
    console.log(`Replacing all files in node ${nodeId} with ${newFiles.length} new files`);
    return await this.updateNode(nodeId, newFiles);
  }

  async auditNodeConfiguration() {
    const nodes = await this.getCurrentNodes();
    const audit = {
      totalNodes: nodes.length,
      totalFiles: 0,
      nodeDetails: [],
      duplicateFiles: {},
      emptyNodes: []
    };

    const fileUsage = {};

    for (const node of nodes) {
      const files = node.data.config.files || [];
      const nodeDetail = {
        id: node.id,
        name: node.data.name,
        fileCount: files.length,
        files: files,
        needsUpdate: !node.data.result?.updated
      };

      audit.nodeDetails.push(nodeDetail);
      audit.totalFiles += files.length;

      if (files.length === 0) {
        audit.emptyNodes.push(nodeDetail);
      }

      // Track file usage
      for (const file of files) {
        if (!fileUsage[file]) {
          fileUsage[file] = [];
        }
        fileUsage[file].push(node.id);
      }
    }

    // Find duplicate files
    for (const [file, nodeIds] of Object.entries(fileUsage)) {
      if (nodeIds.length > 1) {
        audit.duplicateFiles[file] = nodeIds;
      }
    }

    return audit;
  }
}

// Usage
const manager = new DatasetConfigManager('my-rag-pipeline', 'YOUR_API_TOKEN');

manager.auditNodeConfiguration()
  .then(audit => {
    console.log('Configuration Audit:', audit);
    
    if (audit.emptyNodes.length > 0) {
      console.log('Empty nodes found:', audit.emptyNodes.map(n => n.id));
    }
    
    if (Object.keys(audit.duplicateFiles).length > 0) {
      console.log('Duplicate files found:', audit.duplicateFiles);
    }
  })
  .catch(console.error);

Batch Configuration Tool

import requests
from typing import List, Dict, Any
import time

class BatchDatasetUpdater:
    def __init__(self, flow_name: str, api_token: str):
        self.flow_name = flow_name
        self.api_token = api_token
        self.base_url = f"https://{flow_name}.flows.graphorlm.com"
        
    def get_dataset_nodes(self) -> List[Dict[str, Any]]:
        """Get all dataset nodes in the flow"""
        response = requests.get(
            f"{self.base_url}/datasets",
            headers={"Authorization": f"Bearer {self.api_token}"}
        )
        response.raise_for_status()
        return response.json()
    
    def update_single_node(self, node_id: str, files: List[str]) -> Dict[str, Any]:
        """Update a single dataset node"""
        response = requests.patch(
            f"{self.base_url}/datasets/{node_id}",
            headers={
                "Authorization": f"Bearer {self.api_token}",
                "Content-Type": "application/json"
            },
            json={"config": {"files": files}}
        )
        response.raise_for_status()
        return response.json()
    
    def batch_update_nodes(self, updates: Dict[str, List[str]], delay: float = 1.0) -> Dict[str, Any]:
        """
        Update multiple dataset nodes with different file configurations
        
        Args:
            updates: Dictionary mapping node_ids to lists of files
            delay: Delay between updates in seconds
        """
        results = {
            "successful_updates": [],
            "failed_updates": [],
            "total_nodes": len(updates)
        }
        
        for node_id, files in updates.items():
            try:
                print(f"Updating node {node_id} with {len(files)} files...")
                result = self.update_single_node(node_id, files)
                results["successful_updates"].append({
                    "node_id": node_id,
                    "files": files,
                    "result": result
                })
                print(f"✅ Success: {result['message']}")
                
            except Exception as e:
                error_info = {
                    "node_id": node_id,
                    "files": files,
                    "error": str(e)
                }
                results["failed_updates"].append(error_info)
                print(f"❌ Failed: {e}")
            
            # Delay between updates
            if delay > 0:
                time.sleep(delay)
        
        return results
    
    def standardize_node_configurations(self, file_groups: Dict[str, List[str]]) -> Dict[str, Any]:
        """
        Apply standardized file configurations to dataset nodes
        
        Args:
            file_groups: Dictionary mapping group names to file lists
        """
        nodes = self.get_dataset_nodes()
        
        # Interactive selection of nodes for each group
        assignments = {}
        
        for group_name, files in file_groups.items():
            print(f"\n📁 Group: {group_name}")
            print(f"Files: {', '.join(files)}")
            print("Available nodes:")
            
            for i, node in enumerate(nodes):
                current_files = node['data']['config'].get('files', [])
                print(f"  {i+1}. {node['id']} - {node['data']['name']} ({len(current_files)} files)")
            
            # In a real implementation, you'd get user input here
            # For this example, we'll assign the first node to each group
            if nodes:
                selected_node = nodes[0]['id']
                assignments[selected_node] = files
                print(f"Assigned {group_name} to node {selected_node}")
        
        return self.batch_update_nodes(assignments)

# Usage
updater = BatchDatasetUpdater("my-rag-pipeline", "YOUR_API_TOKEN")

# Example: Standardize configurations
file_groups = {
    "research_papers": [
        "attention.pdf",
        "bert.pdf",
        "transformer_architecture.pdf"
    ],
    "documentation": [
        "api_guide.pdf",
        "user_manual.pdf",
        "troubleshooting.pdf"
    ],
    "datasets": [
        "training_data.csv",
        "validation_data.csv"
    ]
}

try:
    results = updater.standardize_node_configurations(file_groups)
    print(f"\nBatch update completed:")
    print(f"Successful: {len(results['successful_updates'])}")
    print(f"Failed: {len(results['failed_updates'])}")
    
except Exception as e:
    print(f"Batch update failed: {e}")

Best Practices

Configuration Management

  • Validate Files First: Use the List Sources endpoint to verify file availability before updating
  • Backup Configurations: Save current configurations before making changes
  • Incremental Updates: Make small, incremental changes rather than large configuration replacements
  • Document Changes: Keep track of configuration changes for rollback purposes

File Organization

  • Consistent Naming: Use clear, consistent file naming conventions
  • Logical Grouping: Group related files in the same dataset nodes
  • Version Control: Include version information in file names when appropriate
  • Size Considerations: Balance the number of files per node for optimal performance

Error Handling

  • Validate Before Update: Check that files exist before attempting updates
  • Handle Partial Failures: In batch operations, handle individual failures gracefully
  • Retry Logic: Implement retry mechanisms for transient failures
  • Detailed Logging: Log all configuration changes for audit trails

Troubleshooting

Next Steps

After updating dataset configurations, you might want to: