Skip to main content
LLM nodes are the final component in RAG pipelines that generate natural language responses using retrieved context. They combine language models with prompt templates to create contextually-aware answers.

Key Features

  • Multiple Models: Access to 8 language models including GPT-4o, GPT-4.1, and high-speed Groq models
  • Prompt Integration: Custom prompt templates for specialized response behaviors
  • Temperature Control: Adjust creativity from deterministic (0.0) to creative (2.0) responses
  • Streaming Support: Real-time response generation for interactive applications
  • Quality Metrics: Built-in evaluation using DeepEval metrics

Available Endpoints

LLM Node Structure

{
  "id": "llm-1748287628685",
  "type": "llm",
  "position": { "x": 500, "y": 300 },
  "data": {
    "name": "Response Generator",
    "config": {
      "model": "gpt-4o",
      "promptId": "default_retrieval_prompt", 
      "temperature": 0.0
    },
    "result": {
      "total_responses": 1247,
      "avg_response_length": 342,
      "avg_processing_time": 2.8,
      "streaming_enabled": true
    }
  }
}

Configuration Parameters

ParameterTypeDescription
modelstringLanguage model for response generation
promptIdstringPrompt template for instruction guidance
temperaturefloat (0.0-2.0)Creativity and randomness control

Available Models

ModelContext WindowBest ForExpected Latency
gpt-4o128K tokensHigh accuracy, complex reasoning2-4 seconds
gpt-4o-mini128K tokensBalanced quality and speed1-2 seconds
gpt-4.1128K tokensLatest capabilities2-5 seconds
gpt-4.1-mini128K tokensModern features, efficient1-2 seconds
gpt-4.1-nano128K tokensResource optimization0.8-1.5 seconds
gpt-3.5-turbo-012516K tokensHigh-volume processing0.5-1 second
mixtral-8x7b-3276832K tokensReal-time processing0.5-1 second
llama-3.1-8b-instant8K tokensUltra-fast responses0.3-0.8 seconds

Common Configurations

Maximum Accuracy

For technical documentation and critical applications:
{
  "model": "gpt-4o",
  "promptId": "default_retrieval_prompt",
  "temperature": 0.0
}

Balanced Performance

For general Q&A and customer support:
{
  "model": "gpt-4o-mini",
  "promptId": "default_retrieval_prompt",
  "temperature": 0.2
}

High-Speed Processing

For real-time chat and instant responses:
{
  "model": "mixtral-8x7b-32768",
  "promptId": "customer_support_agent",
  "temperature": 0.1
}

Creative Generation

For content creation and diverse outputs:
{
  "model": "gpt-4.1",
  "promptId": "creative_content_generator",
  "temperature": 0.8
}

Quality Metrics

LLM nodes track response quality using DeepEval metrics:
  • Contextual Precision: Accuracy of context usage
  • Contextual Recall: Completeness of context utilization
  • Answer Relevancy: Relevance of response to question
  • Faithfulness: Adherence to context without hallucination

JavaScript Example

class LLMManager {
  constructor(flowName, apiToken) {
    this.flowName = flowName;
    this.apiToken = apiToken;
    this.baseUrl = `https://${flowName}.flows.graphorlm.com`;
  }
  
  async updateLLMNode(nodeId, config) {
    const response = await fetch(`${this.baseUrl}/llm/${nodeId}`, {
      method: 'PATCH',
      headers: {
        'Authorization': `Bearer ${this.apiToken}`,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({ config })
    });
    
    if (!response.ok) {
      throw new Error(`Update failed: ${response.status}`);
    }
    
    return await response.json();
  }
  
  async getLLMNodes() {
    const response = await fetch(`${this.baseUrl}/llm`, {
      headers: { 'Authorization': `Bearer ${this.apiToken}` }
    });
    
    if (!response.ok) {
      throw new Error(`Failed to get nodes: ${response.status}`);
    }
    
    return await response.json();
  }
}

// Usage
const manager = new LLMManager('my-flow', 'YOUR_API_TOKEN');

// Update for high accuracy
await manager.updateLLMNode('llm-node-id', {
  model: 'gpt-4o',
  promptId: 'default_retrieval_prompt',
  temperature: 0.0
});

Python Example

import requests

class LLMManager:
    def __init__(self, flow_name, api_token):
        self.flow_name = flow_name
        self.api_token = api_token
        self.base_url = f"https://{flow_name}.flows.graphorlm.com"
    
    def update_llm_node(self, node_id, config):
        response = requests.patch(
            f"{self.base_url}/llm/{node_id}",
            headers={
                "Authorization": f"Bearer {self.api_token}",
                "Content-Type": "application/json"
            },
            json={"config": config}
        )
        response.raise_for_status()
        return response.json()
    
    def get_llm_nodes(self):
        response = requests.get(
            f"{self.base_url}/llm",
            headers={"Authorization": f"Bearer {self.api_token}"}
        )
        response.raise_for_status()
        return response.json()

# Usage
manager = LLMManager("my-flow", "YOUR_API_TOKEN")

# Update for fast processing
manager.update_llm_node("llm-node-id", {
    "model": "mixtral-8x7b-32768",
    "promptId": "customer_support_agent", 
    "temperature": 0.1
})

Best Practices

Model Selection

  • Use gpt-4o for maximum accuracy in critical applications
  • Use gpt-4o-mini for balanced performance in general use cases
  • Use Groq models (mixtral, llama) for real-time applications
  • Use creative models (gpt-4.1) for content generation

Temperature Settings

  • 0.0-0.1: Deterministic responses for factual queries
  • 0.2-0.4: Slight variation for natural conversations
  • 0.5-0.8: Creative responses for content generation
  • 0.9-2.0: Highly creative and diverse outputs

Performance Optimization

  • Monitor processing time and adjust model selection accordingly
  • Enable streaming for better user experience with longer responses
  • Use appropriate context window sizes for your content
  • Track quality metrics to ensure response accuracy

Troubleshooting

Common issues and solutions:
  • Slow responses: Switch to faster models (Groq) or lower temperature
  • Inconsistent quality: Use higher-quality models or lower temperature
  • Resource usage: Use efficient models (nano, mini) for high-volume processing
  • Poor context usage: Optimize prompt templates and verify context formatting

Next Steps