This guide explains how to use GraphorLM’s evaluation capabilities to measure, analyze, and improve the performance of your RAG pipeline.

Overview

Evaluation is a critical step in developing effective RAG pipelines. It helps you:

  1. Measure the quality of retrieved information
  2. Assess the relevance of chunks to user queries
  3. Identify gaps in your knowledge base
  4. Optimize your retrieval and chunking configurations
  5. Compare performance across different pipeline versions

GraphorLM offers built-in evaluation tools that enable systematic testing and optimization of your RAG components.

The Evaluation Component

In GraphorLM’s Flow Builder, the Evaluation component analyzes the performance of your retrieval and response generation:

This component can be attached to:

  • A Retrieval node to evaluate retrieval quality
  • An LLM node to evaluate response quality (if you’re using the optional LLM component)

Configuring the Evaluation Component

To set up evaluation for your RAG pipeline:

  1. Add the Evaluation component to your flow
  2. Connect it to the output of the component(s) you want to evaluate
  3. Make sure you have a Testset or Question node connected to your flow
  4. Double-click the Evaluation component to open its configuration panel
  5. Click Update Results to generate the evaluation

Once you click Update Results, the system will automatically calculate and display all five metrics (Relevance, Precision, Recall, Answer Relevance, and Faithfulness) in the dashboard.

Required Components

The Evaluation node requires:

  • A properly configured Testset or Question node with questions and expected answers
  • One or more Retrieval or LLM nodes to evaluate (you can connect multiple for comparison)

Comparing Multiple Strategies

One powerful feature of GraphorLM’s evaluation system is the ability to compare different configurations simultaneously:

  • You can connect multiple Retrieval nodes to a single Evaluation node
  • You can connect multiple LLM nodes to a single Evaluation node
  • Each connection creates a separate evaluation dataset for easy comparison

This capability enables direct A/B testing of different retrieval strategies, embedding models, chunking approaches, or LLM configurations using the same test questions.

Creating Effective Testsets

Testsets are collections of queries and expected answers that serve as ground truth for your evaluation:

To create a testset:

  1. Navigate to Testsets in the left sidebar
  2. Click New Testset
  3. Add questions that represent typical user queries
  4. Provide expected answers for each question
  5. Save your testset with a descriptive name

Testset Best Practices

  • Include diverse question types: Mix factual, conceptual, and reasoning questions
  • Cover edge cases: Include questions that might be challenging for your system
  • Use real-world examples: Base questions on actual user queries when possible
  • Provide comprehensive expected answers: Include all key points that should be covered
  • Update regularly: Add new questions as you discover gaps or receive new user questions

Evaluation Metrics

GraphorLM provides five key metrics to evaluate different aspects of your RAG pipeline:

Retrieval Metrics

For evaluating retrieval quality:

  • Relevance: Measures how well the retrieved documents relate to the query

    • Higher values indicate retrieved chunks contain information pertinent to the question
    • Useful for general assessment of retrieval effectiveness
  • Precision: Measures the proportion of relevant documents among the retrieved results

    • Higher values indicate better precision in finding relevant information
    • Useful for evaluating the quality of your retrieval configuration
  • Recall: Measures the proportion of relevant documents retrieved out of all relevant documents

    • Higher values indicate better coverage of available relevant information
    • Helps identify if your retrieval is missing important information

Response Metrics (if using LLM component)

For evaluating generated responses:

  • Answer Relevance: Measures how relevant the response is to the question

    • Higher values indicate the response directly addresses the query
    • Helps identify off-topic or irrelevant responses
  • Faithfulness: Measures how well the response sticks to the information in the retrieved documents

    • Higher values indicate responses that accurately reflect the source material
    • Helps identify hallucinations or made-up information in responses

Interpreting Evaluation Results

The Evaluation component displays results in a comprehensive dashboard:

The interface is divided into two main sections:

  • Question list: On the left side, you’ll see all the questions from your testset
  • Metrics dashboard: On the right side, you’ll see the evaluation metrics and results

Analyzing Individual Questions

To analyze the performance of specific questions:

  1. Select a question from the list on the left side of the Evaluation component
  2. The dashboard will update to display metrics specifically related to that question
  3. Compare how different questions perform across various metrics
  4. Identify patterns in which types of questions perform better or worse

This question-level analysis helps you understand which types of queries your RAG pipeline handles well and which may need optimization.

Understanding Metric Scores

Each metric score includes additional context to help you understand the evaluation:

  • Hover tooltips: When you hover your mouse over any metric score, a tooltip appears explaining the reasoning behind that particular score
  • Score breakdown: See how individual test cases contribute to the overall score
  • Score range: All metrics are normalized on a scale from 0.0 to 1.0, with higher values indicating better performance

This contextual information helps you pinpoint specific areas for improvement in your RAG pipeline and understand why certain configurations perform better than others.

When analyzing results:

  1. Look for patterns: Are certain types of questions performing poorly?
  2. Compare configurations: Test different retrieval settings and compare metrics
  3. Identify thresholds: Determine acceptable performance levels for your use case
  4. Track over time: Monitor how changes to your pipeline affect performance

Common Evaluation Workflows

Basic Retrieval Evaluation

To evaluate just the retrieval component:

  1. Connect a Testset node → Retrieval node → Evaluation node
  2. Focus on Relevance, Precision, and Recall metrics
  3. Experiment with different retrieval settings (Search Type, Top K, Score Threshold)
  4. Compare results across configurations

Full Pipeline Evaluation

To evaluate the complete RAG pipeline with LLM (if using):

  1. Connect a Testset node → Retrieval node → LLM node → Evaluation node
  2. Analyze both retrieval and response metrics
  3. Identify which component may be limiting performance
  4. Optimize components individually and test again

Comparative Evaluation

To directly compare different strategies:

  1. Connect a single Testset node to multiple pipeline branches
  2. Create different configurations for each branch (e.g., different retrieval settings or LLM prompts)
  3. Connect all branches to the same Evaluation node
  4. Analyze metrics side-by-side to determine which configuration performs best

This approach allows you to identify optimal configurations more quickly and make data-driven decisions about which strategies to implement in your production pipeline.

Troubleshooting Evaluation Issues

Best Practices for Evaluation

  1. Start early: Integrate evaluation into your development process from the beginning
  2. Test frequently: Evaluate after each significant change to your pipeline
  3. Segment results: Analyze performance across different question types or domains
  4. Compare iteratively: Make one change at a time to isolate its impact
  5. Balance metrics: Optimize for the metrics most important to your use case
  6. Consider qualitative review: Supplement metrics with human review of responses
  7. Document findings: Track what changes led to improvements or regressions

Next Steps

After understanding how to evaluate your RAG pipeline, explore: