Evaluation

This guide explains how to use GraphorLM’s evaluation capabilities to measure, analyze, and improve the performance of your RAG pipeline. Evaluation overview

Overview

Evaluation is a critical step in developing effective RAG pipelines. It helps you:

Measure the quality of retrieved information
Assess the relevance of chunks to user queries
Identify gaps in your knowledge base
Optimize your retrieval and chunking configurations
Compare performance across different pipeline versions

GraphorLM offers built-in evaluation tools that enable systematic testing and optimization of your RAG components.

The Evaluation Component

In GraphorLM’s Flow Builder, the Evaluation component analyzes the performance of your retrieval and response generation:

This component can be attached to:

A Retrieval node to evaluate retrieval quality
An LLM node to evaluate response quality (if you’re using the optional LLM component)

Configuring the Evaluation Component

To set up evaluation for your RAG pipeline:

Add the Evaluation component to your flow
Connect it to the output of the component(s) you want to evaluate
Make sure you have a Testset or Question node connected to your flow
Double-click the Evaluation component to open its configuration panel
Click Update Results to generate the evaluation

Once you click Update Results, the system will automatically calculate and display all five metrics (Relevance, Precision, Recall, Answer Relevance, and Faithfulness) in the dashboard.

Required Components

The Evaluation node requires:

A properly configured Testset or Question node with questions and expected answers
One or more Retrieval or LLM nodes to evaluate (you can connect multiple for comparison)

Comparing Multiple Strategies

One powerful feature of GraphorLM’s evaluation system is the ability to compare different configurations simultaneously:

You can connect multiple Retrieval nodes to a single Evaluation node
You can connect multiple LLM nodes to a single Evaluation node
Each connection creates a separate evaluation dataset for easy comparison

This capability enables direct A/B testing of different retrieval strategies, embedding models, chunking approaches, or LLM configurations using the same test questions.

Creating Effective Testsets

Testsets are collections of queries and expected answers that serve as ground truth for your evaluation: Creating a testset

To create a testset:

Navigate to Testsets in the left sidebar
Click New Testset
Add questions that represent typical user queries
Provide expected answers for each question
Save your testset with a descriptive name

Testset Best Practices

Include diverse question types: Mix factual, conceptual, and reasoning questions
Cover edge cases: Include questions that might be challenging for your system
Use real-world examples: Base questions on actual user queries when possible
Provide comprehensive expected answers: Include all key points that should be covered
Update regularly: Add new questions as you discover gaps or receive new user questions

Evaluation Metrics

GraphorLM provides five key metrics to evaluate different aspects of your RAG pipeline:

Retrieval Metrics

For evaluating retrieval quality:

Relevance: Measures how well the retrieved documents relate to the query
- Higher values indicate retrieved chunks contain information pertinent to the question
- Useful for general assessment of retrieval effectiveness
Precision: Measures the proportion of relevant documents among the retrieved results
- Higher values indicate better precision in finding relevant information
- Useful for evaluating the quality of your retrieval configuration
Recall: Measures the proportion of relevant documents retrieved out of all relevant documents
- Higher values indicate better coverage of available relevant information
- Helps identify if your retrieval is missing important information

Response Metrics (if using LLM component)

For evaluating generated responses:

Answer Relevance: Measures how relevant the response is to the question
- Higher values indicate the response directly addresses the query
- Helps identify off-topic or irrelevant responses
Faithfulness: Measures how well the response sticks to the information in the retrieved documents
- Higher values indicate responses that accurately reflect the source material
- Helps identify hallucinations or made-up information in responses

Interpreting Evaluation Results

The Evaluation component displays results in a comprehensive dashboard: Evaluation results

The interface is divided into two main sections:

Question list: On the left side, you’ll see all the questions from your testset
Metrics dashboard: On the right side, you’ll see the evaluation metrics and results

Analyzing Individual Questions

To analyze the performance of specific questions:

Select a question from the list on the left side of the Evaluation component
The dashboard will update to display metrics specifically related to that question
Compare how different questions perform across various metrics
Identify patterns in which types of questions perform better or worse

This question-level analysis helps you understand which types of queries your RAG pipeline handles well and which may need optimization.

Understanding Metric Scores

Each metric score includes additional context to help you understand the evaluation:

Hover tooltips: When you hover your mouse over any metric score, a tooltip appears explaining the reasoning behind that particular score
Score breakdown: See how individual test cases contribute to the overall score
Score range: All metrics are normalized on a scale from 0.0 to 1.0, with higher values indicating better performance

This contextual information helps you pinpoint specific areas for improvement in your RAG pipeline and understand why certain configurations perform better than others. When analyzing results:

Look for patterns: Are certain types of questions performing poorly?
Compare configurations: Test different retrieval settings and compare metrics
Identify thresholds: Determine acceptable performance levels for your use case
Track over time: Monitor how changes to your pipeline affect performance

Common Evaluation Workflows

Basic Retrieval Evaluation

To evaluate just the retrieval component:

Connect a Testset node → Retrieval node → Evaluation node
Focus on Relevance, Precision, and Recall metrics
Experiment with different retrieval settings (Search Type, Top K, Score Threshold)
Compare results across configurations

Full Pipeline Evaluation

To evaluate the complete RAG pipeline with LLM (if using):

Connect a Testset node → Retrieval node → LLM node → Evaluation node
Analyze both retrieval and response metrics
Identify which component may be limiting performance
Optimize components individually and test again

Comparative Evaluation

To directly compare different strategies:

Connect a single Testset node to multiple pipeline branches
Create different configurations for each branch (e.g., different retrieval settings or LLM prompts)
Connect all branches to the same Evaluation node
Analyze metrics side-by-side to determine which configuration performs best

This approach allows you to identify optimal configurations more quickly and make data-driven decisions about which strategies to implement in your production pipeline.

Troubleshooting Evaluation Issues

Poor retrieval metrics

Poor response metrics (if using LLM)

Inconsistent evaluation results

Best Practices for Evaluation

Start early: Integrate evaluation into your development process from the beginning
Test frequently: Evaluate after each significant change to your pipeline
Segment results: Analyze performance across different question types or domains
Compare iteratively: Make one change at a time to isolate its impact
Balance metrics: Optimize for the metrics most important to your use case
Consider qualitative review: Supplement metrics with human review of responses
Document findings: Track what changes led to improvements or regressions

Next Steps

After understanding how to evaluate your RAG pipeline, explore:

Retrieval

Optimize your search configuration based on evaluation results

LLM Integration

Refine your language model integration based on response metrics

Chunking

Improve document segmentation to enhance retrieval quality

Integrate Workflow

Deploy your optimized RAG pipeline through API and MCP Server integration

Get Started

Guides

Overview

The Evaluation Component

Configuring the Evaluation Component

Required Components

Comparing Multiple Strategies

Creating Effective Testsets

Testset Best Practices

Evaluation Metrics

Retrieval Metrics

Response Metrics (if using LLM component)

Interpreting Evaluation Results

Analyzing Individual Questions

Understanding Metric Scores

Common Evaluation Workflows

Basic Retrieval Evaluation

Full Pipeline Evaluation

Comparative Evaluation

Troubleshooting Evaluation Issues

Best Practices for Evaluation

Next Steps

Retrieval

LLM Integration

Chunking

Integrate Workflow

Get Started

Guides

​Overview

​The Evaluation Component

​Configuring the Evaluation Component

​Required Components

​Comparing Multiple Strategies

​Creating Effective Testsets

​Testset Best Practices

​Evaluation Metrics

​Retrieval Metrics

​Response Metrics (if using LLM component)

​Interpreting Evaluation Results

​Analyzing Individual Questions

​Understanding Metric Scores

​Common Evaluation Workflows

​Basic Retrieval Evaluation

​Full Pipeline Evaluation

​Comparative Evaluation

​Troubleshooting Evaluation Issues

​Best Practices for Evaluation

​Next Steps

Retrieval

LLM Integration

Chunking

Integrate Workflow

Overview

The Evaluation Component

Configuring the Evaluation Component

Required Components

Comparing Multiple Strategies

Creating Effective Testsets

Testset Best Practices

Evaluation Metrics

Retrieval Metrics

Response Metrics (if using LLM component)

Interpreting Evaluation Results

Analyzing Individual Questions

Understanding Metric Scores

Common Evaluation Workflows

Basic Retrieval Evaluation

Full Pipeline Evaluation

Comparative Evaluation

Troubleshooting Evaluation Issues

Best Practices for Evaluation

Next Steps