In GraphorLM’s Flow Builder, the Evaluation component analyzes the performance of your retrieval and response generation:This component can be attached to:
A Retrieval node to evaluate retrieval quality
An LLM node to evaluate response quality (if you’re using the optional LLM component)
Connect it to the output of the component(s) you want to evaluate
Make sure you have a Testset or Question node connected to your flow
Double-click the Evaluation component to open its configuration panel
Click Update Results to generate the evaluation
Once you click Update Results, the system will automatically calculate and display all five metrics (Relevance, Precision, Recall, Answer Relevance, and Faithfulness) in the dashboard.
One powerful feature of GraphorLM’s evaluation system is the ability to compare different configurations simultaneously:
You can connect multiple Retrieval nodes to a single Evaluation node
You can connect multiple LLM nodes to a single Evaluation node
Each connection creates a separate evaluation dataset for easy comparison
This capability enables direct A/B testing of different retrieval strategies, embedding models, chunking approaches, or LLM configurations using the same test questions.
Each metric score includes additional context to help you understand the evaluation:
Hover tooltips: When you hover your mouse over any metric score, a tooltip appears explaining the reasoning behind that particular score
Score breakdown: See how individual test cases contribute to the overall score
Score range: All metrics are normalized on a scale from 0.0 to 1.0, with higher values indicating better performance
This contextual information helps you pinpoint specific areas for improvement in your RAG pipeline and understand why certain configurations perform better than others.When analyzing results:
Look for patterns: Are certain types of questions performing poorly?
Compare configurations: Test different retrieval settings and compare metrics
Identify thresholds: Determine acceptable performance levels for your use case
Track over time: Monitor how changes to your pipeline affect performance
Connect a single Testset node to multiple pipeline branches
Create different configurations for each branch (e.g., different retrieval settings or LLM prompts)
Connect all branches to the same Evaluation node
Analyze metrics side-by-side to determine which configuration performs best
This approach allows you to identify optimal configurations more quickly and make data-driven decisions about which strategies to implement in your production pipeline.