Overview
The Rhesis SDK provides a comprehensive metrics system for evaluating LLM-based systems. The metrics module supports multiple evaluation frameworks and allows you to create custom metrics tailored to your specific use cases. The metrics module is integrated with the backend, allowing you to work with metrics directly from the platform.
Metric Types
Rhesis SDK supports two types of metrics:
Single-Turn Metrics
Single-turn metrics evaluate individual exchanges between user input and system output. These metrics are ideal for assessing:
- RAG Systems: Context relevance, faithfulness, and answer accuracy
- Response Quality: Clarity, relevance, and accuracy of individual responses
- Safety & Compliance: Bias, toxicity, PII leakage, and other safety concerns
- Custom Evaluations: Domain-specific quality assessments
View Single-Turn Metrics Documentation →
Conversational Metrics
Conversational metrics (multi-turn metrics) evaluate the quality of interactions across multiple conversation turns. These metrics are ideal for assessing:
- Conversation Flow: Turn relevancy and coherence across dialogue
- Goal Achievement: Whether objectives are met throughout the conversation
- Role Adherence: Consistency in maintaining assigned roles
- Knowledge Retention: Ability to recall and reference earlier conversation context
- Tool Usage: Appropriate selection and utilization of available tools
- Conversation Completeness: Whether conversations reach satisfactory conclusions
View Conversational Metrics Documentation →
Metric Scopes
Every metric has a metric_scope that controls which test types it can be used with. The scope can be Single-Turn, Multi-Turn, or both.
| Metric Class | Default Scope | Notes |
|---|---|---|
NumericJudge | Single-Turn, Multi-Turn | See note below on multi-turn behavior |
CategoricalJudge | Single-Turn, Multi-Turn | See note below on multi-turn behavior |
ConversationalJudge | Single-Turn, Multi-Turn | Receives structured ConversationHistory |
GoalAchievementJudge | Single-Turn, Multi-Turn | Receives structured ConversationHistory |
GarakDetectorMetric | Single-Turn only | Operates on individual prompt/response pairs |
How single-turn metrics work in multi-turn tests: When a NumericJudge or CategoricalJudge is used in a multi-turn evaluation, the full conversation is serialized to plain text and passed as the output parameter. The metric does not receive a structured ConversationHistory object — it evaluates the conversation as a single text blob. This means the evaluation quality depends entirely on the evaluation_prompt you write. For turn-aware evaluation (e.g., analyzing coherence between specific turns), use a ConversationalJudge instead, which receives the full structured conversation with individual turns.
You can override the default scope when creating a metric:
Framework Integration
Rhesis integrates with the following open-source evaluation frameworks:
- DeepEval - Apache License 2.0
The LLM Evaluation Framework by Confident AI - DeepTeam - Apache License 2.0
The LLM Red Teaming Framework by Confident AI - Ragas - Apache License 2.0
Supercharge Your LLM Application Evaluations by Exploding Gradients - Garak - Apache License 2.0
LLM Vulnerability Scanner by NVIDIA
These tools are used through their public APIs. The original licenses and copyright notices can be found in their respective repositories. Rhesis is not affiliated with these projects.
Quick Example
API Key Required: All examples require a valid Rhesis API key. Set your API key using:
For more information, see the Installation & Setup guide.
Single-Turn Evaluation
Conversational Evaluation
Custom Metrics
In addition to framework-provided metrics, Rhesis offers custom metric builders:
For Single-Turn Evaluation
NumericJudge: Create custom numeric scoring metrics (e.g., 0-10 scale)CategoricalJudge: Create custom categorical classification metrics
For Conversational Evaluation
ConversationalJudge: Create custom conversational quality metricsGoalAchievementJudge: Evaluate goal achievement with custom criteria
Platform Integration
Metrics can be managed both in the platform and in the SDK. The SDK provides push and pull methods to synchronize metrics with the platform.
Next Steps
- Single-Turn Metrics - Learn about all available single-turn metrics
- Conversational Metrics - Learn about all available conversational metrics
- Models Documentation - Configure LLM models for evaluation
- Installation & Setup - Setup instructions
Need Help?
If any metrics are missing from the list, or you would like to use a different provider, please let us know by creating an issue on GitHub .