Evaluate AI systems with the rigour they demand

Traditional software testing falls short for AI. EvalSpec provides a structured 5-dimension framework to assess accuracy, reliability, safety, alignment, and robustness — so you ship with confidence.

Start Evaluating

The 5 Evaluation Dimensions

Each dimension targets a distinct failure category. Together, they provide comprehensive coverage of AI system quality.

Accuracy & Groundedness

Is the output factually correct? Is it grounded in the provided context rather than fabricated?

Key questions: Does the system hallucinate? Can it distinguish between what it knows and what it doesn't? Does it cite sources accurately?
Measurement: Factual verification against ground truth, citation accuracy scoring, hallucination rate tracking, knowledge boundary detection.

Consistency & Reliability

Does the system deliver similar quality for similar inputs? Are outputs predictable in format and structure?

Key questions: Does the same prompt produce wildly different results? Does output format remain stable? Are quality levels consistent across sessions?
Measurement: Semantic similarity across repeated runs, format compliance rate, variance analysis on quality scores.

Safety & Compliance

Does the system refuse harmful requests? Can it withstand prompt injection? Does it meet regulatory requirements?

Key questions: Can the system be jailbroken? Does it leak sensitive data? Does it comply with GDPR, EU AI Act, and internal policies?
Measurement: Prompt injection success rate, harmful content generation rate, PII leakage testing, compliance checklist verification.

Alignment & Usefulness

Does the output serve the user's actual intent? Is the tone, length, and format appropriate for the context?

Key questions: Does the system understand implicit instructions? Does it follow specified constraints (word count, tone, audience)? Is the output actionable?
Measurement: Instruction-following compliance, user satisfaction scoring, task completion rate, constraint adherence metrics.

Robustness & Edge Cases

How does the system handle unexpected, malformed, or adversarial input? Does it degrade gracefully?

Key questions: What happens with empty input? Extremely long input? Mixed languages? Contradictory instructions? Does it fail silently or communicate the issue?
Measurement: Error handling coverage, input boundary testing, graceful degradation assessment, recovery behavior analysis.

Why 5 Dimensions?

Single-metric evaluation misses critical failure modes. A system can be accurate but unsafe, consistent but misaligned, or robust but unhelpful.

Accuracy alone is insufficient

A system that gives correct but harmful answers, or accurate outputs in the wrong format, still fails users. Accuracy without safety and alignment is dangerous.

Safety requires dedicated testing

Prompt injection, jailbreaking, and data leakage are adversarial problems that standard quality metrics never detect. They require purpose-built test cases.

Consistency reveals systemic issues

Non-deterministic systems can pass spot checks while failing in production. Measuring variance across identical inputs exposes hidden reliability problems.

Alignment captures user intent

The gap between "technically correct" and "actually useful" is where most AI systems fail. Alignment testing ensures outputs serve real user needs.

Edge cases define production readiness

Real-world inputs are messy, contradictory, and unexpected. Robustness testing reveals how your system behaves when the textbook ends and reality begins.

Holistic coverage prevents blind spots

Each dimension addresses failure modes invisible to the others. Together, they create a comprehensive quality picture — not just a point estimate.

EvalSpec — AI Systems Evaluation Framework. Built for quality engineers, by quality engineers.
EvalSpec — AI Systems Evaluation Framework

Test Case Library

Browse, filter, and select from curated test cases for evaluating AI systems across all dimensions.

EvalSpec — AI Systems Evaluation Framework

Sign in

Access your evaluation projects.

Forgot your password?

Don't have an account? Create one

Create account

Start evaluating your AI systems.

Already have an account? Sign in

Reset password

Enter your email and we'll send you a reset link.

Back to sign in

Choose a new password

Enter and confirm your new password below.

Evaluation Builder

Define your system, select criteria, choose test cases, and run your evaluation.

1
System
2
Criteria
3
Test Cases
4
Evaluate
5
Results

Define Your System

Select Evaluation Criteria

These five dimensions define what "good" looks like for your AI system. We've pre-set the weights based on your system type — higher weight means that dimension counts more toward your overall score.

Hover the icon next to each criterion to learn what it measures and when you might want to adjust its weight.

Select Test Cases

We've suggested test cases based on your system type. Select the ones you want to include, or add your own.

Run Evaluation

For each test case, paste the actual system output and score the result.

0 / 0 scored

Your Projects

+ New Evaluation
Back to Dashboard

Compare Evaluations

Session Analytics

Session-only event log. For production use, connect to a backend (Supabase, PostHog, etc.)

Pricing

Start free, pay only when you need more.

1 evaluation included with every account No credit card required to get started.
Get started free
Pay-as-you-go
£9.99
1 evaluation
  • Full evaluation report
  • Export results (CSV)
  • RAG status analysis
  • All 5 evaluation dimensions
Bundle of 20
£79.99
£4.00 per evaluation
  • Everything in Bundle of 5
  • Full analytics dashboard
  • Bulk evaluations
  • Best per-evaluation rate

Secure payment via Stripe. Credits never expire.

EvalSpec — AI Systems Evaluation Framework

404

Page not found

The page you're looking for doesn't exist.

Back to home