Evaluate AI systems with the rigour they demand

No code. No pipelines. Structured, standards-backed AI evaluation in your browser.

Start Evaluating

The 5 Evaluation Dimensions

Each dimension targets a distinct failure category. Together, they provide comprehensive coverage of AI system quality.

Accuracy & Groundedness

Is the output factually correct? Is it grounded in the provided context rather than fabricated?

Key questions: Does the system hallucinate? Can it distinguish between what it knows and what it doesn't? Does it cite sources accurately?
Measurement: Factual verification against ground truth, citation accuracy scoring, hallucination rate tracking, knowledge boundary detection.

Consistency & Reliability

Does the system deliver similar quality for similar inputs? Are outputs predictable in format and structure?

Key questions: Does the same prompt produce wildly different results? Does output format remain stable? Are quality levels consistent across sessions?
Measurement: Semantic similarity across repeated runs, format compliance rate, variance analysis on quality scores.

Safety & Compliance

Does the system refuse harmful requests? Can it withstand prompt injection? Does it meet regulatory requirements?

Key questions: Can the system be jailbroken? Does it leak sensitive data? Does it comply with GDPR, EU AI Act, and internal policies?
Measurement: Prompt injection success rate, harmful content generation rate, PII leakage testing, compliance checklist verification.

Alignment & Usefulness

Does the output serve the user's actual intent? Is the tone, length, and format appropriate for the context?

Key questions: Does the system understand implicit instructions? Does it follow specified constraints (word count, tone, audience)? Is the output actionable?
Measurement: Instruction-following compliance, user satisfaction scoring, task completion rate, constraint adherence metrics.

Robustness & Edge Cases

How does the system handle unexpected, malformed, or adversarial input? Does it degrade gracefully?

Key questions: What happens with empty input? Extremely long input? Mixed languages? Contradictory instructions? Does it fail silently or communicate the issue?
Measurement: Error handling coverage, input boundary testing, graceful degradation assessment, recovery behavior analysis.

Why 5 Dimensions?

Single-metric evaluation misses critical failure modes. A system can be accurate but unsafe, consistent but misaligned, or robust but unhelpful.

Accuracy alone is insufficient

A system that gives correct but harmful answers, or accurate outputs in the wrong format, still fails users. Accuracy without safety and alignment is dangerous.

Safety requires dedicated testing

Prompt injection, jailbreaking, and data leakage are adversarial problems that standard quality metrics never detect. They require purpose-built test cases.

Consistency reveals systemic issues

Non-deterministic systems can pass spot checks while failing in production. Measuring variance across identical inputs exposes hidden reliability problems.

Alignment captures user intent

The gap between "technically correct" and "actually useful" is where most AI systems fail. Alignment testing ensures outputs serve real user needs.

Edge cases define production readiness

Real-world inputs are messy, contradictory, and unexpected. Robustness testing reveals how your system behaves when the textbook ends and reality begins.

Holistic coverage prevents blind spots

Each dimension addresses failure modes invisible to the others. Together, they create a comprehensive quality picture — not just a point estimate.

Who it's for

EvalSpec is built for the people responsible for AI quality — not just the engineers who build it.

QA Lead & Test Manager

Told to "test the AI" but handed no framework for non-deterministic systems. EvalSpec gives you a structured 5-dimension methodology and 70 pre-built test cases to work from immediately.

Product Manager

Need to decide if an AI feature is ready to ship, but gut feel isn't good enough. EvalSpec gives you an objective readiness assessment with RAG status and risk-based thresholds.

Compliance & Governance Officer

Must demonstrate AI systems meet regulatory requirements — EU AI Act, ISO 42001, GDPR. EvalSpec maps directly to these standards and produces exportable evidence.

AI & ML Team Lead

Running ad-hoc tests with no consistent methodology across the team. EvalSpec gives you reusable evaluation projects with comparison and trend tracking built in.

CTO & VP Engineering

Need assurance that AI integrations meet quality standards before launch. EvalSpec gives you a dashboard view of all evaluations with at-a-glance readiness status.

No Python required

Every other AI evaluation tool requires code. EvalSpec runs entirely in your browser — structured, rigorous, and accessible to every stakeholder on the team.

Test Case Library

Browse, filter, and select from curated test cases for evaluating AI systems across all dimensions.

Sign in

Access your evaluation projects.

Forgot your password?

Don't have an account? Create one

Create account

Start evaluating your AI systems.

Already have an account? Sign in

Reset password

Enter your email and we'll send you a reset link.

Back to sign in

Choose a new password

Enter and confirm your new password below.

Evaluation Builder

Define your system, select criteria, choose test cases, and run your evaluation.

1
System
2
Criteria
3
Test Cases
4
Evaluate
5
Results

Define Your System

Select Evaluation Criteria

These five dimensions define what "good" looks like for your AI system. We've pre-set the weights based on your system type — higher weight means that dimension counts more toward your overall score.

Hover the icon next to each criterion to learn what it measures and when you might want to adjust its weight.

Select Test Cases

We've suggested test cases based on your system type. Select the ones you want to include, or add your own.

Run Evaluation

For each test case, paste the actual system output and score the result.

0 / 0 scored

Your Projects

+ New Evaluation
Back to Dashboard

Compare Evaluations

Session Analytics

Session-only event log. For production use, connect to a backend (Supabase, PostHog, etc.)

Pricing

Start free, pay only when you need more.

1 evaluation included with every account No credit card required to get started.
Get started free
Pay-as-you-go
£9.99
1 evaluation
  • Full evaluation report
  • Export results (CSV)
  • RAG status analysis
  • All 5 evaluation dimensions
Bundle of 20
£79.99
£4.00 per evaluation
  • Everything in Bundle of 5
  • Full analytics dashboard
  • Bulk evaluations
  • Best per-evaluation rate

Secure payment via Stripe. Credits never expire.

Legal

Privacy Policy

Last updated: 15 March 2026

1. Who we are

EvalSpec ("we", "us", "our") operates the website at evalspec.ai and provides a structured AI systems evaluation framework. We are the data controller for personal data collected through this service.

Contact: hello@evalspec.ai

2. What data we collect

We collect the minimum data necessary to provide the service:

Data Why we collect it
Email address Account creation, login, and password reset
Device fingerprint A non-identifying token derived from browser characteristics (canvas rendering). Used solely to prevent free-tier abuse. Not used for advertising or tracking across sites.
Evaluation data AI system descriptions, test cases, and scores you enter while using EvalSpec. Stored to your account so you can retrieve them later.
Payment confirmation When you purchase credits, Stripe notifies us that a payment was completed and which credit tier to apply. We do not receive or store your card details.

Session analytics — we track which steps of the evaluation wizard you reach and how long you spend on each step. This data is stored only in your browser's sessionStorage and is never sent to our servers or shared with third parties.

3. Legal basis for processing

We process your personal data on the following legal bases under UK GDPR:

  • Contract — email address is required to provide the account and service you signed up for.
  • Legitimate interests — device fingerprinting to protect the integrity of the free tier and prevent abuse, where our interest does not override your rights.
  • Contract — processing payment confirmation to deliver the credits you purchased.

4. Third-party services

We use a small number of trusted third-party services to operate EvalSpec:

  • Supabase — provides our database and authentication infrastructure. Your email address and evaluation data are stored on Supabase servers. Supabase is SOC 2 Type II certified. Supabase Privacy Policy ↗
  • Stripe — processes all payments. When you purchase credits you are redirected to a Stripe-hosted payment page. EvalSpec never sees your card number or billing address. Stripe Privacy Policy ↗
  • Vercel — hosts and serves the EvalSpec application. Vercel may process standard server logs (IP address, browser type) as part of normal web hosting. Vercel Privacy Policy ↗

We do not use any advertising networks, social media trackers, or third-party analytics tools.

5. Cookies

EvalSpec does not use advertising or tracking cookies. Supabase Auth uses a session token stored in localStorage to keep you logged in. This token is essential for the service to function and does not track you across other websites.

6. How long we keep your data

  • Account data — retained for as long as your account is active.
  • Evaluation data — retained for as long as your account is active, or until you delete individual evaluations.
  • Device fingerprint — retained for as long as your account is active.
  • Payment records — retained for 7 years to meet UK financial record-keeping obligations.

When you delete your account, all personal data is removed from our systems within 30 days, except where retention is required by law.

7. Your rights

Under UK GDPR you have the right to:

  • Access — request a copy of the personal data we hold about you.
  • Rectification — ask us to correct inaccurate data.
  • Erasure — ask us to delete your personal data ("right to be forgotten").
  • Restriction — ask us to limit how we process your data.
  • Portability — receive your data in a structured, machine-readable format.
  • Object — object to processing based on legitimate interests (including the device fingerprint).

To exercise any of these rights, email us at hello@evalspec.ai. We will respond within 30 days. You also have the right to lodge a complaint with the Information Commissioner's Office (ICO) ↗.

8. Data security

All data is transmitted over HTTPS. Access to the database is restricted by Supabase Row Level Security (RLS) policies, ensuring each user can only access their own data. Passwords are hashed by Supabase Auth and never stored in plain text.

9. Children

EvalSpec is not directed at children under 13. We do not knowingly collect personal data from children. If you believe a child has provided us with personal data, please contact us and we will delete it promptly.

10. Changes to this policy

If we make material changes to this policy we will update the "Last updated" date at the top of this page and, where appropriate, notify registered users by email.

11. Contact

For any privacy-related questions or requests, contact us at hello@evalspec.ai.

Legal

Terms of Service

Last updated: 15 March 2026

1. Acceptance of terms

By creating an account or using EvalSpec at evalspec.ai, you agree to these Terms of Service. If you do not agree, do not use the service. These terms form a binding agreement between you and EvalSpec ("we", "us", "our").

2. What EvalSpec is

EvalSpec is a browser-based diagnostic tool that provides a structured framework for evaluating AI systems across five dimensions: Accuracy & Groundedness, Consistency & Reliability, Safety & Compliance, Alignment & Usefulness, and Robustness & Edge Cases.

EvalSpec is a diagnostic tool, not a certification service. Evaluation results are generated based entirely on the test cases, scores, and weightings you provide. Results represent a structured assessment of the inputs you supplied — they are not a guarantee, certification, or assurance that your AI system is fit for purpose in any particular context, free from defects, or compliant with any regulation. You remain solely responsible for all decisions regarding the development, testing, deployment, and operation of your AI systems.

No two AI systems or deployment contexts are identical. EvalSpec provides a consistent framework, but results will vary depending on the test cases selected, the scores assigned, and the risk profile chosen. We strongly recommend using EvalSpec as one input among several in your quality assurance process.

3. Accounts

  • You must provide a valid email address to create an account.
  • You are responsible for maintaining the security of your account credentials.
  • You must not share your account with others or allow unauthorised access.
  • You must be at least 13 years old to use EvalSpec.
  • We reserve the right to suspend or terminate accounts that violate these terms.

4. Credits, payments & refunds

EvalSpec operates on a credit system. One credit is consumed per completed evaluation.

  • Every account receives one free evaluation credit on registration. No payment is required to get started.
  • Additional credits are purchased in the tiers listed on the pricing page. Prices are in GBP and include VAT where applicable.
  • Payments are processed securely by Stripe. EvalSpec does not store your card details.
  • Credits do not expire.
  • All purchases are non-refundable. Once a credit purchase is completed, it cannot be refunded, whether or not the credits have been used. If you experience a technical issue that prevents you from using a credit you have paid for, contact us at hello@evalspec.ai and we will investigate.

5. Acceptable use

You agree not to use EvalSpec to:

  • Attempt to gain unauthorised access to any part of the service or its infrastructure.
  • Reverse engineer, decompile, or attempt to extract the source code of the service.
  • Use automated tools to scrape, crawl, or stress-test the service.
  • Submit content that is unlawful, harmful, or infringes third-party rights.
  • Resell or sublicense access to the service without written permission.
  • Circumvent the free-tier credit limit through multiple account creation or other means.

6. Your content

You retain ownership of any AI system descriptions, test cases, and evaluation data you submit to EvalSpec. By submitting content, you grant us a limited licence to store and process it solely for the purpose of providing the service to you. We do not use your evaluation data for any other purpose, and we do not share it with third parties.

7. Intellectual property

EvalSpec, its framework, methodology, design, and software are owned by EvalSpec and protected by applicable intellectual property laws. You may not copy, reproduce, or distribute any part of the service without written permission.

8. Disclaimer of warranties

EvalSpec is provided "as is" and "as available" without warranties of any kind, express or implied. We do not warrant that the service will be uninterrupted, error-free, or that results will be accurate, complete, or suitable for any particular purpose. The evaluation framework is based on established standards (NIST, EU AI Act, ISO 42001, OWASP) but its application depends entirely on the inputs and configuration you provide.

9. Limitation of liability

To the fullest extent permitted by law, EvalSpec shall not be liable for any indirect, incidental, special, consequential, or punitive damages arising from your use of the service, including but not limited to losses arising from reliance on evaluation results in deployment decisions. Our total liability to you for any claim shall not exceed the amount you paid to us in the 12 months preceding the claim.

10. Service availability & changes

We reserve the right to modify, suspend, or discontinue any part of the service at any time. We will provide reasonable notice of material changes where possible. We may update these terms from time to time — continued use of the service after changes are posted constitutes acceptance of the updated terms.

11. Governing law

These terms are governed by the laws of England and Wales. Any disputes arising from these terms or your use of EvalSpec shall be subject to the exclusive jurisdiction of the courts of England and Wales.

12. Contact

For any questions about these terms, contact us at hello@evalspec.ai.

About

Why EvalSpec exists

AI systems still need to be tested — they just need to be tested differently. EvalSpec was built to give every team a structured, standards-backed way to do that, without needing to write a single line of code.

Nathan Gough

Founder · 20 years in QA & AI Consulting

In association with KINTAL

The problem

After two decades in quality assurance and AI consulting, one thing became impossible to ignore: most teams have no framework for testing AI systems. When asked how they validate their AI before deployment, the answer is usually some version of "we tried it a few times and it seemed fine."

That's not a testing methodology — it's a hope.

AI systems behave differently from traditional software. They're non-deterministic, context-sensitive, and capable of failing in ways that standard test scripts will never catch. They need to be evaluated for accuracy, safety, consistency, alignment, and robustness — not just "does it return a result?"

Why EvalSpec

The tools that existed were built for ML engineers — Python SDKs, CI/CD pipelines, pytest frameworks. Powerful, but completely inaccessible to the QA leads, product managers, and compliance officers who are actually responsible for signing off on AI deployments.

EvalSpec was built to close that gap. A structured, 5-dimension evaluation framework grounded in NIST, the EU AI Act, ISO 42001, and OWASP — accessible entirely in a browser, with no code required.

The goal is simple: give every team the tools to answer the question "is this AI system ready?" with evidence, not instinct.

The framework

Accuracy & Groundedness
Does it hallucinate? Is it grounded in facts?
Consistency & Reliability
Does it produce stable, predictable results?
Safety & Compliance
Does it meet regulatory and security standards?
Alignment & Usefulness
Does it actually serve the user's intent?
Robustness & Edge Cases
Does it handle the unexpected gracefully?

Get in touch

Questions, feedback, or want to talk about AI evaluation? Reach us at hello@evalspec.ai.

404

Page not found

The page you're looking for doesn't exist.

Back to home