Evaluate AI systems with the rigour they demand
No code. No pipelines. Structured, standards-backed AI evaluation in your browser.
Start EvaluatingThe 5 Evaluation Dimensions
Each dimension targets a distinct failure category. Together, they provide comprehensive coverage of AI system quality.
Accuracy & Groundedness
Is the output factually correct? Is it grounded in the provided context rather than fabricated?
Consistency & Reliability
Does the system deliver similar quality for similar inputs? Are outputs predictable in format and structure?
Safety & Compliance
Does the system refuse harmful requests? Can it withstand prompt injection? Does it meet regulatory requirements?
Alignment & Usefulness
Does the output serve the user's actual intent? Is the tone, length, and format appropriate for the context?
Robustness & Edge Cases
How does the system handle unexpected, malformed, or adversarial input? Does it degrade gracefully?
Why 5 Dimensions?
Single-metric evaluation misses critical failure modes. A system can be accurate but unsafe, consistent but misaligned, or robust but unhelpful.
Accuracy alone is insufficient
A system that gives correct but harmful answers, or accurate outputs in the wrong format, still fails users. Accuracy without safety and alignment is dangerous.
Safety requires dedicated testing
Prompt injection, jailbreaking, and data leakage are adversarial problems that standard quality metrics never detect. They require purpose-built test cases.
Consistency reveals systemic issues
Non-deterministic systems can pass spot checks while failing in production. Measuring variance across identical inputs exposes hidden reliability problems.
Alignment captures user intent
The gap between "technically correct" and "actually useful" is where most AI systems fail. Alignment testing ensures outputs serve real user needs.
Edge cases define production readiness
Real-world inputs are messy, contradictory, and unexpected. Robustness testing reveals how your system behaves when the textbook ends and reality begins.
Holistic coverage prevents blind spots
Each dimension addresses failure modes invisible to the others. Together, they create a comprehensive quality picture — not just a point estimate.
Who it's for
EvalSpec is built for the people responsible for AI quality — not just the engineers who build it.
QA Lead & Test Manager
Told to "test the AI" but handed no framework for non-deterministic systems. EvalSpec gives you a structured 5-dimension methodology and 70 pre-built test cases to work from immediately.
Product Manager
Need to decide if an AI feature is ready to ship, but gut feel isn't good enough. EvalSpec gives you an objective readiness assessment with RAG status and risk-based thresholds.
Compliance & Governance Officer
Must demonstrate AI systems meet regulatory requirements — EU AI Act, ISO 42001, GDPR. EvalSpec maps directly to these standards and produces exportable evidence.
AI & ML Team Lead
Running ad-hoc tests with no consistent methodology across the team. EvalSpec gives you reusable evaluation projects with comparison and trend tracking built in.
CTO & VP Engineering
Need assurance that AI integrations meet quality standards before launch. EvalSpec gives you a dashboard view of all evaluations with at-a-glance readiness status.
No Python required
Every other AI evaluation tool requires code. EvalSpec runs entirely in your browser — structured, rigorous, and accessible to every stakeholder on the team.
Test Case Library
Browse, filter, and select from curated test cases for evaluating AI systems across all dimensions.
Choose a new password
Enter and confirm your new password below.
Evaluation Builder
Define your system, select criteria, choose test cases, and run your evaluation.
Define Your System
Select Evaluation Criteria
These five dimensions define what "good" looks like for your AI system. We've pre-set the weights based on your system type — higher weight means that dimension counts more toward your overall score.
Hover the ⓘ icon next to each criterion to learn what it measures and when you might want to adjust its weight.
Select Test Cases
We've suggested test cases based on your system type. Select the ones you want to include, or add your own.
Run Evaluation
For each test case, paste the actual system output and score the result.
Your Projects
Session Analytics
Session-only event log. For production use, connect to a backend (Supabase, PostHog, etc.)
Pricing
Start free, pay only when you need more.
- Full evaluation report
- Export results (CSV)
- RAG status analysis
- All 5 evaluation dimensions
- Everything in Pay-as-you-go
- System comparison
- Priority support
- Longitudinal tracking
- Everything in Bundle of 5
- Full analytics dashboard
- Bulk evaluations
- Best per-evaluation rate
Secure payment via Stripe. Credits never expire.
Legal
Privacy Policy
Last updated: 15 March 2026
1. Who we are
EvalSpec ("we", "us", "our") operates the website at evalspec.ai and provides a structured AI systems evaluation framework. We are the data controller for personal data collected through this service.
Contact: hello@evalspec.ai
2. What data we collect
We collect the minimum data necessary to provide the service:
Session analytics — we track which steps of the evaluation wizard you reach and how long you spend on each step. This data is stored only in your browser's sessionStorage and is never sent to our servers or shared with third parties.
3. Legal basis for processing
We process your personal data on the following legal bases under UK GDPR:
- Contract — email address is required to provide the account and service you signed up for.
- Legitimate interests — device fingerprinting to protect the integrity of the free tier and prevent abuse, where our interest does not override your rights.
- Contract — processing payment confirmation to deliver the credits you purchased.
4. Third-party services
We use a small number of trusted third-party services to operate EvalSpec:
- Supabase — provides our database and authentication infrastructure. Your email address and evaluation data are stored on Supabase servers. Supabase is SOC 2 Type II certified. Supabase Privacy Policy ↗
- Stripe — processes all payments. When you purchase credits you are redirected to a Stripe-hosted payment page. EvalSpec never sees your card number or billing address. Stripe Privacy Policy ↗
- Vercel — hosts and serves the EvalSpec application. Vercel may process standard server logs (IP address, browser type) as part of normal web hosting. Vercel Privacy Policy ↗
We do not use any advertising networks, social media trackers, or third-party analytics tools.
5. Cookies
EvalSpec does not use advertising or tracking cookies. Supabase Auth uses a session token stored in localStorage to keep you logged in. This token is essential for the service to function and does not track you across other websites.
6. How long we keep your data
- Account data — retained for as long as your account is active.
- Evaluation data — retained for as long as your account is active, or until you delete individual evaluations.
- Device fingerprint — retained for as long as your account is active.
- Payment records — retained for 7 years to meet UK financial record-keeping obligations.
When you delete your account, all personal data is removed from our systems within 30 days, except where retention is required by law.
7. Your rights
Under UK GDPR you have the right to:
- Access — request a copy of the personal data we hold about you.
- Rectification — ask us to correct inaccurate data.
- Erasure — ask us to delete your personal data ("right to be forgotten").
- Restriction — ask us to limit how we process your data.
- Portability — receive your data in a structured, machine-readable format.
- Object — object to processing based on legitimate interests (including the device fingerprint).
To exercise any of these rights, email us at hello@evalspec.ai. We will respond within 30 days. You also have the right to lodge a complaint with the Information Commissioner's Office (ICO) ↗.
8. Data security
All data is transmitted over HTTPS. Access to the database is restricted by Supabase Row Level Security (RLS) policies, ensuring each user can only access their own data. Passwords are hashed by Supabase Auth and never stored in plain text.
9. Children
EvalSpec is not directed at children under 13. We do not knowingly collect personal data from children. If you believe a child has provided us with personal data, please contact us and we will delete it promptly.
10. Changes to this policy
If we make material changes to this policy we will update the "Last updated" date at the top of this page and, where appropriate, notify registered users by email.
11. Contact
For any privacy-related questions or requests, contact us at hello@evalspec.ai.
Legal
Terms of Service
Last updated: 15 March 2026
1. Acceptance of terms
By creating an account or using EvalSpec at evalspec.ai, you agree to these Terms of Service. If you do not agree, do not use the service. These terms form a binding agreement between you and EvalSpec ("we", "us", "our").
2. What EvalSpec is
EvalSpec is a browser-based diagnostic tool that provides a structured framework for evaluating AI systems across five dimensions: Accuracy & Groundedness, Consistency & Reliability, Safety & Compliance, Alignment & Usefulness, and Robustness & Edge Cases.
EvalSpec is a diagnostic tool, not a certification service. Evaluation results are generated based entirely on the test cases, scores, and weightings you provide. Results represent a structured assessment of the inputs you supplied — they are not a guarantee, certification, or assurance that your AI system is fit for purpose in any particular context, free from defects, or compliant with any regulation. You remain solely responsible for all decisions regarding the development, testing, deployment, and operation of your AI systems.
No two AI systems or deployment contexts are identical. EvalSpec provides a consistent framework, but results will vary depending on the test cases selected, the scores assigned, and the risk profile chosen. We strongly recommend using EvalSpec as one input among several in your quality assurance process.
3. Accounts
- You must provide a valid email address to create an account.
- You are responsible for maintaining the security of your account credentials.
- You must not share your account with others or allow unauthorised access.
- You must be at least 13 years old to use EvalSpec.
- We reserve the right to suspend or terminate accounts that violate these terms.
4. Credits, payments & refunds
EvalSpec operates on a credit system. One credit is consumed per completed evaluation.
- Every account receives one free evaluation credit on registration. No payment is required to get started.
- Additional credits are purchased in the tiers listed on the pricing page. Prices are in GBP and include VAT where applicable.
- Payments are processed securely by Stripe. EvalSpec does not store your card details.
- Credits do not expire.
- All purchases are non-refundable. Once a credit purchase is completed, it cannot be refunded, whether or not the credits have been used. If you experience a technical issue that prevents you from using a credit you have paid for, contact us at hello@evalspec.ai and we will investigate.
5. Acceptable use
You agree not to use EvalSpec to:
- Attempt to gain unauthorised access to any part of the service or its infrastructure.
- Reverse engineer, decompile, or attempt to extract the source code of the service.
- Use automated tools to scrape, crawl, or stress-test the service.
- Submit content that is unlawful, harmful, or infringes third-party rights.
- Resell or sublicense access to the service without written permission.
- Circumvent the free-tier credit limit through multiple account creation or other means.
6. Your content
You retain ownership of any AI system descriptions, test cases, and evaluation data you submit to EvalSpec. By submitting content, you grant us a limited licence to store and process it solely for the purpose of providing the service to you. We do not use your evaluation data for any other purpose, and we do not share it with third parties.
7. Intellectual property
EvalSpec, its framework, methodology, design, and software are owned by EvalSpec and protected by applicable intellectual property laws. You may not copy, reproduce, or distribute any part of the service without written permission.
8. Disclaimer of warranties
EvalSpec is provided "as is" and "as available" without warranties of any kind, express or implied. We do not warrant that the service will be uninterrupted, error-free, or that results will be accurate, complete, or suitable for any particular purpose. The evaluation framework is based on established standards (NIST, EU AI Act, ISO 42001, OWASP) but its application depends entirely on the inputs and configuration you provide.
9. Limitation of liability
To the fullest extent permitted by law, EvalSpec shall not be liable for any indirect, incidental, special, consequential, or punitive damages arising from your use of the service, including but not limited to losses arising from reliance on evaluation results in deployment decisions. Our total liability to you for any claim shall not exceed the amount you paid to us in the 12 months preceding the claim.
10. Service availability & changes
We reserve the right to modify, suspend, or discontinue any part of the service at any time. We will provide reasonable notice of material changes where possible. We may update these terms from time to time — continued use of the service after changes are posted constitutes acceptance of the updated terms.
11. Governing law
These terms are governed by the laws of England and Wales. Any disputes arising from these terms or your use of EvalSpec shall be subject to the exclusive jurisdiction of the courts of England and Wales.
12. Contact
For any questions about these terms, contact us at hello@evalspec.ai.
About
Why EvalSpec exists
AI systems still need to be tested — they just need to be tested differently. EvalSpec was built to give every team a structured, standards-backed way to do that, without needing to write a single line of code.
The problem
After two decades in quality assurance and AI consulting, one thing became impossible to ignore: most teams have no framework for testing AI systems. When asked how they validate their AI before deployment, the answer is usually some version of "we tried it a few times and it seemed fine."
That's not a testing methodology — it's a hope.
AI systems behave differently from traditional software. They're non-deterministic, context-sensitive, and capable of failing in ways that standard test scripts will never catch. They need to be evaluated for accuracy, safety, consistency, alignment, and robustness — not just "does it return a result?"
Why EvalSpec
The tools that existed were built for ML engineers — Python SDKs, CI/CD pipelines, pytest frameworks. Powerful, but completely inaccessible to the QA leads, product managers, and compliance officers who are actually responsible for signing off on AI deployments.
EvalSpec was built to close that gap. A structured, 5-dimension evaluation framework grounded in NIST, the EU AI Act, ISO 42001, and OWASP — accessible entirely in a browser, with no code required.
The goal is simple: give every team the tools to answer the question "is this AI system ready?" with evidence, not instinct.
The framework
Get in touch
Questions, feedback, or want to talk about AI evaluation? Reach us at hello@evalspec.ai.