Evals engineer

Evals Engineer Jobs: What the Role Is and Who's Hiring

There are 27 open evals and AI quality roles across 11 companies tracked by the Agentic Ready Jobs Index, as of 12 June 2026 — and not one of them is remote. Every posting names an office, against a 13% remote share across the full index. The hiring is concentrated where the measurement problems are hardest: model labs, agent builders, and the platform companies selling evaluation tooling to everyone else.

Open roles37
Companies19
Remote share8%

What is an evals engineer?

An evals engineer builds the machinery that measures whether an AI system works: test sets, scoring methods, regression suites, benchmarks, annotation pipelines, and the dashboards that turn model behaviour into numbers a team can act on. Deterministic software gets unit tests — the same input produces the same output, and a test either passes or fails. A model-backed system produces distributions, so "does it work" becomes a statistical question, and answering it repeatably is a full engineering discipline: choosing representative cases, deciding what counts as success, scoring at volume (often with one model grading another), and catching the regression when a prompt change or model upgrade quietly breaks something three workflows away.

The 27 postings sort into three distinct jobs that share a title fragment. Platform engineering builds evals products for other teams — LangChain's seven openings are all on its AI observability and evals platform (LangSmith), and Datadog is hiring an engineering manager for evaluation and annotation in Paris. Research evals measure frontier models themselves — OpenAI's research engineer for frontier evals and environments, Cohere's model evaluation research roles in Toronto and London, Mercor's benchmarking and failure analysis work. Product quality applies the discipline to a shipping product — Cursor's software engineer for agent evaluation and quality, Glean's machine learning engineers for LLM evals and observability, plus Glean's two "Product Manager, AI Quality" openings, which put a PM title on the same problem. Anthropic's three roles add a fourth flavour worth noting: safeguards evals, where measurement serves safety rather than capability.

Skills and tools

Grounded in the snapshot: "evals" is the tagged skill on 25 of the 27 postings — the category is unusually pure. Underneath it, the platform roles want conventional product engineering (LangChain is hiring frontend, backend, and full-stack separately), the research roles want experimental design and failure analysis (Mercor's title says "Benchmarking, Evals & Failure Analysis" outright), and the product-quality roles want ML engineering plus annotation pipeline experience — Datadog's posting pairs evaluation with annotation explicitly. Statistical literacy is the common floor: you need to know when 200 test cases distinguish two prompts and when they do not. Seniority spreads evenly — 10 mid-level, 5 senior, 3 staff-plus, 9 manager-level — so the category has room at most career stages, including an unusual number of management openings for its size.

How to break in

The discipline is young enough that demonstrated work beats credentials. Build an eval suite for something real — an open-source agent, your own side project — and publish the design decisions: how you chose cases, what you scored, where automated grading disagreed with human judgement, and what regression it caught. That artefact maps directly onto these postings. Engineers arrive from AI engineer and agent engineer roles after living with systems nobody could measure; data and QA backgrounds transfer through the annotation and test-design side.

Adjacent roles: agent ops engineer (evals in production become monitoring — the boundary is thin), AI product manager (they set the quality bar you instrument; Glean hires both on the same team), AI security engineer (safety and adversarial evals border on security testing), AI governance lead (eval evidence is what governance reviews consume), and prompt engineer — a title whose remaining market demand has largely folded into this one.

Skills appearing in real postings

EvalsTest set designScoring pipelinesRegression suitesBenchmarksAnnotation pipelinesFailure analysisStatistical literacy

Hiring for this role right now

Live from the Agentic AI Jobs Index, updated 16 June 2026.

Salary

None of the 27 tracked postings discloses a salary range, and the title is too new for public aggregators to report it separately, so any specific figure would be invented. Compensation in practice follows the engineering bands of the hiring company; the Agentic AI Jobs Index records disclosed ranges across all categories as they appear.

Frequently asked questions

How many AI evals jobs are open right now?

27 open evals and AI quality roles across 11 companies, as of 12 June 2026, per the Agentic Ready Jobs Index. LangChain is the largest single hirer with 7 openings.

Are evals engineer jobs remote?

In this snapshot, no — 0 of the 27 postings are remote. The roles cluster in San Francisco (LangChain, OpenAI, Cursor, Mercor), the Bay Area more broadly (Glean), Toronto and London (Cohere), Paris (Mistral AI, Datadog), and New York.

What does an evals engineer actually build?

Test sets and scoring pipelines, regression suites that gate deployments, benchmarks, annotation tooling and workflows, and observability dashboards for model behaviour. At platform companies the deliverable is the evals product itself; at labs it is the measurement of frontier models; at product companies it is the quality layer for one shipping system.

Hiring for this role — or trying to become the organisation that can use it? See where you stand.

Take the assessment →