Evals Engineer Jobs: What the Role Is and Who's Hiring
There are 27 open evals and AI quality roles across 11 companies tracked by the Agentic Ready Jobs Index, as of 12 June 2026 — and not one of them is remote. Every posting names an office, against a 13% remote share across the full index. The hiring is concentrated where the measurement problems are hardest: model labs, agent builders, and the platform companies selling evaluation tooling to everyone else.
What is an evals engineer?
An evals engineer builds the machinery that measures whether an AI system works: test sets, scoring methods, regression suites, benchmarks, annotation pipelines, and the dashboards that turn model behaviour into numbers a team can act on. Deterministic software gets unit tests — the same input produces the same output, and a test either passes or fails. A model-backed system produces distributions, so "does it work" becomes a statistical question, and answering it repeatably is a full engineering discipline: choosing representative cases, deciding what counts as success, scoring at volume (often with one model grading another), and catching the regression when a prompt change or model upgrade quietly breaks something three workflows away.
The 27 postings sort into three distinct jobs that share a title fragment. Platform engineering builds evals products for other teams — LangChain's seven openings are all on its AI observability and evals platform (LangSmith), and Datadog is hiring an engineering manager for evaluation and annotation in Paris. Research evals measure frontier models themselves — OpenAI's research engineer for frontier evals and environments, Cohere's model evaluation research roles in Toronto and London, Mercor's benchmarking and failure analysis work. Product quality applies the discipline to a shipping product — Cursor's software engineer for agent evaluation and quality, Glean's machine learning engineers for LLM evals and observability, plus Glean's two "Product Manager, AI Quality" openings, which put a PM title on the same problem. Anthropic's three roles add a fourth flavour worth noting: safeguards evals, where measurement serves safety rather than capability.
Skills and tools
Grounded in the snapshot: "evals" is the tagged skill on 25 of the 27 postings — the category is unusually pure. Underneath it, the platform roles want conventional product engineering (LangChain is hiring frontend, backend, and full-stack separately), the research roles want experimental design and failure analysis (Mercor's title says "Benchmarking, Evals & Failure Analysis" outright), and the product-quality roles want ML engineering plus annotation pipeline experience — Datadog's posting pairs evaluation with annotation explicitly. Statistical literacy is the common floor: you need to know when 200 test cases distinguish two prompts and when they do not. Seniority spreads evenly — 10 mid-level, 5 senior, 3 staff-plus, 9 manager-level — so the category has room at most career stages, including an unusual number of management openings for its size.
How to break in
The discipline is young enough that demonstrated work beats credentials. Build an eval suite for something real — an open-source agent, your own side project — and publish the design decisions: how you chose cases, what you scored, where automated grading disagreed with human judgement, and what regression it caught. That artefact maps directly onto these postings. Engineers arrive from AI engineer and agent engineer roles after living with systems nobody could measure; data and QA backgrounds transfer through the annotation and test-design side.
Adjacent roles: agent ops engineer (evals in production become monitoring — the boundary is thin), AI product manager (they set the quality bar you instrument; Glean hires both on the same team), AI security engineer (safety and adversarial evals border on security testing), AI governance lead (eval evidence is what governance reviews consume), and prompt engineer — a title whose remaining market demand has largely folded into this one.
Skills appearing in real postings
Hiring for this role right now
- LangChain 7 roles San Francisco Careers ↗
- Glean 4 roles Palo Alto Careers ↗
- General Motors 4 roles Detroit Careers ↗
- Anthropic 3 roles San Francisco Careers ↗
- Cohere 3 roles Toronto Careers ↗
Live from the Agentic AI Jobs Index, updated 16 June 2026.
Salary
None of the 27 tracked postings discloses a salary range, and the title is too new for public aggregators to report it separately, so any specific figure would be invented. Compensation in practice follows the engineering bands of the hiring company; the Agentic AI Jobs Index records disclosed ranges across all categories as they appear.