Leave us your email address and we'll send you all the new jobs according to your preferences.

Senior Research Scientist - Science of Evaluation

Posted 3 hours 22 minutes ago by AI Safety Institute

Permanent

Not Specified

Research Jobs

London, United Kingdom

Job Description

Senior Research Scientist - Science of Evaluation London, UK

About the AI Security Institute The AI Security Institute is the largest team in a government dedicated to understanding AI capabilities and risks in the world.

Our mission is to equip governments with an empirical understanding of the safety of advanced AI systems. We conduct research to understand the capabilities and impacts of advanced AI and develop and test risk mitigations. We focus on risks with security implications, including the potential of AI to assist with the development of chemical and biological weapons, how it can be used to carry out cyber-attacks, enable crimes such as fraud, and the possibility of loss of control.

The risks from AI are not sci-fi, they are urgent. By combining the agility of a tech start-up with the expertise and mission-driven focus of government, we're building a unique and innovative organisation to prevent AI's harms from impeding its potential.

This role sits outside of the DDaT pay framework given the scope of this role requires in depth technical expertise in frontier AI safety, robustness and advanced AI architectures.

The deadline for applying this role is Sunday 24th August 2025, end of day, anywhere on Earth.

About t he Team

AISI's Science of Evaluation team develops and applies rigorous techniques for measuring and forecasting AI system capabilities, ensuring evaluation results are robust, meaningful, and useful for governance decisions.

Evaluations underpin both our scientific understanding and policy decisions around frontier AI systems. However, current evaluation designs, methodologies, and statistical techniques are poorly suited to extracting the insights we care about, such as underlying capabilities, dangerous failure modes, forecasts of future capabilities, and the robustness of model performance across varied settings. Our team addresses this by acting as an internal auditor, stress-testing the claims and methods in AISI's testing reports , developing new tools for evaluation analysis, and advancing methodologies that help anticipate dangerous capabilities before they emerge.

(1) Methodological red teaming: we independently stress-test the evidence and claims made in AISI's evaluations reports which are shared with model developers;

(2) Consulting partnerships: collaborating with the evaluations teams in AISI to improve methodologies and best practices;

(3) Targeted research bets: pursuing foundational work that enables new types of insights into model capabilities.

Our research is problem-driven, methodologically grounded, and focused on impact. We aim to improve epistemic rigour and increase confidence in the claims drawn from evaluation data and translate those conclusions into actionable insights for model developers and policymakers.

Role Summary

This is a senior research scientist position focused on developing and applying evaluation methodologies to frontier AI systems. We're also excited to hear from earlier-career researchers with 2-3 years of hands-on experience with LLMs, especially those who've shown creative or rigorous empirical instincts.

As model capabilities scale rapidly, evaluations are becoming a critical bottleneck for safe deployment. This role offers the opportunity to shape how capabilities are measured and understood across the frontier AI ecosystem. This role is for people who can identify flaws or hidden assumptions in evaluations and experimental setups. We care more about how you think about evidence than how many models you've fine-tuned.

You'll shape and conduct research on how to better extract signal from evaluation data, going beyond benchmark scores to uncover underlying model capabilities, safety-relevant behaviours, and emerging risks. You'll work closely with engineers and domain experts across AISI, as well as external research collaborators. Researchers on this team have substantial freedom to shape independent research agendas, lead collaborations, and initiate projects that push the frontier of what evaluations can reveal.

Example Projects

Conduct adversarial quality assurance of frontier AI evaluation reports, including targeted analyses to uncover potential issues, blind spots, or hidden/unexplored assumptions.
Support the design of evaluation suites that improve coverage, predictive validity, and robustness.
Contribute to protocols and internal best-practices that help other teams produce better, more actionable evaluation results.
Build tools for quantitatively analysing agent evaluation transcripts, for failure modes or proxy signals of capabilities.
Develop new methodologies for understanding capability emergence, e.g., milestones or partial progress analysis on complex agent-based evaluations, intervention-based probing of agent behaviours, and predictive models of agent performance based on task and model characteristics.

Responsibilities

Lead and conduct applied research into evaluation methodology, including the design of new techniques and tools.
Analyse evaluation results in depth to stress-test claims, understand the structure of model capabilities, and inform policy-relevant assessment against capability thresholds.
Develop predictive models of LLM capabilities, including through observational scaling laws, agent skill decomposition, or other techniques.
Develop and validate new evaluation methodologies (e.g. transcript analysis, milestones or partial progress, hinting interventions).
Collaborate with policy, safety, and research teams to translate empirical results into governance insights.
Staying well informed of the details of evaluations across domains in AISI and the state of the art in frontier AI evaluations research more broadly, including attending ML conferences.
Write and edit scientific reports, internal memos, and other materials that synthesise results into actionable guidance.

Person Specification

We're flexible on the exact profile and expect successful candidates will meet many (but not necessarily all) of the criteria below. Depending on experience, we will consider candidates at either the RS or Senior RS level.

Strong track record in applied ML, evaluation science, or equivalent experimental sciences with fraught methodological challenges. Ideally including multiple publications, projects, or real-world deployments (e.g. PhD in a technical field and/or spotlight papers at top-tier conferences) .
Deep interest in methodology and measurement: strong instincts for finding flaws in experimental designs, and how to build methods that generalise.
Excellent scientific writing skills and the ability to clearly communicate complex ideas to technical and policy audiences.
Strong motivation to do impactful work at the intersection of science, safety, and governance.
Ability to work autonomously and in a self-directed way with high agency,thriving in a constantly changing environment and a steadily growing team.

Nice to Have

Excellent understanding of the literature and hands on experience with large language models, including experience with designing and running evaluations, fine-tuning, scaffolding, prompting.
Experience with experimental design, diagnostics, or tooling in other scientific disciplines (e.g. psychometrics, behavioural economics).
Understanding of (observational) scaling laws or predictive modelling for capabilities.

Core requirements

You should be able to spend at least 4 days per week on working with us.
You should be able to join us for at least 24 months.
You should be able work from our office in London (Whitehall) for parts of the week, but we provide flexibility for remote work.

Salary & Benefits

We are hiring individuals at all ranges of seniority and experience within this research unit, and this advert allows you to apply for any of the roles within this range. Your dedicated talent partner will work with you as you move through our assessment process to explain our internal benchmarking process. The full range of salaries are available below, salaries comprise of a base salary, technical allowance plus additional benefits as detailed on this page.

Level 3 - Total Package £65,000 - £75,000 inclusive of a base salary £35,720 plus additional technical talent allowance of between £29,280 - £39,280
Level 4 - Total Package £85,000 - £95,000 inclusive of a base salary £42,495 plus additional technical talent allowance of between £42,505 - £52,505
Level 5 - Total Package £105,000 - £115,000 inclusive of a base salary £55,805 plus additional technical talent allowance of between £49,195 - £59,195
Level 6 - Total Package £125,000 - £135,000 inclusive of a base salary £68,770 plus additional technical talent allowance of between £56,230 - £66,230
Level 7 - Total Package £145,000 inclusive of a base salary £68,770 plus additional technical talent allowance of £76,230

This role sits outside of the DDaT pay framework given the scope of this role requires in depth technical expertise in frontier AI safety, machine learning, and empirical research experience.

There are a range of pension options available which can be found through the Civil Service website.

The Department for Science, Innovation and Technology offers a competitive mix of benefits including:

A culture of flexible working, such as job sharing, homeworking and compressed hours.
A minimum of 25 days of paid annual leave . click apply for full job details

Email this Job

Apply Now

ShortList

Recommend to a friend