| Qualification Type: | PhD |
|---|---|
| Location: | Cambridge |
| Funding for: | UK Students, EU Students, International Students |
| Funding amount: | Fully-funded studentship (fees and maintenance) |
| Hours: | Full Time |
| Placed On: | 1st May 2026 |
|---|---|
| Closes: | 30th July 2026 |
| Reference: | NM49585 |
LLMs are becoming more capable and society increasingly relies on them. This makes it important to ensure LLMs are safe. In this PhD you can use a variety of approaches, such as white-box mechanistic interpretability and black-box behavioural research to evaluate the safety of LLMs, monitor their behaviour at inference time, as well as devise strategies for reducing risk from LLMs. Initially, this PhD will focus on increasing CoT faithfulness and mitigating encoded reasoning.
This PhD is funded by Coefficient Giving, which has the following focus areas https://coefficientgiving.org/tais-rfp-research-areas/#6-encoded-reasoning-in-cot-and-inter-model-communication
The first 1.5 years of this PhD are scoped out and will be about investigating and carrying out either project 1 or project 2 (described below). After these projects have been completed to the highest standard, you will together with your supervisor and Coefficient Giving decide how to proceed, and what to investigate next.
Project 1: Test for straightforward meaning of CoT and mitigate deceptive behaviour via "perturbation methods".
First apply a CoT perturbation method (e.g. applying paraphrasing to intermediate outputs). You then compare the final outputs after the CoT is perturbed with baseline final outputs. Performance deterioration after applying perturbation methods, indicates the model was using words in the CoT in a non-straightforward way. If you find performance deterioration after applying perturbation methods, the next step is investigating (for example using mechanistic interpretability) the underlying cause e.g. the model using a secret code or prompt hacking itself.
Project 2: Train for transparency using a human predictor
Use a human (or AI imitating human behavior, e.g. an LLM) to evaluate whether the final model outputs (and counterfactual outputs) can be predicted based on the CoT. The accuracy of this human predictor is a measure of reasoning transparency and can be used as reward during training.
Qualifications required (edit as necessary): Applicants should have (or expect to obtain by the start date) at least a first degree in an Engineering or related subject.
Ideally applicants have some experience with either software development projects or research on LLMs.
This is a fully-funded studentship (fees and maintenance) to cover a home or overseas candidate.
To apply for this studentship, please upload your two page CV and research proposal in this form https://forms.gle/Cm3MWPsWta73J2Gp7. The form responses will be evaluated on a rolling basis
Please note that any offer of funding will be conditional on securing a place as a PhD student. Candidates will need to apply separately for admission through the University's Graduate Admissions application portal; this can be done before or after applying for this funding opportunity. The applicant portal can be accessed via the above 'Apply' button. University Postgraduate Admissions closing dates are 14 May for October start and 30 July for January start, although it is advisable to apply earlier than this. Please note that there is an application fee of £20 to apply via the Postgraduate Application Portal.
The University actively supports equality, diversity and inclusion and encourages applications from all sections of society.
Type / Role:
Subject Area(s):
Location(s):