PhD Studentship in Monitoring and Increasing LLM Safety

University of Cambridge - Department of Engineering

Qualification Type:	PhD
Location:	Cambridge
Funding for:	UK Students, EU Students, International Students
Funding amount:	Fully-funded studentship (fees and maintenance)
Hours:	Full Time

Placed On:	1st May 2026
Closes:	30th July 2026
Reference:	NM49585

LLMs are becoming more capable and society increasingly relies on them. This makes it important to ensure LLMs are safe. In this PhD you can use a variety of approaches, such as white-box mechanistic interpretability and black-box behavioural research to evaluate the safety of LLMs, monitor their behaviour at inference time, as well as devise strategies for reducing risk from LLMs. Initially, this PhD will focus on increasing CoT faithfulness and mitigating encoded reasoning.

This PhD is funded by Coefficient Giving, which has the following focus areas https://coefficientgiving.org/tais-rfp-research-areas/#6-encoded-reasoning-in-cot-and-inter-model-communication

The first 1.5 years of this PhD are scoped out and will be about investigating and carrying out either project 1 or project 2 (described below). After these projects have been completed to the highest standard, you will together with your supervisor and Coefficient Giving decide how to proceed, and what to investigate next.

Project 1: Test for straightforward meaning of CoT and mitigate deceptive behaviour via "perturbation methods".

First apply a CoT perturbation method (e.g. applying paraphrasing to intermediate outputs). You then compare the final outputs after the CoT is perturbed with baseline final outputs. Performance deterioration after applying perturbation methods, indicates the model was using words in the CoT in a non-straightforward way. If you find performance deterioration after applying perturbation methods, the next step is investigating (for example using mechanistic interpretability) the underlying cause e.g. the model using a secret code or prompt hacking itself.

Project 2: Train for transparency using a human predictor

Use a human (or AI imitating human behavior, e.g. an LLM) to evaluate whether the final model outputs (and counterfactual outputs) can be predicted based on the CoT. The accuracy of this human predictor is a measure of reasoning transparency and can be used as reward during training.

Qualifications required (edit as necessary): Applicants should have (or expect to obtain by the start date) at least a first degree in an Engineering or related subject.

Ideally applicants have some experience with either software development projects or research on LLMs.

This is a fully-funded studentship (fees and maintenance) to cover a home or overseas candidate.

To apply for this studentship, please upload your two page CV and research proposal in this form https://forms.gle/Cm3MWPsWta73J2Gp7. The form responses will be evaluated on a rolling basis

Please note that any offer of funding will be conditional on securing a place as a PhD student. Candidates will need to apply separately for admission through the University's Graduate Admissions application portal; this can be done before or after applying for this funding opportunity. The applicant portal can be accessed via the above 'Apply' button. University Postgraduate Admissions closing dates are 14 May for October start and 30 July for January start, although it is advisable to apply earlier than this. Please note that there is an application fee of £20 to apply via the Postgraduate Application Portal.

The University actively supports equality, diversity and inclusion and encourages applications from all sections of society.

We value your feedback on the quality of our adverts. If you have a comment to make about the overall quality of this advert, or its categorisation then please send us your feedback

Advert information

Type / Role:

Subject Area(s):

Location(s):

PhD tools

PhD Alert Created

Job Alert Created

Your PhD alert has been successfully created for this search.

Your job alert has been successfully created for this search.

Ok Ok

PhD Alert Created

Job Alert Created

Your PhD alert has been successfully created for this search.

Your job alert has been successfully created for this search.

Manage your job alerts Manage your job alerts

Account Verification Missing

In order to create multiple job alerts, you must first verify your email address to complete your account creation

Request verification email Request verification email

jobs.ac.uk Account Required

In order to create multiple alerts, you must create a jobs.ac.uk jobseeker account

Create Account Create Account

Alert Creation Failed

Unfortunately, your account is currently blocked. Please login to unblock your account.

Email Address Blocked

We received a delivery failure message when attempting to send you an email and therefore your email address has been blocked. You will not receive job alerts until your email address is unblocked. To do so, please choose from one of the two options below.

Re-verify your email Update your email

Max Alerts Reached

A maximum of 5 Job Alerts can be created against your account. Please remove an existing alert in order to create this new Job Alert

Manage your job alerts Manage your job alerts

Creation Failed

Unfortunately, your alert was not created at this time. Please try again.

Ok Ok

Create PhD Alert

Create Job Alert

When you create this PhD alert we will email you a selection of PhDs matching your criteria.When you create this job alert we will email you a selection of jobs matching your criteria. Our Terms and Conditions and Privacy Policy apply to this service. Any personal data you provide in setting up this alert is processed in accordance with our Privacy Notice

Create PhD Alert

Create Job Alert

I would like to receive email communication from jobs.ac.uk relating to Company News I would like to receive email communication from jobs.ac.uk relating to Conferences, Meetings and Events

Save this job