David Lindner

Researcher at Google DeepMind

I am a researcher at Google DeepMind working on AI alignment. My research aims to develop safe, interpretable, and trustworthy AI systems that learn from human feedback.

I received my PhD from ETH Zurich, where I was part of the Learning & Adaptive Systems Group supervised by Prof. Andreas Krause and Dr. Katja Hofmann. My dissertation focused on developing “Algorithmic Foundations for Safe and Efficient Reinforcement Learning from Human Feedback”.

Research

My research focuses on what I believe are key ingredients for building safe and trustworthy AI systems: (1) learning from human feedback safely and sample efficiently; (2) providing robust supervision for very capable AI systems (e.g., via scalable oversight); (3) evaluating AI systems for dangerous capabilities; and, (4) monitoring and red-teaming of AI systems. My research interests tend to shift between these areas. To learn more about my current work, see my recent publications.

News

July 2025: We published “Evaluating Frontier Models for Stealth and Situational Awareness” describing our scheming capability evaluations. Also see the blogpost putting this work into context of our approach to evaluating and mitigating risks from deceptive alignment.
July 2025: We published “Early Signs of Steganographic Capabilities in Frontier LLMs”, a paper from a MATS project evaluating the ability of LLM agents to encode hidden messages in their outputs.
June 2025: I went on the AXRP podcast to talk about our MONA paper.
April 2025: We published “An Approach to Technical AGI Safety and Security”, a paper outlining Google DeepMind’s approach to AGI Safety research. Also check out the post on the GDM blog.
January 2025: New Google DeepMind safety paper: “MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking”
December 2024: Check out “MISR: Measuring Instrumental Self-Reasoning in Frontier Models” at the Socially Responsible Language Modelling Research (SoLaR) Workshop at NeurIPS.
November 2024: New paper on arXiv: “ViSTa Dataset: Do vision-language models understand sequential tasks?”.
October 2024: “On scalable oversight with weak LLMs judging strong LLMs” was accepted to NeurIPS 2024.
July 2024: I gave a talk at the ICML Models of Human Feedback for AI Alignment Workshop.
July 2024: New paper on arXiv: “On scalable oversight with weak LLMs judging strong LLMs”.