Skip to main content

Please join us for a Computer and DSI joint colloquium.

Thursday, February 20
2:00pm – 3:00pm
John Crerar Library, 390

Abstract: The AI research community has become increasingly concerned about risks arising from capable AI systems, ranging from misuse of generative models to misalignment of agents. My research aims to address problems in AI safety by tackling key issues with the interpretability and controllability of large language models (LLMs). In this talk, I present research showing that we are well beyond the point of thinking of AI systems as “black boxes.” AI models, and LLMs especially, are more interpretable than ever. Advances in interpretability have enabled us to control model reasoning and update knowledge in LLMs, among other promising applications. My work has also highlighted challenges that must be solved for interpretability to continue progressing. Building from this point, I argue that we can explain LLM behavior in terms of “beliefs”, meaning that core knowledge about the world determines downstream behavior of models. Furthermore, model editing techniques provide a toolkit for intervening on beliefs in LLMs in order to test theories about their behavior. By better understanding beliefs in LLMs and developing robust methods for controlling their behavior, we will create a scientific foundation for building powerful and safe AI systems.

Bio: I am an AI researcher currently doing a residency at Anthropic. Before this, I completed my PhD at the University of North Carolina at Chapel Hill, where I was advised by Mohit Bansal. My work at UNC was supported by a Google PhD Fellowship and a Royster Fellowship.

My research focuses on AI safety and NLP. Below are some of the main areas I am interested in:

  1. Interpretability
  2. Model Editing & Unlearning
  3. Scalable Oversight

Broadly, I am interested in explaining and controlling the behavior of machine learning models. I see language models as a good object of study since we lack complete explanations for their behavior and human language provides a rich means of interaction with models. I find work on clarifying concepts and developing strong evaluation procedures especially valuable.

arrow-left-smallarrow-right-large-greyarrow-right-large-yellowarrow-right-largearrow-right-long-yellowarrow-right-smallclosefacet-arrow-down-whitefacet-arrow-downCheckedCheckedlink-outmag-glass