Brandon Stewart (Princeton University) - Getting Inference Right with LLM Annotations in the Social Sciences

Part of the Spring 2024 Distinguished Speaker Series.

Getting Inference Right with LLM Annotations in the Social Sciences: A General Framework of Using Predicted Variables in Downstream Analysis

Text as data methods, including large language models (LLMs), have allowed social scientists to measure a wide range of properties of documents. While such predicted text-based variables are often analyzed as if they were observed without errors, we show that ignoring prediction errors leads to substantial bias and invalid confidence intervals in downstream analyses, even if the accuracy of the automated annotation step is high, e.g., above 90%. We propose a framework of design-based supervised learning (DSL) that can provide valid statistical estimates, even when predicted variables contain non-random prediction errors. DSL employs a doubly robust procedure to combine predicted labels and a smaller number of high-quality expert annotations. DSL allows scholars to apply advances in LLMs and natural language processing to social science research while maintaining statistical validity. We illustrate its general applicability using two applications where the outcome and independent variables are text-based. This work is joint with Naoki Egami, Musashi Hinck, and Hanying Wei. I will conclude the talk with a broader view of how we can think about the best use of LLMs in the social sciences.

Bio: Brandon Stewart is an Associate Professor in the Department of Sociology and is also affiliated with the Department of Politics and the Office of Population Research. He develops new quantitative statistical methods for applications across the social sciences. Methodologically his focus is in tools which facilitate automated text analysis and model complex heterogeneity in regression. Many recent applications of these methods have centered on using large corpora of text to better understand propaganda in contemporary China. His research has been published in journals such as American Journal of Political Science, Political Analysis and the Proceedings of the Association of Computational Linguistics. His work has won the Edward R Chase Dissertation Prize, the Gosnell Prize for Excellence in Political Methodology, and the Political Analysis Editor’s Choice Award.

Agenda

Thursday, May 2, 2024

12:00 pm–12:30 pm

Lunch

Lunch will be provided on a first come, first serve basis.

12:30 pm–1:30 pm

Initiatives

Programs

Academic Programs

Other Programs

Community Data Fellow Stephania Tello Zamudio helps broaden internet access for Illinois residents

DSI Software Engineers create interactive map tool to maximize climate investment tax benefits

Transform cohort 3 participant Healee uses AI to improve healthcare

Towards New Physics at Future Colliders: Machine Learning Optimized Detector and Accelerator Design

Uncovering Patterns in Structure for Voltage Sensing Membrane Proteins with Machine Learning

Finding the likely causes when potential explanatory factors look alike

First Annual UChicago Transit Datathon

Ask a Student in MS in Applied Data Science

Bryce Meredig (Northwestern) – AI+Science Schmidt Fellows Speaker Series

Agenda

Thursday, May 2, 2024

Lunch

Talk and Q&A

Spring 2024 Distinguished Speaker Series

Inderjit S. Dhillon (The University of Texas at Austin) – MatFormer: Nested Transformer for Elastic Inference

Irina Rish (Université de Montréal) – Scaling to AGI

More on this topic

Navigating the Data Science Job Market: Insights and Opportunities

Inderjit S. Dhillon (The University of Texas at Austin) – MatFormer: Nested Transformer for Elastic Inference

Introducing PalmWatch: Mapping the impact of big brands’ palm oil use

Summer Lab Info Session