Skip to main content

Part of the Spring 2024 Distinguished Speaker Series.

Getting Inference Right with LLM Annotations in the Social Sciences: A General Framework of Using Predicted Variables in Downstream Analysis 

Text as data methods, including large language models (LLMs), have allowed social scientists to measure a wide range of properties of documents. While such predicted text-based variables are often analyzed as if they were observed without errors, we show that ignoring prediction errors leads to substantial bias and invalid confidence intervals in downstream analyses, even if the accuracy of the automated annotation step is high, e.g., above 90%. We propose a framework of design-based supervised learning (DSL) that can provide valid statistical estimates, even when predicted variables contain non-random prediction errors. DSL employs a doubly robust procedure to combine predicted labels and a smaller number of high-quality expert annotations. DSL allows scholars to apply advances in LLMs and natural language processing to social science research while maintaining statistical validity. We illustrate its general applicability using two applications where the outcome and independent variables are text-based. This work is joint with Naoki Egami, Musashi Hinck, and Hanying Wei. I will conclude the talk with a broader view of how we can think about the best use of LLMs in the social sciences.

Bio: Brandon Stewart is an Associate Professor in the Department of Sociology and is also affiliated with the Department of Politics and the Office of Population Research.  He develops new quantitative statistical methods for applications across the social sciences.  Methodologically his focus is in tools which facilitate automated text analysis and model complex heterogeneity in regression.  Many recent applications of these methods have centered on using large corpora of text to better understand propaganda in contemporary China.  His research has been published in journals such as American Journal of Political Science, Political Analysis and the Proceedings of the Association of Computational Linguistics.  His work has won the Edward R Chase Dissertation Prize, the Gosnell Prize for Excellence in Political Methodology, and the Political Analysis Editor’s Choice Award.

Agenda

Thursday, May 2, 2024

12:00 pm–12:30 pm

Lunch

Lunch will be provided on a first come, first serve basis.

12:30 pm–1:30 pm

Talk and Q&A

arrow-left-smallarrow-right-large-greyarrow-right-large-yellowarrow-right-largearrow-right-long-yellowarrow-right-smallfacet-arrow-down-whitefacet-arrow-downCheckedCheckedlink-outmag-glass