Skip to main content

Bio: Lucy Li is a PhD student at the University of California, Berkeley, working on natural language processing (NLP), computational social science, cultural analytics, and AI fairness. She researches how social groups are discussed and represented in language models and textual data, such as textbooks, fiction, and online forums. She is passionate about bridging NLP with the humanities and social sciences, especially education and curriculum studies. She is supported by a NSF Graduate Research Fellowship, and has interned at Microsoft Research and the Allen Institute for AI, the latter which awarded her Outstanding Intern of the Year.

Talk Title: Measuring Language By Social Groups in Natural Language Processing

Abstract: Language data embeds social identities, behaviors, and beliefs. That is, who we are is expressed through how we communicate. My research leverages natural language processing (NLP) methods to measure large-scale language patterns across social groups, and uses these measurements to answer sociolinguistic questions and inform model development. First, I’ll present studies quantifying community-specific words and meanings across two domains: online discussion forums and scholarly literature. In these studies, I leverage perspectives from sociolinguistics to relate communities’ use of distinctive language to various social factors, such as member loyalty, activity, network density, audience, and cross-community impact. Second, I’ll present an ongoing study analyzing the effects of large language model (LLM) pretraining data practices on text spanning a range of socioeconomic and geographic origins. Model developers often implement filters to extract “high-quality” data to train models. I’ll show whether notions of “quality” vary across popular LLMs’ filtering strategies, and what kinds of webtext are disproportionately removed during these curation processes. Finally, I will conclude by discussing challenges and questions raised by my research that point towards future directions for computational social science and NLP.

arrow-left-smallarrow-right-large-greyarrow-right-large-yellowarrow-right-largearrow-right-long-yellowarrow-right-smallclosefacet-arrow-down-whitefacet-arrow-downCheckedCheckedlink-outmag-glass