Summer 2024
2024 Research Projects
DSSI students work on real-world data science projects. Many projects come from DSI’s partnership with the 11th Hour Project, a program of the Schmidt Family Foundation, where we serve as the centralized software and data science hub for social impact grantees and nonprofit organizations.
This year’s research theme was natural language processing (NLP) and large language models (LLM), led by Satadisha Saha Bhowmick, PhD, 2024 Summer Research Director and Preceptor at the UChicago Data Science Institute.
Text Classification via Prompt Generation – Center for Good Food Purchasing
The Center for Good Food Purchasing reviews receipts and invoices to track the kind of foods that public institutions typically purchase. The Center staff categorizes food items and rates them on factors like sustainability, worker treatment, and animal welfare. Institutions can then review these ratings when making future purchasing decisions. Using different methods of natural language processing, the project goal is to automate this food labeling task.
Yannick Tanyi, Joanna John, and Atuwatse Okorodudu took on this task using prompt generation. The team employed different prompting methods to train their large language models in classifying food products, such as zero-shot prompting and few-shot prompting with Retrieval-Augmented Generation (RAG) implementation. They also utilized Llama3 and OpenAI GPT models to compare cost-effectiveness and performance. While students observed that the use of OpenAI GPT provided higher accuracy in grouping food products, RAG implementation greatly increased performance of the Llama3 model. A copy of their poster can be found here.
Fine Tuning Large Language Models (LLMs) – Center for Good Food Purchasing
Cielo Martinez Flores, Perfect Sylvester, and Alondra Flores utilized traditional fine tuning approaches for text classification. The team strived to build a classifier that could successfully classify each product into its food group, category, and subtype with at least 90% accuracy. To meet this goal, students performed an extensive data cleaning protocol that included deduplicating data, simplifying labels, spell-checking, and filtering to allow for accurate downstream sampling. Following this, a roBERTa model was used with a custom trainer to achieve 92% in overall accuracy. A copy of their poster can be found here.
Quotation Extraction – The mBio Project
The mBio Project explores how African media discusses biotechnology and agriculture using textual data in published articles from the last 25 years. Since its founding in 2020, the project aims to understand the historical and current narratives and understanding of genetically modified crops on the continent.
Students Malee Her, Caitlin Hamer, Michelle Cobaxin, and Catie Lutz utilized various techniques to perform sentiment analysis on numerous African media outlets, such as Quotation Extraction, Named Entity Recognition (NER), Coreference Resolution (Coref), and Aspect-Based Sentiment Analysis (ABSA). Using a dataset consisting of more than 1,800 quotes from over 2 million articles the team was able to illuminate advantages and disadvantages of using OpenAI, spaCy (transformer), and BERT models for NER, with BERT having 98% accuracy. The spaCy model performed better than OpenAI in Coref, allowing for identification of entities in 58% of the data. The team’s work may be used for further refinement of the utilized models to enhance the robustness and accuracy of the research and bring the work closer to practical application. A copy of their poster can be found here.
Fine Tuning for Text Summarization – BankTrack
BankTrack is the international tracking, campaigning, and civil society support organization targeting private sector commercial banks and the activities they finance. BankTrack combines twenty years of critical yet constructive engagement with banks and banking initiatives. Form 8-K is a required report that companies must file with the SEC to announce major events that shareholders should know about, such as incurring debt. Using data from these forms, BankTrack aims to illuminate pertinent information regarding the monetary expenditures of financial institutions and the corresponding ramifications in order to inform the public.
Jonathan Garcia, Diego Sarria, Zaina Khalil, and Jack Sanderson worked together to distill complex text in approximately 100 8-K forms into informative summaries. The team constructed the training label summaries through heuristically-generated summaries and the labels and text were then used to train Long Encoder Decoder (LED) and LLaMa large language models (LLMs). After training, refining, and adjusting parameters for these models, the group found that both models were able to create summaries that were much more brief than the original text while retaining approximately 90% of the necessary information. A copy of their poster can be found here.
2024 Leadership and Staff
-
David Uminsky (he/him)
Executive Director, Data Science Institute; Senior Research Associate, Department of Computer Science -
Satadisha Saha Bhowmick
Preceptor, Data Science Institute -
Susan Paykin (she/her)
Associate Director, Community-Centered Data Science -
Evelyn Campbell, Ph.D. (she/her)
Program Manager, Community-Centered Data Science, Data Science Institute -
Amanda Kube
Currently: Preceptor, UChicago Data Science Institute; Previously: PhD Candidate, Washington University in St. Louis -
Bill Trok
Preceptor, Data Science Institute
David Uminsky joined the University of Chicago in September 2020 as a senior research associate and Executive Director of Data Science. He was previously an associate professor of Mathematics and Executive Director of the Data Institute at University of San Francisco (USF). His research interests are in machine learning, signal processing, pattern formation, and dynamical systems. David is an associate editor of the Harvard Data Science Review. He was selected in 2015 by the National Academy of Sciences as a Kavli Frontiers of Science Fellow. He is also the founding Director of the BS in Data Science at USF and served as Director of the MS in Data Science program from 2014-2019. During the summer of 2018, David served as the Director of Research for the Mathematical Science Research Institute Undergrad Program on the topic of Mathematical Data Science.
Before joining USF he was a combined NSF and UC President’s Fellow at UCLA, where he was awarded the Chancellor’s Award for outstanding postdoctoral research. He holds a Ph.D. in Mathematics from Boston University and a BS in Mathematics from Harvey Mudd College.
Satadisha is a Preceptor focusing on data science education at the Data Science Institute of the University of Chicago, where she is working jointly with the City College of Chicago. She graduated with a PhD in Computer Science from the State University of New York at Binghamton, under the supervision of Prof. Weiyi Meng.
Her research lies in the intersection of Natural Language Processing, Information Retrieval and Machine Learning. Her doctoral dissertation focussed on the problem of scalable Named Entity Recognition for Microblog Streams. To this end, her work involves applying and building state-of-the-art Deep Learning pipelines that yield more effective results for NLP tasks on microblog messages.
She also briefly worked as a Machine Learning Engineer for the Content Growth and Moderation Team at Quora.
In addition to her research, she had assisted the Department of Computer Science at SUNY Binghamton in teaching several graduate and undergraduate level courses. She is excited to continue her academic career and teaching journey in her role as a Preceptor. Outside her academic work, she spends time studying history, writing fiction and passionately supporting Liverpool Football Club. But mostly, she yearns for Kolkata!
Susan Paykin is the Associate Director, Community-Centered Data Science at the DSI, where she leads social impact and strategic partnership initiatives across the organization’s three pillars of research, education, and engagement. She is also the Program Lead of the Open Spatial Lab where she leads geospatial data science projects and partner engagement. Susan was previously the Research Manager at the Center for Spatial Data Science at UChicago and has served in leadership roles for environmental and social impact organizations. She holds a Master in Public Policy (M.P.P) with a concentration in policy analysis from the Harris School of Public Policy at University of Chicago and a B.A. from Brandeis University.
Evelyn Campbell is the Program Manger for Community-Centered Data Science at the Data Science Institute. She oversees and implements the DSI’s educational outreach programs, such as Data4All and the Data Science for Social Impact Research Experience. She was previously a Data Science Preceptor where she taught data science curriculum for both the University of Chicago and City Colleges of Chicago. She obtained her PhD in Microbiology from the University of Chicago in 2022 and her BS in Biology from Rider University in 2016. She is an advocate for educational access and expanding representation in data science and other STEM fields.
Bio: Amanda Kube is a Ph.D. Candidate in the Division of Computational and Data Sciences at Washington University in St. Louis working with Dr. Sanmay Das in the Department of Computer Science and Dr. Patrick Fowler in the Brown School. She received her B.S. in Psychological and Brain Sciences and Mathematics with a concentration in Statistics from Washington University in St. Louis where she also received an M.S. in Data Analytics and Statistics. Her research interests involve the intersection of computation and the social sciences. Her current work combines machine learning and human decision-making to inform fair and efficient service allocations for homeless families.
Talk Title: Integrating Human Priorities and Data-Driven Improvements in Allocation of Scarce Homeless Services to Households in Need
Talk Abstract: Homelessness is a major public health issue in the United States that has gained visibility during the COVID-19 pandemic. Despite efforts at the federal level, rates of homelessness are not decreasing. Homeless services are a scarce public resource and current allocation systems have not been thoroughly investigated. Algorithmic techniques excel at modeling complex interactions between features and therefore have potential to model effects of homeless services at the individual level. These models can reason counterfactually about the effects of different services on each household and resulting predictions can be used for matching households to services. The ability to model heterogeneity in treatment effects of services provides the potential for “precision public health” where allocation of services is based on data-driven predictions of which service will lead to better outcomes. I discuss the scarce resource allocation problem as it applies to homeless service delivery, and the ability to improve upon the current allocation system using algorithmic techniques. I compare prediction algorithms to each other as well as to the ability of the general public to make these decisions. As homeless services are scarce public goods, it is vital to ensure allocations are not only efficient, but fair and ethical. I discuss efforts to ensure fair decisions and to understand how people prioritize households who should receive scarce homeless services. I also discuss future work and next steps as well as policy implications.
Bill Trok is a Preceptor in Data Science focusing on data science education as a joint instructor for both the University of Chicago and City Colleges of Chicago. He holds a PhD in Mathematics from the University of Kentucky, where he worked with Uwe Nagel researching polynomial interpolation and applications of combinatorial optimization to algebraic geometry. In his spare time he enjoys running and cooking.