Summer 2025

DSSI students work on real-world data science research projects centered around a shared data science theme or approach. The theme for the 2025 Summer Experience was natural language processing (NLP) and large language models (LLM).
Students worked with real data to solve real-world problems in partnership with social impact organizations across the world. Many of these organizational partners come from with The 11th Hour Project, a program of the Schmidt Family Foundation, for which UChicago DSI serves as the centralized software and data science hub.
2025 Research Projects
Text Classification for Commercial Debt Tracking
BankTrack is an international tracking, campaigning, and NGO support organization focused on private sector commercial banks and the activities they finance. Many such financial activities are disclosed through Form 8-Ks. A Form 8-K is a report that public companies in the United States file with the Securities and Exchange Commission (SEC) to disclose material events that may affect their financial health or investor interests. It essentially provides investors with real-time updates on significant developments. They can also be an important source of information to disclose hidden roles investors play in projects that can cause ethical concerns such as environmental or human rights harms. This summer, students worked to support BankTrack by developing a commercial debt tracker that helps advocacy organizations shed light on these financial relationships.
Tracy Nguyen, Sonia Pereira, Coral Fragoso Herrera, and Bidisha Dalai helped to build a text classifier to detect whether form 8Ks contain information relevant to commercial debt. The team utilized web-scraped data, as well as compiled training data and newly annotated data, to classify information based on its debt-relation status using prompt engineering and fine-tuning. Utilizing a combination of open source and proprietary language models, the group assessed accuracy and financial costs to determine the best approach of deployment on a larger dataset. In the end, they found that the Lianglab model successfully filtered non-debt-related forms efficiently and recommended that this model be integrated into Banktrack’s debt analysis pipeline.
Q/A Bot for Parsing Country-Specific Agricultural Regulations
A Growing Culture is an organization committed to advancing global food sovereignty by challenging industrial agriculture and advocating for the rights of small-scale farmers. Central to their mission is the protection of traditional seed systems and the empowerment of communities to maintain control over their seeds. Hence a lot of their work naturally focuses on seed laws in a country that is governed by complex national and international legal conditions. To enable easy understanding of seed laws in a given country, it is important to parse this information and present it in a simplified manner. This summer, students worked to build a chatbot that can provide country-specific information about seed laws.
Together, Amari Gray, Jibek Gupta, Allen Jones, Gerardo Rojas, and Peter Ronemous utilized a dataset of 183 PDFs detailing seed laws in 78 countries and 9 languages. The group utilized a web-browser interface to deploy the chatbot in an accessible and user-friendly format for farmers. Utilizing a variety of language models, the group assessed each model’s efficiency with summarization, multilingualism, and performance. They were able to successfully obtain summarization outputs from multiple models, including Gemma, Phi-4, and Llama-3. They found that Phi-4 was the best model for the chatbot, as it offered fast and quality responses, was cost effective as an open-source model, and was compatible with hardware limitations. Some future directions of this work include expanding chatbot functionality to more countries, extending support in more languages, and improving the website’s concurrency to handle more users simultaneously.
Topic Classification of Grievances Against Palm Oil Producer
The palm oil industry is tied to land conflicts, labor violations, and environmental harm. There are several online databases of grievances submitted against palm oil producers, yet many community complaints remain scattered and overlooked. To better surface those grievances, there are efforts to organize them into a clear, accessible dataset. Students worked to support a public tool that promotes transparency, holds companies accountable, and amplifies the voices of affected communities in this regard.
As a team, Theron White, Helena Card, Uriel Fuentes, Owakamare Princewill, Nicole Page and Yuliia Ihnatesku worked together to build a palm oil grievance classifier. By utilizing topic modeling, the group was able to perform thematic analysis on grievances based on co-occurrence of words. Multiple methods were used to evaluate several topic modeling techniques, including visualizations (word clouds, topic listings, frequency graphs, etc.) and human oversight. Outcomes of initial topic models were tested for sustainability over previously unseen grievances by training a multi-label classifier that is capable of tagging each grievance with multiple topics. In the end, the team demonstrated they were able to create a sustainable and scalable pipeline to interpret palm oil grievances over time.
Named Entity Recognition on Congressional Donation Records
Identifying which candidate is best aligned with an individual’s or organization’s values is key to actually voting in one’s best interest. This is particularly important for organizations like Climate Cabinet, which aim to identify and empower pro-climate candidates in smaller races where there is little information. To address this problem, students were tasked with helping classify donor data by donation type (individual or organization) to understand donor networks in politics.
Using donation records from five states, Adinai Niiazbekova, Drew Day, Sloan Louis, Gabriel Romero Torres, and Jakob Ontiveros utilized donation records data from Pennsylvania, Arizona, Michigan, Minnesota, and Texas. After preprocessing through binarized labels, the team used traditional statistical models (logistic regression, support vector machine, and random forest) and language models (ALBERT, RoBERTa, and DistilBERT) for training. To evaluate generalizability of the models, the group trained and tested on data from pairs of states and all states excluding Texas together. Texas data was used as a separate test set that was never utilized for training to get a true estimate of the model’s performance on previously unseen data. The baseline models generally performed well when trained and tested on the same set of states, but failed to generalize, showing signs of overfitting. The deep learning models also generally performed well when trained and tested on the same set of states. However, the models initially struggled to transfer when trained on one pair of states and tested on another. Through model experimentation and fine-tuning, students accurately classified donor names as either individuals or organizations across four states. Future work may focus on multi-class classification, breaking down organizations into subcategories such as committees, companies, parties, and more.
Bias or Stereotype Detection in Online Humorous Content
Humor is prevalent in online textual content, but it can also serve as a source of trolling and perpetuate discrimination and cyberbullying in digital spaces. As a result, humor classification and generation tasks have attracted significant academic attention in recent years. However, existing corpora used for these tasks often lack detailed labelling of the specific elements that contribute to humor. Additionally, humor is highly subjective, making it difficult to analyze how hateful or offensive messages are softened through humor to gain acceptance. To address this, students worked to build a text classification system that is able to detect the use of stereotypes or biased representations in humorous content extracted from a collection of jokes in English publicly available online.
Avery Pratt, Malcolm Felix, Jiaqi (Scarlett) He, Casidhe Pierre, and Dickson Acheampong explored two modeling approaches. Creating a novel annotation framework, students labeled datasets using their customized framework and compared their labeling to that of a finetuned model. The proposed annotation framework utilized two independent annotators for each example and required texts to be labelled across three dimensions: joke type, rhetoric, and target. Mistral and OpenAI were leveraged for few-shot prompting, and a host of pretrained models were finetuned and optimized for performance. In the end the study confirmed that while humor classification is challenging, prompt engineering offered flexibility and speed, especially for prototyping or exploring new joke types. However, fine-tuned models generally perform better on fixed tasks with well-labeled datasets. By combining these strategies, students progressed the research closer to automated systems that can understand and mitigate bias hidden in online humor.
2025 Student Profiles
-
Adinai Niiazbekova
City Colleges of Chicago -
Allen Jones
Florida A&M University -
Amari Gray
Morehouse College -
Avery Pratt
Spelman College -
Bidisha Dalai
University of Texas, San Antonio -
Casidhe Pierre
Florida A&M University -
Coral Fragoso Herrera
University of Illinois at Chicago -
Dickson Acheampong
Howard University -
Drew Day
University of Texas, San Antonio -
Gabriel Romero Torres
University of Puerto Rico, Rio Piedras -
Gerardo Rojas
California State University, Fresno -
Helena Card
California State University, Fresno -
Jakob Ontiveros
California State University, Fresno -
Jiaqi (Scarlett) He
University of Chicago -
Jibek Gupta
Howard University -
Malcolm Felix
Prairie View A&M University -
Muritala Bello
Chicago State University -
Nicole Page
North Carolina State University -
Owakamare Princewill
University of Chicago -
Peter Ronemous
Prairie View A&M University -
Sloan Louis
Spelman College -
Sonia Pereira
North Carolina State University -
Theron White
Morehouse College -
Tracy Nguyen
City Colleges of Chicago -
Uriel Fuentes Figueroa
University of Puerto Rico, Rio Piedras -
Yuliia Ihnatesku
University of Chicago
Leadership and Staff
-
David Uminsky (he/him)
Executive Director and Research Professor, Data Science Institute -
Satadisha Saha Bhowmick
Preceptor, Data Science Institute -
Susan Paykin (she/her)
Senior Associate Director, Community-Centered Data Science
David Uminsky joined the University of Chicago in September 2020 as a senior research associate and Executive Director of Data Science. He was previously an associate professor of Mathematics and Executive Director of the Data Institute at University of San Francisco (USF). His research interests are in machine learning, signal processing, pattern formation, and dynamical systems. David is an associate editor of the Harvard Data Science Review. He was selected in 2015 by the National Academy of Sciences as a Kavli Frontiers of Science Fellow. He is also the founding Director of the BS in Data Science at USF and served as Director of the MS in Data Science program from 2014-2019. During the summer of 2018, David served as the Director of Research for the Mathematical Science Research Institute Undergrad Program on the topic of Mathematical Data Science.
Before joining USF he was a combined NSF and UC President’s Fellow at UCLA, where he was awarded the Chancellor’s Award for outstanding postdoctoral research. He holds a Ph.D. in Mathematics from Boston University and a BS in Mathematics from Harvey Mudd College.
Satadisha is a Preceptor focusing on data science education at the Data Science Institute of the University of Chicago, where she is working jointly with the City College of Chicago. She graduated with a PhD in Computer Science from the State University of New York at Binghamton, under the supervision of Prof. Weiyi Meng.
Her research lies in the intersection of Natural Language Processing, Information Retrieval and Machine Learning. Her doctoral dissertation focussed on the problem of scalable Named Entity Recognition for Microblog Streams. To this end, her work involves applying and building state-of-the-art Deep Learning pipelines that yield more effective results for NLP tasks on microblog messages.
She also briefly worked as a Machine Learning Engineer for the Content Growth and Moderation Team at Quora.
In addition to her research, she had assisted the Department of Computer Science at SUNY Binghamton in teaching several graduate and undergraduate level courses. She is excited to continue her academic career and teaching journey in her role as a Preceptor. Outside her academic work, she spends time studying history, writing fiction and passionately supporting Liverpool Football Club. But mostly, she yearns for Kolkata!
Susan Paykin is the Senior Associate Director, Community-Centered Data Science, at the DSI, where she oversees social impact and strategic partnership initiatives across the organization’s research, education, and engagement. She is also the Program Lead of the Open Spatial Lab where she leads geospatial data science projects and partner engagement. Susan was previously the Research Manager at the Center for Spatial Data Science at UChicago and has served in leadership roles for environmental and social impact organizations. She holds a Master in Public Policy (M.P.P) from the Harris School of Public Policy at University of Chicago and a B.A. from Brandeis University.




