DSI Celebrates Culmination of 2024 Summer Programs
For eight weeks this summer, 36 students from across the country engaged in intensive research projects, deepening their understanding of data science applications across various disciplines. Ranging from high school to undergraduate upperclassmen, the UChicago Data Science Institute’s summer program students embarked on an intellectual journey filled with personal and professional growth. The students brought a diversity of experiences, academic backgrounds, and research interests to their projects. Majoring in physics, political science, chemical engineering, computer science, robotics, mathematics, and more, students came together to advance data science research. This brought about unique opportunities to broaden their technical skills, learn about new fields of study, and explore their scholastic and career options in new ways.
Students participating in the Summer Lab were paired with faculty mentors across the university. The program is designed to provide high school and undergraduate students with hands-on experience in data science research, fostering their skills in computational analysis, data management, and interdisciplinary collaboration. Participants engage in cutting-edge projects that address real-world problems, working closely with their faculty mentors and peers. 21 students took on wide-ranging projects, with examples including neutrino research, programming robots that teach social and emotional learning to children, and identifying climate infrastructure in Chicago. One of the most important aspects of Summer Lab is the cohort structure. Dr. Kyle Chard, Summer Lab program director and Research Associate Professor, said that “the cohort environment is integral to the Summer Lab program — it fosters a collaborative environment where students learn from one another and builds strong camaraderie that supports students through the challenges and successes of research.”
Learning alongside the Summer Lab students, participants in the Data Science for Social Impact (DSSI) summer program focused on natural language processing research projects to benefit external partners in social impact organizations spanning climate, health, policy, human rights, and finance. Partners for this year’s research projects were sourced among our 11th Hour Project partners, including BankTrack, the Center for Good Food Purchasing, and the mBio Project. The DSSI program welcomes students from UChicago and a consortium of diverse higher education partners, including historically Black colleges and universities (HBCUs), minority-serving institutions (MSIs), and Hispanic-serving institutions (HSIs), The consortium is a collaborative effort to broaden participation in the talent pipeline, serve communities of highest need, and introduce students to social impact data science career opportunities. This year, fifteen students from seven colleges and universities across the nation joined the DSSI program. The institutions included North Carolina State University; Howard University; City Colleges of Chicago; California State University; the University of Chicago; the University of Texas at San Antonio; and the University of Illinois Chicago. Over the 8-week program, students engaged with rigorous curriculum and technical training to perform analysis and produce project-specific deliverables.
Participants in both summer programs engaged in several professional development activities, including a weekly speaker series featuring UChicago faculty, program alumni, and industry representatives; career seminars; and public speaking workshops. Additionally, students enjoyed various social activities, including a tour focused on the history of Bronzeville, a White Sox game, and a volunteer activity with the Chicago Park District.
While troubleshooting code and problem-solving with analytical techniques, participants found time for communal fun amidst their hard work, forming career-long friendships. “I believe the students were able to grow their professional network while participating in the summer programs, as well as make significant additions to their resume that could help them in their future professional pursuits,” said Satadisha Saha Bhowmick, Research Lead of the DSSI program.
A few projects from the DSI’s Summer Programs are highlighted below:
Implementing Large Language Models From Farm to Fork
The Center for Good Food Purchasing reviews receipts and invoices to track the kind of foods that public institutions typically purchase. The Center staff categorizes food items and rates them on factors like sustainability, worker treatment, and animal welfare. Institutions can then review these ratings when making future purchasing decisions. Using different methods of natural language processing, students worked to automate this food labeling task.
Yannick Tanyi (UChicago), Joanna John (North Carolina State University), and Atuwatse Okorodudu (Howard University) took on this task using prompt generation. The team employed different prompting methods to train their large language models in classifying food products, such as zero-shot prompting and few-shot prompting with Retrieval-Augmented Generation (RAG) implementation. They also utilized Llama3 and OpenAI GPT models to compare cost-effectiveness and performance. While students observed that the use of OpenAI GPT provided higher accuracy in grouping food products, RAG implementation greatly increased the performance of the Llama3 model. Engagement in this project helped to inform students’ academic and career goals. “It was really beneficial to be able to understand how to incorporate my Political Science lens through a technology field and how I could work on social impact projects where I could incorporate the best of both worlds,” said Atuwatse Okorodudu.
Extracting Financial Information for Fiscal and Social Accountability
BankTrack is an international tracking, campaigning, and civil society support organization targeting private-sector commercial banks and the activities they finance. BankTrack combines twenty years of critical yet constructive engagement with banks and banking initiatives. Form 8-K is a required report that companies must file with the SEC to announce major events that shareholders should know about, such as incurring debt. Using data from these forms, BankTrack aims to illuminate pertinent information regarding the monetary expenditures of financial institutions and the corresponding ramifications in order to inform the public.
Jonathan Garcia (California State University, Fresno), Diego Sarria (North Carolina State University), Zaina Khalil (University of Illinois Chicago), and Jack Sanderson (UChicago) worked together to distill complex text in approximately 100 8-K forms into informative summaries. The team constructed the training label summaries through heuristically-generated summaries and the labels and text were then used to train Long Encoder Decoder (LED) and LLaMa large language models (LLMs). After training, refining, and adjusting parameters for these models, the group found that both models were able to create summaries that were much more brief than the original text while retaining approximately 90% of the necessary information. The students overcame different hurdles in order to utilize models that efficiently summarize dense financial documents. While discussing challenging aspects of the research project, rising sophomore Zaina Khalil said “I never worked with natural language processing before so it was definitely a learning curve developing models and getting acclimated to Github, but the project allowed me to gain a new skill set and learn how to work with a team to make progress.”
Socially Fair Regionalization
Yassir Atlas (University of Illinois at Chicago) was mentored by Dr. Yue Lin at the UChicago Center for Spatial Data Science. For his Summer Lab project, he studied regionalization algorithms, which are used to group areas into continuous regions for school districting, political districting, habitat delineation, etc. Regionalization algorithms inform major societal decisions, so it is important to ensure they are socially fair. Yassir’s Summer Lab research showed that commonly used regionalization algorithms can favor certain racial subgroups. He also proposed a solution to this problem that minimizes the maximum subgroup cost.
Learn more about Yassir’s project in the below video presentation.
Navigating the Wild: Classifying and Detecting Animals in Low-Quality Camera Trap Image
For her Summer Lab project, Elva Lu (University of California, Berkeley) worked with Dr. Kyle Chard and his PhD student Matt Baughman. Camera traps are used to monitor wildlife and have important conservation uses, but the images they capture are often low-quality and difficult for computer image models to analyze. Elva developed an efficient approach to improving the detection and classification of animals in these low-quality camera trap images. She utilized a combination of existing computer vision models and pre-processing techniques to enhance image quality, achieving a significant increase in detection accuracy. By integrating the CLIP vision-language model for zero-shot classification, Elva and her mentors were able to further refine accuracy without the need for extensive manual labeling. The results, developed and tested with data from Wellington, New Zealand, show promise for aiding conservation efforts by making the analysis of camera trap images more efficient and reliable. Elva said, “DSI Summer Lab gave me the opportunity to learn more about object detection models, develop solutions to address challenges of working with low-quality images, and explore my interest in vision-language models. I also enjoyed presenting my findings to different audiences and collaborating with my mentors.”
Learn more about Elva’s project in the below video presentation.