Skip to main content
Li Ma (BA ’06, MS ’06) returns to UChicago as Professor of Statistics and Data Science.

When Li Ma (BA ’06, MS ’06) was a student at the University of Chicago in the early 2000s, data generation and collection technologies were entering a new era: the Human Genome Project had just completed its ambitious goal of sequencing 3 billion DNA base pairs, opening new frontiers in bioinformatics and enabling genome-wide association studies that correlated genetic variants with disease risk. Meanwhile, companies were starting to analyze customer behavior through search patterns and clicks, and the term “big data” was just emerging from industry conferences into academic discourse. Ma witnessed the boundaries of the field expanding in real time.

“Then internet companies like Google and Facebook took off, and suddenly everyone was talking about internet user data,” Ma recalls. “If you were interested in statistics, it was a very exciting time—everything was getting bigger and more complex.”

But with data’s increasing complexity came the proliferation of analytical strategies to handle it. As Ma pursued his PhD in Statistics at Stanford University and joined the faculty of Duke University, his research focused on understanding how sound statistical principles could guide the development of methods for massive, messy, modern datasets.

Today, the data landscape continues to undergo massive shifts, this time to generative AI, but the same fundamental question remains. Now, more than two decades after first arriving on campus in 2002, Ma is back at UChicago as a Professor of Statistics and Data Science to continue his research.

The Making of a Modeler

Ma, who describes himself as a data modeler, sees modeling as encompassing challenges throughout the data analysis life cycle: expressing data generation mechanisms using mathematical and probabilistic language, building the computational algorithms to learn from the data, quantifying uncertainty in the results, understanding theoretical properties, and ultimately providing guidance to researchers making decisions based on these models, both for the design of their study and for the analysis of their data.

Over the years, he’s worked across multiple areas of statistical methodology including tree- and partition-based models, ensemble methods, mixture models, and hierarchical frameworks. But he sees them as connected pieces of a larger puzzle about how we represent and reason about uncertainty.

His approach reflects the influences of mentors who showed him how powerful theory can be when it meets real-world problems. As an undergraduate senior and concurrent masters student at UChicago, Ma worked with Michael Stein, whose ground-breaking research in spatial statistics was always driven by applications involving real spatial data. Later, his doctoral advisor Wing Hung Wong (also a former UChicago professor), immersed him in biomedical data problems which Ma has continued to work on. Ma was particularly inspired by Wong’s work developing elegant theoretical formulations grounded in applied contexts, which then translate into effective data analytical solutions for domain practitioners.

“I learned over the years that it’s when we work with actual data that we discover what tools work, or not, and where we need other strategies,” Ma explains. “It gives me a better sense of priorities in setting my research agenda.”

Now, as models and algorithms become more powerful and flexible, capable of capturing increasingly complex patterns, a challenge emerges: How do we balance their impressive capabilities with the need for interpretability, reliability, and robust decision-making?

Design-Aware Assessment

Recently, Ma’s focus has shifted toward a question that’s increasingly urgent as AI is integrated into scientific research. Most generative models are trained to produce outputs that resemble their training data. But merely “looking real” is not enough in scientific applications. It’s vital that  synthetic data actually capture the key experimental features of real scientific studies.

For example, Ma’s own work in modeling and analyzing microbial communities, which began in 2015, grappled with just how complex these biological realities are: the human body harbors trillions of microbes with intricate evolutionary relationships, enormous variation across samples, and many rare species. If a researcher were to generate synthetic microbiome composition data, it’s not sufficient for that data to simply look like realistic microbiome samples at the individual sample level. A synthetic dataset with metrics agnostic to the specific context could lead researchers badly astray.

In order to draw valid scientific conclusions, the data must reflect the full distribution of the population of interest and offer reliable ways to measure uncertainty. To achieve these, it’s important to incorporate assumptions that reflect the underlying biological structure and study-design constraints.

“You need to know which sources of variation matter the most for your scientific question, and model those sources accurately,” Ma says. “Accordingly, you need to assess a model in the context of the biological constraints it’s meant to emulate.” This philosophy reflects a principle long recognized in statistics: that reliable scientific conclusions depend critically on properly accounting for the data generation mechanism and experimental design. Applying this philosophy to synthetic data generated by modern generative models, in what Ma calls “design-aware assessment,” requires new thinking.

“Every model, no matter how large or small, makes assumptions,” Ma explains. “Making and recognizing the appropriate assumptions is key to balancing the generalizability, scalability, interpretability, and robustness of the resulting models, algorithms, and downstream analyses. How to achieve this balance when models become massive and highly complex is one of the central questions I aim to address.”

His approach leverages multi-scale techniques, nonparametric models and deep learning—methods that are able to capture complex patterns in massive datasets while preserving mathematical rigor–to identify where and how synthetic data deviates from real data in ways that matter for a given scientific question. Some of Ma’s recent efforts, for example, have focused on developing ways to compare distributions of characteristics through density ratios, allowing researchers to pinpoint whether a generative model preserves the key relationships they’re examining, whether in a microbiome community or cell populations. [1,2]

As synthetic data becomes more common in research—from training diagnostic AI to augmenting datasets in fields where data is scarce or sensitive—design-aware assessment offers rigorous methods to ensure it serves rather than undermines research.

The UChicago Chapter

Now, at UChicago, in addition to continuing his research in the biomedical space, Ma is keen to explore how statistical principles apply across domains. “The Data Science Institute provides exciting new opportunities for me to talk to experts from other fields who encounter data challenges. I’m excited about sitting with colleagues from computer science and other departments,” he says. “We’ve found we share a lot of mutual interests—like assessing the quality of different generative models and algorithms—but we have very different perspectives that complement each other.”

“The greatest statisticians have always drawn motivation from the pressing data challenges of their time,” Ma reflects. “Today, with the rise of generative AI and emerging data technologies, new challenges continue to arise, and ensuring we meet them with rigor has never been more important. It’s an exciting time to be a statistician.”

[1] Awaya, Naoki, Yuliang Xu, and Li Ma. “Two-sample comparison through additive tree models for density ratios.” arXiv preprint arXiv:2508.03059 (2025).

[2] Xu, Yuliang, Yun Wei, and Li Ma. “Distributional Evaluation of Generative Models via Relative Density Ratio.” arXiv preprint arXiv:2510.25507 (2025).

People

Li Ma

Professor of Statistics and Data Science
arrow-left-smallarrow-right-large-greyarrow-right-large-yellowarrow-right-largearrow-right-long-yellowarrow-right-smallclosefacet-arrow-down-whitefacet-arrow-downCheckedCheckedlink-outmag-glass