Skip to main content
  • Description

    Workshop Dates: November 14-15th, 2022

    Workshop Agenda (as of September 2022)

    The Rising Stars in Data Science workshop at the University of Chicago focuses on celebrating and fast tracking the careers of exceptional data scientists at a critical inflection point in their career: the transition to postdoctoral scholar, research scientist, industry research position, or tenure track position. The workshop will showcase the exciting, innovative data science initiatives at UChicago and provide PhD students and postdocs the opportunity to plug into these networks, platforms, and opportunities. The workshop also aims to increase representation and diversity in data science by providing a platform and a supportive mentoring network to navigate academic careers in data science. Women and underrepresented minorities in computing are especially encouraged to apply. 

    The two-day research workshop will feature career and research panels, networking and mentoring opportunities, and research talks from the Rising Stars (View the 2021 agenda). Participants will gain insights from faculty panels on career development questions such as: how to start your academic career in data science; how to strategically sustain your career through research collaborations, publications, and skill development; and how to form meaningful interdisciplinary collaborations in data science with industry and government partners. Participants will also hear inspiring keynote talks from established, cutting-edge leaders in data science.

    Eligibility & Guidelines: 

    If you have any questions about your eligibility, please feel free to send an email to data-science@uchicago.edu.

    • Applicants must be full time graduate students within ~1-2 years of obtaining a PhD, or a current postdoctoral scholar, fellow, or researcher.
    • We welcome applicants from a wide variety of fields and backgrounds: any eligible PhD or postdoc who is engaging in rigorous, data-driven inquiry is encouraged to apply.
    • Applicants both from and outside of the University of Chicago are encouraged to apply.
    • Applicants may only submit one application.
    • Applicants may have nominations from a maximum of 2 faculty members or advisors.

    Workshop Format

    • Rising Star research talks
    • Panels (career development, data science research)
    • Keynote address
    • 1:1 meetings with faculty members
    • Networking within the UChicago data science ecosystem

    This convening is open to all invitees who are compliant with UChicago vaccination requirements and, because of ongoing health risks, particularly to the unvaccinated, participants are expected to adopt the risk mitigation measures (masking and social distancing, etc.) appropriate to their vaccination status as advised by public health officials or to their individual vulnerabilities as advised by a medical professional. Public convening may not be safe for all and carries a risk for contracting COVID-19, particularly for those unvaccinated. Participants will not know the vaccination status of others and should follow appropriate risk mitigation measures.”

    Campus Visitor Information & COVID Policy

  • Rising Stars Profiles

    See profiles of alumni Rising Stars here.

    Rising Stars Autumn 2022 Cohort

    Bio: Adityanarayanan (Adit) Radhakrishnan is a Ph.D. candidate in the Electrical Engineering and Computer Science (EECS) department at MIT advised by Caroline Uhler. He is also a Ph.D. fellow at the Eric and Wendy Schmidt Center of the Broad Institute of Harvard and MIT. Prior to his Ph.D., Adit completed his Masters in EECS and Bachelors in EECS and mathematics at MIT. Adit’s research focuses on advancing theoretical understanding of over-parameterized neural networks in order to develop simple and effective methods for applications in healthcare and computational biology.

    Talk Title: Wide and Deep Networks Achieve Consistency in Classification

    Talk Abstract: While neural networks are used for classification tasks across domains, a long-standing open problem in machine learning is determining whether neural networks trained using standard procedures are consistent for classification, i.e., whether such models minimize the probability of misclassification for arbitrary data distributions. In this work, we identify and construct an explicit set of neural network classifiers that are consistent. Since effective neural networks in practice are typically both wide and deep, we analyze infinitely wide networks that are also infinitely deep. In particular, using the recent connection between infinitely wide neural networks and Neural Tangent Kernels, we provide explicit activation functions that can be used to construct networks that achieve consistency. Interestingly, these activation functions are simple and easy to implement, yet differ from commonly used activations such as ReLU or sigmoid. More generally, we create a taxonomy of infinitely wide and deep networks and show that these models implement one of three well-known classifiers depending on the activation function used: (1) 1-nearest neighbor (model predictions are given by the label of the nearest training example); (2) majority vote (model predictions are given by the label of the class with greatest representation in the training set); or (3) singular kernel classifiers (a set of classifiers containing those that achieve consistency). Our results highlight the benefit of using deep networks for classification tasks, in contrast to regression tasks, where excessive depth is harmful.

    Bio: Amanda Coston is a PhD student in Machine Learning and Public Policy at Carnegie Mellon University. Her research investigates how to make algorithmic decision-making more reliable and more equitable using causal inference and machine learning. She is advised by Alexandra Chouldechova and Edward H. Kennedy. She completed a B.S.E from Princeton where she was advised by Robert Schapire. Prior to her PhD, she worked at Microsoft, the consultancy Teneo, and the Nairobi-based startup HiviSasa. She is a Meta Research PhD Fellow (2022), K & L Gates Presidential Fellow in Ethics and Computational Technologies (2020), and NSF GRFP Fellow (2019).

    Talk Title: Validity, equity, and oversight in societally consequential machine learning

    Talk Abstract: Machine learning algorithms are widely used for decision-making in societally high-stakes settings from child welfare and criminal justice to healthcare and consumer lending. Recent history has illuminated numerous examples where these algorithms proved unreliable or inequitable. We take a principled approach to the use of machine learning in societally high-stakes settings, guided by three pillars: validity, equity, and oversight. We address data problems that challenge the validity of algorithmic decision support systems by developing methods for algorithmic risk assessments that account for selection bias, confounding, and bandit feedback. We conduct audits for bias throughout the systems in which algorithms are used to inform decision-making. Throughout we propose novel methods that use doubly-robust techniques for bias correction. We apply our methods to the consumer lending context where we find that use of our methods could extend lending opportunities to applicants from historically underbanked regions without an increase in the average rate of default. We present additional findings in the child welfare, criminal justice and healthcare settings.

    Bio: Anushri Dixit is a Ph.D. candidate at California Institute of Technology in Control and Dynamical Systems and is advised by Prof. Joel Burdick. Her research focuses on motion planning and control of robots in extreme terrain while accounting for uncertainty in a principled manner. Her work on risk-aware methodologies for planning has been deployed on various robotic platforms as a part of the Team CoSTAR’s effort in the DARPA Subterranean Challenge. She has received the DE Shaw Zenith Fellowship and was selected as a rising star at the Southern California Robotics Symposium. Prior to her Ph.D., she earned her B.S. in Electrical Engineering from Georgia Institute of Technology in 2017.

    Talk Title: Towards Risk-Aware Robotic Autonomy in Extreme Environments: Robustness to Distribution Shifts in Control and Planning

    Talk Abstract: Robots provide the crucial ability to replace humans in environments that are inaccessible due to environmental hazards such as search and rescue operations. They need to be able to reason about the risk in an environment that is perceptually degraded and complete tasks while maintaining safety. Providing safety and performance guarantees for motion planning and control algorithms is a well-studied problem for robotic systems with well-known dynamics that operate in structured environments. However, when robots operate in a real-world setting where the environment is dynamic and unstructured, common distributional assumptions used to develop the planning algorithms are no longer valid and consequently, the safety guarantees no longer hold. In this talk, I will focus on risk-aware methodologies for robotic autonomy in unstructured environments. I will provide data-driven motion planning techniques to account for uncertainty in dynamic environments and sample-based risk bounds to provide high-confidence verification statements for robotic systems. The goal of the talk is to understand how robots interpret the ambiguity in their environment, and the ways to generate policies that better account for this uncertainty.

    Bio: Blair Bilodeau is a graduating PhD candidate in statistical sciences at the University of Toronto, advised by Daniel Roy. His PhD was funded by an NSERC Doctoral Canada Graduate Scholarship and the Vector Institute. Blair’s research area is broadly the theoretical foundations of statistical decision making, with applications to both statistical methodology and machine learning algorithms. His contributions span fundamental problems in inference and prediction, including: adaptive sequential decision making, minimax conditional density estimation, and numerical quadrature for statistical inference. Previously, Blair completed his BSc at Western University.

    Talk Title: Adaptively Exploiting d-Separators with Causal Bandits

    Talk Abstract: Multi-armed bandit problems provide a framework to identify the optimal intervention over a sequence of repeated experiments. Without additional assumptions, minimax optimal performance (measured by cumulative regret) is well-understood. With access to additional observed variables that d-separate the intervention from the outcome (i.e., they are a d- separator), recent “causal bandit” algorithms provably incur less regret. However, in practice it is desirable to be agnostic to whether observed variables are a d-separator. Ideally, an algorithm should be adaptive; that is, perform nearly as well as an algorithm with oracle knowledge of the presence or absence of a d-separator. In this work, we formalize and study this notion of adaptivity, and provide a novel algorithm that simultaneously achieves (a) optimal regret when a d-separator is observed, improving on classical minimax algorithms, and (b) significantly smaller regret than recent causal bandit algorithms when the observed variables are not a d-separator. Crucially, our algorithm does not require any oracle knowledge of whether a d-separator is observed. We also generalize this adaptivity to other conditions, such as the front-door criterion.

    Bio: Chan Park is a 5-th year PhD student at Language Technologies Institute at Carnegie Mellon University. She is advised by Professor Yulia Tsvetkov, and is currently a visiting PhD student at University of Washington. She is a recipient of a KFAS Overseas PhD Scholarship. Her research interests lie in the intersection of natural language processing and computational social science and her work has been published to various top-tier conferences and journals including PNAS, EMNLP, ICWSM, and WWW. She has collaborated with the Washington Post and the Data for Black Lives organization to apply her research to address real-world problems.

    Talk Title: Challenges in Opinion Manipulation Detection: An Examination of Wartime Russian Media

    Talk Abstract: Information warfare has been at the forefront of the 2022 Russia-Ukraine war, as all sides attempt to shape online narratives. NLP could be valuable for combating manipulation, but most prior work focuses on detecting extreme manipulation strategies like fake news and relies on supervised approaches with pre-annotated data, which is unavailable in emerging crises. We release a new dataset, VoynaSlov, containing 38M+ social media posts from Russian media and their public responses. We characterize VoynaSlov along three dimensions: media ownership (state-affiliated, independent), platform (Twitter, VKontakte), and time (pre-war, during-war). Drawing from political communication, we investigate subtle manipulation strategies by applying state-of-the-art NLP models to quantify agenda-setting, framing, and priming. Our methodology includes word frequencies, topic modeling, supervised classification, and consideration of metadata such as likes or shares. Each technique reveals interesting, yet sometimes conflicting, insights; for example, independent outlets are more likely to use war-related words, but state-affiliated outlets discuss war-related topics more. Throughout the talk, we discuss numerous limitations: topic model interpretability and instability, unclear framing typologies and domain-specificity, and issues surrounding validity and misuse for priming. Our dataset, initial investigation, and discussion of open technical challenges can facilitate the identification and mitigation of information manipulation tactics in ongoing crisis situations.

    Bio: Chao Ma is a Szego Assistant Professor in the Department of Mathematics at Stanford University. His research focuses on building mathematical theories for modern machine learning methods. His recent works provided understandings to the implicit regularization effect of various optimization algorithms for neural networks, by studying the interaction between the algorithms and the network structures. Before joining Stanford, Chao obtained his PhD from the Program of Applied and Computational Mathematics at Princeton University, advised by Prof. Weinan E. His thesis mathematically analyzed the approximation and generalization capacity of neural networks.

    Talk Title: Implicit bias of optimization algorithms for neural networks and their effects on generalization

    Talk Abstract: Modern neural networks are usually over-parameterized—the number of parameters exceeds the number of training data. In this case the loss functions tend to have many (or even infinite) global minima, which imposes an additional challenge of minima selection on optimization algorithms besides the convergence. Specifically, when training a neural network, the algorithm not only has to find a global minimum, but also needs to select minima with good generalization among many other bad ones. In this talk, I will share a series of works studying the mechanisms that facilitate global minima selection of optimization algorithms. First, with a linear stability theory, we show that stochastic gradient descent (SGD) favors flat and uniform global minima. Then, we build a theoretical connection of flatness and generalization performance based on a common structure of neural networks. Next, we study the global minima selection dynamics—the process that an optimizer leaves bad minima for good ones—in two settings. For a manifold of minima around which the loss function grows quadratically, we derive effective exploration dynamics on the manifold for SGD and Adam, using a quasistatic approach. For a manifold of minima around which the loss function grows subquadratically, we study the behavior and effective dynamics for GD, which also explains the edge of stability phenomenon.

    Bio: I am a philosopher of science and a computational researcher. My research agenda is broad and advances both our theoretical and empirical understanding of how emerging technologies (particularly AI) affect the way we do science and investigates how we can both study and use computational methods to better understand the joint processes of discovery and justification. I am currently a joint PhD Candidate in the departments of Philosophy and the Committee on the Conceptual and Historical Studies of Science at the University of Chicago. I am a Fellow in the Pritzker School of Molecular Engineering AI-Enabled Molecular Engineering of Materials and Systems for Sustainability program, an Affiliated Researcher at Globus Labs, and a member of the KnowledgeLab.

    Talk Title: Digital Doubles of Everything (Even You)

    Talk Abstract: Scientists have long made and interrogated doubles of reality. Often developed to describe narrow aspects of otherwise immense systems, traditional models have always been relatively poor substitutes for the bewildering tangle of complexity that pervades the world. However, the rise of AI allows scientists to generate and interrogate increasingly sophisticated digitized captures of entire systems of interest. These “digital doubles” represent everything from system I/O to complete characterizations of internal mechanisms and processes. In this project, we argue that the opportunity with doubles of many such systems is not just expedient exploration or accurate prediction. Instead, it is to combine them, represent interactions between them faithfully, and even enable them to interrogate and explore one another. We probe the possibility of developing digital doubles of dynamic systems of knowledge, values, and commitments, as well as using these doubles to operationalize and automate a logic of anomaly detection and correction, not only in our data but in our theoretical and normative, non-epistemic commitments.

    Bio: Gemma Moran is a postdoctoral research scientist at the Data Science Institute at Columbia University, mentored by David Blei. She received her PhD in Statistics from the University of Pennsylvania, advised by Edward George and Veronika Rockova. Her research is on probabilistic machine learning and Bayesian inference. She aims to develop interpretable and reliable methods for large-scale data that help researchers gain scientific insight and guide their decision-making.

    Talk Title: Identifiable deep generative models via sparse decoding

    Talk Abstract: We develop the sparse VAE for unsupervised representation learning on high-dimensional data. The sparse VAE learns a set of latent factors (representations) which summarize the associations in the observed data features. The underlying model is sparse in that each observed feature (i.e. each dimension of the data) depends on a small subset of the latent factors. As examples, in ratings data each movie is only described by a few genres; in text data each word is only applicable to a few topics; in genomics, each gene is active in only a few biological processes. We prove such sparse deep generative models are identifiable: with infinite data, the true model parameters can be learned. (In contrast, most deep generative models are not identifiable.) We empirically study the sparse VAE with both simulated and real data. We find that it recovers meaningful latent factors and has smaller heldout reconstruction error than related methods

    Bio: Harlin Lee is a Hedrick Assistant Adjunct Professor at UCLA Mathematics. She received her PhD in Electrical and Computer Engineering at Carnegie Mellon University in 2021. She also has a MS in Machine Learning from Carnegie Mellon University, and a BS + MEng in Electrical Engineering and Computer Science from MIT. Her research is on learning from high-dimensional data supported on structures such as graphs (networks), low-dimensional subspace, or sparsity, motivated by applications in healthcare and social science. Harlin’s lifelong vision is to use data theory to help everyone live physically, mentally and socially healthier.

    Talk Title: Understanding scientific fields via network analysis and topic modeling

    Talk Abstract: As scientific disciplines get larger and more complex, it becomes impossible for an individual researcher to be familiar with the entire body of literature. This forces them to specialize in a sub-field, and unfortunately such insulation can hinder the birth of ideas that arise from new connections, eventually slowing down scientific progress. As such, discovering fruitful interdisciplinary connections by analyzing scientific publications is an important problem in the science of science. This talk will present several past and ongoing projects towards answering that question using tools from network analysis and topic modeling: 1) a dynamic-embedding-based method for link prediction in a machine learning/AI semantic network, 2) finding communities in cognitive science that study similar topics but do not cite each other or publish in the same venues, and 3) developing theoretically grounded hypergraph embedding methods to capture surprising collaborations or missed opportunities.

    Bio: I am a fourth year PhD candidate in Physics at the University of Pennsylvania, advised by Dr. Masao Sako. I work with type Ia supernovae as cosmological probes as part of the Dark Energy Survey Collaboration as well as the Nancy Grace Roman Space Telescope supernova science investigation team. I am particularly interested in applications of machine learning and other modern data science techniques to large-scale astronomical survey data.

    Talk Title: A Convolutional Neural Network Approach to Supernova Classification

    Talk Abstract: One of the brightest objects in the universe, supernovae are powerful explosions marking the end of a star’s lifetime. Supernova type is defined by spectroscopic emission lines, but obtaining spectroscopy is often logistically unfeasible. Thus, the ability to identify supernovae by type using time-series image data alone is crucial, especially in light of the increasing breadth and depth of upcoming telescopes. We present a convolutional neural network method for fast supernova time-series classification, with observed brightness data smoothed in both the wavelength and time directions with Gaussian process regression. We apply this method to full duration and truncated supernova time-series, to simulate retrospective as well as real-time classification performance. Retrospective classification is used to differentiate cosmologically useful Type Ia supernovae from other supernova types, and this method achieves > 99% accuracy on this task. We are also able to differentiate between 6 supernova types with 60% accuracy given only two nights of data and 98% accuracy retrospectively.

    Bio: Jian Kang is a final-year Ph.D. candidate in the Department of Computer Science at the University of Illinois at Urbana-Champaign. He received his M.CS. degree from the University of Virginia in 2016 and B.Eng. degree from Beijing University of Posts and Telecommunications in 2014. His research interests lie in trustworthy learning and mining on graphs. His research works on related topics have been published at several major conferences and journals in data mining and artificial intelligence (e.g., KDD, WWW, CIKM, TKDE, TVCG). He is the recipient of Mavis Future Faculty Fellowship and three reviewer awards (ICML’20, ICLR’21, CIKM’21).

    Talk Title: Algorithmic Foundation of Fair Graph Mining

    Talk Abstract: Graphs are ubiquitous in many real-world applications. To date, researchers have developed a plethora of theories, algorithms and systems that are successful in answering what and who questions on graphs. Despite the remarkable progress, unfairness often occurs in many graph mining algorithms, hindering the deployment of graph mining algorithms in high-stake applications. To make graph mining process and its results fair, it is crucial to propose a paradigm shift, from answering what and who to understanding how and why. Four desired properties (i.e., utility, fairness, transparency and robustness) are called for in order to build an algorithmic foundation of fair graph mining. In this talk, I will present our efforts in building the algorithmic foundation of fair graph mining by addressing the tensions among the desired properties. First, we present a generic algorithmic framework to understand how the graph mining results relate to the input graph. Second, we systematically study the individual fairness on graph mining, including the measurement, mitigation strategies and cost. Finally, we analyze the root cause of degree-related unfairness in graph neural networks and present a family of debiasing algorithms, followed by my thoughts about the future work.

    Bio: Jingyan Wang is a Ronald J. and Carol T. Beerman President’s postdoctoral fellow in the H. Milton Stewart School of Industrial and Systems Engineering at Georgia Institute of Technology. Her research interests lie in improving high-stake decision-making problems such as hiring and admissions, using tools from statistics and machine learning. She is the recipient of the Best Student Paper Award at AAMAS 2019. She received her Ph.D. from the School of Computer Science at Carnegie Mellon University in 2021, advised by Nihar Shah. She received her B.S. in Electrical Engineering and Computer Sciences with a minor in Mathematics from the University of California, Berkeley in 2015.

    Talk Title: Modeling and Correcting Bias in Sequential Evaluation

    Talk Abstract: We consider the problem of sequential evaluation in applications such as hiring and competitions, in which an evaluator observes candidates in a sequence and assigns scores to these candidates in an online, irrevocable fashion. Motivated by the psychology literature that has studied sequential bias in such settings — namely, dependencies between the evaluation outcome and the order in which the candidates appear — we propose a natural model for the evaluator’s rating process that captures the lack of calibration inherent to such a task. We conduct crowdsourcing experiments to demonstrate various facets of our model. We then proceed to study how to correct sequential bias under our model by posing this as a statistical inference problem. We propose a near-linear time, online algorithm for this task and prove guarantees in terms of two canonical ranking metrics. We also prove that our algorithm is information theoretically optimal, by establishing matching lower bounds in both metrics. Finally, we show that our algorithm significantly outperforms the de facto method of naively using the rankings induced by the reported scores.

    Bio: Jonatas Marques is a research assistant at the Federal University of Rio Grande do Sul (UFRGS, Brazil), working on the intersection of network management and artificial intelligence. He earned a PhD (2022, with honors) in Computer Science from UFRGS, advised by Prof. Luciano Paschoal Gaspary. Jonatas also spent a year (Sep, 2019 – Aug, 2020) as a research scholar at the University of Illinois at Urbana Champaign working with Prof. Kirill Levchenko. His research helps advance network management in the programmable network era. He is the recipient of Best Paper Awards at IFIP/IEEE IM (2019) and SBC SBRC (2018).

    Talk Title: Advancing Network Monitoring and Operation with In-band Telemetry and Data Plane Programmability

    Talk Abstract: Modern communication networks operate under high expectations on performance and resilience (e.g., latency, bandwidth, availability) mainly due to the continuous proliferation of non-elastic highly-distributed applications. In this context, closely monitoring the state, behavior, and performance of networking devices and their traffic as well as quickly troubleshooting problems as they arise is essential for the operation of network infrastructures. In this talk, I will describe monitoring opportunities and challenges in the era of programmable data planes. I will present past and current work on advancing how we monitor networks and their traffic to improve network management practices. First, I will present IntSight, a system for a system for highly accurate and fine-grained detection and diagnosis of SLO violations based path-wise in-band telemetry. Second, I will present Euclid, a fully in-network fine-grained, low-footprint, and low-delay traffic analysis mechanism for DDoS attack detection and mitigation. Finally, I will present on-going work on exploring in-network computation for running machine learning models directly in the data plane of networks.

    Bio: Laixi Shi is a Ph.D. candidate in the Department of Electrical and Computer Engineering at Carnegie Mellon University, fortunately advised by Prof. Yuejie Chi. She has also interned at Google Research Brain Team and Mitsubishi Electric Research Laboratories. Her research interests include reinforcement learning (RL), non-convex optimization, high-dimensional statistical estimation, and signal processing, ranging from theory to applications. Her current research focuses on 1) theoretical works: designing provable sample-efficient algorithms for value-based RL, offline RL, and robust RL problems, resorting to optimization and statistics; 2) practical works: reinforcement learning algorithms (DRL) on different large-scale problems such as Atari games, web navigation and etc.

    Talk Title: Provable Algorithms for Reinforcement Learning: Efficiency and Robustness

    Talk Abstract: Reinforcement learning (RL) has achieved remarkable success recently to deal with unknown environments and maximize some long-term cumulative rewards. Contemporary reinforcement learning has to deal with unknown environments with unprecedentedly large dimensionality. In order to fundamentally understand and improve the effectiveness of RL algorithms in high dimensions, a recent body of work sought to develop a finite-sample theoretical framework to analyze the algorithms of interest. My work belongs to this recent line of delineating the dependency of algorithm performance on salient parameters in a non-asymptotic fashion in different RL settings. Besides maximizing the expected total reward, perhaps an equally important goal—to say the least— for an RL agent is safety and robustness, especially in high-stake applications such as robotics, autonomous driving, clinical trials, financial investments, and so on. Inspired by the robustness requirement, in this talk, we introduce our work on distributional robust RL with Kullback-Leibler (KL) divergence as the distance, referring to the problem of learning an optimal policy that is robust to future environment perturbations. I will introduce our proposed robust RL algorithm, which to the best of our knowledge is the first near-optimal algorithm for this problem, which matches the minimax lower bound up to a polynomial factor of the effective horizon length.

    Bio: Linyi Li is a fifth-year Ph.D. student at the Computer Science Department of University of Illinois Urbana-Champaign advised by Prof. Bo Li and Prof. Tao Xie. Linyi’s research lies in the intersection of machine learning and computer security. Recently, he focuses on building certifiably trustworthy deep learning systems at scale, achieving state-of-the-art certifiable robustness against noise perturbations, semantic perturbations, poisoning attacks, distributional shift, and state-of-the-art certifiable fairness. His research is published at top-tier deep learning and computer security conferences, including ICML, NeurIPS, ICLR, S&P, and CCS. Linyi is a recipient of AdvML Rising Star Award and Wing Kai Cheng Fellowship, and a finalist of Two Sigma PhD Fellowship and Qualcomm Innovation Fellowship. Previously, Linyi got his bachelor’s degree in computer science with distinction from Tsinghua University in 2018.

    Talk Title: Enabling Large-Scale Certifiable Deep Learning towards Trustworthy Machine Learning

    Talk Abstract: Given the rising societal safe and ethical concerns for modern deep learning systems in deployment, designing certifiable large-scale deep learning systems for real-world requirements is in urgent demand. This talk will introduce our series of work on building certifiable large-scale deep learning systems towards trustworthy machine learning, achieving robustness against noise perturbations, semantic transformations, poisoning attacks, distributional shifts; fairness; and reliability against numerical defects. I will also present the applications of these certifiable methods in modern deep reinforcement learning and computer vision systems. These works are recently published at machine learning and security conferences such as ICML, ICLR, NeurIPS, CCS, and S&P. Then, I will introduce the shared core backbone for designing certifiable deep learning systems, including threat-model-dependent smoothing, efficient and exact model state abstraction, statistical worst-case characterization, and diversity-enhanced model training. These backbone methodologies not only enable us to achieve the state-of-the-art certified trustworthiness under multiple existing notations and outperform existing baselines (if exists) with a large margin, but also indicate a promising direction towards achieving certified and “meta-trustworthy” ML. At the end of the talk, I will summarize several challenges in certifiable ML, such as scalability challenges, tightness challenges, deployment challenges, the gap between theory and practice, and the societal implications/impacts of certifiably trustworthy ML. The talk will be concluded with highlighted future directions for research and applications.

    Bio: Lu Gan is currently a Postdoctoral Scholar at the California Institute of Technology, working with Soon-Jo Chung and Yisong Yue. She completed her Ph.D. in robotics at the University of Michigan, where she was co-advised by Ryan Eustice, Jessy Grizzle, and Maani Ghaffari. She is broadly interested in Robotics, Computer Vision, and Machine Learning. Her current research focuses on perception and navigation for robot autonomy in unstructured environments and adverse conditions. Her work has been published at premier conferences and journals in robotics such as ICRA, IROS, BMVC, IEEE RAL and IEEE TRO. More details can be found at https://ganlumomo.github.io/.

    Talk Title: Semantic-Aware Robotic Mapping in Unknown, Loosely Structured Environments

    Talk Abstract: Robotic mapping is the problem of inferring a representation of a robot’s surroundings using noisy measurements as it navigates through an environment. As robotic systems move toward more challenging behaviors in more complex scenarios, such systems require richer maps so that the robot understands the significance of the scene and objects within. This talk focuses on semantic-aware robotic mapping in unknown, loosely structured environments. The first part is a Bayesian kernel inference semantic mapping framework that formulates a unified probabilistic model for occupancy and semantics, and provides a closed-form solution for scalable dense semantic mapping. This framework significantly reduces the computational complexity of learning-based continuous semantic mapping and achieves high accuracy in the meantime. Next, I will present a novel and flexible multi-task multi-layer Bayesian mapping framework that provides even richer environmental information. A two-layer robotic map of semantics and traversability is built as a constructive example. Moreover, it is readily extendable to include more layers according to needs. Both mapping algorithms were verified using publicly available datasets or through experimental results on a Cassie-series bipedal robot. Finally, instead of modeling the terrain traversability using metrics defined by domain knowledge, an energy-based deep inverse reinforcement learning method that learns robot-specific traversability from demonstrations will be presented. This method considers robot proprioception and can learn reward maps that lead to more energy-efficient future trajectories. Experiments are conducted using a dataset collected by a Mini-Cheetah robot in various scenes of a campus environment.

    Bio: Nwamaka Okafor is a final year PhD student at University College Dublin. She works with the Smartbog Observation Group on Irish peatlands and advised by Dr. Declan Delaney. Her work is supported by Schlumberger Faculty for the Future Fellowship (FFTF) Programme, Environmental Protection Agency (EPA) and TETFUND and focuses on the applications of IoT and machine learning in environmental monitoring. Key to her research is a means to process and efficiently harness inherently noisy data collected from IoT deployments, web/crowdsourced data and remote sensing of ecologically significant sites. Nwamaka holds an MSc in Computer Forensics and Cyber Security (Distinction) from the University of Greenwich, London. During the Covid-19 pandemic, Nwamaka collaborated with other researchers from the Universities of Oxford and Cambridge on a British Telecom’s sponsored project to explore the impact of containment measures on the rate of spread of Covid-19 in selected countries. The project won the overall best project with Real World Impact Award among 44 other projects from across the world. Nwamaka is a recipient of the 2020 Google Women Techmakers award (now Generation Google Scholarship), ACM SRC award and recently listed as one of the eighty women advancing AI in Africa and around the world by African Shapers. Her work has been published in leading journals and conferences including ICML, IEEE-Sensors. Upon completion of her PhD, Nwamaka will join Argonne National Laboratory, IL, USA as a Postdoctoral appointee and would focus on developing novel methods to analyse supercomputer logs, including Aurora and Polaris.

    Talk Title: Efficient data processing pipeline for IoT sensor devices in environmental monitoring networks.

    Talk Abstract: Low-cost IoT sensors (LCS) have the capacity to provide high resolution spatio-temporal dataset of key variables in environmental monitoring networks. The use of LCS in environmental monitoring, however, continues to raise important questions, especially pertaining to the reliability, field performance and data quality of the sensors. IoT technologies are at the early stage of development in many application areas and as such they are still challenged by a number of issues including data handling and framework. There is currently no universal framework and/or formalized architecture in existence for the application of IoT sensor devices in environmental monitoring and no complete tool has been developed for data handling and analysis.
    Developing effective data processing pipeline for IoT sensor data is essential for the adoption and application of LCS in environmental monitoring. In this talk, I will discuss our research efforts aimed at improving field performance and data quality of LCS devices in environmental monitoring networks, through the development of reliable data processing and error correction techniques such as automatic sensor calibration, data fusion, data imputation, outlier detection and effective feature selection.
    I will discuss how the capabilities of AI/ML can be leveraged to support the effective application of IoT in ecological sensing. In particular, how these technologies can: 1.) support the operations of LCS in data acquisition models to capture large scale and longitudinal observations, 2.) support the development of specific data handling and Data Quality Improvement (DQI) techniques for LCS, and 3.) serve as effective strategies to support on-site sensor calibration.

    Bio: Oscar Leong is a von Karman Instructor in the Computing and Mathematical Sciences department at Caltech, hosted by Venkat Chandrasekaran. He also works with Katie Bouman and the Computational Cameras group. He received his PhD in 2021 in Computational and Applied Mathematics from Rice University under the supervision of Paul Hand, where he was an NSF Graduate Fellow. His research interests lie in the mathematics of data science, optimization, inverse problems, and machine learning. The core focus of his doctoral work was on proving recovery theorems for ill-posed inverse problems using generative models. He is broadly interested in using tools from convex geometry, high-dimensional statistics, and nonlinear optimization to better understand and improve data-driven, decision-making algorithms.

    Talk Title: The power and limitations of convex regularization

    Talk Abstract: The incorporation of regularization functionals to promote structure for inverse problems has had a long history in applied mathematics, with a recent surge of interest in the use of both convex and nonconvex functionals. The structure of such functionals largely lies in computational and modeling considerations, but there is a lack of an overarching understanding of the power and limitations of enforcing convexity versus nonconvexity and when one should be used over another for a given data distribution. In this presentation, we propose to tackle this question from a convex geometric perspective and ask the following question: for a given data source, what is the optimal regularizer induced by a convex body? To answer this, a variational optimization problem over the space of convex bodies is proposed to search for the optimal regularizer. We analyze the structure of population risk minimizers by connecting minimization of the population objective to Minkowski’s inequality for convex bodies. These results are also shown to be robust, as we establish convergence of empirical risk minimizers to the population risk minimizer in the limit of infinite data. Based on this characterization of optimal convex regularizers, we then consider whether convexity is the right structure for certain distributions and use our theory to characterize when convex or nonconvex structure should be preferred.

    Bio: Ruishan Liu is a postdoctoral researcher in the Department of Biomedical Data Science at Stanford University, working with Prof. James Zou. She received her PhD in the Department of Electrical Engineering at Stanford University in 2022. She is broadly interested in the intersection of machine learning and applications in human diseases, health and genomics. The results of her work have been published in top-tier venues such as Nature, Nature Medicine and ICLR. She was the recipient of Stanford Graduate Fellowship and was selected as the rising star in engineering in health by Johns Hopkins University and Columbia University in 2022. She led the project Trial Pathfinder, which was selected as 2021 Top Ten Clinical Research Achievement and Finalist for Global Pharma Award 2021.

    Talk Title: AI for clinical trials and precision medicine

    Talk Abstract: Clinical trials are the gate-keeper of medicine but can be very costly and lengthy to conduct. Precision medicine transforms healthcare but is limited by available clinical knowledge. This talk explores how AI can help both — make clinical trials more efficient and generate hypotheses for precision medicine. I will first discuss Trial Pathfinder, a computational framework that simulates synthetic patient cohorts from medical records to optimize cancer trial designs (Liu et al. Nature 2021). Trial Pathfinder enables inclusive criteria and data valuation for clinical trials, benefiting diverse patients and trial sponsors. In the second part, I will discuss how to quantify the effectiveness of cancer therapies in patients with specific mutations (Liu et al. Nature Medicine 2022). This work demonstrates how computational analysis of large real-world data generates insights, hypotheses and resources to enable precision oncology.

    Bio: Ryan Shi is a final-year Ph.D. candidate of Societal Computing in the School of Computer Science at Carnegie Mellon University. He works with nonprofit organizations to address societal challenges in food security, sustainability, and public health using AI. His research has been deployed at these organizations worldwide. Shi studies game theory, online learning, and reinforcement learning problems motivated by nonprofit applications. He is the recipient of a Siebel Scholar award, an IEEE Computer Society Upsilon Pi Epsilon Scholarship, and a Carnegie Mellon Presidential Fellowship. Shi grew up in Henan, China before moving to the U.S., where he graduated from Swarthmore College with a B.A. in mathematics and computer science.

    Talk Title: Towards the Science of AI for Nonprofits

    Talk Abstract: I work on AI for nonprofits, with nonprofits, so that it could be used by nonprofits. This talk is about delineating AI for nonprofits as a research discipline of its own. I will start with a 4-year collaboration with a food rescue organization. We developed a series of machine learning-based algorithms to help the organization engage with food rescue volunteers. Our tools have been deployed since February 2020. Our randomized controlled trial showed that the algorithm not only improved the accuracy of engagement, but also made it significantly easier for rescues to get claimed. We are rolling it out to over 15 cities across the US. Motivated by the pain points we experienced in this and other works, we proposed bandit data-driven optimization, a new learning paradigm that integrates online bandit learning and offline predictive models to address the unique challenges that arise in data science projects for nonprofits. We prove theoretical guarantees for our algorithm and show that it achieves superior performance on simulated and real datasets. I will conclude the talk with an inquiry of the state of AI for nonprofits as its own research discipline. Using another project where we developed and deployed an NLP tool for with a conservation nonprofit as a primer, I argue that AI for nonprofits needs its own set of research questions that will systematically guide future projects. I will share a few of these directions that I am currently working on and plan to work on in the near future.

    Bio: Tian Li is a fifth-year Ph.D. student in the Computer Science Department at Carnegie Mellon University working with Virginia Smith. Her research interests are in distributed optimization, large-scale machine learning, federated learning, and data-intensive systems. Prior to CMU, she received her undergraduate degrees in Computer Science and Economics from Peking University. She was a research intern at Google Research in 2022. She received the Best Paper Award at ICLR Workshop on Security and Safety in Machine Learning Systems (2021), was selected as Rising Stars in Machine Learning (2021), and was invited to participate in EECS Rising Stars Workshop (2022).

    Talk Title: Scalable and Trustworthy Learning in Heterogeneous Networks

    Talk Abstract: Federated learning stands to power next-generation machine learning applications by aggregating a wealth of knowledge from distributed data sources. However, federated networks introduce a number of challenges beyond traditional distributed learning scenarios. In addition to being accurate, federated methods must scale to potentially massive and heterogeneous networks of devices, and must exhibit trustworthy behavior—addressing pragmatic concerns related to issues such as fairness, robustness, and user privacy. In this talk, I talk about how heterogeneity lies at the center of the constraints of federated learning—not only affecting the accuracy of the models, but also competing with other critical metrics such as fairness, robustness, and privacy. To address these metrics, I talk about new, scalable federated learning objectives and algorithms that rigorously account for and address sources of heterogeneity. Although our work is grounded by the application of federated learning, I show that many of the techniques and fundamental tradeoffs extend well beyond this use-case.

    Bio: Valentin is a final-year PhD student at the University of Oxford and a research assistant at LMU Munich. His work broadly focuses on the intersection of NLP, linguistics, and computational social science, with specific interests in tokenization algorithms, socially and temporally aware NLP systems, and computational models of political ideology. The findings of his research have been published at major NLP and machine learning conferences such as ACL, EMNLP, NAACL, ICWSM, and ICML. During his PhD, Valentin also spent time as a research intern in DeepMind’s Language Team and as a visiting scholar in Stanford’s NLP Group.

    Talk Title: Modeling Ideological Language with Graph Neural Networks and Structured Sparsity

    Talk Abstract: The increasing polarization of online political discourse calls for computational tools that automatically detect and monitor ideological divides in social media. While many methods to track ideological polarization have been proposed, most of them rely on knowing in advance the political orientation of text, a requirement seldom met in practice. In this talk, I will discuss research recently published at NAACL and ICML that fully dispenses with the need for labeled data and instead leverages the ubiquitous network structure of online discussion forums, specifically Reddit, to detect ideological polarization. I will first give an overview of prior research on polarization in natural language processing, highlighting salience and framing as two key mechanisms by which ideology manifests itself in language. I will then present Slap4slip, a method that combines graph neural networks with structured sparsity learning to determine the polarization of political issues (e.g., abortion) along the dimensions of salience and framing. In the third part of the talk, I will show that polarization is also reflected by the existence of an ideological subspace in contextualized embeddings, which can be found by adding orthogonality regularization to Slap4slip. The ideological subspace encodes abstract evaluative semantics and indicates pronounced changes in the political left-right spectrum during the presidency of Donald Trump.

    Bio: Vishwali Mhasawade is a Ph.D. candidate in Computer Science at New York University, advised by Prof. Rumi Chunara. Her research is supported by the Google Fellowship in Health. She focuses on designing fair and equitable machine learning systems for mitigating health disparities and developing methods in causal inference and algorithmic fairness. Vishwali was an intern at Fiddler AI Labs and Spotify Research. She has been involved in mentoring roles involving high school students through the NYU ARISE program, reviewer mentoring through the Machine Learning for Health initiative, and career mentoring for Ph.D. applicants through the Women in Machine Learning program.

    Talk Title: Advancing Health Equity with Machine Learning

    Talk Abstract: While a patient visits the hospital for treatment, factors outside the hospital, such as where the individual resides and what educational and vocational opportunities are present, play a vital role in the patient’s health trajectory. On the contrary, most advances in machine learning in healthcare are mainly restricted to data within hospitals and clinics. While health equity, defined as minimizing avoidable disparities in health and its determinants between groups of people with different social privileges in terms of power, wealth, and prestige, is the primary principle underlying public health research, this has been largely ignored by the current machine learning systems. Inequality at the social level is harmful to the population as a whole. Thus, a focus on the factors related to health outside the hospital is imperative to not only address specific challenges for high-risk individuals but also to determine what policies will benefit the community as a whole. In this talk, I will first demonstrate the challenges of mitigating health disparities resulting from the different representations of demographic groups based on attributes like gender and self-reported race. I will focus on machine learning systems using person-generated data and in-hospital data from multiple geographical locations worldwide. Next, I will present a causal remedial approach to health inequity using algorithmic fairness that reduces health disparities. In the end, I will discuss how algorithmic fairness can be leveraged to achieve health equity by incorporating social factors and illustrate how residual disparities persist if social factors are ignored.

    Bio: I am a joint NIST-IMA Postdoctoral fellow in Analysis of Machine Learning at the Institute for Mathematics and its Applications (IMA) at the University of Minnesota (UMN). I completed my Ph.D. at the University of California, Los Angeles in mathematics in 2022 under the guidance of Professor Stan Osher. Throughout my Ph.D., I have developed optimal transport-based algorithms to solve nonlinear partial differential equations (PDEs) such as Darcy’s law, tumor growth model, and mean field games. As a Postdoc, I will be working with Professor Jeff Calder, Gilad Lerman, and Li Wang at UMN to develop PDE-based algorithms to solve high-dimensional machine learning problems and analyze the theoretical properties of the algorithms.

    Talk Title: The Back-And-Forth Method For Wasserstein Gradient Flows

    Talk Abstract: We present a method to efficiently compute Wasserstein gradient flows. Our approach is based on a generalization of the back-and-forth method (BFM) introduced by Jacobs and Leger to solve optimal transport problems. We evolve the gradient flow by solving the dual problem to the JKO scheme. In general, the dual problem is much better behaved than the primal problem. This allows us to efficiently run large scale gradient flows simulations for a large class of internal energies including singular and non-convex energies. Joint work with Matt Jacobs (Purdue University) and Flavien Leger (INRIA Paris)

    Bio: Xintong Wang is a postdoctoral fellow in the Harvard EconCS group hosted by David Parkes. She received her Ph.D. in April 2021 from the CSE Department at the University of Michigan, advised by Michael Wellman. Her research is centered around developing computational approaches to model complex agent behaviors and to design better market-based algorithmic systems, drawing on tools from machine learning, game theory, and optimization, with applications ranging from platform economies to financial systems. During her Ph.D., she worked as a research intern at Microsoft Research NYC (mentored by David Pennock) and J.P. Morgan AI Research (mentored by Tucker Balch). Prior to Michigan, she received her Bachelor’s degree from Washington University in St. Louis in 2015.

    Talk Title: Market Manipulation: An Adversarial Learning Framework for Detection and Evasion

    Talk Abstract: I will talk about our recently proposed adversarial learning framework that captures the evolving game between a regulator who develops tools to detect market manipulation and a manipulator who obfuscates actions to evade detection. The model includes three main parts: (1) a generator that learns to adapt a sequence of known manipulation activities to resemble patterns of a normal (benign) trading agent while preserving the manipulation intent; (2) a discriminator that differentiates the adversarially adapted manipulation actions from normal trading activities; and (3) an agent-based model that provides feedback in regard to the manipulation effectiveness and profitability of adapted outputs. We conduct experiments on simulated trading actions associated with a manipulator and a benign agent respectively. We show examples of adapted manipulation order streams that mimic the specified benign trading patterns and appear qualitatively different from the original manipulation strategy we encoded in the simulator. These results demonstrate the possibility of automatically generating a diverse set of (unseen) manipulation strategies that can facilitate the training of more robust detection algorithms.

    Bio: I am a postdoc hosted by the Department of Statistics at the University of Chicago. Prior to this, I obtained my Ph.D. in Statistics at the University of Washington, advised by Fang Han and Daniela Witten. I am broadly interested in nonparametric and semiparametric statistics, high dimensional inference, and applied probability.

    Talk Title: Universality of regularized regression estimators in high dimensions

    Talk Abstract: The Convex Gaussian Min-Max Theorem (CGMT) has emerged as a prominent theoretical tool for analyzing the precise stochastic behavior of various statistical estimators in the so-called high dimensional proportional regime, where the sample size and the signal dimension are of the same order. However, a well recognized limitation of the existing CGMT machinery rests in its stringent requirement on the exact Gaussianity of the design matrix, therefore rendering the obtained precise high dimensional asymptotics largely a specific Gaussian theory in various important statistical models. This work provides a structural universality framework for a broad class of regularized regression estimators that is particularly compatible with the CGMT machinery. In particular, we show that with a good enough $\ell_\infty$ bound for the regression estimator $\hat{\mu}_A$, any `structural property’ that can be detected via the CGMT for $\hat{\mu}_G$ (under a standard Gaussian design G) also holds for $\hat{\mu}_A$ under a general design A with independent entries. As a proof of concept, we demonstrate our new universality framework in three key examples of regularized regression estimators: the Ridge, Lasso and regularized robust regression estimators, where new universality properties of risk asymptotics and/or distributions of regression estimators and other related quantities are proved. As a major statistical implication of the Lasso universality results, we validate inference procedures using the degrees-of-freedom adjusted debiased Lasso under general design and error distributions. We also provide a counterexample, showing that universality properties for regularized regression estimators do not extend to general isotropic designs.

    Bio: Denny Wu is a Ph.D. candidate at the University of Toronto and the Vector Institute for Artificial Intelligence, under the supervision of Jimmy Ba and Murat A. Erdogdu. He also maintains close collaboration with the University of Tokyo and RIKEN AIP, where he is currently working with Taiji Suzuki and Atsushi Nitanda. Denny is interested in developing a theoretical understanding (e.g., optimization and generalization) of modern machine learning systems, especially overparameterized models such as neural networks, using tools from high-dimensional statistics.

    Talk Title: How neural network learns representation: a random matrix theory perspective

    Talk Abstract: Random matrix theory (RMT) provides powerful tools to characterize the performance of random neural networks in high dimensions. However, it is not clear if similar tools can be applied to trained neural network where the parameters are no longer i.i.d. due to gradient-based learning. In this work we use RMT to precisely quantify the benefit of feature learning in the “early phase” of gradient descent training. We study the first gradient step under the MSE loss on the first-layer weights W in a two-layer neural network. In the proportional asymptotic limit and an idealized student-teacher setting, we show that the gradient update contains a rank-1 “spike”, which results in an alignment between the weights and the linear component of the teacher model f*. To understand the impact of this alignment, we compute the asymptotic prediction risk of ridge regression on the conjugate kernel after one feature learning step with learning rate \eta. We consider two scalings of \eta. For small \eta, we establish a Gaussian equivalence property for the trained features, and prove that the learned kernel improves upon the initial random features, but cannot defeat the best linear model. Whereas for sufficiently large \eta, we prove that for certain f*, the same ridge estimator goes beyond this “linear regime” and outperforms a wide range of (fixed) kernels. Our results demonstrate that even one gradient step can lead to considerable advantage over random features, and highlight the role of learning rate scaling in the initial phase of training.

    Bio: Yiping Lu is a fourth-year Ph.D. student at ICME Stanford, working with Lexing Ying and Jose Blanchet. Before that, he got his Bachelor’s degree in computational mathematics and information science at Peking University. His research interest lies in Non-parametric Statistics, Inverse Problems, Stochastic Modeling, and Econometrics. His research aims to design algorithms that combine data-driven techniques with (differential equation/structural) modeling and provide corresponding statistical and computational guarantees. Yiping’s research is supported by the Stanford Interdisciplinary Graduate Fellowship.

    Talk Title: Machine Learning for Differential equations: Statistics, Optimization and MultiLevel Learning

    Talk Abstract: In this talk, we aim to study the statistical limits and optimal algorithms to solve Elliptic PDEs from random samples. In the first part of my talk, I’ll study the statistical and optimization property of learning the Elliptic inverse problem using the Deep Ritz Method (DRM) and Physics Informed Neural Networks (PINNs). We provided a minimax lower bound for this problem and discovered that the variance in DRM leads sample complexity being suboptimal. Based on this observation, we proposed a modification of DRM with optimal sample complexity. Then I’ll introduce the optimization property of a class of Sobolev norms as the objective function. Although early stopped gradient descent for all the objective functions is statistically optimal, I observed that using the Sobolev norm can accelerate training. Combining the information from the two papers, we can conclude that DRM is fast for the high dimensional smoothing function due to the low computational cost of each iteration. But for low dimensional rough functions, we should use PINN. In the second part of my talk, I’ll talk about the statistical limit of learning linear operator learning, where both the input and output space are infinite-dimensional. We showed that proper bias and variance trade-offs could lead to optimal learning rates. The optimal learning rate is different from learning finite-dimensional linear operators. We also illustrate how this theory can inspire a multilevel machine learning algorithm. Our theory has wide application in learning differential operators and generative modeling.

    Bio: Yiqun Chen is a Stanford Data Science Postdoctoral Fellow hosted by Professor James Zou. His research focuses on quantifying, calibrating, and communicating the uncertainty in modern data analysis, with applications to biomedical and health data. Previously, Yiqun received his Ph.D. in Biostatistics from the University of Washington under the supervision of Professor Daniela Witten. He completed undergraduate degrees in Statistics, Computer Science, and Chemical Biology at the University of California Berkeley. Yiqun is the recipient of a young investigator award at CROI 2020, a best paper award at WNAR 2021, a student research award at NESS 2022, a University of Washington Outstanding Teaching Assistant Award, and the Thomas R. Fleming Excellence in Biostatistics Award.

    Talk Title: Selective inference for k-means clustering

    Talk Abstract: We consider the problem of testing for a difference in means between clusters of observations identified via k-means clustering. In this setting, classical hypothesis tests lead to an inflated Type I error rate. To overcome this problem, we take a selective inference approach. We propose a finite-sample p-value that controls the selective Type I error for a test of the difference in means between a pair of clusters obtained using k-means clustering, and show that it can be efficiently computed. We apply our proposal in simulation, and on hand-written digits data and single-cell RNA-sequencing data.

    Bio: Yuntian Deng is a PhD student at Harvard University advised by Prof. Alexander Rush and Prof. Stuart Shieber. Yuntian’s research aims at building coherent, transparent, and efficient long-form text generation models. Yuntian is also the main contributor to open-source projects such as OpenNMT, Im2LaTeX, and LaTeX2Im that are widely used in academia and industry. He has been awarded Nvidia Fellowship, Baidu Fellowship, ACL 2017 Best Demo Paper Runner-Up, DAC 2020 Best Paper, Harvard Certificate of Distinction in Teaching, NeurIPS 2020 top 10% reviewer for his research, teaching, and professional services.

    Talk Title: Model Criticism for Long-Form Text Generation

    Talk Abstract: Language models have demonstrated the ability to generate highly fluent text; however, it remains unclear whether their output retains coherent high-level structure (e.g., story progression). Here, we propose to apply a statistical tool, model criticism in latent space, to evaluate the high-level structure of the generated text. Model criticism compares the distributions between real and generated data in a latent space obtained according to an assumptive generative process. Different generative processes identify specific failure modes of the underlying model. We perform experiments on three representative aspects of high-level discourse—coherence, coreference, and topicality—and find that transformer-based language models are able to capture topical structures but have a harder time maintaining structural coherence or modeling coreference structures.

    Bio: She is a postdoctoral researcher in the Statistics Department at the University of Chicago, advised by Professor Rina Foygel Barber. Before joining the University of Chicago, she obtained her Ph.D. in Statistics from Stanford University, advised by Professor Emmanuel Candès.

    Talk Title: Derandomized knockoffs: leveraging e-values for false discovery rate control

    Talk Abstract: Model-X knockoffs is a flexible wrapper method for high-dimensional regression algorithms, which provides guaranteed control of the false discovery rate (FDR). Due to the randomness inherent to the method, different runs of model-X knockoffs on the same dataset often result in different sets of selected variables, which is undesirable in practice. In this paper, we introduce a methodology for derandomizing model-X knockoffs with provable FDR control. The key insight of our proposed method lies in the discovery that the knockoffs procedure is in essence an e-BH procedure. We make use of this connection, and derandomize model-X knockoffs by aggregating the e-values resulting from multiple knockoff realizations. We prove that the derandomized procedure controls the FDR at the desired level, without any additional conditions (in contrast, previously proposed methods for derandomization are not able to guarantee FDR control). The proposed method is evaluated with numerical experiments, where we find that the derandomized procedure achieves comparable power and dramatically decreased selection variability when compared with model-X knockoffs.

    Bio: Zichao (Jack) Wang is a final-year Ph.D. student in electrical and computer engineering at Rice University, advised by Prof. Richard Baraniuk. Previously, he obtained M.S. (2020) and B.S. (2016), both in electrical and computer engineering at Rice University. His research focuses on developing artificial intelligence, machine learning, and natural language processing methods with applications in education and human learning. He has been a research intern with Microsoft Research, NVIDIA Research, and Google AI.

    Talk Title: Machine learning for human learning

    Talk Abstract: Despite the recent advances in artificial intelligence and machine learning, we have yet to witness the transformative breakthroughs that they can bring to education, and more broadly, to how humans learn. In this talk, I will introduce our two recent research directions in developing AI/ML methods to enable more personalized learning experiences on a large scale. First, I will describe generative modeling and representation learning techniques to enable adaptive learning materials, such as assessment questions and scientific equations, that are personalized to each learner. Second, I will describe a framework for analyzing learners’ open-ended solutions to assessment questions such as code submissions in computer science education. I will also discuss my vision for AI/ML methods for education and human learning, highlighting the fundamental technical challenges, my preliminary attempts, and their potential real-world impacts.

  • Committee & Mentors

    Faculty Committee

    Yuxin Chen is an assistant professor at the Department of Computer Science at the University of Chicago. Previously, he was a postdoctoral scholar in Computing and Mathematical Sciences at Caltech, hosted by Prof. Yisong Yue. He received my Ph.D. degree in Computer Science from ETH Zurich, under the supervision of Prof. Andreas Krause. He is a recipient of the PIMCO Postdoctoral Fellowship in Computing and Mathematical Sciences, a Swiss National Science Foundation Early Postdoc.Mobility fellowship, and a Google European Doctoral Fellowship in Interactive Machine Learning.

    His research interest lies broadly in probabilistic reasoning and machine learning. He is currently working on developing interactive machine learning systems that involve active learning, sequential decision making, interpretable models and machine teaching. You can find more information in my Google scholar profile.

    Homepage.

    Aaron Dinner is a Professor of Chemistry and Deputy Dean of the Physical Sciences Division at the University of Chicago. He obtained his A.B. in Biochemical Sciences in 1994 and his Ph.D. in Biophysics in 1999 both from Harvard University. Following postdoctoral research at Oxford University from 1999 to 2001 and University of California Berkeley from 2001 to 2003 he joined the faculty at the University of Chicago in 2003.From 2012 to 2018 he served as the Director of the James Franck Institute — a longstanding interdisciplinary institute bringing together research in condensed matter physics, atomic, molecular, and optical (AMO) physics, biophysics, and physical chemistry.. His research seeks to understand how the complex behaviors of living systems arise from molecular interactions, and he pioneered machine learning methods for interpreting molecular dynamics simulations. His honors include a Searle Scholarship, National Science Foundation CAREER Award, Sloan Fellowship, and an American Physical Society Fellowship.

    I completed my PhD in Statistics in June 2020 at Stanford University, where I was lucky to be supervised by Professor Susan Holmes and where I had the chance to work with Prof. Jure Leskovec. Prior to Stanford, I studied Applied Mathematics and Engineering at Ecole Polytechnique (France), where I received an M.S and B.S equivalent.

    My research interests lie in the analysis of patterns and the quantification of uncertainty in high-dimensional datasets, and in particular, graphs and networks, geared towards biomedical applications.

    https://donnate.github.io/

    I’m an assistant professor at the Department of Statistics at University of Chicago. I am also a member of Committee on Computational and Applied Mathematics (CCAM). I am interested in computational problems in structural biology and quantum many-body physics. I was extremely fortunate to have Amit Singer at Princeton as my Ph.D. adviser during 2012-2016, Lexing Ying at Stanford as my post-doc mentor during 2016-2019, and Phuan Ong at Princeton supervising my master thesis in experimental physics during 2010-2012.

    https://www.stat.uchicago.edu/~ykhoo/

    My research interests fall in two main threads. The first includes the use of the isotopic composition of atmospheric water vapor as a tracer of convective processes, cirrus formation, and stratosphere-troposphere exchange; and the design of spectroscopic techniques for in-situ trace gas measurements. The second includes climate (and human) response to greenhouse-gas forcing; development of tools for impacts assessment; statistical emulation of climate model output; and climate and energy policy evaluation.

    Dan Nicolae obtained his Ph.D. in statistics from The University of Chicago and has been a faculty at the same institution since 1999, with appointments in Statistics (since 1999) and Medicine (since 2006). His research focus is on developing statistical and computational methods for understanding the human genetic variation and its influence on the risk for complex traits, with an emphasis on asthma related phenotypes. The current focus in his statistical genetics research is centered on data integration and system-level approaches using large datasets that include clinical and environmental data as well as various genetics/genomics data types: DNA variation, gene expression (RNA-seq), methylation and microbiome.

    Homepage

    Lorenzo Orecchia is an assistant professor in the Department of Computer Science. His research focuses on applying mathematical techniques from discrete and continuous optimization to design algorithms for computational challenges arising in a variety of applications, including Machine Learning, Numerical Analysis and Combinatorial Optimization.

    David Uminsky joined the University of Chicago in September 2020 as a senior research associate and Executive Director of Data Science. He was previously an associate professor of Mathematics and Executive Director of the Data Institute at University of San Francisco (USF). His research interests are in machine learning, signal processing, pattern formation, and dynamical systems.  David is an associate editor of the Harvard Data Science Review.  He was selected in 2015 by the National Academy of Sciences as a Kavli Frontiers of Science Fellow. He is also the founding Director of the BS in Data Science at USF and served as Director of the MS in Data Science program from 2014-2019. During the summer of 2018, David served as the Director of Research for the Mathematical Science Research Institute Undergrad Program on the topic of Mathematical Data Science.

    Before joining USF he was a combined NSF and UC President’s Fellow at UCLA, where he was awarded the Chancellor’s Award for outstanding postdoctoral research. He holds a Ph.D. in Mathematics from Boston University and a BS in Mathematics from Harvey Mudd College.

    Faculty Mentors

    Aloni Cohen is an Assistant Professor of Computer Science and Data Science at the University of Chicago. Previously, Cohen was a Postdoctoral Associate at Boston University, with a joint appointment at the Hariri Institute for Computing and the School of Law. His research explores the interplay between theoretical cryptography, privacy, law, and policy. Aloni earned his PhD in electrical engineering and computer science at MIT where he was advised by Shafi Goldwasser and supported by a Facebook Fellowship and an NSF Graduate Student Fellowship. Aloni is a former affiliate at the Berkman Klein Center for Internet & Society and a Fellow at the Aspen Tech Policy Hub.

    My research explores the interplay between theoretical cryptography, privacy, law, and policy. Specifically, I aim to understand and resolve the tensions between the theory of cryptography and the privacy and surveillance law that governs its eventual real-world context. Right now, I’m thinking about differential privacy, GDPR, the Fifth Amendment, encryption, multiparty computation, and the Census.

    Nick Feamster is Neubauer Professor in the Department of Computer Science and the College and the Faculty Director of Research for the Data Science Institute. He researches computer networking and networked systems, with a particular interest in Internet censorship, privacy, and the Internet of Things. His work on experimental networked systems and security aims to make networks easier to manage, more secure, and more available.

    Website

    My main research interests are in speech and language processing, as well as related aspects of machine learning.

    I am an Associate Professor at TTI-Chicago, a philanthropically endowed academic computer science institute located on the University of Chicago campus. We are recruiting students to our PhD program and visiting student program, as well as additional faculty, including in speech and language-related areas (more on Speech and Language at TTIC).

    I completed my PhD in 2005 at MIT in the Spoken Language Systems group of the Computer Science and Artificial Intelligence Laboratory. In 2005-2007 I was a post-doctoral lecturer in the MIT EECS department. In Feb.-Aug. 2008 I was a Research Assistant Professor at TTI-Chicago.

    Homepage

    I am an assistant professor in the Department of Statistics at the University of Chicago.

    Previously, I was a postdoctoral researcher at UC Berkeley, advised by Professor Martin Wainwright. I obtained my Ph.D. at Princeton University in 2020, advised by Professor Yuxin Chen and Professor Jianqing Fan. Prior to graduate school, I received my bachelor’s degree in Electrical Engineering from Tsinghua University in 2015.

    I am broadly interested in mathematics of data science, reinforcement learning, high-dimensional statistics, convex and nonconvex optimization as well as their applications to neuroscience.

    Website

    I am an Assistant Professor in the Department of Political Science at The University of Chicago.

    I work on quantitative methodology for social science research, with a focus on causal inference, machine learning, and experimental design–particularly for adaptive experiments. My PhD is from Yale, joint in Political Science and Statistics & Data Science.

    Previously, I was a post-doctoral fellow in Susan Athey’s Golub Capital Social Impact Lab at the Stanford Graduate School of Business.

    In addition to the PhD, I hold a Masters in Statistics, also from Yale, and a Masters in Public Affairs, from the Princeton School of Public and International Affairs. My undergraduate degree was in cultural anthropology from Grinnell College; after college, I spent a year in Lesotho, teaching high school students, and two years in Madagascar, as a Peace Corps volunteer.

    Website

    Since 2018, I am an Assistant Professor in the Department of Statistics at the University of Chicago, where I am a member of the Committee on Computational and Applied Mathematics. Previously, I was a postdoctoral research associate and a member of the Data Science Initiative at Brown University. In 2016 I completed my PhD in Mathematics and Statistics at the University of Warwick, UK, under the supervision of Andrew Stuart and Gareth Roberts.

    My research interests are in graph-based learning, inverse problems and data assimilation. The main theme that drives my research across these three disciplines is the desire to blend complex preditive models with large data-sets. My work addresses both theoretical and compuational challenges motivated by data-centric applications.

    My work is currently funded by the National Science Foundation, the National Geospatial-Intelligence Agency and the BBVA Foundation. I have been awarded the 2020 José Luis Rubio de Francia prize to the best Spanish mathematician under 32 by the Spanish Royal Society of Mathematics.

    I am the organizer of the CAM Colloquium.

    Aaron is an Assistant Professor in the Statistics Department and Data Science Institute at UChicago. His research develops methodology in Bayesian statistics, causal inference, and machine learning for applied problems in political science, economics, and genetics, among other fields. Prior to joining UChicago, Aaron was a postdoctoral fellow in the Data Science Institute at Columbia University. He received his PhD in Computer Science from UMass Amherst, as well as an MA in Linguistics and a BA in Political Science. He is on Twitter @AaronSchein.

    Chenhao Tan is an assistant professor at the Department of Computer Science and the UChicago Data Science Institute. His main research interests include language and social dynamics, human-centered machine learning, and multi-community engagement. He is also broadly interested in computational social science, natural language processing, and artificial intelligence.

    Website

    Rebecca Willett is a Professor of Statistics and Computer Science at the University of Chicago. She completed her PhD in Electrical and Computer Engineering at Rice University in 2005 and was an Assistant then tenured Associate Professor of Electrical and Computer Engineering at Duke University from 2005 to 2013. She was an Associate Professor of Electrical and Computer Engineering, Harvey D. Spangler Faculty Scholar, and Fellow of the Wisconsin Institutes for Discovery at the University of Wisconsin-Madison from 2013 to 2018.  Prof. Willett received the National Science Foundation CAREER Award in 2007, was a member of the DARPA Computer Science Study Group 2007-2011, and received an Air Force Office of Scientific Research Young Investigator Program award in 2010. Prof. Willett has also held visiting researcher positions at the Institute for Pure and Applied Mathematics at UCLA in 2004, the University of Wisconsin-Madison 2003-2005, the French National Institute for Research in Computer Science and Control (INRIA) in 2003, and the Applied Science Research and Development Laboratory at GE Medical Systems (now GE Healthcare) in 2002. Her research interests include network and imaging science with applications in medical imaging, wireless sensor networks, astronomy, and social networks. She is also an instructor for FEMMES (Females Excelling More in Math Engineering and Science; news article here) and a local exhibit leader for Sally Ride Festivals. She was a recipient of the National Science Foundation Graduate Research Fellowship, the Rice University Presidential Scholarship, the Society of Women Engineers Caterpillar Scholarship, and the Angier B. Duke Memorial Scholarship.

    Homepage

  • FAQ

    Any questions regarding the application or the workshop can be directed to data-science@uchicago.edu.

    • When is the application due?

      The student application is due Friday, September 30th by 11:59pm CT. Letters of recommendation are due Monday, October 3rd by 5:00pm CT. 1 letter of recommendation is required, and you can provide up to 2. Recommenders can submit their Letters of Recommendation through our online application system and specific instructions are available through the application website. If recommenders have any issue submitting their letters via the InfoReady application interface, they can email as a PDF attachment to krosengarten@uchicago.edu.

    • Where can I apply?

      The application is available here through InfoReady. If you have not previously used InfoReady, you will be required to create an account in order to submit your application.

    • How many recommendations can I receive?

      You must request at least one letter of recommendation, and no more than two.

    • Who can I request a recommendation from?

      You are encouraged to request a letter of recommendation from a faculty member or advisor who can speak first-hand to your research and academic strengths and qualifications for the program.

    • Where do faculty submit recommendations?

      Your requested ‘recommender’ can submit their letter through the InfoReady system; the application provides a template email that you can edit and automatically generate to the person you’ve requested a letter from.

    • Can the research statement include references?

      Yes, your research statement can include references. References are allowed to run onto an additional third page beyond the two-page limit for the statement.

    • Do I have to be in a data science program (PhD or postdoc) to apply?

      We do not require that applicants are currently in a Data Science program to apply. However, applicants should be pursuing doctoral degrees or postdocs in computer science, statistics, computational and applied math, data science, or a related computational field.

    • Who can apply?

      Applicants must be full time graduate students within ~1-2 years of obtaining a PhD, or a current postdoctoral scholar, fellow, or researcher. If you have any questions about your eligibility, please feel free to send an email to data-science@uchicago.edu.

    • Will the Autumn 2022 workshop be remote or in person?

      We are committed to providing a safe and healthy environment for UChicago Rising Stars attendees, and will abide by recommendations made by UChicago Medicine and the University of Chicago leadership regarding the impact of the ongoing coronavirus pandemic on travel and in-person events. The structure of the workshop may change based on regulations put in place to ensure everyone’s health and safety, but we remain committed to creating an event that is engaging, supportive, and informative for all participants. We will immediately inform workshop participants if alternative plans are necessary, and will update all communication channels, including the program website.

      This convening is open to all invitees regardless of vaccination status and, because of ongoing health risks, particularly to the unvaccinated, participants are expected to adopt the risk mitigation measures (masking and social distancing, etc.) appropriate to their vaccination status as advised by public health officials or to their individual vulnerabilities as advised by a medical professional. Public convening may not be safe for all and carries a risk for contracting COVID-19, particularly for those unvaccinated. Participants will not know the vaccination status of others and should follow appropriate risk mitigation measures.

    • Is travel funding available?

      Yes, we offer a travel stipend of up to $1000, reimbursable after the workshop.

    • When will the workshop agenda be available?

      The workshop agenda will be available in October.

    • Who is the audience for the student research talks?

      The primary audience for the student research talks are the Rising Stars Committee and PhD students, researchers, faculty, and staff in the data science ecosystem at UChicago. Certain panels and workshop events will be open to the public. Registration for the event will open in October.

    • What should the research talk cover?

      The research talk should highlight your research interests in data science and computing, and ideally showcase your unique approaches to the nascent field as it takes shape.

    • Do I need to cover established research in my talk, or can it cover early-stage projects?

      Your talk can use research work that is either early-stage or published, so long as you are confident that it appropriately and best showcases your research methodology and approaches.