Skip to main content

Rising Stars in Data Science

Autumn 2021 Rising Stars

Bio: Maria Antoniak is a PhD candidate in Information Science at Cornell University. Her research focuses on unsupervised natural language processing methods and applications to computational social science and cultural analytics. Her work translates methods from natural language processing to insights about communities and self-disclosure by modeling personal experiences shared in online communities. She has a master’s degree in computational linguistics from the University of Washington and a bachelor’s degree in humanities from the University of Notre Dame, and she has completed research internships at Microsoft, Facebook, Twitter, and Pacific Northwest National Laboratory.

Talk Title: Modeling Personal Experiences Shared in Online Communities

Talk Abstract: Written communications about personal experiences—and the emotions, narratives, and values that they contain—can be both rhetorically powerful and statistically difficult to model. The first goal of my research is to use natural language processing methods to represent complex personal experiences and self-disclosures communicated in online communities. Two fruitful sites for this research are online communities grounded in structured cultural experiences (books, games) and online communities grounded in healthcare experiences (childbirth, contraception, pain management). These communities situate personal opinions and stories in social contexts of reception, expectation, and judgment. The second goal of my research is critical re-examination of measurement methods: I probe models designed for traditional natural language processing tasks involving large, generic datasets by exploring their results on small, socially-specific datasets that are popular in cultural analytics and computational social science.

Bio: Arjun studies the security of machine learning systems, with a focus on adversarial and distributed learning. His work has exposed new vulnerabilities in learning algorithms, along with the development of a theoretical framework to analyze them. He was a finalist for the 2020 Bede Liu Best Dissertation Award, and won the 2019 Yan Huo *94 Graduate Fellowship and 2018 SEAS Award for Excellence at Princeton University. He received the 2018 Siemens FutureMakers Fellowship in Machine Learning, and was a finalist for the 2017 Bell Labs Prize. He is currently a postdoctoral scholar at UChicago with Ben Zhao and Nick Feamster.

Talk Title: The Role of Data Geometry in Adversarial Machine Learning

Talk Abstract: Understanding the robustness of machine learning systems has become a problem of critical interest due to their increasing deployment in safety critical systems. Of particular interest are adversarial examples, which are maliciously pertrubed test-time examples designed to induce misclassification. Most research on adversarial examples has focused on developing better attacks and ad hoc defenses, resulting in an attacker-defender arms race.

In this talk, we will step away from this paradigm and show how fundamental bounds on learning in the presence of adversarial examples can be obtained by viewing the problem through an information-theoretic lens. For fixed but arbitrary distributions, we demonstrate lower bounds on both the 0-1 and cross-entropy losses for robust learning. We compare these bounds to the performance of state-of-the-art robust classifiers and analyze the impact of different layers on robustness.

Bio: Lingjiao Chen is a PhD candidate in the computer sciences department at Stanford University. He is broadly interested in machine learning, data management and optimization. Working with Matei Zaharia and James Zou, he is currently exploring the fast-growing marketplaces of artificial intelligence and data. His work has been published at premier conferences and journals such as ICML, NeurIPS, SIGMOD and PVLDB, and partially supported by a Google fellowship.

Talk Title: Understanding and Exploiting Machine Learning Prediction APIs

Talk Abstract: Machine Learning (ML) prediction APIs are a fast-growing industry and an important part of ML as a service. For example, one could use Google prediction API to classify an image for $0.0015 or to classify the sentiment of a text passage for $0.00025. While many such services are available, the heterogeneity in their price and performance makes it challenging for users to decide which API or combination of APIs to use for their own data.

In this talk, I will present FrugalML, a principled framework that jointly learns the strength and weakness of each API on different data, and performs an efficient optimization to automatically identify the best sequential strategy to adaptively use the available APIs within a budget constraint. Our theoretical analysis shows that natural sparsity in the formulation can be leveraged to make FrugalML efficient. We conduct systematic experiments using ML APIs from Google, Microsoft, Amazon, IBM, Baidu and other providers for tasks including facial emotion recognition, sentiment analysis and speech recognition. Across various tasks, FrugalML can achieve up to 90% cost reduction while matching the accuracy of the best single API, or up to 5% better accuracy while matching the best API’s cost. If time permits, I will also discuss recent follow-up studies on API performance shifts and multi-label APIs.

Bio: Amrita Roy Chowdhury is a PhD student at the University of Wisconsin-Madison and is advised by Prof. Somesh Jha. She completed her Bachelor of Engineering in Computer Science from the Indian Institute of Engineering Science and Technology, Shibpur where she was awarded the President of India Gold Medal. Her work explores the synergy between differential privacy and cryptography through novel algorithms that expose the rich interconnections between the two areas, both in theory and practice. She has been recognized as a Rising Star in EECS at MIT, 2021 and UC Berkeley, 2020, and a 2021 Facebook Fellowship finalist. She has also been awarded the 2021 CRA/CCC Computing Innovation Fellowship.

Talk Title: Crypt$\epsilon$: Crypto-Assisted Differential Privacy on Untrusted Servers

Talk Abstract: Differential privacy (DP) is currently the de-facto standard for achieving privacy in data analysis, which is typically implemented either in the ”central” or ”local” model. The local model has been more popular for commercial deployments as it does not require a trusted data collector. This increased privacy, however, comes at the cost of utility and algorithmic expressibility as compared to the central model.

In this talk, I will be presenting Crypt$\epsilon$, a system and programming framework that (1) achieves the accuracy guarantees and algorithmic expressibility of the central model (2) without any trusted data collector like in the local model. Crypt$\epsilon$ achieves the ”best of both worlds” by employing two non-colluding untrusted servers that run DP programs on encrypted data from the data owners. In theory, straightforward implementations of DP programs using off-the-shelf secure multi-party computation tools can achieve the above goal. However, in practice, they are beset with many challenges like poor performance and tricky security proofs. To this end, Crypt$\epsilon$ allows data analysts to author logical DP programs that are automatically translated to secure protocols that work on encrypted data. These protocols ensure that the untrusted servers learn nothing more than the noisy outputs, thereby guaranteeing DP for all Crypt$\epsilon$ programs. Crypt$\epsilon$ supports a rich class of DP programs that can be expressed via a small set of transformation and measurement operators followed by arbitrary post-processing. Further, I will talk about a novel performance optimization that leverages the fact that the output is noisy. Consequently, Crypt$\epsilon$ achieves performance that is practical for real-world usage.

Bio: Xiaoan Ding is a Ph.D. candidate in the Department of Computer Science at the University of Chicago, advised by Prof. Kevin Gimpel. Her interest lies in innovating machine learning methods to natural language processing and applying the deep learning approach in language applications. Her research seeks to build data-efficient, resilient, fair, trusted models for text classification and text generation, with her Ph.D. work focusing on developing models and algorithms spanning these directions. In the past, she’s interned at Microsoft Research NLP group working on hallucination detection, Amazon Alexa AI on neural information retrieval, and Google dialogue group on task-oriented dialogue systems.

Talk Title: Data-Efficient Text Classifier for Robust NLP

Talk Abstract: With the unprecedented progress in deep learning architectures, large-scale training, and learning algorithms, pre-trained models have become pivotal in AI. Concurrently, the definition of model robustness has transited to broader aspects: data-efficiency, model resilience, fairness, and faithfulness. In this talk, I will focus on data-efficient and model resilience aspects and present my efforts to build robust text classifiers where we introduced discrete latent variables into the generative story. In modeling we parameterized the distributions using standard neural architectures used in conditional language modeling. Our training objective combines generative pretraining and discriminative finetuning. The results shows that our generative classifiers outperform discriminative baselines including BERT-style models across several challenging experimental settings.

Bio: I am a postdoctoral researcher in the Department of Statistics at Harvard University. My research interests lie at the intersection of high-dimensional statistics and applied probability. Currently, I am excited about understanding phase transitions, universality, and computational-statistical gaps in high-dimensional inference problems. Before joining Harvard, I obtained a Ph.D. in Statistics from Columbia University and a B.Tech. in Electrical Engineering from the Indian Institute of Technology, Delhi.

Talk Title: High-Dimensional Asymptotics for Phase Retrieval with Structured Sensing Matrices.

Talk Abstract: Phase Retrieval is the problem of recovering an unknown complex-valued signal vector from the magnitudes of several linear measurements. This problem arises in applications like X-ray crystallography, where it is infeasible to acquire the phase of the measurements. In this talk, I will describe some results regarding the analysis of this problem in the high-dimensional asymptotic regime where the number of measurements and the signal dimension diverge proportionally so that their ratio remains fixed. The measurement mechanism in phase retrieval is specified by a sensing matrix. A limitation of existing high-dimensional analysis of this problem is that they model this matrix as a random matrix with independent and identically distributed (i.i.d.) Gaussian entries. In practice, this matrix is highly structured with limited randomness. I will describe a correction to the i.i.d. sensing model, known as the sub-sampled Haar sensing model, which faithfully captures a crucial orthogonality property of realistic sensing matrices. For the Haar sensing model, I will present a precise asymptotic characterization of the performance of commonly used spectral estimators for solving the phase retrieval problem. This characterization can be leveraged to tune certain parameters involved in the spectral estimator optimally. The resulting estimator is information-theoretically optimal. Next, I will describe an empirical universality phenomenon: the performance curves derived for the Haar model accurately describe the observed performance curves for realistic sensing matrices. Finally, I will present recent progress towards obtaining a theoretical understanding of this universality phenomenon that causes practical sensing matrices to behave like Haar sensing matrices.

DSI Postdoctoral Scholar 2021-2023

Shi Feng worked as a postdoc fellow at the University of Chicago working on human-in-the-loop and interpretable NLP. Recently, he is focused on investigating the role of interpretability in the alignment of NLP systems. He holds a PhD from University of Maryland, supervised by Jordan Boyd-Graber.

Bio: Anjalie is a PhD candidate at the Language Technologies Institute at Carnegie Mellon University and a visiting student at the University of Washington, where she is advised by Yulia Tsvetkov. Her work focuses on the intersection of NLP and computational social science, including both developing NLP models that are socially aware and using NLP models to examine social issues like propaganda, stereotypes, and prejudice. She has presented her work in NLP and interdisciplinary conferences, receiving a nomination for best paper at SocInfo 2020, and she is also the recipient of a NSF graduate research fellowship and a Google PhD fellowship. Prior to graduate school, she received her undergraduate degree in computer science, with minors in Latin and ancient Greek, from Princeton University.

Talk Title: Building Language Technologies for Analyzing Online Activism

Talk Abstract: While recent advances in natural language processing (NLP) have greatly enhanced our ability to analyze online text, distilling broad social-oriented research questions into tasks concrete enough for NLP models remains challenging. In this work, we develop state-of-the-art NLP models grounded in frameworks from social psychology in order to analyze two social media movements: online media coverage of the #MeToo movement in 2017-2018 and tweets about #BlackLivesMatter protests in 2020. In the first part, we show that despite common perception of the #MeToo movement as empowering, media coverage of events often portrayed women as sympathetic but unpowerful. In the second, we show that positive emotions like hope and optimism are prevalent in tweets with pro-BlackLivesMatter hashtags and significantly correlated with the presence of on-the-ground protests, whereas anger and disgust are not. These results contrast stereotypical portrayals of protesters as perpetuating anger and outrage. Overall, our work provides insight into social movements and debunks harmful stereotypes. We aim to bridge the gap between NLP, where models are often not designed to address social-oriented questions, and computational social science, where state-of-the-art NLP has often been underutilized.

Bio: Neil Gaikwad is a doctoral scholar at MIT, specializing in Human-centered AI and Public Policy for Sustainable Systems. He develops computational and data science lenses to address public policy issues concerning sustainability and international development. This research focuses on the community-based design of data-intensive public interest computing systems to advance equitable public policy interventions for improving the livelihood of historically disadvantaged populations affected by climate change, structural inequalities, and human rights violations. Neil’s scholarship has resulted in publications at AI & HCI conferences, talks at UN and EU global policy forums, environmental art exhibitions, and featured articles in the New York Times, New Scientist, WIRED, Wall Street Journal. He has mentored over 20 students who pursued careers in research and published influential scholarship that has shifted the discourse on AI fairness. His research, teaching, leadership, and commitment to diversity, inclusion and belonging have been recognized with Facebook Ph.D. Fellowship in Computational Social Science, MIT Human Rights & Technology Fellowship, William Asbjornsen Albert Memorial Science & Engineering MIT Fellowship, MIT Graduate Teaching Award, and Karl Taylor Compton Prize (highest student award of MIT). Neil earned a master’s degree from the School of Computer Science at Carnegie Mellon University.

Talk Title: Community-based designs of Human-centered AI and Public Policy for Global Inclusion, Resilience, and Sustainability

Talk Abstract: The field of computational social science has been traditionally focused on analyzing large-scale human and social dynamics in social media. However, millions of people from historically disadvantaged communities, harmed by climate change, structural inequalities, and human rights violations, remain missing from the digital databases, Internet traces, census records, and mainstream policy design processes. While data-intensive algorithms could help improve the livelihood of these at-risk populations, the key challenge is that their prejudiced designs overlook the interconnected nature of colonial, socioeconomic, institutional, and ecological processes that govern our world and amplify algorithmic harms.

In this talk, I will present a data science research program demonstrating how we can bring digitally invisible communities to the center of designing data-intensive public interest computing systems for collaborative public policy decision-making concerning sustainable development. During the talk, I will discuss resilience and adaptation mechanisms required to address climate disasters and agricultural market failures that have led to over 300000 farmer suicides, food systems crises, and forced displacements of women and children in the Global South. Finally, I will showcase how human-AI collaboration system designs, coupled with participatory and remote sensing satellite datasets, can advance equitable policymaking and make socio-technical systems (such as markets and food systems) more resilient to the perils of sustainability.

By studying the interconnected dynamics of socioeconomic and environmental processes and their impact on digitally missing and underserved communities, this scholarship broadens the horizon of computational social science research beyond the Internet ecosystems. It has already led to the informed data science research and policy program in Data-driven Humanitarian Mapping that convenes a global community of stakeholders from industry, academia, NGOs, and governments to tackle overarching sustainability challenges posed by climate change and the COVID-19 pandemic.

DSI Postdoctoral Scholar 2021-2022

Bio: Sainyam Galhotra worked as a CI postdoctoral fellow at University of Chicago. He received his Phd from University of Massachusetts Amherst. Previously, he was a researcher at Xerox Research and received his Bachelor’s degree in computer science from Indian Institute of Technology, Delhi. His research is broadly in the area of data management with a specific focus on designing algorithms to not only be efficient but also transparent and equitable in their decision-making capabilities. He is a recipient of the Best Paper Award in FSE 2017 and Most Reproducible Paper Award in SIGMOD 2017 and 2018. He is a DAAD AInet Fellow and the first recipient of the Krithi Ramamritham Award at UMass for contribution to database research.

Bio: Mengdi Huai is a Ph.D. candidate in the Department of Computer Science at the University of Virginia, advised by Professor Aidong Zhang. Her research interests are in the general area of data mining and machine learning, with an emphasis on the aspects of model transparency, security, privacy and algorithm design. Mengdi’s research has been published in international conferences and journals, including top conferences in data mining and AI (KDD, AAAI, IJCAI, NeurIPS, WWW, ICDM, SDM, BIBM) and top journals (TKDD, NanoBioscience). She has received multiple awards, including the Rising Star in EECS at MIT, the John A. Stankovic Research Award, the Sture G. Olsson Fellowship in Engineering, and the Best Paper Runner-up for KDD2020.

Talk Title: Malicious Attacks against Deep Reinforcement Learning Interpretations

Talk Abstract: The past years have witnessed the rapid development of deep reinforcement learning (DRL), which incorporates deep learning into the solution and makes decisions from unstructured input data without manual engineering of the state space. However, the adoption of deep neural networks makes the decision-making process of DRL opaque and lacking transparency. Motivated by this, various interpretation methods for DRL have been proposed. Those interpretation methods make an implicit assumption that they are performed in a reliable and secure environment. However, given their data-driven nature, these DRL interpretation methods themselves are potentially susceptible to malicious manipulations. In spite of the prevalence of malicious attacks, there is no existing work studying the possibility and feasibility of malicious attacks against DRL interpretations. To bridge this gap, in my work, I investigated the vulnerability of DRL interpretation methods. Specifically, I introduced the first study of the adversarial attacks against DRL interpretations, and proposed an optimization framework based on which the optimal adversarial attack strategy can be derived. In addition, I also studied the vulnerability of DRL interpretation methods to the model poisoning attacks, and present an algorithmic framework to rigorously formulate the proposed model poisoning attack. Finally, I conducted both theoretical analysis and extensive experiments to validate the effectiveness of the proposed malicious attacks against DRL interpretations.

Bio: Haojian Jin is a final year Ph.D. student in the Human-Computer Interaction Institute at Carnegie Mellon University, advised by Jason Hong and Swarun Kumar. Haojian’s research explores new software architecture and toolkits that make it easier for users, developers, and auditors to protect users’ privacy. His work has been recognized with a UbiComp Gaetano Borriello Outstanding Student Award, Research Highlights at Communication of ACM and GetMobile, and best paper awards at Ubicomp and ACM Computing Reviews.

Talk Title: My Data is None of Your Business: Separation of Concerns for Privacy through Modular Privacy Flows.

Talk Abstract: This wide-scale deployment of tiny sensors, coupled with improvements in recognition and data mining algorithms, will enable numerous new applications for personal and societal benefits. But, we have also seen many undesired data-driven applications deployed, such as price discrimination, shopping behavior persuasion. Once one’s data is out of users’ direct control, it may potentially be used at places and times far removed from its original context. How can we computer scientists assure users that a data-driven world is the one everyone wants to live in?

In this talk, I will introduce my thesis work on separating concerns for privacy through a new software design pattern, named Modular Privacy Flows. Rather than continuing to build privacy support in an ad-hoc manner, my research demonstrates how we can separate the privacy logic from the application logic. This separation can help users gain independent and unified control of their data while reducing the burdens of developers and auditors on ensuring privacy.

Bio: I am a Postdoc working with Robert Nowak at the University of Wisconsin. Previously, I was a Postdoc at the Paul G. Allen School of Computer Science & Engineering at the University of Washington under Kevin Jamieson. I completed my PhD in the Electrical Engineering and Computer Science Department at the University of Michigan where my advisor was Clayton Scott. Prior to that, I double-majored in mathematics and philosophy at the University of Chicago. My research focuses on pure exploration multi-armed bandits, recommender systems, and nonparametric estimation. I am also interested in applications of machine learning that promote the social good. As a Data Science for Social Good fellow at the University of Chicago in 2015, I helped develop the Legislative Influence Detector.

Talk Title: Practical Algorithms for Interactive Learning with Generic Function Classes

Talk Abstract: We consider interactive learning in the realizable setting and develop a general framework to handle problems ranging from best arm identification to active classification. We begin our investigation with the observation that agnostic algorithms cannot be minimax-optimal in the realizable setting. Hence, we design novel algorithms for the realizable setting that are nearly minimax optimal, computationally efficient, and general-purpose, accommodating a wide variety of function classes including kernel methods, Holder smooth functions, and convex functions.  The sample complexities of our algorithms can be quantified in terms of well-known quantities like the extended teaching dimension and haystack dimension. However, unlike algorithms based directly on those combinatorial quantities, our algorithms are computationally efficient. To achieve computational efficiency, our algorithms sample from the version space using Monte Carlo “hit-and-run” algorithms instead of maintaining the version space explicitly. Our approach has two key strengths. First, it is simple, consisting of two unifying, greedy algorithms. Second, our algorithms have the capability to seamlessly leverage prior knowledge that is often available and useful in practice. In addition to our new theoretical results, we demonstrate empirically that our algorithms are competitive with and in some cases outperform Gaussian process UCB methods. This talk is based on work to appear in NeurIPS 2021.

Bio: Aditi Krishnapriyan is the 2020 Alvarez Fellow in Computing Sciences at Lawrence Berkeley National Laboratory and UC Berkeley. Previously, she received a PhD at Stanford University, supported by the Department of Energy Computational Science Graduate Fellowship. During her PhD, she also spent time working on machine learning research at Los Alamos National Laboratory, Toyota Research Institute, and Google Research. Her research interests include combining domain-driven scientific mechanistic modeling with data-driven machine learning methodologies to accelerate and improve spatial and temporal modeling.

Talk Title: Integrating Machine Learning with Physics-Based Spatial and Temporal Modeling

Talk Abstract: Deep learning has achieved great success in numerous areas, and is also seeing increasing interest in scientific applications. However, challenges still remain: scientific phenomena are difficult to model, and can also be limited by a lack of training data. As a result, scientific machine learning approaches are being developed by incorporating domain knowledge into the machine learning process to enable more accurate and general predictions. One such popular approach, colloquially known as physics-informed neural networks (PINNs), incorporates domain knowledge as soft constraints on an empirical loss function. I will discuss the challenges associated with such an approach, and show that by changing the learning paradigm to curriculum regularization or sequence-to-sequence learning, we can achieve significantly lower error. Another approach, colloquially known as ODE-Nets, aims to couple dynamical systems/numerical methods with neural networks. I will discuss how exploiting techniques from numerical analysis for these systems can enable learning continuous, function-to-function mappings for scientific problems.

Bio: Amanda Kube is a Ph.D. Candidate in the Division of Computational and Data Sciences at Washington University in St. Louis working with Dr. Sanmay Das in the Department of Computer Science and Dr. Patrick Fowler in the Brown School. She received her B.S. in Psychological and Brain Sciences and Mathematics with a concentration in Statistics from Washington University in St. Louis where she also received an M.S. in Data Analytics and Statistics. Her research interests involve the intersection of computation and the social sciences. Her current work combines machine learning and human decision-making to inform fair and efficient service allocations for homeless families.

Talk Title: Integrating Human Priorities and Data-Driven Improvements in Allocation of Scarce Homeless Services to Households in Need

Talk Abstract: Homelessness is a major public health issue in the United States that has gained visibility during the COVID-19 pandemic. Despite efforts at the federal level, rates of homelessness are not decreasing. Homeless services are a scarce public resource and current allocation systems have not been thoroughly investigated. Algorithmic techniques excel at modeling complex interactions between features and therefore have potential to model effects of homeless services at the individual level. These models can reason counterfactually about the effects of different services on each household and resulting predictions can be used for matching households to services. The ability to model heterogeneity in treatment effects of services provides the potential for “precision public health” where allocation of services is based on data-driven predictions of which service will lead to better outcomes. I discuss the scarce resource allocation problem as it applies to homeless service delivery, and the ability to improve upon the current allocation system using algorithmic techniques. I compare prediction algorithms to each other as well as to the ability of the general public to make these decisions. As homeless services are scarce public goods, it is vital to ensure allocations are not only efficient, but fair and ethical. I discuss efforts to ensure fair decisions and to understand how people prioritize households who should receive scarce homeless services. I also discuss future work and next steps as well as policy implications.

Bio: Lihua Lei is a postdoctoral scholar in Statistics at Stanford University, advised by Emmanuel Candès. His current research focuses on developing rigorous statistical methodologies for uncertainty quantification in applications involving complicated decision-making processes, to enhance reliability, robustness and fairness of the system. Prior to joining Stanford, he obtained his Ph.D. in statistics at UC Berkeley, advised by Peter Bickel and Michael Jordan. His research areas include causal inference, multiple hypothesis testing, network clustering, and stochastic optimization.

Talk Title: Distribution-Free Assessment of Population Overlap in Observational Studies

Talk Abstract: Overlap in baseline covariates between treated and control groups, also known as positivity or common support, is one of the most fundamental assumptions in observational causal inference. Assessing this assumption is often ad hoc, however, and can give misleading results. For example, the common practice of examining the empirical distribution of estimated propensity scores is heavily dependent on model specification and has poor uncertainty quantification. In this paper, we propose a formal statistical framework for assessing the extrema of the population propensity score; e.g., the propensity score lies in [0.1, 0.9] almost surely. We develop a family of upper confidence bounds, which we term O-values, for this quantity. We show these bounds are valid in finite samples so long as the observations are independent and identically distributed, without requiring any further modeling assumptions on the data generating process. We also use extensive simulations to show that these bounds are reasonably tight in practice. Finally, we demonstrate this approach using several benchmark observational studies, showing how to build our proposed method into the observational causal inference workflow.

Bio: Konstantin Mishchenko received his double-degree MSc from Paris-Dauphine and École normale supérieure Paris-Saclay in 2017. He did his PhD under the supervision of Peter Richtárik, and had research internships at Google Brain and Amazon. Konstantin has been recognized as an outstanding reviewer for NeurIPS19, ICML20, AAAI20, ICLR21, and ICML21. He has published 8 conference papers at ICML, NeurIPS, AISTATS, and UAI, 1 journal paper at SIOPT, 6 workshop papers, and co-authored 8 preprints, some of which are currently under peer review. In 2021, Konstantin is joining the group of Alexandre d’Aspremont and Francis Bach in Paris as a Postdoctoral Researcher.

Talk Title: Optimization for Federated Learning

Talk Abstract: Optimization has been a vital tool for enabling the success of machine learning. In the recently introduced paradigm of federated learning, where devices or organizations unite to train a model without revealing their private data, optimization has been particularly nontrivial. The peculiarities of federated learning that make it difficult include unprecedented privacy constraints, the difficulty of communication with a server, and high heterogeneity of the data across the participating parties. Nevertheless, the potential applications of federated learning, such as machine learning for health care, banking, and smartphones, have sparked global interest in the problem and quick growth in the number of publications.

In this talk, we will discuss some of the recent advances in optimization for federated learning. We will formulate the key challenges in communication efficiency and personalization and propose ways for tackling them that are motivated by theory. To this end, we will discuss the convergence properties of some existing and new federated learning algorithms that leverage on-device (local) iterations as a way to limit communication.

Bio: Faidra Monachou is a final-year Ph.D. candidate in Operations Research at the Department of Management Science and Engineering at Stanford University. She is interested in market and information design, with a particular focus on the interplay between policy design and discrimination in education and labor. Faidra’s research has been supported by various scholarships and fellowships from Stanford Data Science, Stanford HAI, Google, and other organizations. She won the Best Paper with a Student Presenter Award at ACM EAAMO’21. She co-chaired the 2020 Mechanism Design for Social Good workshop and co-organized the 2021 Stanford Data Science for Social Good program. Faidra received her undergraduate degree in Electrical and Computer Engineering from the National Technical University of Athens in Greece.

Talk Title: Discrimination, Diversity, and Information in Selection Problems

Talk Abstract: Despite the large empirical literature on disparities in college admissions, our theoretical understanding is limited. In this talk, I will introduce a theoretical framework to study how a decision-maker concerned with both merit and diversity, selects candidate students under imperfect information, limited capacity, and legal constraints. Motivated by recent decisions to drop standardized testing in admissions, we apply this framework to study how information differences lead to disparities across equally skilled groups and quantify the trade-off between information and access in test-free and test-based policies with and without affirmative action. Using application and transcript data from the University of Texas at Austin, we illustrate that there exist practical settings where dropping standardized testing improves or worsens both merit and diversity. Furthermore, we extend this model to demonstrate how privilege differences lead to intra-group disparities and establish that the direction of discrimination at the observable level may differ from the unobservable level. We compare common policies used in practice and take an optimization approach to design an optimal policy under legal constraints.

Bio: Omar Montasser is a fifth year PhD student at TTI-Chicago advised by Nathan Srebro. His main research interest is the theory of machine learning. Recently, his research focused on understanding and characterizing adversarially robust learning, and designing algorithms with provable robustness guarantees under different settings. His work has been recognized by a best student paper award at COLT (2019).

Talk Title: What, How and When can we Learn Adversarially Robustly?

Talk Abstract: In this talk, we will discuss the problem of learning an adversarially robust predictor from clean training data. That is, learning a predictor that performs well not only on future test instances, but also when these instances are corrupted adversarially. There has been much empirical interest in this question, and in this talk we will take a theoretical perspective and see how it leads to practically relevant insights, including: the need to depart from an empirical (robust) risk minimization approach, and thinking of what kind of accesses and reductions can allow learning.

Bio: Jeffrey Negrea is a 5th year Ph.D. candidate and Vanier scholar at the University of Toronto in the department of Statistical Sciences, and a graduate student researcher at the Vector Institute, working with Daniel Roy on foundational problems in computational statistics, machine learning, and sequential decision making. His research focuses on questions of reliability and robustness for statistical and machine learning methods. His contributions are broad: he has recent work addressing robustness to the IID assumption in sequential decision making, the role of regularization in statistical learning, the connection between stochastic optimization and uncertainty quantification, and approximation methods in MCMC. Previously, Jeff completed his B.Math. at the University of Waterloo, and his M.Sc. in Statistics at the University of Toronto.

Talk Title: Adapting to failure of the IID assumption for sequential prediction

Talk Abstract: We consider sequential prediction with expert advice when data are generated from distributions varying arbitrarily within an unknown constraint set. We quantify relaxations of the classical IID assumption in terms of these constraint sets, with IID sequences at one extreme and adversarial mechanisms at the other. The Hedge algorithm, long known to be minimax optimal in the adversarial regime, was recently shown to be minimax optimal for IID data. We show that Hedge with deterministic learning rates is suboptimal between these extremes, and present new algorithms that adaptively achieve the minimax optimal rate of regret with respect to our relaxations of the IID assumption, and do so without knowledge of the underlying constraint set. We analyze our algorithm using the follow-the-regularized-leader framework, and prove it corresponds to Hedge with adaptive learning rates.

Bio: Abhilasha is a Ph.D. student at Carnegie Mellon University, working in the Language Technologies Institute. Her research focuses on understanding neural model performance, and consequently developing robust and trustworthy NLP technologies. She has published papers in premier NLP conferences and has been the recipient of the outstanding reviewer awards at ACL and EMNLP. Her work has also received the “Area Chair Favorite Paper” award at COLING 2018. In the past, she interned at Allen Institute for AI and Microsoft Research, where she worked on understanding how deep learning models process challenging semantic phenomena in natural language.

Talk Title: Developing User-Centric Models for Question Answering

Talk Abstract: Everyday users now benefit from powerful QA technologies in a range of consumer-facing applications. Voice assistants such as Amazon Alexa or Google Home have brought natural language technologies to several million homes globally. Yet, even with millions of users now interacting with these technologies on a daily basis, there has been surprisingly little research attention devoted to studying the issues that arise when people use QA systems. Traditional QA evaluations do not reflect the needs of many users who can benefit from QA technologies. For example, users with a range of visual and motor impairments would prefer the option to interact with voice interfaces for efficient text entry. Keeping these needs in mind, we construct evaluations considering the interfaces through which users interact with QA systems. We analyze and mitigate errors introduced by three interface types that could be connected to a QA engine: speech recognizers converting spoken queries to text, keyboards used to type queries into the system, and translation systems processing queries in other languages. Our experiments and insights present a useful starting point for both practitioners and researchers, to develop usable question-answering systems.

 

Bio: Alexander Rodriguez is a Ph.D. student in Computer Science at Georgia Tech advised by Prof. B. Aditya Prakash. His research interests include data science and AI, with emphasis on time-series and real-world networks problems motivated from epidemiology and community resilience. In response to COVID-19, he has been the student lead at his research group in forecasting the progression of the pandemic, and these predictions have been featured in the CDC’s website and FiveThirtyEight.com. His work has been published in AAAI, KDD, NeurIPS, and BigData, and awarded the 1st place in the Facebook/CMU COVID-19 Challenge and the 2nd place in the C3.ai COVID-19 Grand Challenge. He also has served as workshop organizer in BPDM @ KDD 2017 and epiDAMIK @ KDD 2021.

Talk Title: Deep Learning Frameworks for Epidemic Forecasting

Talk Abstract: Our vulnerability to emerging infectious diseases has been illustrated with the devastating impact of the COVID-19 pandemic. Forecasting epidemic trajectories (such as future incidence over the next four weeks) gives policymakers a valuable input for designing effective healthcare policies and optimizing supply chain decisions. However, this is a non-trivial task with multiple open questions. In this talk, I will present our neural frameworks for epidemic forecasting, using seasonal influenza and the COVID-19 as examples. I will introduce our efforts in three research directions: (1) awareness of multiple facets of the epidemic dynamics, (2) coping with challenges from using public health data, and (3) readiness to provide actionable forecasts and insights. I will first discuss our deployed model for predicting COVID-associated indicators, which has been recognized as a top short-term forecasting model among all models submitting predictions to the CDC. I will also introduce how to use deep learning to adapt a historical flu model to an emerging scenario where COVID and flu coexist by leveraging auxiliary data sources. Next, I will introduce deep learning frameworks for incorporating expert-guidance, principled uncertainty quantification for well-calibrated forecasts, and handling data revisions for refining forecasts. Finally, I will share some future research directions.

Bio: Martin Saveski is a postdoctoral scholar at the Management Science and Engineering department at Stanford University. He completed his Ph.D. from MIT in September 2020. Martin’s broad research area is Computational Social Science. He uses Causal Inference and Social Network Analyses to study pressing social problems online, such as political polarization and toxicity. He has also made methodological contributions in the areas of causal inference in networks, and recommender systems. Previously, he has interned at Facebook, LinkedIn, Amazon, and Yahoo. His work has been covered by major media outlets, including the New York Times, NPR, MIT Tech Review, and others.

Talk Title: Engaging Politically Diverse Audiences on Social Media

Talk Abstract: In this talk, I will present our study of how political polarization is reflected in the language used by media outlets to promote their content online and what we can do to reduce it. We tracked the Twitter posts of several media outlets over the course of more than three years (566K tweets), and the engagement with these tweets from other users (104M retweets). We then used this data to model the relationship between the tweet text and the political diversity of the audience. We built a tool that integrates our models and helps journalists craft tweets that are engaging to a politically diverse audience, guided by the model predictions. To test the real-world impact of the tool, we partnered with the award-winning PBS documentary series Frontline and ran a series of advertising experiments on Twitter testing how tens of thousands of users respond to the tweets. We found that in seven out of the ten experiments, the tweets selected by our model were indeed engaging to a more politically diverse audience, illustrating the effectiveness of our tool. I will close by discussing the methodological challenges and opportunities in using advertisements to test interventions on social media platforms.

Bio: Liyue Shen is a final-year Ph.D. candidate in Electrical Engineering at Stanford University, co-advised by Professor John Pauly and Professor Lei Xing. Her research focuses on Medical AI, which spans the interdisciplinary research areas of AI/ML, computer vision, biomedical imaging and data science. Her dissertation research develops efficient AI/ML-driven computational algorithms and techniques for carrying out biomedical imaging and informatics to tackle real-world biomedicine and healthcare problems through engineering and data science. Her works have been published in both computer vision conferences (ICCV, CVPR) and medical journals (Nature Biomedical Engineering, IEEE TMI, MedIA). She is the recipient of Stanford Bio-X Bowes Graduate Student Fellowship (2019-2021) and is selected as Rising Star in EECS by MIT (2021). She co-organized Women in Machine Learning (WiML) Workshop at ICML 2021 and Machine Learning for Healthcare (ML4H) Workshop at NeurIPS 2021. Prior to her PhD, Liyue received her bachelor’s degree in Electronic Engineering from Tsinghua University.

Talk Title: Exploiting Prior Knowledge in Physical World Incorporated with Machine Learning for Solving Medical Imaging Problems

Talk Abstract: Medical imaging is crucial for image-guided clinical patient care. In my research of the interdisciplinary area in medical AI, I develop efficient machine learning algorithms for medical imaging by exploiting prior knowledge from the physical world — exploit what you know — to incorporate with machine learning models.

I present two main directions of my research. First, since the data-driven machine learning methods always suffer from limitations in generalizability, reliability and interpretability, By exploiting geometry and physics priors from the imaging system, I proposed physics-aware and geometry-informed deep learning frameworks for radiation-reduced sparse-view CT and accelerated MR imaging. Incorporating geometry and physics priors, the trained deep networks show more robust generalization across patients and better interpretability. Second, motivated by the unique characteristics of medical images that patients are often scanned serially over time during clinical treatment, where earlier images provide abundant prior knowledge of the patient’s anatomy, I proposed a prior embedding method to encode internal information of image priors through coordinate-based neural representation learning. Since this method requires no training data from external subjects, it relaxes the burden of data collection, and can be easily generalized across different imaging modalities and anatomies. Following this, I developed a novel algorithm of temporal neural representation learning for longitudinal study. Combining both physics priors and image priors, I showed proposed algorithm can successfully capture subtle yet significant structure changes such as tumor progression in sparse-sampling image reconstruction, which can be applied to tackle real-world challenges in cancer patients treatment and radiation therapy.

Bio: Guanya Shi received a B.E. in mechanical engineering (summa cum laude) from Tsinghua University in 2017. He is currently working toward a Ph.D. degree in computing and mathematical sciences at the California Institute of Technology. He was a deep learning research intern at NVIDIA in 2020. His research interests are centered around the intersection of machine learning and control theory, spanning the entire spectrum from theory and foundation, algorithm design, to solve cutting-edge problems and demonstrate new capabilities in robotics and autonomy. Guanya was the recipient of several awards, including the Simoudis Discovery Prize and the WAIC Yunfan Award.

Talk Title: Safety-Critical Learning and Control in Dynamic Environments: Towards Unified Theory and Learned Robotic Agility

Talk Abstract: Deep-learning-based methods have made exciting progress in many decision-making problems such as playing complicated strategy games. However, for complex real-world settings, such as agile robotic control in hazardous or poorly-sensed environments (e.g., autonomous driving), end-to-end deep-learning-based methods are often unreliable. In this talk, I will first present the Neural-Control Family, which is a family of nonlinear deep-learning-based control methods with stability, safety, and robustness guarantees. The Neural-Control Family bridges learning and control theory in a unified framework, and demonstrates new capabilities in agile robot control (e.g., agile flight maneuvers in unknown strong wind conditions). In the second part, I will discuss progress towards establishing clean interfaces that fundamentally connect learning and control. A strong focus will be on non-asymptotic analysis for online learning and control. In particular, we will discuss the intersection of representation learning and adaptive control, no-regret and competitive control, and safe exploration in dynamical systems.

Bio: Tyler is a Ph.D. candidate in Computer Science at the University of Chicago, advised by Kyle Chard and Ian Foster. His research interests lie at the intersection of data management, data science, and HPC, focusing on enabling scientists to maximize the utility of massive amounts of data. His work has culminated in the design of the open-source system Xtract that can intelligently formulate metadata extraction workflows for data stored in heterogeneous file formats across leadership-scale computing facilities. Before joining the University of Chicago, he received his B.A. in Applied Mathematics and Statistics from Macalester College.

Talk Title: Enabling Data Utility Across the Sciences

Talk Abstract: Scientific data repositories are generally chaotic—files spanning heterogeneous domains, studies, and users are stuffed into an increasingly-unsearchable data swamp without regard for organization, discoverability, or usability. Files that could contribute to scientists’ future research may be spread across storage facilities and submerged beneath petabytes of other files, rendering manual annotation and navigation virtually impossible. To remedy this lack of navigability, scientists require a rich search index of metadata, or data about data, extracted from individual files. In this talk, we will explore automated metadata extraction workflows for converting dark data swamps into navigable data collections, given no prior knowledge regarding each file’s schema or provenance. I enable such extraction from files of vastly different structures by building a robust suite of “extractors” that leverage data scientific methods (e.g., keyword analysis, entity recognition, and file type identification) in order to maximize our body of knowledge about a diversity of files.

In this talk, I outline the construction, optimization, and evaluation of Xtract—a scalable metadata extraction system—that automatically constructs extraction plans for files distributed across remote cyberinfrastructure. I illustrate the scale challenges in processing these data, and outline techniques to maximize extraction throughput, by analyzing Xtract’s performance on three real science data sets.

Bio: Jennifer is a PhD candidate in Computing and Mathematical Sciences at Caltech, advised by Pietro Perona and Yisong Yue. Her research is on machine learning for scientific applications, in order to enable efficient interactions between scientists and data analysis systems. Her current work is at the intersection of machine learning and behavior analysis, with projects on learning behavioral representations, social behavior recognition, interpretable modeling, and keypoint discovery. She worked on organizing two interdisciplinary workshops in 2021, on affective computing (AUVi) and multi-agent behavior modeling (MABe). In particular, MABe is organized with the Kennedy Lab at Northwestern, which aims to connect researchers across science and data science. Her work was awarded best student paper at CVPR2021 and is supported by the Kortschak Scholars Program and a Natural Sciences and Engineering Research Council of Canada (NSERC) Postgraduate Fellowship.

Talk Title: AI for Science: Learning from Experts and Data

Talk Abstract: In many fields, the amount of recorded scientific data is increasing much faster than the speed at which researchers can analyze and interpret them. For example, recorded videos of animal behavior over a few days can take domain experts months to analyze. Innovations in data science, such as machine learning, provide a promising direction to enable scientists to scalably perform data-driven experiments. However, scientific applications raise a number of challenges for existing methods: data creation is expensive, model interpretability is important, and tools are often needed to translate algorithmic improvements to practical benefits.

To address these challenges, my current work has focused on incorporating domain knowledge into machine learning to reduce human effort for data analysis. I will discuss methods to improve the sample-efficiency and interpretability of models in the context of behavior modeling. To learn annotation-sample efficient representations, we developed a framework to unify self-supervision with weak programmatic supervision from domain experts. We demonstrated that our method reduces annotation requirements up to a factor of 10 without compromising accuracy, compared to previous approaches. Furthermore, we investigate program synthesis as a promising direction to produce interpretable descriptions of behavior. We integrate interpretable programs from our method with an existing tool in behavioral neuroscience. These interdisciplinary approaches of machine learning with experts in the loop are important to broaden the application of data science across scientific domains.

Bio: Wei is a PhD candidate in the Department of Computer Science & Engineering at the Washington University in St. Louis, advised by Chien-Ju Ho. His research interests are in online learning, algorithmic economics, optimization, and behavioral experiments, with a focus on developing theoretically rigorous, empirically grounded frameworks to understand and design human-centered algorithms. He received the B.E. degree from Tianjin University in 2017.

Talk Title: Learning with Understanding: Human Behavior in Algorithm Design

Talk Abstract: Algorithms increasingly pervade every sphere of human life and thus have great potential to reshape various sectors of our modern society. Thus, it is important to understand the role humans play in the design of algorithm. However, human involvement also creates unique challenges. Humans might be careless, strategic, or have behavioral biases.
In this talk, I will present two works from my own research on theoretically and empirically dealing with these challenges when humans are involved in algorithm design. First, I will describe the problem on learning with human biased behavior. In this problem, the learner cannot directly observe the realized reward of an action but can only observe human biased feedback on the realized reward. I explored two natural human feedback models. Our results show that a small deviation on user behavior model and/or the design of the information structure could significant impact the overall system outcome.
I then step back and examine whether the standard behavior models capture human behavior in practice by utilizing behavioral experiments. I studied this question in AI-assisted decision-making where AI intelligently abstracts out useful information from a large amount of data. Human then review the information output by the AI and make the decision. I have run behavior experiments to characterize human’s response in practice and established an empirically grounded human behavior model.

Bio: I am a postdoctoral researcher in the Machine Learning Foundations group at Microsoft Research Redmond. My research interests are in designing algorithms for massive datasets and large-scale machine learning, especially in the contexts of high-dimensional metric data, fast linear algebra, and learning on data streams. My recent work is focused on harnessing the power of big data and machine learning to guide us toward better algorithm design. I received my PhD from the EECS department at MIT in September 2020, and have spent time as a research intern at Microsoft, Amazon and VMware.

Talk Title: On the Role of Data in Algorithm Design

Talk Abstract: Recently, there has been a growing interest in harnessing the power of big datasets and modern machine learning for designing new scalable algorithms. This invites us to rethink the role of data in algorithm design: not just the input to pre-defined algorithms, but also a factor that enters the algorithm design process itself, driving it in a strong and possibly automated manner. In this talk, I will describe my work on data-driven and learning-based algorithms for high-dimensional metric spaces and nearest neighbor search. In particular, I will show that data-dependence is necessary for optimal compressed representations of high-dimensional Euclidean distances, and that neural networks can be used to build better data structures for nearest neighbor search.

January 2021 Rising Stars

Talk Title: Evaluating the Impact of Entity Resolution in Social Network Metrics

Watch Abby’s Research Lightning Talk

Talk Abstract: Modern databases are filled with potential duplicate entries—caused by misspellings, change in address, or differences in abbreviations. The probabilistic disambiguation of entries is often referred to as entity resolution.

Entity resolution of individuals (nodes) in relational datasets is often viewed as a pre-processing step in network analysis. Studies in bibliometrics have indicated that entity resolution changes network properties in citation networks, but little research has used real-world social networks that vary in size and type. We present a novel perspective on entity resolution in networks—where we propagate error from the entity resolution process into downstream network inferences. We also seek to understand how match thresholds in unsupervised entity resolution affect both global and local network properties, such as the degree distribution, centrality, transitivity, and motifs such as stars and triangles. We propose a calibration of these network metrics given measures of entity resolution quality, such as node “splitting” and “lumping” errors.

We use a respondent driven sample of people who use drugs (PWUD) in Appalachia and a longitudinal network study of Chicago-based young men who have sex with men (YMSM) to demonstrate the implications this has for social and public health policy.

Bio: Abby Smith is a Ph.D. Candidate in Statistics at Northwestern University. Her work centers around evaluating the impact of entity resolution error in social network inferences. She is particularly interested in collaborative research and data science for social good applications, and most recently served as a Solve for Good consultant at the mHealth nonprofit Medic Mobile. Abby is passionate about building community for women in statistics and data science in Chicago, and serves as a WiDS Ambassador and R-Ladies: Chicago board member. She holds a Masters in Statistical Practice and a B.S. in Mathematics, both from Carnegie Mellon.

Talk Title: Modeling the Impact of Social Determinants of Health on Covid-19 Transmission and Mortality to Understand Health Inequities

Watch Abby’s Spotlight Research Talk

Talk Abstract: The Covid-19 pandemic has highlighted drastic health inequities, particularly in cities such as Chicago, Detroit, New Orleans, and New York City. Reducing Covid-19 morbidity and mortality will likely require an increased focus on social determinants of health, given their disproportionate impact on populations most heavily affected by Covid-19. A better understanding of how factors such as household income, housing location, health care access, and incarceration contribute to Covid-19 transmission and mortality is needed to inform policies around social distancing and testing and vaccination scale-up.

This work builds upon an existing agent-based model of Covid-19 transmission in Chicago, CityCOVID. CityCOVID consists of a synthetic population that is statistically representative of Chicago’s population (2.7 million persons), along with their associated places (1.4 million places) and behaviors (13,000 activity schedules). During a simulated day, agents move from place-to-place, hour-by-hour, engaging in social activities and interactions with other colocated agents, resulting in an endogenous colocation or contact network. Covid-19 transmission is determined via a simulated epidemiological model based on this generated contact network by tuning (fitting) model parameters that result in simulation output that matches observed Covid-19 death and hospitalization data from the City of Chicago. Using the CityCOVID infrastructure, we quantify the impact of social determinants of health on Covid-19 transmission dynamics by applying statistical techniques to empirical data to study the relationship between social determinants of health and Covid-19 outcomes.

Bio: Abby Stevens is fourth year statistics PhD student at the University of Chicago advised by Rebecca Willett. She is interested in using data science techniques to address important social and political issues, such as climate science, public health, and algorithmic fairness. She graduated with a math degree from Grinnell College in 2014 and then worked as a data scientist at a healthcare tech company before entering graduate school. She has been involved in a number of data science for social good organizations and is a primary organizer of the Women in Data Science Chicago annual event.

Talk Title: Covariant Neural Networks for Physics Applications

Watch Alexander’s Research Lightning Talk

Talk Abstract: Most traditional neural network architectures do not respect any intrinsic structure of the input data, and instead are expect to “learn” it. CNNs are the first widespread example of a symmetry, in this case the translational symmetry of images, being used to advise much more efficient and transparent network architectures. More recently, CNNs were generalized to other non-commutative symmetry groups such as SO(3). However, in physics application one is more likely to encounter input data that belong to linear representations of Lie Groups, as opposed to being functions (or “images”) on a symmetric space of the group.

To deal with such problems, I will present a general feed-forward architecture that takes vectors as inputs, works entirely in the Fourier space of the symmetry group, and is fully covariant. This approach allows one to achieve equal performance with drastically fewer learnable parameters, Moreover, the models become much more physically meaningful and more likely to be interpretable. My application of choice is in particle physics, where the main symmetry is the 6-dimensional Lorentz group. I will demonstrate the success of covariant architectures compared to more conventional approaches.

Bio: I am a PhD student at the University of Chicago working on theoretical hydrodynamics problems in relation to the quantum Hall effect. In addition, I am working on developing new group-covariant machine learning tools for physics applications, such as Lorentz-covariant neural networks for particle physics. My background is in mathematical physics, in which I hold a master’s degree from the Saint-Petersburg University in Russia. My interests lie on the intersection of theoretical and mathematical physics and new inter-disciplinary applications of such ideas.

Talk Title: Credible and Effective Data-Driven Decision-Making: Minimax Policy Learning under Unobserved Confounding

Talk Abstract: We study the problem of learning causal-effect maximizing personalized decision policies from observational data while accounting for possible unobserved confounding. Since policy value and regret may not be point-identifiable, we study a method that minimizes the worst-case estimated regret over an uncertainty set for propensity weights that controls the extent of unobserved confounding. We prove generalization guarantees that ensure our policy will be safe when applied in practice and will in fact obtain the best-possible uniform control on the range of all possible population regrets that agree with the possible extent of confounding. Finally, we assess and compare our methods on synthetic and semi-synthetic data. In particular, we consider a case study on personalizing hormone replacement therapy based on the parallel WHI observational study and clinical trial. We demonstrate that hidden confounding can hinder existing policy learning approaches and lead to unwarranted harm, while our robust approach guarantees safety and focuses on well-evidenced improvement.  This work is joint with Nathan Kallus. An earlier version was circulated as “Confounding-Robust Policy Improvement”.  Time permitting, I will highlight recent follow-up work on robust policy evaluation for infinite-horizon reinforcement learning. 

Bio: My research interests are at the intersection of statistical machine learning and operations research in order to inform reliable data-driven decision-making. Specifically, I have developed fundamental contributions and algorithmic frameworks for robust causal-effect-maximizing personalized decision rules in view of unobserved confounding, as well as methodology for credible impact evaluation for algorithmic fairness with high potential impact in industry and policy. My work has been published in journals such as Management Science and top-tier computer science/machine learning venues (Neurips/ICML), and has received a INFORMS Data Mining Section Best Paper award. My work was previously supported on a NDSEG (National Defense Science and Engineering) Graduate Fellowship.

Talk Title: AI for Population Health: Melding Data and Algorithms on Networks

Watch Bryan’s Spotlight Research Talk

Talk Abstract: As exemplified by the COVID-19 pandemic, our health and wellbeing depend on a difficult-to-measure web of societal factors and individual behaviors. Tackling social challenges with AI requires algorithmic and data-driven paradigms which span the full process of gathering costly data, learning models to understand and predict interactions, and optimizing the use of limited resources in interventions. This talk presents methodological developments at the intersection of machine learning, optimization, and social networks which are motivated by on-the-ground collaborations on HIV prevention, tuberculosis treatment, and the COVID-19 response. These projects have produced deployed applications and policy impact. For example, I will present the development of an AI-augmented intervention for HIV prevention among homeless youth. This system was evaluated in a field test enrolling over 700 youth and found to significantly reduce key risk behaviors for HIV.

Bio: Bryan Wilder is a final-year PhD student in Computer Science at Harvard University, where he is advised by Milind Tambe. His research focuses on the intersection of optimization, machine learning, and social networks, motivated by applications to population health. His work has received or been nominated for best paper awards at ICML and AAMAS, and was a finalist for the INFORMS Doing Good with Good OR competition. He is supported by the Siebel Scholars program and previously received a NSF Graduate Research Fellowship.

Talk Title: Towards Data-Driven Internet Routing Security

Talk Abstract: The Internet ecosystem is critical for the reliability of online daily life. However, key Internet protocols, such as the Border Gateway Protocol (BGP), were not designed to cope with untrustworthy parties, making them vulnerable to misconfigurations and attacks from anywhere in the network. In this talk, I will present an evidence-based data-driven approach to improve routing infrastructure security, which I use to identify and characterize BGP serial hijackers, networks that persistently hijack IP address blocks in BGP. I’ll also show how similar approaches can quantify the benefits of the RPKI security framework against prefix hijacks, and identify route leaks. This work improves our understanding about how our Internet actually works and has been used by industry and researchers for network reputation and monitoring of operational security practices.

Bio: Cecilia Testart is a PhD candidate in EECS at MIT, working with David D. Clark. Her research is at the intersection of computer networks, data science and policy. Her doctoral thesis focuses on securing the Internet’s core routing protocols, leveraging machine learning and data science approaches to understand the impact of protocol design in security, and considering both technical and policy challenges to improve the current state-of-the-art. Cecilia holds Engineering Degrees from Universidad de Chile and Ecole Centrale Paris and a dual-master degree in Technology and Policy and EECS from MIT. Prior to joining MIT, she helped set up the Chilean office of Inria (the French National Institute for Research in Digital Science and Technology) and worked for the research lab of the .CL, the Chilean top-level domain. She has interned at Akamai, MSR and the OECD. Cecilia’s work was awarded with a Distinguished paper award at the ACM Internet Measurement Conference in 2019.

Talk Title: Machine Learning for Astrophysics & Cosmology in the Era of Large Astronomical Surveys and an Application for the Discovery and Classification of Faint Galaxies

Watch Dimitrios’ Research Lightning Talk

Talk Abstract: Observational astrophysics & cosmology are entering the era of big-data. Future astronomical surveys are expected to collect hundreds of petabytes of data and detect billions of objects. Machine learning will play an important role in the analysis of these surveys with the potential to revolutionize astronomy, as well as providing challenging problems that can give opportunities for breakthroughs in the fundamental understanding of machine learning
In this talk I will present the discovery of Low Surface Brightness Galaxies (LSBGs) from the Dark Energy Survey (DES) data. LBGSs are galaxies with intrinsic brightness less than that of the dark sky, and so are hard to detect and study. At the same time, they are expected to dominate the number density of galaxies in the universe, which thus remains relatively unexplored. I will discuss the development of automated, deep learning-based, pipelines for LSBG detection (separation of LSB galaxies from LSB artifacts present in images) and morphological classification. Such techniques will be extremely valuable in the advent of very large future surveys like the planned Legacy Survey of Space and Time (LSST) on the Vera C. Rubin Observatory.

Bio: Dimitrios Tanoglidis is a fifth-year PhD student at the department of Astronomy & Astrophysics at the University of Chicago. He holds a BSc in Physics and MSc in Theoretical Physics, both from the University of Crete, Greece. His research interests lie in cosmology, analysis of large galaxy surveys, and data science applications in astrophysics. He has led the research for the discovery and analysis of Low Surface Brightness Galaxies from the Dark Energy Survey Data using machine learning. Interdisciplinary in nature, he is also pursuing a certificate in Computational Social Science.

Talk Title: Network Effects on Outcomes and Unequal Distribution of Resources

Watch Eaman’s Research Lightning Talk

Talk Abstract: We study how networks affect different groups differently and provide pathways to reinforce existing inequalities. First we provide observational evidence for differential network advantages in access to information: individuals from the low status group receive lower marginal benefit from networking than the high status group. Second, we provide causal evidence for differential diffusion of a new behavior in the network, mainly driven due to homophily and slight initial advantages of a group. Third, we develop a theoretical network model that captures the network structure of unequal access to opportunities. We show that any departure from the uniform distribution of links to information sources among members of a group limits the diffusion of information to the group as a whole. Fourth, we develop an online lab experiment to further study the network mechanisms that widen inter-group differences and yield different returns on social capital to different groups. We recruit individuals to play an online collaborative game in which they have to find and dig gold mines and in the process can pass information to their network neighbors. By changing the network structure and composition of groups with low and high initial advantage, we generate the processes that lead to unequal distribution of opportunities, beyond what’s expected by individual differences. Finally, we contribute to the literature on network structure and performance and propose the concept of bandwidth-diversity matching: individuals who match the tie strength to their contacts with their information novelty achieve truly diverse networks and better outcomes.

Bio: I am a PhD candidate in the Social and Engineering Systems program at MIT IDSS, under supervision of Prof. Pentland and Prof. Eckles. I am also receiving a second PhD in Statistics from the Statistics and Data Science Center at MIT. I received my Bachelor’s and Master’s degrees in Computer Science both from the University of Michigan – Ann Arbor.
My PhD research is focused on micro-level structural factors, such as network structure, that contribute to unequal distribution of resources or information. As a computational social scientist, I use methods from network science, statistics, experiment design and causal inference. I am also interested in understanding the collective behavior in institutional settings, the institutional mechanisms that promote cooperative behavior in networks, or in contrast lead to unequal outcomes for different groups.
In a previous life, I worked at Google New York City as a software engineer from 2011 to 2015. Currently, I am also a research contractor at Facebook working on how networks affect economic outcomes.

Talk Title: What and How Students Read: A Data-driven Insight

Talk Abstract: Reading is an integral part of learning. The purpose of reading to learn is to comprehend meaning from informational texts. Reading comprehension tasks require self-regulated learning (SRL) behaviors – to plan, monitor, and evaluate one’s reading strategies. Students without SRL skills may struggle in reading which in turn may inhibit them to acquire domain-specific knowledge. Thus, understanding students reading behavior and SRL usage is important for intervention. Digital reading platforms can provide opportunities to learn and practice SRL strategies in classroom settings. These platforms log rich array of student and teacher interaction data with the systems. Retrospective analysis of these logged data can derive insights– which can be used to support tailored interventions by instructors and students in complex learning activities. In this talk, I will discuss students’ science reading and SRL behaviors, and connect those behaviors with performance within a digital literacy platform, Actively Learn. The talk consists of two studies (i) identifying patterns that differ between productive and unproductive students (iI) analyzing the association of teachers’ behavior and students’ SRL usage. I will finish my talk by underlying possible future directions.

Bio: Effat Farhana is a Ph.D. Candidate in the Computer Science Department at North Carolina State University working with Dr. Collin F. Lynch in the ArgLab research group. She received her B.S. in Computer Science and Engineering from Bangladesh University of Engineering and Technology. Her research focuses on mining educational software to derive data-driven heuristics, machine learning, and designing interpretable machine learning algorithms.

Talk Title: Quantifying The Power of Mental Shortcuts in Persuasive Communication with Causal Inference from Text

Talk Abstract: The reliance of individuals on mental shortcuts based on factors such as gender, affiliation, and social status could distort the equitability of interpersonal discussions in various settings. Yet, the impact of such shortcuts in real-world discussions remains challenging to quantify. In this talk, I propose a novel quasi-experimental study that incorporates unstructured text in a principled manner to quantify the causal effect of status indicators in persuasive communication. I also examine how linguistic and rhetorical devices moderate this effect, and thus provide communication strategies to potentially reduce individuals’ reliance on mental shortcuts. I discuss implications for fair communication policies both within organizations and in society at large.

Bio: Emaad Manzoor is a PhD candidate in the Heinz College of Information Systems and Public Policy at Carnegie Mellon University, and will begin as an assistant professor of Operations and Information Management at the University of Wisconsin-Madison in Fall 2021. Substantively, he designs randomized experiments and quasi-experimental studies to quantify the persuasive power of mental shortcuts in text-based communication, and how language can be used to moderate this power. Methodologically, he develops data-mining techniques for evolving networks and statistical frameworks for causal inference with text. He is funded by a 2020 McKinsey & Company PhD Fellowship, and was a finalist for the 2019 Snap Research PhD Fellowship, the 2019 Jane Street Depth First Learning Fellowship, and the 2019 INFORMS Annual Meeting Best Paper award.

Talk Title: Machine Learning in Dynamical Systems

Talk Abstract: Many branches of science and engineering involve estimation and control in dynamical systems; consider, for example, using data to help stabilize the flight of a drone or predict the path of a hurricane. We consider control in dynamical systems from the perspective of regret minimization. Unlike most prior work in this area, we focus on the problem of designing an online controller which competes with the best dynamic sequence of control actions selected in hindsight, instead of the best controller in some specific class of controllers. This formulation is attractive when the environment changes over time and no single controller achieves good performance over the entire time horizon. We derive the structure of the regret-optimal online controller using techniques from robust control theory and present a clean data-dependent bound on its regret. We also present numerical simulations which confirm that our regret-optimal controller significantly outperforms various classical controllers in dynamic environments.

Bio: Gautam is a PhD student in the Computing and Mathematical Sciences (CMS) department at Caltech, where he is advised by Babak Hassibi. He is broadly interested in machine learning, optimization, and control, especially 1) online learning and online decision-making and 2) integrating machine learning with physics, dynamics and control. Much of his PhD work has been supported by a National Science Foundation Graduate Research Fellowship and an Amazon AWS AI Fellowship. Prior to joining Caltech, he obtained a BS in Mathematics from Georgia Tech.

Talk Title: Adversarial Collusion on the Web: State-of-the-art and Future Directions

Talk Abstract: The growth and popularity of online media have made it the most important platform for collaboration and communication among its users. Given its tremendous growth, the social reputation of an entity in online media plays an important role. This has led to users choosing artificial ways to gain social reputation by means of blackmarket services as the natural way to boost social reputation is time-consuming. We refer to such artificial ways of boosting social reputation as collusion. In this talk, we will comprehensively review recent developments in analyzing and detecting collusive entities on online media. First, we give an overview of the problem and motivate the need to detect these entities. Second, we survey the state-of-the-art models that range from designing feature-based methods to more complex models, such as using deep learning architectures and advanced graph concepts. Third, we detail the annotation guidelines, provide a description of tools/applications and explain the publicly available datasets. The talk concludes with a discussion of future trends.

Bio: Hridoy Sankar Dutta is currently pursuing his Ph.D. in Computer Science and Engineering from IIIT-Delhi, India. Starting January 2021, he will be joining University of Cambridge as a Research Assistant in the Cambridge Cybercrime Centre. His current research interests include data-driven cybersecurity, social network analysis, natural language processing, and applied machine learning. He received his B.Tech degree in Computer Science and Engineering from Institute of Science and Technology, Gauhati University, India in 2013. From 2014 to 2015, he worked as an Assistant Project Engineer at the Indian Institute of Technology, Guwahati (IIT-G), India, for the project ‘Development of Text to Speech System in Assamese and Manipuri Languages’. He completed his M.Tech in Computer Science and Engineering from NIT Durgapur, India in 2015. More details can be found at https://hridaydutta123.github.io/.

Talk Title: Computer-Aided Diagnosis of Thoracic CT Scans Through Multiple Instance Transfer Learning

Talk Abstract: Computer-aided diagnosis systems have demonstrated significant potential in improving patient care and clinical outcomes by providing more extensive information to clinicians.  The development of these systems typically requires a large amount of well-annotated data, which can be challenging to acquire in medical imaging.  Several techniques have been investigated in an attempt to overcome insufficient data, including transfer learning, or the application of a pre-trained model to a new domain and/or task.  The successful translation of transfer learning models to complex medical imaging problems holds significant potential and could lead to widespread clinical implementation.

However, transfer learning techniques often fail translate effectively because they are limited by the domain in which they were initially trained.  For example, computed tomography (CT) is a powerful medical imaging modality that leverages 3D images in clinical decision-making, but transfer learning models are typically trained on 2D images and thus can not incorporate the additional information provided by the third dimension.  This evaluation of the available data in a CT scan is inefficient and potentially does not effectively improve clinical decisions.  In this project, the 3D information available in CT scans is combined incorporated with transfer learning through a multiple instance learning (MIL) scheme, which can individually assess 2D images and form a collective 3D prediction based on the 2D information, similar to how a radiologist would read a CT scan.  This approach has been applied to evaluate both COVID-19 and emphysema in CT thoracic CT scans and demonstrated strong clinical potential.

Bio: Jordan Fuhrman is a student in the Graduate Program in Medical Physics at the University of Chicago. Since joining the program after his graduation from the University of Alabama in 2017, Jordan’s research has focused on the investigation of computer-aided diagnosis techniques for evaluating CT scans. Generally, this includes implementation of machine learning, deep learning, and computer vision algorithms to accomplish such tasks as disease detection, image segmentation, and prognosis assessments. His primary research interests lie in the development of novel approaches that incorporate the full wealth of information in CT scans to better inform clinical predictions, the exploration of explainable, interpretable outputs to improve clinical understanding of deep learning algorithm performance, and the early detection and prediction of patient progress to inform clinical decisions (e.g., most appropriate treatment) and improve patient outcomes. His work has largely focused on incidental disease assessment in low-dose CT lung screening scans, including emphysema, osteoporosis, and coronary artery calcifications, but has also included non-screening scan assessments of hypoxic ischemic brain injury and COVID-19. Jordan is a student member of both the American Association of Physicists in Medicine (AAPM) and the Society of Photo-optical Instrumentation Engineers (SPIE).

Talk Title: How to Preserve Privacy in Data Analysis?

Talk Abstract: The past decade has witnessed the tremendous success of large-scale data science. However, recent studies show that many existing powerful machine learning tools used in large-scale data science pose severe threats to personal privacy. Therefore, one of the major challenges in data analysis is how to learn effectively from the enormous amounts of sensitive data without giving up on privacy. Differential Privacy (DP) has recently emerged as a new gold standard for private data analysis due to the statistical data privacy it can provide for sensitive information. Nevertheless, the adaptation of DP to data analysis remains challenging due to the complex models we often encountered in data analysis. In this talk, I will focus on two commonly used models, i.e., the centralized and distributed/federated models, for differentially private data analysis. For the centralized model, I will present my efforts to provide strong privacy and utility guarantees in high-dimensional data analysis. For the distributed/federated model, I will discuss new efficient and effective privacy-preserving learning algorithms.

Bio: Lingxiao Wang is a final year Ph.D. student in the Department of Computer Science at the University of California, Los Angeles, advised by Dr. Quanquan Gu. Previously he obtained his MS degree in Statistics at the University of Washington. Lingxiao’s research interests are broadly in machine learning, including privacy-preserving machine learning, optimization, deep learning, low-rank matrix recovery, high-dimensional statistics, and data mining. Lingxiao aims to apply his research for social good, and he is one of the core members of the Combating COVID-19 project (https://covid19.uclaml.org/).

Talk Title: Systematic Evaluation of Privacy Risks of Machine Learning Models

Talk Abstract: Machine learning models are prone to memorizing sensitive data, making them vulnerable to membership inference attacks in which an adversary aims to guess if an input sample was used to train the model. In this talk, we show that prior work on membership inference attacks may severely underestimate the privacy risks by relying solely on training custom neural network classifiers to perform attacks and focusing only on aggregate results over data samples, such as the attack accuracy.

To overcome these limitations, we first propose to benchmark membership inference privacy risks by improving existing non-neural network based inference attacks and proposing a new inference attack method based on a modification of prediction entropy. Using our benchmark attacks, we demonstrate that existing membership inference defense approaches are not as effective as previously reported.

Next, we introduce a new approach for fine-grained privacy analysis by formulating and deriving a new metric called the privacy risk score. Our privacy risk score metric measures an individual sample’s likelihood of being a training member, which allows an adversary to perform membership inference attacks with high confidence. We experimentally validate the effectiveness of the privacy risk score metric and demonstrate the distribution of privacy risk scores across individual samples is heterogeneous. Our work emphasizes the importance of a systematic and rigorous evaluation of privacy risks of machine learning models.

Bio: Liwei Song is a fifth-year PhD student in the Department of Electrical Engineering at Princeton University, advised by Prof. Prateek Mittal. Before coming to Princeton, he received his Bachelor’s degree in Electrical Engineering from Peking University.

His current research focus is on investigating security and privacy issues of machine learning models, including membership inference attacks, evasion attacks, and backdoor attacks. His evaluation methods on membership inference have been integrated into Google’s TensorFlow Privacy library. Besides that, he has also worked on attacking voice assistants with ultrasound, which received widespread media coverage, including BBC News and New York Times.

Talk Title: Reasoning about Social Dynamics and Social Bias in Language

Watch Maarten’s Spotlight Research Talk

Talk Abstract: Humans easily make inferences to reason about the social and power dynamics of situations (e.g., stories about everyday interactions), but such reasoning is still a challenge for modern NLP systems. In this talk, I will address how we can make machines reason about social commonsense and social biases in text, and how this reasoning could be applied in downstream applications.

In the first part, I will discuss PowerTransformer, our new unsupervised model for controllable debiasing of text through the lens of connotation frames of power and agency. Trained using a combined reconstruction and paraphrasing objective, this model can rewrite story sentences such that its characters are portrayed with more agency and decisiveness. After establishing its performance through automatic and human evaluations, we show how PowerTransformer can be used to mitigate gender bias in portrayals of movie characters. Then, I will introduce Social Bias Frames, a conceptual formalism that models the pragmatic frames in which people project social biases and stereotypes onto others to reason about biased or harmful implications in language. Using a new corpus of 150k structured annotations, we show that models can learn to reason about high-level offensiveness of statements, but struggle to explain why a statement might be harmful. I will conclude with future directions for better reasoning about social dynamics and social biases.

Bio: Maarten Sap is a final year PhD student in the University of Washington’s natural language processing (NLP) group, advised by Noah Smith and Yejin Choi. His research focuses on endowing NLP systems with social intelligence and social commonsense, and understanding social inequality and bias in language. In the past, he’s interned at AI2 on project Mosaic working on social commonsense reasoning, and at Microsoft Research working on long-term memory and storytelling with Eric Horvitz.

Talk Title: Formal Logic Enhanced Deep Learning for Cyber-Physical Systems

Watch Meiyi’s Research Lightning Talk

Talk Abstract: Deep Neural Networks are broadly applied and have outstanding achievements for prediction and decision-making support for Cyber-Physical Systems (CPS). However, for large-scale and complex integrated CPS with high uncertainties, DNN models are not always robust, often subject to anomalies, and subject to erroneous predictions, especially when the predictions are projected into the future (uncertainty and errors grow over time). To increase the robustness of DNNs for CPS, in my work, I developed a novel formal logic enhanced learning framework with logic-based criteria to enhance DNN models to follow system critical properties and build well-calibrated uncertainty estimation models. Trained in an end-to-end manner with back-propagation, this framework is general and can be applied to various DNN models. The evaluation results on large-scale real-world city datasets show that my work not only improves the accuracy of predictions and effectiveness of uncertainty estimation, but importantly also guarantees the satisfaction of model properties and increases the robustness of DNNs. This work can be applied to a wide spectrum of applications, including the Internet of Things, smart cities, healthcare, and many others.

Bio: Meiyi Ma is a Ph.D. candidate in the Department of Computer Science at the University of Virginia, working with Prof. John A. Stankovic and Prof. Lu Feng. Her research interest lies at the intersection of Machine learning, Formal Methods, and Cyber-Physical Systems. Specifically, her work integrates formal methods and machine learning, and applies new integrative solutions to build safe and reliable integrated Cyber-Physical Systems, with a focus on smart city and healthcare applications. Meiyi’s research has been published in top-tier machine learning and cyber-physical systems conferences and journals, including NeurIPS, ACM TCPS, ICCPS, Percom, etc. She has received multiple awards, including the EECS Rising Star at UC Berkeley, the Outstanding Graduate Research Award at the University of Virginia and the Best Master Thesis Award. She is serving as the information director for ACM Transactions on Computing for Healthcare and a reviewer for multiple conferences and journals. She also served as organizing committees for several international workshops.

Talk Title: Human-AI Collaborative Decision Making on Rehabilitation Assessment

Talk Abstract: Rehabilitation monitoring systems with sensors and artificial intelligence (AI) provide an opportunity to improve current rehabilitation practices by automatically collecting quantitative data on patient’s status. However, the adoption of these systems still remains a challenge. This paper presents an interactive AI-based system that supports collaborative decision making with therapists for rehabilitation assessment. This system automatically identifies salient features of assessment to generate patient-specific analysis for therapists, and tunes with their feedback. In two evaluations with therapists, we found that our system supports therapists significantly higher agreement on assessment (0.71 average F1-score) than a traditional system without analysis (0.66 average F1-score, p < 0.05). In addition, after tuning with therapist’s feedback, our system significantly improves its performance (from 0.8377 to 0.9116 average F1-scores, p < 0.01). This work discusses the potential of a human and AI collaborative system that supports more accurate decision making while learning from each other’s strengths.

Bio: Min Lee is a PhD student at Carnegie Mellon University. His research interests lie at the intersection of human-computer interaction (HCI) and machine learning (ML), where he designs, develops, and evaluates human-centered ML systems to address societal problems. His thesis focuses on creating interactive hybrid intelligence systems to improve the practices of stroke rehabilitation (e.g. a decision support system for therapists and a robotic coaching system for post-stroke survivors).

Talk Title: Mathematical Models of Brain Connectivity and Behavior: Network Optimization Perspectives, Deep-Generative Hybrids, and Beyond

Talk Abstract: Autism Spectrum Disorder (ASD) is a complex neurodevelopmental disorder characterized by multiple impairments and levels of disability that vary widely across the ASD spectrum. Currently, quantifying symptom severity relies almost solely on a trained clinician’s evaluation. Recently, neuroimaging studies, for example, using resting state functional MRI (rs-fMRI) and Diffusion Tensor Imaging (DTI) have been gaining popularity for studying brain dysfunction. My work aims at linking the symptomatic characterization of ASD with the functional and structural organization of a patient’s brain via machine learning. To set the stage, I will first introduce a joint network optimization to predict clinical severity from rs-fMRI data. Our model is couples two terms: a generative matrix factorization and a discriminative regression in a joint optimization. Next, we extend this to a deep-generative hybrid, that jointly models the complementarity between structure (DTI) and functional dynamics (dynamic rs-fMRI connectivity) to extract predictive disease biomarkers. The generative part of our framework is now a structurally-regularized matrix factorization on dynamic rs-fMRI correlation matrices, guided by DTI tractography to learn anatomically informed connectivity profiles. The deep part of our framework is an LSTM-ANN, which models the temporal evolution of the scan to map to behavior. Our main novelty lies in our coupled optimization, which collectively estimates the matrix factors and the neural network weights. We outperform several state-of-the-art baselines to extract multi-modal neural signatures of brain dysfunction. Finally, I will present our current exploration based on graph neural networks and manifold learning to better capture the underlying data geometry.

Bio: Niharika is a PhD candidate in the department of Electrical and Computer Engineering. Her research interests lie at the intersection of deep learning, non-convex optimization, manifold learning and graph signal processing applied to neuroimaging data. She has developed novel machine learning algorithms that predict behavioral deficits in patients with Autism by decoding their brain organization from their functional and structural neuroimaging scans. Prior to joining Hopkins, she obtained a bachelor’s degree (B. Tech with Hons.) in Electrical Engineering with a minor in Electronics and Electrical Communications Engineering from the Indian Institute of Technology, Kharagpur.

Talk Title: Power Outage Risk Interconnection: Relationship with Social and Environmental Critical Risk Indicators

Watch Olukunle’s Research Lightning Talk

Talk Abstract: The interconnections between diverse components in a system can provide profound insights on the health and risk states of the system as a whole. Highly interconnected systems tend to accumulate risks until a large, systemic crisis hits. For example, in the 2007-09 financial crisis, the interconnection of financial institutions heightened near the collapse, suggesting the system could no longer absorb risks. Extending concepts of interconnectedness and systemic risk to coupled human-natural systems, one might expect similar behaviours of risk accumulation and heightened connectivity, leading to potential system failures. The Predictive Risk Investigation System (PRISM) for Multi-layer Dynamic Interconnection Analysis aims to explore the complex interconnectedness and systemic risks in human-natural systems.

Applying the PRISM approach, we could uncover dynamic relationships and trends in climate resilience and preparedness using Energy, Environmental and Social indicators. This study proposes a case-study application of the PRISM approach to the State of Massachusetts using a dataset of over 130000 power outages in the state from 2013-2018. Random Forest, Locally Weighted Scatterplot Smoothing (LOWESS) and Generalized Additive Models (GAMS) are applied to understand the interconnections between Power outages, Population density and Environmental factors (Weather indicators e.g. Wind Speed, Precipitation).

Bio: I am a Data Scientist with domain expertise in Energy – Oil, Gas, Renewables and Power Systems. With a BS in Petroleum Engineering and an MS in Sustainable Energy Systems, I have always enjoyed a data-centric approach in solving interdisciplinary problems. In my Bachelor’s degree, I used Neural Networks to solve a practical oil-field (Production Engineering) problem. In my master’s I explored potentials for optimizing clean-energy microgrids in low-income, underserved communities while leveraging insights from large, messy, unstructured data. In my PhD at Tufts I am working in an interdisciplinary team of Data and Domain Scientists where I am applying Data Science/Machine Learning Techniques and Tools to Energy, Climate, Financial, and Ecological systems. One word to describe my experience is diversity. I am fortunate to have enjoyed a fair share of diversity in my academic and professional experience – in geography and in scope. An experience that traverses three continents of the world equipped with a broader scientific and engineering background. This exemplifies my interest in complex, interdisciplinary and multifaceted problems traversing various fields such as: science, engineering and data science. I am enthusiastic about applying my knowledge and skills in Data Science to new, challenging, unfamiliar terrains to discover and garner insights and solve problems that improves experience and affects people, communities, and organizations.

Talk Title: Data-Efficient Optimization in Reinforcement Learning

Watch Pan’s Research Lightning Talk

Talk Abstract:Optimization lies at the heart of modern machine learning and data science research. How to design data-efficient optimization algorithms that have a low sample complexity while enjoying a fast convergence at the same time has remained a challenging but imperative topic in machine learning. My research aims to answer this question from two facets: providing the theoretical analysis and understanding of optimization algorithms; and developing new algorithms with strong empirical performance in a principled way. In this talk, I will introduce our recent work in developing and improving data-efficient optimization algorithms for decision-making (reinforcement learning) problems. In particular, I will introduce the variance reduction technique in optimization and show how it can improve the data efficiency of policy gradient methods in reinforcement learning. I will present the variance reduced policy gradient algorithm, which constructs an unbiased policy gradient estimator for the value function. I will show that it provably reduces the sample complexity of vanilla policy gradient methods such as REINFORCE and GPOMDP.

Bio: Pan Xu is a Ph.D. candidate in the Department of Computer Science at the University of California, Los Angeles. His research spans the areas of machine learning, data science, and optimization, with a focus on the development and improvement of large-scale nonconvex optimization algorithms for machine learning and data science applications. Pan obtained his B.S. degree in mathematics from the University of Science and Technology of China. Pan received the Presidential Fellowship in Data Science from the University of Virginia. He has published over 20 high-quality papers on top machine learning conferences and journals such as ICML, NeurIPS, ICLR, AISTATS, and JMLR.

Talk Title: Efficient Neural Question Answering for Heterogeneous Platforms

Watch Qingqing’s Research Lightning Talk

Talk Abstract: Natural language processing (NLP) systems power many real-world applications like Alexa, Siri, or Google and Bing. Deep learning NLP systems are becoming more effective due to increasingly larger models with multiple layers and millions to billions of parameters. It is challenging to deploy these systems because they are compute-intensive, consume much more energy, and cannot run on mobile devices. In this talk, I will present two works on optimizing efficiency in question answering systems and my current research in studying large NLP models’ energy consumption. First, I will introduce DeQA, which provides an on-device question-answering capability to help mobile users find information more efficiently without privacy issues. Deep learning based QA systems are slow and unusable on mobile devices. We design the latency- and memory- optimizations widely applicable for state-of-the-art QA systems to run locally on mobile devices. Second, I will present DeFormer, a simple decomposition-based technique that takes pre-trained Transformer models and modifies them to enable faster inference for QA for both the cloud and mobile. Lastly, I will introduce how we can accurately measure the energy consumption of NLP models using hardware power meters and build reliable energy estimation models by abstracting meaningful features of the NLP workloads and profiling runtime resource usage.

Bio: Qingqing Cao is a graduating Computer Science Ph.D. candidate at Stony Brook University. His research interests include natural language processing (NLP), mobile computing, and machine learning systems. He has focused on building efficient and practical NLP systems for both edge devices and the cloud, such as on-device question answering (MobiSys 2019), faster Transformer models (ACL 2020), and accurate energy estimation of NLP models. He has two fantastic advisors: Prof. Aruna Balasubramanian and Prof. Niranjan Balasubramanian. He is looking for postdoc openings in academia or research positions in the industry.

Talk Title: Artificial Intelligence for Medical Image Analysis for Breast Cancer Multiparametric MRI

Watch Isabelle’s Spotlight Research Talk

Talk Abstract: Artificial intelligence is playing an increasingly important role in medical imaging. Computer-aided diagnosis (CADx) systems using human-engineered features or deep learning can potentially assist radiologists in image interpretation by extracting quantitative biomarkers to improve diagnostic performance and circumvent unnecessary invasive procedures. Multiparametric MRI (mpMRI) has become a part of routine clinical assessment for screening of high-risk patients for breast cancer and monitoring therapy response because it has been shown to improve diagnostic accuracy. Current CADx methods for breast lesion assessment on MRI, however, are mostly focused on one sequence, the dynamic contrast-enhanced (DCE)-MRI. Therefore, we investigated methods for incorporating three sequences in mpMRI to improve the CADx performance in differentiating benign and malignant breast lesions. We compared integrating the mpMRI information at the image level, feature level, or classifier output level. In addition, transfer learning is often employed in deep learning applications in medical imaging due to data scarcity. However, pretrained convolutional neural networks (CNNs) used in transfer learning require two-dimensional (2D) inputs, limiting the ability to utilize high-dimensional information in medical imaging. To address this problem, we investigated a transfer learning method that collapses volumetric information to 2D by taking the maximum intensity projection (MIP) at the feature level within CNNs, which outperformed a previous method of using MIPs of images themselves in the task of distinguishing between benign and malignant breast lesions. We proposed a method that combines feature fusion and feature MIP for computer-aided breast cancer diagnosis using high-dimensional mpMRI that outperforms the current benchmarks.

Bio: Isabelle is a PhD candidate in Medical Physics at the University of Chicago, supervised by Dr. Maryellen Giger. Her research is centered around developing automated methods for quantitative medical image analysis to assist in clinical decision-making. She has proposed novel methodologies to diagnoses breast cancer using multiparametric MRI exams. Since the pandemic, she has also been working on AI solutions that leverage medical images to enhance the early detection and prognosis of COVID-19. She has first-hand experience tackling unique challenges faced by medical imaging applications of machine learning due to high-dimensionality, data scarcity, noisy labels, etc. She loves working at the intersection of physics, medicine, and data science, and she is motivated by the profound potential impact that her research can bring on improving access to high-quality care and providing a proactive healthcare system. She hopes to dedicate her career to building AI-empowered technology to transform healthcare, accelerate scientific discoveries, and improving human well-being.

Talk Title: Asymptotically Optimal Exact Minibatch Metropolis-Hastings

Talk Abstract: Metropolis-Hastings (MH) is one of the most fundamental Bayesian inference algorithms, but it can be intractable on large datasets due to requiring computations over the whole dataset. In this talk, I will discuss minibatch MH methods, which use subsamples to enable scaling. First, I will talk about existing minibatch MH methods, and demonstrate that inexact methods (i.e. they may change the target distribution) can cause arbitrarily large errors in inference. Then, I will introduce a new exact minibatch MH method, TunaMH, which exposes a tunable trade-off between its batch size and its theoretically guaranteed convergence rate. Finally, I will present a lower bound on the batch size that any minibatch MH method must use to retain exactness while guaranteeing fast convergence—the first such bound for minibatch MH—and show TunaMH is asymptotically optimal in terms of the batch size.

Bio: Ruqi Zhang is a fifth-year Ph.D. student in Statistics at Cornell University, advised by Professor Chris De Sa. Her research interests lie in probabilistic modeling for data science and machine learning. She currently focuses on developing fast and robust inference methods with theoretical guarantees and their applications with modern model architectures, such as deep neural networks, on real-world big data. Her work has been published in top machine learning venues such as NeurIPS, ICLR and AISTATS, and has been recognized through an Oral Award at ICLR and two Spotlight Awards at NeurIPS.

Talk Title: Towards Global-Scale Biodiversity Monitoring – Scaling Geospatial and Taxonomic Coverage Using Contextual Clues

Watch Sara’s Research Lightning Talk

Talk Abstract: Biodiversity is declining globally at unprecedented rates. We need to monitor species in real time and in greater detail to quickly understand which conservation efforts are most effective and take corrective action. Current ecological monitoring systems generate data far faster than researchers can analyze it, making scaling up impossible without automated data processing. However, ecological data collected in the field presents a number of challenges that current methods, like deep learning, are not designed to tackle. Biodiversity data is correlated in time and space, resulting in overfitting and poor generalization to new sensor deployments. Environmental monitoring sensors have limited intelligence, resulting in objects of interest that are often too close/far, blurry, or in clutter. Further, the distribution of species is long-tailed, which results in highly-imbalanced datasets. These challenges are not unique to the natural world, advances in any one of these areas will have far-reaching impact across domains. To address these challenges, we take inspiration from the value of additional contextual information for human experts, and seek to incorporate it within the structure of machine learning systems. Incorporating species distributions and access across data collected within a sensor at inference time can improve generalization to new sensors without additional human data labeling. Going beyond single sensor deployment, there is a large degree of contextual information shared across multiple data streams. Our long-term goal is to develop learning methods that efficiently and adaptively benefit from many different data streams on a global scale.

Bio: Sara Beery has always been passionate about the natural world, and she saw a need for technology-based approaches to conservation and sustainability challenges. This led her to pursue a PhD at Caltech, where she is advised by Pietro Perona and funded by an NSF Graduate Research Fellowship, a PIMCO Fellowship in Data Science, and an Amazon/Caltech AI4Science Fellowship. Her research focuses on computer vision for global-scale biodiversity monitoring. She works closely with Microsoft AI for Earth and Google Research to translate her work into usable tools, including widely-used models and benchmarks for detection and recognition of animal species in challenging camera trap data at a global scale. She has worked to bridge the interdisciplinary gap between ecology and computer science by hosting the iWild-Cam challenge at the FGVC Workshop at CVPR from 2018-2021, and through founding and managing a highly successful AI for Conservation slack channel which provides a meeting point for experts from each community to discuss new methods and best practices for conservation technology. Sara’s prior experience as a professional ballerina and a nontraditional student has taught her the value of unique and diverse perspectives in the research community. She’s passionate about increasing diversity and inclusion in STEM through mentorship and outreach.

Talk Title: Promoting Worker Performance with Human-Centered Data Science

Watch Teng’s Spotlight Research Talk

Talk Abstract: Addressing real-world problems about human behavior is one of the main approaches where advances in data science techniques and social science theories achieve the greatest social impact. To approach these problems, we propose a human-centered data science framework that synergizes strengths across machine learning, causal inference, field experiment, and social science theories to understand, predict, and intervene in human behavior. In this talk, I will present three empirical studies that promote worker performance with human-centered data science. In the first project, we work with New York City’s Mayor’s Office and deploy explainable machine learning models to predict the risk of tenant harassment in New York City. In the second project, we leverage insights from social identity theory and conduct a large-scale field experiment on DiDi, a leading ride-sharing platform, showing that the intervention of bonus-free team ranking/contest systems can improve driver engagement. Third, to further unpack the effect of team contests on individual DiDi drivers, we bring together causal inference, machine learning, and social science theories to predict individual treatment effects. Insights from this study are directionally actionable to improve team recommender systems and contest design. More promising future directions will be discussed to showcase the effectiveness and flexibility of this framework.

Bio: I am a final-year Ph.D. candidate at the School of Information, University of Michigan, Ann Arbor, working with Professor Qiaozhu Mei. My research focuses on human-centered data science, where I couple data science techniques and social science theories to address real-world problems by understanding, predicting, and intervening in human behavior.

Specifically, I synergize strengths across machine learning, causal inference, field experiments, and social science theories to solve practical problems in the areas of data science for social good, the sharing economy, crowdsourcing, crowdfunding, social media, and health. For example, we have collaborated with the New York City’s Mayor’s Office and helped to prioritize government outreach to tenants vulnerable to landlord harassment in New York City by deploying machine learning models. In collaboration with Didi Chuxing, a leading ride-sharing platform, we have leveraged field experiments and machine learning models to enhance driver engagement and intervention design. The results of my work have been integrated into the real-world products that involve millions of users and have been published across data mining, social computing, and human-computer interaction venues.

Talk Title: PAPRIKA: Private Online False Discovery Rate Control

Watch Wanrong’s Spotlight Research Talk

Talk Abstract: In hypothesis testing, a false discovery occurs when a hypothesis is incorrectly rejected due to noise in the sample. When adaptively testing multiple hypotheses, the probability of a false discovery increases as more tests are performed. Thus the problem of False Discovery Rate (FDR) control is to find a procedure for testing multiple hypotheses that accounts for this effect in determining the set of hypotheses to reject. The goal is to minimize the number (or fraction) of false discoveries, while maintaining a high true positive rate (i.e., correct discoveries).
In this work, we study False Discovery Rate (FDR) control in multiple hypothesis testing under the constraint of differential privacy for the sample. Unlike previous work in this direction, we focus on the online setting, meaning that a decision about each hypothesis must be made immediately after the test is performed, rather than waiting for the output of all tests as in the offline setting. We provide new private algorithms based on state-of-the-art results in non-private online FDR control. Our algorithms have strong provable guarantees for privacy and statistical performance as measured by FDR and power. We also provide experimental results to demonstrate the efficacy of our algorithms in a variety of data environments.

Bio: Wanrong Zhang is a PhD candidate at Georgia Tech supervised by Rachel Cummings and Yajun Mei. Her research interests lie primarily in data privacy, with connections to statistics and machine learning. Her research focuses on designing privacy-preserving algorithms for machine learning models and statistical analysis tools, as well as identifying and preventing privacy vulnerabilities in modern collaborative learning. Before joining Georgia Tech, she received her B.S. in Statistics from Peking University.

Talk Title: Towards Better Informed Extraction of Events from Documents

Watch Xinya’s Spotlight Research Talk

Talk Abstract: Large amounts of text are written and published daily on-line. As a result, applications such as reading through the document to automatically extract useful information, and answering user questions have become increasingly needed for people’s efficient absorption of information. In this talk, I will focus on the problem of finding and organizing information about events and introduce my recent research on document-level event extraction. Firstly, I’ll briefly summarize the high-level goal and several key challenges (including modeling context and better leveraging background knowledge), as well as my efforts to tackle them. Then I will focus on the work where we formulate event extraction as a question answering problem — both to access relevant knowledge encoded in large models and to reduce the cost of human annotation required for training data creation/construction.

Bio: Xinya Du is a Ph.D. candidate at the Computer Science Department of Cornell University, advised by Prof. Claire Cardie. He received a bachelor degree in Computer Science from Shanghai Jiao Tong University. His research is on natural language processing, especially methods that enable learning with fewer annotations for document-level information extraction. His work has been published in leading NLP conferences and has been covered by New Scientist and TechRepublic.

Talk Title: Understanding Success and Failure in Science and Technology

Talk Abstract: The 21st century society is largely driven by science and innovation, but our quantitative understanding of why, how, and when innovators and innovations succeed or fail remains limited. Despite the long-standing interest in this topic, current science of science research relies on citation and publication records as its major data sources. Yet science functions as a complex system that is much more than published papers, and ignorance of this multidimensional nature precludes a deeper examination of many fundamental elements of innovation lifecycles, from failure to scientific breakthrough, from public funding to broad impact. In this talk, I will touch on a few examples of success and failure across science and technology, hoping to illustrate a way for a better understating of the full innovation lifecycle. By combining various large-scale datasets and interdisciplinary analytical frameworks rooted in data mining, statistical physics, and computational social science, we discover a series of fundamental mechanisms and signals underlying the processes in which (1) individuals and organizations build on previous repeated failures towards ultimate victory or defeat in science, startups and security; (2) scientific elites produce breakthrough discoveries in their scientific careers; and (3) scientific research gets funded and used by the general public. The uncovered patterns in these studies not only unveil regularity and predictability underlying the often-noisy social systems, they also offer a new theoretical and empirical basis that is practically relevant for individual scientists, research institutes, and innovation policymakers.

Bio: Yian Yin is a Ph.D. candidate of Industrial Engineering & Management Sciences at Northwestern University, advised by Dashun Wang and Noshir Contractor. He also holds affiliations with Northwestern Institute on Complex Systems and Center for Science of Science and Innovation. Prior to joining Northwestern, he received his bachelor degrees in Statistics and Economics from Peking University in 2016.

Yian studies computational social science, with a particular focus on integrating theoretical insights in innovation studies, computational tools in data science, and modeling frameworks in complex systems to examine various fundamental elements of innovation lifecycles, from dynamics of failure to emergence of scientific breakthrough, from public funding for science to broad uses of science in public domains. His research has been published in multidisciplinary journals including Science, Nature, Nature Human Behaviour, and Nature Reviews Physics, and has been featured in Science, Lancet, Forbes, Washington Post, Scientific American, Harvard Business Review, MIT Technology Review, among other outlets.

Talk Title: Towards Interpretable Machine Learning by Human Knowledge Reasoning

Talk Abstract: Given the great success achieved by statistical learning theories for building intelligent systems, there is still a long-standing challenge of artificial intelligence, which is to bridge the gaps between what machines know, what humans think what machines know, and what humans know, about the real world. By doing so, we are expected to ground the prior knowledge of machines to human knowledge first and perform explicit reasoning for various downstream tasks for better interpretable machine learning. 

In this talk, I will briefly present two pieces of my existing work that leverage human expert and commonsense knowledge reasoning to increase the interpretability and transparency of machine learning models in the field of natural language processing. Firstly, I will show how existing cognitive theories on human memory can inspire an interpretable framework for rationalizing the medical relation prediction task based on expert knowledge. Secondly, I will introduce how we can learn better word representations based on commonsense knowledge and reasoning. Our proposed framework learns a commonsense reasoning module guided by a self-supervision task and provides word pair and single word representations distilled from learned reasoning modules. Both the above works are able to offer reasoning paths to justify their decisions and boost the model interpretability that humans can understand with minimal knowledge barriers.

Bio: Zhen Wang is a Ph.D. student in the Department of Computer Science and Engineering at the Ohio State University advised by Prof. Huan Sun. His research centers on natural language processing, data mining, and machine learning with emphasis on information extraction, question answering, graph learning, text understanding, and interpretable machine learning. Particularly, he is interested in improving the trustworthiness and generalizability of data-driven machine learning models by interpretable and robust knowledge representation and reasoning. He has published papers in several top-tier data science conferences, such as KDD, ACL, WSDM as well as journals like Bioinformatics. He conducts interdisciplinary research that connects artificial intelligence with cognitive neuroscience, linguistics, software engineering, and medical informatics, etc.

arrow-left-smallarrow-right-large-greyarrow-right-large-yellowarrow-right-largearrow-right-long-yellowarrow-right-smallclosefacet-arrow-down-whitefacet-arrow-downCheckedCheckedlink-outmag-glass