Skip to main content

Reid McIlroy-Young builds AI systems that better understand people, drawing on insights from human problem solving to design machine learning models and applying these systems to produce fairer outcomes in decisions such as hiring. McIlroy-Young was a Postdoctoral Fellow at the Hire Aspirations Institute and the Theory of Computation group at the Harvard John A. Paulson School of Engineering and Applied Sciences. He received a PhD from the University of Toronto in Computer Science, where he was also a fellow at the Schwartz Reisman Institute for Technology and Society. McIlroy-Young’s research focuses on developing machine learning systems that can collaborate with humans and examining the societal implications of these systems. He is also known for the creation of Maia Chess, a neural network of human-like chess engines. Most recently, he has examined the medical data ecosystem in the paper “Medical Data for Sale: Accessing Reproductive Health Information via the Data Brokerage Landscape.” This paper was co-authored with fellow researchers Diana Freed, Brown University; Sarah Radway, Harvard University; Gabriela Becher, Brown University; Diane Bernabei, Debevoise & Plimpton; Christina Lee, George Washington University; Cynthia Dwork, Harvard University. See McIlroy-Young’s previous publications here.

Health data is among the most sensitive information humans produce, and protecting it has led to a web of regulatory safeguards. In the United States, for instance, the Health Insurance Portability and Accountability Act (HIPAA) sets strict rules for how personal health information can be used and shared. Yet despite these protections, marketplaces for health data continue to emerge, driven by their immense value to pharmaceutical companies, insurers, and researchers working to advance patient care. When shared responsibly, such data has the potential to be transformative. But is the current health data marketplace ecosystem designed to maximize social value while minimizing risks to privacy and welfare? Dr. Reid McIlroy-Young has investigated these questions in depth. In this interview, we explore his research and the dynamics that shape this evolving ecosystem.

  1. What are the dataflows in the “medical data ecosystem” that you identified in your study?

For the paper [“Medical Data for Sale: Accessing Reproductive Health Information via the Data Brokerage Landscape”], we were concerned with a broad conception of medical data. To data brokers, the main factor that matters to a piece of data is if it is HIPAA (Medical records, Medical scribe transcription, etc) and non-HIPAA (Web searches, credit card receipts). To me, the most important discovery of the study is how easy it is for things that look like medical records to be collected and sold, so these would be things like if someone is inferred to be prenatal, or location data giving visits to medical providers. If records are non-HIPAA, then they flow easily and in our model are trivially available for purchase.

  1. Are there examples of dataflows that are (potentially) beneficial or harmful to agents of the ecosystem?

As someone taking the position of a privacy researcher, most if not all of these dataflows that reach commercial vendors are harmful to most of society and in particular people seeking access to sensitive medical treatments. The benefits are solely towards those wishing to exploit information asymmetries for personal gain.

In a broader view, there are other potential benefits, the main one that I have seen discussed are drug testing and understanding markets/customers.

  1. Can you describe how these insights are being monetized and in what way, is it impacting drug pricing, food pricing, is it targeting specific groups, etc.? What groups are most at risk?

To be clear, I’m talking about how the privacy research position that dataflows without user consent and full knowledge are inherently harmful. There are other valid ethical frameworks to view these ecologies under. To answer your question more directly, in my work we do not look at how the data are being used to harm people, instead we hypothesize different potential threat models (agents that wish to do harm) and consider how they can get access to data that would allow them to do harm. So I can’t point to specific companies or individuals doing harm. The first threat model that inspired my work was bounty hunters, Texas famously allows for individuals to collect bounties for aiding the State in finding people participating in “illegal” abortions, note the law is rather roundabout so this is not a fully correct description. Access to location data would allow these bounty hunters to locate people visiting abortion services. A less charged example is drug development and pricing, one of the most common named use cases for medical data sales is aiding drug development. The example I saw was showing that people with different medical conditions could be mapped and classified, so that drug companies could optimally target marketing/new developments/etc. These types of practice tend to hurt low-income and minority populations, by reducing access to resources. 

  1. What legal or technical steps could protect sensitive health data from exploitation while still enabling beneficial research?

Legally, there are many options, but reducing the ability for people to sell medical data is the main one. Most other countries don’t have the same amount of medical data for sale, so there are clearly many possible legal options. I’m Canadian, so I’m familiar with Canadian Blood Services, they allow data sharing with researchers, and to the best of my knowledge there have been no negative consequences of their data sharing and it is regularly used by researchers.

Technically, there is a large body of work on conducting privacy preserving database queries. The main method is differential privacy, which was created by one of my coauthors Dr. Dwork. In our work we looked for mentions of privacy preserving techniques, and the only ones we found were limitations on minimum sample size and simple anonymization by censoring or removing information (i.e., removing names, or using ZIP 3). These methods can be easily circumvented and have not been considered sufficient by researchers for decades.

  1. What interventions have you identified in the ecosystem that aim to control dataflows, directly or indirectly? How do you assess the effectiveness of those interventions? For example, regarding the Expert Determination clause within HIPAA, what do you think about the effectiveness of this process? What are potential pitfalls and what are potential benefits?

[HIPAA approves two methods for de-identification of personal health information (PHI), the Expert Determination method and the Safe Harbor method. The Expert Determination approach for de-identification requires “(1) A person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods for rendering information not individually identifiable: (i) Applying such principles and methods, determines that the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information; and (ii) Documents the methods and results of the analysis that justify such determination.” Read more about HIPAA and the Expert Determination approach from the U.S. Department Health and Human Services.]

As I mentioned in question one, HIPAA. HIPAA is the main mechanism to control dataflows. It is very effective in some ways, it does mean that raw medical data are not for sale. But as you mentioned the Expert Determination clause and safe harbor provisions mean that circumventing HIPAA is possible, and it is possible to hire HIPAA covered entities to make queries on your behalf (this is from our newer unpublished work).

We do take the stance that laws are effective in controlling dataflows in this ecosystem, for example most data brokers claim to follow various state laws and when we talked to them were very aware of the legal landscape. The main issue is that the laws don’t do very much, so it’s tricky to say how effective they would be in practice. That said, New Jersey’s Daniels Law might test this in the near future.

[Daniel’s Law prohibits disclosure of the residential addresses of a “covered person,” and allows “covered person” to request cease of disclosure for anyone who is disclosing their residential address. It was passed on the state and federal level, in New Jersey in 2020 and federally passed by Congress in 2022. Daniel’s Law was named after Daniel Anderl, the son of U.S. District Court Judge Esther Salas. At their home, Anderl was shot and killed by a gunman who had found Judge Salas’ home address online. Read more about Daniel’s Law here.]

  1. Regarding the New Jersey Daniel’s Law, what challenges do you foresee with implementation of this law?

Daniel’s Law is already experiencing significant challenges. The primary push back is the claim that it limits free speech rights. When I talked to legal experts about it, they were pessimistic about it surviving this challenge, and it seems some companies may simply ignore it hoping it gets made void. As there is very little transparency in the industry it’s also not clear how much enforcement there will be.

  1. What are the next steps for this project?

We (Diana Freed, Cynthia Dwork, and I) are currently focused on understanding how different types of attackers can exploit access to data brokers to surveil pregnant women. In particular we’re looking at the marketplace from the perspective of an attacker, so we understand what the data looks like after it has “gone through the ecosystem”.

arrow-left-smallarrow-right-large-greyarrow-right-large-yellowarrow-right-largearrow-right-long-yellowarrow-right-smallclosefacet-arrow-down-whitefacet-arrow-downCheckedCheckedlink-outmag-glass