Fred Trotter
October 24, 2016
Did it matter how many numbers were on this safe’s combination lock?
Nope! Only the tools pictured were used to open this safe, according to the forums where I found this image. This situation is analogous to healthcare cybersecurity and sophisticated cyberattacks. All the energy we put into protecting patient privacy is similar to the safe combination. Very quickly we make legitimate access difficult, without actually making it harder for the bad guys.
Where are we with healthcare cybersecurity? Is the safe something that requires a crowbar to open, power tools, a blow torch, or a tank? How can we better address cybersecurity threats?
I am so glad you asked! The first Healthcare Industry CyberSecurity Task Force is calling for feedback on these issues, and we very much appreciate hearing your opinions about this here:
https://www.reddit.com/r/medicine/comments/55qmft/healthcare_cybersecurity_task_force_ama/
In addition to being a consumer representative of the Task Force, I am a healthcare data journalist, data scientist, and cybersecurity consultant. I make money and a difference in the world by mining value from large patient data sets. I work at CareSet Systems, and I am comfortable claiming that CareSet is the first Medicare data vendor.
At CareSet, the best term we have yet found to describe our work remains “Healthcare Data Journalism”. That term captures the work that we do at CareSet to gather and disseminate high quality healthcare data to solve problems in healthcare. We also feel a profound responsibility to share our work not only with our customers, but with the public, whenever it is obvious that there is a public interest in the datasets that we uncover. But we are hardly the only organization that relies on deidentification methods to create value from patient data. For the moment, let’s refer to organizations with this motivation as “ethical miners” of healthcare data.
Ethical miners frequently take data that was once associated with individual patients, and changes it into information that has no specific patient information present, but can still describe how large groups of patients were treated as a whole. This process, known as “Patient Data Aggregation”, is a powerful way to benefit patients in the long term, using data they entrust to the healthcare system. CMS will only allow organizations like CareSet to access aggregated patient data with more than 10 (i.e. 11+) patients.
Note that I did not label this process as “Deidentification”, which occurs a step prior to Aggregation. Deidentification includes stripping away details like “first name”, “last name” and “phone number” from the data. In the case of Medicare claims, CMS takes care of deidentification before my company aggregates the data. This is a very effective way for data holders like CMS to work with ethical miners.
However, without getting too technical, it is possible to use a mosaic attack on an aggregated data set, which could turn it into a partially deidentified data set. This means it might be possible to do some math on multiple data sources and piece together what happened to a single patient, while still not knowing who that patient is. This “disaggregation” attack involves merging available data sets together from various sources to infer something that would be very difficult to infer otherwise.
Happily, “disaggregation” attacks, are much less worrisome than “re-identification” attacks. After all, only a fully re-identification attack actually reveals the identity of a patient. For a properly deidentified dataset, it should be impossible to identify more than a handful of records. Lets overestimate this number to .1% of the original deidentified data set. If a disaggregation attack is successful, it might make a handful of records into merely deidentified records. Lets overestimate this at .1% of an aggregated data set.
What is the chance of an individual record being fully reidentified from a properly aggregated healthcare data set? Well, based on the assumptions we have made here, it is .1% OF .1%. Which is almost always going to be exactly 0 people. That math makes the data mining that I perform, both ethical and safe. And it makes the pointy haired lawyers at places like HHS sleep warm at night.
But what if there is a cybersecurity breach that makes patient’s medical record available on the dark web? Our assumptions here rely on the idea that the only data sets that are used in “combination” with public healthcare data sets are other public data sets. Using leaked EHR records, the data sets released by CMS and other data vendors could create a vivid picture of the given patient.
Let’s pretend our imaginary friend, John Smith, had a personal mental health problem while working in one of the many industries that still regard such issues with disdain. Then, John’s psychologist gets hacked, and his entire record is dumped on the dark web.
Getting access to patient records on the dark web is not a theoretical problem. This Senate briefing from ICIT is a good source for more details. Once a “black hat” accesses this information, they could use other publicly available data sets to sort out where else the patient in question had received care. This information might help the black hat in subsequent attacks, or it might be enough for the black hat to infer facts that could damage Mr. Smith.
Because of the American aversion to “Big Brother” the notion of a centralized health record has never been able to get political traction. As a result, we have our healthcare records spread across organizations. The fact that organization A takes every possible step to properly protect healthcare information as they release an aggregated data set, makes little difference if organization B cannot protect their healthcare records from hackers.
Note that organization B can be an email provider rather than an EHR vendor. It remains to be seen just how exposed the public was as the result of the recent Yahoo email breach. Personally, I had a Yahoo account for years and most certainly held multiple explicit and inferred discussions regarding my personal healthcare. Countless people, including me, have had material that are, “Personally Identified Healthcare information” under HIPAA and subsequently released as part of the Yahoo data breach.
Ethical healthcare data miners like me should continue to obsess over how to use math to protect the privacy of healthcare data releases. This helps ensure that researchers, entrepreneurs and innovators have the data they need to improve issues in healthcare. Our work to ensure that our data releases protect patient privacy are just like choosing a good safe combination. But we need to be careful not to kid ourselves as we attend to privacy math.
The problem with a cheap home safe (grade “B” as it’s called in the forum) is that they make you think you are protected, when in fact you are just protected from ignorant and inexperienced attackers. Because of your imagined protection, you might make bad decisions that make the problem with a weak safe much worse (i.e. putting the safe in the front of the store where it can be readily seen).
Sadly, this is exactly the position that the we find ourselves with healthcare cybersecurity. The issue of protecting patient data is complex, nuanced, and large in scope. It includes using math to protect patient privacy, but it includes a host of other things that are far more important, and far more difficult to quantify. This is why the Task Force is calling on thoughtful experts everywhere to help.
Again, please share your thoughts and advice:
https://www.reddit.com/r/medicine/comments/55qmft/healthcare_cybersecurity_task_force_ama/
– Fred Trotter, CTO CareSet Systems