As you might already have recognized I am in favor of strong privacy technology to counter the surveillance in our current age. One pillar to archive this is to get developers thinking about this and teach methods to build privacy protecting systems.
Within this article you will learn about a theoretical technic called k-anonymity in order to obfuscate & protect the people which data you collected.
Usage Scenario
Let’s start with the typical example scenario in this space:
A hospital collects data about their patients in order to help research gather insights about diseases and in the end serve their patients even better.
Let’s assume we have following (intentional super small) data set (Source: Wikipedia):
Name | Age | Gender | State of domicile | Religion | Disease |
Ramsha | 30 | Female | Tamil Nadu | Hindu | Cancer |
Yadu | 24 | Female | Kerala | Hindu | Viral infection |
Salima | 28 | Female | Tamil Nadu | Muslim | TB |
Sunny | 27 | Male | Karnataka | Parsi | No illness |
Joan | 24 | Female | Kerala | Christian | Heart-related |
Bahuksana | 23 | Male | Karnataka | Buddhist | TB |
Rambha | 19 | Male | Kerala | Hindu | Cancer |
Kishor | 29 | Male | Karnataka | Hindu | Heart-related |
Johnson | 17 | Male | Kerala | Christian | Heart-related |
John | 19 | Male | Kerala | Christian | Viral infection |
I assume we agree that is is not a good idea to expose the patients and their diseases by handing this data as it is to research. Doing so we would violate the patients privacy and could cause problems for them and also the hospital. (GDPR for the rescue ;))
This is the moment where k-anonymity comes into play
Definition k-anonymity
Wikipedia states: *”A release of data is said to have the *k-anonymity* property if the information for each person contained in the release cannot be distinguished from at least k − 1 individuals whose information also appear in the release.”*
We get the following insights from this definition:
- k-anonymity is a property of a data set, not a method
- k-anonymity refers to k-1 individuals in the data set not being able to be distinguished
To archive 2-anonymity (k=2) the data set needs to be obfuscated (by some method) with the result that for every entry, there is at least one other entry being identical. For 3-anonymity it must be at least two others and so on…
Obfuscation
Best protection would be to remove as much attributes as possible and only hand out a minimal amount of data. Though protecting the privacy of the patients this would render research useless. We have to find the sweet spot between patient privacy and research interest, by just obfuscating enough information to effectively protect the patients privacy.
There are two common obfuscation methods in order to achieve k-anonymity:
- Generalization: Replaces attributes with broader ones. E.g. Replacing the exact age of a patient with a date range (Replace 33 with 25) is still a good enough information for research, but removes direct identifying attributes.
- Suppression: This methods removes certain attributes all together as without them a identification of single people in the data set might not be possible anymore. e.g. First step of obfuscating a data set is removing the unique names
Obfuscation applied
Generalization
Our data set would look like following if we apply generalization to the age attribute with the two categories “Age ≤ 20” & “20 < Age ≤ 30”:
Name | Age | Gender | State of domicile | Religion | Disease |
Ramsha | 20 < Age ≤ 30 | Female | Tamil Nadu | Hindu | Cancer |
Yadu | 20 < Age ≤ 30 | Female | Kerala | Hindu | Viral infection |
Salima | 20 < Age ≤ 30 | Female | Tamil Nadu | Muslim | TB |
Sunny | 20 < Age ≤ 30 | Male | Karnataka | Parsi | No illness |
Joan | 20 < Age ≤ 30 | Female | Kerala | Christian | Heart-related |
Bahuksana | 20 < Age ≤ 30 | Male | Karnataka | Buddhist | TB |
Rambha | Age ≤ 20 | Male | Kerala | Hindu | Cancer |
Kishor | 20 < Age ≤ 30 | Male | Karnataka | Hindu | Heart-related |
Johnson | Age ≤ 20 | Male | Kerala | Christian | Heart-related |
John | Age ≤ 20 | Male | Kerala | Christian | Viral infection |
Suppression
Our data set would look like following if we apply suppression to Name an the Religion attribute:
Name | Age | Gender | State of domicile | Religion | Disease |
– | 30 | Female | Tamil Nadu | – | Cancer |
– | 24 | Female | Kerala | – | Viral infection |
– | 28 | Female | Tamil Nadu | – | TB |
– | 27 | Male | Karnataka | – | No illness |
– | 24 | Female | Kerala | – | Heart-related |
– | 23 | Male | Karnataka | – | TB |
– | 19 | Male | Kerala | – | Cancer |
– | 29 | Male | Karnataka | – | Heart-related |
– | 17 | Male | Kerala | – | Heart-related |
– | 19 | Male | Kerala | – | Viral infection |
Combined
Most of the time both methods are used in combination. Combining the two examples our data set would look like following:
Name | Age | Gender | State of domicile | Religion | Disease |
– | 20 < Age ≤ 30 | Female | Tamil Nadu | – | Cancer |
– | 20 < Age ≤ 30 | Female | Kerala | – | Viral infection |
– | 20 < Age ≤ 30 | Female | Tamil Nadu | – | TB |
– | 20 < Age ≤ 30 | Male | Karnataka | – | No illness |
– | 20 < Age ≤ 30 | Female | Kerala | – | Heart-related |
– | 20 < Age ≤ 30 | Male | Karnataka | – | TB |
– | Age ≤ 20 | Male | Kerala | – | Cancer |
– | 20 < Age ≤ 30 | Male | Karnataka | – | Heart-related |
– | Age ≤ 20 | Male | Kerala | – | Heart-related |
– | Age ≤ 20 | Male | Kerala | – | Viral infection |
Evaluating k-anonymity
To check if we have a degree of k-anonymity we have to count the occurrences of non-distinguishable entries for disease as the goal of our anonymity is to not disclose which person has which disease.
Let’s do this for 4 different data sets:
- Original: 10 unique entries, 0 2-anonym entries, 0 3-anonym entries => no anonymity
- Generalization example: 10 unique entries, 0 2-anonym entries, 0 3-anonym entries => no anonymity
- Suppression example: 6 unique entries, 4 2-anonym entries, 0 3-anonym entries => no anonymity
- Generalization & Suppression examples combined: 0 unique entries, 10 2-anonym entries, 0 3-anonym entries => 2-anonymity
The whole data set always gets the lowest k-anonymity even if some entries offer a lot higher protection. Thus, even if the suppression example offers some 2-anonym entries, the overall class is still 1-anonymity as there are unique entries left. Even if only one unique entry would be left and all others are in higher classes, the class would stay the same.
Only the last example combining generalization & suppression offers 2-anonymity as the weakest anonymity the data set provides is the 2-anonymity.
Thereby no person can be linked to a single disease, as there are always at least two entries with the same identifying attributes.
Possible attacks & caveats
There are possible attacks on k-anonymity to re-identify the information in the data set.
- Homogeneity Attack: If the anonymized data is too similar it might still give insights about the person. E.g. If all male in the data set under 20 living in a certain city have cancer, than this would allow to reason about a single person again.
- Background Attack: By using additional information it might be possible to de-anonymize the data. E.g. If you know that heart attacks occur at a higher rate for people living in a certain city, you are able to link this city to the data again.
Other problems are:
- Forward Knowledge Attack (my term): If you have certain knowledge about people in the data set, you might gain unintended insights again. E.g. If you know that Simon is a Male with the Age 19 you know from the data set that he is sick (either Cancer, Heart-related, Viral infection) which is a privacy violation again.
- Skewing of the data set: By removing to much information from the data set the validity of the data set might be skewed, as important information could be removed which renders the research useless again.
Conclusion
If you work with data sets you should always keep the privacy rights of the people in the set in mind. You can use the given techniques Generalization & Suppression to archive a certain strength of k-anonymity. The whole process is proven to not be easily automatizable and needs time and understanding. Small mistakes can result in de-anonymization.
Please keep in mind this article is a simplified explanation. If you want to gain even deeper knowledge about this topic, following paper is a good start: https://dataprivacylab.org/dataprivacy/projects/kanonymity/paper3.pdf