k-anonymity - Blog by Simon Frey

show table of contents

As you might already have recognized I am in favor of strong privacy technology to counter the surveillance in our current age. One pillar to archive this is to get developers thinking about this and teach methods to build privacy protecting systems.

Within this article you will learn about a theoretical technic called k-anonymity in order to obfuscate & protect the people which data you collected.

Usage Scenario

Let’s start with the typical example scenario in this space:
A hospital collects data about their patients in order to help research gather insights about diseases and in the end serve their patients even better.
Let’s assume we have following (intentional super small) data set (Source: Wikipedia):

Name	Age	Gender	State of domicile	Religion	Disease
Ramsha	30	Female	Tamil Nadu	Hindu	Cancer
Yadu	24	Female	Kerala	Hindu	Viral infection
Salima	28	Female	Tamil Nadu	Muslim	TB
Sunny	27	Male	Karnataka	Parsi	No illness
Joan	24	Female	Kerala	Christian	Heart-related
Bahuksana	23	Male	Karnataka	Buddhist	TB
Rambha	19	Male	Kerala	Hindu	Cancer
Kishor	29	Male	Karnataka	Hindu	Heart-related
Johnson	17	Male	Kerala	Christian	Heart-related
John	19	Male	Kerala	Christian	Viral infection

I assume we agree that is is not a good idea to expose the patients and their diseases by handing this data as it is to research. Doing so we would violate the patients privacy and could cause problems for them and also the hospital. (GDPR for the rescue ;))

This is the moment where k-anonymity comes into play

Definition k-anonymity

Wikipedia states: *”A release of data is said to have the *k-anonymity* property if the information for each person contained in the release cannot be distinguished from at least k − 1 individuals whose information also appear in the release.”*

We get the following insights from this definition:

k-anonymity is a property of a data set, not a method
k-anonymity refers to k-1 individuals in the data set not being able to be distinguished

To archive 2-anonymity (k=2) the data set needs to be obfuscated (by some method) with the result that for every entry, there is at least one other entry being identical. For 3-anonymity it must be at least two others and so on…

Obfuscation

Best protection would be to remove as much attributes as possible and only hand out a minimal amount of data. Though protecting the privacy of the patients this would render research useless. We have to find the sweet spot between patient privacy and research interest, by just obfuscating enough information to effectively protect the patients privacy.

There are two common obfuscation methods in order to achieve k-anonymity:

Generalization: Replaces attributes with broader ones. E.g. Replacing the exact age of a patient with a date range (Replace 33 with 25) is still a good enough information for research, but removes direct identifying attributes.
Suppression: This methods removes certain attributes all together as without them a identification of single people in the data set might not be possible anymore. e.g. First step of obfuscating a data set is removing the unique names

Obfuscation applied

Generalization

Our data set would look like following if we apply generalization to the age attribute with the two categories “Age ≤ 20” & “20 < Age ≤ 30”:

Name	Age	Gender	State of domicile	Religion	Disease
Ramsha	20 < Age ≤ 30	Female	Tamil Nadu	Hindu	Cancer
Yadu	20 < Age ≤ 30	Female	Kerala	Hindu	Viral infection
Salima	20 < Age ≤ 30	Female	Tamil Nadu	Muslim	TB
Sunny	20 < Age ≤ 30	Male	Karnataka	Parsi	No illness
Joan	20 < Age ≤ 30	Female	Kerala	Christian	Heart-related
Bahuksana	20 < Age ≤ 30	Male	Karnataka	Buddhist	TB
Rambha	Age ≤ 20	Male	Kerala	Hindu	Cancer
Kishor	20 < Age ≤ 30	Male	Karnataka	Hindu	Heart-related
Johnson	Age ≤ 20	Male	Kerala	Christian	Heart-related
John	Age ≤ 20	Male	Kerala	Christian	Viral infection

Suppression

Our data set would look like following if we apply suppression to Name an the Religion attribute:

Name	Age	Gender	State of domicile	Religion	Disease
–	30	Female	Tamil Nadu	–	Cancer
–	24	Female	Kerala	–	Viral infection
–	28	Female	Tamil Nadu	–	TB
–	27	Male	Karnataka	–	No illness
–	24	Female	Kerala	–	Heart-related
–	23	Male	Karnataka	–	TB
–	19	Male	Kerala	–	Cancer
–	29	Male	Karnataka	–	Heart-related
–	17	Male	Kerala	–	Heart-related
–	19	Male	Kerala	–	Viral infection

Combined

Most of the time both methods are used in combination. Combining the two examples our data set would look like following:

Name	Age	Gender	State of domicile	Religion	Disease
–	20 < Age ≤ 30	Female	Tamil Nadu	–	Cancer
–	20 < Age ≤ 30	Female	Kerala	–	Viral infection
–	20 < Age ≤ 30	Female	Tamil Nadu	–	TB
–	20 < Age ≤ 30	Male	Karnataka	–	No illness
–	20 < Age ≤ 30	Female	Kerala	–	Heart-related
–	20 < Age ≤ 30	Male	Karnataka	–	TB
–	Age ≤ 20	Male	Kerala	–	Cancer
–	20 < Age ≤ 30	Male	Karnataka	–	Heart-related
–	Age ≤ 20	Male	Kerala	–	Heart-related
–	Age ≤ 20	Male	Kerala	–	Viral infection

Evaluating k-anonymity

To check if we have a degree of k-anonymity we have to count the occurrences of non-distinguishable entries for disease as the goal of our anonymity is to not disclose which person has which disease.

Let’s do this for 4 different data sets:

Original: 10 unique entries, 0 2-anonym entries, 0 3-anonym entries => no anonymity
Generalization example: 10 unique entries, 0 2-anonym entries, 0 3-anonym entries => no anonymity
Suppression example: 6 unique entries, 4 2-anonym entries, 0 3-anonym entries => no anonymity
Generalization & Suppression examples combined: 0 unique entries, 10 2-anonym entries, 0 3-anonym entries => 2-anonymity

The whole data set always gets the lowest k-anonymity even if some entries offer a lot higher protection. Thus, even if the suppression example offers some 2-anonym entries, the overall class is still 1-anonymity as there are unique entries left. Even if only one unique entry would be left and all others are in higher classes, the class would stay the same.

Only the last example combining generalization & suppression offers 2-anonymity as the weakest anonymity the data set provides is the 2-anonymity.
Thereby no person can be linked to a single disease, as there are always at least two entries with the same identifying attributes.

Possible attacks & caveats

There are possible attacks on k-anonymity to re-identify the information in the data set.

Homogeneity Attack: If the anonymized data is too similar it might still give insights about the person. E.g. If all male in the data set under 20 living in a certain city have cancer, than this would allow to reason about a single person again.
Background Attack: By using additional information it might be possible to de-anonymize the data. E.g. If you know that heart attacks occur at a higher rate for people living in a certain city, you are able to link this city to the data again.

Conclusion

If you work with data sets you should always keep the privacy rights of the people in the set in mind. You can use the given techniques Generalization & Suppression to archive a certain strength of k-anonymity. The whole process is proven to not be easily automatizable and needs time and understanding. Small mistakes can result in de-anonymization.

Please keep in mind this article is a simplified explanation. If you want to gain even deeper knowledge about this topic, following paper is a good start: https://dataprivacylab.org/dataprivacy/projects/kanonymity/paper3.pdf