k-anonymity

As you might already have recognized I am in favor of strong privacy technology to counter the surveillance in our current age. One pillar to archive this is to get developers thinking about this and teach methods to build privacy protecting systems.

Within this article you will learn about a theoretical technic called k-anonymity in order to obfuscate & protect the people which data you collected.

Usage Scenario

Let’s start with the typical example scenario in this space:
A hospital collects data about their patients in order to help research gather insights about diseases and in the end serve their patients even better.
Let’s assume we have following (intentional super small) data set (Source: Wikipedia):

NameAgeGenderState of domicileReligionDisease
Ramsha30FemaleTamil NaduHinduCancer
Yadu24FemaleKeralaHinduViral infection
Salima28FemaleTamil NaduMuslimTB
Sunny27MaleKarnatakaParsiNo illness
Joan24FemaleKeralaChristianHeart-related
Bahuksana23MaleKarnatakaBuddhistTB
Rambha19MaleKeralaHinduCancer
Kishor29MaleKarnatakaHinduHeart-related
Johnson17MaleKeralaChristianHeart-related
John19MaleKeralaChristianViral infection

I assume we agree that is is not a good idea to expose the patients and their diseases by handing this data as it is to research. Doing so we would violate the patients privacy and could cause problems for them and also the hospital. (GDPR for the rescue ;))

This is the moment where k-anonymity comes into play

Definition k-anonymity

Wikipedia states: *”A release of data is said to have the *k-anonymity* property if the information for each person contained in the release cannot be distinguished from at least k − 1 individuals whose information also appear in the release.”*

We get the following insights from this definition:

  • k-anonymity is a property of a data set, not a method
  • k-anonymity refers to k-1 individuals in the data set not being able to be distinguished

To archive 2-anonymity (k=2) the data set needs to be obfuscated (by some method) with the result that for every entry, there is at least one other entry being identical. For 3-anonymity it must be at least two others and so on…

Obfuscation

Best protection would be to remove as much attributes as possible and only hand out a minimal amount of data. Though protecting the privacy of the patients this would render research useless. We have to find the sweet spot between patient privacy and research interest, by just obfuscating enough information to effectively protect the patients privacy.

There are two common obfuscation methods in order to achieve k-anonymity:

  • Generalization: Replaces attributes with broader ones. E.g. Replacing the exact age of a patient with a date range (Replace 33 with 25) is still a good enough information for research, but removes direct identifying attributes.
  • Suppression: This methods removes certain attributes all together as without them a identification of single people in the data set might not be possible anymore. e.g. First step of obfuscating a data set is removing the unique names

Obfuscation applied

Generalization

Our data set would look like following if we apply generalization to the age attribute with the two categories “Age ≤ 20” & “20 < Age ≤ 30”:

NameAgeGenderState of domicileReligionDisease
Ramsha20 < Age ≤ 30FemaleTamil NaduHinduCancer
Yadu20 < Age ≤ 30FemaleKeralaHinduViral infection
Salima20 < Age ≤ 30FemaleTamil NaduMuslimTB
Sunny20 < Age ≤ 30MaleKarnatakaParsiNo illness
Joan20 < Age ≤ 30FemaleKeralaChristianHeart-related
Bahuksana20 < Age ≤ 30MaleKarnatakaBuddhistTB
RambhaAge ≤ 20MaleKeralaHinduCancer
Kishor20 < Age ≤ 30MaleKarnatakaHinduHeart-related
JohnsonAge ≤ 20MaleKeralaChristianHeart-related
JohnAge ≤ 20MaleKeralaChristianViral infection

Suppression

Our data set would look like following if we apply suppression to Name an the Religion attribute:

NameAgeGenderState of domicileReligionDisease
30FemaleTamil NaduCancer
24FemaleKeralaViral infection
28FemaleTamil NaduTB
27MaleKarnatakaNo illness
24FemaleKeralaHeart-related
23MaleKarnatakaTB
19MaleKeralaCancer
29MaleKarnatakaHeart-related
17MaleKeralaHeart-related
19MaleKeralaViral infection

Combined

Most of the time both methods are used in combination. Combining the two examples our data set would look like following:

NameAgeGenderState of domicileReligionDisease
20 < Age ≤ 30FemaleTamil NaduCancer
20 < Age ≤ 30FemaleKeralaViral infection
20 < Age ≤ 30FemaleTamil NaduTB
20 < Age ≤ 30MaleKarnatakaNo illness
20 < Age ≤ 30FemaleKeralaHeart-related
20 < Age ≤ 30MaleKarnatakaTB
Age ≤ 20MaleKeralaCancer
20 < Age ≤ 30MaleKarnatakaHeart-related
Age ≤ 20MaleKeralaHeart-related
Age ≤ 20MaleKeralaViral infection

Evaluating k-anonymity

To check if we have a degree of k-anonymity we have to count the occurrences of non-distinguishable entries for disease as the goal of our anonymity is to not disclose which person has which disease.

Let’s do this for 4 different data sets:

  • Original: 10 unique entries, 0 2-anonym entries, 0 3-anonym entries => no anonymity
  • Generalization example: 10 unique entries, 0 2-anonym entries, 0 3-anonym entries => no anonymity
  • Suppression example: 6 unique entries, 4 2-anonym entries, 0 3-anonym entries => no anonymity
  • Generalization & Suppression examples combined: 0 unique entries, 10 2-anonym entries, 0 3-anonym entries => 2-anonymity

The whole data set always gets the lowest k-anonymity even if some entries offer a lot higher protection. Thus, even if the suppression example offers some 2-anonym entries, the overall class is still 1-anonymity as there are unique entries left. Even if only one unique entry would be left and all others are in higher classes, the class would stay the same.

Only the last example combining generalization & suppression offers 2-anonymity as the weakest anonymity the data set provides is the 2-anonymity.
Thereby no person can be linked to a single disease, as there are always at least two entries with the same identifying attributes.

Possible attacks & caveats

There are possible attacks on k-anonymity to re-identify the information in the data set.

  • Homogeneity Attack: If the anonymized data is too similar it might still give insights about the person. E.g. If all male in the data set under 20 living in a certain city have cancer, than this would allow to reason about a single person again.
  • Background Attack: By using additional information it might be possible to de-anonymize the data. E.g. If you know that heart attacks occur at a higher rate for people living in a certain city, you are able to link this city to the data again.

Other problems are:

  • Forward Knowledge Attack (my term): If you have certain knowledge about people in the data set, you might gain unintended insights again. E.g. If you know that Simon is a Male with the Age 19 you know from the data set that he is sick (either Cancer, Heart-related, Viral infection) which is a privacy violation again.
  • Skewing of the data set: By removing to much information from the data set the validity of the data set might be skewed, as important information could be removed which renders the research useless again.

Conclusion

If you work with data sets you should always keep the privacy rights of the people in the set in mind. You can use the given techniques Generalization & Suppression to archive a certain strength of k-anonymity. The whole process is proven to not be easily automatizable and needs time and understanding. Small mistakes can result in de-anonymization.


Please keep in mind this article is a simplified explanation. If you want to gain even deeper knowledge about this topic, following paper is a good start: https://dataprivacylab.org/dataprivacy/projects/kanonymity/paper3.pdf

To never miss an article subscribe to my newsletter
No ads. One click unsubscribe.