Removing personal details from data to make it anonymous will not stop individuals being identified and compromising their privacy, a new study suggests.
Researchers from Imperial College London, UK, and the Catholic University of Louvain, Belgium, have shown that even so-called “anonymised” datasets – with names, addresses and other unique identifiers removed – can be traced back to individuals using statistical modelling.
The accuracy is unnerving: 99.98% of Americans could be correctly re-identified in “anonymous” data that contained just 15 characteristics, including age, gender and marital status.
The researchers say their findings, which are published in the journal Nature Communications, should be a wake-up call for policymakers on the need to tighten the rules for what constitutes truly anonymous data. {%recommended 8973%}
Companies and governments routinely collect and use our personal data, stripping it of personal identifiers so that it can be used legally in tasks ranging from scientific and social research to product planning.
Companies can also sell de-identified data to third parties, such as advertising agencies and data brokers.
This process of anonymisation has been the main tool used by data collectors to address privacy concerns in the wake of scandals such as the sale of Facebook data to Cambridge Analytica.
But the new research shows how easily the data can be reverse engineered to re-identify individuals, using modelling based on statistical probability known as a generative copula-based method.
For example, a 61-year old man living in Chelsea, New York, would be correctly identified 81% of the time just from gender, birth date and postcode data. By adding just five more points of data (marital status, vehicle ownership, home ownership status and employment status) the likelihood of this person being correctly identified hits 100% as the combination of attributes becomes unique.
The more data points are gathered, the easier it becomes to identify the individual behind the data – and some anonymous data sets contain as many as 248 data points.
The researchers say this ease of discovery could expose sensitive information about personally identified individuals, and allow data buyers to build increasingly comprehensive personal profiles of individuals.
Senior author Yves-Alexandre de Montjoye, from Imperial College London, says companies and governments have downplayed the risk of re-identification by arguing that the datasets they sell are always incomplete.
“Our findings contradict this and demonstrate that an attacker could easily and accurately estimate the likelihood that the record they found belongs to the person they are looking for,” he says.
“The goal of anonymisation is so we can use data to benefit society. This is extremely important but should not and does not have to happen at the expense of people’s privacy.”
As well as their paper, the researchers have developed an online demonstration tool that allows people to estimate how likely they are to be traced from anonymous data.
The tool does not save data, they add. Its function is to help people see which characteristics make them unique – and therefore identifiable – in anonymised datasets.