Free Republic
Browse · Search
General/Chat
Topics · Post Article

Skip to comments.

De-identify, re-identify: Anonymised data's dirty little secret
The Register ^ | 16 September 2021 | Danny Bradbury

Posted on 09/16/2021 10:39:36 AM PDT by ShadowAce

Publishing data of all kinds offers big benefits for government, academic, and business users. Regulators demand that we make that data anonymous to deliver its benefits while protecting personal privacy. But what happens when people read between the lines?

Making data anonymous is known as de-identifying it, but doing it properly is more challenging than it seems, says Wei Wang, professor of computer science and director of the Scalable Analytics Institute at UCLA.

"It's one thing to remove the identity, but we also need to keep in mind that the remaining data right after we remove that entity is still useful," she says.

With a little work, people can often recreate your identity from these remaining data points. This process is called re-identification, and it can ruin lives.

In a recent case, an online newsletter outed a Catholic priest who was a frequent user of the Grindr gay hookup app. The newsletter purchased the Grindr usage data from a third-party data broker. Even though the data set had no identifying information, the newsletter found him using his device ID and location data. The ID showed up in gay bars, his work address, and family addresses, which was enough to find his name and out him. He later resigned.

The spectre of re-identification has grave implications for us all, and should give us pause as we rush to publish anonymous data sets. It has become a sport for some researchers, such as those who mined anonymous AOL search queries in 2006 and identified individuals from de-identified Netflix usage data. Both organisations had published the data in the name of research. Back in 2009, a gay woman sued Netflix, alleging that the data could have outed her.

How de-identification works

There are different ways to de-identify data. These include deleting identifiable fields from records, which theoretically should let researchers use the data without linking it back to an individual.

The danger here is that smart third parties could re-identify someone using data elements that were deemed innocuous enough to leave in the records. In an explainer on the topic, the Georgetown University Law School describes multiple levels of identifiability.

These levels begin with data such as a phone number and social security number that can directly identify a person. At the level below that are items such as gender, birth date, and zip code. These might not identify an individual alone but can quickly single a person out when combined. At still lower levels, the data points relate less specifically to a single person, such as favourite restaurants and movies.

In the mid-nineties, the state of Massachusetts published scrubbed data on every state employee's hospital visits, but left in some level-two data: zip code, gender, and age.

Re-identification researcher Latanya Sweeney used public zip code records, correlated with the other two data points, to single out the one person matching them all: state governor William Weld. His full medical history, gleaned from the data set, landed on his desk shortly afterwards.

A token gesture

Another approach to de-identification replaces identifiable data with a token. This theoretically allows the data set's producer to map the tokens back to the user's real ID while leaving others guessing.

This is also sometimes vulnerable to attack. If those tokens aren't truly random and an attacker can reverse-engineer them to retrieve a real-world data attribute, they could find the data's owner. This happened in 2014, when someone reverse-engineered tokens created from New York taxi medallions and mined information about specific taxi rides.

Even if you can't reverse-engineer the token, you can use it to correlate a single data subject's activity over time. That's how researchers pinpointed people in the 2006 AOL dataset; tokens representing individuals allowed them to group search queries and attribute them to a single person, gleaning lots of information about them.

Using additional sources

The availability of multiple data sets compounds the problem of re-identification, warns Wang. "There's a lot of information that you can collect from different sources and correlate them together," she says. Taken individually, each data set might seem innocuous enough. Put them together, and you can cross-reference that information. "Then you can figure out a lot of information that's going to surprise you," she adds.

The problem, as the UK's ICO outlines in its own Anonymisation Code (PDF), is that you can never be sure what other data is out there and how someone might map it against your anonymous data set. Neither can you tell what data will surface tomorrow, or how re-identification techniques might evolve. Data brokers readily selling location access data without the owners' knowledge amplifies the dangers.

Other de-identification techniques include aggregating data. This, the fourth level of data on Georgetown Law's list, includes summarised data such as census records.

You could aggregate neighbourhood-level health records at a county level. Even that can be dangerous, warns Wang. You might be able to correlate aggregate data with other data sets, especially if the number of people with a specific attribute at the aggregated level are low enough.

Concerns about re-identification have surfaced of late with the NHS Digital's recent push to collect the public's health data en masse under its General Practice Data for Planning and Research initiative. The scheme would have transferred GP medical records for all of England's residents to a central research store, giving people a short window to opt out.

NHS Digital had outlined specific data fields that it would transfer under the scheme, which would have allowed it to share that data with third parties. After delaying the deadline in response to pressure from GPs and relaxing opt-out deadlines, it had to put the project on hold.

Solving the re-identification problem

One theoretical way to cut through the whole tangled mess is to just keep removing data points that could reveal someone's identity. Taking out age, zip (post) code, and gender might have stopped Sweeney's Weld discovery, for example. But each piece of data that you take out lessens the data set's value, warns Eerke Boiten, professor of cybersecurity at De Montfort University's School of Computer Science and Informatics.

"If your objective is to make the information less specific, less specifically pinpointing one specific person, you're also taking out the utility," he says.

One way to reconcile anonymity and usefulness could be differential privacy. This technique adds statistical noise to the data by subtly altering parameters, perhaps shifting someone's age or zip code slightly, which makes it harder to correlate them.

Scientists can still filter out that noise with repeated database queries, so another factor of differential privacy is a restriction on the number of times that they can access that data. This restriction is known as a privacy budget, or epsilon, and you can alter the anonymity of a database by changing it.

That involves retaining control over the data, Boiten says, pointing out: "Control and accountability disappears when you hand it over." An alternative is to avoid publishing the data openly and instead make it available in a controlled research environment. "Rather than sharing the data set you share the access," he explains.

The ICO's Anonymisation Code makes it clear that in some scenarios, where re-identification could be damaging, organisations should seek consent before distributing anonymous data sets. Some situations might demand restricting disclosure to a closed community, it adds, and in some cases the data shouldn't be shared at all.

Regulating our way out of it

Scientists also call for more legislation around de-identification. The GDPR excludes data that it deems de-identified from its regulation.

The ICO warns that if the data can be re-identified using "any reasonably available means," then it won't pass muster under the EU General Data Protection Regulations. Olivier Thereaux, head of research and development for the non-profit Open Data Institute, says that misjudging this can get companies into hot water.

"GDPR does state that it does not apply to anonymous information, so anonymisation has sometimes been seen as a way to 'get out' of data protection obligations," he says. "That is often a mistake as there are many ways to anonymise data, and some may be regarded by data protection authorities as 'not reasonably anonymised'."

Danish taxi service Taxa 4x35 is a case in point. Regulators penalised it after it deleted names associated with trip records from its database after two years. The regulator found that the customers were still re-identifiable.

It's a question of risk

No de-identification technique is completely foolproof though, warns Omer Tene, chief knowledge officer at the International Association of Privacy Professionals.

"While there are scientific remedies, most practical remedies are limited in terms of really being risk-based," he says. "They minimise or limit risk but don't completely eliminate it."

The ICO makes this clear in its Code, pointing out that it's "impossible to assess re-identification risk with absolute certainty."

It recommends what it calls a 'motivated intruder' test in which a person without prior knowledge could re-identify individuals using publicly available tools.

Does this mean that we shouldn't publish data at all? Not at all, says Thereaux. To do so would have a chilling effect on research. "Statistics bodies like the ONS do publish data that is anonymised to a minute risk of re-identification, and that publication is hugely valuable to our society," he says.

Lowering risk involves taking a careful and multi-faceted approach to de-identification. Thereaux points to the UK Anonymisation Network, which is a non-profit originally created by the ICO to share best practices in de-identification. It publishes a decision-making framework to help navigate the de-identification process.

The framework emphasises the need to engage with people who might be affected. "Making sure you are transparent and honest about how risks were mitigated, and how you are responding to a breach is key," Thereaux warns. "Organisations who fail to engage and plan for what they might do if anonymisation is breached are the ones who end up at the heart of data scandals."

The data broker who sold Grindr data without considering the implications could perhaps have done with some of that thinking. Come to that, so could everyone involved in that supply chain. Clearly, when it comes to understanding and protecting identities in anonymous data, there's still a lot of work to be done. ®


TOPICS: Computers/Internet
KEYWORDS: data

1 posted on 09/16/2021 10:39:36 AM PDT by ShadowAce
[ Post Reply | Private Reply | View Replies]

To: rdb3; JosephW; martin_fierro; Still Thinking; zeugma; Vinnie; ironman; Egon; raybbr; AFreeBird; ...

2 posted on 09/16/2021 10:40:03 AM PDT by ShadowAce (Linux - The Ultimate Windows Service Pack )
[ Post Reply | Private Reply | To 1 | View Replies]

To: ShadowAce

Check out N3C.
https://covid.cd2h.org/

Database of 7.9M patients with and without COVID-19 (running on Palantir).


3 posted on 09/16/2021 10:55:30 AM PDT by yevgenie (Does one really need to add sarcasm tags?)
[ Post Reply | Private Reply | To 1 | View Replies]

To: ShadowAce

First step: don’t let your data get in their datasets.

Second step: realize that you don’t have any privacy, the only things that can be kept private are those that are scrupulously kept separate from yourself.

Note “scrupulously kept separate” is difficult to impossible to achieve.


4 posted on 09/16/2021 11:03:29 AM PDT by glorgau
[ Post Reply | Private Reply | To 1 | View Replies]

To: ShadowAce

“Even though the data set had no identifying information, the newsletter found him using his DEVICE ID AND LOCATION DATA.”

THIS is a huge problem... And they need to quit requiring it in order to participate. And using hidden 3rd party APIs that collect it without your knowledge.


5 posted on 09/16/2021 11:05:10 AM PDT by Openurmind (The ultimate test of a moral society is the kind of world it leaves to its children. ~ D. Bonhoeffer)
[ Post Reply | Private Reply | To 1 | View Replies]

To: glorgau

Doesn’t cure the “Device ID” issue...


6 posted on 09/16/2021 11:06:05 AM PDT by Openurmind (The ultimate test of a moral society is the kind of world it leaves to its children. ~ D. Bonhoeffer)
[ Post Reply | Private Reply | To 4 | View Replies]

To: Openurmind
“Even though the data set had no identifying information, the newsletter found him using his DEVICE ID AND LOCATION DATA.”

Want to drive around with your cellphone but not be followed? Buy a cheap countertop microwave and put your phone in it...

7 posted on 09/16/2021 11:35:47 AM PDT by GOPJ (Those who can make you believe absurdities, can make you commit atrocities. -- Voltaire)
[ Post Reply | Private Reply | To 5 | View Replies]

To: GOPJ

“Buy a cheap countertop microwave and put your phone in it...”

And turn it on until the smoke comes out.


8 posted on 09/16/2021 12:13:25 PM PDT by beef (The Chinese have a little secret—diversity is _not_ a strength.)
[ Post Reply | Private Reply | To 7 | View Replies]

To: GOPJ

The problem is not the real time tracking, the real problem is carriers and websites gathering and sharing your device ID. This along with the IP address and carriers selling location data for “Demographics” concerning commercial pricing, availability of products, Etc.

And then this data is sold over and over again to party after party until it ends up public on the deep web. Kind of like debt collectors buying the debt and then auctioning that same debt again and again, party to party until it is finally collected and stops.

Except the data selling never stops in the chain until it ends up being public. The new two factor device recognition authentication is one of the huge problems, that and websites being able to even detect a device ID. And carriers selling our device ID and location data.

Yep... Getting rid of net neutrality was a great thing right? Them being allowed to do this legally is a direct result of removing net neutrality. I saw it coming and knew this would happen when all privacy restrictions were removed. Especially with the data that internet carriers and providers can now sell.


9 posted on 09/16/2021 12:36:41 PM PDT by Openurmind (The ultimate test of a moral society is the kind of world it leaves to its children. ~ D. Bonhoeffer)
[ Post Reply | Private Reply | To 7 | View Replies]

To: beef

LOL - no I meant put it in the back seat of your car...then put your phone in it.

Of course with newer cars we have on-star etc... my car lets me know when my tire pressure is low. So being worried about a group knowing my whereabouts has flown the coop.


10 posted on 09/16/2021 1:49:54 PM PDT by GOPJ (Those who can make you believe absurdities, can make you commit atrocities. -- Voltaire)
[ Post Reply | Private Reply | To 8 | View Replies]

To: Openurmind

Many many years ago I spoke with a commie whose two issues were ‘net neutrality’ and ‘rare earth elements’... From that time forward I took a knee jerk uninformed stance on both. Happens - I should have looked at it closer...


11 posted on 09/16/2021 1:57:06 PM PDT by GOPJ (Those who can make you believe absurdities, can make you commit atrocities. -- Voltaire)
[ Post Reply | Private Reply | To 9 | View Replies]

To: GOPJ

The problem was it was actually two part. One needed to go for business rights and lack of government control, and the other needed to stay to maintain our privacy protection.

Instead of a compromise they just got rid of the whole thing, and with it our protections that were in place with it. Greed over personal rights. But what is new?

They should have replaced back the privacy protections, but everyone hailed it as the best thing since sliced bread so these were forgotten about.

Now here we are... Your phone company and internet carriers have the legal right to sell away your privacy and data. Which has now lead to bad characters abusing it with ill intent. It is much worse than having net neutrality.

I fought against repealing it because of the privacy concerns. If they had promised to replace the protections back I would have supported repealing it. I screamed and holler on here about that difference back when.

I saw this result coming...


12 posted on 09/16/2021 2:15:24 PM PDT by Openurmind (The ultimate test of a moral society is the kind of world it leaves to its children. ~ D. Bonhoeffer)
[ Post Reply | Private Reply | To 11 | View Replies]

To: Openurmind

> Doesn’t cure the “Device ID” issue...

Get a device that isn’t linked somehow to yourself (job, credit card paying for it, location, etc)

Go to some location that you NEVER go to otherwise.

Perform the operations over a VPN. etc.

Keeping privacy these days is pretty darn hard. I’d say it’s not worth the effort unless it is for one or two specific things.


13 posted on 09/16/2021 2:15:55 PM PDT by glorgau
[ Post Reply | Private Reply | To 6 | View Replies]

To: Openurmind

It was a complicated issue - we need to create our own ‘non-profits’ so people like you can fight the parts that are bad - and do it with grants... and donations... like liberal do on the other side.


14 posted on 09/16/2021 2:42:29 PM PDT by GOPJ (Those who can make you believe absurdities, can make you commit atrocities. -- Voltaire)
[ Post Reply | Private Reply | To 12 | View Replies]

To: glorgau

Yep, no contract burn phones. And never give customer support a real name or email address. And take the time and effort to figure out how to use just basic features without supplying a Google account. Just live without the fancy apps. Only the Camera, phonebook, calling and text. uninstall everything else you can or disable them.

VPN is no good in the case of device ID. Your Carrier or ISP has that and your location, not even to websites, it still does not hide your international device ID. They will not let you use their system without these, and they detect them. Especially phone companies. They will not even give you service without the international device ID. The device location they have in real time.

We are now starting to see why allowing carriers and ISPs to sell data was a huge mistake. We are paying our own phone companies big money to throw us under the bus.


15 posted on 09/16/2021 2:45:08 PM PDT by Openurmind (The ultimate test of a moral society is the kind of world it leaves to its children. ~ D. Bonhoeffer)
[ Post Reply | Private Reply | To 13 | View Replies]

To: GOPJ

No grants, just donations, But because just about every business supports and benefits from this overreach invasion of privacy there will be none coming from them. The greed is too strong. They don’t care how many toes they have to step on to make a buck.


16 posted on 09/16/2021 2:57:20 PM PDT by Openurmind (The ultimate test of a moral society is the kind of world it leaves to its children. ~ D. Bonhoeffer)
[ Post Reply | Private Reply | To 14 | View Replies]

To: beef

Haha


17 posted on 09/16/2021 3:47:49 PM PDT by Silentgypsy (In my defense, I was left unsupervised.)
[ Post Reply | Private Reply | To 8 | View Replies]

To: ShadowAce

bump for later


18 posted on 09/18/2021 8:24:19 AM PDT by Albion Wilde ("Let us not talk falsely now, the hour is getting late." —Bob Dylan)
[ Post Reply | Private Reply | To 1 | View Replies]

To: ShadowAce

Bookmark


19 posted on 09/18/2021 8:30:38 AM PDT by RightField
[ Post Reply | Private Reply | To 1 | View Replies]

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.

Free Republic
Browse · Search
General/Chat
Topics · Post Article

FreeRepublic, LLC, PO BOX 9771, FRESNO, CA 93794
FreeRepublic.com is powered by software copyright 2000-2008 John Robinson