The real Strava heatmap story is no threats to national security, but about privacy and de-anonymization
At the end of January, Nathan Ruser posted a tweet about Strava, a Website and mobile app used to track athletic activity via GPS coordinates. It concerned Strava’s global heatmap: “the largest, richest, and most beautiful dataset of its kind. It is a visualization of Strava’s global network of athletes,” as the company puts it. The most recent release includes 3 trillion latitude/longitude points, 13 trillion pixels rasterized, and 10 terabytes of raw input data. Ruser noticed that the tendency of military personnel to run around tracks on military bases while wearing their Strava devices reveals all kinds of hitherto secret details about those installations. That’s clearly a serious failure of operational security, but fairly easy to fix, by ordering military personnel to use more stringent privacy controls with activity trackers.
It turns out that the privacy implications of the Strava heatmap are more problematic. For example, on Twitter, somebody noted that by zooming in it is possible to make out the exercise patterns of some people living in individual houses. Steve Loughran showed how you could even de-anonymize many of the heatmap routes. Although his technique is specific to Strava, there is always the risk that similarly clever ways will be used to obtain highly personal information from other systems. For example, a 2016 paper showed how triangulation could be applied to locate individuals exactly using dating applications, even though the latter correctly tried to obfuscate that sensitive information. More recently, Privacy News Online reported on how encrypted data streams from the Internet of Things are leaking sensitive information, and that merely using certain browser extensions may result in your full Internet history ending up for sale and easily de-anonymized.
A common thread here is that supposedly anonymized data very often is nothing of the kind. By exploiting other features of the system that generated it, or combining the data with other kinds of personal information, it may be quite easy to de-anonymize the datasets. A detailed examination of these issues is provided by a 2015 paper from Edward W. Felten, Joanna Huey and Arvind Narayanan, called “A Precautionary Approach to Big Data Privacy“, pointed out in a post on Boing Boing. The fact that this was published some years ago, and yet people remain surprised when datasets are de-anonymized, reveals how little progress has been made on this front. The researchers suggest the problems of de-anonymization are likely to become more widespread:
“high-dimensional datasets, which contain many data points for each individual’s record, have become the norm: social network data has at least a hundred dimensions and genetic data can have millions. We expect that datasets will continue this trend towards higher dimensionality as the costs of data storage decrease and the ability to track a large number of observations about a single individual increase.”
The more data points there are for each individual – the higher the dimensionality of the datasets – the greater the risk that one or more of them can be linked to a specific individual at a later date through the use of newly-available information. A related problem is that once datasets have been released, they can’t be recalled when they become vulnerable to de-anonymization for whatever reason. Matters are made worse by the ad hoc way in which de-identification is carried out in the first place. Typically, the researchers write, a “penetrate-and-patch” approach is adopted:
“Proponents [of anonymization] ask whether a de-identification method can resist certain past attacks, rather than insisting on affirmative evidence that the method cannot leak information regardless of what the attacker does.
The penetrate-and-patch approach is denounced in the field of computer security because systems following that approach tend to fail repeatedly. Ineffective as the penetrate-and-patch approach is for securing software, it is even worse for de-identification. End users will install patches to fix security bugs in order to protect their own systems, but data users have no incentive to replace a dataset found to have privacy vulnerabilities with a patched version that is no more useful to them. When no one applies patches, penetrate-and-patch becomes simply penetrate.”
In one respect, things have moved on since the 2015 paper was published. Artificial intelligence (AI) has made unexpectedly rapid progress in the last few years. As this blog noted a couple of months ago, the increasing capabilities of AI pose new and serious threats to privacy. In an op-ed published in the wake of the realization that the Strata heatmap revealed sensitive information about military installations, Zeynep Tufekci, associate professor at the School of Information and Library Science at the University of North Carolina, pointed out that machine learning is likely to make the problem of de-anonymization even worse. One of the strengths of AI systems is that they can ingest huge quantities of seemingly unrelated data to find hidden patterns and relationships. That makes them perfect for cross-referencing “anonymized” personal information from multiple sources in order to reveal the identity of individuals who are the source of that data. Tufekci’s solution is as follows:
“With the implications of our current data practices unknown, and with future uses of our data unknowable, data storage must move from being the default procedure to a step that is taken only when it is of demonstrable benefit to the user, with explicit consent and with clear warnings about what the company does and does not know.”
Interestingly, the 2015 paper from Felten, Huey and Narayanan also recommends providing greater transparency about the risks of re-identification when providing personal data:
“Giving users information about privacy protection measures and re-identification risks helps to even the information asymmetry between them and data collectors. It would allow users to make more informed decisions and could motivate more conscientious privacy practices, including the implementation of provable privacy methods. It is also possible that data collectors could give users options about the privacy protection measures to be applied to their information. Such segmentation would permit personal assessments of the risks and benefits of the data collection: people who have strong desires for privacy could choose heavier protections or non-participation; people who do not care about being identified or who strongly support the potential research [using their personal data] could choose lighter, or no, protections. This segmentation is a helpful complement to narrowed releases of data: instead of restricting access to the people who can create the most benefit, segmentation restricts participation to the people who feel the least risk.”
That may be a welcome approach for people who take privacy protection seriously, such as readers of this blog. But when few general users of online services bother reading the terms and conditions before signing up, it seems rather optimistic to expect them to choose among a complex range of protection options offering variable risks of re-identification.
Featured image by Strava.