Another threat to your privacy: the way you write

Posted on Sep 13, 2017 by Glyn Moody

The ‘creator’ of Bitcoin, Satoshi Nakamoto, has been identified. That, at least, is the claim in a recent article by Alexander Muse on Medium. But don’t get too excited. The article not only fails to name him/her/them, Muse admits he doesn’t know, either. All he will say is that the Department of Homeland Security (DHS) has discovered the true identity of Satoshi Nakamoto, but that it won’t publicly confirm that fact. Not much of a story, you might think. But the real interest lies in how the DHS is alleged to have discovered Bitcoin’s biggest secret:

“Throughout the years Satoshi wrote thousands of posts and emails and most of which are publicly available. According to my source, the NSA was able to the use the ‘writer invariant’ method of stylometry to compare Satoshi’s ‘known’ writings with trillions of writing samples from people across the globe.”

The application of what is known as stylometry is only useful if you have other holdings of text linked to named individuals, which can be compared to a kind of stylistic fingerprint extracted from the texts under study. The problem is that Satoshi Nakamoto could be anyone, anywhere. That means stylometry is only likely to be helpful if you have a huge database of writings that includes everyone on the planet who is active on the Internet; people who are not online can probably be excluded since they are unlikely to have come up with something as inherently Net-based as Bitcoin. Although we are not generally aware of the fact, the NSA has just such a database, as the Medium article explains:

“The NSA then took bulk emails and texts collected from their mass surveillance efforts. First through PRISM (a court-approved front-door access to Google and Yahoo user accounts) and then through MUSCULAR (where the NSA copies the data flows across fiber optic cables that carry information among the data centers of Google, Yahoo, Amazon, and Facebook) the NSA was able to place trillions of writings from more than a billion people in the same plane as Satoshi’s writings to find his true identity. The effort took less than a month and resulted in positive match.”

Again, leaving aside the fact that we are not told the supposed true identity of Bitcoin’s creator, what is much more relevant for readers of this blog is that the NSA possesses trillions of texts written by billions of people, and can therefore fruitfully apply stylometry to work out the author of a document, provided it is substantial enough to make any match that is found statistically meaningful.

This means for practical purposes, that it is very difficult to write longer documents, or produce sets of smaller texts, anonymously. All the NSA needs to do is to calculate the stylometric fingerprint for a document or group of posts, and then compare it with the huge holdings of texts with identifiable authors in its database. Of course, the NSA will not expend large amounts of time and money doing so unless the document is of particular importance or – as in the case of Satoshi Nakamoto – the person sought is of particular note.

The quantity of digital data being generated continues to grow rapidly. As a result, the number of emails and social media posts that the NSA must store in order to have a comprehensive record of everyone’s writing style is also growing rapidly. However, don’t start hoping that the NSA will be overwhelmed, and forced to store only a portion of that data flood. Last December, Amazon announced a new service called the AWS Snowmobile:

“This secure data truck stores up to 100 PB of data and can help you to move exabytes to AWS in a matter of weeks (you can get more than one if necessary). Designed to meet the needs of our customers in the financial services, media & entertainment, scientific, and other industries, Snowmobile attaches to your network and appears as a local, NFS-mounted volume.”

The AWS Snowmobile is primarily designed to move petabytes – or even exabytes – from company data centers to Amazon’s AWS cloud. But if Amazon can put that much storage in a single container, think how much the NSA might have crammed into its extensive facilities. Given that ten AWS Snowmobile containers can store an exabyte, the NSA could easily by running databases holding a zettabye or even a yottabyte. To put that in perspective, a Wired article on Amazon’s product notes that a single AWS Snowmobile could hold five copies of the Internet Archive – effectively a backup copy of the Web, past and present – which contains “only” about 18.5 petabytes of unique data. Storing every email and social media post it intercepts is clearly quite feasible for the NSA.

Even if it doesn’t (currently) do this, there is little doubt that the NSA, and other top intelligence agencies in other countries, have vast holdings of our digital activities. That’s important not just for existing applications like stylometric analysis, but particularly for training future artificial intelligence systems. Indeed, most of the power of such AI tools comes from feeding in lots of relevant data to hone the system. Whatever algorithms the NSA and other spy agencies have developed, they are probably already pretty good at analyzing our digital lives thanks to the huge data stores available for training.

That’s the bad news. Some good news is that just as stylometric analysis is gaining new power through the application of technology, so it can perhaps be defeated by technology. There’s an open source project on GitHub called Anonymouth:

“a Java-based application that aims to give users to tools and knowledge needed to begin anonymizing documents they have written.

It does this by firing up JStylo libraries (an author detection application also developed by [the Privacy, Security and Automation Lab at Drexel University, Philadelphia]) to detect stylometric patterns and determine features (like word length, bigrams, trigrams, etc.) that the user should remove/add to help obsure their style and identity.”

And so the great digital arms race continues, between those wanting to preserve their anonymity and privacy online, and those wishing to strip them away.

Featured image by Amazon.

VPN Service