Time to embrace federated analytics – it’s no privacy panacea, but probably the closest we will get to one for many situations
A couple of weeks ago, a post on this blog explained how the OpenSAFELY project allowed trusted analysts to run large-scale computation across live pseudonymized patient records inside the data centre of the electronic health records software company. At a time when the world is grappling with the coronavirus pandemic, that’s a hugely important task, but it might seem rather specialized. It’s not, for reasons that go back to a Privacy News Online post from last year. The latter was mainly about the difficulty of respecting privacy using Google’s business model, which is based on gathering detailed personal information about its users. However, it also mentioned some important work Google is conducting on “federated learning“. This enables developers to train machine learning models across many devices without centralized data collection, ensuring that only the user has a copy of their data, and avoiding privacy issues. This is clearly quite similar to what OpenSAFELY does, and Google has now extended its work to produce what it calls “federated analytics“:
Today we’re introducing federated analytics, the practice of applying data science methods to the analysis of raw data that is stored locally on users’ devices. Like federated learning, it works by running local computations over each device’s data, and only making the aggregated results — and never any data from a particular device — available to product engineers. Unlike federated learning, however, federated analytics aims to support basic data science needs.
Google already uses this approach to improve the Now Playing feature on Google’s Pixel phones, a tool that shows you what song is playing in the space around you. Song recognition is carried out on the device, so no data is sent back to Google when using this service. In order to create and improve the on-device database of songs, Google uses federated analytics. Participating smartphones compute the recognition rate for the songs in its Now Playing history. They then send the rates in a special encrypted form to Google, using the “secure aggregation protocol“.
Under this approach, encrypted rates are sent to the federated analytics server, which does not have the keys to decrypt them individually. But when combined with the encrypted counts from the other phones in the round, the final number of all song counts – and nothing else – can be decrypted by the server. The secure aggregation protocol is an important additional level of protection to ensure that personal information cannot be extracted from the data sent back by individual smartphones. The Google post provides a simplified explanation of how it works. It gets across well the idea of how sending data in a particular form allows it to be useful in aggregate, but immune to further interrogation about each separate source:
Let’s say that Rakshita wants to know how often her friends Emily and Zheng have listened to a particular song. Emily has heard it SEmily times and Zheng SZheng times, but neither is comfortable sharing their counts with Rakshita or each other. Instead, the trio could perform a secure aggregation: Emily and Zheng meet to decide on a random number M, which they keep secret from Rakshita. Emily reveals to Rakshita the sum SEmily + M, while Zheng reveals the difference SZheng – M. Rakshita sees two numbers that are effectively random (they are masked by M), but she can add them together (SEmily + M) + (SZheng – M) = SEmily + SZheng to reveal the total number of times that the song was heard by both Emily and Zheng.
In practice, the privacy protection is further strengthened by adding small random values to the song counts, and then summing over not two sources, as in the explanation above, but over multiple users. This approach is one example of what is known as “differential privacy“. Last year, Google released an open source differential privacy library based on the code it uses in its own products. According to the company, key features include statistical functions, rigorous testing, modularity and the fact that it can be used in real-world situations immediately. To help people try out the approach, it has included a PostgreSQL extension along with some “recipes”; details are provided in an accompanying technical paper
This work is important because it describes how privacy can be respected when aggregating data from distributed sources, with no risk that it can be de-anonymized, because the data itself is never accessible. It also provides open source code so that people can try it out for themselves. Google is deploying the approach in its own products – proof that it is no mere theoretical approach.
Although this kind of differential privacy may not be appropriate in every situation, it is nonetheless sufficiently mature that is should be considered where possible. It resolves the age-old tension of how to preserve individual privacy while obtaining the benefits of aggregating large quantities of data from many sources.
Featured image by pixabairis.