How data minimization can protect privacy and reduce the harms of collecting personal information

Updated on Jan 25, 2024 by Glyn Moody

It’s no secret that many companies and governments try to collect as much personal information as possible. This might be because they believe this will improve the results of their analyses, or simply “just in case” they need something at a later date. According to a new paper from the digital rights organization, Access Now, the practice of collecting more data than is necessary is widespread. One study of companies in Europe showed that 72% gathered data which they never used. Another global report showed that 55% of all data collected is “dark data” that is not used for any purpose after collection. As the title of Access Now’s report “Data minimization: Key to protecting privacy and reducing harm” suggests, the organization believes simply reducing the amount of personal data that is gathered could have big benefits for privacy. As it writes:

Privacy is a human right, and data minimization is a human rights issue. The most important impact of strong data minimization is harm reduction: data that is not collected cannot harm people. As organizations collect more data, the potential for and real harms to people grow. Reducing the amount of data collected is important for at least two reasons: people do not want organizations collecting every bit of information about them, and personal information can be, and often is, misused in ways that perpetuate significant harms.

Statistics quoted by the paper show that people around the world are worried by the quantity of data that businesses and governments collect about them. A study of more than 25,000 people in 40 countries showed that 70% of them were concerned about sharing personal information, and two-thirds were unhappy with the current privacy practices of data collectors. Access Now points out that specific groups are especially at risk from excessive data collection, which can reduce the opportunities for Black, Hispanic, Indigenous, and other communities of color, or actively target them for discriminatory campaigns and deception.

Another obvious risk of extensive data collection is the use of that information for government surveillance. That can lead to the abuse of government authority and have chilling effects on free expression. Data minimization has obvious benefits by limiting the harms that governmental use of personal data might have. The paper quotes the example of Signal, which offers end-to-end encryption of all its communications, and keeps the data held about users to an absolute minimum. As a result, when the US government recently asked for the names and addresses of users, Signal said it couldn’t comply because it didn’t hold that data. Similarly, the unnecessary collection and retention of personal information creates a growing treasure trove of valuable data. Inevitably those stores become targets for third parties, whether it is law enforcement, foreign governments, or criminals. Moreover, the larger the stores of personal information, the greater the opportunities for cross-linking different sets of data when they are exfiltrated, creating even more detailed and thus even more harmful profiles of people.

The new paper recognizes that there are circumstances in which data minimization may not be straightforward. For example, it may be necessary for an organization to collect data on protected classes where the purpose of doing so is to address its own discriminatory practices and mitigate or eliminate the harms or to benefit certain underrepresented populations. However, Access Now emphasizes that once that data on protected classes is collected and stored, it should be put to no other use, and should be strictly protected against unauthorized access, unauthorized disclosure, and other data protection violations. Unsurprisingly, the paper is concerned about what it terms “behavioral advertising” – that is, using micro-targeting – which typically draws on large databases holding personal data and detailed profiles. Ideally, it says, such instrusive advertising would be banned completely:

While the evidence of harm is abundant, there is little to show the benefit of behavioral advertising for the companies deploying it. It may not be as effective as claimed. A recent study found that publishers retain only 4% of the increased revenue from behavioral advertising. In 2019, when The New York Times cut off ad exchanges and turned to contextual advertising, it saw its revenues rise. Data from the Dutch broadcaster NPO showed that when it ditched behavioral advertising for contextual ads across its sites for the first half of 2020, its revenue increased each month.

At the very least, Access Now suggests, any company that collects data for advertising purposes should be required to delete – not merely de-identify – that information, as well as any information they inferred from that data, after 30 days.

Finally, there is an interesting section about using “thoughtful” data minimization when building machine learning systems. The latter are typically trained on large datasets, which often include personal information. However, the paper notes, there is an assumption that more data is always better, but that is not necessarily true. Instead, it suggests, those building AI systems based on machine learning should aim to train them on good data. There are real benefits from doing so, since the use of inappropriate data can negatively impact the system in various ways, not least in producing biased algorithms and skewed results. Adopting data minimization as a principle would help engineers ensure that the training sets are high quality and appropriate.

Data minimization seems like a rather obvious, even trivial idea. But as the Access Now paper makes clear, applying this simple technique can have big benefits not just in terms of enhancing privacy, but also in improving the quality of services that are based on the collection and analysis of personal data.

Featured image by Alexei Chizhov.