Coming soon: everyone’s genetic anonymity undermined by distant relatives – and there’s nothing you can do about it

Posted on Oct 20, 2018 by Glyn Moody

Earlier this year, Privacy News Online wrote about how long-standing linked but unsolved murder cases were resolved by checking genetic material found at some of the crime scenes against online DNA-based genealogy sites. The partial matches with others on the database indicated that they were relatives of the murderer. By drawing up a family tree containing some 1000 people, the investigators were able to work out who might be the killer.

The genetic genealogist who played a key role in establishing the identity of the murderer was Barbara Rae-Venter. The New York Times has just published a fascinating tale of how Rae-Venter used DNA tests and publicly-available genetic information to establish the identity of a woman who was kidnapped as a child. Once more, Rae-Venter searched DNA-based genealogical sites for rough matches with the victim’s DNA. From these, she was ultimately able to establish the kidnapped girl’s family tree and thus identity.

The increasingly successful use of this technique raises an important question: given a DNA sample, how likely is it that there will be a rough match among consumer genetics databases? A US-Israeli group of researchers have published a paper exploring exactly that issue, and the results have major implications for privacy.

The researchers used a dataset of 1.28 million individuals who had sent their DNA to be analyzed by one of the increasingly-popular consumer genomics companies such as 23andMe. They took random individuals from this pool, and searched for any distant family members that might also be present in the dataset. Interestingly, they did not look for close relationships, because there is apparently a tendency for near-relatives to get tested together, which would skew the results for finding matches. Distant relatives are less likely to act in a coordinated fashion, and so searching for such matches give a better indication of the true power of this kind of genetic analysis.

In 15% of the searches, a match which corresponded to a second cousin or closer relative was found. In 60% of the searches carried out by the research team, a match which corresponded to a third cousin or closer relative was found. That’s significant, because the case involving the unsolved murders mentioned above used matches at this level. The new paper therefore indicates that for around 60% of the genetic pool studied there would be a match that would probably be good enough to identify them if they had left DNA at a location.

However, it’s important to note that the dataset used by the researchers was not representative of the US population of a whole. About 75% of the 1.28 million individuals were primarily of North European genetic background. This means that individuals primarily from that background were more likely to have a match than individuals whose genetic background was primarily from sub-Saharan Africa, say. Nonetheless, the figures in the study still give a good idea of how easy it has become to find matches for genetic material in DNA genealogy databases.

The researchers went on to calculate how big the pool of DNA samples would have to be to make the probability of finding a rough match near to certainty. They found that a genetic database needs to cover only 2% of the target population to provide a third cousin match to nearly any person:

we predict that with a database size of [about] 3 million US individuals of European descent (2% of the adults of this population), over 99% of the people of this ethnicity would have at least a single 3rd cousin match and over 65% are expected to have at least one 2nd cousin match. With the exponential growth of consumer genomics, we posit that such database scale is foreseeable for some 3rd party websites in the near future.

It may take a little longer, but the same will be true for people who descend from most other ethnic groups. The researchers went on to consider how easy it would be to establish the exact identity of a person of interest after finding one or more distant relatives in a familial search. The group tried to reduce the number of people who would need to be interviewed, using basic demographic information, such as geography, age, and gender:

On the basis of counting relevant relatives of the match, the initial list of candidates contains on average [about] 850 individuals. Our simulations indicate that localizing the target to within 100 miles will exclude 57% of the candidates on average. Next, availability of the target’s age to within [plus or minus five years] will exclude 91% of the remaining candidates. Finally, inference of the biological sex of the target will halve the list to just around 16-17 individuals, a search space that is small enough for manual inspection.

This has important implications for people who have provided their DNA for scientific purposes, and allowed it to be released anonymously. As the researchers go on to show in their paper, it is now possible to take DNA from a supposedly anonymous dataset, find matches in public genetic databases, and then work out the identity of the individual by building a family tree. That’s a big problem, because it means that it will be possible to put names to DNA sequences that may have easily-identified medical problems. Clearly, this might pose a real privacy challenge for people who have donated their DNA in the belief that it would remain anonymous, and that their possibly-serious medical conditions would never be connected with them.

That new capability may impact the willingness of people to allow their DNA to be released for scientific research. But there is a broader problem that will affect everyone. The research described above indicates that soon, US individuals of European descent will have lost their genetic anonymity. Those descended from other major populations will find themselves in a similar situation in due course.

Already, given a DNA sample, distant relatives can probably be found. That’s the case whether or not the individual concerned had uploaded DNA to a consumer genomics company. And from those close relatives it is likely that a family tree could be built up that would allow them to be correctly identified. Moreover, as more people add their genetic profiles to genealogical databases, the easier, quicker and cheaper it will become to name them.

It is not unreasonable to assume that in a few years’ time, there will be well-populated family trees for more or less everyone in countries where consumer genomics is offered as a low-cost service. As a result, for almost any genetic material found on an object, or at a site, it will be possible to establish the likely identity of the person who left it there. Since we are continuously shedding our DNA wherever we go, this could become the perfect way to identify people and to track their movements and activities – without the need to install any surveillance equipment beforehand.

Featured image by 23andMe.