search this blog


Friday, December 6, 2013

23andMe's Ancestry Composition is now better, but still not great

I just had a look at the updated Ancestry Composition (AC) results of many of the people I'm sharing with at 23andMe. Yes, after a year of waiting the AC has finally been updated, with an overfitting fix and an improved disambiguation of African and Asian ancestry. But to be blunt, the AC still sucks.

There's actually no excuse why it should still suck. The 23andMe scientists have obviously done a great job with the algorithms that make up the AC, because they appear to be highly accurate when used together with well defined reference sets that represent robust biogeographical clusters. So the hard part is done. However, the problem is that several of the reference sets aren't well defined, and that's putting it mildly. Here are some examples:

- Croatians are in a Balkan reference set alongside Greeks, and even more unbelievably, Maltese. On the other hand, their genetically very similar neighbors, the Slovenians, are in the Eastern European reference set, alongside the HGDP Russians from Kargopol, in the north of Russia.

- The North African reference set is mostly made up of samples from the Near East, not North Africa. Also, the samples from Northwest Africa, like the Mozabite Berbers, are genetically quite distinct from all of the other samples in this reference set.

- Czechs are in the Eastern European reference set, while their genetically very similar neighbors, the Austrians, in the French & German reference set. Below is a Principal Component Analysis (PCA) from Nelis et al. showing the genetic relationship between these two Central European populations, relative to the differences within another three European countries.

This appears to be causing problems for some users, because supervised admixture analyses like the AC need relatively pure reference sets from robust biogeographical clusters to work properly. For instance, I'm sharing with Germans and Austrians who, in the standard mode, aren't even classified as 2% French & German, but over 10% Eastern European, and mostly nonspecific Northern European and European. In speculative mode their French & German proportions rise a few per cent.

I think what's happened in this case is that 23andMe has ignored the existence of the biogeographical region known as Central Europe. As a result, people from this area are mostly sitting in a nonspecific European no-man's land. That's because they're not particularly French, nor are they typically Eastern European.

Perhaps some might argue that this Central European biogeographical cluster doesn't really exist, and that it's actually a buffer zone between Western and Eastern Europe? If so, I beg to differ. Clusters specific to Central Europe show up in fine scale analyses with such programs as ChromoPainter and Mclust, and they're quite distinct from clusters specific to France (see here).

So unfortunately it seems that the scientists at 23andMe aren't doing enough to search for the most robust clusters in their dataset. Instead, based on what I've read at the 23andMe website, they seem to be using basic PCAs and their customers' self-reported ancestry to guide them. I reckon they should give ChromoPainter and Mclust a go. There's a PDF article about how these two methods compare to each other here. ChromoPainter is better overall, but Mclust much faster.

If, like me, you're a client at 23andMe and agree with what I've said here, then please send a link to this blog post to the scientists at 23andMe responsible for the AC. I think it'd be a shame to see this powerful tool and the thousands of reference samples available to 23andMe not used to their full potential.

Maybe someone over there will listen, and next time there's an update we'll see Northwest Africans in a reference set of their own, Palestinians, Jordanians and Saudi Arabians not classified as North Africans, the Maltese taken out of the Balkan reference set, Germans in a Central European or their own reference set, and plenty of other improvements.

See also...

23andMe’s Ancestry Composition – a preliminary review


Nelis M, Esko T, Mägi R, Zimprich F, Zimprich A, et al. (2009)
Genetic Structure of Europeans: A View from the North–East. PLoS ONE 4(5): e5472. doi:10.1371/journal.pone.0005472


Maju said...

"The North African reference set is mostly made up of samples from the Near East, not North Africa".

That's absolutely ridiculous, not even just "unprofessional" or "amateurish" but simply idiotic.

"... supervised admixture analyses (...) need relatively pure reference sets from robust biogeographical clusters to work properly".

Indeed, at the very least to provide some sort of coherence. Of course detecting and selecting these reference sets is also problematic and fraught with controversy, for example: should we allow small isolated populations like Sardinians or Lithuanians to tip the scales or should we consider more representative but also more heterogenous ones like, say, peninsular Italians or Russians instead as the references? But whatever the imperfect solution you choose the kind of categories used by these people seem done by some random and not too brilliant primary school kid.

"Clusters specific to Central Europe show up in fine scale analyses with such programs as ChromoPainter and Mclust, and they're quite distinct from clusters specific to France (see here)".

But that's such a refined analysis, and with such an extreme focus on Northern Europeans, that you should not expect a commercial company testing for all the planet to even bother.

Whatever the case, I suspect that their groupings have been done based on rough Y-DNA zones. That would explain why Croats (dominant haplogroup I2a) are placed in the Balcans, while Slovenes and Czechs (dominant haplogroup R1a) are placed in Eastern Europe or all J1-dominated populations are placed in North Africa. This of course makes little sense when talking of autosomal DNA, which only sometimes holds relation to Y-DNA.

And when you look at Y-DNA, especially without enough phylogenetic depth, Germans do indeed look a mix of West and East Europeans, although closer to West (more R1b than R1a). That may be the "original sin" of 23&Me in this matter.

Davidski said...

The AC algorithms are powerful enough to split Northern Europe into French, British & Irish, Central, Scandinavian, Finnish and Eastern clusters, especially when more samples come in. But yeah, these samples would have to be chosen very carefully, and not just with the help of a 2D PCA or an ancestry questionnaire.

spagetiMeatball said...

David, this is off-topic, but I think it would be of high interest to you:

What do you think? ~2,000 years ago seems to be in the right time frame that arkaim was up and running, and the chariots there were found. I guess in this cases, if you want to come up with theories, margins for error in dating are slim.

Davidski said...

You mean 4,000 years ago, right? If the dating's right, that's very early for a chariot. Arkaim just got going at that time, so maybe whoever built Arkaim also moved west into the Balkans at about the same time?

spagetiMeatball said...

Yes, I meant 4,000. What you said sounds logical. Well, we sort of do know who build arkaim: some indo-european group, since all the trails lead to those metallurgical centers. Which got me thinking about something: assuming horse-riding, chariotry, and the whole cultural practice was introduced by people similar to the arkaim residents to large portions of eurasia, we should be able to find something about them from people they traded/interacted with. Maybe one of the ancient near eastern city states in sumeria? They surely must have left some record of something so momentous as being introduced to a chariot or a horse.

Eduardo Pinto said...

it is not all negative I must say, I was able for the first time to cross-reference the AC with the AF and connect my 0.2% SSA with my North African matches, maybe we can finally assess the genetic impact that the Moors had in the Iberian peninsula.

Davidski said...

Right, but you've waited over a year to see that 0.2% Sub-Saharan, and you'll probably have to wait another year to see the 0.5%, or whatever, of Northwest African, if they even decide to show a proper Northwest African component.

Also, I think there are some issues with the smoothing algorithm they use to take out noise. I don't know enough about the details of how that works to really comment, but I've seen some results where entire half-chromosomes are painted a certain ancestral component, and then there's nothing at all like that in the rest of the genome.

Helgenes50 said...

The Eastern European can it be taken seriously for someone from Northwest Europeen (I mean recent), for a Normand exactly.
I did my analysis to see if one of my ancestors was Russian, as saying the family tradition. My ggg grandfather (= 3,125 %)

In the old AC, not Eastern Europeen. Just an important segment in COA, shared by people of the east (mainly Ashkenazi), in Non specific European.

In the new AC, I have 0.9% of Eastern European and 0.3% Ashkenazi, precisely the segment mentioned before.

It's a little what I was expected with the first AC, But with all these changes
I don't know what to think

Davidski said...

Let's wait for another update, with hopefully more reference samples and better defined ancestral clusters. Then you can pick the best of the three versions.

Helgenes50 said...

Thank you, I also think that it's better to wait

Gui S said...

Well, I went from an unrealistic 95% French/German from overfitting to a non-specific moosh that seems focused on Western Europe. Looking at my contacts it seems that the French/German component does the strongest showings in Benelux people and Swiss, actual French and actual Germans don't have it as much, and it have other (mostly Non-specific) components to balance things out.
Treating it like the results of admixture runs and making population averages, is I think, the best way to get anything meaningful out of it (for instance from my sampling I am within 1 standard deviation of the French average for all components except Ashkenazi and Balkan where I have a lot more for both, which seems to be a realistic look at my ancestry). But this is only taking us back to square one of IBS admixture-type clustering rather than the great IBD revolution we were expecting. Oh well...

Lisa Evans said...

I don't want to simply complain, but I don't understand some of their other procedures. Maybe someone understands better than I do.

First, their phasing procedure. I am guessing that it is all done with statistics, and completely ignores using data from obvious close relatives. If they just identified your closest relatives for the relative finder, then why not use that for phasing? Why do you have to tell them who you mother is, shouldn't they already know? And they don't seem to take into account siblings or cousins even if you do tell them.

Second, if I have linked my account to both of my parents, then why can't they give me the 500 closest relatives on each side, instead of 999 Ashkenazi Jews and 1 Irish person? I am lucky enough to have several great aunts and uncles, but most people do not have that luxury.

It would be great if they could "reconstruct" the partial genomes of parents from those of several children. This could then be used for ancestry and relative finding. Let me know if I'm wrong, but this surely can't be that difficult.

Also, why can't I sort relative finder matches by chromosome or segment? Too intrusive? They could at least do it for public profiles.

Fanty said...

I went from 95% French/German, 2.5 Nonspecific Northern Euro, 2.5 Nonspecific Euro to:

45% Nonspecific Northern European
38.5% Nonspecific European
10.7% Eastern Euro
4% French/German
0.8% Nonspecific Southern European
0.5% Balkan
0.3 Unassined
0.1% British/Irish
0.1% East Asian/Native American


29.7% Nonspecific Northern European
20.4% Eastern European
18.4% French/German
9.6% Scandinavian
9.5% Nonspecific European
7.4% British/Irish
2.9% Balkan
2% Nonspecific Southern European
0.1% East Asian/Native American

78.5% Nonspecific European
17.5% Nonspecific Northern European
2.5% Unassined
1.4% Eastern European
0.1% East Asian/Native American

For comparation typical Oracle result:
4 parents: Spanish, Norwegian, Swedish, Russian
Spanish, German, Swedish, Lithuanian

Oracle X, language groups summed up:
60% Germanic speaking countries
25% Slavic/Baltic speaking countries
15% Romanic speaking (SW) countries

mikej2 said...

Those big nonspecific portions are caused by too specific reference samples. In other words, 23andme pinpoints reference samples into very small
areas of local settlements. It seem to be for example in Scandinavia around 0.5% of the Scandinavian population where a typical Scandinavian result 20-30% increases to 100%.

Helgenes50 said...

This means that some of us may have good results and others do not.

Davidski said...

Yeah, pretty much.

The only way to make sure everyone gets top notch results is to design the reference pools so that they reflect reality as much as possible. But in my opinion 23andMe hasn't done that yet.

Davidski said...

Here are some details about the phasing part of the analysis...

"We phase customer and reference data using our own version of Brian Browning's BEAGLE software. With a tip of the hat to Darwin, we named our version "Finch." Finch uses statistical analysis to separate each parent's contribution to a person's DNA, without requiring the parent's DNA. It doesn't say which DNA is from your mother, and which is from your father — for that you do need a parent's DNA.

We wrote our own version of BEAGLE so it would work smoothly in our production environment. Because Finch and BEAGLE use the same underlying algorithm, Finch achieves phasing accuracy consistent with that of BEAGLE.

There's one important difference between Finch and BEAGLE. BEAGLE makes the assumption that all of the individuals that need to be phased are available when the program is run. That assumption is not true for the 23andMe database, since new customers join every day. To avoid the computational costs of re-running the analysis from scratch, we modified BEAGLE to efficiently handle customers that weren't present in the initial sample."

Adele said...

Vis-a-vis the discussion of how 23andMe deals with central Europe, does anybody know how they classify Roma (gypsy) people? These are people with South Asian/Sindhi origins, but have been in diaspora in Europe for centuries. Would they be classified as European by 23andMe?

Davidski said...

Roma aren't used a references by 23andMe. But the AC is likely to show much higher South Asian and Near Eastern ancestry proportions for Europeans with recent Roma ancestry than for other Europeans. That's because, as you say, Roma have South Asian origins, but also a lot of West Asian admixture that they picked up along the way into Europe.

Most of them are definitely outliers from the mainstream European gene pool, even more so than Ashkenazi Jews.

Alex Lee said...

My parent's recently had their DNA tested by 23 and me. My father is either 1/2 or 1/4 Roma. His mother is of Northern European ancestry. His results came back saying he was 2% South Asian. They also trace your paternal lineage and maternal lineage. I was surprised that when we clicked on paternal lineage the result said "of South Asian origin most likely Roma or "Gypsy" is how they put it.