I just had a look at the updated Ancestry Composition (AC) results of many of the people I'm sharing with at 23andMe. Yes, after a year of waiting the AC has finally been updated, with an overfitting fix and an improved disambiguation of African and Asian ancestry. But to be blunt, the AC still sucks.
There's actually no excuse why it should still suck. The 23andMe scientists have obviously done a great job with the algorithms that make up the AC, because they appear to be highly accurate when used together with well defined reference sets that represent robust biogeographical clusters. So the hard part is done. However, the problem is that several of the reference sets aren't well defined, and that's putting it mildly. Here are some examples:
- Croatians are in a Balkan reference set alongside Greeks, and even more unbelievably, Maltese. On the other hand, their genetically very similar neighbors, the Slovenians, are in the Eastern European reference set, alongside the HGDP Russians from Kargopol, in the north of Russia.
- The North African reference set is mostly made up of samples from the Near East, not North Africa. Also, the samples from Northwest Africa, like the Mozabite Berbers, are genetically quite distinct from all of the other samples in this reference set.
- Czechs are in the Eastern European reference set, while their genetically very similar neighbors, the Austrians, in the French & German reference set. Below is a Principal Component Analysis (PCA) from Nelis et al. showing the genetic relationship between these two Central European populations, relative to the differences within another three European countries.
This appears to be causing problems for some users, because supervised admixture analyses like the AC need relatively pure reference sets from robust biogeographical clusters to work properly. For instance, I'm sharing with Germans and Austrians who, in the standard mode, aren't even classified as 2% French & German, but over 10% Eastern European, and mostly nonspecific Northern European and European. In speculative mode their French & German proportions rise a few per cent.
I think what's happened in this case is that 23andMe has ignored the existence of the biogeographical region known as Central Europe. As a result, people from this area are mostly sitting in a nonspecific European no-man's land. That's because they're not particularly French, nor are they typically Eastern European.
Perhaps some might argue that this Central European biogeographical cluster doesn't really exist, and that it's actually a buffer zone between Western and Eastern Europe? If so, I beg to differ. Clusters specific to Central Europe show up in fine scale analyses with such programs as ChromoPainter and Mclust, and they're quite distinct from clusters specific to France (see here).
So unfortunately it seems that the scientists at 23andMe aren't doing enough to search for the most robust clusters in their dataset. Instead, based on what I've read at the 23andMe website, they seem to be using basic PCAs and their customers' self-reported ancestry to guide them. I reckon they should give ChromoPainter and Mclust a go. There's a PDF article about how these two methods compare to each other here. ChromoPainter is better overall, but Mclust much faster.
If, like me, you're a client at 23andMe and agree with what I've said here, then please send a link to this blog post to the scientists at 23andMe responsible for the AC. I think it'd be a shame to see this powerful tool and the thousands of reference samples available to 23andMe not used to their full potential.
Maybe someone over there will listen, and next time there's an update we'll see Northwest Africans in a reference set of their own, Palestinians, Jordanians and Saudi Arabians not classified as North Africans, the Maltese taken out of the Balkan reference set, Germans in a Central European or their own reference set, and plenty of other improvements.
23andMe’s Ancestry Composition – a preliminary review
Nelis M, Esko T, Mägi R, Zimprich F, Zimprich A, et al. (2009) Genetic Structure of Europeans: A View from the North–East. PLoS ONE 4(5): e5472. doi:10.1371/journal.pone.0005472