Wednesday, October 19, 2011

Erroneous results from Dodecad (aka. Dienekes)

A while back, Dienekes welcomed "peer review" of his work, which I thought was very commendable. I recently spotted a serious error in his analysis, and let him know about it over at his blog. I was hoping to see a correction, and also an admission that his methodology was faulty. Unfortunately, this hasn't happened to date, so I thought I'd describe the problem in detail here.

In the blog entry Yunusbayev et al. (2011) data assessed with Dodecad v3, Dienekes analyzed samples with ADMIXTURE in "supervised" mode using allele frequencies obtained from a run that didn't include these samples. He posted the results in a spreadsheet, which can bee accessed here.

Obviously, my area of interest is the genetic ancestry of Poles, other Balto-Slavs, and nearby populations. So it only took me a matter of seconds to notice that something was off about the results for several of these groups. For instance, Poles are listed in the spreadsheet as 34.5% West European, and 44.3% East European. On the other hand, the more easterly Ukrainians show 38.5% West European, and only 31.5% East European. Also, the Mordvinian sample from near the Volga scores 38.1% West European, and only 32.5% East European.

The first port of call when checking the validity of such results is to see whether they gel with geography. Clearly these results don't. So either something isn't right, or there are factors that work against the general rule of genes = geography. When I alerted Dienekes of these seemingly implausible figures, he was in favor of the second scenario. His reply was as follows:

Ukrainians' higher West/east European ratio makes perfect sense as it is transitional to both the Caucasus (where there are even higher such ratios) and to the Balkans. Their ratio is exactly what one might expect from their geographical position vis a vis. Russians, Belorussians, and Balts, ie. , populations with a high E/W ratio.

Mordvins are also in line with other Uralic populations (Finns, Selkups) in having an inverted European ratio relative to Balto-Slavs., the results don't make perfect sense. They make no sense at all. There's no way these Ukrainians can be described as transitional to the Balkans and the Caucasus compared to Poles, even if the term is used very loosely. Below are two MDS plots. The first one shows that the same Ukrainians (UA) used by Dienekes do not cluster closer to the Balkans than Poles do (PL), and only barely closer, on average, than the Belorussians. The second plot shows that Ukrainians (UA), Poles (PL) and Belorussians (BY) are all about the same distance from the Caucasus.

In theory, it's possible to argue that the plots above produced different results to Dienekes' analysis because they used only the two most significant dimensions of genetic variation. On the other hand, ADMIXTURE works in a very different way, and so can reveal details past the first two dimensions. But that would be a stretch, because generally speaking, when a population appears to be transitional between two others in an ADMIXTURE run, such results are often very easily reproduced with MDS/PCA plots.

Moreover, I've actually analyzed the same and similar samples with ADMIXTURE and have been unable to reproduce Dienekes' results. In other words, as per geography, Ukrainians are less Western European than Poles, and more Eastern European. This shows up in my latest Eurasian K=10 run (see here), where, on the balance of all the components, the Ukrainians and Mordvinians are more Eastern than Poles.

Below are two PCAs, the first one shows the bizarre results produced using data from Denekes' spreadsheet, with Mordvinians clustering with Ukrainians and Hungarians along Component 1. The result is more reliable along Component 2, because that seems to be picking up North Eurasian admixture in the Mordvinians and Russians, which is much lower in Hungarians, Poles ad Belorussians. The second plot is based on my K=10, and shows a more expected result all round, with the Mordvinians lining up with both Russian samples (RU and North Russian) along Component 1, and also very close to the North Russians along Component 2. They also cluster with the same North Russians in Yunusbayev et al., rather than with the Ukrainians.

A whole range of PCA plots can be produced using the data from the supervised Dodecad V3 and my Eurasian K=10, in which the former results look at least a little out of whack with reality, while the latter appear as expected.

Interestingly, Dienekes' new
euro7 analysis supports the results obtained by me. In this experiment, the same Ukrainians and Mordvinians were used in the initial run that set up the clusters, and came out amongst the most Northeastern European and least Northwestern + Southwestern European samples on the sheet. Now that makes perfect sense.

So what happened? Are these euro7 components different enough to make the results better match geography? Yes, they're a lot more in tune with reality due to a higher quality dataset, with more samples from key areas of Europe and Caucasus. However, it's also clear that the supervised analysis produced erroneous results. It's obvious that it's not always possible to correctly analyze samples with allele frequencies from ADMIXTURE runs in which they were not included, especially versus those that were.

Now that the sampling is better, Dienekes' euro7 shows the previously mentioned Uralic Selkups to have a higher level of membership in the cluster that peaks in Balto-Slavs, than in those which peak in Northwestern and Southwestern Europeans. This is obviously a turn-around from his Dodecad V3 result. So which is correct? Strictly speaking, they're both correct, because the components that form in ADMIXTURE runs are dependent on the allele frequencies in the dataset used, and the number of K (clusters) set by the user. These clusters might peak in different groups depending on the dataset, but the results will usually make pretty good sense in relative terms. Indeed, on the balance of their overall results, across all the ancestral components in the V3 and euro7, the Selkups don't appear very different. They cluster in generally the same area relative to the other samples. See, for instance, their positions on two PCAs based on the V3 and euro7. So unlike the supervised results, it's not possible to outright declare the unsupervised Dodecad V3 results as erroneous.

However, I would say that the appearance of such a dominant Western European-based cluster as seen in the V3 is, at the very least, surprising. For instance, why would the Siberian Selkups carry more allele frequencies that appear Western European than Eastern European? The Uralic theory proposed by Dienekes really doesn't seem plausible. I don't know how many times Dienekes repeated his experiment to see if the results were stable, but scientists often run their experiments as many as 100 times each, and then publish the most consistent results.

If Dienekes obtained those results from multiple runs, and it was a stable effort, then that's fine. However, the Western European-based cluster still looks unusual enough to treat it with great caution. Suffice to say that it's not something that can be reliably used to theorize about the peopling of Europe, or the genetic ancestry of linguistic groups, like the Uralics. Dienekes did this, which I thought was very naive of him. But it was even more naive of many people to take his musings seriously. I don't believe that he'll ever be able to produce similar results with his updated dataset (like the higher West/East European ratio in the Ukrainians, Mordvinians and Selkups).

Obviously, there's nothing wrong with experimentation. That's what science and genome blogging are all about. We're not just here to provide a genetic ancestry service, but also to try and unravel mysteries that are taking scientists years to get around to via the convoluted peer review system in journals. Mistakes will happen, because boundaries are being pushed, but these mistakes have to be corrected.

Update: Dienekes attempts to strike back...and trips up again

