Sunday, October 12, 2014

Ancient genomes and the calculator effect


Several ancient genomes have been posted online as text files and uploaded to GEDmatch over the last couple of weeks, and many more are likely to follow in the future. A lot of people have already taken this opportunity to analyze these files with various online ancestry tools, usually DIY calculators.

That's actually not a bad way of doing things, as long as everyone's aware that almost all of these calculators produce biased results. They produce biased results because they violate a very basic rule of science, which is this:
Do not test more than one variable at a time.
Obviously, the variable we want to test with these calculators is ancestry. However, when the reference samples are tested in a different way to the test samples, which is what usually happens, then this adds another variable to the proceedings. As a result, we simply can't compare the results of the reference samples to those of the test samples.

I know that a lot of people find this difficult to grasp, and many just seem hell bent on not grasping it. However, anyone who isn't completely insane, and takes five minutes out of their day to try and understand the concepts involved, has to agree that this is a real problem. It can be proven empirically, like I did over two years ago (see here).

I suspect that a lot of confusion has been caused by the fact that the people who were used as reference samples in the making of the various DIY calculators saw highly accurate results when running them, and so assumed everything was fine. The accuracy of the DIY calculators for such people is indeed impressive, and I show that at the link above, but unfortunately the story is very different for everyone else.

Here's the good news: the Eurogenes calculators don't suffer from the calculator effect. That's because the reference samples are treated in the same way as the test samples, so there's only one variable: ancestry. What this means is that when you run a modern or ancient genome with a Eurogenes calculator you can confidently compare the result to those of the reference samples (provided enough SNPs are used), and then be able to make sensible inferences about its genetic origins.

28 comments:

  1. Yes, that makes sense to analyse all the samples in the same conditions.

    But , the question ! except Eurogenes, all the other calculators suffer from this effect ?

    ReplyDelete
  2. Yes, they all suffer from it except the latest MDLP K23b, which was designed to get around this problem.

    ReplyDelete
  3. Thanks David, important to know.

    Indeed, the results of the last MDLP calculator are close enough to those of Eurogenes
    ( in my case !)

    ReplyDelete
  4. I have to admit the earlier article of yours wasn't that clear to me, so I checked up a little more

    http://dodecad.blogspot.co.uk/2012/08/on-so-called-calculator-effect.html

    As Dienekes describes it basically, calculators will always provide slightly inaccurate approximations compared to including a person in a real ADMIXTURE run. Increasingly less so the more SNPs and people included in ADMIXTURE runs.

    On the Dodecad project, project members are included with academic samples. So these project members who are included will have less noisy results than non-members who use the calculator without being included in the run.

    Whereas on Eurogenes, as only academic references are used in the ADMIXTURE run, every person using the calculator is noisy compared to the academic references.

    So, it wasn't clear to me how the Eurogenes strategy was actually better, so long as everyone knows project participents and academic references weren't comparable to others. In both cases, the ADMIXTURE run samples (whether academic only or academic + project) will be given and will be inaccurate to compare to other people using the calculator.

    But then the creator of the MDLP explained in more detail about how he further actually adjusted his calculators in line with Davidski's advice :

    "1) Set aside five of the most typical individuals from each of your reference populations.
    2) Run your ADMIXTURE analysis with the remaining samples.
    3) Use the allele frequencies from the ADMIXTURE run to test the individuals you set aside and produce population averages from their results only.
    This should fix all the problems."


    Which kind of seems too make more sense. Basically a strategy of leaving out some samples which should represent that population's mean, running them through the calculators and then using that to quantify the degree of calculator effect and adjust the calculator to fit. Or just give averages based on the individuals set aside, not sure which you do exactly.

    At the least it seems like it would be worth explaining to the Harappa project, as that project has a different focus not well served by this or MDLP.

    (Well MDLP also has a more world focus, but the current one's mix of ancient and modern clusters seems strange and at odds with one another - the goal of trying to explain modern relatedness and descent from ancient clusters seem at odds, as if you want modern relatedness you need to take account of modern drift and recombination since admixture).

    I guess this calculator effect must also multiply with low SNP coverage samples, while being minimal at high coverage? That might explain why the noise genetiker is getting seems quite large, beyond what I would expect from what modern people with good coverage get out of Dodecad calculators. Zak of the Harappa Project puts these errors on modern samples at around 1% on his calculators - http://www.harappadna.org/2014/05/harappaworld-hrp0385-hrp0419/, and the errors on genetiker's use of these samples through Dodecad is higher than that. Modern NW Europeans (and the Hixton samples are close to that) might get calculator error, but I doubt that it would be like 6.5% African sized like G is finding - instead they just get blur on their West Eurasian clusters.

    ReplyDelete
  5. Minimal noise isn't really the issue, as long as it affects both the test and reference samples. My K13 and K15 tests do show a bit of noise, but when their results are plugged into the oracles they're spot on for many people.

    This might be more difficult to appreciate if you haven't tested yourself and tried these tools. People get all sorts of strange results because of this problem, with Irish coming out German and what not in the oracles. It can be a complete farce for many people.

    It's a problem similar to the PCA projection bias which I blogged about before, and just as obvious. Also, just like PCA bias, the fewer the number of samples in the analysis, the more extreme its effects. That's why some of the calcs produce much noisier and stranger results than others. They were released a few years ago when very few samples were available, and unfortunately they're still online confusing the crap out of many people.

    ReplyDelete
  6. Matt wrote

    "Whereas on Eurogenes, as only academic references are used in the ADMIXTURE run, every person using the calculator is noisy compared to the academic references."

    In fact that is not true. If allete frequences are calculated professionally the result for "the others" iscloser the truth than for project members. This happens because the admix procedure tends to exaggerate cluster profiles. For "the others" allele frequences are more objective.

    ReplyDelete
  7. How would y'all rank 23andme, ancestry.com, and FTDNA's non-parental tests? Is there any better option than those three, that can still be used at GEDmatch?

    ReplyDelete
  8. Can someone please explain why researchers are still using "ancient" academic reference samples such as HGDP. I've had an issue with "Orcadian" for many years at this point. I truly appreciate the work of the early pioneers in this field, including the doyen himself, but isn't it time to move on to more modern and representative sampling.

    ReplyDelete
  9. Bazza,

    If you're looking for good raw data, then Ancestry and FTDNA are now better than 23andMe, which, as far as I know, imputes some markers instead of genotyping them. Imputation is OK if done right, but why impute when you can genotype for the same price, more or less.

    Mark,

    The papers I've seen lately have used samples from the HGDP as well as other sources, including the 1000K Genomes, Estonian Biocentre and Reich Lab (Human Origins dataset). Sampling can usually be improved, but I think it's now at a fairly reasonable level.

    The HGDP is still very useful, because it has some very unusual samples, like those Neolithic farmer-like Sardinians and ANE-rich Karitiana Indians.

    In fact, I've come to appreciate more the sampling strategy of the HGDP since we've seen a few ancient genomes. It'd actually be interesting to find out why the HGDP sampled those particular Sardinians and the Karitiana Indians. Did they know something about them that no one else did back then?

    ReplyDelete
  10. @David
    When exactly we will get the data about Corded Ware Genome and Others?....

    ReplyDelete
  11. The ASHG presentation on the Corded Ware and other ancient European genomes is next Monday. Scroll down to the update here...

    http://eurogenes.blogspot.com.au/2014/09/corded-ware-culture-linked-to-spread-of.html

    Let's hope someone tweets or blogs from it.

    ReplyDelete
  12. A different question related to ancient genomes and calculator effect.

    Looking at the k13 averages I find results like these for ancient genomes from WHG:

    La-Brana-1:
    North Atlantic: 44.6%
    Baltic: 49.5%
    West_Med: 0%
    East_Med: 0%

    Loschbour:
    North Atlantic: 48.1%
    Baltic: 50.4%
    West_Med: 0%
    East_Med: 0%

    And from EEF:

    Ötzi:
    North Atlantic: 24.6%
    Baltic: 0%
    West_Med: 50.9%
    East_Med: 24.5%

    Stuttgart:
    North Atlantic: 17.5%
    Baltic: 0%
    West_Med: 50.5%
    East_Med: 29%

    As far as we know, EEF were some Middle Eastern population mixed with WHG. But these results don't show this. It looks like if EEF were a mix of a population that was 100% Atlantic with some population that was 100% Mediterranean (or mixed with some Atlantic).

    But where is that population that was so Atlantic without being Baltic at all??? It never existed, as far as we know.

    Is this that we are seeing a calculator effect (defect, actually), where all the Baltic component shifts to West_Med for some reason? In this case, can it be that anything above K7 is just too inaccurate due to calculator defect?

    ReplyDelete
  13. The K13 and K15 results do show that EEF is a mixture of Near Eastern and WHG populations.

    You can see that in the oracle results, even though the fit is very poor probably because Yementine Jews are not very similar to Neolithic Near Easterners.

    https://drive.google.com/file/d/0B9o3EYTdM8lQeTdWTkJjVXJIVzA/view?usp=sharing

    The reason I first had to process the K15 results to show it (and I could do exactly the same with the K13, but I don't have time now) is because the clusters were based on modern populations. If we ever get enough high quality ancient genomes from Mesolithic Europe and Neolithic Near East to design tests like this, I won't need to do that.

    And I wouldn't be so sure that a purely K13 Atlantic population didn't ever exist in Europe. For instance, it'll be interesting to see what high quality Megalithic samples score in the K13 and K15 tests. I suspect they might come close to 100% Atlantic.

    I also have a hunch that there will be genomes from the Bronze and/or Iron Ages that score almost 100% in the North Sea and East Euro clusters.

    ReplyDelete
  14. Yes, the quality of the ancient genomes obviously has some limitations and gives some inaccuracies and noise. But in my example there are 2 WHG which are about 50-50 Atlantic and Baltic, both. And two EEF which are 0% Baltic, both. It seems more than a coincidence of low quality samples.

    So yes, there are 2 options: Or it is a calculator defect, or there existed some population that was highly Atlantic without being Baltic (contemporary in Western Europe with La Braña type that was 50-50). But we still have to find that population (not impossible, of course, but quite a finding it would be!)

    ReplyDelete
  15. "In fact, I've come to appreciate more the sampling strategy of the HGDP since we've seen a few ancient genomes. It'd actually be interesting to find out why the HGDP sampled those particular Sardinians and the Karitiana Indians. Did they know something about them that no one else did back then?"

    I understand. Cavalli-Sforza was infatuated with Sardinia; I guess one reason HGDP had several samples from Italy but none from Germany. I still have a problem with Orcadian being presented as anything British. The Orkneys were mostly Norwegian, with a few Scots and Brits. Still, I agree that the more samples that are added to the mix, the better the results generally.

    ReplyDelete
  16. David,

    Do you think it's possible to try out Balaji's idea of running f3 stats on the Human Origins data-set (including all ancient genomes and modern populations, but perhaps excluding AG-2)? Just like what you did here:

    http://eurogenes.blogspot.com/2014/07/f3-stats-100-present-day-populations.html

    I think this would be quite interesting, since it would now involve so many ancient samples.

    ReplyDelete
  17. IS there any commercial Y DNA test, that allows to choose what Y SNPs to test?

    ReplyDelete
  18. I'm just trying to figure out the most economic way of running the f3 tests on the Human Origins dataset.

    I might have the results tomorrow.

    ReplyDelete
  19. Mark D,
    Orcadians are only 25% Norwegian, and maybe less than that when considering isle blood in Norway.

    ReplyDelete
  20. "Orcadians are only 25% Norwegian, and maybe less than that when considering isle blood in Norway."

    Source?

    The history of Orkney and low population growth (fewer now than in 1801) would indicate otherwise. Compare, http://www.orkneyjar.com/orkney/norn.htm

    ReplyDelete
  21. The British population study that's been passed around here, a few times. I'll post a link. Give me a bit.

    ReplyDelete
  22. I can't find the damn thing. It was in pre-print with Nature, in June. There are two distinct populations on Orkney, and there is a 25% Norwegian input. Maybe David or someone remembers the link.

    ReplyDelete
  23. barakobama, of course there is. It's possible for instance on FTDNA. But I fear only as an upgrade for people who already had bought one of their STR tests. But their SNP tests are quite cheap.

    ReplyDelete
  24. Uh, I have a FTDNA Account too. (STR 111 or something) and FGS mtDNA. And ordered several SNP Tests... a million years ago.

    What reminds me, that there possibly are new SNP to break down my Y-DNA further....maybe.... *yawn*

    ReplyDelete
  25. I posted some of those f3 ratios in the other thread...

    http://eurogenes.blogspot.com.au/2014/10/scratch-north-caucasus.html?showComment=1413377726014#c5768609114209502942

    ReplyDelete
  26. This comment has been removed by the author.

    ReplyDelete
  27. Thank you for this mind-opening conversation. I would like to know what I can find the breakdown of what constitutes North Atlantic and the other categories in Eurogenic analysis. The others are easier to guess, but I'd like to be sure.

    JMH

    ReplyDelete

Read the rules before posting.

Comments by people with the nick "Unknown" are no longer allowed.

See also...


New rules for comments

Banned commentators list