Eurogenes Blog: Harvard

Showing posts with label Harvard. Show all posts

Saturday, June 18, 2022

David Reich on the origin of the Yamnaya people (!?)

Harvard's David Reich is doing a talk next month about the genetic history of West Asia and nearby parts of Europe. This is a quote from an online abstract of the talk (found here).

The impermeability of Anatolia to exogenous migration contrasts with our finding that the Yamnaya had two distinct gene flows, both from West Asia, suggesting that the Indo-Anatolian language family originated in the eastern wing of the Southern Arc and that the steppe served only as a secondary staging area of Indo-European language dispersal.

If this is actually what David Reich is going to claim then I'd say his team has a lot of work to do before they put out their paper on the topic.

First of all, Yamnaya did not have two distinct gene flows from West Asia. I don't even know what that means exactly, but there's no way that this statement is correct no matter how one interprets it.

In fact, the Yamnaya population formed on the Pontic-Caspian steppe from earlier groups native to this part of Eastern Europe, such as the people associated with the Sredny Stog culture.

That is, there were no migrations from West Asia into Eastern Europe that can be claimed to have been instrumental in the emergence of the Yamnaya population. On the other hand, Yamnaya may have been significantly influenced by cultural impulses from West Asia, but this is nothing new.

In terms of deep population structure, the Yamnaya genotype can be described as a mixture between Eastern European and West Asian-related genetic components. However, these Asian-related components were already in Europe thousands of years before Yamnaya came into existence.

Indeed, soon to be published ancient DNA shows that hunter-gatherers very similar to the Yamnaya people, packing quite a lot of West Asian-related ancestry, lived in the Middle Don region (just north of the Pontic-Caspian steppe) well before 5,000 BCE (see here).

So, did the West Asian ancestors of these Middle Don hunter-gatherers speak Proto-Indo-European, or, as David Reich calls it, Indo-Anatolian? Keep in mind that most linguists put the birth of Indo-Anatolian around 4,000 BCE, which is actually the Sredny Stog period.

Moreover, in underlining Anatolia's supposed impermeability to exogenous migration, David Reich is arguing against things that no one worth their salt ever claimed. That's because the spread of Indo-Anatolian speakers into Anatolia has never really been described by archeologists and linguists as a massive migration, but rather as an infiltration into lands already heavily populated by the Hattians (for instance, see here).

We may have already seen the genetic evidence of this infiltration in the presence of steppe Y-chromosome haplogroup R-V1636 in a Chalcolithic burial at Arslantepe (see here and here). Let's wait and see what else crops up over the next few years as many more ancient Anatolian genomes are sequenced by David Reich and colleagues.

See also...

Tuesday, November 9, 2021

Crazy stuff

I'm hoping that 2022 is the year when this problem is finally straightened out. Over to you David Reich, Nick Patterson, Iosif Lazaridis, David Anthony, Wolfgang Haak, Johannes Krause and colleagues.

Wednesday, August 19, 2020

Yamnaya-related ancestry proportions in present-day Poles

Modeling ancient ancestry proportions in present-day Europeans with the qpAdm software is now a lot more difficult. The reasons for this are updates to qpAdm as well as the availabiity of more useuful outgroups or right pops.

This isn't necessarily a bad thing, because users are forced to work harder to find successful models, which is likely to lead to some interesting discoveries. But it can be very frustrating.

I don't think that settling for poor statistical fits or using a small number of outrgoups are acceptable short cuts. Perhaps sequencing modern-day samples in exactly the same way as the ancient samples, and thus increasing the compatability between them, might help?

Limiting qpAdm runs to higher quality SNPs from transversion sites does help, but perhaps largely because of the significant reduction in markers?

In any case, I've now given up on running such analyses, at least until I see some serious pointers on the topic from Harvard's qpAdm experts. But before I put this project to bed for the time being, I'd like to share some new results for Poles from eastern and western Poland, respectively.

right pops:

CMR_Shum_Laka_8000BP
MAR_Taforalt
IRN_Ganj_Dareh_N
Levant_PPNB
GEO_CHG
TUR_Barcin_N
RUS_Piedmont_En
SRB_Iron_Gates_HG
WHG
RUS_Karelia_HG
MNG_North_N
RUS_Ust_Kyakhta
left pops:

Polish_East
CWC_Baltic_early 0.572±0.024
SWE_TRB 0.428±0.024
chisq 11.776
tail prob 0.300296
Full output
Polish_West
CWC_Baltic_early 0.587±0.021
SWE_TRB 0.413±0.021
chisq 11.165
tail prob 0.34478
Full output

Even using transversion sites, this is one of the very few combinations of ancient reference samples that works for the Poles with these right pops. That is, the combination of early Corded Ware samples from the East Baltic (CWC_Baltic_early) and Funnel Beaker samples from Scandinavia (SWE_TRB). The former are obviously the proxy here for Yamnaya-related ancestry.

Adding any sort of hunter-gatherer population to this model doesn't help or even makes things worse (for instance, see here and here). It is possible to add Baltic hunter-gatherers to a similar model after dropping CWC_Baltic_early in favor of closely related samples from the Early to Middle Bronze Age Pontic-Caspian steppe. Note, however, that the statistical fits are somewhat poorer.

Polish_East
Baltic_LTU_Narva 0.032±0.014
PC_steppe_EMBA 0.483±0.019
SWE_TRB 0.485±0.019
chisq 17.143
tail prob 0.0465198
Full output
Polish_West
Baltic_LTU_Narva 0.031±0.011
PC_steppe_EMBA 0.491±0.015
SWE_TRB 0.477±0.016
chisq 22.444
tail prob 0.00757421
Full output

Interestingly, but not surprisingly, the ancestry of many present-day Northwestern European populations can be modeled in basically the same way. That's because ancient ancestry proportions are more closely correlated with latitude than longitude across much of the European continent.

English_Kent
CWC_Baltic_early 0.527±0.024
SWE_TRB 0.473±0.024
chisq 13.042
tail prob 0.221357
Full output
Icelandic
CWC_Baltic_early 0.586±0.023
SWE_TRB 0.414±0.023
chisq 16.517
tail prob 0.085751
Full output
Scottish
CWC_Baltic_early 0.583±0.021
SWE_TRB 0.417±0.021
chisq 12.144
tail prob 0.275536
Full output

A zip file with the qpAdm output from this analysis and a list of the most relevant ancients is available here. I might try to run a few more populations over the next few days, but probably only from the northern half of Europe, so please check the zip file in a week or so to see what else is in there.

If anyone wants to challenge my results, note that these and very similar samples are freely available to the public via Harvard University here and here.

Update 22/08/2020: From Nick Patterson (Broad) in the comments:

My general advice for qpAdm is 1) Work on the right hand set. Don't include irrelevant population (except for one population as an outgroup); picking the best RHS can dramatically reduce s. errors on the admixture weights. 2) If qpAdm gives a very low p-value try and understand why, sometimes it is telling you that the target is not a mixture of the sources but sometimes the assumptions are violated, for example recent gene-flow from left pops -> right.

See also...

Ancient ancestry proportions in present-day Europeans

Saturday, June 27, 2020

Major updates to ADMIXTOOLS

An important message from Nick Patterson:

Dear Eurogenes bloggers,

Many of you use ADMIXTOOLS and you might like to know that there is a new release on github [LINK] with some important enhancements.

From the README

*** NEW ***

1)

Version 7.0 has numerous upgrades.

a) Two new executables --qpfstats qpfmv allow precomputation of f-statistic basis. This can greatly reduce computation costs.
b) qpAdm, qpWave, qpGraph support qpfstats output as input.
*** This is a much improved way of running with allsnps: YES. ***
c) A new experimental feature of qpGraph (halfscore: YES) allows comparison of 2 phylogenies + a (weak) goodness of fit score. Be careful if running with a large number of populations and consider reducing block size say blgsize: .005

2)

Note that several of the new ideas implemented in version 7.0 were developed collaboratively with Robert Maier, who has implemented them along with the great majority of other ADMIXTOOLS functionality in R: See https://github.com/uqrmaie1/admixtools
Executables run fast, and it has features not available in this C version, such as interactive exploration of graph phylogenies.
A manuscript describing the algorithmic ideas and providing documentation of the methods is in preparation.

qpfstats is the most important new executable. This estimates f-statistics and covariance on a basis.

a) This can be passed into other programs of the package without having to reaccess the genotype files, greatly speeding the computations.
b) In allsnps: YES mode a new computation is carried out (explained in qpfs.pdf) that is much more logical when there is a lot of missing data. Sometimes standard errors are greatly reduced.
qpfstats can be used with up to 30 populations. Much beyond that the output files become large.

As usual there may be bugs...

Nick Patterson 6/27/2020

Update 29/06/2020: As pointed out above, qpfstats is the most important new executable. Indeed, Nick Patterson now recommendeds that qpAdm analyses run with the allsnps: YES flag should be based on qpfstats output.

Several of my recent blog posts featured qpAdm models run with the allsnps: YES flag, but they were based on genotype data because obviously I didn't know anything about qpfstats at the time.

So I went back and ran some of these models again, just to make sure that they were still relevant. Below are three examples which you can compare to the original analyses here, here and here, respectively.

TUR_Arslantepe_LC_Maykop
RUS_Maykop_Novosvobodnaya 0.281±0.042
TUR_Arslantepe_LC 0.719±0.042
chisq 10.923
tail prob 0.449752
Full output

TUR_Barcin_C
RUS_Vonyuchka_En 0.137±0.031
TUR_Buyukkaya_EC 0.863±0.031
chisq 15.074
tail prob 0.0889099
Full output

UKR_N_admixed
RUS_Progress_En 0.083±0.020
UKR_N 0.917±0.020
chisq 6.825
tail prob 0.65538
Full output

As far as I can tell, they're very similar to the original runs, which is a relief, because it means that the conclusions in my blog posts still make sense.

search this blog