search this blog

Tuesday, June 13, 2017

qpGraph models for the Kalash & Yamnaya

I'm pretty happy with this effort, but it's a very complex topology with a lot of admixture edges. Moreover, its highest Z score of nearly 3 suggests that it can be improved (Z >3 would mean a failed model). Indeed, I'd say that the Basal Eurasian admixture coefficients are a little too high, and perhaps Steppe_EBA is a few per cent more West Asian/Caucasian than it should be. More details about all of the graphs in this post are available here.

Obviously, the labels for the inferred ancestral populations, like North Caucasian, are speculative. In hindsight, it may have been better to use something like single letter labels.

But now that I have a fairly robust topology, I can try and ask some questions. For instance, is the inferred Caspian pop a better source of West Asian ancestry in Yamnaya than the so called North Caucasian one? The answer is probably no.

My main graph is also a decent statistical fit for at least a number Indian groups, like, for instance one of the Gujarati subpopulations labeled GujaratiD in the Human Origins dataset. But it fails marginally for Pathans, so it's not a robust solution for all of South Asia. Incredibly, using Andronovo instead of Yamnaya in the Pathan model makes it work. Tajiks can also be modeled in this way using Andronovo. I say incredibly, because Pathans and Tajiks are obviously Iranic speakers, and their Iranic ancestors in all likelihood arrived in South Asia from the Eurasian steppe much later than the Indo-Aryan ancestors of the Kalash and most Indians.

So what we might be seeing here is substructure within the steppe-related admixture amongst South Asians, with Indo-Aryan speakers apparently showing Yamnaya-related (Catacomb?) ancestry, and Iranic speakers, as well as possibly groups with significant Iranic ancestry, showing a preference for later Andronovo-related ancestry. I need to have a closer look at this. But it won't happen overnight; my brain is fried as it is after this effort, and I need to get some fresh air.

Update 14/06/2017: I've now had the chance to test many more Indo-Aryan and Iranic groups with my model. Most of these groups show a slight, non-significant, preference for Yamnaya_Samara as the steppe reference population. However, those that show a slight, and again non-significant, preference for Andronovo are usually Iranic, such as the Balochi in the graphs below. I'm not claiming that this proves anything, but I do think that it hints at something, and I'll try testing a few different hypotheses in the near future with qpGraph.

See also...

qpGraph open thread


Nirjhar007 said...

Thank you for the effort .

Nick Patterson (Broad) said...

A major (the major?) weakness of qpGraph is that there is no clear
goodness of fit test. But it's not as though Z=2.9 is OK, Z=3.1 is fail.

If you add enough admixture edges you can always get a good fit. I tend to
look at the bad scores [You can code zthresh: 2 (for example) if you want to look
at more "outlier" scores), and see if I am worried. Sometimes there are mild
troubles in a part of the graph that for the purpose at hand one doesn't care
so much about.

Davidski said...


Yep, the choice of 3 is arbitrary, but it seems OK as a general rule. And in this case Indo-Aryans are generally <3 and Iranics >3, and the latter are <3 using Andronovo. That's actually pretty awesome.

Nirjhar007 said...


I hope the aDNA paper which have Pakistani samples , will not suffer from creationist narratives...

Davidski said...

Yeah, I'm sure you'll get an answer saying something like that.

Nirjhar007 said...

Well, we can always hope that people will be cautious , they will take sources and data from scholars experts from each specific areas of ancient cultures , if they don't do such and try to 'solve' some thing they think ''is already solved and a matter of time'' , then they are in deep shit...

For example for Swat culture , contact Dr. Luca Maria Olivieri . A true scholar .....

Carlos Aramayo said...


You wrote:

"Pathans and Tajiks are obviously Iranic speakers, and their Iranic ancestors in all likelihood arrived in South Asia from the Eurasian steppe much later than the Indo-Aryan ancestors of the Kalash and most Indians".

Do you agree with "a first wave" of (Pre-Vedic)Indo-Aryans arrival to South Asia as per Parpola`s hypothesis (at the first half of 2nd millennium BC? When do you consider the first arrival of Indo Aryans took place? I would like to see your opinion in years BCE.

Davidski said...

I don't know about that, but I don't think it's controversial to say that Iranics arrived in South Asia later than Indo-Aryans.

Taymas said...

How would the Z-score react to the difference between Onge and subcontinental HG? I don't have a good feel for how these computations behave with different inputs.

That difference could be fairly substantial: (a) we know HGs can have high genetic distance in close proximity, (b) IIRC you've also used other Asian pops as the proxy, (c) the route between the Andamans and the Indus is long and ecologically varied.

Really cool work, thanks Davidski.

Synome said...

If I try to combine this tentative fit of Steppe_EMBA to Iranic and Steppe_EBA to Indo Aryan, I'd get something like this. This is directly related to Davidski's earlier post about the Poltavka outlier and its relationship with the Indo Aryans.

There's historical evidence that the Iranians and Indo Aryans became adversaries, and people like Christopher Beckwith (Empires of the Silk Road) think the Iranians chased the IAs out of central Asia and into Mesopotamia and the Indus Valley. If we associate the elite Andronovo remains genetically with Iranic speaking peoples, what that could suggest is an elite stratum in central asia that later pushed out earlier Yamnaya like inhabitants, or pushed out Yamnaya like immigrants from further afield from places like the Afanasievo culture area at the edges of the Andronovo horizon who had come and partaken in the multitude of interactions that led to the formation of the complex Indo-Iranian culture.

Were the people who eventually became the Iranians already the elite from the beginning? Wealthy Corded Ware immigrants who imposed on and eventually displaced both the BMAC and earlier IE peoples, who mixed and became the Indo Aryans?

Seinundzeit said...


Very fascinating; I think you're on very solid footing here.

For quite some time, we've been seeing multiple analyses that all show the Kalasha at anywhere between 35% to 50% Steppe_EMBA, 55% to 40% Iran_Neo, and 10% ENA, so it's good to see that you get identical results using qpGraph.

And, I think the Andronovo vs Yamnaya difference is a great indication that your efforts payed off, as we've seen this pattern elsewhere (it's pretty consistent).

Samuel Andrews said...

" If we associate the elite Andronovo remains genetically with Iranic speaking peoples, what that could suggest is an elite stratum in central asia that later pushed out earlier Yamnaya like inhabitants, "

Should we assume there was genetic difference between elites and ordinary people in pre historic cultures? I see no indication of that in ancient DNA. Elite Yamnaya genomes fit as an ancestor of all LNBA Europeans no matter their status. Elite Andronovo genomes fit as an ancestor of much later Sycthian and Sarmation genomes.

And isn't the genetic difference between modern high caste Indians and low caste Indians microscopic?

Ryan said...

I'm not sure why this substructure is that surprising? The DNA pool of the steppe should have been constantly in flux as new groups of people are encountered and admixed with steppe nomads, who then spread that DNA across the steppe. Each wave from the steppe should be distinguishable, no?

Synome said...

@Samuel Andrews

"Should we assume there was a genetic difference between elites and ordinary people in pre historic cultures?"

I would say no. We shouldn't assume there was a difference. But if there is archeological, historical support for such a difference then we can ask that question.

I think that there were prehistoric cultures with genetically identical elites. I think there were also some prehistoric cultures with genetically distinct elites. I think the case of BA Europe isn't necessarily representative of what was going on in central Asia. The Mitanni were Indo Aryan elites who migrated into northern Mesopotamia and ruled over Hurrian speaking peoples. We have plenty of reason to believe they were genetically distinct from the Hurrians, even if intermixture lessened the difference over time. This was the period of the early Iranians as well. I don't see it as impossible that genetically different strata can exist at this time on the steppe and in central asia. The Indo Iranian peoples may even be one of the earliest examples of such stratification.

The reason for theorizing of the existence of a stratum is that the archaeological, historic and linguistic evidence show a common origin of the Indo Aryan and Iranian cultures, but some genetic evidence now suggests that that Iranians may have been genetically distinct from Indo Aryans. So how can two genetically distinct peoples occupy the same area and speak closely related languages with a common origin and yet remain separate? Social stratification caused by an elite group of foreign origin.

I'm not sure you can compare modern Indians and their history to the type of mobile, client based society that existed on the steppe and Central Asia in the bronze age. There were millennia for Indo Aryan elites to settle side by side with lower status peoples in urban societies and intermix with them. If the Iranians drove the Indo Aryans out and caused a mass exodus, they might have never gotten the opportunity to mix to the same degree. The precursor Sintashta culture only began ~2100 BCE, and by ~1800 BCE the Indo Aryan migrations had begun. That's maximum of around 300 years together, probably less for those people closer to the edge of the Andronovo horizon.

andrew said...

@Carlos Aramayo I would put the date at about 1900 BCE based upon Cemetery H and the earliest evidence of new types of metallurgy in South Asia.

@Samuel Andrews There is non-trivial caste variation in Indian genetics. See, e.g., papers discussed at:, and

Generally speaking there is more steppe ancestry in high caste than in low caste individuals. The disparity is particularly great in Southern India where steppe ancestry/ANI is particularly low in low caste individuals but comparable to the rest of India's high caste population. Genetic tests suggest that very strong caste endogamy dates to about 70 generations ago (about 2000 years), although historical evidence suggests that this may be a few hundred years off, a date long after initial Indo-Aryan admixture.

EastPole said...

Vedic Sanskrit and Avestan are very similar. Much, much more similar than for example Baltic and Slavic. Early Indo-Aryans and early Iranians should be the same people genetically.

Matt said...

@Davidski, I had a go at modifying your basic main model to try and include WHG and Barcin_N as well:

Ancient Populations:
Pops file:
Topology A: (EHG and WHG split together from West Asian's non-Basal Ancestry)
Topology B: (WHG splits from EHG + West Asian's non-Basal Ancestry).

(I'm assuming the Ancient Populations file is responsible for bracketing together samples into the CHG, EHG, etc. as in your files, and used the sample ids from Mathieson's latests supplemental, hopefully getting them right).

If this runs relatively quickly for you, enough to be no trouble, would like to see if it runs.

Matt said...

Also if poss.

Bit more complex graph, modified from your main graph, aiming to model

a) Yamnaya_Samara, Iberia_Chalcolithic, Iran_N, CHG, Barcin_N, EHG all as admixed between AG-3, MA-1, WHG, and ghost clades (Basal, West Asian HG ghost, etc.)
b) use both Ust-Ishim and Onge together as constraints on Basal Eurasian levels:

Ancient Pops:

Not surprised if this one ends up throwing out the "not enough overlapping SNPs" problem which turned up with the previous models I posted in the other thread, but I thought worth a try!

Carlos Aramayo said...


Thanks for your opinion. I still wonder on a dating, although based on modern DNA, published here in this blog a year or two ago, which showed a possible timeframe betwween 2500 and 2000 BC for R1a in Indus Valley. I think Heggarty et al included this one in their "hybrid hypothesis" map.

Samuel Andrews said...

Thanks for the helpful response Synome.

Davidski said...


Your first two graphs.

Btw, the ancient samples file isn't necessary for qpGraph. It's for my own use so that I know which samples I grouped together to form the various sets.

Davidski said...


There's a problem with your third graph.

fatalx: edge pUst_Ishim exists!

Genetonaut said...

What is the make-up of the West Asian component in the graphs? It's apparently separate from Basal so I assume it's UP West Eurasian derived (ANE and/or WHG-UHG)?

Davidski said...

Yes, and I'd say mostly UHG, for the lack of a better term.

Matt said...

@ Davidski, cheers. The above two graphs don't fit very well. I would guess that's due to neither of the simple bifurcations in each topology working well to capture the different "West Eurasian" descended structure (unless it's in the "Basal Eurasian" structure?).

The graph where the divergence between (West_Asian+pEHG) and (pWHG+pUHG) fails a bit less badly than (West_Asian) and (pWHG+pEHG)...
Looking over it again, I really messed up that Graph3 in multiple ways (kind of a complex topology to put together).

Hopefully this works:

Also, same pops with a slightly different split order between ANE and other HG:

Jijnasu said...

While it does appear that male lineages are closely related to linguistic affiliation in patrillineal societies, there is no reason to expect as close a relationship as regards their autosomal make. So no reason to assume that early IA and Iranics were genetically identical autosomally though sharing closely related paternal lineages.

Davidski said...


It's a good idea, and you're generally pretty close, but for one reason or another that Z score is still too high.

Arza said...

@ Davidski

So vertices not connected to the Root are allowed? (Matt_graph4.png - pANE)

If so, can you model Kalash using Andronovo + Srubnaya Outlier (or substitute) mixed in?

Davidski said...

So vertices not connected to the Root are allowed? (Matt_graph4.png - pANE)

They're not. Matt made a mistake and I didn't notice it.

That's probably the reason for the bloated Z score in that model.

If so, can you model Kalash using Andronovo + Srubnaya Outlier (or substitute) mixed in?

Sounds way too hard.

MaxT said...

Tajik_Shugnan is 71% steppe!? Z score looks normal but are you sure steppe % in Tajik_Shugnan not inflated here? that's the highest steppe admix i've seen in modern population so far.

Davidski said...

Yeah, but that's using Andronovo as the steppe reference. Northern and Eastern Europeans would also score that much steppe ancestry with Andronovo.

Having said that, the Shugnans show 60% steppe with Yamnaya, and I suspect that this may well be correct, and if so, it would make them the most steppe admixed population around.

Davidski said...

I updated the post.

I may have been somewhat presumptuous in my original post in regards to that Yamnaya vs Andronovo split in South Asia, but I still think there's something in it, and if so, it needs a different test to really flesh it out beyond any doubt.

MaxT said...

Can you include western Iranics in your post, Persian for example. What do they prefer, Andronovo?

Davidski said...

Persians basically get the same Z scores with Andronovo and Yamnaya, both under 3. That's like most of the groups that I tested. So I reckon I need a different model to suss this out properly.

Btw, Balochis are West Iranics too.

Nirjhar007 said...


Samuel Andrews said...

This is just a btw to everyone. I think the Lithuanian Narva HGs and the Latvian Narva HGs *might* represent two different populations.

The Lithuanian HGs mtDNA/Y DNA looks like hunter gatherers from Germany-France(U5b1, I2a1), while the Latvian HGs mtDNA/Y DNA looks like the Iron Gate hunter gatherers(U5a1c, U5a2, R1b1a). Plus the Latvian HGs had significantly more EHG(28% compared to 3-8%).

Matt said...

@ Davidski, thanks.

One more try with that second topology, actually connecting the ANE group to "SiberiaHG" this time: (See if this can get any closer).

Gökhan said...

David, will you work on the place of "EEF", "Anatolian Farmer" or "Natufian " genomes on this qpGraph? I m curious if it will yield same results as jones et al which says CHG and EEF split from one source population into two piece during ice age.

Davidski said...


Getting warmer.


Matt's latest tree is a pretty good way to visualize the relationship between CHG and EEF, even though the Z score is still a little too high.

Salden said...

A study on the Parsi is out. Apparently, they're closer to Neolithic Iranians than modern Persians. And also closer to modern Persians than to South Asians.

Davidski said...


There's potentially a big problem with that Parsi study. Its main conclusion is this:

Finally, we show that Parsis are genetically closer to Neolithic Iranians than to modern Iranians, who have witnessed a more recent wave of admixture from the Near East.

This seems like a strange comment, considering that the so called Neolithic Iranians probably no longer existed in Iran already during the Chalcolithic, because Chalcolithic Iranians are different, and the ancestors of the Parsis moved to South Asia well after the Chalcolithic.

So Parsis can't be more similar to Neolithic Iranians than modern Iranians are due to gene flow into Iran after the ancestors of the Parsis moved to South Asia. In all likelihood, the reason they're more similar to Neolithic Iranians is because of admixture from South Asians, who have a lot of Neolithic Iranian ancestry.

At the same time, most Iranians are basically a mixture of Chalcolithic Iranians and steppe peoples. Only some Iranian groups have very recent, post Parsi migration, ancestry from the Middle East.

The data from the Parsi paper is probably available somewhere, so I might blog about it on the weekend.

Davidski said...


I reckon your model works better with MA1 instead of AG3.

Arza said...

@ Davidski

Can you check what will happen if you'll add admixture from an Iranian or Caucasian direction into Barcin?

The reason:

If true, judging by this plot, it should be 8-10% on average.

Chad Rohlfsen said...


I'm not sure if Nick told you too, but if you have an edge with zero, that pop needs a more ancestral position.

Davidski said...

Yeah, I try to avoid zero edges, but sometimes it's not possible in models for all of the test populations. And I've seen some zero edges in literature, like in Figure 5 here.

So I don't think it's a huge problem to have one zero edge in an otherwise solid model.

Davidski said...


Barcin doesn't take any of the Caspian stuff.

But it seems quite receptive to Caucasian stuff, and this edge even lowers the highest Z score very slightly.

Arza said...


6% - close enough.

When it comes to drift=0 I saw recently in some paper a comparison between two models and it was simply stated that "due to 0 drift this part of the second model is topologically identical to the previous one", so it was not a problem for the author, but I cannot find now where I saw it.

Another idea:
pCaucasian -72-> CHG
pCaucasian --2-> North_Caucasian >> pSteppe_EBA

Doesn't this look like "long long time ago there was a pCaucasian population that contributed to pSteppe_EBA and then, after thousands of years of evolution it drifted to CHG"?

Drift needs a time and now it looks like a populations that contributed to the steppe are way older than CHG (2 vs. 72) and EHG (4 vs. 91). Possibly we see the same with WHG and Iberia, but here the difference [in years] is rather small (1 vs. 9).

So maybe pSteppe_EBA should be connected directly to EHG and CHG? It works great e.g. in nMonte, so maybe algorithm here should not be allowed to create any ghost population?

Z-score will probably sky-rocket and the drift will be shifted to pSteppe_EBA -> Yamnaya, so probably this test will also require deletion of pSteppe_EBA node to see where the additional drift will reappear.

Rob said...

I doubt there is any universal "Proto- steppe"

Arza said...

Ideally North_Caucasian and Steppe_HG should look a little bit younger than CHG and EHG, but there should not be much drift at all on all 4 arrows, because drift = different DNA = different position on PCA and we already know that Yamna is practically EHG/CHG without any major shifts.

EHG - 91 - pEHG - 4 - Steppe_HG : total difference 95

compared to:

pSiberiaHG - 70 - West_Eurasian - 30 pMainHG - 0 - pVillabruna - 30 - pWHG - 9 - WHG : total difference 139

In other words if I'm reading this graph correctly it says that difference between Steppe_HG-ghost and real EHG is as big as 68% of the difference between pSiberiaHG and WHG. Isn't this too big?

Davidski said...


Doesn't this look like "long long time ago there was a pCaucasian population that contributed to pSteppe_EBA and then, after thousands of years of evolution it drifted to CHG"?

These graphs aren't showing chronological relationships. In this particular instance the graph is showing that something like a sister clade to CHG contributed southern ancestry to pSteppe_EBA.

However, I think sister clade in this context might mean either of two things: an actual sister clade, as in a phylogenetic CHG twin population from, say, the Northwest Caucasus, or rather essentially just CHG but with a bit of admixture from somewhere that the graph is missing.

In other words if I'm reading this graph correctly it says that difference between Steppe_HG-ghost and real EHG is as big as 68% of the difference between pSiberiaHG and WHG. Isn't this too big?

I wouldn't take that too literally.

Matt said...

@ Davidski, thanks re:MA-1. May be that AG-3 has slightly different / admixed history and that means ANE is modeled strangely.

Not sure what else would improve the fit of topology. Is there any way to use the zthresh: 2 to see if there are any particularly bad outliers in the topology, or if it's more a question of lots of slightly off stats?

Davidski said...


Does this help?

Btw, AG3 is not of the best quality and offers few markers. So I think that basing ANE on this sample is likely to drag down the whole analysis.

EastPole said...

“While it does appear that male lineages are closely related to linguistic affiliation in patrillineal societies, there is no reason to expect as close a relationship as regards their autosomal make. So no reason to assume that early IA and Iranics were genetically identical autosomally though sharing closely related paternal lineages.”

Languages change when people mix. Also autosomal mixing is important. Mothers teach children how to speak.
Early IA and Iranics should be similar. I don’t think a theory that IA came from Yamnaya and Iranics from Andronowo is possible. IA mixed in India and Iranics in Iran with not-IE speakers and their languages were still as similar as Vedic Sanskrit and Avestan? Hard to belive.
Indo-Iranians probably were one tribe either from Yamnaya or from Andronovo which split into IA and Iranics somewhere in Central Asia.

Matt said...

@ Davidski: Does this help?

Thanks. Err.. might be helpful, I just have to switch my brain on and understand how to interpret them.

I'm assuming that that sheet works something like this:

So if so...

In which case most significantly wrong f4 outgroup stats are:

And the most significant f stats overall:

I might try and look at the outgroup stats first, as I find those easiest to interpret.

Note to anyone who didn't download the above link, doesn't just include outgroup f4 but general f4 (e.g. MA1 Barcin Yamnaya Iberia_Chalcolithic) also other f stats (e.g. I think f3 would be like Onge Yamnaya Onge Barcin and f2 would be like CHG Yamnaya CHG Yamnaya).

qpGraph is using all the f stats to fit the topology.

IIUC, the stat on the top of the print of the graph is the most significant f stat difference between real and model, in this case f4 (Onge, Yamnaya, MA1, Barcin_Neolithic) model: -0.008567, real: 0.002038, difference: 0.010605, standard error? 0.001945, Z: 5.453 (see

(Or in the case of the Kalash model from the post, the most different between real and model would be Ong Yam CHG Yam).

Matt said...

Problems with that model of HG+Neolithic I made based on outlying f4 Yoruba outgroup stats above 2 seems like (if I'm reading them right):

1) All Barcin stats>2 involve Yamnaya or CHG:

a) Barcin is too far from Yamnaya relative to its difference from EHG, Onge, WHG, Iberia_Chal, IranN, Ust Ishim

b) CHG is too far from Barcin relative to its difference from WHG, Iran_Neolithic, EHG, Onge

2) All CHG stats>2 involve being closer to Barcin / Yamnaya relative to other groups than predicted.

3) EHG stat is being closer to Yamnaya relative to Barcin than model predicts.

4) Iberia stats are again being closer to Yamnaya than predicted by model (relative to Onge and EHG).

5) MA1 is closer to Onge relative to Yamnaya and closer to Barcin and Iberia_CA relative to Yamnaya than predicted by model.

6) Similar patterns with Onge being closer to MA1, Iran_N and Yamnaya relative to Iberia_CA, Barcin, WHG, than model would predict.

On balance it looks like the way to fit the f4 outgroup statistics better might be to add in some kind of interactions between Onge / ENA and ANE and also some kind of interactions between the West Anatolian complex (represented by pAnatolian) and the Caucasus complex (represented by pCaucasian).

Updated graph file trying to include that:

(I *really* don't love how complex this graph is. As Nick said "If you add enough admixture edges you can always get a good fit". But perhaps it may be unavoidable at the limits of qpGraph if you are dealing with historically and geographically closer populations, and rare, infrequent admixture breaks down and you get continuous admixture?).

Matt said...

Looking at my last post, seems like the strongest stat in the outliers set for that graph encompasses all those trends:

D (Ong, Yam, MA1, Bar) : model -0.008567, real 0.002038, difference 0.010605, st error? 0.001945, Z: 5.453

That is:

1) model predicts a negative stat: more relatedness between pair Yamnaya+MA1 and Onge+Barcin relative to less relatedness between pairs Onge+MA1 and Yamnaya+Barcin?

2) real stat is positive, so opposite: less relatedness between pair Yamnaya+MA1 and Onge+Barcin relative to more relatedness between pairs Onge+MA1 and Yamnaya+Barcin?

(Or it's all the opposite, or means something quite different, if I've misunderstood the outliers sheet... hopefully not).

Balaji said...

Davidski, thanks for this interesting work. The limits of this kind of modeling are shown by the fact that in this model the Kalash have 10% ASI and in the model in your previous post they had 25% ASI. A comparable difference in estimates of Sub-saharan African ancestry in Egyptians was also found in the recent Egyptian paper.

I do think there is something in your finding that Iranics are better modeled with Steppe_EMBA and Indics with Steppe_EBA. This is because of the presence of EEF ancestry in Iranics and its absence in Indics. Seinundzeit and Kurti have made the same observation.

However, as EastPole pointed out the close linguistic relationship between Indian and Iranian languages calls for a close genetic relationship as well. In an Out-of-India scenario, the Iranians would originally have lacked EEF and then acquired it from the Andronovo, Persians, Greeks, Arabs and others.

When the Swat Valley aDNA results are published, you can repeat these modeling efforts. You will probably find the you can model Steppe_EBA very well as Swat Valley + EHG.

Davidski said...


The limits of this kind of modeling are shown by the fact that in this model the Kalash have 10% ASI and in the model in your previous post they had 25% ASI.

That's probably because the ANI/ASI model assumes wrongly that ANI is a single stream of ancestry, and forces it to look more southern, and thus basal, than it really is. As a result, ASI is inflated, because the non-basal ancestry that the Kalash have still has to go somewhere.

That's not to say my Kalash/Yamnaya topology is perfect, but it definitely reflects the reality that the Kalash derive almost 50% of their ancestry from ancient populations living between the Black and Caspian Seas in Eastern Europe. There's no way around this fact now, even without ancient DNA from South Central Asia.

In an Out-of-India scenario, the Iranians would originally have lacked EEF and then acquired it from the Andronovo, Persians, Greeks, Arabs and others.

Out of India is nonsense for both the Proto-Indo-Europeans and Indo-Iranians. You'll have to accept this fact at some point.

Davidski said...


Samuel Andrews said...

New Post at my blog: Three new U5b subclades in Eastern Europe

Matt said...

@ Davidski, thanks. That seems to have lowered the most extreme Z by 1. But it looks from the most extreme stat shows the real sharing between Yamnaya and Barcin is still higher than in the model (f stat (Yamnaya Barcin Yamnaya Barcin) is stronger in the model than reality)...

Another model: Shifts admixture between Caucasian and Anatolian downstream to between pAnatolian and a pNorth_Caucasian.

If poss. if you do choose to run this one if you have time, could you run the outliers file again? I'd like to check if any of the outgroup f4 stats problems are strong or it is other f stats.

Matt said...

Also if quick to run, this model:, as there's seems like a lot of excessive complexity around the divergences with what I've designated "Main_HG".

Davidski said...


Graph 6...

I made a mistake running graph 7. Gotta do it again.

Davidski said...


Graph7: apparently worse than graph 6.

Matt said...

@ Davidski, thanks.

For Graph6, the strongest are:

f4 (CHG, Ibe, Yam, Bar) model= 0.023125, real = 0.026996, difference = 0.003871, standard error? 0.00124, Z = 3.122

f4 (CHG, Bar, Yam, Ibe) model = 0.026248, real = 0.030638, diffrence = 0.004389, standard error? 0.001388, Z = 3.163

Effectively, the pair Barcin-Iberia_CA and CHG-Yamnaya are less close within pair compared to out of pair in reality compared to model.

Quickest way to fix that is add extra admixture edge between the NorthCaucasian and Barcin:

This may get all the stats under 3, but if so still a very complicated model... so ideally either you would need a simpler model, or the populations are just too close and continuously admixing to easily fit.

Still, for this model, I guess I can get does give some rough levels of Basal Eurasian (which was kind of what I was looking at it for):

Compare to the earlier model with MA-1:

Both fairly consistent; roughly, Basal Eurasian: Yamnaya 18-19, Iberia_CA 28-29, Barcin_N 36-38, CHG 38-44, Iran_N 57-64.

Davidski said...


Didn't work.

In regards to the high levels of Basal in Caspian/Iran_Neolithic, maybe there's a way to check if those aren't being inflated by some sort of basal East Eurasian admix?

Matt said...

@Davidski: Cheers. Outlier scores are strange stuff; I can't see anything wrong in the graph file compared to last that would make it go to those kind of extreme Z scores. Maybe just a bug in qpGraph, or that's the qpGraph of qpGraph's algorithm with those kind of repeated flows. Did it generate a dotfile that you could screencap, just in case I somehow messed up the topology?

Re: adding an from basal East Eurasian to pCaspian or pWest_Asian, a couple of models like that:

pOnge (basal East Eurasian) to HG-like ancestry of Caspian-Caucasian:

pOnge (basal East Eurasian) to Caspian:

I don't think it would improve the f stats here, as none of the >2 outliers seem to involve relationships between IranN and Onge / IranN and Ust Ishim. Doesn't seem necessary to improve fit with this model and adds complexity, and a model constraint to the pOnge admixture into ANE. But you could give it a shot.

Davidski said...


From the Z ~84 run...

Gill said...

I'm interested in what the inferred Caspian population could represent and how soon we can get ancient DNA from the region.

Davidski said...

In my qpGraph models, the inferred Caspian pop is closely related to Iran_Neolithic and Iran_Chalcolithic. It's essentially a twin of either one or the other, depending on the model.

Davidski said...


Matt said...

@ Davidski ENA to Caspian actually does give some small improvement to fit. not sure the small improvement worth the more complex model, but then it might be more important in future if we were adding East Eurasian ancient dna to the graph...

ENA to West Asian graph was a mistake.

ENA to West Asian (properly):

btw, on these graphs, might actually make sense to place Ust Ishim after the Late_Main_Eurasian split, either on the West or East Eurasian branch, as there's a zero drift length between Early and Late Main Eurasian. That might decrease things a little further below 3.

But I'll keep the topology steady for the minute.

Davidski said...


Matt said...

@ Davidski, thanks, so the edge from pOnge into pCaspian improves fit, but pOnge to pWest_Asian gives a worse fit and pWest Asian doesn't take any pOnge.

Couple graphs to modify the best working model I got so far (with pOnge to pCaspian) and modify Ust-Ishim position, to have either:

Ust Ishim weakly on West Eurasia branch:
Ust Ishim weakly on East Eurasia branch:

Davidski said...


Seinundzeit said...

Pretty cool to see ANE as a North Eurasian-ENA hybrid, and to see the non-Basal portion of Iran_Neo as a combination of some distant relative of WHG + ANE + minor additional ENA.

Although not accounted for in these topologies, there is the whole notion of some ENA ancestry for WHG (as per Reich and Lipson).

So, basically, there might be ENA admixture in most ancient West Eurasians.

It'll be very interesting to see if these patterns of East Eurasian-West Eurasian gene flow pan out, once we see some East Eurasian aDNA.

Shaikorth said...


Reich & Lipson included Ami in those models, so I wonder what happens to the scores if these are added to Matt's Ust_Ishim on W-Eurasian branch and E-Eurasian branch models respectively:

label Ami Ami
edge Ami pOnge Ami


label Ami Ami
edge pAmi East_Eurasian pAmi
edge Ami pAmi Ami

Matt said...

@ Sein and Shaikorth: Yes. Also of course adding Kostenki14 and Natufians/Levant_N to the graph could also smoke out a stronger signal of sharing between the ENA edge and some other populations.

Re: Ami specifically I suspect that you could try those models just with those changes you suggest Shaikorth, but the edge into the Caspian and ANE populations is *very* Basal ENA, and the protoOnge+protoAmi population might be downstream.

So you might get a better bet with splitting East Eurasian into pANE-ENA, then pOngeAmi which splits to pOnge and pAmi, and having the edge to ANE and Caspian from pANE-ENA.

Also some problems I still have with this model are:

1) the drift lengths to Siberia_HG, South_HG and UHG look really extreme. I wonder whether those could be reduced by adding more structure within the Basal Eurasian and East Eurasian sides.

At the moment there's a bit of balancing where there tend to be very derived and drifted edges on the West Eurasian side of the tree where
we've got lots of structure, and then much more less drifted admixture from Basal ENA and Basal Eurasian that counteracts this down.

With a graph where there was more structure allowed on the ENA and Basal Eurasian sides of the graph, with separating ENA and Basal Eurasian contributing to later populations, then the drift lengths in some of the West Eurasian side might be less extreme. But of course this would be less parsimonious in terms of adding more and more ghosts.

2) There are still some zero length drift edges in the best fitting graph that would ideally not be present (e.g. leading up to West_Eurasian, leading up to MainHG, and almost no drift between pWHG and descendants)

Shaikorth said...

The idea with putting Ami there sans any changes to admixtures was to check if the resulting changes to Z-scores would be informative about its real optimal position.

Reich & Lipson model got the East Eurasian in ANE and WHG from pAmi (post Onge and Papuan split), but there might be other options.

Arza said...

Yoruba, Ust_Ishim, CHG, AG2, MA1, WHG, EHG, Srubnaya Outlier - Z-score: -1.685

+SHG - Z-score: -2.271

+Steppe_EMBA modelled in the same way as HG - Z-score: 2.888

Ust_Ishim related ancestry in MA1 and AG2.
A lot of CHG in EHG already (J in Karelia!).
HG-cline doesn't point to ANE, but ANE+CHG mix or rather various mixes as proto-SHG (Ertebølle pottery?) was diferent than proto-EHG.
CHG-like population probably lived at least in the Southern Ural and was assimilated by the incoming ANE wave.
Then they hit the WHG wall.
Possible more northern origin of CWC, BB, Srubnaya, Sintashta etc. as well as PIE language (resolved mismatch with Yamanayan Y-haplos and resolved lack of expected clones of Latvia_LN1 in Ukraine).

Possible improvements:
Basal in CHG.
Maybe also Ust_Ishim related ancestry in other pops (CHG also).
Modelling Yamnaya in a traditional way (EHG related + CHG).
Separate xCHG ghost for Yamnaya (the supposed Uralian CHG-like population was probably different than CHG proper).
Merging pMA1 and pAG2.

Graph file:
Data set:
Merged ancients and HO from Lazaridis 2016 (Srubnaya outlier pulled out of Steppe_EMBA).

Davidski said...

A lot of CHG in EHG already (J in Karelia!).


No Basal in EHG, so no CHG.

Arza said...

Visual representation of the model (Global10):

4mix (also Global10):

Population,AfontovaGora3:I9050.damage,MA1:MA1,Kotias:KK1,Loschbour:Loschbour,D statistic

Population,AfontovaGora3:I9050.damage,Ust_Ishim,Kotias:KK1,Loschbour:Loschbour,D statistic

Population,AfontovaGora3:I9050.damage,Ust_Ishim,X,X,D statistic

Arza said...

Maybe... it's just a wild guess... but why it's working?

Davidski said...

How many zthresh: 2 outlier scores does that model have? I bet it's a lot.

Arza said...

Arza said...
but why it's working?

It works because proto-CHG moved north before the wave of Basilisks diluted the ones that stayed in the Caucasus:


Matt said...

@ Shaikorth, OK.

Modified the graph file to add Ami for

Ust Ishim on East Eurasian side:
Ust Ishim on West Eurasian side:

Should be topologically what you're looking for (I have added a couple of extra specifically proto-Ami and proto-Onge nodes but that is just for future proofing and should not affect outcomes compared to the topology you set out).

Davidski said...


Your first model has >2 edges at East_Eurasian.

Here's your second model...

Matt said...

@ Davidski: Ah, oops. This should be what I was trying to do :

@ Shaikorth: On the Ust Ishim West Eurasian branch model, the worst fitting Z scores here relate to greater sharing between Ami and EHG / ANE descandents.

So modification of the Ust Ishim West Eurasian branch model to change edge from proto-Ami into SiberiaHG2 and into proto-Caspian:

Davidski said...


Matt said...

Thanks Davidski.

@ Shaikorth, do these graphs give you any more ideas about the questions you were interested in?

For visualising side by side:


Strongest problem Z stats in 11 seem to relate to MA-1 being relatively closer to Onge and EHG being relatively closer to Ami, a la what we've observed previously in Lazaridis's stats: I guess you could add more edges to try and resolve that.

One thing I notice with clarity is that the ENA edge into ANE goes down as the amount of shared drift with East Eurasian populations increases.

Graph with just pOnge and 12 drift from Main_Eurasian->pOnge: 44% into ANE (cascades to 16% in Steppe_EMBA)

Graph with pOnge and pAmi, then 22 drift from Main_Eurasian->East_Eurasian: 34% into ANE (cascades to 10% in Steppe_EMBA)

Graph with 22+32 drift leading up to pAmi and pAmi edge into ANE: 17% into ANE (cascades to 5% in Steppe_EMBA).

Shaikorth said...

Ami's extra drift with EHG was likely something that Reich & Lipson solved with Ami-related admix in both WHG and ANE. That being the case, the next step could be a single edge from pAmi into pVillabruna or two edges into pEHG and pWHG respectively. Ami to WHG-ANE has a stronger signal than the opposite but two-way geneflow might be tested with pSiberiaHG into Ami.

Matt said...

@ Shaikorth, cheers. Seems like if you have pAmi -> pVillabruna to boost EHG -> Ami, then the model would need to give >pAmi -> pVillabruna than SiberiaHG2? That would have that problem that it would set WHG to be closer to pAmi than EHG?

Model I would try is a separate pANE and pMA1 split, and have pAmi->pANE and pOnge -> pMA1:

(also if possible Davidski, could you run an alternate version where I have added extra structure in Basal Eurasian to see if that limits some of the extreme drift in the West Eurasian groups:

please just let me know if this is too many qpGraphs I am asking you to run).

This should the graph for the pAmi -> pVillabruna model though anyway:

Shaikorth said...

If EHG has Ami relation from Villabruna and ANE getting Ami, it will be closer to pAmi than WHG as long as ANE has more Ami than Villabruna.

I alternatively suggested separate Ami mixture events into EHG's WHG part and the node of full WHG's (like the pWHG and pEHG nodes in 11), should do the trick.

Davidski said...


Matt said...

@ Shaikorth, looks like your pAmi edge to WHG does add some improvement, but very small (-0.050 to Z):

My pOnge edge to pMA-1 model was a total washout.

Chad Rohlfsen said...

I think I know what's going on here. I'll try to get something up here this evening.

Shaikorth said...

It probably would be better to have no Ami to pVillabruna and instead separate streams to pEHG, pWHG and UHG. That way none of them acts as a constraint (now pEHG's Ami limits pVillabruna's Ami)

Matt said...

@Shaikorth, that is sounds pretty plausible right, but adds more complexity to the model; also don't know if that would improve the model's worst Z, which relates to patterns of relatedness between CHG, MA1, Yamnaya and Iberia_Chalcolithic. I don't have the time to mod the file atm, but if you want to have a go, I'm sure Davidski would be happy to run off for you.

@Chad, would be interested to see what you've got, especially if you've got any ideas that could radically simplify this very complex model...

Daniel Fernandes said...

What files are really necessary to run qpGraph? Can someone show me an example of a working parameter file? I was trying to replicate a graph with Anatolia_Neolithic, LBK, Iberia_EN, and WHG, but qpGraph shows non-sense admixture results (even if I input approximate admixture proportions on the graph file). And on that, where do you prefer to get the admixture proportions for it? Is admixture and qpAdm similarly good? Thanks!

Davidski said...

Text in my parameter file...

data: data
indivname: /home/davidski/data.ind
snpname: /home/davidski/data.snp
genotypename: /home/davidski/data.geno
outpop: Yoruba
blgsize: 0.05
forcezmode: YES
lsqmode: YES
diag: .0001
bigiter: 6
hires: YES
inbreed: YES
zthresh: 2