search this blog

Wednesday, September 2, 2015

A multidimensional approach

This is arguably the most interesting Principal Component Analysis (PCA) I've run to date. Note that overall the ancient steppe genomes appear to be the crucial link in the Indo-European chain, basically bridging the gap between European and South Asian Indo-European-speakers. A plot with all of the samples labeled individually can be downloaded here. If you have any questions about the methodology, just ask in the comments.

In this analysis I used samples from the Allentoft et al., Haak et al. and Lazaridis et al. datasets, all of which are publicly available. The latter two are found at the Reich Lab site here.

See also...

No significant genetic substructures within (eastern) Yamnaya


Maju said...

Looks interesting but I'm a bit lost without pointers admittedly.

The most intriguing PC dimension is for me PC3 because it shows a very strong Paleo- vs Neo-European polarity and, perplexingly it is some West Asians (not sure which ones) the ones who most strongly (by far) tend to the Paleo-European polarity here (they are however separated from European HGs by PC1, which is clearly a Europe vs West Asia polarity).

Re. Indoeuropean "blood" tendencies it is interesting that there seems to be none or very mild at best, with IEs and non-IEs spreading very similarly through the graphs, regardless of dimension. Pakistan is no exception (it may appear as such in shallow approach) as it's clear that Balochi and Brahui fully overlap too and the differences are with other populations and not between directly comparable IEs vs non-IEs.

Matt said...

The size of PC2 as a % of the total variance becomes much larger when the S-C Asians are included, and is almost the same size as PC1. Part of this is an overall rotation - part of the Bedouin-WHG PC1 dimension from the European PCA has been subsumed into the PC2 dimension here, placing Bedouin eest and WHG slightly east and moving everyone else accordingly as well. Also changes that are more than simple rotation though.

PC3's a strange one. It's the one you always tend to get, where there is Mediterranean-Caucasus-Early Farmer unity to the exclusion of the Bedouin and Hunter Gatherers, who cluster together on it.

So some shared element of genetic closeness all the Med-Caucasus (and to a lesser extent others) share that the Bedouin and HG do not have, or a shared element of difference to BedouinB and HG.

Then PC4 is remaining distinctions of the Bedouin from one another and from others. Possibly relates to genetic drift within the Middle East (possibly early in history, or more recently).

Btw, no Volga Urals? One of the things I found quite interesting about them in the IBS stats you posted before was the Yamnaya affinity was quite low in the Mansi (comparable to Tadjiks and Turkmen) while the EHG affinity was quite high, comparable to NW Europeans. With a similar but lesser phenomenon in Mari and Chuvash. I took that as an indication that some of the sharing between Volga-Ural and Yamnaya in position, etc. was mediated by shared EHG compared to direct ancestry with Yamnaya (esp. net of ancestry from East Eurasia). OTOH the Russian Erzya sample is not really an outlier in the same way. This was opposed to mildly the opposite pattern in Georgian / Abkhasian.

Assume with the steppe samples, one is the Sintashta, then the other three are Afanasievo, Haak Yamnaya and Allentoft Yamnaya?

Matt said...

For a graphical on the IBS EHG vs Yamnaya thing -

bellbeakerblogger said...

Judging from the lack of the typical 200 comments at this point, I'd guess everyone is having difficulty with interpretating.

Here's my wag. I think what we are seeing with these charts is that Europeans are largely an amalgam of two distinct peoples, not so much three (one associated with the early Near East and a West Central Asian population that encroached upon it). This formula is stretched at the margins by the less populous local hunters who were less successful but whose distinctiveness give a deceiving differentiation to an otherwise similar group over a large area.

Basically, Europeans are sandwiched between farmers and West Asia. Everyone else is an outlier.

Krefter said...

"Judging from the lack of the typical 200 comments at this point, I'd guess everyone is having difficulty with interpretating."

I just don't have time.

Davidski said...

I would rather say that basically Europeans are sandwiched between Neolithic farmers and the Bronze Age Eastern European steppe.

But some southeastern European populations do have inflated levels of post-Neolithic admixture from West Asia.

Kurti said...

Almost a year ago, I opened a thread about the unnatural genetic gap between Europe, West and South_Central Asia and said that the Indo Europeans and Northeast Iranic speakers in the Steppes would be the missing link.

Take a look if interested.

Kurti said...

However it seems I was wrong with my one statement in that thread, that Semitic spekers did not cause any major gab. In fact they did in the Levant, Mesopotamia and Anatolia as far as we know now.

Rob said...


What is your point ? Its not clear

Kurti said...

@CroMagnon. Read the thread I provided. What my point is? Have you actually red the thread. It says that Indo Europeans are the missing link between Europe and South Asia. And I had a year ago this theory that the Indo Europeans (Iranic speakers) were the missing link between those. And back than some experts didn't take my theory serious. And now it looks like I was for most part right. There wasn't once this big genetic gap between Europe, South_Central Asia and West Asia. We had EEF farmers connecting Anatolia with the Balkans and mainland Europe. We had the Indo Europeans connecting West Asia and SouthCentral Asia with East Europe. Thats the point.

Hope it's clear now.

Davidski said...

It's likely that the gap between Europe, including the European steppe, and South Central Asia was much bigger than it is now.

During the Indo-European expansions there was a major pulse of admixture from the European steppe to South Central Asia. So if we test some South Central Asian samples from just after this period they might actually form an unbroken cline running from Eastern Europe to South Asia.

However, since then there's been a lot of mixing in South Asia, which has leveled out the steppe admixture and again created a gap between Europe and South Central Asia.

Seinundzeit said...

Interesting, these steppe samples do seem to bridge South Central Asia and Europe. In some dimensions, they cluster with South Central Asians.

Unknown said...

Certainly the steppes linked vast areas of eurasia. They were definitely great vectors for spreading languagea which might have originated outside the steppe.

I'm curious what getting Bronze Age genomes from Central Asia and Anatolia will look like, and if they "complete the circle

Anonymous said...

I do not see this bridge davidski sees. This PCA looks like Yunusbayev update only. Still it is useful.


'Looks interesting but I'm a bit lost without pointers admittedly.'

Each picture is different viewing of multivariant clustering (human gene diversity can be thought of as 3D pattern). PCAs like this are 2D however. To try and capture some of this, multidimension views are used.

Most informative view is always PC1-2. PC3+4 here are interpreted with caution. The numbers you see in diagonal square tell you how much total variation is explained.

It is clear to see Europe and non SSA mix West Asia more or less form one 'clade' not shared by South Central Asia. Central Asia are more Euro (but not more West Asia) shifted than South Central Asia.

This makes perfect sense and nobody should expect otherwise. North Caucasus has thousands years connection to steppes. South Central Asia influence from steppes is very recent.




On first glance yes. Two reasons why this is false:
- Steppe and South Central Asia link only in less informative dimension views (PC2-3, PC3-4, but not the most informative one, PC1-2)
- The steppe samples linking Europe to South Central Asia is from Yamna, not Arkaim-Sintashta (davidski can check this). Yamna has nothing to do with this part of world. Ark-Sin is source of all steppe ancestry into there. If davidski runs multidimension view showing just Ark-Sin, you will see no 'bridge'. Ark-Sin is North European.

John Thomas said...

Sterling work, David, in researching genetic links between the Corded Ware Culture, Sintashta, modern day south central Asian populations and extant eastern European populations.
Just one question, David. Will it ever be possible to unearth genetic links between Corded Ware and the extant Indo-European speaking populations of western Europe, particularly north western Europe?
The stubborn r1a/r1b dichotomy seems to be a major stumbling block against any hard theorizing in this particular problem.
Though it would be satisfying to pinpoint Corded Ware as the 'daddy of them all' at least as the IE expansion is concerned.

Karl_K said...

"human gene diversity can be thought of as 3D pattern"

Actually, it has many more dimensions than three. These plots can be made with any clustered combination of the markers used. The number of possible variations is staggering. It is actually amazing that programs can pull out meaningful information so easily.

But, this does not mean that the reason for the clusters is straightforward. Different historical events involving very similar populations could produce the same results. So the actual history requires more information to be understood.

Anonymous said...


From technical side of things you are right.. That sentence was meant to give Maju a means of mentally visualizing this sort of multidimensional analysis

Indeed much caution is needed when viewing such outputs, especially in low gene variance explaining dimensions. All sorts of false inference can be made from them.

Unknown said...

Ideally, and I'm sure it'll happen one day, we can create these plots period by period (ie palaeolithic , mesolithic.......pre-modern and modern). Then we'll be on a pretty darn solid footing

Davidski said...


Here are the plots with the Chuvash and Mansi.


I can already see strong connections between Corded Ware and Northwest Europeans, like high levels of shared drift and lots of shared mtDNA lineages. Also, Northwest European R1a-L664 and R1a-Z284 are definitely from Corded Ware, because they've been found in Corded Ware remains.

It's hard to say where R1b-L51 comes from exactly. Maybe from Corded Ware or some related group? But even though it reaches high frequencies in Northwestern Europe today doesn't mean its expansion somehow wiped out Corded Ware ancestry there.


These plots seem to be particularly informative for the Bronze Age steppe groups. Note that in eigenvectors 1 and 2 Afanasievo and both Yamnaya groups can be modeled as around 50/50 Eastern_HG and a as yet unsampled population from the North Caucasus. I think this will prove to be correct.

Also, Andronovo cluster with Corded Ware, which is correct.

Other dimensions are indeed much more difficult to interpret, but on the balance of things, I don't think my comment was out of line. If the Tajiks and South Central Asians didn't have that ANE-like something and ASI, they'd probably be pushing up into that empty space between the North Caucasians, Andronovo and Yamnaya.

I can't run Sintashta on these plots because there's not enough markers, and missing markers tend to have a big effect on the higher dimensions. But the Sintashta set we have isn't much different from Andronovo, just a little more western.

Seinundzeit said...


For whatever it's worth, PC2-3 and PC3-4 are still important, one can't just ignore them. The fact that Yamnaya + Afanasievo cluster with South Central Asians in those dimensions is certainly of some significance.

Regardless, PC1-2 are showing an interesting pattern. Basically, South Central Asians differ from West Asians in the same way that Yamnaya/Afanasievo + EHG differ from Europeans. I guess one could say that modern Europeans are close to constituting a cline from Neolithic Europeans to EHG, while the South Central Asians are behaving in a very similar manner (but in relation to modern southwestern Asia, rather than Neolithic Europe). In the latter case, they are at the endpoint of a cline that starts in southwestern Asia and extends to something related to EHG/ANE. Whatever this "something related to EHG/ANE" is, it is best represented today among South Central Asians.

I think if one accounts for the 60%-70% EBA European/BA Eurasian steppe ancestry in the Hindu Kush/Pamirs, the rest of that 40%-30% is going to be very rich in something closely related to EHG/ANE, which makes for a lot of complexity.

Unknown said...

This seals the deal for me!
Cardial Neolithic and LBK must derive from the same source:


Are the raw data publicly available yet?

Anonymous said...


I was not criticizing anything you wrote, just cant see any sort of 'bridge' from Europe to South Central Asia. All you typed is correct. The idea about 'rebound' South Asian DNA in South Central Asia over time makes sense.


I explained to Maju why basing anything off PC3-4 is super problematic (low gene variance speaks for itself). It may or may not show us hidden ancestral streams. Do yourself a favor and reread that. The rest of what you wrote also reads like strange conjecture. Are we viewing the same PCAs? LoL

Heres a more logical take based on PC1-2; Corded Ware and Ark-Sin are the ones sitting outside the European group. They are basically North Europeans like us. Yamna and Afanasevo drift more east thanks to other mixtures. Brahuis and Balochis are basically one group and pull evenly to Ark-Sin next to the Iranian. Central Asia Tajik sit above them, suggesting stronger steppes pull. North Caucasus even more pulled that way. This makes sense.

Saying South Central Asians (like Brahuis?) are a bit less than "60%-70% EBA European/BA Eurasian steppe ancestry" (remember they sit right beneath Central Asian Tajik) is anti-intellectual and not holistic.

Davidski said...

Yes, it's here....

Sequence Read Archive (SRA) at the accession code SRP057056

I suppose Felix will do it soon. Or I might have a go tomorrow.

Karl_K said...

"This seals the deal for me!
Cardial Neolithic and LBK must derive from the same source:"

Not that there was any doubt!

PF said...

"I would rather say that basically Europeans are sandwiched between Neolithic farmers and the Bronze Age Eastern European steppe."

The questions remain: where, when, and from whom did the Neolithic farmers get their WHG, and the same for Bronze Age steppe people regarding "teal" and EHG. The devil's in the details, and lots remains to be figured out.

Seinundzeit said...


Do yourself a favor by taking prior information and other genetic analyses into consideration. After doing so, you'll benefit from rereading what I wrote. To help you out, let's think about this together for a just a second or so.

The BA steppe samples are shifted towards the "east" for a reason. It's called EHG ancestry! There is a cline from Neolithic Europe to the EHG sample (we are ignoring the shift towards the WHG/SHG samples right now). South Central Asians are also shifted towards the east, just like the BA steppe samples. But instead of being shifted eastward in relation Neolithic Europe, they are shifted eastward in relation to southwestern Asia. That is pretty obvious, and we already know why (David's previous analyses have shown a robust ANE affinity in the region, even when accounting for the massive amounts of BA steppe ancestry that exists in groups like Pamiri Tajiks, Kalash, Pashtuns, etc). Besides the excess ANE/EHG affinity in South Central Asia, there is also ENA admixture.

Also, no one is saying that the lower order dimensions are somehow determinative in relation PC1-2. Rather, it's simply the case that the lower order dimensions still convey important information. With that in mind, the fact remains that the Yamnaya-Afansaievo samples cluster with South Central Asians in PC2-3 and PC3-4. They are distinctly "South Asian" in ways that modern Europeans aren't, and that probably involves both shared "North Eurasian" ancestry and direct genetic ties via unsampled southern Andronovo.

Davidski said...

OK, settle down with the insults and stuff.

It's not necessary. This will be settled sooner rather than later with ancient DNA.

Chad said...

When breaking up Yamnaya into EHG and that Caucasus like stuff, I get a clearer picture of Yamnaya, Corded, and Beaker. Yamnaya has the most EHG, and the most of the Caucasus. Corded and Beaker are about identical in Caucasus, but Corded has more EHG. So, relatively speaking, both show the same relation to Yamnaya if we base it on those two components, but Corded appears closer on PCA, due to more EHG. This basically makes both Corded and Beaker look about 50% or so Yamnaya-like, but obviously they have different stories on how they got that way and it doesn't appear the two mixed with each other until later on.

Dmytro said...

"When breaking up Yamnaya into EHG and that Caucasus like stuff, I get a clearer picture of Yamnaya, Corded, and Beaker. Yamnaya has the most EHG, and the most of the Caucasus. Corded and Beaker are about identical in Caucasus, but Corded has more EHG" (Chad Rohlfsen)

RISE 552 seems well integrated autosomally with the other Haak Yamnas. But his YDNA is WHG is it not? How could one distinguish his autosomal WHG from the EHG he shares with his R1b buddies?

Chad said...

Rise 552 comes out about 11% less EHG than Haak I0429. It is a bit noisier, with about 2% showing up in other components that don't make sense. They are very low coverage compared to Haak, so I'm not sure. A couple of the Haak and Rise samples show a couple percent in WHG and EEF, so it does show up a bit. I'm sure it would be much higher west of the Don. We will have those samples, eventually.

Rob said...


: RE that Eupedia thread, are you "Alan" ?

Any case, its clear that there's been significant turnover on the steppe. But I'm not sure how much of it was cause solely by Turkic invaders. It'd be a combination of push and pull factors.

Simon_W said...

@ Chad Rohlfsen

Sounds interesting, but how do you rule out that part of the Bell Beaker's Caucasus-related ancestry isn't from EEF? The same component needn't refer to the same stuff in different samples.

In the Eurogenes K15, Corded people from Germany have 8 - 13% of the West_Asian component that was largely absent from the EEF. Whereas Bell Beakers from Germany seem to vary more, with some having less than 2% West_Asian, others as much as 10%. A Czech Bell Beaker had 0.7% West_Asian. So I'm not sure if I can share your confidence that Corded Ware and Bell Beaker had the same amount of the new, Yamnaya-related Caucasus stuff.

Simon_W said...

On an unrelated note, check out the map of the MDLP World-22 Northeast_European component:

The distribution of this component in Afghanistan, Pakistan, Iran and India is certainly compatible with a steppe origin of Indo-Iranians.

Note that the MDLP World-22 analysis also includes a highland West_Asian component similar to Dienekes' West_Asian components, that is, a West Asian component quite unlike the EEF people and obviously including quite a bit of ANE. It also includes a Kalash centered component, misleadingly called „Proto-Indo-Iranian“ by Vadim Verenich. For this reason, the presence of the Northeast_European component in West Asia cannot be the result of common ancestry and a lack of ANE-rich „teal-like“ components in the World-22 analysis. It most likely is real admixture.

Interestingly the Northeast_European component is absent in the middle of Georgia. This in spite of the considerable ANE there. So this ANE in Georgia must either have been there for a very long time or it must have been brought by ANE-rich teal people, as some here hypothesize.

To me it would be most interesting to know if the Hittites had some admixture of the Northeast_European component or not. I think this would be really the decision between West Asian theories of a PIE origin and the steppe theory.

Unknown said...

I'm referring to my own run with EHG. Nothing from here.

a said...

I would boldly slant/project the results in another interpretation. Tabasarans/region[proto-Kartvelian] can already be linked with PIE and R1b-z2103.
Rib predates Sintashta by 1000 years in the same region. Eurogenes K15 gives Samara R1b H.G./[R1b z2103] cluster-highest Eastern European score to date until proven otherwise. Therefore Samara has the variance; R1b of different clades coupled with high European scores + antiquity in it's favour.

The R1b substrate comprising the oldest element of Indo-Iranians then should be also evident in migration pattern as shown in MLDP-22. and coupled with variance in regions of known migrations. Also Vedic traditions of ox drawn multi- person cart, and kurgan in the region might be interesting to compare.

Areas like Jawzjan[BMAC, also known as the Oxus ] fall within the above parameter linking core PIE homelands & possible Afanasevo culture .Having R1b-z2103+ M73 and elevated East European as per MLDP 22.

Matt said...

Cool. The West Siberians are the Mansi, and they do indeed deviate strongly to the HG-Bedouin pole in the PC3 that pushes Yamnaya further from EHG, smoothing out their high relatedness to EHG with not too high relatedness to Yamnaya. The effect is different in the Chuvash who are closest to Yamnaya in PC2, then a little further towards EHG in PC3.

Generally these agree well with the raw IBD scores (they should as these are IBD based PCA using between individual's IBD as the input variables?).

Also re: PC3, I find it interesting to note that there seems more variability on it in EN-MN farmers relative to PC2, while modern Europeans are no more or less variable in it than PC2.

Davidski said...


That's not the middle of Georgia on Vadim's map, it's Kalmykia. East Asians live there.


The PCA plots are based on pairwise IBS not IBD.

Krefter said...

I getting an awesome second post at my blog ready. I'm focused on finding diversity in mtDNA H because it's the most popular and least defined haplogroup. But I'll also be looking at all other haplogroups and searching for shared haplotypes. And this time Ancient mtDNA will be just as involved in the analysis as modern. I've found many haplogroups are represented 80%+ of the time by a handful of young subclades in ancients and moderns. I'll probably be done in several days to a week.

There are some interesting trends in H. H1, H3 look like EEF dead-ringers. There's plenty of them in Neolithic mtDNA and most Sardinian and Iberian H is H1+H3. Of my Hs only Basque and some Ancient European H is of full-sequence.

Of Neolithic H1s that were fully sequenced most are H1e and H1j, and most Basque H1 is H1e and H1j. You can't ask for more. Modern H1 looks like a direct decedent of Neolithic H1.

H1, H3, T2b, J1c, HV0, K1a, X2a-o, are all major subclades of haplogroups that look like EEF or mostly EEF lineages. No doubt EEF is a big contributor to modern mtDNA. The East Mediterranean was a nesting place for lots of European mtDNA 8,000-10,000 years ago. There's certainly space for some JT, R0, N* lineages to be of pre-Neolithic expansions but most are not.

There are interesting trends outside of H. Neolithic Euros and Bronze age Steppe people being direct ancestors becomes very clear when you look deeply at the mtDNA.

Grey said...

"I getting an awesome second post at my blog ready."


Matt said...

David: The PCA plots are based on pairwise IBS not IBD.

Apologies, not IBD. Actually the distinction between IBD and IBS is based on haplotypes (in IBD, of varying lengths) vs unlinked SNPs, isn't it? I didn't realise that before a few days ago.

Just out of interest, thinking about loadings, I guess these load mainly on higher IBSs for Northern European samples with one another and for them with the HGs. IRC the southern samples and farmers tended to have relatively lower IBS with one another, which makes sense if they have higher genetic diversity, so it's either less likely that they will be identical by state for any variant or at least the IBS ceiling is closer to the IBS floor. Similar story with IBD to a stronger extent (as haplotypes have much stronger effect from recent ancestry with relatively lower weighting to ancient).

So high score on PC1 = higher IBS with Northern Europe / HGs, while low score = low IBS with Northern Europe / HGs (rather than low score on PC1 = higher IBS with Middle East)?

Onur Dincer said...

H1, H3, T2b, J1c, HV0, K1a, X2a-o, are all major subclades of haplogroups that look like EEF or mostly EEF lineages. No doubt EEF is a big contributor to modern mtDNA. The East Mediterranean was a nesting place for lots of European mtDNA 8,000-10,000 years ago. There's certainly space for some JT, R0, N* lineages to be of pre-Neolithic expansions but most are not.

Krefter, I am X2e2a. Is X2e2a among the major EEF lineages as well? My paternal grandmother is T2b, which you say is in the inventory of major EEF mtDNA haplogroups. My paternal grandfather is H2a1 and my maternal grandfather is H5, what can you say about their origins?

Simon_W said...

@ David

I was referring to this:

Granted, it's not exactly in the middle of Georgia, but close to it. And it's less Northeast_European than Kalmykia, at least according to this MDLP analysis.

Krefter said...


X2e2 is under X2a-o, which close to 100% of Neolithic, Bronze age, modern X belongs to. Your specific subclade X2e2 appears to be very rare and I don't have an example of X2e2 in modern or ancient mtDNA. That's the case for most people. The old haplogroup we belong to is popular but the young deep subclade we belong to is very rare.

Yes, T2b looks like a mostly EEF lineage. It's rare in West Asia, popular in Europe and Neolithic Europeans. H2a1 takes up most modern European H2 but H2 is rare and so therefore H2a1 is to. H2a1 today peaks in Arabia and Egypt, but it's probably some-type of founder effect. No Neolithic H2a1 has been found yet. The only ancient H2a1 examples come from Catacomb(2600 BC), Corded Ware(2600 BC), Unetice(2000 BC).

H5 is found in all Pre-Historic European groups(inclu. EEF) except hunter gatherers. That's because it's the only major H clade that can be defined with low-coverage testing. It exists today in West Asia and Europe at similar frequencies(1-5%), with a possible peak in Iberia. Some 90% of modern European H5 is H5a1 and I think the same may be true for West Asia.

Simon_W said...

@ Chad

Yes, that was clear to me. I was just pointing to an analysis in conflict with your results. It's really crucial to separate the Yamnaya related West Asian influence from the EEF-mediated one.

Onur Dincer said...

Thank you for your valuable insights, Krefter.

Davidski said...

I added more ancient samples to the plots, including Germany_MN, Italy_Copper_Age and a couple of Karasuk samples.