search this blog

Friday, October 14, 2016

The peopling of South Asia: an illustrated guide


For your pleasure and my satisfaction: a nice little slide show on the ancient population history of South Asia. Click on the first image to get started. The images are based on my latest Principal Component Analysis (PCA) of the world (see here). Any other questions? Ask in the comments.


See also...

Caste is in the genes

51 comments:

Nirjhar007 said...

Any other questions? Ask in the comments

Well nope . There is no data from there yet, so I don't think there are any permanent answers :) . But I would love to see what others try to suggest .

Davidski said...

I'm gonna edit some of slides a bit; the Mesolithic South Asians are too high up the plot IMO.

Nirjhar007 said...

And where you found Mesolithic S Asians? BTW?.

Davidski said...

Indian tribals minus 25% West Eurasian or so and no East Asian = South Asian Mesolithic.

Nirjhar007 said...

Seriously you are suggesting this ? Without having any aDNA? . Anyway, I am not going to take this seriously , nope.......

Ryan said...

Could just use the Onge. Would that satisfy you Nirjhar?

Nirjhar007 said...

Well I would be really satisfied if Dave could run some aDNA samples right now! ;) . Whats your address Dave? I will send via courier ! ..

kkkk

Davidski said...

Alright, edit complete. Please refresh your pages.

I didn't do too much in the end, but I want my predictions to be as accurate as possible, and I reckon these plots are now spot on.

Btw, yeah, the posited Mesolithic South Asians do cluster very close to where Onge cluster on similar plots.

Rob said...

Make it into an animated video

Davidski said...

Too lazy today. Might upload it to Youtube later.

Ryan said...

BTW, I love how that weird migration edge between Oceaniana and the Mbuti sort of shows up for the Sahul samples on this PCA. IMHO not necessarily a recent thing, but just the remants of an Out-Of-Africa migration from someone not in the "Afrasian" or NeoAfrican clade that had a greater impact there than elsewhere, though I realize this PCA isn't robust evidence of anything like that. Just a hint... maybe.


I wouldn't be completely certain that that SE Asian migration edge into South Asia is entirely Neolithic though. There was pretty big turnover in Europe prior to the Neolithic, and it's quite possible that South Asia was just as dynamic.

Davidski said...

There was pretty big turnover in Europe prior to the Neolithic, and it's quite possible that South Asia was just as dynamic.

Not really sure what the gist of this is, but please note that there couldn't have been any Southeast Asian ancestry per se in much of India prior to migration of the Neolithic farmers from the Zagros into South Asia, because the South Asian genetic cline looks more like a banana than a straight line, meaning that some Indians are still practically mixtures of South Asian foragers and Zagros farmers.

Gioiello said...

@ Davidski (: @ Nirjhar007)

"but I want my predictions to be as accurate as possible"

Science works so: predictions and experiments that prove or disprove that.
I wiwh you finding your Villabruna.

Gioiello said...

wish

Matt said...

That's one way it could've happened.

How does nMonte fit South Asian populations based on just PC1 and PC2? If it's all evident in PC1 and PC2, the fits should be roughly the same as with all 10 dimensions.

(Although I think it could've happened like this, maybe, 2d view might be misleading in the proportions, like I tried using just PC1 and PC2 in this case -

English_Kent
Loschbour - 62.6
Levant_Neolithic - 37.4
AfontovaGora3 0
Iberia_Mesolithic 0
Iran_Hotu 0
Iran_Neolithic 0
Israel_Natufian 0
Karelia_HG 0
Samara_HG 0
Ust_Ishim 0

distance% = 0.3983 %)

Davidski said...

@Matt

English_Kent are a mixture of relatively similar ancient West Eurasian groups, while South Asians are a mixture of highly distinct global and West Eurasian groups, so modeling the fine scale ancestry of English_Kent using the first two dimensions of a global plot is close to a pointless exercise, while it is indeed a very useful exercise for South Asians.

Davidski said...

@Matt

Btw, if you get the chance this weekend, plz test all of the South Asian samples as a mix of Yamnaya (and/or Andronovo), Iran_N and Bonda using only the first two dimensions, and tell us what patterns you see.

I reckon the results will look pretty solid, even if simplified. Kalash should be the most Yamnaya, Makrani the most Iran_N and Paniya the most Bonda, basically in line with formal stats.

And that's because unlike Europeans, South Asians have been part of the global village since the Neolithic.

Jijnasu said...

Is it possible that rather than being spread evenly across the subcontinent, the iranian neolithic like ancestry largely remained confined to the northwest until the 2nd millenium B.C.E? Non-Brahmin Dravidian speaking groups in southern India too show a cline of decreasing ANI ancestry from Upper castes to lower castes suggesting a rather recent expansion of ANI groups into peninsular India

Matt said...

Davidski, I'll have a try if I get the time. I reckon overall trend will be there like you say. There are quite a few populations though, so just starting with Kalash, fitting with PC1 and PC2:

Kalash - Steppe EMBA 41.8 (Yamnaya_Samara 23.45, Yamnaya_Kalmykia 18.35), Steppe LMBA 16.8 (Andronovo 16.8), Iran_Neolithic+CHG 23.6 (Kotias 12.5, Satsurblia 7.7, Iran_Neolithic 3.6) Austroasiatic_Bonda 17.6 - distance% = 0

or without the CHG and just limiting to the Yamnaya Samara and Andronovo, not including Yamnaya_Kalmykia:

Kalash - Yamnaya_Samara 41.8, Andronovo 26.15, Iran_Neolithic 16.15, Austroasiatic_Bonda 15.9 - distance% = 1e-04 %

Using the full 10 dimensions and the same calc populations

Kalash - Iran_Neolithic 42.1, Yamnaya_Samara 36.8, Austroasiatic_Bonda 21.1, Andronovo 0, Kotias 0, Satsurblia 0, Yamnaya_Kalmykia 0 - distance% = 1.0313 %

Kalash - Iran_Neolithic 42.1, Yamnaya_Samara 36.8, Austroasiatic_Bonda 21.1, Andronovo 0 - distance% = 1.0313 %

So there is presumably some degree of switcharound in the higher dimensions where Kalash is closer to Iran_Neolithic and steppe is more distant (and to a lesser extent Kalash is slightly closer to Bonda).

Matt said...

Paniya with 2 dimensions:

Paniya - Austroasiatic_Bonda 75.2, Iran_Neolithic 24.8, Andronovo 0, Kotias 0, Satsurblia 0, Yamnaya_Kalmykia 0, Yamnaya_Samara 0 - distance% = 0.1491 %

Paniya - Austroasiatic_Bonda 75.2, Iran_Neolithic 24.8, Andronovo 0, Yamnaya_Samara 0 - distance% = 0.1491 %

Paniya with 10 dimensions:

Paniya - Austroasiatic_Bonda 91.4, Iran_Neolithic 8.6, Andronovo 0, Kotias 0, Satsurblia 0, Yamnaya_Kalmykia 0, Yamnaya_Samara 0 - distance% = 3.7484 %

Paniya - Austroasiatic_Bonda 91.4, Iran_Neolithic 8.6, Andronovo 0, Yamnaya_Samara 0 - distance% = 3.7484 %

(Paniya becomes better approximated by Bonda in the full 10 dimensions and less so by Iran Neolithic, but it's not a great fit)

Makrani with 2 dimensions:

Makrani - Iran Neolithic+CHG 72.9(Iran_Neolithic 29.3, Satsurblia 22.85, Kotias 20.75) Steppe EMBA 11.1 (Yamnaya_Kalmykia 8.85, Yamnaya_Samara 2.25), Steppe LMBA 9.1 (Andronovo 9.1), Austroasiatic_Bonda 6.9 - distance% = 0 %

Makrani - Iran_Neolithic 53.45, Andronovo 25.65, Yamnaya_Samara 17.3, Austroasiatic_Bonda 3.6 - distance% = 0 %

Makrani with 10 dimensions:

Makrani - Iran_Neolithic 61.65, Andronovo 25.3, Austroasiatic_Bonda 13.05, Kotias 0, Satsurblia 0, Yamnaya_Kalmykia 0, Yamnaya_Samara 0 - distance% = 1.1538 %

Makrani - Iran_Neolithic 61.65, Andronovo 25.3, Austroasiatic_Bonda 13.05, Yamnaya_Samara 0 - distance% = 1.1538 %

(Makrani is probably a bit more Iran_Neolithic and Bonda like in the 10 dimension, and slightly less steppe like, though the inclusion of the CHG in one of the above 2 dimension nMontes complicates matters a little).

Matt said...

For quick demonstration, neighbour joining using just dimensions 1-2 and 3-10:

http://i.imgur.com/p43ugm2.png

(Dimensions 1-2 mostly concern with African vs Non-African variation, so they are clearly distinct in the tree, while Yamnaya and Loschbour and Sintashta sits close to populations like Adygei and Circassian and Abkhasian. Dimensions 3-10 have less information about Africans, so they are closer on the tree to everyone else, but reflects high information about structuring between Eurasian populations.)

Like the basic concepts of the post are OK, I just wouldn't use the PC1 and PC2 positions to infer proportions. (And putting a date on all the Iran_Neolithic and ANE relatedness into South Asia is still speculative atm, IMO.).

aniasi said...

Question...

Is it possible to see when the South Indian Mesolithic, Iranian Neolithic, and Austroasiatic start to internally diverge?

This is still broad for a subcontinent of 1.5 billion people, and I would like to see if there was some form of tribal matching going on. It might also provide some resolution on the various groups that existed in India. I am beginning to wonder if there were Paleolithic relics in the South and East that would become Veddas and other isolated groups.

Alberto said...

One interesting thing about South Asian populations is how remarkably stable the West Eurasian side of all the populations is (what we call ANI). The difference between populations is basically how much ASI/ANI they have. If one could remove the ASI part, they'd all be pretty similar.

Here a few plots based on some D-stats that RK requested some time ago involving Chamar, Brahmin_UP, Chenchu and Velamas:

http://imgur.com/a/jeJ1a

Comparing any pair of populations gives a really good correlation, even if their cultural and geographical backgrounds are very different. It looks like Velamas (high caste) is the most shifted towards Iran_N and Iran_ChL. The other 3 are more homogeneous, with the Chamar (low caste) and Chenchu (Dravidian tribe) being the most shifted towards EHG, while Brahmin_UP (highest caste) might be the most shifted towards Sintashta, though all the differences are pretty small and hardly significant.

Matt said...

@ Alberto, I would say though taking the Fsts from Chaubey et al 2016 (http://www.nature.com/articles/srep19166) and sticking them on a neighbour joining tree:

http://i.imgur.com/85KfQ7q.png

You do get some populations who are remote on the cline from their nearest neighbours. Like if it were just clinal between two identical populations with no drift then I think they would just sit flat on the tree without any branching off of their own.

I would think that this would be because of high levels of recent group specific drift (e.g. most likely for Indian Jewish groups). But it could be ancient and reflect substructure within ASI that isn't much reflected by different affinities outside it. ASI might not have been one thing any more than ENA as much as a bunch of structured populations that contribute differently to different Indian groups (and those distinctions get lost in very high dimensions of PCA).

Shaikorth said...

Matt, I figure these Paniya, Chenchu and Pulliyar are quite drifted too.

But lets try and test this. Chenchu and their tree neighbour Kol need no ancients in Broushaki fits when Indian and ancient donors are allowed which should already be an indicator.
Chenchu with no Indian (modern ASI-rich) donors, but ancients allowed in Broushaki fits:
7% Cambodian 7% Ust-Ishim rest Sindhi and Pathan.
Kol:
5% Cambodian 10% Ust-Ishim rest Sindhi and Pathan

Doesn't look like there's ancient structure involved. For more ASI examples without Indian donors:

Malayan: 8% Cambodian 26% Ust-Ishim rest Sindhi and Pathan.
Nihali (language isolate):
13% Cambodian 30% Ust-Ishim rest Sindhi and Pathan

Matt said...

Could you dumb it down for me why you think that shows that?

Shaikorth said...

I figure that if there's unique ASI-related ancient structure in Chenchu that's missing from other Indians and that's the cause of their high fst-differentiation, it should probably show up as a donation from Ust-Ishim, not from modern Indians (Meghawal and Dhurwa tribals in this case are enough to cover Chenchu). When Indian donors are removed, the bulk of ASI shows up as Cambodian + Ust-Ishim and the Chenchu are again no exception.

Ganesh hatwar said...

I would like to make a point.. you seem to considered caste but not sub caste.. each caste has several sub castes .. what if sub castes share a different ancestry then brahmins may share a common ancestry it applies for every caste..

MaxT said...

How does this chart explain Yamnaya affinity in South Asia? We're missing a Yamnaya-like steppe population that contributed to South Asians before Andronovo-like population.

Lazaridi et al (2016)

"While the Early/Middle Bronze Age ‘Yamnaya’-related group (Steppe_EMBA) is a good genetic match (together with Neolithic Iran) for ANI, the later Middle/Late Bronze Age steppe population Sintashta-Andronovo (Steppe_MLBA) is not."

"Our results do not resolve the relationship between ANI and the origin of Indo-European speakers in South Asia, in the sense that they reveal that South Asian populations have ancestry both from regions related to the Eurasian steppe and.ancient Iran, which is compatible with alternative homeland solutions."

http://biorxiv.org/content/biorxiv/early/2016/06/16/059311.full.pdf

They don't seem to know which steppe population contributed to South Asians but steppe ancestry in South Asia is Yamnaya-like rather than Andronovo/Sintastah-like.

MaxT said...

Lazaridis et al (2016)

"ANI ancestry related to both ancient Iran and the steppe is found across South Asia making it difficult to associate it strongly with any particular language family (Indo-European or otherwise)."

"Nonetheless, the fact that we can reject West Eurasian population sources from Anatolia, mainland Europe, and the Levant diminishes the likelihood that these areas were sources of Indo-European (or other) languages in South Asia."

They also seem to rule out Anatolia/Europe/Levant as source of I.E language in South Asia.

Davidski said...

The ASHG is this week. Mathieson is doing his poster presentation on the Balkan farmers on Wednesday. Wonder if a preprint at bioRxiv will follow soon after?

Chad Rohlfsen said...

I'll follow tweets. I'm off Wednesday.

Seinundzeit said...

There is a question of how many dimensions one should retain.

Some of the higher dimensions might muddle things, add to the noise.

Something to look into. I can recall an online piece, dealing with this very question (it was about "how many dimensions are informative?", or something along those lines).

Nirjhar007 said...

Dave,

Whats the schedule if you remember of that Balkan paper?.

Davidski said...

The posters are being presented from 2-3pm Vancouver time. So I guess someone will tweet about them right after, and then maybe we'll see some of the papers at bioRxiv. No idea when Mathieson's paper is coming out.

Nirjhar007 said...

Sounds worth looking forward to :) .

Davidski said...

@Sein

There is a question of how many dimensions one should retain.

It might be an idea to check which of the ten dimensions produce results that most closely resemble models based on formal stats.

That would probably take a while to figure out, but I have a feeling that it might be the first few dimensions.

Seinundzeit said...

David,

That sounds like a good idea. I'll give it a try, once I find some time.

Jijnasu said...

Quite frustrating that the data from Rakhigarhi is taking much longer than expected. Hopefully it will arrive soon

Davidski said...

Quite frustrating that the data from Rakhigarhi is taking much longer than expected. Hopefully it will arrive soon.

I wouldn't be surprised if the paper was being delayed by debates on how to frame the results, which in all likelihood do not support the idea that Harappans were Indo-Aryans.

Matt said...

Re: dimensions, in this set, at the very least I would really, really not discard Dimensions 4, 6, and 9, since they're highly informative in distinguishing Euro and Siberian pre-Neolithic from other West Eurasian groups (4), in distinguishing western West Eurasian groups from eastern West Eurasian groups(6), and in distinguishing South Asians from Steppe, Iran Neolithic and Caucasus populations (9):

Not the best graphic (I'd use Davidski's colour coding if I had the time):

http://i.imgur.com/V6C8tFL.png

(4 by 6 is pretty much your "West Eurasia PCA" shape).

If you kept up 1-6, I'd imagine that South Asians would come out more Iran Neolithic than with just 1-2 and more Iran Neolithic than under 1-10.

1 and 2 split Africans from Eurasians then ENA from present day West Eurasians, other Dimensions distinguish:

3: Oceanians (and to a lesser extent South Asians) from other Eurasians
4: Biaka from West Africans, esp Gambians
7: EuroHG and East Asians from recent Siberians and the Middle East
8: Bougainvilleans from Papuans+Koinanbe
10: Gambians and North Africans from other African groups

So you could potentially try and discard all those dimensions except 7 if you really believed they contained noise. I think they might have useful info at a trace level though.

I'm not sure you'd be able to pick a subset of these dimensions to approximate the formal stats to be honest.

Davidski said...

@Jijnasu

Is it possible that rather than being spread evenly across the subcontinent, the iranian neolithic like ancestry largely remained confined to the northwest until the 2nd millenium B.C.E? Non-Brahmin Dravidian speaking groups in southern India too show a cline of decreasing ANI ancestry from Upper castes to lower castes suggesting a rather recent expansion of ANI groups into peninsular India.

Hard to say, but the plot isn't making any predictions where the Neolithic Zagros/Mesolithic South Asian mixed populations lived, just that they existed somewhere in South Asia, and during the Neolithic included Zagros admixture from close to 100% down to 0%.

In other words, that whole South Asian cline on the third and fourth slides may have been located in Northwestern India and Pakistan from the Early Neolithic to the Copper Age.

But considering that there are Indian populations that might still be straight two-way mixtures between Zagros farmers and Mesolithic South Asians, like the Paniya, suggests that at some point populations with only Zagros and Mesolithic ancestry, or even just Zagros ancestry, pushed deep into India.

I suppose it could have been a late migration that also included groups with steppe ancestry, but this is unlikely. It does seem more likely that the descendants of Zagros farmers moved a long way south before the steppe people got to India.

Nirjhar007 said...

I wouldn't be surprised if the paper was being delayed by debates on how to frame the results, which in all likelihood do not support the idea that Harappans were Indo-Aryans.

I know you are feeling the tension . But please stop talking out of your rear...

huijbregts said...

@ Matt, Davidski
I tried a different experiment. I chose an nMonte specification at k=10 and then subsequently lowered the number of dimensions.
At k=9 and k=8 I got essentially the same nMonte output, but at k=7 I lost an important population.
The Euclidean distance did hardly increase from 8 to 10 dimensions. So dimensions 9 and 10 didn't add new information, but they also didn't add to the noise.
My first impression is that 10 dimensions may not be an overkill.
Maybe somebody can repeat this experiment?
(I have averaged the populations in the datasheet. My nMonte run had an European target; three of the admixtures were also European, the fourth was SW-Asian.
What I lost in dimension 7 was the discrimination between Dutch and Croatians. Indeed this is a small difference in the datasheet, maybe it is in a higher dimension than 10)

Matt said...

@ huijbregts, I think there's really not any dispersal of different Europeans in dimension 9, so I think you wouldn't really get any difference including that dimension for models of European populations, especially as mixtures of other Europeans (which is what I think you're running?). If you're modeling the steppe or Caucasus or South Asia, I would think you'd get differences with including 9 or not though, because those populations show dispersal on that dimension.

The different dimensions add value for different populations, so it's hard to say "For all populations, no more difference will exist beyond this many dimensions". If you take dimensions 5 and 10, which are intra African dimensions - http://i.imgur.com/OKI1fCB.png.

There's not much dispersal of Eurasian populations there and where there is, the pattern basically just mirrors lower / other dimensions. But if you trying to model Mozabites as having a Gambian or Nigerian African ancestry or Ethiopians as having a Kenyan or Nigerian ancestry, then those dimensions would be able to point nMonte to the right answer which makes sense with geography (Mozabite is in a Eurasian-Gambian intermediate position, etc.).

(Even with these 10 dimensions, they're probably good enough for Europeans, but I do think you might still get some West vs East European dimensions eventually in the PCA dimensions beyond the 10, as there differences that in Fst that are slightly more structured than they are here.).

Davidski said...

Seems like it's possible to get results for South Asians that resemble models based on formal stats using dimensions 1-5 or even just 1-3.

On the other hand, to model Europeans successfully requires most of the dimensions, and indeed probably as many as 9.

Nirjhar007 said...

But I am really excited about the Balkan paper presentation . Who is going?

huijbregts said...

@ Matt
You did a fine job in characterizing the dimensions. Please continue.
Somehow you are confident that once you have completed the characterization of the lower dimensions, this proves that the remaining dimensions are redundant.
I am not convinced that the logic is this simple.
DNA is a huge datastructure which may have hundreds of eigenvectors; 10 may be just the tip of the iceberg.
The algorithm is an unsupervised mathematical analysis on DNA data, it does't know about geographical or historical features.
I agree that in the lower dimensions the geographically interpretation seems dominant. As the number of geographical dimensions is limited, it is also plausibel that the higher dimensions cannot be geographically interpreted.
However, this does not imply that the higher dimensions are just redundant.
Also remember that in the higher dimensions it are the columns which are geographically neutral. It remains possible that some of the rows may have significant loadings in the higher dimensions.
Until we have learned more about this application of PCA, I consider it a very lucky coincidence if the PCA were exclusively structured by geographical information.

Matt said...

huijbregts Somehow you are confident that once you have completed the characterization of the lower dimensions, this proves that the remaining dimensions are redundant.

No, I don't believe that.

(Or else why would I have said

"(Even with these 10 dimensions, they're probably good enough for Europeans, but I do think you might still get some West vs East European dimensions eventually in the PCA dimensions beyond the 10, as there differences that in Fst that are slightly more structured than they are here.).".

That's me pretty much saying I think you would have extra information in higher dimensions (beyond these 10) about population structure. I do understand that by saying "good enough" I may have muddied the waters a little).

There will come a point at which either the power of the algorithm to identify meaningful dimensions that reflect real population structure or the actual population structure will be exhausted. That could be beyond these 10.

Ryan said...

@David - "Not really sure what the gist of this is, but please note that there couldn't have been any Southeast Asian ancestry per se in much of India prior to migration of the Neolithic farmers from the Zagros into South Asia, because the South Asian genetic cline looks more like a banana than a straight line, meaning that some Indians are still practically mixtures of South Asian foragers and Zagros farmers."

I'm saying I wouldn't exclude a bit banana-shaped cline existing prior to the neolithic too.

terryt said...

@ Ryan:

Probably at least two separate movements from Southeast Asia into India widely separated in time. To me the most interesting aspect is that although Mesolithic South Asia is closest to the Papuan group the diagram cannot support at all the idea that the Papuan element formed as a result of an arrival via South Asia. Rather it looks to be the other way round: the South Asian is derived from the Papuan. In other words South Asia has never been a route between east and west. Mesolithic/Neolithic Iran and Mesolithic South Asia are at opposite ends of a cline.