search this blog

Tuesday, March 28, 2017

"Heavily sex-biased" population dispersals into the Indian Subcontinent


And so it begins. BMC Evolutionary Biology has a very interesting, but hardly surprising, new paper by Silva et al. on the population history of the Indian Subcontinent. Emphasis is mine:

Background: India is a patchwork of tribal and non-tribal populations that speak many different languages from various language families. Indo-European, spoken across northern and central India, and also in Pakistan and Bangladesh, has been frequently connected to the so-called “Indo-Aryan invasions” from Central Asia ~3.5 ka and the establishment of the caste system, but the extent of immigration at this time remains extremely controversial. South India, on the other hand, is dominated by Dravidian languages. India displays a high level of endogamy due to its strict social boundaries, and high genetic drift as a result of long-term isolation which, together with a very complex history, makes the genetic study of Indian populations challenging.

Results: We have combined a detailed, high-resolution mitogenome analysis with summaries of autosomal data and Y-chromosome lineages to establish a settlement chronology for the Indian Subcontinent. Maternal lineages document the earliest settlement ~55–65 ka (thousand years ago), and major population shifts in the later Pleistocene that explain previous dating discrepancies and neutrality violation. Whilst current genome-wide analyses conflate all dispersals from Southwest and Central Asia, we were able to tease out from the mitogenome data distinct dispersal episodes dating from between the Last Glacial Maximum to the Bronze Age. Moreover, we found an extremely marked sex bias by comparing the different genetic systems.

Conclusions: Maternal lineages primarily reflect earlier, pre-Holocene processes, and paternal lineages predominantly episodes within the last 10 ka. In particular, genetic influx from Central Asia in the Bronze Age was strongly male-driven, consistent with the patriarchal, patrilocal and patrilineal social structure attributed to the inferred pastoralist early Indo-European society. This was part of a much wider process of Indo-European expansion, with an ultimate source in the Pontic-Caspian region, which carried closely related Y-chromosome lineages, a smaller fraction of autosomal genome-wide variation and an even smaller fraction of mitogenomes across a vast swathe of Eurasia between 5 and 3.5 ka.

...

There are now sufficient high-quality Y-chromosome data available (especially Poznik et al. [58]) to be able to draw clear conclusions about the timing and direction of dispersal of R1a (Fig. 5). The indigenous South Asian subclades are too young to signal Early Neolithic dispersals from Iran, and strongly support Bronze Age incursions from Central Asia. The derived R1a-Z93 and the further derived R1a-Z94 subclades harbour the bulk of Central and South Asian R1a lineages [55, 58], as well as including some Russian and European lineages, and have been variously dated to 5.6 [4.0;7.3] ka [55], 4.5-5.3 ka with expansions ~4.0-4.5 ka [58], or 4.7 [4.0;5.5] ka (Yfull tree v4.10 [54]). The South Asian R1a-L657, dated to ~4.2 ka [3.3;5.1] (Yfull tree v4.10 [54]]), is the largest (in the 1KG dataset) of several closely related subclades within R1a-Z94 of very similar time depth. Moreover, not only has R1a been found in all Sintashta and Sintashta-derived Andronovo and Srubnaya remains analysed to date at the genome-wide level (nine in total) [76, 77], and been previously identified in a majority of Andronovo (2/3) and post-Andronovo Iron Age (Tagar and Tachtyk: 6/6) male samples from southern central Siberia tested using microsatellite analysis [101], it has also been identified in other remains across Europe and Central Asia ranging from the Mesolithic up until the Iron Age (Fig. 5).

The other major member of haplogroup R in South Asia, R2, shows a strikingly different pattern. It also has deep non-Subcontinental branches, nesting a South Asian specific subclade. But the deep lineages are mainly seen in the eastern part of the Near East, rather than Central Asia or eastern Europe, and the Subcontinental specific subclade is older, dating to ~8 ka [55].

Altogether, therefore, the recently refined Y-chromosome tree strongly suggests that R1a is indeed a highly plausible marker for the long-contested Bronze Age spread of Indo-Aryan speakers into South Asia, although dated aDNA evidence will be needed for a precise estimate of its arrival in various parts of the Subcontinent. aDNA will also be needed to test the hypothesis that there were several streams of Indo-Aryan immigration (each with a different pantheon), for example with the earliest arriving ~3.4 ka and those following the Rigveda several centuries later [12]. Although they are closely related, suggesting they likely spread from a single Central Asian source pool, there do seem to be at least three and probably more R1a founder clades within the Subcontinent [58], consistent with multiple waves of arrival. Genomic Y-chromosome phylogeography is in its infancy compared to mito-genome analysis so it is of course likely that the picture will evolve with sequencing of further South Asian Y-chromosomes, but the picture is already sufficiently clear that we do not expect it to change drastically.



Silva et al., A genetic chronology for the Indian Subcontinent points to heavily sex-biased dispersals, BMC Evolutionary Biology, Published: 23 March 2017, DOI: 10.1186/s12862-017-0936-9

See also...

Descendants of ancient European (fair?) maidens in Central Asia's highlands

Ancient herders from the Pontic-Caspian steppe crashed into India: no ifs or buts

Children of the Divine Twins

The Aryan Trail (3500 - 1500 BC)

The Poltavka outlier

Indian genetic history in three simple graphs

The peopling of South Asia: an illustrated guide

Caste is in the genes

149 comments:

André de Vasconcelos said...

Oh man, this is going to be an interesting one *grabs popcorn*

Salden said...

Wait for the Hindu Nationalists to show up.

Gioiello said...

We came at the goal, at last!

postneo said...

So how many samples have they considered in their analysis of R1a phylogeny? There do not seem to be any fresh sampling just the same repeats from the last 5 years.

Z2123 has a spread in south asia too, but somehow it appears they are trying underplay it?

Gioiello said...

Read the paper. Nothing new. We have nothing to learn from these scholars. They base their Y dates upon YFull tree, I am saying underestimated for an 1.17 to an 1.26 factor. My theory upon R1b1 remains strong. About the expansion of R1a and satem IE languages we know that from years.

postneo said...

Good to have higher res mtdna. Will go thru when I have tome. Their discussion of U2/U2e is inadequate. I am sure it has a bearing on the overall picture. Seems like they made assumptions.

Rami said...

There is nothing new here, though I would say R1a was there a few hundred years there before. The article is wrong in associating Sintashta with Proto Indo Iranian, its Iranian. Also BMAC did not start trading with the IVC 4 Kya, that occurred far earlier.

Only interesting bit was on Mtdna other than that meh.

Al Bundy said...
This comment has been removed by the author.
blogspot said...

@Salden said...
'Wait for the Hindu Nationalists to show up'

he fled to Mom in great frustration.

Al Bundy said...
This comment has been removed by the author.
TruthPrevails said...

More good news!! Glad so many folks are expending their resources in rebuilding the smaller details.

In our research we only have to connect the dots from Indian subcontinent to Siberia during the late paleolithic and everything will fall into place. Which no doubt Kostenki 14 and Ust-Ishim have done to an extent.

capra internetensis said...

The paper is mainly about mtDNA. Soares et al have been doing a whole series of these in different parts of the world, trying to sort out the different layers of maternal relationship. In this case they have looked into genome wide and paternal correlations as well, but there is not going to be anything new there.


@Rami

BMAC is only one, late stage of the Namazga culture. There was plenty going on earlier but at that stage it wasn't the BMAC.

postneo said...

@Al Bundy
I don't pay attention to linguistic assumptions in these papers. I am talking about the assumption that R1a and z93 in south asia is well understood and that the handful of modern samples are adequate.

The structure indicates otherwise.

Nathan Paul said...

This is a Dead Cat bounce . Indus -Mehargarh-BMAC-Andronovo they are right next to each other. Current South Asian political boundaries are being used for ones own gratification.

India is both a source and recipient. People mix like Pizza dough.

Where are they going to put Y G, L, T, R2 boundaries? Same goes for J2.

capra internetensis said...

@Nathan Paul

It is 4000 km from Andronovo (the type site) to Merv Oasis, by modern roads. Or let's say from Petrovka - 2700 km.

Then it is about 1400 km from Merv to Mehrgarh, and another 1200 km from Mehrgarh to Rakhigarhi.

Matt said...

Maternal lineages primarily reflect earlier, pre-Holocene processes, and paternal lineages predominantly episodes within the last 10 ka. seems fine, but is universal everywhere and not particular to India, taking Karmin et al as true.

Though I'm a bit shady on whether these kind of studies can actually tell if the mtdna pool comes from India during the pre-Holocene. I'm not sure if they've had a good track record in Europe, by comparison.
Also would say:

"This greater impact in Europe is also reflected in the genome-wide picture. In Europe, although the CHG component is only 10–15% in most populations, it is thought to have been accompanied by a similar fraction of indigenous Mesolithic European lineages from the steppe, seen in Yamnaya samples [53]. This component does not seem to have spread significantly east and south into Central and South Asia, however [76]."

...

In the Subcontinent, the Levantine component is (like the European Mesolithic component) minor, due to a deep east–west separation across the Fertile Crescent prior to the spread of the Neolithic [75]. As a result, both the Southwest Asian source for the Late Palaeolithic/Early Holocene and the Steppe/Central Asian source for the Bronze Age largely share the same ancestral pool, which may have arisen in the region of the Caucasus and eastern Fertile Crescent and expanded both north and south during the later Neolithic and Early Bronze Age [74, 75, 95].


Seems confused, e.g. in that it implies Europeans only have 20-30% Steppe ancestry, which is a little at odds with much of the formal literature that we have.

Furthermore, in the case of Europe, the major stages are simpler to disentangle from the genome-wide evidence. This is because the distinctiveness of the Levantine source for the Early Neolithic, compared to the Pontic-Caspian steppe, gives most European populations a clear tripartite ancestry that is less evident in South Asia. is also a rather odd comment - it was not particularly simple to disentangle, we just *actually* have adna there!

Al Bundy said...
This comment has been removed by the author.
Nathan Paul said...

Capra,

Fact is Indus area is the most widespread in history by population or sq km. You can not just separate Merv by 100 KM or 1400 KM from that region. Mehragarh is not at the edge . Look at the other way, Iranian and Indus settlements are 200 KM apart. Their genetics are lot different in some ways and lot similar in some ways. But your 4000Km , Andronovo, BMAC, Indus spread is more similar.

Davidski said...

@postneo @Nathan Paul

Maybe try some coherent arguments for a change?

Davidski said...

@Matt

The authors of this paper aren't exactly experts when it comes to genome-wide DNA.

Note, for instance, that they ran the PCA without lsqproject: YES for Yamnaya, causing some of the Yamnaya samples to pull towards 0 in both dimensions due to missing markers.

I just skimmed the parts of this paper pertaining to autosomal DNA. The really interesting and competent parts of this paper deal with uniparental markers.

Ryan said...

This is really interesting:

The other major member of haplogroup R in South Asia, R2, shows a strikingly different pattern. It also has deep non-Subcontinental branches, nesting a South Asian specific subclade. But the deep lineages are mainly seen in the eastern part of the Near East, rather than Central Asia or eastern Europe, and the Subcontinental specific subclade is older, dating to ~8 ka [55].

There was possible R2 in the Iranian Chalcolithic samples, correct? So that would make R2 an Elamo-Dravidian possibly? Would that mean R2 was in HG populations in Iran though - meaning some sort of early and minor ANE migration there?


Davidski said...

R2 is likely for one or two Iranian Neolithic samples.

http://eurogenes.blogspot.com.au/2016/06/the-genetic-structure-of-worlds-first.html

These Iranian Neolithic farmers and the Iranian Hotu forager all have a lot of ANE (30-40%), so finding R2 in them isn't surprising. I also expect some Q when more samples from Neolithic Iran come in.

Simon said...

How valid is the idea that CHGs are part ANE, part Basal? Is it possible that there was no such admixture, and that the clinal position of CHGs is due to common descent?

Davidski said...

@Simon

It's valid, because in Central Anatolia and the Levant we have Epipaleolithic forager-farmers that lack ANE.

So ANE somehow managed to infiltrate the highland areas of West Asia, but did not move further west until the Chalolithic.

By the way, CHG is actually a mixture of ANE, Basal Eurasian and something related to WHG. So I think it's likely that before ANE moved into highland West Asia, this region was populated by foragers similar to the Boncuklu Central Anatolian forager-farmers.

Rami said...

How much WHG does CHG have? Iran_N has much more ANE right ?

Davidski said...

CHG and Iran_N are basically the same, with around the same amount of ANE and other stuff, like some sort of mysterious East Asian or South Asian-like components, except Iran_N is more Basal Eurasian, and seems to have a slightly lower WHG-related ratio compared to the other non-Basal stuff.

Aram said...

What is that B110 SNP in Kyrgyz?
I don't find it in Yfull neither in ISOGG.

Also Kyrgyz and Altaians are quite close to Scythian in Yfull.


https://www.yfull.com/tree/R-S23592/

Matt said...

Davidski: I just skimmed the parts of this paper pertaining to autosomal DNA. The really interesting and competent parts of this paper deal with uniparental markers.

Yes, though y-dna isn't anything new (no new data, no new tree), and the mtdna I'm... skeptical of their date estimates without mtdna (all the mtdna they believe is pre-Neolithic could have been from outside India coming in with the Neolithic even - I don't really have a strong belief they have a reliable way to know).

Davidski: By the way, CHG is actually a mixture of ANE, Basal Eurasian and something related to WHG. So I think it's likely that before ANE moved into highland West Asia, this region was populated by foragers similar to the Boncuklu Central Anatolian forager-farmers.

Though CHG is usually richer in Basal Eurasian in Lazaridis's formal stat, so might be more like the Levant_HG / Natufians?

Coldmountains said...

Interesting that we have M780* from Ukraine and L657* from Tajikistan. L657 was also recently found among Tatars so "Indian" L657/M780 still exists in East Europe . It is just much rarer than Z2124 but modern day y-dna in this Region is very mich bottlenecked.

Davidski said...

Though CHG is usually richer in Basal Eurasian in Lazaridis's formal stat, so might be more like the Levant_HG / Natufians?

Yeah, that's true. CHG is too basal to be modeled successfully as Boncuklu Neolithic/MA1, but even too basal to be modeled as Natufian/MA1.

So I suppose we'd need a reference like Boncuklu Neolithic or the Natufians, but more basal than the Natufians.

Jaydeep said...

Matt,

"Yes, though y-dna isn't anything new (no new data, no new tree), and the mtdna I'm... skeptical of their date estimates without mtdna (all the mtdna they believe is pre-Neolithic could have been from outside India coming in with the Neolithic even - I don't really have a strong belief they have a reliable way to know)."

Though I do not think this is a very good paper and I have many reservations about their assumptions, I think that on this point of yours the authors have covered the bases very well.

Please refer to Additional file: figure S2. You can see that the date of arrival of the 'West Eurasian' mtDNA into South Asia is actually the date of expansion of the Indian specific mtDNA clades which are not found outside of South Asia. It is therefore highly unlikely that these Indian specific 'West Eurasian' subclades first expanded outside South Asia and then after thousands of years came to South Asia during the Neolithic with all of its initial diversity intact but losing all trace of it outside South Asia.

Infact, the dates for arrival of these 'West Eurasian' mtdna clades into South Asia, as proposed in this paper, are in all likelihood very conservative. There are other 'West Eurasian' mtDNA subclades which are shared between the Near East and South Asia which have an earlier age of expansion. There is no reason to believe that they came to South Asia only during the Neolithic.

What is clear from the paper is that 'West Eurasian' mtDNA and therefore 'West Eurasian' y-dna have a deep palaeolithic presence in South Asia and there is no reason to propose that all of it came after the Neolithic.

Davidski said...

It is therefore highly unlikely that these Indian specific 'West Eurasian' subclades first expanded outside South Asia and then after thousands of years came to South Asia during the Neolithic with all of its initial diversity intact but losing all trace of it outside South Asia.

They probably expanded within West Eurasian populations that are no longer extant in West Eurasia.

As we know from ancient DNA, human populations routinely went extinct in prehistoric times and left descendants in highly admixed forms, often far away from their homelands.

What is clear from the paper is that 'West Eurasian' mtDNA and therefore 'West Eurasian' y-dna have a deep palaeolithic presence in South Asia and there is no reason to propose that all of it came after the Neolithic.

Nope, it's highly unlikely that we'll see any West Eurasian admixture in South Asia before the Neolithic. And certainly no R1a before the late Bronze Age.

Gioiello said...

Jaedeep is completely wrong in what he says: mts of Western European origin, entered before than R1a into India, expanded again with the R1a expansion. We say in Italian as "le mosche cocchiere", i.e. if flies are on a yak, if the yak moves, also flies move.

Jaydeep said...

That's your own dogma. Please stick to the facts and not your preferred narrative. When there is no proof of existence of Indian specific 'West Eurasian' mtDNA subclades outside South Asia, how can you say that their origin lies outside South Asia ? Based on what ? The dates of expansion of the Indian specific WE mtDNAs is calculated based on the subclade's diversity within South Asia. It would be very strange if a subclade originally expanded in Anatolia or Iran around 20 kya and then reached South Asia around 10 kya preserving all its diversity of its 20 kya expansion while losing all trace in Anatolia or Iran.

As Indian populations are sampled more and analysed at greater resolution all of the bogus theories about South Asian population history is going to bite the dust. Quite simply the ANI is not intrusive to South Asia during the Neolithic. It is most likely far older.

Davidski said...

When there is no proof of existence of Indian specific 'West Eurasian' mtDNA subclades outside South Asia, how can you say that their origin lies outside South Asia?

Well, it's really easy.

South Asians can be modeled successfully as mixtures of ancient West Eurasian populations that lived outside of South Asia during the Neolithic or later.

So there's no need to posit migrations from West Eurasia to South Asia during the Paleolithic, or the existence of some sort of pseudo-West Eurasian shared ancestry, based on modern-day genetic diversity, which can be very deceptive.

If you want to prove that West Eurasian ancestry existed in, say, Paleolithic South Asia, then you gotta do it with ancient DNA.

Jaydeep said...

It doesn't seem to occur to you David, that just like elsewhere, South Asia also would have had many mtDNA and y-DNA clades during the palaeolithic which are now extinct in South Asia but present elsewhere. So the argument cuts both ways. Its better if we stop speculating about the unknown.

Jaydeep said...

South Asians can be modeled successfully as mixtures of ancient West Eurasian populations that lived outside of South Asia during the Neolithic or later.

How do you know that populations similar to Iran Neolithic / CHG did not live in South Asia before the Neolithic ? Do you have any Mesolithic/Palaeolithic aDNA from South Asia that proves your case ? The present paper clearly shows that the age of several 'West Eurasian' clades present in South Asia goes well into the Palaeolithic. Can you cite me a paper that comes to a different conclusion ? Let me refresh your memory a bit. There was a paper on India-specific y-dna J2 subclades about 2 years back and that paper too argued that Indian J2 clades show a pre-Neolithic separation from West Asian/ Near Eastern J2 clades.

Further, the Iranian Neolithic is clearly related to South Asian Neolithic which is far off to its East but it is very distant from the geographically closer Anatolian & Levantine Neolithic. It is very probable that the origins of Iran Neolithic lies far to its east, somewhere close to South Asia. We just need more research on the Neolithic in South Asia and Eastern Iran, which are not as well researched as West Iranian Neolithic.

Davidski said...

All we need is one Mesolithic genome from North India that has no West Eurasian admixture, and I'm proven right.

postneo said...

This is an assymetric study on mtdna and ydna.
There was the recent paper with clades of Q which is a better example of complementary cladal analysis on the y side.
We need to look at R2, Q3, H, L, R1a on the y side in a similar manner. They have done that to a limited extent with R1a in this paper with no fresh samples from south asia.

Davidski said...

South Asian R1a is irrelevant, since we already have Mesolithic R1a samples from Northeastern Europe.

There's no way that R1a can be a Mesolithic Northeastern European lineage, and at the same time indigenous to South Asia or even Iran.

It obviously arrived in South Asia during the Bronze Age mostly as R1a-Z645(Z93+), in a population closely related to the early Corded Ware rich in R1a-Z645.

PF said...

Davidski said...
"CHG and Iran_N are basically the same, with around the same amount of ANE and other stuff, like some sort of mysterious East Asian or South Asian-like components, except Iran_N is more Basal Eurasian, and seems to have a slightly lower WHG-related ratio compared to the other non-Basal stuff."

Perhaps these mysterious South and East Asian signals are related to an ANE-like group that admixed into both East Asians and EHG. In this case some kind of ANE/EHG hybrid probably existed in West Asia for a long time.

Basal ancestry was brought later when the climate opened up around ~15K years ago. I propose that a Basal-related group came out of the Gulf refugium and mixed with this ANE/EHG hybrid around western Iran to form something similar to CHG, which seems to be foundational to Caucasians, Iranians, and many Central and South Asians.

Moreover, this Basal group could have been a relic of the original out-of-Africa population en route east, existing along a drift path from original Basals --> ASI.

capra internetensis said...

@Aram

B110 is YP1505.

Rami said...

@jaydeep , clearly a MA1 like population bringing West Eurasian dna but its more complicated by the fact the MA1 itself does have ASI/ASE component. It seems there was definitely a now extinct SC Asian hunter gatherer population, which I have said and even Sein has based off his models.
The ANI model is dead IMO, as it was thought ANI was one source population which mixed only 4000 years ago, its definitely 3 now, Iran_N/Hotu , Steppe Indo Iranians, and some yet to be discovered SC Asian population. In case of the Iran_N/Hotu and these SC Asian hunter gatherers far earlier mixing occurred much earlier.

Is it NOT possible for Iran_N to have originated in South Asia, given their so Basal rich.
They could have lived there 10-11 Kya but ultimately they came from the Iranian plateau or regions bordering SC Asia and Iran. The Iran_N component peaks in SW Pakistan, where the first Neolithic settlements in Mergarh and the wheat varieties found are clearly West Asian in origin.

Rami said...

"CHG and Iran_N are basically the same, with around the same amount of ANE and other stuff, like some sort of mysterious East Asian or South Asian-like components, except Iran_N is more Basal Eurasian, and seems to have a slightly lower WHG-related ratio compared to the other non-Basal stuff.

Thats a volte face from what you use to say before because you were very firm on CHG and Iran_N being different due to their uniparental markers. Based of they genetiker's phenotype analysis they also did differ physically.

So is it possible CHGs are Iran_N who moved into the Caucasus, or are Iran_N , CHGs who moved into the Iranian Plateau??

TruthPrevails said...

@Rami

I agree with what you are trying to say, however it will be considered hypothesis.

The big problem is, its like the wild wild west, where the finders are keepers. And that is what exactly has happened due to the well preserved aDNA which has been found up north. And which has led to all these silly nomenclatures like EHG, WHG, CHG ANE etc. you get the hang..

Till its proven , that the origin of those aDNA found in northern areas actually originated in South Asia this is going to be a never ending debate, and till then the wild west prevails, and the finders will be the keepers.

John Smith said...

It seems likely that prehistoric India had a strong connection to prehistoric paleolithic Europe.Kosenki had U2 and C1b found in India. U2 is Western Eurasian and seems to been for a long time in India.C1a2 had been reported in Nepal. The type of K2a found in Oase seems to be closest to X only found in India. However ancient DNA suggests R2 came from the Neolithic in Iran R1a found the bronge age steep L was the Indus Cilivilation and most of India had H and C1b pre 8 Kya. H was assciated with M and C1b with U2 the mtdna in India has probably changed only a little since the paleolithic.

Seinundzeit said...

PF,

That is a brilliant proposal.

In fact, I'd say that this line of thinking is completely warranted, looking at the current evidence.

Rami,

I wouldn't really say that MA1 has any ASI ancestry.

Rather, South Asian populations have extra ANE. So, when MA1 is parsed in the context of contemporary variation, it looks like he is partially South Asian (and also partially Native American, and partially Northern European).

The same occurs with the Neolithic Iranians, as they look heavily South Asian in unsupervised ADMIXTURE, but I really doubt that they have any ASI admixture.

Just so that we're on the same page, when I say "ASI" I mean the ENA ancestry seen in South Asia (which is apparently neither East Asian-like, nor Andamanese-like, but rather from a distinct clade within the ENA group, as per Reich and Lipson).

Honestly, I think PF's ideas are of considerable interest.

Also, I do broadly agree with you, when it comes to the different possible waves of West Eurasian ancestry in South Asian populations.

I'd sum up the history of "ANI" like this:

Perhaps, during the Mesolithic or even Upper Paleolithic, there might have been a forager population in South Central Asia that was on the CHG/Iran_Neolithic/Iran_Hotu continuum, but with more ANE than those samples, and with even less of the WHG-related ancestry seen in those samples.

If so, the analyses I've done seem to suggest that this ancestry survives strongest in South India (although, all South Indians, with exception to Brahmins and people like the Paniya/Pulliyar, are also 40% ASI. Brahmins are 25%-30%, while Paniya/Pulliyar are 60%/50% ASI). This explains the 16% Yamnaya admixture seen in low-caste South Indians; it's an artifact of extra ANE, from an ancient population quite similar to Iran_Hotu.

Later, the Neolithic wave into South Asia might have involved populations on that very same continuum of CHG/Iran_Neolithic/Iran_Hotu, but much more similar to the current samples that we have.

In fact, I wouldn't be surprised if those people were virtually identical to the Iran_Neolithic samples.

If so, this ancestry survives strongest in southern Iran, Balochistan, and Sindh. Essentially, the Elamo-Dravidian homeland (and it's South Asian environs).

Now, looking at South Central Asia, much of the ancient Iranian ancestry isn't really from this Neolithic wave.

Rather, it is much more similar to Iran_Chalcolithic, but with a CHG-related shift. In fact, Pamiri Tajiks have 0%-5% Iranian Neolithic ancestry, almost all of their ancient Iranian plateau heritage is basically like Iran_Chalcolithic/CHG.

Pashtuns and Afghan Tajiks show a mix of this and Iranian_Neolithic, although with a shift towards Iran_Chalcolithic/CHG.

I'd say Namazga/BMAC, and the less studied archaeological cultures of Afghanistan, were more Iran_Chalcolithic/CHG-related, rather than Iran_Neolithic-related, which is why this ancestry survives strongest in Tajikistan/Uzbekistan/Turkmenistan and Afghanistan/western Pakistan (basically, South Central Asia).

Finally, the last wave of what some call "ANI" definitely involved steppe populations.

I find that Indians prefer the Srubnaya_outlier, but with a little Sintashta/Andronovo/Srubnaya.

So, I think we'll find that a population intermediate between Srubnaya_outlier and "Steppe_MLBA" was responsible for the spread of Indo-Aryan languages, and R1a, into the Subcontinent.

South Central Asians (of Iranian linguistic extraction) prefer Scythians/Sarmatians, on top of a Srubnaya_outlier "base" (remnant of Indo-Aryan ancestry in South Central Asia). These later Iranian waves from the steppe occurred during the historical era, and people like Pashtuns, Ormur, Parachi, and the Pamiri Tajiks derive a huge portion of their genetic ancestry (and much of their culture) from these historical migrations.

I'll eventually post some models here, to demonstrate all of this.

Onur Dinçer said...

@LiePrevails

Till its proven , that the origin of those aDNA found in northern areas actually originated in South Asia this is going to be a never ending debate, and till then the wild west prevails, and the finders will be the keepers.

Hogwash! Native populations from the northern areas lack the South Asian component in ADMIXTURE analyses and also lack the native South Asian uniparental markers. This is the case in ancient genomes too, so not a recent development for sure.

TruthPrevails said...

Agree about kostenki, however the claimed ownership of Mal'ta aDNA is the root cause of the problem.

And most genetic studies try to confirm with foolish theories put forth by witzel/anthony and group, even though the descendant split and spread timings do not match with their stated timelines of their theories.

As you can imagine the change in origin/ownership of R descendants can and will result in significant changes to history of language and culture, which will need to be revisited.


TruthPrevails said...

@Onur Dincer

I am sure you are aware that programs like ADMIXTURE and the likes are manual input programs, so the saying "garbage in garbage out" applies very well to such programs.

In other words the quality of the reports is equivalent to the quality of the input, since there is no high quality verified ancient DNA from South Asia its not very relevant.

Onur Dinçer said...

@LiePrevails

Formal test results confirm ADMIXTURE results pointing to the lack of South Asian ancestry in the northern areas, so there is nothing open to debate there. Ancient genome results from South Asia won't change this.

TruthPrevails said...

you wish... it will change a lot of things

Onur Dinçer said...

@LiePrevails

you wish... it will change a lot of things

Maybe, but not in the way you hope them to do.

For the king said...

Wow, What a massive cuckolding event! it seemed most of Europe, South Asia and much of central were cuckolded by Indo-Europeans. It seems only that the Middle east*, Parts of the Caucasus and Parts of Balochistan were the only west Eurasian areas not Dominated(Majority) by Indo-European linked Y-DNA. Probably because locals in those regions had superior military qualities compared to early Indo Euros, Large strong population centers and isolation in case of Balochistan/some Caucasus areas. Which makes sense because Mittanis/Hittites were defeated or absorbed by local populations. Persians and Medians were either subjugated by local groups or allied to them. They eventually became culturally dominant, but their cultures were already localized and too foreign to other Indo-Europeans in many ways(And that explains the constant wars with eastern Iranics, which often resulted in west Iranic decisive victories).

*Only Some Armenian/Assyrian groups seem to be over 50% R1b in the ME, but they are relatively small in population.

capra internetensis said...

So many retarded comments...

The proposed expansion of M subclades from central and eastern India 45-35 kya with a major expansion of M4'67 at ~38 kya is interesting. The latter agrees with the estimated age of H1-M69 at ~40 ky. Maybe associated with the spread of the microlithic Indian Upper Palaeolithic?

According to Mishra this basic technology is the basis of the Upper Paleolithic-Mesolithic into the Holocene, at least in Peninsular India. So were these guys ASI? On the other hand we also have core-and-flake tools in the Himalayan foothills (and East India?) resembling the Hoabinhian and other such very primitive-looking stone tool industries from East and Southeast Asia. That would very plausibly be related to ENA ancestry.

I can find very little about the archaeology Upper Palaeolithic South-Central Asia or the NW of the subcontinent. Anyone know how that might line up?

Carlos Aramayo said...

@Davidsky

I agree that R1a could have entered South Asia in Bronze Age, but not necessarily Late Bronze Age. Archaeological (and linguistic) hypothesis by Asko Parpola point out to many waves of Indo-Aryan coming to South Asia. These people could have been there since Harappan 3C period (2200 to 1900 BC).

Ryan said...

@David - "So there's no need to posit migrations from West Eurasia to South Asia during the Paleolithic, or the existence of some sort of pseudo-West Eurasian shared ancestry, based on modern-day genetic diversity, which can be very deceptive."

With respect, I think on the mtDNA side at least you still do need pseudo-West Eurasian, since haplogroup R seems to be South Asian in origin.

"So ANE somehow managed to infiltrate the highland areas of West Asia, but did not move further west until the Chalolithic."

You may want to revisit your dismal of Underhill's suggestion of basal R1a in the Zagros then. It wouldn't be an origin for the IE branches of R1a, but some sort of early R1a wandering into Iran seems more plausible now, even if that branch isn't ancestral to very many modern carriers of R1a.

Davidski said...

Why does the pre-Indo-European ANE in West Asia have to be linked to R1a? Why not just R2 and maybe Q?

Rami said...

MA1 , Ushtishim do have archaic connections with South Asia. Whether it came from South Asia or not , that's yet to be determined but there is an archaic connection . Either that SC population was MA1 like or an Iran Hotu like , but at 20-25 Kya, it would likely be MA1 or AG3 like.

Afghan Pashtuns are complex , no doubt Pashtun in the South would have some Chalcolithic ancestry via Joroft , but most of their non Steppe ancestry is still Iran_N. Another thing to note , large swathes of where Pashtuns stay now were populated by Indo Aryan speaking populations which got Pashtunized so their steppe ancestry would consist of 2 types one from the Bronze and one from Antiquity. Ironically the Indo Aryan cultural tool kit emanated out of areas Pashtuns live in .

I doubt the BMAC was populated with Iran_CHl rich people , otherwise you would see a good amount of it in Northern Pakistanis , who lack or have very low amounts of Barcin ancestry . Whatever bits of Barcin like ancestry are mostly coming via Steppe Nomads. Its possible it got repopulated after by new Iran Chl like groups who mixed with fresh waves of Eastern Iranian groups arriving. But for the most part most of those Pakistani Pashtun's non Steppe ancestry is largely derived from Neolithic Iranians , just like their Sindhi, Balochi and Punjabi neighbours. I would posit the same for most of those Pashtuns in Laghman,Kunar and Nangarhar , who are not much different from their relatives across the border.

Uzbekistan/Tajikistan/Kazakhstan are in regions where populations were largely replaced with new groups ,as the populations in that area were largely decimated in the 12 and 13th century, so getting ancient genomes would be the only way to confirm any speculations.




Davidski said...

You're just interpreting modern DNA in the way that suits you.

There's no hard evidence of the presence of ANE or any sort of AG3/MA1 related ancestry in South Asia during the Mesolithic. As far as we know for now, Mesolithic South Asians were an East Eurasian branch parallel to those of the Onge and East Asians.

Considering the high level of ANE among Iran_N and Iran_Hotu, it's plausible that all of the ANE present in South Asia today arrived there during and after the Neolithic.

Rami said...

@ Carlos

Actually they were there by 2100-1900 BC, as you start seeing a shift in grave burials and granary structures in the SWAT , as well the derivative cemetery H .
They came in waves across the Khyber. The IVC was in rapid decline by that time and the cities were largely abandoned by 1700 BC, with populations moving eastward and leading a more rural and agricultural existence. There was a cultural vacuum and the Indo Aryans rapidly filled it with a very male dominated society.

Seinundzeit said...

@Rami

"MA1 , Ushtishim do have archaic connections with South Asia..."

You often say this, but it makes no sense.

Ust-Ishim doesn't have any special connection with ANE, I honestly have no clue as to where you get this idea.

"Either that SC population was MA1 like or an Iran Hotu like , but at 20-25 Kya, it would likely be MA1 or AG3 like."

If you read what I've said, that's exactly what I've been claiming. Although, the manner in which you've described it is quite confused/muddled.

"Afghan Pashtuns are complex , no doubt Pashtun in the South would have some Chalcolithic ancestry via Joroft , but most of their non Steppe ancestry is still Iran_N. Another thing to note , large swathes of where Pashtuns stay now were populated by Indo Aryan speaking populations which got Pashtunized so their steppe ancestry would consist of 2 types one from the Bronze and one from Antiquity. Ironically the Indo Aryan cultural tool kit emanated out of areas Pashtuns live in .

I doubt the BMAC was populated with Iran_CHl rich people , otherwise you would see a good amount of it in Northern Pakistanis , who lack or have very low amounts of Barcin ancestry . Whatever bits of Barcin like ancestry are mostly coming via Steppe Nomads. Its possible it got repopulated after by new Iran Chl like groups who mixed with fresh waves of Eastern Iranian groups arriving. But for the most part most of those Pakistani Pashtun's non Steppe ancestry is largely derived from Neolithic Iranians , just like their Sindhi, Balochi and Punjabi neighbours. I would posit the same for most of those Pashtuns in Laghman,Kunar and Nangarhar , who are not much different from their relatives across the border."

Most of this is incorrect.

Much of what you're saying here is actually a regurgitation of claims that I myself have made.

Turns out that I've often been wrong, so I feel that I have an obligation to explain to you how my old ideas are incorrect.

I have 3 Pakistani Pashtun samples, 5 Afghan Pashtun samples, and the samples from academic data-sets (Pashtun_Afghan, "Pathans"). I've been examining them using the PCA data-sheet for quite some time now, and a few consistent patterns have emerged. These are pretty airtight findings.

With exception to some Pashtuns like Chagarzai from Buner or true Yusufzai from Swat (many "Yusufzai" aren't ethnically Yusufzai, but have complex roots, like Malala Yusufzai's father, who comes from a family of Mullah/teachers. It's complicated stuff, outside the scope of what we're currently discussing), all Pashtuns have very substantial Iran_Chalcolithic/CHG-related ancestry. It often exceeds the Iran_Neolithic-related ancestry.

The Kalash, Pashayi, Kohistanis, and Pashtuns with substantial Dardic/Indo-Aryan ancestry (like Chagarzai, or true Yusufzai from Swat) don't have much, if any, Iran_Chalcolithic.

But all other Pashtuns (including Pakistani Pashtuns in Mohmand Agency, Khyber Agency, the frontier region near Peshawar, etc), Chitrali people, and basically all other South Central Asians show very substantial Iran_Chalcolithic. This includes northern Pakistanis, even the Burusho.

I've tried countless different setups, experimented with different dimensions, tried using weighted eigenvalues, and have also begun to implement the scaling procedure described by Matt/Alberto.

Despite me playing around with these different methods, the results are the same; what I just described is how things look.

Rami man, you always make very confident assertions, but you never back them up with any analyses.

By contrast, I keep an open mind, and have no problem with being wrong. And, once proven wrong, I immediately, and quite shamelessly, revise/reevaluate my opinions.

Try to do the same. It's science, not religion.

And yes, until we see aDNA from Tajikistan/Afghanistan/Northern India, everything is tentative, and any of us could be wrong. Oh well.

Rami said...

You're just interpreting modern DNA in the way that suits you.

You do it all the time. Its not about what suits me its putting 2 and 2 together, common sense is not that hard you know.
No population is ever a monolith and a messy ones like those found South Asia would likely be the same.
There is no genome from the Mesolithic , but the intrusion of Western Eurasian related mtdna 20 -23 Kya ago in that paper states that and is almost certain SC Asia was harboring an ANE rich population, whether its Iran Hotu like , MA1 like or an intermediate

Coldmountains said...

If Iran_Neolithic had high ANE Neolithic Afghanistan/Hindukush had high ANE for sure. There is still much R2 in this region and SC Asia is just between Iran, India and Siberia. South Asia/India is another story but we know that the Farmers who arrived in India before steppe Indo-Europeans were Iran_Neolithic-like so they could already brought some ANE-related ancestry. Rakhigarhi was mainly L and R2 (according to some rumors). Haplogroup L looks like it was brought to South Asia from West Asia by a Iran_Neolithic-like population and if it is true that R2 was found in Rakhigarhi ANE-related stuff could already exist in NW South Asia before Indo-Europeans. But yes much of this is my speculation and we need ancient dna

Matt said...

On autosomals, here are some comparable Fst scores tables between South Asian populations and between European populations as a complement to this discussion:

South Asian Populations - http://i.imgur.com/SSyTOyq.png (sorted by differentiation from Brahui)

European Populations - http://i.imgur.com/lkuSsNB.png (sorted by differentiation from Sardinians)

(The above are taken from Chaubey 2016 and Lazaridis 2016, but equivalent, as can be seen from comparing the European values from Lazaridis to the subset of European values from Chaubey - http://i.imgur.com/1pIoni4.png - which are the same for crossover populations).

Differentiation between the following European vs South Asian populations is about comparable:

* Sardinian-Lithuanian comparable to Gond-Brahui
* Greek-English comparable to TN_Low_Caste-Pathan
* Ukraine-Irish comparable to Brahmins_UP-UP_Low_Caste
* Sicilian-Ukrainian comparable to UP_Low_Caste-Pathan
* Sicilian-Scottish comparable to Gond-Brahmins_UP
* Paniya-any other population is beyond the European range

Differentiation between the Brahmin and Low Caste groups is about the mainstream European range, Brahmin to some scheduled tribes is at the very highest end of European differentiation and while differentiation between least differentiated tribal to NW frontier is not matched except by comparison of Sardinian-Lithuanian/Finnish.

(Note - A larger set of South Asian populations here - http://i.imgur.com/FwLhLDb.png. The above take has removed some outliers who seem to have far inflated Fsts beyond any position on a cline).

(Limitations of Fst disclaimer etc.)

This is not meant to imply anything, I just thought it would be interesting to put down here, as a comment on the relative degree of differentiation of each.

Coldmountains said...

@Sein

Very interesting. I agree, it seems that BMAC had also archaelogical links with NW Iran and represented a newer wave of farmers from the West.

"Cranial analysis by B. Hemphill and A. Christensen (1994) from the Bactrian
cemeteries of Sapalli and Jarkutan have shown that the Bactria-Margiana
Archaeological Complex (BMAC) was formed from a population migrating from
north-west Iran, and its creators did not migrate afterwards into the Indian
subcontinent"

"The BMAC mythology forms part of the circle of ancient Near Eastern beliefs and, judging by the artistic images, is closest to the pantheon of Elam (Amiet 1977, 1997; Francfort 2001;
Litvinsky 1989; Klochkhov 1997; Antonova 2000)"

So BMAC could brought Iran_Chalcolithic ancestry. The West Asian ancestry (excluding the ancestry which is likely steppe derived) of Tajiks, Pashtuns and also Uyghurs seems to be more EEF-Western shifted than that of Kalash, Dardics and many South Asians

Davidski said...

@Coldmountains

R2 is found in Neolithic Iran without accompanying Steppe_EMBA-like ancestry, so why not in Neolithic South Asia?

Balaji said...

According to the commonly held view, subscribed to by the authors of this paper and Davidski, hordes of “Indo-Aryans” derived from the Andronovo people invaded India around 1500 B.C. and substantially altered the genetic landscape. Using qpAdm, Davidski has calculated that Brahmins from U.P. Have about 48% ancestry from the “steppe”.

http://eurogenes.blogspot.com/2017/03/bring-it-on.html

We know that the Reich Lab people have a results based on aDNA ready to be published and that Nick Patterson gave a talk a couple of weeks ago on this.

https://www.shh.mpg.de/events/9234/369480

But they have not published their paper yet or even put it on bioRxiv. I suspect that this is because the results are shattering to the conventional wisdom and they want to double-check everything before publishing.

Here is another reason to doubt the conventional wisdom.

http://www.ias.ac.in/article/fulltext/jgen/092/01/0135-0139

This paper is about pigmentation genes in several different groups in India and the results are summarized in Table 3 of the paper. In particular, we can consider the Kanyakubja Brahmins of north India. The prevalence of the derived allele of SLC24A5 for them is 88% but of the derived allele of SLC45A2 only 12%. The Andronovo people by about 1500 B.C. probably were like modern Europeans who have only the derived allele for both SLC24A5 and SLC45A2. If the Kanyakubja Brahmins derived 48% of their ancestry from the “steppe”, the derived allele of SLC45A2 should have been much higher than 12%.

Ryan said...

@David - "Why does the pre-Indo-European ANE in West Asia have to be linked to R1a? Why not just R2 and maybe Q?"

Agreed that it doesn't have to be. I'm only saying it could be.

Coldmountains said...

@Davidksi

I was talking about ANE or something distantly related with EHG. I am not an expert about this but ANE and R2 was found in Neolithic Iran (correct me if i am wrong) so i would expect more ANE-related stuff in the East (Afghanistan, Tajikistan, NW South Asia). This regions were later populated by West Asian farmers and closer to the steppe so theoretically local hunter gathers should left more genetic traces especially in isolated and mountainous regions of the Hindukush. They surely not brought Steppe_EMBA ancestry but mixed populations with high Iran_Neolithic and pseudo-EHG/ANE could maybe to some extent resemble Steppe_EMBA populations

John Smith said...

Ust Ishim has a connection to South Asia as he had a type of K2a which if I understand correctly has it's closes living relative in South India he was a cousin of Oase this group had R mtdna it seems and was possibly the oldest group in Europe. Europe was than replace by a group with U2 mtdna and C1a2 and in the sole case of Kosenti C1b y DNA. U2 and C1b are quite common in India C1a2 is not common anywhere but if I understand correctly it has been found in Nepal. This group had M as well I don't know the type was it M1 ( U6 was found at this time) or a type of M found only in South Asia or something else? For the most part the Augnicians seem to have the most connections to South Asians before they got replaced by the gravettians with little connection at least uniparentaly to India with U5 and I.Do the South Asians descend partially from the Augnicians and the Pre Augnicians?

batman said...

@ Sein

"It is 4000 km from Andronovo (the type site) to Merv Oasis, by modern roads. Or let's say from Petrovka - 2700 km.

Then it is about 1400 km from Merv to Mehrgarh, and another 1200 km from Mehrgarh to Rakhigarhi."

From the NE bay of the Atlantic - todays Gulf of Finland - to the shores of the Caspian Sea there's a traveling distance of 5.000+ km.

You may divide by 70 km a day to find the average travel-time.

Ancient travellers could use a year to complete their trade-cycles.

Coldmountains said...

@Sein

Thanks. Could you also model Burusho?

batman said...

From the paper:

"Maternal lineages primarily reflect earlier, pre-Holocene processes, and paternal lineages predominantly episodes within the last 10 ka. In particular, genetic influx from Central Asia in the Bronze Age was strongly male-driven, consistent with the patriarchal, patrilocal and patrilineal social structure..."

As mentioned a time or two already - the early Holocene represents a meeting between arctic and tropic survivors of the LGM-YD, 24-12.0000 years ago.

In this analyzis it's even clearified that this re-connection beteen 'archaic' populations were spearheaded by arctic descendants of y-dna F (GHIJK) moving into the tropic refugias of mt-dna M.

The "Aryan invasion" was definitly nothing but travel, trade and diplomacy between the arctic and tropic parts of central Asia - leading to inter-marriages. Resulting in a mix between the ancient natives of the Indian sub-continent and their new neighbours in the north - arriving as early Holocene made the inter-continental travels possible

Harry Parihar said...

@Sein
To piggyback off of @ColdMountains lol
could you also model any Indian Punjabi samples that you have Sein ? Perhaps Sapporo if it doesn't infringe on his privacy

Gioiello said...

@ Ryan
"You may want to revisit your dismal of Underhill's suggestion of basal R1a in the Zagros then. It wouldn't be an origin for the IE branches of R1a, but some sort of early R1a wandering into Iran seems more plausible now, even if that branch isn't ancestral to very many modern carriers of R1a".

I wrote a lot about (and against) the Underhill's paper and about its agendas: 1) to demonstrate that R1a (and also R1b) came from Middle East 2) he used only the Iranian samples and said nothing about the unique sample of R-M420 found in Europe, just in Italy.

Of course these old subclades of R1a expanded after the Younger Dryas and go from Western Europe to Iran (and after to Arabia). They don't say anything about the point of expansion and only the aDNA will be able to say the truth, but, that the following expansion happened from Europe, should be taken into account, as for hg. R1b1, where I think having demonstrated it came from.

Seinundzeit said...

Harry Parihar,

Absolutely my friend.

I have Sapporo's data, and I don't think he'll mind.

But, just to be safe, I'll either double-check with him, or I'll ask Khana to talk to him.

In the meantime, here are a few results.

Indo-Aryans:

UP_Brahmin

41.60% Iran_Neolithic

25.20% Srubnaya_outlier + 7.35% Srubnaya

25.85% ASI

Distance=0.4946

GujaratiA

43.9% Iran_Neolithic

23.6% Srubnaya_outlier + 12.2% Srubnaya

20.3% ASI

Distance=0.5074

Sindhi

51.00% Iran_Neolithic + 1.25% Iran_Chalcolithic

15.00% Srubnaya_outlier + 14.15% Srubnaya

18.60% ASI

Distance=0.4764

Kalash

43.65% Iran_Neolithic + 4.80% Iran_Chalcolithic

28.75% Srubnaya_outlier + 13.05% Srubnaya

9.75% ASI

Distance=0.282

Iranian South Central Asians:

Pakistani Pashtun, KPK

39.5% Iran_Neolithic + 13.2% Iran_Chalcolithic

20.8% Srubnaya_outlier + 12.6% Srubnaya

13.9% ASI

Distance=0.3096

Pakistani Pashtun, FATA (north)

32.40% Iran_Chalcolithic + 20.35% Iran_Neolithic

25.80% Srubnaya_outlier + 8.65% Srubnaya

12.80% ASI

Distance=0.2442

Me

31.4% Iran_Chalcolithic + 24.9% Iran_Neolithic

29.6% Srubnaya_outlier + 2.2% Scythian_Pazyryk

11.9% ASI

Distance=0.1997

Pakistani Pashtun, FATA (south)

34.95% Iran_Chalcolithic + 17.00% Iran_Neolithic

22.40% Srubnaya_outlier + 16.20% Srubnaya

9.20% ASI

0.25% Mongola

Distance=0.2042

Afghan Pashtun (Ghazni and Paktika)

29.7% Iran_Chalcolithic + 24.9% Iran_Neolithic

25.6% Srubnaya_outlier + 10.4% Srubnaya

9.3% ASI

Distance=0.2485

Tajik_Shugnan

32.20% Srubnaya_outlier + 19.25% Srubnaya

35.25% Iran_Chalcolithic + 6.05% Iran_Neolithic

5.50% ASI

1.75% Mongola

Distance=0.1009

South Central Asian Linguistic Isolate:

Burusho

26.7% Iran_Neolithic + 20.4% Iran_Chalcolithic

30.3% Srubnaya_outlier

16.60% ASI

6.0% Mongola

Distance=0.2073

As is quite evident, all South Central Asians have varying levels of Iran_Chalcolithic-related ancestry, with Burusho showing the least, and the Pamiri peoples showing the most.

By contrast, northern South Asians lack Iran_Chalcolithic-related ancestry, they only show Iran_Neolithic.

Although, a trace amount in Sindhis is pretty interesting. I'd say it's solid evidence of minor Baloch admixture.

Also, the Kalash are somewhat of an outlier (in the context of South Central Asia). The ASI levels resemble Iranian South Central Asians, but their West Eurasian ancestry is more similar to northwestern South Asians.

Aram said...

Thanks capra

Now it makes sense.

Rob said...

Sein
I think you'll find that (for the most part) Iran Chalcolithic ancestry disappears if you include Armenia EBA in "sources".

For the king said...

That Iran ChL admix in Afghans and Pamiris has to be related to BMAC. Uzbeks and Turkic admixed Tajiks also show strong Iran ChL signs.

Armenia EBA shows up probably because it's largely similar to modern Kurds and some Iranians(And thus Iran ChL), per archaeology Kura Araxes culture didn't have much influence south or east of NW Iran.

Rob said...

@ FT King
Yes KA influence was limited to NW Iran; but its impact on modern Afghans, say, could still be large, if they migrated east from NW Iran.
Indeed, Pashtuns & Tajiks score 30% or so "EBA Armenia"; it being 30-40% in various Iranian plateau groups; whilst it drops precipitously to 0 in Punjabs, Brahmins, Gujarat.
So to me it seems to be a good differentiation of iranics from Indo-Aryans (in addition to the differential "ASI" levels)

Davidski said...

But aren't Andronovo, Sintashta and Sarmatians even better proxies for this type of ancestry in Eastern Iranics than Armenia_EBA/Kura-Araxes?

Rob said...

Yes yes, for the steppe part.
So eg

GujaratA
Iran Neolithic 30%
Srubnaya Outlier 25 %
Paniya 45%

Tajik
Iran Neolithic 10%
Armenia EBA 35%
Srubnaya Outlier 26%
Scythian 9%
Paniya 15%

Rob said...

So my original remark was that 'Iran Neolithic' (in 'unadulterated' form) ancestry is best preserved outside the Iran plateau (as per the Broushaki paper); whilst in Iranian plateau itself, and Iranic speakers outside it, more recent movements which have attenuated its presence.

So seems to have been at least 3 movements into Iran since the Neolithic;

- something contributing to Iran Chalcolithic (but this hasn't left much legacy, in my runs)

- Kura -Araxes (widespread effect from north Levant to Iranian plateau)

- steppe EMBA and even IA

Coldmountains said...

@Sein

Thanks very much wrora. Some Afghan Pashtuns and especially non-pamiri Tajiks may have more recent persian--like ancestry from the West but it is probably mostly limited to urban areas. Cranial Analysis also shew that BMAC-like populations did not move into South Asia. Maybe Indo-Aryans entered South Asia with a BMACized material culture but without actually mixing much with BMAC people. The lack of Iran_Chalcolithic seems to support this and BMAC shows too strong archaelogical links with Mesopotomia/Elam/NW Iran to lack Iran_Chalcolithic IMHO.

Some actually argue (Asko Parpola) that the Dasa tribes of Rigveda were (Iranicized) BMAC tribes of Afghanistan and that Vedic Aryans were quite hostile to BMAC people. Dasa/Dahae was used as ethnonym by some Central Asian Iranics but somehow had a very negative meaning among Indo-Aryans.

"The same Rigvedic verse which mentions Divodāsa’s birth as a jpgt of the Sarasvatī (of Arachosia), RV 6,61,1, also says that the river took away from the (enemy) Paṇi his nourishment (pasture or cattle). The Paṇis are often mentioned as enemies of Rigvedic kings, sometimes along with the Dāsas and Dasyus, and in very similar terms. Paṇí is supposed to be a Prakritic development of *Pṛṇi, a reduced-grade variant of an ethnic name that has a full grade in Parṇáya, the name of an enemy of King (Divodāsa) Atithigva in RV 1,53,8 and 10,48,8. Strabo (11,9,2) describes the foundation of the Parthian empire of the Arsacids around 240 BCE as follows: “Arsakēs, a Scythian man, who had (in his command) some men of the Dāai, namely nomads called Párnoi, who were living along the River Okhos (modern Tejend), invaded Parthia and conquered it.”

"Sanskrit Dāsa- as an ethnic name thus has an exact counterpart in Avestan Dåŋha-, which stands for *Dāha-. The corresponding Old Persian ethnic name is Daha-. The plural form Dahā is included among the subjects of the Great King in the “empire list” of Xerxes, immediately before the two kinds of Sakā."

Parpola, Asko. The Roots of Hinduism: The Early Aryans and the Indus Civilization

Matt said...

Complementing above Fst scores, some PCoA plots on those Fst scores for them:

South Asian Populations - http://i.imgur.com/SSyTOyq.png (sorted by differentiation from Brahui)

Corresponding PCoA: http://i.imgur.com/o0lLLVs.png with Paniya / http://i.imgur.com/bGzuXmu.png without Paniya

With or without Paniya, the first dimension is mostly dominated by the ANI:ASI contrast and "Indian cline" while the second dimension is either dominated by Paniya's difference if they are included, or shows substructure beyond the cline if it is not.

European Populations - http://i.imgur.com/lkuSsNB.png (sorted by differentiation from Sardinians)

Corresponding PCoA: http://i.imgur.com/B38tRtA.png.

Pretty much reproduces a well done PCA of European genotypes. First dimension separates most EEF (Sardinian) from most Euro_HG (NE European and Lithuanian) then second dimension reproduces extra similarities between related West Europe / East Europe populations.

...

Further PCoA based on all the modern West Eurasian Fst scores from Lazaridis (minus Italian_South, which looks broken):

Dimension 1 vs 2: http://i.imgur.com/W6oqJzN.png, 2 vs 3: http://i.imgur.com/3VXCp50.png

Again reproduces the result of genotype PCA. 1 is North Europe vs Southern ME (Lithuanian-BedouinB at extremes), 2 is West Med vs Caucasus (Sardinian vs Georgian), 3 is Southern ME vs West Med (Sardinian vs BedouinB) and 4 is NW Europe vs NE Europe.

PCoA on Europe and West Eurasia are really just included as "proof of concept" that PCoA on Fst can reproduce the same substructures found in PCA.

Garvan said...

@Matt

Is PCoA the same as MDS? Or can you easily explain the difference between PCoA and PCA?

Garvan

Jaydeep said...

Coming back to the paper, I think its good to see some effort at dating the expansion of mtDNA lineages of South Asia. However the paper could have been a lot better than what it finally turns out.

As far as the study of WE mtDNA lineages of South Asia is concerned, Palanichamy et al 2015 is certainly the most comprehensive. It is therefore quite informative when we compare the data from that paper with what we have here.

Palanichamy et al scoured through about 14,000 samples from across India to get a sample size of about 1,180 WE mtDNA samples. The current paper only has about 220 WE mtDNA samples from across South Asia (India, Pakistan, Bangladesh, Sri Lanka).

South Asia has a population of about 1.6 billion, out of which approx. 750 million should be the female population. If we assume that 20 % of this belongs of WE mtdna, that would mean a population of 150 million females with WE mtdna. That is twice the size of the entire population of Iran.

Can a measly sample size of 220, be expected to capture the entire WE mtdna diversity of 150 million. Obviously not.

I would say that even Palanichamy et al fell short. A sample size of about 5,000 would do justice. Obviously that would require a huge effort. But it should be done.

Nevertheless, the diversity captured by Palanichamy et al is greater than that in the current paper.

Let me give the examples:-

The current paper (CP) only lists N1a1b1 & N1a2 as WE lineages under N which are found in South Asia, Palanichamy et al found in addition to above, N1a1a, N1a3 & N1b1.

The CP lists under W, W3a1, W4 & W6 in South Asia while Palanichamy have in addition, basal W, W1c, Wlg, basal W3 and W3a2.

The CP lists under HV, HV2a2, HV14, HV12, HV13, while Palanichamy have in addition HV2a, HV2a1, HV6b & HV14a1.

Under I, the CP only has I1 while Palanichamy have in addition I4 & I6.

Under H, the CP lists H2b, H7b, H13, H15a1a & H29, while Palanichamy missed H29, it also had basal H, H2a, H3g, H5a, H6a, H9b, H14a, H14c & H103.

The CP also lists R1a1 and R2a, but misses R1b and R2c.

Under J, it has J1b1, J1b3 & J1d but Palanichamy also has basal J, J1b, J1c, J2a & J2b.

Under T, the CP has T1a1, T2a1, T2b, T2d, T2e but misses basal T1, T1a5, T1b, T2c.

Under K it has K1a1b, K2a5 but Palanichamy has in addition basal K & K1.

Lastly, the CP only analyses pre-U1c, U1a1, U1a3 under U1 and U7, U7a, U7a2, U7a3, pre-U7a3, U7b under U7. Palanichamy has in addition U1a2 under U1 and U7a1, U7a4, U7a6, U7a7 & U7c.

The CP also totally fails to add the following U clades found by Palanichamy - basal U, U2e, U2e1, U2e3, U3, U3a, U3b, U4, U4a1, U4b2, U4c1, U5, U5a, U5a1a, U5a1b, U9a1 & U9b1.

In effect, the current paper only manages to capture less than half of the WE mtdna diversity that was found by Palanichamy et al that focussed only on India.

So how can the dates of expansion as proposed by this paper give us an accurate picture on WE mtdna expansions in South Asia ? It cannot.

There is more criticism but I shall write that down later.

Seinundzeit said...

Coldmountains,

Absolutely wrora, no problem.

And, I completely agree; many scholars have identified links between BMAC and northwestern Iran, so an Iran_Chalc/CHG-related connection has always been a possibility.

When Tajiks/Pashtuns across Tajikistan, Afghanistan, and Pakistan show substantial Iran_Chalc, it just further cements the notion.

Also, I do think that Parpola's Dasa theory has considerable merit.

Matt said...

@ Garvan, what I understand of it is, from https://folk.uio.no/ohammer/past/multivar.html (which will tell you more than I can)

It is: Metric Multidimensional Scaling (different from Non-metric Multidimensional Scaling!), so yes, MDS.

While I don't understand the mathematics of it, I gather from the link above that MDS is the right multivariate reduction technique to use for a matrix of distances, as it covers data where "The PCO routine finds the eigenvalues and eigenvectors of a matrix containing the distances or similarities between all data points".

While PCA requires columns of variables, then produces "the variance-covariance matrix or the correlation matrix" and subjects these to analysis to find a set of "eigenvalues and eigenvectors". So will not function on a distance matrix like an Fst matrix.

I find it the Metric MDS (i.e. PCoA) works better than the Non-metric MDS function in PAST3, as I gather that N-M MDS "attempts to place the data points in a two- or three-dimensional coordinate system such that the ranked differences are preserved". When I run this over Fst matrices, it which looks inherently more unstable with the actual distances between populations, while this is not a problem with PCoA.

There are variable similarity index settings for PCoA in PAST3; I find with Fst data it is best to just use the "User-supplied distance" setting, as the other similarity / distance indices apply processing to the data where the outcome doesn't seem to align with what I am expecting out (though I would have a tough time technically you why).

Matt said...

@ Garvan (split my post into two) here's another example if you're interested in East Asian population history at all, or if not, you still might find it illustrative:

http://i.imgur.com/rnwyR3j.png - This Fst matrix includes a little compared set of populations, including some small Tibeto-Burman groups from Nagaland (Ao Naga and Naga) and Bangladesh (Garos), along with a fairly robust sample size of Siberian, Northeast and mainland Southeast Asians, and a couple of "ASI" tribal pops (Munda and Paniya).

Put this through the PCoA and the first 3 output coordinates look like this: http://i.imgur.com/VqGHz2G.png

1: separates the "ASI" tribal pops from the East Asians, and as you'd expect, the Burmese, Garo and Cambodians look to have a position that looks intermediate and admixed (though the Naga and Ao-Naga do not).

2: separates Siberians and Naga from Southeast Asians, with Han-Japanese falling around the middle. Interestingly the position of Burmese and Garos vs Cambodians here indicates that the Burmese samples here are more like a Han population mixing with "ASI", where Cambodians are more like Dai mixing with "ASI".

Both of these dimensions are completely orthodox with what genome wide PCA normally tells us about all these populations.

Now, 3: adds more information, this time separating the Western Tibeto-Burman populations (Aonaga, Nysha, Burmese, etc.) at one end of the pole from North Asians at the other, with Southeast Asians in between.

So this is some cool information and a sort of new minor dimension of East Asian variance in the mainland that really doesn't show up in most studies of East and South Asia. Probably as it's mostly concentrated in relatively small populations in the west of East Asia and Tibet bordering on India, who are not normally the topic of study. But using the PCoA and the matrices of Fst which have been made available in papers we can visualise it. Maybe it implies to us something new about population history (or maybe not!).

Seinundzeit said...

Harry Parihar,

Absolutely, no problem bro.

And, I think you're right.

The primary differentiating stream of ancestry for neighboring Punjabis and Pashtuns would be Iran_Chalcolithic-related stuff, not ASI or steppe admixture.

For example, Swati Yusufzai live right on a cultural shatter zone. In their area, Central and South Asia basically melt into each other.

The whole valley has many thriving Hindkowan communities. This is how they stack up (I'm sure my Yusufzai friend here is representative of true Yusufzai Pashtuns):

39.5% Iran_Neolithic + 13.2% Iran_Chalcolithic

20.8% Srubnaya_outlier + 12.6% Srubnaya

13.9% ASI

Distance=0.3096

For comparison, I have Sapporo's data, and he is around 15% ASI (out of all the South Asian results I've seen, Sapporo has the least ASI admixture. At minimum, the most West Eurasian Indians show 20% ASI, but Sapporo is consistently only 15%, sometimes even lower).

So, there isn't much of a difference between an Indian Punjabi Jatt and a KPK Pashtun, when it comes to ASI.

On top of that, Sapporo has more steppe admixture than this KPK Pashtun!

But, Sapporo has 0% Iran_Chalcolithic, while this Pashtun from the Punjabi/Pashtun frontier has around 15%.

Other Pakistani Pashtuns show up to 35% Iran_Chalc, and Afghan Pashtuns show a range of 25% to 35%.

So again, CHG/Iran_Chalc is what really differentiates populations that live east and west of the Indus river.

Oddly though, if old ADMIXTURE runs are any indication, Pakistani Punjabi Jatts have more ASI than they're Indian coethnics; in my setup they'll probably be around 20%-25%.

Definitely out of whack with geography, but it was a consistent pattern.

And when it comes to steppe ancestry, I think Punjab is quite complicated.

Obviously you know this better than me, but if I'm not mistaken, the caste system operates somewhat differently in Greater Punjab, due to this region's proximity to South Central Asia.

At the end of the day, I think Punjabi Brahmins only have Aryan ancestry, nothing from Scythians/Kushans/Hepthalites. By contrast, Jatts, especially the Jatts of Haryana (and even Rajasthan), have substantial Scythian/Kushan/Hepthalite-related ancestry.

Also, I think there is an Aryan substrate, when looking at northeastern Pashtuns.

But broadly, I think what you've said does sum things up.

Seinundzeit said...

Oh, also, I should add that I now use the scaling procedure discussed by Alberto and Matt.

Works wonderfully.

Matt said...

@ Garvan, now getting totally OT, but also using Chaubey's West Eurasian Fst scores, you can also see an interesting thing with the Jewish populations they've sampled which Lazaridis / Reich didn't use in that matrix I posted above (not the focus of what they're looking at!).

Take this matrix from Chaubey: http://i.imgur.com/XaE9Wie.png

PCoA (metric MDS) from this: http://i.imgur.com/a3jloKa.png

You can kind of see here in this PCoA a pretty interesting position for the Italian Jewish population, where they're essentially much closer to the Near East than the Ashkenazy or Sephardic population. Differentiation is very low in 1 (North Europe vs Southern ME), 2 (East vs West of West Eurasia) and 3 (Caucasus+West Med vs NE Europe+Southern ME). The matrix accordingly fitted them as having very low differentiation from Syrian / Jordanian populations.

Carlos Aramayo said...

@Rami

What do you think about J. M. Kenoyer`s view (based on linguistic research done by Southworth)that Harappan Civilization could have had, at least, four kinds of language families: Dravidian, Sino-Tibetan, Austro-Asiatic, and Indo-Aryan?

bmdriver said...

" Salden said...
Wait for the Hindu Nationalists to show"

.......Hindu nationalistic have always been slower and less violent, barbaric, imperialistic than the White Christian nationalistic that went onto occupy, enslave, and then impose a European white Christian format in Africa, North America, South America, and Australia.

So if we take logic, Hindu nationalism is medicine against a white Christian virus that consumed the world, and today their germs still linger in informing others of their roots.........but some history cannot be simply wiped out or its direction reversed, because you believe white skinned, blue eyed, blonde haired people are gods chosen to rule the earth, then calling others nationalistic is truly amusing to say the least.

"India has been conquered once, but India must be conquered again, and that second conquest should be a conquest by education.” – Max Muller

1455, edict by Pope Nicholas V that grants the ''right of conquest to invade, search out, capture, vanquish and subdue all Non believers whatsoever and other enemies of Christ wheresoever placed, and the kingdoms, dukedoms, principalities, dominions, possessions and all movable and immovable goods whatever held and possessed by them and to reduce their persons to perpetual slavery"

bmdriver said...

Bigger picture South Asia is the source of west Asia, Europe, steppes, with migrating Indian tribes moving out settling, then some returning to their Indian homeland as dharmic pagan civilisation not of the abrhamic hue that discarded its roots to worship an Israeli tribe in which God allows his chosen people to exploit everyone as an act of God and righteousness itself. Similar to Isis.

The day abrhamic cults started to destroy their Indian pagan roots, they become European Caucasian and Aryan, before that they where part of the Indian dharmic civilianisation.

postneo said...

@Balaji
yes the current paper seems superficial.

@Sein
Pakistani Punjabi Jatts have more ASI than they're Indian coethnics; in my setup they'll probably be around 20%-25%.

phonemically panjabi's are more dravidian shifted than hindi and bengali, Marwari, probably even sindhi speakers to their east and south( its hard to find sindhi speakers nowadays).

They are weak on voiced aspiration, higher cadence complexity and gemmination, strong on retroflexion.
its also interesting that pastuns have east of Iran have retained retroflexion unlike Iranis.

Saqib said...

@Seinundzeit

How Sindhi is scoring 18% ASI and GujaratiA 20% ASI in your results. When Sindhi HGDP score 30% ASI and GujaratiA score 36% ASI on HarappaWorld?

thanks

André de Vasconcelos said...

bmdriver, quoting stuff from over 500 years ago to fuel your (thus current) anti-European hate? Jeez, you're as much of a nutter as that afrocentrist we recently got here

John Smith said...

Any idea on when new adna will come out? 😞

Seinundzeit said...

postneo,

Interestingly, the Pamiri languages also display retroflexion.

Saqib,

With regard to the HarappaWorld "South Indian" component, it isn't really ASI.

Like all South Asian components seen in ADMIXTURE, it's a mix of ASI, Iran_Neolithic, and extra ANE.

This is why Iranian_Neolithic samples, which are basically 0% ENA, show 10%-15% "South Indian" on HarappaWorld.

In this context, 18% ASI for Sindhis only involves ENA, no ANE or Iran_Neolithic-related admixture.

That's pretty much it.

Saqib said...

@Seinundzeit

UP brahmin score 39% south indian but 25.85% ASI in your program. Sindhi 30% SI and 18.60% ASI, GujaratiA 35-36% SI and 20.3% ASI. It looks like one can get good idea how much one will score ASI based on his south indian score on Harappaworld? Just to compare with other populations found on HarappaDNA.net.

Its possible for you to release calculator for other people to run their samples?

thanks

Seinundzeit said...

Saqib,

Absolutely, the correlation is very solid.

Based on what I've seen, actual ASI will usually be around 60%-70% of the "South_Indian" score for HarappaWorld.

Also, if you have your coordinates for David's PCA data-sheet, I can try to model your data.

Garvan said...

@Matt,

Thanks for the explanation of PCoA, and the interesting examples. I will try it for myself.

Reza said...

Sein, really interesting concepts!

Would you mind running some bengali results, I think khana may have sent you my mother's at one point?

I wonder how the Eastern end of the Gangetic plains compare with other south Indians when it comes to ASI modelling.

How would a higher level of presumed Austroasiatic input in eastern SA come up on your model? ASI+mongola? As well as the extra mongola to be seen in Bangladeshis?

Saqib said...

@Seinundzeit

To get other component scores one have to actually run results in program? I don't have David PCA sheet.

capra internetensis said...

@Matt

Interesting. Do you have any idea why Paniya behave that way? Are they just especially drifted (I couldn't find any data about RoHs or anything for them), or is it possible this is some kind of ASI population structure?

@Jaydeep

Thanks for pulling out that other mtDNA data. I wish I knew more about mtDNA so I could comment. All I can say is that we can't assume a larger dataset should break the pattern established by smaller one, and vague numerical comparisons aren't a substitute for statistical measures of significance.

PF said...

@Matt

Those are interesting but strange results. Previous work has shown that Italian Jews are basically the same as Ashkenazi/Sephardic; if anything closer to Ashkenazi. I tried to find details of the sample but it seems to be missing from the paper...

Seinundzeit said...

Hey Reza,

Sorry for the delay.

This is what I find, for your mother:

39.50% Iran_Neolithic
30.75% ASI
22.90% Srubnaya_outlier
6.85% Dai

Distance=0.1586

Saqib,

I think you have to send your file to David, and pay him, in order to receive your coordinates.

Matt said...

@ Capra, good question.

To be honest there are a few other populations that show higher differentiation from others in the panel. Paniya was the one with most seeming "extra differentiation" from its clinal position that I included in the PCoA of South Asian populations upthread, but in the full table of Fsts (I think I put this in my first post) Pulliyars, Chenchu, a few of the Indian Jewish populations (1,2,4), Dusadhs, Dhakars, Kanjars were also pretty extreme in their Fst relative to general position on the PCoA cline, and in terms of generating an extra dimension displacing them from the South Asian cline in the PCoA. Pulliyars very much so.

This could well just be extra drift (and I think for the Indian Jewish examples it seems pretty certain to be).

One way I would've checked this would've been to look at their Fst from the African outgroups as a check. Can't do that in Chaubey's data as that was not provided, but for some populations we could in Metspalu's 2011 paper with some of the same populations - https://www.ncbi.nlm.nih.gov/pubmed/22152676. Fsts - http://i.imgur.com/hiaLr5b.png. PCoA - http://i.imgur.com/UCOTBUz.png.

Does seem like extra drift for Pulliyars as they seem extra differentiated from Africans as well, but can't judge this way for Paniya.

As well as RoH, a better check for all this (strong drift vs ancient substructure) than my above African Fst comparison might be to look at an outgroup f3 sharing matrix between these South Asian populations, since that's in theory much less sensitive to drift than Fst. Haven't got myself the setup yet to get the data and run that though, unfortunately, and besides not sure if the dataset isn't restricted.

(Btw, here's some other PCoA from the Fst from that Metspalu paper, with all samples: http://i.imgur.com/xUraCov.png, Eurasian samples only: http://i.imgur.com/Wh9FxkW.png)

Seinundzeit said...

Off-topic, but Matt's PCoA of fst data brings back memories of my own experiments with the Metspalu et al. fsts.

Basically, I ran nMonte on the Metspalu fst data (this was a very long time ago).

These are the models I obtained.

Pashtuns ("Pathans")

63.7% Tajik
32.5% Brahui
3.8% Munda Tribal

Sindhi
66.9% Brahui
26.3% Tajik
6.8% Munda Tribal

UP Brahmin
75.8% Tajik
24.2% Munda Tribal
0% Brahui

Now that I look back, these look pretty interesting/good, especially considering the nature of the data.

Matt said...

@PF, yes it's a strange one. I could't find direct mention of the Italian Jewish sample in Chaubey use except in the Fst table
(http://www.nature.com/article-assets/npg/srep/2016/160113/srep19166/extref/srep19166-s1.pdf, p12) so the sourcing was a bit obscure.

However their other Jewish samples are largely sourced from Atzmon et al. 2010 and Behar et al 2010.

So looking at Atzmon 2010 - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3032072/ - includes mention of an Italian Jewish sample - "the Italian Jewish community in Rome" - so
I would assume this is the origin of the sample in Chaubey's paper. Atzmon's PCA shows that the ItJ is slightly closer to the ME than other European Jewish populations, but appears to have a strange shape (from Dienekes blog - http://3.bp.blogspot.com/_Ish7688voT0/TAimWIpgDAI/AAAAAAAACao/0uuOXPXvqcQ/s1600/pc-jews.jpg), perhaps from its limited samples (HGDP plus a few others).

Taking the lower bounds Fsts from Atzmon's supplement (page 12) and running PCoA: http://i.imgur.com/eFJdMSf.png

The Italian Jews do have a distinct position from Ashkenazis, while being close to Turkish and Greek populations, but aren't as close to the Levant as in the Chaubey run above.

All that said, I might personally probably trust the Fst and positioning Chaubey's paper more, assuming the same sample, as the range of comparison populations is so much larger, so it's much easier for a technique like PCoA to represent their position accurately.

@ Sein, talking of nMonte, output of PCoA on Fst from Lazaridis including the ancients (Yamnaya, Europe_MN, etc.) actually worked quite decently for fitting with nMonte. Gave fairly similar proportions for recent Europeans and ancients as would be expected. Though I think not the best method because of likely some slight deflations / inflations of Fst in some ancients, but pretty good really. I don't know if nMonte would work as well on the Fst matrix without first a dimensional reduction / transformation technique like PCoA, but maybe.

Seinundzeit said...

Matt,

I really wish we could see some Fst distances for Central and Southern Asian populations (in relation to the ancient samples that we currently have), but I don't think it's ever be done.

Shaikorth said...

@Matt

The Italian Jewish sample has a strange Fst of 0 to Lebanese which may show up on the PCA. The distances between Levantine populations are higher, and ItJ's distances to other MENA are similar to those of Sephardics. Usually Fst 0 appears when populations are basically indistinguishable on a genotype-based PCA, but Lebanese and Italian Jews aren't.

Matt said...

@ Shaikorth, I cut the Lebanese samples from the matrix and reran the PCoA:
http://i.imgur.com/wuYo9Tb.png

(side by side with the original: http://i.imgur.com/aSnKtXl.png)

Positions of others basically completely invariant.

The method is fitting to the whole matrix of 41-42 populations, and with that n of populations, its difficult that removing any one population would have an overall effect on the position of all the others (unless there is *very* strong and systematic differentiation with that population).

Matt said...

Sein: I really wish we could see some Fst distances for Central and Southern Asian populations (in relation to the ancient samples that we currently have), but I don't think it's ever be done.

I think it could be illustrative, yeah, but I don't think it has either - the only thing I remember is this subset from the Fst from Haak: http://i.imgur.com/C1LMjVt.png, but its very incomplete compared to what we have now, and the available panel of SCA populations, and the n of populations necessary for good confidence that it represents general trends and not been hijacked by any specific population's drift. I'm also not sure if they didn't change the means to calculate Fsts for ancients after that paper (the ancient Fst from Mathieson and Lazaridis's paper seem to work better).

ak2014b said...

Many in South Asia and some of Afghan descent, particularly Pathans, have the surname "Khan". Although this largely occurs in the region's Muslim groups, the name itself was derived from Turko-Mongolian influences, which need not all have been genetic. However, the "Mughal" empire in South Asia is related to the word "Moghul" which means "Mongol" too. So questions about the extent of such later steppe genetic influence remains interesting.

We know from studies, including the current one, that mtDNA would be local to whatever region the Moghuls settled in, confirming that the ruling Moghul elites would have taken local women. The predominance of local women contributing genetically to subsequent generations carrying surnames like "Khan" would also have overpowered Turko-Mongolian signals in the autosomal DNA. So a study of the paternal markers of South and South Central Asians who hold Turko-Mongolian surnames or who otherwise claim Moghul ancestry could be more productive in identifying the extent of actual Turko-Mongolian genetic influences. (Alternatively, if possible, specific mutations or other identifying variants that can clarify the migration path of Turko-Mongolian branches in the region would first need to be discovered.)

The aDNA of Turko-Mongolian branches that migrated to South Central and South Asia would certainly have to be studied to identify what qualifies as a clear Turko-Mongolian signal and paternal lineages. I'm not up to guessing what the Y of more regionally relevant Turko-Mongolian Khaganate branches were, but there was report of 2/2 ancient Khazars being R1a-Z93. This may turn out as having some application in a wider Turko-Mongolian context since the Turks were found to be descended from eastern Scythians, who would have harboured some Z93 too. If further resolution of the phylogeny is possible, it might be possible to distinguish between the bronze age signals from Sintashta or Andronovo, and later Turko-Mongolian movements that could have brought the same or other common Y haplos to the South Central and South Asian region. Then it may become possible to quantify how much (or little) genetic influence Turko-Mongolians actually had, at least paternally, on their namesakes in these other Asian regions.

Matt said...

@ Sein, if you wanted to see how these PCoA on Fst scores can perform in nMonte, here is one based on the Fst scores from the Lazaridis paper:
Lazaridis Fst PCoA - https://pastebin.com/DRf30EU4

(Left in about 21 dimensions on this one, as that captures all the modern day European differentiation including NW and NE, but you can probably trim to 10 if you're just interested in how moderns are related to the main ancients, as 3 and 6 capture the vast majority of West Eurasian structure. The full PCoA generated more dimensions than this, but eventually those seemed to just be increasing population distances above reality without meaningfully changing the structure.).

Example nMonte Calc files:

Calc1: https://pastebin.com/HTQ1TRp7 - ancients for Europe, not including MNChl / Steppe_EMBA)

Calc2: https://pastebin.com/q61QcrF3 - ancients for Europe, including MNChl / Steppe_EMBA

Matt said...

Further some example runs using Calc1:

Steppe_EMBA: https://pastebin.com/3UsjZUbs, Europe_MNChl: https://pastebin.com/gFJKq4MX

Some example runs using Calc2:

Europe_LNBA: https://pastebin.com/x68SSRer, Steppe_LMBA: https://pastebin.com/gqyvjjFi
Basque: https://pastebin.com/GDQNGUxj, English: https://pastebin.com/JRvAW6k8, Finnish: https://pastebin.com/YgEtAqmf, Greek: https://pastebin.com/8sGXFyyP, Lithuanian: https://pastebin.com/emkSGERN, Nowegian: https://pastebin.com/PRz0AtJp, Polish: https://pastebin.com/BNh88d68, Sardinian: https://pastebin.com/yPRMxth5, Sicilian: https://pastebin.com/s4AZkB3f, Spanish: https://pastebin.com/eEf0yu4P.

Intuitive fits, and ones that match *very* well with formal tests. (Only odd fits are the Southern Europeans, but I suspect that's because you'd actually need to include more actual late Bronze Age / Iron Age / recent samples from the Near East and North Africa in the calc file to fit them properly with this method.).

(for the runs, don't pay too literal attention to the distance fit % in comparison to other PCA runs. this depends strongly on the scale of the dimensions which varies from different PCA to PCA).

Matt said...

Finally, also, fitting some modern Europeans with a "leave one out from the calc file" methodology:

Basque: https://pastebin.com/s7wsHbCC, English: https://pastebin.com/YYH3zwLZ, Lithuanian: https://pastebin.com/8KLbWWzz, Norwegian: https://pastebin.com/hpViXCkq, Polish: https://pastebin.com/7cA5W66L

Will be cool to do with SC Asian+ancient Fst at some point as well...

Seinundzeit said...

Matt,

Exceedingly interesting; I think this could prove to be very fruitful.

I'll definitely give these a spin, thanks for posting the PCoA/calcs.

Also, I just played around with the Lazaridis et al. Sindhi Fsts, and the output looks rather nice.

63.55% Iraqi_Jew
32.35% Yamnaya
4.10% Onge
0.00% EHG

distance=4.8112

or

51.55% Mala
26.65% Yamnaya
21.80% Iraqi_Jew
0.00% EHG

distance=2.4459

The first one doesn't look bad at all, considering how Iraqi_Jew is a poor proxy for Iran_Neolithic, and how Onge are an even poorer proxy for ASI.

The second one is also very sensible (even though I'd rather not use Mala, as they are 60%-50% West Eurasian).

Looking at something as basic/spare as this, I can't help but share your enthusiasm for an extensive listing of ancient Fsts in relation to SCA.

David,

Is it feasible? If not, that's completely alright.

But if it is a possibility, I think it might really be worth it.

Seinundzeit said...

Ah, so things get really interesting for the Mala, if I add Han:

Mala

42.70% Iraqi_Jew
27.65% Han + 6.75% Onge
22.90% Yamnaya
0.00% EHG

distance=3.8161

If the Mala are similar to Primalai Kallars, then 35% ASI makes sense (I usually have the Kallar at 35%-40% ASI, using the PCA data-sheet).

My 60%-50% West Eurasian estimate was based on the Pulliyar (as I've always found that they tend to be around 50%), but perhaps Mala resemble Kallar more.

Matt said...

@Sein, fairly sensible outcome!

I've still got a few reservations about using the matrix through nMonte itself.

With the Haak data I generated a quick PCoA of the whole Fst matrix - https://pastebin.com/Xe6eD6CN.

(You might want to reduce the overall n of dimensions to a more sensible number if you were running with this - I didn't bother in the upload).

I'm not so sure about it as the Lazaridis Fst data though, where the PCoA+nMonte seems to produce very sensible looking dimensions and output proportions that essentially match what come out of formal.

Couple calc sets for that for use for Sindhi and/or Mala: 1 - https://pastebin.com/efmDWNBC, 2 - https://pastebin.com/S4Q2E7yq
and some output proportions for them

1 - https://pastebin.com/GACcb756 (Sindhi: Iraqi_J 48.5, Yamnaya 30.3, Han 15, Onge 6.2), 2 - https://pastebin.com/mcm79fZJ (Mala: Iraqi_J 34.35, Han 29.75, Yamnaya 22.3, Onge 13.6) , 3 - https://pastebin.com/7f6AknrJ (Sindhi: Mala 50.35, Iraqi_J 30.85, Yamnaya 18.8, Han 0, Onge 0)

Seinundzeit said...

Matt,

Thanks! The PCoA output proportions look pretty sensible for both Sindhis and Mala.

I've done some models for West Asians/Caucasus pops, using the Lazaridis Fst PCoA, and I must say that these results are the most reasonable I've ever seen:

Iranians

66.9% Iran_ChL
24.6% Steppe_MLBA
5.6% Kharia
1.9% She
1.0% Yoruba

distance=4.9775

Georgians

45.45% Iran_ChL
21.25% Steppe_MLBA
17.35% CHG
14.10% Anatolia_N
1.20% She
0.65% Kharia

distance=7.0233

Chechen

45.85% Steppe_MLBA
40.50% Iran_ChL
8.60% CHG
3.05% She
1.75% Kharia

distance=6.1864

The reference populations I used were:

Anatolia_N, CHG, EHG, Iran_N, Iran_ChL, Levant_BA, Steppe_EMBA, Steppe_MLBA, Steppe_Eneolithic, and all of the non-West Eurasian populations.

Also, I used all 21 dimensions, just to see what would happen.

Again, definitely the most reasonable/sensible results I'v ever seen for these West Asians.

So, I truly think you've found a very effective/fruitful method of going about things.

I really wish there was a Central/South Asian population in there (lol).

Alberto said...

@Matt

Thanks for that Fst based PCA. It seems to work really good. I ran European modern populations with the same sources used by David in his qpAdm tour of Europe (http://eurogenes.blogspot.mk/2017/01/qpadm-tour-of-europe-mesolithic-to.html) and the results are very similar:

https://docs.google.com/spreadsheets/d/1xcix-erUisL3UjRHpXgmU5QZF_jOrIFCx4iYLx5tlL8/edit?usp=sharing

Matt said...

@ Alberto and @ Sein, thanks, yeah, actually surprising to me how well this seems to fit with other methods and work, for such a brutally simple method unexplored in the literature (just the Fst matrix through a sort of MDS). Also in case you didn't spot, Davidski released a post of mine above from the spam folder purgatory with a few more of the fits for European populations.

Also Alberto, interesting to see the models with those choice of ancestors which I didn't test (e.g. populations pick up Ulchi in absence of Steppe_EMBA, Iran_N, etc. and with choice of Anatolia_N, CHG, EHG, WHG as the other West Eurasians). Nitpicking, bear in mind this is PCoA (a kind of MDS), and if we just dropped a Fst matrix into PCA, IME you wouldn't get very good results (strange horseshoe shapes when graphed, etc. and the resulting dimensions don't work well in nMonte).

(A few graphics from these PCoA. Graphed: http://i.imgur.com/rCzjglv.png for WE populations and a neighbour joining tree - http://i.imgur.com/Xa0ewhf.png)

Also had a go at using a calc with all the ancient populations - https://pastebin.com/KuSZtNZa - to test what the outcome was, as I thought that might capture populations outside Europe and within South Europe a bit better.

Results - https://pastebin.com/Hit4NeUC

Seems like most North-Central Europeans are covered by Europe LNBA with approx 10% of differences in expected directions (a bit of WHG in Polish, Steppe EMBA and WHG in Lithuanians, etc.), Basques are a mix of roughly equal parts Europe LNBA and EuropeMN.

As we move through the Med from west to east, nMonte chooses to mix in more Armenia_MLBA and drop off EuropeMN for Europe EN, from 20% Armenia_MLBA in Spain to 50% in Greece. Near Eastern / West Asian populations seem chiefly to model as the Armenia_MLBA melting pot, with varying components as expected for where they are (e.g. Iran takes on Iran_Chl and Kharia, Palestinians take on Levant_BA and some Yoruba, etc.)

Not sure if this maps to anything real though, but given it performs well elsewhere it's interesting to me.

Matt said...

Few more experiments.

This time ran a PCoA on the Fst matrix from Mathieson et al 2015. Compared to Lazaridis 2016, this one has many fewer modern West Eurasian populations, and world populations, and ancient populations but has a potential advantage that the ancients were split into more specific cultures. So rather than Europe_MNChl or Europe_EN or Steppe_EMBA, it has Iberia_MN, Iberia_CA, Central_MN, Iberia_EN, Hungary_EN, etc.

So the PCoA on this Fst matrix: https://pastebin.com/EZDb58ez. As previously, I cut this down to 21 dimensions (https://pastebin.com/j4xeP2W6) for the following.

Calc 1 - Anatolia and Europe EN-CA, Yamnaya_Samara, WHG, EHG, SHG and other world populations: https://pastebin.com/6TKAmg2q. Results: https://pastebin.com/4rZdP01c

Kind of as expected in overall proportions of Yamnaya / Europe EN / MN / extra HG. One facet that shows up here is that there are fairly substantial proportions of Iberia Early / Middle Neolithic detected in Southwest Europe today (20% Sardinian, 15% Basque, 10% Spanish) which are not found in other populations tested who are exclusively fit with Central_MN.

Calc 2 - All ancients (including LNBA Europeans) and world populations: https://pastebin.com/9u3YwVn7. Results: https://pastebin.com/8YVABmVb

Here the NW and W Central Europeans tend to be fit almost exclusively with Bell Beaker, Spanish+Spanish pick up Bell Beaker+Iberia&CentralMN, NE Europe (Lithuanians) get HungaryBA and Northern_LNBA, then E Central Europeans get Bell Beaker with a significant HungaryBA fraction. Could be driven by the WHG:AN:Steppe ratios, but also from looking at dimensions, possibly picking up extra information beyond that... It's a little similar to the haplotype peak of HungaryBR2 in Eastern Europe and Rathlin in the British Isles.

(All this said, I'm a little less confident on these than the Lazaridis Fst PCoA as there are those weird 1-3% Han showing up in some populations.)

Seinundzeit said...

Matt,

I'd say that this is vastly superior to using D-stats with nMonte.

In fact, after exploring some "basal" modelling, it's quite clear (at least to me) that this also performs better when compared to qpAdm.

For example, if my memory serves me right, Lazaridis et al. had Anatolia_N construed as 40% WHG, 30% Iran_Neolithic, and 30% Levant_Neolithic. Europe_EN was 45% Iran_Neolithic, 40% WHG, and 15% Levant_Neolithic.

Quite frankly, that was/is rather strange.

With your PCoA coordinates, this is what I've found (when only using Levant_N, Iran_N, WHG, and EHG):

Anatolia_N

85.55% Levant_N
6.85% EHG
5.55% WHG
2.05% Iran_N

distance% = 10.1869 %

Europe_EN

80.85% Levant_N
12.80% WHG
6.35% EHG
0.00% Iran_N

distance% = 10.4282 %

Far more sensible.

Again, I really have to thank you, for your work on this.

Also, I'm pretty sure that we can now (finally) solve the "MA1/AG3, or Steppe_EMBA, or Steppe_MLBA" conundrum for South Central Asia (if we can eventually utilize this sort of data for Central and South Asian populations).

Davidski said...

What is this stuff based on, just Fst?

Seinundzeit said...

Obviously, Matt can explain this far better, and he can also go into much richer detail; but yeah, it seems that he just ran an Fst matrix through some kind of MDS.

By far, this is the best data I've ever seen for inferring admixture proportions. I'd say that it works better than nMonte with unlinked genotype PCA data, works better than nMonte with D-stats, and it also seems to work better in comparison to qpAdm.

Every result makes sense in terms of the reference populations used and the subject population being tested, and it also yields very consistent/stable output, no matter what you throw at it (the latter is in sharp contrast to nMonte with PCA).

If you ever find the time, and if you ever have the inclination, I think it would prove to be very fruitful if you tried to replicate the Lazaridis et al. Fst matrix, but added AG3/MA1 (and any other ancients), and all the Central/Southern Asians that you can/want to add.

As always, no rush or pressure.

Matt said...

@ Davidski, yeah, to replicate pretty much the extent is the steps as follows :

1. Download Fst matrix file from Lazaridis 2016 supplement - http://biorxiv.org/highwire/filestream/16360/field_highwire_adjunct_files/3/059311-4.xlsx

2. Download Past3 - https://folk.uio.no/ohammer/past/

3. Adjust formatting on Fst data. In the raw table the standard error was above the diagonal, and the Fst below, so mirrored Fst above and below the diagonal with 0 on the diagonal - https://pastebin.com/31kvLwgc

4. Paste into Past3 - http://i.imgur.com/G2mQFzM.png

5. Run PCoA function. Set the Similarity Index as User Defined Distance, Transformation Exponent = 1, and set Eigenvalue Scale. Recompute. Copy out the scores - http://i.imgur.com/aavsQYD.png

(To explain settings which I believe it works as follows: 1) Similarity Index as User Defined Distance means Past3 applies the PCoA function to the distance matrix you provide, rather than calculating a secondary similarity index on the data first), 2) Transformation Exponent = 1 means that you are computing on the distance provided, whereas 2 would be the square of the distances etc. I found 1 was representing the structure better and that using transformation distorted the dimensions, but you might find differently, 3) Eigenvalue Scaling just means that the data exported will be eigenvalue scaled).

6. Paste the exported data and save as a .csv file for use in nMonte - https://pastebin.com/vaRivrPd. You might have to do some minor formatting before the file is usable.

...

The full file runs to about 70 axes, to try and fit all the populations, but cumulative variances are done at about 80% by 10 and 90% by 20, so I would trim to 21 to get nMonte to run in practical time. As always with dimensional data on nMonte there is the subjective element on when you choose to cutoff the number of dimensions.

In the copy I put up for download earlier, I'd curated out the Italian_South, and the Jewish populations other than Ashkenazis, as they looked very drifted (or there's an error) which led to weird extra axes which they dominated at around axis 20+, but really if you're using relatively small numbers of the dimensions this may not actually matter or be necessary.

Matt said...

Couple more technical details (put in a separate post to not data dump you lot in one post):

--------------

1) Comparing the PCoA on the matrix to trying to run PCA on the West Eurasian subset of the matrix: http://i.imgur.com/FsqFLrt.png. There are similarities, but there's obviously a strange "horseshoeing" in the PCA data where a big major axis forms separating the modern populations who are globally low drift / diverse from the high drift WHG / SHG / Natufian. The PCoA forms a shape which makes much more sense.

That said, discarding the PC1 in the PCA (the "drift dimension") and plotting PC2 vs PC3 does give you something like Axis 1 vs Axis 2 from the PC: http://i.imgur.com/QRs7RFK.png. But I think there are still features that look worse and to make less sense than the PCoA, so I don't know if would work as well in nMonte and less subjectively, formally it seems that PCA is not the technique that's really for analysing a matrix of distances (see upthread discussion with Garvan).

2) Re: the Transformation Exponent = 1 setting I used above, I found that this made dimensions which matched conventional PCA best and fit fine structual divergences best. For example: http://i.imgur.com/yJwoZsz.png. Axis 1 vs Axis 2 in = 1 "makes sense" while Axis 1 vs Axis 2 in = 2 looks like a hybrid of the "West Eurasian dimensions" with the "World dimensions" (incorporating that information together).

But again "That said...", there could be benefits to it. Comparison of the Neighbour Joining Tree to the distances from on PCoA with Transformation Exponent = 1 - http://i.imgur.com/PS9Jbm1.png. You can see populations get their own longer branches than on the raw Fst matrix (this may be what is called "overfitting"?). As well, the EuroHG and ancients generally are less embedded as joining modern neighbours. Whereas with Transformation Exponent = 2 - http://i.imgur.com/Owspu1r.png this aspect is in check, but some of the fine structure scale at Fst of 0.001-0.003 among recent Europeans has been transformed away and is less precise.

(Intuitively I'd think some of the problem with overfitting might be reduced by running Fst to higher decimal places, as I'd think then there are less cases where on the fine scale where rounding gives sets of Fst distances which are identical save for a few and which are difficult to then fit.)

@Ryukendo, fully agree with your note of caution there. IIUC it should be that the procedure is using the Fst distances to a position on the underlying shared genetic space (a la the way the example here of the PCoA technique maps physical distances between places to positions http://occamstypewriter.org/boboh/2012/01/17/pca_and_pcoa_explained/) but yes I'm also not sure about interpretation, if that's coherent, and what that really means. It was most interesting to me that it did work at all, and matched formal outcomes / other methods fairly closely, but beyond that, I don't disagree with you here.

Davidski said...

I'll try and knock out an Fst matrix with a few extra pops that weren't featured in literature yet, but I'm not sure merging AG3 and MA1 will work.

Seinundzeit said...

David,

Thanks!

If it works as well as the Lazaridis et al. Fst matrix (when subjected to the PCoA technique), it'll be of considerable interest.

Just spit-balling here, but if merging MA1 with AG3 isn't sensible in this context, and if only MA1 can't be added, would it work with your old synthetic ANE samples?

They behaved just like MA1/AG3 in ADMIXTURE/PCA, so would they work in the context of Fsts?

Matt said...

Couple bits of potential food for though David (for after if you get comfortable that you're able to get an autosomal matrix that is similar to that from Lazaridis 2016 and works well with this):

1) Given the Goldberg sex asymmetric migration paper, it might be cool to make an fstX matrix for a good set of populations. Maybe Lazaridis 2016 plus extra world populations in Asia and Africa, and any of the new ancient European samples since available that you've been able to import working with. Then we could compare the same method on that matrix and see if it coheres to a similar set of axes and if so how ancestral proportions estimate over that with nMonte.

2) I guess you'll have to consider whether for the various European ancient cultures, you lump a la Lazaridis 2016 (e.g. Steppe_EMBA, Europe_MNChl) or split a la Mathieson 2015 (e.g. Yamnaya, Iberia_MNChl, Hungary_CA, etc.). Using the split European cultures a la Mathieson with the wider range of recent and ancient West Eurasia populations from Lazaridis might be a way to potentially yield new insights, but OTOH estimating Fst might be tougher to do accurately with fewer samples per group. (I'd be interested to see the Baltic_BA, Anglo-Saxon, Iron Age Britain, etc. in this context).

Davidski said...

@Matt and Sein

Here's my first attempt at an Fst matrix.

https://drive.google.com/file/d/0B9o3EYTdM8lQYWtGYlBtMEx6eFE/view?usp=sharing

Can't add the British Iron Age and AS samples to this. Might be able to at some point when better sequences are released.

Seinundzeit said...

David,

Thanks! I truly appreciate this. We're looking at a massive treasure trove of data here.

I'll let Matt do what he does, and then we'll see how well this works (although, I'm pretty confident that this will work great).

But before that, I'll do some fits just using the Fsts, to see how things look.

Matt said...

@ Davidski, thanks for taking the time for this.

Here are a few graphics based on the data:

Neighbour Joining Structure: http://i.imgur.com/7dBelZY.png

PCoA of all world: Axes 1 vs 2- http://i.imgur.com/hrrcTFe.png / http://i.imgur.com/OukHp2Y.png, Axes 4 vs 6 - http://i.imgur.com/2YNfXwz.png,

"West Eurasians": Axes 1 vs 2 - http://i.imgur.com/lQG4FmH.png
"West and South Eurasians": Axes 1 vs 2 - http://i.imgur.com/oPDfbvq.png

Generally similar to the same calculated off Lazaridis Fst scores with some differences, like in "West Eurasians only" Natufians tending less to take an extreme position in low dimension and more to fall in another separate dimension higher up. Haven't checked to tell if this because of differences in Fst scores, or may be because of differences in sample composition (e.g. AG3-MA1).

@ Sein, I don't have time to do any testing today, but here is the same PCoA procedure I described a few posts up run over this set:

Full Set (158 axes!) https://pastebin.com/NjZRU0a0

Reduced to 25 dimensions (with approx 90% variance) https://pastebin.com/6sGH0evu

When I looked at the "West Eurasian only" set there tentatively look to be some interesting relationships: between Iran_N / Iran_Chl and present day Iran and Tajik populations that are distinct from others, between Hungary_EN, Hungary_BA and present day Eastern Europe (and Corded Ware / Bell Beaker and Central_MN with Northwest Europe). May or may not work in nMonte. Though I expect you're of course going to be more interested in the running the SCA models first and making sure this replicates the ancient European models work so you have confidence that the SCA models would make sense.

Seinundzeit said...

Matt,

Thanks for the data-sheets!

Also, I can now see why you utilize PAST, lots of interesting functions (I copy and pasted David's Fst matrix earlier today, and quickly played around with K-means, PCA, PCoA, etc. Fun stuff).

Regardless, (thankfully) I think David's efforts have payed off.

Just to see if things make sense, I tried some basal models involving a few ancient populations.

So far, I really like what I see.

EHG:

63.55% AG3-MA1
36.45% WHG

distance=10.6302

Motala_HG

79.8% WHG
20.2% AG3

distance=10.622

LBK_EN

79.75% Levant_Neolithic
16.40% WHG
2.35% AG3-MA1
0.80% Ami
0.70% Onge

Distance=10.1974

CHG

91.25% Iran_Neolithic
4.85% WHG
3.65% AG3-MA1
0.25% Ami

Distance=9.8501

Iran_Chalcolithic

58.65% Iran_Neolithic
34.35% Levant_Neolithic
7.00% AG3-MA1

distance=6.4142

Very solid results.

To further test things, I decided to use more recent references for Yamnaya.

Yamnaya_Samara

50.85% EHG
42.95% CHG
6.20% LBK_EN

Distance=5.7887

Finnish

33.25% LBK_EN
27.05% EHG
16.95% WHG
15.35% CHG
7.40% Ulchi

Distance=7.4709

Sensible results.

So, I'm pretty confident that whatever I find for South Central Asians will closely reflect reality.

"Pathans" (Pashtuns):

37.15% Sarmatian_Pokrovka + 3.50% Yamnaya_Samara
36.15% Iran_Chalcolithic + 9.10% Iran_Neolithic + 1.55% AG3-MA1
7.15% Ami + 3.15% Onge + 2.25% Papuan

distance=6.97

Tajik_Ishkahim:

38.10% Sarmatian_Pokrovka + 15.50% Yamnaya_Samara
35.10% Iran_Chalcolithic
7.20% Ami + 2.35% Onge + 1.75% Papuan

distance=5.2283

Tajik_Rushan:

38.20% Sarmatian_Pokrovka + 15.45% Yamnaya_Samara + 3.00% Srubnaya
36.15% Iran_Chalcolithic
5.65% Ami + 0.95% Papuan + 0.60% Onge

Distance=4.7159

Very similar to what I've found using nMonte with PCA, which definitely boosts my confidence in these results.

I'll try a few more things tomorrow.

Davidski said...

@Sein

How many dimensions (PCoA coords) are you using there?

Seinundzeit said...

David,

Just 25, although I do want to see if things change with more dimensions (tomorrow).

Davidski said...

New sheets. I tweaked a couple of things.

Fst matrix

https://drive.google.com/file/d/0B9o3EYTdM8lQUmxXWFliaHhaQWs/view?usp=sharing

PCoA coords

https://drive.google.com/file/d/0B9o3EYTdM8lQLVV6TjJOX0tmUHM/view?usp=sharing

Matt said...

@ Sein: "Also, I can now see why you utilize PAST, lots of interesting functions (I copy and pasted David's Fst matrix earlier today, and quickly played around with K-means, PCA, PCoA, etc. Fun stuff)."

Yeah, I don't know if I ever conveyed properly that there are many multivariate functions in Past3 that basically run in *seconds* and allow very quick visualisation of outliers, general relationships, etc. I began to use it because Davidski had datasheets for it, and it's got a lot of uses beyond that. I find it's always best to use Past3 to visualise before working on any data with nMonte, etc, as you can quickly spot what "makes sense" to test and any odd population relationships / outliers.

@ Davidski, looking at your newest set, one note about this set compared to the one from Lazaridis is that when I run the PCoA on it, I find the inclusion of far more population outside of West Eurasia does mean that many of the <20 or even <30 dimensions are dominated by contrasts between Fst between these populations. Particularly this is driven by the accelerated Fst between some of the small hunter gatherer / indigenous populations in Africa, South America, Siberia who've been at low population size or through recent bottlenecks.

So I think if we're looking at using low dimension (<30 or <20) in nMonte (for quicker runtimes, etc.) you might get the maximum bang for buck among West Eurasian differences by curating some of them out of versions of the matrix before running the PCoA.

Examples I've chosen to redact were: Hadza, Nganasan, Atayal, Ami, Lahu, Karitiana, Surui, Mixe, Pima, Piapoco, Bougainville, Mbuti, Eskimo, Itelman. Some are small populations with lots of drift, others aren't so much, but still not that relevant to what I'm looking to pack into lower dimensions. PCoA result - https://pastebin.com/XSdLN4Gg, Fst matrix - https://pastebin.com/i5ckdeC3

To be honest, it's also the case that the Onge has a dimension dominating effect, and so do the Kalash, Iranian_Zoroastrians, among ancients some populations, particularly the Potapovka (which seems to have an extremely unusual Fst character compared to other Steppe_EMBA). To some degree Laz dodged the accelerated Fst between some ancients (may be due to high drift?) by lumping populations together under categories like Steppe_EMBA. But if you want to estimate those populations, or estimate from those populations (like you want to model Kalash with Onge as a potential ancestor) you can't remove them.

Without Kalash, Zoroastrian, Potapovka, Onge: PCoA - https://pastebin.com/NP8zBzwB, up to axis 23 which seems pretty good for lots of interesting West Eurasia variation - https://pastebin.com/sGAJ8EsC, Fst matrix - https://pastebin.com/ezrCdT2n.

That said, I think there may be some issue with this technique with looking to use high numbers of dimension where populations being fit in a low dimension in a way that removes noise, and then it is overfit in a high dimension in a way that doesn't totally reflect population relationships. Overfitting problems. Overall this might average out over all the dimensions to give a roughly correct topology but with accelerated distances, and also it you cut off at an arbitary high dimension (even 20+?) you might end up over / underfitting. I don't like the arbitrariness of picking a dimensionality to stop looking at (though I guess in a sense its no worse than choosing a pright and pleft or choosing a level of K in ADMIXTURE, etc.).

As the high dimensions are where the very slight differences with different BA populations, etc. live, it might not be practical some of what I might try to do with this, even assuming no error - e.g. model Europeans with various different BA and older ancients and world pops (so we can model if e.g. Lithuanian fits better as Unetice / Bell Beaker / Hungary_BA descended, Spanish with as Central European Bell Beaker ancestry, etc).

Davidski said...

Ah, just noticed that, of course, if I remove some outliers, other outliers eventually take their place, and it's impossible to decide which populations and dimensions to use.

Think I'll stick to qpAdm for now, and Global 10 when more ancient samples come out.

On a positive note, I like the way the Fst is more sensitive than D-stats and f3-stats for population affinities, both ancient and modern. It's similar to a haplotype test in that regard, except easier to run.

Seinundzeit said...

Matt,

Not having the Kalash, Zoroastrian, Potapovka, and the Onge really seems to clean things up.

Pashtuns (Pathans):

37.40% Sarmatian_Pokrovka + 3.10% Andronovo + 0.10% Yamnaya_Samara
34.30% Iran_Chalcolithic + 10.80% Iran_Neolithic + 2.55% AG3-MA1
9.20% Dai + 2.50% Australian

distance=6.9555

I wouldn't worry about not having the Onge in there, as the levels of ENA are still the same (around 12%), and as per Reich and Lipson (personal communications) it seems that they no longer think ASI has any heightened relatedness to the Onge.

I think if one adopts the Lazardis et al. strategy of lumping similar ancient populations together (Steppe_EMBA, Steppe_MLBA, etc), and if one avoids hyper-drifted modern populations (like the Kalash), this method is definitely superior to qpAdm, and it works better than unlinked genotype PCA data.

Although, I should note that the model above is quite similar to what I get for Pashtuns using the Global_10 (45% Sarmatian-related, 45% Iran_Chalcolithic + Iran_Neolithic, and 10% ASI).

Matt said...

@ Davidski, I do think it's probably worth running off an Fst matrix and running these and other analyses when substantial amounts of new samples come out *but* yeah I'm pretty cautious about these stats being a better method. They can be comparable in some ways and have advantages and disadvantages.

@ Sein, that's pretty cool as a fit for Pashtuns. Got some interesting fits for Europeans with that 23 dimension set that I posted upthread, using the "All Ancients+Populations outside West Eurasia":
https://pastebin.com/3uKTreXM

So some of the qualities there are: 1) Some continuity of Iberia_EN in Spain and Sardinia, 2) Bell Beaker Germany linked across West European populations, 2) Unetice across Northern and NE Europe (Iceland, Lithuania, Finland), 3) Hungary_BA links across Central Europe (Polish, Ukraine most strongly), 4) small levels of Sarmartian in Baltic-Slavic populations. That's all pretty nice. (Also a few things that didn't make sense at low levels (e.g. Sintashta in Basque).)

Also very low levels (1-2% generally) of Siberian ancestry that didn't quite make sense. On this point one theory about that I have is that one slight issue with these Fst->PCoA model might be that as it is capturing a lot of differentiation, more than formal stats, it also captures damage, etc. that makes ancients systematically different from moderns and elevates their Fst. Particularly at high numbers of axes. So this may mean that, for example for West Eurasians, it might be that even slight amounts of undamaged West Eurasian ancestry in moderns can get picked up as better fits than ancients which may have some damage. This problem seems higher if we allow populations with substantial (25%-50%) WE into the calculator. (Examples - https://pastebin.com/H1vaw8MF with pops picking up pretty substantial fractions of Yukaghir Forest that aren't real). I hope this doesn't affect anything you're trying to model in SC Asia much.

There are a bunch of advantages to different modes. Genotype PCA, ADMIXTURE, qpAdm for instance lets us project individuals on where you can't really here. Likewise genotype PCA are particularly cool in that they can detect specific SNPs that are driving the dimensions (and so show particularly high pop differentiation), and you be more certain that any relationship reflects real structure in a particular SNP and is not just making the levels of differentiation fit as with us using Fsts.
(Drawback of this genotype PCA may be that "For principal component analysis (PCA), Patterson et al. [5] have argued that structure reveals itself much like a phase change in physics; namely if the product of the number of genetic markers (m) and individuals (n) is greater than 1/(FST^2) then structure will be evident." (Novembre et al 2016 - http://biorxiv.org/content/biorxiv/early/2016/09/04/073221.full.pdf) so it may be that it requires more individuals and markers to visualise the structure than can currently practically be obtained for some ancient cultures).

Matt said...

In both Lazaridis and the Eurogenes Fst, does seem like Fsts for ancient dna and modern dna tend to be systematically closer to each other, comparing ancients to closest (or one of closest) modern: http://i.imgur.com/Jj9o50q.png / http://i.imgur.com/oHlZieN.png / http://i.imgur.com/f096AFZ.png :/. E.g. not simply the case that Bell Beaker is systematically closer to some ancients compared to Icelandic / Scotland (indicating different admix), but rather it seems slightly closer to all, and vice versa.

Seems of less importance comparing ancient to ancient: http://i.imgur.com/pcg9x9Z.png

Seinundzeit said...

Matt,

Those are some pretty interesting patterns.

For South Central Asia + South Asia, this data seems to provide evidence for a few ideas that were first articulated by RK.

RK has often noted that the best way to differentiate scheduled-caste populations from Brahmins (across all of India) is by looking at affinity towards Steppe_MLBA populations.

Those are also the exact same populations which had Central/South Asian-specific R1a subclades.

Yet, the ancient West Eurasian populations closest to all modern South Asians, be they low-caste or Brahmin, are actually Steppe_EMBA.

If we take into account the fact that those populations were dominated by R1b, which is lacking in South Asia (although Pashtuns, Tajiks, Uzbeks, etc, do show R1b at low percentages), this leaves us with a bit of a conundrum.

RK's solution was (and, I'm assuming still is?) the notion that there is a Central Asian ANE substrate across South Asia, and some Iran_Chalcolithic/CHG-related ancestry in all South Asian populations.

So, the combination of ANE, Iran_Chalc, and actual Steppe_MLBA, produces the "illusion" of Steppe_EMBA ancestry.

Yesterday, that was the exact pattern I saw, when I modeled South Asians. They all showed Srubnaya, but had 0% Yamnaya.

Instead of Yamnaya, they showed ANE and Iran_Chalc; so I feel like RK deserves a shout-out.