search this blog

Saturday, November 12, 2016

Days of high adventure


I've redesigned and streamlined my Principal Component Analysis (PCA) plot of West Eurasia in anticipation of the arrival of many more ancient samples. Rumor has it we'll not only get stuff from the Balkans, but also finally from the steppes north of the Black Sea and South Asia.

I'd say my new plot does a better job of highlighting relationships between the different prehistoric groups and population shifts across space and time. The datasheet is available here.
It should be pretty clear from this plot how the modern-day European gene pool came about. So I don't expect any major surprises when the new samples come in. Nevertheless, the wait is killing me, and many others I'm sure.

Update 13/11/2017: A year on, I've acquired many new samples and, as a result, streamlined the PCA some more, see...

Who's your (proto) daddy Western Europeans?

235 comments:

1 – 200 of 235   Newer›   Newest»
Matt said...

Lovely datasheet. Thanks.

How possible would it be to do a version of the datasheet that added the following samples in : http://textuploader.com/d5fy8 ?

Applying Linear Discriminant Analysis to these dimensions (maximises the degree to which dimensions capture differences between labeled groups rather than individual differences):

http://i.imgur.com/MH7UAeB.png

Axis 1 and Axis 2 give a bit more of a straight line and not much of an arch from NearEast_N-Epi to WHG through Anatolia_N (with some AN tilted towards Caucasus).

Axis 3 and 4 don't look nice, as Iran_Hotu wants to sit with Iran_Neolithic, preventing the impression of a clean Caucasus vs Iran discriminant dimension from forming.

After relabelling Iran_Hotu as Iran_Hunter_Gatherer and Kotias and Satsurblia as CHG:

http://i.imgur.com/uenMlhe.png

There's some kind of additional low level affinities between CHG and Anatolia_Neolithic here splitting off the other southern groups (Iran and Near East) and then it looks like a residual very low level affinity between some pre-Kurgan Europe, some WHG and the Iran Neolithic cluster (seems like these samples are Iberia Chal and MN for pre-Kurgan and then for WHG Falkenstein, Ranchot, Rochdane, Iberia_Meso, Continenza, Villabruna, BerryAuBac, Chadardes).

In the above, I dumped the following outliers:

- The sample Levant_Neolithic:I701 is a far outlier on some of these dimensions, PC8 and PC9.

- The sample Iboussieres39:ADI_d seems like an outlier in at least dimension PC5.

They both have a odd position in neighbour joining, with very long branches, because of this: http://i.imgur.com/bGPDWeD.png

Neighbour joining w/out them looks a little better: http://i.imgur.com/SqHY2Rc.png

I tried running sorting the populations into K-means clusters (for the same number of groups, 16) and using those as the groups for Linear Discriminant Analysis instead:
http://i.imgur.com/fSmMUU4.png

Clusters on these dimensions seem to bin some to lose some the substructure as clusters (e.g. between Nordic HG and WHG, between later Caucasus and CHG) and find some other substructure (chiefly seem among Europe pre-Kurgan and Levant and within Europe+Steppe LNBA).

Tree structuring with colour matching to those clusters instead: http://i.imgur.com/r12KGgA.png

(Here's the clustering assignments: http://textuploader.com/d5fyi)

Nathan Paul said...

I bet Rakhigarhi will be Y R2 and H. If we get Harappan Y DNA from North west south Asia ( current Pakistan) it will be R1a,R2, L and H combo.

Unknown said...

Thanks for the update on Rakhigarhi samples. It is heartening to know that we will have aDNA from as many as 12 samples. The paper when it comes out is going to be a landmark, no doubt about it. The earliest of these Rakhigarhi samples may be as old as 5500 BC.

But I was expecting the Rakhigarhi paper to come out by the end of this year. Why do you say that it will take the middle of next year to come out ? Will we have a preprint or not ? And which team is it by the way if you have any idea about it ?

It would be wonderful if along with the Rakhigarhi samples, they also manage to incorporate copper/bronze age samples of Sumerians & Central Asians & East Iranians (from sites like Jiroft, Konar Sandal, Tepe Hissar, Shahr-i-Sokhta etc). If that is the reason for the delay in publication there is nothing to complain.

FrankN said...

I don't like the "Europe_post_Kurgan-invasion" label, because, as far as archeology tells us, such invasion never happened. Instead, we had a late 4th mBC population decline in Central Europe, possibly caused by a combination of climate deterioration and diseases (plague), and subsequent immigration from the East. Whether those immigrants were "Kurgan" (Yamnaya), or represent two separate streams (EHG from Finland/ E. Baltic; CHG from CT and further east) still is an open question. As such, "LN/BA Europe" would be a much more adequate labelling.

Olympus Mons said...

@FrankN

"CHG from CT" - CT Will be anything but!

MaxT said...

@Nathan Paul

There won't be any R1a from Indus Valley Civilization South Asia or South Central Asia, it came with Indo-Europeans, like in Europe.

Samuel Andrews said...

@FrankN,

CordedWare and Bell Beaker's EHG/CHG is of the same proportion as Yamnya. There weren't two separate waves of EHG ad CHG!!! How stupid are you people? Open ur eyes already. Some if us wanna move passed did a IE R1 steppe BA migration occur" but idiots like u refuse to accept the obvious. At this being offensive is justified because there have literally been years of useless discussion with you guys. We've layed out the facts 1,000s of times yet you still don't. God Damn it enough with discussing the basics with you stubborn idiots.

Iranocentrist said...

@Samuel Andrews/Krefter

How dare you call Frank an idiot, Frank has forgotton more about History/Archeology/Anthropology then you will learn in your entire life, you little pest, now fix your spelling and grammar before you dare reply to someone of Frank'Ns caliber.

Nirjhar007 said...

The headline should re-named Days of High Anxiety ;) .

Keep calm folks , everything is going smooth and steadily .

Samuel Andrews said...

@Iranocentrist,

"How dare you call Frank an idiot, Frank has forgotton more about History/Archeology/Anthropology then you will learn in your entire life, you little pest, now fix your spelling and grammar before you dare reply to someone of Frank'Ns caliber."

FrankN is intelligent but his refusal to accept the basics about the Steppe's role in the genetics of BA Eurasia is idiotic. I've lost all patience for posters here who do the same.

Davidski said...

@FrankN

The question is not open. Just look at the plot.

And there was no EHG in the East Baltic, as you'll soon find out.

Azarov Dmitry said...

@MaxT
There won't be any R1a from Indus Valley Civilization South Asia or South Central Asia, it came with Indo-Europeans, like in Europe.


I believe we’ll see relic subclades of R1a and R1b haplos in samples from South Iran and Central Asia (during Neolithic period and earlier). And arrival of bearers of R1a-L657 and R1a-Z2124 subclades in South Iran, South Asia (Indus Valley Civilization) and Central Asia after ~2300-2200 bc.

MaxT said...

@Azarov Dmitry

It's more likely that we might find R1a in BMAC culture which is on borders of Central Asia than in Iran or South Asia.

Davidski said...

On the above plot R1a is only found in the light blue and green samples. There's a good reason for that; a very strong association between R1a and EHG.

There was no EHG south of the steppe until after the Neolithic. The EHG in Mesolithic Caucasus isn't real.

So there won't be any R1a south of the steppe until after the Neolithic.

Davidski said...

Please note, I updated the plot and datasheet by removing the most extreme outliers across the 9 dimensions.

Also, Matt, try this. If the samples you're looking for aren't on this sheet, then for one reason or another I can't get them on there.

https://drive.google.com/file/d/0B9o3EYTdM8lQbmV0VDZSMnZKOVE/view?usp=sharing

EastPole said...

I have run mclust on the data. There are 15 clusters.

Iboussieres39:ADI_d and Levant_Neolithic:I1701 are really outliers assigned together to cluster 13 (huge ellipse on clusters plot).

Sintashta:RISE394 is really interesting. It is assigned to cluster 7 together with Vatya:RISE479, Hungary_BA, BB and some Unetice and CW with probability 0.52. But probability for it to be in cluster 8 with CW, Andronovo, Srubnaya is 0.47. So it is really in between clusters 7 and 8.

PC12:

https://www.dropbox.com/s/jkj55ou6tjig5d7/CWE_PCA12.pdf?dl=0

Clusters and probabilities :

https://www.dropbox.com/s/p6mllh8qjzliuru/CWEmclust.xls?dl=0

Cluster assignment plot:

https://www.dropbox.com/s/6e1n12spx7n115x/CWEmclust.pdf?dl=0

Uncertainty plot:

https://www.dropbox.com/s/0qi51gwe8uof9yd/CWEmclustUncertainty.pdf?dl=0

I think that Sintashta is not a typical steppe population but a mix of steppe and some Central European population. Maybe this is the cause of similarities between Slavic and Indo-Iranian languages.

Azarov Dmitry said...

@MaxT
It's more likely that we might find R1a in BMAC culture which is on borders of Central Asia than in Iran or South Asia.



I believe ancestors of Indo-Arians (subclade R1a-Z94 or R1a-Z95) migrated in the middle of the 3-rd mil bc from the Pontic-Caspian steppe via the Caucasus Mountains to South Caucasus and North Iran. And later (~2300-2200 bc) they migrated from Iran to South and Central Asia.
http://s017.radikal.ru/i409/1611/54/3a9687bb4b31.jpg

Matt said...

@ Davidski, thanks. I merged that up with the original datasheet from your first post minus outliers* and really sharp clustering on that with K-means, for the moderns:

http://i.imgur.com/Ks1DiQa.png

(I left all the extra Bronze Age Europeans in the big and diverse Europe post Kurgan cluster, who overlap to some degree with most of modern Europe except the modern Baltic, even though K means wouldn't have).

There are really clear differences between all expected regional subpopulations. In particular what seems striking and a new thing from other PCA is a really clear clustering of Arabians with many of the Natufians in a dimension that is distinguishing between Anatolia+CHG vs Natufian/Euro HG.

Here are two sheets on Scribd with those K means cluster assignments if anyone wants:

Without South Central Asia: https://www.scribd.com/document/330916168/Ancient-Modern-Central-West-Eurasia

With South Central Asia: https://www.scribd.com/document/330916174/Ancient-Modern-Central-West-Eurasia-SCA

(left Iboussieres and Levant_Neolithic:I701 in the above datasheets and graphic. Bell_Beaker_Czech:RISE568 seems a bit of an outlier as well, but not to as much a degree).

I wanted to cross some of this up with some other datasheets and see what came out. I might try with EastPole's clusters as well.

* didn't download the newest version of the datasheet so hopefully nothing changed other than removing the outliers.

Davidski said...

The new datasheet is exactly the same as the old one, except for the missing outliers, and a new color scheme for Kurgan_Eneolithic.

Rob said...

@ Azarov

I think it's possible. There is evidence of movement after 2500 BC southward from the steppe via the Caucasus. But then where/ how ? Via BMAC ?

Matt said...

Merged that up with the dimensions from a Europe based PCA with ancients added from an earlier post, where there's overlap:

Linear Discriminant Analysis : http://i.imgur.com/dUNp5p3.png

PCA: http://i.imgur.com/tgvhsbO.png

PCA with Steppe_EMBA to Iberia Chalcolithic clines drawn on: http://i.imgur.com/UcBtsuo.png

(Ancients are dots, moderns are pluses).

Combining all the dimensions, forms a nice cline in PC1 and PC2, then in the smaller PC3 and PC4, the specific drift for Baltic-Slavic populations*, and then extra drift for Iranian and Levantine Neolithic is also apparent.

Groups were guided by the K means clustering, then split and merged a few (based on the PCA and intuition).

For the combined dimensions - https://www.scribd.com/document/330924566/ancient-EuropePCA

* Like the Welsh specific drift from Galinsky et al 2016, Figure 3 - http://biorxiv.org/content/biorxiv/early/2016/05/27/055855.full.pdf

Davidski said...

@Rob

I think it's possible. There is evidence of movement after 2500 BC southward from the steppe via the Caucasus. But then where/ how ? Via BMAC ?

South Asians have ancestry directly from the steppe not from the Caucasus.

huijbregts said...

@Eastpole
I tried to reproduce your mclust results with the new datasheet.
After averaging the populations I found only 11 clusters (modeltype = 'VEE').
Can you check this?

EastPole said...

@huijbregts
Yes, I got the same for averaged populations: 11 clusters VEE model, and very strange assignments: Sintashta is in cluster 4 with Balkans, Italians, Scandinavians, Greeks, although on PCA it is close to Russians:

Matt said...

I'm looking at some of these stats and I wonder if the Baltic (Jones) + Balkan (Mathieson) samples when they are published might show the following model:

1) Really sparse settlement in Baltic and East Central Europe prior to Steppe / Kurgan expansions, mostly by WHG. Attribute that to difficulty raising early neolithic founder crops due to climate.

2) At post-Kurgan / LNBA, continues to be sparsely populated landscape around the Baltic and East Central Europe with high Steppe / Kurgan EMBA and WHG ancestry. Present day Baltic populations still represent this configuration to a degree.

3) Late Iron Age, farmers from Balkans introduce new crops (millet, rye, etc.) to the region. Slavic speaking groups akin to Baltic populations from 2) mix with Balkan farmers who bring new crops (probably also IE speaking - Thracian?), and leads to population explosion. Ancestry gradient between these sources in different Slavic populations.

Above is totally uninformed by archaeology, I admit.

EastPole said...

@huijbregt
I have a mixed feeling about applying mclust to the PCA data of this type. Surely not all dimensions are equally important. Only those with highest eigenvalues which explain most of the variance should be used for clustering. Other dimensions maybe just introducing noise which confuses the algorithm.
Problem is which dimensions are more important than others. When I look at the plots of all the data first 6 dimensions show some interesting structure, 7,8 and 9 look more like noise.

http://s18.postimg.org/8l3909wfd/screenshot_90.png

@Matt
Poland and Ukraine have the best agricultural land in Europe. Have always been well populated. I think the history of Slavs and the history of IE will be rewritten when we get more aDNA. They just have to get right samples.
Slavs and Tocharians have the same word for millet. And I am sure they got it from us as they got many other Slavic words.
“As the Tocharians began to move east, the last contacts that they had with other Indo-Europeans (before their much later interaction with the Indians and Iranians) was with the Slavs, resulting in some Slavic influence in the lexicon”

http://www.oxuscom.com/eyawtkat.htm

huijbregts said...

@Eastpole
I share your feeling that these models have unexpected outcomes.

Matt said...

@ EastPole. Fair enough, I don't know too much about the quality of land and history of raising plant agriculture in East Europe, other than remembering Peter Turchin putting weird exceptions into his computer models for where empires and states would form (otherwise it would form loads of mega states in East Europe which never happened) and justifying it with very low agricultural productivity there until recently*.

* http://www.pnas.org/content/suppl/2013/09/20/1308825110.DCSupplemental/sapp.pdf - "Additionally, we have taken into account the terrestrial plant threshold (TPT) occurring at Effective Temperatures of 12.75°C (4: Table 4.02 and figure 4.12). The TPT marks the northern boundary of the area where plant productivity is sufficiently high to permit plant-dominated subsistence strategies by hunter-gatherers. This is also the area where intensification of production leads to the domestication and diffusion of plant cultivars (5). When plants were first domesticated (in regions south of the isotherm) they spread readily West and East (6). The northward spread, especially across the TPT, was much slower and required a lengthy period of adaptation to colder climates. For example, Bronze Age populations in the European forest belt north of the TPT practiced agropastoralism with a heavy emphasis on animal-dominated strategies (7). Productive agriculture, capable of sustaining complex societies, appeared in Northern and later Eastern Europe only during the first millennium CE. "

EastPole said...

@Matt
“Productive agriculture, capable of sustaining complex societies, appeared in Northern and later Eastern Europe only during the first millennium CE. "
By Northern and later Eastern Europe they probably mean Scandinavia and Eastern Baltic. But Slavs originated from Vistula-Dnieper area:

http://s18.postimg.org/mujn5hz6x/screenshot_74.png

So we come from the area of the best agricultural land where agriculture was practiced from 6000 BC by Tripolye and TRB cultures.
I think we didn’t originate from Yamnaya but rather from Late Sredny Stog Dereivka steppe culture mixing with Tripolye and TRB Neolithic farmers around 3000 BC in Dnieper-Vistula area. Slavs are really Corded Ware culture populations which mixed with farmers and settled down and started farming between Vistula and Dnieper as early hydronyms show.
I hope that when we get aDNA from that area it will explain a lot.

huijbregts said...

@EastPole
The mclust results are much better if I remove the two outliers that Matt indicated.

Matt said...

EastPoleBy Northern and later Eastern Europe they probably mean Scandinavia and Eastern Baltic. But Slavs originated from Vistula-Dnieper area

Possibly. It's not really clear. They only display their model of the expansion of agriculture in three shots without in their Figure S2, first 1) as as having excluded all of Eastern Europe north of Greece and the Balkans at a time when all of Eurasia and all of North and Central Africa are agricultural, 2) then with all of Eastern Europe and the Scandinavia as agricultural (the area you linked in your image is just on the edge of their limit of agriculture there), 3)then finally allowing agriculture into Russia and Southern Africa. No dates on any of these, presumably the transition from 1) to 2) is "during the first millennium CE" judging by their comments.

But I'm not sure that's even a correct model anyway and I suspect it's just been fixed to prevent the model creating mega empires in Russia and Eastern Europe without them having to seriously abandon the predictions of the model (agriculture+near to steppe military frontier = states).

Nice theory on the Slavic origin. I'll try to keep it in mind for when we have the right samples and then we can see how to test it!

Rob said...

The area north of Ukraine & east of the Visla developed an agriculture more than slash/burn c, 500 BC; with permeation of eastern Hallstatt influences. Basically as Matt suggested.

Samuel Andrews said...

@Matt,

Your theory/hypothesis/whatever makes sense. However, the Corded Ware Estonian lady has a significant amount of Anatolia Neolithic ancestry. I don't know what you mean by EastCentral Europe, do you mean only areas bordering the Baltic sea or everything from Germany to Ukraine? We do have Unetice and a Urnfield genome from Poland and Germany and they have similar proportions of EEF/Steppe/WHG as modern Northern Europeans.

Davidski said...

I changed the color scheme on the plot again. Really happy with this color scheme. This will be the color scheme for all future "Days of high adventure" plots.

I'm such a perfectionist. Or maybe I'm just obsessive compulsive? Never mind, that was a rhetorical question.

Rob said...

yes I like the gradienting

Azarov Dmitry said...

@Rob
I think it's possible. There is evidence of movement after 2500 BC southward from the steppe via the Caucasus. But then where/ how ? Via BMAC ?


I think formation of BMAC is connected with migration of bearers of R1a-Z2123 subclades from Iran while bearers of R1a-L657 subclades migrated from Iran in South Asia (more likely peaceful colonization than military invasion). BMAC population (mostly R1a-Z2123, G2, J2) adopted from their steppe neighbors (mostly R1a-Z93*(xZ94)) all these famous military innovations (spoke-wheeled chariots and cavalry) and in post-BMAC period they brought these innovations in South Asia and Iran (military invasion). So there were three major groups of R1a-Z93 folks in Central and South Asia: R1a-L657 – peaceful South Asian farmers (PC steppe->Caucasus->Iran->South Asia), R1a-Z2123 – BMAC farmers (PC steppe->Caucasus->Iran->Central Asia) that adopted military innovations (spoke-wheeled chariots and cavalry) from their steppe neighbors and R1a-Z93*(xZ94) – badass steppe nomads and inventors of spoke-wheeled chariots (ancestor of Tocharians and Srubnaya population).

Davidski said...

Nonsense.

The founding population of BMAC came from the South Caspian and had nothing to do with the steppes. So BMAC will be loaded with J2.

Z93 arrived in the area at the Bronze Age/Iron Age transition, and will show up in remains from cultures like Yaz.

http://eurogenes.blogspot.com.au/2016/08/maybe-first-direct-hints-of-yamnaya.html

Rob said...

Thanks Azary
Dave i note your reservations, but the whole point of aDNA was to help solve archaeological mysteries, and to this point no one can honestly come up with a decisive line on where BMAC people came from (? natives, south, Syro-Anatolia, steppe have all been proposed). In all reality it's a syncretic and original formation.

Azary, however, Davids point that it's seems unlikely that steppe admixture intonsouth asia came via Iran because Iran represents a lull in steppe ancestry c.f. further east is a significant obstacle to your theory (barring contravening aDNA); but the Mitanni are supportive..

Seinundzeit said...

I can't wait to see those Rakhigarhi samples.

With South Asia in mind, I finally found a setup which allows for the creation of exceedingly sensible models for populations from Central/South Asia, and which also works great with European populations. In addition, the ASI levels finally make sense, and the fits are good.

I used the first 7 PCs from the recent data-sheet David posted, the one with a global focus (since Matt demonstrated that the first 7 PCs were packed with relevant information, but 8+9+10 not so much, at least for the populations I was seeking to model). The same reference populations were utilized for all groups which were tested, whether South Asian, West Asian, Central Asian, or European. I did not remove or add any reference populations when testing all of these groups. The output is posted below.

Seinundzeit said...

South India:

Paniya

60.3% Jarawa
39.7% Iran_Neolithic
Distance=3.2884

Pulliyar

52.55% Jarawa
47.45% Iran_Neolithic
Distance=3.2562

Hakkipikki

54.15% Iran_Neolithic
41.90% Jarawa
3.95% Iran_Hotu
Distance=3.24

Chenchu

55.85% Iran_Neolithic
41.55% Jarawa
2.60% Yamnaya_Samara
Distance=2.8999

Sakilli

58% Iran_Neolithic
41.25% Jarawa
0.75% Iran_Hotu
Distance=3.4653

Piramalai Kallar

62.15% Iran_Neolithic
36.90% Jarawa
0.95% Yamnaya_Samara
Distance=3.2786

Kurumba

56.20% Iran_Neolithic
34.35% Jarawa
6.45% Iran_Hotu
Distance=3.0516

Tamil Nadu Brahmin

57.95% Iran_Neolithic
26.40% Jarawa
15.65% Yamnaya_Samara
Distance=2.2208

Seinundzeit said...

Northern India/eastern Pakistan:

Chamar

55.5% Iran_Neolithic
38.5% Jarawa
3.1% Yamnaya_Samara
Distance=3.6215

Dusadh

59.15% Iran_Neolithic
36.30% Jarawa
4.55% Yamnaya_Samara
Distance=3.2785

Dharkar

54.00% Iran_Neolithic
33.35% Jarawa
12.75% Yamnaya_Samara
Distance=2.6026

UP Kshatriya

52.4% Iran_Neolithic
26.2% Jarawa
21.4% Yamnaya_Samara
Distance=2.2015

UP Brahmin

50.30% Iran_Neolithic
24.95% Yamnaya_Samara
24.75% Jarawa
Distance=2.0721

GujaratiD

44.10% Iran_Neolithic
26.95% Jarawa
22.15% Iran_Hotu
6.80% Yamnaya_Samara
Distance=2.8073

GujaratiC

54.55% Iran_Neolithic
25.40% Jarawa
14.40% Yamnaya_Samara
5.65% Iran_Hotu
Distance=2.6255

GujaratiB

51.05% Iran_Neolithic
25.15% Jarawa
23.80% Yamnaya_Samara
Distance=1.8327

GujaratiA

52% Iran_Neolithic
28.6% Yamnaya_Samara
19.4% Jarawa
Distance=1.4351%

Punjabi_Lahore

55.6% Iran_Neolithic
29.4% Jarawa
15.00% Yamnaya_Samara
Distance=2.4456

Sindhi

59.10% Iran_Neolithic
22.15% Yamnaya_Samara
18.75% Jarawa
Distance=1.0703

Seinundzeit said...

South Central Asia (northern/northwestern Pakistan, Afghanistan, Tajikistan)

Burusho

46.45% Iran_Neolithic
27.85% Yamnaya_Samara
18.80% Jarawa
6.90% Daur
Distance=0.6483

Pakistani Pashtun, Khyber Pakhtunkhwa (of the Yusufzai tribe, from the actual Pashtun landholding class)

55.30% Iran_Neolithic
27.15% Yamnaya_Samara
15.70% Jarawa
1.70% Loschbour
0.15% Daur
Distance=0.5797


“Pathan” (In reality, these are just Pakistani Pashtuns. It’s a curated set, one which only includes the samples that ChromoPainter + MClust construe as belonging to a single population. Coincidentally, these more tightly clustered samples are also much less genetically “South Asian” than the other samples, the ones that cluster close to Sindhis/Punjabi Jatts. Anyway, these samples were collected in Kurram Agency, and are of the Karlani confederacy)

51.70% Iran_Neolithic
30.20% Yamnaya_Samara
13.65% Jarawa
2.25% Barcin_Neolithic
1.30% Loschbour
Distance=0.5161

Kalash

50.5% Iran_Neolithic
37.2% Yamnaya_Samara
12.3% Jarawa
Distance=0.6058

Myself (I’m a Pashtun with roots in both Afghanistan and Pakistan, most similar to Afghan Pashtuns in Kunar/Laghman/Nangarhar and Pakistani Pashtuns in Bajaur/Mohmand/Khyber. Although, those people usually lack the 4%-5% Siberian/East Asian that I have, which I can pin on my Uzbek great-grandmother)

41.3% Iran_Neolithic
26.70% Yamnaya_Samara
13.25% Iran_Hotu
11.95% Jarawa
4.70% Daur
2.10% Barcin_Neolithic
Distance=0.4905

Afghan Pashtun, Ghazni (Central Afghanistan, of the Ghilzai confederacy)

47.10% Iran_Neolithic
32.75% Yamnaya_Samara
11.15% Jarawa
7.70% Barcin_Neolithic
1.30% Daur
Distance=0.5233

Pashtun_Afghanistan (samples collected in the provinces of northern Afghanistan, probably hodgepodge of Ghilzai and Durrani)

44.30% Iran_Neolithic
30.60% Yamnaya_Samara
10.05% Barcin_Neolithic
9.20% Jarawa
4.15% Daur
1.70% Loschbour
Distance=0.3995

Afghan Pashtun, Kandahar (southeastern Afghanistan, of the Durrani confederacy)

49.4% Iran_Neolithic
29.60% Yamnaya_Samara
10.30% Barcin_Neolithic
5.70% Daur
5.00% Jarawa
Distance=0.3514

Pakistani Pashtun, Waziristan (northwestern Pakistan, of the Karlani confederacy)

42.55% Iran_Neolithic
18.45% Yamnaya_Samara
8.95% Loschbour
7.80% Barcin_Neolithic
4.75% Daur
4.65% Jarawa
Distance=0.4702

(This individual turned out really distinctive, not sure why the Yamnaya percentage is so low, but the WHG quite high. I find this interesting, because the Pashtuns of Waziristan/Khost are culturally distinctive, so it’s quite fascinating/possibly significant that an individual from those parts gets a different sort of model)

Tajik_Ishkashim

44.35% Yamnaya_Samara
28.45% Iran_Neolithic
8.50% Iran_Hotu
7.05% Daur
6.20% Barcin_Neolithic
5.45% Jarawa
Distance=0.3625

Tajik_Shugnan

48.60% Yamnaya_Samara
21.15% Iran_Neolithic
10.80% Barcin_Neolithic
10.40% Iran_Hotu
6.70% Daur
2.35% Jarawa
Distance=0.2857

Tajik_Rushan

42.60% Yamnaya_Samara
21.55% Iran_Hotu
16.65% Barcin_Neolithic
9.65% Iran_Neolithic
9.55% Daur
Distance=0.2715

Seinundzeit said...

Iran/Pakistani Balochistan:

Baloch

60.45% Iran_Neolithic
18.65% Yamnaya_Samara
12.00% Barcin_Neolithic
8.40% Jarawa
0.50% Daur
Distance=0.4388

Brahui

68.4% Iran_Neolithic
12.85% Yamnaya_Samara
7.30% Jarawa
6.65% Barcin_Neolithic
4.80% Loschbour
Distance=0.4772

Makrani

74.85% Iran_Neolithic
9.90% Barcin_Neolithic
9.25% Loschbour
5.55% Jarawa
0.45% Yamnaya_Samara
Distance=0.3574

Iranian_Fars

44.95% Iran_Neolithic
40.55% Barcin_Neolithic
9.20% Yamnaya_Samara
3.50% Daur
1.80% Iran_Hotu
Distance=0.2185

Iranian_Mazandarani

53.3% Iran_Neolithic
35.65% Barcin_Neolithic
9.70% Yamnaya_Samara
0.80% Jarawa
0.55% Daur
Distance=0.332

Iranian_Lor

45.05% Barcin_Neolithic
41.15% Iran_Neolithic
7.05% Iran_Hotu
3.40% Daur
3.35% Yamnaya_Samara
Distance=0.3001

Iranian_Jew

57.2% Barcin_Neolithic
41.7% Iran_Neolithic
1.1% Daur
Distance=0.7037

Seinundzeit said...

Northern Europe:

Polish

38.7% Yamnaya_Samara
32.7% Barcin_Neolithic
28.5% Loschbour
0.1% Daur
Distance = 0.3734

Lithuanian

40.65% Yamnaya_Samara
35.25% Loschbour
24.10% Barcin_Neolithic
Distance=0.454

Norwegian

37.50% Yamnaya_Samara
35.35% Barcin_Neolithic
26.55% Loschbour
0.60% Daur
Distance=0.395

Icelandic

37.15% Yamnaya_Samara
35.85% Barcin_Neolithic
27.00% Loschbour
Distance=0.4705

Seinundzeit said...

General observations:

Basically, it seems most South Indian populations are a straightforward mix of Iranian Neolithic-related and ASI ancestries, with most of them being in the neighborhood of 60% Iranian Neolithic-related and 40% ASI. Only the tribal Paniya and "untouchable" Pulliyar constitute real exceptions, with 60% ASI and 50% ASI, respectively.

And the South Indian Brahmins are huge outliers, at around only 25% ASI, and with noticeable steppe ancestry.

In addition, as someone with great interest in Pashtun genetics, these results are quite fascinating.

For example, the ASI percentages don’t seem to differ across the Durand Line. Myself (an eastern Pashtun), an Afghan Pashtun from Ghazni, and Afghan Pashtuns from the far north of the country, all seem to be at around 10% ASI. By contrast, a Pakistani Pashtun from Waziristan has only 5% ASI, less than many Afghan Pashtuns and me, and thus the same amount as an Afghan Pashtun from Kandahar.

But Pashtuns do differ on an east-west axis, even though ASI isn’t the differentiating component. Mainly, the Barcin_Neolithic percentages are much higher among all Afghan Pashtuns (and the Pakistani Pashtun from Waziristan), but very low in my case, while my Yusufzai friend has none.

The Pamiri people have higher Barcin_Neolithic percentages compared to Afghan Pashtuns. So, the Barcin_Neolithic percentages probably track Sintashta-like versus Yamnaya-like genetic contributions in this region. Although, with Afghan Pashtuns, there is the possible confound of isolation-by-distance dynamics increasing the Barcin_Neolithic contribution, taking into account networks of gene-flow across the Iranian plateau.

In addition, these models prove that the recent Lazaridis et al. paper was correct. In it, it was claimed that the Kalash have around the same amount of Yamnaya-related ancestry as Northern Europeans.

This is verified here, as the Kalash are comparable to Icelanders, Norwegians, Lithuanians, and Polish people, when it comes to Bronze Age steppe admixture. Also, most Pashtun samples have around the same amount of Yamnaya-related admixture seen in more northerly European populations. Good to see an agreement between the formal stats and PCA. Although, I should note that the estimates here are much more conservative. Formal stats have Northern Europeans and the Kalash at around 50% Bronze Age steppe-admixed, while this PCA method has Northern Europeans and the Kalash as closer to 40%-35% Bronze Age steppe-admixed. Not sure which is more correct?

And finally, it seems Pamiri people are the most highly Bronze Age steppe-admixed people around. Although, it is possible some Russian groups have more (or are at the very least equal), compared to Pamiri people (I’ll test that out later).

*For those who don’t know, the Jarawa are another Andaman population, genetically identical to the Onge.

Davidski said...

ASI to me looks like the eastern but non-East Asian Mesolithic South Asian forager stuff plus whatever managed to mix into this main component from Central Asia prior to the Neolithic. This is why I reckon Ust_Ishim got picked as the most likely surrogate for South Asian ancestry in Broushaki et al. 2016.

On the other hand, ANI appears to be all the Iran_Neolithic and Steppe_EMBA/MLBA related stuff in South Asia.

FrankN said...

@Matt: The archeological record of the E. Baltics is anything but suggesting sparse population during the Mesolithic and Early Neolithic (ceramic, but pre-agricultural). As in the densely populated W. Baltic (Kongemose/Ertebolle/Nord. FB), aquatic foraging prevailed on the E. Baltic shores and around inland lakes/ wetlands. Such foraging included seal-hunting, fishery (nets, fish-traps) and hunting for waterfowl (89% of the bone assembly of Riigikula 3, Narva Culture, N. of Narva town on the Estonian-Russian border, is composed of waterfowl, mainly ducks). Plants preferring a maritime climate, especially hazelnut, water chestnut, crab apple, blackberry, possibly also beet roots (brassica, sea beet, parsnip etc.) complemented the diet. Such aquatic foraging appears to not have been covered by Turchin’s model.

As concerns CW/BattleAxe, 340 settlement sites have been documented from SE Finland, and 50 from Estonia. That is again anything but “sparsely populated”. Animal husbandry (sheep, cattle, pigs, but no horses) is secured for E. Baltic CW. There is evidence of earlier sea transport of boar, e.g. to Gotland, adding to the ongoing discussion that boar may already have been (semi-)domesticated in Mesolithic Europe prior to EEF introducing Anatolian pigs. Millet is evidenced from Turlojiskes, SW Lithuania (EBA, ca. 1900 BC).

CW Finland:
http://www.kirj.ee/public/Archaeology/2014/issue_1/arch-2014-1-3-29.pdf

Dairy farming Finland:
http://rspb.royalsocietypublishing.org/content/281/1791/20140819

https://www.researchgate.net/publication/255716945_New_dates_for_the_Late_Neolithic_Corded_Ware_Culture_burials_and_early_husbandry_in_the_East_Baltic_region

Botanical Macroremains Lithuania:
https://homepages.uni-tuebingen.de/simone.riehl/Litauen.htm

Matt said...

@ Sein, Hmmm, well I wouldn't agree totally with discarding any of the 10 PCs, particularly as PC9 did add a lot of extra differentiation to

Iran_Neolithic vs South Asia:
http://i.imgur.com/lId9Soy.png (PC9 has a larger degree of variance across this set of populations than the Loschbour vs Near East PC5!) / http://i.imgur.com/QVQYLtO.png
http://i.imgur.com/BU304C4.png

I don't really think there are noisy dimensions that cause significant problems; the biggest thing that happens is that the dimension is simply irrelevant to what nMonte is doing for Europeans / South Asians, etc. I can't see a downside to including any of them.

PC9 will make your models worse with the set of admixing populations though, as none of them will peak the difference in that dimension, and we don't have a good proxy for what ancient might peak that dimension. It strongly represents a divergence between an ancestry in India from Iran Neolithic and Caucasus (and this will also decrease the sharing with Yamnaya for Indians).

One thing with the fits you've posted up is that they do seem to require a lot of Loschbour for Europe. Where the most WHG mixed MN populations are at 25% Loschbour, 75% Barcin, you would need a population of 45% Loschbour, 55% Barcin to be the non-Yamnaya side of the Polish model. Why not run the models for Europe again with Europe_MN fixed as Iberia_MN or Baalberge_MN and allow the EHG vs CHG level to vary?

Seinundzeit said...

Karl_K,

I was surprised by the consistency. Almost all South Indians seem to be 60% Neolithic Iranian-related and 40% South Asian Hunter Gatherer (I agree with you, South Asian Hunter Gatherer is probably a much better way to describe Andaman-related ancestry in South Asia), with no Yamnaya-related admixture. Not much difference between the powerful Kallar caste and tribal Hakkipikki. Brahmins are the only real outliers, they look like they came almost straight from Uttar Pradesh, in all the models I've seen.

I guess South Asia is a very complex genetic stew, but it's good that we finally have the recipe (please forgive the culinary language, lol).

I mean, we have a cline of indigenous South Asian ancestry, a range of 60% in the Paniya, to around 10%-5% in the Kalash and assorted Pashtun tribal groups. And the steppe element clearly arrived long after this Iran_Neolithic + Andaman-related cline was established, as South Indians consistently have 0% Yamnaya-related admixture (at most, it seems 2% in some groups), while people like the Kalash are comparable to Northern Europeans when it comes to this sort of ancestry. Basically, steppe ancestry is very high in Afghanistan + northwestern Pakistan, and still quite important in northern India, yet it almost disappears south of the Vindhya mountain ranges. So, the steppe element was definitely a "disruption", .

David,

I think that makes sense, probably the case.

Matt,

I definitely don't think the lesser dimensions cause problems. Rather, I had what you wrote right in my mind:

"the biggest thing that happens is that the dimension is simply irrelevant to what nMonte is doing for Europeans / South Asians, etc"

So, since I wanted to use the most dimensions I could, but with the quickest run time that would be afforded, due to the fact that I was running a lot of populations (and I mean a lot of populations, I haven't even begun posting all the tested populations), the last three dimensions were discarded.

I'm pretty sure the models (meaning, the proportions) won't change when using all 10 dimensions.

For example, I had a different setup a few days ago. To test the questions you've mentioned, I just varied/played around with the dimensions, using the same test populations and the same reference populations in every case. I tried 5 dimensions, 7 dimensions, and 10 dimensions. To my pleasant surprise, nothing really changed. The models were literally almost identical, no matter how many dimensions were used. The fits just got slightly worse, but that is to be expected, totally obvious.

Truth be told, one doesn't get much (practically speaking), beyond the first 4 dimensions. In fact, I was actually pondering whether using 7 dimensions might constitute overkill. But I went ahead anyway with 7, rather than 4. I think I'll try this again with only the first four dimensions.

And for the fun of it, I'll also do this exercise again with 10 dimensions. Again, I'm 100% certain the models will pretty much be the same (as far as percentages go). But I'll still do these same fits with only 4, and 10, dimensions, for everyone to see.

Also, I wanted a setup in which South Asians, Central Asians, West Asians, Caucasians, and Europeans would be tested under identical conditions. This setup turned out very successful, everything makes sense for all the West Eurasian populations tested, so I went with it.

But, if one seeks a more intensive exploration of European genetic ancestry, I'd try what you suggest. In fact, I'll give that a spin, in a few minutes.

Shaikorth said...

I'm not sure ASI needs anything from pre-Neolithic Central Asia (unless we go back to the Paleolithic?) to be modeled as Ust-Ishim in the Broushaki paper, Onge prefers Ust-Ishim as a donor over all modern populations.

Seinundzeit said...

Hmm, the model gets a bit worse, and doesn't look as good as what I had previously:

Polish

51.95% Iberia_MN
40.65% Karelia_HG
7.40% Satsurblia
Iran_Chalcolithic
Daur
Yoruba

Distance=1.0126

But this model is pretty cool:

Polish

82.60% Bell_Beaker_Germany
13.95% Corded_Ware_Germany
3.45% Iberia_MN

Distance=0.9803

It's an interesting model, but curiously, the fit still isn't as good as my other setup. This is despite the fact that these two models are also based on 7 dimensions.

Seinundzeit said...

Okay, so Albanians fit much better with the Bell Beaker/Corded Ware model (German samples for both cultures, and using Baalberge now):

41.10% Anatolia_Chalcolithic
29.35% Baalberge_MN
16.70% Corded Ware
11.70% Bell Beaker
1.15% Daur

Distance=0.1169

But it isn't optimal for Finnish people:

51.55% Bell Beaker
44.45% Corded Ware
4% Daur

Distance=1.9615

Seinundzeit said...

One more model:

Russian_Kargopol

57.45% Corded Ware
37.40% Bell Beaker
5.15% Daur

Distance=1.4328

Jijnasu said...

I wonder how Velamas would look using this model. Would they show significantly more Iran Neolithic ancestry?

Nirjhar007 said...

Come on Dave , give us the GAC updates .........

Davidski said...

GAC won't be a game changer; the GAC mtDNA results I've seen suggest it'll be more or less Middle Neolithic like Esperstedt_MN, but maybe with inflated WHG, or Middle Neolithic with a bit of steppe admixture. If there's R1a in GAC, it'll be a minor lineage associated with minor steppe admixture.

By the way, I modified the color scheme on the plot again. Now it's perfect. LOL

Nirjhar007 said...

Come on Dave! everyone here knows you can give much better than that! ( on GAC ) ! ;) ......

Davidski said...

Not at the moment I can't.

Samuel Andrews said...

Speaking of GAC it's important to know Eastern/Western Europe distingushation doesn't work to describe where for MN and Steppe lived or where WHG and EHG lived. WHG and MN inhabited most of Eastern Europe. Steppe and EHG really only inhabited Russia. Ukraine might have experienced basically the same population history as Ireland. First WHG, second MN, third Steppe/MN.

Chad said...

Sammy,

That won't be correct. Ukraine will be EHG, possibly with Caucasus ancestry before MN.

Matt said...

@ Sam Andrews: I don't know what you mean by EastCentral Europe, do you mean only areas bordering the Baltic sea or everything from Germany to Ukraine? We do have Unetice and a Urnfield genome from Poland and Germany and they have similar proportions of EEF/Steppe/WHG as modern Northern Europeans.

In my mind's eye, I was thinking roughly from Baltic+Belarussia+Ukraine+Northern and Eastern Poland and maybe some of Romania really.

@ Sein, I've reviewed some of the PCA dimensions by doing a meta PCA that World PCA set with different numbers of dimensions, and I think you're probably right that 4 is enough to get an approximation of most West Eurasians:

http://i.imgur.com/ZMFy7Vb.png - meta PCA with increasing numbers of dimensions.

The intra-West Eurasia specific PCA crops up at PC4, and that's mostly enough combined with PC1-3 which split off Africans, West Eurasians, East Eurasians and South Eurasians to distinguish. PC6 is the EHG vs WHG and Iran_N vs Levant_N PCA, so by 6 its very complete.
With full 10 dimensions - http://i.imgur.com/WDsWeZM.png

I think those models for South Asia are pretty good in the context of what we know - 30% Steppe EMBA in much of South and South Central Asia seems very right (compared to the Lazaridis 2016 estimates around 50%!). I see why you wanted the comparable setup, but it is a strange result to have anything that WHG rich.

I guess I'll row it back from suspecting the model being wrong with dimensions 1-7 only. I'll say more I don't think it's necessary to use a subset of the information and that using 1-7 might show a better fit than 1-10, but that a worse fit with 1-10 would wouldn't reflect noise (just more structure than the putative ancestral populations we can use can capture).

Seinundzeit said...

Matt,

I am really pleased that I found this setup, because the models I've created with it all seem to be in good agreement with the Lazaridis 2016 paper, as far as the relative genetic affinities of different South Central Asian/South Asian populations are concerned.

On top of that, the Kalash hold the same position they do in the paper, as Lazardis 2016 claimed that the Kalash have around the same amount of steppe EMBA admixture as Northern Europeans (like Lithuanians), and this setup yields that same result. Also, Pashtuns are still on the more northern edge of the cline seen in Europe (with regard to Yamnaya-related ancestry), as was also the case in that paper.

But as you mentioned, the percentages themselves seem much more sensible here, and here all South Indians (with only the extreme exception of Brahmins) are pretty much 0%-2% steppe EMBA (with most at 0%, and only one population at 2%). Furthermore, lower caste North Indians get very low percentages, ranging from 3% to 7%. And even North Indian Brahmins are only around 25% here.

This just (prima facie) seems much more reasonable than the percentages we see in the paper. I mean, the modelling in the paper has even lower caste South Indians with some very serious steppe admixture. Not to mention the ENA levels, which seem extremely inflated in the paper.

Regardless of whether this sort of modelling is better or if qpAdm is better, you're absolutely right, 7 dimensions is quite good for this sort of thing, and 4 is probably enough.

In my case, I could have tried the full 10 dimensions, and I had/have no methodological problems/qualms with doing so. It's just that, when I was running these, I wanted to save some time. It boils down to the fact that I was testing so many populations. Basically, I ran every European, Caucasian, West Asian, Central Asian, and South Asian population in the sheet, and my setup has many more reference populations which I didn't show (since the tested populations I posted had 0% from them). But despite the need for speed, I didn't want to use too few dimensions, and thus risk the possibility of having the reliability of the models questioned. So, I compromised between using only 4 or just using all 10, and went with 7.

So we really agree on everything here.

Karl_K said...

@Seinundzeit

I also am impressed that in your analysis, the southern Dravidian speaking tribal Chenchu had obvious Yamnaya ancestry. They are also well known for having ~25% R1a1 Y-haplogroups.

In general, the pattern is that populations with high R1a also have high Yamnaya-like ancestry.



Seinundzeit said...

Karl_K,

No doubt, the R1a1 correlation is pretty solid, and definitely involves causality.

Unknown said...

@Seinundzeit

Nice results but those "South India" samples are too limited, not good representation of general populations of that region. I would like to see more populations added, which will most likely be similar to Lazaridis estimates with lower caste Malla with 16% EMBA and Tribal with 6% or less.

Jijnasu said...

One thing that surprises me is that with this model even the Paniyas have around 40% ANI. Is this consistent with the fact that the majority of Y chromosomal lineages in this group are C & F ?

Jijnasu said...

@Karl_K
Are you suggesting that rather than being purely Onge like, mesolithic south asians had some iran-neolithic like ancestry?

By the way Peninsular India never had a bronze age but directly transitioned to the Iron age in the early first millenium BCE

FrankN said...

@Jinasu: Are you suggesting that rather than being purely Onge like, mesolithic south asians had some iran-neolithic like ancestry?

The mix may even date back to the Upper/ Epi-Paleolithic. The LGM seriously weakened the Monsoon. The Bay of Bengal branch appears to have completely terminated after ca. 18 kya, turning most of the Indian subcontinent into an inhabitable desert. At the same time, however, the SW Monsoon seems to have strenghened, to the benefit of now arid areas around the mouth of the Indus river. That would have yielded a well habitable (now mostly inundated) coastal zone stretching between the Persian Gulf and Sri Lanka.

Matt said...

@ All, back on the topic of the West Eurasian PCA from this post, with ancients added, I found something quite interesting when running a meta PCA on these PCA dimensions and the population averages.

I got the familar PC1 North-South (HG vs ancient Near East) and PC2 East-West dimensions (IranN + EHG vs Levant N + WHG), and then PC3 as CHG vs Iran_N+Levant N:

http://i.imgur.com/ZWRR0sJ.png

Then a PC4 forms, and is not totally negligble for 5% of the variance (the previous are basically 95%):

http://i.imgur.com/5pOnKhr.png

This splits Basque and Iberia Neolithic on the one end (with some South Asians too), from a mix of Baltic-Uralic populations on the other and also opposes the Near East. It also interestingly spans different WHG populations from the Villabruna cluster.

It loads on a PC6 in the original West Eurasian PCA, which distinguishes South Asian+West Mediterranean on one end, from CHG and Eastern Europe on the other.

I think that contributes to when I run in neighbour joining, there's a fairly simple split between Eastern and Western European populations: http://i.imgur.com/5JncGcO.png.

While in lots of previous PCA we've had, when transformed into neighbour joining, you get joinage of East and West Europeans with similar overall levels of HG and Neolithic farmer ancestries (e.g. Hungarian and English, Scottish and Ukrainian, Norwegian and Polish, etc.).

Seems quite interesting particularly if this is mainly based on the ancient dna itself (because it would indicate that the West Europe vs East Europe genetic divergences - generally too low to be picked up by most PCA but present when Europeans are run in PCA together - might come from the ancient dna structure). I'm not sure it represents a real connection between some of regions on either end though.

(Btw, Sein, using this West Eurasia PCA and your estimates of ASI, I made an ASI zombie for this PCA, then when I reprocessed PCA data with this zombie and the ancestral populations (Steppe and Iran and ASI Zombie) through nMonte, it gave effectively the same result for Kalash - Iran_Neolithic 50.7, Yamnaya_Samara 35.6, ASI 9.45, Boncuklu_Neolithic 4.25. If that all makes any sense.)

Alberto said...

The problem with dimensions and West Eurasians is that most of the variance is really just captured in 2 dimensions. This makes that the kind of models we do with nMonte are a bit unstable. If anyone looks at the plot above in this post (very nice colours!), you can imagine that with only 2 dimensions you can model the samples in the centre in various different ways (one way would be as Kurgan_EMBA + Europe_pre_Kurgan and another one would be as WHG + Caucasus and other Near Eastern pops). The theory to prevent this simplistic approach is that the other dimensions won't allow all these combinations to work, choosing the correct one. But when we have such a high percentage of variance in just 2 dimensions (about 85%?), it becomes difficult to get the other dimensions to really make much of a difference.

As an example, with the datasheet from the plot above:

Corded_Ware_Germany:I0103
"Kotias:KK1" 38.4
"Iberia_Mesolithic:I0585" 32.6
"Motala_HG:I0017" 23.6
"Iran_Chalcolithic:I1661" 4.1
"Loschbour:Loschbour" 1.3
"Karelia_HG:I0061" 0
"Iran_Neolithic:I1290" 0
"Yamnaya_Samara:I0231" 0
"Iberia_MN:I0408" 0
"Israel_Natufian:I1072" 0

We know with quite some certainty that the above model is not correct. But it just happens to work. I used for the source the higher coverage samples from each population, and the target CW sample is also the highest coverage, all this to avoid noise. But the problem still happened. If instead of CW sample I0103 I run I0104 (the other high coverage one) I do get a more expected result with good amounts of Yamnaya and EHG. So the point is that very small details can break the balance in one or other direction. The models based mostly on 2 dimensions are simply a bit unreliable.*

So it's interesting what Matt notes above about that 4th dimension showing interesting differences. The question is how do we get those higher dimensions to matter more (on the assumption that they are not noise, but signal - a weak but important signal that can make a difference). Enough that maybe we can figure out if those Bronze Age Hungarians have CHG (as the PCA based nMonte model was showing) or EHG (as Admixture was showing) or both.

I think that the fact that nMonte uses the Euclidean distance does not help much for this problem. It's an algorithm that actually gives more weight to those first 2 dimensions with higher variance than to the next ones, neglecting any discriminatory power that could lie in those higher dimensions.

I've personally tried using the Absolute (AKA Manhattan) distance, and while is does help in these situations I don't think it's the ideal solution either (besides, nMonte has a hard time finding the lowest Absolute distance in its current state). But maybe someone feels like exploring this issue and finding a good alternative? For anyone interested in maths it just takes a bit of googling for "best distance for high dimensional data" or similar to get started and exploring the options.

*This datasheet with only ancients is more unreliable that global ones with modern populations, which made it easier to show the problem. But the problem (limitation?) does happen in any datasheet to some degree, as seen with the BA Hungarians.

huijbregts said...

@Matt

This is the first time I heard about MetaPCA. I suppose it does a meta-analysis on two or more datasets.
But how do you get independent datasets? They all use the same DNA samples.

By the way, about reducing the number of dimensions: did you consider the possibility that the HG-populations in dimension 1 might be noisier than the more recent populations in the higher dimensions?

Chad said...

I think it would make more sense to model SC and S Asians with Sintashta or Andronovo plus MA1. SC Asians do not derive up to 50% from the steppes and nothing directly from Yamnaya. They're all mostly Neolithic plus local.

Davidski said...

No, they're not.

Northern Indo-Aryans are 30-40% Steppe_EMBA (probably Catacomb) with minor Steppe_MLBA (Andronovo).

Eastern Iranians are 30-50% Steppe_EMBA/Steppe_MLBA, which is probably mostly Andronovo and Scythians, with minor Indo-Aryan input.

The steppe admix drops like a rock as one moves south of Gujarat. This gels with Y-hg data.

Davidski said...

This is gonna be interesting.

http://www.dorsetcountymuseum.org/events?show_archive=0&filter=ARCHAEOLOGY+UNEARTHED&filterfrom=keywords&item=716

Davidski said...

All Dravidians have low to non-existent levels of steppe ancestry.

Most northern Indo-Aryans, as in from Uttar Pradesh and northwards, have high levels (>25%) of steppe ancestry, unless they're low caste, like the Chamar.

Almost all Indo-Aryans have higher levels of steppe ancestry than almost all Dravidians. But this isn't true of overall West Eurasian ancestry, with some Dravidian groups having higher levels of West Eurasian ancestry than many Indo-Aryans, except that of course this West Eurasian ancestry is almost entirely, and sometimes entirely, Iran Neolithic-related.

We're all seeing these same patterns using different methods. Only the ancestral proportions vary, and that's probably mostly due to the different methods dealing in different ways with the problem that we don't yet have all the precise ancient reference samples to run these tests.

Jijnasu said...

@Karl_K
That wouldn't explain why the paternal lineages of the Paniyas, Pullayars and a other similar groups are so different from those of the surrounding Dr speaking low & middle caste populations amongst whom L, J2, H, R2 and R1a are the major lineages. This despite seemingly not too different autosomally

Davidski said...

As you can see groups such as Jatt Sikhs and Haryana Jaats are clustering with Pakistani Pashtuns and Kalash and some even clustering with Afghan Tajik/Pashtuns why is that?

Populations from the Indus Valley like Sindhis are amongst the most West Eurasian in South Asia, but most of their ancestry is Iran Neolithic-related, probably via Harappan and BMAC ancestors.

Jatts are a different story though. They do have a lot of steppe ancestry. I haven't looked at what type of steppe ancestry it is, but my guess is that it's a mixture of Steppe_EMBA and Steppe_MLBA.

Jaydeep said...

David,

I think you should do some analysis of the Jatts as they are a very interesting and important group for two reasons :-

1. They inhabit the same region from where we have the Rakhigarhi samples i.e. Haryana, Punjab & Western UP. Moreover recent evidence suggests that the region of Haryana had a very important role in the formation of the classical Harappan phase with Rakhigarhi being not only the largest Harappan site so far discovered but also potentially being an urban city significantly earlier to Harappa and Mohenjo Daro.

2. Rigvedic geography is centered around present day Haryana with the site of Kurukshetra being one of the most holy sites. In the epic Mahabharata as well the center of action is in Haryana. The heroes of the epics lived in Haryana.

I am pretty sure the present day Jatts are closely related to both the Rakhigarhi Harappans as well as to the epic heroes. Incidentally, a lot of Brahmins trace their lineage to the ancient sages residing around the extinct Saraswati river which flowed primarily in Haryana and Rajasthan. It could be the reason why the Brahmins have so much 'steppe ancestry' in the South.

It is also very likely that the Rigvedic culture and religion spread out across the rest of North India from its locus in Haryana.

Therefore a study of the Jatts could be very useful in understanding the South Asian population dynamics from the Bronze Age onwards.

Jaydeep said...

The Kalash are also a very interesting group. According to the Broushaki et al paper, the Kalash are one of the closest groups to the Iranian Neolithic. At the same time they have the highest MA-1 affinity among all the South Asian groups so far sampled. They are also considered to have the highest steppe related admixture. How do we explain this ?

Simon_W said...

I fear if the Rakhigarhi samples as expected don't have any steppe admixture and differ a lot from modern South Asians, the Indian opponents to the steppe theory may be tempted to regard the samples as excellent proof that IE had nothing to do with the steppe, and they may continue to surmise that Harappans were IE and that steppe admixture in South Asia is from a much later date. Until they analyze samples that are only slightly younger which display decent steppe admixture.

Davidski said...

The Rakhigarhi samples won't be very different from many Makranis, Brahuis and Balochs. In other words, I do expect to see a lot of genetic continuity in South Asia despite later steppe admixture.

But yeah, I reckon what will happen is that the Iran_Neolithic/Harappa + PIE association won't die until it's proven that steppe admixture arrived in South Asia during the early Iron Age (as, of course, there was no Bronze Age per se in South Asia).

Karl_K said...

@Davidski

"But yeah, I reckon what will happen is that the Iran_Neolithic/Harappa + PIE association won't die until it's proven that steppe admixture arrived in South Asia during the early Iron Age (as, of course, there was no Bronze Age per se in South Asia)."

What exactly do you mean by South Asia here? There was definitely a Bronze Age around the Indus Valley.

Nevermind how this will all be spun for political reasons, it looks like it will be clear to historians without any agenda.


Karl_K said...

@Simon_W

"I fear if the Rakhigarhi samples as expected don't have any steppe admixture and differ a lot from modern South Asians, the Indian opponents to the steppe theory may be tempted to regard the samples as excellent proof that IE had nothing to do with the steppe"

The only things we have heard from the people who actually know the results is that modern people in the region ARE related to the ancient samples.

This sounds like what Davidski is suggesting. The Indus Valley samples will be mostly related to Iran_Neolithic. The locals still have 60-70% of this ancestry autosomally.

The timing of the arrival of the steppe ancestry will be interesting, but not too surprising.

The data will be great for determining the relationship to other Indo-European groups.

Davidski said...

My understanding is that South Asia went from the Chalcolithic to the Iron Age without an actual Bronze Age like in Europe and West Asia.

Jijnasu said...

There were Bronze age cultures in Northern/Northwestern India while Central and southern India directly transition from chalcolithic & neolithic cultures to the iron age

postneo said...

I.E. Is not a steppe marker. It entered Europe thru multiple paths reinforcing each other. yamnaya may be be just one such path

Zoroastrian and bakhtiari DNA suggest that i.e. Speakers were not exclusively from the steppe.

It may have spread to the steppe thru transhumance. Picked up by people that were not closely related to original speakers.

To claim that IE Is exclusively from the steppe you need proof.

Today French is on the decline in Canada vs English. Such a process could have happened in the steppe when they increasingly started speaking IE languages vs local ones




Karl_K said...

@postneo

Where and when was the Proto-Indo-European language spoken?

Are you saying that Zoroastrian and Bakhtiari have NO Yamnaya-like DNA?

Please expand on this. It is very interesting.

Karl_K said...

@postneo

I thought I read somewhere that Zoroastrians in Iran had >15% R1a1a Y-haplogroup. Is that not true? I can't remember which study it was.

Coldmountains said...

@Karl_K
Zorastrians are rather low in R1a. They have more or less the same frequency of R1a as muslim Persians. R1a peaks in Sistan/Baluchistan and Khorasan.
http://dienekes.blogspot.de/2012/07/huge-study-on-y-chromosome-variation-in.html

huijbregts said...

@Alberto #1

You are repeating the suggestion that nMonte may be flawed because it should use the Manhattan distance instead of the Euclidean distance.
But you arguments are not valid. Specifically you do not understand that Manhattan distance is fundamentally different from Euclidean distance and cannot coexist with variance or PCA.


A few rare mathematical calculations can be formulated in Manhattan distances (see Wikipedia), where Pythagoras' theorem is violated.
However in the vast majority of mathematical calculations the distance between two points in a multidimensional space is the Euclidean distance.
This is the basis for the concepts of variance, least square statistics and PCA.
The variance has the useful property that it is a unbiased estimation, which means that for large samples the estimation converges to the real value.
Unfortunately, being a sum of squares, the estimation of the variance is dominated by the larger values/outliers.
The consequence is that the variance estimate does a poor job in estimating the variance of small samples. But nobody has suggested that we better use a Manhattan variance.
That the PCA is vulnerable to outliers is unfortunate, but it doesn't prevent people from using it.
Also PCA and clustering software do reveal distinct and plausible structures in DNA datasets. There are zero hints that the concept of Euclidean distance is unsatisfactory.
Yet, about the rather conventional nMonte algorithm you are tempted to say 'If the Manhattan distance is less vulnerable to outliers, then lets try the Manhattan distance'.
That is wishful thinking, unless you have arguments that the mathematical structure of nMonte is better described by the rare Manhattan distance.

Your main argument is that you can produce unsatisfactory nMonte results.
A simple explanation is that the specification of the admixtures has been unsatisfactory. But you turn it into an argument for Manhattan distance.
That is an exceptional suggestion which is not backed by exceptional arguments.

huijbregts said...

@Alberto #2

If I understand your line of reasoning correctly, you are saying:
1. We know that the example nMonte run is not correct (True, but maybe I0103 needs a different specification).
2. Very small details can break the balance in one or other direction (True, that is a feature of small sample least square calculations)
3. The models based mostly on 2 dimensions are simply a bit unreliable (I would prefer the formulation that they do not capture enough variation)
4. nMonte is an algorithm that actually gives more weight to those first 2 dimensions with higher variance than to the next ones, neglecting any discriminatory power that could lie in those higher dimensions.
(False, the calculation of the Euclidean distance gives the same weight to all the dimensions; but the first two dimensions capture 85% of the variance).
5. We need an algorithm that grants more variance to the higher dimensions (False, 15% can accomodate a lot of structure and remember #2)
6. The Manhatten distance is less sensitive to outliers (True)
7. Maybe the Manhatten distance will grant more variance to the higher dimensions of the PCA (False, a PCA is a partitioning of the variance matrix; this cannot coexist with a Manhatten distance)
8. The Manhatten distance seems more attractive, so it is to be preferred in nMonte. (False. The decision to use Manhattan distance should follow from the fundamental structure of the modelled data, not from more or less pleasing results. One cannot change it like a fresh shirt.
In the rare cases that the model needs a Manhatten distance, this cannot coexist with variance-related models like a PCA.)
9. nMonte has a hard time finding the lowest Absolute distance in its current state. (That doesn't surprise me. It cannot find a minimum distance until it has picked up enough noise to get stuck in local minimum).
10. It just takes a bit of googling for "best distance for high dimensional data". (High dimensional data is an interesting subject, but different from Manhattan distance. Instead look at Wikipedia.)

And just in case: I do not license the name 'nMonte'.

Davidski said...

@postneo

Just like almost all Europeans and most South Asians, Zoroastrians show a strong signal of ancestry from the Bronze Age steppe.

So you have no argument.

Alberto said...

@Huijbregts

No, I don't think that nMonte is flawed for using the Euclidean distance. 4mix and other "oracles" have used it too. It's a generic approach that works well. And I and many others have used nMonte extensively and it's been very useful for us.

But I reckon we're probably forcing it with use cases it was not specifically designed for. And for some specific problems there might be more specific solutions. So it might be interesting for someone interested in maths and statistical analysis to explore the problem and try to find the most suitable solution.

I don't think that using the Manhattan distance is the answer to this problem. It would require something a bit more sophisticated than that.

As for the license, if you're concerned about it, it's usually a good idea when distributing any software to do so with some kind of license (whichever suits you better - free, proprietary or whatever).

Matt said...

@ huijbregts: That could be a good idea, but no sadly "MetaPCA" was just a slightly goofy term I have used for running a PCA in Past3 on the PCA data which Davidski has already provided. It is probably not a good term, but that is what I was doing.

Possibly MetaPCA might be a better term to use, for example, on a PCA on multiple sets of PCA with incompletely bridging sample sets, then imputed missing values.

@ Alberto, I'm not totally sure why you would think the dimensions with a <% of the variance should matter more? I'm not really following your argument here.

I think you'll get better models with high dimension by identifying samples which are just as useful in the low dimension and fit better in the high dimension. But if you did that by deemphasising the low dimensions (e.g. in the simplest case by simply excluding them from you nMonte sheets) it seems like you'd rapidly become unstuck with models where you'd wander across combinations that didn't actually work in the low dimensions.

Alberto said...

@Matt

No, obviously I'm not suggesting that the higher dimensions should matter more than the lower dimensions. And of course, removing the lower dimensions would destroy the models.

It's about finding the right balance for the kind of problem we're dealing with. Basically, that West Eurasians are quite similar and most of their variance is captured in just 2 dimensions. While those first 2 dimensions are obviously fundamental, it's also the case that with just 2 dimensions we can't get any kind of reliability, and for "checking" which of all the possible combinations that are possible in the first 2 dimensions to see which one is good and which one is clearly bad, we rely on the next ones, which contain a very small amount of variance.

So that little 15% variance in the higher dimensions is very valuable for us (always assuming it's signal and not noise, because if it's noise then we're doomed and that's nothing we can do about it).

So finding the best solution for our specific problem can help to make the most out of the data we have, that's all. I wouldn't know which specific algorithm would be the most suitable, but there are probably options that would work better for us than using the Euclidean distance (but I'd have to refer you to read a bit on the subject because it's something complex that I'm not qualified to explain accurately).

Anonymous said...

I would really love to see some samples from the Nilgiri tribes, besides the Panniya. It would help establish the origin of the Dravidian languages, and help provide some resolution on the actual 'ASI' complexities that may have otherwise been missed.

Ryukendo K said...
This comment has been removed by the author.
Alberto said...

@Matt

Since indeed the use of different methods for measuring distance in multi dimensional spaces is very complex for any of us not into it, I thought I'd clarify further my thoughts.

In general, I don't think we need any of the more complex and exotic methods. Our problem is a quite simple numerical one. And regarding using things like fractional (L < 1) methods, that's probably out of the question too, since even a L1 method like the Manhattan distance already introduces some undesired effects. So given that we also don't want to use any L > 2 method (which would neglect further the importance of lower variance dimensions), I'm basically speaking about some kind of weighted Euclidean distance. Examples of these kind of distances would be the chi-square (used by qpAdm) or Mahalanobis (used extensively in Craniometric analysis). We'd just need to find the simplest way of applying some weight to the dimensions to best match our data set and desired results (but of course finding the right way is the difficult part that requires knowledge and testing).

Alberto said...

@RK

Is this the paper from Parpola you mentioned?

http://www.sgr.fi/sust/sust266/sust266_parpola.pdf

Do you know any details about the possible introduction of domestic sheep into China?

https://www.researchgate.net/publication/268788928_Oldest_Directly_Dated_Remains_of_Sheep_in_China

Could it be related to the people from sites like Begash?

https://www.academia.edu/6632362/Early_agriculture_and_crop_transmission_among_Bronze_Age_mobile_pastoralists_of_Central_Eurasia

Ryukendo K said...
This comment has been removed by the author.
Davidski said...

I think the results from our methods will converge with more thorough sampling.

In other words, when we get more relevant ancient samples, the more generous qpAdm method will produce lower levels of steppe ancestry in South Asia, while the more conservative PCA/nMonte method will raise it up 5-10%.

Ryukendo K said...
This comment has been removed by the author.
Davidski said...

The ancient population sets are usually made up of individuals with highly variable SNP counts, and there's often not enough high quality individuals in the sets to split them up into pairs. This is especially true for the population sets from the Near East.

Ryukendo K said...
This comment has been removed by the author.
Romulus the I2a L233+ Proto Balto-Slav, layer of Corded Ware Women said...

David, have you seen this?

https://www.researchgate.net/publication/309394790_Migration_und_Integration_von_der_Urgeschichte_bis_zum_Mittelalter_Migration_and_Integration_from_Prehistory_to_the_Middle_Ages

Romulus the I2a L233+ Proto Balto-Slav, layer of Corded Ware Women said...

Celtic migrations in the Early La T̬ne period Рfact or fiction?
The historically defined term »Celtic migrations« used by classical authors who wrote accounts of »Celts« as
marauding mercenaries and migrating tribes wrongly brings to mind an image of an »invasion of entire peoples«
in the second half of the 1st millennium BC. The historical hypotheses about mobility and migration scrutinised in
archaeology have been verified by drawing on methods of scientific analysis. Sixteen Early La Tène grave fields
from the La Tène Culture’s core territory and area of expansion extending into eastern and south-eastern Europe
and southwards over the Alps into Italy were selected for this in the DFG project »Mobility and Migration in the
Iron Age (4th–3rd century BC): Archaeological and Bioarchaeometric Approaches to Identifying Locals and
Immigrants«. The archaeological evidence indicates that mobility and migration transpired in significantly more
complex processes than »migrations« in one direction. Moreover, since there is no evidence that settlement north
of the Alps ceased entirely, the more likely assumption would be that smaller fractions of tribal communities
moved away from their original areas of settlement. The bioarchaeological analyses of stable isotopes (87Sr/86Sr and
δ18O) employed in the project largely corroborate the archaeologically based hypothesis that, although Celtic
migrations are indicative of an extensive network of contacts that can be documented by single individuals’ relocation,
large bodies of people did not relocate over long distances at all in the Early La Tène period

huijbregts said...

@all
I am a retired biologist who is interested in algorithms; I am definitely not an expert in abstract statistical mathematics.
I created nMonte as an experiment after some thinking about the K=4 limitation of 4Mix; I wondered whether it would be possible to guide a Monte Carlo simulation by monitoring the Euclidean distance.
After solving some performance issues, that turned out be quite feasible.
I think nMonte is a useful 'light' tool, but one should be aware of its limitations, which many users are not.
IMO the main problem with nMonte is that the specification of the admixtures is really critical; not only which ones, but also how many.
One has to steer between Scylla and Charybdis. Choose too few admixtures and you will miss ancestral DNA.
But choose too many and you are overfitting because the variables are collinear. Nobody trusts a regression analysis with 20 variables.
When you look at posted nMonte results, they often have more than 20 admixtures. Often their distances are far less then 0.1%. These are clearly overfitted.
But it is hard to find a safe haven between Scylla and Charybdis. It is hard to avoid overfitting because there are no populations without common ancestry.
And actually the multicollinearity is an important feature of nMonte, because it allows a few populations to clean up by 'eating' related populations.
My feeling is that the problems of finetuning the mathematical model pales when compared with the specification problem.

@Alberto
Maybe I misunderstood you when you again brought up the Manhattan distance.But I do not share your opinion that the use of Manhatten is simply a numerical detail.
Have you read the Wikipedia lemma on Manhatten distance? It is an entirely different geometry. Statistically this means that you are leaving familiar statistics of average, variance, Euclidean distance and PCA.
Effectively you enter a world of medians, cluster analysis and multidimensional scaling. I doubt whether that will help you to understand the PCA. The use of fractional methods is new for me; I think this is exotic and if I may say so, quixotic.
I do not understand your remarks about a weighted Euclidean distance; both chisquare and Mahalanobis are derived from the variance and definitely L=2.
If your purpose is just to get more variance in the higher dimensions, why don't you normalize the data by dividing each column by its standard deviation?

@Ryukendo
The D-stats datasheet is not symmetric, so I do not understand what you mean by the diagonal. Or do you mean its covariance matrix?
But if your idea is directed against the multicollinearity I mentioned above, I feel positive about it.
I am just worried about the mathematical correctness of throwing out the heart of a datasheet. It is not a pineapple.
Moreover, the columns of the Dstat sheet are not orthogonal.

Romulus the I2a L233+ Proto Balto-Slav, layer of Corded Ware Women said...

Here is another interesting bit, I think it's new information that Nordic Funnelbeakers displaced LBK.


Populations of Nordic Funnelbeaker Cultures
turned out to have displaced the genetic descendants of the Linear Pottery Culture, i. e. the Baalberge and
Salzmünde Cultures, over the course of the 4th millennium BC. The Globular Amphora Culture specialising in
livestock keeping also arrived from the East toward the 4th millennium BC. The situation changed in the 3rd millennium
BC when the Corded Ware Culture (ca. 28oo BC) from the eastern European steppe and the Bell-Beaker
Culture (ca. 25oo BC) from western Europe moved into central Germany and set off massive cultural changes.
How the different protagonists acted and how these prehistoric migrations transpired still require investigation,
of course. Social anthropological research on ethnic groups and their origins and structure can bring us
further here. Fredrik Barth’s volume of essays »Ethnic Groups and Boundaries« from 1969 in particular is playing
an important role. It demonstrates that ethnicity manifests itself particularly succinctly at the boundaries between
different ethnic groups in particular, whereas it has little importance in ethnically homogeneous regions. Central
Germany is a perfect example since 15 cultural groups appeared there during the 4th and 3rd millennium BC,
settlers encountering autochthone populations. The assimilation of individuals by particular cultures, which
undoubtedly took place, can only be inferred from the rapid proliferation of evidence of these cultures that surpasses
the natural reproduction rate or, in the case of some regional Corded Ware groups, from reminiscences of former
local cultures in material culture.
The interdisciplinary interplay of archaeology, genetics, and social anthropology is enabling us to identify and
understand prehistoric migrations better now and, thus, ultimately to write some real history as well.

FrankN said...

@Romulus: "I think it's new information that Nordic Funnelbeakers displaced LBK."

It's not new, it just seems to be the first time the fact has been explicitly reported in an English language publication.

It has long (1930s?) been known that dolmen construction spread out of East Holstein/ Danish Isles southward at a rate of ca. 100 km/century (30 km/generation) from ca. 3500 BC onward. By around 3.300 BC, that expansion had passed sparsely-populated Mecklenburg/ N. Lower Saxony and reached the Loess Belt, old EEF (LBK) territory. There, a militarised border (lots of fortifications and "warrior burials") evolved along the line Erfurt-Potsdam-Stettin, before eventually around 3100 BC the Salzmünde Settlement and the eponymous culture was violently destroyed. Not just by Nordic FB (afterwards in the Elbe-Saale Region known as Bernburg Culture), but in alliance with GAC.
Hence my constant warning that, until we have Bernburg / Western GAC aDNA, we can't tell whether EHG found its way into Central Europe only with Corded Ware, or already with Late Nordic FB aka Bernburg, plus GAC.

Baalberge, of course, wasn't a straight "descendent of LBK", but had been transformed by Rössen and Michelsberg, both more pastoralism-oriented arrivals from the West (Paris Basin and beyond), and also by previous interaction with Nordic FB. In fact, Baalberge got relatively seamlessly absorbed into Bernburg.
Salzmünde is a different case - it was a Baden offspring and represents additional Danubian influence.

Rami said...

Modelling Paniya as 40% Iranian Neolithic is incorrect, as do those other South Indian groups, as they contain a lot of Usht Ishim like ANE ancestry.
The percentages for groups in the North West of the subcontinent seem more reasonable but without any ancient genomes from Central or South Asia there is quite a bit of speculation. Till last year nobody even knew Zagros/ Neolithic Iranian Farmers existed and were quite distinct from their neighbours.

Ric Hern said...

Wonder if someone could help me. I have seen a map of R1b DF27 in Eupedia. It was updated recently and I seem to remember that R1b DF27 were not shown in Northern Romania and Southwestern Ukraine previously. Wonder if it was included because of Ancient or Modern Samples that became available ?

Anonymous said...

@Davidski,

Have you been able to get this new information into a Gedmatch heritage test? Would be interesting to see with ANE as well.

huijbregts said...

@Alberto

I prepared a heatmap of the population averages.
It confirms that most information can be found in the dimensions 1:4.
Dimension 8 is conspicuously barren.
https://www.dropbox.com/sh/m3lwiv9ynqejyx8/AABwQJCZOxVfv69EmM3ifoLQa?dl=0

Matt said...

@ Ryu, I'm not sure if this is entirely on topic for what you're discussing, but I did play around with one of the D-stats sheets we had to try and get as close as possible to a symmetrical matrix:
http://i.imgur.com/Oix45eW.png

That's kind of not symmetrical though. (To actually get symmetry, I think you'd have to use a genetically identical same outgroup on both sides of the D-stat?).

So I averaged it with a transposed version (where columns and rows are swapped) to get a symmetrical D-stat based matrix
http://i.imgur.com/UdRlSSx.png

Then applied neighbour joining trees using that matrix as user defined similarity index (a la what I've found as best practice to use to transform Fst scores into trees and plots, only that's distance not similarity) and not a Euclidean dimensional data:

http://i.imgur.com/HyfD9SD.png
http://i.imgur.com/HyfD9SD.png

and MDS:

http://i.imgur.com/80NbIvs.png
http://i.imgur.com/JJzkMf2.png

The self similarity level does seem to change the outcomes as the length of branches on trees changed when it set the diagonal to 1 or 0 rather than what it was.

(Not sure what outgroups were used in these, whether it was the Mbuti, X, Chimp, Y formulation or vice versa, or something else).

Matt said...

Btw, with the Global 10 dimensions, they seem really dominated in many dimensions by space between Siberian and SE Asian populations, e.g. putting their averages all in Euclidean Neighbour Joining and rooting in the most outgroup African population:

http://i.imgur.com/ZvhS6UU.png

You get the Beringian populations at one end, and then SE Asians at the other and West Eurasians bridge between them. The degree of drifts between those populations represented in this PCA must be huge for that to happen.

huijbregts said...

@Matt
Very nice manipulation of the Dstats!

biaystior said...

Europe Post Kurgan, is highly shifted towards Iranian_N component. Interesting.

Davidski said...

Europe Post Kurgan, is highly shifted towards Iranian_N component. Interesting.

Europe post-Kurgan is shifted towards Kurgan Eneolithic/EMBA.

Iran_N is not relevant to Europe or the steppe.

huijbregts said...

I have a better heatmap of the 'Adventure' data. I think this is a very good way to explore the dimensions.
For better visibility I have improved the color scheme; also I have added a trace line.
For global purposes the dimensions 1 and 2 capture most of the variance.
However, even on dimension 9 a faint blip can be discerned (put your glasses on).

https://www.dropbox.com/sh/m3lwiv9ynqejyx8/AABwQJCZOxVfv69EmM3ifoLQa?dl=0

postneo said...

@karlk

You hinted at accumulated wealth in bronze behind the explosion of certain y lineages. Before that, during the early chalc, a similar advantage would accumulation of surplus food.. or more perishable resources that could be bartered.

I think Transhumance along mountain corridors by pastoralists had vital advantages and influenced proto PIE.

populations like the kalash and bakhtiars are examples of aboriginal IE speakers practicing old subsistence and cultural practices hard to fake.

The incidence of R1a is low in Bakhtiars and Iranians in general, ANE/"steppe" component seems lower in iran than south asia.

Karl_K said...

@postneo

Surplus food lasts only a short time. Tons of cultures had surplus food now and then.

Surplus fine stone would be better, and that is probably why we find so many large caches of flint blanks and large handaxes. They were buried in secret locations, but the owners never made it back.

Hard metals, however, never totally wear out, and can be reshaped without becoming smaller. And you need knowledge to know how to produce them, and how to reuse them. Casting allows for large scale production.

The reason that it is called the bronze age is because bronze started to be used to transfer wealth and power.

This is why money was made of metal right up until the most modern times. Money was never made of wheat.

postneo said...

a sustained surplus/security in food is essential before you can graduate to other value add products. Any industry in fashioning blades flint tools mining smelting would not possible be possible in societies that faced unpredictable famines or starvation die offs. Such societies must have existed.

Davidski said...

The incidence of R1a is low in Bakhtiars and Iranians in general, ANE/"steppe" component seems lower in iran than south asia.

Well duh. That's because the Indo-Iranians from the steppes entered the Iranian Plateau via South Central Asia.

Ramber said...

@Seinundzeit,

Do you know if there is indirect West Eurasian/West Eurasian-like ancestry in many SE Asian populations which they likely received indirectly from South Asian gene flow? Also if SE Asians have indirect West Eurasian/West Eurasian-like ancestry, is it in the form of Iran Neolithic/Iran N-like?

Also I am curious why is Balochi and Brahui having rather significant Yamnaya Samara in the d-stats? Don't they lack Steppe or have very negligible Steppe ancestry? Sorry for asking such questions but I am very beginner to genetics.

postneo said...

@davidski
Well duh David, look at the map.

Iran and south asia are equidistant from "south central asia". Inf act parts of south asia are quite from south central asia vs Iran and yet have higher or comparable ANE.

Matt said...

It seems hard to get a good test panel of mainland Southeast Asians together. Like, the newest PCA sheet lacks Southeast Asians.

Thai, Burmese and Cambodian are the SE Asians that tend to pick up Indian elements, though IRC often only the Burmese show any degree of "ANI" components and even then the Indian elements they include are heavily ASI.

(e.g. http://journals.plos.org/plosone/article/figure/image?id=10.1371/journal.pone.0127655.g001&size=large where there is some very small trace of a light green ANI component in Burmese, not in Cambodian, or the same thing here - http://www.nature.com/articles/srep19166/figures/3).

Using Davidski's K7 ADMIXTURE datasheet (http://eurogenes.blogspot.co.uk/2016/07/sneak-peek-basal-eurasian-k7.html), which included Burmanese and the best available, and using all the available Indians, here are a few models for them:

First with Han only as the East Asian reference and all available South Asian references:

Burmese - Han 72.8, Chamar 14.35, Paniya 12.85 - distance% = 3.6778 %

The adding Naxi, Kinh Vietnamese, Yi people as proxies for the East Asian side.

Burmese - Naxi 79.25, Paniya 16.2, Chamar 4.55 - distance% = 1.2102 %

The adding also adding Ulchi (from Siberia), Dai and She people (from South China) as proxies for the East Asian side.

Burmese - Naxi 68.6, Paniya 20.5, Ulchi 10.9 - distance% = 0.9836 %

Naxi are people from Yunnan, China, and the Ulchi are Siberian, so the formula seems to be Yunnanese plus South Indian with a little Siberian ancestry (possibly actually from Tibet).

If the Paniya / Chamar don't need any Steppe EMBA ancestry in Sein's models, it seems unlikely that any Southeast Asians have it as any appreciable quantity of their South Asian related ancestry.

Rob said...

@ Karl K

"@postneo

Surplus food lasts only a short time. Tons of cultures had surplus food now and then.

Surplus fine stone would be better, and that is probably why we find so many large caches of flint blanks and large handaxes. They were buried in secret locations, but the owners never made it back.

Hard metals, however, never totally wear out, and can be reshaped without becoming smaller. And you need knowledge to know how to produce them, and how to reuse them. Casting allows for large scale production.

The reason that it is called the bronze age is because bronze started to be used to transfer wealth and power.

This is why money was made of metal right up until the most modern times. Money was never made of wheat."

Some very good reasoning.
Can you shed light where the largest depositions of metals are to be found c. 4-3000 BC ?

Davidski said...

@postneo

Yeah, no shit there's more Bronze Age steppe ancestry in the Pamirs, Hindu Kush and North India than in Iran. Should be obvious by now.

No idea what you're whining about.

Karl_K said...

@Rob

"Can you shed light where the largest depositions of metals are to be found c. 4-3000 BC ?"

I am not sure I understand your question. Are you talking about where the ore came from? Or where they produced the metal? Or where artifacts have been found?

If you want to know who was producing the best Bronze at the earliest dates, then look at the ancient Near East. They first had food surpluses, which led to job specialization, which allowed social stratification, and accumulation of power, and wealth. They could then control trade and production of metals, and became very rich civilizations.

In other places, where people didn't usually have excess food, or were much more mobile, metal objects were one of the only ways to transfer wealth over multiple generations.

Karl_K said...

@postneo

"a sustained surplus/security in food is essential before you can graduate to other value add products. Any industry in fashioning blades flint tools mining smelting would not possible be possible in societies that faced unpredictable famines or starvation die offs. Such societies must have existed."

I think you are missing the point. High value non-perishible objects allow wealth to be accumulated, and then used as a buffer against famines. They can be traded for food if necessary. Gold jewelry made 6000 years ago can be traded for food today, this is why grave robbers exist, not to dig up the food offerings to eat.

The people who mined and smelted the ores do not matter here. The steppe people who moved into Europe didn't get their advantage there by mining. But the rich ones could pass their wealth to a child using metal.

Davidski said...

In answer to one of the questions above, I do have a new test on the way within a few days. But it won't be on GEDmatch. I'll be using it as part of a fund-raising effort for 2017, along with the Basal-rich K7.

So yeah, two tests for a donation of, say, $15, plus my new global PCA for an extra donation for anyone who's feeling more adventurous/generous.

As in the past, the idea is to give the readers an insight into what I do here in exchange for their support. I find that most people show more interest in the concepts when their genomes are part of the analyses.

The relevant post should be up by Monday or Tuesday latest.

Ramber said...

@Matt

So Southeast Asians that tend to show "South Asian" elements like Burmese, Cambodians, Malays have very little indirect West Eurasian ancestry or none at all? This means that the "South Asian or Indian" that they tend to show is heavily ASI?

Do you know why in some ADMIXTURE like CHG K8: http://www.anthrogenica.com/showthread.php?6073-K8-CHG-Test-Results&p=129076&viewfull=1#post129076

or even Steppe K10: https://docs.google.com/spreadsheets/d/1Hb0GVyrf2ztR_QvoIYcmhWtsYv0p39avjqM-G3-6Xew/edit?pli=1#gid=1809893991

,most Southeast Asians have a lot of CHG or Hindu Kush? Because these two components in both calculators seem to be West Eurasian components as they peak in Balochis/Brahuis with other West Eurasian populations scoring a lot of them in each calculator.

Or would you say the CHG and Hindu Kush in these calculators, are heavily ASI-admixed as well? But since most Southeast Asians score such components in these calculators, they should have some West Eurasian in my opinion.

Seinundzeit said...

Hi Ramber,

Matt is right, it seems Cambodians and Malay have very negligible West Eurasian ancestry.

I used the same setup that worked great with all kinds of West Eurasians + South Indians. This is what I get:

Cambodians

80.15% Dai
16.00% Jarawa
3.35% Iran_Neolithic
0.50% Yamnaya_Samara

Distance=0.762

Malay

80.65% Dai
19.35% Jarawa

Distance=1.3394

So, I'm guessing that the South Asian signal seen in other analyses might be reflective of an ENA substratum shared with peninsular South Asia?

As Matt noted, the Burmese are probably a different story. Unfortunately, the PCA data-sheet in question doesn't have them.

With regard to Yamnaya-related ancestry in Balochistanis, although the percentages differ between methods, the genetic affinities seen with these populations are very consistent between various methods.

Mainly, it's obvious that the Balochistanis are outliers in the genetic context of South Central Asia. When compared with other West Eurasian populations in this region, it is always certain that the Balochistanis will have the least EMBA steppe admixture.

For example, if the Kalash/Pashtuns are construed as 50%-45% EMBA steppe, the Balochistanis are around 30%. When the Kalash/Pashtuns are construed as 30%-35% EMBA steppe, the Balochistanis are around 10%-15%.

Also, it's a set/consistent pattern that the Baloch proper have the most EMBA steppe admixture in this area, the Makrani have the least, and the Brahui are intermediate with respect to these two other populations (when it comes to this sort of ancestry).

Finally, an interesting pattern I've noticed is that the Makrani almost always turn out around 75% Neolithic Iranian.

Truth be told, the patterns themselves are firmly established. The real uncertainties all involve the percentages.

And as David noted, more aDNA will narrow things down, and probably allow some convergence between methods.



Matt said...

@ Ramber, I was going to run some nMonte for those extra spreadsheets you gave, but unfortunately doesn't seem like the populations or format work out. Steppe K10 sheet has Thai, Cambodian and Burmese together, but there are no South Indian / high ASI populations (the most ASI population there is GujuratiA), while the version of the CHG K8 uploaded to the forum on Anthrogenica has weird formatting issues going on.

Re: "This means that the "South Asian or Indian" that they tend to show is heavily ASI?" I do think it is. That could be because it came from ASI Neolithic people, or even pre-Neolithic ASI like people. Possible on the outside could also be some model where it is a composite of a pre-Neolithic and post-Neolithic connection that works out most similar to ASI.

"Or would you say the CHG and Hindu Kush in these calculators, are heavily ASI-admixed as well?" I wouldn't say the components themselves are ASI admixed exactly, but the proportions of ancestry in SE Asians that get them I think would be consistent with a heavily ASI source in SE Asians.

Btw, even though the Steppe K10 doesn't have any ASI populations or anyone more ASI than GujaratiA, I tried using the method of taking Burmese as roughly 80% Naxi to visualise where it's other ancestry would land in terms of proportions*

http://i.imgur.com/WyFUr21.png
http://i.imgur.com/SVYRX9O.png

Or alternatively, if we instead just assume that Cambodian, Thai and Burmese are all mixing with a uniform South Asian population (might be a bit unlikely, but why not try?):

http://i.imgur.com/RsEuHn5.png

SteppeK10 is not really precise for this purpose though.

(Please don't take these PCA to represent any genetic distance though - they're mapping the amount of components and the genetic distance between the components being mapped is not equal).

*Note though, this did mean that the ghost population had to go strongly negative in many proportions, so it was well outside the scope of anything within the spreadsheet.

Garvan said...

@Ramber & Matt & Seinundzeit

In a study of 125 Cambodians children in Siem Reap province, 7.2% had R1a1*. This is missing in the Han and Miao (Yao & Hmong). (Black 2006, Genetic ancestries in northwest Cambodia).

The elites in SE Asia were speaking sanskrit-pali at the time of first contact with Europeans, so I expect some admixture.

I work and travel in rural areas of Myanmar and my observation is that the population is highly variable. The label "Burmese" should be treated with caution.

Garvan

Matt said...

@Garvan, yes, the different samples are also obviously very heterogenous with some recent admixture in ADMIXTURE, so the population average could be slightly misleading. I do think the general picture from it is probably correct: at the fine structure level, the ancestry in Burma is reflecting what you'd expect with cline from ASI->Yunnan and lowlands near Tibet, while Thailand&Cambodia a cline like minority ASI->South Vietnam. But getting high N from that nation and then looking at substructure to try and find meaningful subpopulations would refine the results (ideally).

If R1a is an elite lineage then it would probably move to a degree beyond autosomal ancestry, and if you've got South Asian populations with 50-70% of R1a with 30% steppe, then a population where the most R1a subpopulations have 7% R1a, you could expect an order of magnitude lower or less actual steppe ancestry, in the most R1a subpopulations.

Alberto said...

@Huijbregts

Thanks for those heatmaps. For West Eurasians it's indeed 4 dimensions that capture almost 100% of the variance, and the problem is that the percentages are very uneven, something like 50%, 35%, 10% and 5% respectively. My thinking was that trying to give some relative more weight to dimensions 3 and 4 could have some benefit in making models a bit more robust. So I did a few tests using 4mix (which only takes 4 populations as source, but does test all the combinations) and it really doesn't seem that it makes any difference. So I don't think it'll help to try to fine tune the algorithm of nMonte or any other tool. It all comes down to the quality input data, and that can only be improved by having more samples, as many have pointed out.

Rob said...

@ Karl K

Here are we referring to the introduction of actual Tin-Bronze (variously c. 2200 BC), or more broadly, the rhythms of Metallurgy ? If the latter, then what is obvious is the close correlation of events around the Ponto-Caspian region c. 4000 BC, if the former, then it relates to a different set of phenomena c. 2200 BC (i.e. after the major genetic shifts).

Ramber said...

@Seinundzeit

Dear Seinundzeit

Hmm Cambodians and Malays do have West Eurasian ancestry, but it is very minor like 1-5%? When you mentioned that Burmese might be a different story, you mean they might have more ANI/West Eurasian? If this is the case, most of the "South Asian" component that Southeast Asians usually score in ADMIXTURE calculators is mostly ASI/ASI-like?

Thank you for your verification on Balochistanis. :) Im trying to grasp here, this mean in contrary to the previous belief, Balochis, Brahuis and Makranis do actually have Steppe admixture but very little and the least among Northern South Asians and South Central Asians?

@Matt

Dear Matt,

Hmm that's weird the spreadsheets do not work. How about these two spreadsheets from Kurd's ANE K6 and Iran Neolithic K6 ? Unfortunately Cambodians are the only SE Asians but they can probably give an idea. Cambodians score some ANE+very little Natufian in ANE K6, while they score minor Iran Neolithic+little Natufian in Iran Neolithic K6. However there are populations with significant ASE (I think it is ASI) in both spreadsheets like Paniyas, Pulliyars, Onge with the ASE peaking in the Andamanese sample:
1) https://docs.google.com/spreadsheets/d/13O_IYAv4SE8jLO9FKOQ5RiHf4vQeOQbSRWa-KsN7wO4/edit#gid=1957523915

2) https://docs.google.com/spreadsheets/d/1ByPKVwgDy1lXpwjb-YOfVHPRHCkM1G5Ce6FdVumBBLE/edit#gid=1636308841

Ramber said...

@Seinundzeit,

I have another question. Do you know are there any Mongolians or any other North Asians/Siberians without any West Eurasian admix? The lowest one seems to be Mongola, who are ethnic Mongols from China.

Matt said...

@ Ramber, OK, yes, for the three SE Asian pops that usally look to have South Asian ancestry, I can only really see Cambodians in there as populations on these sheets, so I've tested that row with nMonte.

Gedrosia ANE K6:

Cambodian - Tibetan 65.3, Paniyas 17.5, Han 17.2 - distance% = 3.0412 %

Without Tibetian

Cambodian - Han 75.2, Paniyas 24.8 - distance% = 3.1622 %

Iran Neolithic K6 -

Cambodian - Dai 80.15, Paniyas 19.85 - distance% = 2.9616 %

I included Tibetan and Han as populations on that one as well, so the Iran_N K6 seems more precise at placing the East Asian ancestry (more similar to Dai than either Ami / Han / Tibetan / Ulchi etc.). I think the setup with an Iran_N cluster opposed to an ANE cluster seems better at ferreting out the precise admix.

I didn't use the Kharia / Kusunda who are ASI rich populations with a lot of East Asian ancestry, from India. Otherwise I used as many Indian populations as I could spot, and it doesn't seem to require want any contribution from any other than Paniyas.

With Kharia, with Gedrosia ANE K6, Cambodian places as:

Cambodian - Han 45.55, Kharia 27.4, Tibetan 27.05 - distance% = 2.7398 %

and Iran Neo K6:

Cambodian - Dai 74.45, Kharia 25.55 - distance% = 2.6225 %

Ramber said...

@Matt

Thank you,do these Cambodians have any West Eurasian or is it also in very minor amounts like 1-5%? Because don't Paniyas have West Eurasian/West Eurasian-like ancestry?

Sorry for asking this but I hardly know anything about d-stats unlike ADMIXTURE which I understand more.



Seinundzeit said...

Ramber,

For what it's worth, it seems Balochistanis have the least steppe ancestry, and the most Neolithic Iranian ancestry, in South Central Asia (with regard to the Neolithic Iranian angle, it seems that Balochistanis, along with some Bandari people in Iran, probably have the highest amounts of Neolithic Iranian ancestry among contemporary populations).

With East Asians, I think it's probable that all East Asian populations have some West Eurasian ancestry, if ANE is defined as West Eurasian. Even the Mongola have some ANE admixture.

Although, it really depends on the ENA baseline. Compared to the indigenous people of the Andaman islands, all East Asians appear to be ANE-admixed. And, it's actually possible that even the Onge (and their relatives) have low levels of ANE admixture, albeit much less compared to East Asians.

It'll be very interesting to see what East Asian aDNA will reveal.

Anyway, just a side-note, but the West Eurasian concept is (philosophically speaking) very muddled, considering that ANE and WHG are rather distinct, and considering that all contemporary West Eurasian populations possess varying quantities of extremely distinct/divergent "Basal Eurasian" admixture. Not to mention varying levels of African and ENA affinity. But this is a completely different topic.

Ramber said...

@Sein

Thank you again for your clarification. Balochistanis like Baloch, Brahui and Makrani have the least Steppe in Northern South Asia and South Central Asia, but still have quite pretty substantial around 10-15%?

Regarding East Asians, If ANE can be consider "West Eurasian", Malta Boy is a West Eurasian individual then? Doe all East Asians that are ANE-admixed including Southeast Asians as well? Because I was very surprised when I saw these data figures of C and D, which shows all East Asians having less than zero West Eurasian-related value int the graph: http://www.nature.com/nature/journal/v536/n7617/fig_tab/nature19310_SF5.html

Unlike East Africans where there are still ethnicities like Dinka, Anuak, Sudanese who seem to be completely African and lack Eurasian. This mean overall East Asians are West Eurasian-related overall than Africans?

Hmm so how many groups of populations are there in the world? Is it two main groups which is Africans and various non-Africans from West Eurasians to other groups?

Forgive me for asking so much questions but I am very curious and still a new beginner to genetics. :)

Seinundzeit said...

Ramber,

East Asian genetic history is quite uncharted, compared to what we know concerning West Eurasia.

But based on the current evidence, I would say that all East Asian populations have ANE admixture, at low levels.

We will know much more, with aDNA from East Eurasia.

huijbregts said...

@ Alberto

I am glad we found some common ground.
If you liked the heatmaps, you may also be interested in one more detail.
The clades which show up most prominently are western hunter gatherers on the negative end of dimension 1. You may have noticed that the same clades also show up on the postive side of dimension 3. How can this happen?
I did some preprocessing by first dropping the two outliers and next averaging the populations (At this moment I am not really interested in the within population variance).
However, some populations have only 1 item, while others have 10. So the populations are unbalanced and the dimensions of the datasheet will not be the eigenvectors of the averaged populations.
To correct this, I have done a secondary PCA and made a heatmap of the scores. I have added the heatmap 'scores.pdf' to
https://www.dropbox.com/sh/m3lwiv9ynqejyx8/AABwQJCZOxVfv69EmM3ifoLQa?dl=0

Again we have 4 dimensions. Now the hunter gatherers are on the positive side of dimension 1. But this time they do not show up on another dimension.
The traces of the dimensions 6:9 are very flat. Surprisingly, there is one positive blip on dimension 5. Maybe this local detail can be identified on the dendrogram.

Ramber said...

@Sein,

Forgive me if I ask few more questions. Hope you don't mind.

In this Steppe K10, it seems all Middle Eastern groups have Steppe with the exclusion of native Bedouins/Arabians: https://docs.google.com/spreadsheets/d/1Hb0GVyrf2ztR_QvoIYcmhWtsYv0p39avjqM-G3-6Xew/edit?pli=1#gid=1809893991

Please correct me if I am understanding it wrong. There is very little to none Steppe in the Middle East including Mesopotamia and Levant, the majority of the "Steppe" is actually Iran Neolithic/Iran Neolithic-like?

Iranians and Caucasian populations have the most Steppe/Steppe-like admix in this region?

Thanks again.

Seinundzeit said...

Ramber,

No problem.

And yes, Iranians have the most steppe ancestry of all populations in West Asia.

The Caucasus is different. Chechens/Lezgins have substantial EMBA steppe-related admixture.

Ramber said...

@Sein,

Thank you.

Most Levantines and Mesopotamians don't have at all or very little "Steppe"? Most of their "Steppe" is actually Iranian Neolithic/Iran N-like?

Hmm so Iranians have the most like 10-20% Steppe? Also aren't Iranians more genetically related to South Caucasians like Azerbaijanis, Armenians and Georgians than the rest of West Asia? Please correct me if I am wrong.

Do Chechens/Lezgins have the most in Caucasus and like around 20-30%? What about Turkish Turks?

Sorry for shooting a few more questions.

I will kindly appreciate your generous answers.


huijbregts said...

@Alberto:
The positive side of this dimension 5 is:

Continenza 0.026980808
Corded_Ware_Poland 0.021927944
Ofnet 0.015357032
LBKT_EN 0.012744204
Vestonice16 0.010923049

Both Continenza and Corded_Ware_Poland are single item populations;
this dimension 5 seems to be just noise.

Matt said...

OT: Some possibly interesting recent biorxiv stuff:

https://arxiv.org/pdf/1610.07306.pdf - Quite interesting (though I don't fully understand)

Novel probabilistic models of spatial genetic ancestry with applications to stratification correction in genome-wide association studies "In this work, we marry the previously mentioned model-free and model-based approaches to geographic ancestry localization by developing a flexible spatial stochastic process model that subsumes previously developed parametric allele frequency models such as SPA, SCAT and SpaceMix as special cases. Furthermore, we develop a data-driven spatial reconstruction algorithm Geographic Ancestry Positioning (GAP), that exploits the structural properties of our stochastic process while being agnostic to its minutiae. Our localization algorithm is inspired by principles from manifold learning, and can be viewed as a generalization of PCA."

"Figure 1 illustrates the conceptual difference between GAP and PCA. PCA tries to embed individuals into a two dimensional space which preserves the pairwise genetic distance between all pairs of individuals as estimated from their genotype data. On the other hand, GAP takes a more local approach by using the genotype data from only genetically similar pairs of individuals to estimate their spatial distance. This leads to a qualitatively better low-dimensional embedding."


Applied to GLOBETROTTER, Human Origins and Popres datasets. I wonder how this algorithm would cope with ancient data.

http://biorxiv.org/content/early/2016/11/19/088716 - Identifying Y-chromosome haplogroups in arbitrarily large samples of sequenced or genotyped men

"We have developed an algorithm to rapidly and accurately identify the Y-chromosome haplogroup of each male in a sample of one to millions. The algorithm, implemented in the yHaplo software package (yHaplo), does not rely on any particular genotyping modality or platform. Full sequences yield the most granular haplogroup classifications, but genotyping arrays can yield reliable calls, provided a reasonable number of phylogenetically informative variants has been assayed."

postneo said...

@karlk

valuation is based not so much accumulation of assets but repeatability, a supply chain. Its the bankability and repeatability of a behaviour not just the asset alone that counts. Lottery tickets/buried treasure can only go so far.

There would be no bats if insect swarms were unpredictable.

Unlike Mao's famine ridden China, today's China can play games in pricing and supply of neodymium/Steel/solar cells, why because they have solved food security invested in fertilizer plants etc..

Transhumance and pastoralism was an important component provided food security, The animal dung functioned as crop insurance. Grazing rapidly converted non food producing land to crops.

In part this enabled division of labor and more specialized industry.

huijbregts said...

NICE CLUSTER PLOT

https://www.dropbox.com/sh/4njpy307pui1kpq/AACx32eLw9s363IrqhsW4wifa?dl=0

Recently I learned some details about the dataset:
Two items must be considered outliers.
After averaging the populations, there are only 4 relevant dimensions.
Because the numbers per population are unbalanced, the dimensions of the raw dataset are no longer the eigenvectors the averages. So the the averaged data must be orthogonalized by a secondary PCA.

Knowing this, it was easy to find the optimal clustering with mclust, see jpg file.
The only problem was assigning labels to the clusters.
Maybe somebody has better ideas. The clustering can be found in the textfile clusters.txt.

Ryukendo K said...
This comment has been removed by the author.
Ryukendo K said...
This comment has been removed by the author.
Alberto said...

@Ryukendo

It's difficult to say intuitively what would happen, but I suspect that you'd need to drop the column for the target population (I'm assuming you would have a column for it, as for every other population). Otherwise the diff in that column is going to be so huge that it's going to make all other diffs mostly irrelevant and the algorithm might just choose the population that minimizes the diff to the target one in its column (IOW, if you have Yoruba in the source pops, you might end up with a model that is 100% Yoruba, simply because the diff in that column from Yoruba to 0 is much lower than any Eurasian population).

But I'm certainly not sure of that, you'd have to test it and see how it works.

Ramber said...

@Sein

In case you did not see few more of questions, I decided to resend this again.

Thank you.

Most Levantines and Mesopotamians don't have at all or very little "Steppe"? Most of their "Steppe" is actually Iranian Neolithic/Iran N-like?

Hmm so Iranians have the most like 10-20% Steppe? Also aren't Iranians more genetically related to South Caucasians like Azerbaijanis, Armenians and Georgians than the rest of West Asia? Please correct me if I am wrong.

Do Chechens/Lezgins have the most in Caucasus and like around 20-30%? What about Turkish Turks?

Sorry for shooting a few more questions.

I will kindly appreciate your generous answers.

Emanuela said...

@Ramber said:

Most Levantines and Mesopotamians don't have at all or very little "Steppe"? Most of their "Steppe" is actually Iranian Neolithic/Iran N-like?

Hmm so Iranians have the most like 10-20% Steppe? Also aren't Iranians more genetically related to South Caucasians like Azerbaijanis, Armenians and Georgians than the rest of West Asia? Please correct me if I am wrong.

Do Chechens/Lezgins have the most in Caucasus and like around 20-30%? What about Turkish Turks?

Sorry for shooting a few more questions.

I will kindly appreciate your generous answers.


@Ramber

Neither Levantines nor Mesopotamians have considerable Steppe ancestry. It's often less than 10%, with the exception of Assyrians who have slightly higher.

Eastern Europeans such as Balts, North Slavs, North Caucasians, and some Balkan Slavs have it the most. In Western Europe, northerners have the most Steppe ancestry which is mainly EHG-CHG (not Iran N because Iran N doesn't appear in North Europe where CHG is found). For Iran, such ancestry is actually low, but the failure for some calculators to distinguish between Iran N versus EHG-rich CHG which also has WHG often leads to erroneous results when it comes to calculate such ancestry in Iran and South-Central Asian Iranic populations. These Iranic groups are rich for Iran N and ANE, but not particularly for CHG and ANE-rich EHG which are the main actors of the Steppe and rather shaped the genepool of modern Northwest and East/Northeast Europeans, not that of West, Central, or South Asians.

Regarding the other question, Iran is a peculiar case, not really related to the South Caucasus proper. Georgians have ~25% Steppe ancestry which is still less than their Northwestern and Northeastern Caucasian neighbours who all have ~30%. Armenians, who are an intermediate between South Caucasians (Georgians) and West Asians, have it ~17%. For West Asians proper (Iranians), Steppe ancestry is more or less 13% which is however higher than what Southwestern Europeans have. Finally, Turkish people have it more than Iranians do.

Seinundzeit said...

Ramber,

Sorry about that, was occupied by stuff IRL.

With the setup I had, Levantines were pretty 0% for steppe ancestry.

Iranians were (at most) around 10%, and the Lur people turned out to be only 3%.

In the Caucasus, Georgians turned out to be around 10%-15% steppe-admixed, while Chechens turned out to be around 35%, so there is a huge gap between northern Caucasians and Georgians.

But yeah, EMBA steppe ancestry is obviously way higher in South Central Asia than it is in West Asia or most of the Caucasus (Chechens are comparable to Pashtuns/Kalash, but no one approaches the levels seen with Pamiri people).

Ramber said...

@Sein

Thank you for your kind replies. :)

It seems that my understanding that Arabians, Levantines and Mesopotamians like Assyrians, Mandaeans, Iraqi Jews completely lack Steppe ancestry and that all of their "Steppe" is basically Iranian Neolithic+some CHG is correct? Do Cypriots also lack Steppe ancestry?

Wow!! I am pretty surprised and shocked to know that Iranians are at most around 10% Steppe-admixed with Lurs having very little. I would expect them to have slightly more like around 10-20%. It seems Iranians have very little Indo-European ancestry despite being Indo-Europeanized linguistically and probably culturally.

According to you, it looks like Georgians are slightly more Steppe-admixed than Iranians at 10-15% but a lot less than Chechens. I am very surprised that Chechens are a lot more admixed than Georgians despite both being close geographically to one another. Hmm maybe it's because Chechens live in closer approximity to the Steppes and don't dwell in mountainous terrains that prevent most of the Steppe gene flow like for Georgians? Furthermore, do you know if Azerbaijanis, Armenians and Turkish Turks have similar levels of Steppe admixture to Iranians and Georgians?

Awesome to know that South Central Asia have one of the highest Steppe ancestry in the world and have way way more higher than in Souhwest/West Asia or in most of the Caucasus !! It is surprising that even Balochistanis like Baloch, Brahui, Makrani are more Steppe-admixed than the majority of Southwest/West Asia; that they have Steppe in similar levels to Iranians and Georgians.

I am very surprised to found out that Chechens have Steppe in comparable levels to Pashtun and Kalash!! I thought Pashtun are more similar to Pamiris in the levels of Steppe ancestry. Do Tajiks and Pamiris have higher Steppe than Northern and East Europeans or lower levels than them? Do Southern Europeans have lower levels of Steppe than Chechens but higher than most of Caucasus and Southwest/West Asia?

Finally, do you know if Burushos and NW South Asians like Sindhis, Punjabis, Gujaratis, Brahmins, Jatts have the same level of Steppe ancestry as Pashtun/Kalash or lower than them? Please correct me if I am wrong, from what I know, NW South Asians have way way higher Steppe ancestry than in West Asia and most of the Caucasus.

Thank you very much again for your generous replies and patience in answering my inquiries.

Best regards,

Ramber

Seinundzeit said...

Ramber,

I guess one has to remember that all these populations are quite genetically interconnected (the people of Southwest Asia, the Caucasus, West Asia, South Central Asia, and South Asia), in very complex/multiple ways.

And the percentages will always differ between methods.

We need more aDNA, and even then, it's a matter of how the methods/stats work, not to mention deeper epistemological issues.

Honestly, we will never be able to quantify the "exact" amount of ancestry any ancient stream of populations might have contributed to modern people.

It is an impossible task. In broadly theoretical terms, it is an enterprise that, on a very deep level, involves failure.

Regardless, this sort of conversation involves certain ontological and epistemological issues, issues that are beyond the scope of what we can discuss here.

Anyway, objectively speaking, the patterns are what we should really look at. These are very consistent between methods.

Also, the patterns are quite "tight", these populations always model in ways that suggest that they are (for the most part) interconnected.

For example, if Pashtuns are 50% EMBA steppe, then our sampled Iranians are 25% EMBA steppe. If Pashtuns are 35% EMBA steppe, then our sampled Iranians are 5%-10% EMBA steppe.

Or, if Pashtuns are 20% ASI, Iranians are 10% ASI. If Pashtuns are 10% ASI, Iranians are 0%-3% ASI.

Basically, Caucasus populations, West Asians, and South Central Asians vary together. Many threads connects these populations.

If that makes sense.

Chad said...

What do people get using Sintashta as the incoming population? Using Yamnaya makes no sense as Yamnaya never went to SC Asia. You're picking up local ancestry that mimics Steppe ancestry. The population that brought IE to SC Asia was basically identical to modern Northeastern Europeans, not Yamnaya. If we're going to guess the amount, we might as well use appropriate populations.

I'd recommend using Sintashta, AfontovaGora/MA1, Onge/Jarawa, Iran_EN, Levant_EN, and Nganasan.

Davidski said...

The currently available Sintashta samples have too much EEF/MN ancestry to be relevant to South Asia. The Andronovo samples are much less EEF/MN than them, and even they're not quite right, apart maybe from outlier RISE512.

So we need more samples from the Sintashta-Andronovo horizon to catch the right samples, or Indo-Aryan languages arrived in South Asia from an earlier archaeological culture, like Catacomb or Khvalynsk.

Either way, based on current sampling, the steppe population that arrived in India was much more Yamnaya-like than Sintashta-like.

Karl_K said...

@RK

"Just wanted to point out that the timing of the bottleneck does not coincide with the introduction of Bronze in East Asia, the Middle East, or South Asia. In East Asia and the Middle East, the bottleneck peaks 1-1.5 millenia before the introduction of metals, and Y chromosomal diversity is on the way to recovery with the introduction of bronze."

I wasn't talking about a bottleneck. I was talking about the rapid expansion of particular Y lineages. When and where did R1a-Z93 originate? Why are there now so many descendants? What mtDNA haplogroup shows a similar pattern of growth and spread?

As for money, I was only talking about commodity money. I am well aware of the history of money.

But in each situation where a different form of money arose, gold, silver, copper, bronze (or other metals) would still have been preferred. They just were not available in enough abundance (or at all) for daily transactions. However, wealth could still be accumulated in a small number of metal objects. This is why people were buried with rings or small daggers or ornaments.

As soon as people started finding metals, there has never been a lasting substitute. Even today, people stockpile commodity gold bought with their bitcoin/USD/GBP/EUR or whatever.

Seinundzeit said...

Yeah, I included Sintashta in my first models, but all South Asians had 0%, they kept picking Yamnaya.

Iran_Hotu was also included. So any local Neolithic Iran-related ancestry was well covered.

And I have included MA1/AG3 before.

In all cases, South Asians got 0% Sintashta, 0% MA1, 0% AG3, and substantial Yamnaya percentages. Yamnaya is clearly preferred.

I'm sure we'll find out why, with more aDNA sampling of the steppe.



Karl_K said...

@Davidski

"or Indo-Aryan languages arrived in South Asia from an earlier archaeological culture, like Catacomb or Khvalynsk."

Are you suggesting that R1a-Z93, didn't live somewhere within the Corded Ware horizon, even as far west as Germany or Poland?

Karl_K said...

@Davidski

"So we need more samples from the Sintashta-Andronovo horizon to catch the right samples"

This seems like it will provide the answers. It is clear that the male lines extensively admixed into the local populations in each migration.

Men similar to those that brought Z93 to the Sintashta-Andronovo cultures must have also admixed into multiple groups that had never left the steppe.

The migration-prone Y haplogroups dominated, while the mtDNA and autosome were dominated by the never-left-the-steppe people.

The men from these cultures would also have a culture of this behavior, and that would explain how this haplogroup became so dominant in autosomally very different peoples.

Ramber said...


@Sein

"I guess one has to remember that all these populations are quite genetically interconnected (the people of Southwest Asia, the Caucasus, West Asia, South Central Asia, and South Asia), in very complex/multiple ways."

I agree that all these populations from these five regions are substantially genetically interconnected to one another.

"And the percentages will always differ between methods.

We need more aDNA, and even then, it's a matter of how the methods/stats work, not to mention deeper epistemological issues."

How do we know which percentage and method is the most accurate?
You are right, we do need more aDNA to clarify the overall picture.Can you elaborate on the epistemological issues regarding this?

"Honestly, we will never be able to quantify the "exact" amount of ancestry any ancient stream of populations might have contributed to modern people.

It is an impossible task. In broadly theoretical terms, it is an enterprise that, on a very deep level, involves failure."

If it is impossible to quantify the exact amount of ancestral contribution from various ancient populations to modern people, how would we figure out what is the most "accurate" and "precise" amounts or this is impossible?

"Regardless, this sort of conversation involves certain ontological and epistemological issues, issues that are beyond the scope of what we can discuss here.

Anyway, objectively speaking, the patterns are what we should really look at. These are very consistent between methods.

Also, the patterns are quite "tight", these populations always model in ways that suggest that they are (for the most part) interconnected."

So the best we can do right now is wait for more aDNA from these regions to clear things up in order to have a better understanding? Hmm if that's the case, the way we will know, as you mentioned, is to observe the pattern between various methods, which you also stated, is very accurate. Does "tight" mean these populations are very interconnected to one another?

"For example, if Pashtuns are 50% EMBA steppe, then our sampled Iranians are 25% EMBA steppe. If Pashtuns are 35% EMBA steppe, then our sampled Iranians are 5%-10% EMBA steppe.

Or, if Pashtuns are 20% ASI, Iranians are 10% ASI. If Pashtuns are 10% ASI, Iranians are 0%-3% ASI."

In this case, how do we know which percentages are the most accurate and reliable for the Pashtuns and Iranians' EMBA Steppe and ASI ancestry?

"Basically, Caucasus populations, West Asians, and South Central Asians vary together. Many threads connects these populations."

Yes, they are indeed very interconnected. Yes it make sense to me.

Thank you again and regards,

Karl_K said...

@Ramber

'If it is impossible to quantify the exact amount of ancestral contribution from various ancient populations to modern people, how would we figure out what is the most "accurate" and "precise" amounts or this is impossible?'

It is even impossible to determine the exact amount of parental ancestry shared between siblings.

Most of the genome is identical between any two modern humans. Only where sites differ can you say from which side it came.

If both ancestry sides contain both alleles at any frequency, then it is all statistics. And the statistics are dominated by the samples we have.

It is actually amazing that it works as well as it does. But remember, it is all biased statistics in the end.

More samples = less bias = higher accuracy.

Ramber said...

@Karl K

"It is even impossible to determine the exact amount of parental ancestry shared between siblings.

Most of the genome is identical between any two modern humans. Only where sites differ can you say from which side it came.

If both ancestry sides contain both alleles at any frequency, then it is all statistics. And the statistics are dominated by the samples we have.

It is actually amazing that it works as well as it does. But remember, it is all biased statistics in the end.

More samples = less bias = higher accuracy."

Thank you for your detailed explanation regarding this rather complex issue. Does "site" mean the population source? Can we say that statistics is the only thing for now that can give an idea of the "exact" amount of shared ancestry? I meant: is it thanks to statistics, we now have a lot of information regarding population genetics from ancient movements to derived ancestry?

Chad said...

I don't see that happening. Outliers are just that. The great majority have a good amount of MN ancestry. Sampling of SC Asian and C Asia is what we need. Not more steppe stuff. That's where you'll get your answer.

Sorry, but when pops are only 30% R1a and much less "steppe" mtDNA, they won't be 50% derived from the early I-I, I-A crowd. Pop densities favor more of a Saxon type story rather than some big upheaval like with BB or CW. I think that's what we'll see with BMAC and others.

Davidski said...

R1a reaches 60-70% among many groups from Afghanistan and North India.

On the other hand, something like 25% of the Kalash carry mtDNA U4. So the uniparental signals from the steppe are strong.

Jijnasu said...

Catacomb and the Indo-Aryans ??
http://www.samorini.it/doc1/alt_aut/ek/klejn.htm

Seinundzeit said...

@Chad

What pop densities are you referring to?

And outliers don't appear out of thin air. They are always indicative of actual patterns of gene-flow between (and within) populations, providing very strong hints of what we have yet to sample.

Regardless, we already have samples from the steppe that are mostly ANE in terms of genetic ancestry (the Srubnaya outliers).

In addition, we have an Andronovo sample that closely resembles Yamnaya.

Also, in general Andronovo and Srubnaya samples are already more Yamnaya-shifted versus the Sintashta samples we currently have. If Sintashta is around 50% Yamnaya-related, Andronvo/Srubnaya are more like 60%-70% Yamnaya-related.

And that Iron Age Scythian was obviously derived from the same broad cluster of "steppe" populations. His roots lie in the very same streams of ancestries from which the Bronze Age steppe populations emerged. Yet, he is very far from being identical to Sintashta.

At the end of the day, these were very heterogeneous populations, and it is very probable that South Central Asians derive their steppe ancestry from populations extremely similar to Yamnaya.

What seems much more fanciful, and requires stretching the imagination a fair bit, is assuming that there were Yamnaya-like people in South Central Asia since the Mesolithic or something.

Yamnaya and company were highly specific/distinctive mixtures between EHG-related, CHG-related, and Iran/Anatolia Chalcolithic-related ancestries. You can't mimic that sort of mix by having an ANE-shifted version of the Iranian Neolithic samples.

I mean, Iran_Hotu cannot be confused for Yamnaya.

Also, ditto what David said, populations in Afghanistan, Pakistan, and northern India are usually 50% R1a, at the least. And some reach 80% (Ghilzai come to mind).

Shaikorth said...

How about Sintashta, Yamnaya, Onge/Jarawa, Iran_EN for northwestern South Asia? There's been this idea floating around that some groups like Jatts and Tajiks have more recent (and more Euro-neolithic) steppe input than Kalash and Brahmins. For any extras over those, Daur or Buryat should do.

Then an ANE-preference test: Euro_MN Euro_EN Iran_Hotu CHG Iran_N Jarawa Dai Karelia_HG Srubnaya_Outlier AG3/MA1

Seinundzeit said...

For what it's worth, I tried that sort of model.

And I support this idea, it's something I've argued for since the beginning.

For some reason, Sintashta keeps coming up as 0% for the Tajiks, but the Tajiks do get noticeable Barcin_Neolithic percentages, along with 45%-50% Yamnaya.

Assuming that the Barcin_Neolithic percentages are reflective of their steppe ancestry, that would put Pamiri Tajiks at around 55%-60% for Sintashta/Andronovo-related ancestry, which makes sense.

Some Pashtun samples also have these Barcin_Neolithic percentages. If we use the same assumption for them that we've used for the Pamiri, that puts certain Pashtun tribal groups at around 45% for Sintashta/Andronovo-related ancestry.

But the Kalash are almost allergic to Anatolian/EEF ancestry, in all analyses. They turn out straight 40% Yamnaya, no Barcin_Neolithic percentages, and the same lack of Anatolian affinity applies to Brahmins.

So, I think this idea has great merit. Eastern Iranians (and probably Jatts) have strong links with Sintashta/Andronovo-like populations and probably yet unsampled Scythian groups. While the Dardic peoples and Brahmins have a clear shift towards Yamnaya/Afanasievo.

But I haven't tried this ANE-preference test. I'll get on that.

Jaydeep said...

Dear Sein,

You are mistaken in your analysis. The Central Asians have been heavily influenced since the advent of the historical era, from the Iranians (who by then must already have Anatolian Neolithic admixture) and to a lesser extent the Arabs. The Pashtuns have also been influenced by the Iranians. This is the reason why the Tajiks & Pashtuns show Barcin Neolithic while Kalash do not. This has nothing to do with any steppe admixture.

Seinundzeit said...

Shaikorth,

Your idea for that ANE-preference test was wonderful, very brilliant. The patterns here are of great interest. I used all the populations you requested, but added the Daur as an extra control.

Steppe populations:

Yamnaya

60.75% Karelia_HG
37.40% Kotias
1.85% Iberia_MN
Distance=0.7006

Andronovo

48.05% Karelia_HG
26.15% Kotias
25.80% Iberia_MN
Distance=0.5831

Sintashta

42.5% Karelia_HG
35.3% Iberia_MN
22.2% Kotias
Distance=0.5132

Scythian_IA

38.90% Srubnaya_outlier
26.40% LBK_EN
17.55% Karelia_HG
10.85% Iran_Hotu
6.10% Daur
Distance=0.1019

South Central Asian populations:

Kalash

45.95% Iran_Hotu
23.95% Srubnaya_outlier
19.75% Kotias
7.35% Jarawa
3.00% Dai
Distance=0.7534

Tajik_Rushan

34.50% Srubnaya_outlier
25.25% Kotias
19.30% LBK_EN
6.70% Iran_Neolithic
5.90% Iran_Hotu
4.55% Daur
3.45% Jarawa
0.35% Dai
Distance=0.1645

Tajik_Shugnan

36.10% Srubnaya_outlier
36.05% Kotias
12.40% LBK_EN
6.35% Iran_Hotu
5.05% Jarawa
2.95% Dai
0.85% Iran_Neolithic
0.25% Daur
Distance=0.1894

South Asian populations:

Paniya

57.85% Jarawa
33.35% Iran_Hotu
8.80% Iran_Neolithic
Distance=3.4505

Chamar

64.65% Iran_Hotu
32.70% Jarawa
2.65% Afontova Gora3
Distance=3.6875

Fascinating stuff.

Yamnaya is basically a simple mix of EHG and CHG, in this analysis. Andronovo/Sintashta are like Yamnaya, with the addition of substantial Middle Neolithic European admixture. The Iron Age Scythian is rather different. He has a striking affinity with the Srubnaya outliers, and prefers Early Neolithic populations.

Furthermore, genetically isolated South Asians display no EHG, Srubnaya outlier, CHG, or EN/MN European percentages.

So, EHG, Srubnaya outlier, CHG, and EN/MN Europeans percentages in South Central Asia will reflect steppe ancestry. This is exceedingly obvious.

With that in mind, the Kalash appear to be 45% steppe in terms of genetic ancestry, with their steppe half involving a Yamnaya-like population (as a reminder, the Srubnaya outliers are just EHG samples, but with more ANE affinity, and very minor levels of the Near Eastern admixture seen in Yamnaya).

The Pamiri, by contrast, almost display continuity with the Scythian sample! This is very striking, because many have hypothesized that the Pamiri Tajiks are descended from Scythians. As per this analysis, Pamiri Tajiks appear to be of predominately steppe ancestry, related to people like the Iron Age Scythian.

Jatts would be so cool to try, in this analysis.

Regardless, this adds further evidence for very substantial steppe admixture in South Central Asia.

Nirjhar007 said...

Jaydeep,

I think you are right in suggesting that samples from ~ 4000 BC India is the next step forward , going to be awesome .

Any bloke who is Harvard now and related to the Paleogenomics department may give some update on the Swat ones ...

Nirjhar007 said...

Frank

Can you give me your contact mail? :) . Would love to discuss few things now .

Matt said...

@ Sein, btw, where did you get the Jarawa values for your models? I couldn't find them in the datasheet at http://eurogenes.blogspot.co.uk/2016/10/a-fresh-look-at-global-genetic-diversity.html...

Also, the following might also be interesting to some:

I combined the values from the West Eurasian PCA from this post with the World PCA values (used mainly populations with high levels of West Eurasian type ancestry and a couple Siberians and Austroasiatic from India):

Example image: http://i.imgur.com/2wPWBHH.png, .csv file: https://www.scribd.com/document/331696902/Imputation-of-SCA-and-West-Eurasia-2

Now there are obviously a lot of gaps between the two sets, but PAST3 has a function that allows those gaps to be imputed in, using the known crossups between the two sets: http://i.imgur.com/7spupkK.png

So once this is done, this outputs a secondary PCA based on both: https://www.scribd.com/document/331697175/World-and-West-Secondary-PCA
(image: http://i.imgur.com/JcWglPU.png)

Taking the dimensions, the neighbour joining is then: http://i.imgur.com/cg0p7PM.png

Seems like the imputation process works fairly well for these, and combines some extra fine detail of the West Eurasia PCA with some global scope of the World PCA. I don't know how well it work with further afield populations (e.g. like Africans). (I also removed a lot of outliers here, particularly Paleolithic and Mesolithic samples).

Test of Afghanistan Pashtun with nMonte:

Pashtun_Afghanistan: Chamar 27.15, Steppe EMBA 29.6(Afanasievo 23, Yamnaya_Kalmykia 6.6) Satsurblia 21.45, Levant_Neolithic 13, Iran_Neolithic 8.8 - distance% = 0.6461 %

(No Sintashta, Srubnaya or Iran Hotu)

w/out Satsurblia / Kotias:
Pashtun_Afghanistan: Yamnaya_Kalmykia 37.1, Chamar 25.6, Iran_Neolithic 22.7, Levant_Neolithic 13.1, Iran_Hotu 1.5 distance% = 1.624 %

Still looks about right.

Jaydeep said...

Nirjhar,

Any idea when the Rakhigarhi paper might be out ? Any kind of update ? And what about Swat ? Do we have DNA from Iron Age samples there ?

Karl_K said...

@Ramber

'Does "site" mean the population source?'

No. By 'sites' I meant SNPs. Differences in nucleotide polymorphism frequencies.

huijbregts said...

This thread invites to methodological experiments. I too got a new(?) idea.
Our workflow usually starts with a datasheet. Next we spend a lot of energy on nMonte to estimate the admixtures which are hidden in the target sample.
One admixture might be 15% Corded_Ware_Germany. But if also specify Yamnaya_Kalmykia, I get a different percentage.
This is not what I actually want.
I want the total percentage of the Steppe cluster, and of all the other relevant clusters.
The algorithm which can do this for us, is called 'fuzzy clustering'. In R it is implemented in the package 'fclust'.

In my previous post I have demonstrated that the sophisticated cluster software mclust finds a nice model with 8 clusters.
https://www.dropbox.com/sh/m3lwiv9ynqejyx8/AABwQJCZOxVfv69EmM3ifoLQa?dl=0
If have tested whether fclust produces a comparable cluster structure and it does.
The first lines of the fuzzy output are:
Clus1 Clus2 Clus3 Clus4 Clus5 Clus6 Clus7 Clus8
Abkhasian 0.05 0.07 0.12 0.05 0.05 0.04 0.01 0.62
Afanasievo 0.83 0.03 0.01 0.02 0.07 0.01 0.01 0.02
Albanian 0.05 0.30 0.20 0.03 0.09 0.16 0.02 0.14
For all the population averages the 100% admixture is partitioned among the 8 clusters.
You can find the complete list in the file fuzzy8.csv. Paste it directly in a spreadsheet.
The 8 clusters are calculated by the software. Test whether you find them helpful.

Whoever wants to reproduce these results: look at the preprocessing in my previous post.

Matt said...

@ huijbregts, interesting, fuzzy clustering sounds like an output that is quite more comparable with ADMIXTURE clustering. I was wondering if that was possible, and also with the D-stat data. I couldn't find the file fuzzy8.csv in your link. Does fclust also feed you back the values for centroid for the fuzzy cluster so that you can visualise in the original PCA?

huijbregts said...

@Matt
Sorry, I gave you the wrong link
https://www.dropbox.com/sh/4njpy307pui1kpq/AACx32eLw9s363IrqhsW4wifa?dl=0
Remember that these clusters are never exactly the ones you wanted. The ones from mclust seem slightly better, but I don't believe that mclust can do fuzzy clustering.
I don't know how it will work Dstats. We should first find the number of clusters with mclust. Do you have a link to a representative set of Dstats? I am not familiar with fclust. But mclust gives the means. Do you want them?

Matt said...

@huijbregts

This is most representative set of D-stats which I could find that I had on my hd - https://www.scribd.com/document/331728969/dstatsset. Others may have a better one if they'd like to chime in and mention it.

If you do have the chance at any time in the future, also be interested to see how fuzzy clustering works on the population averages for:

Davidski's latest World PCA (which is what Sein has been using for the models upthread) - https://www.scribd.com/document/331729069/world-PCA

This PCA - https://www.scribd.com/document/331728998/Europe-PCA - which Davidski ran before which is based on European samples with ancients added to it.

once the optimal number of clusters is determined. If no time, np for any of these.

Thanks.

If it is possible to upload the means for cluster positions, that would be interesting to put back in with the PCA data to visualise.

huijbregts said...

@Matt
Clustering table:
1 2 3 4 5 6 7 8
32 19 34 24 37 23 19 14

Mixing probabilities:
1 2 3 4 5 6 7 8
0.14700973 0.10199840 0.16131499 0.12212806 0.18843400 0.11429843 0.09550952 0.06930687

Means:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
PC1 -0.046067006 0.024825670 0.003575595 -0.04223664 0.027257320 -0.014790328 -0.02282073 0.109015202
PC2 0.015974037 0.041759169 -0.016122594 -0.02303741 0.004995472 -0.060509981 0.06372697 -0.018829749
PC3 0.012843139 0.007589334 0.009798025 -0.01373437 0.014753661 -0.008451524 -0.03098099 -0.020495781
PC4 0.002699873 -0.004505945 0.003847190 -0.01972857 -0.004097709 0.015038853 0.01039278 -0.001268062

These are the parameters of the estimated mclust model.
Be aware of the my dataprocessing:
removing 2 outliers
averaging populations
secondary PCA because after averaging the data are no longer orthogonal

I will work on your other requests, but I will take my time.

Samuel Andrews said...

@David,
"R1a reaches 60-70% among many groups from Afghanistan and North India.

On the other hand, something like 25% of the Kalash carry mtDNA U4. So the uniparental signals from the steppe are strong."

But few in India have more than half that much R1a and probably under 10% Steppe mtDNA. So, I can't imagine them being something even like 30% Steppe. Also, I've noticed ASI levels have gone way done from almost 50% to like 20% in India and their Iran_NEo levels are over 50%. This is difficult to believe because well over 50% of their mtDNA is M.

All I'm saying is there may be inconcsistences in uniparental markers.

Davidski said...

Well, the methods I use don't show more than a few per cent of steppe admix for Indian groups that aren't Indo-European speaking and/or high caste.

My methods only show really high (>30%) levels of steppe admix for Indo-European/high-caste South Asian groups that are known to have high levels of Y-hg R1 and/or other steppe-specific markers.

Davidski said...

Btw, I think what Nirjhar is hinting at is that Broad MIT/Harvard have tested Iron Age skeletons from the Swat Valley. Probably these...

http://phys.org/news/2012-11-ancient-tombs-pakistan-swat.html

If so, that'd be awesome, because these are likely to be the recent descendants of some of the first and thus least admixed migrants from the steppes to South Asia.

Expect R1a-Z93 and loads of Yamnaya-like ancestry.

Unknown said...

First, ADMIXTURE is not an appropriate tool to do ancient-modern comparisons, because of it being driven by recent drift.

@ Ramber / Sein

Formal methods, such as dstats should be used. With regards to the most steppe admixed
W Asian pop, Kurds would have to be among the top scorers. Using an Iraqi Kurd sample [Kurd C3], I found it to be significantly more steppe shifted than the Turkish, Iranian, Georgian and Armenian samples, no matter what steppe pop I used [Eneolithic Khvalynsk, Steppe EMBA, Andronovo, Scythian, etc], although the sample that stands out across the board in all comparisons, is the Iron Age Western Scythian sample, Scythian IA. Kurd C3 seems to be significantly more Scythian IA shifted in ALL the comparisons. This is not surprising since Scythians and Mitanni ruled over Kurdistan. In fact, Scythians would have enough steppe ancestry to elevate steppe admixture in Kurds to levels higher than their current neighbors, and sway D (Georgian/Armenian, Kurd, Steppe, Outgroup) to negative territory (Kurd-steppe shift) in spite of elevated SW Asian admixture in Kurds, as compared to Georgians and SC Asians.

An additional observation here and with the previous comparisons is that Kurd C3 shares considerably more drift with Satsurbila than with Kotias. In other words shared drift Satsurbila-Kurd is considerably higher than Satsurbila-Turk or Iranian or Armenian or Georgian, when compared with Kotias-Kurd vs Kotias-Turk, Iranian, Armenian, or Georgian.

Negative D indicates Kurd C3 shares more total genetic drift with the various steppe samples than Turks share with those steppe samples. Tables are sorted with the samples Kurd C3 shares the most drift with on top.

POP 1 POP 2 TARGET OUTGROUP D Z SNPs
Georgians .Kurd_C3 Andronovo2 Chimp -0.0126 -1.738 64920
Georgians .Kurd_C3 Yamnaya_S7 Chimp -0.012 -1.555 58423
Georgians .Kurd_C3 Scythian_IA Chimp -0.0099 -1.651 98625
Georgians .Kurd_C3 Steppe_Eneolithic Chimp -0.0096 -1.246 58375
Georgians .Kurd_C3 EHG_61 Chimp -0.0087 -1.373 103695
Georgians .Kurd_C3 Andronovo1 Chimp -0.0087 -1.381 86114
Georgians .Kurd_C3 Poltavka1 Chimp -0.0083 -1.126 57507
Georgians .Kurd_C3 Sintashta Chimp -0.0067 -1.022 85128
Georgians .Kurd_C3 Steppe_MLBA Chimp -0.0063 -1.314 105789
Georgians .Kurd_C3 Yamnaya_S2 Chimp -0.0047 -0.599 53056
Georgians .Kurd_C3 Yamnaya_K2 Chimp -0.0044 -0.625 70565
Georgians .Kurd_C3 Yamnaya_S5 Chimp -0.0038 -0.529 65230
Georgians .Kurd_C3 Yamnaya_S3 Chimp -0.0035 -0.496 67600
Georgians .Kurd_C3 Yamnaya_S4 Chimp -0.0035 -0.499 67961
Georgians .Kurd_C3 Andronovo3 Chimp -0.0027 -0.427 104529
Georgians .Kurd_C3 Poltavka4 Chimp -0.0011 -0.176 103582


POP 1 POP 2 TARGET OUTGROUP D Z SNPs
Iranian .Kurd_C3 Andronovo1 Chimp -0.0153 -2.897 86114
Iranian .Kurd_C3 Scythian_IA Chimp -0.0136 -2.799 98625
Iranian .Kurd_C3 Sintashta Chimp -0.0133 -2.401 85128
Iranian .Kurd_C3 Andronovo2 Chimp -0.0131 -2.139 64920
Iranian .Kurd_C3 Satsurbila Chimp -0.0125 -2.135 74429
Iranian .Kurd_C3 Yamnaya_K2 Chimp -0.0118 -2.047 70565
Iranian .Kurd_C3 Steppe_MLBA Chimp -0.0116 -2.948 105789
Iranian .Kurd_C3 Andronovo3 Chimp -0.0113 -2.186 104529
Iranian .Kurd_C3 Steppe_Eneolithic Chimp -0.0105 -1.664 58375
Iranian .Kurd_C3 Yamnaya_S7 Chimp -0.0103 -1.667 58423
Iranian .Kurd_C3 Poltavka1 Chimp -0.0095 -1.546 57507
Iranian .Kurd_C3 Yamnaya_S2 Chimp -0.0094 -1.48 53056
Iranian .Kurd_C3 Yamnaya_S4 Chimp -0.0091 -1.542 67961
Iranian .Kurd_C3 Yamnaya_S5 Chimp -0.0082 -1.362 65230
Iranian .Kurd_C3 EHG_61 Chimp -0.0078 -1.519 103695
Iranian .Kurd_C3 Yamnaya_S3 Chimp -0.0074 -1.263 67600
Iranian .Kurd_C3 Afansievo Chimp -0.0059 -1.263 103812
Iranian .Kurd_C3 Poltavka4 Chimp -0.005 -1.021 103582
Iranian .Kurd_C3 Yamnaya_S6 Chimp -0.0035 -0.659 103164
Iranian .Kurd_C3 Poltavka2 Chimp -0.0024 -0.44 90988
Iranian .Kurd_C3 Kotias Chimp -0.0008 -0.159 105796

Unknown said...

Kurd C3 shares more total drift with steppe than Iran Chl and Zoroastrians

POP 1 POP 2 TARGET OUTGROUP D Z SNPs
Iran_ChL .Kurd_C3 Yamnaya_K2 Chimp -0.0248 -3.2 66591
Iran_ChL .Kurd_C3 EHG_61 Chimp -0.0207 -3.127 97213
Iran_ChL .Kurd_C3 Steppe_Eneolithic Chimp -0.0202 -2.561 55978
Iran_ChL .Kurd_C3 Andronovo1 Chimp -0.0178 -2.448 81001
Iran_ChL .Kurd_C3 Sintashta Chimp -0.0177 -2.428 80344
Iran_ChL .Kurd_C3 Scythian_IA Chimp -0.0176 -2.679 92951
Iran_ChL .Kurd_C3 Andronovo3 Chimp -0.0146 -2.129 97829
Iran_ChL .Kurd_C3 Steppe_MLBA Chimp -0.0144 -2.648 98950
Iran_ChL .Kurd_C3 Yamnaya_S5 Chimp -0.0125 -1.555 62074
Iran_ChL .Kurd_C3 Poltavka1 Chimp -0.0117 -1.409 55355
Iran_ChL .Kurd_C3 Afansievo Chimp -0.0111 -1.774 97225
Iran_ChL .Kurd_C3 Yamnaya_S7 Chimp -0.0102 -1.195 55685
Iran_ChL .Kurd_C3 Andronovo2 Chimp -0.0093 -1.132 60993
Iran_ChL .Kurd_C3 Yamnaya_S2 Chimp -0.0084 -0.975 50325
Iran_ChL .Kurd_C3 Yamnaya_S4 Chimp -0.0083 -1.097 64962
Iran_ChL .Kurd_C3 Poltavka2 Chimp -0.0067 -0.905 86426
Iran_ChL .Kurd_C3 Yamnaya_S3 Chimp -0.0056 -0.723 64902
Iran_ChL .Kurd_C3 Yamnaya_S6 Chimp -0.004 -0.563 96773
Iran_ChL .Kurd_C3 Poltavka4 Chimp -0.0032 -0.498 97140
Iran_ChL .Kurd_C3 Satsurbila Chimp -0.0032 -0.397 69522
Iran_ChL .Kurd_C3 Kotias Chimp 0.0112 1.692 98947


POP 1 POP 2 TARGET OUTGROUP D Z SNPs
Iran_Zoroastrian .Kurd_C3 Scythian_IA Chimp -0.0125 -2.505 98554
Iran_Zoroastrian .Kurd_C3 Yamnaya_S7 Chimp -0.0111 -1.74 58374
Iran_Zoroastrian .Kurd_C3 Andronovo1 Chimp -0.0107 -1.963 86046
Iran_Zoroastrian .Kurd_C3 Satsurbila Chimp -0.0096 -1.593 74377
Iran_Zoroastrian .Kurd_C3 Andronovo2 Chimp -0.0095 -1.533 64869
Iran_Zoroastrian .Kurd_C3 Sintashta Chimp -0.008 -1.419 85060
Iran_Zoroastrian .Kurd_C3 Yamnaya_K2 Chimp -0.0073 -1.242 70505
Iran_Zoroastrian .Kurd_C3 Steppe_MLBA Chimp -0.0071 -1.745 105715
Iran_Zoroastrian .Kurd_C3 Poltavka1 Chimp -0.0069 -1.084 57456
Iran_Zoroastrian .Kurd_C3 Yamnaya_S2 Chimp -0.0063 -0.976 53019
Iran_Zoroastrian .Kurd_C3 Andronovo3 Chimp -0.0057 -1.073 104457
Iran_Zoroastrian .Kurd_C3 Yamnaya_S5 Chimp -0.0048 -0.769 65177
Iran_Zoroastrian .Kurd_C3 Steppe_Eneolithic Chimp -0.0046 -0.726 58324
Iran_Zoroastrian .Kurd_C3 Yamnaya_S4 Chimp -0.0039 -0.671 67900
Iran_Zoroastrian .Kurd_C3 Yamnaya_S3 Chimp -0.0037 -0.609 67540
Iran_Zoroastrian .Kurd_C3 Afansievo Chimp -0.0023 -0.476 103739
Iran_Zoroastrian .Kurd_C3 EHG_61 Chimp -0.0017 -0.328 103622
Iran_Zoroastrian .Kurd_C3 Poltavka4 Chimp -0.0007 -0.137 103508

Unknown said...

Kurd C3 also shares more total drift with steppe [especially Scythian IA] than Turks

POP 1 POP 2 TARGET OUTGROUP D Z SNPs
Turkish .Kurd_C3 Satsurbila Chimp -0.0117 -2.017 74429
Turkish .Kurd_C3 Scythian_IA Chimp -0.0087 -1.786 98625
Turkish .Kurd_C3 Steppe_Eneolithic Chimp -0.0084 -1.324 58375
Turkish .Kurd_C3 Steppe_MLBA Chimp -0.0052 -1.313 105789
Turkish .Kurd_C3 Yamnaya_K2 Chimp -0.0071 -1.233 70565
Turkish .Kurd_C3 Andronovo1 Chimp -0.0065 -1.228 86114
Turkish .Kurd_C3 Yamnaya_S7 Chimp -0.0072 -1.155 58423
Turkish .Kurd_C3 Poltavka1 Chimp -0.0062 -1.005 57507
Turkish .Kurd_C3 Andronovo2 Chimp -0.0062 -1 64920
Turkish .Kurd_C3 Yamnaya_S4 Chimp -0.0058 -0.984 67961
Turkish .Kurd_C3 Sintashta Chimp -0.0055 -0.983 85128
Turkish .Kurd_C3 Yamnaya_S5 Chimp -0.0056 -0.921 65230
Turkish .Kurd_C3 Andronovo3 Chimp -0.0041 -0.806 104529
Turkish .Kurd_C3 Yamnaya_S3 Chimp -0.0047 -0.788 67600
Turkish .Kurd_C3 Yamnaya_S2 Chimp -0.0049 -0.764 53056
Turkish .Kurd_C3 Afansievo Chimp -0.0015 -0.308 103812
Turkish .Kurd_C3 EHG_61 Chimp -0.0015 -0.292 103695
Turkish .Kurd_C3 Poltavka4 Chimp -0.0012 -0.248 103582
Turkish .Kurd_C3 Kotias Chimp -0.0003 -0.065 105796

Unknown said...

Kurd C3 shares more total drift with Andronovo Rise 503 than various ancients and modern Asians

POP 1 POP 2 TARGET OUTGROUP D Z SNPs
Iran_N .Kurd_C3 Andronovo2 Chimp -0.0625 -5.899 52923
Kostenki14 .Kurd_C3 Andronovo2 Chimp -0.0487 -4.576 60648
Iran_N_WC1 .Kurd_C3 Andronovo2 Chimp -0.0486 -4.717 40909
Iranian_Bandari .Kurd_C3 Andronovo2 Chimp -0.0381 -5.949 64789
GujaratiD .Kurd_C3 Andronovo2 Chimp -0.0378 -5.574 64920
Jordanians .Kurd_C3 Andronovo2 Chimp -0.0371 -5.102 64920
Levant_N .Kurd_C3 Andronovo2 Chimp -0.0369 -3.372 47124
BedouinB .Kurd_C3 Andronovo2 Chimp -0.0317 -4.706 64920
Levant_BA .Kurd_C3 Andronovo2 Chimp -0.0303 -3.504 56074
Balochi .Kurd_C3 Andronovo2 Chimp -0.0219 -3.55 64920
MA1 .Kurd_C3 Andronovo2 Chimp -0.0216 -1.874 47093
Pathan .Kurd_C3 Andronovo2 Chimp -0.0154 -2.486 64920
Satsurbila .Kurd_C3 Andronovo2 Chimp -0.0145 -1.274 45790
Pashtun_Afghan .Kurd_C3 Andronovo2 Chimp -0.0139 -2.064 62424
Iranian .Kurd_C3 Andronovo2 Chimp -0.0131 -2.139 64920
Druze .Kurd_C3 Andronovo2 Chimp -0.0126 -1.777 64920
Georgians .Kurd_C3 Andronovo2 Chimp -0.0126 -1.738 64920
Azerbaijanis .Kurd_C3 Andronovo2 Chimp -0.0121 -1.744 64920
Jew_iraqi .Kurd_C3 Andronovo2 Chimp -0.0116 -1.742 64920
Kotias .Kurd_C3 Andronovo2 Chimp -0.0103 -1.01 64905
Kalash .Kurd_C3 Andronovo2 Chimp -0.01 -1.442 64920
Iran_Zoroastrian .Kurd_C3 Andronovo2 Chimp -0.0095 -1.533 64869
Iran_ChL .Kurd_C3 Andronovo2 Chimp -0.0093 -1.132 60993
Turkish .Kurd_C3 Andronovo2 Chimp -0.0062 -1 64920
Armenians .Kurd_C3 Andronovo2 Chimp -0.006 -0.898 64920
Iran_recent .Kurd_C3 Andronovo2 Chimp -0.0056 -0.513 50605
Assyrians .Kurd_C3 Andronovo2 Chimp -0.003 -0.423 64920
Anatolia_N .Kurd_C3 Andronovo2 Chimp -0.0028 -0.421 64920

Unknown said...

Kurd C3 shares more total drift with Khvalynsk Eneolithic than various ancients and modern W and SC Asians.

POP 1 POP 2 TARGET OUTGROUP D Z SNPs
Iran_N .Kurd_C3 Steppe_Eneolithic Chimp -0.0518 -4.782 49183
Kostenki14 .Kurd_C3 Steppe_Eneolithic Chimp -0.0454 -4.015 54600
Levant_N .Kurd_C3 Steppe_Eneolithic Chimp -0.0445 -4.067 44284
BedouinB .Kurd_C3 Steppe_Eneolithic Chimp -0.041 -5.954 58375
Iran_N_WC1 .Kurd_C3 Steppe_Eneolithic Chimp -0.041 -3.809 36815
Jordanians .Kurd_C3 Steppe_Eneolithic Chimp -0.0356 -4.619 58375
Iranian_Bandari .Kurd_C3 Steppe_Eneolithic Chimp -0.0289 -4.335 58238
Druze .Kurd_C3 Steppe_Eneolithic Chimp -0.0257 -3.383 58375
Levant_BA .Kurd_C3 Steppe_Eneolithic Chimp -0.0229 -2.696 52265
Jew_iraqi .Kurd_C3 Steppe_Eneolithic Chimp -0.0224 -3.323 58375
Iran_ChL .Kurd_C3 Steppe_Eneolithic Chimp -0.0202 -2.561 55978
GujaratiD .Kurd_C3 Steppe_Eneolithic Chimp -0.0192 -2.84 58375
Armenians .Kurd_C3 Steppe_Eneolithic Chimp -0.0147 -2.147 58375
Assyrians .Kurd_C3 Steppe_Eneolithic Chimp -0.013 -1.78 58375
Balochi .Kurd_C3 Steppe_Eneolithic Chimp -0.0124 -1.918 58375
Anatolia_N .Kurd_C3 Steppe_Eneolithic Chimp -0.0106 -1.552 58375
Iranian .Kurd_C3 Steppe_Eneolithic Chimp -0.0105 -1.664 58375
Georgians .Kurd_C3 Steppe_Eneolithic Chimp -0.0096 -1.246 58375
Turkish .Kurd_C3 Steppe_Eneolithic Chimp -0.0084 -1.324 58375
Azerbaijanis .Kurd_C3 Steppe_Eneolithic Chimp -0.0081 -1.104 58375
Kotias .Kurd_C3 Steppe_Eneolithic Chimp -0.0074 -0.714 58371
North-Ossetians .Kurd_C3 Steppe_Eneolithic Chimp -0.0047 -0.604 58375
Iran_Zoroastrian .Kurd_C3 Steppe_Eneolithic Chimp -0.0046 -0.726 58324
Iran_recent .Kurd_C3 Steppe_Eneolithic Chimp -0.0044 -0.397 47804
Pathan .Kurd_C3 Steppe_Eneolithic Chimp -0.004 -0.633 58375
Kalash .Kurd_C3 Steppe_Eneolithic Chimp -0.0039 -0.558 58375
Pashtun_Afghan .Kurd_C3 Steppe_Eneolithic Chimp -0.0023 -0.331 55946

Aram said...

Kurd

Kurds share more drift with steppe than North Ossetians? Amazing. They claim Scythian and Alanian ancestry. And they speak East Iranian language like Pashtuns.

Can someone modelate North Ossetians and Balkars? They are very similar and both claim Scythian ancestry, but Balkars are Turkic speaker. In Russian Molgen there is a lot off discussions on this.

Seinundzeit said...

David,

Those skeletons belong to a culture which (in all likelihood) is directly ancestral to Dardic Indo-Aryans (like the Kalash), so this will be very huge.

Combine these samples with those from Rakhigarhi, and we could do amazing/unambiguously definitive analyses.

I really hope this actually pans out. I would love to see how these Swat people relate to contemporary populations in the region (and it'll be very fun from a personal perspective, since I probably have substantial direct ancestry from these people).

huijbregts,

Those clusters are very interesting.

Fascinating that CHG, Iran_Chalcolithic, and some recent West Asians/Caucasians are construed as belonging in a single cluster, while Iran_Neolithic, Iran_Hotu, and recent South Central Asians are construed as belonging in another unified cluster.

And we look forward to seeing that fuzzy clustering output.

Matt,

How do Afghan Pashtuns model, when Austroasiatic_Bonda are used?

Based on my previous setup, the Bonda are the least West Eurasian of all mainland Indian populations, like only 20% Iran_Neolithic-related, and around 65% Andaman (Jarawa)-related, with the rest being Southeast Asian ancestry. The Chamar are more like 60%-65% Iran_Neolithic-related, with the rest being Andaman-related.

Then again, if the goal isn't estimating total West Eurasian ancestry, but rather estimating possible IVC-related admixture, the Chamar might be better, as I think it's possible that the Rakhigarhi samples will show a great similarity to Chamar-like people.

So, the kind of modelling you've applied to the Afghan Pashtun samples could prove to be of substantial value. It might constitute a hint of what we’ll see with IVC samples.

All,

For the fun of it, I took Shaikorth's ANE affinity test (which worked great), and just threw in a bunch of additional populations.

This isn't meant to be taken seriously. Rather, I mainly did it for the sake of seeing what the patterns would look like.

Seinundzeit said...

Here is the output:

South Central Asians:

Kalash

50.60% Iran_Hotu
21.85% Yamnaya_Samara
9.95% Iran_Neolithic
8.75% Srubnaya_outlier
5.20% Jarawa
3.65% Vietnamese_south

Distance=0.6912

Afghan Pashtun, Ghazni (Central Afghanistan, of the Ghilzai confederacy)

34.85% Kotias
20.75% Poltavka
14.95% Iran_Neolithic
14.95% Iran_Hotu
5.60% Jarawa
5.40% Vietnamese_South
3.50% Yamnaya_Samara

Distance=0.366

Myself

30.15% Iran_Hotu
29.70% Iran_Neolithic
21.05% Poltavka
6.50% Srubnaya_outlier
6.50% Vietnamese_south
5.70% Jarawa
0.25% Buryat
0.15% Daur

Distance=0.3922

Pashtun_Afghanistan (northern Afghanistan, hodgepodge of Durrani and Ghilzai)

32.65% Poltavka
17.35% Iran_Hotu
16.40% Iran_Neolithic
13.00% Iran_Chalcolithic
10.25% Kotias
7.30% Vietnamese_south
3.05% Jarawa

Distance=0.2157

Afghan Pashtun, Kandahar (southeastern Afghanistan, of the Durrani confederacy)

40.65% Iran_Neolithic
31.00% Poltavka
15.50% Iran_Chalcolithic
6.95% Vietnamese_south
2.95% Srubnaya_outlier
2.60% Kotias
0.45% Buryat
0.35% Daur

Distance=0.1691

Pakistani Pashtun, Waziristan (northwestern Pakistan, of the Karlani confederacy)

33.45% Iran_Hotu
30.4% Poltavka
15.95% Iran_Chalcolithic
11.75% Kotias
7.85% Vietnamese_south
0.60% Jarawa

Distance=0.256

South Asians:

Chamar

64.65% Iran_Hotu
32.70% Jarawa
2.65% Afontova Gora3

Distance=3.6878

Pulliyar
48.40% Jarawa
46.75% Iran_Hotu
4.85% Iran_Neolithic

Distance=3.423

Paniya

57.85% Jarawa
33.35% Iran_Hotu
8.80% Iran_Neolithic

Distance=3.4505

Austroasiatic_Bonda

66.25% Jarawa
15.45% Vietnamese_south
12.90% Iran_Neolithic
5.40% Iran_Hotu

Distance=3.2655

As expected, things get complicated for South Central Asians, with so many reference populations in the mix. Again, this isn’t meant to be taken too seriously.

But, it is of great interest that the southern Vietnamese really eat up Andaman-related percentages for South Central Asians. From previous modelling, I know that these Vietnamese have a good amount of Andaman-related admixture, and South Central Asians like Pashtuns and Pamiri Tajiks have Siberian admixture. So, it isn’t surprising that an East Asian population with Andaman-related ancestry will eat up the already quite minor Andaman-related and low level Siberian-related admixture found in these populations.

At the same time, there is evidence that rice cultivation came to South Asia via perhaps East Asian-like people. In addition, if I’m not mistaken, many Harappan cranial remains do display East Asian features, and I remember a paper that claimed that the dental traits of some ancient Pakistani remains were clearly East Asian, not West Eurasian or South Eurasian.

On top of that, the Kalash do not have any Siberian/recent East Asian admixture, but their indigenous South Asian ancestry is still split between the Jarawa and the Vietnamese.

Basically, there could be something very significant in these results. But I wouldn’t bet on it.

Also, it’s interesting that despite the complicated picture seen in South Central Asia, isolated South Asians are still being construed as just simple mixtures between Iran_Neolithic/Iran_Hotu-related and Andaman-related ancestries. That seems to be all there really is, with regard to their genetic history.

Nevertheless, the fits aren’t excellent, unlike the South Central Asian models.

In their case, I guess we need South Asian Mesolithic aDNA to get the kind of tight fits we see with South Central Asians.

Jijnasu said...

I doubt the skeletons have anything to do with the Kalash. They were probably Ancient Gandharans likely the precursors of modern Hindkowans

«Oldest ‹Older   1 – 200 of 235   Newer› Newest»