search this blog

Thursday, December 29, 2016

Early Indo-European migrations map

Wikipedia has a new animated gif of early Indo-European migrations (available at various resolutions here). It's pretty good overall, but very speculative and potentially erroneous in parts. For instance, my understanding is that the Vedic Aryans did not emerge from BMAC per se, as the map suggests, but rather from a post-BMAC phenomenon heavily influenced by steppe pastoralists. Hi-res ancient DNA from BMAC and post-BMAC sites should be able to resolve this issue.

As far as I know, BMAC remains were being tested at Harvard earlier this year, but the year is almost out, and nothing has been published. So either David Reich and co. are keeping the results for a new paper on the Indo-European homeland question, or they couldn't get any usable data from the samples. Keep in mind that only 30-40% of the ancient remains that are tested at Harvard are successfully genotyped. I can imagine that the success rate for samples from arid locations, like former BMAC sites in Turkmenistan, is even lower.

Update 31/12/2016: Commentator Tapatuevik Kaarmkyno points us to an article from earlier this year at NIH Record featuring this quote from David Reich:"We’ve sequenced more than 1,000 samples in our own lab — there’s not enough time to publish". That's probably why the second half of 2016 was so agonizingly slow. Next year should be awesome.

See also...

Maybe first direct hints of Yamnaya-related gene flow into South Central Asia


1 – 200 of 204   Newer›   Newest»
rozenblatt said...

Well, they could try to test samples from other, less arid parts of BMAC territory, like Sarazm in Tajikistan.

Davidski said...

Maybe they are. Btw, I know that GeoGenetics has ancient horse samples from Tajikistan, so why not human samples, including from BMAC?

I think we'll see a lot of good stuff eventually, in all likelihood next year.

Jaydeep said...

I think, if as many as 12 samples out of 15 could have been genotyped from Rakhigarhi in Haryana, India, there is no reason why DNA cannot be extracted from Central Asia which is much colder in comparison to South Asia. Also any ideas as to why aDNA father father from Tarim mummies is not being taken by Reich & company ? They probably are the most well preserved.

rozenblatt said...

@Jaydeep I don't know precisely, but I have strong suspicion that Chinese government doesn't allow foreign labs to analyze samples from China. The only exception that I know of - remains from Tianyuan, but in that case one of the main researchers is Fu Qiaomei, so may be the exception was make because of that.

Davidski said...

As far as I know, desert conditions like in much of Turkmenistan are by far the worst for DNA preservation, not just because of the mostly arid climate but also extreme variation in temperatures and humidity.

No idea what's happening now with the Tarim Basin samples. Maybe the Chinese want to be first to publish their genomes, so everyone else is waiting for them?

Jaydeep said...

As per the above paper, arid weather appears to be excellent for preservation of ancient DNA. I think it is just a matter of time. As far as the Tarim mummies are concerned, I hope the Chinese govt is not being a Spoiltsport. Any idea how soon the Rakhigarhi DNA might be published in the new year ?

Davidski said...

Yeah, but it depends on what sort of arid weather.

As far as I know, the Rakhigarhi paper will be published before the middle of the year. And it might include all sorts of other stuff too, like data on Cemetery H samples.

But I guess we'll see.

Anonymous said...

I do not understand how this map can be correct for the paternal side. Yamna is not the ancestors of corded ware paternally according to ancient dna evidence. Yamna had R1b and corded ware had R1a. Maybe its true for the mtdna side as well as much of what is in between but it obviously incorrect for the pateral side.

Anonymous said...

R1a was in Karelia before Yamna and even Baikal according to one study when R1b was in the Yamna area I do not see how it is from the Yamna culture at all. Has there been any new ancient y-dna published?

Davidski said...

How do you know there was no R1a in the entire Yamnaya horizon? I'd be very surprised if there wasn't, so you must know something I don't. What is it?

Rob said...

Decent effort, with some problems.
- Afansievo isn't from "Yamnaya" but from an earlier migratory group from the Don steppe
- the Greek migration is imagined at best
- the Armenian ethnogenesis doesn;t seem correct.

Anonymous said...

I don't and maybe R1a comes from Yamanya/Yamna and this will be discovered in the Future. However with the limited samples that we have access to I dont think it is likely. For example a number of Yamnaya had R1b1a2a2 -Z2103 and seemed to lack R1a according to the few samples collected while R1a was already found in places like Karelia a few thousand years before that and even one in Lokomotiv ''Y-chromosomal DNA analyzed for four prehistoric cemeteries from Cis-Baikal, Siberia, '' long before so R1a to me seems to probably not come from Yamnava even if they did have a minority with R1a. Maybe Rob is right and the Corded Ware and the other R1a ''isn't from "Yamnaya ''

Davidski said...

There was R1a in Khvalynsk so there will be R1a in Yamnaya.

Tapatuevik Kaarmkyno said...

David Reich: “We’ve sequenced more than 1,000 samples in our own lab—there’s not enough time to publish” all the data they are collecting.

Rob said...

@ 5c63f22..

Hi. I do think it likely CWC derives from Yamnaya, or at least some "post-Stog" group there abouts. But yes the final proof is awaiting for ....

As for Afansievo: I think it comes from a pre-Yamnaya group originating b/w the Kuban & Volga, which was then overlaid by the ancestors of Yamnaya (sensu strictu) expanding from just west of that (~ Dnieper-Don steppe).

Aram said...


I think they placed Armenians around lake Urmia based on Iron Age Iran F38 sample. Who was R1b-L584 ( the most frequent SNP among Armenians ). He was quite close to modern Armenians by f3 stats. That sample also some shared drift with Kum6 so they presume it crossed Anatolia.

Olympus Mons said...

I see. When suits you is stick to the adna. when not, then you are perfectly happy to infer from negatives? So, no R1a in Yamnaya the response is ... there will be?

Ok, just show some lenience to others that use the same arguments.

Olympus Mons said...

and do you have an idea of when such migration occurred?

From Shulaveri shomu to Kubans! (

Davidski said...

It's very likely that there was R1a in Yamnaya...

However, even if there wasn't, then we know for a fact that there was R1a in Khvalynsk (ie. pre-Yamnaya)...

So R1a-M417 comes from the steppe, because both Y-DNA and autosomal DNA show the same thing (ie. Corded Ware is indeed very much 75% Yamnaya-like).

You will have to accept sooner or later that both R1a-M417 and R1b-M269 are from the steppes.

Olympus Mons said...

Earlier layer of Kumtepe (so, Kum6) really points to circular architecture and since it has a dating of 4900/4800BC you know where they came from, right? :)

Ot: have you checked gadachrilli gora in Georgia ? -

great videos. They really try to show you all.
I don't think the Canadians have reached the inhumations/pits part yet and since it was only occupied by shulaveri I think the best expectation for aDna of Shulaveri-shomu might be there.

Nirjhar007 said...


Azarov Dmitry said...

This map was made by Yamnaya fanboy who never heard about Sredny Stog, Khvalynsk, Cernavoda, Maykop, Catacomb, Fatyanovo–Balanovo and Baden cultures. When modeling migrations of R1 folks and spread of magic steppe admixture one has to keep in mind that for a while all feasible migration models can be divided in two major groups:
a) magic steppe admixture was brought in Europe from the PC steppe;
b) magic steppe admixture was brought in Europe by R1 folks.
Now we know that spread of magic steppe admixture in Europe can be reliably connected with Bronze Age migrations of R1 folks. But at the same time migrations of R1b (subclades R1b-L51 and R1b-L151) folks from the PC steppe look pretty much problematic. That’s why I believe that steppe admixture was brought in Europe by R1 folks and they came from the Iranian Plateau.

Davidski said...

You're an Iranian Plateau fanboy who doesn't know how to interpret the ancient DNA we already have.

Olympus Mons said...


HAPPY NEW YEAR!!!!!!!!!!
May 2017 bring peace to our doubts and that we may all be grown-up enough to man up to whatever aDna is published.

Davidski said...

I'll drink to that.

Rob said...

@ Arame

"I think they placed Armenians around lake Urmia based on Iron Age Iran F38 sample. Who was R1b-L584 ( the most frequent SNP among Armenians ). He was quite close to modern Armenians by f3 stats. That sample also some shared drift with Kum6 so they presume it crossed Anatolia."

Yes, that sample - as you say - is from the Iron Age; but this illustrations has Armenians appearing around Urmia c. 2500 BC. They also mark it specifically coming from the lower Danube segment of Yamnaya. I think it's a great animation, but for discussion's sake, what do we know so far ?

The Chalcolithic Anatolian samples (c. 4000 BC) can be summarized as 40-50% Anatolian-Levant farmer, 30-40% Kotias, 5-15% EHG. The current EBA (3300- 2200 BC) samples show a distinctly higher Kotias component, doubling to ~60%. This represents, IMO, a completely different (?new) population to the Chalcolithic one, not just admixture. Of course this could be due to structure within the Caucasus, and local expansion/ replacements.

By the MLBA, there is considerable continuity with EBA, but also a significant 'comsopolitinization' of the population. with a major chunk coming from a steppe population, with a 5-10% increase in EHG, but an overall 20-30 % steppe impact in certain individuals (not doubt associated with the appearance of Kurgans south of the Caucasus after the fragmentation of the K-A culture).

Lastly, modern Armenians are again quite ANF/Levant farmer shifted. However, this doesn't come from an actual Levantine source, but rather Anatolia & the Balkans, as legend suggest historical movements. There are links with Greek, Phrygian, etc.

So proto-Armenians could have arrived in EBA with the shifts in the EBA, or the arrival of some newcomers in the MBA, or western influences ? during the Iron Age. I guess it partly depends what other regions show, in comparison.

Samuel Andrews said...

Natural Selection transformed/unified European mtDNA.

To me it's irrefutable now that something other than direct ancestry and founder effects formed (uniform)haplogroup frequencies in Europe. I've looked like 1,000s of H genomes. There is no massive H founder effect and there was no H people. What else could it be but natural selection?

Atriðr said...

If teleportation was invented in the Bronze Age, then this is a good map.

Gaspar said...

PIE and IE

first split was ~4000BC in Anatolia

This link has the same map

IMO, no haplogroup can claim creation of PIE, but any Haplogroup from Yamnaya can.

The Phrgians or more exact proto-Phrygians would have made the first PIE split in Anatolia

Romulus the I2a L233+ Proto Balto-Slav, layer of Corded Ware Women said...

@Samuel Andrews

How much do you know about RH factor incompatibility? Rhesus negative blood types peak in Europeans and in Basques among Europeans. Given the distribution local to Europe it's fair to assume that Mesolithic Western Europeans were high or perhaps homozygous for the Rhesus negative blood type.

If a Rhesus negative woman conceives a child with a Rhesus positive man then typically she would not be able to have more than 1 child due to RH factor incompatibility. Heterozygotes are RH positive so there would be no Rhesus factor incompatibility for a homozygous RH negative man conceiving children with a homozygous RH positive woman.

If we have a situation in Neolithic Western Europe where incoming (homozygous) RH positive farmer groups are exchanging wives with local RH negative Hunter Gatherer groups, then we would see a significant decrease in fertility for EEF Male / HG Female pairings. Especially in relation to HG Male / EEF Female pairings. As RH positive alleles spread through HG groups this would further reduce fertility among Rhesus negative women in HG groups.

Unfortunately none of the ancient genomes we have so far have been typed for the Rhesus factor snps/gene (I've checked). But given modern distributions, the only way we could have a peak of RH negative blood types in Basques is almost certainly from some Meso or Paleolithic founder effect in a refugium around the Pyrenees.

The effect of this compounded over thousands of years would explain the disappearance of both mtDNA U5b (local HG mtdna) and Y DNA G2a. Natural selection in effect as you suggested.

Anonymous said...


Gravetto-Danubian on Anthrogenica states that several Central Asian samples yielded no usable DNA:

Maybe related to said BMAC samples? The list there is the one Pinhasi published.

Davidski said...

@Tapatuevik Kaarmkyno

Thanks, I updated the post with that quote from David Reich.

Aram said...


For the EBA I agree it is a new population. But it do not make 100% population replacement. Maybe some 50-70%.
For the MLBA look at Alberto's datasheet. Notice that high level of Espersted_MN in Armenia_MLBA. This Espersted_MN is not a artefact of nMonte. It shows up in other admixture tools. For example in Gedmatch it shows as a unusually high level of Atlantic Med. So the migration occurred something like this. From Carpathian/North Black sea region a movement toward South Caucasus. This is btw backed up by some Y DNA like I2c2 present in Unetice and today prominent in South Caucasus. Maybe R1b came in this way also?
R1b-L584 can't come via Anatolia. Because L584* was found in Daguestan. There are other reasons to think so.

""Lastly, modern Armenians are again quite ANF/Levant farmer shifted. However, this doesn't come from an actual Levantine source""

Well this is a specifically Levantine shift not something Anatolian. The closest pop to Armenians by FST are the Lebanon Muslims. Who themselves are the closest pop to Jordan_EBA. This Levantine shift is real thing, it was found in Busby et al. If You draw a line from Armenia_MLBA to modern Armenians on Davidski's PCA You will end up in Levant not in Anatolia.
But off course this Levantine shift doesn't come from most of Levant. It comes from Diyarbekir, Gaziantep, and parts of North East Syria. Maybe even Subartu. Why I think so it's little bit long to elaborate.

Aram said...


Look at this tables especially the second

Fst Scores -
Shift between Fst scores between each era -

Note that the main difference between Armenia MLBA and modern Armenians is a very Levantine shift. I would say even that it is Arabian shift because of presence of Saudis and Yemenite Jews.

As for legends. Eh oh. If we start to rely on legends then we must investigate the possibility of Germans and Brits coming from Armenia. :) Anglo Saxon Chronicles.

Aram said...

Summing up things.
2500 BC migration is correct.
But the route of migration is dubious.

EastPole said...

Sintashta is shown as a mix of Corded Ware and Yamna. If Sintashta was proto-Indo-Iranian, then what languages were spoken in Corded Ware and what languages in Yamna? Genetically Sintashta is much closer to Slavs (who are a mix of Steppe herders and Neolithic farmers) than to Yamna PIE herders. And Sintashta has R1a and doesn’t have Yamna R1b. Lots of questions for linguists.

Rob said...


Sure Whatever you want to call 70% population shift , doesn't bother me, but you seem content on excluding the EBA shift which is more notable than the MLBA one

I think the difference with my analysis is I looked at all individuals, not grouped ones. Thus, the results will be very different. I.e. more accurate

Lastly, yes, some myths are just that (eg Trojan origins of Franks- but even these have hidden meaning). Some myths however contain more grains of truth. It's up to us to be able to discern that instead of summarily dismiss, which would be of utmost folly.

Anyhow: the biggest shifts in Armenia are likely to have occurred after the Mesolithic (not revelant for PIE), then c. 3500 BC.
The MLBA shifts are more of an adstratum, and whilst possible to cause linguistic shift, I think you're jumping the gun in proclaiming what is coreect, given that corresponding shifts are absent in Iran.
Happy NY Bud

Nirjhar007 said...

Yes Dave , they did lots of senseless sampling..

Davidski said...

I'm pretty sure that amongst those 1,000+ samples there's something for everyone, including you.

Nirjhar007 said...

Happy 2017 for my Aussie mates !.

Well Dave , I will make sure that you get your share too ;) .

jv said...

Look forward to 2017 and the results from David Reich! Maybe someday some of my unanswered questions regarding my ancient Grandmothers will be answered. 1.)Did they live in the Pre-Yamnaya Hunter-Gatherer Pontic-Caspian Steppe? 2.} Did they arrive with the Elshanka Culture or with an earlier Culture? 3.}Did they follow R1b from the Zagros Mts to the Pontic-Caspian Steppe? 4.} WHEN did they migrate to the Balkans? After the Catacomb Culture or with the Yamnaya Culture migrations?............maybe someday I will know?

Rob said...

@ Arame

I have more time for an answer now. The important thing to realize is that results will be different between pooled Dstat data and PCA data which contains individuals from each time period. The latter gives a more accurate picture, IMO, as the pooled data 'averages' out giving a phantom phenomenon which might not have occurred.

* "For the EBA I agree it is a new population. But it do not make 100% population replacement. Maybe some 50-70%."

Sure, its not ever 100%, but could be more like 60%. In fact I can't really put a good descritption of the change from Chalc to EBA, but a definite rise in Kotias-type ancestry (with relative drop of Anatolian Farmer ancestry c.f. preceding Chalcolithic). NB One Armenian Chalc is an outlier (I1407), with already relative higher Kotias:ANF ratios, suggesting that we are looking at population structure in the sth Caucasus, & "internal shifts".

* For Iran there is a significant shift from Neolithic to Chalcolithic, as we know , characterised by a population of mixed Kotias-Levant Farmer ancestry. I agree with Dav'es suggestion of Halaf here.

* From EBA Armenia to MLBA there is overriding continuity but with definite shift toward EHG. As you say, there is some European farmer as part of this. Curiously, the type of steppe chose is Afansievo over Yamnaya. So does this represent *one* migration from a group which also gave rise to something like Sintashta type, as Dave suggested a while ago, or movement from at least 2 different populations (one from central Eurasian steppe, one minor one from Balkans?)

* The shift in modern Armenians is not Levantine. Apart from Gavar & Yegvard who score ~ 7% Jordan Bronze Age, the majority of "farmer' shift seen in modern Armenians compared to their MLBA predecessors is the classic Anatolia farmer, even LBK like (i.e. they do not chose Levant EBA or even Neolithic). This is obvious in the PCAs attached & raw data breakdown.

* You say "As for legends. Eh oh. If we start to rely on legends then we must investigate the possibility of Germans and Brits coming from Armenia. :) Anglo Saxon Chronicles."

It would be a folly to summarily dismiss legends. What is required is careful interpretation instead of literal reading. All recorded legends have hidden meaning. Exactly what they inform us on is the skill. Eg one myth places the Franks descending from Trojans. Obviously nonsense but it does inform us about what was happening in Frankia/ Gaul in Late Antiquity: Frankish integration into Roman society - hence shared Trojan origins with the Romans was invented. So then, what is the meaning behind the alleged Balkan origin of Armenians ?

* For Iranians, there are still too many gaps, but they do indeed show 5-15% EHG admixture. Unlike Armenians, they have a more consistent & significant of actual Levant shift (20-25%).

PCA of individual PCA-IBS data

Pooled DStat data

Raw Data

Aram said...

Thanks for the data. I will analyse it and will answer later.
About Levantine shift. It was my error. I should have elaborate more what I mean by 'Levantine' shift.
What population diluted the Natufian during the transition from Natufian to Levant_N?
It was not from Anatolia. Right?
So did this population disappeared or evolved somewhere in North Levant/Taurus region, maybe even parts of historic Armenia? I was referring to that unsampled territory. And yes Iranians have much more Natufian than Armenians.

As for ancestry from IE people of BA/IA Anatolia and Balkans. How they can be a good source for Armenians if according to Steppe theory they should carry some reasonable levels of EHG ancestry. So how could a classical Anatolian farmers ancestry survive in pure form after all that Hittites, Anatolians, Greeks and Phrygians overrun Anatolia?

**The MLBA shifts are more of an adstratum, and whilst possible to cause linguistic shift, I think you're jumping the gun in proclaiming what is coreect, given that corresponding shifts are absent in Iran. **

It is normal from archaeological point of view. Early Kurgans of Transcaucasia never crossed in large numbers Araxes river. They didn't even ventured in to West Armenia. Kura-Araxes sites were continuing to flourish in South Western parts of historic Armenia while they have completely disappeared in South Caucasus.
Only later when Trialeti culture formed there was a some expansion toward West and South, but again they never expanded into whole Armenia.

I link this MBA period with the arrival of R1b-Z103 lineages. The two lineages have a TMRCA fitting into this period. 4800 and 4600 ybp. While the L584 lineage has a strong reexpansion at 3300-3100 ybp period. Consistent with Mushki expansion period.
Off course I could be wrong but is there any data that I missed, and I should reconsider it?

**It would be a folly to summarily dismiss legends.**

Yes it is folly. It is even greater folly to dismiss hard cuneiform data or other non-Greek sources. Or even other Greek historians, not Herodotus. Or even other Herodotus texts except that one frickin sentence about same clothings of Phrygians and Armenians who are the colons of Phrygians.
Unfortunatly this folly has long time plugged Armenology. For various historic reasons. Armenian scholars are also to blame here. But fortunately things are evolving into right direction. After all it is a genetic blog and I don't want to flood it with non genetic 'soft data' as Dave calls them.

Aram said...

Happy New Year to everybody.

I learned a Shulaveri sample from Aknashen is already sent to lab. So maybe next year we will see it.

Tapatuevik Kaarmkyno said...

News from southern India and K. Thangaraj:

CCMB ... has already entered into an agreement with Anthropological Survey of India (AnSI) to unravel the mystery of skeletal remains discovered during excavation of ancient sites. On Thursday, it signed MoU with Telangana state archaeology department to study bone samples collected from various places in [Andhra Pradesh] and Telangana over years.

Rob said...

@ Aram

"Thanks for the data. I will analyse it and will answer later.
About Levantine shift. It was my error. I should have elaborate more what I mean by 'Levantine' shift.
What population diluted the Natufian during the transition from Natufian to Levant_N?
It was not from Anatolia. Right? "

That's Ok. But I'm not following

* "As for ancestry from IE people of BA/IA Anatolia and Balkans. How they can be a good source for Armenians if according to Steppe theory they should carry some reasonable levels of EHG ancestry. So how could a classical Anatolian farmers ancestry survive in pure form after all that Hittites, Anatolians, Greeks and Phrygians overrun Anatolia"

You probably misunderstood again, as with the data. I'm not saying that IE came to Armenia with Balkan farmer ancestry. All I am stating that compared to the MLBA, there is a shift in Armenians to Anatolian type farmer ancestry.
IM not sure exactly when and with who PIE came to Armenia, but beginning your process with 'according to X, Y or Z theory' already weakens any deduction you make, because they are post hoc.

* "It is normal from archaeological point of view. Early Kurgans of Transcaucasia never crossed in large numbers Araxes river. They didn't even ventured in to West Armenia. Kura-Araxes sites were continuing to flourish in South Western parts of historic Armenia while they have completely disappeared in South Caucasus.
Only later when Trialeti culture formed there was a some expansion toward West and South, but again they never expanded into whole Armenia."

Yes that is the explanation I saw too.

* "Yes it is folly. It is even greater folly to dismiss hard cuneiform data or other non-Greek sources. Or even other Greek historians, not Herodotus. Or even other Herodotus texts except that one frickin sentence about same clothings of Phrygians and Armenians who are the colons of Phrygians.
Unfortunatly this folly has long time plugged Armenology. For various historic reasons. Armenian scholars are also to blame here. But fortunately things are evolving into right direction."

I can see this is a passionate topic for you ;).
But you must understand better. I am not coming from the position of past Armenian scholars. I am stating that Herodotus was not simply tripping out when he stated that sentence, but did so because he had heard of such things, or adapted a previous story to explain something, etc. I'm not sure what it means, but perhaps it is some for of anachronism or semi-legend. Tales of migration between Anatolia & northern Greece/ Thrace/ Macedonia were commonplace in Greek. Im not stating this should have any bearing on how we interpret the DNA data. Indeed, even the position of Armenian is difficult to characterize : is more like Iranian, or more like Greek ?

Rob said...

@ Áram

I figured out also where you've made a discrepancy
Alberto's spread sheet only uses "EHG" as a source pop, instead of something like Yamnaya or Poltavka. Hence Esperstadt showing up as the MNE proxy which mixed into and is found in all EMBA steppe groups (Poltavka outlier, Andronovo, Srubnaya).

So it's a careful task in setting the source populations, depending on what your asking. Thus the answers drawn need to be considered in such light, not literally

FrankN said...

Happy New Year to everybody!

Dave, I concur with your opening statement that the map sequence is “very speculative and potentially erroneous in parts.

All such exercises are speculative, as long as linguistics hasn’t reached consensus on:
1. The basic phylogeny of IE families,
2. The dating of PIE.

I’ll deal with both issues in longer write-ups to follow. First, here a list of sources those write-ups are based upon.

T, Warnow; Phylogeny Reconstruction Methods in Linguistics (includes representations of IE phylogenies prepared by Ringe 2002 and Nakhleh 2005)

Da Silva/ Tehrani 2016: Comparative phylogenetic analyses uncover the ancient roots of Indo-European folktales

Chang e.a. 2015: ANCESTRY-CONSTRAINED PHYLOGENETIC ANALYSIS SUPPORTS THE INDO-EUROPEAN STEPPE HYPOTHESIS (includes an overview on IE root ages estimated by several authors, Fig. 3, p 201)

D. Anthony 2013: Two IE phylogenies, three PIE migrations, and four kinds of steppe pastoralism

Broukaert e.a. 2012: Mapping the Origins and Expansion of the Indo-European Language Family

Greenhill e.a. 2010: The shape and tempo of language evolution (morphology-based phylogenies)

Jaeger e.a. World Language Tree 2016

FrankN said...

1. IE phylogeny:
a) Earliest splits: There is consensus that Anatolian languages were the first to split from PIE. Most trees have Tocharian following; however, Bouckaert e.a. reconstruct a Armeno-Tocharian clade instead. In the neighbour-joining (NJ) tree reported by Warnow, Albanian splits before Tocharian. Warnow’s UGMPA-based tree, finally, inserts Osco-Umbrian and Old Persian as second/ third splits before Albanian and Tocharian.

b) Position of Proto-IndoIranian (PII):
Following the early (Anatolian, Tocharian, poss. Albanian) splits,
- Witzel and many Uralicists postulate a direct split between PII and Western IE (everything else);
- Chang., Graca da Silva/ Tehrani, and Warnow’s NJ, Gray&Atkinson (GA) and Maximum Compatibility (MC) trees have Graeco-Armenian (-Albanian) splitting off before PII seperates from Germano-BaltoSlavic-ItaloCeltic;
- Bouckaert e.a. see Germano-BaltoSlavic-ItaloCeltic splitting off a joint clade of Graeco-Albanian and PII;
- Acc. to Ringe (2002), Nakhleh (2005) and Anthony (2015), Italo-Celtic splits off early. They generally propose an IndoIranian-BaltoSlavic dialect continuum. Anthony adds Germanic to this continuum, while Ringe and Naklekh have Germano-Albanian and Graeco-Armenian as subsequent splits.
- Warnow’s UGMA tree has Celtic, then Germanic, then Balto-Slavic splitting off after Tocharian. What remains is a PII - Graeco-Armenian – Latin continuum (note that UGMPA has pre-Tocharian split of Osco-Umbrian).
- Jaeger’s “World Language Tree” (WLT); which doesn’t include Anatolian and Tocharian, finally, has first Albano-Celtic, then Armenian, and then PII branching off. Modern Greek clusters with Romance there.

c) Position of individual families: As can be seen from the above, except for Anatolian and to some extent Tocharian, there is hardly any consensus on the position of individual families within the phylogeny. Two examples:

- Albanian: Pre-Tocharian split (Warnow NJ, UGMPA), Graeco-Armeno-Albanian (Chang, da Silva/Tehrani, Warnow GA, MC), Graeco-Albanian (Bouckaert), Germano-Albanian (Ringe, Nakhleh), Albano-Celtic (WLT), not considered by Anthony.

- Germanic: Germano-Albanian (Ringe, Nakhleh), Germano-Celtic (Warnow NJ, UGPMA), Germano-ItaloCeltic (Chang, da Silva/Tehrani, Bouckaert, Warnow GA, WC), Germano-BaltoSlavic (WLT), Germano-IndoIranian-BaltoSlavic dialect continuum (Anthony)

The reasons for that mess are clear: IE hasn’t evolved in a tree-like fashion. Instead, there have been multiple instances of post-PIE “admixture”, i.e. language contact, including borrowing and influence of lexical and morphological sub-/adstrate (Cf Greenhill e.a. on typologically German being closer to French than to English, Polish clustering with Baltic languages, and the Graeco-Albano-Bulgarian-Romanian cluster). Important bridge languages, e.g. Thracian (Graeco-Germano-Slavic), Venetic (ItaloCeltic - Germanic, poss. also BaltoSlavic), are insufficiently documented to be included in phylogenic analysis, and the earliest millennia of IE aren’t recorded at all.
The way forward is also clear – supplementing phylogenic by admixture analyses. First attempts (Nakhleh 2005) have yielded admixture edges between Graeco-Armenian and Italo-Celtic, Italic and pre-Germanic, and Germanic and Baltic. More detailed admixture analyses are ongoing within Jaeger’s EVOLAEMP project at Tübingen. However, such analyses, if run on a sizeable number of languages and going to higher-level edges, require massive calculation power that isn’t available so far [IIRC, Jaeger’s team reported that a complete analysis for IE would take up Tübingen University’s computer capacity for one month.]

FrankN said...

2. Dating of PIE:
Chang (2015) in Fig.3, p.201 sums up previous dating attempts for PIE. The majority (Gray&Atkinson 2003, Nicholls&Gray 2008, Ryder&Nicholls 2011, Bouckaert e.a. 2013) comes in between 7,500 and 5,000 BC, which leaves a slight possibility for a spread with ANF, but most likely is too late for that.

The minority position is made up by Chang e.a., and Anthony. Chang is problematic for their use of clade constraints, i.e. postulating a specific IE phylogeny (Anat>Toch>GraecoArmenoAlbanian>PII) that isn’t secured so far (see above). Anthony’s estimate isn’t based on systematic lexical comparison, but only uses vehicle terminology, i.e. a subset of seven putative IE roots. Both date the split of Anatolian, i.e. the end of PIE as a dialect continuum, to around 4,500-4,200 BC.

Indirect help for dating is provided by Da Silva/ Tehrani (2016), who identify the tale of “The Smith and the Devil” as with 87% likelihood being shared by all IE families, and as such belonging to common (P)IE heritage (Anat. and Tocharian not considered for lack of data). The smith may theoretically have been a goldsmith (Varna Culture, 5th mBC) or have worked on native copper. However, his nails are strong enough to tie the devil to an immovable object (e.g. a tree), which rather suggests bronze as material.
So far, the earliest arsenic bronzes known are from Tall-i-Iblis, SE Iran, early 5M BC, the earliest tin bronzes date from ca. 4,500 BC, Vinca Culture. (Bronze) smithing as shared IE mythological motive thus ties in better with a late chronology as proposed by Wang and Anthony. However, we don’t know whether that motive was already present before the earliest splits (Anatolian/ Tocharian). Moreover, the “Smith and the Devil” tale is also absent from Albanian, which may have been another early split.

What does that mean for the WP maps?

1. Yamnaya can’t be PIE. It at earliest evolved around 3,500 BC, recent corrections for reservoir effects rather place it at 2,900-2,500 BC (i.e. contemporary with/ slightly later than CW). The latest dating for Anatolian splitting off PIE is 4,200 BC (Anthony).

2. Proponents of a Steppe origin need to consider presence of bronze metallurgy in early (post Anat./ Toch) IE. This rules out Sredny Stog, Dniepr-Donetsk etc., and also any tying of PIE to EHG. It essentially leaves us with Maykop, but Maykop in all likelihood originated on the NW Iranian Plateau, cf

3. Alternatively, for its early metallurgy, Vinca may be considered as PIE home culture. However, this poses various challenges, most notably the strong ANF presence on the Balkans during the EN.

4. Having the starting point (Yamnaya) wrong, lacking a credible (pre 4,200 BC, Anthony proposes Cernavoda) path to Anatolian, and the early “Danube route” being at odds with pre-BA Hungarian aDNA, those maps shouldn’t be taken seriously and in fact be removed from WP as soon as possible!

Olympus Mons said...

That is great news. Aknashen prior to 5000BC would be a great addition to aDna. Lets keep our fingers crossed.

Olympus Mons said...

Thank you for making my case of Shulaveri-Shomu as origin of PIE so strong…
I think Johannes Krause at Max Planck would be happy with your explanation.

Kidding. – Shulaveri was not the origin of PIE (6300BC- 4900BC) . They were PIE for sure, but if they were then I believe Fikirtepe (all 7th millennia in northwestern Anatolia), their forefathers, also were PIE and might be the Origin of PIE.

So, origin of PIE was south shores of black sea which means I cannot rule out western part therefore… might even go to Vinca, but not likely at all.

There was something going on in south/west/east Black sea during the Neolithic that regretfully, the expansion of the black sea shores has submerge and destroy.
Looks like even EEF avoid it. Expansion of agriculture seemed to go around it/them and jump into Greece and move on… but avoiding it. Then a possible back flow of West to East might have brought something back to Anatolia.

Anyway I think a component going from west to Anatolia (maybe Gioello R1b?) made impact:
a. Might have made Fikirtepe 7000bc- 6000bc (so, … PIE arising?)
b Moved to south Caucasus (6000bc-5000bc) as shulaveri-Shomu.
c. Moved back to northwest Anatolia in for instance Kumtepe (Kum6 – 4800bc) and later made Anatolian branch(?) of Pie.
d. Move up the east shores of Black sea and into Kuban river that naturally from there to the samara region. That is how it got to North Caucasus.

e… moved south to… well that is another story. Check Merimda/El-omari aDna…

Olympus Mons said...

If Reich has over 1000 samples sequenced he for sure knows what the results of those samples are. And if those results had confirmed is preferred narrative there would not be this drought of papers. That is not how humans work.

So the reason why we are having this Adna drought is because results are creating a conflict.

I wonder what that/those conflicts are!

Davidski said...

Conflicts, or rather disagreements, shouldn't result in any delays, because they can publish papers that argue for several different scenarios and let the readers decide which scenario works best.

I think the problem is more straightforward: when you have so much data, it's more difficult to explain it all and easier to make mistakes.

In that last paper, they seemed pretty confident that Khvalynsk and Yamnaya had ancestry from Iran, but they didn't look at the mtDNA, which is typically South Caspian in the ancient Iranian groups, and nothing like that in any of the Bronze Age steppe groups.

That was a massive oversight, despite the huge team of experienced people they had working on the project. When we get more relevant data, I will go back to this issue, because I almost fell off my chair when I saw their conclusions.

Olympus Mons said...

It can be that too. - That is also how humans (the good ones) operate. :)

Rob said...

@ FrankN

I was not aware that the Boukaert position was "majority". Quite the opposite from my recollection. I have no strong position (or the required skills) to say which is right. I think it would be a hard task to convince us that PIE should be dated to 5300 BC (say) instead of 4500BC (and vice-versa). How can such precision be claimed to be achieved ?

Anyhow, my opinion is that the simple trees we have traditionally miss the real picture of IE dialect formation.

Aram said...


"I am stating that Herodotus was not simply tripping out when he stated that sentence, but did so because he had heard of such things, or adapted a previous story to explain something, etc."

* For example he can repeat what is basically said in Assyrian texts. There is a Assyrian text with a phrase Mitas of Mushki, basically meaning Phrygians are called Mushki. Mushki were probably also related to Armenians which create an amalgam of Phrygians and Armenians.
* There is some evidence that Mushkis were vassal a client state of Phrygia. And the real meaning of that phrase is that Armenians are a vassal or clients of Phrygia not colons.
* In NW Anatolia there were Mysians. Whose name in Greek is like Musoi. I have little doubt that this name is related to Mushki were the last -ki is a suffix. Mysians language is unknown. Probably initially they were Anatolians who switched to Phrygian thus creating a reason why Assyrians would call Phrygians Mushki.

Etc etc.

**All I am stating that compared to the MLBA, there is a shift in Armenians to Anatolian type farmer ancestry. **

Yes I agree with You. It is a obvious shift, no need to deny that. But understanding that shift is impossible without BA/IA ancient DNA at last from Anatolia. What I am saying is that Occam razor rules say that the most probable origin of that shift is a proximate place. More western type of Kura Araxes will fit the bill. And I said why I think that the existence of such a more Anatolian farmer rich KA is very possible in the Western parts. Extrapolating data from South Caucasus on large parts of North West Asia can be misleading.
Will find a time to look more on this after hollydays.

Btw speaking about LBK. In Alberto's spreadsheet Iranians have more of that Esperstedt_MN. Andronovo effect? How is that?

Aram said...

Speaking about PIE age. In a recent report in Russian Academy of Science it was declared that PIE is older than Yamna thus Yamna can't be a start.

Davidski said...

In Horse, Wheel and Language, Anthony speculates that Khvalynsk was early PIE, while Yamnaya was late PIE. I've seen other authors claim that Sredny Stog was early PIE.

So Yamnaya is not generally accepted as the earliest stage of PIE, and indeed it may not have been PIE at all, but rather one of the early branches.

Rob said...

@ Arame.

Ah yes I recall reading that somewhere: it could be a copy or adaptation from Phrygians, who have far more solid links to the central Balkans. Anyhow, you didn't say your opinion on the relative links of Armenian to Iranian & Greek, Slavic, etc.

And i look forward to more aDNA from the Caucasus, Armenia, etc. It'll be very interesting

Ric Hern said...

How accurate is the description of the Wessex Culture in Wikipedia ?

Ric Hern said...

The reason I ask is the far reaching trade between Wessex, Germany, the Baltic even as far as the Greece. Could there have been a later standardisation of some Western Indo-European words through trade ? This may create the illusion of some shared words before PIE dispersal ?

postneo said...

there are 14 frames in the animation
Only the last 3 have any semblance of empirical support.

They should color code the frames with empirical linguistic evidence differently so that people don't get mislead.

a said...


As everyone continues in the peer review of the merits of Joshua Jonathan's Yamnaya map/work....never ending PIE origin saga...
Would it be possible to make a R1b-Northern- Neaderthal-Rich calculator? Using exclusive/only R1b samples from Europe,as they match/share components with many modern day R1b populations in Europe, that have Neaderthal ancestry?
Some possible samples of R1b that could be used that all European R1b have in common with Neaderthal and shared components within R1b samples-

Ancient Eurasia K6 Oracle results:
gedrosia K6 Oracle
Kit M236020
Admix Results (sorted):
# Population Percent
1 West_European_Hunter_Gartherer 94.39
2 Sub_Saharan 1.84
3 East_Asian 1.74
4 Natufian 1.22
5 Ancestral_South_Eurasian 0.81

7.5k+/YBP- Els Trocs R1b
Ancient Eurasia K6 Oracle results:
gedrosia K6 Oracle
Kit M641265
Admix Results (sorted):
# Population Percent
1 Natufian 54.78
2 West_European_Hunter_Gartherer 45.22

7.5+/YBP-Samarra Hunter Gatherer R1b
Ancient Eurasia K6 Oracle results:
gedrosia K6 Oracle
Kit M218547
Admix Results (sorted):
# Population Percent
1 Ancestral_North_Eurasian 71.22
2 West_European_Hunter_Gartherer 28.78

6.5k+/-YBP-Khvalynsk R1b
Ancient Eurasia K6 Oracle results:
gedrosia K6 Oracle
Kit M340431
Admix Results (sorted):
# Population Percent
1 West_European_Hunter_Gartherer 47.06
2 Ancestral_North_Eurasian 39.19
3 Natufian 7.02
4 Ancestral_South_Eurasian 6.73

5.3K+/-YBP-Yamnaya R1b
Ancient Eurasia K6 Oracle results:
gedrosia K6 Oracle
Kit M343758
Admix Results (sorted):
# Population Percent
1 Ancestral_North_Eurasian 40.89
2 West_European_Hunter_Gartherer 32.88
3 Natufian 21.86
4 Ancestral_South_Eurasian 4.08
5 East_Asian 0.29

Davidski said...

It wouldn't work. There's no connection between R1b and the data used to make calculators. They're on different chromosomes.

a said...

"It wouldn't work. There's no connection between R1b and the data used to make calculators. They're on different chromosomes."

How about a knew calculator, one just for European R1b? One that doesn't mix all the ydna lines with non-autochthonous geographical regions where R1b/Neaderthal have not been found. It's turning into some genetic fruit salad cocktail mix- free for all?
It's getting a tad boring, all these years trying to convince European R1b men they come from Africa/Levant/ and or Iran, with no ancient samples/evidence.
For example, "Villabruna like" component K7 calculator- why not use "Bichon like"- since IJ-M429 has more affinity with the "Basal" component region[Iran and Levant affinity IJ-M429 33k +/- Southern Italian Sample]; Villabruna "exactly" R1b-L754 line from Italy has around 3% Neaderthal+/- Something not seen in the region/ with the moniker named "Basal rich"[so far]and or Sub-Saharan region.
The above 5 R1b samples found in Europe span 9k+/- YBP and have Northern Neaderthal rich admixture[R1b-L754's and R1b-Hunter Gatherer are rather "Basal" poor]; however the samples have all the components of modern day R1b Europeans.

I'm an exotic branch R1b, and I'm pretty sure that if I have all the elements of the above European R1b samples with Neaderthal- that all R1b throughout Europe will also.

Admix Results (sorted):

# Population Percent
1 West_European_Hunter_Gartherer 43.38
2 Natufian 33.42
3 Ancestral_North_Eurasian 21.07
4 East_Asian 1.64

Olympus Mons said...

Well, if things are so slow, where it is something from Anna Dybo

It tells brilliantly the story of a non steppe PIE.

"...Based on these juxtapositions for a number of proto-lexical microsystems, the following conclusions can be proposed. The peculiarities of the landscape-related lexicon in both families are as follows. First of
all, the steppe must be excluded from the regions potentially inhabited by Proto-IndoEuropeans
Some relatively high mountains with many kinds of rocks and sharp or big stones are present. Some of these mountains are covered by forests. There are words for narrow passages, canyons, precipices, mines and caves, foothills, valleys and dells, meadows in forests and on the river-banks. The rivers have fords and are definitely smaller than their Proto-Altaic counterparts (there is no semantic variation between “river” and “sea”; nota bene
that the only trace of the name of flood is GA; the lower Danube?); cf. here the noticeably weaker function of fish in the Indo-European economy (expressed in a substantially smaller number of terms for fishing tools, fish body parts and fish species — see the example below). But they could have lived near a sea or a big lake with sandy banks.

H/t to MarkoZ at Eupedia.

well, anyone who checks the area btw caucasus and lesser Caucasus, passages from Georgia to Armenia, kura river... that description fits like a glove in Shulaveri Shomu, you know those guys where the earliest plough, made of antler, was found?

Olympus Mons said...

…And something that always gets people going… Bell beakers.

This STASO FORENBAHER work from late last century, really tells the marvelous story of large bifacial flaked dagger and points in the heart of pre-Bell Beakers Zambujal and VNSP Portugal where nonfunctional, ritualistic and Symbolic …and completely useless… long daggers were made in a specific artisan location for the purpose of an elite. The artisan place (not a settlement) was Arruda dos Pisôes.
Those blades were useless for functional purposes. Hence they were only found at burials.
Just something to add to the “Copos” they also made on their way to become bell beakers.

Just fine reading on a slow time of the year.

PS: that image of a Hollow Alcalar arrow bifacial flaked (8) . Those are exactly the same as the arrows found in Merimde, called Hollow based projectile point from the Fayum A… wonder where those were also found between 4000bc and 3000Bc apart from this two places. Can anyone point me to other places where those were made?

Samuel Andrews said...

BlogPost: South Asia's West Eurasian mtDNA'

If India West Eurasian ancestry is Iran Neo Bronze age Steppe, they have lots more Iran Neo mtDNA. There are interesting matches between Iran Neolithic mtDNA and Steppe mtDNA with modern South Asia. U5a1b1 for example was found in Bell Beaker Spain and a Tarim Mummy and modern India. It was originally pretty much only in Russia and expanded to the western and eastern edges of Eurasia.

Olympus Mons said...

Just to be precise...U5a1b1 or U5a1?
U5a1 was found in Portugal 1000 years prior to bell beakers at 3750BC... but U5a1b1 can't seem to find any prior to 2500bc when bloody Bell beakers out of Iberia met the CWC in Bohemia/Germany.

Samuel Andrews said...

@Olympus Mons,

U5a1 and U5a1b1 were found in Spanish Bell Beaker. Spanish Bell Beaker had mostly local Iberian mtDNA. Their U5a1s though are a sign of Steppe ancestry. U5a1b1 has been found in a Tarim Mummy, bell beaker Germany, Corded Ware Germany, Unetice Poland, and early Bronze Age Ireland. So yeah it looks like a Steppe lineage.

There's no way to determine whether German Bell Beaker had Iberian ancestry or not because Neolithic Iberians were essentially identical to Neolithic Central/East Europeans. Alright. Their R1b-P312 isn't from Corded Ware, it's from Yamnaya or another related people. The high H1, H3 in German Bell Beaker is evidence of Neolithic Iberian ancestry but not definitive evidence.

Unknown said...

H1 and H3 were in Central Europe 1000 years before Bell Beaker.

Samuel Andrews said...

They did but they were probably more frequent in Iberia and France. ~80% of Neo Iberian and French H was H1 and H3. That's a higher ratio than any modern Eurooeans have.

Not many Neo Central/East Euro H has been tested for H subclades. However 40 from Early Neo Hungary were and only 15% of their H was H1 and 0% was H3.

So the H1 and H3 in German Bell Beaker can be used as evidence of Iberian ancestry.

Chad said...

Not when Funnelbeaker has no Iberian ancestry. Rarecoal says basically no Iberian in them. The hits with the MN in BB is strongest with Germany and Central Europe MN.

FrankN said...

@Ric Hern: “How accurate is the description of the Wessex Culture in Wikipedia ?
Considering that the latest source listed in the English article dates from 1971, why do you even ask…
German WP, as so often, is more up to date, incorporating some continental literature, but still lacks recent reviews, most notably A.M. Martin “Archaeology beyond Postmodernity: A Science of the Social”, Chapt. 6 (available as Google Book). Needham e.a 2010, unfortunately pay-walled, should also be a good source:

Here my understanding of the current knowledge:

a) Chronology:: There is insufficient AMS dating for absolute dating. Available dates (e.g. Amesbury archer, ca 1,700 BC, point to the EMBA. The traditional distinction between Wessex I and II doesn’t hold up to AMS dating (Needham 2010). Acc. to German WP; Wessex burials align well with rich EBA (Reinecke A1) burials as known from Brittany, Singen (S. Black Forest), and Unetice (Leubingen, Helmsdorf): Currently available information suggests the start of the Wessex culture by 2,000 BC or possibly sometime later (Hilversum Culture from 18th cBC). Relation to Middle Rhine Beakers (early 3rd mBC), as suggested in the English WP article, would clearly be anachronistic.

b) Character/ origin: To which extent the Wessex Culture sets forth late BB traditions is debated. Martin argues for an intrusive character, especially as concerns the widespread change to cremation, and various cases of Wessex cremation barrows intruding onto late BB inhumation sites, and vice versa (implying that Wessex intruders coexisted with the “native” BB population for some time, but there was quite a “clash of cultures”). Cremation isn’t unknown from continental 3rd mBC contexts: Swiss CW, e.g., seems to have mostly used it; also, the scarceness of Finnish Battle Axe Culture burials relative to settlement finds is commonly attributed to cremation. The cremating culture closest to England was the Schönfelder Group along the Lower Middle Elbe (Magdeburg to Hamburg). [I’d say. aDNA could help in this respect, but it will probably take quite some time until aDNA can be extracted from cremations, if ever. ]

c) Cultural Links:Wessex was quite “cosmopolitan”, with linkage to Ireland, Scotland, Wales, and the rest of England, Brittany, the Dutch Hilversum Culture, and Germany. Very prominent is the connection to Unetice. The Nebra Sky Disk was manufactured using Cornish tin, and displays the same astronomical concept as does Stonehenge. Unetice is attributed a major role in channeling Cornish tin and Baltic amber to the Aegean (Danubian Cultures such as. Mako and Otomany serving as further relays), and seems the most likely path for Aegean goods having reached England [“Mycenaean” fayence beads mentioned in the English WP article, however, seem to have been of local manufacture, and were also found in Brittany and the Netherlands.]

FrankN said...

Ric: Your related question; “ Could there have been a later standardisation of some Western Indo-European words through trade ?” is a good one. In fact, one might even go further and ascribe a good part of the expansion of IE to trade, especially in metals. Several linguists have credited Unetice with a major role in spreading IE around Central Europe (plus Britain?). BB, also essentially a trade network, may have assumed a pioneering role in this respect. [The disparate distribution of Celtic, during the early IA documented from NW Iberia and the Rhone/ W. Alps, separated from each other by non-IE Vascones and Iberians, necessitates sea-bound connections as seem to have existed during the BB period].

Ric Hern said...

Thank you very much Frank.

Ric Hern said...

Yes it seems to me that there were many consecutive standardisations that blanketed the European population since the First Indo-Europeans arrived at least from Unetice Culture onward...Urnfield, etc.all quite wide spread.

Aram said...


The most comprehensive work on Armenian and Greek relations is the book of James Clackson

The Linguistic Relationship Between Armenian and Greek (Publications of the Philological Society)

And a review

Devastating critique of the notion that Greek and Armenian form an IE subgroup, and good intro to Armenian IE issues

By Christopher Culver TOP 1000 REVIEWER on July 14, 2014

In the history of comparative Indo-European linguistics, scholars (such as Meillet and Pedersen) have occasionally posited that Greek and Armenian form a subgroup within Indo-European, pointing to various lexical, phonology and morphological features held in common. This 1994 monograph my James Clackson, which originated in his Cambridge PhD thesis, weighs the evidence. Clackson's ultimate conclusion is that Armenian and Greek are not especially closely related within the Indo-European family.

And here the opinion of Ringe on this book. David Anthony relies on Ringe's tree.

"But though the quantity of the evidence is comparable to that for Italo-Celtic, the absence of any phonological or in¯ectional character makes it qualitatively poorer. In sum, though we think that Clackson (1994) has overstated his case in denying any evidence for Graeco- Armenian, we readily admit that the evidence is disappointingly meagre; in e€ect, he and we seem both to be quite close to the line that divides our positions, even though we are on opposite sides of it. "

Aram said...

This is a good FAQ about Armenian language prepared by Luc Vartan Baronian.
All most frequent questions are treated. Even the question of Basque Armenian relations.

Some citations.

The third possibility is the dialect continuum hypothesis, which appears to be most likely to most specialists. Under this view, Armenian separated early on from the other IE languages, but remained in close contact with Greek, Phrygian and Indo-Iranian. At the time, these languages must have been separate, but still mutually intelligible, such that it was possible for them to undergo common changes. Armenian underwent some innovative changes with Greek and Phrygian, but others with Indo-Iranian.[14]


Nevertheless, there remains a set of common features between Armenian and Indo-Iranian that, if they are due to contact, must be very old (before the two branches were too differenciated). Particularly striking are the common features of Armenian and Indic that are lacking in Iranian.


For further aDNA. Yes they are plans ( that are currently are moving forward ) to have dense sampling of Iron Age. Pre and post Urartean period. Also the specific Urartean sites. Especially their elite burials, settlers and people who were deported by Urarteans to Ararat Valley from South Western places. So we can have a hint who was living there even without having samples from IA Eastern Turkey.

Seinundzeit said...

I just wanted to give a shout-out to Sangarius, a member at Anthrogenica. Due to some work he's done, one can now produce sensible "basal" models with the PCA data.

For the record, I totally understand what Huijbregts was trying to say (in his conversation with Sangarius), he was/is absolutely right.

But, I approach things from a Peirce-James-Dewey perspective, if that makes sense. Basically, if it works, it is what we should do. And, it seems that weighing the PCs by their eigenvalues works very well, a clear improvement in the results.

Again, I understand why this can be construed as problematic, but again, the results are now much more sensible, for many populations.

Here are some examples, a few North Asian/Siberian/Native American populations. Without tinkering with the 10 dimensions, here is what one gets, if one tries to model these people with basic/divergent ancient populations:


55.6% Ami
24.7% MA1
19.7% Natufian



38.8% MA1
34.7% Ami
26.5% Natufian



61.1% MA1
25.6% Natufian
13.3% Ami



65.3% MA1
21.9% Ami
12.7% Natufian



72.0% MA1
23.1% Natufian
4.9% Ami


None of these results even remotely resemble what we've seen in the scientific literature. Construing these populations as primarily MA1 + Natufian, with varying amounts of East Asian, makes no sense.

Now, here is what happens with the same 10 dimensions, and the same reference populations, if we take advantage of Sangarius' work:


86.2% Ami
13.8% MA1



89.0% Ami
11.1% MA1



72.8% Ami
27.2% MA1



60.9% Ami
39.1% MA1



58.2% MA1
30.1% Ami
11.7% Bichon


Perfection, everything matches what we've seen in the scientific literature!

A random North African example.



61.95% Natufian
14.85% Gambian
14.75% Bichon
4.80% Iran_Neolithic
3.10% Onge
0.55% Dinka




68.60% Natufian
15.45% Gambian
13.80% Bichon
2.15% Iran_Neolithic


The weighted result is much cleaner, and has no unexpected Onge percentage.

A European example.



51.5% Natufian
33.6% Bichon
10.6% Iran_Neolithic
4.4% Ami




50.4% Bichon
49.6% Natufian


Again, it just looks much more sensible.

Seinundzeit said...

Also, Sangarius' modification allows one to produce a cline seen with the d-stat data, in the case of South Asia/Central Asia.

As per some d-stats Chad ran for me, long ago (during a conversation we had here at this blog), South Central Asians prefer East Asian populations to the Onge, even when compared to West Asians, while South Indians prefer the Onge when compared to West Asians.

In addition, a few months ago I created a basal test with one of David's d-stat sheets. It had a "Basal Eurasian" simulation (created on the assumption that the Natufians are 50% Basal Eurasian, which may/may not be true), along with the Onge, Ami, Yoruba, and a bunch of ancient West Eurasian hunter-gatherers. Here is a demonstration of the South Asian-to-South Central Asian ENA cline, using that d-stat sheet, and then showing the weighted PCA-based results for comparison:


80.2% Onge
11.2% Ami
6.95% AG3-MA1
1.65% Villabruna


59.1% Andamanese_Onge
27.4% AG3-MA1
8.7% Basal Eurasian
4.5% Villabruna
0.3% Ami


49.45% AG3-MA1
22.25% Basal Eurasian
14.85% Villabruna
13.45% Ami
0% Onge


52.65% AG3-MA1
22% Basal Eurasian
13.85% Villabruna
11.5% Ami
0% Onge

Weighted PCA


69.8% Onge
20.5% Ami
6.2% Iran_Neolithic
3.5% MA1



65% Andamanese_Onge
25% Iran_Neolithic
10% MA1
0% Ami



53.0% Iran_Neolithic
24.2% MA1
17.9% Bichon
5.0% Ami
0% Onge



49.7% Iran_Neolithic
30.9% MA1
16.1% Bichon
3.2% Ami
0% Onge


We used different kinds of data, yet we see the same cline, and qualitatively similar results.

Looking at these examples, I think it would be safe to recommend Sangarius’ modification of nMonte.

Alberto said...


Thanks for pointing out this discussion. It's something I'm interested in and have looked into, and it's always good to see others thinking about it too.

It seems I was late and missed the details (and the modified script), but my impression is that this is not the right approach. It tries to solve a problem that is not our main problem, and it does so in the wrong way (I think). Indeed, giving more weight to the lower dimensions will get better results if you, for example, try to model a Turkish as a 2-way admixture between Yoruba and Loschbour. But the right approach for this would be to just use the first dimension, rather than to "smash" the information in the higher dimensions so that it doesn't create "noise" in that model.

In realistic models, we really want that information in higher dimensions. Getting rid of it will make things look cleaner, but the results will be less realistic (I think - I didn't try the implementation and testing always has the last word).

This seems to be more a problem of "too many dimensions" for the basic models he's trying. But for finer details, those extra dimensions are very valuable (I actually wanted to give them more weight, rather than less, though after some basic testing I decided that the standard Euclidean distance is the best compromise - unless someone wants to do something really fancy and complicated which I wouldn't know how to do to test it).

So I would check with the most realistic models and where certainties about the results are higher (basically European populations, where we have better references and knowledge of them) to see how it affects those models.

War Lord said...

I wonder, who created this bold map? The fact that Tocharians and Indo-Iranians spread from the Pontic steppe and the Corded Ware people were Indo-European is not so surprising. Therefore, the new genetic evidence for the steppe invasion to Central Europe solves nothing, because something like this was already anticipated. We still have the burdensome Hittites and Luwians in Anatolia. Until we get some genetic evidence for the invasion of the Yamnaya to Anatolia, the Pontic origin of Indo-Europeans is only an impetuous hypothesis.

What may be surprising is the magnitude of the steppe admixture, which I still doubt, because despite the influx of R1a Corded Ware, the Central European R1b-U106 and I-M170 guys soon gained the upper hand and R1a was reduced to insignificance. The most important thing that they took from the Corded Ware invaders was lactose tolerance, which subsequently enabled their expansion - indirectly even to Western Europe, via the Bell Beakers, who mixed with them in the Rhine valley.

War Lord said...

Unless we get some evidence for the Yamnaya invasion to Anatolia, I will stick to my assumption that the origin of Indo-Europeans was in Anatolia, and the Indo-European language spread to Eastern Europe via the Cucuteni-Tripolje culture and its wagons.

And for those, who say that the Yamnaya and Afanasievo people don't have any Neolithic admixture: The samples of the Yamnaya men belong almost exclusively to R1b aborigines from the Volga basin, and something similar applies for Afanasievo (Tocharians), who probably expanded from the Lower Volga area. Therefore, these people could not be in any direct contact with the Near Eastern farmers in western Ukraine, and they could have adopted the Indo-European language via cultural diffusion.

Indo-Iranians (Andronovo and Sintashta) DO HAVE Near Eastern Neolithic admixture (about 15-20%). While some people interpret it as their relatedness with the Corded Ware people, I think that this genetic profile represents the true Yamnaya people in the Pontic steppe.

We will see. I am waiting for genetic samples from the Bronze Age Anatolia and for Y haplogroups of Tocharians.

Davidski said...

Your theories aren't based on any evidence, just your own personal preferences.

Early farmers could not have been Proto-Indo-Europeans, because key words related to farming in Indo-European languages have non-Indo-European origins.

Also, farmer admixture in Bronze Age steppe groups looks like it came via female mediated gene flow, which suggests that the non-Indo-European words related to farming came with non-Indo-European farmer women who mixed with Indo-European males of steppe origin.

See here for more details...

Ebizur said...

Munda peoples' Y-DNA as per Chaubey et al. (2011):

2/42 = 4.8% H-M69
40/42 = 95.2% O-M95

4/54 = 7.4% F-M89(xM69, M304, M9)
48/54 = 88.9% O-M95
2/54 = 3.7% P-M45(xR-M207)

3/27 = 11.1% H-M69
24/27 = 88.9% O-M95

Birhor (Chhattisgarh)
2/27 = 7.4% H-M69
24/27 = 88.9% O-M95
1/27 = 3.7% R1-M173

Birhor (Maharashtra)
3/35 = 8.6% H-M69
31/35 = 88.6% O-M95
1/35 = 2.9% R1-M173

1/45 = 2.2% C-M216(xM38, M217)
1/45 = 2.2% F-M89(xM69, M304, M9)
2/45 = 4.4% H-M69
2/45 = 4.4% J2-M172
38/45 = 84.4% O-M95
1/45 = 2.2% P-M45(xR-M207)

Mawasi (Madhya Pradesh)
1/12 = 8.3% C-M216(xM38, M217)
10/12 = 83.3% O-M95
1/12 = 8.3% R1-M173

South Munda (Bonda, Gadaba, Juang, Kharia, & Savara) total
1/181 = 0.6% C-M216(xM38, M217)
13/181 = 7.2% F-M89(xM69, M304, M9)
15/181 = 8.3% H-M69
3/181 = 1.7% J2-M172
141/181 = 77.9% O-M95
4/181 = 2.2% P-M45(xR-M207)
4/181 = 2.2% R1-M173

Mawasi (Jharkhand)
2/27 = 7.4% F-M89(xM69, M304, M9)
4/27 = 14.8% H-M69
2/27 = 7.4% J2-M172
18/27 = 66.7% O-M95
1/27 = 3.7% R1-M173

North Munda (Asur, Ho, Mawasi, Mahali, Santhal, & Birhor) total
4/286 = 1.4% C-M216(xM38, M217)
16/286 = 5.6% F-M89(xM69, M304, M9)
41/286 = 14.3% H-M69
16/286 = 5.6% J2-M172
187/286 = 65.4% O-M95
3/286 = 1.0% P-M45(xR-M207)
14/286 = 4.9% R1-M173
5/286 = 1.7% R2-M124

1/88 = 1.1% C-M216(xM38, M217)
7/88 = 8.0% F-M89(xM69, M304, M9)
15/88 = 17.0% H-M69
6/88 = 6.8% J2-M172
53/88 = 60.2% O-M95
5/88 = 5.7% R1-M173
1/88 = 1.1% R2-M124

2/21 = 9.5% F-M89(xM69, M304, M9)
5/21 = 23.8% H-M69
11/21 = 52.4% O-M95
1/21 = 4.8% P-M45(xR-M207)
2/21 = 9.5% R1-M173

1/37 = 2.7% C-M216(xM38, M217)
7/37 = 18.9% F-M89(xM69, M304, M9)
5/37 = 13.5% H-M69
3/37 = 8.1% J2-M172
18/37 = 48.6% O-M95
1/37 = 2.7% P-M45(xR-M207)
2/37 = 5.4% R1-M173

3/20 = 15.0% F-M89(xM69, M304, M9)
5/20 = 25.0% H-M69
1/20 = 5.0% J2-M172
7/20 = 35.0% O-M95
1/20 = 5.0% P-M45(xR-M207)
2/20 = 10.0% R1-M173
1/20 = 5.0% R2-M124

1/32 = 3.1% C-M216(xM38, M217)
3/32 = 9.4% F-M89(xM69, M304, M9)
10/32 = 31.3% H-M69
5/32 = 15.6% J2-M172
6/32 = 18.8% O-M95
1/32 = 3.1% P-M45(xR-M207)
3/32 = 9.4% R1-M173
3/32 = 9.4% R2-M124

Ebizur said...

According to YFull,

O-M268 > O-P49 vs. O-K18 28,500 [95% CI 26,100 <-> 30,900] ybp
O-K18 > O-CTS10887 vs. O-PK4 22,100 [19,900 <-> 24,400] ybp
O-PK4 > O-F838 vs. O-M95 12,500 [11,000 <-> 14,100] ybp
O-M95 > O-CTS10007 vs. O-M1310 10,800 [9,500 <-> 12,200] ybp

O-P49 is now found most commonly in Japan and Korea, though it also has been observed occasionally in samples from China, Mongolia, southeastern Siberia, Vietnam, Micronesia, etc.

O-CTS10887 has been found in Han Chinese (including the purported descendants of Cao Cao, an infamously diabolical chancellor and warlord of the final years of the Han Dynasty, who was enfeoffed as King of Wei and ushered in the Three Kingdoms period of Chinese history), Vietnamese, Dai, and Japanese. A great deal of ancient substructure (perhaps even of Paleolithic time depth) within this clade has been passed down to the present, though it does not include a large proportion of the population anywhere (except perhaps among the Yao people of Bama Yao Autonomous County, Guangxi, where O-M268(xP49, M95) has been found in about 20% of samples, though those may belong to O-F838 instead).

O-F838 or O-PK4(xM95) has been found heretofore in Han Chinese (cf. YFull and Shi Yan, Chuan-Chao Wang, Hui Li, et al., "An updated tree of Y-chromosome Haplogroup O and revised phylogenetic positions of mutations P164 and PK4," European Journal of Human Genetics (2011) 19, 1013–1015).

O-CTS10007, the rarer of the two primary branches of O-M95, also has been found in two Southern Han according to the current version of the YFull tree.

Thus, currently available data suggest that all the males in those Munda peoples of India who belong to O-M95 are patrilineal descendants of some member of a Neolithic culture that flourished on territory that now belongs to China. Perhaps Pengtoushan, whose archeological remains, including cord-marked pottery, soybeans, and rice (it has perhaps the most thorough traces of the development of rice cultivation), have been found around the middle reaches of the Yangtze River.

Why is this apparently Neolithic "Chinese" ancestry so strongly overrepresented on the Y-chromosome in comparison to the autosomes? For example, please consider the case of the Bonda:

69.8% Onge
20.5% Ami
6.2% Iran_Neolithic
3.5% MA1

2/42 = 4.8% H-M69
40/42 = 95.2% O-M95

For your estimates to be close to reality, their male ancestors must have married out to such an extent that it equates to (roughly) completely replacing the female members of the tribe with genetically distinct females on two separate occasions (plus a little more besides that, in fact). That seems like quite a feat for a rather primitive tribal population that is known for its hostile attitude toward foreigners and "strong women."

I am not really challenging your results; I am just a bit troubled by the discordance between the Munda peoples' Y-DNA and their autosomal DNA.

huijbregts said...

@ Alberto
This time we are on the same side, you even have some praise for the Euclidean distance!
I think your ideas about the higher dimensions are sound. Have you experimented with placing (some of) the Africans in an outgroup (by dropping them from the dataset) and recalculating the PCA scores?
That way you will get more variance in the higher dimensions. You can do the same with the Austronesians. Of course you can use this this trick only for West_Euarasian models.

Shaikorth said...

Austroasiatic tribal Y-DNA could have a strong founder effect, though we don't know for sure until there are some results on Yfull.

Sein, how do Santhal, Savara and Asur fit with the PCA model?

Seinundzeit said...


I think you make excellent points.

And from a mathematical perspective, I definitely see why this can be construed as problematic.

But, I'm looking at this issue via instrumentalist lenses. For example, when I restrict myself to only the first 2 or 3 dimensions (non-weighted), and try some basal models, the results are still nonsensical. Same as using 10 dimensions without weighting. I'll provide some additional examples (hopefully, today).

By contrast, basal models done using Sangarius' modified script are always exceedingly sensible, and always perfectly concordant with estimates in the scientific literature and/or analyses we've seen here at this blog.

So, just taking a purely pragmatic perspective, weighing the PCs looks like a solid idea.

I think I should try to do models that use much more genetically proximate (and more contemporary) populations, and see what happens.

My prediction is that perhaps with more proximate reference populations, weighing might lead to less realistic results, while weighing in conjunction with more diverged/basal reference populations definitely leads to far more realistic results.

I'll give that a try, and report the results here.


No doubt, Austroasiatic populations in India definitely show evidence of East Asian ancestry.

Depending on the population, it ranges from 10% to almost 30%. But none of them, with the obvious exception of the Khasi, are predominately East Asian. Whatever ASI is, that is the largest component of their autosomal ancestry, and all of them also have minor West Eurasian admixture which ranges from only 10% to almost 35%-40%.

That's how they stack up with basically every method used to analyze autosomal data (with the goal of finding proportions).


Absolutely, I'll give those a spin as well, (hopefully, today).

huijbregts said...

"And from a mathematical perspective, I definitely see why this can be construed as problematic."

I doubt that you understand the consequences.
By weighting the dimensions, you destroy the PCA properties.
For instance, this transformation does not conserve the pairwise distances. Its generates gigantic projection errors and distorts the geometry a like distorting mirror.
Populations that were close together may now become apart, and separate populations may now closely cluster. Also the proportion of the admixtures will dramatically change.
We may expect spectacular discoveries from nDistorter.

Davidski said...

There won't be any distortion if Sangarius weighted the PCs correctly.

The eigenvalues are mainly there so that we can create more realistic plots than just squares, in which all PCs plotted carry equal weight.

So if you want to create a more realistic plot using PC1 and PC2, you give PC1 a ratio of 38.86 and PC2 a ratio of 27.86, and you get a rectangle instead of a square.

In other words, the distortion is there when the plot is a square and when the PCs are not weighted.

I'm just talking about visuals here, but why would the concept be different for estimating ancestry proportions?

Ric Hern said...

Could the similarities of the beads rather point to a common origin shared between Mycenaeans and Wessex ? Maybe within the Vucedol Culture or relations to the Wietenburg Culture ?

Matt said...

To blunder in, yes, weighting (not all PCs have the same eigenvector scale) seems essential to preserve distances.

(So I have not had much use for unweighted PCA in building neighbour joining trees based on distance, etc, e.g. I found unweighted PCA gave really odd distances between Africans and other populations due to lack of lots of extra weight in the first PC then minimal differentiation in higher PCs.)

But the correct weighting should be the outcome of the PCA process and algorithm and the information that goes in, already in your data, and not just a sort of circular post-hoc "These are the weightings after semi-random tweaking that give the nMonte outcomes I expect, so they must be correct" adjustment process. I can't tell from this thread how this Sangarius person has done it.

Alberto said...

Yes, I think there's anything wrong per se in weighting the PCA dimensions by their eigenvalues. But shouldn't this be already implied in the values themselves (if the eigenvectors have the same scale, which I thought they did)?

I was just afraid that by dong this you'd get better models for very broad components, you might lose small details for the more difficult and realistic models (which depend of dimensions with lower variance). But I could be wrong, and that's why my attitude is rather opposite of strongly advising against doing it: I actually encourage anyone with the inclination to test it (this or anything else) to go ahead and do it, because it might turn out to work better for just all the models (and if it doesn't, you can just forget about it and nothing's lost).

Though for people to experiment and share the changes, it requires a free licensed script (Huijbregts made his position clear about this, so no one should tinker with nMonte).

Any news about the new versions of 4mix? I did write my own script for my tests, but since I never wrote any R script before I'm not confident in distributing it if a better version might come soon. Otherwise, for anyone willing to test things I'd gladly send the script so they can modify it as they want for their tests.

Grey said...


"...their male ancestors must have married out to such an extent that it equates to (roughly) completely replacing the female members of the tribe with genetically distinct females..."

or a dramatic selection process where females from one population had a dramatic advantage (at some point in time)

so possibly another candidate for cold-climate vs warm climate mtdna?

i.e. cold climate males and their cold climate females move into a warm climate zone and replacement local mtdna gets selected over time?

Davidski said...

Yes, I think there's anything wrong per se in weighting the PCA dimensions by their eigenvalues. But shouldn't this be already implied in the values themselves (if the eigenvectors have the same scale, which I thought they did)?

I don't know.

But clearly, if the PCs aren't weighted by their eigenvalues when creating plots, then the plots suggest that they're all of comparable weight, even though PC1 has almost 20 times the weight of PC10.

The question is, are we dealing with basically the same situation when calculating ancestry proportions from PCs?

Davidski said...

Interesting comment at Anthrogenica.

Just a word about this topic of weighting PCs with eigenvalues. I'm a mathematician ( even not at all a statistician, far from it), and I've never ( I emphasize: never) seen that done. PCAs are, as the Generalissimo told, a technic of representation of variability datas in usually 2 dimensions, sometimes 3. More dimensions would be sometimes useful, but we should change our universe, unfortunately. Anyway, and even if most often the 2 first PCs summarize more than 80% of the total variability, a PCA representation is inevitably distorting ( like any 2D or 3D representation of a higher dimensional object). Weighting a PCA would emphatize this distorsion in an absolutely irrealistic way.

Off topic, and while I'm at it, I have worked with many datas, comparing Global 10 with all the PCs and with only 5,6,7 pCs, after a statistician friend had told me that it would be better to leave out the minor PCs. Honnestly I've not seen one case where those minor PCs cause problems.

I might try and look into this further when some relevant people get back from holidays.

Davidski said...

Those example plots I put up were sloppy as hell, so I got rid of them.

Essentially, what's really bugging me is why do all the models thus far done with Singarius' modification actually make sense?

Have we seen any that don't make sense?

Seinundzeit said...


"Also the proportion of the admixtures will dramatically change."

Yes, as a matter of fact, the proportions do change dramatically. For the better!

To put it in the simplest terms possible, construing the Karitiana as mostly ANE and Levantine hunter-gatherer, with a mild dose of East Asian, is obviously incorrect.

By contrast, construing Karitiana as 60% East Asian and 40% ANE is nicely in line with the scientific literature.

Also, does this model of Lithuanians (with weighted PCs) seem unreasonable?

42.1% Yamnaya_Samara
33.4% LBK_EN
24.6% Bichon


Or this Polish result (again, weighted PCs)?

42.8% Yamnaya_Samara
38.0% LBK_EN
19.3% Bichon


Also, is this French_Basque result somehow problematic (again, weighted PCs)?

59.70% LBK_EN
23% Yamnaya_Samara
17.30% Bichon


Perhaps I’m blind, or just really stupid, but instead of seeing odd or nonsensical results, what I’m seeing is that these European populations are stacking up just like they should stack up, as per the recent scientific literature.

So, using weighted PCs is not a horrible idea. Surely, it is not the sort of thing that could make Gödel turn in his grave (lol). On the contrary, it seems to be a rather decent idea.

Now, let’s have a look at those complex/tough-to-crack populations from the far southeastern end of West Eurasia. Again, pretty sensible results, don’t you think (same reference populations here as the ones used on the Europeans, weighted PCs)?


45.55% Yamnaya_Samara
24.30% Iran_Neolithic
14.65% Iran_Chalcolithic
11.20% Onge
4.30% Oroqen


It’s pretty neat, right? As should be the case, they have around the same amount of Yamnaya-related admixture as Lithuanians/Polish people (in fact, a slight hair more of this kind of ancestry when compared to those two northern/eastern European populations), and the rest of their ancestry is primarily ancient Iranian-related ancestry, with a mild touch of ENA. It’s Lazaridis et al. all over again.

In the same neighborhood as the Kalash, these Afghan Pashtuns deserve a spin.

39.2% Yamnaya_Samara
35.30% Iran_Chalcolithic
10.3% Iran_Neolithic
8.95% Onge
6.25% Oroqen

Isn’t it interesting how they differ from the Kalash in precisely those ways in which one would expect them to differ from the Kalash, based on an understanding of Central Asian history and anthropology? But, I guess this is just a huge, probabilistically astounding coincidence.

Also, what about those Pamiri Eastern Iranians, another interesting cluster of peoples in the “South Central Asian culture area”, how do they get construed when using this horrible weighted PCs-thing we’ve been trying?

I mean, this should be interesting. Just like how the Kalash have long been hypothesized by serious anthropologists as direct descendants of “Aryans” or whatever from Central Asia, the Eastern Iranian peoples of the Pamir region have long been hypothesized as direct descendants of “Scythians” or whatever.


49.45% Yamnaya_Samara
31.85% Iran_Chalcolithic
8.15% Oroqen
7.20% Onge
3.35% Iran_Neolithic



50.1% Yamnaya_Samara
38.0% Iran_Chalcolithic
9.2% Oroqen
1.5% LBK_EN
1.1% Onge


And how could we forget about the direct descendants of the Sogdians?

47.20% Iran_Chalcolithic
42.6% Yamnaya_Samara
6.95% Oroqen
3.25% LBK_EN


At this point, I’m really puzzled. I mean, when will we start seeing weird/nonsensical results? So far, everything here would be approved by Tarski, if he was interested in this kind of thing (we all know he wasn’t a history/population genetics/anthropology kind of dude. Way more abstract stuff was his game, but you get my point).

To be continued...

Eren said...

Hi guys, this is Sangarius from Anthrogenica.

First off thanks to @SeinundZeit for the shout-out and tinkering with it. I really didn't do any validity testing after finishing it. Just tested some basal/Meta-population level modeling for myself and it looked reasonable enough. For me this is all really just procrastination at the moment, since I have to work on my thesis. So, I shared it so others could play around with it. I removed it on Huijbregts request, though.
The results you produced look quite good actually, didn't really know what to expect.

Anyway @all, the way I implemented the scaling is analogous to what Davidski did in his example. I took the Eigenvalues of the G10 PCA that Davidski posted on Anthrogenica. Then I created factors for the PCs by dividing the Eigenvalues of each PC by the Eigenvalue of PC1. So PC1 remains unscaled, the rest are rescaled according to how much variance they explain in proportion to PC1. The result looks like this:

Eigenvalues: 38.86, 27.86, 7.97, 4.07, 3.47, 3.15, 2.87, 1.89, 1.83, 1.78
Factors: 1.00, 0.72, 0.21, 0.10, 0.09, 0.08, 0.07, 0.05, 0.05, 0.05

distance = sqrt( ((p1-q1)*f1)^2 + ((p2-q2)*f2)^2 + ... + ((p10-q10)*f10)^2 )

A similar method is described in this paper:

But the consensus seems to be that weighing the output of a PCA is mathematically incorrect, because they are already weighted by the definition of PCA. So, I don't know why it produces seamingly more sensible results in those instances.

Anyway, I'm out of the discussion for now. Regards

Seinundzeit said...


Now, you could say that perhaps weighted PCs only work with West Eurasian populations, even though those West Eurasian populations are as different as Basques-Lithuanians and Kalash-Tajik_Rushan.

Perhaps, this sort of thing won’t do so well with populations outside of West Eurasian variation, populations with complex ancestry that isn’t properly understood right now, populations with ancestry that might seem unique in a Eurasian context. Basically, populations like South Indian tribal people.

So, here we go. These are the Paniya, same reference populations used on the West Eurasians from Europe and South Central Asia, weighted PCs.

78.0% Onge
21.6% Iran_Neolithic
0.3% Yamnaya_Samara


Again, considering the data that we currently have, they seem to have turned out perfectly fine.

Lol though, all the sarcasm and cheekiness aside (huijbregts, I’ve been busting your balls, to use a Midwestern Americanism), I honestly do appreciate your perspective (I’m being serious now), and we all owe you a debt of gratitude, for your work on nMonte. Again, I’m actually being serious now.

It’s no small feat, and jokes/teasing aside; I do have a deep respect for your work.

Also, I do understand why you found Sangarius’ modification unpalatable, both in terms of your position of being the dude who came up with nMonte, and in terms of your position as someone who looks at this sort of thing through the lenses of a mathematician (I’m assuming mathematics is what you do).

But for the record, Sangarius’ tinkering with nMonte allows for very, very sensible models.

Philosophically speaking, I am an instrumentalist (not in the sense that I play the guitar or drums, lol, but rather in the sense that I look at things the way William James did); I look for methods that work. And this weighting of PCs works, for fine-tuned models that involve Bronze Age and Upper Paleolithic + Mesolithic + Neolithic West Eurasians in conjunction with ENA and Sub-Saharan African references, and for broad models that try to hone in on deep/divergent ancestral streams (especially for the latter).

Again, I’ll leave the mathematical side of the analytic equation to people who do it for a living or have degrees in it (basically, people who know what their actually talking about, people like yourself).

Yet, the fact remains that Native Americans don’t have Natufian ancestry, and anything that shows Native Americans as having 10%-15% ancestry from the ancient Levant is wrong, plain and simple.

So, this is indeed an issue that needs further exploration. I look forward to hearing what David finds, after the holidays.

Anyway, this is what Sangarius said, with regard to what he has attempted:

“Well, that's reassuring, I really started to doubt my reasoning. There are of course many ways of weighing, I sure tried a lot of variations. I ended up with dividing the Eigenvalues of the PCs by the Eigenvalue of the first PC and then multiplying the PC scores with them. I saw that approach in a paper which I can't find right now. Anyway, how would you do it?”

Seinundzeit said...


Lol bro, we posted at the same time.

But thanks for finally chiming in! It really clarifies things.

huijbregts said...

@ Seinundzeit

I checked your Karitiana example.
The correlation between Natufian and MA1 is 0.79 so these two populations should not have been together in one specification.
You have used this combination several times.

Seinundzeit said...


To further hit on this point, here are the Karitiana, using unweighted PCs, and only two reference populations, MA1 and Ami.

(Previously, I tested them with my standard setup, the one I use on every population)

70% MA1
30% Ami


Even with only two relevant reference populations, the result just doesn't reflect reality, or at least reality as we know it.

By contrast, now with weighted PCs, but only using those same two reference populations.

60.9% Ami
39.1% MA1


It's the exact same result I found with my standard PCA data-sheet, and it also happens to be the correct result.

There are actually many more examples (too many, to be quite frank), but I keep coming back to this one, because the Karitiana are the least mysterious population around, in terms of their deep genetic history.

Regardless, when all is said and done, the weighted results are always better than the non-weighted results.

It's just a fact.

Since I'm no Reimann or Hilbert (lol), since my mind is still in "it's too early in the morning" mode, and since my girlfriend does often tell me that I'm a certifiable idiot (and she knows her stuff), I do (quite honestly) defer to you with regard to this issue.

Still, if you could just explain to me why Eren's modification works better than working with the non-weighted PCs, I would be very pleased. Taking your word with regard to the mathematics, this is all rather puzzling.

But before that, keeping in mind my still somewhat "sleepy" mental state at the moment, I want to ask, is Shaikorth not correct when he states:

"Although in this case you're actually adjusting nMonte to deal with different eigenvalues, not changing the PCA itself. "Unweighted" means it'll treat every PC the same, doesn't it? Would be a distortion by default since every dimension has a different eigenvalue."

Thanks in advance.

huijbregts said...

A note about the distance used by Eren.

Above I found the formula used to transform the original distances into 'weighed' distances:
distance = sqrt( ((p1-q1)*f1)^2 + ((p2-q2)*f2)^2 + ... + ((p10-q10)*f10)^2 )
where f1..f10 is 1.00, 0.72, 0.21, 0.10, 0.09, 0.08, 0.07, 0.05, 0.05, 0.05
This formula tells us that the difference on the first dimension is left unchanged, but the differences on the higher dimensions are progressively reduced.
So after this transformation the total distance will necessarily be considerably smaller than than with the original data. This is an arithmetic artefact, NOT AN INDICATION OF ABETTER FIT.

Seinundzeit said...

For what it's worth, the total distance isn't what I had in mind, when I spoke of better fits.

Rather, the proportions constitute the real point of interest.

Basically, why do the proportions make much more sense, with weighted PCs?

That's the primary question that we need to examine.

huijbregts said...


"Regardless, when all is said and done, the weighted results are always better than the non-weighted results.
It's just a fact."

Wait a moment.
As you can read above, the distance of Eren is necessarily smaller than the distance of nMonte; this is a mathematical artefact, not an indication of a better fit.
So you will have to base your evaluation on another argument than the distance.
As to the quote of Shaikorth, no, he is not right, he just doesn't understand that in a PCA the dimensions have a different eigenvalue by definition.

Seinundzeit said...


There is a misunderstanding at play here, which is my fault, mainly due to imprecise language on my part.

When I spoke of better fits, I meant that the percentages match the scientific literature, I didn't have the smaller distances in mind.

huijbregts said...

OK, I did indeed misunderstand that.
Did you really check all those results against the scientific literature? You must be an efficient worker or have a photographic memory. What do you think is the best example? What I have seen from Karitiana is mainly a very poor fit, I doubt whether this specification is complete.

By the way I am not a mathematician but a retired biologist. I just very much want to understand what an algorithm effectively does. Now I have seen that Eren's distance is not the Euclidean distance but a distance with dimensional preferences, I am completely at a loss to understand its effects. But I get even more convinced that it should not be used in nMonte.

Alberto said...

I'm trying to understand the exact consequences of this approach. But basically, as it's clear by the formula, it reduces the weight of the higher dimensions, essentially making this an *almost* 3 dimension PCA (if the first 3 dimensions already had most of the weight by themselves, after this weighting the next 7 become almost negligible).

So trying a simple comparison, running that Karitiana model with just 3 dimensions instead of 10:

Ami 66 %
MA1:MA1 34 %

Distance 0.006672

Indeed much better than with the 10 dimensions. Though I'm pretty sure that using only 3 dimensions will screw many other models (didn't try it, though), while this weighted approach seems to work normally with every model. So to me it looks like it deserves a closer look, even if we might not understand exactly why this is beneficial yet. In the end, when something works (if it works) there's always a mathematical reason behind it.

Seinundzeit said...

Hmm, in all honesty, I don't think that I'm a very efficient worker (quite the contrary, lol), nor do I seem to display the classic signs of having a photographic memory.

Rather, perhaps it's just that I "read better" than most people, and/or perhaps because I have a natural inclination towards holistic/systematized knowledge. ;-)

All kidding aside, I think we've reached an impasse. So, I guess I'll drop out of the conversation, until any further developments come up. Regardless, I do appreciate the fact that you explained your ideas to me. I guess we'll have to agree to disagree (obviously, with all due respect).

Anyway, on a related note, I've realized that my old "Basal Eurasian" d-stat sheet (for nMonte) was/is pretty awesome, so I will make sure to eventually post that output here.

Seinundzeit said...


I should have added "@ huijbregts".


Would you be willing to share that script you mentioned previously? That would definitely be appreciated.

Eren said...

Just to clarify, one doesn't need to modify nMonte for this. In my first version I had to modify nMonte, but after I fixing it further I noticed modifying nMonte is not required.
One can instead just modify the dataset and the target file and use regular nMonte. One can do that even in Excel. Just multiply the values in each dimension with the respective factors. Do that in both files, and that's it. That's because it does use regular Euclidean distance:
sqrt((p1-q1)*f1)^2) is the same as sqrt((p1f1-q1f1)^2).

The effect is that as the variance in the dimensions gets progressively smaller, so does their affect on the distance. The information is conserved, but reduced in magnitude. Kinda like dropping dimensions light. When you drop a dimension, you essentially multiply it with 0. In this solution you multiply it with the ratio it explains compared to the first dimension. Actually pretty simple conceptually.

Shaikorth said...

"As to the quote of Shaikorth, no, he is not right, he just doesn't understand that in a PCA the dimensions have a different eigenvalue by definition."

That's what my post isn't implying at all, I'm suggesting nMonte doesn't understand the eigenvalue differences between different dimensions until adjusted, and that's the reason why unadjusted fits are not necessarily sensible. Like in the case of Karitiana.

Alberto said...

I implemented that in my own script to do some quick testing. For the Karitiana case, I got:

Ami 61 %
MA1:MA1 39 %

So it's probably working as expected. However, going onto European models things got ugly. For example, for Estonian:

Non weighted
Loschbour:Loschbour 40 %
Kotias:KK1 22.6 %
Karelia_HG:I0061 19 %
Barcin_Neolithic:I0707 18.4 %

Barcin_Neolithic:I0707 38.2 %
Karelia_HG:I0061 36.6 %
Loschbour:Loschbour 25.2 %
Kotias:KK1 0 %

For English_Cornwall:

Non weighted
Barcin_Neolithic:I0707 34.4 %
Loschbour:Loschbour 29.4 %
Kotias:KK1 23.4 %
Karelia_HG:I0061 12.8 %

Barcin_Neolithic:I0707 57 %
Karelia_HG:I0061 25.2 %
Loschbour:Loschbour 17.6 %
Kotias:KK1 0.2 %

Those weighted ones definitely look bad, with 0% Kotias. So I either did something wrong or the method doesn't work that good. Sein, could you test one of those simple models to see if you get the same with the weighted approach?

(I'll send the script ASAP)

Matt said...

@ Eren, interesting. If the original the original PCA was already scaled by the eigenvalues size (that Davidski provided to you), that scaling seem like the wrong thing to do as it would square the weighting.

But! If the original PCA data you had for G10 were originally unscaled that scaling like exactly the right thing to do to me. (And what should have originally been done with the data anyway?).

I have to say I wasn't sure about the scaling on the original datasheet ( from

The max-min on the dimensions there follows the pattern of:
PC1 - 0.14266, PC2 - 0.1193857, PC3 - 0.26094, PC4 - 0.1631, PC5 - 0.18856, PC6 - 0.166, PC7 - 0.17432, PC8 - 0.2867, PC9 - 0.1374, PC10 - 0.07527

So it seems like in that case the distance between the maximum and minimal samples is higher on many of the lower dimensions (PC3, PC8) compared to PC1. Rather than following a pattern of decreasing sized dimensions. Certainly the distances between max and min don't decline in anything like the progression of the eigenvalues you've quoted, where PC3 is 0.21 the eigenvalue of PC1, etc.

Applying "Sangarius scaling" to the PCs*, then building neighbour joining trees on euclidean distances in the dimensions:



I don't know if either match my preconceptions exactly, but there seem like many more odd features on the unscaled tree - 1) the most distant parts are the EEF and Africans, 2) the primary split between Africans and Eurasians is quite weak, with modern day Middle Eastern populations sitting closer on the tree to Africans than they are other West Eurasians, and more distantly rooted on the tree to West Eurasians than various Eastern Non-Africans. The scaled tree seems to fit more with fewer odd features.

Doing PCA on PCA on scaled and unscaled dimensions as well:



Scaled produces an output whose highest dimensions closely resembles and is dominated by PC1 and PC2 of the world PCA. Unscaled produces a plot dominated by many lower order dimensions and which resembles most closely a West Eurasia PCA.

(For good measure, here are distance matrixes calculated by PAST3 from:



Download and save as .csv and then open as spreadsheet. You can sort the columns by distance and see how logically distances between different populations seem or do not seem).

Though I can't talk about any alternative distance measure you've used to modify nMonte, and I don't really understand it's function. I'm not sure I can actually see why its a necessary improvement / adjustment over the euclidean distance after the scaling is taken into account.

* when I scaled, I just multiplied the values in each dimension by the eigenvalues, because only their relative size to each other should matter.

huijbregts said...

That is computationally very simple. But your formula is not the Euclidean distance, but a dimension-selective distance, which is always smaller than the Euclidean distance.

Seinundzeit said...


I actually have to make a run for it. But once I find the time tonight or tomorrow, I'll make sure to give those a try.

And thanks, I'd definitely like to give your script a try as well.

Matt said...

I also just modified the datafile rather than any adjustment to the script.

Seinundzeit said...


One final observation+question before I leave.

Mainly, I like the fact that your scaled neighbor joining tree resembles fineSTRUCTURE output, in terms of how the populations place with respect to each other.

Out of curiosity, is this a nice coincidence, or perhaps a reflection of the scaled tree being more accurate, in your estimation?

Alberto said...


That's really strange. So where did people get the eigenvalues from if not from the variance (the maximum spread) in each dimension? The numbers you posted are correct, it's just that it doesn't seem to make sense that there's higher variance in PC8 than in PC1. And how was that scaling done? Arbitrarily? (Since it's not made to scale all the eigenvalues to have the same length).


Could you email me at alberto6674 at (I can't PM you on Anthrogenica since I have not enough posts there).

FrankN said...

On PCA-based models (e.g. nMonte):

1. It is important to realize that PCA constitutes an enormous reduction in information content. A medium-quality sample of 100k SNPs, contains 200 kbit information (4 possible nucleobases per SNP = 2bit). Standard PCA reduces this to 16b per dimension (5 decimal places = 15 bits, plus 1b for plus/minus). So, a PC10 (160b) means information compression by factor 1,000 or more. The same essentially applies to ADMIXTURE, except for only having 15b (no negative values) per component.
Before using such heavily compressed data as input into further calculations (e.g. nMonte), it would be prudent to check the quality of the compression. PCA should in principle be able to provide the residual for each sample, and it seems like a good idea to exclude all samples with above-average residual from further use in nMonte.

- t.b.c - text too long for a single comment

FrankN said...

2. I had a closer look at Dave’s Global10 PCA. Whatever operations lay behind it, the vectors obviously haven’t been normalized. The absolute span between the highest and lowest value per dimension ranges between 0.075 (PC10) and 0.287 (PC8). Moreover, the vectors aren’t scaled symmetrically. PC1, e.g., is heavily negatively shifted (Max 0.02 @Lithuanians, Min -0.12 @Biaka), as is PC3 (Max 0.041 @Nganasan, Min -0.22 @Koinanbe), and, to a lesser extent, also PC2 (Max 0.0365 @ Barcin_Neolithic:I0709, Min -0.083@Igorot).
Most problematic in these respects seems PC8, which describes SEA-Sahulian substructure (Max 0.1985 @Bougainville, followed by Australian and Lapita, Min -0.0882 @Kosipe, followed by Papuan and S.Vietnamese; zero values for Mota, LBK_EN:I0046, Mordovian).
My hunch is that one reason for Sangarius’ transformation working so well result wise may be is that it
a) Enhances resolution, especially in the positive range, of PC1 (Baltics), PC2 (ANF) and PC3 (ANE); while
b) Downscaling the problematic PC8, which is rather meaningless for any non-SEA/Sahulian analysis, yet, for its high numerical load, will impose itself strongly on calculations, and consequently may lead to high errors/ distances.
Note in this context that PC8 has moderate positive (Oceanian) loads for certain Steppe pops (Potapovka:I0418, Altai_IA:RISE602, Srubnaya_outlier:I0354 all >0.008), and negative (SEA) loads for Iberia_EN/MN/CA and Esperstedt (all < -0.008). The PC8 load difference between Potapovka:I0418 and Iberia_MN:I0405, e.g., is 0.021, compared to 0.006 on PC1 (Baltics-SSA) and 0.013 on PC2 (ANF-ENA).
To a lesser extent, this also applies to PC3 (load span 0.261) that essentially describes a Beringian-Sahul/ASI cline. It is in general reasonably well setting apart Steppe (positive, i.e. Beringian-loaded) from ANF pops (negative, i.e. Sahul/ASI loaded). However, some loads seem counter-intuitive: Yamnaya_Samara:I0439 (-0.0003), e.g., comes in close to Iberia_MN:I0406 and Remedello:RISE489 (both -0.0002), but quite distant from Yamnaya_Kalmykia:RISE546 (0.0048) and especially Altai_IA:RISE602 (0.0164); I0349’s PC1 distance to the latter is 0.0039. Conversely, Iberia_MN: I0405/08 (both 0.0004) slightly fall towards the positive (Steppe/Beringian) side, while Iberia_CA:I1271 loads moderately negative (-0.0054). PC1 can hardly tell them apart (0.0001 load difference). Whether PC3 really reflects respective substructure, or tends to produce noise away from its Beringian and Sahulian poles, is difficult to say.

Hence, as alternative to Sangarius’ transformation, I suggest:
a) For each dimension, adding/ substracting a constant so that both poles have the same absolute value. In the case of PC1 (0.0198 to -0.1228), the offset would be +0.0515, and place the zeropoint between Somali and Amharans, with all SSA pops having negative, and all other pops (including Amhara, Tigray, Sahrahwi N. Africans) positive loads. PC2 (ANF – ANE, offset +0.0232) would zero in between Ust-Ishim and Okunevo, PC4 (Natufian – UHG, offset +0.0192) on Serbs, PC6 (WHG/pre-BA Iberia – Iran_Neol/ S.Asia, offset +0.0112) on Armenia_CA, which all intuitively feels right.
b) Normalizing all PCs to the same distance between their poles. This would reduce the weights of PC8 (factor 0.6) and PC3 (factor 0.66), while enhancing PC2 (factor 1.44) and PC1 (factor 1.2). The most prominent effect would be on PC10 (factor 2.28), which is worthwhile monitoring: Its poles (Gambian – BantuKenyan) suggest an inner-African cline, but it also does a pretty good job in setting apart ANF/EEF/Levante/AustroAsiatics (on the W.African side) from UHG/Iran_Neol/Austronesians (on the E.African side). The same applies to PC9 (factor 1.25) that has ASI, WHG and N. Africans on the negative, and CHG, Austronesians and ANE on the positive pole.

Eren said...

I don’t think you did anything wrong, it’s probably not working so well.

I started questioning the PCA because some of the population distances seem not to reflect the population affinities I’m familiar with from the literature. In this PCA WHGs are one of the most distant populations to me in the whole dataset, much more distant than Yoruba. When I looked where the large distance came from, I noted that they were the product of the smaller dimensions. Dimension 4 in particular. Using up to dimension 3, WHG are super close. But when going to 4 dimensions the distance increases by 4000%. Suddenly WHG are as distant as Japanese and the distance only increases with each additional PC. I got the impression that something was not right with the scaling. How can dimensions with Eigenvalues of 4 and lower have such an incredible impact on the distance? I shared this with Davidski and he agreed that the dimensions probably shouldn’t all have the same weight on the distance. This was a perfect opportunity to procrastinate, so I started tinkering with the scaling.

I’ve also noticed that the min-max ranges on the PCs don’t match with the progression of the Eigenvalues (Davidski posted the link on Anthrogenica: I’m not intimate with the intricacies of PCA, so I don’t know if that is an anomaly of this PCA or normal. Intuitively, I would expect that they should follow the progression of the Eigenvalues.

Regarding the PCA plots, are you sure you labeled them correctly, because the unscaled PCA looks more like your description of the scaled PCA and vice versa. And I couldn’t open the distance matrices properly, it looks all like gibberish.

I actually didn't implement any other distance measure in the version I shared, just code that takes care of the scaling. Only afterwards I noticed that that wasn't necessary, one can just modify the data itself.

Seems like I was right that something is not right with the PCA. Could you maybe upload a properly modified version of the data?

Davidski said...


I'm suggesting nMonte doesn't understand the eigenvalue differences between different dimensions until adjusted, and that's the reason why unadjusted fits are not necessarily sensible. Like in the case of Karitiana.

That was my original impression, but apparently it does.


This weighted approach seems to work normally with every model. So to me it looks like it deserves a closer look, even if we might not understand exactly why this is beneficial yet. In the end, when something works (if it works) there's always a mathematical reason behind it.


huijbregts said...

I think that by now I understand most of the problems.
You have to discriminate between the deep structure on the first two dimension, and the more local structures on the higher dimensions.
It is well known that first two dimensions of a PCA on DNA data invariably have must larger eigenvalues than dimension 3.
That is because the deep DNA-structure is bound two the two dimensional structure of the earth surface.
The higher dimensions contain more local information, but when when looking for the deep structure, the higher dimensions are just noise which should be discarded.

Now for the weighting of the PCA. Mathematically speaking this is a bad idea because it compromises the PCA and distorts the geometry.
The effect of squaring the eigenvalues is that the higher dimensions loose all their influence on Euclidean distances.
But when looking for the deep structure this is a blessing in disguise. Because it effectively drops all the information in the dimensions 3 and higher.
When looking for more local structures, the higher dimensions are essential and the weighting is very disadvantageous.

Much attention has been given to the modelling of Karitiana, because in nMonte the ratio of Ami to MA1 is the reverse from what is found in the scientific literature.
But weight the model and you get the right ratio. In the line of the above, this exactly what is to be expected.
I tested this by feeding into nMonte only the first two dimensions of the data and presto, I got the correct ratio.

Rob said...

So what's the upshot ? If we want to model someone on deep (Palaeolithic) source Pops, drop the number of D's ? And keep them for modelling from more recent prehistory (BA, Iron Age) and contemporary source lists ?

Seinundzeit said...


I'm just going to quickly chime in, for a sec.

Using only the first 3 dimensions isn't exactly the solution, if we want sensible basal models.

It does help the Karitiana, but it leads to really weird results in my case, and in the case of many other populations.

For example, with 3 dimensions, my "closest single item distance" reference population is MA1 (expected), followed by Iran_Neolithic (again, expected), followed by Bichon, followed by Levant_Neolithic, and then after a large gap I see ENA and Sub-Saharan Africans. Totally reasonable.

Yet, here is how I model with those same 3 dimensions and reference populations, despite seeing a rational ordering of my reference populations from nearest to furthest:

58.3% Bichon
17.35% Onge
16.7% MA1
5.9% Levant_Neolithic
1.1% Ami
0.65% Iran_Neolithic


So, I'm 60% WHG, and basically 0% Iran_Neolithic, despite those two being my closest reference populations (not to mention that my Onge has gone from around 10% to now being closer to 20%)?

Again, it just doesn't make sense. If we go by everything else, MA1 + Iran_Neolithic should be, in any basal model, the two primary elements of my ancestry, with WHG being a minor component, and Natufian/Levant Neolithic percentages being almost trace/barely noticeable.

Basically, using only the first 3 dimensions won't fix things for most populations.

Anyway, for West Eurasians, I found out long ago that the optimal amount of dimensions to use is 7, not the full 10, nor anything less than 7. But that is an entirely different discussion.

Regardless, I'll catch up with the conversation tomorrow.

Davidski said...


Now for the weighting of the PCA. Mathematically speaking this is a bad idea because it compromises the PCA and distorts the geometry.
The effect of squaring the eigenvalues is that the higher dimensions loose all their influence on Euclidean distances.
But when looking for the deep structure this is a blessing in disguise. Because it effectively drops all the information in the dimensions 3 and higher.
When looking for more local structures, the higher dimensions are essential and the weighting is very disadvantageous.

In theory, as far as we know, yes, in practice no. Scaling with the eigenvalues produces very sensible models even when higher dimensions are required for fine scale models.

Here are three models for Ashkenazi Jews. The third one is the most sensible as far as I know based on other analyses and scientific literature.

I've run a number of models that I'm familiar with, and none appeared to be distorted by the scaling process. The important question is why?

Are there any models that do look distorted?

nMonte original

Italian_Tuscan 61.05
Samaritan 23.60
Cypriot 10.70
Polish 2.20
Han 1.70
Yoruba 0.75
Avar 0.00

First two PCs only

Italian_Tuscan 38.15
Cypriot 26.45
Polish 18.25
Samaritan 10.30
Avar 4.75
Yoruba 1.20
Han 0.90

Using Eren's added script

Italian_Tuscan 51.5
Samaritan 29.9
Cypriot 8.4
Polish 8.0
Han 1.5
Yoruba 0.7
Avar 0.0


Basically it's like this:

1) Dropping most higher dimensions from the sheets is in theory a sensible idea, but in practice a horrible idea.

2) Scaling the PCs according to the eigenvalues is in theory a horrible idea, but in practice a good idea.

FrankN said...

@Sein e.a.;
I had a look at with/without eigenvalue correction runs to find out what actually happened. I concentrated on your Mongola and Karitiana cases, the other Siberian pops look similar enough for the findings being transferable. Essentially, I manually multiplied the source pop’s raw PCA values (i.e. uncorrected for eigenvalues or acc. to my proposal above) with the ratios given by you, and assessed the square error per dimension.
Turns out that for both Mongola and Karitiana, your “correct” result beats the initial nMonte result, or is only marginally worse, on six dimensions (1,2,3,8,9,10), and is only slightly inferior on PC5. The biggest problems rest with PC7 (75% of the total error for Mongola), and PC4 (esp. Karitiana). OTOH, the “w/o Natufian” scenarios have significantly lower errors on PCs 2 and 3. By reweighting the PCs according to their eigenvalues, the weight of those well-fitting PC2/3 is enhanced, and poorly-fitting PC8 literally taken out of the equation. Et voila – everything is fine!

In fact, it isn’t. A closer look on those problematic PCs 4 and 7 reveals that the correction just cures the symptom, but not the cause:
- PC7 spans between UHG (max. load Loschbaur, Bichon, Villabruna, Motala) and ANE (min. Nganasan, Sakha). Unsurprisingly, all your target populations have a substantial negative load. However, out of the source pops, only Natufian is negatively loaded (-0.051), while both MA1 (+0.03) and Ami (+0.047) fall on the positive side. This “motivates” nMonte to include a substantial Natufian share (which in fact still leaves an enormous error on that dimension).
- PC4 poles are BedouinB/ Natufians (pos.) and KareliaHG/ AG3/ Loschbaur/ Motala (neg.), with Ami following soon on the positive, and MA1 on the negative side. The target pops range between neutral (Mongola +0.003) and strongly negative (Eskimo Chaplin -0.055, Karitiana -0.049), and the latter call for a lot of MA1 in order to somewhat counter the positive score of Ami.
- PC6 is somewhat similar, spanning between WHG (pos.) and Iran_Neol/ Tribal Indians (neg.). Here again, Ami fall quite strongly on the positive side (+0.036), while the target pops range between moderately negative (Karitiana -0.025) and slightly positive (Mongola +0.009).

In short: Ami as source pop aren’t well aligned to the target pops in at least three dimensions (PC5 may possibly added here), and that is the root of the problem. I bet if Ami were replaced by E.Siberians, say, Itelmen, and some SEA, e.g. Lahu, were added as source pop, Natufians wouldn’t show up anymore. Ami and Lahu in fact don’t look that different PCA-wise. However, as the SEA adstrate in the Kunza (Atacameno) language looks one-third each Thai, Polynesian, and Gelao, I thought Lahu might better represent the source pop.
[I should download nMonte one day – but from manually playing around with the raw data, 60% Itelmen, 27% MA1 and 13% Lahu look like a very decent fit for Karitiana.]

Davidski said...


See how you go at reproducing the Karitiana model from literature, with Han, MA1 and Onge as references.


Also, try some models for the Bronze Age samples, like Corded Ware as Yamnaya, and also have a look if Motala is a viable reference for anyone past or present.

Rob said...

OK I had a go for Mongola, Karitania, Kalash using Natufians, Bichon, MA-1 as "Big 3"

I got similar findings to Sein...


Dai 65.4 %
MA1:MA1 20 %
Israel_Natufian:I0861 14.6 %
Yoruba 0 %
Ust_Ishim 0 %


Dai 89.3
MA1:MA1 10.7
Israel_Natufian:I0861 0.0
Yoruba 0.0
Ust_Ishim 0.0
Bichon:Bichon 0.0


MA1:MA1 62.2 %
Dai 28.6 %
Israel_Natufian:I0861 9.2 %
Yoruba 0 %
Ust_Ishim 0 %
Bichon:Bichon 0 %

Distance 0.069119


Dai 63.1
MA1:MA1 36.9
Israel_Natufian:I0861 0.0
Yoruba 0.0
Ust_Ishim 0.0
Bichon:Bichon 0.0


MA1:MA1 54.2 %
Ust_Ishim 36 %
Israel_Natufian:I0861 9.8 %
Dai 0 %
Yoruba 0 %
Bichon:Bichon 0 %

Distance 0.049919



MA1:MA1 71.75
Israel_Natufian:I0861 27.00
Bichon:Bichon 1.25
Dai 0.00
Yoruba 0.00
Ust_Ishim 0.00

Davidski said...

I did find a model in which nMonte is far superior to the modified script.

nMonte original


Loschbour 36.0
Samara_HG 22.9
LBK_EN 21.1
Kotias 20.0

Modified script


Samara_HG 41.2
LBK_EN 37.0
Loschbour 21.9
Kotias 0.0

Rob said...

Briefly CWC using 'world at 3000 BC' as sources:

The Battle Axe sampling (initially it looked very different to other CWC individuals);

Motala 38.3
Kotais 34.25
Barcin 3.95
Salzmeunde 12
Remedello R489 7.8
Fit: 0.003734


Yamnaya_Samara:I0429 28.70
Kotias:KK1 23.15
Loschbour:Loschbour 22.10
Remedello_BA:RISE489 12.30
LBK_EN:I0056 10.20
Hungary_HG:I1507 2.85
Yoruba 0.70
Iberia_Mesolithic:I0585 0.00
Bichon:Bichon 0.00
Motala_HG:I0012 0.00
Karelia_HG:I0061 0.00
Fit: 0.0003

A CWC sample which tested as expected unweighted
CWC I0103
Loschbour: 6.65
Karelia HG: 4.8
Barcin: 9.1
LBK: 9.85
Yamnaya Samara: 68.45
Fit: 0.002243

Corded_Ware_Germany: I0103
Yamnaya_Samara: 68.8
Baalberge_MN: 13.2
LBK_EN: 9.3
Motala_HG: 8.7
Loschbour: 0.0

Another CWC I1544
Motala : 26.15
Salzmeunde: 18.3
Yamnaya Sam: 46.8
Fit: 0.007401


Yamnaya_Samara:I0429 67.8
Baalberge_MN:I0559 27.4
Loschbour:Loschbour 3.8
Motala_HG:I0012 1.0
Hungary_HG:I1507 0.0
Iberia_Mesolithic:I0585 0.0
Bichon:Bichon 0.0
Fit: 0.001

Davidski said...

After seeing my Estonian models, I don't think weighting is the answer.

Maybe Frank has the answer? Let's see how he goes...

FrankN said...

@Dave: I am not equipped with DMonte, all I can do so far is checking where problems may have come from based on comparing "original" with "weighted" runs. So far, evidence points to source pops that are unsuited to higher PC dimensions, thus producing strange results, which get "weighted out" afterwards - e.g., in Sein's case, Ami instead of a more Beringian source.

I looked a bit into your Ashkenazi experiment, which appears to fail on PCs 7,8, and 10. Would you mind running it again with South_Italians and Georgian_Jews as additional source pops (if the model allows, also add Span_Extremadura)? I am quite confident a lot of noise, including those bits of Yoruba and Han, will disappear [In fact, from my calcs it seems that South Italians alone are closer to Ashkenazi than your nMonte mix, though a bit of Polish admix should improve the fit further.]

Also, is it possible to produce an error/ residual matrix on your original PCA, in order to judge how well potential source pops have been approximated?

Rob: I'll have a look at your comparisons to see what goes on there, and will come back on it later.

Davidski said...

Frank, don't worry about those Ashkenazi models. Both the nMonte and modified nMonte models are fine. Ashkenazi Jews do often show minor East Asian and African signals, so those results make sense.

Try modeling Bronze Age Northern Europe with your methods.

Samuel Andrews said...


Does rathlin1 have less MN, more Steppe than modern Irish? These D-stats show Irish are unquely close to Irish Neolithic, meaning extra MN was probably picked up in Ireland.

Davidski said...

From memory, Rathlin1 was more Yamnaya-like, and very similar to some of the more eastern shifted German Bell Beakers and even Srubnaya, while modern-day Irish had more Europe MN stuff.

But I can't confirm that now. Don't have Rathlin1 in my new dataset.

Alberto said...


Yes, I posted above also those models weighted and unweighted for Estonian and English_Cornwall, both rejecting Kotias. So my runs were correct in the weighted form, which indeed proves that this is not the right solution even if it works for some cases like Karitiana.

But anyway I thin that with everyone effort we're getting to understand part of the problems with the data we're feeding nMonte.

I'll look too in the direction that FrankN is looking, because there's where there seems to lie the problem. Some dimensions seem to have some outliers with surprisingly big values (when we would expect higher dimensions to have little variance by definition). Reducing the weight of those dimensions to 1/20th of their original weight fixes those corner cases, but it's definitely not a good idea for the more common cases with populations that behave normally in those higher dimensions.

So is a possible solution to manually "prune" those outliers in higher dimensions? Or can we do better with a more elegant solution?

huijbregts said...

"Basically it's like this:
1) Dropping most higher dimensions from the sheets is in theory a sensible idea, but in practice a horrible idea.
2) Scaling the PCs according to the eigenvalues is in theory a horrible idea, but in practice a good idea."

To complete the madness of the situation: according to the logic of his system, Eren should not have scaled by the eigenvalues but by the square root of the eigenvalues.
The corresponding factors are
1 0.846718713 0.452874391 0.323627789 0.298822531 0.284710759 0.271762513 0.220536005 0.217007202 0.214022091
Anybody motivated to see whether this works even better?

I still think that the effectiveness of the weighting system is somehow related to the suppression of the higher dimensions.
Some explanations might be:
1. Some of the higher eigenvalues are artefacts due the non-normality of the within population distribution
2. The model behind the PCA is the multivariate normal distribution. On the other hand admixture models suppose a linear mixture of normal populations.
3. The deflation of the higher dimensions somehow alleviate a weakness of the nMonte process.
I find each of these hypotheses hard to test.

Davidski said...

Yeah, it's a lottery with the weighted system; many models look exceptionally good, but others fail totally.

Well, I'm happy that I've confirmed for myself that we've been running these tests correctly from the start.

But it's all about the end game here, meaning results, so if anyone can figure out an elegant way to further improve results, please don't be shy.

Eren said...

This is all really strange. Plotting of the scaled PCA produces a plot identical in proportions to Davidski's Global 10 plot. Whereas plotting the unscaled PCA produces a very distorted plot. Based on that, one would expect that the distances resulting from the scaled PCA should be correct, since they are in accordance with the visuals.

@huijbregts, scaling by the squareroot of the Eigenvalues worsens the plot. I'm not sure how it affects nMonte.

Alberto said...

Yes, I think that for most of the models we've been doing we're in safe ground. Most of the time the models look really good.

Looking at the dimensions, I can't think of an elegant way to fix the problematic cases. My impressions (from a quick look, this is a difficult task that takes time) is that it might just not be possible to pack all the world populations into a multidimensional PCA and get sensible results for all kind of models. Many dimensions are created by variance among outlier populations (like PC8 for Oceanians) and in those dimensions the rest of the populations are completely "lost" and get quite random values. Same with other dimensions for African variation.

Maybe a safe approach could be to simply do a "Eurasian 6" PCA instead of a "Global 10" PCA. I don't really know if that would make any significant difference for our most common cases (since those ones look good as it is now), but maybe it does improve things a bit (and prevents people getting weird results with other kind of models and thinking that the whole thing doesn't work).

Just thinking aloud...

EastPole said...

When doing PCA on Admixture of modern populations I have experimented with many methods and scaling. SVD with no scaling, no centering worked best. By best I mean the plotted PCA results corresponded best with the known geographical locations of populations and we know from many works that there is a close correspondence between genetic and geographic distances in Europe.
Maybe we should use this approach with nMonte. Model modern populations as a composition of ancient populations with nMonte using various methods of scaling and weighting and then see which methods produce results best corresponding with geography.

Ravai said...

Hello, what model do you recommend for the Sephardim? Which era is best to use? And specimens?

Thank you


Anonymous said...


Sorry for off topic, but if you don't mind do you know how much West Eurasian admix do North Asian/Siberians like Mongolian, Altaian, Evenk, Yakut, Tuvan have?

In thes case of Mongolians and closely related grops, they seem to be able to be divided into distinct subgroups genetically: Daur, Mongola (from China), Mongolian (from Mongolia), Buryatm, Kalmyk. Do you know how much West Eurasian admix each group have on average?

Thank you very much.

huijbregts said...

I have been thinking about the weighting paradox (who doesn't).
Many people agree that, mathematically speaking, it doesn't seem a good idea to weight by the eigenvalues.
Yet we have seen models where a weighted model outperforms the unweighted model, as evidenced by a better agreement with the scientific literature.
How can this be?

We have noticed that an effect of the weighting is the deflation of the higher dimensions.
Now suppose that all these models are overfitted. In that case weighting might reduce the degree of overfitting, because it reduces the number of operational dimensions.

Is there any reason to think that these models are overfitted?
In some scenario's there is.
Davidski's PCA is calculated on the basis of 1000 samples (?).
Now let us take a subset of these 1000, for instance the West_Eurasians. Does this subset result in the same PCA? We know it doesn't because David has experimented with this subset.
The Euclidean distances have to be calculated against different eigenvectors. Technically we might say that the local geometry of the West_Eurasians is different from the global geometry.

The extreme consequence of this reasoning is that for each model you have to recalculate the eigenvectors. And now you are in for a new surprise.
If your PCA software is based on matrix calculations you will get an error because you have more dimensions than observations.
If your PCA software is based on svd you will get a result, but the estimated parameters cannot be independent.
So yes, in some scenario's an overfitting problem is quite possible and it may be relaxed by weighting.

I noted that by a different route you arrived at comparable ideas.

FrankN said...

@Rob: Took a look at your Kalash comparison. Interesting model, not sure what you are after with it…
First a look at each source pop’s distance to the Kalash (in %):

MA1: 9.69
UI: 5.8
Natufian: 9.45
Bichon: 14.84

UI is surprisingly close, but the other sources are far off and of course can’t yield good approximations.
5/6th of the error are produced on three dimensions, namely PC4, 6 and 9. PC4 spans between Natufian (pos) and UHG (neg.), with both MA1 and Bichon having high negative loads, while Kalash scores close to zero on it. nMonte unsuccessfully tries to balance both poles, but the only somewhat “neutral” source pop to achieve this is UI, hence its high share in the unweighted scenario.
PC6 and 9 are both anchored on CHG/Iran_Neol. All source pops are further away from this pole than the Kalash, and thus can’t approximate it correctly.
So essentially – this has already become clear from Sein’s modeling – your setup lacks adequate IranoCaucasian source pops to yield satisfactory results. Weighting decreases the relevance of the a/m higher dimensions and zooms into PC1-3, where the Kalash score relatively neutral and may be approximated by various combinations of Eurasian source pops.

I took this as opportunity to check a bit on individual pop’s distances to the Kalash. Current populations:

Iranian_Persians: 4.46
Armenian_Chambarak: 4.65
Pashtun_Afghan: 5.42
Kyrgyz: 6.52
Burusho: 6.61
Burmese: 7.5

Quite a surprise: Closeness to Iranians may still have been expectable, but Kalash being closer to Armenians than to Afghan Pashtuns..

It gets even more interesting when looking at aDNA (no tricking, I have checked each and every ancient Steppe sample, and only took the best ones):

Armenia_CA (early Kura-Araxes): I1634: 3.69
Armenia MLBA: 3.99
Maros:RISE373, Vatya:RISE480: 4.95
Germany_MBA:RISE471: 5.09
Armenia_EBA:I1635: 5.27
Unetice_EBA:RISE577: 5.3
Altai_IA:RISE492: 5.37
Sintashta:RISE386: 5.55
Potapovka:I0419: 5.72
Srubnaya:I0235: 5.89
Barcin_Neolithic:I0708: 6.25
Iran_Chalcolithic:I1670: 6.27
Poltavka_outlier:I0432: 6.53
Yamnaya_Kalmykia:RISE240: 7.21
Samara_Eneolithic:I0434: 8.85
Iran_Late_Neolithic:I1671: 9.06
Iran_Hotu: 9.09

The Armenian connection obviously has quite some tradition. Moreover, there apparently was some kind of connection between the Northern Alps and the Hindukush during the EMBA, and that connection went rather south of (or across?) the Caspian Sea than via the Caspian Steppe (note UI beating Yamnaya and Srubnaya).

The above should provide you and others with some ideas on how to better tune nMonte models for the Kalash.

Matt said...

@Alberto: "That's really strange. So where did people get the eigenvalues from if not from the variance (the maximum spread) in each dimension?"

I don't really know. I would infer that this is a methodological detail of how Davidski ran the PCA and it just gave the dimensions separated from their eigenvalues.

Question for me is more, I don't know what the different mix-max range scaling on each dimension actually represents if *not* related to the eigenvalue (also true for the ancient West Eurasia plot here -, where the dimensions don't reduce in minmax range in a strictly decreasing fashion, though the pattern is closer).

Btw, I did another scaling exercise in which I first scaled each dimension from (0-1) and then scaled each dimension to the eigenvalue:

Different from multiplying each dimension by the eigenvalue as it would adjust for different sizes of each dimension that existed in the ordinary dataset. Not too different though.

@ Eren:Regarding the PCA plots, are you sure you labeled them correctly, because the unscaled PCA looks more like your description of the scaled PCA and vice versa.

Yes, labelled them backwards. Scaled: Unscaled:

And I couldn’t open the distance matrices properly, it looks all like gibberish.

Should've downloaded as a set of comma separated data that opens as a text file like this - - and then when you rename the file extension to .csv, opens in a spreadsheet as a distance matrix like this -

You get anything like that or just random numbers and characters? (Gibberish).


The following may be sort of on topic, but also I was pretty impressed that the neighbour joining trees and PCA on the scaled / weighted data seemed to reproduce world relationships a bit better than the unscaled data, so I thought I would try to restrict to just West Eurasian samples and do the same thing with PCA:

PCA on PCA with just broadly West Eurasian samples, on unscaled data:

NJ equivalent:

PCA on PCA with just broadly West Eurasian samples, on scaled data:

NJ equivalent:

For just West Eurasian samples, PCA seems to function less well to separate and structure the West Eurasian samples on scaled data in the predicted fashion. As the result is more strongly driven by PC1 and PC2 which are relatively noisily associated with WHG, EHG, Iran_N and Levant_N. Even though it works better to separate populations on the world scale. The NJ I find harder to call.

It seems to me like the unscaled don't really represent the distance of world populations quite right. (Looking at the euclidean distance matrix I've run, you have things like the distance of Saudi to Thai and Latvian looking equal.) At the same time, the scaled version looks maybe noisy for fine scale local relationships?

Rob said...

@ FrankN

Thanks Frank. I know Armenia MBA is a great proxy for many west and even south Asian groups. The point of the above was to try model kalash using *only LUP samples, excluding Satsurblia.*. An artificial exercise, but done for the sake of comparisons, specifically something Sein had an issue with (some runs showing "Natufian" in Kalash). A forced use of selective source populations gives windows into different phenomena, and tinkering with using only Mesolithic samples vs Eneolithic samples vs Bronze Age samples might give different clues cumulativey. Perhaps I'll illustrate later.

FrankN said...

@Rob: It’s o.k., I had already assumed you had a reason to develop and post such a counterintuitive model for the Kalash.

I have looked at your BattleAxe RISE94. A bit cumbersome: The unweighted model had 3.7% missing, which I have ascribed to Loschbour. Moreover, I wasn’t sure what Motala meant in particular (I took I0011), neither about Barcin (I took I1097). My distance is a bit off from the one given by you (0.384% vs. 0.3734%), meaning I have gotten something slightly wrong, but it still feels close enough to your data to allow for qualified comments.

In general, the unweighted output looks quite decent. I actually don’t understand why weighted nMonte produces such a differing result, including a minor phantom Yoruba component. Except for PC3, the unweighted output beats the weighted one across all dimensions – the latter is clearly suboptimal. Possibly, you have set some nMonte parameters in a way that allows termination before the best solution has been found.

The unweighted as the weighted output have larger (yet overall still moderate) errors on PCs 3, 7 and 8. “Weighted” in addition is quite a disaster on PC4 (EHG/SHG - Natufian), as Yamnaya, which replaces Motala in the “weighted” version, doesn’t have sufficient SHG-ness to provide for an acceptable fit.
More specifically:
- PC3 (Sahul-ENA) calls for more “easterness” than provided by the source pops – Motala and Yamnaya are fine in this respect, but offset by Kotias and ANF/EEF-rich pops;
- PC7 (ENA-WHG/SHG) requests some more UHG,
- PC8 (SEA - Oceania), finally, has RISE94 scoring higher on the Oceania side than any source pop except for Kotias.

Making sense of these weaknesses is quite difficult. MA1 could work fine on PC7/8, but would worsen the situation with PC3. Ket, OTOH, are eastern enough to do a good job on PCs 3 & 8, but would fail on PC7. Those Lapita samples from Tonga and Vanuatu have everything required to close the gaps, but Vanuatu to Gotland? Srubnaya outlier I0354 also looks about right, but unfortunately to young. Nevertheless, she indicates that adequate pops may have been around, but haven’t been sampled so far (or have they? – Pitted Ware is not covered by the PCA). In any case, anybody tired of waiting for fresh Baltic aDNA might want to include some Siberian pops – say Ket or Khanty – as source pops for the next nMonte run on RISE94 to see what happens.

Bottom line, however, is: Eren’s weighting doesn’t improve the output quality for RISE94, but worsens it.

I would love to do a similar analysis on your CW cases, Rob, but the outputs provided by you unfortunately also miss some percentages, i.e. don’t add up to 100. Can you e-mail the complete outputs to me, or re-post them here?

Rob said...

Yes Frank
For CWC I had to manually transcribe them from a saved wordpad and got lazy (omitting the final minor percentages).
There is also an "Ncycle=300" option in running individual analyses.
I'll repost the Full gamut later today.
Should I stick with CWC for now ? Do you want anything else added (Unetice, BB, south Asian aDNA) ?

Anonymous said...

Do you want anything else added (Unetice, BB, south Asian aDNA) ?

Do you already have access to the south Asian aDNA?

Ryukendo once referred to Gravetto-Danubian on Anthrogenica as "Rob". And around October 2016, Gravetto-Danubian had said that the south Asian aDNA was already analysed and the results known to a few. And since you now referred to south Asian aDNA, does that mean you have access to the south Asian aDNA results? And is it BMAC, Swat, or IVC, or all together, that you have access to?

Aram said...

Kalash have high level of Y DNA L. 25%. Afaik it is mostly L1a-M27. The same branch as in Armenia Chalcolithic. How to interpret this I don't know.

Aram said...

Woow. A new L-M27* is added to Yfull
He is from Saudi Arabia.
Areni cave males were also M27*.

Davidski said...


Apparently a number of South Asian samples are now ready, but the results and data are embargoed until papers based on them come out in 5-6 months time.

Rakhigarhi is just one site from which we'll see South and Central Asian DNA, and there will be Mesolithic samples as well.

huijbregts said...

You have invited us not to be shy in speculating about the causes of the weighting paradox.
So here I go: maybe the dimensions of the Global 10 are not as accurate as we would wish.
My doubts started when I asked myself the question "How many samples would you need to estimate a PCA in 10 dimensions". My gut feeling immediately answered "millions" and you have 'only' about 500.
To put some empirical flesh on this question, I have done bootstrap analysis on the datasheet. The next table shows the 95% reliability margins of the eigenvalues:
2.5 % 97.5 %
[1,] 0.001516851 0.0018093809
[2,] 0.001283789 0.0015458585
[3,] 0.0008120617 0.0011207352
[4,] 0.00005826861 0.0008933734
[5,] 0.0004631113 0.0007224738
[6,] 0.0003049551 0.0006314343
[7,] 0.0001258633 0.0003544125
[8,] 0.00002424219 0.0002730802
[9,] 0.00001286359 0.0001319374
[10,] 0.000006936709 0.0000333966
As you can see the estimates of the first two eigenvalues have a small a small overlap, but they are completely separate from eigenvalue 3.
Eigenvalue 3 has a definite overlap with eigenvalue 4, but maybe that is still acceptable. However the overlap between the eigenvalues 4 and 5 is huge.
Now this is the first time I have even seen a bootstrapping of eigenvalues, so my interpretation is necessarily speculative.
But I think it is plausible that, if the algorithm cannot discriminate the eigenvalues 4 and 5, it also also cannot reliably project the dna on the eigenvectors 4 and 5.
So the scores on the dimensions 4 and 5 are just a matter of chance.
If this interpretation is true, the dimensions 4 and 5 (and following) are invalidated for the calculation of the Euclidean distance.
It would also explain why the nMonte result of the weighted scores often appear more realistic then the results of a regular nMonte run on 10 dimensions.

Rob said...

@ AK
No I don't have access to anything. Sorry For false alarm ! :)
I meant Iran/ west Asia

Davidski said...


I use many more samples than 500 to run the PCA. About 1,600 in fact. But you're only seeing the results of about 500.

In any case, it's not realistic in this context to run much more than a couple thousand.

Alberto said...


It seems to me like the unscaled don't really represent the distance of world populations quite right. (Looking at the euclidean distance matrix I've run, you have things like the distance of Saudi to Thai and Latvian looking equal.) At the same time, the scaled version looks maybe noisy for fine scale local relationships?

Yes, that's a pretty good summary of the situation. The reason is probably that the first 3 dimensions provide a better estimate of world populations distances, while the next 7 look at more local phenomenons that distorts the big picture. So for a few specific models, the scaling works well (ht example of Karitiana being an obvious one), but for the more usual models that we use it doesn't work well (like North Europeans getting 0% Kotias when modeled using the "big 4" European ancestors).

That's why I said it might just not be possible to get all the world populations into a single 10 dimensional PCA and get sensible results for every model. Eurasian models with reasonably proximate sources seem to work well (I posted that K11 for all Eurasians and nothing looked really bad), so for the most part this works pretty good for our purposes. I just wonder if doing a PCA with only Eurasian populations (Oceanians make weird things, like on PC8, while SSA are also big outliers with little use for us) and maybe 6-7 dimensions would make it even better or roughly the same.

huijbregts said...

I am glad to hear that you have used 1600 samples. The more the better.
But even 1600 samples is IMO far too few to accurately estimate a 10-dimensional PCA. I expect that it may shift the problems from dimension 4 to dimension 5.
The problem remains that at some dimension the estimations will start to be inaccurate.
The decision not to include these noisy dimensions in the calculation of the Euclidean distance can considerably increase the quality of the nMonte estimation.
Which is exactly the effect of weighting the data.

Anonymous said...

Thanks both for the response.

"Rakhigarhi is just one site from which we'll see South and Central Asian DNA, and there will be Mesolithic samples as well."

Will these Mesolithic samples be from South or Central Asia, or both? And are the sampling localities known with any more specificity, such as "Tibet/Himalayan" or "Andaman"?

Any chance of Neolithic and early Chalcolithic samples as well?

the results and data are embargoed until papers based on them come out in 5-6 months time.

This is the part I don't get. If the hard work of results and analysis was already done, why wait so long to write up the papers presenting the findings? There's so much more of the world's geography and eras, not to mention other species, to cover.

Davidski said...

It takes 5-6 months to analyze the ancient data, write up the manuscript, get it through peer review, and then see it published online.

Often it takes as much as two months from acceptance to publication in a journal.

Rob said...

@ Aram

I suspect L-M27 broadly overlaps with the 'Iran Neolithic' component seen in both Armenia Chalc, and modern Kalash, so the link is somewhat indirect ?

FrankN said...

Re: Kalash distances that I have posted above

Alberto kindly pointed me towards an error - essentially, using Excel, I had a cell reference wrong. So, please discard my numbers posted above, and excuse any irritation those wrong numbers may have caused. The comparison of goodness of fit for the “unweighted” and “weighted” Kalash outputs wasn’t affected by that error. [Dave - is there any way to edit instead of completely deleting older post?]

Alberto has prepared a correct list of indiv. pops distance to the Kalash that can be found here:

I have to retreat from my previous statement that Kalash substantially differ from nearby pops. In fact, the PCA has them being quite close to Pathan, Burusho, Pashtun_Afg, Brahui etc. However, the general observation of genetic closeness to Iranian and Armenian aDNA, and higher distance to Steppe aDNA, still stands, albeit with some shifts in the sequence. The closest aDNA to the Kalash is Iran_Hotu, followed by Armenia_CA:I1409, Iran_Neolithic:I1945, and Armenia_MLBA:RISE397.

As to the general discussion: PCA already constitutes a massive information compression by factor 1,000 or higher. In view of this massive compression, I can’t imagine overfitting to be an issue.
While it is legitimate to investigate the goodness of fit for higher PCA dimensions, I wouldn’t a priori discard them and thereby reduce the information content even further. My comparisons above of selected “unweighted” vs. “weighted” outputs have shown that even the higher dimensions contain valuable and reasonably well structured information. Intuitively better-looking “weighted” results were in all cases due to the provided source pops insufficiently covering higher dimensions. For the time being, I think it is better to take seemingly “wrong” unweighted nMonte outputs as a signal of missing source pops, rather than just “weighting” this signal out and obtaining a superficially correct-looking output based on a further information-compressed (i.e. mostly just PCs 1-3) dataset.

Having said that, I think we should try to better figure out why PCAs, or at least Dave’s GlobalK10, apparently come out unscaled/ non-normalised, and which effect this has on the quality of nMonte outputs.

huijbregts said...

You must have done a lot of systematic comparisons.
I do not quite understand your opinion about the higher dimensions. On the one hand you say that they contain valuable information, but in the next sentence you state that 'better-looking “weighted” results were in all cases due to the provided source pops insufficiently covering higher dimensions'.
Are you familiar with the standard practice of big data scientists to reduce the dimensionality before calculating an Euclidean distance? If not, see the Wikipedia lemma's about 'dimensionality reduction' and 'the curse of dimensionality'.
Weighting the data by the eigenvalues also has the effect of a dimensionality reduction. I think this more than a coincidence.

Matt said...

Alberto: That's why I said it might just not be possible to get all the world populations into a single 10 dimensional PCA and get sensible results for every model.

Yes, though, that's tricky if you really want to make comparisons with the world admixed populations in ancient terms.

With these data, there are compromise solutions I can think of, like scaling to the square root of the eigenvalue, which preserves some of each distinction. But it's hard to scientifically justify why that's not just an arbitrary choice...

FrankN said...

@huijbregts; I hold a BA of Business Mathematics and when still in University did some C+ programming on numerical optimization, so you may assume me to be acquainted with the general concepts in question here.
Let me in this context address one of your earlier misunderstandings (I understand you are a trained biologist, not a mathematician): Neither re-scaling a PCA dimension by adding/ substracting a constant, nor by multiplying it with a scaling factor, changes the orthogonality of the system. It affects, however, the distances between individual datapoints within that system. In fact, that distance change is the idea behind re-scaling/ normalising PC axes.
So, the question is not whether re-scaling is theoretically good or bad - it is in principle neutral. The question is, whether a specific re-scaling ("weighting") makes sense for the data in question or not. I lack the understanding why Dave's K10 data comes across as seemingly non-normalised/ distorted as it does, as I don't have insight into the algorithm that produced it. nMonte can work with that PCA, and for Europe produces reasonable outputs, so I'd rather refrain from any operations until I understand why the PCA looks as it looks.

On the one hand you say that they [higher PC dimensions] contain valuable information, but in the next sentence you state that 'better-looking “weighted” results were in all cases due to the provided source pops insufficiently covering higher dimensions'.

Yes, that’s how I see it. The problem doesn’t lie with the higher dimensions, but with the source pops that have been specified.

Let’s take Sein’s Mongola / Karitiana runs as example: Both targets are quite negatively-loaded on PC7 (Mong. -0,0225, Karit. -0,02923), while the “desired” sources have a substantial positive load (Ami 0,04734, MA1 0,0304). In order to solve that dilemma, nMonte draws strongly on the source pop with the highest negative PC7 load, which happen to be Natufians (-0,0509) - quite unsuccessfully, because even after boosting Natufians, PC7 still makes up for 55% of the total error, and on PC2, pos. loaded Natufians are a poor choice for the neg. loaded targets.

Eren’s weighting virtually discards PC7, and, by enhancing PC2, throws out Natufians from the mix. Rightly so, of course Natufians are a bad fit for Mongola and Karitiana.
But that doesn’t solve the real problem, which is absence of a suitable, significantly negatively-loaded source pop on PC7. Such source pops are available – PC7 has a well-defined negative pole made up by Beringian pops like Nganasan, Sakha, Eskimo, Yukaghir, Chukchi, Koryak etc.. Including any of these as source pop should remove Natufians from the unweighted output and also produce quite decent fits. I manually (i.e. shifting percentages around in whole percentage steps) produced Karitiana as 60% Itelmen, 27% MA1 and 13% Lahu; dist. 0.732. For comparison: Sein’s 60.9% Ami, 39.1% MA1 mix on unweighted nMonte yields a distance of 9,179. Leaves the final question: Is it, from a theoretical standpoint, unadvisable to include a Beringian source into nMonte analysis of Karitiana? I don’t think so...

Shaikorth said...

"I manually (i.e. shifting percentages around in whole percentage steps) produced Karitiana as 60% Itelmen, 27% MA1 and 13% Lahu; dist. 0.732. For comparison: Sein’s 60.9% Ami, 39.1% MA1 mix on unweighted nMonte yields a distance of 9,179. Leaves the final question: Is it, from a theoretical standpoint, unadvisable to include a Beringian source into nMonte analysis of Karitiana? I don’t think so.."

If your goal is to get closer to Karitiana's Paleolithic source components it's wise to leave them out, as Siberians are ANE-admixed in comparison to Ami. Further, modern Beringians have more recent ethnogenesis than Native Americans by far so they are unlikely to be Karitiana's real source populations, even if the models work. The Onge-ANE model of Lazaridis 2016 may be very hard to replicate with PCA so using Ami or She or Dai for Karitiana is a good choice.

Alberto said...


With these data, there are compromise solutions I can think of, like scaling to the square root of the eigenvalue, which preserves some of each distinction. But it's hard to scientifically justify why that's not just an arbitrary choice...

Well, two things are clear: there is something strange with the values in higher dimensions not being in accordance with the eigenvalues, but applying the factors provided by Eren above doesn't really work well. So yes, why not try that suggestion that also huijbregts made above and which is a compromise between both extremes?

These are the factors applied: 1, 0.846718713, 0.452874391, 0.323627789, 0.298822531, 0.284710759, 0.271762513, 0.220536005, 0.217007202, 0.214022091

First thing that Eren noticed: Loschbour was very distant from him, more than Yoruba. That does seem definitely wrong.

Unweighted euclidean distance:
Turkish <-> Loschbour = 0.158371
Turkish <-> Yoruba = 0.143799

Weighted with square root of eigenvalues factors:
Turkish <-> Loschbour = 0.046
Turkish <-> Yoruba = 0.134087

So this problem gets solved (as it did with the original factors). Now for Karitiana using Ami and MA1 (I do agree with FrankN that probably Ami is not a good reference, but still the model should be better than the unweighted one):

MA1:MA1 70 %
Ami 30 %

Distance 0.074276

Ami 50 %
MA1:MA1 50 %

Distance 0.025165

Using Ulchi instead of Ami:

Ulchi 63 %
MA1:MA1 37 %

Distance 0.010238

Ulchi 67 %
MA1:MA1 33 %

Distance 0.004028

And now for the more usual models. The original weighting was making Estonians get 0% Kotias (vs. 22% unweighted), which was clearly wrong. But with the new weighting, they get 18% Kotias, and the model looked perfectly reasonable. So I went along and replicated the same models I did with the original unweighted data to be able to compare them side by side. I just had time to upload them and hardly to look at them (time to run to bed here), but they seem to be quite similar. I leave both here for anyone to take a closer look. Maybe this is indeed a good solution (in any case this should be applied to the datasheet itself, not as a change in nMonte).

Unweighted (original):


(I did all this in a bit of a rush, so I hope I didn't make any mistake. Tomorrow I'll double check, just in case).

Aram said...

Yes that is a possibility. The only problem is the low TMRCA of modern branches. 6200 ybp. This means either it suffered a severe bottleneck or it had another expansion at Chalcolithic.

huijbregts said...

I am glad to hear that you are a practicing mathematician. You must understand the abstract mathematics if can clearly indicate them.
There is a fundamental problem with higher distance. I did not recognize it initially, so I may have posted misleading comments.
Orthogonality is not the problem. The problem is that 10-dimensional space is unimaginably vast; compare the volume of a 10-dimensional cube with the volume of a 3-dimensional cube.
10-dimensional data are necessarily sparse data. If you had the DNA of all the people that have ever lived (and supposing you had a 10-dimensional hard disk to store it), that would not change the fact that it is sparse data that it is sparse data in 10-dimensional space.
The problem with these sparse data is that samples do not capture all the possible variance. Sure, you can extract structures; but take a fresh batch of data and you will extract another structure.
Big data scientists have recognized these problems long ago, an important paper had the title 'The curse of dimensionality'. Their solution is simple: after you have collected your high-dimensional data, you should reduce the number dimensions before you use the data for further calculations like calculating the Euclidean distance.
If your data are from a PCA there is simple way to do that: just drop the higher dimensions. By the transformation of Eren the higher dimensions are deflated, which has an effect that is comparable to dimensional reduction. That is why I think it is effective.
A few posts ago I reported the result of bootstrapping the Global 10 datasheet that David has published. My interpretation of the output is that in this dataset the 4th and 5th eigenvalue cannot be discriminated.
In my opinion when we use the Global 10 or other high dimensional DNA data, we should reduce the number of dimensions to 3 before we calculate the Euclidean distance or feed the data into nMonte.

Matt said...

The orthogonality of the data does mean that you will drop distinctions between many populations if you drop K above 3. The structure of the world plus fine structure just does not fit well into 3 dimensions.

Particularly for estimating Euro-Siberian HG ancestry, a heavy reliance on dimensions 1-3 (whether by weighting or exclusion) will not provide very a meaningful output, as these individual samples do not form a clade in dimensions 1-3, have a somewhat widely distributed position in those, and some are very close to current populations. (This makes sense as dimensions 1-3 mainly capture affinity to Africans, Neolithic Europeans and Eastern Non-Africans and the Euro HG are diverse on this measure and not outside the range of present day Europeans in these dimensions as constructed, and don't form a clade on these dimensions). Pooling the Euro HG to a single average would not solve this problem.

Here is the neighbour joining tree based on just Dimensions 1-3:

Here is a set of euclidean distances from populations, in increasing order, for the ancient individuals:

Siberian HG aren't closest to Euro HG, Early Neolithic Europeans are not structured in a way that correlates with any estimated Euro HG contribution, Euro HG are not closest to one another, etc.

You'll get very widely varying estimates depending on which EuroHG sample you use in your model and being overlapping with present day West Eurasians in these dimensions means that EuroHG will often tend to inflate strongly compared to Neolithic contribution.

Matt said...

@ Alberto: Weighted with square root of eigenvalues factors:....

Using those proportions, to try and visualise via PCA I transformed as follows

KareliaHG unchanged

(Example outcomes of process:
Basque_Spanish - WHG: 35%, Karelia: 11%, Levant_N: 44%, CHG/Iran: 10%
Dutch - WHG: 41%, Karelia: 6%, Levant_N: 26%, CHG/Iran: 27%
England_Roman:6DRIF-23 - WHG: 42%, Karelia: 6%, Levant_N: 26%, CHG/Iran: 26%
Irish - WHG: 36% Karelia: 13%, Levant_N: 26%, CHG/Iran: 25%
Polish - WHG: 37%, Karelia: 17%, Levant_N: 24%, CHG/Iran: 21%
Estonian - WHG: 38%, Karelia: 26%, Levant_N: 19%, CHG/Iran: 17%
Lithuania - WHG: 44%, Karelia: 16%, Levant_N: 17%, CHG/Iran: 22%
Hungary_BA:I502 - WHG: 49%, Levant_N: 29%, CHG/Iran: 18%
Iberia_Chal:I281 - WHG: 41%, Levant_N: 58%, African: 0.7%
Sardinian - WHG: 32%, Levant_N: 64%, CHG: 3%, African: 1%
Finnish - WHG: 14%, Karelia: 54%, Levant_N: 29%, CHG/Iran_N: 0%, Dai: 2%
Russian_Kargopol - WHG: 13%, Karelia: 52%, Levant_N 27%, CHG/Iran_N: 5%, Dai: 2%
Abhkhasian - WHG: 7%, Levant_N: 28%, CHG/Iran_N: 62%, Nganasan: 2%
Saami - WHG: 13%, Karelia: 48%, Levant_N: 18%, Nganasan 20%
Chvuash - WHG: 8%, Karelia: 44%, Levant_N: 22%, CHG/Iran: 7%, Nganasan: 16%, Dai 3%
Kalash - Karelia: 20%, Levant_N: 1%, CHG/Iran_N: 56%, Paniya: 22%
Uzbek - WHG: 3%, Karelia_HG: 11%, Levant_N: 13%, CHG/Iran_N: 24%, Nganansan: 26%, Paniya: 8%, Dai: 15% [Total from WHG+Karelia+Levant+CHG-Iran: 49%])

Then filtered out populations with less than 76.7% (Pathan value setting the lower bound) in those clusters.

PCA result:

Forms a fairly decent West Eurasia PCA. Most unusual aspect would be that PC3 here suggests some slight additional tendency for what look like Bronze Age West European samples (mainly Roman Era England, and the Dutch it looks like) to have a higher WHG+CHG/Iran_N count and some Russian populations more EHG+Levant_N than suggested by the PC1 and PC2.

On that basis this weighting schema seems at least OK for preserving West Eurasia ancient structure and the proportions in Kalash, Saami, Chuvash and Uzbeks look reasonable (albeit I've used proportions from the literature here to estimate very ancient proportions here, and gone high for Levant_N in Mozabite).

(with some odd presence of Sudanese and Luo Africans in that PCA, as they came out as 100% Mozabite in your analysis, without a more African proxy, and I didn't bother to filter them out).

huijbregts said...

"The structure of the world plus fine structure just does not fit well into 3 dimensions."

No is doesn't. I have realized this all along.
The question I was trying to answer was: "How can it be that a dubious idea like the Eren transformation appears to increase the quality of the nMonte estimations?"
As the Eren transformation deflates the higher dimensions, might it be possible that there is something wrong with them?
I knew already that it is common practice among big data and machine learning scientist to preprocess their high-dimensional data by a dimensional reduction.

In my previous posts I have elaborated the idea that high-dimensional data are necessarily sparse data. It involves abstract mathematics, but I think it is relevant.
After having sent this post, I realised that there is a another, simpler, reason to shun the higher dimensions: estimation errors.
In the calculation of a PCA the algorithm starts by finding the dimension with the largest component of the variance. This involves some estimation errors, but they are small.
However, these small errors carry over to the estimation of the second dimension and so on. So in the series of dimensions, the signal get smaller all the time, while the errors accumulate.
Moreover, between the dimension 2 and 3 there is a large difference in eigenvalue. So somewhere on the road, the signal will be smaller than the estimation error.
So I am even more convinced that the higher dimensions should not be used in calculating the Euclidean distance.

I see that you are experimenting with the square root weigthing.
Do your results deviate very much from what you get by only keeping dimensions 1-3?

Alberto said...


Yes, now that I had time to look at the differences everything looks good to me using those factors you provided above for weighting the dimensions (yes, Luo and Sudanese are there by mistake, I forgot to remove them).

So that weighting seems to solve those strange problems without introducing new ones, so it might be the correct answer.

For example the euclidean distances now look good for all the populations. Without weighting, they look ok for proximate ones, but very wrong at the bottom of the list. For examples, for Loschbour:

And after applying the weighting:

With the African populations at the bottom, which is the correct way.

The only models I find a bit suspicious after the weighting are the ones for Finnish and a few related populations, getting too much Karelia_HG, no Loschbour and no Kotias (and no Nganasan). The unweighted model seemed more balanced, and compatible with the D-stat:

Loschbour EHG Finnish Chimp 0.0138 2.675 341763

So apart from those few models (Finnish, Russian_Kargopol, Saami,...) all the rest looks good to me.

huijbregts said...

That is ironic.
By using my factors you have perfected a method which I distrust!
Anyway, thanks for the data.

Matt said...

@ Alberto: The only models I find a bit suspicious after the weighting are the ones for Finnish and a few related populations, getting too much Karelia_HG, no Loschbour and no Kotias (and no Nganasan).

It seems it's in the higher dimensions where a combination of ancestry from WHG+Nganasan like populations is most distinctive from a combination of Karelia_HG+Dai, so it makes sense that reducing the weight of higher dimensions would increase uncertainty on this.

Those are the same higher dimensions where African, South Asian, SE Asian populations have more similar position to West Eurasians, and where weighting against those dimensions avoids phenomena like Sardinians and Iberian Middle Neolithic picking up as much South Asian fractions (because with weighting, improving fit in high dimensions is a bad trade for decreasing fit in low dimensions). So it seems like it's maybe an unavoidable tradeoff in this data. (Unless as huijbregts suggests there is a problem in the construction of these higher dimensions, which is associated to this overlap.)

(To be honest, except at the margins, the proportions are very similar for both your spreadsheets. Most differences are 1-2%, except for in those instances where unweighted finds Nganasan+WHG proportions which weighted often supplants with Dai+Karelia proportions to a degree (but not in any absolute pattern inasfar as I can tell).)

huijbregts said...

I just thought about something. When you do a PCA, you have the choice whether to normalize the column variables or not or not. In this case the variables should be not normalized, because alle the variables are of the same datatype (some measure of DNA proportion). However if you do normalize, the proportion of variance in the higher dimensions will be too high and the Eren transformation with the root of the eigenvalues might more or less correct this error.

Do you know wether your software did normalize the column variables during the PCA calculation?

Shaikorth said...

@Matt, Alberto

Excess EHG could be caused by mixed populations being used as sources: Karelia_HG does have significant WHG compared to some other ANE-type populations of the time if Srubnaya outlier is of any indication, and Nganasan is just Dai with ANE or EHG added.

If so, dropping Nganasan, Paniya (ASI+Iran_N) and Karelia_HG while adding AG3 (or maybe Srubnaya outlier since it was preferred by most moderns in Sein's fits) and Onge could fix the WHG proportions.

Alberto said...

Yes, playing a bit with the populations could fix it and you can find a better balance (though modelling Estonian with AG3 and Onge is not that realistic for most of our purposes).

But my 2 cents after testing all this: Some time ago I was worried that the low variance in higher dimensions might not give us enough information to have consistent models, so I wanted to increase their relative weight. However, after some tests i gave up from the idea, since it didn't seem worth it.

Now what we've found out is that actually the higher dimensions already have a bigger weight than they should (mathematically speaking), so with applying this weighting we can fix it and give them their correct (lower) weight. This is mathematically correct, and works fine for things like calculating the total Euclidean distance from one population to another, or for very broad models.

However, when we need to look at finer details, we do get close to that line of uncertainty where a few of the "normal" models show inconsistencies. So as it turns out, leaving things as they are seems actually beneficial for those situations (which I think are more important many times than the other ones that get fixed with the correct weight of each dimension).

So overall I think it's good that the problem and solution were found, so that we know about it and correct for it when it matters. But most of the time I would just use the datasheet as it is, with that extra weight in higher dimensions helping our more common models to be consistent.

Just my opinion of course (based on my own testing).

Shaikorth said...

The model I proposed would keep most of your K=11 populations (replacing/removing only three) so it's unlikely Estonians would be getting Onge.

If Dai was also dropped and Onge kept there would be an opportunity to test if PCA can replicate Lazaridis 2016's "ASE" model where all East Eurasian populations including Dai are some kind of ANE-Onge mixes, but that would probably be too difficult. Onge don't seem distinctive enough along major dimensions, perhaps it'd work if they were oversampled enough to get "extreme" positions in more significant dimensions.

Alberto said...


I don't have Onge, but here with and without Dai:

Loschbour:Loschbour 32.6 %
AfontovaGora3:I9050.damage 25.6 %
Barcin_Neolithic:I0707 20.6 %
Esperstedt_MN:I0172 10.7 %
Kotias:KK1 9 %
Dai 1.5 %
Mozabite 0 %
Iran_Neolithic:I1290 0 %
Israel_Natufian:I0861 0 %

Distance 0.004843

Loschbour:Loschbour 31 %
AfontovaGora3:I9050.damage 28.9 %
Barcin_Neolithic:I0707 23.6 %
Esperstedt_MN:I0172 9.8 %
Kotias:KK1 6.7 %
Mozabite 0 %
Iran_Neolithic:I1290 0 %
Israel_Natufian:I0861 0 %

Distance 0.004986

And here with Dai and without applying the weighting:

Esperstedt_MN:I0172 37.7 %
AfontovaGora3:I9050.damage 27.5 %
Loschbour:Loschbour 24.1 %
Kotias:KK1 9.6 %
Dai 1.1 %
Barcin_Neolithic:I0707 0 %
Mozabite 0 %
Iran_Neolithic:I1290 0 %
Israel_Natufian:I0861 0 %

Distance 0.01065

So yes, with those populations the balance doesn't break like it does with Karelia_HG. We get similar amount of Kotias and of ANE.

But my point is that without weighting the dimensions things look more robust for the kind of models we use more frequently. In any case, I'm not opposed to fixing the weight so that we use the mathematically correct version. Just stating that in my experience it will make our life more difficult more often that it will help us.

Matt said...

@ Alberto, request for you if you're still following this comment thread.

I was considering the comments above about normalisation. One of the forms of trying to scale to the eigenvalues, which I'd mentioned to you upthread, which I tried had involved a prelimary normalisation step of normalising each PC to fit a 0-1 scale. Then I scaled each dimension to the eigenvalue. (This is one kind of normalisation and there are others, like normalising to the SD, etc.)

So I was interested in what the neighbour joining structure would look like if I'd stopped at the step of normalising 0-1 and left at that. Expecting a horrible outcome.

Surprisingly, though :

The preservation of the structure looks relatively good. (Though with the big caveat that the West and East Eurasians do not form clades here and rather there's a big Central Asian clade on the tree that comprises Siberians and South Asians and is wedged in the middle of the West Eurasian part of the tree.)

The Euclidean distance matrices don't look great but are less poor than I would've expected them to be, compared to the unweighted form.

(Thinking about it, the fact that it preserves the African-Non-African split so much also might be due to the fact that, although the African split in Dimension 1 does not have that much weight, there are also splits away from Africans in the higher dimensions 10 and 5 which also have extra weight here.)

So my request to you is whether you could rerun your analysis from upthread with my normalised sheet? I'm assuming you have a neat script or something for mass nMonte analysis that would make it quick to do. :)

Here's the sheet in .csv format: (.txt, rename to .csv)

If that doesn't download OK, the formula I used in Excel to normalise is (=(PCxValue-MIN(AllValuesinPCx))/(MAX(AllValuesinPCx)-MIN(AllValuesinPCx)), replacing AllValuesinPCx with the whole range for the PC, and PCxValue is the value in each PC to be normalised.

I'm basically wondering if the (marginal) problems with the unweighted sheet are actually due to the scales on each dimension being sort of random, as far as we can tell, rather than whether they did or did not scale with the eigenvalues. As another possibility to eliminate.

(Now, doing a secondary PCA on the 0-1 normalised PCA looks pretty strange and not like a normal world PCA at all - Rather like a strange "All Eurasia" PCA with Africans in a null position).

Alberto said...


Without careful inspection yet, it looks quite similar to the others. We'll need to look for the details to see how it compares.

huijbregts said...

@Alberto, Matt
On a pragmatical level I can go a long way with you.
You remember me though to a Northern Dutch aphorism 'If it cannot be done as it should, it should be done as it can.'
Alexander the Great might have said this when he solved the problem of the Gordian knot.

I cannot see any scientific logic behind the Sangarius-Eren transformation.
It is just there because the results are more agreeable. That is a poor basis for scientific progress.
From a mathematical viewpoint the weighting of dimensions of a PCA is an anathema.
True, it conserves the orthogonality of the dimensions, but it destroys the defining property of a PCA that the variances are distributed according to the eigenvalues.
Inspecting the arithmetics of this transformation learns that it effectively deflates the higher dimensions. That is useful property, because the higher dimensions may have a low reliability anyway.
But there are umpty other ways to deflate the higher dimensions, so this transformation is highly arbitrary.
In the spirit of the PCA logic, the most simple and transparant way is just dropping the higher dimensions; like Davidski did already with the dimensions above 10.
So my opinion is that, before we spend a lot of energy in interpreting the detailed effects of the Sangarius-Eren transformation, we should convincingly demonstrate that it performs better than dropping dimensions 4 thru 10.

Alberto said...


Sorry, I forgot to put Iran_Neolithic in the source pops. I'll post the correct one in a few minutes.

Alberto said...


My understanding is a bit different. I first assumed that the values in the original PCA sheet were perfectly correct and corresponded to their eigenvalues. So what Sangarius proposed looked like a bad idea to me (as to you).

However, after looking at the values it seems that they are indeed wrong for some reason. They don't follow the principle of lower variance in each dimension and don't correspond to their eigenvalues.

So then we realised that Sangarius had found a real problem, and his solution was in the right direction, but it was not totally correct. It was actually you who then suggested to use the square root of the eigenvalues, and that seems to be the correct thing to do. So after applying that correction, the PCA values seem to be what they should have always been (or at least close to them).

For me the question then became: should we fix the values or leave them as they are? And I don't have a strong preference for either, since the results are very close most f the time, but I lean towards leaving them as they are for practical reasons.

Dropping the dimensions 4-10 works really bad, I tested that. With 3 dimensions we really can't work our models.

Alberto said...


I hope this time the correct one:

Chad said...

Your EHG is way too low and CHG way too high. There's an issue in there somewhere. These stats make it plainly obvious that it is a Yamnaya-like population, but maybe more EHG shifted. Not CHG shifted.

result: Germany_MN Bell_Beaker_Czech Loschbour Karelia_HG 0.0346 5.596 30720 28669 615312
result: Germany_MN Bell_Beaker_Czech Loschbour Kotias 0.0142 2.261 37015 35977 712510
result: Germany_MN Bell_Beaker_Czech Loschbour Yamnaya_Samara 0.0288 6.004 36056 34037 703380
result: Germany_MN Bell_Beaker_Czech Karelia_HG Kotias -0.0199 -3.209 30692 31939 621083
result: Germany_MN Bell_Beaker_Czech Karelia_HG Yamnaya_Samara -0.0057 -1.177 29312 29649 614479
result: Germany_MN Bell_Beaker_Czech Kotias Yamnaya_Samara 0.0143 3.134 35399 34402 710238
result: Germany_MN Bell_Beaker_Germany Loschbour Karelia_HG 0.0343 7.454 45451 42433 890684
result: Germany_MN Bell_Beaker_Germany Loschbour Kotias 0.0116 2.442 54766 53505 1033770
result: Germany_MN Bell_Beaker_Germany Loschbour Yamnaya_Samara 0.0273 7.149 53331 50497 1020055
result: Germany_MN Bell_Beaker_Germany Karelia_HG Kotias -0.0209 -4.705 45151 47082 898448
result: Germany_MN Bell_Beaker_Germany Karelia_HG Yamnaya_Samara -0.0062 -1.702 43352 43896 888489
result: Germany_MN Bell_Beaker_Germany Kotias Yamnaya_Samara 0.0149 4.477 52416 50874 1029219
result: Germany_MN Czech Loschbour Karelia_HG 0.0289 5.572 22452 21192 495731
result: Germany_MN Czech Loschbour Kotias 0.0088 1.769 26714 26250 568303
result: Germany_MN Czech Loschbour Yamnaya_Samara 0.0204 5.106 26171 25124 565153
result: Germany_MN Czech Karelia_HG Kotias -0.0187 -3.940 22432 23287 500027
result: Germany_MN Czech Karelia_HG Yamnaya_Samara -0.0069 -1.749 21688 21988 497558
result: Germany_MN Czech Kotias Yamnaya_Samara 0.0115 3.220 25785 25201 570164
result: Germany_MN German Loschbour Karelia_HG 0.0262 5.051 22137 21006 488023
result: Germany_MN German Loschbour Kotias 0.0080 1.606 26393 25976 559260
result: Germany_MN German Loschbour Yamnaya_Samara 0.0182 4.555 25815 24890 556262
result: Germany_MN German Karelia_HG Kotias -0.0172 -3.600 22232 23009 492287
result: Germany_MN German Karelia_HG Yamnaya_Samara -0.0070 -1.785 21461 21764 489931
result: Germany_MN German Kotias Yamnaya_Samara 0.0101 2.846 25471 24960 561236

Matt said...

Alberto thanks. Doing a quick comparison to the unweighted values (the unmodified spreadsheet) the normalized proportions show very similar values for most populations. Like 0.5% difference per proportion generally, or 0.

The particular exceptions are again the Finno-Ugric and Russian populations (Finnish, Vepsa, Ingrian, North Russian), who again flip from having fractions which are WHG+Nganasan+Kotias to more EHG+Dai+EEF fraction, despite the overall outcomes being more similar to the unweighted spreadsheet.

It does seem like, whether you make dimensions more equal weight or add more weight towards lower dimensions, the spreadsheet is very sensitive to weighting to whether the ANE in these populations is expressed via EHG or other populations with low ANE fractions (Kotias, Nganasan). Even though most other changes to the original weighting for other assignments look relatively insensitive to any changes in weighting.

(I don't actually know if that means anything - it's difficult to say if either formulation actually makes more or less sense. East Asian ancestry populations that mixed with Finns, etc. may not have been exactly like the North Russian Nganasan after all, and it's possible they may in some sense actually really better approximate Dai+EHG anyway).

Alberto said...


Thanks for those stats. This is something that always makes me wonder and I don't have a good answer. Using D-stats based models you always get more EHG and Anatolian, and less WHG and CHG. But I've found problems with the values in the sheets and the models too. So right now I'm leaning that this models from he PCA might be closer to the truth.

It's hard measure Kotias with D-stats. Apparently it's always pretty low. I don't even know if Georgians themselves (which have the highest affinity to Kotias) would get good values with D-stats to prove their most likely model.

And then you also need some way to explain the shift that the stats show in the WHG-EHG axis. For example:

Loschbour EHG Yamnaya Chimp -0.0354 -6.545 341312
Loschbour EHG Germany_MN Chimp 0.0314 5.105 316966

If you want to make north Europeans something close to 50% Germany_MN and 50% Yamanaya (And even more if you say that it was more EHG rich than Yamnaya), then the resulting population would have a slightly negative D-stat. But Look:

Loschbour EHG Scottish Chimp 0.0202 3.771 341763

How could you explain that?

Northern Europeans, by these stats cannot be mostly Corded Ware derived. There is an important population shift from CW, that also is significantly negative:

Loschbour EHG Corded_Ware_LN Chimp -0.0189 -3.32 341342

Loschbour EHG Lithuanian Chimp 0.0189 3.687 341763

So I don't know how to solve this conundrum. For now I think the better explanation is the strangely "low performance" of Kotias in D-stats. But if you find a better explanation by running other stats to clear it up, I'd be interested in seeing them.

Alberto said...


Yes, overall they're really very similar. The reason why I think that the models with those populations seem to break after the normalization or weighting is not so much because of the Nganasan, but mostly the EHG/WHG/Kotias ratios. It makes Finnish a big outlier, even compared to Estonians. And then it's hard to explain why they would have also a positive stat in the WHG-EHG comparison with those numbers:

Loschbour EHG Finnish Chimp 0.0138 2.675 341763

On the other hand there are the benefits like having correct euclidean distances, better models for Amerindians and probably other kind of models too. So I don't have a strong preference for either version. Better to know about the issue and have both options to double check when needed.

Alberto said...


By the way, did you find some way to test reliably with qpAdm with the 4 main European ancestors? Like having in the left pops something like:


I remember there were issues with trying that, but I don't know if some solution has been found. Would be interesting to test that, if possible.

Alternatively, are you using qpGraph already? Maybe there it could be possible to test a similar setup.

huijbregts said...

Thank you for covering details of my arguments. As to my conjecture that 3 dimensions might cover enough detail, that has apparently been to drastic.
But I remain very suspicious about the (lack of) justification for your transformation.
Moreover, I have realised that there is also a serious problem with the factors that I have calculated.
David has published a datasheet of some 500 samples. But he has informed me that for his calculations he has used a much larger dataset of some 1600 samples.
Now I suppose his eigenvalues are from the the larger dataset. But you have used these eigenvalues on the smaller dataset. That is only correct if the smaller dataset has more or less the same eigenvalues as the larger dataset.
I don't buy that. For instance, the smaller dataset should have the same degree of oversampling of Europeans as the larger dataset; That seems highly improbable.

Rereading my last sentence, I am wondering whether the advantage of the Sangarius-Eren transformation may be caused by its deflation of the oversampled 'European' dimensions 3-7.
Stated differently, does this transformation have the side effect of weighting by sampling density? If so, this might explain why for some European populations the weighted estimation is worse than than the unweighted estimation.
Anyway, weighting by sampling density appears far more natural than weighting by (the root of) the eigenvalue.

Davidski said...

Alfredo's a great name. :)

Matt said...

@ Alberto: Yeah, I have no understanding of how and when those D (Loschbour EHG X Chimp) stats are compatible with our models. I'm not sure if they work such that a 50:50 mix population would necessarily also be intermediate on that stat.

@ huijbregts: Though if the eigenvalue scale is expressed through the min-max distance in each dimension, it shouldn't matter the relative number of European vs non-European samples present, so long as the samples that take extreme positions aren't systematically excluded. (If they were, what would they be since the datasheet seems to include all samples that would take extreme positions, barring perhaps San and maybe some Native Americans?). The min-max should still match with the eigenvalue.

«Oldest ‹Older   1 – 200 of 204   Newer› Newest»