search this blog

Monday, March 21, 2016

R1a in Yamnaya


Do you ever get the feeling when reading some of the ancient DNA papers that the authors know more than they're willing to admit? That's probably because they do.

It looks like Y-chromosome haplogroup R1a has already been found in Yamnaya remains, which, I believe, was something that was hinted at by David Reich and colleagues last year. It's just that, for some reason, the result hasn't yet been published. Crank up Google translate and navigate to here.

Что касается различий в преобладании R1a у Corded Ware Culture и R1b у ямников, то хочу напомнить — в статье приводится всего 5 определений гаплогрупп ямников, что конечно же мало! И уже после выхода статьи мы получили новый образец уже гаплогруппы R1aдля ямного погребения. Все определения, которыми мы сейчас оперируем получены из маленькой и далёкой территории – северной части степного Поволжья. Я же с большим оптимизмом жду определений ямников из Причерноморских степей.

...

Александр, дорый день! Надеждинка — это погребение 1 кургана 1. Оно основное. В глубокой яме лежал подросток — мальчик: Y-hapl. R1a1a1d2a. Лежал на спине, ноги подогнуты, руки не сохранились. Голова была посыпана тёмно-красной охрой, у левого плечал стоял круглодонный сосуд, лежала створка раковины. И обряд и инвентарь — типичные для ямной культуры Поволжья. Курган стоял на берегу р. Б.Иргиз (левый приток р. Волга) — N 52 град 12 минут, Е 48 град 39 мин. Должен заметить, что Восточное крыло ямников (Волго-Уралье) охватывает период 3400 — 2900 ВС. Западное крыло 3100-2400 ВС. Александр, а где вы встретили информацию о гаплогруппах энеолитических хвалынцев?

...

Уважаемая Елена, результат прислал Девид Райх (David Reih). Он возглавляет департамент в медицинской школе Гарварда. Увы, он ещё не опубликован. Ваш КП

Yep, the Corded Ware people were closely related to the Yamnaya folk. No doubt about it. I'd say it's just a matter a time before an early Yamnaya or proto-Yamnaya group is found on the steppe with R1a lineages ancestral to those present among Corded Ware males, and indeed many millions of present-day Europeans.

97 comments:

Nirjhar007 said...

We wait and see if it was a dead branch or not .

Very interesting news of course...

Davidski said...

If that's not a typo in that comment above, where he says R1a1a1d2a, then it looks like some type of Z93.

Nirjhar007 said...

Yes you are right. But we wait for confirmation .

Nirjhar007 said...

BTW is't it older than the Poltavka Outlier?.

Davidski said...

Yes, by a few hundred years. But I think most people knew Poltavka outlier was largely of Yamnaya stock one way or another. My TreeMix graph from the Poltavka outlier write up...

https://drive.google.com/file/d/0B9o3EYTdM8lQU3ZnMkItT2lUazQ/view?usp=sharing

Nirjhar007 said...


Perhaps that Z-94 (still to be sure) was a result of Intrusion . Yamnaya is a R1b dominant culture.

And those R1a/R1b of Khavalynsk are of dead branches most likely.

Davidski said...

There's no intrusion into Yamnaya and Poltavka except from the west, and even that's either from western Yamnaya or Corded Ware, which is basically Yamnaya anyway.

Karl_K said...

@Nirjhar007

"Perhaps that Z-94 (still to be sure) was a result of Intrusion . Yamnaya is a R1b dominant culture."

Intrusion from where exactly? From the people with the exact same autosomal genetics 50km to the south-east?

I don't understand the point of your comment.

Please can you clarify?

Coldmountains said...

R1a1a1d is P98 and Z93 is R1a1a1h according to wikipedia
https://en.wikipedia.org/wiki/Haplogroup_R1a

Where R1a1a1d is most frequent today? Looks like another very rare R1a lineage

Nirjhar007 said...

There are lots of ifs and buts . Without genetic data from the large areas/sites, its futile to make such conclusions. I certainly doubt there will be any Z-94 in Ukraine.

Let us have the results from Majkop,India,Afanasievo etc, then we will be in the position to compare and conclude. We are only at the moment seeing results from one specific area!, its totally frustrating and disgusting .

Do the scientists think only publishing samples from the steppes is all that needed?.

Krefter said...

@Karl_K,
"Intrusion from where exactly? From the people with the exact same autosomal genetics 50km to the south-east?

I don't understand the point of your comment."

That is an invasion. Did Germany invade Poland in WW2? Yes, even though they have basically the same autosomal genetics. Poltvaka outlier I belive was 20-30% Anatolia_Neolithic. So, it was big migration without a lot of admixture.

Nirjhar007 said...

anyway, for people who are interested on the IE, Aryan issue etc, please take a look and discuss here -
https://throneoftruth.wordpress.com/2015/12/19/aryan-invasion-or-migration-theory-and-indo-european-origins-vedic-origins/

There you will see a great coverage of the situation.

Karl_K said...

@Krefter

"Poltvaka outlier I belive was 20-30% Anatolia_Neolithic. So, it was big migration without a lot of admixture."

Or... it could also have been a quite local shift over several hundreds of years. We have very little time and space resolution here.

Think more like a scientist and you will soon be one.

Krefter said...

@Karl_K,
"Or... it could also have been a quite local shift over several hundreds of years. We have very little time and space resolution here."

That would suggest R1a-Z93 came from Neolithic Turkey, so..... We see an autosomally identical people in Germany with R1a-M417*. So, it's pretty obvious hyprid Steppe/EEF populations formed west of the Volga river then migrated en masse east to the Volga river.

"Think more like a scientist and you will soon be one."

I'm not being arrogant, but my positions do usually come out mostly correct. Also, most of the time they aren't just my positions, they're the positions of most people here, because the truth is so obvious.

Sometimes you have to go with what is most likely, instead of considering every possibility and equal. For example, Instead of using Nigeria and Dai and Eskimo in nMonte as possible ancestors of Kalash because that's more scientific, I use realistic possible ancestors.

capra internetensis said...

@Nirjhar

I'm pretty sure that palaeogeneticists would give their left nuts for decent aDNA from South Asia (too bad about the hot conditions). Last I heard though Indian bureaucracy ain't exactly speedy and forthcoming with permits and permissions for that kind of thing. But maybe I'm wrong.

Nirjhar007 said...

Capra,
Yes you are wrong. We are going to get some good amounts of aDNA from the subcontinent, but at the moment its largely one specific area (steppes) that is getting publications.

Be sure that at least 5-6 important sites aDNA (of both Asia and Europe ) are ready to be published at the moment , but for some reason, they are not doing so .
Please don't ask which ones...

capra internetensis said...

I'm wrong that the Indian bureaucracy is notoriously slow, or I'm wrong that palaeogeneticists care about South Asia?

Nirjhar007 said...

The first.

DMXX said...

Gents,

Some of the members at Anthrogenica have dug through the original Russian thread. It looks like the contributors are actually discussing the Khvalynsk results from several months back (R1b and Q are also mentioned).

Doesn't appear as if these are new findings (Y-DNA R1a1a from Yamnaya) and, instead, Khvalynsk is being confused with Yamnaya.

Please see here for clarification:

http://www.anthrogenica.com/showthread.php?3978-When-are-we-to-expect-the-next-round-of-ancient-y-dna-results&p=146905#post146905

Karl_K said...

Actually DMXX.

I agree with Krefter. He seems to be making the most sense here.

And he's not being arrogant, but his positions do usually come out mostly correct.

DMXX said...

Karl,

Not sure why my username is being cited in contradistinction to Krefter's. I'm not contesting (or even addressing) anything he is.

Karl_K said...

@DMXX

Because... Krefter clearly said:
"That would suggest R1a-Z93 came from Neolithic Turkey, so..... We see an autosomally identical people in Germany with R1a-M417*. So, it's pretty obvious hyprid Steppe/EEF populations formed west of the Volga river then migrated en masse east to the Volga river."

And then you said:
"Doesn't appear as if these are new findings (Y-DNA R1a1a from Yamnaya) and, instead, Khvalynsk is being confused with Yamnaya."

So, I was just siding with Krefter because he is usually right and not arrogant.

Davidski said...

DMXX,

The new sample is indeed from a Yamnaya burial.

But the Khvalynsk R1a also gets a mention.

Rob said...

Much ado about nothing

Gioiello said...

For what I know no R1a nor R1b has been found in Anatolia, and long before the samples from Russia. Are R1a and R1b in Anatolian aDNA or in Krefter's head?

capra internetensis said...

@Nirjhar

That's good to hear. People do tend to exaggerate the failings of their local government departments. :)

Nirjhar007 said...

Capra,

I am very happy to inform you that Rakhigarhi is not the only one which is being tested! :)

Gioiello,

Well they haven't published results from the key Eastern Regions.

Davidski said...

Krefter & Karl,

These steppe populations were highly mobile, so it's possible that Z93 came from the Volga steppes, but Poltavka outlier, with his Z93 and relatively western genetic structure, arrived on the Volga steppes from as far west as Ukraine or even Poland. In other words, he may have returned to the homeland of his paternal ancestors.

On the other hand, there's evidence of some vicious battles in the forest steppe zone of Russia occupied by the Abashevo culture.

So it's also possible that Poltavka outlier was one of the early invaders from that area, and a series of invasions of highly militaristic groups from Abashevo basically wiped out the Yamnaya/Poltavka R1b-Z2103 clan of the Volga region, and then expanded into Asia with their militaristic culture as the Sintashta and Andronovo people.

Rob said...

Dave
So it's sounds like you're finally heeding my directions ?
;)

Davidski said...

Probably not.

It's clear to me that the initial migrations of these steppe people during the Early Bronze Age were from areas around the Volga and probably the Don.

So Abashevo, Andronovo, Sintashta and all the rest ultimately hail from Yamnaya and/or pre-Yamnaya (Repin) of the Pontic-Caspian steppe.

Rob said...

I see, M417 moved west from the Don, picked up a lot of MNE wives, then moved back east to their homelands.
Sounds biblical

Gioiello said...

@ Nirjhar007

“Gioiello,
Well they haven't published results from the key Eastern Regions”.
We are waiting for them… but:
1) I think the hgs found in Western Anatolia, both Y and mt, might have linked with the Balkan ones.
2) Autosomally they were far from other Middle Easterner samples (they say that there happened migrations from elsewhere and aren't representative of the old Anatolian pool). We'll see.
3) So far in the aDNA from Hungary no R1b… and it will be interesting to see what will be found in Vinca of 6000 years ago.
4) Of course I think that any discussion about R1a from Central Asia is a little believable, as from the YFull tree is evident that the oldest haplotypes come from West.
5) Many say that R-V88 isn't linked with R1a and R1b1-L389+, but that it came from Sardinia/Italy and Iberia very likely is already demonstrated both from the aDNA (7100 years ago in Iberia) and from the modern data: the oldest samples are in Sardinia (and also Italy till the Isles).
6) Read sometimes the YFull blog at FB where both Ted Kandell and me may write, and perhaps you will understand many things behind.

Davidski said...

@Rob

Something like that.

This hybrid steppe/MNE population may have been more advanced culturally and militaristically than the groups that remained on the steppe, thanks to experiences with Copper Age Europe. So they may have simply moved back in to take what they wanted, like land for stock and metal ores.

It's not like this sort or thing hasn't happened in historic times. That's what Germany tried in Eastern Europe and failed. Maybe there was an ancient battle of Kursk on the Bronze Age steppe, except unlike Germans, who got their asses whooped, the ancient western invaders actually prevailed. Haha.

Rob said...


Hhmm but makes you wonder - as you say- suggestive previous papers, little leaks like this.
Maybe something new, ? and big on the way

huijbregts said...

@Davidsky
Is it possible to get Motala_HG in a row? It might be interesting to see where their DNA went to.

Kristiina said...

Onur, this is particularly for you! I noticed it only today. It is not about Yamnaya but it is however about R1a1-Z93.

In 2015-2016 in Fudan University (Shanghai), headed by ethnogenomist Shao-Qing Wen (文少卿) in China were tested to determine the Y-DNA haplogroup the representatives from aristocratic Turkic clan Ashina (creators and managers Turkic Khanate in the VI-VII centuries) and Ashide(阿史德: another dominant clan which produced empresses, so called Khatuns, and supreme military leaders).

Ashina, also spelled Asen, Asena, or Açina, was a tribe and the ruling dynasty of the ancient Turks who rose to prominence in the mid-6th century when their leader, Bumin Khan, revolted against the Rouran. The two main branches of the family, one descended from Bumin and the other from his brother Istemi, ruled over the eastern and western parts of the Göktürk empire, respectively.

The result:
Subclade of clan Ashina: R1a-Z93, Z94+, Z2123-, Y2632-. Recommended to request SNPs — Z2124, Z2122.

Subclade of clan Ashide: Q1a-L53.

When I check the existing ancient yDNA samples, I see that Z2122 is in Sintashta (RISE386, RISE392) and in Andronovo (Rise512). It looks like Ashina haplotype could be the typical Turkic haplotype under Z2125 found in Kyrgyz and Altaians.

By contrast, Q1a-L53 seems to be a typical Altaian and Central Siberian haplogroup as it is distributed as follows: 1st branch Northern Altaians, Western Khakasses; 2nd branch Southern Altaians; 3rd branch Todjins, Soyots, Eastern Khakasses, Tuvinians and 4th branch Tuvinians, Mongols, Kets, Khantys. My understanding is that L-53 was not found in Karasuk or Iron Age Altai who are all Q-M25. Maybe the Khvalynsk Q1a belonged to this line, because Genetiker defines it as xQ1a2. If so, it clearly took its revenge during the Turkic Empire. :-)

Fanty said...

"Maybe there was an ancient battle of Kursk on the Bronze Age steppe, except unlike Germans, who got their asses whooped, the ancient western invaders actually prevailed. Haha."

With "war-waggons" on both sides too? ;)

Davidski said...

huijbregts,

Try this...

https://drive.google.com/file/d/0B9o3EYTdM8lQUzh3T0EwMHdSeUU/view?usp=sharing

Onur said...

@Kristiina

Yes, I know that Wen et al. study. Since the official study has not been published yet, they have not published the details of the individuals they tested. Until then, I will be in the wait and see mode regarding that study.

batman said...

Davidski,

If one starts with the timeline from Lyngby/RBK noe geys to TRB/PWC/CWC, parallell to Swidrien/Kunda/PCW/CWC.

Yamna may well be a result of the western Baltic, which obviously were and central area between the european and caucasian populations, as for cultural inter-change, travel and trade as well as geneological relaionships.

Consequently we have to consider the possibility that EBK/TRB can be ancestral to both CWC an Yamna.

Another point adressong the same topic-matter is the option that the 7.500 year old Carelian with R1a, growing cheakpeas and barley, represents an early cousins of the Samarian barley-growers.

Underlining the fact that the Holocene warm-period moved across the Eurasian continent from the West, first benefitting the Occidental Europe before the Oriental Eurasia, in terms of median temperatures, abundant humiditity and bioproduction.

huijbregts said...

@Davidsky
Thanks for the datasheet.
I noticed that you dropped the columns Kostenki and Samara_HG.
You also dropped the row Karelia_HG.
Can I restore the row Karelia from the previous datasheet (after removing the columns Kostenki and Samara_HG)?

huijbregts said...

@Davidsky
Motala_HG was interesting:

Motala_HG:
"Western_HG" 63.9
"Karelia_HG" 36.1
"Anatolia_Neolithic" 0
"Caucasus_HG" 0
"Eastern_HG" 0

The distance is not good though: distance%=4.21

Matt said...

@ Davidski and huijbregts convo: Davidski, if you add in Samara and Kostenki columns, it'd be nice to have have the Swedish_NHG (Pitted Ware Culture) in as a row as well, just to try testing if they do model as like Motala with German_MN admixture (or vice versa) as an abstract from Skoglund's group last year suggested.

It does seem like Motala does have somewhat of its own personality, looking at the Motala population's results, although I don't see it being important in Europe today generally, and probably in the low single digits around the Baltic (Lithuania?). Since it seems like Motala does have an independent personality to some degree, with EHG, WHG and Motala all in rows and columns though, we could try and test the influence of all together (and with NHG could test Lithuanian as NHG+Corded Ware).

MA-1 would be good as a row as well.

Romulus said...

I could be wrong but I've seen admix results that have Swedish_NHG (Pitted Ware Culture) showing 0 EEF.

Chad Rohlfsen said...

The Pitted Ware samples range from about 15-25% EEF.

batman said...
This comment has been removed by the author.
pequerobles said...

Ashina,
is the also the name of the very pretty Dutch/Turkish girl who posts on Anthroscape and Anthrogenica

batman said...
This comment has been removed by the author.
batman said...

"Ashina" seems to be a common family-name (etnonym), defining å common tribe/etnicity rather than than a common geographical, cultural or professional denominator.

The word-stem 'as' is known as an etnic identitet om several Indo-European branches, such as As-vin (vedic), Az-eri (persian), As-er (Scandia) and As-ir (Norse). Today this 'tribal' identificator is reflected in geonyms like As-ia, As-tra-khan, Azer-bad, As-gar-bad, As-gar, As-gard and As-hov.

Similar to the vedic Asvinas we find the Ashinas to be an etnicity/tribe whose branches have been spread, later to be unified as a "federation" (clan), due to a common danger.

Its an old idea that the unification of the "ashina-clan" was a 'tribal reunification" rather than a purely political construct, based on common geography but random relations under occasional circumstances.

The Kok-turs/turks seem to be close relatives of the Onugurs, altso known as Ounu-gard, Vana-gar(ia) and Hunagar(ia).

The (re-)unification orchestrated by the Ashina dynasty have some obvious connection to 'huns', also known as 'vends' and 'venedae, refering to populations of eastern Europe/Russia.

Classical sources refer to both ouan/huns/vends/vanir and aser/azeri/asvinas as "gentes" and "aets". Thus we may have to view them both as 'etnicities'.

Seemingly, the post-persian Bactrians/Tocharians seem to have formed a federacy aimed to reconstruct an old, pre-war constitution, based on etnic relationships, remembered t hrough common identifoets such as language, economies, trade-routes. Obviously the various 'tribes' ("arrows") - based their federacy on 'kinship'/'kind-ship' - based on a common background.

The Ashinas were obviously instrumental in the reunification of the highlanders, being able to produce and head the union, as a "union-of-kin" (tur-ki, turk-men) of eastern ounu-garians/unu-gurs/ungurs.

Since the ouna/hunic/venedae have obvious ties to the ouan/van/ven/vend and ven-eti - there's obviously an Uralian connection between the Venedae and the Ungurs, rising the issue of the Ashina possibly being uralic speakers of 'vendic' origin, that once formed their own dynasty, kingdom and tribe in the east, creating the stem and outliers of hg Q.

batman said...
This comment has been removed by the author.
Matt said...

Fitting a few populations with nMonte using the Motala rows, plus some imputed values for that row for the Kostenki and Samara columns:

Belarusian - Anatolia_Neolithic 38.3, Karelia_HG 23.8, Western_HG 13.45, Caucasus_HG 13.3, Nganasan 4, BedouinB 3.6, Motala_HG 2.25, Onge 0.7, Kharia 0.6

Finnish - Anatolia_Neolithic 34.7, Karelia_HG 23.45, Western_HG 16.35, Caucasus_HG 11.8, Nganasan 11.4, Motala_HG 2.3

Lithuanian - Anatolia_Neolithic 38.4, Karelia_HG 22.55, Western_HG 18.7, Caucasus_HG 12.4, Nganasan 4.8, Motala_HG 3.15

Unetice - Anatolia_Neolithic 41.95, Karelia_HG 19.9, Caucasus_HG 17.3, Motala_HG 9.75, Western_HG 8.15, Nganasan 2.65, Onge 0.3

Nordic_LNBA - Anatolia_Neolithic 33.45, Karelia_HG 24.25, Caucasus_HG 19.1, Western_HG 17.3, BedouinB 3.6, Masai_Kinyawa 1.1, Nganasan 1.1, Esan_Nigeria 0.1

Seems like the Karelia_HG+WHG combinations are generally preferred to including Motala_HG (at least with the imputed values). This is probably because Karelia+WHG gets the relatedness to each right, while not giving any excessive relatedness to Motala?

Romulus said...

@Matt

The Unetice numbers nicely reflect the apparent SHG paternal ancestry of that culture, likely by way of FunnelBeaker.

Krefter said...

@Davidski,

Can you make a Sycthian_IA row? Also, can you post Sycthian_IA's Eurogenes K15 results?

huijbregts said...

@Matt
I did not impute Kostenki and Samara and got different values:
Belarusian
"Anatolia_Neolithic" 38.65
"Karelia_HG" 23
"Western_HG" 15.8
"Caucasus_HG" 13.8
"Nganasan" 4.45
"BedouinB" 3.2
"Onge" 0.65
"Kharia" 0.45
"Motala_HG" 0
distance% 0.5538

Unetice
"Anatolia_Neolithic" 41.65
"Karelia_HG" 21.8
"Caucasus_HG" 17.25
"Western_HG" 11.35
"Motala_HG" 5.25
"Nganasan" 2.4
"Onge" 0.3
%distance 0.4642

I also did an experiment. In the Belarus file I dropped the Karelia and Western HG rows. This results in an awful distance.
Now when I look at the fitted values of Motala_HG2, Iberia_Mesolithic and Karitiana, I find that these values are off the mark.
The fitted values seem to be related to the real population percentages. percentages. I agree with Romulus.

Davidski said...

Here's the new datasheet, also including Scythian_IA and Poltavka_outlier.

https://drive.google.com/file/d/0B9o3EYTdM8lQRmRlZzdvOExDaDQ/view?usp=sharing

Matt said...

Looks like the real values in the Kostenki and Samara columns allow it to pick up a bit more Motala over the imputed values, which must have had a contribution to Motala ancestry fitting less well (bolded Motala, if any):

Belarusian - Anatolia_Neolithic 38.05, Karelia_HG 22.85, Caucasus_HG 13.2, Western_HG 12.3, Motala_HG 4.65, Nganasan 4.05, BedouinB 3.7, Onge 0.7, Kharia 0.5 - distance% = 0.6425 %

Finnish - Anatolia_Neolithic 34.55, Karelia_HG 22.3, Western_HG 14.9, Caucasus_HG 11.7, Nganasan 11.4, Motala_HG 5.15 - distance% = 0.8355%

Lithuanian - Anatolia_Neolithic 38.2, Karelia_HG 21.4, Western_HG 17.2, Caucasus_HG 12.3, Motala_HG 6.1, Nganasan 4.8 - distance% = 0.7903 %

Norwegian - Anatolia_Neolithic 43.8, Karelia_HG 21.15, Western_HG 14.3, Caucasus_HG 13.25, Nganasan 4.8, Motala_HG 1.75, BedouinB 0.95 - distance% = 0.7788 %

Unetice - Anatolia_Neolithic 41.7, Karelia_HG 20.15, Caucasus_HG 17.1, Motala_HG 11.25, Western_HG 7.1, Nganasan 2.4, Onge 0.3 - distance% = 0.6432 %

Nordic_LNBA - Anatolia_Neolithic 33.1, Karelia_HG 24.5, Caucasus_HG 18.95, Western_HG 17.65, BedouinB 3.6, Masai_Kinyawa 1.25, Nganasan 0.95 - distance% = 0.6173 %

Saami - Nganasan 26.7, Anatolia_Neolithic 21.45, Karelia_HG 19, Western_HG 13.55, Motala_HG 10.95, Caucasus_HG 6.25, Papuan 1.3, Itelmen 0.8 - distance% = 1.9088 %

Kargopol Russian - Anatolia_Neolithic 32.5, Karelia_HG 22.1, Western_HG 14.9, Caucasus_HG 13.4, Nganasan 12.45, BedouinB 2.4, Motala_HG 2.2 - distance% = 0.7075 %

Yamnaya_Samara - Karelia_HG 52, Caucasus_HG 29.35, Anatolia_Neolithic 14.4, Western_HG 2.85, Onge 0.85, Papuan 0.55 - distance% = 0.6253 %

Corded Ware Germany - Karelia_HG 34, Anatolia_Neolithic 30.6, Caucasus_HG 22.85, Western_HG 8.7, Motala_HG 2.9, Nganasan 0.95 - distance% = 0.5957 %

Andronovo - Karelia_HG 36.4, Anatolia_Neolithic 24.7, Caucasus_HG 23.45, Western_HG 6.8, BedouinB 4.85, Nganasan 3.5 - distance% = 0.5066 %

Germany_MN - Anatolia_Neolithic 76.3, Western_HG 20.4, BedouinB 2.05, Onge 1, Papuan 0.25 - distance% = 0.419 %

Iberia_MN - Anatolia_Neolithic 71.35, Western_HG 28.1, Atayal 0.55 - distance% = 1.4139 %

Hungary_EN - Anatolia_Neolithic 89.05, Western_HG 9.45, Motala_HG 1.2, Esan_Nigeria 0.3 - distance% = 0.7084 %

PCAs with these stats:

1. PCA based on new stats for selected West-Central Eurasian populations: http://i.imgur.com/0nBrq8j.png

2. same without MA1: http://i.imgur.com/lUQopHs.png

3. same as 2 without ENA and African columns - http://i.imgur.com/fzSGBJs.png

Krefter said...

@Davidski,

Can you also add Chuvash, Udmurt, and Erzya as rows?

Davidski said...

Erzya are listed as Mordovian. There aren't any Udmurts in this dataset.

I'll see about adding Chuvash tomorrow.

huijbregts said...

@ all

I have added an extra line to the output of nMonte, which I think has some heuristic value.
The output starts with Ncycles. The next line gives the names of the columns.
The third line gives the Dstats in the respective columns. On the fourth line are the values as estimated by nMonte.
The new output is on the fifth line. It is just the difference between the two preceding lines. Not too difficult to calculate them yourself, but the computer can do it faster.
Update at https://www.dropbox.com/sh/1iaggxyc2alafow/AACIjLtnkuaNNsJ5oKME_3XHa?dl=0

As a demonstration the model which Matt gave for Iberia_MN:
Iberia_MN -> Anatolia_Neolithic 71.35, Western_HG 28.1, Atayal 0.55 - distance% = 1.4139 %
The new output of nMonte reads:
[1] "Ncycles= 1000"
Ami Anatolia_Neolithic2 Australian BedouinB2 Biaka Caucasus_HG2 Dai2 Eskimo_Naukan Han Iberia_Chalcolithic Iberia_Mesolithic
Iberia_MN 0.3338000 0.4202000 0.3014000 0.38260000 0.02100000 0.38270000 0.3330000 0.34160000 0.3353000 0.43100000 0.4192000
fitted 0.3320973 0.4198205 0.3002321 0.38155805 0.02105235 0.38184285 0.3321925 0.34058695 0.3332289 0.41904635 0.4160451
dif -0.0017027 -0.0003795 -0.0011679 -0.00104195 0.00005235 -0.00085715 -0.0008075 -0.00101305 -0.0020711 -0.01195365 -0.0031549
Karitiana Kharia2 Kostenki14 Mansi2 Motala_HG2 Onge2 Papuan2 Samara_HG Selkup Yoruba
Iberia_MN 0.3462000 0.3388000 0.3557000 0.36470000 0.407700 0.32910000 0.3011000 0.39110000 0.3597000 0.09550000
fitted 0.3458683 0.3373728 0.3571005 0.36519215 0.411635 0.32813285 0.2996489 0.39453595 0.3592727 0.09560855
dif -0.0003317 -0.0014272 0.0014005 0.00049215 0.003935 -0.00096715 -0.0014511 0.00343595 -0.0004273 0.00010855
[1] "distance%=1.4139 / distance=0.014139"

The largest negative value of dif is in the column Iberia_Chalcolithic (-0.01195365).
This tells that the Dstat for Iberia_Chalcolithic is estimated too low by nMonte.
The cause might be that the subset [Anatolia_Neolithic, Western_HG, Atayal] is missing a relevant population.
Now the second most negative value is Iberia_Mesolithic (-0.0031549).
So Iberia_Mesolithic is also deficient in [Anatolia_Neolithic, Western_HG, Atayal].
It is plausible to add Iberia_EN to the set.
Lo and behold:
Iberia_MN -> Iberia_EN 80.45, Western_HG 19.45, Atayal 0.1, Anatolia_Neolithic 0 - distance% = 0.9761 %
A considerable improvement!

(Matt, I know that you have your arguments for favoring these populations.
But this sitting duck was just what needed to demonstrate the idea.)

Matt said...

@ huijbregts, interesting, and that looks like a useful feature. It makes intuitive sense that the Iberia_Chalcolithic and Iberia_Mesolithic column stats should be important to improving the fits for Iberia_MN. As you would expect Iberia_EN, Iberia_MN, Iberia_Chalcolithic and Iberia_Mesolithic to all share some fine scale drift or phylogenic structure similarity that can't quite be approximated simply by Anatolia_Neo and WHG rows.

Funnily enough, I was myself having a look at the fine scale differences in fits, trying to understand precisely what, if anything is really lacking in the fits for West Eurasian with just ancient groups by doing fits with just (Anatolia_Neolithic, Karelia_HG, Caucasus_HG, Western_HG, MA1, BedouinB, Motala_HG just comparing the fitted values to real values (what the diff does in huijbregts' updated version of nMonte):

English_Cornwall Eng_C_fitted Difference
Kostenki14 0.3522 0.3556 0.0034
Samara_HG 0.4057 0.4078 0.0021
Karitiana 0.3579 0.3596 0.0017
Anatolia_Neolithic2 0.4019 0.4034 0.0015
Motala_HG2 0.4087 0.4095 0.0008
Eskimo_Naukan 0.3511 0.3515 0.0004
Iberia_Mesolithic 0.4025 0.4023 -0.0002
Biaka 0.0203 0.0201 -0.0002
Caucasus_HG2 0.3899 0.3896 -0.0003
Yoruba 0.095 0.0946 -0.0004
Australian 0.3033 0.3021 -0.0012
Papuan2 0.3026 0.3013 -0.0013
Iberia_Chalcolithic 0.4032 0.4018 -0.0014
Onge2 0.3321 0.3307 -0.0014
Ami 0.3375 0.336 -0.0015
Dai2 0.3382 0.3363 -0.0019
Selkup 0.3684 0.3665 -0.0019
Han 0.3392 0.3372 -0.002
Kharia2 0.3433 0.3407 -0.0026
Mansi2 0.3757 0.3717 -0.004
BedouinB2 0.3776 0.3733 -0.0043

Belarusian Belarus_fitted Difference
Kostenki14 0.3513 0.3572 0.0059
Samara_HG 0.4085 0.4112 0.0027
Karitiana 0.3615 0.3637 0.0022
Eskimo_Naukan 0.3536 0.3546 0.001
Anatolia_Neolithic2 0.3975 0.3984 0.0009
Caucasus_HG2 0.3872 0.3876 0.0004
Biaka 0.0203 0.0201 -0.0002
Iberia_Chalcolithic 0.3989 0.3986 -0.0003
Iberia_Mesolithic 0.405 0.4047 -0.0003
Motala_HG2 0.4123 0.412 -0.0003
Australian 0.3042 0.3034 -0.0008
Yoruba 0.0954 0.0945 -0.0009
Papuan2 0.3039 0.3028 -0.0011
Onge2 0.3344 0.3323 -0.0021
Selkup 0.3707 0.3684 -0.0023
Ami 0.3401 0.3376 -0.0025
Dai2 0.3403 0.3378 -0.0025
Kharia2 0.3451 0.3419 -0.0032
Mansi2 0.3768 0.3736 -0.0032
Han 0.342 0.3387 -0.0033
BedouinB2 0.3749 0.3711 -0.0038

I looks like where the fits fall down, is in in being too close to Kostenki, Samara, Native Americans in particular, and not close enough to Kharia, East Asians and BedouinB at the other end.

MfA said...

Has anyone tried D-Stats admixtures on the Iron Age Hungarian sample supposedly the Cimmerian?

MfA said...

I'm big fan of the D-stats ancestry portions, I wish the Kurdish samples were also part of the data set.

Alberto said...

Another interesting thing related to Motala is that EHG does prefer it clearly over WHG when modeling as admixed with MA1:

Karelia_HG
"MA1" 52.65
"Motala_HG" 47.35
"Western_HG" 0
distance=0.037513

Karelia_HG
"MA1" 64.15
"Western_HG" 35.85
0.050959

This could, in any case, have to do with needing less MA1 when taking Motala, and MA1 being old and more "noisy" because of age and quality. However, when testing Okunevo:

Okunevo
"MA1" 50.15
"Dai" 22.35
"Karelia_HG" 18.25
"Atayal" 9.25
"Western_HG" 0
"Motala_HG" 0
distance=0.024875

It doesn't seem like MA1 would be systematically rejected due to quality when an option exists (EHG, in this case, even if clearly not the same thing). So I'm wondering if someone has ever explored the possibility of EHG being admixed with SHG rather than WHG (which makes geographical sense if SHG were in the East Baltic, as suggested by Matt's numbers above).

huijbregts said...

A few comments above I presented an automatic analysis of Dstat estimation errors in nMonte.
This method capitalizes on noisy data, so it surely cannot be very reliable.
On the other hand, the results are for free, so why not do a few experiments.

I tried a model of Hungary_CA:
Hungary_CA -> Iberia_MN 53.25, Anatolia_Neolithic 42.7, Western_HG 2.2, Motala_HG 1.6, Caucasus_HG 0.25 dist% = 0.9055
I am not proud of this model. The distance is bad and a 50% admixture of Iberians into Hungarians is not very plausible.

But the Dstat estimation errors are interesting. The largest underestimation is in the column Kostenki14:
column Kostenki14
Hungary_CA 0.3581000
fitted 0.3533389
dif -0.0047611

Indeed in the sheet from David the top 6 positions on Kostenki14 are:
Western_HG 0,3796
Motala_HG 0,3697
MA1 0,3665
Karelia_HG 0,366
Hungary_CA 0,3581
Iberia_MN 0,3557

This makes a lot of sense in the above model of Hungary_CA.
If only we had a row of Kostenki14. :-(

Matt said...

@ Alberto, fits I get for the Okunevo are:
Okunevo - MA1 42 Dai 22.4 Karelia_HG 21.6 Atayal 10.2 Caucasus_HG 3.8 - distance=0.024448

or with better fit allowing Ulchi as a group, but no other North Asians:

Okunevo - Ulchi 39.4, Karelia_HG 27.25, MA1 21.7, Caucasus_HG 11.65 - distance=0.011036

or with everything:

Okunevo - Karelia_HG 24.6, MA1 18.85, Ulchi 17.35, Caucasus_HG 14.35, Itelmen 14.15, Nganasan 10.4, Papuan 0.25, Esan_Nigeria 0.05 - distance=0.009045

Essentially combines to be approximately something Amerindish (more or less 1:3 MA1:Northeast Asian) plus something Afanasievoish (3:2 EHG:CHG).

Makes some sense as a Siberian culture with unusually heavy MA1 for their time mixing with Afanasievo (their predecessor culture), then replaced by Andronovo.

Doesn't seem too important in West Eurasia, since as a component they'd raise the degree of Amerind and North Asian affinity, which is already covered by EHG.

Karelia_HG does model sort of with Motala_HG and MA_1. At the same time, WHG does model like this (with an extremely bad fit but free choice from the ancient populations):

Western_HG
Motala_HG 88.3
Anatolia_Neolithic 11.7
Karelia_HG 0
distance=0.057609

It is tough to intuit any direction of admixture from any of this.

Davidski said...

Krefter,

Sheet with Chuvashs and Erzya (Mordovians).

https://drive.google.com/file/d/0B9o3EYTdM8lQemVmTjE0VjdsVGs/view?usp=sharing

Alberto said...

@Matt

That's interesting. When I tried with an older set with less columns to model WHG as EHG + Anatolia_Neolithic it was giving me 100% EHG, but now I get:

Western_HG
"Karelia_HG" 74.9
"Anatolia_Neolithic" 25.1
distance=0.104052

Which kind of makes sense to balance the difference in affinities (the opposite, which is what I was getting, Karelia_HG and MA1 taking some Anatolia Neolithic in addition to WHG is what didn't seem to make much sense, unless Karelia_HG and MA1 had some Basal Eurasian admixture, but who knows, I'll test with nMonte's latest version posted by huijbregts earlier -thanks!- to check the reasons better).

For completeness about what I wrote above it would be interesting to note that the model of Motala_HG as a mix of WHG and Karelia_HG is worse than Karelia_HG and a mix of Motala_HG and MA1, so in this model, being all Mesolithic samples, it's clear that this didn't happen anytime close to the Mesolithic between these populations:

Motala_HG
"Western_HG" 61.9
"Karelia_HG" 38.1
distance=0.043612

And the fit is worse with WHG + MA1:

Motala_HG
"Western_HG" 81.8
"MA1" 18.2
distance=0.054151

Alberto said...

Now I checked. That previous set of stats didn't have Samara_HG as a column, so using Karelia_HG as a row didn't generate the big excess of affinity in that column to create the need to reduce the amount of Karelia_HG to ameliorate it.

Now I tried the model of Karelia_HG as WHG + Anatolia_Neolithic but removing Iberia_Mesolithic as a column from this last set, and things improve (I guess?):

Karelia_HG
"Western_HG" 100
"Anatolia_Neolithic" 0
distance=0.084824

The example represents a worst case scenario, but apparently the deal is: we want the best reference populations in the columns, we also want them in the rows, but having the same population in both rows and columns creates an imbalance for the excess of shared drift of a population with itself.

This will need a bit of testing to see the impact in more real world cases and if there is any other option that has more benefits than drawbacks.

huijbregts said...

@ Alberto

I prefer another model. Using the latest version of the spreadsheet (with the Chuvash), I got the following result:
Karelia_HG
"Western_HG" 72.8
"Caucasus_HG" 27.2
"Anatolia_Neolithic" 0
distance=0.099085

Even for very ancient pops, this distance is very poor.
Inspection of the residuals per column shows two columns that were conspiciously too low, Samaria_HG (-0.0491728) and Eskimo_Naukan (-0.033216).
So the model can be improved by adding Yamnaya_Samara.

Karelia_HG
"Yamnaya_Samara" 73.7
"Western_HG" 26.3
"Anatolia_Neolithic" 0
"Caucasus_HG" 0
distance=0.06914

Now the distance is improved, but not dramatically.
The the underallocated columns are now Samara_HG (-0.0376636, cannot increase because that makes Anatolia_Neolithic too high)
and Karitiana (-0.0308352). Caucasus_HG is not really 0, because it is present within Samara_HG.

We can try to improve the model by adding Motala_HG
Karelia_HG
"Yamnaya_Samara" 54.05
"Motala_HG" 45.95
"Anatolia_Neolithic" 0
"Caucasus_HG" 0
"Western_HG" 0
distance=0.05686
The the underallocated columns are Yamnaya_Samara(-0.02769365), Karitiana(-0.0274567) and Eskimo_Naukan(-0.02182695).

A last idea is the addition Itelmen and Nganasan, because they carry both Karitiana and Eskimo_Naukan.
Karelia_HG
"Motala_HG" 54.35
"Yamnaya_Samara" 30.55
"Itelmen" 15.1
"Anatolia_Neolithic" 0
"Caucasus_HG" 0
"Nganasan" 0
"Western_HG" 0
distance=0.042942

My conclusion is that Karelia_HG is mainly from the EHG/SHG stock. I see no indication for a role in WHG, which is however not a column.
All-in the fit remains poor. Probably by the absence of relevant DNAs.
Dropping the column Iberia_Mesolithic does not seem necessary.

Matt said...

Alberto:

And the fit is worse with WHG + MA1

I haven't run this through nMonte but I think this worse fit is probably related to the fact there is no MA-1 column in any of the datasheets.
Motala as WHG+Karelia has three relevant columns that are relevant to the rows that can be under or overfitted (Ibera_Mesolithic,Samara,Motala2) while the MA-1 lacks a relevant column in the same way (with only proxies), so there's less there that is relevant to the population and difficult to fit.

we also want them in the rows, but having the same population in both rows and columns creates an imbalance for the excess of shared drift of a population with itself.

Yeah, I think if I understand you correctly this is definitely an issue and is the reasoning for Haak et al (and qpAdm) to use outgroups rather than the actual populations under study (although I can't find the quote in the SI9 of Haak's paper). Using the actual populations might means there is additional drift between say Anatolia_Neolithic members that can't necessarily be covered by the modern populations, whose ancestors were drawn from a slightly more general population of which Anatolia_Neolithic was a more tightly related subset. This might mean the fits are necessarily imperfect without quite large / diverse sample sizes that should smooth out this issue.

(ADMIXTURE, of course, IRC essentially "monte carlo" simulates populations until all real populations are fitted perfectly. But no guarantee that these actually ever existed or aren't over generalised due to lack of specific linking / link breaking input populations that break correlations apart between certain allele frequencies).

I think the flipside of this for me is that I'm skeptical that the divergences in relatedness to the outgroups are actually sufficient to use estimation via outgroups only. The relatedness to Native Americans is probably sufficient to pull out Yamnaya and EHG ancestry (since its relatively recent). But the differences in relatedness to ENA, just seem too close between CHG, AN and WHG and not to follow strong patterning, that even small, almost no geneflow from ENA can seriously perturb them as a signal for ancient WHG / CHG / AN ancestry. As I think we found with the experiment with 4mix and the D stats with outgroups. (And that's still true with nMonte, using only outgroups with columns, or even outgroups+Kostenki).

huijbregts said...

@ Alberto, Matt

I have run the model:
Iberia_MN
"Iberia_EN" 80.45
"Western_HG" 19.45
"Atayal" 0.1
"Anatolia_Neolithic" 0
"BedouinB" 0
"Caucasus_HG" 0
"Dai" 0
"Kharia" 0
"Motala_HG" 0
distance=0.009761

Next I dropped the column Anatolia_Neolithic2 from datasheet and target file.
The result was:
Iberia_MN
"Iberia_EN" 79.7
"Western_HG" 19.9
"Atayal" 0.4
"Anatolia_Neolithic" 0
"BedouinB" 0
"Caucasus_HG" 0
"Dai" 0
"Kharia" 0
"Motala_HG" 0
distance=0.008958

So removing the column did not make Anatolia_Neolithic show up.

Krefter said...

@Davidski,

Can you add the ancient Irish genomes and Otzi to the spreadsheet.

Matt said...

@ huijbregts

Not 100% I understand what we're doing atm, but to add something, I'm not sure the columns "Iberia Chalcolithic" and "Anatolia Neolithic" have a lot of independent predictive value, since "Iberia Chalcolithic" is in theory "Anatolia Neolithic" plus WHG admixture (including Iberia Mesolithic and maybe other sources).

The changes in models should be fairly weak if you drop either one, alone.

Although huijbregts there is that change that you identified upthread, where the relatedness to "Iberia Chalcolithic" and "Iberia Mesolithic" makes Iberia_EN a better fit than can be achieved only by Anatolia_Neolithic plus WHG for ancestry of Iberia_MN (because Iberia_EN has additional drift related to both columns that a mix of Anatolia_Neolithic and WHG doesn't quite have).

BedouinB column also has a similar pattern in relatedness to Anatolia_Neolithic as well and that will contribute to fitting populations with Anatolia_Neolithic.

If you dropped both the "Iberia Chalcolithic" and "Iberia Mesolithic" columns, then Iberia_EN might not be preferred for Iberia_MN.

Alberto said...

@Matt, huijbregts

I've been testing a bit with removing the columns that I want to use as rows and overall by reducing the references (especially the most relevant ones) I think the results get more noisy.

Indeed this problem I showed above of excess of affinity of a population to itself is related to the problem that qpAdm tries to completely avoid, so if we tried to avoid it completely we'd be reinventing qpAdm: one step forward, two steps backwards. We're definitely better comparing to "ingroups" than just to real outgroups.

To make the point I was trying to make with the example of Karelia_HG as WHG + Anatolia_Neolithic more clear:

- Phylogenetically, it makes more sense a model where Karelia_HG is 100% WHG than one that is 75% WHG and 25% Anatolia_Neolithic (even if both models are terrible, I know). In theory, Anatolia_Neolithic would basically add Basal Eurasian, which would be a worse option than a 100% WHG fit (ok, it could add also some Kostenki-related ancestry that could help a bit, but it's unlikely that it would outweigh the disadvantage of adding a larger amount of Basal Eurasian).

- Indeed, adding 25% Anatolia_Neolithic to the mix makes most of the diffs to the columns bigger. But the total distance improves due to the big improvement in one single column: Iberia_Mesolithic (this is a WHG, so the same population, just different individual). In other words, Anatolia_Neolithic is messing up most of the columns in order to compensate for the big difference in one column, which is caused by the non-proportional affinity that a population has to itself. That's why by just removing the Iberia_Mesolithic column the fit is better as 100% Western_HG. (Motala_HG and Kostenki as the other 2 columns that improve when adding Anatolia_Neolithic, and that's because of their high shared drift with WHG, but they're not enough by themselves to "spoil" the mix - it's Iberia_Mesolithic the one that makes the biggest difference).

So for now, after testing, I'd say that except in unrealistic models (like trying to model Karelia_HG as 100% WHG), we're better off by leaving all the columns intact. Though an option to be explored would be to use a more complex algorithm in which an outlier column could be given less weight in the calculation of total distance, since it's probably caused by this disproportionate effect of high intra-group shared drift. But I'm not sure if that would be easy and if it would introduce some uncertainty in itself instead of improving things.

huijbregts said...

@ Matt
You are right, I chose a very unfortunate example.
I followed your suggestions. My original model was:

Iberia_MN
"Iberia_EN" 80.45
"Western_HG" 19.45
"Atayal" 0.1
"Anatolia_Neolithic" 0
"BedouinB" 0
"Caucasus_HG" 0
"Dai" 0
"Kharia" 0
"Motala_HG" 0
distance=0.009761

After removing the column Iberia_Mesolithic few things changed.
After also dropping Iberia_Chalcolithic I got:

Iberia_MN
"Anatolia_Neolithic" 64.3
"Western_HG" 22.85
"Iberia_EN" 10.7
"Atayal" 2.1
"Motala_HG" 0.05
"BedouinB" 0
"Caucasus_HG" 0
"Dai" 0
"Kharia" 0
distance=0.003132

So most of the row Iberia_EN moved to Anatolia_Neolithic.
What I did not expect is that the distance diminished to one third.
It seems that by too many columns, which are not independent and moreover noisy, ugly things happen.
Maybe we should do something like calculating the number of eigenvectors.

@ Alberto
This is difficult stuff. I have to let it sink in.

huijbregts said...

@ Matt, Alberto
I looked at the eigenvectors. I found that with 5 eigenvectors, 99% of the variance is explained.
That is, with all the data. With a restriction to European populations, the situation will be worse.
So somehow we must limit ourselves to 4 or 5 independent columns.
Maybe the author of 4Mix had much knowledge of genetics.

Alberto said...

@huijbregts

I hope that I'll have time to look at this more systematically so I can come up with some numbers, but for now I think that the more columns we have, the more robust the numbers are going to be.

Anyway, I have one question. Why do you use:

eval1 <- sum((colM1 - myTarget)^2)

Instead of just:

eval1 <- sum(abs(colM1 - myTarget))

I guess it's intentional and I'm just missing something (I'm not into this kind of scripting), but I wonder what is it.

huijbregts said...

@ Alberto

I think that more rows are always good. The situation with the columns is different. When you have all the variance covered, more columns will be redundant.
Maybe it is not a good idea to select a subset of relevant rows. It is hard to restrict the number of rows without causing a projection error. IMO this is one of the reasons for the instabilities.

You found the most important line in the script. It calculates the square of the Euclidean distance which is also the basis of the least square statistics.

Davidski said...

I was talking to Ger over e-mail just now about using PCA data instead of D-stats with nMonte. Here's what I got for Yamnaya. I'll try and work out what type of PCA data is most useful for nMonte, and might post a datasheet later today or tomorrow.

Yamnaya_Samara_I0231
"Karelia_HG" 55.1
"Kotias" 35.8
"Loschbour" 5.4
"Anatolia_Neolithic" 3.7

"distance% = 0.954 %"

huijbregts said...

@ Matt, Alberto, Davidsky

After (not) having slept a night about it, I fear that the method in its present form should be discontinued.
There are several problems:

1. The Dstats have a low dimensionality, k=4 seems enough. This is considerably lower then the dimensionality of autosomal DNA.

2. Euclidean distance. (Alberto, the code you asked about is the square of the euclidean distance).
Theoretically, the Euclidean distance should be calculated on the basis on orthogonal distances.
As this entails a choice of k, I have not done this in nMonte.
Remember that nMonte is designed as a tool in combination with a calculator datasheet.
The author of the calculator will prefer a structure which is orthogonal and publish a test for this.
So in combination with calculator sheets, orthogonality will not be a major problem.
However the columns in a Dstats sheet are generally far from orthogonal.
So I should have orthogonalized them in nMonte. Which I have neglected because I wrongly suspected that Dstats would mirror DNA itself.

3. nMonte was designed to use as much information as is available in the datasheet.
In this project however, an expert chooses a small 'relevant' subset of the rows.
This not only reduces the already small variance in the data. It also causes projection errors.
As a result with one subset row A has a dominant position, while it can be zero with another subset.
This is not a problem of inexpert subset selection, it is a problem of subsetting as such.

The Monte Carlo process in nMonte is guided by minimization of the Euclidean distance.
Under the present conditions, the guide has a pair of variable focus glasses.

What can be done?
A PCA of the Dstats is no problem at all. And each population can be assigned to its valid position (if you use all the information in the Dstats sheet).
Can the combination of Dstats and 4Mix/nMonte be modified to be used in a correct way?
I think the start is to construct a PCA of say 4 eigenvectors and project the sheet of Dstats on these eigenvectors.
We need a sheet with as much rows as we can get and enough columns to be cover all the variance.
It would be nice if the eigenvectors would be interpretable. But I fear they will not be, because of differences between the continents.
Next we can define a small set of compound categories. For Europe they might be WHG, EHG, early farmer, steppe, Bell Beaker.
For each of these compound categories we can calculate the K4-vector of projected scores. The compound groups will generally be overlapping and not be orthogonal.
Next we can use nMonte to project a target population on the compound scores. This is a Procrustus process, because the sum will be forced to 100%
Main points:
orthogonalize
don't use subsets

Davidski said...

Global PCA data using as few as 10 dimensions is producing very coherent and clean results in the few nMonte tests that I ran.

I'll definitely have a datasheet ready for some tests later today. Stay tuned.

Alberto said...

@huijbregts

Yes, I see the rationale behind using the Euclidean distance. Though I'm not sure it really applies to this case. Outlier values have a bigger impact using the sum of the squares of the residuals than the sum of the absolute values of the residuals. But if Davidski might have found a way to avoid these outlier values it might not be a problem.

I did try to test it, and while using the sum of the absolute values did solve the problem for the corner case above (it did model Karelia_HG as 100% Western_HG and 0% Anatolia_Neolithic without needing to remove the Iberia_Mesolithic column), I couldn't get it to work well for normal cases. For some reason the algorithm has a hard time finding the lowest distance when given more populations. Not really sure why. So I can't say if the method has any benefit or if it's detrimental for other cases.

For the rest I can't comment much. I don't understand well what you mean by orthogonalize and not using subsets. But in any case, it's probably better to wait to see how it works with the new input data (and see how that new date looks exactly) before doing an overhaul.

Matt said...

@ Davidski, is that PCA based on IBS stats? Is it possible to post the underlying data used to build the PCA at the same time as the datasheet, or not really possible?

@ huijbregts and Alberto:

Just testing running the d-stats datasheet I was using through PCA, then using the PCA eigenvector scores (all dimensions) to fit a population:

English_Cornwall - Anatolia_Neolithic 47.35, Karelia_HG 21.25, Caucasus_HG 14.8, Western_HG 11.65, Nganasan 3, Motala_HG 1.4, BedouinB 0.45, Onge 0.05, Papuan 0.05 - distance=0.006986

For the same data without running the d-stat datas through PCA to orthogonalize, but using the same calc populations:

English_Cornwall - Anatolia_Neolithic 47.35, Karelia_HG 21.25, Caucasus_HG 14.8, Western_HG 11.65, Nganasan 3, Motala_HG 1.4, BedouinB 0.45, Onge 0.1 - distance=0.006986

So essentially no difference between using the data processed through PCA to data not processed through PCA, at least for this sample and calc set?

(The slight difference that is, between Onge and Papuan estimated proportion, may only be because in the PCA data I replaced a few values like "5.64699E-07" generated by Past3's PCA with 0 because my spreadsheet software kept turning them into text values).

I would've expected that I suppose, since the PCA are containing all the variance in the d-stats set (21 dimensions, but 99.2 % variance in first 6).

Using just the first 2 dimensions, which contain 93% of variance, we get:

English_Cornwall - Western_HG 28.3, Motala_HG 18.5, Anatolia_Neolithic 16.3, BedouinB 10.8, Karelia_HG 8.2, Caucasus_HG 6.7, MA1 4.3, Ust_Ishim 2.8, Kharia 1.4, Esan_Nigeria 1, Itelmen 0.8, Yakut 0.6, Onge 0.3 - distance=6e-06 (0)

But then, that is because the first 2 dimensions, are simply more or less world PCA dimensions, dominated by the contrast between African and Eurasian ancestry, then within West and ENA ancestry, without much useful relevance to distinctions between West Eurasians and ENA.

Using dimensions 1-4 (97.6% of variance), which have , we get :

English_Cornwall - Anatolia_Neolithic 53.85, Western_HG 15.3, Karelia_HG 15.2, Caucasus_HG 5.2, Nganasan 2.85, Itelmen 2.55, Motala_HG 2.5, BedouinB 2.35 - distance=3.9e-05

So closer, because while dimension 3 is just Papuan-Onge distinction, dimension 4 actually has information to distinguish between Anatolia_Neolithic/Caucasus type ancestry and HG ancestry.

Using dimensions 1-6 (99.2%), we get:

English_Cornwall - Anatolia_Neolithic 58.6, Karelia_HG 35.35, Caucasus_HG 4.35, Onge 1.15, Papuan 0.4, Western_HG 0.15 - distance=0.002201

Even with these dimensions, still seems to run short of replicating the data with all dimensions. Possibly because some of the data distinguishing between Karelia_HG and WHG and Anatolia_Neolithic and CHG is actually hidden in quite high PCA dimension?

The PCA with all dimensions seemed subjectively to take a little longer to run through nMonte than the raw data they're based on, but possibly not and certainly not much (obviously the reduced dimensions are quite a bit quicker, because far fewer columns).

Matt said...

@ huijbregts:

On the other point, re subsets, this is with using limited subsets of the whole set for the calc file? I can kind of understand that using a subset is a problem for finding the best fit.

On the other hand, using a calc file with all populations except English Cornwall and English Cornwall as the target, and the PCA data I used at the beginning of this comment :

English_Cornwall - Orcadian 33.25, Icelandic 12.2, Anatolia_Neolithic 9.25, Scottish_Argyll 7.9, Sintashta 7.05, Potapovka 6, Poltavka 5.6, Basque_Spanish 4.8, Poltavka_outlier 3.2, Italian_Bergamo 2, English_Kent 1.9, Ukrainian_East 1.65, Iberia_EN 1.2, Hungary_EN 1.1, Saami 1, Basque_French 0.55, Corded_Ware_Germany 0.4, Hungary_CA 0.25, BedouinB 0.2, Motala_HG 0.2, Afanasievo 0.1, Hungarian 0.05, Italian_Tuscan 0.05, Lezgin 0.05, Norwegian 0.05 - distance=0.001014

which is better fit yes, and probably summarizes the kind of population affinities of the sample.

So it's exactly the kind of thing that you might want, if you had an an individual of unknown origin, if you wanted to try and use this information to investigate their background. So there is a value to it.

(Although for that purpose it seems like it would be useful to have virtually all the rows as columns).

But it is also really hard to interpret in terms of, how exactly is this population rooted in the 5 or so theorised ancestral populations of Europeans - Anatolia Neolithic, WHG, SHG, EHG, CHG - and other world populations, if that's the question we're asking, rather than what the best fit is.

Btw, for a couple of ancients, same experiment:

Yamnaya_Samara - Poltavka 46.55, Karelia_HG 22.8, Caucasus_HG 11.85, Sintashta 10.1, Poltavka_outlier 3.75, Potapovka 2.55, Anatolia_Neolithic 1.25, BedouinB 0.9, Onge 0.25 - distance=0.002568

LBK_EN - Anatolia_Neolithic 64.7, Hungary_EN 19.85, Iberia_MN 10.25, Scottish_Argyll 2.75, BedouinB 1.35, Nordic_LNBA 0.3, Motala_HG 0.2, Sardinian 0.2, Western_HG 0.15, Itelmen 0.1, Spanish_Cantabria 0.1, Mordovian 0.05 - distance=0.002194

huijbregts said...

@ Alberto
The function of that line of code is not to detect outliers.
Its function is to evaluate whether a new random value has improved the approximation of the target file.
As far as I remember, the absolute value might have worked also, but at the end of the day mathematicians ask for an Euclidean distance, not for the sum of absolute values.
What David is doing is not about outliers, but about a more efficient use of the information by using PCA scores of the raw data instead of Dstats. He seems very enthusiastic.
And yes, my comments did not excel by clarity. I have never been a gifted teacher, especially not after a wakeful night.

huijbregts said...

@ Matt

As you have also noted, the first few dimensions of the Dstats PCA contain nearly all the information. Compare that to Davids statement that in the PCA of the raw data he needs "just 10" dimensions to keep virtually all of the information. Obviously, in the process of the Dstat calculation, you drop a lot of information (which in no way discounts its value as a formal method; but data technically it is just not efficient).
If you want an objective measurement of the run time, you can time it with: system.time(getMonte(....))

The problem with the subsets really bothers me. Finding the optimal solution for one subset is just not equivalent to finding the optimal solution for an other subset; especially when your columns are not orthogonal.

My understanding for what David is doing is that he leaves all the rows in place and that as columns he uses the first 10 eigenvectors of the raw data. So you need not sacrifice rows for columns. Oh, what a lovely world. Moreover the eigenvectors are by definition orthogonal, while the Dstat columns are definitely not. nMonte was not expecting these non-orthogonal columns, so now the optimizing process works more reliable. And another advantage, you need not adjust to new sheets every few days.

As to the hard interpretation, yeah that is true, but people seem to have also adjusted to Dstats :)

Alberto said...

@huijbregts

Yes, the line is self explanatory. I was not changing the evaluation itself, I was just changing the calculation of the value (as you can see in the code above).

What I mean about making worse the problem of outlier values is this: If we have 4 columns with the following diffs:

0.05, 0.05, 0.05, 0.05
Distance with absolute values = 0.2
Euclidean distance = 0.1

And then in the next iteration we get the following diffs:
0.01, 0.01, 0.01, 0.10
Distance with absolute values = 0.13
Euclidean distance = 0.1015

So (if I didn't make any mistake) with absolute values this iteration represents a clear improvement (so the model would be stored), but with Euclidean distance it got slightly worse (so the model would be discarded).

Now, the question is if the second model is better or worse for our purposes. I don't have a clear answer, I'd like to see it in real world cases, but as I said the script was not able to find the lowest distance when using the sum of absolute values for some reason.

The outlier values are a consequence of recent gene flow detected by D-stats. So in general we want to avoid them for our calculations, that's why I think the absolute values could be a better approach. But Davidski is now going to use data that doesn't come from D-stats, so the question might turn irrelevant.

An example of D-stat showing this effect of disproportionate value (in this case presumably because of recent gene flow between Spanish_EN and Spanish_MN):

D(Mbuti, Spanish_MN : Hungarian_EN, Spanish_EN) D=0.0379 Z=14.28

Normal values when comparing a middle Neolithic European population to 2 Early Neolithic ones would be:

D(Mbuti, Gok2 : Hungarian_EN, Spanish_EN) D=0.009 Z=2.199
D(Mbuti, Gok2 : LBK_EN, Hungarian_EN) D=0.0021 Z=0.562

Matt said...

@huijbregts

Hmm... Still feeling my way around my thoughts on this, trying to revisit over my understanding of what D-stats do, and thoughts are not totally clear, thanks for the discussion. This may be a little due to thinking at cross purposes, but I've been interested in the D-stats specifically not so much as a summary of the autosomal information (which as you note, is going to be lossy), rather because of the particular statistical properties these D-stats as I understand them should supposed have, specifically my understanding is that as they are measuring matching in shared derived alleles in the two test populations relative to the outgroups, that cuts cut deeper into phylogenic position of each population, and is relatively blind to drift causing changes in frequency of ancestral alleles (they're lossy in a specifically interesting / useful way?).

As you have also noted, the first few dimensions of the Dstats PCA contain nearly all the information. Compare that to Davids statement that in the PCA of the raw data he needs "just 10" dimensions to keep virtually all of the information.

OK, yes, the Dstats PCA first few dimensions do contain very large %s of the overall variance. But the results from using just the first few dimensions of the Dstats PCA vs the full 21 are very starkly different in terms of results for the West Eurasian populations we're interested in. So it is hard for me to think the Dstats set in the current datasheets functionally have few relevant dimensions (functionally low dimensionality?), even if many of the later dimensions have a relatively small eigenvector (as a function of how differences in the genetic information scale to a D-stat). They're still systematic in a useful way. Although I'm not totally sure if this was your meaning?

My understanding for what David is doing is that he leaves all the rows in place and that as columns he uses the first 10 eigenvectors of the raw data. So you need not sacrifice rows for columns...

The problem with the subsets really bothers me. Finding the optimal solution for one subset is just not equivalent to finding the optimal solution for an other subset; especially when your columns are not orthogonal.


Although I'm pretty sure Davidski will be using pretty similar subsets to the ones Alberto and I have been running though, when he uses the PCA data (for the same reasons, because the idea is to find the best matching combination of ancient populations, not the best matching combination overall). If I'm wrong, please correct.

(On an aside, comparing the first result from Davidski, I have some pause about the differences in method as the D-stats via PCA, although very similar, seem quite clear that Yamnaya Samara is behaving as Karelia_HG 52, Caucasus_HG 29.35, Anatolia_Neolithic 14.4, Western_HG 2.85, Onge 0.85, Papuan 0.55. So where the differences are emerging from to give those slight differences in Kotias, Anatolia and WHG are coming from will be the question (possibly due to issues in making the CaucasusHG2 column)?)

Moreover the eigenvectors are by definition orthogonal, while the Dstat columns are definitely not.

D-stat columns on their own, OK are not, but surely once the D-stat column information is transformed through PCA it is? PCA, I'm not familiar with the maths, is "a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. " So the output of the D-stats sheet through PCA, should be orthogonal. It didn't seem to make any difference in results for the English_Cornwall sample, using either the raw D-stats or the orthogonally transformed data via PCA. Is that something you'd expect or not expect?

huijbregts said...

@ Alberto
You found a good demonstration how Euclidean distances work. Extreme values get more weight then average values.
Under some plausible conditions like normal distribution etc. it can be proven that the second sample, with the large outlier, has a smaller probability than the first sample with a more even distribution. This is the basis of the familiar least squares statistics.
As far as I know there is no popular statistical test based on a least absolute value statistic.
It can also be proven that under plausible conditions the least square solution based on the Euclidean distance is the optimal solution.
So, according to the statistical texts, there is a clear answer to the question which is better.
It will take strong empirical evidence to convince me of the opposite.
But as I said above, I think that minimizing the absolute value will lead to about the same results. Maybe it will take some more time to get there.

huijbregts said...

@ Matt
So you say that the first few dimensions from a Dstat PCA give a different result than with the full 21 dimensions. That implies that you need more than 3 dimensions, lets say 5. David needs 10 dimensions of his PCA. It is like words of 5 letters vs words of 10 letters. The 10 letter words contain more information.
This information must have been lost during the calculation of the Dstats. (An alternative explanation is that the Dstat columns are not a representative sample from all the possible columns. This alternative seems even worse.)
You ask about the subset of ancient populations. I expect no problem. All the populations in the rows can be described as combinations of the eigenvectors; that includes the ancient rows. But in theory a situation might occur that an eigenvector seems highly specific for present populations and yet an ancient population loads on that modern vector. According to dr. Murphy we will see such a situation.
The time here is now 1.00 AM. I am going to sleep. I hope to see a post from David tomorrow. There will be a lot to discuss.

Matt said...

@ huijbregts: Re: "let's say 5".

Late here also, so just on this, to give the actual results (rather than just assuming 5 for the sake of discussion), for English Cornwall they don't converge close to the full set of D-stats PCA dimensions until about 8-10 dimensions, from what I can tell. 10 is convergent. 8 is still slightly off (as in different results) to the full set but not by much. 6 doesn't give the same result as the full set, as I showed in the post above (I haven't tested 7).

Like you say, we'll have to see what the PCA dimensions Davidski produces in a datasheet until I can properly compare the variation in results with decreasing dimension between Davidski's PCA and the PCA based on the D-stat columns.

Davidski said...

Matt, I can't post the underlying PCA data. Or maybe I can, but I don't know how to yet. Anyway, I think you'll really dig this datasheet.

Check this out. RISE552 is the only Yamnaya sample with Y-HG I2a.

Yamnaya_Kalmykia_RISE552
Anatolia_Neolithic_I0709 0
Esperstedt_MN_I0172 0
Hungary_CA_I1497 0
Karasuk_outlier_RISE497 2.1
Kotias_KK1 44.8
Loschbour_Loschbour 0
Motala_HG_I0017 20.8
Samara_HG_I0124 32.3
Stuttgart_LBK380 0

Yamnaya_Samara_I0231
Anatolia_Neolithic_I0709 0
Esperstedt_MN_I0172 0
Hungary_CA_I1497 0
Karasuk_outlier_RISE497 0
Kotias_KK1 40.5
Loschbour_Loschbour 0
Motala_HG_I0017 16.7
Samara_HG_I0124 42.8
Stuttgart_LBK380 0

huijbregts said...

@ Davidsky

As I am Y-HG_I2a myself, I am very pleased by your example.
It says my big daddy had more Kotias than Samara_HG!

I hope you find a solution for the posting problem.

Davidski said...

Almost there. I'm doing a really big run to cover as much variation as possible.

batman said...

@Huijbregts

Seemingly pre-mesolithic 'Bichon' belonged to y-dna I2a1a. Thue he sjå red the ancestor of Motala 9, as well as the first farming herders of Gotland; Ajvide 52, 58, and 70.

The same family-line seems to have been involved with the early agricultures in the highlands of Middle Europe, such as KO1 and NE7, from the Körös-Lengyel hemisphere.

Based on percentages of the extant populations with y-dna I1 and I2 respectively, we find a clear correlation to Gotland and the historical 'gots'/'goths' - as well as to the 'venetian' side of the Balkans - today known to share alleles with mesolithic Mottalas and the early-neolithic Ajvidje.

Does the mutations avaloble imply that mrn of hg I2a had a last common ancestor in the Balkan west-coast or in the Scanian waters?

Krefter said...

@batman,

Most hg I today is from post-Neolithic founder effects. I1 and I2a1b2-Dinaric are two known ones. There might be others. So, we can't interpret modern I2 or I frequencies as reflective of anything that happened before 3000 BC.

batman said...

@ Krefter
Hg I1 was found in a 7.500 year old simple from a tiny Island ib the Baltic Ocean (Stora Förvar).

Doesn't that give us a minimum age for the branching (bifurication) between I1 and I2?

Isn't the location of this early I1 telling us so
anythig about the origin of the later I1, from the early stock-herders on the neighbouring islands, as well as maonland Fenno-Scandia? The earlist known - so far - is some LN/BA-samles from Scania.

Meanwhile the I2-lines came to spread the first boat-building, herding, agriculture and metallurgy in northern Europe, with outliers in all of mainland Europe. (Side-by-side with their cousins of hgs G,H and J. At times Even alongside some common ancestral family-lne of hg F*...?)

The I2 herders from Scania seem to cline with the neolithic I2 from continental Europe, all the way down to Trielles and Sardinia. Perhaps also with the EN I2-samles from Moldovia and Balkan?

Doesn't that add up to anything?