search this blog

Wednesday, May 18, 2016

Yamnaya = Khvalynsk + extra CHG + maybe something else


Thanks to recently published ancient DNA it's now generally accepted that the Yamnaya pastoralists of the Pontic-Caspian Steppe were basically a mixture of Eastern European Hunter-Gatherers (EHG) and Caucasus Hunter-Gatherers (CHG).

However, I thought it might be useful to dig a little deeper with the D-stats/nMonte method to see what else crops up. I put together a special datasheet for the job, with Yamnaya Samara featured both in the rows and columns. It's available for download here. The nMonte R script can be gotten here.

Using the most plausible reference samples currently available - almost all of them older than Yamnaya, and thus unlikely to skew the results with Yamnaya admixture - reveals the following models for the two Yamnaya sets from Kalmykia and Samara, respectively.

Yamnaya_Kalmykia
Khvalynsk 57.7
Kotias 28.3
Hungary_EN 12.9
Ulchi 1.1
AfontovaGora3 0
Anatolia_Neolithic 0
Karelia_HG 0
Loschbour 0
MA1 0
Motala_HG 0

distance%=1.9125 / distance=0.019125

Yamnaya_Samara
Khvalynsk 56.75
Kotias 26.4
Hungary_EN 10.85
Karelia_HG 4.4
Loschbour 1.6
AfontovaGora3 0
Anatolia_Neolithic 0
MA1 0
Motala_HG 0
Ulchi 0

distance%=2.1354 / distance=0.021354

Very interesting but hardly surprising. Essentially what we're seeing there is potentially very strong genetic continuity from the Eneolithic to the Early Bronze Age on the Pontic-Caspian Steppe. In other words, from Khvalynsk to Yamnaya.

However, at some point between the Eneolithic and the Early Bronze Age, the steppes saw a major influx of extra CHG, represented by the ~27% of Kotias-related admixture. Considering the relevant uniparental data, with lots of Y-HG R1b and no Y-HG J among Yamnaya males, I'd say this CHG came with women.

Also, the relatively high admixture related to early Hungarian Plain farmers (Hungary EN) is a fairly curious detail that has not been reported before. If real, it probably represents gene flow from the Neolithic and/or Chalcolithic Balkans to the Pontic-Caspian Steppe. Again, in all likelihood it mostly came with women, perhaps from Tripolye-Cucuteni and/or Varna communities.

But, to make sure the results weren't erroneous - perhaps skewed by the worryingly high affinity between the two CHG genomes, Kotias and Satsurblia, both pseudo-diploid sequences - I reran the tests using a different version of the same datasheet (available here). In this version, Kotias and Satsurblia are part of a single Caucasus HG sample included in the rows.

Yamnaya_Kalmykia
Khvalynsk 53.95
Caucasus_HG 38.45
Hungary_EN 4.7
Motala_HG 2.9
AfontovaGora3 0
Anatolia_Neolithic 0
Karelia_HG 0
Loschbour 0
MA1 0
Ulchi 0

distance%=1.7492 / distance=0.017492

Yamnaya_Samara
Khvalynsk 56.7
Caucasus_HG 33.55
Hungary_EN 4.25
Loschbour 2.95
Motala_HG 2.55
AfontovaGora3 0
Anatolia_Neolithic 0
Karelia_HG 0
MA1 0
Ulchi 0

distance%=2.0964 / distance=0.020964

Yup, the Hungary EN contributions take a dive. But they're still above noise level (>2%), so they might well represent a real signal that entered the Yamnaya horizon or late Khvalynsk from the west. Can't wait to see how Yamnaya genomes from north of the Black Sea come out.

Update 20/08/2017: It appears that I was onto something. See here.

See also...

Modeling Steppe_EMBA

Yamnaya =/= Eastern Hunter-Gatherers + Iran Chalcolithic

Another look at the genetic structure of Yamnaya

77 comments:

Ryukendo K said...
This comment has been removed by the author.
Alberto said...

Thanks David & RK.

One thing that I think would be good to look at is the method for running these D-stats. I again couldn't yet run enough tests to get a more complete picture, but from a quick look the differences by swapping the position of Mbuti and Chimp looks very significant. For example, with the datasheet in this post (with Caucasus_HG as one population):

Kalash
"Caucasus_HG" 78.3
"Atayal" 10.95
"AfontovaGora3" 10.75
"Anatolia_Neolithic" 0
"Esan_Nigeria" 0
"Munda" 0
"MA1" 0
distance=0.061205

With the datasheet in the previous post:

Kalash
"Kotias" 26.9
"Anatolia_Neolithic" 25.75
"Munda" 24.95
"MA1" 22.4
"AfontovaGora3" 0
"Atayal" 0
"Esan_Nigeria" 0
distance=0.014476

So with this new datasheet a similar phenomenon happen as when using ADMIXTURE and the IBS based PCA data: CHG becomes very important (though here it seems even more than ever). On one hand it's good to see 3 different methods agreeing, and this one cannot be skewed because of using many modern populations in the comparison. But on the other hand, the results look just too "good" to be true? (too simple and easy).

With a European population where it should be easier to get both methods to agree for having more and better references, the differences are quite significant too. With the datasheet in this post:

Spanish_Extremadura
"Anatolia_Neolithic" 46.3
"Caucasus_HG" 28.15
"Hungary_HG" 21.65
"Karelia_HG" 3.35
"Atayal" 0.55
"Esan_Nigeria" 0
"Loschbour" 0
distance=0.020011

With the previous one:

Spanish_Extremadura
"Anatolia_Neolithic" 61.8
"Loschbour" 11.5
"Karelia_HG" 11.5
"Kotias" 10.6
"Atayal" 2.8
"Esan_Nigeria" 1.8
"Hungary_HG" 0
distance=0.01071

So before going on with more models, I think we should try to figure out what's going on and which one is better (if any, maybe both are half right, but it would be good to understand that too).

huijbregts said...

@ Alberto
If the differences between Kotias and Satsurblia are noisy and Caucasus_HG is less noisy, then nMonte will prefer Caucasus_HG above Kotias.

Anonymous said...

Any possibility that this extra component might reflect an original central asian component?

Alberto said...

Yes, indeed using the second sheet in the post with Caucasus_HG as a combination of both Kotias and Satsurblia accounts for a good part of the effect. So here with the first one using Kotias as a row, which should be a better comparison for the swapping of Chimp and Mbuti:

Kalash
"Kotias" 46.35
"MA1" 25.35
"Anatolia_Neolithic" 15.3
"Atayal" 8.05
"Munda" 4.95
"Esan_Nigeria" 0
"AfontovaGora3" 0
distance=0.064622

Spanish_Extremadura
"Anatolia_Neolithic" 55.85
"Kotias" 16.2
"Hungary_HG" 15.2
"Karelia_HG" 8.9
"Loschbour" 2.05
"Atayal" 1.8
"Esan_Nigeria" 0
distance=0.020772

Now the differences are not nearly as big, but still there. So 2 things to look at: the difference by swapping Mbuti/Chimp and the difference by using Caucasus_HG vs. Kotias alone.

Davidski said...

@aniasi

Any possibility that this extra component might reflect an original central asian component?

Yamnaya is a phenomenon derived from populations living between the Black Sea, Caspian Sea and the Caucasus. The extra component, if real, is from the Balkans.

There's no evidence that it had any post-Ice Age admixture from Central Asia.

Krefter said...

@Davidski,

This has been an issue for over 3 months. I think Yamnaya definitly does have EEF or closely related Near Eastern ancestry. We need CHG or a very CHG population(like Georgians) as a row population to get accurate CHG ancestry percentages. When we do Yamnaya needs EEF admixture.

Davidski said...

Georgians are in the rows and columns in the above sheets. So are Armenians.

But using them to model Yamnaya won't give you accurate information about the origins of Yamnaya, because both are modern populations with complex ancestry.

Hungary EN looks like the best proxy we have for the minor non-CHG southern ancestry in Yamnaya. But what we really need are CT and Varna. They'll probably end up being better fits than Hungary EN.

Davidski said...

rk,

result: Chimp Kostenki14 Mbuti Han 0.3404 81.463 21161 10413 241538
result: Chimp GoyetQ116-1 Mbuti Han 0.3491 85.519 21550 10399 241538
result: Chimp Vestonice16 Mbuti Han 0.3431 83.527 21230 10384 241538
result: Chimp AfontovaGora3 Mbuti Han 0.3580 86.132 19855 9386 223846
result: Chimp ElMiron Mbuti Han 0.3504 84.795 21440 10313 241538
result: Chimp Karelia_HG Mbuti Han 0.3597 94.218 22004 10362 241538
result: Chimp MA1 Mbuti Han 0.3598 90.209 21768 10248 241538
result: Chimp Loschbour Mbuti Han 0.3528 92.354 21656 10360 241538
result: Chimp Villabruna Mbuti Han 0.3431 81.283 21480 10506 241538
result: Chimp LaBrana1 Mbuti Han 0.3485 89.042 21645 10457 241538
result: Chimp Anatolia_Neolithic Mbuti Han 0.3354 100.000 21273 10587 241538
result: Chimp Kotias Mbuti Han 0.3407 90.516 21382 10515 241538

Ryukendo K said...
This comment has been removed by the author.
Ryukendo K said...
This comment has been removed by the author.
Anonymous said...

Jest już jakiś materiał genetyczny z pochówków Otomani-Füzesabony.

Davidski said...

Z pochówków Otomani-Füzesabony.

Niestety nie.

Anonymous said...

Na internacie jest raport z badań ciała ( szczątków ) Pana Generała Sikorskiego.
Jest tam raport Y-DNA. Winka z niego że Pan Generał należał do gałęzi R1B-Z2103 . W okolicach Przeworska jest mała gromadka
R1b-Z2103. Druga gromada to Wielkopolska. Przypuszczam Polskie R1b-Z2103 jest w wprost z kultury Otomani-Füzesabony

Davidski said...

rk,

Unless I messed something up, then these are the ratios. Doesn't look like Basal Eurasian.

https://drive.google.com/file/d/0B9o3EYTdM8lQU0VsaGtacWdRLXM/view?usp=sharing

Ryukendo K said...
This comment has been removed by the author.
Davidski said...

Something's still off...

Mbuti Afanasievo Ami Ust_Ishim : Mbuti Loschbour Ami Ust_Ishim 1.74193 0.91148 1.911
Mbuti Albanian Ami Ust_Ishim : Mbuti Loschbour Ami Ust_Ishim 1.811601 0.932488 1.943
Mbuti Anatolia_Neolithic Ami Ust_Ishim : Mbuti Loschbour Ami Ust_Ishim 0.962777 0.413903 2.326
Mbuti Andronovo Ami Ust_Ishim : Mbuti Loschbour Ami Ust_Ishim 1.659072 0.833482 1.991
Mbuti Armenian Ami Ust_Ishim : Mbuti Loschbour Ami Ust_Ishim 1.538128 0.774075 1.987

Ryukendo K said...
This comment has been removed by the author.
FrankN said...

rk: Thx for the illustrative description of methodological issues! I wouldn't exclude slight epipaleolithic/mesolithic admixture between Mbuti and Near East (the Mbuti's Basenji dog has Israeli Wolf admixture!), such admix (~1%) is actually shown in one of Dave's Treemix diagrams posted a few days ago. This might explain the erratic results. To get rid of such effects, one would need to replace Mbuti by Chimp.

Alberto: The following, very recent paper discusses methodological problems with genetic clustering analysis. It refers spefically to STRUCTURE and ADMIXTURE, but the theoretical considerations should also apply to other Monte-Carlo-based clustering approaches, including nMonte. Specifically, three problems are highlighted:
(i) Label switching
(ii) Multimodality, i.e. possible solutions lie so close to each other that the algorithms have a hard time to differentiate between them, leading to non-trivial differences between various runs;
(iii) Alignment across K: When an imput parameter (dimension) is added, the result matrices for K and K+1 dimensions will differ and cannot be mathematically transformed into each other.
http://biorxiv.org/content/biorxiv/early/2016/05/18/031815.full.pdf
The article presents a software tool to better deal with these issues - not solving them, which seems mathematically impossible, but highlighting their occurence as a sign of warning, and providing entry into further analysis.

Two more fresh papers on biovrix that I found interesting:
1. Detection of Human adaptation during the past 2000 years
"Human" is British, but nevertheless. Strongest signal is on lactase persistence, which apparently is a quite recent process, followed by the HLA system that plays a key role in the immune defense. They furthermore identified selection on sugar metabolism (fastening insulin, glucose intolerance, insulin response), cholesterol, plus a number of features that may have come under sexual (partner choice) selection; Blonde hair, blue eyes, female tanner (skin colour), male and female height, male BMI, female hip size and waist/hip ratio. The latter may also relate to selection for higher birth weight and infant head size.
http://biorxiv.org/content/biorxiv/early/2016/05/07/052084.full.pdf

2. Adaptively introgressed Neandertal haplotype at the OAS locus functionally impacts innate immune responses in humans
While many Neandertal genes were apparently selected against in humans, Neandertal OAS genes were positively selected for. OAS genes play a prime role in viral defense, including influenza, but also in combatting Salmonella. The Neandertal-derived variant is especially beneficiary against Flavovirus (Hepathitis C, Dengue, West Nile etc.). The study suggests that Neandertals, as most African populations, had preserved the original OAS1 splice variant, which was replaced by other variants in Basal Eurasians due to a geographically different virus pressure, but than re-introduced from admixing with Neandertals.
http://biorxiv.org/content/biorxiv/early/2016/05/04/051466.full.pdf

OT: My niece has her first publication out, a few more are to follow over the coming months:
http://www.pnas.org/content/early/2016/05/13/1520255113.full#fn-3

Ryukendo K said...
This comment has been removed by the author.
Davidski said...

@Frank

Mbutis don't have any West Eurasian ancestry. They have some sort of archaic admixture that possibly occasionally affects Treemix graphs. And if they had 1% West Eurasian ancestry, this wouldn't have any appreciable effect on the f4 ratios.

Also, we're not using nMonte in the same way as Structure or Admixture are used. Even the supervised tests offered as part of Structure and Admixture are very different to what we're doing here.

Davidski said...

rk,

Chimp Kotias Ami Ust_Ishim : Chimp Loschbour Ami Ust_Ishim 1.376175 0.313522 4.389
Chimp Stuttgart Ami Ust_Ishim : Chimp Loschbour Ami Ust_Ishim 1.064245 0.247987 4.292
Chimp Yamnaya_Samara Ami Ust_Ishim : Chimp Loschbour Ami Ust_Ishim 1.490595 0.295643 5.042

Grey said...

"I think Yamnaya definitly does have EEF or closely related Near Eastern ancestry."

If CHG expanded onto the steppe did they do so as HGs or farmers? If farmers then did they develop it independently or was there a catalyst?

I recall reading there were pottery using sedentary HGs around the edge of the Black Sea so were those people primed to adopt farming unlike most HG populations?

If so maybe a few ENF were all that was needed as a catalyst to turn them into farmers?

In other words CHG around the Black Sea weren't over-run by ENF like most HGs because they were already primed for farming so they adopted it instead and a majority CHG / minority ENF farmer population expanded onto the steppe to become the catalyst that turned EHG into PIE.

Alberto said...

@FrankN

Thanks, I'll look at the paper as soon as I can, even if as David said the methods are different it might still be helpful.

I guess we can all accept that the methods we use are never going to be perfect, but somehow when an apparently trivial change can have such a big impact in the results it really affects the confidence we can have in the models we're trying to make. It's not like the case of Motala vs. a mix of WHG-EHG, where the line is thin. It's a much bigger uncertainty that is introduced when a population can go from 78% to 27% CHG, 26% to 0% Anatolia_Neolithic, Munda (representing something akin to ASI) also from 25% to completely disappear, etc... This is something we'd need to look at and figure out if there's something wrong there or we simply have to accept that this is how it is.

Does anyone have any idea of what might be going on there? Any suggestion as to what to test to see what works better, etc...?

Davidski said...

Kalash are in the columns in these sheets. This might be exaggerating the differences, at least for them anyway, because they're so drifted.

FrankN said...

@Dave: Against which outgroup did you establish that Mbuti don't have West Eurasian ancestry?

Also, even at 1-2% West Eurasian admix in Mbuti, things might become complicated when the admixture source hasn't spread equally across W. Eurasia.

What I am particularly thinking about is this new paper:
Mapping human dispersals into the Horn of Africa from Arabian Ice Age refugia using mitogenomes
http://www.nature.com/articles/srep25472
Based on phylogenetic and founder analysis, and Bayesian skyline plots, they attempt to reconstruct the migration history of mtDNA R0a. They postulate an Arabic origin/ glacial refuge, either neir the Persian Gulf, or in the Red Sea plains, and a strong expansion starting with the Bölling-Alleröd interstadial 14.7 kya. Expansion into the Horn of Africa is estimated to have ocurred in a major peak ca. 11.8kya, i.e. just after the Younger Dryas. At the same time, the line also, via the Levante, expanded peakwise across the Mediterranean, and into the Fertile Crescent. South Asia shows a small peak 7.8 kya, and a larger one 2kya. The latter concerns epecially the Kalash, which harbour most of South Asian R0a. Further mid- to late Holocene spread occured with the Afroasiatic/ Semitic expansions (Amhara, Cushites, Berber, Arabs etc.)

Let's assume that the East African wave also reached the Mbuti - it at least fits timewise with the introduction of the Basenji dog there. We would get a quite scattered pattern. Eventually, what was in fact an early Holocene expansion out of the Arab peninsula might be mistaken for SSA admix in Bedouins, Druze etc. Whether such a scenario actually explains the inconsistencies that rk has uncovered - I don't know. But some testing on that scenario might be a good idea.

Alberto said...

Yes, that could have something to do with it, because other populations related to Kalash don't get so radical results:

Tajik_Ishkashim
"Caucasus_HG" 50.7
"Anatolia_Neolithic" 19.05
"AfontovaGora3" 17.45
"Atayal" 12.8
"Esan_Nigeria" 0
"Munda" 0
"MA1" 0

Pathan
"Caucasus_HG" 58.1
"Anatolia_Neolithic" 12.5
"Atayal" 10.6
"AfontovaGora3" 9.8
"Munda" 9
"Esan_Nigeria" 0
"MA1" 0

GujaratiA
"Caucasus_HG" 50.45
"Munda" 22.95
"AfontovaGora3" 10.6
"Anatolia_Neolithic" 10.5
"Atayal" 5.5
"Esan_Nigeria" 0
"MA1" 0

Knowing better this effect of having a population in the columns should help to try to make the best choices for columns. I'll try to investigate this a bit.

Nirjhar007 said...

http://www.nature.com/articles/srep25501

Davidski said...

Frank,

Only one paper claimed that Mbutis had West Eurasian admixture, the one about the Mota genome, and that was shown to be an error.

Alberto,

Just remove Kalash2 from the columns.

Chad said...

Well, I retried the qpAdm runs assuming there is no Basal Eurasian, and the fits aren't bad. Loschbour comes out 28% Anatolian, 45% GoyetQ, 23% MA1, and 3% Han. Ust-Ishim was in the outgroups and the fit is about as good as making Corded Ware as LBK, WHG, and Yamnaya. Something is going on here.

Kristiina said...

Thank you Nirjhar and Frank for the links! Basal U6 in Romania in a 35000 year old genome is highly interesting. I am still wondering the route it took to Africa: via Gibraltar or Yemen/Sinai.

Frank, when I look at the distribution of R0a, I see that it radiates from Saudi-Arabia and the Horn of Africa to Egypt/Morocco and to India and Central Asia via Iran. Again, it is the same cultural/political pathway I was talking about yesterday.

I do not think that geneflow from Arabian Ice Age refugia to Mbuti would be (closely) related to West Eurasian ancestry as defined by Upper Palaeolithic Europeans (or Ma1) or modern Europeans, so I do not think that it is a problem that Mbutis would not appear to have Western Eurasian admixture as it is defined on the basis of modern Western Eurasians or UPH from Europe. Arabian Ice Age ancestry would probably be Basal or similar geneflow that yet remains to be discovered/identified.

huijbregts said...

@ Alberto

I find it hard to understand your selection of reference genomes.
Why do you compare Kalash with Atayal or Munda?
IMO after Caucasus_HG, Andronovo is a logical choice.
Karasuk_outlier seems even more important.
With the last dataset:

KALASH
"Caucasus_HG" 52.7
"Karasuk_outlier" 29.65
"Andronovo" 14.9
"Scythian_IA" 2.3
"Afanasievo" 0.3
"Okunevo" 0.1
"Altai_IA" 0.05
distance%=5.3999

Alberto said...

@Davidski

Yes, removing Kalash from the columns indeed improves things for them (compare to the first run above with 78% Caucasus_HG):

Kalash
"Caucasus_HG" 55.7
"Anatolia_Neolithic" 15.3
"Atayal" 13.7
"AfontovaGora3" 13.1
"Munda" 2.2
"Esan_Nigeria" 0
"MA1" 0

Which basically confirms what we already knew, that recent gene flow amplifies D-stats values disproportionately. Which complicates the choices for rows and columns. But ok, we already knew this.

There are still a couple more things:

- What about the very significant difference between using Kotias alone or using Caucasus_HG as a group with both Kotias and Satsurblia? In general I would think that using both together should be better, though I wouldn't expect such difference (otherwise, what to think about all the others we're using as individuals?).

- And then about changing the position in the stats of Chimp and Mbuti, one consequence is that Spanish_Extremadura stops showing SSA admixture, which is strange. Results look cleaner, though, but not sure if that means they are better or worse.

I'll try to get time this weekend to test in more detail.

PF said...

The R0 paper is good, but just provides a higher-res view of what was already known -- I looked into it when I found out I am R0a2 myself. Previously I suggested R0 might relate to Basal, and perhaps the weird stats we are seeing is because there were multiple pulses of related ancestry across large swaths of time. (ie, a very early out-of-Arabia, followed by later expansion/s).

RE: Mbuti and West Eurasians. Weren't all the paths in Treemix *from* Mbuti to Eurasians? I recall consistent ~5% from Mbuti to the base of Anatolia-Neolithic and Kotias. This persisted with Africans and Paleo Eurasians in the tree. So what the hell is going on?!

FrankN said...

@Dave, Kristina: When it comes to the Mbuti, my main concern is their traditional dog, the Basenji. Some ethnological background is provided here:
https://shibasenji.wordpress.com/tag/democratic-republic-of-the-congo/
http://www.dogslife.com.au/dog-news/dog-stories/Dogs-in-Africa-Part-2

When, and from whom the Mbuti learned to use dogs for hunting is unclear. But dogs aren't native to SSA, so somebody must have brought the Basenji to the Congo Basin and taught the locals how to use them in hunting.

Then, there is (a)DNA research on the Basenji:
http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004016
"We found significant evidence of admixture for three population pairs: Israeli wolf and Basenji, Chinese wolf and Dingo, and Israeli wolf and Boxer (..) [D]ivergence (..) is estimated to have occurred (..)at 12.8 kya (11.8–13.7 kya; divergence of Dingo) and 12.1 kya (10.9–13.1 kya; divergence between Boxer and Basenji)." Add to that the Natufian dog burials 14 kya, and the population expansion from the Arabic peninsula into East Africa around 11.8 kya as reported above, and you get a quite convincing timeline.

As you have said, Kristina, UPH European admix in Mbuti would only be expected if (a) Levantine pops (Natufian) already received WHG/Villabruna genetic inflow prior to the Younger Dryas, and (b) the expansion into East Africa started directly from the Levante and was carried by Natufian-like populations, respectively. Both can't be excluded, but is as yet unproven. If, however, the transfer of the domesticated dog into SSA was mediated by a population on the Red Sea coast, we should just expect Red Sea-like admix in Mbuti. I don't know whether such admix has yet been tested. Mota would be a prime candidate here, but mtDNA R0a-heavy Ethiopian Jews, even Bedouin B might also do the job.

Matt said...

All, re the differences between stats with D(Chimp,X)(Mbuti,Outgroup)* and the new stuff with D(Chimp,Outgroup)(Mbuti,X), I tried to ran a comparison of two PCA for the same populations, with the same columns (from the datasheets from Davidski's last post and comments):

PCA - http://i.imgur.com/oMZCLe8.png

Classical Clustering - http://i.imgur.com/zvpc39J.png

It looks like the PCA with D(Chimp,Outgroup)(Mbuti,X) places much more emphasis on the distinctness of EUP HG from Neolithic groups, and generally much more emphasis on contrasts between HG and Neolithic, while D(Chimp,X)(Mbuti,Outgroup) less so. I don't know which seems more correct, exactly.

*what we were using for most of the previous D-stat experiments

Shaikorth said...

If there was West Eurasian admix in Mbuti how do we explain the following stat from Wong et al.

East_Asian French Mbuti Chimp D 0.009 Z 2.17

without assuming Mbuti without this admixture would be significantly East Asian shifted?

East Asian is Dai+Han+Malaysian, genomes are high coverage (30-40x) sequences.

huijbregts said...

One more problem with feeding Dstat-data to nMonte:
the Dstat columns are not nearly orthogonal, hardly any correlation is below 0.9.
So the validity of Euclidian distance calculations is questionable at least.
Probably the best workflow is: Dstats, PCA, nMonte.

Anonymous said...

Hervella has a paper out on the mtDNA Muierii1, the other sample from the Romanian cave. It is also U6, but apparently basal U6. Hervella's paper thus concludes it is a sign of a back migration to North-Afrika.

Could this be the source of the rising Middle-Eastern affinity?

http://www.nature.com/articles/srep25501

Alberto said...

@Matt

Thanks for looking in to this. Much appreciated.

From your PCAa it's difficult to really say which one looks more correct. But there are definitely some differences.

I've been looking at the column values and there it's also easy to see differences. Maybe collecting some of the bigger outliers we could check with direct D-stats (that should not suffer from the double outgroup problem) to see if there's a pattern of ones being clearly more correct than others. I mean for example, looking at the Dai2 column, an outlier would be Papuan. In the original type of stats:

Dai2 - Papuan: 3746
Dai2 - Turkmen: 3702

While in the new type:

Dai2 - Turkmen: 3727
Dai2 - Papuan: 3510

So a stats like:

Mbuti Dai Turkmen Papuan

would be informative as to which one could be more correct. Then collecting more outliers we could see if for most of them (or all, ideally) one type agrees with the direct ones.

On the column Druze2, not so significant, but still useful:

Original type:
Druze2 - Anatolia_Neolithic: 3976
Druze2 - Germany_MN: 3957

New type:
Druze2 - Germnay_MN: 3937
Druze2 - Anatolia_Neolithic: 3885

Mbuti Druze Anatolia_Neolithic Germnay_MN

Etc... Not sure if with Past3 there's an easy way to spot the outliers in both types of stats. If there is and you (or RK?) could do it it would be great. Otherwise I'll go through them manually over the weekend and try to collect a dozen or two and see if that's helpful.

Nick Patterson (Broad) said...

@Alberto
Chimp vs Mbuti

Ancient DNA is tricky and there can be attraction to outgroups from bad DNA. The further the
outgroup the worse the problems. So for example in our old Neanderthal work
D(Chimp, Altai; Vindija, X) was hard to interpret.

If African gene-flow is implausible I strongly prefer Mbuti as these effects are much less.
Also if a result looks dubious, definitely try with transversions only

Matt said...

@ Alberto, may possibly shed light that in the datasheets I was looking at, the biggest differences between the two set of stats (D(Chimp,X)(Mbuti,Outgroup) and D(Chimp,Outgroup)(Mbuti,X) were where Outgroup (column) was Denisovan.

Essentially for D(Chimp,Denisovan)(Mbuti,X), the stats tend to be mildly negative for all Eurasians, which indicates that Mbuti tends to be non-significantly closer to Denisovan than it is all to Eurasians. Except Papuan which gets a positive stat of 0.0298 (they're closer to Denisovan than Mbuti are) and Neanderthal which is quite strongly closer to Denisovan than modern people are (0.3509).

While in the D(Chimp,X)(Mbuti,Denisovan), the stats are strongly negative, because of course any modern human X will be *far* more related to Mbuti than they are to Denisovan. Neanderthal though get 0.0903 in this stat (in the grand scheme, seemingly slightly closer to the other Eurasian homo species than the African one that lacks any geneflow).

Generally using PCA on the matrix of differences, the differences between the two sets of stats compressed into basically a single vector in PCA when I tried it (if that makes sense to you?). It looks like the D(Chimp,Outgroup)(Mbuti,X) stats systematically show less shared drift with the column populations for Papuan, Esan_Nigeria, all the ancient samples (particularly the older ones and HGs, less systematically) and then higher shared drift with column populations for recent Europeans, particularly recent Mediterranean populations. Neanderthal is a big outlier in showing much less shared drift in D(Chimp,Outgroup)(Mbuti,X) than D(Chimp,X)(Mbuti,Outgroup).

If I had to speculate, and it was due to something real, and not just ascertainment bias or problems with adna as seems quite possible and even more probable... possibly related to fractionally higher Neanderthal ancestry in ancient HGs and then also possibly very low level gene flow from/to North Africa in modern West Eurasia?

Alberto said...

@Nick Patterson

Thank you for commenting about this. The persons who usually run D-stats here are definitely using Mbuti now as an outgroup when dealing with ancient samples. The problem is when they need to use 2 outgroups, which is never going to be ideal, but for some situations needed to be able to use that output for some other statistics. So trying to figure out what's the best strategy in these cases.

Alberto said...

@Matt

Yes, that makes sense, and it's what I've observed too by looking at the values of the columns and comparing them. Archaic samples are the ones that suffer a bigger effect when swapping the position of the outgroups, though fortunately those ones (and Papuans) won't matter much to us in most cases.

Assuming that Chimp is either neutral to all populations, or that it does weird things with ancients, but we can't control that (and it would be random), it would leave us in the position that the differences we're seeing are mostly due to SSA admixture in the populations used. If this is the case, then maybe it's the combination of using populations with SSA admixture in the columns and not just in the rows, which is responsible for the differences?

Difficult to tell theoretically. Only testing different combinations and trying to figure out what works better will help, I guess.

huijbregts said...

A few items before I asserted that the Dstats spreadsheet has columns that are not orthogonal.
Moreover most of the values are greater than 0.9.
The implication is that the raw Dstats cannot be used for calculation of the Euclidean distance.
I suggested to orthogonalize the data by a PCA.

But first I did some cleaning:
# read data
df <- read.csv('D-stats1b.txt', head = TRUE, row.names = 1)
# delete rows and columns
dat <- df[!rownames(df) %in% c('Esan_Nigeria','Masai_Kinyawa','Mozabite','Neandertal_Altai','Papuan','Ust_Ishim'),]
dat <- dat[,!colnames(df) %in% c('Denisovan', 'Yoruba')]
# get PCA-scores
myPCA <- prcomp(dat, retx = T, center = T, scale. = T, tol = NULL)
scores <- myPCA$x
scores <- scores[,1:5]
plot(scores[,1],scores[,2])

Next I used the scores to run nMonte with the same model as last time:
KALASH
"Afanasievo" 55.1
"Caucasus_HG" 19.05
"Munda" 16.2
"Andronovo" 9.65
"Karasuk_outlier" 0
"Scythian_IA" 0
"Okunevo" 0
"Altai_IA" 0
"Anatolia_Neolithic" 0
"Atayal" 0
"AfontovaGora3" 0
"MA1" 0
distance=2.829232

These results are dramatically different.
At first sight it seems to confirms my suspicion that the raw Dstats cannot be used to calculate Euclidean distances.

Alberto said...

I don't know for the other components, but for the SSA there seems to be something wrong. Using the datasheets with the same rows and columns, the one with the original D(Chimp,X)(Mbuti,Outgroup):

Spanish_Extremadura
"Anatolia_Neolithic" 55.9
"Karelia_HG" 11.8
"Loschbour" 11.2
"Kotias" 9.55
"Mozabite" 9.15
"Atayal" 2.4
"Esan_Nigeria" 0
"Hungary_HG" 0
distance=0.009526

With the alternative D(Chimp,Outgroup)(Mbuti,X):

Spanish_Extremadura
"Anatolia_Neolithic" 58.5
"Kotias" 12.55
"Hungary_HG" 11.5
"Karelia_HG" 8.45
"Loschbour" 5.7
"Atayal" 3.3
"Esan_Nigeria" 0
"Mozabite" 0
distance=0.018026

If even using Mozabite I can't get Spanish_Extremadura to take any of it, something has to be wrong. (I tried removing the columns with Denisovan, BedouinB and Druze, but no change).

Tobus said...

@Matt:
D(Chimp,Denisovan)(Mbuti,X), the stats tend to be mildly negative for all Eurasians

Wouldn't that be due to Neanderthal in Eurasians?

Davidski said...

Alberto,

I think the problem is that the puzzle you're trying to solve is too complex for the method, which relies on input based on more or less correct assumptions. For instance, here's a model I got for the Extremadura Spaniards based on some of my assumptions about their ancestry (using the Caucasus HG datasheet).

[1] "distance%=1.5072 / distance=0.015072"

Spanish_Extremadura
"Bell_Beaker_Germany" 61.2
"Iberia_MN" 30.5
"Mozabite" 8.3
"Esan_Nigeria" 0

This looks fine to me. I wouldn't complain about a result like this knowing what I know about Iberian population history, and I reckon I know enough.

Going further back in time is going to be tricky, but maybe doable if we use other tools to try and figure out which reference samples to use. Maybe Anatolia Neolithic is the wrong reference sample for Iberians?

Davidski said...

Another thing to keep in mind, is that by using a lot of ancient reference samples, especially those that aren't UDG treated, you can potentially cancel out minor Sub-Saharan ancestry in the test sample. And when you do that, you also might cancel out North African ancestry.

FrankN said...

@Shaikorth: how do we explain the following stat from Wong et al.
East_Asian French Mbuti Chimp D 0.009 Z 2.17.


At several occasions, a strange signal of West African admix into East/South Asians popped up. The TreeMix diagrams in Raghavan 2015 "Peopling of the Americas" show a Yoruba->Han migration edge. Another such TreeMix was presented here a few months ago (don't recall the commenter anymore, Kristina drew my attention to it). And then look at Fig. S10 of the following, fresh paper:
Contrasting Linguistic and Genetic Origins of the Asian Source Populations of Malagasy
http://www.nature.com/articles/srep26066

There, we see a quite solid migration edge from Yoruba into the French-Brahmin bifurcation. This can't be deamination noise, because the TreeMix is using current populations (including Han, Dai, and Malaysians).
I don't have much of an idea where this signal comes from. The targets change: Han with Raghavan, close to Brahmins in the a/m paper - possibly both of them just serve as proxys for another population not included in the TreeMix panel. Nevertheless, whatever origin and target, if such a West African signal is quite regularly picked up by TreeMix, it should also affect D-stats.

Even more strangely, the TreeMix in the a/m paper has the Aeta, negritos from West Luzon/ Phillipines, branching off from West (French and Brahmins) instead from East Eurasians (Han, Dai, SEA). The French-Brahmin bifurcation sends an admixture edge into East Indonesians (Alor, Lenbata etc.), possibly signifying the same West Eurasian relation. Another edge goes from the Aeta to other Phillipine negritos (Ati, Agta) - not surprising in general, but in the context of the first two re-inforcing the impression of a very ancient connection between West Eurasians and Eastern Insular SEA.

In order to understand that impression, I took a look at Aeta and Agta uniparental markers as descibed in Heyer e.a. 2013
http://digitalcommons.wayne.edu/cgi/viewcontent.cgi?article=2057&context=humbiol
What stands out there is
(i) an elevated share of yDNA C-M216 (25% with San Juan Aeta), which calls to mind negrito-like La Brana, and the prevalence of yDNA C (CT) among UP Europeans (including Kostenki).

(ii) mtDNA P9/P10 virtually exclusively occurring with Phillipine negritos, there accounting for 28% of all mtDNA. P, otherwise primarily found with Papuans and Australian Aborigines, is a sister to West Eurasian U and Near Eastern R0, as well as Far Eastern B and F (Aeta: 39% B, 23% F1a3a), all derived from R (UI, AG3, KO1).

However, while the Aeta/negritos harbour quite some private haplotypes, their uniparental markers in general don't look that alien to SEA. At least not alien enough to place them on the West Eurasian branch.
So - just some strange TreeMix artefact? Or is TreeMix telling a different story, about two or more OOA migrations, one of which (yDNA C, mtDNA R*) reached out to West Eurasia and the Pacific coast, while another one, represented by that Yoruba admix, had a much more East Asian focus? But the treemix signals in both the Raghavan and the Malagasy paper look too strong to be that ancient.
Another possibility, which leaves the Aeta puzzle unsolved, but might at least explain the Yoruba link, is S(E) Asian - W. African trade contacts. Some 4 ky ago, IVC received sorghum, donkeys and a couple of other agricultural items from Africa; W. Africa latest some 2.5 ky ago cultivated SEA bananas, other possible imports from SEA include yams and coconuts; the Wolof's sheep breading and metalurgical terminology is closely resembling Dravidian, etc. So, there was definitely contact during BA/IA - direct (by boat around Africa), not just mediated via the Near East. Intensive enough to bring forward such strong admixture signal? Honestly, I don't know...

Krefter said...

@epoch,

"Could this be the source of the rising Middle-Eastern affinity?"

He or She lived too long ago. El Miron and WHG have more affinity to Middle Easterners than early Europeans, they lived almost 20,000 years after that U6 person.

BTW, it should be worded "Middle Easterners have WHG affinity" not "WHG has Middle Eastern affinity". Middle Easterners can fit as being part WHG but WHG can't be fit as part Middle Eastern. Middle Easterners probably have WHG ancestry.

Davidski said...

@huijbregts

Your results aren't dramatically different. They're basically the same, broadly speaking. It's just that the models are extremely complex, with several reference samples carrying very similar components. So what's happening is that the two different methods come up with different ways to describe the same thing.

Keep in mind that we probably don't have the perfect reference samples for the Kalash yet, and we might never have them. Both sets of results are abstracted reality.

@Frank

TreeMix is a software for simple models and exploring more complex models for further analysis. Its output should not be taken too literarily.

Alberto said...

@Davidski

Yes, I see. Well, I understand that there are limitations, and that the easier the problem the least differences there will be by using different methods. But in this specific case it seems that we've introduced a limitation that didn't exist before (the PCA based datasheet had no problems with Anatolia_Neolithic as a source either), so that looks like a regression.

Of course it's only one case. Thorough testing could reveal that it has more advantages than disadvantages over the previous one. We have to keep an eye on possible regressions and improvements when testing other methods to decide if it's better or worse than the previous one. Otherwise we might go backwards.

Kristiina said...

Frank, TreeMix trees were posted by Matt. I still have them on my computer. In the first set, Mota receives gene flow from ENA base. In the tree with Papuan, Mota receives gene flow from Eurasian base. In the tree with Near Easterners, Mota branch splits in two: in Atayal/ENA branch and in Western Eurasian branch.

If I remember correctly, in the paper on Native Americans (published in August), there were TreeMix trees in which there were separate northern and southern paths across Eurasia, and, am I right that in OpenGenome's 3D model there were also two paths across Eurasia, the southern and northern path.

@Davidski TreeMix's output should not be taken too literarily.
I agree that there are usually very many possible interpretations, and we usually pick up those we like and discard as errors those we do not like.

I am sure that when we get more and more genomes, the picture becomes more complex and we will see migrations and gene flow here and there and back and forth.

Ryukendo K said...

@ Alberto @ Krefter @ Chad

During the modelling in nMonte did you guys drop any columns?

Ryukendo K said...

@ Matt

Matt, do you kind providing us with the regression equation-derived ghost basal Eurasians 1, 2, and 3 to fit into the following datasheet? :

https://drive.google.com/file/d/0B9o3EYTdM8lQak1CRW5jLWZLMTA/view

With the following columns:
,Anatolia_Neolithic2,BedouinB2,Cypriot2,Dai2,Druze2,Han2,Iberia_Chalcolithic,Iraqi_Jew2,Karitiana2,Ket2,LaBrana1,Motala_HG2,Paniya2,Papuan2,Samara_HG,Satsurblia,Sudanese,Ust_Ishim,Yoruba2

In some of my own nMonte both Basal-containing and UP European-containing models do about equally well, and I suspect this is because we've been focusing on those sets without Ust Ishim, and also there is lack of orthogonalisation. Looking at the PCAs of this particular datasheet's column values, the positions of UP Europeans are also not where they should be if they could explain the differential relations of the Neolithics vs the WHGs to the columns. I suspect orthogonalising on this datasheet with the Basal ghosts may cause the two models to pull away from each other.

Ryukendo K said...
This comment has been removed by the author.
Krefter said...

@ryuk,
"During the modelling in nMonte did you guys drop any columns?"

Hells yeah. We be dropping those &*()^^%%$$^ columns.

Seriously though yes I do. You have to sometimes. For example you can't estimate Basal Eurasian ancestry if Anatolia_Neolithic is in the columns. If you want to estimate WHG ancestry, you need WHG in the columns because they are so drifted. Different tests require differnt column populations.

Kristiina said...

@Frank, "mtDNA P9/P10 virtually exclusively occurring with Philippine negritos, there accounting for 28% of all mtDNA. P, otherwise primarily found with Papuans and Australian Aborigines, is a sister to West Eurasian U and Near Eastern R0, as well as Far Eastern B and F (Aeta: 39% B, 23% F1a3a), all derived from R (UI, AG3, KO1)."

I noticed that in this new paper on Peştera Muierii-1 mitogenome, there is a new mtDNA tree in Figure 1. In this PM1 phylotree, Fumane, Ust Ishim and Tianyuan are all on a separate branch (R) different from U.

Interestingly, in this tree, P and R0/HV form a clade to the exclusion of B/F and R3/X, while J/T forms a clade to the exclusion of the two above. However, it is odd that Ust Ishim has been said to be pre-B and, in the tree, Ust Ishim, Tianyuan (B) and modern B are all on different branches.

In this tree, G2, which belongs to M, is an outgroup to all U, R and N as in Phylotree.org.
L3, M, N and R are aligned as in Phylotree.org.

Grey said...

FrankN

"At several occasions, a strange signal of West African admix into East/South Asians popped up. The TreeMix diagrams in Raghavan 2015 "Peopling of the Americas" show a Yoruba->Han migration edge."

I found a paper once during random googling showing that iodine in the oceans decreases away from the equator and as iodine mostly comes from the oceans that would make the region north of the Himalayas an iodine desert - relatively speaking.

If iodine is necessary for brain development then it seems likely to me populations in an iodine desert might develop adaptations that make the most of the limited iodine available.

So if humans came out of Africa and spread clockwise and counter-clockwise around the coasts until eventually somehow reaching the region north of the Himalayas then maybe some kind of iodine retaining adaptation would reverse the dna tide - heading back south again both clockwise and counter-clockwise around the Himalayas.

If so genes like EDAR and SLCwhatever would have some kind of iodine boosting effect either in the kids or the mothers.

DNA from the OoA tide that was swamped by the reversal would then still be found in refugia - especially refugia around the coasts if the OoA tide was mostly coastal - and DNA from the reverse tide would be found in Africa.

Ryukendo K said...
This comment has been removed by the author.
Ryukendo K said...
This comment has been removed by the author.
Alberto said...

I went through the values in both datasheets looking for specific differences that could help in determining if one agrees more with non-double outgroup D-stats. If anyone (David? Chad?) has the chance to run them:

Mbuti Dai Turkmen Papuan
Mbuti Han Turkmen Papuan
Mbuti Druze Germany_MN Anatolia_Neolithic
Mbuti Iberia_Chalcolithic ElMiron
Mbuti Iranian_Jew Greek2 Hungary_CA
Mbuti Karitiana Dai AfontovaGora3
Mbuti LaBrana1 Bell_Beaker_Germany Hungary_BA
Mbuti LaBrana1 Ukrainian_West Hungary_CA
Mbuti LaBrana1 Chechen AfontovaGora3
Mbuti Anatolia_Neolithic Karasuk_outlier MA1
Mbuti BedouinB Karasuk Villabruna
Mbuti BedouinB Okunevo AfontovaGora3
Mbuti Satsurblia Spanish_Pais_Vasco AfontovaGora3
Mbuti Motala_HG Spanish_Aragon MA1
Mbuti Samara_HG Cypriot Itelmen
Mbuti Cypriot Punjabi_Lahore AfontovaGora3
Mbuti Cypriot Germany_MN Hungary_CA
Mbuti Mansi Icelandic Bichon
Mbuti Mansi Spanish_Pais_Vasco Hungary_BA
Mbuti Mansi Uygur MA1
Mbuti Mansi Dai MA1

FrankN said...

@Dave: "TreeMix is a software for simple models and exploring more complex models for further analysis. Its output should not be taken too literarily."
I feel this this applies to all the methods we have available. Monte-Carlo based clustering may converge on several, quite different solutions, depending on the starting point of the iteration. See
https://en.wikipedia.org/wiki/Mandelbrot_set
for further details (any stochastical process based on the method of least squares, or minimising euclidian distance, produces a quadratic polynomial that will converge in Mandelbot sets).
And the caveats of D-statistics have been described extensively by rk - we need to be sure that the outgroup is really an outgroup, which, with Neandertal and Denisova admix, and the extent of (Epi-)Paleolithic mobility that has become apparent in the Fu study, can't be taken for granted.

I have already commented on Mbuti in this respect. Let me furthermore add genetic evidence for Paleolithic transatlantic transport of the bottle gourd [I don't believe the "oceanic drift" meme for various reasons I can detail on request, suffice to say here that the bottle gourd prefers sunny, well drained terrain that is rare along the Congo and Amazonas rivers, and there are no wild varieties documented further west than Zimbabwe].
https://pgl.soe.ucsc.edu/PNAS-2014-Kistler-1318678111.pdf
Furthermore, all available studies on dog aDNA point towards the domesticated dog having reached (North) America from West Eurasia, see, e.g.
http://science.sciencemag.org/content/342/6160/871

The question how West Eurasian (Caucasus, NE, EastMed) mtDNA X2 reached North America at least 9 kya (Kennewick Man) is as yet unsolved. Now we learn that West Eurasian mtDNA U forms a clade with Papuan P (thx for that hint, Kristina). So, apparently there has been a lot of Paleolithic (and mesolithic) worldwide mobility we only gradually start to grasp, which will affect Dstats in strange ways.

What are the consequences:
(a) "Playing around" with all these tools is fine. But mainly in the sense mentionned by you, i.e. for exploring, and possibly noting "strange things" for further analysis. We should be cautious with drawing too early conclusions, especially when dealing with deeper time horizons and intercontinental/global scales.
(b) In more restricted contexts, the tools can nevertheless be helpful for hypothesis testing, just as you have exmplified with your Extremadura example to Alberto.

Davidski said...

rk & Alberto,

https://drive.google.com/file/d/0B9o3EYTdM8lQcld1dnRPRzloeFE/view?usp=sharing

https://drive.google.com/file/d/0B9o3EYTdM8lQNWRIS0VIN0tOc28/view?usp=sharing

Krefter said...

@FrankN,

mtDNA P and U are not sister clades. They descend from R but that's it. They don't have a relationship besides being descended of R.

FrankN said...

@Krefter: You are right. According to that new Pesteri-Muierii paper, the W Eurasian sister clade to Papuan P is R0/HV. Sloppy reading on my behalf.

huijbregts said...

@ Frank

Thank you for your survey of the vulnerabilities of explorative methods.
I want to add some details about problems of nMonte, especially in combination with Dstat spreadsheets.
You mention that:
1. Monte-Carlo based clustering may converge on several quite different solutions. That is especially true if the the iteration is stopped before converge is completed.
In nMonte I have set the number of iterations to 1000, which is very generous. But it is also time consuming, so I have given the user an option to choose a smaller number of iterations.
I think it is a good practice to repeat runs with important results. A related danger is a local minimum, but so far I seen no examples of it.
2. "Any stochastical process based on the method of least squares, or minimising euclidian distance, produces a quadratic polynomial that will converge in Mandelbot sets"
I would formulate this differently. It is widely known that a PCA is vulnerable to outliers, because they create spurious dimensions.
It is less widely known that outliers are also a danger to least square or euclidean distance methods.
That is because a large value has more influence on a sum of squares than a small value.
So it is important to inspect whether the datasheets contain columns with outliers. In the present datasheets there are several problems:
The column Denisovan is dominated by Neandertal_Altai, Papuan2 is dominated by Papuan and Kalash2 is dominated by Kalash.
Al these populations are outliers which feed the the euclidean distances with steroids (and a PCA with extra dimensions).
Also the column Yoruba is a problem. It contains two African pops, but by 'Out of Africa' this dimension is an attractor for rare birds. (Davidsky, can you add one or two more Africans?).
A population like Neandertal_Altai (whatever that is) will be an outlier in any sheet; but it will be clustered somewhere.

IMO the main reason why nMonte runs show widely differing results is that these runs are fed widely differing input sets.
A good example are the Kalash. nMonte runs show a steppe component like Afanasievo and a lot of CHG, about 50/50. Alberto has shown that the amount of CHG is widely varying.
For a part this may be caused by differences in the Dstat sheets. But it is mainly dependent on the presence of other South-Asian components in the input set.
If you don't add South-Asians you may get up to 65% CHG. But if you add GujaratiA, only 10% CHG is remaining.
So, an other good practice with nMonte: do not use a restricted input set before you have studied the results of the entire set.

Davidski said...

The reason GujaratiA eats up the CHG in Kalash is because GujaratiA is mostly CHG.

But GujaratiA is a modern population that in all likelihood did not contribute ancestry to the Kalash, so taking away CHG from the Kalash by using GujaratiA as a reference isn't informative when studying Kalash origins.

So this comes back to what I said before; since we're humans, we can use all sorts of variables including archeology, historical sources, as well as output from other DNA analyses to help the algorithm produce a sensible model. There's nothing wrong with this, and it beats feeding the program a whole set of populations that are very similar, and sometimes unnecessary, that might confound the results.

I'll add more Sub-Saharan Africans to the columns in future datasheets.

Matt said...

@ Ryu, I think these were what you were looking for, calculated using the same multiple regression method based on PC1 and PC2 of those stats plus the new Ust Ishim column:

Anatolia_Neolithic2 BedouinB2 Cypriot2 Dai2 Denisovan Druze2 Han2 Iberia_Chalcolithic Iranian_Jew2 Karitiana2 LaBrana1 Mansi2 Motala_HG2 Munda2 Papuan2 Samara_HG Satsurblia South_Indian2 Ust_Ishim Yoruba
UHG_Sim_1 0.4216 0.384 0.4079 0.3433 -0.5813 0.4017 0.3445 0.4361 0.4001 0.3609 0.4541 0.3877 0.4537 0.3461 0.3083 0.4273 0.3854 0.3595 0.3196 0.0955

Basal_Sim_1 0.4156 0.3927 0.4057 0.319 -0.595 0.4004 0.3191 0.3992 0.3998 0.3168 0.349 0.3436 0.3521 0.3251 0.283 0.3453 0.3782 0.3406 0.2913 0.0975

Basal_Sim_2 0.4133 0.3962 0.4049 0.3093 -0.6004 0.4 0.309 0.3846 0.3996 0.2994 0.3074 0.3261 0.3119 0.3169 0.2729 0.3128 0.3753 0.3331 0.2801 0.0982

On PCAs:

With Ust Ishim column: http://i.imgur.com/a9kf6zn.png

Without Ust Ishim column: http://i.imgur.com/Fh14SCZ.png

Note because the Sims are all calculated on PC1 and PC2 only, they fall at 0 on all higher PCs. There are probably non-zero positions on the higher PCs that would generate "Basal" and "UHG" with better fit to the data
Another note, just looking at them closely the populations I selected for these PCA were defined by some fairly arbitrary cutoffs in statistics that seemed to lens in on West Eurasian PCA populations mostly without complicating South Asians, but the Balochi and Makrani got through as they're statistically quite unusual for South-Central Asia.

Comparing PC with and without Ust Ishim, Ust Ishim as a column doesn't really impact either PC1 or PC2 or PC3, but it does seem to influence a change in what PC4 and PC5 are, bringing out a contrast between El Miron and Kotias and Motala_HG that is less expressed without it (though these PCs are small amounts of variance).

If you did want any other positions, download the PCA images, dot the points you want, then reupload them and I'll try to see if I have time to do those. (Although of course this stuff can get a bit circular, when you're creating "ghost" populations that match your expectations).

huijbregts said...

@ Davidsky

My idea was that more Africans in ROWS will make the African column less attractive for non-African outliers.

Alberto said...

David, thanks for the stats, and others (FrankN, Hujbregts, Matt...) thanks for the input.

The results don't give a definitive answer (as it usually happens). More are in agreement with the original D(Chimp,X)(Mbuti,Outgroup) than with the alternative (14 vs. 6 if I'm forced to draw a line, but the magnitude varies).

The other observations by looking at the differences in values go in line with all the things pointed out already above. The older genomes are by far the ones that show higher variability (Ust-Ishim, Kostenki14, Vestonice16, ElMiron...). Then it would be MA1 and AG3. From WHGs Villabruna and Bichon show it more than Loschbour and Hungary.KO1. Nothing unexpected.

Two populations that seem to stand out are Hungary_BA and Hungary_CA. They vary as much as the much older MA1 and AG3.

Among the modern populations, clearly Papuan is an outlier, and also Esan_Nigeria (though in this case I'm not sure of the effect it has. I'll look into it more carefully, though probably the suggestion of adding other SSA to the columns seems like a good one).

The last datasheet had many modern populations as columns, so I'll try to see if there's a way to compare how they behave vs. ancient. I'm thinking that one possibility could be to use the higher quality modern populations in the columns so that the ancients that we usually want for the rows don't get the recent gene flow effect and can be used more reliably. Something like instead of using WHG, EHG, CHG and Anatolia_Neolithic as columns, using Basque_Spanish, Lithuanian, Georgian, Sardinian. As long as we don't want to target those modern ones for the models, that is.

Matt said...

Btw, comparing the D(Mbuti,Outgroup)(Chimp,X) with the f3 stats from the Ice Age Europe f3 stats:

http://i.imgur.com/3s4s9qb.png / http://i.imgur.com/mxy6aYm.png (without Dai)

Dendrograms - http://i.imgur.com/kv6rNoC.png

Pretty minor differences. Might be more serious for the ancient Europeans, but I think that Davidski said the marker differences would prevent this comparison (or using f3 for nMonte).

Alberto said...

That's good, so no big problem in using double outgroup stats as the input.

BTW, I tested removing ancient samples from the columns and it's definitely a bad idea. We need those ancient samples as references in the columns.

huijbregts said...

While exploring these data, I learned a few things about the PCA methodology. Maybe some of them are useful for others too.

To recapitulate the theory:
A PCA has optimal results when all the data belong to a single multivariate normal distribution.
Random processes in finite samples are responsible for the phenomenon of outliers. Due to the sum-of-squares calculations these outliers can cause spurious dimensions in the PCA.
They also generate large euclidean distances. If it is suspected that these outliers are caused by random disturbances, it is practice to remove them from the dataset.
If the data are not from a multivariate normal distribution, this will also cause extra dimensions in the PCA.

Compare this genomic data:
In PCA-terminology SSA and Papuan are outliers and should be removed. But in a genetic setting this the very information we are looking for.
In genetics it seems more appropriate to describe the genomes as a MIXTURE of several multivariate normal distributions (clusters). Indeed, none of the Dstat columns does fit a single normal distribution.
But if some of these clusters are widely separated, the PCA-idea of common eigenvectors may become problematic.
I cannot assess the consequences of mixture models. I would say: if you construct a PCA, do not scale and especially do not center (this may remove the average distance between the clusters).
As to nMonte: this tool selects on small euclidean distances so it should be robust to outliers; but it does need othogonal dimensions.

huijbregts said...

In my previous comment I claimed that PCA may not be the perfect instrument to illustrate the relations of genomes.
As an alternative I want to take a look at hierarchical clustering.
A priori it can be expected that problems arise when populations get admixed, but lets see far how we can get.

I used the last set of Dstats from Davidsky; but I removed the row Neadertal_Altai and the column Denisovan.
I use the R function hclust. The user has to choose between a number of linkage methods.
The default is 'complete'; the alternatives are single, average, median, centroid, mcquitty, ward.D and ward.D2.

At 2 clusters, a SSA cluster is created for Esan_Nigeria and Masai_Kinyawa.
All clustering methods do this except ward.D which forms a N=52 cluster.

At 3 clusters I expected the Papuan to split off.
But most hclust linkage methods just split the SSA cluster.
A new Asian cluster is separated by 'complete linking' (N=10) and ward.D2 (N=47).
In both cases this cluster also contains the Papuan. Both also have the Mozabite.
It is possible that the failure to split off the Papuan is caused by the fact the fact that I had dropped the row Neandertal_Altai and the column Denisovan.
But with the complete datasheet ward.D2 produced a third cluster of Neandertal_Altai and the Asian cluster(N=47)

I have also tried a few non-hierarchical clustering methods.
kmeans(k=3) separates the SSA-cluster (N=2) and an Asian cluster (N=31).
PAM(k=3) sepatates the SSA-cluster (N=2) and an Asian cluster (N=40).
mclust(k=3) separates the SSA-cluster (N=2) and an Asian cluster (N=52).

So the various clustering functions have no problem to find the African cluster, but cannot find the Papuan cluster at k=3.
The conclusion is that two rows of SSA is sufficient, but the Dstat-sheet badly needs one or two more Australian/East-Asian/Papuan rows.
It is also possible that the sheet contains an overkill of Eurasian which makes it hard to separate the Papuan.

anthrospain said...

@Davidski wrote :

" [1] "distance%=1.5072 / distance=0.015072"

Spanish_Extremadura
"Bell_Beaker_Germany" 61.2
"Iberia_MN" 30.5
"Mozabite" 8.3
"Esan_Nigeria" 0" "


David, Which datasheet did you use ? I tought it was CHG_K10 but couldn't find north-africans in it...