search this blog

Friday, December 1, 2017

Descendants of ancient European (fair?) maidens in Central Asia's highlands


Several South Central Asian populations have a reputation for producing individuals who look surprisingly European, even the lighter shade sort of European from Eastern and Northern Europe. This is especially true of the Pamiri Tajiks, and that's unlikely to be a coincidence, because these people probably do harbor a lot of ancient Eastern European ancestry.

My own estimates, using various ancestry modeling methods, suggest that Pamiri Tajiks derive ~50% of their genome-wide genetic ancestry from populations closely related to, and probably derived from, Eneolithic/Early Bronze Age pastoralists from the Pontic-Caspian steppe of Eastern Europe, such as the Sredny Stog and Yamnaya peoples. Below is a simple Admixture graph using the mostly Yamnaya-derived Iron Age Sarmatians from Pokrovka, Russia, in far Eastern Europe, to illustrate the point. Note that Sarmatians were East Iranic-speakers, which is what Pamiri Tajiks are. The relevant graph file is available here.


But, some of you might retort, this is all just statistical smoke and mirrors, and what it really shows is that these so called Europeans came from Central Asia or even India.

Not so, because my models can't be twisted any which way, and they have strong support from uniparental marker data.

Many South Central Asian groups, and especially Indo-European-speakers, like the Tajiks, show moderate to high frequencies of two Y-chromosome haplogroups typical of Bronze Age Eastern Europeans: R1a-M417 and R1b-M269. This is old news to the regular visitors here and its implications are obvious, so if you still think that these haplogroups expanded from South Central Asia to Eastern Europe, rather than the other way around, then please update yourself (for some pointers, see here and here).

And now, courtesy of Peng et al. 2017, we also have a much better understanding of ancient European influence on the maternal gene pool of Pamiri groups (see here). The paper doesn't specifically cover the topic of European admixture in South Central Asia, but it nevertheless demonstrates it unequivocally.

Below are a couple of phylogenetic trees from the paper featuring a wide range of mitochondrial DNA (mtDNA) sequences shared between Europeans and Central and South Asians; quite a few of these lineages are rooted in Eastern Europe, as shown by both modern-day and ancient DNA, so they strongly imply gene flow, and indeed considerable maternal gene flow, from Eastern Europe deep into Asia.


Worthy of note are the lineages belonging to such relatively young (likely post-Neolithic) haplogroups as U5a1a1, U5a1d2b, U5a2a1, and U5b2a1, all of which have already been found in ancient remains from the Pontic-Caspian steppe.

I'm no longer wondering whether there were massive population movements from Eastern Europe to South Central Asia during the metal ages. It's a given that they happened, and I'm now looking forward to learning about the details from ancient DNA. For instance, what was the ratio of men to women amongst these migrants? And how fair were they exactly?

See also...

Late PIE ground zero now obvious; location of PIE homeland still uncertain, but...

239 comments:

«Oldest   ‹Older   201 – 239 of 239
Dilawer (Eurasian DNA) said...

I meant:

1- Lt to med brown
2- Olive to lt brown

Still working on the 36 SNP predictor

Rob said...

@ Dave

You and your disciples make statements on matters you haven't done one minute of research. Instead, you're all just parroting each other's BS in grand scale of conceited ignorance.
If you had, you'd realise the basic tenets, such as the fact that the 'deep steppe', as the eminent Sam attempted to argue, was only moved into during the Yamnaya period. Until then, foragers stuck the riverine niches, connected to the main bulk of the met-population further north or west, via the rivers which served as 'arteries'. The main bulk of the population lived in the forest zone, which is why you see early R1a in Karelia, and divergent R1a5 in CCC.
And do you think cattle just wandered onto the steppe by themselves, or do you think the techniques of arsenic-alloyed copper smithing, potentially lethal if done wrong, were just learned by osmosis, or via stolen females ?
And yet you're attempt to suggest that I have an axe to grind merely shows that you're a pretty pathetic character. How would a forest vs steppe affect me personally ?


Anyhow, when I2a2a1b actually turns up in the Mesolithic deep steppe, then I'll modify my theory.

Davidski said...

@Rob

The evidence points to the steppe being the R1a demographic hub, because Corded Ware was very similar to Yamnaya, and its main lineage was R1a-M417, which has been found in Eneolithic remains from the steppe.

The entire Eastern European forest and forest steppe may have been inhabited sparsely with foragers with R1a, but they didn't have the demographic impact that Corded Ware and other Yamnaya-like groups from the steppe did. This is why R1a-M417 is now the most important R1a lineage, and that's why CHG admixture is found across Northern and Eastern Europe. The lineages from further up north are either extinct today or very rare.

No idea why you would attempt to ignore these facts, and instead favor the forest zone in spite of the evidence, but that's what you're doing.

Ric Hern said...

@ Thanks Davidski

It is certainly an eye opener. Now I wonder if Hittites could have been dominated by I2a2a1b1b ? I guess we will know soon...

Rob said...

@ Dave
I did not suggest that M417 expanded from the forests. It expanded with Yamnaya or antecedent cultures like Dereivka, or what have you, c. 3000 BC.
I am talking about preceding periods. The ecotone of R1a seems to be the forest zone. That's probably the route via it arrived to E Europe from Asia during the late glacial

Davidski said...

@Rob

R1a has been found in Mesolithic and Neolithic forager samples on the steppe. There were high forager population densities around the river valleys there. You must know that.

So it's surely not going out on a limb to suggest that R1a-M417 is derived from one of these older R1a steppe lineages.

But it is going out on a limb to suggest that it arrived on the steppe from the forest zone and then expanded from the steppe.

Samuel Andrews said...

@Rob,
"Stop insinuating ulterior motives if you don't understand what I'm saying. "

I have a pretty good idea of what you're saying. No, you have no motive for saying it. I just notice you've been pushing for the theme 'Caucasus and Balkan movement into the Steppe and everywhere else' very consistently. There's nothing wrong with that. The only issue is in the genetic data nobody else is seeing evidence of movement from the Balkans into the Steppe.

Rob said...

@ Sam

That's okay if you can't see it, although it's plainly evident.
I'm just a European who wants to understand the history of my people the way it really happened,instead of a semi-fictionalised rendering. So do bare with me

jv said...

Davidski,
Thank you. I see my H6a1a is shared equally among some populations East & West of the Pontic Caspian Steppe. However, my H6a1a2 is prevalent in Germany, H6a1a3 Finland, H6a1a5 Ukraine & Russian Federation, H6a1a8 British Isles. As more people test, the subclades are showing regional differences.

Archaelog said...

@Davidski I2a must have entered the steppe gene pool from Europe sometime during the Paleolithic/ Mesolithic and persisted in the western steppe regions .That would explain its absence in the IE migrations to the east. Not a single I2a sample to the east of the Pontic steppes right?

Davidski said...

@Chetan

Yamnaya/Catacomb RISE552 from Kalmykia belongs to I2a2a1b1b, so this marker wasn't restricted to the western parts of the Pontic-Caspian steppe.

It's just that the steppe clans that migrated to West, Central and South Asia had very high frequencies of R1a-Z93 and R1b-Z2103. They probably did come from the eastern end of the Pontic-Caspian, but we don't know yet.

Samuel Andrews said...

@Rob,
"You and your disciples make statements on matters you haven't done one minute of research. Instead, you're all just parroting each other's BS in grand scale of conceited ignorance. "

I think for myself. Think about it....I2a2a1b, U5b2a1a in Mesolithic Ukraine. A migration from the Balkans isn't necessary. This is simple stuff. Your archaeology evidence is weightless! Until someone gets DNA showing without a doubt Balkan farmers went up into the Steppe and gave Yamnaya I2a2a1b I won't be convinced.

Matt said...

@Sein, thanks, though sorry, if you're trying this and it seems strange, I got the upthread wrong. Agh.

Meant to say "2C) ^2 that output distance matrix" (so square not square root. Really dumb mistake, sorry again.

@Azra, agree with that way of putting things, at least in the PCoA dimensions which capture the variance from Fst matrix, if moving towards WHG means an increase in distance to Balts, then it has to be moving away in some dimensions. Question whether this is real or just an artefact of the matrix and method, but I think it could be real.

So yeah, again, with your explanation, broad proportions of WHG:EHG:CHG:AN may not be as important to explaining modern day Fst though your explanation leads more to Narva->BichonLoschbour type distinctions (that is, higher level dimensions sharply distinctive between Narva and BichonLoschbour), rather than Bronze Age drift, etc.

(Will say that using SHG+WHG instead of WHG didn't change the outcomes in this scenario though, so it can't be anything to do with dimensions contrasting those two in this set...).

Rob said...

@ Sammy

"I think for myself."

You do, you coming along well. I'd reccommend reading up more archaeolgy, it'll really supplement your mtDNA work

"Think about it....I2a2a1b, U5b2a1a in Mesolithic Ukraine."

Ive thought about it, but the problem is - there isn;t any I2a2 in Mesolithic Ukraine.
As ive pointed out, it first appears in the Mariupol - Sredny Stog horizon, which is Neolithic, c. 5000 BC. So Im not making stuff up out of thin air.
Nor did I say I2a2a1b is a 'Balkan farmer lineage', its the lineage of East-central European natives, who also moved west toward the Dnieper steppe; although one of the I2a2 in Neolithic Ukraine in the latest version of Mathieson looks like straight up EEF. The obvious conclusion is that these people were moving onto steppe, mixing with other people on the steppe (R1's) and moving back over the course of an invidivual's life, and cumulative hundreds of years.. This is what the archaeology shows, and what isotopic studies shows, and now aDNA. Pastoralists had set moving terrains, special paths they followed between the settled communiteis and steppe.
It'll become clearer and clearer over the next years.

@ Dave

Safe travels.

Happy Season's to all.

Ric Hern said...

Falkenstein I2a2a Mesolithic +-7200 BCE ?

https://www.nature.com/articles/nature17993

Ric Hern said...

A lot of I2a2a in Late Hunter Gatherers of Latvia....

Dilawer (Eurasian DNA) said...

To clarify and expand my previous post regarding Chad's concerns regarding "bad alignment". After clarifying to me what he meant by "bad alignment", namely odd Dstats showing certain pops sharing unexpected drift with others, a more appropriate term would be "reference bias", which is something that I had extensively discussed and analyzed on my website at http://www.eurasiandna.com/2017/10/02/diploid-genotyping-low-medium-coverage-ancient-dna/ tables 2,3,4 and fig 4.

Reference bias affect all sequenced aDNA to a certain degree and also is dependant on the pipeline used to genotype the sequences. For example, I experienced more reference bias using the GATK pipeline than other pipelines better suited for aDNA. Reason being the GATK requires more evidence of variation (minor/alternate alleles), ie greater read depths, thus it is best suited for sequences with coverages >20X

No as far what causes odd and unexpected shared drifts with Hg19 that Chad was seeing. It has to do with the fact that the reference genome itself is haploid and comprised of bits and pieces of DNA from 13 ANONYMOUS volunteers. Oddly the Human Reference Consortium does not even divulge the ethnicities of the volunteers. However, my tests indicate that there are relatively large haplotypes from W AFRICANS and Europeans also tend to be well represented. That is why my GATK processed aDNA showed 2-7 % African admixture. Thus I don't use GATK for diploid processing anything <20X.

Therefore genomes that share more drift with the 13 ANONYMOUS individuals comprising Hg19, will share more drift with Hg19 and contribute to odd dstats.

The problem is magnified some because aDNA reads containing the REFERENCE allele get mapped better than reads containing alternate alleles. So if a position is homozygous alternate chances are it will not get mapped at all or perhaps get mapped to the wrong location. The fact that Hg19 is haploid aggrevates the problem.

So you can imagine aDNA that shares more reference alleles with Hg19 will have a higher mapping %.

There really isn't any easy fix, short of :

1- making the Human reference diploid;
2- Having multiple Human references, for ex an African one for African aDNA, an E Asian one for E Asian aDNA, etc
3- Creating mapping software able to work with diploid Human Reference positions. To my knowledge BWA not well suited for this.

To summarize, certain sequence genotyping pipelines, such as the one I use; ATLAS, can mitigate reference bias a little, and reference bias affects all aDNA.

Matt said...

@Kurd, interesting stuff there, thanks.

@Sein, I think I have some more understanding of why dimensions constructed with Fst+PCoA with Exponent1, run through nMonte, seems to be so effective at modeling populations.

I was looking back through some old threads where Alberto was trying to look into how nMonte worked, to understand how it worked with Global10 data.

In this, he notes that nMonte "calculates the distance as the .... sum of the squared values of the (Eucidean distance) of the residuals", in the context where he's talking about the merits of this vs using the absolute Euclidean distance on Global10 data (complicating factor, first he's scaling by the sqrt of the eigenvectors, which is the correct way to eigenvector scale).

Getting back to the Fst+PCoA with Exponent1 data, I discussed upthread that the absolute Euclidean distances in this, when squared, were equal to real Fsts. In other words, these dimensions summarise the Fst matrix and the squared Euclidean distance in these dimensions basically approximately equals Fst.

Following from this, as absolute Euclidean distance^2 is the distance that nMonte is minimizing(!), this effectively means that when we feed the data output from Fst+PCoA with Exponent1 into nMonte, then nMonte is minimizing the distance of the simulated population from the real population in a space representing real Fst (and even with as little as 10 dimensions, doing so with a high degree of accuracy).

For some summaries graphs about how much more squared Euclidean distances on PCoA+Fst correlate with the true Fst, vs against Global10 with eigenvector scaling and without, see: https://imgur.com/a/9XP3M

Scaling Global10 by Eigenvectors very much improves how much the relative distances match to Fst compared to unscaled Global10 (though a scalar would need to be applied to get the the squared Euclidean distances to match Fst).

However, the Fst+PCoA certainly seems to capture Fst more than G10 with or without scaling (not totally surprising!). For whatever it's worth (some questions about value / accuracy of Fst).

(There are definitely things that genotype PCA may be doing better than Fst+PCoA though - 1) you certainly can't place an individual onto the Fst+PCoA plots the way you can with genotype, 2) dimensions in genotype are more likely to be genetically real in the sense that genotype PCA can actually find which individual SNPs load very heavily on a given dimension, and dimensions must load on frequency differences in real variants...)

This all seems like a fairly good argument for at least continuing to try and use the Fst+PCoA method... (and grounds that my concerns that incorrect raw absolute Euclidean distances made Fst+PCoA,Exponent=1 bad data for nMonte were basically not well founded!).

Seinundzeit said...

Matt,

"Following from this, as absolute Euclidean distance^2 is the distance that nMonte is minimizing(!), this effectively means that when we feed the data output from Fst+PCoA with Exponent1 into nMonte, then nMonte is minimizing the distance of the simulated population from the real population in a space representing real Fst (and even with as little as 10 dimensions, doing so with a high degree of accuracy)."

Very interesting; I think that this does explain the power of this method, and I completely concur with you in regard to continued utilization.

In fact, for the fun of it, some models:

Lithuanian

34.65% WHG
32.10% Levant_N
18.00% AG3-MAl
9.50% Iran_N
5.75% Xibo

distance=12.0055

Iranian

44.25% Iran_N
33.00% Levant_N
8.00% AG3-MA1
7.60% Xibo
4.85% WHG
1.20% ASI
1.10% Yoruba

distance=8.7735

Karlani Pashtun, central highlands

43.80% Iran_Neolithic
19.30% AG3-MA1
16.50% Levant_N
9.30% Xibo
7.15% WHG
3.95% ASI

distance=8.8876

Matt said...

@Sein, using just your Lithuanian example plotting the real against Fst estimated via this method: https://i.imgur.com/qHKjsB6.png (using Tu in place of Xibo for your proportions, and a run by myself on similar factors -https://pastebin.com/WZdW3Rir).

Pretty good correlation overall; proxy is somewhat over attracted its components and their relatives, relative to the real population, and under-attracted to real modern West Eurasians.

Some of this is going to be due to the fact that AG3-MA1 is "under drifted" compared to the real components - not as bad as using Kostenki in place of WHG but on similar orders and Fst methods should be more sensitive to this kind of problem and using likely real ancestors vs related proxies (compared to f3 stats, etc. which reflect relatively deep rooted divisions, and far more insensitive to much drift, perhaps at cost of being less informative in some ways).

Comparison using more recent components which are a better proxy for real ancestors (Steppe_EMBA, WHG, Barcin_N), which results in improved distance (https://pastebin.com/ZeQL3hdM): https://imgur.com/a/9I3yW.

You can see that this is a bunch more linear, and fits real Lithuanian better. Though some things are slightly off, like real Lithuanians being less related to other ancients and modern South Central Asians than the 60% Steppe_EMBA proxy is, so this combo is not perfect. (Using higher dimension in nMonte, where I used 10 might also help slightly, but using 100+ dimensions in nMonte certainly not practical!).

I would also say though, that I am wary of mixing the ancient and modern in this method, as I think that there may still be some issues in Fsts of ancient attraction to other ancients, and modern attraction to other moderns, so introducing a population which may have contributed no ancestry, but is modern may cause it to pick up some fraction (e.g. East Asian populations in this scenario). We can see this later if we try repeating these models where low fraction East Asian picks up with the Devil's Gate samples. (Ancient-to-ancient attraction in Fst may also be making fits slightly worse than reality!)

No good solutions for this, I think yet. So I'd prefer to stick to modern+modern or ancient+ancient models...

Seinundzeit said...

Matt,

"I would also say though, that I am wary of mixing the ancient and modern in this method, as I think that there may still be some issues in Fsts of ancient attraction to other ancients, and modern attraction to other moderns, so introducing a population which may have contributed no ancestry, but is modern may cause it to pick up some fraction (e.g. East Asian populations in this scenario). We can see this later if we try repeating these models where low fraction East Asian picks up with the Devil's Gate samples. (Ancient-to-ancient attraction in Fst may also be making fits slightly worse than reality!)"

A very reasonable point.

That being said, even though I recognize the somewhat problematic status of mixing ancient and modern with this method, what is your view on obtaining models in nMonte by combining, say, 13 Fst PCoa dimensions with 10 global PCA dimensions (eigenvector scaled)?

When I have time later this weekend, or perhaps next week, I'll post a few examples.

Matt said...

@Sein, hmmm... good question. Though that's something I have tried in the past (with combining recent European based PCA and Global10, for'ex) still unsure if that's more than the average of its parts; do the results from nMonte minimising squared Euclidean distance on the whole set of combined dimensions work out differently, more accurately, from just running both set of dimensions separately and averaging results?

I think at the least, you'd want them to be on a similar scale, as if either one of the two sets of dimensions that you pooled together was much larger, it would simply dominate the results.

A said...

@Kurd, you should write a paper on this, or at least publish an article online. There seems to be quite a lot of confusion regarding the skin colour of prehistoric populations.

Samuel Andrews said...

@Philipe,

Yeah, maybe Kurd should.

To prevent misconceptions, one must first understand there are few SNPs which can differentiate a typically brownish Iranian to the typically whiteish German. There's a noticeable difference in skin tone between the two but we don't yet know how to differentiate them using DNA. Therefore, anything from brown to white is possible for most ancient Europeans whose DNA has been tested for SNPs related to skin color.

huijbregts said...

@Matt
"do the results from nMonte minimising squared Euclidean distance on the whole set of combined dimensions work out differently, more accurately, from just running both set of dimensions separately and averaging results?"
nMonte is a simple algorithm. In particular it doesn't protect against overfitted solutions. Indeed, complete protection against overfitting is not possible, because the supposedly independent samples share a lot of DNA. But nMonte adds its own amount of overfitting.
Now you are (hesitantly) proposing doing nMonte on a combination of two datasets which are both derived from Davidskis database. Of course these datasets are not independent. This adds a whole new layer of overfitting.
So IMO the answer to your question is obvious: nMonte on the combined datasets is not more accurate, but more overfitted.

Matt said...

As a quick exercise using this Fst+PCoA prediction, I generated predicted ancestry levels of AN, WHG, Steppe for some European populations using nMonte and the Fst+PCoA dimensions: https://i.imgur.com/U9VYzJ1.png

Using these, I then produced Fst estimates, which would be "Fst assuming all populations are just admixtures of unstructured AN, WHG and Steppe populations and no drift happens after that", and compared them to real Fst to produce correlations and residual relatedness: https://i.imgur.com/aw98iEj.png

This is sort of an explicit test of how much of present day between fine scale population variation is based on ancient variation.

So

1) real tends to show a correlation with predicted (so ancient proportions are somewhat predictive)
2) real is always higher than the predicted (more distance than expected based on ancient proportions)
3) residuals tend to show logical spatial relationships - for instance Basque residual distance follows the order Lithuanian>Sardinian>Polish>Bulgarian>Norwegian>Irish>English>Spanish, while Bulgarian residual distance tends to follow Basque>Sardinian>Irish>Norwegian>English>Spanish>Polish (post / pre BA structure is logically regional).

(As well, as a thought exercise bearing in mind the relatively low Fsts shown between the predicted samples, I multiplied the Predicted values by 3.9, assuming the prediction were in the right direction but wrong in magnitude, then repeated again: https://i.imgur.com/YMtXa2s.png

These show different patterns, where the most negative residuals for populations which are most related in reality compared to prediction now tend to be Sardinians and NE European populations, and relatively high residuals Fsts are present between Spanish and Basque French. There is a way that we could explain that outcome, assuming that margin European populations have experienced less gene flow so are in common more related compared to central populations that have had extra African / Greater ME geneflow... but I think the initial unadjusted prediction residuals make more sense.).

Be nice in future when labs can test this directly by simulating populations from ancient samples as well...

(PCoA plots comparing the predicted and real fst also here: https://imgur.com/a/XISrx)

Matt said...

@huijbregts, there is a conflict here between underfitting and overfitting, I suppose. I am wary of models that are too underfit, which Global10 alone possibly may be for fine scale structure. I think I am perhaps less wary about overfitting in this context than you are (huge sample size of populations here, each population fairly good n of individuals, each individual sampled at many independent locii.)

Ryan said...

New and good paper on Ireland that folks may be interested in:
https://www.nature.com/articles/s41598-017-17124-4

@David - "It's just that the steppe clans that migrated to West, Central and South Asia had very high frequencies of R1a-Z93 and R1b-Z2103. They probably did come from the eastern end of the Pontic-Caspian, but we don't know yet."

FWIW as far as I've been able to find out R1b in South Asia is exclusively Z2103. Are there even R1b-M269+ and R1b-Z2103- samples in Central Asia?

Ebizur said...

Ryan wrote,

"FWIW as far as I've been able to find out R1b in South Asia is exclusively Z2103. Are there even R1b-M269+ and R1b-Z2103- samples in Central Asia?"

Di Cristofaro et al. (2013) have reported finding R1b-M269(xL23) in 1/77 Hazaras (1/69 Hazara in Bamiyan, Afghanistan), 1/160 Mongolians (1/23 Mongol in SE Mongolia), and 1/186 Iranians (1/42 Esfahan). They also have reported finding R1b-U106 in 1/186 Iranians (1/27 Gilan) and R1b-U152 in 1/186 Iranians (1/20 Khorasan).

However, it appears that the majority of Asian members of haplogroup R1b who have been tested for markers of downstream subclades do belong to R1b-Z2103.

ERR445312 from India on the YFull tree belongs to R-Z2103 > R-Y4364 > R-M12135. His TMRCA with the four other members of this clade who are currently tabulated on YFull, all of whom are Armenian, is estimated to be 3,000 [95% CI 2,400 <-> 3,700] ybp.

NA20866 from Gujarat, India has been classified as R-Z2103 > R-Z2106 > R-Z2108*. The TMRCA of the entire R-Z2108 clade is estimated to be 5,700 [95% CI 5,000 <-> 6,400] ybp.

ERR445322 and ERR445298 from India are currently the only members of a clade tabulated on the YFull tree as R-Z2103 > R-Z2106 > R-Z2108 > R-Y14415 > R-Y35099. Their TMRCA in R-Y14415 with the members of R-Y14512, a clade that is currently represented on YFull by several Americans and other individuals who have reported origins in Sweden, Hungary, Germany, and Scotland, is estimated to be 5,000 [95% CI 3,900 <-> 6,200] ybp.

Both NA18645 of the CHB sample and bhu-1953 from Bhutan (cf. Hallast et al. 2014) belong to R-Z2103 > R-Z2106 > R-CTS8966.

Seinundzeit said...

Matt,

I finally have some free time...

"I think at the least, you'd want them to be on a similar scale, as if either one of the two sets of dimensions that you pooled together was much larger, it would simply dominate the results."

That makes sense.

For what it's worth, I combined a D-stat nMonte data-sheet with 13 Fst PCoA and 10 global PCA dimensions.

In technical terms, very problematic; but I'm just experimenting for the pure fun of it (I have no methodological goals).

Quite honestly, combining the D-stat data and PCA data with the Fst-based output wasn't really worth the effort; not much bang for the buck.

For example:

Lithuanian

52.80% Corded_Ware
39.00% Bell_Beaker
7.30% WHG
0.85% Xibo
0.05% Vietnamese

"distance%=5.7041 / distance=0.057041"

Kinda "cleaner" than just the Fst PCoA-based model, which is good, but it's still virtually identical.

Looking at other West Eurasians:

Iranian_Mazandarani

61.20% Iran_Chalcolithic + 5.30% CHG + 0.95% Levant_Neolithic
23.50% Corded_Ware + 1.75% Bell_Beaker
4.60% ASI
2.35% Xibo
0.35% Yoruba

"distance%=4.7254 / distance=0.047254"

Kalash

39.95% Steppe_EMBA + 6.75% Scythian_East
15.20% Iran_Chalcolithic + 15.20% Iran_Neolithic + 10.20% CHG
12.70% ASI

"distance%=5.2237 / distance=0.052237"

Sensible stuff, but not really worth the effort.

That being said, I should probably give D-stats-based PCoA a spin.

Matt said...

@Sein: That being said, I should probably give D-stats-based PCoA a spin.

I think the problem I could see with that is that the D-stats wouldn't really form a distance / similarity matrix in the same way, at least from the datasheets we've got.

To use the user defined distance or similarity function in the PCoA, you really do need a matrix where every row has a corresponding column, and every population has 0 with itself (if distance) or 1 (if similarity).
PCoA is really just a technique we're using with the Fst distance matrix because it *is* a matrix of distances after all, which we want to represent in a dimensional form, and distance matrices (apparently) doesn't work well with the dimensional reduction assumptions of PCA.
So I don't think there'd be a gain from running it through PCoA rather than PCA as we have been.

I suppose in theory you could get a matrix of outgroup D-stat row and column, as a similarity matrix... they only problem I could see is that it is not so clear that the values on the diagonal would be 1. E.g., D(Mbuti,Anatolia_N;Mbuti,Anatolia_N), which would represent a diagonal where row and column are Anatolia_N, and I think in theory would be equivalent to f3(Mbuti,Anatolia_N,Anatolia_N), would not be 1 for any pairs of samples in Anatolia_N.

Arza said...

https://github.com/DReichLab/AdmixTools/blob/master/README.Dstatistics
New program: qpDpart; implements partitioned D-stats
(Eaton and Ree. Inferring phylogeny and introgression .. Systematic Biology (2013) (Vol 62:5) pp 689-706)


https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3739883/
Materials and Methods > Tests for Introgression
Partitioned D-statistic test. The D-statistic as described above does not take full advantage of the information available from incongruent allele patterns in multiple taxa. Importantly, it detects only whether alleles from one lineage occur excessively in another lineage, but does not distinguish whether this stems from direct gene flow from the lineage in question, or gene flow from a close relative. This distinction becomes increasingly important when the D-statistic is applied at deeper or broader phylogenetic scales with redundant sampling of taxa.

Directionality of gene flow. While the four-taxon D-statistic cannot distinguish the directionality of gene flow, that is, whether it occurred from P2 into P3, from P3 into P2, or in both directions, the partitioned D-statistic can infer directionality through its measurement of introgression of shared ancestral alleles, D12. For example, if gene flow occurred from P31 into P2, then derived P3 alleles which arose in the ancestor of P31 and P32, and are thus shared by both taxa, will also appear in P2. In contrast, if gene flow occurred only in the opposite direction, from P2 into P31, P2 will not contain alleles that are shared by the two P3 taxa, and thus the partitioned test would find a non-significant D12. In this way, D12 acts as an indicator, showing whether introgression occurred from the P3 lineage into P2, versus whether the signal is caused by gene flow in the opposite direction. Contrast this with the four-taxon test, where a significant D for tests (P1, P2, P31, O), (P1, P2, P32, O), or (P31, P32, P2, O) would all indicate introgression, but fail to distinguish that only P31 and not P32 introgressed into P2 (which D1 vs. D2 would indicate), or that introgression occurred in only one direction (which D12 indicates).

Seinundzeit said...

Matt,

"To use the user defined distance or similarity function in the PCoA, you really do need a matrix where every row has a corresponding column, and every population has 0 with itself (if distance) or 1 (if similarity)."

For what it's worth, Euclidean distance does pretty well.

But yeah, PCA makes more sense.

I'm experimenting with a few things; so far, so good.

If things work out in every single case, I'll post the results.

Suevi said...

Genetic landscapes reveal how human genetic diversity aligns with geography - https://www.biorxiv.org/content/early/2017/12/13/233486

Ryukendo K said...

Matt,

Your discovery of the PCoA method recovering fst distances very precisely is really great! Seems like this method allows for sensitivity to recent drift in admixture calculation, the only issue is drift in fst space should not be expected to occur in Euclidean ways?

Either way more mathematical exploration of this is warranted!

Matt said...

@Ryukendo, though note that the absolute euclidean distance in the dimensions generated by PCoA does *not* capture Fst distances so well - it's the squared euclidean distance which does. This also appears to be the case when comparing distances in eigenvector scaled Global10 to Fst - the absolute euclidean distances in Global10 space don't have much to do with pairwise Fst, while squared euclidean distances.

But yeah, it really surprised me that applying this technique of PCoA to the Fst matrix would produce fairly coherent dimensions that recapitulate the same genetic shape as genotype Fst. It would be cool to see someone mathematically explore / explain why that is the case and why squared euclidean distances on these PCoA and on PCA seem to present a much better fit to Fst generally (though I'm pretty unfit to do that, even if it would be fairly obvious to someone with a decent grasp of mathematical geography and how Fst is computed) .

(What Huijbregts says is true about this method though; with very high dimension this PCoA it is producing a very close fit to the Fst, which would include any errors in the Fst scores. If you do it and look at the some of the high dimensions that load on single, very drifted populations, then there's obviously some contribution of very marginal error or rounding differences to those.

So there's definitely a tradeoff between capturing some of the very fine structure Fsts which are nonetheless decisive in very low Fst regions (like Europe, perhaps between Tibeto-Burman speaking groups, etc.) and almost certainly real, vs error which is about the same order of magnitude once all the main dimensions are considered... If you're using PCA / PCoA as a sort of noise reduction technique, then high dimension will not be useful, but at the same time because distances are so close and structure so fine between some populations, there's the risk of removing real signal with noise by simply removing higher dimensions).

@Sein, yeah, true actually, with Exponent=1 that (D-stats data, euclidean distance measure) does look different to just running through PCA. May give you something more useful to run through nMonte if that provides a set of dimensions which better fits (e.g. squared euclidean distance, not absolute euclidean distance).

Ryukendo K said...

@ Matt

Yeah, you're right not linear, what I meant is that there really shouldn't be any obvious reason why mathematical relationships under addition should be preserved at all, fst being such a mathematically intractable measure.

Matt said...

Hmm... "mathematically intractable"; certainly if we treat it linear ala "Distance for D (which is 0.5 A 0.5 B) from C = 0.5 A-C + 0.5 B-C". Because Fst does not behave that way; just empirically differentiation of say D (Europe_LNBA) from C(Yoruba) does not equal 0.5 A (Steppe_EMBA) and 0.5 B (Europe_MNChl), and neither other similar cases like, approximately, D (Uyghur), A(Tajik), B(Han_N_China). In both cases distance D-C is not intermediate A-C and B-C but less than it.

I don't know if it's necessarily mathematically intractable, if there are specific algorithms to derive distances for intermediate points from a matrix of squared euclidean distances and then we use those, without first transforming through PCoA into points in a set of n linear dimensions. (Even if it's realistically actually more computationally *practical* and tractable in practice to use PCoA transformation first!).

Though, again, I don't know if those methods exist! (Is there even a way that you can do this for a matrix of squared euclidean distances without PCoA transformation?).

(I mean, thinking contextually, it's a distance, so why would even expect it to be tractable under the former assumptions of linearity anyway? Do any set of absolute euclidean / squared euclidean distances, or any kind of distances?).

This is just my guess and I probably know less math than you (just never been able to get the concentration and interest and correct perspective to go beyond what I learned in secondary school).

(In a perfect world there are probably some smart Harvardian math guys and girls lurking in this thread who could answer this instantly ;) Should be low hanging fruit for them.).

Seinundzeit said...

Matt,

"May give you something more useful to run through nMonte if that provides a set of dimensions which better fits (e.g. squared euclidean distance, not absolute euclidean distance)."

I'll be honest; I've been experimenting with different techniques (a whole bunch of them), and it seems that out of all of them, the Fst-based PCoA method is the most effective tool for inferring admixture proportions.

So for now, your method will constitute my preferred tool for admixture calculation.

(To reiterate, your discovery of this technique was pure brilliance; we owe you!)

«Oldest ‹Older   201 – 239 of 239   Newer› Newest»