search this blog

Monday, August 24, 2015

Smarter than the average bear


Using a few ancient hunter-gatherer sequences, formal statistics, and enough present-day samples, I can predict with basically 100% accuracy whether an ethnic group is of European or extra-European origin. Actually, there's probably an infinite number of ways of doing the same thing nowadays, but I thought this was an effective way of visualizing it. The datasheet can be downloaded here.


I also tried to analyze the Indo-European expansion in a similar way. The results are a lot less obvious, and there are a number of reasons for this. One of the main factors, I'd say, is that languages can be learned or imposed very quickly, and this happened often during historic times, well after the Proto-Indo-European dispersals.

For instance, Sardinians spoke Paleo-Sardinian or Nuragic languages until they adopted Indo-European speech, in the form of Latin, from the Romans (see page 118 here). Indeed, it's highly unlikely that any Proto-Indo-Europeans ever stepped foot on Sardinia. The relevant datasheet is here.


In this analysis I used samples from the Allentoft et al., Haak et al. and Lazaridis et al. datasets, all of which are publicly available. The latter two are found at the Reich Lab site here.

See also...

Pre and Post-Kurgan Europe

24 comments:

Alberto said...

No surprise that WHG is the clear European marker and can be used to separate Europeans from other populations. But regarding the language, yes, there's never going to be perfect correlation between genes and language for many reasons that you outlined above. However, I wonder if using something like Tajik instead of MA1 would give a significantly better correlation.

Davidski said...

Well, using Tajik Shugnans doesn't make things better. It makes things worse. Here's the datasheet.

https://drive.google.com/file/d/0B9o3EYTdM8lQVTZqNUNuc1c5VHM/view?usp=sharing

It's an interesting question why it makes things worse, and I suspect it's because Indo-European languages were initially spread by groups during the Bronze Age with very high levels of ANE, higher than Tajiks, but sometimes (or often?) different proportions of other components. And, unlike modern Tajiks, I don't think these groups had any significant levels of ENA.

truth said...

I can't see the images..

Alberto said...

Thanks David.

Yes, it doesn't look any better, it looks worse. But what I find a bit surprising is the low affinity of Tajiks to Indian populations.

D(Ju_hoan_North,MA1)(BedouinB,Druze) D=0.0248
D(Ju_hoan_North,Tajik_Shugnan)(BedouinB,Druze) D=0.0243

D(Ju_hoan_North,MA1)(BedouinB,GujaratiD) D=0.0377
D(Ju_hoan_North,Tajik_Shugnan)(BedouinB,GujaratiD) D=0.0219

I expected the opposite effect. If both Tajik and GujaratiD have ENA, the affinity should increase relatively to BedouinB and comparatively to Druze.

I didn't suggest something like Kalash because I thought they have too much ASI, but it seems that it's Tajiks that have too little to work better than MA1.

Davidski said...

Not sure what's happening with the images? Have a look now.

I'm getting weird links from Google today. I'll try again tomorrow.

John Thomas said...

So, do Indians (and Pakistanis) have any appreciable European ancestry?
Is this correlated at all with caste?
And what about south India?

- and finally, is this the 'smoking gun', or can we expect decades' and hundreds of postings worth of quibbles by the usual suspects?

Alberto said...

I definitely think that D-stats go crazy when one population has BEA and other (from the other side) has ASI. I think that an IBS sharing of Tajiks with other populations would give quite different results from these ones:

https://docs.google.com/spreadsheets/d/1wLz53qVlZcVgPgYgkhzfzvzIuPdYT5nwVs_ofa3X6FU/edit?usp=sharing

Matt said...

The colour gradient makes it's a very pretty / elegant looking stat demonstration. You could also add BedouinB as 0 on both stats, since BedouinB cannot by definition have any variance from itself (for an example - http://i.imgur.com/fsoh7w3.png).

The Spanish samples in this view do seem quite strikingly more WHG like particularly comparing Spain_EN to LBK_EN, but perhaps also Spain_MN to Germany_MN. I wonder if graphing SHG vs MA1 or EHG vs MA1, or PCing the stats MA1, WHG, SHG, EHG all together would provide a slightly different pattern. Bell Beaker and Sintashta look really like intermediate between Spain_MN and Yamnaya in this view, or alternately 60:40 Corded_Ware and Spain_EN, while modern North-Central Europeans generally look more WHG like (possibly because of some minority ancestry from "hidden" farmers with higher WHG fractions - the Gokhem farmer tends to be more WHG by D stats, I think than the Spain_MN or German_MN). Corded Ware itself deviates in having what looks like more HG ancestry than would fit intermediate Yamnaya and any MN population in Germany or Spain.

Matt said...

Graphing the Tajik_Shugnan stat against WHG is good for breaking out the MN Neolithic Europeans (Remedello, Spain_MN, Baalberge MN) from populations close to them. This is because they are both relatively high in WHG and low in the Neolithic "West Asian" / teal that makes up part of the ancestry of Tajik_Shugnan (and possibly something to do with ENA). While the MN are also relatively high in the divergently evolving (drifting) European descendant of early farmer ancestry. It gives perhaps more of a distinction than MA1 vs WHG, in terms of separating these guys out from the rest of the pack.

See - http://i.imgur.com/fCnVBhK.png

OTOH, it's not so good for breaking apart the LNBA from modern Europeans, nor Indo-European West Asians from others, as these pairs are more similar in their closeness to Tajik-Shugnan than they are in their similarity to MA1. The similarity to Tajik_Shugnan is flatter and more compressed in the sample set, more uniform, so it doesn't form a good measure distinguishing between samples. Separation comes from ancestry that varies strongly between samples, totally irrespective of how much of how little of their ancestry it makes up (proviso: it's relatively unlikely that something which makes up a lot of ancestry will have little variation between populations, unless it spread *really* uniformly).

Davidski said...

Matt,

Here's the datasheet with EHG and SHG added.

https://drive.google.com/file/d/0B9o3EYTdM8lQTmFWUnRUeWlKX2M/view?usp=sharing

It produces a very nice PCA which looks a lot like my genotype-based PCA.

https://drive.google.com/file/d/0B9o3EYTdM8lQb2d0Y1NwWnZxMUE/view?usp=sharing

Mickey,

The plots show that Europeans, West Asians and South Central Asians form clines that stretch out towards the Bronze Age steppe. This suggests that there were migrations from the steppe into these regions, although we really need ancient DNA from Asia to figure out who was moving there and when.

Krefter said...

@Davidski,

"Smarter than the average bear"?! That doesn't make sense. Did you just give up thinking of a title name?

Davidski said...

Holy shit, you've never heard of Yogi Bear? I must be getting old.

https://en.wikipedia.org/wiki/Yogi_Bear#Catchphrases

Matt said...

Thanks. It is a nice complement, the EHG element (and SHG) variables really seems to help in boosting the position of the ancient LNBA Yamnaya influenced "north", compared to the WHG vs MA1 PCA / graphs, because of the strong affinities to EHG these populations have which others, particularly in the south of West Eurasia, lack.

Considering these HG ancestries all together does seem to bring it into a slightly closer alignment with the IBS PCA than just MA1 and WHG alone (although nothing's ever perfect - the Spanish samples and ancient samples have a much wider distribution on that PCA than the IBS PCA, for'ex, while Bell Beaker is still quite interestingly EHG shifted compared to recent Europeans, Sintashta takes its place very nicely in a modern Volga-Ural position etc).

The Spain_MN still retains its quite separate position from the other MN populations considered here, and Spain_EN is also well separated still from LBK_EN. Subjectively, continues when only SHG vs MA1 or SHG vs EHG is the case, but breaks down when EHG vs MA1 is considered, as do a lot of the positions - http://i.imgur.com/2oLeogI.png

Unknown said...

So does this mean that Kennewick Man would plot somewhere around the Iranians, since on the Eurogenes ANE K7 calculator he scored 16% WHG-UHG and 29% ANE?

Davidski said...

No he wouldn't. I have no idea where he'd plot, but he'd probably be an outlier.

Btw, for low coverage and old samples like Kennewick man the ANE K7 only really predicts the level of ANE accurately, more or less. You can ignore the other components.

Unknown said...

Okay then, but how would you explain him scoring 7.20% on the Atlantic component and 2.77% on the North Sea component in Eurogenes 15?

Davidski said...

Well, keep in mind that the Kennewick Man's genome is low coverage. And then there's also the calculator effect.

Moreover, I've noticed that the samples coming out of the Copenhagen Uni aDNA lab must be run on transversion SNPs only when using ADMIXTURE and similar algorithms.

You'll note that I mention this a lot whenever I run analyses with them here. In fact, the D-stats above are based on transversion sites only.

Just ignore the calculator results. They're almost always garbage for ancient samples.

Unknown said...

Thanks for the explanation. I know you have some issues with Dienekes’ calculators as well, but there are some consistencies with his and your calculators’ results in Kennewick Man’s breakdowns: his Dodecad v3 has Kennewick Man showing 9% more “West European” than Anzick 1; and his Dodecad K12b has Kennewick Man showing a curiously higher ratio of “North European” to “Gedrosia” than Anzick 1. (2.84 to 2.27) And the large majority of the Mesolithic Western European Hunter Gatherers had zero “Gedrosia” there. Any guesses as to how this could be?

Davidski said...

The results will depend on the age, coverage depth, ancestry of the ancient samples and design of the calculators. But in theory, we should only be testing the ancestry of the samples.

So sometimes the outcomes will make a lot of sense, but much of the time they're best ignored.

Unknown said...

Thanks.

Unknown said...

Ha ha
They stopped playing Hanna-Barbara cartoons like 20 years ago, unfortunately

DMXX said...

"Holy shit, you've never heard of Yogi Bear? I must be getting old."

:D Booboo, picnic baskets etc. aren't stock phrases among the younger lot... Unless they watch re-runs in developing countries.

WesternPonticSteppe said...

how can I calculate my coordinates? :) some formula involving GedMATCH run results? thanks (MD2)

Anonymous said...

Nice work davidski. Sicilians look basically Middle Eastern on these and plot far away from Europe. Big gaping gap between modern South Europe and LBK, which of course means a big shift east due to Yamna.

I dont see Cyprus here but figure they will be in the same region as Sicily.