Wednesday, July 29, 2015

The ancient DNA case against the Anatolian hypothesis


In the debate over the location of the Proto-Indo-European urheimat, Colin Renfrew's Anatolian hypothesis is usually mentioned as the most viable alternative to the steppe or Kurgan hypothesis. But probably not for very much longer.

Below is a Principal Component Analysis (PCA) featuring extant Indo-European and non-Indo-European groups from West Eurasia, a couple of typical early Neolithic farmers from Central Europe, a typical Western Hunter-Gatherer, also from Central Europe, and the Iceman from the Copper Age Tyrolean Alps, again typical of his time and place.*

It's just a taste of the ancient genomic data we have available from prehistoric Europe, but it has almost everything that is pertinent to the issue at hand.


You don't need to be familiar with PCA methodology to be able to read the plot. Basically, it shows that the present-day European population structure is the result of two main events:

- the arrival of early farmers from Anatolia during the Neolithic transition, which eventually caused the extinction of people like the Western Hunter-Gatherer, who is the most obvious outlier on the plot

- the expansion of Kurgan groups such as the Yamnaya, which led to the formation of the Corded Ware horizon across much of Europe and shifted the genetic structure of almost all Europeans to the east, away from the Neolithic and Copper Age samples.

These were massive population turnovers, and, as a rule, massive population turnovers are accompanied by language change. So it's highly unlikely that any Europeans today are speaking languages derived from those of the Western Hunter-Gatherers or early Neolithic farmers of Central Europe (ie. according to Renfrew the ancestors of Celts, Germanics and other Indo-Europeans). Moreover, consider this:

- most present-day Indo-European speaking Europeans form an elongated cluster between the Neolithic farmers and the Corded Ware sample, pointing to the steppe-derived Corded Ware Culture as the proximate agent of the Indo-European expansion in much of Europe

- the only present-day Europeans who closely resemble Neolithic farmers are some Sardinians (the small Romance cluster just above the two Neolithic samples), but Sardinians spoke Paleo-Sardinian or Nuragic languages until they adopted Indo-European speech, in the form of Latin, from the Romans (see page 118 here).

Also, this isn't shown on the plot, but the dominant Y-chromosome haplogroup of early Neolithic farmers is G2a, which is a low frequency marker in Europe today. The two most common Y-chromosome haplogroups among present-day Europeans are R-M198 and R-M269, which are also typical of Corded Ware and Yamnaya males, respectively, and probably originally from the steppe.

So is there any way to rework the Anatolian hypothesis so that it can be salvaged? I doubt it. Even making the steppe a homeland for all of the main Indo-European branches apart from Anatolian and Armenian probably won't help.

It is true that the Yamnaya nomads carried Near Eastern-related ancestry which may represent Proto-Indo-European admixture from outside of the steppe. But there's no evidence that it came from Anatolia.

In fact, if Neolithic Anatolians were basically identical to early Neolithic European farmers, which seems to be the case (see here and here), then it's unlikely that it did, because the latter carried a peculiar genome-wide signal that is missing in Yamnaya genomes (orange cluster in the ADMIXTURE bar graph below).** Heck, even the early Corded Ware genomes from Germany barely show any of it.

I won't go into the linguistics arguments here why the Anatolian hypothesis is implausible. But it might be worth checking out a new book on the topic by linguists Asya Pereltsvaig and Martin W. Lewis: The Indo-European Controversy: Facts and Fallacies in Historical Linguistics. I haven't read it yet, so I welcome the opinions here of those who have. I did, however, read a lot of the online articles on which the book is based. As far as I know most of them are still available here and here.


*Another version of the same PCA, with the samples labeled individually, is available here. All possible combinations of dimensions 1 to 4 are shown here. The samples are listed here. All of the samples are from Haak et al. and Allentoft et al. The PCA was run using ~56K high confidence SNPs listed here.

The Corded Ware sample is a composite of Corded Ware sequences from Germany, Scandinavia, Estonia and Poland. The Yamnaya sample is a composite of Yamnaya sequences from the Kalmykia and Samara regions of Russia.

I chose to use these composites instead of individual sequences because I didn't want to run any samples with genotype rates of less than 98%.

** For a more detailed ADMIXTURE analysis comparing early Neolithic farmers to Yamnaya refer to Haak et al. Supplementary Information 6. Note the minimal sharing of components at the higher K between the early Neolithic farmers and Yamnaya, especially at K=16, which has the lowest median cross-validation (CV) error. This is in agreement with the PCA above.

See also...

Population genomics of Early Bronze Age Europe in three simple graphs

Sunday, July 26, 2015

Global PCA of selected Late Neolithic/Bronze Age Eurasians


I was curious how the Bronze Age steppe and Corded Ware genomes from the Rise dataset would behave in Principal Component Analyses (PCA) alongside populations from across the globe. Ten genomes had enough high confidence (transversion) markers to be analyzed accurately in such a way. I also ran an Iron Age Swedish sample, just to see how it differed from the older genomes.

Click on the links to go to my drive to download the plots. If you're having trouble finding the ancient samples, type their IDs into the PDF search field and hit enter.

RISE509_Afanasievo
RISE509_Afanasievo
RISE509_Afanasievo

RISE511_Afanasievo
RISE511_Afanasievo
RISE511_Afanasievo

RISE500_Andronovo
RISE500_Andronovo
RISE500_Andronovo

RISE505_Andronovo
RISE505_Andronovo
RISE505_Andronovo

RISE00_Corded_Ware
RISE00_Corded_Ware
RISE00_Corded_Ware

RISE94_Corded_Ware
RISE94_Corded_Ware
RISE94_Corded_Ware

RISE493_Karasuk
RISE493_Karasuk
RISE493_Karasuk

RISE496_Karasuk
RISE496_Karasuk
RISE496_Karasuk

RISE548_Yamnaya
RISE548_Yamnaya
RISE548_Yamnaya

RISE552_Yamnaya
RISE552_Yamnaya
RISE552_Yamnaya

RISE174_Iron_Age_Scandinavia
RISE174_Iron_Age_Scandinavia
RISE174_Iron_Age_Scandinavia

I can't see any major surprises. But I do find it remarkable how very European the Andronovo individuals appear on these plots. Keep in mind that they're ~3,000-year-old samples from the Altai region of Russia. Their ancestors probably migrated there from the Trans-Urals steppe sometime during the Middle Bronze Age.

The Andronovo Culture was succeeded in the Altai region during the Late Bronze Age by the Karasuk Culture, which was probably a new composite of local and perhaps foreign groups. Interestingly, the Karasuk samples featured above are obviously of mixed European/East Asian origin.

Note also that the Afanasievo and Yammnaya individuals fall outside the range of present-day European variation in many of the dimensions, basically as if they were pulling towards the Karitiana Indians of the Amazon. No doubt, this is their excess ANE talking.

By the way, I recently ran some of the same samples in PCA limited to West Eurasian populations. You can see the results here.

Wednesday, July 22, 2015

High-res R1b tree featuring 16 ancient sequences


Here's a useful R1b phylogenetic tree that was posted recently at the R1b-M269 (P312- U106-) DNA Project site.


If these results are correct (and judging by the quality of work at the aforementioned R1b project, I'm pretty sure they are), it would appear that the Samara hunter-gatherer, marked I0124, was not directly ancestral or even all that closely related to any of the Yamnaya/Pit-Grave samples from the North Caspian region (each one also marked with an I~ ID).

On the other hand, the North Caspian Yamnaya sequences are very similar to the rest of the Yamnaya sequences, which come from just north of the Caucasus (marked RISE~). Indeed, all of these Yamnaya samples are almost identical in terms of genome-wide genetic structure (see here).

What this suggests is that the Yamnaya nomads emigrated to the North Caspian from somewhere near the Caucasus, or they were the descendents of such migrants. And if we assume that their ancestral homeland abutted the territory of the Maikop Culture, as shown on this map from Dolukhanov 2014 (look for 9 - early Pit-graves), it becomes easy to understand why they carried such significant maternal and genome-wide genetic Caucasus-related admixture (usually estimated at around 50%).

However, if you're one of those online Near Eastern patriots who like to imagine the Yamnaya as your own, please don't jump for joy just yet. The Yamnaya nomads still look very much like a people native to the western steppe, and this is probably also where their R1b comes from.

Sunday, July 19, 2015

The real thing


A couple of years ago Moorjani et al. concluded that present-day Georgians of the Transcaucasus were the best available proxy for the ancient West Eurasian population that mixed into the South Asian gene pool.

This was a solid statistical fit. And you can see on the TreeMix graph below, featuring a Georgian and a Kalash, why it worked so well.




But it was also a big fat coincidence, because check out what happens when I add another migration edge to the same graph.




Thus, the Indo-Iranian and hence Indo-European speaking Kalash no longer looks very similar to the Kartvelian speaking Georgian. In fact, he appears to be most closely related to the supposedly Indo-European speaking Afanasievo and Yamnaya nomads of the Early Bronze Age Eurasian steppe. The rest of his ancestry is probably best described as South Central Asian, which is an unknown quantity to me at this stage, but probably in large part of indigenous South Asian origin (see here).

I'm only able to show this thanks to the ancient samples that are on the tree, for which, as far as I know, there aren't any useful substitutes among present-day populations. Obviously, Moorjani et al. didn't have this luxury, so they ended up with a model that was statistically sound, but didn't make much sense otherwise, especially in terms of linguistics.

My TreeMix model is easily reproducible with most of the other South Asian samples from the Human Origins, and it gels nicely with uniparental marker data too. For instance, here's a close up from a similar graph featuring a Pathan, with a few extra details.




Yep, not only do Pathans cluster among these ancients of the Eurasian steppe, but most of them also carry the same Y-chromosome haplogroup: R1a-Z93, which is derived from R1a-M417, and in all likelihood first expanded in a big way with the Proto-Indo-Iranians of the Trans-Ural steppe.

By the way, the Human Origins has four different sets of Gujarati samples from Houston, USA, marked A, B, C and D, and each one shows a different level of ancient steppe admixture as inferred with my test. GujaratiA score around 50% while GujaratiD only 40%. Does anyone know why these Gujaratis were grouped in such a way? Was it based on genetic structure or caste origin?





Full output from the analysis above is available in a zip file here. The reference samples and markers are listed here and here. The ancient samples are from Allentoft et al. 2015 and Haak et al. 2015.

See also...

The Poltavka outlier

Friday, July 17, 2015

Iron Age and Anglo-Saxon genomes from eastern England (Schiffels et al. preprint)


I haven't read this properly yet, but the results appear to be very similar to those I obtained with some of the same ancient genomes (see here), which must be very heartening for the authors (j/k). By the way, it's interesting to note that the word Celtic doesn't appear anywhere in the paper. I wonder why?

British population history has been shaped by a series of immigrations and internal movements, including the early Anglo-Saxon migrations following the breakdown of the Roman administration after 410CE. It remains an open question how these events affected the genetic composition of the current British population. Here, we present whole-genome sequences generated from ten ancient individuals found in archaeological excavations close to Cambridge in the East of England, ranging from 2,300 until 1,200 years before present (Iron Age to Anglo-Saxon period). We use present-day genetic data to characterize the relationship of these ancient individuals to contemporary British and other European populations. By analyzing the distribution of shared rare variants across ancient and modern individuals, we find that today’s British are more similar to the Iron Age individuals than to most of the Anglo-Saxon individuals, and estimate that the contemporary East English population derives 30% of its ancestry from Anglo-Saxon migrations, with a lower fraction in Wales and Scotland. We gain further insight with a new method, rarecoal, which fits a demographic model to the distribution of shared rare variants across a large number of samples, enabling fine scale analysis of subtle genetic differences and yielding explicit estimates of population sizes and split times. Using rarecoal we find that the ancestors of the Anglo-Saxon samples are closest to modern Danish and Dutch populations, while the Iron Age samples share ancestors with multiple Northern European populations including Britain.

Schiffels et al., Iron Age and Anglo-Saxon genomes from East England reveal British migration history, bioRxiv, Posted July 17, 2015. doi: http://dx.doi.org/10.1101/022723

Wednesday, July 15, 2015

Population genomics of Early Bronze Age Europe in three simple graphs


Thanks to recent advances in ancient genomics there's very little doubt now that the Pontic-Caspian Steppe was the source of massive population movements deep into Europe during the Late Neolithic/Early Bronze Age.

But some people still don't get it, maybe because genomics isn't their thing? Others just refuse to get it probably because it's at odds with what they've been hoping to see.

To help the former, and piss off the latter some more, I've put together three simple TreeMix graphs featuring ancient samples from a wide range of European archeological cultures, along with a little bit of commentary. Enjoy.





Full output from the analysis above is available in a zip file here. The samples and markers are listed here and here. The ancient samples are from Allentoft et al. 2015 and Haak et al. 2015. The Sub-Saharan Africans are from the fully public Human Origins dataset available here.

Wednesday, July 8, 2015

Another look at the ancient mtDNA from Xiaohe, Tarim Basin


BMC Genetics has just published a new paper on the famous Tarim Basin mummies. It's a bit of a shame that it only deals with their mtDNA. Here's the abstract:

Background: The Tarim Basin in western China, known for its amazingly well-preserved mummies, has been for thousands of years an important crossroad between the eastern and western parts of Eurasia. Despite its key position in communications and migration, and highly diverse peoples, languages and cultures, its prehistory is poorly understood. To shed light on the origin of the populations of the Tarim Basin, we analysed mitochondrial DNA polymorphisms in human skeletal remains excavated from the Xiaohe cemetery, used by the local community between 4000 and 3500 years before present, and possibly representing some of the earliest settlers.

Results: Xiaohe people carried a wide variety of maternal lineages, including West Eurasian lineages H, K, U5, U7, U2e, T, R*, East Eurasian lineages B, C4, C5, D, G2a and Indian lineage M5.

Conclusion: Our results indicate that the people of the Tarim Basin had a diverse maternal ancestry, with origins in Europe, central/eastern Siberia and southern/western Asia. These findings, together with information on the cultural context of the Xiaohe cemetery, can be used to test contrasting hypotheses of route of settlement into the Tarim Basin.

Five years ago some of the same scientists published a paper on an older set of human remains from the same burial site, and found that all of the males belonged to Y-chromosome haplogroup R1a (see here). Last year one of them apparently left a comment under that paper saying this:

Our results show that Xiaohe settlers carried Hg R1a1 in paternal lineages, and Hgs H, K, C4, M* in maternal lineages. Though Hg R1a1a is found at highest frequency in both Europe and South Asia, Xiaohe R1a1a more likely originate from Europe because of it not belong to R1a1a-Z93 branch (our recently unpublished data) which mainly found in Asians.

So I'm pretty sure another paper is on the way. But hopefully the data will include much more than just broad Y-haplogroup classifications. A few full genomes from several layers of the Xiaohe cemetery would be really nice.

Citation...

Chunxiang Li., Analysis of ancient human mitochondrial DNA from the Xiaohe cemetery: insights into prehistoric population movements in the Tarim Basin, China, BMC Genetics 2015, 16:78, doi:10.1186/s12863-015-0237-5

See also...

Lots of ancient Y-DNA from China

Bronze Age Tarim Basin Caucasoids belonged to Y-haplogroup R1a1a

Friday, July 3, 2015

ADMIXTURE analysis of Allentoft et al. and Haak et al. ancient genomes


I haven't had a chance to study the output in detail yet, and I don't know what the cross-validation errors are for each of these unsupervised runs, but I'd say they all look pretty good. A Principal Component Analysis (PCA) of some of the K=10 data, showing how present-day Armenians compare to two Bronze Age Armenians, can be seen here.

K=6 spreadsheet

K=7 spreadsheet

K=8 spreadsheet

K=9 spreadsheet

K=10 spreadsheet

I did attempt to go up to K=11, but the algorithm appeared to be struggling to find a solution, so I killed the run. I'll have another go when more samples come in.

By the way, the analysis is based on the Human Origins fully public dataset available at the Reich lab website here.

To reduce errors, I limited the markers to transversion SNPs, and only kept samples with minimum call rates of 20%. This left 113K SNPs and 101 ancient genomes; 47 from Allentoft et al., 36 from Haak et al., and 18 from other recent papers. I didn't thin the markers to correct for LD, because in my experience this often results in less accurate outcomes.