search this blog

Sunday, October 12, 2014

Ancient genomes and the calculator effect


Several ancient genomes have been posted online as text files and uploaded to GEDmatch over the last couple of weeks, and many more are likely to follow in the future. A lot of people have already taken this opportunity to analyze these files with various online ancestry tools, usually DIY calculators.

That's actually not a bad way of doing things, as long as everyone's aware that almost all of these calculators produce biased results. They produce biased results because they violate a very basic rule of science, which is this:
Do not test more than one variable at a time.
Obviously, the variable we want to test with these calculators is ancestry. However, when the reference samples are tested in a different way to the test samples, which is what usually happens, then this adds another variable to the proceedings. As a result, we simply can't compare the results of the reference samples to those of the test samples.

I know that a lot of people find this difficult to grasp, and many just seem hell bent on not grasping it. However, anyone who isn't completely insane, and takes five minutes out of their day to try and understand the concepts involved, has to agree that this is a real problem. It can be proven empirically, like I did over two years ago (see here).

I suspect that a lot of confusion has been caused by the fact that the people who were used as reference samples in the making of the various DIY calculators saw highly accurate results when running them, and so assumed everything was fine. The accuracy of the DIY calculators for such people is indeed impressive, and I show that at the link above, but unfortunately the story is very different for everyone else.

Here's the good news: the Eurogenes calculators don't suffer from the calculator effect. That's because the reference samples are treated in the same way as the test samples, so there's only one variable: ancestry. What this means is that when you run a modern or ancient genome with a Eurogenes calculator you can confidently compare the result to those of the reference samples (provided enough SNPs are used), and then be able to make sensible inferences about its genetic origins.

Wednesday, October 8, 2014

Analysis of an ancient genome from Hinxton


I've just added an ancient sample from Hinxton, England, to my burgeoning ancient genomes collection. It's a pre-publication release freely available here as ERS389795. Thanks to Felix C. for breaking the news. We've both called this sample Hinxton1.

Unfortunately, its archeological context is a mystery to me, but it's possibly one of the ancient genomes mentioned in the recent Schiffels et al. ASHG abstract (see here).

In terms of genome-wide genetic structure, Hinxton1 is most similar to present-day Orcadians, Irish, western Scots, Icelanders and western Norwegians, more or less in that order. However, it's fairly distinct from the modern inhabitants of England, or at least those in my datasets, who mostly come from Kent and Cornwall.

Please note, this analysis features two different datasets: Eurogenes and Human Origins. Eurogenes, which is my own dataset, includes more populations than Human Origins, and is based on SNPs used in commercial ancestry and medical work. On the other hand, Human Origins shows a more varied sampling strategy, and is based on SNPs specifically chosen for population genetics.




Shared drift stats in the form f3(Mbuti;Hinxton1,Test) - Eurogenes dataset

Shared drift stats in the form f3(Mbuti;Hinxton1,Test) - Human Origins dataset



Eurogenes K15 4 Ancestors Oracle results

See also...

Analysis of Hinxton2 - ERS389796

Analysis of Hinxton3 - ERS389797

Analysis of Hinxton4 - ERS389798

Analysis of Hinxton5 - ERS389799

Hinxton ancient genomes roundup

Sunday, August 24, 2014

Genetic structure in the Western Balkans


PLoS ONE has a new paper by Kovacevic et al. on the genetic structure of Western Balkan populations. Here's the abstract:

Contemporary inhabitants of the Balkan Peninsula belong to several ethnic groups of diverse cultural background. In this study, three ethnic groups from Bosnia and Herzegovina - Bosniacs, Bosnian Croats and Bosnian Serbs - as well as the populations of Serbians, Croatians, Macedonians from the former Yugoslav Republic of Macedonia, Montenegrins and Kosovars have been characterized for the genetic variation of 660 000 genome-wide autosomal single nucleotide polymorphisms and for haploid markers. New autosomal data of the 70 individuals together with previously published data of 20 individuals from the populations of the Western Balkan region in a context of 695 samples of global range have been analysed. Comparison of the variation data of autosomal and haploid lineages of the studied Western Balkan populations reveals a concordance of the data in both sets and the genetic uniformity of the studied populations, especially of Western South-Slavic speakers. The genetic variation of Western Balkan populations reveals the continuity between the Middle East and Europe via the Balkan region and supports the scenario that one of the major routes of ancient gene flows and admixture went through the Balkan Peninsula.

Among the most eye catching figures from the study is this TreeMix graph with ten migration edges or admixture events. Note the 44% migration edge running from the base of the Eastern European branch to the French. Is this perhaps a legacy of the Proto-Celts and early Germanics? In any case, something similar can be seen on this TreeMix graph from the supplementary PDF to Skoglund et al. 2014, where a French genome is modeled as a clade closely related to Upper Paleolithic Siberian forager MA-1, but with considerable Sardinian admixture.


Also, the position of the Poles at the tip of the tree, and thus near the North Russians, is somewhat curious. However, I know that several of these individuals are ethnic Poles from Estonia, so that might be the problem.

Update 25/08/2014: Here's a typical Eurogenes Principal Component Analysis (PCA) of West Eurasia with the new samples from this paper (Bosnians, Kosovars, Macedonians, Montenegrins and Serbs).



Citation...

Kovacevic L, Tambets K, Ilumäe A-M, Kushniarevich A, Yunusbayev B, et al. (2014) Standing at the Gateway to Europe - The Genetic Structure of Western Balkan Populations Based on Autosomal and Haploid Markers. PLoS ONE 9(8): e105090. doi:10.1371/journal.pone.0105090


Wednesday, August 6, 2014

Haplotype-based PCA of West Eurasia and Europe


The Principal Component Analyses (PCA) below are based on pairwise Identity-by-Descent (IBD) sharing inferred with fastIBD. My aim was to create PCA that took into account haplotype information to see how they might differ from similar plots based on unlinked loci (such as here).






Clearly, they're less reflective of geography and isolation-by-distance, and instead more profoundly influenced by relatively recent isolation, founder effects and/or rapid expansions, especially in Northern and Eastern Europe, and in particular among the Finns, Balts and East Slavs. Unfortunately, I don't have time to say much more about these results. But feel free to post any questions or observations in the comments below. I have done something very similar in the past, but with far fewer samples (see here).

Please note, to ensure that the PCA were as informative as possible I was forced to drop several populations that produced unusual results, probably because of extreme founder effects. This is why, for instance, there are no Ashkenazi Jews on any of the plots, and the only Finns you'll find come from western Finland.

I'll try this again on a much larger dataset when more samples come in, and also include populations from Central and South Asia.

Update 7/8/2014: Apparently some people are wondering what the plots with Finns and Jews look like. Here you go...




Thursday, June 5, 2014

Coming soon: genome-wide data from more than forty 3-9K year-old humans from the ancient Russian steppe


Below is a presentation abstract from the upcoming SMBE 2014 conference. I simply can't wait to see the paper, which I'm guessing will be published very soon.

A central challenge in ancient DNA research is that for many bones that contain genuine DNA, the great majority of molecules in sequencing libraries are microbial. Thus, it has been impractical to carry out whole genome analyses of substantial numbers of ancient individuals. We report a strategy for in-solution capture of ancient DNA from approximately 390,000 single nucleotide polymorphism (SNP) targets, adapting a method of Fu et al. PNAS 2013 who enriched a 40,000 year old DNA sample for the entire chromosome 21. Of the SNPs targets, the vast majority overlap the Affymetrix Human Origins array, allowing us to compare the ancient samples to a database of more than 2,700 present-day humans from 250 groups.

We applied the SNP capture as well as mitochondrial genome enrichment to a series of 65 bones dating to between 3,000-9,000 years ago from the Samara district of Russia in the far east of Europe, a region that has been suggested to be part of the Proto-Indo-European homeland. We successfully extracted nuclear data from 10-90% of targeted SNPs for more than 40 of the samples, and for all of these samples also obtained complete mitochondrial genomes. We report three key findings:

- Samples from the Samara region possess Ancient North Eurasian (ANE) admixture related to a recently published 24,000 year old Upper Paleolithic Siberian genome. This contrasts with both European agriculturalists and with European hunter-gatherers from Luxembourg and Iberia who had little such ancestry (Lazaridis et al. arXiv.org 2013). This suggests that European steppe groups may be implicated in the dispersal of ANE ancestry across Europe where it is currently pervasive.

- The mtDNA composition of the steppe population is primarily West Eurasian, in contrast with northwest Russian samples of this period (Der Sarkissian et al. PLoS Genetics 2013) where an East Eurasian presence is evident.

- Samara experienced major population turnovers over time: early samples (>6000 years) belong primarily to mtDNA haplogroups U4 and U5, typical of European hunter-gatherers but later ones include haplogroups W, H, T, I, K, J.

We report modeling analyses showing how the steppe samples may relate to ancient and present-day DNA samples from the rest of Europe, the Caucasus, and South Asia, thereby clarifying the relationship of steppe groups to the genetic, archaeological and linguistic transformations of the late Neolithic and Bronze ages.

David Reich et al., Genotyping of 390,000 SNPs in more than forty 3,000-9,000 year old humans from the ancient Russian steppe, SMBE 2014 abstract.

The other really interesting abstract from this conference concerns the Ust-Ishim genome from Upper Paleolithic western Siberia (see here). I'm betting its Y-chromosome haplogroup will be P*, but that's pure speculation on my part.


Update 11/02/2015: Massive migration from the steppe is a source for Indo-European languages in Europe (Haak et al. 2015 preprint) .

Tuesday, May 13, 2014

PCA projection bias in ancient DNA studies


Many Principal Component Analyses (PCA) in papers on ancient genomes clearly suffer from projection bias. However, most people don't seem to understand this problem and the impact it can have on the interpretation of the data.

Here's a demonstration of this effect using two PCA. In the first PCA, La Brana-1, a Mesolithic genome from Iberia, was projected onto the PC eigenvectors computed with modern individuals from the HGDP. However, in the second PCA the ancient genome was run together with these samples. Note the clear difference between the two outcomes.




The second outcome does look a bit strange, but it's actually the correct one, because it's now an established fact that Mesolithic hunter-gatherers, like La Brana-1, were clearly outside the range of modern European, and indeed West Eurasian, genetic variation.

For a technical discussion of this problem, which is also sometimes known as "shrinkage", refer to Lee et al. 2012. To get an idea of the confusion that it can cause, see the discussion in the comments section under my last blog post:

More info on two Thracian genomes from Iron Age Bulgaria + a complaint

The above experiment with La Brana-1 was run with PLINK 2, which is freely available here, using just over 16K SNPs. Only markers with a read depth of 4x or higher were considered, and the marker set was further pruned to account for no-calls (--geno 0.005), LD (--indep-pairwise 200 25 0.4), and minor allele frequency (--maf 0.05).

Friday, May 9, 2014

More info on two Thracian genomes from Iron Age Bulgaria + a complaint


PLoS Genetics has just published a new paper on the genetic affinities of Oetzi the Iceman (see here). As far as I can tell, it simply affirms what we've already learned about Oetzi from previous studies, but it does feature interesting new insights into a couple of genomes from Iron Age Bulgaria, aka. Thrace:

The first individual (P192-1) was excavated from a pit sanctuary near Svilengrad, Bulgaria, dated to 800–500 BCE. The other individual (K8) was found in the Yakimova Mogila Tumulus in southeastern Bulgaria, dated to 450–400 BCE.

...

For the Thracian individuals from Bulgaria, no clear pattern emerges. While P192-1 still shows the highest proportion of Sardinian ancestry, K8 more resembles the HG individuals, with a high fraction of Russian ancestry.

...

Interestingly, this individual [K8] was excavated from an aristocratic inhumation burial containing rich grave goods, indicating a high social standing, as opposed to the other individual, who was found in a pit [15]. However, the DNA damage pattern of this individual does not appear to be typical of ancient samples (Table S4 in [15]), indicating a potentially higher level of modern DNA contamination.


K8 might well be contaminated with modern DNA to some degree, but I'd say there's a much better explanation for these signals of non-trivial genetic substructures within the Thracian population.

Archeology suggests that during the Bronze Age the Balkans were invaded from the east by nomads associated with the Yamnaya culture of the Pontic-Caspian Steppe. These invaders, possibly of early Indo-European stock, liked to build Tumuli mounds for their important dead, which were essentially copies of the Kurgan mounds built by the Yamnaya and related peoples.

Moreover, we now know that indigenous European hunter-gatherer (HG) ancestry survived best in Eastern Europe (see here), so it's very likely that the aforementioned invaders from the steppe were significantly HG-like in terms of genetic structure.

Therefore, the fact that K8 was buried in a richly furnished Tumulus (essentially a Kurgan), and genetically more similar to indigenous Europeans than P192-1, who was genetically more Near Eastern-like, and basically thrown into a ditch after he died, doesn't appear to be a coincidence.

In other words, perhaps K8 belonged to a ruling class of steppe origin, while P192-1 was largely of native Balkan stock, whose ancestors were conquered centuries earlier by the steppe nomads and forced to live as an underclass? If so, it wouldn't be the only time in history that this sort of thing has happened, especially within Indo-European societies.

By the way, unfortunately I have to add that the Principal Component Analyses (PCA) in this paper featuring the two HG genomes, ajv70 and La Brana-1, are simply woeful (PDF link). These genomes should be clearly outside the range of modern European genetic variation, but here they land among the Orcadian and French samples. Where was the peer review I wonder?

Citation...

Sikora M, Carpenter ML, Moreno-Estrada A, Henn BM, Underhill PA, et al. (2014) Population Genomic Analysis of Ancient and Modern Genomes Yields New Insights into the Genetic Ancestry of the Tyrolean Iceman and the Genetic Structure of Europe. PLoS Genet 10(5): e1004353. doi:10.1371/journal.pgen.1004353

See also...

Ancient DNA from prehistoric Bulgaria and Denmark

PCA projection bias in ancient DNA studies

Thursday, April 3, 2014

The really old Europe is mostly in Eastern Europe


A new version of the Lazaridis et al. ancient genomes preprint has just appeared at arXiv (see here). It includes several new Principal Component Analyses (PCA), TreeMix graphs, a ChromoPainter/fineSTRUCTURE co-ancestry matrix, and an updated ADMIXTURE analysis. The revised text underlines the relatively close genetic relationship between indigenous European hunter-gatherers and present-day Eastern Europeans:


The co-ancestry matrix (Fig. S19.3) confirms the ability of this method to meaningfully cluster individuals. We highlight two clusters: Stuttgart joins all Sardinian individuals in cluster A and Loschbour joins a cluster B that encompasses all Belarusian, Ukrainian, Mordovian, Russian, Estonian, Finnish, and Lithuanian individuals. These results confirm Sardinia as a refuge area where ancestry related to Early European Farmers has been best preserved, and also the greater persistence of WHG-related ancestry in present-day Eastern European populations. The latter finding suggests that West European Hunter-Gatherers (so-named because of the prevalence of Loschbour and La Braña) or populations related to them have contributed to the ancestry of present-day Eastern European groups. Additional research is needed to determine the distribution of WHG-related populations in ancient Europe.


Fig. S10.5 suggests that the main axis of differentiation in Europe when the subcontinent is considered as a whole may tend to Northeastern Europe rather than SSE/NNW (8). This is consistent with our analysis of ancestry proportions in European populations (Fig. 2B, Extended Data Table 3) which indicate a cline of reduced EEF (and increasing WHG) ancestry along that direction.

Citation...

Iosif Lazaridis, Nick Patterson, Alissa Mittnik, et al., Ancient human genomes suggest three ancestral populations for present-day Europeans, arXiv, April 2, 2014, arXiv:1312.6639v2

Monday, March 10, 2014

Extreme positive selection for light skin, hair and eyes on the Pontic-Caspian steppe...or not


Unusually strong positive selection over the past 5,000 years, rather than population replacement or even admixture, is responsible for the high frequencies of light skin, hair and eyes among present-day Eastern Europeans, according to a new paper by Wilde et al. at PNAS.

The authors were able to infer pigmentation traits from ancient DNA for 63 Eneolithic and Bronze Age samples, mostly from Kurgan mounds from the Pontic-Caspian steppe of Ukraine and surrounds. The results suggest that the ancient individuals were overall much darker than present-day Ukrainians, who, nevertheless, appear to be their direct descendants based on mitochondrial DNA (mtDNA) sequences. Quoting the paper:

To this end we compared the 60 mtDNA HVR1 sequences obtained from our ancient sample to 246 homologous modern sequences (29–31) from the same geographic region and found low genetic differentiation (FST = 0.00551; P = 0.0663) (32). Coalescent simulations based on the mtDNA data, accommodating uncertainty in the ancient sample age, failed to reject population continuity under a wide range of assumed ancestral population size combinations (Fig. 1).

Conversely, continuity between early central European farmers and modern Europeans has been rejected in a previous study (33). However, the Eneolithic and Bronze Age sequences presented here are ∼500–2,000 y younger than the early Neolithic and belong to lineages identified both in early farmers and late hunter–gatherers from central Europe (33).

...

In sum, a combination of selective pressures associated with living in northern latitudes, the adoption of an agriculturalist diet, and assortative mating may sufficiently explain the observed change from a darker phenotype during the Eneolithic/Early Bronze age to a generally lighter one in modern Eastern Europeans, although other selective factors cannot be discounted. The selection coefficients inferred directly from serially sampled data at these pigmentation loci range from 2 to 10% and are among the strongest signals of recent selection in humans.

Well, either this is indeed a remarkable finding, or something's not quite right. I think it's the latter.

The argument for genetic continuity from the Eneolithic/Bronze Age to the present on the Pontic-Caspian steppe based on mtDNA sequences is actually very weak. The results could simply mean that the ancient samples shared deep maternal ancestry with modern Ukrainians and most other Europeans.

Indeed, we know for a fact that much of the Pontic-Caspian steppe was occupied by Turkic groups of Asian origin from the early Middle Ages until only a couple of hundred years ago. They were eventually cleared out by Tsarist Russia, and mainly replaced by East Slavic settlers from just northwest of the steppe. This process might not be easy to see by comparing low resolution mtDNA data, even between European populations separated by 5,000 years, but it's likely to be obvious when looking at full mtDNA genomes, high-density genome-wide data, and/or Y-chromosome haplogroups.

Surprisingly, the article doesn't mention Keyser et al. 2009, a very important study which showed that a sample of Kurgan nomads from Bronze and Iron Age South Siberia had frequencies of light hair and eyes comparable to those of present-day Northern and Eastern Europeans (see here). Also worth noting is that the most common Y-chromosome haplogroup among these individuals was R1a, which is today the most frequent haplogroup in Eastern Europe, including Ukraine.

What this suggests to me is that the Kurgan cultural horizon was not genetically homogeneous. I suspect that Kurgan groups closer to the Balkans carried significantly higher levels of Near Eastern Neolithic farmer ancestry, and were thus much darker than those in the more temperate northerly regions. However, it seems that at some point, the Neolithic farmer DNA was diluted enough by continuous movements of light pigmented groups from the north and east, possibly made up mostly of males, that there was a major shift in pigmentation traits from Near Eastern-like to North European-like across most of Eastern Europe. This scenario actually fits very nicely with the latest on the genetic origins of Europeans (see here).

We won't know what really happened until we see at least a few complete ancient genomes from Eastern Europe. But for now, I'd have to suspend my disbelief to accept that present-day Eastern Europeans are, by and large, descendants of these exceedingly brunet prehistoric people of the Pontic-Caspian steppe.

Citation...

Wilde et al., Direct evidence for positive selection of skin, hair, and eye pigmentation in Europeans during the last 5,000 y, PNAS, Published online before print on March 10, 2014, DO:I10.1073/pnas.1316513111

See also...

PCA of ancient European mtDNA

Thursday, February 27, 2014

Khazar shmazar


Human Biology recently posted several open access manuscripts dealing with the topic of Jewish origins (see submissions from 2013 here). One of these preprints is essentially a rebuttal to an Eran Elhaik paper from a couple of years ago, which argued that a substantial part of Ashkenazi Jewish ancestry was derived from within the Khazar Empire. The leading author of the new preprint is Doron M. Behar, but thirty people in all, many of them well known scientists, have put their names on it. Here's the abstract:

The origin and history of the Ashkenazi Jewish population have long been of great interest, and advances in high-throughput genetic analysis have recently provided a new approach for investigating these topics. We and others have argued on the basis of genome-wide data that the Ashkenazi Jewish population derives its ancestry from a combination of sources tracing to both Europe and the Middle East. It has been claimed, however, through a reanalysis of some of our data, that a large part of the ancestry of the Ashkenazi population originates with the Khazars, a Turkic-speaking group that lived to the north of the Caucasus region ~1,000 years ago. Because the Khazar population has left no obvious modern descendants that could enable a clear test for a contribution to Ashkenazi Jewish ancestry, the Khazar hypothesis has been difficult to examine using genetics. Furthermore, because only limited genetic data have been available from the Caucasus region, and because these data have been concentrated in populations that are genetically close to populations from the Middle East, the attribution of any signal of Ashkenazi-Caucasus genetic similarity to Khazar ancestry rather than shared ancestral Middle Eastern ancestry has been problematic. Here, through integration of genotypes on newly collected samples with data from several of our past studies, we have assembled the largest data set available to date for assessment of Ashkenazi Jewish genetic origins. This data set contains genome-wide single-nucleotide polymorphisms in 1,774 samples from 106 Jewish and non- Jewish populations that span the possible regions of potential Ashkenazi ancestry: Europe, the Middle East, and the region historically associated with the Khazar Khaganate. The data set includes 261 samples from 15 populations from the Caucasus region and the region directly to its north, samples that have not previously been included alongside Ashkenazi Jewish samples in genomic studies. Employing a variety of standard techniques for the analysis of populationgenetic structure, we find that Ashkenazi Jews share the greatest genetic ancestry with other Jewish populations, and among non-Jewish populations, with groups from Europe and the Middle East. No particular similarity of Ashkenazi Jews with populations from the Caucasus is evident, particularly with the populations that most closely represent the Khazar region. Thus, analysis of Ashkenazi Jews together with a large sample from the region of the Khazar Khaganate corroborates the earlier results that Ashkenazi Jews derive their ancestry primarily from populations of the Middle East and Europe, that they possess considerable shared ancestry with other Jewish populations, and that there is no indication of a significant genetic contribution either from within or from north of the Caucasus region.

I'm really not sure what to make of all of this attention that the Khazar hypothesis is still getting? It's been obvious for a while now that in terms of genetic structure Ashkenazi Jews are basically a group of East Mediterranean origin. But Elhaik's paper did get a fair bit of media coverage, so I suppose after that a rebuttal was to be expected.

In any case, I'm not complaining. This paper includes a very interesting genotype dataset of many previously unpublished samples, which I tested last week with PCA (see here).

Citations...

Behar, Doron M.; Metspalu, Mait; Baran, Yael; Kopelman, Naama M.; Yunusbayev, Bayazit; Gladstein, Ariella; Tzur, Shay; Sahakyan, Havhannes; Bahmanimehr, Ardeshir; Yepiskoposyan, Levon; Tambets, Kristiina; Khusnutdinova, Elza K.; Kusniarevich, Aljona; Balanovsky, Oleg; Balanovsky, Elena; Kovacevic, Lejla; Marjanovic, Damir; Mihailov, Evelin; Kouvatsi, Anastasia; Traintaphyllidis, Costas; King, Roy J.; Semino, Ornella; Torroni, Anotonio; Hammer, Michael F.; Metspalu, Ene; Skorecki, Karl; Rosset, Saharon; Halperin, Eran; Villems, Richard; and Rosenberg, Noah A., No Evidence from Genome-Wide Data of a Khazar Origin for the Ashkenazi Jews (2013). Human Biology Open Access Pre-Prints. Paper 41.

Elhaik E. The missing link of Jewish European Ancestry: contrasting the Rhineland and Khazarian hypotheses. Genome Biol Evol. 2012. doi:10.1093/gbe/evs119, Advance Access publication December 14, 2012.

See also...

Near Eastern origin of Ashkenazi Levite R1a