Tuesday, May 13, 2014

PCA projection bias in ancient DNA studies

Many Principal Component Analyses (PCA) in papers on ancient genomes clearly suffer from projection bias. However, most people don't seem to understand this problem and the impact it can have on the interpretation of the data.

Here's a demonstration of this effect using two PCA. In the first PCA, La Brana-1, a Mesolithic genome from Iberia, was projected onto the PC eigenvectors computed with modern individuals from the HGDP. However, in the second PCA the ancient genome was run together with these samples. Note the clear difference between the two outcomes.

The second outcome does look a bit strange, but it's actually the correct one, because it's now an established fact that Mesolithic hunter-gatherers, like La Brana-1, were clearly outside the range of modern European, and indeed West Eurasian, genetic variation.

For a technical discussion of this problem, which is also sometimes known as "shrinkage", refer to Lee et al. 2012. To get an idea of the confusion that it can cause, see the discussion in the comments section under my last blog post:

More info on two Thracian genomes from Iron Age Bulgaria + a complaint

The above experiment with La Brana-1 was run with PLINK 2, which is freely available here, using just over 16K SNPs. Only markers with a read depth of 4x or higher were considered, and the marker set was further pruned to account for no-calls (--geno 0.005), LD (--indep-pairwise 200 25 0.4), and minor allele frequency (--maf 0.05).

Friday, May 9, 2014

More info on two Thracian genomes from Iron Age Bulgaria + a complaint

PLoS Genetics has just published a new paper on the genetic affinities of Oetzi the Iceman (see here). As far as I can tell, it simply affirms what we've already learned about Oetzi from previous studies, but it does feature interesting new insights into a couple of genomes from Iron Age Bulgaria, aka. Thrace:

The first individual (P192-1) was excavated from a pit sanctuary near Svilengrad, Bulgaria, dated to 800–500 BCE. The other individual (K8) was found in the Yakimova Mogila Tumulus in southeastern Bulgaria, dated to 450–400 BCE.


For the Thracian individuals from Bulgaria, no clear pattern emerges. While P192-1 still shows the highest proportion of Sardinian ancestry, K8 more resembles the HG individuals, with a high fraction of Russian ancestry.


Interestingly, this individual [K8] was excavated from an aristocratic inhumation burial containing rich grave goods, indicating a high social standing, as opposed to the other individual, who was found in a pit [15]. However, the DNA damage pattern of this individual does not appear to be typical of ancient samples (Table S4 in [15]), indicating a potentially higher level of modern DNA contamination.

K8 might well be contaminated with modern DNA to some degree, but I'd say there's a much better explanation for these signals of non-trivial genetic substructures within the Thracian population.

Archeology suggests that during the Bronze Age the Balkans were invaded from the east by nomads associated with the Yamnaya culture of the Pontic-Caspian Steppe. These invaders, possibly of early Indo-European stock, liked to build Tumuli mounds for their important dead, which were essentially copies of the Kurgan mounds built by the Yamnaya and related peoples.

Moreover, we now know that indigenous European hunter-gatherer (HG) ancestry survived best in Eastern Europe (see here), so it's very likely that the aforementioned invaders from the steppe were significantly HG-like in terms of genetic structure.

Therefore, the fact that K8 was buried in a richly furnished Tumulus (essentially a Kurgan), and genetically more similar to indigenous Europeans than P192-1, who was genetically more Near Eastern-like, and basically thrown into a ditch after he died, doesn't appear to be a coincidence.

In other words, perhaps K8 belonged to a ruling class of steppe origin, while P192-1 was largely of native Balkan stock, whose ancestors were conquered centuries earlier by the steppe nomads and forced to live as an underclass? If so, it wouldn't be the only time in history that this sort of thing has happened, especially within Indo-European societies.

By the way, unfortunately I have to add that the Principal Component Analyses (PCA) in this paper featuring the two HG genomes, ajv70 and La Brana-1, are simply woeful (PDF link). These genomes should be clearly outside the range of modern European genetic variation, but here they land among the Orcadian and French samples. Where was the peer review I wonder?


Sikora M, Carpenter ML, Moreno-Estrada A, Henn BM, Underhill PA, et al. (2014) Population Genomic Analysis of Ancient and Modern Genomes Yields New Insights into the Genetic Ancestry of the Tyrolean Iceman and the Genetic Structure of Europe. PLoS Genet 10(5): e1004353. doi:10.1371/journal.pgen.1004353

