Tuesday, May 13, 2014
PCA projection bias in ancient DNA studies
Many Principal Component Analyses (PCA) in papers on ancient genomes clearly suffer from projection bias. However, most people don't seem to understand this problem and the impact it can have on the interpretation of the data.
Here's a demonstration of this effect using two PCA. In the first PCA, La Brana-1, a Mesolithic genome from Iberia, was projected onto the PC eigenvectors computed with modern individuals from the HGDP. However, in the second PCA the ancient genome was run together with these samples. Note the clear difference between the two outcomes.
The second outcome does look a bit strange, but it's actually the correct one, because it's now an established fact that Mesolithic hunter-gatherers, like La Brana-1, were clearly outside the range of modern European, and indeed West Eurasian, genetic variation.
For a technical discussion of this problem, which is also sometimes known as "shrinkage", refer to Lee et al. 2012. To get an idea of the confusion that it can cause, see the discussion in the comments section under my last blog post:
More info on two Thracian genomes from Iron Age Bulgaria + a complaint
The above experiment with La Brana-1 was run with PLINK 2, which is freely available here, using just over 16K SNPs. Only markers with a read depth of 4x or higher were considered, and the marker set was further pruned to account for no-calls (--geno 0.005), LD (--indep-pairwise 200 25 0.4), and minor allele frequency (--maf 0.05).