search this blog

Tuesday, May 13, 2014

PCA projection bias in ancient DNA studies

Many Principal Component Analyses (PCA) in papers on ancient genomes clearly suffer from projection bias. However, most people don't seem to understand this problem and the impact it can have on the interpretation of the data.

Here's a demonstration of this effect using two PCA. In the first PCA, La Brana-1, a Mesolithic genome from Iberia, was projected onto the PC eigenvectors computed with modern individuals from the HGDP. However, in the second PCA the ancient genome was run together with these samples. Note the clear difference between the two outcomes.

The second outcome does look a bit strange, but it's actually the correct one, because it's now an established fact that Mesolithic hunter-gatherers, like La Brana-1, were clearly outside the range of modern European, and indeed West Eurasian, genetic variation.

For a technical discussion of this problem, which is also sometimes known as "shrinkage", refer to Lee et al. 2012. To get an idea of the confusion that it can cause, see the discussion in the comments section under my last blog post:

More info on two Thracian genomes from Iron Age Bulgaria + a complaint

The above experiment with La Brana-1 was run with PLINK 2, which is freely available here, using just over 16K SNPs. Only markers with a read depth of 4x or higher were considered, and the marker set was further pruned to account for no-calls (--geno 0.005), LD (--indep-pairwise 200 25 0.4), and minor allele frequency (--maf 0.05).


Maju said...

Thank you very much for the experimental demonstration, David.

Maju said...

BTW, does this also affect the EEF samples?, I wonder.

Davidski said...

Yes it does, but probably to a much lesser degree, because they're a significantly better fit for Sardinians and even most other southern Europeans than European hunter-gatherers are for any present-day Europeans.

I'll test this when I get my hands on Stuttgart, which should be soon.

Maju said...

Cool. It'll be interesting to see it.

Maju said...

On second look, I find also interesting that the already discussed cross-shape along two axes (Sardinian-Russian, Basque-Adigey) when West Asians are not included in the sample happens here again and that, logically, Braña 1 scores high in both the Russian and Basque poles of these axes. This pattern is perceptible in both the projected and the natural eigenspaces, even if in different ways.

However EV1 (horizontal axis, I presume) plots Bra1 almost exactly at the same non-Sardinian score as Kargopol Russians (who may well be genetically Finnic, considering their geography), what surely represents the almost absolute lack of Neolithic genetics (EEF) in both populations. On the other hand EV2 suggests (to my eyes at least) that the Bra1 specificness (i.e. not the generic Euro-HG score but a more specific subpopulation aspect of it) may be more present in modern Basques than in the other sampled populations, much as Lochsbour's seems to be more present in the modern French.

In a sense, both the Russian and Basque poles of "greater Europeanness" seem to converge at the extreme in Bra1 here, although they must mean different subtleties and in different degrees.

I find this experiment really interesting and I must say that I strongly prefer the non-projected PCA to the projected ones. I really hope that other similar tests will follow for other ancient samples: it's a most valuable visualization tool. Thanks again.

Shaikorth said...

Maju, the Kargopol/HGDP Russians are genetically most similar to Erzya/Moksha despite being relatively far away from Mordovia in most experiments including admixture calculators, PCA's, Mclust and Finestructure.

Southern Russians are basically Ukrainians. On these two PCA's they'd be mingled with Kargopol Russians, but in the Finestructure run of Lazaridis paper they'd be in the Ukrainian/Belorussian subcluster and not in the Kargopol Russian/Mordovian cluster.

Maju said...

Alright, Uralic anyhow and not core-Russian (as from Moscow, Leningrad or Volgograd).

Said that, it's possible that core-Russians score similarly. I just can't know from this dataset.

Matt said...

Maju's comments are interesting. Yeah, this PCA represents La Brana more as an amplified in the Basque vs Adygei dimension, while it sits with the Kargopol Russians on the dimension distinguishing Kargopol Russians and Adygei from Sardinia.

It's a lot better at representing La Brana's genetic distance to other populations than the projected PCA where it's N=0, and the dimensions, although generally similar in shape, are not weighted to represent it at all (and so the other populations in the projected PCA are less "biased towards 0"? Although this effect is small across the large number of samples).

Compare to the more generalised West Eurasian PCAs Eurogenes has run before, where La Brana sits much higher than any present day population on the Saudi-Lithuanian axis, then around where France does on the Adygei-Sardinian dimension.

Does the PCA here tell us more or less? The generalised West Eurasian PCA samples more populations, but do we know for sure that the trends to variation between those populations (and so the PCAs they form) are as relevant to La Brana?

Davidski said...


It depends what you're testing. Usually it's better to run as many samples as possible, but in some exceptional circumstances all you really need are a handful of individuals.

So it's difficult to argue that one correctly run PCA is better than another correctly run PCA. But it's very easy to argue that any correctly run PCA is better than one that is not run correctly.

Shaikorth said...

Maju, while this isn't a PCA but a MDS of Europe and has Turks proxying for Adygei as the "anti-Basque", it has general Russians (RU), Mordvins and HGDP North Russians. Their relation isn't much different in a PCA with the same populations. (large image)

Davidski said...

I changed the shape of the plots slightly, to help people appreciate that the horizontal axis represents eigenvector (or PC) one.

Maju said...

"Does the PCA here tell us more or less? The generalised West Eurasian PCA samples more populations, but do we know for sure that the trends to variation between those populations (and so the PCAs they form) are as relevant to La Brana?"

They are just two different approaches. Probably, as you incorporate more and more distinctive West Asian pops., notably Palestinians and such, who may be key in altering the graph's organization because of their stronger distinctiveness (not just to Europeans but also to Northern West Asians), the shape will generally tend to become more and more V-like and the Basque/Sardinian distinctiveness will get blurred.

I'm saying from experience only, of course. It's still possible that different datasets or even specific algorithm variants may cause other alterations, but in general terms this seems to be the main pattern: cross-shaped for Europe with clear Basque/Sardinian distinction in one extreme (Europe only or similar) and V-shaped for West Eurasia with blurred out Basque/Sardinian distinction.

Saikorth's example may be a transitional one. It's not "rocket science" and we should not expect any single PCA or equivalent to provide a single synthetic snapshot of all the complexity involved in autosomal DNA comparison, just the main one for that particular dataset.

Onur Dincer said...

David, could you carry out a similar experiment with ADMIXTURE analyses rather than PCAs?

Chad said...

Great work, David! I am wondering where the French samples were taken. It appears that many French are lacking as much ANE as was reported. It looks like they can be modeled as having almost zero ANE ancestry. I am sure Northern France has to be different. Southern France, I am sure is going to be similar to the Basque. The Southern French are almost the same as the Basque in terms of EEF ancestry, so it is not much of a shock The second PCA plot looks great and spot on in terms of ranging everyone from Sardinians/EEF, to La Brana/WHG.

One thing that I am curious about is if Loschbour will change anything. The fact that the French and the Basque sit in the same area still has me believing that Loschbour and La Brana are largely indistinguishable. Will the genome on Loschbour be available soon? Thanks, for all of your hard work.

Krefter said...

"The fact that the French and the Basque sit in the same area still has me believing that Loschbour and La Brana are largely indistinguishable. Will the genome on Loschbour be available soon? Thanks, for all of your hard work."

Loschbour had La Brana-1-like west European hunter gatherer and around as much central-north European hunter gatherer ancestry which it seems Motala12 and the Swedish Baltic hunter gatherers were minus their ANE.

We really need some Mesolithic genomes from the Balkans, northeast Europe(preferable Russia and central Asia-Siberia), and Italy to understand the genetic landscape of Europe during the Mesolithic. I am very curious to know what Y DNA Mesolithic Balkaners and northeast Europeans had since they certainly were not mainly hg I, like Mesolithic west-central-north Europeans.

Krefter said...

Davidski, Genethinker already has the DNA of the stone age Swedes from Skoglund 2014, and is doing very interesting things with them.

Krefter said...

Here are likely Y DNA haplogroups of all the confirmed males from Skoglund 2014.

This is my source for the Y SNP calls.

I only listed positives and negatives for the samples from Skoglund 2014, if I thought they were significant.

Ajv59, hunter gatherer PWC culture 4,900-4,600 cal B.P, Ajvide, Eksta, Gotland Sweden: ?(E-, G-)

Gok4, farmer TRB culture, 5,050-4,750YBP Frälsegården Sweden: I2 (I2a1a L159.1/S169.1-, I2a2a L622-, and I2a2a1c1 L701-)

Ire8, hunter gatherer PWC culture 5,100-4,150 cal. B.P, Ire, Hangvar, Gotland Sweden: F(G-, I1 CTS11042/S66-, I1a-DF29/S438-, I1a2b1 Z2541-, I1a3a2 S15301-, I2a2a L622-, I2a2a1b-L1229-, I2a2a1c2a2a1a Z190-)

Sf11, hunter gatherer, 7,500-7,250YBP, Stora Karlso Sweden: F(G-, I1 L121/S62-, I1a2 S244/Z58-, I1a2b PF2805.2/Z2540.2-, I2a1b L178/S328-, I2a1b M423-, I2a2 L37/PF6900/S153-, I2a2a P223/PF3860/S117-, I2a2a1c2 Z161-, I2a2b L38/S154-, I2a2b L39/S155-.

Ajv52, hunter gatherer PWC culture, 4,900-4,600 cal B.P Ajvide, Eksta, Gotland Sweden: I2a2a1-CTS616!!!!!!!!

Ajv70, hunter gatherer PWC, 4,900-4,600 cal B.P, Ajvide, Eksta, Gotland Sweden: Probably F, but maybe C(I-, G-, C-V183+, C-P184-,F-P146/PF2623+)

Ajv58, hunter gatherer PWC, 4,900-4,600 cal B.P, Ajvide, Eksta, Gotland Sweden: I2a1*(I2a1a-, I2a1b-, I2a1d-, I2a1e-).

Loschbour, 6,000BC Loschbour Luxembourg: Y DNA=pre-I2a1b or brother lineage to I2a1b(I L41+, I PF3742+, I M258+, I M170+, I P389+, I2 L68+, I2 M438+, I2a L460+, I2a1 P37.2+, I2a1b M423+, I2a1b CTS8239+, I2a1b CTS7218+, I2a1b CTS54985+, I2a1b L178+, I2a1b CTS1293+, I2a1b CTS176+, I2a1b CTS5375-, I2a1b CTS8486-, I2a1b1 M359.2-, I2a1b2 L161.1, I2a1b3 L621-)

Motala2, 6,000BC Motala Sweden: Y DNA=I* (I P38+, I PF3742+, I L41+, I1 S108-, I1 L845-, I1 M253-, I2a1b CT1293-, I2a2 L37-)

Motala3 6,000BC Motala Sweden: Y DNA=I2a1b*(I M258+, I PF3742+, I2 L68+, I2a1 P37.2+, I2a1b CTS7218+, I2a1b CTS1293+, I2a1b CTS176+, I2a1b1 M359.2-, I2a1b3 L621-)

Motala6 6,000BC Motala Sweden: Y DNA=? (Q1 L232- Q1a2a L55+)

Motala9 6,000BC Motala Sweden: Y DNA=I* (I P38+, I1 P40-)

Motala12 6,000BC Motala Sweden: Y DNA=pre-I2a1b or brother lineage to I2a1b(I PF3742+, I M258+, I M170+, I2 L68+, I2a L460+, I2a1 P37.2+, I2a1b CTS7218+, I2a1b CTS5985+. I2a1b L178+, I2a1b CTS1293+, I2a1b CTS176+, I2a1b CTS5375-, I2a1b CTS8486-, I2a1b1 M359.2-, I2a1b3 L621-)

La Brana-1,~7940-7690YBP, La Braña-Arintero, Leon Spain: C1a2-V20(no reason to list results for haplogroup defining SNPs he was tested for).

Krefter said...

I know my list is kind of sloopy, but it does show that there was some noticeable Y DNA diversity in stone age European hunter gatherers. 11/11 have BT, 9/10 have F, 7/9 have hg I, 1/8 have C1a2-V20, 5/7 have I2a, 4/7 have I2a1, 1/9 have I2a2, 1/10 have I2a2a1, 3/9 have I2a1b(most likely pre-I2a1b). 10 have been tested and are negative for at least one I1 defining mutations(there are 25) or were found to have another haplogroup.

Likely hunter gatherer descended I2 has also been found in farmer Y DNA samples: Two I2a1(likely I2a1a1-M26) ~5,000 year old samples were found in farmers from Trellis, France, and two I2a1(likely I2a1a1-M26) were found in ~4750-4,725 year old Megalithic burials in northwestern France, ~5,050-4,750 year old TRB Swedish hunter gatherer Gok4 had I2 but not I2a1a, which may mean his I2 lineage is descended of northern hunter gatherers.

Ajv58 may have had I2a1c-L2888 which is very exclusive to northwest Europe(like I2a1e and I2a1d), and especially popular in the British isles. It's not surprising a high amount of Mesolithic northwest Europeans had I2a1 and pre-I2a1b, because I2a1 and subclade I2a1b today are most diverse in western Europe(especially northwest Europe), but in eastern Europe nearly all of the I2a1 is under the deep and young subclade I2a1b3a L147.2. I2a2 and I1 are both probably descended of Mesolithic west Europeans, and were picked up by Indo Europeans who arrived in the metal ages. I am now convinced that I2a was born in the Solutrean culture or they at least had members, and that Y DNA I existed in pre-LGM Europeans probably members of the Gravettian culture(who we already know through ancient DNA, had mtDNA U5, U2, and U8). Y DNA I may be west European, since there is such a low amount of diversity in eastern Europe(Most or all came from the west in post-Mesolithic times). This is why I am despite for Y DNA samples from Mesolithic Russia, I bet they had R1a, N1c, and or some type of P.

Krefter said...

Gok4 was a farmer, my bad.

Davidski said...

Indeed, they're now available here...

I'll download them next week.

Davidski said...

By the way, that genetiker guy appears to be clinically insane.

Krefter said...

"By the way, that genetiker guy appears to be clinically insane."

I have a problem with him giving big labels that are 100's of years old to define people, that are somewhat accurate but not established yet and are probably not pure branches of the human family. Plus he has minor white-raciest tendencies. I don't think he is insane, just he has some problems like everyone that can be corrected.

He is still showing real DNA results for the ancient Swedes.

Krefter said...

How do you download the DNA in that link?

Davidski said...

The files are stored here...

Davidski said...

No wait, that bozo genetiker is probably legally insane.

Krefter said...

After searching for depictions of dark skinned people in stone age Europe(hunter gatherers of course), for just a few minutes, i hit the jackpot.

Cave paintings from Iberia known as Levantine art, have depictions of men with dark brown or black hair, beards, long hair, and dark brown skin.

They look alot like La Brana-1.

It is hard to believe that Mesolithic Europeans, who were literally pure Europeans, probably had deep brown skin. Now I think European light skin one way or another descends from EEF, and that light hair(decent amounts)+even lighter skin+revival of light eyes(became connected with light hair) evolved after the Neolithic in the main ancestors of all north Europeans.

I have not been able to find an age for that cave painting. I really hope it's legit because that will finally put this debate to rest.

Maju said...

Neolithic rock art, for Chaos sake!

Krefter said...

"Neolithic rock art, for Chaos sake!"

That's disappointing. I knew there was a low chance that painting was Mesolithic, it was just too good to be true.

I found that it dates to 4,000-2,000BC. Plus it may not be its original coloring. How stupid were pre historic people? How hard was it for them to simply paint themselves? How could there be so many known cave paintings in Europe, yet not one detailed painting of a human?

Maju said...

That kind of art is Neolithic. It's somewhat similar to what we find in North Africa in the same period (and later among Bushmen).

In the European Paleolithic there are very few instances of human representation but there are some: Lespulges woman's head is a rare case of semi-realistic female representation, in the famous scene of the bison hunt of Lascaux a schematic man falls dead to a much more realistic bison, which also seems injured. But the most important and often ignored self-representation of Paleolithic Europeans comes from La Marche (Ardéche), where a large number of floor slabs had faces engraved on them, possibly the various inhabitants of the cave. The style is a bit cartoonish though.


You won't see your obsessive skin color in any of them, nor in the Neolithic ones either - they just used a limited range of pigments (ochre and coal basically).

"Stupid?" Nope they were not the least concerned about your opinions on them or their art. They had their own reasons, naturally.

Krefter said...

My favorite depiction of a human from the stone age comes from the Gravettian people at the Dolni Vestonive site(has the oldest U5 samples), and is 26,000 years old.

Unlike most pre historic depictions of humans it's realistic, and not abstract, spiritual, Venus, mystical crap.

It should be the logo for WHG-ANE hunter gatherers.

" But the most important and often ignored self-representation of Paleolithic Europeans comes from La Marche (Ardéche), where a large number of floor slabs had faces engraved on them, possibly the various inhabitants of the cave. The style is a bit cartoonish though."

Has La Marche been dated since its discover in 1937?

I kind of doubt all of the carvings are legit, because they look very modern. The hats, boots, purses(?!!), and clean shaven men with short or medium hair, don't make sense for Upper Palaeolithic people. But since many were carved on limestone slabs and some now extinct animals were depicted, they may be legit.

This guy looks alot like you.

"You won't see your obsessive skin color in any of them, nor in the Neolithic ones either - they just used a limited range of pigments (ochre and coal basically)."

I would call it an interest not an obsession. I mostly post about other things. It's not my fault I live in a culture which puts so much emphasis on skin color, not surprisingly they rub off on me. Just like it's not your fault you're so interested in the Basque, and things associated with them like R1b.

Besides it is not just my culture, it doesn't matter if you ask Chinese people, an isolated Papuan tribe, or ancient Mayans, they will all tell you that skin color is one of the most significant differences in physical appearance between human populations, and I think it is interesting to know the history and evolution of skin color in humans. I am also interested in it because i would like to know what my great great great great..... grandparents looked like in person, and what I got from them.

Krefter said...

Pigmentation of stone age Swedish hunter gatherers and farmers.

Spreadsheet of stone age European hunter gatherer's and farmer's genotypes in pigmentation SNPs of the 8-plex and Hirisplex system, and SNPs of blue eye haplotypes.

Map of Mesolithic-Neolithic European hunter gatherer DNA

The new results are constant with older samples(Loschbour, La Brana-1, Motala12, Stuttgart, Otzi). Except farmer Gok2 had light eyes, which is probably because he had around 40% total WHG, while Otzi and Stuttgart had around 20%. The hunter gatherer ancestry of the Swedish TRB farmers varies significantly(some had more than Gok2 and around as much as modern north Europeans), and Gok2 is especially close to Ajv58 when compared to La Brana-1 and Sf11, which probably means TRB farmers were constantly mixing with local Swedish hunter gatherers. I am sure some of the Swedish farmers including Gok2(who had a hunter gatherer paternal lineage, hg I2) had Swedish hunter gatherer grandparents.

All of the hunter gatherer and farmer samples according to the hirsplex system were defintley dark haired. The hunter gatherers were probably uniformly dark haired, because a very low percentage(1/4) had the 374f mutation. If anything the farmers were lighter haired than the hunter gatherers, because most probably had the 374f mutation.

The hunter gatherers were probably dark skinned while the farmers were probably light skinned.

Gok2 is the first stone age European sample who was certainly pigmented like a typical northern European; light skin, dark hair, and blue eyes. Maybe farmers in northwest Europe were heavily WHG compared to southern ones, and then Indo Europeans raised their WHG and ANE percentages even more to create modern northwest Europeans. Northwest European's hunter ancestry doesn't have enough affinities to eastern Europe for them to be a simple admixture of Sardinian like farmers and Indo Europeans.

I don't know though that was just an idea that went though my head. When I have time I really want to investigate all this stuff.

I know i am repeating myself but a breakthrough has been made by looking at stone age European pigmentation. We now know light hair rose in popularity AFTER the farmers and hunter gatherers mixed. There was major depigmentation that occurred especially in northern European's ancestors, their skin got lighter, their hair got lighter, and light eyes became attached to light hair and rose in popularity with it.

Krefter said...

Pigmentation SNPs of MA1 and AG2.

The results confirm that MA1 had dark eyes unlike Mesolithic and Neolithic European hunter gatherers. The results are constant with MA1 and AG2 having dark hair, but there are not enough SNPs to be for sure.

AG2 had skin lighting mutation rs1426654 A/A, like Motala12, Stuttgart, Otzi, Gok2, and bronze-iron age Siberian Indo Iranians, proving this mutation is very ancient in west Eurasians. AG2 having this mutation is constant study which estimated this mutation to be 22,000-28,000 years old. I think it is older though because it existed in WHG, ANE, and middle easterns(brother to WHG+basal Eurasian), who are the three main ancestral groups of modern west Eurasians.

MA1 had rs1426654 G/G, like most European hunter gatherer samples, and is evidence the majority of WHG-ANE hunter gatherers did.

Krefter said...

Y SNP calls for AG2.

The results confirm that he had hg F, and probably P. He has the only P mutation he was tested for, P-L781/PF5875/YSC0000255. He is Q1a1-F1215+, but was not tested for anything in between P and Q1a1. AG2 is R1-P245/PF6117+ and R1a1a1-Page7+, but he was R-P224/PF6050-, R1-P286/PF6136-, and R1a1-L122/M448/PF6237-.

Now i understand why others besides Geneticker who did the same tests said he belonged to either R1a1 or Q1a.

AG2 confirms that hg P was popular and probably largely spread with ANE populations.

P was all over Eurasia before the Neolithic. R1b and probably R1a were in west Asia and or central Asia-eastern Europe, R2 was in south Asia, Q was in south Asia-central Asia-Siberia-and the Americas, and P* lineages were in south Asia.

Krefter said...

Calls for mtDNA SNPs of 17,000YBP Siberian AG2.

AG2 is a member of mtDNA R and could not be placed in every R subclade he was tested for notably west Eurasian R2′JT and nearly every RO subclade including H and HVO. It is really disappointing that he could not get calls for U SNPs, but we do know he did not belong to any of the known U subclades; U5, U6, U1, and U2’3’4’7’8’9. The same is true for MA1 though, because he belonged to his own U subclade which has not been found in modern people.

He was not tested for U-MA1 mutations, and geneticker will probably find if he had U-MA1 soon.

Krefter said...

You should read this. Among other things R1a Z93 has been confirmed in bronze age Mongolia, and is without a doubt of Indo Iranian origin.