search this blog

Thursday, September 5, 2013

A multidimensional view of East Asia

Asia isn't the focus of my project, but I thought many readers would find these PCA interesting:

Basically, there are three main poles of genetic variation on these plots: Northeast Siberian (Koryaks and Chukchi), East Asian (Japanese and Korean) and Southeast Asian (Malayan).

Overall, the Northeast Siberians appear to be the most distinct group, and that's because they're more closely related to some Amerindians (like Greenlanders) than even other Siberians. Interestingly, the two Koreans cluster firmly with the Japanese across the first two PCs, but are clearly separated from them in PC 4.

It's also worth noting that the Han Chinese sample from Beijing (from the HapMap project) doesn't look particularly homogenous, with some individuals overlapping with the Japanese and others with the Vietnamese.

See also...

PCA of the world


Maju said...

Thanks, David. One of the things that calls my attention is how even PC1 only captures less than 2% of the variability. That is very little and means that the differences are all quite subtle. I had not noticed before but in the West Eurasian graphs the situation is even worse with PC1 only capturing 0.9% of the variability.

Overall in East Asia PC1 vs PC2 only captures 2.5% of the variability, while in West Eurasia that is a mere 1.2%. Can we really draw any conclusions from that?

On the other hand, I know that inter-group variability makes up roughly just 10% of all inter-personal variability (at global level). I can only imagine that the percentage is even lower in such homogeneous areas like East Asia or West Eurasia, should we "correct" the figures therefore in order to more properly express inter-group variability? If so, using what kind of equation?

Davidski said...

You're right, single PCs capture very little variation. But the thing is that the first few PCs capture a lot of variation relative to all the other PCs, and the first PC is usually by far the most important in this regard. What this means in practice is that we can usually just focus on the first one, two or perhaps three PCs, and still come up with solid conclusions. For instance, I'm yet to see any evidence that the north to south differentiation within Europe and West Eurasia isn't by far the most important.

Here's a more complex, but also more informative, way of saying what I just said (the paper is open access)...

Ebizur said...

PC1 is mainly a representation of IBD across latitude. It exhibits a very strong negative correlation with latitude, though the spread is disproportionately great in the northern half of East Asia compared to the southern half (which I suppose might be due to an increased rate of gene flow among populations in the southern half of East Asia, or perhaps some confounding external admixture in populations of the northern half of East Asia).

The interpretation of PC2 is less straightforward. One pole is represented by the Japanese, and the other pole is represented by Malays, Cambodians, and Koryaks (though some members of these populations appear to be somewhat moderate in regard to PC2). My best guess is that PC2 maps roughly to longitude, again with a negative correlation. In the case of PC2, it seems to be not so much geographical longitude, but rather some sort of "genetic longitude" (West Eurasian vs. East Eurasian) that is at play, with the Malays, Cambodians, Koryaks, and Chukchis obtaining such high values for PC2 on behalf of their presumably greater Western Eurasian-affiliated admixture, and the Japanese obtaining their extremely low values for PC2 on behalf of their presumably purer East Eurasian ancestry.

Davidski said...

None of these Koryaks or Chukchis have any detectable European ancestry. They're actually very similar to unadmixed Greenlanders. See here...

The most European and/or West Asian of these samples are the Mongolians and Altaians, although I removed all the outliers with significant European ancestry.

The Malayans and Cambodians do have some Indian ancestry, but again I removed all the excessively admixed samples.

Maju said...

I also do not think it's about "purity" of any sort. PC2 obviously is something that Malays and East Siberians (and presumably Native Americans, I'd adventure) share and I doubt it's something extraneous to East Asia. Yet I do not know what it is. It is something extraneous to the Japanese, which are the most isolated (not purest) East Asian population and therefore something shared to some extent by all mainland East Asians, maybe some sort of substrate layer diluted less in those extremes than in the center of the region.

Sometimes I and others perceive certain appearance similitudes between Native Americans and SE Asians, as in contrast with your typical Han Chinese or Japanese, maybe this is the genetics behind that. It would not be something about being mixed with an outsider population but more likely some deep structure within the region, I would guess that maybe linked (not strictly but more diffusely) to Y-DNA C.

Japanese are outstanding in PC4 (negative pole: Nganassans?), probably because of the island isolation and admixture with ancient Jomon aboriginals (Y-DNA D1, etc.), who should also have been quite isolated.

PC3 seems to be about what gathers together the vast majority of East Asians with a few exceptions. Who are they: Koryaks and Chuchki? Seems so.

Davidski said...

Perhaps the Japanese are drifted in a specific way and the Malayans and Koryaks share less of that drift than the Koreans, Mongolians, Chinese, and so on?

Maju said...

But it is a positive correlation: they are in the same positive zone of PC2, that should mean that they share a decent amount of autosomal genetics and not just in contrast to someone else but positively so.

I don't know what it is exactly but it may be related, a bit obliquely in any case, to the distribution of the two main Y-DNA haplogroups of the region: NO and C. However Malays are high in O and low in C so that's why I say that "obliquely".

Davidski said...

I think that what they share is a relatively non-drifted East Eurasian component which has become heavily drifted in the Japanese. That doesn't mean the Koryaks and Chukchis aren't genetically drifted - they are, but they show it in the other dimensions.

Genetic drift, or lack of, is very important on these sorts of plots. It explains a lot of what we see on them.

Ebizur said...


Thank you for your explanation. However, I have a hard time believing in the concept of unadmixed Koryaks, Chukchis, Malays, or Cambodians. Samples of each of these populations has revealed some individuals bearing stereotypical Caucasoid or South Asian Y-DNA (R1b, J2, G2, H, L, T, etc.), and I think the same can be said for Greenlanders.

On the cultural side of things, most people who identify themselves as ethnic Koryaks or Chukchis actually can speak only the Russian language, and the languages of the Malays and Khmers exhibit heavy, long-term influence from Indo-Aryan languages.


If you want to look for correlates of PC2 in Y-DNA, I would consider Y-DNA haplogroup D as a more plausible candidate than Y-DNA haplogroup C. Y-DNA haplogroup C is at least as frequent among Japanese as it is among Malays or Cambodians, and also very diverse (Japanese contain representatives of both C1-M8 and C3-M217, two highly divergent subclades of C-M130). Furthermore, Y-DNA haplogroup C3-M217 is very frequent among Koryaks, but its frequency is much lower among the ethnolinguistically closely related Chukchis.

The distribution of Y-DNA haplogroup D is a much better fit for the pattern of PC2. It might suggest that earlier populations of East Asia had high values for PC2, like modern Koryaks, Chukchis, Malays, and Cambodians, and the populations in the center of East Asia were subsequently affected by an expansion of populations carrying Y-DNA haplogroup D that had low values for PC2, with the Japanese being the most heavily affected by the Y-DNA haplogroup D/low PC2 newcomers. (Alternatively, the hypothetical Y-DNA haplogroup D/low PC2 population might have been isolated in Japan for a long time, only emerging relatively recently to influence populations of nearby parts of mainland East Asia, but with that influence not extending so far as the Malays, Cambodians, Koryaks, or Chukchis.)

Davidski said...

It might be useful to cross check the East Asian PCA with the global PCA.

It's interesting that on the global PCA the Malays and Koryaks also line up in PC2, more or less. This indicates that they do share about the same amount of DNA from Western Eurasia. However, the Chinese appear clearly more East Asian than the Japanese.

Maju said...

"Y-DNA haplogroup C is at least as frequent among Japanese as it is among Malays or Cambodians"...

Sure but this sample does not include more C-high populations like those of Wallacea, to which I believe Malays are close to either by origin or admixture or both. That's the kind of thing I meant.

What I really mean is that there may have been in the early stages of East Asian colonization some duality (or multiplicity but PCs are unidimensional and hence bipolar) still at the SE Asian stage of it and that the PC2 may be capturing a residual signal of that, maybe reinforced by neolithic flows or whatever (not in Siberia surely, of course).

Similarly the WEA PC2 seems to suggest some ancient affinities or stucture like Highlander West Asians with Eastern Europeans, which are blurred only when bidemensionalized (i.e. contrasted with the more continental-like PC1).

" Y-DNA haplogroup D as a more plausible candidate"

That may be for PC4, where Japanese are outstanding in the positive pole - but Tibetans are outstandingly neutral so not too obvious. Here I'd lean for island isolation and drift, both at the Jomon stage but also later, of which D1 is an element but that's it.

In any case, if PC2 represents here West/South Eurasian admixture (what I doubt), it can be checked by adding a few (not many as they would distort the results, just two or so) such samples. If real they should be very high in that PC axis, if not they should be quasi-neutral (and I honestly expect this result).

Matt said...

Cool stuff. East Asian populations tend to break down under a 3 way division in ADMIXTURE along these lines, with closely related East and Southeast Asian components and a more distant North Asian component.

Who are the Tibeto-Burman speakers here? Am I right in assuming they are the sampled populations from Chaubey (Lahus, Naxi, Tujias) but not including the Burmese or other groups who showed as having significant South Asian ancestry (because I would've expected that to form a principal component)?

Davidski said...

The Tibeto-Burman samples are indeed from Chaubey, and they're from Burma.

Tibeto-Burman_Burmese bumaBR110
Tibeto-Burman_Burmese bumaBR50
Tibeto-Burman_Burmese bumaBR54
Tibeto-Burman_Burmese bumaBR55
Tibeto-Burman_Burmese bumaBR56
Tibeto-Burman_Burmese bumaBR62
Tibeto-Burman_Burmese bumaBR64
Tibeto-Burman_Burmese bumaBR68
Tibeto-Burman_Burmese bumaBR69
Tibeto-Burman_Burmese bumaBR78
Tibeto-Burman_Burmese bumaBR81
Tibeto-Burman_Burmese bumaBR83
Tibeto-Burman_Burmese bumaBR84
Tibeto-Burman_Burmese bumaBR98
Tibeto-Burman_Garo GA1
Tibeto-Burman_Garo GA13

Ebizur said...

The Garos are a distinctive population of Tibeto-Burman speakers who inhabit the Garo Hills of Meghalaya, just west of the Khasis, who are an equally peculiar tribe of Austroasiatic speakers. I have never heard of any Garos living in Burma, so two of the "Tibeto-Burman" data points should be considered as representing a population of Northeast India rather than Burma. It would be nice to know which two data points in the PC plot represent the Garo individuals.

In any case, we can say for certain that all the Tibeto-Burman populations in this PC plot (Yi, Garo, Burmese, and Tujia) fall between the Northern Han and Southern Han in regard to PC1, which generally seems to correlate with latitude. However, the Yi appear the most "northern" (low value for PC1) of the sampled Tibeto-Burmans, with one Yi individual actually clustering with Mongolic-speaking peoples of the PRC, followed by the Burmese/Garo and finally the Tujia, whose average for PC1 is very close to that of the Southern Han. This is a bit of a perturbation, because the Tujia at present are the northernmost of these populations, and the Burmese are the southernmost.

In regard to PC2, which I previously have hypothesized might be related to Western Eurasian affinity, most of the Burmese must have values roughly equal to those of the Altaians, Tuvinians, Vietnamese, and Dais, being somewhat lower than those of the Koryaks, Chukchis, and Nganasans in the north or the Malays, Cambodians, and perhaps Filipinos in the south.

Ebizur said...

I think it also should be pointed out that the Mongolic-speaking or Tungusic-speaking populations from the PRC generally seem to be genetically more similar to the Han Chinese majority population of that country than to their linguistic relatives in Mongolia and Siberia. The only one of these populations whose genetic affinity is rather ambiguous is the Oroqens, a small population from the northern fringes of Inner Mongolia and Manchuria, who speak a language that is very similar to that of the Evenks. The Oroqens are strung out along a vector between the Evenks and the Northern Han, without clearly tending toward either end.

A cursory comparison with data on the haploid DNA of the Evenks and Oroqens reveals that some Evenks belong to typically Western Eurasian haplogroups (Y-DNA R1a1 (including R1a1a-M17/M198 and R1a1a1b1a1-M458), Y-DNA R1-M173(xR1a1a-M17), Y-DNA I-M170 (including I2a1-P37), mtDNA J (including J1c5 and J2), mtDNA U (including U4a1), mtDNA H (including H1(xH1a) and H8)) that are not apparent in published samples of Oroqens. This agrees with my "PC2 = Western Eurasian affinity" hypothesis because the tested Evenks all have positive values for PC2, whereas the Oroqens all have negative values for PC2. Besides the aforementioned stereotypical Western Eurasian haplogroups, the Evenks also exhibit Y-DNA belonging to N1c1-Tat or N1c2b-P43 with much greater frequency than the Oroqens, though each of these subclades of haplogroup N also has been found in at least one Oroqen individual.

The main Y-DNA haplogroup shared among the Oroqens and the Evenks is C3b2a-M86, which is generally the modal Y-DNA haplogroup in each of these populations. C3-M217(xC3b2-M48), N1c2b-P43, and N1c1-Tat also are shared, but with lower frequency (in particular, C3-M217(xC3b2-M48) is more frequent among Oroqens than among Evenks, and N1c2b-P43 and N1c1-Tat are quite common among Evenks but rare among Oroqens). The remainder of the Evenk Y-DNA pool seems more-or-less Western, whereas the remainder of the Oroqen Y-DNA pool is composed of various subclades of haplogroup O-M175, namely O-M175(xO1a-M119, O2-P31, O3-M122), O2-P31(xO2a1-M95, O2b-SRY465), O3-M122(xO3a2a-M159, O3a2b-M7, O3a2c1-M134), O3a2b-M7, O3a2c1-M134(xO3a2c1a-M117), and O3a2c1a-M117.

On the mtDNA side, the haplogroups shared among the Evenks and the Oroqens are C, Z, D(xD5), D5, G(xG1a, G3), A(xA5), and F1b. C is overwhelmingly predominant (47%-72%) among Evenks, but it is found in only about 30% of Oroqens. D(xD5), D5, and G, on the other hand, are more frequent among Oroqens than among Evenks. A, F1b, and Z do not exhibit any particular tendency, but each of these haplogroups occurs with low frequency (<5%) among these populations. The Evenks also have a bunch of Western Eurasian mtDNA, and M7c has been found in at least one Evenk individual from Yakutia. B4(xB4a, B4b) and N(xA, N1a, N1b, N9a, R, W, X, Y1) have been found once each in a sample of Oroqens from Oroqen Autonomous Banner, Inner Mongolia, PRC.

In summary, among at least these two Tungusic-speaking peoples, it appears that subclades of Y-DNA haplogroup O-M175 and mtDNA haplogroups D and G correlate with East Asian (particularly Northern Han, but obliquely also Korean and Japanese) affinity, and Y-DNA haplogroups N1c2b-P43 and N1c1-Tat along with mtDNA haplogroup C and Western Eurasian haplogroups correlate with Northern Siberian (Koryak/Chukchi/Nganasan/Yukaghir) affinity.

Davidski said...

Please note, I updated the PCAs with new samples.

The old PCAs can still be accessed here...