search this blog

Tuesday, March 22, 2011

Reconstructing the Ancestral North Indian (ANI) genome

Back in 2009, Reich et al. theorized that the current South Asian gene pool was basically made up of two founding genetic components; Ancestral North Indian (ANI), and Ancestral South Indian (ASI). The distilled ANI, they noted, was more similar to the genomes of modern Northwest Europeans than those of the Adygei from the Caucasus. This is obviously out of whack with geography, but it does make sense based on what I've seen in my experiments on the Pakistani samples from the HGDP. Many of them, especially the Pathans, carry numerous segments, or haploblocks, that basically look North European. This gave me an idea to try and reconstruct the ANI genome based on such fragments. The first chromosome of my composite sample, which I call the "ANI composite" is available for download here. It's a PLINK Ped file in illumina AB format with 19,261 SNPs.

Below are several PCA plots featuring the "ANI composite", obviously not including the HGDP samples used to make it (see below). Overall, it seems to resemble most closely my reference samples from Eastern Europe. I have to admit that I was very pleased to see it behaving like a set of genotypes from a real human subject across many dimensions of genetic variation. PCA are very sensitive to anomalies, such as unusually long runs of homozygosity, so the fact that my composite can pass for a normal sample on these plots is fantastic.

So how did I do this? Well, it wasn't very difficult, but a bit tedious, so I need a break before continuing. I used information from my earlier experiments with ADMIXMAP, HAPMIX and RHH Counter to locate and delineate North European-like segments in phased Pakistani HGDP samples. I phased the data myself with BEAGLE, in a pool of South Asian and Middle Eastern samples, so as not to bias the results of phasing and imputation towards Northern Europe. In order to keep the alleles in phase when loaded into PLINK, I duplicated the haplotypes, producing completely homozygous individuals out of each one. Then I created an ANI composite dummy with 100% no calls, and loaded the haplotypes into this sample with a Python script. The first to load were the Pathan haplotypes, followed by the Burusho. I chose individuals from these two groups to make up the backbone of the putative ANI genome because they always seem to come out most "North European" in my ADMIXTURE and PCA/MDS runs compared to other South Asians. The empty spaces were filled with haplotypes from the Brahui and Balochi. Below is a list of all the samples used:

Pathan HGDP00213
Pathan HGDP00214
Pathan HGDP00218
Pathan HGDP00224
Pathan HGDP00241
Pathan HGDP00243
Pathan HGDP00254
Pathan HGDP00258
Pathan HGDP00259
Pathan HGDP00262
Pathan HGDP00264

Burusho HGDP00338
Burusho HGDP00356
Burusho HGDP00364
Burusho HGDP00382
Burusho HGDP00392
Burusho HGDP00412
Burusho HGDP00417
Burusho HGDP00423
Burusho HGDP00428
Burusho HGDP00433

Brahui HGDP00007
Brahui HGDP00009
Brahui HGDP00017
Brahui HGDP00041
Brahui HGDP00047

Balochi HGDP00054
Balochi HGDP00058
Balochi HGDP00062
Balochi HGDP00072

The phased data and the "ANI" haplotypes used in this experiment are available on request from eurogenesblog [at] hotmail [dot] com. I welcome feedback and suggestions on how to improve my methodology. Admittedly, this was a test run, so it's unlikely to be perfect.