search this blog

Thursday, January 8, 2015

SpaceMix: A Spatial Framework for Understanding Population Structure and Admixture


Analyzing admixture isn't easy, especially among spatially more or less continuous populations that exchange DNA gradually by mixing with their immediate neighbors. A new preprint at bioRxiv explains this problem in detail and provides a possible solution: SpaceMix (available here as an R script). Below are a few excerpts from the paper. I highlighted the Polish sample in the two figures for my own use.



Of the European samples, the Spanish and the East and West Sicilian samples all draw small amounts of admixture from close to the Ethiopian samples, presumably reflecting a North African ancestry component [Moorjani et al., 2011, Botigu et al., 2013].

...

The Chuvash move close to Russian and Lithuanian samples, drawing admixture from close to the Yakut; the Turkish sample also draws a smaller amount of admixture from there. There are several other East-West connections: the Russian and Adygei samples have admixture from a location "north" of the East Asian samples, and the Cambodia sample draws admixture from close to the Eygptian sample [Pickrell and Pritchard, 2012, Hellenthal et al., 2014].

There are also a number of samples that draw admixture from locations that are not immediately interpretable. For example, the Hadza and Bantu Kenyan samples draw admixture from somewhat close to India, and the Xibo and Yakut from close to "northwest" of Europe. The Pathan samples draw admixture from a location far from any other samples' locations, but close to where the India samples also draws admixture from.

...

There are a number of possible explanations for these results. As we only allow a single admixture arrow for each sample, populations with multiple, geographically distinct sources of admixture may be choosing admixture locations that average over those sources. This may be the case for the Hadza and Bantu Keynan samples [Hellenthal et al., 2014]. A second possibility is that the relatively harsh prior on admixture proportion forces samples to choose lower proportions of admixture from locations that overshoot their true sources; this may explain the Xibo and Yakut admixture locations. A final explanation is that good proxies for the sources of admixture may not be included in our sampling, either because of of the limited geographic sampling of current day populations, or because of old admixture events from populations that are no longer extant.

Citation...

Bradburd et al., A Spatial Framework for Understanding Population Structure and Admixture, bioRxiv, Posted January 7, 2015. doi: http://dx.doi.org/10.1101/013474

12 comments:

Nirjhar007 said...

Cool! but since you are doing so many things it would have been great if you managed to create a software which accurately( in relative terms of course) calculates the age of admixture in each individuals!

Matt said...

I've had a quick skim read. Obviously this is a pretty dense, so I've only skim read some of it.

From what I understand, SpaceMix is a *lot* like TreeMix.

Both builds model that explains most population relatedness and builds in edges to explain additional variance. Drift is anything that is left unexplained, that doesn't increase relatedness to one population more than another.

Difference is the model SpaceMix uses, rather than a tree, is it builds two PCA dimensions, and assumes these relate to IBD (the one dimension relates to northings and one to eastings). Which may be more realistic or may not be.

So it is kind of showing whether populations are more related than would be expected based on their position in the first two dimensions of PCA, which again are assumed to relate to isolation by distance. It depends to some extent then, on how well those first two dimensions are *actually* characterised by real isolation by distance.

Both SpaceMix and TreeMix have the same difference from MixMapper, in that, unlike MixMapper, neither of them uses a scaffold of populations which are tested to be unadmixed via formal tests or direct tests of admixture (f3 stats, d stats, recent haplotype sharing) to build their basic models (the PCA space in SpaceMix and the tree in TreeMix). So there is testing that the edge populations which define the space and tree are themselves necessarily unadmixed.

It would be interesting to see their analysis repeated, just with drop MA-1 and Loschbour in without changing the sample composition they use (or use Human Origins instead?) and see what is found.

Their PCA is obviously pretty different from what we typically see. Why?

Well, the difference, from skim reading ( I might be wrong) between their PCA and the other PCA we've seen in the past seems to be that their PCA doesn't just use genotype data.

Rather they also use longitude and latitude data, which has the effect of forcing the PCA to select SNPs for the first two dimensions that are related to longitude and latitude. Otherwise they would probably recapitulate the more "normal" PCA we see, where almost none of the particular differentiation of Native Americans and Oceanians shows us in the first and second dimension. This does seem mean that the first two dimensions will necessarily and inevitably show less in terms of relatedness in genotype frequency, compared to typical PCA.

Another interesting thing, from skim reading, here is that the number of SNPs they used seems pretty low - 10,000. That's kind of interesting. I wonder if that relates to algorithmic constaints of their methods, or is a choice to show they get good results from low coverage or what. That does mean that the method could be used on the low SNP coverage Pan-Asian dataset, which could add some insights there, possibly, which aren't possible from ADMIXTURE or typical genotype PCA or TreeMix or MixMapper very well.

Nirjhar007 said...

Gosh i wish if there were any way to calculate the age of admixtures!:)

George said...

See this Reich paper (fig 5) to est timing of admixture events. http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1000519

Davidski said...

It's not possible to accurately estimate dates of admixture from modern DNA.

Scientists keep trying, and they keep coming up with nonsense.

Matt said...

Talking about time estimates, the paper includes an interesting bit on this in its discussion, and I've bolded the interesting subsection

"The increased sequencing of ancient DNA [see Pickrell and Reich, 2014, for a recent review] promises an interesting way forward on that front, and it will also be exciting to learn where ancient individuals fall on modern maps, as well as how the inclusion of ancient individuals changes the configuration of those maps [Skoglund et al., 2014]. The inclusion of ancient DNA samples in the analyzed sample o ers a way to get better representation of the ancestral populations from which the ancestors of modern samples received their admixture.

However, it is also possible to model genetic drift as a spatiotemporal process, in which covariance in allele frequencies decays with distance in both space and in time. We are currently exploring using ancient DNA samples as `fossil calibrations' on allele frequency landscapes at points in the past, so that modern day samples may draw admixture from coordinates estimated in spacetime."


In principle, we could see that it might be possible to extend SpaceMix to include a dimensional handling for time.

Also recommend reading the discussion in this paper as it is pretty good for talking about likely limitations of the method and comparisons to other methods.

- SpaceMix's lower two dimensions hold more data than the lowest two PCA dimensions, assuming isolation-by-distance holds. However PCA can hold more information across a number of dimensions, which is hard to make sense of with SpaceMix over high dimension (what exactly do these higher dimensions than the first two and admixture edges in them, actually mean?).

- SpaceMix may not show populations as admixed if there are a large enough number of them to warp the plot and they have similar admixture histories, or ancient samples are lacking. E.g. Europeans.

-SpaceMix does not use linkage disequilibrium data in the way Hellenthal 2014's Globetrotter does, and lacks the ability to make inferences that this can provide, like some of the 1%-2% level admixture events which Globetrotter seems to find. Personally, while I'm not 100% convinced about this, Globetrotter may be able to find events which escape causing meaningful shifts in genotype frequencies, as they're so low they cause almost no shift in frequencies but show structure in LD (sharing of haplotype blocks). Although LD cannot find ancient admixture due to constraints on decay of LD over time, which may confuse matters using LD only methods.

Simon_W said...

Matt, I've also skim read. First of all, their results are visualized as a so called geogenetic map. They don't call this their PC dimensions, because their method has nothing to do with a PC analysis (PCA). Their map doesn't show their first two dimensions, it shows their entire result. In contrast to a PCA, they don't have any higher dimensions! Though, as they write, it would be possible to plot the positions of the populations in a space with more dimensions, which would allow for more structure to be captured. Well, at least a three dimensional space could be easily visualised, it just wouldn't relate intuitively to a geographical map.

As for your suggestion that their geogenetic map differs from the known patterns we usually see in the first two dimension of a PCA because it takes the sampling location into account – I'm not sure. At the beginning of the chapter on empirical applications, they state:

„For all analyses presented below, we used random ‘observed’ locations as the priors on population locations. The geogenetic maps shown here were maximum a posteriori estimates (over all parameters).“

So this seems to imply that sampling locations were not used in the making of the map.

But in the next sentence they state:

„For clarity and ease of interpretation, we then present a full Procrustes superimposition of the inferred population locations (G) and their sources of admixture (G∗ ), using the observed latitude and longitude of the samples/individuals to give a reference position and orientation.“

I don't understand this. Does this mean the sampling location was somehow used a posteriori to warp the map?

Furthermore, you said: „SpaceMix (...) builds two PCA dimensions, and assumes these relate to IBD“.

No, these are not PCA dimensions, and I didn't see any reference to IBD (identity by descent). I suppose you mean isolation by distance. ;-)

Matt said...

Simon : They don't call this their PC dimensions, because their method has nothing to do with a PC analysis (PCA). Their map doesn't show their first two dimensions, it shows their entire result. In contrast to a PCA, they don't have any higher dimensions!

Furthermore, you said: „SpaceMix (...) builds two PCA dimensions, and assumes these relate to IBD

Yes, they compare and contrast it to PCA, but the technique is mathematically distinct. PCA like dimensions if you prefer, that would certainly be closer to accuracy. They use forms of analysis which explain variation in dimensional terms.

Also, well done, I'm referring to the thing relevant to the paper commonly referred to by the acronym IBD, not Identity by Descent or Inflammatory Bowel Disease for that matter.

I don't understand this. Does this mean the sampling location was somehow used a posteriori to warp the map?

Yes sounds like, contrary to my initial impression, the Spacemix algorithm doesn't actually use the geographic information; what it does is it outputs a position estimate for each sample in its two dimensions (parameter G), then the real geographic information has been used to fit these to latitude and longitude.

My impression is that "Procrustes Superimposition" should only involve rotation, position and scaling of the whole plot, not moving any of the coordinates away from one another relative to the rotation, position and scaling of the plot.

The differences in the algorithm from PCA must be quite different from what I thought they were. Do you want to summarize them, if you have a grasp of them?

Simon_W said...

As far as I understand it (which isn't far) this Spacemix geogenetic map is a new method to visualize the overall similarity between populations; the genetic distance between two populations is somehow proportional to the distance in the map and the mutual distances between all samples kind of impose one solution. So far that stands to reason, and I know I'm saying the obvious. But I didn't read in detail how exactly this is achieved and how the distances between populations are calculated, as maths was never my strong point. But to me it looks like a useful method, as the first two dimensions of a PCA often capture only a limited fraction of the variance.

Simon_W said...

Also, a PCA plot depends on the sampling, it changes depending on the populations or individual samples you add. The geogenetic map doesn't change its pattern if samples are added or removed.

Matt said...

Simon: Also, a PCA plot depends on the sampling, it changes depending on the populations or individual samples you add. The geogenetic map doesn't change its pattern if samples are added or removed.

OK, I personally doubt that Spacemix's genetic geographic map is agnostic to the choice of sample composition - their comments include "The landscape of allele frequencies on which the location of populations that were the source of population's admixture are estimated is entirely informed by the placement of other modern samples, even though the admixture events may have occurred many generations ago ... it will also be exciting to learn where ancient individuals fall on modern maps, as well as how the inclusion of ancient individuals changes the conguration of those maps " and "One concern is that the multiple admixed samples (from a single admixture event) may simply choose to cluster close to each other, and not need to draw admixture from elsewhere due to the fact that their frequencies are well described by their proximity to other admixed populations. Along these lines, it is noticeable that many of our European samples draw little admixture from elsewhere [also noted by Hellenthal et al., 2014, using a different approach], despite evidence of substantial admixture [Lazaridis et al., 2014]". I can't see how any dimensional method would not produce dimensions which are different to at least some degree due to different sampling strategies.

If they'd resimulated the genetic geographic map with different balances of individuals, then I'd have more confidence - e.g. run a 5 cluster ADMIXTURE analysis, then include equal numbers of members of each cluster, then I'd have more confidence.

But to me it looks like a useful method, as the first two dimensions of a PCA often capture only a limited fraction of the variance.

It's true that they do fit more information into two dimensions than PCA of many higher dimension, not sure how their method compares to multidimensional scaling set to two dimensions.

Simon_W said...

Good point. As far as I see it, they don't even mention MDS. I think there are two main differences. 1. Their geogenetic maps are a model-based approach, i.e. based on a population genetic model. (It's about covariance among alleles at the same locus.) While MDS works with distance matrixes in general, which needn't have anything to do with genetics. And 2., their approach allows to plot admixture edges.