The aim of this survey is to review the methods and results that serve as the basis for myOrigins. After generating a reference data set pruned of related individuals, we used the the Admixture package’s model-based clustering framework to capture the genetic variation of the world’s populations, and infer ancestral proportions for particular individuals. Twenty-four clusters were defined spanning extant modern human genetic variation but reflective of ancient migrations and admixtures.
Over the past 50,000 years since modern humans have left Africa numerous populations have diverged [Mendez2013] and recombined [Hellenthal2014] to generate the diversity we see around us today. Though most human genetic variation is apportioned within, rather than across, populations, there have been barriers to gene flow on an inter-continental scale [Jorde2004]. This has resulted in genetic distances between groups being great enough so as to identify them accurately with molecular markers [Rosenberg2002]. The markers are diverse, from microsatellites to alloyzomes. But for our purposes it is critical to note that the the human genome has approximately 10 million single nucleotide polymorphisms (SNPs) [Reich2003]. These are single base pairs of A, C, G, and T, which vary across populations, and can be used for purposes of inferring deep history on a paleontological scale [Patterson2006a], as well as individual forensic identification [Tian2008].
With a sufficient density of markers very fine-grained inferences can be made as to the ancestral makeup of any given individual, assuming a finite number of discrete ancestries. Though the reality of human variation and its history is more nuanced than any given population genetic summary, importance aspects of deep ancestry and genetic relatedness can be established through methods which reduce the complexity of the data into a finite set of dimensions. These are represented visually in a manner so as to impart the general qualitative result to a non-specialist audience.
2 Model-based inferences of genetic variation
There are many ways to represent genetic distance and clustering between individuals and populations. A classic method is Principal Component Analysis (PCA), which reduces the total variation into a set of independent dimensions ordered by proportion of variation explained for each dimension [Patterson2006]. Therefore, PC 1 is the dimension which explains the largest proportion of variation, and PC 2 explains the second largest proportion of the variation, and so on. Populations or individuals can then be represented along a two-dimensional plane which exhibits patterns of clustering which correspond to genetic relatedness. More recently dense multi-locus marker sets have allowed for the development of robust model-based methodologies which give more explicit results [Pritchard2000].
These methods posit a model which includes a “K” number of ancestral populations as a parameter, assuming that true populations will be in Hardy-Weinberg Equilibrium. The genotypic variation of individuals is configured as a composite of these ancestral populations according to their weights. Assuming K = 2 an individual may have assignments of 50 percent to each of the posited clusters, or any pairwise values which sum to 100 percent (e.g., 10 percent and 90 percent for each component). Note that the “K” does not necessarily dennote “real” populations in a concrete sense which underwent discrete admixture events, though it may. But often a given number of populations will still be highly informative of population and individual relatedness, setting aside any necessary correspondence to historical genetic fact.
The modern era of model-based admixture analysis using multi-locus data began with the release of the Structure program [Pritchard2000]. It utilizes a Bayesian approach and an Markov chain Monte Carlo (MCMC) algorithm to sample the distribution of probabilities. Given the range of models and parameters which can fit the genotypic data the ultimate outcomes are a series of assignments to explicit K values which are fixed as inputs. The ultimate outputs of weighted assignment of individuals to each K is generally visualized in the form of colored barplots.
Figure 1: Hubisz, Melissa J., et al. “Inferring weak population structure with the assistance of sample group information.” Molecular ecology resources 9.5 (2009): 1322-1332.
Because Structure implements a computationally intensive Bayesian method, it has a relatively long run-time. Several packages have emerged that operate within the same model-based framework of Structure, but utilize more computationally efficient algorithms and a less exhaustive maximum likelihood approach. Foremost among these is the Admixture program, which has widely become the standard model-based admixture analysis package [Alexander2009]. Though there are differences in the technical details between Structure and Admixture, both generate results that are visually represented as barplots. With dense SNP data, Admixture has been shown to be as accurate and much faster than Structure.
4 Reference Populations
Model-based methods of admixture inference are sensitive to the input data which is evaluated to establish population substructure. To maintain consistency, it is important to fix a reference set that can be used across runs of individual ancestry predictions. In other words, though individuals whose ancestries are being computed may vary, the reference genotypes against which they are compared to infer those ancestries will remain constant. To construct a reference set, we collected populations from multiple sources:
- GeneByGene DNA customer database
- Human Genome Diversity Project
- International HapMap Project
- Estonian Biocentre
- 1000 Genome
The GeneByGene data derives from a custom Illumina OmniExpress 710K microarray chip. The HGDP data was from Illumina HumanHap650K Beadchips [Li2008]. The Estonian Biocentre data derives from Illumina 610K or 660K bead arrays [Behar2010] or Illumina HumanOmniExpress BeadChip.
Figure 2: PCA used to remove outliers
We assembled a large number of candidate reference populations which were relatively unadmixed and sampled widely in terms of geography. From these, we removed related or outlier individuals with the Plink software, utilizing identity-by-descent (IBD) analysis and visually inspecting multi-dimensional scaling plots (MDS). Further visualization established that the reference population sets were indeed genetically distinct from each other. We also ran Admixture and MDS with specific populations to asses if any individuals were outliers or exhibited notable gene flow from other reference groups, removing these. Admixture was run on an inter and intra-continental scale to establish a plausible number of K values utilizing the cross-validation method [Alexander2011]. After removing markers that were missing in more than 5 percent of loci and those with minor allele frequencies below 1 percent, the total intersection of SNPs across the pooled data set was 245,039. The final number of individuals in was 2,943.
To validate our Reference Population set, we tested them against a list of well studied benchmark groups whose ancestral background in the literature has been well attested. Additionally, we also cross-checked against individuals with attested provenance within the GeneByGene DNA database.
5 Generation of cluster maps
To generate the contour maps of the distribution of a particular cluster as a function of geography, we used the fraction of representation in a myOrigins cluster at latitude-longitude points derived from distinct populations within our benchmark list and produced smoothed density maps. The R packages spatstat and maps were employed to generate polygons overlain upon a world map.
6 Table of reference populations