1. Take a look at Extended Data Figure 3 from Huerta-Sánchez et al (2014). How does this figure emphasize that the Tibetan allele for EPAS1 came from Denisovans? (10 points)
2. You are given the following seven aligned sequences from a single population (variants with respect to the first sequence are shown in bold). Justify whether or not you see any evidence of selection. (20 points)
TGGAGTGTGACCATAGCGAT
TGGAGTTTGACCATAGCCAT
TGGAGTGTGACCATAACCAT
TGGTGTTTGCCCATAACGAT
TGGTGTGTGACCATAGCGAT
TGGTGTGTGCCCATAGCGAT
TGGAGTGTGACCATAACGAT
3. Explain the concept of the “Neutral Theory of Molecular Evolution” and how it relates to the idea of a molecular clock? (10 points)
Natural selection on EPAS1 (HIF2α) associated with
low hemoglobin concentration in Tibetan highlanders
Cynthia M. Bealla,1, Gianpiero L. Cavallerib,1, Libin Dengc,2, Robert C. Elstond, Yang Gaoc, Jo Knighte,f, Chaohua Lic,
Jiang Chuan Lig, Yu Liangh, Mark McCormackb, Hugh E. Montgomeryi,1, Hao Panc, Peter A. Robbinsj,1,3,
Kevin V. Shiannak, Siu Cheung Taml, Ngodrop Tseringm, Krishna R. Veeramahn, Wei Wangh, Puchung Wangduim,
Michael E. Wealee,1, Yaomin Xuo, Zhe Xuc, Ling Yangh, M. Justin Zamanp, Changqing Zengc,1,3, Li Zhango,1,
Xianglong Zhangc, Pingcuo Zhaxih,1,4, and Yong Tang Zhengq
aDepartment of Anthropology, Case Western Reserve University, Cleveland, OH 44106-7125; bMolecular and Cellular Therapeutics, The Royal College of
Surgeons in Ireland, Education and Research Centre, Beaumont Hospital, Dublin 9, Ireland; cBeijing Institute of Genomics, Key Laboratory of Genome Sciences
and Information, Chinese Academy of Sciences, Beijing 100029, China; dDepartment of Epidemiology and Biostatistics, Case Western Reserve University,
Cleveland, OH 44106-7281; eDepartment of Medical and Molecular Genetics, King’s College London, Guy’s Hospital, London SE1 9RT, United Kingdom;
fNational Institute for Health Research, Biomedical Research Centre, Guy’s and St. Thomas’ National Health Service Foundation Trust and King’s College
London, London SE1 7EH, United Kingdom; gYunnan Institute of Population and Family Planning Research, Kunming 650021, China; hBeijing Genomics
Institute at Shenzhen, Shenzhen 518000, China; IInstitute for Human Health and Performance, University College London, London N19 5LW, United Kingdom;
jDepartment of Physiology, Anatomy and Genetics, University of Oxford, Oxford OX1 3PT, United Kingdom; kInstitute for Genome Sciences and Policy, Center
for Human Genome Variation, Duke University, Durham, NC 27708; lSchool of Biomedical Sciences, Chinese University of Hong Kong, Shatin NT, Hong Kong,
China; mTibet Academy of Social Sciences, Lhasa 850000, Tibet Autonomous Region, China; nDepartment of History, The Centre for Society and Genetics and
the Novembre Laboratory, University of California, Los Angeles, CA 90095-7221; oDepartment of Quantitative Health Sciences, Cleveland Clinic, Cleveland, OH
44195; pDepartment of Epidemiology and Public Health, University College London, London WC1E 6BT, United Kingdom; and qKunming Institute of Zoology,
Chinese Academy of Sciences, Kunming 650223, China
Edited by Peter T. Ellison, Harvard University, Cambridge, MA, and approved May 17, 2010 (received for review February 26, 2010)
By impairing both function and survival, the severe reduction in
oxygen availability associated with high-altitude environments is
likely to act as an agent of natural selection. We used genomic and
candidate gene approaches to search for evidence of such genetic
selection. First, a genome-wide allelic differentiation scan (GWADS)
comparing indigenous highlanders of the Tibetan Plateau (3,200–
3,500 m) with closely related lowland Han revealed a genome-wide
significantdivergenceacrosseightSNPs locatednearEPAS1. Thisgene
encodes the transcription factor HIF2α, which stimulates production
of red blood cells and thus increases the concentration of hemoglobin
in blood. Second, in a separate cohort of Tibetans residing at 4,200m,
we identified 31 EPAS1 SNPs in high linkage disequilibrium that
correlated significantly with hemoglobin concentration. The sex-ad-
justed hemoglobin concentration was, on average, 0.8 g/dL lower in
the major allele homozygotes compared with the heterozygotes.
These findings were replicated in a third cohort of Tibetans residing
at 4,300m. The alleles associatingwith lower hemoglobin concentra-
tions were correlated with the signal from the GWADS study and
were observed at greatly elevated frequencies in the Tibetan cohorts
comparedwith the Han. High hemoglobin concentrations are a cardi-
nal feature of chronicmountain sickness offering oneplausiblemech-
anism for selection. Alternatively, as EPAS1 is pleiotropic in its effects,
selectionmay have operated on someother aspect of the phenotype.
Whichever of these explanations is correct, the evidence for genetic
selection at the EPAS1 locus from the GWADS study is supported by
the replicated studies associating function with the allelic variants.
chronic mountain sickness | high altitude | human genome variation |
hypoxia | hypoxia-inducible factor
The high plateaus of Central Asia and the Andes were among
the last areas occupied as Homo sapiens spread across the
globe during the past 100,000–200,000 y. In the case of the Ti-
betan plateau, early visitors appearedmore than 30,000 y ago, and
the plateau has been colonized for more than 10,000 y (1, 2). The
low partial pressure of oxygen resulting from the extreme altitude
would have presented a formidable biological challenge to such
colonists. Individuals from low-altitude populations (European
and Han) who move to live at high altitude suffer from a number
of potentially lethal diseases specifically related to the low levels
of oxygen (3–5) and struggle to reproduce at these altitudes (6–9).
The hypoxia of altitude (hypobaric hypoxia) would thus have
exerted substantial evolutionary selection pressure.
The classic disease associated with long term residence at high
altitude is chronic mountain sickness, or Monge’s disease, after
Carlos Monge-Medrano who first identified the condition among
Andean highlanders (10). The cardinal feature is a very high
concentration of the oxygen-carrying pigment, hemoglobin, in the
blood, caused by an overproduction of red blood cells (excessive
erythrocytosis). Tibetan highlanders are particularly resistant to
developing chronic mountain sickness (4, 11), and exhibit little or
no increase in hemoglobin concentration with increasing altitude,
even at 4,000 m (13,200 ft) and only moderate increases beyond
(12, 13). Typically, Tibetans average at least 1 g/dL and as much
as approximately 3.5 gm/dL (i.e. approximately 10–20%) lower
hemoglobin concentration in comparison with their Andean
counterparts (14–16) or acclimatized lowlanders, such as the Han
who have moved to altitudes above 2,500 m (4, 17–23). This sug-
gests that Tibetans have evolved a blunted erythropoietic response
to high-altitude hypoxia. The induction of erythrocytosis by hyp-
oxia involves the hypoxia-inducible factor (HIF) family of tran-
scription factors and, in particular, EPAS1 (or HIF2α) (24, 25).
Three independent studies, but with mutually reinforcing
results, were performedby groups that have since come together to
form a consortium with the aim of reporting on the findings. The
first study was a genome-wide allelic differentiation scan that
compared SNP frequencies of a Yunnan Tibetan population re-
siding at 3,200–3,500 m with the HapMap Phase III Han sample.
Author contributions: C.M.B., G.L.C., J.C.L., Y.L., H.E.M., P.A.R., S.C.T., N.T., W.W., P.W.,
L.Y., C.Z., P.Z., and Y.T.Z. designed research; C.M.B., G.L.C., Y.G., C.L., J.C.L., Y.L., M.M.,
H.P., K.V.S., S.C.T., N.T., W.W., P.W., Z.X., L.Y., C.Z., X.Z., P.Z., and Y.T.Z. performed re-
search; G.L.C., L.D., R.C.E., Y.G., J.K., K.R.V., M.E.W., Y.X., Z.X., M.J.Z., and L.Z. analyzed
data; and C.M.B., G.L.C., H.E.M., P.A.R., M.E.W., and C.Z. wrote the paper.
The authors are listed alphabetically and declare no conflict of interest.
This article is a PNAS Direct Submission.
Freely available online through the PNAS open access option.
1C.M.B., G.L.C., H.E.M., P.A.R., M.E.W., C.Z., L.Z., and P.Z. contributed equally to this work.
2Present address: Faculty of Basic Medical Science, Nanchang University, Nanchang
330006, China.
3To whom correspondence may be addressed. E-mail: peter.robbins@dpag.ox.ac.uk or
czeng@big.ac.cn.
4Present address: The People’s Hospital of the Tibet Autonomous Region, Lhasa, 850000
Tibet Autonomous Region, China.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.
1073/pnas.1002443107/-/DCSupplemental.
www.pnas.org/cgi/doi/10.1073/pnas.1002443107 PNAS | June 22, 2010 | vol. 107 | no. 25 | 11459–11464
EV
O
LU
TI
O
N
D
ow
nl
oa
de
d
by
g
ue
st
o
n
Ja
nu
ar
y
29
, 2
02
1
mailto:peter.robbins@dpag.ox.ac.uk
mailto:czeng@big.ac.cn
http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1002443107/-/DCSupplemental
http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1002443107/-/DCSupplemental
www.pnas.org/cgi/doi/10.1073/pnas.1002443107
Mitochondrial, Y chromosome, and autosomal DNA evidence all
suggest a northor eastAsianorigin formodernTibetans (1, 26–28).
Thus, theHan comprise a lowlandpopulation that is closely related
to the Tibetans but which has not undergone selection for high-
altitude adaptation. From this study, a signal of selection close to
EPAS1 was identified at a genome-wide level of significance. The
second study comprised a candidate gene analysis of EPAS1 in
a separate sample of Tibetans from 4,200m on theTibetan plateau
and identified a significant association between genotype and he-
moglobin concentration, with the major alleles associating with
the lower hemoglobin levels. These alleles were present at low
frequency in the Han. The third study replicated the hemoglobin
association in an independent sample of Tibetans from 4,300 m.
Results
Genome-Wide Allelic Differentiation Scan. A genome-wide allelic
differentiation scan (GWADS) was used to compare a cohort of
Tibetan residents (n=35) sampled from four townships at altitudes
of 3,200–3,500 m in Yunnan Province, China, with HapMap Phase
III Han individuals (n = 84). We postulated that any marked
differences in SNP frequencies between the Yunnan Tibetan and
the HapMap Han populations could reflect a history of divergent
selection on functional variation that contributes to increased sur-
vival at high altitude. (See SI Materials and Methods for detailed
methodology.) Of the 502,722 SNPs that were included in the
analysis, eight SNPs emerged as having genome-wide significance
(P values ranging from 2.81× 10−7 to 1.49× 10−9), all locatedwithin
235 kb on chromosome 2 (Fig. 1 and Table S1).
All eight GWADS significant SNPs were in high pairwise
linkage disequilibrium in the Yunnan sample (0.23< r2 < 0.82), forming an extended haplotype with a frequency of 46% in the Yunnan Tibetan sample but only 2% in the Han sample [esti- mated via an expectation-maximization algorithm using Haplo- view software (29, 30)]. The SNPs lie between 366 bp and 235 kb downstream of EPAS1 but, as we show below, the region of high linkage disequilibrium extends into EPAS1 itself. In addition to this genome-wide significant finding relating to EPAS1, evidence for other signals of selection was also found. Regions of sub- genome wide significance were in close proximity to other genes of the HIF pathway and present intriguing targets for follow-up studies (see SI Text for further details).
Candidate Gene Study for EPAS1. Independent of the GWADS
study, a candidate gene study (based on the pathway linking
hypoxia, EPAS1, and erythropoietin) addressed the functional
consequence of EPAS1 variants by testing for association with
hemoglobin concentration in a sample of 70 Tibetans residing at
4,200 m in Mag Xiang, Xigatse Prefecture in the Tibet Autono-
mous Region, China (Table S2). One hundred and three non-
coding SNPs across theEPAS1 gene were selected for genotyping.
Of these, 49 had a minor allele frequency ≥5%, and were thus
amenable to regression analysis (Materials and Methods) that
identified 31 SNP sites significantly associated with hemoglobin
concentration (Fig. 2 and Table S3). The major (most frequent)
allele of every significant SNP was associated with lower hemo-
globin concentration. After adjusting for sex differences, indi-
viduals who were homozygous for the major allele had an average
hemoglobin concentration that was 0.8 ± 0.15 g/dL (range from
0.3 to 1.0 g/dL) lower than individuals who were heterozygous for
the major allele. Conditional linear regression analyses showed
that once the most significant SNP (rs4953354) was included, no
significant improvement in fit was obtained after Bonferroni
correction by adding any other SNP, consistent with a single
causal variant model. Many of the SNP sites were in high linkage
disequilibrium (Fig. 2). Genotypes for the eight GWADS signif-
icant SNPs were available on 29 of the 70 individuals in the Mag
Xiang cohort, too few to show statistical association with hemo-
globin concentration. However, all eight GWADS SNPs were
highly correlated (0.54 < r2 ≤ 1) with variants associating with
hemoglobin concentration in the complete Mag Xiang cohort
(Table S4). Thus, the genome-wide and the candidate-gene
analyses can be linked, with the latter study demonstrating that
there is a phenotype associated with the signal of selection.
Replication of Candidate Gene Study for EPAS1. We replicated the
association of EPAS1 SNPs with hemoglobin concentration in
another sample of 91 Tibetans residing at 4,300 m in Zhaxizong
Xiang, Xigatse Prefecture, China (Table S2). Of the 49 Mag
Xiang SNPs with a minor allele frequency ≥5%, 48 were suc-
cessfully genotyped in the Zhaxizong Xiang sample. Of these,
45 sites had a minor allele frequency ≥5% and 32 sites were
significantly associated with hemoglobin concentration. After
adjusting for sex differences, individuals who were homozygous
for the major allele had an average hemoglobin concentration
that was 1.0 ± 0.14 g/dL (range from 0.5 to 1.2 g/dL) lower than
individuals who were heterozygous for the major allele (Fig. 3 and
Table S3). Twenty-six SNPs were associated with hemoglobin
concentration in both samples and the direction of the effect was
the same. Conditional linear regression again found that, after
including the most significant SNP (rs13419896), no further SNPs
were significant after Bonferroni correction. This was also the
case if the most significant SNP from the Mag Xiang sample
Fig. 1. A genome-wide allelic differentiation scan that compares Tibetan residents at 3,200–3,500 m in Yunnan Province, China with HapMap Han samples.
Eight SNPs near one another and EPAS1 have genome-wide significance. The horizontal axis is genomic position with colors indicating chromosomes. The
vertical axis is the negative log of SNP-by-SNP P values generated from the Yunnan Tibetan vs. HapMap Han comparison. The red line indicates the threshold
for genome-wide significance used (P = 5 × 10−7). Values are shown after correction for background population stratification using genomic control.
11460 | www.pnas.org/cgi/doi/10.1073/pnas.1002443107 Beall et al.
D
ow
nl
oa
de
d
by
g
ue
st
o
n
Ja
nu
ar
y
29
, 2
02
1
http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1002443107/-/DCSupplemental/pnas.201002443SI ?targetid=nameddest=STXT
http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1002443107/-/DCSupplemental/st01
http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1002443107/-/DCSupplemental/pnas.201002443SI ?targetid=nameddest=STXT
http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1002443107/-/DCSupplemental/st02
http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1002443107/-/DCSupplemental/st03
http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1002443107/-/DCSupplemental/st04
http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1002443107/-/DCSupplemental/st02
http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1002443107/-/DCSupplemental/st03
www.pnas.org/cgi/doi/10.1073/pnas.1002443107
(rs4953354) was used instead of rs13419896. Genotypes for the
eight GWADS significant SNPs were available on 89 samples
from the Zhaxizong Xiang cohort. Three of these SNPs correlated
significantly with hemoglobin concentration (Table S4), thus sup-
porting the association of a phenotype with the signal of selection
that has been obtained from this area of the genome.
Comparing allelic frequencies between the two Tibetan samples
and the HapMap Han sample, we note that the largest allele
frequency differences occur at the EPAS1 SNP sites that are as-
sociated with low hemoglobin concentration (Fig. 4). Linkage dis-
equilibrium (LD) among these 26 SNP sites was also elevated in
the twoTibetan cohorts compared with theHapMapHan (SI Text).
Discussion
The results from the GWADS study revealed a level of allele
frequency differentiation near EPAS1 that far exceeds the ge-
nome-wide average (Fig. 1). The association studies demonstrated
that the SNP variants that were at higher frequencies within the
Tibetan population were associated with lower hemoglobin con-
centrations. As large genome-wide association studies of the
determinants of hemoglobin concentration in other populations at
low altitude have failed to detect a signal associated with EPAS1
(31–34), our results suggest either that there is a genetic variant
that is quite specific to the Tibetan population or that the variant is
quite specific to moderating hemoglobin concentration only under
conditions of high altitude. Such specificity of effect in relation to
Tibetan highlanders is in keeping with a model of selection pres-
sure on EPAS1 under the stress of high-altitude hypoxia. In-
terestingly, a comparison between the HapMap Han and Andean
highlanders—both of whom have a vigorous erythropoietic re-
sponse (15)—did not detect selection at the EPAS1 locus (35). It
should be noted, however, that the Andean study (35) applied
a different array of methodologies to detect selection and over-
lapping results are not necessarily expected, given the differing
nature of the selection signals that particular techniques are
powered to detect. It is also possible that the Andean and Tibetan
populations have developed different genetic adaptations to the
hypoxia of high altitude given the differences in physiology that
are known to exist between these populations (13).
The association studies revealed that genetic variation across
EPAS1 accounts for a large proportion of the variation in he-
moglobin concentration in these populations. After controlling
for sex, the average difference in hemoglobin concentration be-
tween major allele homozygotes and heterozygotes was 53% of
one SD in theMagXiang sample and 50% in the ZhaxizongXiang
sample. In absolute terms, these differences were several fold
larger than for any of the determinants of hemoglobin concen-
tration in populations residing at low altitude (31–34). Our find-
ings are, however, consistent with previous high estimates of
heritability (h2) for hemoglobin concentration of 0.66 among
Fig. 2. Sex-adjusted hemoglobin concentrations and allelic variation in EPAS1 SNPs in a Tibetan sample from Mag Xiang (4,200 m), Tibet Autonomous
Region, China. Values were, on average, 0.8 g/dL lower for individuals who were homozygous for the major alleles compared with those who were het-
erozygous. (Top) The results of testing variants at 49 SNPs with a minor allele frequency ≥5% for genotypic association with sex-adjusted hemoglobin
concentration. (Middle) The estimated hemoglobin concentration difference (mean ± 95% confidence interval) between the major allele homozygote and
heterozygote genotypes at each SNP. Filled circles represent SNPs detected as having a significant association with hemoglobin concentration, with the false
discovery rate controlled at <0.05 across the EPAS1 locus. Open diamonds represent SNPs without significant association. (Bottom) The pairwise linkage
disequilibrium measured as r2 between SNPs.
Beall et al. PNAS | June 22, 2010 | vol. 107 | no. 25 | 11461
EV
O
LU
TI
O
N
D
ow
nl
oa
de
d
by
g
ue
st
o
n
Ja
nu
ar
y
29
, 2
02
1
http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1002443107/-/DCSupplemental/st04
http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1002443107/-/DCSupplemental/pnas.201002443SI ?targetid=nameddest=STXT
Tibetans at 4,850–5,450 m (36) and of 0.86 among Tibetans at
3,800–4,065 m (15). These values estimate the proportion of
additive genetic variation relative to total phenotypic variance.
The combined findings of our association and conditional linear
regression analysis are consistent with a model in which a single
causal variant at the EPAS1 locus accounts for a substantial
proportion of the heritability. Under this model, hemoglobin-
associating SNPs should be interpreted as markers and are pre-
sumed to have differentiated because they are closely linked to an
as yet to be identified causal variant. Functional studies will be
required to identify how this variation works to restrain the
hematopoietic response.
We have described a signal of natural selection on or near
EPAS1 that is associated with a blunting of the normal erythro-
poietic response to hypoxia. As EPAS1 is pleiotropic, other res-
ponses to hypoxia may be similarly affected. Some insight into
these may be given by studies of a few individuals/families, living
at low altitudes, who have been reported to have gain of function
mutations inEPAS1 (37–40). As expected, these individuals exhibit
excessive erythrocytosis, but they also appear to be particularly
susceptible to thrombotic events and to developing pulmonary
hypertension—although the total number of cases reported is
small. A larger number of cases have been reported for the slightly
less specific genetic disorder of Chuvash Polycythemia, where ho-
mozygosity for hypomorphic alleles for VHL results in elevated
levels of both HIF1α and EPAS1/HIF2α (41). The phenotype for
Chuvash Polycythemia appears very similar to that for the specific
EPAS1 gain of function mutations, with excessive erythrocytosis,
an excess risk of thrombotic events at a young age, and pulmonary
hypertension (42–45). In both conditions, the excessive eryth-
rocytosis is generally managed by venesection in the belief that this
may reduce the incidence of thrombotic events.
The clinical similarity between the phenotypes of these genetic
disorders and chronic mountain sickness is striking. Indeed, it
caused one group of investigators to observe in respect of the
EPAS1 gain of function mutations that “it raises the possibility
that polymorphic variation in HIF2α [EPAS1] contributes to the
marked differential susceptibility to erythrocytosis, reduced
plasma volume and pulmonary hypertension in humans at high
altitude” (39). Chronic mountain sickness occurs among Tibetans
at a lower prevalence than Han lowlanders (1.2% compared with
5.6%) living in the Tibet Autonomous Region (4, 46). Chronic
mountain sickness remains at that low level throughout adulthood
among Tibetans but, in Peruvians, prevalence increases with age
from approximately 13% in 20 to 39 y olds to approximately 36%
in 55 to 69 y olds at 4,300 m (47). In Andeans, excessive eryth-
rocytosis at high altitude has been associated with significant
pulmonary hypertension (48), an increased risk of stroke (49), and
also an increased risk of poor outcome in pregnancy (stillbirth,
preterm birth, or small for gestational age at birth) (50). These
Fig. 3. Sex-adjusted hemoglobin concentrations and allelic variation in EPAS1 SNPs in a Tibetan sample from Zhaxizong Xiang (4,300 m), Tibet Autonomous
Region, China. Values were, on average, 1.0 g/dL lower for individuals who were homozygous for the major alleles compared with those who were het-
erozygous (Top) The results of testing variants at 45 SNPs with a minor allele frequency ≥5% for genotypic association with sex-adjusted hemoglobin
concentration. (Middle) The estimated hemoglobin concentration difference (mean ± 95% confidence interval) between the major allele homozygote and
heterozygote genotypes at each SNP. Filled circles represent SNP detected as having a significant association with hemoglobin concentration with the false
discovery rate controlled at <0.05 across the EPAS1 locus. Open diamonds represent SNP without significant association. (Bottom) The pairwise linkage
disequilibrium measured as r2 between SNPs.
11462 | www.pnas.org/cgi/doi/10.1073/pnas.1002443107 Beall et al.
D
ow
nl
oa
de
d
by
g
ue
st
o
n
Ja
nu
ar
y
29
, 2
02
1
www.pnas.org/cgi/doi/10.1073/pnas.1002443107
findings provide insight into some of the sources of elevated
morbidity and mortality on which selection may have operated to
influence allelic frequencies for EPAS1 among the early colo-
nizers of the Tibetan plateau.
Although the similarity between chronic mountain sickness
and the EPAS1 gain of function phenotype in lowlanders is
striking, there nevertheless may be other aspects to the pheno-
type that are not revealed at low altitude but are only revealed at
high altitude, when oxygen availability is also limited. In partic-
ular, EPAS1 plays very important, if still poorly understood, roles
in both placental and embryonic development (51–54) and pos-
sibly also in the pathogenesis of fetal growth restriction (55). It is
well recognized that reproductive success is more difficult at high
altitude than at low altitude, and more difficult for nonnatives
than natives (6). For example, pre- and postnatal mortality are
threefold higher in the Han than in the Tibetans, and birth
weight is significantly lower (56). This may relate in part to the
presence of greater uterine arterial blood flow and lower he-
moglobin concentration in the Tibetans (9, 57). As such, natural
selection on EPAS1 may also have operated via effects during
pregnancy that affect both pre- and postnatal mortality.
In conclusion, this study provides evidence for natural selection
in Tibetan highlanders at a specific human gene locus. The finding
is further supported by a demonstration, in two independent
samples, that genetic variation at this locus has an associated
phenotype. The known physiological roles associated with this
gene locus provide insight into some of the factors that are likely
to have influenced human adaptation and survival following col-
onization of the Tibetan Plateau.
Materials and Methods
Human Volunteers. Ethics and consent. This study was approved by the ethics
committees of the Yunnan Population and Family Planning Institute
(Kunming, China); the Beijing Genomics Institute at Shenzhen (Shenzhen,
China); the Beijing Institute of Genomics, Chinese Academy of Sciences (Bei-
jing, China)and Case Western University (Cleveland, OH). All work was con-
ducted in accordance with the principles of the Declaration of Helsinki. All
participants were recruited after obtaining informed consent.
Sample collection. SamplingwasconductedinthreegeographicregionsofChina
approximately 2,400 kilometers apart. Theywere (i) theNorthWestern region
of Yunnan province (28°26’N 98°52’E), (ii) Mag Xiang, Xigatse Prefecture,
Tibet Autonomous Region (29°15’N 88°53’E), and (iii) Zhaxizong Xiang,
Xigatse Prefecture, Tibet Autonomous Region (28°34’N 86°38’E). Genotypic
data from the HapMap Phase III Han population were also included in this
analysis. Further details on sample collection are given in the SI Materials
and Methods.
Genotyping. All genotyping was conducted at the Beijing Institute of Geno-
mics. The whole genome genotyping was conducted using the Illumina
Veracode platform and 610-Quad high throughput genotyping chips. Gen-
otyping within EPAS1 was conducted using a customer-designed Illumina
GoldenGate assay (384 SNP plex) for all samples from Mag Xiang and some
of the samples from Zhaxizong Xiang. The remainder of the samples from
Zhaxizong Xiang were genotyped using MassARRAY assays. Further details
of these and the quality control procedures are given in the SI Materials
and Methods.
Phenotyping. Hemoglobin concentration was measured in duplicate imme-
diately after provision of a venipuncture blood sample by individuals in the
Mag Xiang sample (58). Individuals were screened with the aim of including
only healthy, native residents. Excluded were individuals who had anemia
(men and women with hemoglobin concentrations below 13.7 g/dL and 12
g/dL, respectively), hypertension, fever, poor lung function, extreme hyp-
oxemia, or who were currently or recently pregnant, or who had symptoms
or medication indicative of heart or lung disease. The Zhaxizong Xiang
sample was obtained in the course of a health survey and included all vol-
unteers who were native residents.
Statistical Analysis. GWADS. To identify variation between the Yunnan Ti-
betan and the HapMap Han populations, we calculated SNP-by-SNP χ2 sta-
tistics for allele frequencies and corrected for background population
stratification through a genomic control procedure (30). This approach
allows genome-wide significant signals of allele frequency differentiation to
be readily declared by examining genomic distributions of χ2 values in the
sample of approximately 500,000 SNPs. A threshold of genome-wide sig-
nificance was set at 5 × 10−7 (59). A full description of the method, including
a simulation for two populations with a degree of genomic divergence
equal to that between the Yunnan Tibetan and HapMap Han populations, is
given in the SI Text.
Candidate gene studies. Candidate gene association analysis of EPAS1 SNP
genotype with hemoglobin concentration phenotype was performed sepa-
rately in the two Tibet Autonomous Region samples. Mean characteristics for
these populations are given in Table S2. For each SNP, a linear additive genetic
model was fitted with hemoglobin concentration as the response variable,
the SNP as the predictive variable (entered as a numerical variable—1, 2, 3—
corresponding to the three genotypes sorted by descending allelic frequency)
and with gender as a covariate. The P values of the likelihood ratio test were
obtained from a comparison with the null model (i.e., only gender in the
model). The estimated difference stands for the increase in the sex-adjusted
mean with the addition of one copy of the minor allele taking the most
frequent homozygous genotypes as the reference. Unless otherwise stated,
an adjustment for multiple comparisons was implemented by controlling the
false discovery rate at less than 0.05 across the EPAS1 gene. The R language
and environment (R Project for Statistical Computing, http://www.r-project.
org) was used for all related analysis and graphics. Conditional linear analyses
were undertaken by including a specified SNP as an additional covariate in
the model and were implemented using plink (http://pngu.mgh.harvard.edu/
~purcell/plink/).
ACKNOWLEDGMENTS. We thank Wei Chen, Jian Bai, and Feng Cheng of
Beijing Institute of Genomics for their contribution in genotyping and data
processing and three anonymous reviewers for their critical and constructive
comments. We also thank the people of Shangri-La and De Qin Xians,
Yunnan Province; Mag Xiang and Zhaxizong Xiang, Tibet Autonomous
Region, for their cooperation and hospitality during data collection. We are
grateful to the Tibet Academy of Social Sciences for their collaboration and
enabling permission to collect data in Mag Xiang. This work was supported
by the National Science Foundation; National Institutes of Health National
Center for Research Resources, National Institute of General Medical Scien-
ces, National Cancer Institute, National Heart, Lung, and Blood Institute;
National Natural Science Foundation of China Grant 30890031 and Ministry
of Science and Technology Grant 2006DFA31850 (to C.Z.); Chinese Academy
of Sciences Grant KSCX2-YW-R-76 and Science and Technology Plan of
the Tibet Autonomous Region Grant 2007-2-18 to Beijing Genomics Institute
at Shenzhen; and an International Joint Project award from the Royal Soci-
ety. This consortium grew from a catalysis meeting sponsored by the Na-
tional Science Foundation-supported National Evolutionary Synthesis Center
(http://www.NESCent.org).
Fig. 4. Differences in allelic frequency at SNPs within EPAS1 between the
HapMap Han, Mag Xiang and Zhaxizong Xiang cohorts. The horizontal axis
is SNP position according to build 36.1. The vertical axis is allelic frequency,
with the allele selected for display as the one occurring most frequently in
the Mag Xiang cohort. Squares denote data for HapMap Han; circles denote
data for Mag Xiang Tibetans; triangles denote data for Zhaxizong Xiang
Tibetans. Filled symbols denote those SNPs having significant associations
with hemoglobin in both Mag Xiang and Zhaxizong Xiang cohorts; open
symbols denote those SNPs without both such associations.
Beall et al. PNAS | June 22, 2010 | vol. 107 | no. 25 | 11463
EV
O
LU
TI
O
N
D
ow
nl
oa
de
d
by
g
ue
st
o
n
Ja
nu
ar
y
29
, 2
02
1
http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1002443107/-/DCSupplemental/pnas.201002443SI ?targetid=nameddest=STXT
http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1002443107/-/DCSupplemental/pnas.201002443SI ?targetid=nameddest=STXT
http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1002443107/-/DCSupplemental/pnas.201002443SI ?targetid=nameddest=STXT
http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1002443107/-/DCSupplemental/pnas.201002443SI ?targetid=nameddest=STXT
http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1002443107/-/DCSupplemental/pnas.201002443SI ?targetid=nameddest=STXT
http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1002443107/-/DCSupplemental/st02
http://www.r-project.org
http://www.r-project.org
http://pngu.mgh.harvard.edu/~purcell/plink/
http://pngu.mgh.harvard.edu/~purcell/plink/
http://www.NESCent.org
1. ZhaoM, et al. (2009)Mitochondrial genome evidence reveals successful Late Paleolithic
settlement on the Tibetan Plateau. Proc Natl Acad Sci USA 106:21230–21235.
2. Aldenderfer M (2003) Moving up in the world: Archaeologists seek to understand
how and when people came to occupy the Andean and Tibetan plateaus. Am Sci 91:
542–549.
3. Wu T, Miao C (2002) High altitude heart disease in children in Tibet. High Alt Med Biol
3:323–325.
4. León-Velarde F, et al. (2005) Consensus statement on chronic and subacute high
altitude diseases. High Alt Med Biol 6:147–157.
5. Wu T (2004) Life on the high Tibetan plateau. High Alt Med Biol 5:1–2.
6. Julian CG, Wilson MJ, Moore LG (2009) Evolutionary adaptation to high altitude: A
view from in utero. Am J Hum Biol 21:614–622.
7. Monge C (1973)Acclimatization in the Andes (Johns Hopkins University Press, Baltimore).
8. Moore LG, Young D, McCullough RE, Droma T, Zamudio S (2001) Tibetan protection
from intrauterine growth restriction (IUGR) and reproductive loss at high altitude. Am
J Hum Biol 13:635–644.
9. Moore LG, Zamudio S, Zhuang J, Sun S, Droma T (2001) Oxygen transport in tibetan
women during pregnancy at 3,658 m. Am J Phys Anthropol 114:42–53.
10. Monge-M C, Encinas E, Heraud C, Hurtado A (1928) La Enfermedad de los Andes
(Sindromes Eritemicos) (Imprenta Americana, Lima).
11. Pei SX, et al. (1989) Chronic mountain sickness in Tibet. Q J Med 71:555–574.
12. Beall CM (2007) Two routes to functional adaptation: Tibetan and Andean high-
altitude natives. Proc Natl Acad Sci USA 104 (Suppl 1):8655–8660.
13. Beall CM (2002) Biodiversity of Human Populations in Mountain Environments.
Mountain Biodiversity: A Global Assessment, eds Korner C, Spehn EM (The Parthenon
Publishing Group, New York), pp 199–210.
14. Winslow RM, et al. (1989) Different hematologic responses to hypoxia in Sherpas and
Quechua Indians. J Appl Physiol 66:1561–1569.
15. Beall CM, et al. (1998) Hemoglobin concentration of high-altitude Tibetans and
Bolivian Aymara. Am J Phys Anthropol 106:385–400.
16. Beall CM (2002) Biodiversity of Human Populations in Moutain Environments.
Mountain Biodiversity: A Global Assessment, eds Korner C, Spehn EM (The Parthenon
Publishing Group, New York), pp 199–210.
17. Garruto RM, et al. (2003) Hematological differences during growth among Tibetans
and Han Chinese born and raised at high altitude in Qinghai, China. Am J Phys
Anthropol 122:171–183.
18. Ward MP, Milledge JS, West JB (1995) High Altitude Medicine and Physiology
(Chapman and Hall Medical, London).
19. Zhuang J, et al. (1996) Smaller alveolar-arterial O2 gradients in Tibetan than Han
residents of Lhasa (3658 m). Respir Physiol 103:75–82.
20. Moore LG, et al. (1992) Are Tibetans better adapted? Int J Sports Med 13(Suppl 1):
S86–S88.
21. Droma T, et al. (1991) Increased vital and total lung capacities in Tibetan compared to
Han residents of Lhasa (3,658 m). Am J Phys Anthropol 86:341–351.
22. Sun SF, et al. (1990) Greater maximal O2 uptakes and vital capacities in Tibetan than
Han residents of Lhasa. Respir Physiol 79:151–161.
23. Zhuang J, et al. (1993) Hypoxic ventilatory responsiveness in Tibetan compared with
Han residents of 3,658 m. J Appl Physiol 74:303–311.
24. Chi JT, et al. (2006) Gene expression programs in response to hypoxia: Cell type
specificity and prognostic significance in human cancers. PLoS Med 3:e47.
25. Rankin EB, et al. (2007) Hypoxia-inducible factor-2 (HIF-2) regulates hepatic
erythropoietin in vivo. J Clin Invest 117:1068–1077.
26. Su B, et al. (2000) Y chromosome haplotypes reveal prehistorical migrations to the
Himalayas. Hum Genet 107:582–590.
27. Qian Y, et al. (2000) Multiple origins of Tibetan Y chromosomes. Hum Genet 106:
453–454.
28. Gayden T, et al. (2009) Genetic insights into the origins of Tibeto-Burman populations
in the Himalayas. J Hum Genet 54:216–223.
29. Barrett JC, Fry B, Maller J, Daly MJ (2005) Haploview: Analysis and visualization of LD
and haplotype maps. Bioinformatics 21:263–265.
30. Devlin B, Roeder K, Wasserman L (2001) Genomic control, a new approach to genetic-
based association studies. Theor Popul Biol 60:155–166.
31. Benyamin B, et al. (2009) Common variants in TMPRSS6 are associated with iron status
and erythrocyte volume. Nat Genet 41:1173–1175.
32. Chambers JC, et al. (2009) Genome-wide association study identifies variants in
TMPRSS6 associated with hemoglobin levels. Nat Genet 41:1170–1172.
33. Ganesh SK, et al. (2009) Multiple loci influence erythrocyte phenotypes in the
CHARGE Consortium. Nat Genet 41:1191–1198.
34. Soranzo N, et al. (2009) A genome-wide meta-analysis identifies 22 loci associated
with eight hematological parameters in the HaemGen consortium. Nat Genet 41:
1182–1190.
35. Bigham AW, et al. (2009) Identifying positive selection candidate loci for high-altitude
adaptation in Andean populations. Hum Genomics 4:79–90.
36. Beall CM, Blangero J, Williams-Blangero S, Goldstein MC (1994) Major gene for
percent of oxygen saturation of arterial hemoglobin in Tibetan highlanders. Am J
Phys Anthropol 95:271–276.
37. Percy MJ, et al. (2008) A gain-of-function mutation in the HIF2A gene in familial
erythrocytosis. N Engl J Med 358:162–168.
38. Percy MJ, et al. (2008) Novel exon 12 mutations in the HIF2A gene associated with
erythrocytosis. Blood 111:5400–5402.
39. Gale DP, Harten SK, Reid CDL, Tuddenham EGD, Maxwell PH (2008) Autosomal
dominant erythrocytosis and pulmonary arterial hypertension associated with an
activating HIF2 α mutation. Blood 112:919–921.
40. Martini M, et al. (2008) A novel heterozygous HIF2AM535I mutation reinforces the role
of oxygen sensing pathway disturbances in the pathogenesis of familial erythrocytosis.
Haematologica 93:1068–1071.
41. Ang SO, et al. (2002) Disruption of oxygen homeostasis underlies congenital Chuvash
polycythemia. Nat Genet 32:614–621.
42. Gordeuk VR, et al. (2004) Congenital disorder of oxygen sensing: Association of the
homozygous Chuvash polycythemia VHL mutation with thrombosis and vascular
abnormalities but not tumors. Blood 103:3924–3932.
43. Sergeyeva A, et al. (1997) Congenital polycythemia in Chuvashia. Blood 89:2148–2154.
44. Smith TG, et al. (2006) Mutation of von Hippel-Lindau tumour suppressor and human
cardiopulmonary physiology. PLoS Med 3:e290.
45. Bushuev VI, et al. (2006) Endothelin-1, vascular endothelial growth factor and systolic
pulmonary artery pressure in patients with Chuvash polycythemia. Haematologica 91:
744–749.
46. Ren YH, et al. (2010) Incidence of high altitude illnesses among unacclimatized
persons who acutely ascended to Tibet. High Alt Med Biol 11:39–42.
47. Monge-C C, Arregui A, León-Velarde F (1992) Pathophysiology and epidemiology of
chronic mountain sickness. Int J Sports Med 13 (Suppl 1):S79–S81.
48. Penaloza D, Arias-Stella J (2007) The heart and pulmonary circulation at high
altitudes: Healthy highlanders and chronic mountain sickness. Circulation 115:
1132–1146.
49. Jaillard AS, Hommel M, Mazetti P (1995) Prevalence of stroke at high altitude (3380
m) in Cuzco, a town of Peru. A population-based study. Stroke 26:562–568.
50. Gonzales GF, Steenland K, Tapia V (2009) Maternal hemoglobin level and fetal
outcome at low and high altitudes. Am J Physiol Regul Integr Comp Physiol 297:
R1477–R1485.
51. Pringle KG, Kind KL, Sferruzzi-Perri AN, Thompson JG, Roberts CT (2009) Beyond
oxygen: Complex regulation and activity of hypoxia inducible factors in pregnancy.
Hum Reprod Update, 10.1093/humupd/dmp046.
52. Dunwoodie SL (2009) The role of hypoxia in development of the Mammalian embryo.
Dev Cell 17:755–773.
53. Meade ES, Ma YY, Guller S (2007) Role of hypoxia-inducible transcription factors
1alpha and 2alpha in the regulation of plasminogen activator inhibitor-1 expression
in a human trophoblast cell line. Placenta 28:1012–1019.
54. Cowden Dahl KD, et al. (2005) Hypoxia-inducible factors 1alpha and 2alpha regulate
trophoblast differentiation. Mol Cell Biol 25:10479–10491.
55. Dai SY, Kanenishi K, Ueno M, Sakamoto H, Hata T (2004) Hypoxia-inducible factor-
2alpha is involved in enhanced apoptosis in the placenta from pregnancies with fetal
growth restriction. Pathol Int 54:843–849.
56. Moore LG, Young D, McCullough RE, Droma T, Zamudio S (2001) Tibetan protection
from intrauterine growth restriction (IUGR) and reproductive loss at high altitude. Am
J Hum Biol 13:635–644.
57. Xing Y, et al. (2009) Hemoglobin levels and anemia evaluation during pregnancy in
the highlands of Tibet: A hospital-based study. BMC Public Health 9:336.
58. Hoit BD, et al. (2005) Nitric oxide and cardiopulmonary hemodynamics in Tibetan
highlanders. J Appl Physiol 99:1796–1801.
59. Wellcome Trust Case Control Consortium (2007) Genome-wide association study of
14,000 cases of seven common diseases and 3,000 shared controls. Nature 447:
661–678.
11464 | www.pnas.org/cgi/doi/10.1073/pnas.1002443107 Beall et al.
D
ow
nl
oa
de
d
by
g
ue
st
o
n
Ja
nu
ar
y
29
, 2
02
1
www.pnas.org/cgi/doi/10.1073/pnas.1002443107
LETTER
doi:10.1038/nature13408
Altitude adaptation in Tibetans caused by
introgression of Denisovan-like DNA
Emilia Huerta-Sánchez1,2,3*, Xin Jin1,4*, Asan1,5,6*, Zhuoma Bianba7*, Benjamin M. Peter2, Nicolas Vinckenbosch2, Yu Liang1,5,6,
Xin Yi1,5,6, Mingze He1,8, Mehmet Somel9, Peixiang Ni1, Bo Wang1, Xiaohua Ou1, Huasang1, Jiangbai Luosang1, Zha Xi Ping Cuo10,
Kui Li11, Guoyi Gao12, Ye Yin1, Wei Wang1, Xiuqing Zhang1,13,14, Xun Xu1, Huanming Yang1,15,16, Yingrui Li1, Jian Wang1,16,
Jun Wang1,15,17,18,19 & Rasmus Nielsen1,2,20,21
As modern humans migrated out of Africa, they encountered many
new environmental conditions, including greater temperature extremes,
different pathogens and higher altitudes. These diverse environments
are likely to have acted as agents of natural selection and to have led to
local adaptations. One of the most celebrated examples in humans is
the adaptation of Tibetans to the hypoxic environment of the high-
altitude Tibetan plateau1–3. A hypoxia pathway gene, EPAS1, was pre-
viously identified as having the most extreme signature of positive
selection in Tibetans4–10, and was shown to be associated with differ-
ences in haemoglobin concentration at high altitude. Re-sequencing
the region around EPAS1 in 40 Tibetan and 40 Han individuals, we
find that this gene has a highly unusual haplotype structure that can
only be convincingly explained by introgression of DNA from Deni-
sovan or Denisovan-related individuals into humans. Scanning a
larger set of worldwide populations, we find that the selected haplo-
type is only found in Denisovans and in Tibetans, and at very low
frequency among Han Chinese. Furthermore, the length of the hap-
lotype, and the fact that it is not found in any other populations, makes
it unlikely that the haplotype sharing between Tibetans and Deni-
sovans was caused by incomplete ancestral lineage sorting rather
than introgression. Our findings illustrate that admixture with other
hominin species has provided genetic variation that helped humans
to adapt to new environments.
The Tibetan plateau (at greater than 4,000 m) is inhospitable to human
settlement because of low atmospheric oxygen pressure (,40% lower
than at sea level), cold climate and limited resources (for example, sparse
vegetation). Despite these extreme conditions, Tibetans have success-
fully settled in the plateau, in part due to adaptations that confer lower
infant mortality and higher fertility than acclimated women of low-
altitude origin. The latter tend to have difficulty bearing children at high
altitude, and their offspring typically have low birth weights compared to
offspring from women of high-altitude ancestry1,2. One well-documented
pregnancy-related complication due to high altitude is the higher inci-
dence of preeclampsia2,11 (hypertension during pregnancy). In addition,
the physiological response to low oxygen differs between Tibetans and
individuals of low-altitude origin. For most individuals, acclimatiza-
tion to low oxygen involves an increase in blood haemoglobin levels.
However, in Tibetans, the increase in haemoglobin levels is limited3,
presumably because high haemoglobin concentrations are associated
with increased blood viscosity and increased risk of cardiac events, thus
resulting in a net reduction in fitness12,13.
Recently, the genetic basis underlying adaptation to high altitude in
Tibetans was elucidated4–10 using exome and single nucleotide polymor-
phism (SNP) array data. Several genes seem to be involved in the res-
ponse but most studies identified EPAS1, a transcription factor induced
under hypoxic conditions, as the gene with the strongest signal of Tibetan
specific selection4–10. Furthermore, SNP variants in EPAS1 showed sig-
nificant associations with haemoglobin levels in the expected direction
in several of these studies; individuals carrying the derived allele have
lower haemoglobin levels than individuals homozygous for the ances-
tral allele. Here, we re-sequence the complete EPAS1 gene in 40 Tibetan
and 40 Han individuals at more than 2003 coverage to further char-
acterize this impressive example of human adaptation. Remarkably, we
find the source of adaptation was likely to be due to the introduction of
genetic variants from archaic Denisovan-like individuals (individuals
closely related to the Denisovan individual from the Altai Mountains14)
into the ancestral Tibetan gene pool.
After applying standard next-generation sequencing filters (see Meth-
ods), we call a total of 477 SNPs in a region of approximately 129 kilo-
bases (kb) in the combined Han and Tibetan samples (Supplementary
Tables 1 and 2). We compute the fixation index (FST; see Methods) between
Han and Tibetans, and confirm that it is highly elevated in the
EPAS1
region as expected under strong local selection (Extended Data Fig. 1).
Indeed, by comparison to 26 populations from the Human Genome
Diversity Panel15,16 (Fig. 1) it is clear that the variants in this region are
far more differentiated than one would expect given the average genome-
wide differentiation between Han and Tibetans (FST ,0.02, ref. 4). The
only other genes with comparably large frequency differences between
any closely related populations are the previously identified loci associ-
ated with lighter skin pigmentation in Europeans, SLCA45A2 and
HERC2
(refs 17–20), although in these examples the populations compared (for
example, Hazara and French, Brahui and Russians) are more genetically
differentiated than Han and Tibetans. In populations as closely related
as Han and Tibetans, we find no examples of SNPs with as much differen-
tiation as seen in EPAS1, illustrating the uniqueness of its selection signal.
FST is particularly elevated in a 32.7-kb region containing the 32 most
differentiated SNPs (green box in Extended Data Fig. 1 and Supplemen-
tary Table 3), which is the best candidate region for the advantageous
mutation(s). We therefore focus the subsequent analyses primarily on
this region. Phasing the data (see Methods) to identify Han and Tibetan
haplotypes in this region (Fig. 2), we find that Tibetans carry a high-
frequency haplotype pattern that is strikingly different from both their
*These authors contributed equally to this work.
1BGI-Shenzhen, Shenzhen 518083, China. 2Department of Integrative Biology, University of California, Berkeley, California 94720 USA. 3School of Natural Sciences, University of California, Merced,
California 95343 USA. 4School of Bioscience and Bioengineering, South China University of Technology, Guangzhou 510006, China. 5Binhai Genomics Institute, BGI-Tianjin, Tianjin 300308, China.
6Tianjin Translational Genomics Center, BGI-Tianjin, Tianjin 300308, China. 7The People’s Hospital of Lhasa, Lhasa 850000, China. 8Bioinformatics and Computational Biology Program, Iowa State
University, Ames, Iowa 50011, USA. 9Department of Biological Sciences, Middle East Technical University, 06800 Ankara, Turkey. 10The Second People’s Hospital of Tibet Autonomous Region, Lhasa
850000, China. 11The People’s Hospital of the Tibet Autonomous Region, Lhasa 850000, China. 12The hospital of XiShuangBanNa Dai Nationalities, Autonomous Jinghong, 666100 Yunnan, China. 13The
Guangdong Enterprise Key Laboratory of Human Disease Genomics, BGI-Shenzhen, 518083 Shenzhen, China. 14Shenzhen Key Laboratory of Transomics Biotechnologies, BGI-Shenzhen, 518083
Shenzhen, China. 15Princess Al Jawhara Center of Excellence in the Research of Hereditary Disorders, King Abdulaziz University, Jeddah 21589, Saudi Arabia. 16James D. Watson Institute of Genome
Science, 310008 Hangzhou, China. 17Department of Biology, University of Copenhagen, Ole MaaløesVej 5, 2200 Copenhagen, Denmark. 18Macau University of Science and Technology, AvenidaWai long,
Taipa, Macau 999078,China. 19Departmentof Medicine, University of Hong Kong 999077,Hong Kong. 20Departmentof Statistics, University of California, Berkeley, California 94720,USA. 21Department of
Biology, University of Copenhagen, 2200 Copenhagen, Denmark.
1 9 4 | N A T U R E | V O L 5 1 2 | 1 4 A U G U S T 2 0 1 4
Macmillan Publishers Limited. All rights reserved©2014
www.nature.com/doifinder/10.1038/nature13408
minority haplotypes and the common haplotype observed in Han Chinese
For example, the region harbours a highly differentiated 5-SNP haplo-
type motif (AGGAA) within a 2.5-kb window that is only seen in Tibetan
samples and in none of the Han samples (the first five SNPs in Sup-
plementary Table 3, and blue arrows in Fig. 2). The pattern of genetic
variation within Tibetans appears even more unusual because none of
the variants in the five-SNP motif is present in any of the minority hap-
lotypes of Tibetans. Even when subject to a selective sweep, we would
not generally expect a single haplotype to contain so many unique muta-
tions not found on other haplotypes.
We investigate whether a model of selection on either a de novo muta-
tion (SDN) or selection on standing variation (SSV) could possibly lead
to so many fixed differences between haplotype classes in such a short
region within a single population. To do so, we simulate a 32.7-kb region
under these models assuming different strengths of selection and con-
ditioning on the current allele frequency in the sample (see Methods).
We find that the observed number of fixed differences between the hap-
lotype classes is significantly higher than what is expected by simula-
tions under any of the models explored (Extended Data Fig. 2). Thus
the degree of differentiation between haplotypes is significantly larger
than expected from mutation, genetic drift and directional selection alone.
In other words, it is unlikely (P , 0.02 under either a SSV scenario or
under a SDN scenario) that the high degree of haplotype differentiation
could be caused by a single beneficial mutation landing by chance on a
background of rare SNPs, which are then brought to high frequency by
selection. The remaining explanations are the presence of strong epistasis
between many mutations, or that a divergent population introduced the
haplotype into Tibetans by gene flow or through ancestral lineage sorting.
We search for potential donor populations in two different data sets:
the 1000 Genomes Project21 and whole genome data from ref. 14. We
originally defined the EPAS1 32.7-kb region boundaries by the level of
observed differentiation between the Tibetans and Han only (Supplemen-
tary Table 3, Extended Data Fig. 1 and Fig. 2) as described in the pre-
vious section. In that region, the most common haplotype in Tibetans
is tagged by the distinctive five-SNP motif (AGGAA; the first five SNPs
in Fig. 2), not found in any of our 40 Han samples. We first focus on this
five-SNP motif and determine whether it is unique to Tibetans or if it is
found in other populations.
Intriguingly, when we examine the 1000 Genomes Project data set, we
discover that the Tibetan five-SNP motif (AGGAA) is not present in any
of these populations, except for a single CHS (Southern Han Chinese)
and a single CHB (Beijing Han Chinese) individual. Extended Data Fig. 3
contains the frequencies of all the haplotypes present in the fourteen
1000 Genomes populations21 at these five SNP positions. Furthermore,
when we examine the data set from ref. 14 containing both modern (Pap-
uan, San, Yoruba, Mandeka, Mbuti, French, Sardinian, Han Dai, Dinka,
Karitiana, and Utah residents of northern and western European ances-
try (CEU)) and archaic (high-coverage Denisovan and low-coverage
Croatian Neanderthal) human genomes14, we discover that the five-SNP
motif is completely absent in all of their modern human population sam-
ples (Supplementary Table 4). Therefore, apart from one CHS and one
CHB individual, none of the other extant human populations sampled
to date carry this five-SNP haplotype. Notably, the Denisovan haplo-
type at these five sites (AGGAA) exactly matches the five-SNP Tibetan
motif (Supplementary Table 4 and Extended Data Fig. 3).
We observe the same pattern when focusing on the entire 32.7-kb
region and not just the five-SNP motif. Twenty SNPs in this region have
unusually high frequency differences of at least 0.65 between Tibetans
and all the other populations from the 1000 Genomes Project (Extended
Data Fig. 4). However, in Tibetans, 15 out of these 20 SNPs are identical
to the Denisovan haplotype generating an overall pattern of high hap-
lotype similarity between the selected Tibetan haplotype and the Deni-
sovan haplotype (Supplementary Tables 5–7). Interestingly, five of these
SNPs in the region are private SNPs shared between Tibetans and the
Denisovan, but not shared with any other population worldwide, except
Genome-wide FST
M
ax
im
um
fr
eq
ue
nc
y
di
ffe
re
nc
e
0.00 0.05 0.10 0.15 0.20
0.4
0.5
0.6
0.7
0.8
0.
9
1.0
EPAS1
SLC45A2
SLC45A2
SLC45A2
HERC2
Figure 1 | Genome-wide FST versus maximal allele frequency difference.
The relationship between genome-wide FST (x axis) computed for each pair of
the 26 populations and maximal allele frequency difference (y axis), first
explored in ref. 19. Maximal allele frequency difference is defined as the largest
frequency difference observed for any SNP between a population pair. The 26
populations are from the Human Genome Diversity Panel (HGDP). The
labels highlight genes that harbour SNPs previously identified as having strong
local adaptation. The grey points represent the observed relationship between
population differentiation (FST ) and maximal allele frequency difference;
the more differentiated populations tend to have mutations with larger
frequency differences. The star symbol and the yellow symbols represent
outliers; these are populations that are not highly differentiated but where we
find some mutations that have higher frequency differences than expected
(light blue line).
Ti
be
ta
ns
H
an
C
hi
ne
se
***** ** * * * * ** * ** * * * * * * * *** ** * **
Figure 2 | Haplotype pattern in a region defined by SNPs that are at high
frequency in Tibetans and at low frequency in Han Chinese. Each column is
a polymorphic genomic location (95 in total), each row is a phased haplotype
(80 Han and 80 Tibetan haplotypes), and the coloured column on the left
denotes the population identity of the individuals. Haplotypes of the Denisovan
individual are shown in the top two rows (green). The black cells represent the
presence of the derived allele and the grey space represents the presence of
the ancestral allele (see Methods). The first and last columns correspond to the
first and last positions in Supplementary Table 3, respectively. The red and
blue arrows indicate the 32 sites in Supplementary Table 3. The blue arrows
represent a five-SNP haplotype block defined by the first five SNPs in the
32.7-kb region. Asterisks indicate sites at which Tibetans share a derived allele
with the Denisovan individual.
LETTER RESEARCH
1 4 A U G U S T 2 0 1 4 | V O L 5 1 2 | N A T U R E | 1 9 5
Macmillan Publishers Limited. All rights reserved©2014
for two SNPs at low frequency in Han Chinese (Extended Data Fig. 4
and Supplementary Table 7).
If we consider all SNPs (not just the most differentiated) in the 32.7-kb
region annotated in humans, to build a haplotype network22 using the
40 most common haplotypes, we observe a clear pattern in which the
Tibetan haplotype is much closer to the Denisovan haplotype than any
modern human haplotype (Fig. 3 and Extended Fig. 5a; see Extended
Data Fig. 6a, b for haplotype networks constructed using other criteria).
Furthermore, we find that the Tibetan haplotype is slightly more diver-
gent from other modern human populations than the Denisovan haplo-
type is, a pattern expected under introgression (see Methods and Extended
Data Fig. 5b). Raw sequence divergence for all sites and all haplotypes
shows a similar pattern (Extended Data Fig. 7). Moreover, the divergence
between the common Tibetan haplotype and Han haplotypes is larger
than expected for comparisons among modern humans, but well within
the distribution expected from human–Denisovan comparisons (Extended
Data Fig. 8). Notably, sequence divergence between the Tibetans’ most
common haplotype and Denisovan is significantly lower (P 5 0.0028)
than expected from human–Denisovan comparisons (Extended Data
Fig. 8). We also find that the number of pairwise differences between
the common Tibetan haplotype and the Denisovan haplotype (n 5 12)
is compatible with the levels one would expect from mutation accu-
mulation since the introgression event (see Methods for Extended Data
Fig. 8). Finally, if we compute D (ref. 14) and S* (refs 23, 24), two statis-
tics that have been designed to detect archaic introgression into mod-
ern humans, we obtain significant values (D-statistic P , 0.001, and S*
P # 0.035) for the 32.7-kb region using multiple null models of no gene
flow (see Methods, Supplementary Tables 8–10, and Extended Data
Figs 9 and 10a).
Thus, we conclude that the haplotype associated with altitude adapta-
tion in Tibetans is likely to be a product of introgression from Denisovan
or Denisovan-related populations. The only other possible explanation
is ancestral lineage sorting. However, this explanation is very unlikely
as it cannot explain the significant D and S* values and because it would
require a long haplotype to be maintained without recombination since
the time of divergence between Denisovans and humans (estimated to
be at least 200,000 years (ref. 14)). The chance of maintaining a 32.7-kb
fragment in both lineages throughout 200,000 years is conservatively
estimated at P 5 0.0012 assuming a constant recombination of 2.3 3
1028 per base pair (bp) per generation (see Methods). Furthermore, the
haplotype would have to have been independently lost in all African and
non-African populations, except for Tibetans and Han Chinese.
We have re-sequenced the EPAS1 region and found that Tibetans har-
bour a highly differentiated haplotype that is only found at very low fre-
quency in the Han population among all the 1000 Genomes populations,
and is otherwise only observed in a previously sequenced Denisovan
individual14. As the haplotype is observed in a single individual in both
CHS and CHB samples, it suggests that it was introduced into humans
before the separation of Han and Tibetan populations, but subject to
selection in Tibetans after the Tibetan plateau was colonized. Alterna-
tively, recent admixture from Tibetans to Hans may have introduced
the haplotype to nearby Han populations outside Tibet. The CHS and
CHB individuals carrying the five-SNP Tibetan–Denisovan haplotype
(Extended Data Fig. 3) show no evidence of being recent migrants from
Tibet (see Methods and Extended Data Fig. 10b), suggesting that if the
haplotype was carried from Tibet to China by migrants, this migration
did not occur within the last few generations.
Previous studies examining the genetic contributions of Denisovans
to modern humans14,25 suggest that Melanesians have a much larger Deni-
sovan component than either Han or Mongolians, even though the latter
populations are geographically much closer to the Altai mountains14,25.
Interestingly, the putatively beneficial Denisovan EPAS1 haplotype is
not observed in modern-day Melanesians or in the high-coverage Altai
Neanderthal26 (Supplementary Table 4). Evidence has been found for
I
II
III
I
V
V
VI
VII
VIII
I
XX
XI
XII
XIII
XIV
XV
XVI
XVII
XVIII
XIX
XX
XXI
XXII
XXIII
XXIV
XXV
XXVI
XXVII
XXVIII
XXIX
XXX
XXXI
XXXII
XXXIII
XXXIV
XXXV
XXXVI
XXXVII
XXXVIII
XXXIX
XL
XLI
CEU
YRICHB
CHS
GBR
PUR
JPT
LWK
ASWTibetan
Denisovan TSI
FIN
MXL
CLM
2 10 20
35
9
1
40
Figure 3 | A haplotype network based on the number of pairwise differences
between the 40 most common haplotypes. The haplotypes were defined from
all the SNPs present in the combined 1000 Genomes and Tibetan samples:
515 SNPs in total within the 32.7-kb EPAS1 region. The Denisovan haplotypes
were added to the set of the common haplotypes. The R software package
pegas23 was used to generate the figure, using pairwise differences as distances.
Each pie chart represents one unique haplotype, labelled with Roman numerals,
and the radius of the pie chart is proportional to the log2(number of
chromosomes with that haplotype) plus a minimum size so that it is easier
to see the Denisovan haplotype. The sections in the pie provide the breakdown
of the haplotype representation amongst populations. The width of the edges
is proportional to the number of pairwise differences between the joined
haplotypes; the thinnest edge represents a difference of one mutation. The
legend shows all the possible haplotypes among these populations. The
numbers (1, 9, 35 and 40) next to an edge (the line connecting two haplotypes)
in the bottom right are the number of pairwise differences between the
corresponding haplotypes. We added an edge afterwards between the Tibetan
haplotype XXXIII and its closest non-Denisovan haplotype (XXI) to indicate its
divergence from the other modern human groups. Extended Data Fig. 5a
contains all the pairwise differences between the haplotypes presented in this
figure. ASW, African Americans from the south western United States; CEU,
Utah residents with northern and western European ancestry; GBR, British;
FIN, Finnish; JPT, Japanese; LWK, Luhya; CHS, southern Han Chinese; CHB,
Han Chinese from Beijing; MXL, Mexican; PUR, Puerto Rican; CLM,
Colombian; TSI, Toscani; YRI, Yoruban. Where there is only one line within a
pie chart, this indicates that only one population contains the haplotype.
RESEARCH LETTER
1 9 6 | N A T U R E | V O L 5 1 2 | 1 4 A U G U S T 2 0 1 4
Macmillan Publishers Limited. All rights reserved©2014
Denisovan admixture throughout southeast Asia (as well as in Melane-
sians) based on a global analysis of SNP array data from 1,600 individuals
from a diverse set of populations27, and this finding has been recently
confirmed by ref. 26. Therefore, it appears that sufficient archaic admix-
ture into populations near the Tibetan region occurred to explain the
presence of this Denisovan haplotype outside Melanesia. Furthermore,
the haplotype may have been maintained in some human populations,
including Tibetans and their ancestors, through the action of natural
selection.
Recently, a few studies have supported the idea of adaptive introgres-
sion from archaic humans to modern humans as having a role in the
evolution of immunity-related genes (HLA (ref. 28) and STAT2 (ref. 29))
and in the evolution of skin pigmentation genes (BNC2 (refs 23, 30)).
Our findings imply that one of the most clear-cut examples of human
adaptation is likely to be due to a similar mechanism of gene flow from
archaic hominins into modern humans. With our increased understand-
ing that human evolution has involved a substantial amount of gene
flow from various archaic species, we are now also starting to under-
stand that adaptation to local environments may have been facilitated
by gene flow from other hominins that may already have been adapted
to those environments.
METHODS SUMMARY
DNA samples included in this work were extracted from peripheral blood of 41
unrelated Tibetan individuals living at more than 4,300 m above sea level within
the Himalayan Plateau, with informed consent. Tibetan identity was based on self-
reported family ancestry. The individuals were from two villages of Dingri (4,300 m
altitude) and Naqu (4,600 m altitude). These individuals are a subset of the 50 indi-
viduals exome-sequenced analysed in ref. 4. Samples of 40 Han Chinese (CHB) are
from the 1000 Genomes Project. A combined strategy of long-range PCR and next-
generation sequencing was used to decipher the whole EPAS1 gene and its 630-kb
flanking region. We designed 38 pairs of long-range PCR primers to amplify the
region in 41 Tibetan and 40 Han individuals. PCR products from all individuals
were fragmented and indexed, then sequenced to higher than 260-fold depth for
each individual with the Illumina Hiseq2000 sequencer. The reads were aligned to
the UCSC human reference genome (hg18) using the SOAPaligner. Genotypes of
each individual at every genomic location of the EPAS1 gene and the flanking region
were called by SOAPsnp. To make comparisons with the 40 Han easier, we only used
40 Tibetan samples for this study.
Online Content Methods, along with any additional Extended Data display items
andSourceData, are available in the online version of the paper; references unique
to these sections appear only in the online paper.
Received 17 December 2013; accepted 28 April 2014.
Published online 2 July; corrected online 13 August 2014 (see full-text HTML
version for details).
1. Moore, L. G., Young, D., McCullough, R. E., Droma, T. & Zamudio, S. Tibetan
protection from intrauterine growth restriction (IUGR) and reproductive loss at
high altitude. Am. J. Hum. Biol. 13, 635–644 (2001).
2. Niermeyer, S. et al. Child health and living at high altitude. Arch. Dis. Child. 94,
806–811 (2009).
3. Wu, T. et al. Hemoglobin levels in Quinghai-Tibet: different effects of gender for
Tibetans vs. Han. J. Appl. Physiol. 98, 598–604 (2005).
4. Yi, X. et al. Sequencing of 50 human exomes reveals adaptation to high altitude.
Science 329, 75–78 (2010).
5. Bigham, A. et al. Identifying signature of natural selection in Tibetan and Andean
populations using dense genome scan data. PLoS Genet. 6, e1001116 (2010).
6. Simonson, T.S.et al.Geneticevidence forhigh-altitudeadaptation inTibet.Science
329, 72–75 (2010).
7. Beall, C. M. et al. Natural selection on EPAS1 (HIF2a) associated with low
hemoglobin concentration in Tibetan highlanders. Proc. Natl Acad. Sci. USA 107,
11459–11464 (2010).
8. Peng, Y. et al. Genetic variations in Tibetan populations and high-altitude
adaptation at the Himalayas. Mol. Biol. Evol. 28, 1075–1081 (2011).
9. Xu, S. et al. A genome-wide search for signals of high-altitude adaptation in
Tibetans. Mol. Biol. Evol. 28, 1003–1011 (2011).
10. Wang, B. et al. On the origin of Tibetans and their genetic basis in adapting
high-altitude environments. PLoS ONE 6, e17002 (2011).
11. Moore, L. G. et al. Maternal adaptation to high-altitude pregnancy: an experiment
of nature—a review. Placenta 25, S60–S71 (2004).
12. Vargas, E. & Spielvogel, H. Chronic mountain sickness, optimal hemoglobin, and
heart disease. High Alt. Med. Biol. 7, 138–149 (2006).
13. Yip, R. Significance of anabnormally low orhighhemoglobin concentrationduring
pregnancy: special consideration of iron nutrition1’2’3. Am. J. Clin. Nutr. 72,
272S–279S (2000).
14. Meyer, M. et al. A high-coverage genome sequence from an archaic Denisovan
individual. Science 338, 222–226 (2012).
15. Li, J. Z. et al. Worldwide human relationships inferred from genome-wide patterns
of variation. Science 319, 1100–1104 (2008).
16. Rosenberg, N. A. Standardized subsets of the HGDP-CEPH Human Genome
DiversityCell LinePanel, accounting for atypical andduplicatedsamples andpairs
of close relatives. Ann. Hum. Genet. 70, 841–847 (2006).
17. Soejima, M. & Koda, Y. Population differences of two coding SNPs. in
pigmentation-related genes SLC24A5 and SLC45A2. Int. J. Legal Med. 121, 36–39
(2007).
18. Sulem, P. et al. Genetic determinants of hair, eye and skin pigmentation in
Europeans. Nature Genet. 39, 1443–1452 (2007).
19. Coop, G. et al. The role of geography in human adaptation. PLoS Genet. 5,
e1000500 (2009).
20. Pickrell, J. K. et al. Signals of recent positive selection in a worldwide sample of
human populations. Genome Res. 19, 826–837 (2009).
21. An integrated map of genetic variation from 1,092 human genomes. Nature 491,
56–65 (2012).
22. Paradis, E. Pegas: an R package for population genetics with an integrated–
modular approach. Bioinformatics 26, 419–420 (2010).
23. Vernot, B. & Akey, J. Resurrecting Surviving neandertal lineages from modern
human genomes. Science (2014).
24. Plagnol, V. & Wall, J. D. Possible ancestral structure in human populations. PLoS
Genet. 2, e105 (2006).
25. Reich, D. et al. Genetic history of an archaic hominin group from Denisova cave in
Siberia. Nature 468, 1053–1060 (2010).
26. Prüfer, K. et al. The complete genome sequence of a Neanderthal from the Altai
Mountains. Nature 505, 43–49 (2014).
27. Skoglund, P.& Jakobsson, M.Archaic human ancestry inEast Asia. Proc. Natl Acad.
Sci. USA 108, 18301–18306 (2011).
28. Abi-Rached, L. et al. The shaping of modern human immune systems by
multiregional admixture with archaic humans. Science 334, 89–94 (2011).
29. Mendez, F. L., Watkins, J. C. & Hammer, M. F. A haplotype at STAT2 introgressed
from Neanderthals and serves as a candidate of positive selection in Papua New
Guinea. Am. J. Hum. Genet. 91, 265–274 (2012).
30. Sankararaman, S. et al. The genomic landscape of Neanderthal ancestry in
present-day humans. Nature (2014).
Supplementary Information is available in the online version of the paper.
Acknowledgements This researchwas fundedby the StateKey Development Program
for Basic Research of China, 973 Program (2011CB809203, 2012CB518201,
2011CB809201, 2011CB809202), China National GeneBank-Shenzhen and
Shenzhen Key Laboratory of Transomics Biotechnologies (no. CXB201108250096A).
This work was also supported by research grants from the US NIH; R01HG003229 to
R.N. and R01HG003229-08S2 to E.H.S. We thank F. Jay, M. Liang and F. Casey for
useful discussions.
Author Contributions R.N., Ji.W. and Ju.W. supervised the project. X.J., A., Z.B., Y.L., X.Y.,
M.H., P.N., B.W., X.O., H., J.L., Z.X.P.C., K.L., G.G., Y.Y., W.W., X.Z., X.X., H.Y., Y.L., Ji.W. and
Ju.W. collected and generated the data, and performed the preliminary bioinformatic
analyses to call SNPs and indels from the raw data. E.H.-S. andN.V. filtered the data and
B.M.P. phased the data. E.H.-S. performed the majority of the population genetic
analysis with some contributions from B.M.P. and M.S. E.H.-S. and R.N. wrote the
manuscript with critical input from all the authors.
Author Information Sequencedatahavebeendeposited in theSequenceReadArchive
under accession number SRP041218. Reprints and permissions information is
available at www.nature.com/reprints. The authors declare no competing financial
interests. Readers are welcome to comment on the online version of the paper.
Correspondence and requests for materials should be addressed to
Ju.W. (wangj@genomics.cn) or Ji.W. (wangjian@genomics.cn) or
R.N. (rasmus_nielsen@berkeley.edu).
LETTER RESEARCH
1 4 A U G U S T 2 0 1 4 | V O L 5 1 2 | N A T U R E | 1 9 7
Macmillan Publishers Limited. All rights reserved©2014
www.nature.com/doifinder/10.1038/nature13408
www.nature.com/doifinder/10.1038/nature13408
http://www.ncbi.nlm.nih.gov/sra/?term=SRP041218
www.nature.com/reprints
www.nature.com/doifinder/10.1038/nature13408
mailto:wangj@genomics.cn
mailto:wangjian@genomics.cn
mailto:rasmus_nielsen@berkeley.edu
METHODS
DNA samples. DNA samples included in this work were extracted from peripheral
blood of 41 unrelated Tibetan individuals living at more than 4,300 m above sea level
within the Himalayan Plateau. Tibetan identity was based on self-reported family
ancestry. The individuals were from two villages of Dingri (20 samples prefixed DR
(Dingri); 4300m altitude) and Naqu (21 samples prefixed NQ (Naqu); 4,600 m
altitude). All participants gave a self-report of at least three generations living at the
sampling site, and provided informed consent for this study. These individuals are
a subset of the 50 individuals exome-sequenced from ref. 4. Samples of Han Chinese
(CHB) are from the 1000 Genomes Project21 (40 samples prefixed NA in the 1000
Genomes Project).
A combined strategy of long-range PCR and next-generation sequencing was
used to decipher the whole EPAS1 gene and its 630-kb flanking region (147 kb in
total). We designed 38 pairs of long-range PCR primers to amplify the region in 41
Tibetan and 40 Han individuals. PCR products from all individuals were fragmen-
ted and indexed, then sequenced to higher than 260-fold depth for each individual
with the Illumina Hiseq2000 sequencer. The reads were aligned to the UCSC human
reference genome (hg18) using the SOAPaligner31. Genotypes of each individual at
every genomic location of the EPAS1 gene and the flanking region were called by
SOAPsnp32. To make comparisons easier with the Han samples, we only used 40
Tibetan samples for this study.
Data filtering. The coverage for each individual is roughly 2003. Genotypes of
each individual at every site in the EPAS1 gene and the flanking region were called
by SOAPsnp32, which resulted in 700 SNPs in the combined Tibetan–Han sample.
However, we filtered out some sites post genotype calling by performing standard
filters that are applied in the analyses of next generation sequencing data. Specifi-
cally, of the 700 SNPs called, we removed SNPs that fell into the following categories:
SNPs that were not in Hardy–Weinberg equilibrium; SNPs that were located in
hard-to-map regions (the SNP is located at a position with mappability 5 0, using
the Duke Unique 35 track); SNPs where more than half the heterozygote indivi-
duals have a statistically unequal distribution of counts for the two alleles (that is,
the counts of the two alleles deviate from the expectation of 50% of counts for each
assuming a binomial distribution); SNPs with different quality-score distributions
for the two alleles; SNPs that fell on or near a deletion or insertion; and SNPs that
were tri-allelic. A total of 477 SNPs in the combined sample remained after apply-
ing all the filters. Within Tibetans, 354 sites (out of the 477 sites) were SNPs, and
within Han Chinese, 364 sites (out of the 477) were SNPs. After filtering, we used
Beagle to phase the Tibetan and Han individuals together33.
Human Genome Diversity Panel data. We downloaded the HGDP SNP data
from the University of Chicago website (http://hgdp.uchicago.edu/data/plink_data/)
and followed the filtering criteria in ref. 34. We used the formula of ref. 35 to com-
pute FST between pairs of populations. We used the intersection of SNPs between
the 50 Tibetan exomes from ref. 7 and the HGDP SNPs, resulting in 8,362 SNPs.
Note, the number 354 quoted in the previous section refers to Tibetan SNPs from
the full re-sequencing of the EPAS1 gene in this study.
We calculated FST for each pair of populations and also scored the frequencies of
the SNP with the largest frequency difference between pairs of populations. Using
the genotypes from the 26 populations we have re-created Fig. 2a of ref. 34 using the
SNPs overlapping in two data sets: the 50 Tibetan exomes data set and the HGDP15,16
data set. The re-created figure (Fig. 1) displays the maximum SNP frequency differ-
ence against the mean FST across all SNPs for each pair of the HGDP populations.
We added one data point to this figure consisting of the mean Fst between Tibetans
and Han (FST is approximately 0.018) and the SNP with the largest frequency dif-
ference between Han Chinese and Tibetans (approximately 0.8), which is a SNP in
the EPAS1 gene.
Tibetan and Han haplotypes at the 32.7-kb highly differentiated region. The
32.7-kb region was identified as the region of highest genetic differentiation between
Tibetans and Han Chinese (green box in Extended Data Fig. 1), providing the stron-
gest candidate region for the location of the selective sweep. To examine the haplo-
types in this region, we first filtered out SNPs that occurred with a frequency of #5%
or $95% in both the Tibetan and Han samples; that is, SNPs that were very common
or very rare in both populations simultaneously. We computed the number of
pairwise differences between every pair of haplotypes. We then ordered the hap-
lotypes based on their number of pairwise distances from the most common hap-
lotype in each population separately. Figure 2 is generated using the heatmap.2
function from the gplots package of the statistical computing platform R (ref. 36),
with derived and ancestral alleles coloured black and light grey, respectively. We
used the chimpanzee sequence to define the ancestral state. However, the chim-
panzee allele was missing at one of the 32 most differentiated sites (see arrows in
Fig. 2), so in that case we used the orangutan and rhesus macaque alleles to define
the ancestral allele.
Simulations, selection on a de novo mutation and on standing variation. We used
msms37 to simulate data for a population of constant size with mutation, recombination
and positive selection affecting a single site. We simulated under conditions of a
current allele frequency of 69 out of 80 in the population, the observed value in the
real EPAS1 data, so that the beneficial mutation had frequency at the present time.
We estimated a Tibetan effective population size of N 5 7,000 (see Supplementary
Information; section on estimating the effective population size). In addition, we
assumed three different selection parameters (2Ns 5 200, 500, 1,000, where s is the
selection coefficient of the beneficial mutation) and a recombination rate per bp
per generation of 2.3 3 1028 (this is the average recombination rate in the EPAS1
gene using the fine-scale estimates from the map of ref. 38). This recombination
estimate is very similar to the estimate from the African American map39 for the
EPAS1 gene which is 2.4 3 1028. We set the mutation rate to 2.0 3 1028 per bp per
generation because this is what we estimated using the patterns of genetic diversity
in the EPAS1 gene in Tibetans under an approximate bayesian computation (ABC)
framework (see Supplementary Information for details; section on estimating the
mutation rate). This mutation-rate estimate is similar to the phylogenetic estimates
reviewed in ref. 40. We note that the human–chimpanzee differences in other intro-
nic regions in the genome of the same size has a mean (417) and median (410) close
to that found for the EPAS1 32.7-kb region (see Supplementary Information; section
on the distribution of human–chimpanzee differences in 32.7-kb regions for details),
suggesting that this region does not have an unusual mutation rate. In the simula-
tions, we examined a region of 32.7 kb around the selected site and grouped the
haplotypes into two groups: those that carried the beneficial allele and those that
did not. If k is the number of chromosomes carrying the beneficial mutation, we
counted how many mutations within the 32.7-kb region had frequency bigger or
equal to (k/80 – 0.05) in the group that harboured the beneficial allele (that is, fixed
or almost fixed in that group), and simultaneously had frequency 0 in the other group.
To simulate data for a sweep from standing variation, we used mbs41 and the
same parameters as in the previous set of simulations, but assuming an initial allele
frequency of the advantageous allele of 1% when selection starts. To be able to com-
pare the number of almost fixed sites from these simulations to the observed data,
we needed to make a call in the Han–Tibetan data set of what could plausibly be the
selected site. The most straightforward choice is the site that has the highest Han–
Tibetan differentiation; see the circled SNP in Extended Data Fig. 1 (this site also
has the largest frequency difference (,0.85) between Tibetans and any of the 1000
Genomes populations). Tibetan individuals with the derived mutation at this site
were defined as carrying the selected haplotype, and the remaining individuals were
defined as not carrying the selected haplotype. Then we performed the same count-
ing of ‘almost fixed’ sites between these two groups as was done for the simulations.
The simulated distribution of almost fixed differences and the real data are shown in
Extended Data Fig. 2 (histograms of almost fixed differences).
For the SDN model under a selection parameter of 2Ns 5 200, 500, 1,000, the
P values are 0.004, 0.006 and 0.006, respectively. Under SSV with a selection para-
meter of 2Ns 5 200, 500, 1,000, the P values are 0.002, 0.012 and 0.015, respectively.
We note that increasing the initial frequency of the selected allele (to 5%) also leads
to a smaller number of fixed differences than what we observe in the real data, thereby
making the simulated SSV scenario similarly unlikely (P values are 0.007, 0.01 and
0.006 for 2Ns 5 200, 500, 1,000 respectively). We also note that simulating data with a
smaller mutation rate will not result in an increase in the number of fixed differences.
Five-SNP motif. We identified the contiguous five-SNP haplotype motif that is most
common in the 40 Tibetan samples, but entirely absent in the 40 Han individuals
(see the five-SNP haplotype defined by the first five blue arrows in Fig. 2). The five
SNPs comprising this motif (positioned at 46421420, 46422184, 46422521, 46423274
and 46423846), span a 2.5-kb region (46423846–46421420, ,2.5 kb) containing
no other SNPs (even when including low- and high-frequency SNPs). The genomic
positions of this five-SNP motif were then examined in the phased 1000 Genomes21
data set to compute the frequency of this five-SNP haplotype in the populations
sequenced in the 1000 Genomes project (see Extended Data Fig. 3, and below). We
will refer here to this five-SNP motif as the ‘core’ Tibetan haplotype.
Haplotype frequencies at the five-SNP motif in 1000 Genomes data. For all
samples or populations in the 1000 genomes project21, we interrogated the five sites
in the ‘core’ Tibetan haplotype identified in EPAS1, and counted the frequencies of
each of the unique haplotypes within each population group of the 1000 genomes.
The bar-plot in Extended Data Fig. 3 is a summary of these frequencies within each
population, coloured by the unique haplotype sequences present.
Haplotype network. We constructed a haplotype network including Tibetans, Deni-
sovans and the 1000 Genomes samples (ASW, African American from south western
United States; CLM, Colombian; CEU, Utah Residents with Northern and Western
European ancestry; CHB, Han Chinese from Beijing; CHS, Southern Han Chinese;
FIN, Finnish; GBR, British; JPT, Japanese; LWK, Luhya; MXL, Mexican; PUR, Puerto
Rican; TSI, Toscani; YRI, Yoruban) for the 32.7-kb EPAS1 region in the combined
1000 Genomes samples. To limit the number of haplotypes to display, we identi-
fied the 40 most common haplotypes. There are a total of 515 SNPs in the 32.7-kb
EPAS1 region that pass all quality filters. We used the R (ref. 36) software package
RESEARCH LETTER
Macmillan Publishers Limited. All rights reserved©2014
http://hgdp.uchicago.edu/data/plink_data
pegas (ref. 22) to build a tree that connects haplotypes based on pairwise differ-
ences (Fig. 3). The Denisovan individual is homozygous at all the 515 sites. Note
that for clarity (to reduce the number of colours needed for the plot) and because
the small sample of Iberians (18) only contain haplotypes observed in other Euro-
pean samples, Fig. 3 does not include the Iberians (IBS). We find that the Denisovan
haplotype is closest to the Tibetan haplotypes. Extended Data Fig. 5a contains all
the pairwise differences between the 41 (40 from modern humans and 1 from Deni-
sovan) haplotypes in Fig. 3.
The observed haplotype structure is compatible with the introgression hypothesis.
As expected under the introgression hypothesis, the Tibetan haplotype is more
distant to the non-Tibetan haplotypes than the Denisovan haplotype because, after
the admixture event, the introgressed haplotype accumulated extra mutations. In
contrast, the Denisovan haplotype, being the product of DNA extracted from an
ancient specimen, did not have time to accumulate a similar number of mutations.
This effect is illustrated in Extended Data Fig. 5b. For example, the divergence between
the introgressed haplotype (that is, the Tibetan haplotype) and the Yoruban hap-
lotype would be larger than between the observed Denisovan haplotype and the
Yoruban haplotype (see Extended Data Fig. 5b and Supplementary Information;
section on Extended Data Fig. 5b).
Haplotype network. Figure 3 plots the network of the 40 most common haplotypes.
Alternatively, we also used the 20 sites such that the frequency difference between
Tibetans and each of the 14 populations from the 1000 Genomes project21 is at least
0.65 (see Extended Data Fig. 4) to construct a haplotype network (Extended Data
Fig. 6a). The resulting region spanned by these SNPs is the same 32.7-kb region as
identified previously by considering sites that are the most differentiated between
Tibetans and Han (Supplementary Table 3). For more details about Extended Data
Fig. 6a, b, see Supplementary Information; section on haplotype networks con-
structed using other criteria).
Denisovan–human number of pairwise differences at the EPAS1 32.7-kb
region. We computed the number of pairwise differences as described in Supplemen-
tary Information (section on the number of pairwise differences). We removed all
the Denisovan sites that had a genotype quality smaller than 40 and a mapping
quality smaller than 30, using the same thresholds as in the Denisovan paper14.
This filtering resulted in the removal of 782 sites (out of 32,746). We also removed
another 27 sites within the 32.7-kb region that did not pass our quality filters in
Tibetans (see the data filtering section). The total number of SNPs in the com-
bined Tibetan, 1000 Genomes and the Denisovan samples is 520. For the 32.7-kb
region in EPAS1, we computed the number of pairwise differences between the
Denisovan haplotypes and each of Tibetan haplotypes (red histograms, Extended
Data Fig. 7). We also computed the number of pairwise differences between the
Denisovan haplotypes and each of the haplotypes in the 1000 Genomes Project’s
populations (see the blue histograms in Extended Data Fig. 7). Notice that for this
comparison, we compared every site that passes the quality filters even if the site is
not polymorphic in modern humans. This is in contrast to Fig. 3 where we only
considered the sites that are polymorphic in modern humans. Furthermore, if a
site is not polymorphic in our sample, we assumed that all of our samples carry the
human reference allele. We plot two histograms in each panel of Extended Data
Fig. 7: the distribution of Tibetan–Denisovan comparison (red histogram) and the
distribution of pairwise differences between the Denisovan haplotype and each
population (blue histogram) from the 1000 Genomes Project (Extended Data Fig. 7).
Denisovan–modern human divergence and modern human–modern human
divergence. To compute the genomic distribution of modern human–Denisovan
pairwise differences we examined all windows of intronic sequence of size 32.7 kb
(using a table from Ensembl with the exon boundaries for all genes) from chromo-
somes 1 to 6. Within each 32.7-kb region, we removed all the Denisovan sites that
had a genotype quality smaller than 40 and a mapping quality smaller than 30. We
computed divergence by computing all the pairwise differences between a human
haplotype and the Denisovan haplotypes (see Supplementary Information; section
on the number of pairwise differences) and dividing by the effective sequence length
(that is, all the sites in the 32.7-kb region that passed all the filters; a mapping quality
higher or equal to 30 and a genotype quality higher or equal to 40). We only kept the
32.7-kb regions where at least 20,000 sites passed these quality criteria. The modern
humans used in these comparisons were the first 80 CEU chromosomes, the first 80
CHS chromosomes and the first 80 CHB chromosomes from the 1000 Genomes
data. If a site was not polymorphic in modern humans, we assumed that they carried
the reference allele.
We also computed modern human–modern human divergence at the same intro-
nic regions. In this case, we compare modern human populations (CHB versus CHS,
CHB versus CEU, CHS versus CEU) by comparing all 80 haplotypes in one group
to all 80 haplotypes in the other group for a total of 3 3 80 3 80 comparisons. The
distributions of modern human–Denisovan and modern human–modern human
pairwise differences are both plotted in Extended Data Fig. 8. We also display the
distribution of Tibetan–Han pairwise differences in the 32.7-kb region of the EPAS1
locus (80 Tibetan and 80 Han for a total of 6,400 comparisons). Finally, we include
the pairwise differences between the Denisovan and the Tibetans computed as in
Extended Data Fig. 7, standardized by the number of sites that passed all quality filters.
This number (12/31,937) leads to a sequence divergence of 0.000375 for the most
common Tibetan haplotypes, and this is indeed significantly lower (P 5 0.0028)
than what is expected under the distribution of human–Denisovan divergence (see
Extended Data Fig. 8). Supplementary Table 11 contains the details regarding the
12 differences between the Tibetan and the Denisovan haplotypes.
To address further the issue of whether 12 differences are expected between the
Denisovans and Tibetans under the introgression hypothesis, we computed the
number of mutations theoretically expected for an introgressed region of this size,
given published estimates of the age of the sample, and the coalescence time within
Denisovans. We assumed that mutations occur as a Poisson process and used the
estimates of split times from ref. 26 between the called introgressed Denisovan hap-
lotypes and the Denisovan haplotypes (see page 114 of the Supplementary Infor-
mation of ref. 26). Using these estimates, the number of expected mutations between
the Denisovan haplotype and our introgressed haplotype (the Tibetans’ most com-
mon haplotype) is simply: (2 3 tMRCA 2 age) 3 L 3 mu 5 11.25, where tMRCA is
the time to the most recent common ancestor estimated at 394,000 years (394 kyr),
mu 5 0.5 3 1029 per bp per year, L 5 32.7 kb, and age is the age of the Denisovan
sample, which we conservatively set to 100,000 years. Clearly, the observed value of
12 mutations is remarkably close to the expected number (11.25). In fact, we would
need to observe 17 or more mutations to be able to reject the introgression hypo-
thesis at the 5% significance level. If we use our estimate of the mutation rate in the
EPAS1 gene, mu 5 1.0 3 1029 per bp per year (2.0 3 1028 per bp per generation),
then the expected number of differences is 22.5. Therefore we conclude the number
of differences we observe are compatible with the previous estimates of introgressed
Denisovan versus sampled Denisovan sequence divergence.
Probability of 32.7-kb haplotype block from shared ancestral lineage. We cal-
culate the probability of a haplotype, of length at least 32.7 kb, shared by modern
Tibetans and the archaic Denisovan due to incomplete ancestral lineage sorting.
Let r be the recombination rate per generation per bp. Let t be the length of the
human and Denisovan branches since divergence. The expected length of a shared
ancestral sequence is 1/(r 3 t). Let this expected length 5 L. Assuming an exponen-
tial distribution of admixture tracts, the probability of seeing a shared fragment of
length $ m is exp(2m/L). However, conditional on observing the Denisovan nucle-
otide at position j, the expected length is the sum of two exponential random vari-
ables with expected lengths L, therefore it follows a Gamma distribution with shape
parameter 2, and rate parameter lambda 5 1/L. Inserting numbers for human branch
length after divergence at a conservative lower estimate of 200 kyr, and the Denisovan
branch of 100 kyr (divergence minus the estimated age of the Denisovan sample,
which can be as old as 100 kyr (refs 14, 26)), and assuming a generation time of
25 years, we get L 5 1/(2.3 3 1028 3 (300 3 103/25)) 5 3623.18 bp, and the proba-
bility of a length of at least m 5 32,700 bp is 1 2 GammaCDF(32700, shape 5 2,
rate 5 1/L) 5 0.0012. GammaCDF is the Gamma distribution function and the
numbers within the parenthesis are its arguments. Here the recombination of 2.3
3 1028 is the average recombination rate in EPAS1 calculated from the estimates
in ref. 14. We should mention, both this divergence estimate for the Denisovan–
human split and the age of the Denisovan sample are highly conservative14,25,26, so
the actual probability may be considerably less. Also, the haplotype would have to
have been independently lost in all African and non-African populations, except
for Tibetans and Han Chinese.
Null distributions of D statistics under models of no gene flow. As another
approach to assess the probability of an ancestral lineage having given rise to the
32.7-kb haplotype that we observe in Tibetans in the absence of gene flow, we
compared D statistics between human populations under simulations42 of several
demographic models described in ref. 43. D statistics were calculated according to
equation (2) in ref. 44. The two modern human populations used in computing D
statistics are Tibetans and either CHB, CEU or YRI (see Supplementary Informa-
tion for details; section on D statistics under Models of no gene flow). All simulations
results result in P , 0.001 for all comparisons (see Extended Data Fig. 9, Supplemen-
tary Tables 8–10 and Supplementary Information).
Genome-wide value of D statistics. D statistics have been used to assess genome-
wide levels of archaic introgression in previous studies14,25. To assess whether Tibetans
carry more Denisovan admixture than other populations (CEU or CHB), we used the
SNP genotype data from ref. 45 and computed D statistics as in ref. 44: D(chimpanzee,
Denisovan, Tibetan and CHB) and D(chimpanzee, Denisovan,Tibetan and CEU).
At the genome-wide level, using the D statistic, we found no evidence that there is
more Denisovan admixture in Tibetans than in the Han (D 5 0.000504688). We
also did not find evidence that there is more Denisovan admixture in Tibetans
than in the Europeans (D 5 0.001898642).
Empirical distributions of D statistics for 32.7-kb intronic regions. The EPAS1
32.7-kb region was chosen due to its positive selection signal, and not based on a
LETTER RESEARCH
Macmillan Publishers Limited. All rights reserved©2014
genome-wide analysis of Denisovan introgression. Therefore, we only performed
one test when testing for introgression and did not have to correct P values for mul-
tiple testing. We do not have Tibetan whole genome sequence data, but as shown in
the previous section, genotype array data suggests that the level of Denisovan intro-
gression between Han and Tibetans is similar. Moreover, Tibetans and Han are closely
related populations. Therefore, using Han data as a proxy, we can determine whether
the observed D values at the EPAS1 region (D(TIB, YRI, DEN, chimpanzee) 5
20.8818433) is an outlier compared to the distribution of D values at other 32.7-kb
intronic regions. Using the empirical distribution of D values across chromosomes
1 to 22, substituting the 80 Han chromosomes for our 80 Tibetan chromosomes
and computing D(HAN, YRI, DEN, chimpanzee) for each 32.7-kb intronic region,
we obtain a P , 0.008. However, as the variance in D depends on the number of
informative sites, this is probably an overestimate of the true P value. In fact, there
are no other regions with as many informative sites and as extreme a D value as that
observed for EPAS1. This region is clearly a strong outlier.
Null distributions of S* statistics under models of no gene flow. As a final
approach for eliminating the hypothesis of ancestral lineage sorting, we follow the
methods of ref. 23 to compute S* (originally derived by ref. 24). S*was designed to
identify regions of archaic introgression. As in the previous section, we used all the
four models of ref. 43 that do not include gene flow and simulated data to compute
the null distributions of S*. Distributions are generated from 1000 simulations, and
within each simulation we have representation of the 80 Tibetan chromosomes,
and 20 Yoruban chromosomes as the outgroup. For each simulated data set we follow
ref. 23 and compute S* on a per-chromosome basis, after sampling at random 20
chromosomes from the Tibetan group and removing SNPs that are observed in the
Yoruban chromosomes. The maximum S* is then recorded. The above process is
carried out for 10 random samplings of 20 Tibetan chromosomes and the max-
imum of the 10 is the final recorded S*. The exact same procedure is applied to the
simulated data and the real data of 80 Tibetan chromosomes. Extended Data Fig. 10a
shows that under all four models, S* is significantly different from the null distri-
bution with all the empirical p-values lying below 0.035. The grey vertical line is the
S* value computed for the real data. The P values are 0.035, 0.028, 0.019 and 0.017,
respectively, for each model (top to bottom).
Principal component analyses using Chinese and Tibetan samples. As one
single CHB individual carries a haplotype that is very similar to the Denisovan hap-
lotype in EPAS1 (Extended Data Fig. 6a, b), we wanted to assess whether this simi-
larity might be due to recent gene-flow from Tibetans to CHB (Chinese samples
were from the 1000 Genomes Project; Tibetan samples were from ref. 45). If that
were true, then we would expect to observe similarities at other loci. Therefore we
compute the first and second principal components using all of chromosome 2. For
simplicity, we only used chromosome 2 because it contains the EPAS1 gene and has
a sufficiently high number of SNPs to carry out the PCA analysis. We do not have
genome-wide genotype calls for the 40 Tibetan samples considered in this study.
Therefore, as a proxy, we used the Tibetan genotype data from ref. 45 and com-
pared their Tibetan samples to the CHB and CHS individuals from 1000 Genomes.
Extended Data Fig. 10b shows that all the CHB and the CHS individuals cluster
together and principal component 1 clearly separates Tibetans from CHB and CHS
individuals. Furthermore, the CHB individual with the Denisovan EPAS1 haplo-
type (Extended Data Figs 6a, b) clearly clusters with other CHB and CHS indivi-
duals and do not show any closer genetic affinity with Tibetans. This suggests that
the CHB individual with a Denisovan-like haplotype in EPAS1 is not a descendant
of a recent immigrant from Tibet.
31. Li, R., Li, Y., Kristiansen, K. & Wang, J. SOAP: short oligonucleotide alignment
program. Bioinformatics 24, 713–714 (2008).
32. Li, R. et al. SNP detection for massively parallel whole-genome resequencing.
Genome Res. 19, 1124–1132 (2009).
33. Browning, B. L. & Browning, S. R. A fast, powerful method for detecting identity by
descent. Am. J. Hum. Genet. 88, 173–182 (2011).
34. Coop, G. et al. The role of geography in human adaptation. PLoS Genet. 5,
e1000500 (2009).
35. Reynolds, J., Weir, B. S. & Cockerham, C. C. Estimation of the coancestry
coefficient: basis for a short-term genetic distance. Genetics 105, 767–779
(1983).
36. R Development Core Team R: A language and environment for statistical
computing http://www.R-project.org/ (R Foundation for Statistical Computing,
2011).
37. Ewing, G. & Hermisson, J. MSMS: a coalescent simulation program including
recombination, demographic structure, and selection at a single locus.
Bioinformatics 26, 2064–2065 (2010).
38. Myers, S. et al. A fine-scale map of recombination rates and hotspots across the
human genome. Science 310, 321–324 (2005).
39. Hinch, A. G. et al. The landscape of recombination in African Americans. Nature
476, 170–175 (2011).
40. Scally, A. & Durbin, R. Revising the human mutation rate: implications for
understanding human evolution. Nature Rev. Genet. 13, 745–753 (2012).
41. Teshima, K. M. & Innan, H. mbs: modifying Hudson’s ms software to generate
samples ofDNAsequenceswith a biallelic site under selection. BMC Bioinformatics
10, 166 (2009).
42. Hudson, R. R. Generating samples under a Wright–Fisher neutral model of genetic
variation. Bioinformatics 18, 337–338 (2002).
43. Sankararaman, S. et al. The date of interbreeding between Neandertals and
modern humans. PLoS Genet. 8, e1002947 (2012).
44. Durand, E. Y. et al. Testing for ancient admixture between closely related
populations. Mol. Biol. Evol. 28, 2239–2252 (2011).
45. Simonson, T.S.et al.Geneticevidence forhigh-altitudeadaptation inTibet.Science
329, 72–75 (2010).
RESEARCH LETTER
Macmillan Publishers Limited. All rights reserved©2014
http://www.R-project.org
Extended Data Figure 1 | FST calculated for each SNP between Tibetan and
Han populations. Each dot represents the FST value for each SNP in EPAS1.
The x axis is the physical position in the gene. Positions are based on the
hg18 build of the human genome. The green box defines a 32.7-kb region where
we observe the largest genetic differentiation between Han Chinese and
Tibetans. The first and last positions of this 32.7-kb region correspond to
the first and last position of the SNPs listed in Supplementary Table 3. For
comparison, in ref. 4 the genome-wide FST between Han and Tibetans is 0.02.
The site with the largest frequency difference (and therefore largest FST)
is circled.
LETTER RESEARCH
Macmillan Publishers Limited. All rights reserved©2014
Extended Data Figure 2 | Distribution of fixed differences. The left panel is
the distribution of fixed differences between two haplotype groups under a
scenario of selection on a de novo mutation (see Methods), and the right panel
is the distribution under a scenario of selection on standing variation (see
Methods) for a region of size ,32.7 kb. The initial frequency of the selected
allele in the SSV model is 1%. Each row of panels corresponds to different
selection strengths (2Ns) from 200 to 1,000. The red lines mark the number of
fixed differences observed between the two haplotype classes in the real data for
the given window size.
RESEARCH LETTER
Macmillan Publishers Limited. All rights reserved©2014
Extended Data Figure 3 | Haplotype frequencies for Tibetans, our Han
samples and the populations from the 1000 genomes project for the
five-SNP motif in the EPAS1 region. The y axis is the haplotype frequency.
The legend shows all the possible haplotypes for the region considered among
these populations: ASW, African American from the south western United
States; CEU, Utah Residents with Northern and Western European ancestry;
CHB, Han Chinese from Beijing; CHS, Southern Han Chinese; CLM,
Colombian; FIN, Finnish; GBR, British; HAN, Han Chinese from Beijing; IBS,
Iberian; JPT, Japanese; MXL, Mexican; PUR, Puerto Rican; LWK, Luhya; TSI,
Toscani; TIB, Tibetan; YRI, Yoruban (see Methods).
LETTER RESEARCH
Macmillan Publishers Limited. All rights reserved©2014
Extended Data Figure 4 | Derived allele frequency of the SNPs with the
largest frequency difference between Tibetans and the 1000 Genomes
Project populations. At these SNPs, the frequency difference between Tibetans
and the 1000 Genomes project populations is 0.65 or larger. Positions
46571435, 46579689, 46584859 and 46600358 were not called as SNPs in
the 1000 Genomes data, so we assume these positions were fixed for the human
reference allele. Note that even though position 46577251, 46588331, 46594122
and 46598025 appear to have a frequency of 0.0 for the populations in the
1000 Genomes data, the derived allele in these SNPs are observed at very low
frequency in at least one population (for example, CHB).
RESEARCH LETTER
Macmillan Publishers Limited. All rights reserved©2014
Extended Data Figure 5 | Differences between haplotypes. a, The full matrix
of pairwise differences between all the unique haplotypes in Fig. 3, for the
40 most common haplotypes identified in the 1000 Genomes and the Tibetan
samples in the 32.7-kb region of EPAS1. The Denisovan haplotype (of
frequency two) was added afterwards for comparison. The unique haplotypes
are labelled with Roman numerals (here and in Fig. 3), and the Denisovan
haplotype is the first column, haplotype I. Refer to Fig. 3 in the main text and the
supplementary material for the representation of populations for each
haplotype. b, Illustration of the genealogical structure in a model with gene flow
from Denisovans to Tibet. Letters a–k are the labels for the branch lengths and
are adjacent to their corresponding branches. The divergence between modern
human haplotypes and the introgressed haplotype in Tibetans would be larger
than the haplotypes in other modern human populations and the Denisovan
haplotype (see Methods and Supplementary Information). TIB, CEU and YRI
denote Tibetan, European and Yoruban populations. Note that the lengths i
and k are unknown as we do not know when these populations went extinct.
LETTER RESEARCH
Macmillan Publishers Limited. All rights reserved©2014
Extended Data Figure 6 | Other haplotype networks. a, A haplotype network
based on the number of pairwise differences between 43 unique haplotypes
defined from the 20 most differentiated SNPs between Tibetans and the 14
populations from the 1000 Genomes Project. The R software package pegas
(ref. 22) was used to generate the figure. The haplotype distances are from
pairwise differences. Each pie chart represents one unique haplotype and the
size of the pie chart is proportional to log2(number of chromosomes with that
haplotype). The sections in the pie provide the breakdown of the haplotypes
amongst populations. The width of the edges is proportional to the number of
pairwise differences between the joined haplotypes; the thinnest edge width
represents a difference of one mutation. The number 57 next to a Tibetan
haplotype is the number of Tibetan chromosomes with that haplotype.
Similarly, the number 1,912 is the number of chromosomes (across several
populations) with that haplotype. b, The number of pairwise differences
between the Denisovan haplotype and the 43 unique haplotypes defined from
the 20 most differentiated SNPs between Tibetans and the 14 populations from
the 1000 Genomes Project (same haplotypes as in a). Each bar is a unique
haplotype, and they are sorted in increasing order of pairwise differences. The
colours within each bar represent the proportion of chromosomes with that
haplotype broken down by populations. The numbers on top of each bar
represent the total number of chromosomes within the 1000 Genomes data set
and Tibetans that have the haplotype. Note this is the same data set used to
create the haplotype network in panel a. Supplementary Tables 5 and 6 contain
the 43 haplotypes and the frequencies within each of the populations.
RESEARCH LETTER
Macmillan Publishers Limited. All rights reserved©2014
Extended Data Figure 7 | Number of pairwise differences. Red bars are the
histograms of the number of pairwise differences between Denisovan and
Tibetans. Blue bars are the histograms of the number of pairwise differences
between Denisovan and GBR, CHS, FIN, PUR, CLM, IBS, CEU, YRI, CHB,
JPT, LWK, ASW, MXL or TSI. All comparisons are within the 32.7-kb region of
high differentiation (green box in Extended Data Fig. 1).
LETTER RESEARCH
Macmillan Publishers Limited. All rights reserved©2014
Extended Data Figure 8 | Divergence distributions. Modern human–
Denisovan divergence (see Methods) for intronic regions of size 32.7 kb is
plotted in red. Modern human–modern human divergence for the same
intronic regions is plotted in blue. At the EPAS1 32.7-kb region, in green, is
plotted the Tibetan–Han divergence. The black arrow points to the number of
nucleotide differences between the Denisovan and the most common Tibetan
haplotype (0.0038). This value is significantly lower than what we observe
between modern human–Denisovan (red curve, P 5 0.0028).
RESEARCH LETTER
Macmillan Publishers Limited. All rights reserved©2014
Extended Data Figure 9 | Null distributions of D for an assumed Tibet–Han
divergence of 3,000 years. Each histogram corresponds to the D values
obtained under null models without gene flow, and the red vertical bar
corresponds to the D values observed in the real data. The observed D values are
significant (P , 0.001) even when we assume Tibet–Han divergence of 5,000
or 10,000 years (see Methods and Supplementary Tables 8–10) (model
abbreviations are given in the Supplementary Information; section on
D statistics under models of no gene flow).
LETTER RESEARCH
Macmillan Publishers Limited. All rights reserved©2014
Extended Data Figure 10 | S* statistics and PCA plot. a, A measure of
introgression, S*, from ref. 23. Distributions are for 1,000 simulations under the
four demographic models described in the Supplementary Information; section
on D statistics under models of no gene flow. S* for the Tibetan individuals
is shown as a vertical grey line. For all models, the empirical P values are 0.035,
0.028, 0.019 and 0.017, respectively, for each model (top to bottom). b, Plots the
first and second principal components using all the CHS (100 individuals)
and the CHB (97 individuals) from the 1000 Genomes and the 77 Tibetan
individuals from ref. 45 (see Methods). The black circle and the black triangle
represent the single CHB and the CHS individuals carrying the five-SNP
Tibetan–Denisovan-haplotype (Extended Data Fig. 3). All SNPs in the
intersection between the 1000 Genomes populations and the 77 Tibetan
individuals from chromosome 2 were used for this analysis.
RESEARCH LETTER
Macmillan Publishers Limited. All rights reserved©2014