Skip to content

Commit 7ec0128

Browse files
committed
Add PCA tutorial for genetic datasets
Added tutorial on comparing branch and SNP-based PCA using msprime and tskit, including code examples for simulating ARGs and performing PCA. Add PCA file to the table of contents Modify R installation command in README Updated R package installation command to specify CRAN repository. Refactor PCA.md for clarity and organization Updated kernel specifications and code formatting for clarity. Added comments and organized code cells for better readability. Revise PCA tutorial and update Jupyter metadata Updated Jupyter Notebook metadata and improved PCA tutorial content. Update kernelspec display name and language in PCA.md Update PCA.md Co-authored-by: Gregor Gorjanc <gregor.gorjanc@gmail.com> Update PCA.md Co-authored-by: Gregor Gorjanc <gregor.gorjanc@gmail.com> Update PCA.md Co-authored-by: Gregor Gorjanc <gregor.gorjanc@gmail.com> PR comments implemented heading
1 parent da1f625 commit 7ec0128

3 files changed

Lines changed: 240 additions & 1 deletion

File tree

PCA.md

Lines changed: 238 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,238 @@
1+
---
2+
jupytext:
3+
formats: ipynb,md:myst
4+
text_representation:
5+
extension: .md
6+
format_name: myst
7+
format_version: 0.13
8+
jupytext_version: 1.19.1
9+
kernelspec:
10+
display_name: Python 3
11+
language: python
12+
name: python3
13+
---
14+
15+
# PCA, on branches and SNPs
16+
17+
+++
18+
19+
Principal Component Analysis (PCA) is commonly used for exploring population structure in genetic datasets, where it is usually computed from SNP genotyped data. In the context of ARGs, it is also possible to perform branch PCA as implemented in `tskit`. This does not use variant data. (Of course, it may indirectly rely on variant data if the ARG was inferred from data.)
20+
21+
In this tutorial, we demonstrate both approaches. We will apply these to haplotypes and diploid genotype data.
22+
23+
The documentation of `tskit.TreeSequence.pca` can be found [here](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.pca).
24+
25+
+++
26+
27+
:::{note}
28+
Usually, PCA is carried out on a diploid genotype matrix (individuals in rows, loci in columns) with values 0, 1, and 2. PCA can then be achieved through singular value decomposition (SVD) of the column-centred genotype matrix. This results in a matrix of principal component (PC) scores, which are linear combinations of the genotype columns. The PC scores are ordered, decreasingly, by the amount of variation from the original data they account for.
29+
:::
30+
31+
+++
32+
33+
First, we'll simulate an ARG with population structure:
34+
35+
```{code-cell} ipython3
36+
# load required libraries
37+
import msprime
38+
import tskit
39+
import numpy as np
40+
import matplotlib.pyplot as plt
41+
from scipy.stats import linregress
42+
```
43+
44+
```{code-cell} ipython3
45+
# set a mutation rate
46+
mu = 1e-8
47+
# number of sub-populations/'islands'
48+
nPop = 5
49+
# number of diploids sampled from each sub-population
50+
nSamp = 10
51+
# number of haplotypes sampled
52+
nHap = 2* nSamp
53+
# per-island effective population size
54+
nn = 1e4
55+
# migration rate (per individual and island pair)
56+
migRate = 1e-5
57+
```
58+
59+
Simulate an ARG using an island model demography. There are five islands, each with a population size of 10,000. Pairwise migration rates are $10^{-5}$.
60+
61+
```{code-cell} ipython3
62+
# Island model demography, 5 islands connected by low gene flow
63+
dmg = msprime.Demography.island_model([nn] * nPop, migration_rate=migRate)
64+
```
65+
66+
```{code-cell} ipython3
67+
# Simulate ARG
68+
ts = msprime.sim_ancestry(samples={i: nSamp for i in range(nPop)},
69+
demography=dmg,
70+
random_seed=1234,
71+
sequence_length=1e6,
72+
recombination_rate=1e-8)
73+
ts
74+
```
75+
76+
The same ARG, but with mutations added.
77+
78+
```{code-cell} ipython3
79+
# Add mutations
80+
tsm = msprime.sim_mutations(ts, rate=mu, random_seed=1234)
81+
tsm
82+
```
83+
84+
The migration rates between the islands are quite low. This should lead to considerable genetic differentiation. Let us compute pairwise $F_{ST}$:
85+
86+
```{code-cell} ipython3
87+
# Considerable pairwise Fst between the 'islands'
88+
fstmat = np.zeros([nPop,nPop])
89+
for i in range(nPop-1):
90+
for j in range(i+1,nPop):
91+
fstmat[i,j] = tsm.Fst([range(i*nHap,(i+1)*nHap), range((i+1)*nHap,(i+2)*nHap)])
92+
fstmat
93+
```
94+
95+
## Branch PCA (tskit)
96+
To demstrate that branch PCA works without variant data, we run it on the ARG without mutations, `ts`.
97+
98+
```{code-cell} ipython3
99+
# haplotypes, each sample haplotype is ues by default
100+
hapBranchPca=ts.pca(num_components=10)
101+
```
102+
103+
```{code-cell} ipython3
104+
# genotypes, all individuals are specified
105+
dipBranchPca=ts.pca(num_components=10, individuals=range(5*nSamp))
106+
```
107+
108+
## PCA 'by hand'
109+
To compute a traditional SNP PCA, we start by extracting the haploid 'genotypes' from the ARG. We then make use of the `TreeSequence` object's `individuals_nodes` property (an array) to select each individual's two haplotypes and to add them to create individual diploid genotypes.
110+
111+
```{code-cell} ipython3
112+
# obtain a haplotype matrix from the tree sequence with mutation; print its shape
113+
# 100 haplotypes (= 10 individual samples * 5 islands * 2 haplotypes per individual)
114+
# 13683 variant sites
115+
htMat=tsm.genotype_matrix().transpose()
116+
htMat.shape
117+
```
118+
119+
```{code-cell} ipython3
120+
# Add each individual's two haplotypes to generate individual genotypes
121+
sample_ids_to_mat_index = np.full_like(tsm.samples(), tskit.NULL, shape=tsm.num_nodes)
122+
sample_ids_to_mat_index[tsm.samples()] = np.arange(len(tsm.samples()))
123+
gtMat = htMat[sample_ids_to_mat_index[ts.individuals_nodes]].sum(axis=1)
124+
```
125+
126+
```{code-cell} ipython3
127+
# Haplotype SVD (column-centred)
128+
hapSvd = np.linalg.svd(htMat - htMat.mean(axis=0), full_matrices=False)
129+
```
130+
131+
```{code-cell} ipython3
132+
# Genotype SVD (column-centred)
133+
dipSvd = np.linalg.svd(gtMat - gtMat.mean(axis=0), full_matrices=False)
134+
```
135+
136+
## Plot for comparison
137+
Note that PCA does not preserve the axis orientation. The plots in the panels below will show similar patterns but one or both axes may be flipped.
138+
139+
```{code-cell} ipython3
140+
fig, axs = plt.subplots(2, 2)
141+
plt.tight_layout()
142+
axs[0, 0].scatter(hapSvd.U[:,0],
143+
hapSvd.U[:,1],
144+
c=np.repeat([1,2,3,4,5], [nHap] * nPop))
145+
axs[0, 0].set_title('Haplotypes (sites)')
146+
147+
axs[0,0].set_ylabel("PC2")
148+
axs[0,1].scatter(dipSvd.U[:,0],
149+
dipSvd.U[:,1],
150+
c=np.repeat([1,2,3,4,5], [nSamp] * nPop))
151+
axs[0,1].set_title("Individuals (sites)")
152+
153+
# flipping the axes to make similarity clearer:
154+
axs[1,0].scatter(hapBranchPca.factors[:,0],
155+
hapBranchPca.factors[:,1],
156+
c=np.repeat([1,2,3,4,5], [nHap] * nPop))
157+
axs[1,0].set_title("Haplotypes (branches)")
158+
axs[1,0].set_ylabel("PC2")
159+
axs[1,0].set_xlabel("PC1")
160+
161+
axs[1,1].scatter(dipBranchPca.factors[:,0],
162+
dipBranchPca.factors[:,1],
163+
c=np.repeat([1,2,3,4,5], [nSamp] * nPop))
164+
axs[1,1].set_title("Individuals (branches)")
165+
axs[1,1].set_xlabel("PC1")
166+
167+
plt.show()
168+
```
169+
170+
The plots on the left show one dot per haplotype. These have twice as many dots as the plots on the right, which show individuals. The colours indicate from which of the five islands a haplotype or individual was sampled. As expected with low geneflow, there is some grouping by island. Feel free to re-run with higher or lower values of `migRate` to see how the separations between the island samples changes.
171+
172+
+++
173+
174+
## Comparing variance components between branch and SNP PCA
175+
Both `numpy.linalg.svd` and `tskit.TreeSequence.pca` return information about the amount of variation accounted for by each PC. These information are stored in the slots `S` (standard variation for SVD) and `eigenvalues` (variance for branch PCA). To make the two match, we need to multiply the eigenvalues by the mutation rate before taking the square root.
176+
177+
```{code-cell} ipython3
178+
# square root of (branch eigenvalues multiplied by the mutation rate)
179+
xx=np.sqrt(hapBranchPca.eigenvalues * mu)
180+
# SVD S values
181+
yy=hapSvd.S[:10]
182+
```
183+
184+
We now fit a least-squares regression model to demonstrate the match between SVD standard variation and transformed eigenvalues.
185+
186+
```{code-cell} ipython3
187+
res = linregress(xx, yy)
188+
print(f"Intercept: {res.intercept:.4f}\n Slope: {res.slope:.4f}\n r^2: {res.rvalue**2:.4f}")
189+
```
190+
191+
$r^2$ is close to 1. Let us visualise this. Each dot below shows a standard deviation value associate with one PC. The fact that they are well correlated suggests that both SNP and branch PCA yielded very similar results.
192+
193+
```{code-cell} ipython3
194+
plt.scatter(xx, yy)
195+
plt.xlabel("sqrt(Branch eigenvals)")
196+
plt.ylabel("GT svd.S")
197+
plt.plot(xx, res.intercept + res.slope*xx, 'r', label='fitted line')
198+
plt.xlabel(r"Branches: $\sqrt{eigenvals * \mu}$") # use raw string to avoid error message about \s
199+
plt.ylabel("SNPs: $S$")
200+
plt.title("Variance components of SNP and branch PCA")
201+
plt.grid()
202+
plt.show()
203+
```
204+
205+
## Time windows
206+
Above we showed how variant and branch-based PCA are equivalent. But the ARG is a much richer data type than the genotype matrix. ARGs contain information about the historic relationships between the samples (possibly blurred by a inference step). Branch PCA allows one to specify a time window over which the PCA is to be computed, something that cannot be done for SNP PCA. Next, we compute PCA in time slices with breaks 0, 10, 100, 1000, 10,000, 100,000, 100,0000, 1,000,000, and 10,000,000. The results are stored in a list.
207+
208+
```{code-cell} ipython3
209+
pctime=[tsm.pca(num_components=10, time_windows=[10**i, 10**(i+1)]) for i in range(8)]
210+
```
211+
212+
Being of class `PCAResult`, the elements of the list have a `factors` property. This has a shape of (100,10). I.e., 10 PCs for 100 haplotypes.
213+
214+
```{code-cell} ipython3
215+
pctime[0].factors.shape
216+
```
217+
218+
```{code-cell} ipython3
219+
for i in range(8):
220+
evecs = pctime[i].factors[:,:10]
221+
222+
plt.scatter(evecs[:,0],
223+
evecs[:,1],
224+
c=np.repeat(range(5), 20))
225+
plt.title(f"Branch PCA, (window: {10**i} - {10**(i+1)} gens)")
226+
plt.show()
227+
```
228+
229+
When selecting a very old window, each individual contributes to its own PC, causing most to be plotted at the origin (0,0). We can see this when inspecting the oldest window's PC scores, which are an identityt matrix. All haplotypes below the first two have 0 entries for the first two PC scores (the two left-most columns).
230+
231+
```{code-cell} ipython3
232+
pctime[7].factors[:20,:10]
233+
```
234+
235+
## Empirical data
236+
Here, we demonstrated using simulated data how SNPs and ARG branches lead to equivalent PCA results. For empirical data, the ancestral states of variant sites are not known a priori, which will in practice often lead to polarisation differences. That may affect the outcome of PCA.
237+
238+
**TODO:** Extend Tutorial to empirical data.

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ $ python -m pip install -r requirements.txt
2525
You will also need a working R installation with `reticulate` and `irkernel` installed.
2626
This command should do the trick:
2727
```
28-
$ R -e 'install.packages(c("reticulate", "IRkernel")); IRkernel::installspec()'
28+
$ R -e 'options(repos=c(CRAN="http://cran.r-project.org")); install.packages(c("reticulate", "IRkernel")); IRkernel::installspec()'
2929
```
3030

3131
# Building tutorials

_toc.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ parts:
1515
- file: incremental_algorithms
1616
- file: counting_topologies
1717
- file: parallelization
18+
- file: PCA
1819
- caption: Further tskit tutorials
1920
chapters:
2021
- file: tables_and_editing

0 commit comments

Comments
 (0)