Skip to content

Commit 13b60c1

Browse files
authored
Refactor PCA.md for clarity and organization
Updated kernel specifications and code formatting for clarity. Added comments and organized code cells for better readability.
1 parent afbd4e6 commit 13b60c1

1 file changed

Lines changed: 156 additions & 8 deletions

File tree

PCA.md

Lines changed: 156 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,9 @@ jupyter:
88
format_version: '1.3'
99
jupytext_version: 1.19.1
1010
kernelspec:
11-
display_name: msprime
11+
display_name: Python 3
1212
language: python
13-
name: myenv
13+
name: python3
1414
---
1515

1616
# Comparing branch and SNP-based PCA
@@ -30,7 +30,7 @@ Usually, PCA is carried out on a diploid genotype matrix (individuals in rows, l
3030

3131
First, we'll simulate an ARG with population structure:
3232

33-
```python
33+
```{code-cell} ipython3
3434
# load required libraries
3535
import msprime as msp
3636
import tskit
@@ -39,7 +39,7 @@ import matplotlib.pyplot as plt
3939
from scipy.stats import linregress
4040
```
4141

42-
```python
42+
```{code-cell} ipython3
4343
# set a mutation rate
4444
mu = 1e-8
4545
# number of sub-populations/'islands'
@@ -56,12 +56,12 @@ migRate = 1e-5
5656

5757
Simulate an ARG using an island model demography. There are five islands, each with a population size of 10,000. Pairwise migration rates are $10^{-5}$.
5858

59-
```python
59+
```{code-cell} ipython3
6060
# Island model demography, 5 islands connected by low gene flow
6161
dmg = msp.Demography.island_model([nn] * nPop, migration_rate=migRate)
6262
```
6363

64-
```python
64+
```{code-cell} ipython3
6565
# Simulate ARG
6666
ts = msp.sim_ancestry(samples={i: nSamp for i in range(nPop)},
6767
demography=dmg,
@@ -73,15 +73,15 @@ ts
7373

7474
The same ARG, but with mutations added.
7575

76-
```python
76+
```{code-cell} ipython3
7777
# Add mutations
7878
tsm = msp.sim_mutations(ts, rate=mu, random_seed=1234)
7979
tsm
8080
```
8181

8282
The migration rates between the islands are quite low. This should lead to considerable genetic differentiation. Let us compute pairwise $F_{ST}$:
8383

84-
```python
84+
```{code-cell} ipython3
8585
# Considerable pairwise Fst between the 'islands'
8686
fstmat = np.zeros([nPop,nPop])
8787
for i in range(nPop-1):
@@ -93,6 +93,154 @@ fstmat
9393
## PCA 'by hand'
9494
To compute a SNP PCA, we start by extracting the haploid 'genotypes' from the ARG. We then make use of the `TreeSequence` object's `individuals_nodes` property (an array) to select each individual's two haplotypes and to add them to create individual diploid genotypes.
9595

96+
```{code-cell} ipython3
97+
# obtain a haplotype matrix and print its shape
98+
# 100 haplotypes (= 10 individual samples * 5 islands * 2 haplotypes per individual)
99+
# 13683 variant sites
100+
htMat=tsm.genotype_matrix().transpose()
101+
htMat.shape
102+
```
103+
104+
```{code-cell} ipython3
105+
# Add each individual's two haplotypes to generate individual genotypes
106+
sample_ids_to_mat_index = np.full_like(tsm.samples(), tskit.NULL, shape=tsm.num_nodes)
107+
sample_ids_to_mat_index[tsm.samples()] = np.arange(len(tsm.samples()))
108+
gtMat = htMat[sample_ids_to_mat_index[ts.individuals_nodes]].sum(axis=1)
109+
```
110+
111+
```{code-cell} ipython3
112+
# Haplotype SVD (column-centred)
113+
htSvd = np.linalg.svd(htMat - htMat.mean(axis=0), full_matrices=False)
114+
```
115+
116+
```{code-cell} ipython3
117+
# Genotype SVD (column-centred)
118+
gtSvd = np.linalg.svd(gtMat - gtMat.mean(axis=0), full_matrices=False)
119+
```
120+
121+
## Branch PCA (tskit)
122+
To demstrate that branch PCA works without variant data, we run it on the ARG without mutations, `ts`.
123+
124+
125+
```{code-cell} ipython3
126+
# haplotypes, each sample haplotype is ues by default
127+
hapBranchPca=ts.pca(num_components=10)
128+
```
129+
130+
```{code-cell} ipython3
131+
# genotypes, all individuals are specified
132+
dipBranchPca=ts.pca(num_components=10, individuals=range(5*nSamp))
133+
```
134+
135+
Plot for comparison
136+
137+
```{code-cell} ipython3
138+
fig, axs = plt.subplots(2, 2)
139+
plt.tight_layout()
140+
axs[0, 0].scatter(htSvd.U[:,0],
141+
htSvd.U[:,1],
142+
c=np.repeat([1,2,3,4,5], [nHap] * nPop))
143+
axs[0, 0].set_title('Haplotypes (sites)')
144+
145+
axs[0,0].set_ylabel("PC2")
146+
axs[0,1].scatter(gtSvd.U[:,0],
147+
gtSvd.U[:,1]*-1,
148+
c=np.repeat([1,2,3,4,5], [nSamp] * nPop))
149+
axs[0,1].set_title("Individuals (sites)")
150+
151+
# flipping the axes to make similarity clearer:
152+
axs[1,0].scatter(hapBranchPca.factors[:,0]*-1,
153+
hapBranchPca.factors[:,1]*-1,
154+
c=np.repeat([1,2,3,4,5], [nHap] * nPop))
155+
axs[1,0].set_title("Haplotypes (branches)")
156+
axs[1,0].set_ylabel("PC2")
157+
axs[1,0].set_xlabel("PC1")
158+
159+
axs[1,1].scatter(dipBranchPca.factors[:,0]*-1,
160+
dipBranchPca.factors[:,1],
161+
c=np.repeat([1,2,3,4,5], [nSamp] * nPop))
162+
axs[1,1].set_title("Individuals (branches)")
163+
axs[1,1].set_xlabel("PC1")
164+
165+
plt.show()
166+
```
167+
168+
The plots on the left show one dot per haplotype. These have twice as many dots as the plots on the right, which show individuals. The colours indicate from which of the five islands a haplotype or individual was sampled. As expected with low geneflow, there is some grouping by island. Feel free to re-run with higher or lower values of `migRate` to see how the separations between the island samples changes.
169+
170+
171+
## Comparing variance components between branch and SNP PCA
172+
Both `numpy.linalg.svd` and `tskit.TreeSequence.pca` return information about the amount of variation accounted for by each PC. These information are stored in the slots `S` (standard variation for SVD) and `eigenvalues` (variance for branch PCA). To make the two match, we need to multiply the eigenvalues by the mutation rate before taking the square root.
173+
174+
```{code-cell} ipython3
175+
# square root of (branch eigenvalues multiplied by the mutation rate)
176+
xx=np.sqrt(hapBranchPca.eigenvalues * mu)
177+
# SVD S values
178+
yy=htSvd.S[:10]
179+
```
180+
181+
We now fit a least-squares regression model to demonstrate the match between SVD standard variation and transformed eigenvalues.
182+
183+
```{code-cell} ipython3
184+
res = linregress(xx, yy)
185+
print(f"Intercept: {res.intercept:.4f}\n Slope: {res.slope:.4f}\n r^2: {res.rvalue**2:.4f}")
186+
187+
```
188+
189+
$r^2$ is close to 1. Let us visualise this. Each dot below shows a standard deviation value associate with one PC. The fact that they are well correlated suggests that both SNP and branch PCA yielded very similar results.
190+
191+
```{code-cell} ipython3
192+
plt.scatter(xx, yy)
193+
plt.xlabel("sqrt(Branch eigenvals)")
194+
plt.ylabel("GT svd.S")
195+
plt.plot(xx, res.intercept + res.slope*xx, 'r', label='fitted line')
196+
plt.xlabel(r"Branches: $\sqrt{eigenvals * \mu}$") # use raw string to avoid error message about \s
197+
plt.ylabel("SNPs: $S$")
198+
plt.title("Variance components of SNP and branch PCA")
199+
plt.grid()
200+
plt.show()
201+
```
202+
203+
## Time windows
204+
Above we showed how variant and branch-based PCA are equivalent. But the ARG is a much richer data type than the genotype matrix. ARGs contain information about the historic relationships between the samples (possibly blurred by a inference step). Branch PCA allows one to specify a time window over which the PCA is to be computed, something that cannot be done for SNP PCA. Next, we compute PCA in time slices with breaks 0, 10, 100, 1000, 10,000, 100,000, 100,0000, 1,000,000, and 10,000,000. The results are stored in a list.
205+
206+
```{code-cell} ipython3
207+
pctime=[tsm.pca(num_components=10, time_windows=[10**i, 10**(i+1)]) for i in range(8)]
208+
```
209+
210+
Being of class `PCAResult`, the elements of the list have a `factors` property. This has a shape of (100,10). I.e., 10 PCs for 100 haplotypes.
211+
212+
```{code-cell} ipython3
213+
pctime[0].factors.shape
214+
```
215+
216+
```{code-cell} ipython3
217+
for i in range(8):
218+
evecs = pctime[i].factors[:,:10]
219+
220+
plt.scatter(evecs[:,0],
221+
evecs[:,1],
222+
c=np.repeat(range(5), 20))
223+
plt.title(f"Branch PCA, (window: {10**i} - {10**(i+1)} gens)")
224+
plt.show()
225+
```
226+
227+
When selecting a very old window, each individual contributes to its own PC, causing most to be plotted at the origin (0,0). We can see this when inspecting the oldest window's PC scores, which are an identityt matrix. All haplotypes below the first two have 0 entries for the first two PC scores (the two left-most columns).
228+
229+
```{code-cell} ipython3
230+
pctime[7].factors[:20,:10]
231+
```
232+
233+
## Empirical data
234+
Here, we demonstrated using simulated data how SNPs and ARG branches lead to equivalent PCA results. For empirical data, the ancestral states of variant sites are not known a priori, which will in practice often lead to polarisation differences. That may affect the outcome of PCA.
235+
236+
**TODO:** Extend Tutorial to empirical data.
237+
fstmat[i,j] = tsm.Fst([range(i*nHap,(i+1)*nHap), range((i+1)*nHap,(i+2)*nHap)])
238+
fstmat
239+
```
240+
241+
## PCA 'by hand'
242+
To compute a SNP PCA, we start by extracting the haploid 'genotypes' from the ARG. We then make use of the `TreeSequence` object's `individuals_nodes` property (an array) to select each individual's two haplotypes and to add them to create individual diploid genotypes.
243+
96244
```python
97245
# obtain a haplotype matrix and print its shape
98246
# 100 haplotypes (= 10 individual samples * 5 islands * 2 haplotypes per individual)

0 commit comments

Comments
 (0)