You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: PCA.md
+156-8Lines changed: 156 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,9 +8,9 @@ jupyter:
8
8
format_version: '1.3'
9
9
jupytext_version: 1.19.1
10
10
kernelspec:
11
-
display_name: msprime
11
+
display_name: Python 3
12
12
language: python
13
-
name: myenv
13
+
name: python3
14
14
---
15
15
16
16
# Comparing branch and SNP-based PCA
@@ -30,7 +30,7 @@ Usually, PCA is carried out on a diploid genotype matrix (individuals in rows, l
30
30
31
31
First, we'll simulate an ARG with population structure:
32
32
33
-
```python
33
+
```{code-cell} ipython3
34
34
# load required libraries
35
35
import msprime as msp
36
36
import tskit
@@ -39,7 +39,7 @@ import matplotlib.pyplot as plt
39
39
from scipy.stats import linregress
40
40
```
41
41
42
-
```python
42
+
```{code-cell} ipython3
43
43
# set a mutation rate
44
44
mu = 1e-8
45
45
# number of sub-populations/'islands'
@@ -56,12 +56,12 @@ migRate = 1e-5
56
56
57
57
Simulate an ARG using an island model demography. There are five islands, each with a population size of 10,000. Pairwise migration rates are $10^{-5}$.
58
58
59
-
```python
59
+
```{code-cell} ipython3
60
60
# Island model demography, 5 islands connected by low gene flow
The migration rates between the islands are quite low. This should lead to considerable genetic differentiation. Let us compute pairwise $F_{ST}$:
83
83
84
-
```python
84
+
```{code-cell} ipython3
85
85
# Considerable pairwise Fst between the 'islands'
86
86
fstmat = np.zeros([nPop,nPop])
87
87
for i in range(nPop-1):
@@ -93,6 +93,154 @@ fstmat
93
93
## PCA 'by hand'
94
94
To compute a SNP PCA, we start by extracting the haploid 'genotypes' from the ARG. We then make use of the `TreeSequence` object's `individuals_nodes` property (an array) to select each individual's two haplotypes and to add them to create individual diploid genotypes.
The plots on the left show one dot per haplotype. These have twice as many dots as the plots on the right, which show individuals. The colours indicate from which of the five islands a haplotype or individual was sampled. As expected with low geneflow, there is some grouping by island. Feel free to re-run with higher or lower values of `migRate` to see how the separations between the island samples changes.
169
+
170
+
171
+
## Comparing variance components between branch and SNP PCA
172
+
Both `numpy.linalg.svd` and `tskit.TreeSequence.pca` return information about the amount of variation accounted for by each PC. These information are stored in the slots `S` (standard variation for SVD) and `eigenvalues` (variance for branch PCA). To make the two match, we need to multiply the eigenvalues by the mutation rate before taking the square root.
173
+
174
+
```{code-cell} ipython3
175
+
# square root of (branch eigenvalues multiplied by the mutation rate)
176
+
xx=np.sqrt(hapBranchPca.eigenvalues * mu)
177
+
# SVD S values
178
+
yy=htSvd.S[:10]
179
+
```
180
+
181
+
We now fit a least-squares regression model to demonstrate the match between SVD standard variation and transformed eigenvalues.
$r^2$ is close to 1. Let us visualise this. Each dot below shows a standard deviation value associate with one PC. The fact that they are well correlated suggests that both SNP and branch PCA yielded very similar results.
plt.xlabel(r"Branches: $\sqrt{eigenvals * \mu}$") # use raw string to avoid error message about \s
197
+
plt.ylabel("SNPs: $S$")
198
+
plt.title("Variance components of SNP and branch PCA")
199
+
plt.grid()
200
+
plt.show()
201
+
```
202
+
203
+
## Time windows
204
+
Above we showed how variant and branch-based PCA are equivalent. But the ARG is a much richer data type than the genotype matrix. ARGs contain information about the historic relationships between the samples (possibly blurred by a inference step). Branch PCA allows one to specify a time window over which the PCA is to be computed, something that cannot be done for SNP PCA. Next, we compute PCA in time slices with breaks 0, 10, 100, 1000, 10,000, 100,000, 100,0000, 1,000,000, and 10,000,000. The results are stored in a list.
205
+
206
+
```{code-cell} ipython3
207
+
pctime=[tsm.pca(num_components=10, time_windows=[10**i, 10**(i+1)]) for i in range(8)]
208
+
```
209
+
210
+
Being of class `PCAResult`, the elements of the list have a `factors` property. This has a shape of (100,10). I.e., 10 PCs for 100 haplotypes.
When selecting a very old window, each individual contributes to its own PC, causing most to be plotted at the origin (0,0). We can see this when inspecting the oldest window's PC scores, which are an identityt matrix. All haplotypes below the first two have 0 entries for the first two PC scores (the two left-most columns).
228
+
229
+
```{code-cell} ipython3
230
+
pctime[7].factors[:20,:10]
231
+
```
232
+
233
+
## Empirical data
234
+
Here, we demonstrated using simulated data how SNPs and ARG branches lead to equivalent PCA results. For empirical data, the ancestral states of variant sites are not known a priori, which will in practice often lead to polarisation differences. That may affect the outcome of PCA.
To compute a SNP PCA, we start by extracting the haploid 'genotypes' from the ARG. We then make use of the `TreeSequence` object's `individuals_nodes` property (an array) to select each individual's two haplotypes and to add them to create individual diploid genotypes.
0 commit comments