Skip to content

Commit b6159fa

Browse files
authored
Revise PCA tutorial and update Jupyter metadata
Updated Jupyter Notebook metadata and improved PCA tutorial content.
1 parent 13b60c1 commit b6159fa

1 file changed

Lines changed: 16 additions & 159 deletions

File tree

PCA.md

Lines changed: 16 additions & 159 deletions
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,34 @@
11
---
2-
jupyter:
3-
jupytext:
4-
formats: ipynb,md
5-
text_representation:
6-
extension: .md
7-
format_name: markdown
8-
format_version: '1.3'
9-
jupytext_version: 1.19.1
10-
kernelspec:
11-
display_name: Python 3
12-
language: python
13-
name: python3
2+
jupytext:
3+
formats: ipynb,md:myst
4+
text_representation:
5+
extension: .md
6+
format_name: myst
7+
format_version: 0.13
8+
jupytext_version: 1.19.1
9+
kernelspec:
10+
display_name: msprime
11+
language: python
12+
name: myenv
1413
---
1514

1615
# Comparing branch and SNP-based PCA
1716

17+
+++
1818

1919
Principal Component Analysis (PCA) is commonly used for exploring population structure in genetic datasets, where it is usually computed from SNP genotyped data. In the context of ARGs, it is also possible to perform branch PCA as implemented in `tskit`. This does not use variant data. (Of course, it may indirectly rely on variant data if the ARG was inferred from data.)
2020

2121
In this tutorial, we demonstrate both approaches. We will apply these to haplotypes and diploid genotype data.
2222

2323
The documentation of `tskit.TreeSequence.pca` can be found [here](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.pca).
2424

25+
+++
2526

2627
:::{note}
2728
Usually, PCA is carried out on a diploid genotype matrix (individuals in rows, loci in columns) with values 0, 1, and 2. PCA can then be achieved through singular value decomposition (SVD) of the column-centred genotype matrix. This results in a matrix of principal component (PC) scores, which are linear combinations of the genotype columns. The PC scores are ordered, decreasingly, by the amount of variation from the original data they account for.
2829
:::
2930

31+
+++
3032

3133
First, we'll simulate an ARG with population structure:
3234

@@ -121,7 +123,6 @@ gtSvd = np.linalg.svd(gtMat - gtMat.mean(axis=0), full_matrices=False)
121123
## Branch PCA (tskit)
122124
To demstrate that branch PCA works without variant data, we run it on the ARG without mutations, `ts`.
123125

124-
125126
```{code-cell} ipython3
126127
# haplotypes, each sample haplotype is ues by default
127128
hapBranchPca=ts.pca(num_components=10)
@@ -167,6 +168,7 @@ plt.show()
167168

168169
The plots on the left show one dot per haplotype. These have twice as many dots as the plots on the right, which show individuals. The colours indicate from which of the five islands a haplotype or individual was sampled. As expected with low geneflow, there is some grouping by island. Feel free to re-run with higher or lower values of `migRate` to see how the separations between the island samples changes.
169170

171+
+++
170172

171173
## Comparing variance components between branch and SNP PCA
172174
Both `numpy.linalg.svd` and `tskit.TreeSequence.pca` return information about the amount of variation accounted for by each PC. These information are stored in the slots `S` (standard variation for SVD) and `eigenvalues` (variance for branch PCA). To make the two match, we need to multiply the eigenvalues by the mutation rate before taking the square root.
@@ -183,7 +185,6 @@ We now fit a least-squares regression model to demonstrate the match between SVD
183185
```{code-cell} ipython3
184186
res = linregress(xx, yy)
185187
print(f"Intercept: {res.intercept:.4f}\n Slope: {res.slope:.4f}\n r^2: {res.rvalue**2:.4f}")
186-
187188
```
188189

189190
$r^2$ is close to 1. Let us visualise this. Each dot below shows a standard deviation value associate with one PC. The fact that they are well correlated suggests that both SNP and branch PCA yielded very similar results.
@@ -234,151 +235,7 @@ pctime[7].factors[:20,:10]
234235
Here, we demonstrated using simulated data how SNPs and ARG branches lead to equivalent PCA results. For empirical data, the ancestral states of variant sites are not known a priori, which will in practice often lead to polarisation differences. That may affect the outcome of PCA.
235236

236237
**TODO:** Extend Tutorial to empirical data.
237-
fstmat[i,j] = tsm.Fst([range(i*nHap,(i+1)*nHap), range((i+1)*nHap,(i+2)*nHap)])
238-
fstmat
239-
```
240-
241-
## PCA 'by hand'
242-
To compute a SNP PCA, we start by extracting the haploid 'genotypes' from the ARG. We then make use of the `TreeSequence` object's `individuals_nodes` property (an array) to select each individual's two haplotypes and to add them to create individual diploid genotypes.
243-
244-
```python
245-
# obtain a haplotype matrix and print its shape
246-
# 100 haplotypes (= 10 individual samples * 5 islands * 2 haplotypes per individual)
247-
# 13683 variant sites
248-
htMat=tsm.genotype_matrix().transpose()
249-
htMat.shape
250-
```
251-
252-
```python
253-
# Add each individual's two haplotypes to generate individual genotypes
254-
sample_ids_to_mat_index = np.full_like(tsm.samples(), tskit.NULL, shape=tsm.num_nodes)
255-
sample_ids_to_mat_index[tsm.samples()] = np.arange(len(tsm.samples()))
256-
gtMat = htMat[sample_ids_to_mat_index[ts.individuals_nodes]].sum(axis=1)
257-
```
258-
259-
```python
260-
# Haplotype SVD (column-centred)
261-
htSvd = np.linalg.svd(htMat - htMat.mean(axis=0), full_matrices=False)
262-
```
263-
264-
```python
265-
# Genotype SVD (column-centred)
266-
gtSvd = np.linalg.svd(gtMat - gtMat.mean(axis=0), full_matrices=False)
267-
```
268-
269-
## Branch PCA (tskit)
270-
To demstrate that branch PCA works without variant data, we run it on the ARG without mutations, `ts`.
271-
272-
273-
```python
274-
# haplotypes, each sample haplotype is ues by default
275-
hapBranchPca=ts.pca(num_components=10)
276-
```
277-
278-
```python
279-
# genotypes, all individuals are specified
280-
dipBranchPca=ts.pca(num_components=10, individuals=range(5*nSamp))
281-
```
282-
283-
Plot for comparison
284-
285-
```python
286-
fig, axs = plt.subplots(2, 2)
287-
plt.tight_layout()
288-
axs[0, 0].scatter(htSvd.U[:,0],
289-
htSvd.U[:,1],
290-
c=np.repeat([1,2,3,4,5], [nHap] * nPop))
291-
axs[0, 0].set_title('Haplotypes (sites)')
292-
293-
axs[0,0].set_ylabel("PC2")
294-
axs[0,1].scatter(gtSvd.U[:,0],
295-
gtSvd.U[:,1]*-1,
296-
c=np.repeat([1,2,3,4,5], [nSamp] * nPop))
297-
axs[0,1].set_title("Individuals (sites)")
298-
299-
# flipping the axes to make similarity clearer:
300-
axs[1,0].scatter(hapBranchPca.factors[:,0]*-1,
301-
hapBranchPca.factors[:,1]*-1,
302-
c=np.repeat([1,2,3,4,5], [nHap] * nPop))
303-
axs[1,0].set_title("Haplotypes (branches)")
304-
axs[1,0].set_ylabel("PC2")
305-
axs[1,0].set_xlabel("PC1")
306-
307-
axs[1,1].scatter(dipBranchPca.factors[:,0]*-1,
308-
dipBranchPca.factors[:,1],
309-
c=np.repeat([1,2,3,4,5], [nSamp] * nPop))
310-
axs[1,1].set_title("Individuals (branches)")
311-
axs[1,1].set_xlabel("PC1")
312-
313-
plt.show()
314-
```
315-
316-
The plots on the left show one dot per haplotype. These have twice as many dots as the plots on the right, which show individuals. The colours indicate from which of the five islands a haplotype or individual was sampled. As expected with low geneflow, there is some grouping by island. Feel free to re-run with higher or lower values of `migRate` to see how the separations between the island samples changes.
317-
318-
319-
## Comparing variance components between branch and SNP PCA
320-
Both `numpy.linalg.svd` and `tskit.TreeSequence.pca` return information about the amount of variation accounted for by each PC. These information are stored in the slots `S` (standard variation for SVD) and `eigenvalues` (variance for branch PCA). To make the two match, we need to multiply the eigenvalues by the mutation rate before taking the square root.
321-
322-
```python
323-
# square root of (branch eigenvalues multiplied by the mutation rate)
324-
xx=np.sqrt(hapBranchPca.eigenvalues * mu)
325-
# SVD S values
326-
yy=htSvd.S[:10]
327-
```
328-
329-
We now fit a least-squares regression model to demonstrate the match between SVD standard variation and transformed eigenvalues.
330238

331-
```python
332-
res = linregress(xx, yy)
333-
print(f"Intercept: {res.intercept:.4f}\n Slope: {res.slope:.4f}\n r^2: {res.rvalue**2:.4f}")
334-
335-
```
336-
337-
$r^2$ is close to 1. Let us visualise this. Each dot below shows a standard deviation value associate with one PC. The fact that they are well correlated suggests that both SNP and branch PCA yielded very similar results.
338-
339-
```python
340-
plt.scatter(xx, yy)
341-
plt.xlabel("sqrt(Branch eigenvals)")
342-
plt.ylabel("GT svd.S")
343-
plt.plot(xx, res.intercept + res.slope*xx, 'r', label='fitted line')
344-
plt.xlabel(r"Branches: $\sqrt{eigenvals * \mu}$") # use raw string to avoid error message about \s
345-
plt.ylabel("SNPs: $S$")
346-
plt.title("Variance components of SNP and branch PCA")
347-
plt.grid()
348-
plt.show()
349-
```
350-
351-
## Time windows
352-
Above we showed how variant and branch-based PCA are equivalent. But the ARG is a much richer data type than the genotype matrix. ARGs contain information about the historic relationships between the samples (possibly blurred by a inference step). Branch PCA allows one to specify a time window over which the PCA is to be computed, something that cannot be done for SNP PCA. Next, we compute PCA in time slices with breaks 0, 10, 100, 1000, 10,000, 100,000, 100,0000, 1,000,000, and 10,000,000. The results are stored in a list.
353-
354-
```python
355-
pctime=[tsm.pca(num_components=10, time_windows=[10**i, 10**(i+1)]) for i in range(8)]
356-
```
357-
358-
Being of class `PCAResult`, the elements of the list have a `factors` property. This has a shape of (100,10). I.e., 10 PCs for 100 haplotypes.
359-
360-
```python
361-
pctime[0].factors.shape
362-
```
363-
364-
```python
365-
for i in range(8):
366-
evecs = pctime[i].factors[:,:10]
367-
368-
plt.scatter(evecs[:,0],
369-
evecs[:,1],
370-
c=np.repeat(range(5), 20))
371-
plt.title(f"Branch PCA, (window: {10**i} - {10**(i+1)} gens)")
372-
plt.show()
373-
```
374-
375-
When selecting a very old window, each individual contributes to its own PC, causing most to be plotted at the origin (0,0). We can see this when inspecting the oldest window's PC scores, which are an identityt matrix. All haplotypes below the first two have 0 entries for the first two PC scores (the two left-most columns).
239+
```{code-cell} ipython3
376240
377-
```python
378-
pctime[7].factors[:20,:10]
379241
```
380-
381-
## Empirical data
382-
Here, we demonstrated using simulated data how SNPs and ARG branches lead to equivalent PCA results. For empirical data, the ancestral states of variant sites are not known a priori, which will in practice often lead to polarisation differences. That may affect the outcome of PCA.
383-
384-
**TODO:** Extend Tutorial to empirical data.

0 commit comments

Comments
 (0)