Skip to content

Commit 3ebb0b1

Browse files
committed
PR comments implemented
1 parent 208c5ac commit 3ebb0b1

1 file changed

Lines changed: 33 additions & 36 deletions

File tree

PCA.md

Lines changed: 33 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ First, we'll simulate an ARG with population structure:
3434

3535
```{code-cell} ipython3
3636
# load required libraries
37-
import msprime as msp
37+
import msprime
3838
import tskit
3939
import numpy as np
4040
import matplotlib.pyplot as plt
@@ -60,24 +60,24 @@ Simulate an ARG using an island model demography. There are five islands, each w
6060

6161
```{code-cell} ipython3
6262
# Island model demography, 5 islands connected by low gene flow
63-
dmg = msp.Demography.island_model([nn] * nPop, migration_rate=migRate)
63+
dmg = msprime.Demography.island_model([nn] * nPop, migration_rate=migRate)
6464
```
6565

6666
```{code-cell} ipython3
6767
# Simulate ARG
68-
ts = msp.sim_ancestry(samples={i: nSamp for i in range(nPop)},
68+
ts = msprime.sim_ancestry(samples={i: nSamp for i in range(nPop)},
6969
demography=dmg,
7070
random_seed=1234,
7171
sequence_length=1e6,
72-
recombination_rate=1e-9)
72+
recombination_rate=1e-8)
7373
ts
7474
```
7575

7676
The same ARG, but with mutations added.
7777

7878
```{code-cell} ipython3
7979
# Add mutations
80-
tsm = msp.sim_mutations(ts, rate=mu, random_seed=1234)
80+
tsm = msprime.sim_mutations(ts, rate=mu, random_seed=1234)
8181
tsm
8282
```
8383

@@ -92,11 +92,24 @@ for i in range(nPop-1):
9292
fstmat
9393
```
9494

95+
## Branch PCA (tskit)
96+
To demstrate that branch PCA works without variant data, we run it on the ARG without mutations, `ts`.
97+
98+
```{code-cell} ipython3
99+
# haplotypes, each sample haplotype is ues by default
100+
hapBranchPca=ts.pca(num_components=10)
101+
```
102+
103+
```{code-cell} ipython3
104+
# genotypes, all individuals are specified
105+
dipBranchPca=ts.pca(num_components=10, individuals=range(5*nSamp))
106+
```
107+
95108
## PCA 'by hand'
96-
To compute a SNP PCA, we start by extracting the haploid 'genotypes' from the ARG. We then make use of the `TreeSequence` object's `individuals_nodes` property (an array) to select each individual's two haplotypes and to add them to create individual diploid genotypes.
109+
To compute a traditional SNP PCA, we start by extracting the haploid 'genotypes' from the ARG. We then make use of the `TreeSequence` object's `individuals_nodes` property (an array) to select each individual's two haplotypes and to add them to create individual diploid genotypes.
97110

98111
```{code-cell} ipython3
99-
# obtain a haplotype matrix and print its shape
112+
# obtain a haplotype matrix from the tree sequence with mutation; print its shape
100113
# 100 haplotypes (= 10 individual samples * 5 islands * 2 haplotypes per individual)
101114
# 13683 variant sites
102115
htMat=tsm.genotype_matrix().transpose()
@@ -112,52 +125,40 @@ gtMat = htMat[sample_ids_to_mat_index[ts.individuals_nodes]].sum(axis=1)
112125

113126
```{code-cell} ipython3
114127
# Haplotype SVD (column-centred)
115-
htSvd = np.linalg.svd(htMat - htMat.mean(axis=0), full_matrices=False)
128+
hapSvd = np.linalg.svd(htMat - htMat.mean(axis=0), full_matrices=False)
116129
```
117130

118131
```{code-cell} ipython3
119132
# Genotype SVD (column-centred)
120-
gtSvd = np.linalg.svd(gtMat - gtMat.mean(axis=0), full_matrices=False)
121-
```
122-
123-
## Branch PCA (tskit)
124-
To demonstrate that branch PCA works without variant data, we run it on the ARG without mutations, `ts`.
125-
126-
```{code-cell} ipython3
127-
# haplotypes, each sample haplotype is used by default
128-
hapBranchPca=ts.pca(num_components=10)
133+
dipSvd = np.linalg.svd(gtMat - gtMat.mean(axis=0), full_matrices=False)
129134
```
130135

131-
```{code-cell} ipython3
132-
# genotypes, all individuals are specified
133-
dipBranchPca=ts.pca(num_components=10, individuals=range(5*nSamp))
134-
```
135-
136-
Plot for comparison
136+
## Plot for comparison
137+
Note that PCA does not preserve the axis orientation. The plots in the panels below will show similar patterns but one or both axes may be flipped.
137138

138139
```{code-cell} ipython3
139140
fig, axs = plt.subplots(2, 2)
140141
plt.tight_layout()
141-
axs[0, 0].scatter(htSvd.U[:,0],
142-
htSvd.U[:,1],
142+
axs[0, 0].scatter(hapSvd.U[:,0],
143+
hapSvd.U[:,1],
143144
c=np.repeat([1,2,3,4,5], [nHap] * nPop))
144145
axs[0, 0].set_title('Haplotypes (sites)')
145146
146147
axs[0,0].set_ylabel("PC2")
147-
axs[0,1].scatter(gtSvd.U[:,0],
148-
gtSvd.U[:,1]*-1,
148+
axs[0,1].scatter(dipSvd.U[:,0],
149+
dipSvd.U[:,1],
149150
c=np.repeat([1,2,3,4,5], [nSamp] * nPop))
150151
axs[0,1].set_title("Individuals (sites)")
151152
152153
# flipping the axes to make similarity clearer:
153-
axs[1,0].scatter(hapBranchPca.factors[:,0]*-1,
154-
hapBranchPca.factors[:,1]*-1,
154+
axs[1,0].scatter(hapBranchPca.factors[:,0],
155+
hapBranchPca.factors[:,1],
155156
c=np.repeat([1,2,3,4,5], [nHap] * nPop))
156157
axs[1,0].set_title("Haplotypes (branches)")
157158
axs[1,0].set_ylabel("PC2")
158159
axs[1,0].set_xlabel("PC1")
159160
160-
axs[1,1].scatter(dipBranchPca.factors[:,0]*-1,
161+
axs[1,1].scatter(dipBranchPca.factors[:,0],
161162
dipBranchPca.factors[:,1],
162163
c=np.repeat([1,2,3,4,5], [nSamp] * nPop))
163164
axs[1,1].set_title("Individuals (branches)")
@@ -177,7 +178,7 @@ Both `numpy.linalg.svd` and `tskit.TreeSequence.pca` return information about th
177178
# square root of (branch eigenvalues multiplied by the mutation rate)
178179
xx=np.sqrt(hapBranchPca.eigenvalues * mu)
179180
# SVD S values
180-
yy=htSvd.S[:10]
181+
yy=hapSvd.S[:10]
181182
```
182183

183184
We now fit a least-squares regression model to demonstrate the match between SVD standard variation and transformed eigenvalues.
@@ -208,7 +209,7 @@ Above we showed how variant and branch-based PCA are equivalent. But the ARG is
208209
pctime=[tsm.pca(num_components=10, time_windows=[10**i, 10**(i+1)]) for i in range(8)]
209210
```
210211

211-
Being of class `PCAResult`, the elements of the list have a `factors` property. This has a shape of `(100,10)`, that is, 10 PCs for 100 haplotypes.
212+
Being of class `PCAResult`, the elements of the list have a `factors` property. This has a shape of (100,10). I.e., 10 PCs for 100 haplotypes.
212213

213214
```{code-cell} ipython3
214215
pctime[0].factors.shape
@@ -235,7 +236,3 @@ pctime[7].factors[:20,:10]
235236
Here, we demonstrated using simulated data how SNPs and ARG branches lead to equivalent PCA results. For empirical data, the ancestral states of variant sites are not known a priori, which will in practice often lead to polarisation differences. That may affect the outcome of PCA.
236237

237238
**TODO:** Extend Tutorial to empirical data.
238-
239-
```{code-cell} ipython3
240-
241-
```

0 commit comments

Comments
 (0)