You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To compute a SNP PCA, we start by extracting the haploid 'genotypes' from the ARG. We then make use of the `TreeSequence` object's `individuals_nodes` property (an array) to select each individual's two haplotypes and to add them to create individual diploid genotypes.
109
+
To compute a traditional SNP PCA, we start by extracting the haploid 'genotypes' from the ARG. We then make use of the `TreeSequence` object's `individuals_nodes` property (an array) to select each individual's two haplotypes and to add them to create individual diploid genotypes.
97
110
98
111
```{code-cell} ipython3
99
-
# obtain a haplotype matrix and print its shape
112
+
# obtain a haplotype matrix from the tree sequence with mutation; print its shape
Note that PCA does not preserve the axis orientation. The plots in the panels below will show similar patterns but one or both axes may be flipped.
137
138
138
139
```{code-cell} ipython3
139
140
fig, axs = plt.subplots(2, 2)
140
141
plt.tight_layout()
141
-
axs[0, 0].scatter(htSvd.U[:,0],
142
-
htSvd.U[:,1],
142
+
axs[0, 0].scatter(hapSvd.U[:,0],
143
+
hapSvd.U[:,1],
143
144
c=np.repeat([1,2,3,4,5], [nHap] * nPop))
144
145
axs[0, 0].set_title('Haplotypes (sites)')
145
146
146
147
axs[0,0].set_ylabel("PC2")
147
-
axs[0,1].scatter(gtSvd.U[:,0],
148
-
gtSvd.U[:,1]*-1,
148
+
axs[0,1].scatter(dipSvd.U[:,0],
149
+
dipSvd.U[:,1],
149
150
c=np.repeat([1,2,3,4,5], [nSamp] * nPop))
150
151
axs[0,1].set_title("Individuals (sites)")
151
152
152
153
# flipping the axes to make similarity clearer:
153
-
axs[1,0].scatter(hapBranchPca.factors[:,0]*-1,
154
-
hapBranchPca.factors[:,1]*-1,
154
+
axs[1,0].scatter(hapBranchPca.factors[:,0],
155
+
hapBranchPca.factors[:,1],
155
156
c=np.repeat([1,2,3,4,5], [nHap] * nPop))
156
157
axs[1,0].set_title("Haplotypes (branches)")
157
158
axs[1,0].set_ylabel("PC2")
158
159
axs[1,0].set_xlabel("PC1")
159
160
160
-
axs[1,1].scatter(dipBranchPca.factors[:,0]*-1,
161
+
axs[1,1].scatter(dipBranchPca.factors[:,0],
161
162
dipBranchPca.factors[:,1],
162
163
c=np.repeat([1,2,3,4,5], [nSamp] * nPop))
163
164
axs[1,1].set_title("Individuals (branches)")
@@ -177,7 +178,7 @@ Both `numpy.linalg.svd` and `tskit.TreeSequence.pca` return information about th
177
178
# square root of (branch eigenvalues multiplied by the mutation rate)
178
179
xx=np.sqrt(hapBranchPca.eigenvalues * mu)
179
180
# SVD S values
180
-
yy=htSvd.S[:10]
181
+
yy=hapSvd.S[:10]
181
182
```
182
183
183
184
We now fit a least-squares regression model to demonstrate the match between SVD standard variation and transformed eigenvalues.
@@ -208,7 +209,7 @@ Above we showed how variant and branch-based PCA are equivalent. But the ARG is
208
209
pctime=[tsm.pca(num_components=10, time_windows=[10**i, 10**(i+1)]) for i in range(8)]
209
210
```
210
211
211
-
Being of class `PCAResult`, the elements of the list have a `factors` property. This has a shape of `(100,10)`, that is, 10 PCs for 100 haplotypes.
212
+
Being of class `PCAResult`, the elements of the list have a `factors` property. This has a shape of (100,10). I.e., 10 PCs for 100 haplotypes.
212
213
213
214
```{code-cell} ipython3
214
215
pctime[0].factors.shape
@@ -235,7 +236,3 @@ pctime[7].factors[:20,:10]
235
236
Here, we demonstrated using simulated data how SNPs and ARG branches lead to equivalent PCA results. For empirical data, the ancestral states of variant sites are not known a priori, which will in practice often lead to polarisation differences. That may affect the outcome of PCA.
0 commit comments