PR comments implemented

hannesbecher · hannesbecher · commit 3ebb0b151286 · 2026-03-26T09:40:15.000Z
diff --git a/PCA.md b/PCA.md
@@ -34,7 +34,7 @@ First, we'll simulate an ARG with population structure:
 
 ```{code-cell} ipython3
 # load required libraries
-import msprime as msp
+import msprime
 import tskit
 import numpy as np
 import matplotlib.pyplot as plt
@@ -60,24 +60,24 @@ Simulate an ARG using an island model demography. There are five islands, each w
 
 ```{code-cell} ipython3
 # Island model demography, 5 islands connected by low gene flow
-dmg = msp.Demography.island_model([nn] * nPop, migration_rate=migRate)
+dmg = msprime.Demography.island_model([nn] * nPop, migration_rate=migRate)
 ```
 
 ```{code-cell} ipython3
 # Simulate ARG
-ts = msp.sim_ancestry(samples={i: nSamp for i in range(nPop)},
+ts = msprime.sim_ancestry(samples={i: nSamp for i in range(nPop)},
                       demography=dmg,
                       random_seed=1234,
                       sequence_length=1e6,
-                      recombination_rate=1e-9)
+                      recombination_rate=1e-8)
 ts
 ```
 
 The same ARG, but with mutations added.
 
 ```{code-cell} ipython3
 # Add mutations
-tsm = msp.sim_mutations(ts, rate=mu, random_seed=1234)
+tsm = msprime.sim_mutations(ts, rate=mu, random_seed=1234)
 tsm
 ```
 
@@ -92,11 +92,24 @@ for i in range(nPop-1):
 fstmat
 ```
 
+## Branch PCA (tskit)
+To demstrate that branch PCA works without variant data, we run it on the ARG without mutations, `ts`.
+
+```{code-cell} ipython3
+# haplotypes, each sample haplotype is ues by default
+hapBranchPca=ts.pca(num_components=10)
+```
+
+```{code-cell} ipython3
+# genotypes, all individuals are specified
+dipBranchPca=ts.pca(num_components=10, individuals=range(5*nSamp))
+```
+
 ## PCA 'by hand'
-To compute a SNP PCA, we start by extracting the haploid 'genotypes' from the ARG. We then make use of the `TreeSequence` object's `individuals_nodes` property (an array) to select each individual's two haplotypes and to add them to create individual diploid genotypes.
+To compute a traditional SNP PCA, we start by extracting the haploid 'genotypes' from the ARG. We then make use of the `TreeSequence` object's `individuals_nodes` property (an array) to select each individual's two haplotypes and to add them to create individual diploid genotypes.
 
 ```{code-cell} ipython3
-# obtain a haplotype matrix and print its shape
+# obtain a haplotype matrix from the tree sequence with mutation; print its shape
 # 100 haplotypes (= 10 individual samples * 5 islands * 2 haplotypes per individual)
 # 13683 variant sites
 htMat=tsm.genotype_matrix().transpose()
@@ -112,52 +125,40 @@ gtMat = htMat[sample_ids_to_mat_index[ts.individuals_nodes]].sum(axis=1)
 
 ```{code-cell} ipython3
 # Haplotype SVD (column-centred)
-htSvd = np.linalg.svd(htMat - htMat.mean(axis=0), full_matrices=False)
+hapSvd = np.linalg.svd(htMat - htMat.mean(axis=0), full_matrices=False)
 ```
 
 ```{code-cell} ipython3
 # Genotype SVD (column-centred)
-gtSvd = np.linalg.svd(gtMat - gtMat.mean(axis=0), full_matrices=False)
-```
-
-## Branch PCA (tskit)
-To demonstrate that branch PCA works without variant data, we run it on the ARG without mutations, `ts`.
-
-```{code-cell} ipython3
-# haplotypes, each sample haplotype is used by default
-hapBranchPca=ts.pca(num_components=10)
+dipSvd = np.linalg.svd(gtMat - gtMat.mean(axis=0), full_matrices=False)
 ```
 
-```{code-cell} ipython3
-# genotypes, all individuals are specified
-dipBranchPca=ts.pca(num_components=10, individuals=range(5*nSamp))
-```
-
-Plot for comparison
+## Plot for comparison
+Note that PCA does not preserve the axis orientation. The plots in the panels below will show similar patterns but one or both axes may be flipped.
 
 ```{code-cell} ipython3
 fig, axs = plt.subplots(2, 2)
 plt.tight_layout()
-axs[0, 0].scatter(htSvd.U[:,0],
-                  htSvd.U[:,1],
+axs[0, 0].scatter(hapSvd.U[:,0],
+                  hapSvd.U[:,1],
                   c=np.repeat([1,2,3,4,5], [nHap] * nPop))
 axs[0, 0].set_title('Haplotypes (sites)')
 
 axs[0,0].set_ylabel("PC2")
-axs[0,1].scatter(gtSvd.U[:,0],
-                 gtSvd.U[:,1]*-1,
+axs[0,1].scatter(dipSvd.U[:,0],
+                 dipSvd.U[:,1],
                  c=np.repeat([1,2,3,4,5], [nSamp] * nPop))
 axs[0,1].set_title("Individuals (sites)")
 
 # flipping the axes to make similarity clearer:
-axs[1,0].scatter(hapBranchPca.factors[:,0]*-1,
-                 hapBranchPca.factors[:,1]*-1,
+axs[1,0].scatter(hapBranchPca.factors[:,0],
+                 hapBranchPca.factors[:,1],
                  c=np.repeat([1,2,3,4,5], [nHap] * nPop))
 axs[1,0].set_title("Haplotypes (branches)")
 axs[1,0].set_ylabel("PC2")
 axs[1,0].set_xlabel("PC1")
 
-axs[1,1].scatter(dipBranchPca.factors[:,0]*-1,
+axs[1,1].scatter(dipBranchPca.factors[:,0],
                  dipBranchPca.factors[:,1],
                  c=np.repeat([1,2,3,4,5], [nSamp] * nPop))
 axs[1,1].set_title("Individuals (branches)")
@@ -177,7 +178,7 @@ Both `numpy.linalg.svd` and `tskit.TreeSequence.pca` return information about th
 # square root of (branch eigenvalues multiplied by the mutation rate)
 xx=np.sqrt(hapBranchPca.eigenvalues * mu)
 # SVD S values
-yy=htSvd.S[:10]
+yy=hapSvd.S[:10]
 ```
 
 We now fit a least-squares regression model to demonstrate the match between SVD standard variation and transformed eigenvalues.
@@ -208,7 +209,7 @@ Above we showed how variant and branch-based PCA are equivalent. But the ARG is
 pctime=[tsm.pca(num_components=10, time_windows=[10**i, 10**(i+1)]) for i in range(8)]
 ```
 
-Being of class `PCAResult`, the elements of the list have a `factors` property. This has a shape of `(100,10)`, that is, 10 PCs for 100 haplotypes.
+Being of class `PCAResult`, the elements of the list have a `factors` property. This has a shape of (100,10). I.e., 10 PCs for 100 haplotypes.
 
 ```{code-cell} ipython3
 pctime[0].factors.shape
@@ -235,7 +236,3 @@ pctime[7].factors[:20,:10]
 Here, we demonstrated using simulated data how SNPs and ARG branches lead to equivalent PCA results. For empirical data, the ancestral states of variant sites are not known a priori, which will in practice often lead to polarisation differences. That may affect the outcome of PCA.
 
 **TODO:** Extend Tutorial to empirical data.
-
-```{code-cell} ipython3
-
-```