Skip to content

Commit 9be5ad3

Browse files
authored
Merge pull request #220 from hyanwong/R-tutorial-notebook
Update R tute to include plotting and reading trees
2 parents b2cf9c0 + 259346e commit 9be5ad3

2 files changed

Lines changed: 91 additions & 10 deletions

File tree

.github/workflows/build.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ jobs:
3636
run: |
3737
# We need to remove R to pull in a version that's compatible with CRAN, weirdly.
3838
sudo apt-get remove r-base-core
39-
sudo apt-get install r-cran-reticulate r-cran-pbdzmq r-cran-uuid
39+
sudo apt-get install r-cran-reticulate r-cran-pbdzmq r-cran-uuid r-cran-ape
4040
sudo R -e 'install.packages("IRkernel")'
4141
R -e 'IRkernel::installspec()'
4242

tskitr.md

Lines changed: 90 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -12,37 +12,65 @@ kernelspec:
1212
name: ir
1313
---
1414

15+
```{currentmodule} tskit
16+
```
17+
1518
(sec_tskit_r)=
1619

1720
# Tskit and R
1821

19-
To interface with `tskit` in R, we can use the [reticulate](https://rstudio.github.io/reticulate/) R package, which lets you call Python functions within an R session. In this short tutorial, we'll go through a couple of examples to show you how to get started. If you haven't done so already, you'll need to install `reticulate` in your R session via `install.packages("reticulate")`.
22+
To interface with `tskit` in R, we can use the [reticulate](https://rstudio.github.io/reticulate/) R package, which lets you call Python functions within an R session. In this tutorial, we'll go through a couple of examples to show you how to get started. If you haven't done so already, you'll need to install `reticulate` in your R session via `install.packages("reticulate")`.
2023

2124
We'll begin by simulating a small tree sequence using `msprime`.
2225

2326
```{code-cell}
2427
msprime <- reticulate::import("msprime")
2528
26-
ts <- msprime$sim_ancestry(10, sequence_length = 100, random_seed=42)
29+
ts <- msprime$sim_ancestry(80, sequence_length=1e4, recombination_rate=1e-4, random_seed=42)
2730
ts
2831
```
2932

30-
`reticulate` allows us to access a Python object's attributes or call its methods via the `$` operator. For example, we can access (and assign to a variable) the number of samples in the tree sequence:
33+
## Attributes and methods
34+
35+
`reticulate` allows us to access a Python object's attributes or call its methods via
36+
the `$` operator. For example, we can access (and assign to a variable) the number of
37+
samples in the tree sequence:
3138

3239
```{code-cell}
3340
n <- ts$num_samples
3441
n
3542
```
3643

37-
We can also use `tskit`'s powerful [Statistics](https://tskit.dev/tskit/docs/stable/stats.html) framework to efficiently compute many different summary statistics from a tree sequence. To illustrate this, we'll first add some mutations to our tree sequence with `msprime`'s `sim_mutations` function, and then compute the genetic diversity for each of the tree sequence's sample nodes:
44+
We can also call tskit methods on this tree sequence, such as
45+
{meth}`TreeSequence.simplify`. The parameters are given as native R objects
46+
(but note that we still retain tskit's 0-based indexing system
3847

3948
```{code-cell}
40-
ts_mut = msprime$sim_mutations(ts, rate = 1e-2, random_seed = 42)
41-
ts_mut$num_mutations
42-
ts_mut$diversity()
49+
reduced_ts <- ts$simplify(0:7) # only keep the first 8 samples
50+
reduced_ts <- reduced_ts$delete_intervals(list(c(6000, 10000))) # keep the first 6kb
51+
reduced_ts <- reduced_ts$trim() # remove the deleted end
52+
paste("Reduced from", ts$num_trees, "trees to", reduced_ts$num_trees, "trees")
4353
```
4454

45-
As a final example, we can also use the tree sequence `genotype_matrix()` method to return the genotypes of the the tree sequence as a matrix object in R.
55+
56+
## Analysis
57+
58+
From within R we can use `tskit`'s powerful
59+
[Statistics](https://tskit.dev/tskit/docs/stable/stats.html) framework to efficiently
60+
compute many different summary statistics from a tree sequence. To illustrate this,
61+
we'll first add some mutations to our tree sequence with the
62+
{func}`msprime:msprime.sim_mutations` function, and then compute the genetic diversity
63+
for each of the tree sequence's sample nodes:
64+
65+
```{code-cell}
66+
ts_mut = msprime$sim_mutations(reduced_ts, rate=1e-4, random_seed=321)
67+
68+
paste(ts_mut$num_mutations, "mutations, genetic diversity is", ts_mut$diversity())
69+
```
70+
71+
Numerical arrays and matrices work as expected. For instance, we can use the tree
72+
sequence {meth}`~TreeSequence.genotype_matrix()` method to return the genotypes of
73+
the tree sequence as a matrix object in R.
4674

4775
```{code-cell}
4876
G = ts_mut$genotype_matrix()
@@ -56,4 +84,57 @@ allele_frequency = rowMeans(G)
5684
allele_frequency
5785
```
5886

59-
It's as simple as that! Be sure to check out the [reticulate](https://rstudio.github.io/reticulate/) documentation, in particular on [Calling Python from R](https://rstudio.github.io/reticulate/articles/calling_python.html), which includes important information on how R data types are converted to their equivalent Python types.
87+
## Jupyter notebook use
88+
89+
If you are running R within a [Jupyter notebook](https://jupyter.org), then you can
90+
define a few magic functions that will display tskit tables and plots within the notebook:
91+
92+
```{code-cell}
93+
# Define some magic functions to allow objects to be displayed in R Jupyter notebooks
94+
repr_html.tskit.trees.TreeSequence <- function(obj, ...){obj$`_repr_html_`()}
95+
repr_html.tskit.trees.Tree <- function(obj, ...){obj$`_repr_html_`()}
96+
repr_svg.tskit.drawing.SVGString <- function(obj, ...){obj$`__str__`()}
97+
```
98+
99+
This leads to much nicer summary tables:
100+
101+
```{code-cell}
102+
ts_mut
103+
```
104+
105+
And also allows trees and tree sequences to be easily plotted
106+
107+
```{code-cell}
108+
ts_mut$draw_svg(y_axis=TRUE, y_ticks=0:10)
109+
```
110+
111+
112+
## Interaction with R libraries
113+
114+
R has a number of libraries to deal with genomic data and trees. Below we focus on the
115+
phylogenetic tree representation defined in the the popular
116+
[ape](http://ape-package.ird.fr) package, taking all the trees
117+
:meth:`exported in Nexus format<TreeSequence.write_nexus>`, or
118+
individual trees :meth:`exported in Newick format<TreeSequence.as_newick>`:
119+
120+
```{code-cell}
121+
file = tempfile()
122+
ts_mut$write_nexus(file)
123+
# Warning - ape trees are stored independently, so this will use much more memory than tskit
124+
trees <- ape::read.nexus(file, force.multi = TRUE) # return a set of trees
125+
126+
# Or simply read in a single tree
127+
tree <- ape::read.tree(text=ts_mut$first()$as_newick())
128+
129+
# Now we can plot the tree in tskit style, but using the ape library
130+
plot(tree, direction="downward", srt=90, adj=0.5) # or equivalently use trees[[1]]
131+
```
132+
133+
Note that nodes are labelled with the prefix `n`, so that nodes `0`, `1`, `2`, ...
134+
become `n0`, `n1`, `n2` ... etc. This helps to avoid
135+
confusion between the the zero-based counting system used natively
136+
by `tskit`, and the one-based counting system used in `R`.
137+
138+
## Further information
139+
140+
Be sure to check out the [reticulate](https://rstudio.github.io/reticulate/) documentation, in particular on [Calling Python from R](https://rstudio.github.io/reticulate/articles/calling_python.html), which includes important information on how R data types are converted to their equivalent Python types.

0 commit comments

Comments
 (0)