Skip to content

Commit 563e3c1

Browse files
authored
Merge pull request #299 from gregorgorjanc/r_tut_edits
Polish and mention RcppTskit
2 parents c3bc44a + d7f5570 commit 563e3c1

1 file changed

Lines changed: 80 additions & 57 deletions

File tree

tskitr.md

Lines changed: 80 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -19,62 +19,85 @@ kernelspec:
1919

2020
# Tskit and R
2121

22-
To interface with `tskit` in R, we can use the [reticulate](https://rstudio.github.io/reticulate) R package, which lets you call `tskit`'s Python API within an R session. In this tutorial, we'll go through a couple of examples to show you how to get started. You'll need to install `reticulate` in your R session via `install.packages("reticulate")` and at a minimum state the required Python packages via `reticulate::py_require(c("tskit", "msprime"))`. This setting will use `reticulate`'s isolated Python virtual environment, but you can also use an alternative Python environment as described at https://rstudio.github.io/reticulate.
22+
To interface with `tskit` in R, we can use the [reticulate](https://rstudio.github.io/reticulate)
23+
package, which lets you call `tskit`'s Python API from an R session. In this
24+
tutorial, we walk through a few simple examples to help you get started.
25+
26+
First, install `reticulate` in R with `install.packages("reticulate")`. Then,
27+
install the required Python packages with
28+
`reticulate::py_require(c("tskit", "msprime"))`.
29+
30+
By default, this uses `reticulate`'s isolated Python virtual environment, but
31+
you can also use another Python environment as described at
32+
<https://rstudio.github.io/reticulate/>.
33+
34+
```
35+
install.packages("reticulate")
36+
reticulate::py_require(c("tskit", "msprime"))
37+
```
38+
39+
::::{margin}
40+
:::{note}
41+
The [slendr](https://www.slendr.net) R package uses `reticulate` extensively to work with tree sequences in R and also provides R wrappers for some `tskit` functions.
42+
:::
43+
::::
2344

2445
::::{margin}
2546
:::{note}
26-
The [slendr](https://www.slendr.net) R package uses `reticulate` extensively to work with tree sequences in R and also provides R wrappers for some `tskit` functions.
47+
The [RcppTskit](https://cran.r-project.org/package=RcppTskit) R package provides
48+
access to `tskit`'s C API for advanced use cases and also shows how to interface
49+
with `reticulate` Python.
2750
:::
2851
::::
2952

30-
We'll begin by simulating a small tree sequence using `msprime`.
53+
We'll begin by simulating a small tree sequence
54+
by calling `msprime$sim_ancestry()` from R.
3155

3256
```{code-cell}
3357
msprime <- reticulate::import("msprime")
3458
ts <- msprime$sim_ancestry(80, sequence_length=1e4, recombination_rate=1e-4, random_seed=42)
35-
ts # See "Jupyter notebook tips", below for how to render this nicely
59+
ts # See "Jupyter notebook tips" below for nicer rendering
3660
```
3761

3862
## Attributes and methods
3963

40-
`reticulate` allows us to access a Python object's attributes via
41-
the `$` operator. For example, we can access (and assign to a native R object)
42-
the number of samples in the tree sequence:
64+
`reticulate` lets us access a Python object's attributes with the `$` operator.
65+
For example, we can access the number of samples in the tree sequence and
66+
store it into an R object:
4367

4468
```{code-cell}
4569
n <- ts$num_samples
4670
n
4771
is(n) # Show the type of the object
4872
```
4973

50-
The `$` operator can also be used to call methods, for example, the
51-
{meth}`~TreeSequence.simplify` method associated with the tree sequence.
52-
The method parameters are given as native R objects
53-
(but note that object IDs still use tskit's 0-based indexing system).
74+
The `$` operator can also be used to call a Python object's methods, for example
75+
the {meth}`~TreeSequence.simplify` method for a tree sequence.
76+
Method arguments are given as native R objects and then passed to `reticulate` Python
77+
(but note that object IDs still use `tskit`'s 0-based indexing system).
5478

5579
```{code-cell}
56-
reduced_ts <- ts$simplify(0:7) # only keep samples with ids 0, 1, 2, 3, 4, 5, 6, 7
57-
reduced_ts <- reduced_ts$delete_intervals(list(c(6000, 10000))) # delete data after 6kb
80+
reduced_ts <- ts$simplify(0:7) # only keep samples with IDs 0, 1, 2, 3, 4, 5, 6, 7
81+
reduced_ts <- reduced_ts$delete_intervals(list(c(6000, 10000))) # delete data after 6 kb
5882
reduced_ts <- reduced_ts$trim() # remove the deleted region
5983
paste(
6084
"Reduced from", ts$num_trees, "trees over", ts$sequence_length/1e3, "kb to",
6185
reduced_ts$num_trees, "trees over", reduced_ts$sequence_length/1e3, "kb.")
6286
```
6387

64-
## IDs and indexes
88+
## IDs and indexing
6589

66-
Note that if a bare digit is provided to one of these methods, it will be treated as a
67-
floating point number (numeric in R). This is useful to know when calling `tskit` methods that
68-
require integers (such as object IDs). For example, the following will not work:
90+
If you pass a bare number to one of these methods, R treats it as a floating
91+
point value (the default `numeric` type in R). This matters when calling
92+
`tskit` methods that require integers (such as object IDs). For example, the
93+
following raises a `TypeError` because `0` is passed as a float:
6994

7095
```{code-cell}
71-
:tags: [raises-exception, remove-output]
72-
ts$node(0) # Will raise a TypeError
96+
try(ts$node(0)) # Will raise a TypeError
7397
is(0)
7498
```
7599

76-
In this case, to force the `0` to be passed as an integer, you can either coerce it
77-
using `as.integer` or simply append the letter `L`:
100+
To pass `0` as an integer, either coerce it with `as.integer` or append `L`:
78101

79102
```{code-cell}
80103
is(as.integer(0))
@@ -84,43 +107,43 @@ is(0L)
84107
ts$node(0L)
85108
```
86109

87-
Coercing in this way is only necessary when passing parameters to those underlying
88-
`tskit` methods that expect integers. It is not needed to index into numeric arrays.
89-
_However_, when using arrays, very careful attention must be paid to the fact that
90-
`tskit` IDs start at zero, whereas R indexes start at one:
110+
You only need this coercion when calling underlying `tskit` methods that expect
111+
integer arguments. You do not need it for array indexing itself. However, when
112+
indexing `tskit` arrays in R, remember that `tskit` IDs start at zero, whereas
113+
R indexes start at one:
91114

92115
```{code-cell}
93116
root_id_int <- ts$first()$root
94117
is(root_id_int)
95118
root_id_num <- as.numeric(root_id_int)
96119
97120
# Using integer ID will work
98-
paste("Root time via tskit method:", ts$node(root_id_int)$time)
99-
# Using numeric ID will not work
100-
# paste("Root time via tskit method:", ts$node(root_id_num)$time)
121+
paste("Root time via tskit method using integer ID:", ts$node(root_id_int)$time)
122+
# Using numeric ID will raise TypeError
123+
try(paste("Root time via tskit method using float ID:", ts$node(root_id_num)$time))
101124
102125
# When indexing into tskit arrays in R, add 1 to the ID (integer or numeric)
103-
paste("Root time via array access:", ts$nodes_time[root_id_int + 1])
104-
paste("Root time via array access:", ts$nodes_time[root_id_num + 1])
126+
paste("Root time via integer array:", ts$nodes_time[root_id_int + 1])
127+
paste("Root time via float array:", ts$nodes_time[root_id_num + 1])
105128
```
106129

107130
## Analysis
108131

109-
From within R we can use `tskit`'s powerful
110-
[Statistics](https://tskit.dev/tskit/docs/stable/stats.html) framework to efficiently
111-
compute many different summary statistics from a tree sequence. To illustrate this,
112-
we'll first add some mutations to our tree sequence with the
113-
{func}`msprime:msprime.sim_mutations` function, and then compute the genetic diversity
114-
for each of the tree sequence's sample nodes:
132+
From within R, we can use `tskit`'s powerful
133+
[Statistics](https://tskit.dev/tskit/docs/stable/stats.html) framework to
134+
efficiently compute many summary statistics from a tree sequence. To illustrate
135+
this, we'll first add mutations with the {func}`msprime:msprime.sim_mutations`
136+
function and then compute genetic diversity:
115137

116138
```{code-cell}
117139
ts_mut <- msprime$sim_mutations(reduced_ts, rate=1e-4, random_seed=321)
118140
paste(ts_mut$num_mutations, "mutations, genetic diversity is", ts_mut$diversity())
119141
```
120142

121-
Numerical arrays and matrices work as expected. For instance, we can use the tree
122-
sequence {meth}`~TreeSequence.genotype_matrix()` method to return the genotypes of
123-
the tree sequence as a matrix object in R.
143+
Numerical arrays and matrices work as expected:
144+
they are converted to equivalent R objects.
145+
For instance, we can use the tree sequence {meth}`~TreeSequence.genotype_matrix()`
146+
method to return the genotypes of the tree sequence as a matrix object in R.
124147

125148
```{code-cell}
126149
G <- ts_mut$genotype_matrix()
@@ -137,8 +160,8 @@ allele_frequency
137160

138161
## Jupyter notebook tips
139162

140-
When running R within a [Jupyter notebook](https://jupyter.org), a few magic functions
141-
can be defined that allow tskit objects to be rendered within the notebook:
163+
When running R in a [Jupyter notebook](https://jupyter.org), you can define a
164+
few functions so that `tskit` objects render nicely in notebook output:
142165

143166
```{code-cell}
144167
# Define some magic functions to allow objects to be displayed in R Jupyter notebooks
@@ -161,16 +184,16 @@ ts_mut$draw_svg(y_axis=TRUE, y_ticks=0:10)
161184

162185
## Interaction with R libraries
163186

164-
R has a number of libraries to deal with genomic data and trees. Below we demo the
165-
phylogenetic tree representation defined in the the popular
166-
[ape](http://ape-package.ird.fr) package, taking all the trees
187+
R has a number of libraries for genomic data and trees. Below we show the
188+
phylogenetic tree representation from the popular
189+
[ape](http://ape-package.ird.fr) package, using all trees
167190
{meth}`exported in Nexus format<TreeSequence.write_nexus>`, or
168191
individual trees {meth}`exported in Newick format<Tree.as_newick>`:
169192

170193
```{code-cell}
171194
file <- tempfile()
172195
ts_mut$write_nexus(file)
173-
# Warning - ape trees are stored independently, so this will use much more memory than tskit
196+
# Warning: ape trees are stored independently, so this uses much more memory than tskit
174197
trees <- ape::read.nexus(file, force.multi=TRUE) # return a set of trees
175198
176199
# Or simply read in a single tree
@@ -180,10 +203,10 @@ tree <- ape::read.tree(text=ts_mut$first()$as_newick())
180203
plot(tree, direction="downward", srt=90, adj=0.5) # or equivalently use trees[[1]]
181204
```
182205

183-
Note that nodes are labelled with the prefix `n`, so that nodes `0`, `1`, `2`, ...
184-
become `n0`, `n1`, `n2`, ... This helps to avoid
185-
confusion between the the zero-based counting system used natively
186-
by `tskit`, and the one-based counting system used in `R`.
206+
Note that nodes are labelled with the prefix `n`, so nodes `0`, `1`, `2`, ...
207+
become `n0`, `n1`, `n2`, ... This helps avoidinng confusion between the
208+
zero-based counting system used natively by `tskit` and the one-based counting
209+
system used in `R`.
187210

188211
## Further information
189212

@@ -193,12 +216,12 @@ documentation, in particular on
193216
which includes important information on how R data types are converted to their
194217
equivalent Python types.
195218

196-
## Advanced usage through the tskit's C API
219+
## Advanced usage through tskit's C API
197220

198-
While the approach with `reticulate` is very powerful and will be useful for
199-
most analyses, it does have an overhead due to calling Python and conversion of objects.
200-
This overhead will be minimal for some use cases but could be significant for others.
201-
An alternative approach is to use the `tskit` C API from R,
202-
via the standard [R Extensions](https://cran.r-project.org/doc/manuals/R-exts.html) or
203-
via [Rcpp](https://www.rcpp.org),
204-
but these are advanced topics beyond the scope of this tutorial.
221+
While the `reticulate` approach is powerful and useful for most analyses, it
222+
does add overhead from Python calls and object conversion. This overhead may be
223+
minimal for some use cases but significant for others, such as repeatedly
224+
calling Python from an R loop.
225+
An alternative is to use `RcppTskit`, which provides access to the `tskit` C API
226+
(see <https://cran.r-project.org/package=RcppTskit/>), but this is an advanced
227+
topic beyond the scope of this tutorial.

0 commit comments

Comments
 (0)