Skip to content

Commit 9915dfb

Browse files
gregorgorjancbenjeffery
authored andcommitted
Polish and mention C/C++ API
1 parent 6db9629 commit 9915dfb

2 files changed

Lines changed: 44 additions & 24 deletions

File tree

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,12 @@ under construction.
1717

1818
# Requirements
1919

20-
Install the Python requirements from requirements.txt:
20+
Install the Python requirements from `requirements.txt`:
2121
```
2222
$ python -m pip install -r requirements.txt
2323
```
2424

25-
You will also need a working R installation with reticulate and irkernel installed.
25+
You will also need a working R installation with `reticulate` and `irkernel` installed.
2626
This command should do the trick:
2727
```
2828
$ R -e 'install.packages(c("reticulate", "IRkernel")); IRkernel::installspec()'
@@ -35,8 +35,8 @@ $ R -e 'install.packages(c("reticulate", "IRkernel")); IRkernel::installspec()'
3535
[jupytext](https://github.com/mwouts/jupytext) to convert the notebook into
3636
the right format.
3737
- To build locally, run ``make``. The output tells you where to find the
38-
built HTML.
39-
- Pages rendered at https://tskit.dev/tutorials
38+
built `HTML` file(s).
39+
- Pages rendered at https://tskit.dev/tutorials.
4040
- Pages might take a while to be updated after a new tutorial is merged.
4141

4242
If you have an idea for a tutorial, please open an issue to discuss.

tskitr.md

Lines changed: 40 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -19,29 +19,29 @@ kernelspec:
1919

2020
# Tskit and R
2121

22-
To interface with `tskit` in R, we can use the [reticulate](https://rstudio.github.io/reticulate/) R package, which lets you call Python functions within an R session. In this tutorial, we'll go through a couple of examples to show you how to get started. If you haven't done so already, you'll need to install `reticulate` in your R session via `install.packages("reticulate")`.
22+
To interface with `tskit` in R, we can use the [reticulate](https://rstudio.github.io/reticulate) R package, which lets you call `tskit`'s Python API within an R session. In this tutorial, we'll go through a couple of examples to show you how to get started. You'll need to install `reticulate` in your R session via `install.packages("reticulate")` and at a minimum state the required Python packages via `reticulate::py_require(c("tskit", "msprime"))`. This setting will use `reticulate`'s isolated Python virtual environment, but you can also use an alternative Python environment as described at https://rstudio.github.io/reticulate.
2323

2424
We'll begin by simulating a small tree sequence using `msprime`.
2525

2626
```{code-cell}
2727
msprime <- reticulate::import("msprime")
28-
2928
ts <- msprime$sim_ancestry(80, sequence_length=1e4, recombination_rate=1e-4, random_seed=42)
3029
ts # See "Jupyter notebook tips", below for how to render this nicely
3130
```
3231

3332
## Attributes and methods
3433

3534
`reticulate` allows us to access a Python object's attributes via
36-
the `$` operator. For example, we can access (and assign to a variable) the number of
37-
samples in the tree sequence:
35+
the `$` operator. For example, we can access (and assign to a native R object)
36+
the number of samples in the tree sequence:
3837

3938
```{code-cell}
4039
n <- ts$num_samples
4140
n
41+
is(n) # Show the type of the object
4242
```
4343

44-
The `$` operator can also be used to call methods, for example, the
44+
The `$` operator can also be used to call methods, for example, the
4545
{meth}`~TreeSequence.simplify` method associated with the tree sequence.
4646
The method parameters are given as native R objects
4747
(but note that object IDs still use tskit's 0-based indexing system).
@@ -58,33 +58,44 @@ paste(
5858
### IDs and indexes
5959

6060
Note that if a bare digit is provided to one of these methods, it will be treated as a
61-
floating point number. This is useful to know when calling `tskit` methods that
62-
require integers (e.g. object IDs). For example, the following will not work:
61+
floating point number (numeric in R). This is useful to know when calling `tskit` methods that
62+
require integers (such as object IDs). For example, the following will not work:
6363

6464
```{code-cell}
6565
:tags: [raises-exception, remove-output]
66-
ts$node(0) # Will raise an error
66+
ts$node(0) # Will raise a TypeError
67+
is(0)
6768
```
6869

6970
In this case, to force the `0` to be passed as an integer, you can either coerce it
70-
using `as.integer` or simply prepend the letter `L`:
71+
using `as.integer` or simply append the letter `L`:
7172

7273
```{code-cell}
74+
is(as.integer(0))
7375
ts$node(as.integer(0))
7476
# or
77+
is(0L)
7578
ts$node(0L)
7679
```
7780

7881
Coercing in this way is only necessary when passing parameters to those underlying
79-
`tskit` methods that expect integers. It is not needed e.g. to index into numeric arrays.
82+
`tskit` methods that expect integers. It is not needed to index into numeric arrays.
8083
_However_, when using arrays, very careful attention must be paid to the fact that
8184
`tskit` IDs start at zero, whereas R indexes start at one:
8285

8386
```{code-cell}
84-
root_id <- ts$first()$root
85-
paste("Root time via tskit method:", ts$node(root_id)$time)
86-
# When indexing into tskit arrays in R, add 1 to the ID
87-
paste("Root time via array access:", ts$nodes_time[root_id + 1])
87+
root_id_int <- ts$first()$root
88+
is(root_id_int)
89+
root_id_num <- as.numeric(root_id_int)
90+
91+
# Using integer ID will work
92+
paste("Root time via tskit method:", ts$node(root_id_int)$time)
93+
# Using numeric ID will not work
94+
# paste("Root time via tskit method:", ts$node(root_id_num)$time)
95+
96+
# When indexing into tskit arrays in R, add 1 to the ID (integer or numeric)
97+
paste("Root time via array access:", ts$nodes_time[root_id_int + 1])
98+
paste("Root time via array access:", ts$nodes_time[root_id_num + 1])
8899
```
89100

90101
## Analysis
@@ -98,7 +109,6 @@ for each of the tree sequence's sample nodes:
98109

99110
```{code-cell}
100111
ts_mut = msprime$sim_mutations(reduced_ts, rate=1e-4, random_seed=321)
101-
102112
paste(ts_mut$num_mutations, "mutations, genetic diversity is", ts_mut$diversity())
103113
```
104114

@@ -109,6 +119,7 @@ the tree sequence as a matrix object in R.
109119
```{code-cell}
110120
G = ts_mut$genotype_matrix()
111121
G
122+
is(G)
112123
```
113124

114125
We can then use R functions directly on the genotype matrix:
@@ -142,10 +153,9 @@ It also allows trees and tree sequences to be plotted inline:
142153
ts_mut$draw_svg(y_axis=TRUE, y_ticks=0:10)
143154
```
144155

145-
146156
## Interaction with R libraries
147157

148-
R has a number of libraries to deal with genomic data and trees. Below we focus on the
158+
R has a number of libraries to deal with genomic data and trees. Below we demo the
149159
phylogenetic tree representation defined in the the popular
150160
[ape](http://ape-package.ird.fr) package, taking all the trees
151161
{meth}`exported in Nexus format<TreeSequence.write_nexus>`, or
@@ -165,14 +175,24 @@ plot(tree, direction="downward", srt=90, adj=0.5) # or equivalently use trees[[
165175
```
166176

167177
Note that nodes are labelled with the prefix `n`, so that nodes `0`, `1`, `2`, ...
168-
become `n0`, `n1`, `n2` ... etc. This helps to avoid
178+
become `n0`, `n1`, `n2`, ... This helps to avoid
169179
confusion between the the zero-based counting system used natively
170180
by `tskit`, and the one-based counting system used in `R`.
171181

172182
## Further information
173183

174-
Be sure to check out the [reticulate](https://rstudio.github.io/reticulate/)
184+
Be sure to check out the [reticulate](https://rstudio.github.io/reticulate)
175185
documentation, in particular on
176186
[Calling Python from R](https://rstudio.github.io/reticulate/articles/calling_python.html),
177187
which includes important information on how R data types are converted to their
178-
equivalent Python types.
188+
equivalent Python types.
189+
190+
## Advanced usage through the tskit's C API
191+
192+
While the approach with `reticulate` is very powerful and will be useful for
193+
most analyses, it does have an overhead due to calling Python and conversion of objects.
194+
This overhead will be minimal for some use cases but could be significant for others.
195+
An alternative approach is to use the `tskit` C API from R,
196+
via the standard [R Extensions](https://cran.r-project.org/doc/manuals/R-exts.html) or
197+
via [Rcpp](https://www.rcpp.org),
198+
but these are advanced topics beyond the scope of this tutorial.

0 commit comments

Comments
 (0)