@@ -19,29 +19,29 @@ kernelspec:
1919
2020# Tskit and R
2121
22- To interface with ` tskit ` in R, we can use the [ reticulate] ( https://rstudio.github.io/reticulate/ ) R package, which lets you call Python functions within an R session. In this tutorial, we'll go through a couple of examples to show you how to get started. If you haven't done so already, you' ll need to install ` reticulate ` in your R session via ` install.packages("reticulate") ` .
22+ To interface with ` tskit ` in R, we can use the [ reticulate] ( https://rstudio.github.io/reticulate ) R package, which lets you call ` tskit ` 's Python API within an R session. In this tutorial, we'll go through a couple of examples to show you how to get started. You' ll need to install ` reticulate ` in your R session via ` install.packages("reticulate") ` and at a minimum state the required Python packages via ` reticulate::py_require(c("tskit", "msprime")) ` . This setting will use ` reticulate ` 's isolated Python virtual environment, but you can also use an alternative Python environment as described at https://rstudio.github.io/reticulate .
2323
2424We'll begin by simulating a small tree sequence using ` msprime ` .
2525
2626``` {code-cell}
2727msprime <- reticulate::import("msprime")
28-
2928ts <- msprime$sim_ancestry(80, sequence_length=1e4, recombination_rate=1e-4, random_seed=42)
3029ts # See "Jupyter notebook tips", below for how to render this nicely
3130```
3231
3332## Attributes and methods
3433
3534` reticulate ` allows us to access a Python object's attributes via
36- the ` $ ` operator. For example, we can access (and assign to a variable) the number of
37- samples in the tree sequence:
35+ the ` $ ` operator. For example, we can access (and assign to a native R object)
36+ the number of samples in the tree sequence:
3837
3938``` {code-cell}
4039n <- ts$num_samples
4140n
41+ is(n) # Show the type of the object
4242```
4343
44- The ` $ ` operator can also be used to call methods, for example, the
44+ The ` $ ` operator can also be used to call methods, for example, the
4545{meth}` ~TreeSequence.simplify ` method associated with the tree sequence.
4646The method parameters are given as native R objects
4747(but note that object IDs still use tskit's 0-based indexing system).
@@ -58,33 +58,44 @@ paste(
5858### IDs and indexes
5959
6060Note that if a bare digit is provided to one of these methods, it will be treated as a
61- floating point number. This is useful to know when calling ` tskit ` methods that
62- require integers (e.g. object IDs). For example, the following will not work:
61+ floating point number (numeric in R) . This is useful to know when calling ` tskit ` methods that
62+ require integers (such as object IDs). For example, the following will not work:
6363
6464``` {code-cell}
6565:tags: [raises-exception, remove-output]
66- ts$node(0) # Will raise an error
66+ ts$node(0) # Will raise a TypeError
67+ is(0)
6768```
6869
6970In this case, to force the ` 0 ` to be passed as an integer, you can either coerce it
70- using ` as.integer ` or simply prepend the letter ` L ` :
71+ using ` as.integer ` or simply append the letter ` L ` :
7172
7273``` {code-cell}
74+ is(as.integer(0))
7375ts$node(as.integer(0))
7476# or
77+ is(0L)
7578ts$node(0L)
7679```
7780
7881Coercing in this way is only necessary when passing parameters to those underlying
79- ` tskit ` methods that expect integers. It is not needed e.g. to index into numeric arrays.
82+ ` tskit ` methods that expect integers. It is not needed to index into numeric arrays.
8083_ However_ , when using arrays, very careful attention must be paid to the fact that
8184` tskit ` IDs start at zero, whereas R indexes start at one:
8285
8386``` {code-cell}
84- root_id <- ts$first()$root
85- paste("Root time via tskit method:", ts$node(root_id)$time)
86- # When indexing into tskit arrays in R, add 1 to the ID
87- paste("Root time via array access:", ts$nodes_time[root_id + 1])
87+ root_id_int <- ts$first()$root
88+ is(root_id_int)
89+ root_id_num <- as.numeric(root_id_int)
90+
91+ # Using integer ID will work
92+ paste("Root time via tskit method:", ts$node(root_id_int)$time)
93+ # Using numeric ID will not work
94+ # paste("Root time via tskit method:", ts$node(root_id_num)$time)
95+
96+ # When indexing into tskit arrays in R, add 1 to the ID (integer or numeric)
97+ paste("Root time via array access:", ts$nodes_time[root_id_int + 1])
98+ paste("Root time via array access:", ts$nodes_time[root_id_num + 1])
8899```
89100
90101## Analysis
@@ -98,7 +109,6 @@ for each of the tree sequence's sample nodes:
98109
99110``` {code-cell}
100111ts_mut = msprime$sim_mutations(reduced_ts, rate=1e-4, random_seed=321)
101-
102112paste(ts_mut$num_mutations, "mutations, genetic diversity is", ts_mut$diversity())
103113```
104114
@@ -109,6 +119,7 @@ the tree sequence as a matrix object in R.
109119``` {code-cell}
110120G = ts_mut$genotype_matrix()
111121G
122+ is(G)
112123```
113124
114125We can then use R functions directly on the genotype matrix:
@@ -142,10 +153,9 @@ It also allows trees and tree sequences to be plotted inline:
142153ts_mut$draw_svg(y_axis=TRUE, y_ticks=0:10)
143154```
144155
145-
146156## Interaction with R libraries
147157
148- R has a number of libraries to deal with genomic data and trees. Below we focus on the
158+ R has a number of libraries to deal with genomic data and trees. Below we demo the
149159phylogenetic tree representation defined in the the popular
150160[ ape] ( http://ape-package.ird.fr ) package, taking all the trees
151161{meth}` exported in Nexus format<TreeSequence.write_nexus> ` , or
@@ -165,14 +175,24 @@ plot(tree, direction="downward", srt=90, adj=0.5) # or equivalently use trees[[
165175```
166176
167177Note that nodes are labelled with the prefix ` n ` , so that nodes ` 0 ` , ` 1 ` , ` 2 ` , ...
168- become ` n0 ` , ` n1 ` , ` n2 ` ... etc . This helps to avoid
178+ become ` n0 ` , ` n1 ` , ` n2 ` , ... This helps to avoid
169179confusion between the the zero-based counting system used natively
170180by ` tskit ` , and the one-based counting system used in ` R ` .
171181
172182## Further information
173183
174- Be sure to check out the [ reticulate] ( https://rstudio.github.io/reticulate/ )
184+ Be sure to check out the [ reticulate] ( https://rstudio.github.io/reticulate )
175185documentation, in particular on
176186[ Calling Python from R] ( https://rstudio.github.io/reticulate/articles/calling_python.html ) ,
177187which includes important information on how R data types are converted to their
178- equivalent Python types.
188+ equivalent Python types.
189+
190+ ## Advanced usage through the tskit's C API
191+
192+ While the approach with ` reticulate ` is very powerful and will be useful for
193+ most analyses, it does have an overhead due to calling Python and conversion of objects.
194+ This overhead will be minimal for some use cases but could be significant for others.
195+ An alternative approach is to use the ` tskit ` C API from R,
196+ via the standard [ R Extensions] ( https://cran.r-project.org/doc/manuals/R-exts.html ) or
197+ via [ Rcpp] ( https://www.rcpp.org ) ,
198+ but these are advanced topics beyond the scope of this tutorial.
0 commit comments