@@ -19,62 +19,85 @@ kernelspec:
1919
2020# Tskit and R
2121
22- To interface with ` tskit ` in R, we can use the [ reticulate] ( https://rstudio.github.io/reticulate ) R package, which lets you call ` tskit ` 's Python API within an R session. In this tutorial, we'll go through a couple of examples to show you how to get started. You'll need to install ` reticulate ` in your R session via ` install.packages("reticulate") ` and at a minimum state the required Python packages via ` reticulate::py_require(c("tskit", "msprime")) ` . This setting will use ` reticulate ` 's isolated Python virtual environment, but you can also use an alternative Python environment as described at https://rstudio.github.io/reticulate .
22+ To interface with ` tskit ` in R, we can use the [ reticulate] ( https://rstudio.github.io/reticulate )
23+ package, which lets you call ` tskit ` 's Python API from an R session. In this
24+ tutorial, we walk through a few simple examples to help you get started.
25+
26+ First, install ` reticulate ` in R with ` install.packages("reticulate") ` . Then,
27+ install the required Python packages with
28+ ` reticulate::py_require(c("tskit", "msprime")) ` .
29+
30+ By default, this uses ` reticulate ` 's isolated Python virtual environment, but
31+ you can also use another Python environment as described at
32+ < https://rstudio.github.io/reticulate/ > .
33+
34+ ```
35+ install.packages("reticulate")
36+ reticulate::py_require(c("tskit", "msprime"))
37+ ```
38+
39+ ::::{margin}
40+ :::{note}
41+ The [ slendr] ( https://www.slendr.net ) R package uses ` reticulate ` extensively to work with tree sequences in R and also provides R wrappers for some ` tskit ` functions.
42+ :::
43+ ::::
2344
2445::::{margin}
2546:::{note}
26- The [ slendr] ( https://www.slendr.net ) R package uses ` reticulate ` extensively to work with tree sequences in R and also provides R wrappers for some ` tskit ` functions.
47+ The [ RcppTskit] ( https://cran.r-project.org/package=RcppTskit ) R package provides
48+ access to ` tskit ` 's C API for advanced use cases and also shows how to interface
49+ with ` reticulate ` Python.
2750:::
2851::::
2952
30- We'll begin by simulating a small tree sequence using ` msprime ` .
53+ We'll begin by simulating a small tree sequence
54+ by calling ` msprime$sim_ancestry() ` from R.
3155
3256``` {code-cell}
3357msprime <- reticulate::import("msprime")
3458ts <- msprime$sim_ancestry(80, sequence_length=1e4, recombination_rate=1e-4, random_seed=42)
35- ts # See "Jupyter notebook tips", below for how to render this nicely
59+ ts # See "Jupyter notebook tips" below for nicer rendering
3660```
3761
3862## Attributes and methods
3963
40- ` reticulate ` allows us to access a Python object's attributes via
41- the ` $ ` operator. For example, we can access (and assign to a native R object)
42- the number of samples in the tree sequence :
64+ ` reticulate ` lets us access a Python object's attributes with the ` $ ` operator.
65+ For example, we can access the number of samples in the tree sequence and
66+ store it into an R object :
4367
4468``` {code-cell}
4569n <- ts$num_samples
4670n
4771is(n) # Show the type of the object
4872```
4973
50- The ` $ ` operator can also be used to call methods, for example, the
51- {meth}` ~TreeSequence.simplify ` method associated with the tree sequence.
52- The method parameters are given as native R objects
53- (but note that object IDs still use tskit's 0-based indexing system).
74+ The ` $ ` operator can also be used to call a Python object's methods, for example
75+ the {meth}` ~TreeSequence.simplify ` method for a tree sequence.
76+ Method arguments are given as native R objects and then passed to ` reticulate ` Python
77+ (but note that object IDs still use ` tskit ` 's 0-based indexing system).
5478
5579``` {code-cell}
56- reduced_ts <- ts$simplify(0:7) # only keep samples with ids 0, 1, 2, 3, 4, 5, 6, 7
57- reduced_ts <- reduced_ts$delete_intervals(list(c(6000, 10000))) # delete data after 6kb
80+ reduced_ts <- ts$simplify(0:7) # only keep samples with IDs 0, 1, 2, 3, 4, 5, 6, 7
81+ reduced_ts <- reduced_ts$delete_intervals(list(c(6000, 10000))) # delete data after 6 kb
5882reduced_ts <- reduced_ts$trim() # remove the deleted region
5983paste(
6084 "Reduced from", ts$num_trees, "trees over", ts$sequence_length/1e3, "kb to",
6185 reduced_ts$num_trees, "trees over", reduced_ts$sequence_length/1e3, "kb.")
6286```
6387
64- ## IDs and indexes
88+ ## IDs and indexing
6589
66- Note that if a bare digit is provided to one of these methods, it will be treated as a
67- floating point number (numeric in R). This is useful to know when calling ` tskit ` methods that
68- require integers (such as object IDs). For example, the following will not work:
90+ If you pass a bare number to one of these methods, R treats it as a floating
91+ point value (the default ` numeric ` type in R). This matters when calling
92+ ` tskit ` methods that require integers (such as object IDs). For example, the
93+ following raises a ` TypeError ` because ` 0 ` is passed as a float:
6994
7095``` {code-cell}
71- :tags: [raises-exception, remove-output]
72- ts$node(0) # Will raise a TypeError
96+ try(ts$node(0)) # Will raise a TypeError
7397is(0)
7498```
7599
76- In this case, to force the ` 0 ` to be passed as an integer, you can either coerce it
77- using ` as.integer ` or simply append the letter ` L ` :
100+ To pass ` 0 ` as an integer, either coerce it with ` as.integer ` or append ` L ` :
78101
79102``` {code-cell}
80103is(as.integer(0))
@@ -84,43 +107,43 @@ is(0L)
84107ts$node(0L)
85108```
86109
87- Coercing in this way is only necessary when passing parameters to those underlying
88- ` tskit ` methods that expect integers. It is not needed to index into numeric arrays.
89- _ However _ , when using arrays, very careful attention must be paid to the fact that
90- ` tskit ` IDs start at zero, whereas R indexes start at one:
110+ You only need this coercion when calling underlying ` tskit ` methods that expect
111+ integer arguments. You do not need it for array indexing itself. However, when
112+ indexing ` tskit ` arrays in R, remember that ` tskit ` IDs start at zero, whereas
113+ R indexes start at one:
91114
92115``` {code-cell}
93116root_id_int <- ts$first()$root
94117is(root_id_int)
95118root_id_num <- as.numeric(root_id_int)
96119
97120# Using integer ID will work
98- paste("Root time via tskit method:", ts$node(root_id_int)$time)
99- # Using numeric ID will not work
100- # paste("Root time via tskit method:", ts$node(root_id_num)$time)
121+ paste("Root time via tskit method using integer ID :", ts$node(root_id_int)$time)
122+ # Using numeric ID will raise TypeError
123+ try( paste("Root time via tskit method using float ID :", ts$node(root_id_num)$time) )
101124
102125# When indexing into tskit arrays in R, add 1 to the ID (integer or numeric)
103- paste("Root time via array access :", ts$nodes_time[root_id_int + 1])
104- paste("Root time via array access :", ts$nodes_time[root_id_num + 1])
126+ paste("Root time via integer array :", ts$nodes_time[root_id_int + 1])
127+ paste("Root time via float array :", ts$nodes_time[root_id_num + 1])
105128```
106129
107130## Analysis
108131
109- From within R we can use ` tskit ` 's powerful
110- [ Statistics] ( https://tskit.dev/tskit/docs/stable/stats.html ) framework to efficiently
111- compute many different summary statistics from a tree sequence. To illustrate this,
112- we'll first add some mutations to our tree sequence with the
113- {func}` msprime:msprime.sim_mutations ` function, and then compute the genetic diversity
114- for each of the tree sequence's sample nodes:
132+ From within R, we can use ` tskit ` 's powerful
133+ [ Statistics] ( https://tskit.dev/tskit/docs/stable/stats.html ) framework to
134+ efficiently compute many summary statistics from a tree sequence. To illustrate
135+ this, we'll first add mutations with the {func}` msprime:msprime.sim_mutations `
136+ function and then compute genetic diversity:
115137
116138``` {code-cell}
117139ts_mut <- msprime$sim_mutations(reduced_ts, rate=1e-4, random_seed=321)
118140paste(ts_mut$num_mutations, "mutations, genetic diversity is", ts_mut$diversity())
119141```
120142
121- Numerical arrays and matrices work as expected. For instance, we can use the tree
122- sequence {meth}` ~TreeSequence.genotype_matrix() ` method to return the genotypes of
123- the tree sequence as a matrix object in R.
143+ Numerical arrays and matrices work as expected:
144+ they are converted to equivalent R objects.
145+ For instance, we can use the tree sequence {meth}` ~TreeSequence.genotype_matrix() `
146+ method to return the genotypes of the tree sequence as a matrix object in R.
124147
125148``` {code-cell}
126149G <- ts_mut$genotype_matrix()
@@ -137,8 +160,8 @@ allele_frequency
137160
138161## Jupyter notebook tips
139162
140- When running R within a [ Jupyter notebook] ( https://jupyter.org ) , a few magic functions
141- can be defined that allow tskit objects to be rendered within the notebook:
163+ When running R in a [ Jupyter notebook] ( https://jupyter.org ) , you can define a
164+ few functions so that ` tskit ` objects render nicely in notebook output :
142165
143166``` {code-cell}
144167# Define some magic functions to allow objects to be displayed in R Jupyter notebooks
@@ -161,16 +184,16 @@ ts_mut$draw_svg(y_axis=TRUE, y_ticks=0:10)
161184
162185## Interaction with R libraries
163186
164- R has a number of libraries to deal with genomic data and trees. Below we demo the
165- phylogenetic tree representation defined in the the popular
166- [ ape] ( http://ape-package.ird.fr ) package, taking all the trees
187+ R has a number of libraries for genomic data and trees. Below we show the
188+ phylogenetic tree representation from the popular
189+ [ ape] ( http://ape-package.ird.fr ) package, using all trees
167190{meth}` exported in Nexus format<TreeSequence.write_nexus> ` , or
168191individual trees {meth}` exported in Newick format<Tree.as_newick> ` :
169192
170193``` {code-cell}
171194file <- tempfile()
172195ts_mut$write_nexus(file)
173- # Warning - ape trees are stored independently, so this will use much more memory than tskit
196+ # Warning: ape trees are stored independently, so this uses much more memory than tskit
174197trees <- ape::read.nexus(file, force.multi=TRUE) # return a set of trees
175198
176199# Or simply read in a single tree
@@ -180,10 +203,10 @@ tree <- ape::read.tree(text=ts_mut$first()$as_newick())
180203plot(tree, direction="downward", srt=90, adj=0.5) # or equivalently use trees[[1]]
181204```
182205
183- Note that nodes are labelled with the prefix ` n ` , so that nodes ` 0 ` , ` 1 ` , ` 2 ` , ...
184- become ` n0 ` , ` n1 ` , ` n2 ` , ... This helps to avoid
185- confusion between the the zero-based counting system used natively
186- by ` tskit ` , and the one-based counting system used in ` R ` .
206+ Note that nodes are labelled with the prefix ` n ` , so nodes ` 0 ` , ` 1 ` , ` 2 ` , ...
207+ become ` n0 ` , ` n1 ` , ` n2 ` , ... This helps avoidinng confusion between the
208+ zero-based counting system used natively by ` tskit ` and the one-based counting
209+ system used in ` R ` .
187210
188211## Further information
189212
@@ -193,12 +216,12 @@ documentation, in particular on
193216which includes important information on how R data types are converted to their
194217equivalent Python types.
195218
196- ## Advanced usage through the tskit's C API
219+ ## Advanced usage through tskit's C API
197220
198- While the approach with ` reticulate ` is very powerful and will be useful for
199- most analyses, it does have an overhead due to calling Python and conversion of objects.
200- This overhead will be minimal for some use cases but could be significant for others.
201- An alternative approach is to use the ` tskit ` C API from R,
202- via the standard [ R Extensions ] ( https://cran.r-project.org/doc/manuals/R-exts.html ) or
203- via [ Rcpp ] ( https://www.rcpp .org ) ,
204- but these are advanced topics beyond the scope of this tutorial.
221+ While the ` reticulate ` approach is powerful and useful for most analyses, it
222+ does add overhead from Python calls and object conversion. This overhead may be
223+ minimal for some use cases but significant for others, such as repeatedly
224+ calling Python from an R loop.
225+ An alternative is to use ` RcppTskit ` , which provides access to the ` tskit ` C API
226+ (see < https://cran.r-project .org/package=RcppTskit/ > ), but this is an advanced
227+ topic beyond the scope of this tutorial.
0 commit comments