Skip to content

Commit 24db011

Browse files
authored
Merge pull request #235 from hyanwong/ARG-traversal
Add ARG traversal examples
2 parents cfe1ec3 + 82eb1ef commit 24db011

1 file changed

Lines changed: 78 additions & 0 deletions

File tree

args.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -316,6 +316,84 @@ structures for simulation or inference is therefore infeasible.
316316

317317
## Working with the ARG
318318

319+
All the normal tskit functions can be used to analyse an ARG stored in tskit form. However, some
320+
operations are naturally though of in terms of the tree sequence as a graph.
321+
322+
### Graph traversal
323+
324+
The standard edge iterator, {meth}`TreeSequence.edge_diffs()`, goes from left to
325+
right along the genome, matching the {meth}`TreeSequence.trees()` iterator. This
326+
means that unlike most conventional graph traversal methods, the returned edges
327+
are *not* necessarily grouped by node ID (either
328+
the edge's parent node or the edge's child node).
329+
330+
To traverse the graph by node ID, the {meth}`TreeSequence.nodes()` iterator can be
331+
used. In particular, because parents are required to be strictly older than their
332+
children, iterating through nodes using `order="timeasc"` will ensure that children
333+
are always visited before their parents (similar to a breadth-first or "level order"
334+
search). However, using {meth}`TreeSequence.nodes()` is inefficient if you also
335+
want to access the *edges* associated with each node.
336+
337+
The examples below show how to efficiently traverse the connected nodes in a tree
338+
sequence graph, visiting each
339+
only once, ensuring children are visited before parents (or vice versa), while
340+
simultaneously giving access to the edges associated with each node.
341+
342+
#### Traversing parent nodes
343+
344+
The most efficient graph traversal method visits all the parent nodes in the
345+
tree sequence, grouping edges for which that node is a parent. This is simple
346+
because edges in a tree sequence are {ref}`ordered<tskit:sec_edge_requirements>`
347+
firstly by the time of the parent node, then by node ID.
348+
349+
```{code-cell}
350+
import itertools
351+
import operator
352+
def edges_by_parent_timeasc(ts):
353+
return itertools.groupby(ts.edges(), operator.attrgetter("parent"))
354+
355+
for parent_id, edges in edges_by_parent_timeasc(ts):
356+
t = ts.node(parent_id).time
357+
children = {e.child for e in edges}
358+
print(f"Node {parent_id} at time {t} has these child node IDs: {children}")
359+
```
360+
361+
This visits children before parents. To visit parents before children, you can
362+
simply use ``reversed(ts.edges())`` rather than ``ts.edges()`` within the
363+
``groupby`` function. Note that terminal nodes (i.e. which are either isolated
364+
or leaf nodes in all local trees) are not visited by this function: therefore
365+
the method above will omit all the terminal sample nodes.
366+
367+
#### Traversing child nodes
368+
369+
Sometimes you may wish to iterate over all the edges for which a node is a
370+
child (rather than a parent). This can be done by sorting the edges by
371+
child node time (then child node id). This is a little slower, but can be
372+
done relatively efficiently as follows:
373+
374+
```{code-cell}
375+
import itertools
376+
import operator
377+
import numpy as np
378+
def edges_by_child_timeasc(ts):
379+
# edges sorted by child node time, then child node id using np.lexsort
380+
it = (ts.edge(u) for u in np.lexsort((ts.edges_child, ts.nodes_time[ts.edges_child])))
381+
return itertools.groupby(it, operator.attrgetter("child"))
382+
383+
for child_id, edges in edges_by_child_timeasc(ts):
384+
t = ts.node(child_id).time
385+
parents = {e.parent for e in edges}
386+
print(f"Node {child_id} at time {t} has these parent node IDs: {parents}")
387+
```
388+
389+
To visit parents before children, the ``lexsort`` can take the negated
390+
``-ts.nodes_time`` rather than simply using ``ts.nodes_time``. Note that nodes
391+
which are never children of an edge are not visited by this algorithm. Such
392+
nodes are either {ref}`isolated<sec_data_model_tree_isolated_nodes>` or a
393+
{ref}`root<sec_data_model_tree_roots> in each local tree.
394+
395+
### Misc
396+
319397
:::{todo}
320398
Add extra content as per [https://github.com/tskit-dev/tutorials/issues/43](https://github.com/tskit-dev/tutorials/issues/43)
321399
:::

0 commit comments

Comments
 (0)