Add BVBRC statistics R Markdown document#13
Conversation
jananiravi
left a comment
There was a problem hiding this comment.
needs the following to serve as a vignette
- to reside in
vignettes/ - call specific tidyverse packages (bioc requirement)
- check path for bvbrc_data.tsv & bvbrc vs. bvbrc_clean
- use of named r chunks
- text description between chunks
- missing YAML
- documentation
cc: @AbhirupaGhosh @eboyer221
jananiravi
left a comment
There was a problem hiding this comment.
needs annotation (text around code chunks)
eboyer221
left a comment
There was a problem hiding this comment.
Suggested code changes to format the vignette shown in review.
Additional comment : The Rmd should live at vignettes/BVBRC_stats.Rmd rather than doc/ (R packages auto-generated doc/ from vignettes/ at build time).
| ```{r setup, include=FALSE} | ||
| library(tidyverse) | ||
|
|
||
| bvbrc <- read.table("bvbrc_data.tsv", sep = "\t", header = TRUE, fill = TRUE, |
There was a problem hiding this comment.
where is this file , 'bvbrc_data.tsv', coming from? does the user export it from BV-BRC? Right now I don't think the Rmd will work for anyone without that file.
| --- | ||
| title: "bvbrc_stats" | ||
| author: "Abhirupa" | ||
| date: "`r Sys.Date()`" | ||
| output: html_document | ||
| --- |
There was a problem hiding this comment.
| --- | |
| title: "bvbrc_stats" | |
| author: "Abhirupa" | |
| date: "`r Sys.Date()`" | |
| output: html_document | |
| --- | |
| --- | |
| title: "BV-BRC Metadata Statistics" | |
| author: "Abhirupa Ghosh" | |
| date: "`r Sys.Date()`" | |
| output: rmarkdown::html_vignette | |
| vignette: > | |
| %\VignetteIndexEntry{BV-BRC Metadata Statistics} | |
| %\VignetteEngine{knitr::rmarkdown} | |
| %\VignetteEncoding{UTF-8} | |
| --- |
Bioconductor vignette-style header with the vignette: block so R recognizes it as a package vignette.
| --- | ||
|
|
||
| ```{r setup, include=FALSE} | ||
| library(tidyverse) |
There was a problem hiding this comment.
| library(tidyverse) | |
| library(dplyr) | |
| library(stringr) | |
| library(purrr) | |
| library(tibble) |
Bioconductor does not want library(tidyverse) import if we can just import the parts that are actually used.
| col_stats <- tibble(column = colnames(bvbrc)) %>% | ||
| mutate( | ||
| n_total = nrow(bvbrc), | ||
| n_missing = map_int(column, ~ sum(is.na(bvbrc[[.x]]))), | ||
| n_present = n_total - n_missing, | ||
| pct_missing = 100 * n_missing / n_total, | ||
| n_distinct_non_missing = map_int(column, ~ n_distinct(bvbrc[[.x]], na.rm = TRUE)) | ||
| ) %>% | ||
| arrange(desc(pct_missing), column) |
There was a problem hiding this comment.
| col_stats <- tibble(column = colnames(bvbrc)) %>% | |
| mutate( | |
| n_total = nrow(bvbrc), | |
| n_missing = map_int(column, ~ sum(is.na(bvbrc[[.x]]))), | |
| n_present = n_total - n_missing, | |
| pct_missing = 100 * n_missing / n_total, | |
| n_distinct_non_missing = map_int(column, ~ n_distinct(bvbrc[[.x]], na.rm = TRUE)) | |
| ) %>% | |
| arrange(desc(pct_missing), column) | |
| col_stats <- tibble(column = colnames(bvbrc_clean)) %>% | |
| mutate( | |
| n_total = nrow(bvbrc_clean), | |
| n_missing = map_int(column, ~ sum(is.na(bvbrc_clean[[.x]]))), | |
| n_present = n_total - n_missing, | |
| pct_missing = 100 * n_missing / n_total, | |
| n_distinct_non_missing = map_int(column, ~ n_distinct(bvbrc_clean[[.x]], na.rm = TRUE)) | |
| ) %>% | |
| arrange(desc(pct_missing), column) |
currently, col_stats is using bvbrc, but the stats below use bvbrc_clean so that the missing values are accounted for. Suggesting changing from bvbrc -> bvbrc_clean here so that it is consistent if that is appropriate.
| arrange(desc(pct_missing), column) | ||
| ``` | ||
|
|
||
| ```{r} |
There was a problem hiding this comment.
| ```{r} | |
| ``````{r col_stats_output} |
| # --- Outputs ----------------------------------------------------------------- | ||
| cat("\n--- Per-column stats (whole table) ---\n"); print(col_stats, n = Inf) | ||
| ``` | ||
| ```{r} |
There was a problem hiding this comment.
| ```{r} | |
| ``````{r host_block_output} |
| ```{r} | ||
| cat("\n--- Host block stats ---\n"); print(host_block_stats) | ||
| ``` | ||
| ```{r} |
There was a problem hiding this comment.
| ```{r} | |
| ``````{r geo_block_output} |
| ```{r} | ||
| cat("\n--- Geographic block stats ---\n"); print(geo_block_stats) | ||
| ``` | ||
| ```{r} |
There was a problem hiding this comment.
| ```{r} | |
| ``````{r diag_output} |
| ) | ||
| ``` | ||
|
|
||
| ```{r functions} |
There was a problem hiding this comment.
Each of these code chunks needs a paragraph saying what it computes and why. Need to add a paragraph before this functions chunk, before stats (line 84), and before each of the four output chunks (lines 134-145).
Description
What kind of change(s) are included?
Checklist
Please ensure that all boxes are checked before indicating that this pull request is ready for review.