NLPstudio is an R package for scalable text analysis in research workflows. It is built around quanteda, data.table, and portable parallel backends, with particular attention to reproducible social science workflows, including financial disclosures, regulatory filings, and other structured document collections.
The package has two main workflows:
- Corpus preparation and document-level text analysis, from SEC-style JSON files
to
quantedacorpora, tokens, dictionaries, readability, similarity, and export-ready tables. - A consistent topic-model API for fitting, adopting, evaluating, selecting, diagnosing, summarizing, and exporting topic models across supported R backends.
The detailed reference manual and vignettes are published at contefranz.github.io/NLPstudio.
NLPstudio is a stable public release intended for reproducible social science text-analysis workflows, with frozen output schemas for the core corpus and topic-model APIs. Repository archiving and DOI minting through Zenodo are handled from each public GitHub release.
The full output-schema contract for the topic-model API — the frozen result classes and the standardized evaluation/selection columns — is documented in the Topic Model API vignette under Public API Stability.
Install NLPstudio from GitHub with pak:
install.packages("pak")
pak::pkg_install("contefranz/NLPstudio")Some modeling backends are optional. Install backend packages only when you need them; for example, STM support requires stm, and embedded topic models require both topicmodels.etm and a working torch backend.
library(NLPstudio)
library(quanteda)
docs <- data.frame(
doc_id = paste0("doc", 1:6),
text = c(
"Revenue growth improved after subscription demand increased.",
"Operating margin expanded as cloud costs declined.",
"Audit committee oversight focused on internal controls.",
"Risk disclosures emphasized liquidity and refinancing pressure.",
"Customer retention supported recurring software revenue.",
"Debt covenants and interest expense shaped capital allocation."
)
)
corp <- quanteda::corpus(docs, text_field = "text", docid_field = "doc_id")
toks <- quanteda::tokens(corp, remove_punct = TRUE)
toks <- quanteda::tokens_tolower(toks)
toks <- quanteda::tokens_remove(toks, pattern = quanteda::stopwords("en"))
dfm <- quanteda::dfm(toks)
fit <- fit_topic_model(
dfm,
engine = "topicmodels",
model = "lda",
method = "Gibbs",
k = 2,
control = list(fit = list(seed = 1L, iter = 50L, burnin = 0L, thin = 1L))
)
get_top_terms(fit, n = 4)
evaluate_topic_model(
fit,
training = dfm,
metrics = c("diversity", "exclusivity", "coherence_umass"),
top_n = 4L
)For complete workflows, see:
If you use NLPstudio in academic work, please cite the package. Citation metadata is available from R:
citation("NLPstudio")Francesco Grossetti
Assistant Professor of Accounting Analytics and Data Science
Department of Accounting, Bocconi University
Fellow at Bocconi Institute for Data Science and Analytics (BIDSA)
Contact: francesco.grossetti@unibocconi.it
