NLPstudio

NLPstudio is an R package for scalable text analysis in research workflows. It is built around quanteda, data.table, and portable parallel backends, with particular attention to reproducible social science workflows, including financial disclosures, regulatory filings, and other structured document collections.

The package has two main workflows:

Corpus preparation and document-level text analysis, from SEC-style JSON files to quanteda corpora, tokens, dictionaries, readability, similarity, and export-ready tables.
A consistent topic-model API for fitting, adopting, evaluating, selecting, diagnosing, summarizing, and exporting topic models across supported R backends.

The detailed reference manual and vignettes are published at contefranz.github.io/NLPstudio.

Release Status

NLPstudio is a stable public release intended for reproducible social science text-analysis workflows, with frozen output schemas for the core corpus and topic-model APIs. Repository archiving and DOI minting through Zenodo are handled from each public GitHub release.

The full output-schema contract for the topic-model API — the frozen result classes and the standardized evaluation/selection columns — is documented in the Topic Model API vignette under Public API Stability.

Installation

Install NLPstudio from GitHub with pak:

install.packages("pak")
pak::pkg_install("contefranz/NLPstudio")

Some modeling backends are optional. Install backend packages only when you need them; for example, STM support requires stm, and embedded topic models require both topicmodels.etm and a working torch backend.

Quick Example

library(NLPstudio)
library(quanteda)

docs <- data.frame(
  doc_id = paste0("doc", 1:6),
  text = c(
    "Revenue growth improved after subscription demand increased.",
    "Operating margin expanded as cloud costs declined.",
    "Audit committee oversight focused on internal controls.",
    "Risk disclosures emphasized liquidity and refinancing pressure.",
    "Customer retention supported recurring software revenue.",
    "Debt covenants and interest expense shaped capital allocation."
  )
)

corp <- quanteda::corpus(docs, text_field = "text", docid_field = "doc_id")
toks <- quanteda::tokens(corp, remove_punct = TRUE)
toks <- quanteda::tokens_tolower(toks)
toks <- quanteda::tokens_remove(toks, pattern = quanteda::stopwords("en"))
dfm <- quanteda::dfm(toks)

fit <- fit_topic_model(
  dfm,
  engine = "topicmodels",
  model = "lda",
  method = "Gibbs",
  k = 2,
  control = list(fit = list(seed = 1L, iter = 50L, burnin = 0L, thin = 1L))
)

get_top_terms(fit, n = 4)
evaluate_topic_model(
  fit,
  training = dfm,
  metrics = c("diversity", "exclusivity", "coherence_umass"),
  top_n = 4L
)

For complete workflows, see:

Citation

If you use NLPstudio in academic work, please cite the package. Citation metadata is available from R:

citation("NLPstudio")

Author

Francesco Grossetti
Assistant Professor of Accounting Analytics and Data Science
Department of Accounting, Bocconi University
Fellow at Bocconi Institute for Data Science and Analytics (BIDSA)
Contact: francesco.grossetti@unibocconi.it

Name		Name	Last commit message	Last commit date
Latest commit History 270 Commits
.github/workflows		.github/workflows
R		R
data		data
inst		inst
man		man
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
DESCRIPTION		DESCRIPTION
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md
_pkgdown.yml		_pkgdown.yml
codecov.yml		codecov.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLPstudio

Release Status

Installation

Quick Example

Citation

Author

About

Uh oh!

Releases 21

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NLPstudio

Release Status

Installation

Quick Example

Citation

Author

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 21

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages