textmining_workshop/3. Processing Untidy Text.Rmd at master · mdweisner/textmining_workshop · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
---
title: "Process Untidy Text"
author: "Michael Weisner"
date: "February 14, 2019"
output:
  word_document: default
  html_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Processing Untidy Text in R

So, what about untidy text?

## Federalist Papers Data
Download the Federalist Papers data from here:

[https://github.com/mdweisner/textmining_workshop/raw/master/federalist.zip](https://github.com/mdweisner/textmining_workshop/raw/master/federalist.zip)

OR

[goo.gl/y5v6bx](goo.gl/y5v6bx)

The text of the Federalist Papers need to be in a subdirectory called federalist below your working directory.

```{r}
download.file("https://github.com/mdweisner/textmining_workshop/raw/master/federalist.zip", destfile = "./federalist.zip")
unzip("federalist.zip")
dir("federalist")
```

You should now have a folder called federalist in your working directory.

## Basics of Text Processing

The basic workflow from text includes:

1. Convert your text data, format it into a corpus, which is a special collection of text documents
2. Clean the corpus (remove whitespace, convert to lowercase, remove stop words, reduce to wordstems)
3. Create a Document-Term Matrix (DTM) from the corpus
4. Conduct Analysis

## Corpus & Corpora

First, we indicate which documents are to be included in the corpus

```{r, message = FALSE}
library(tm)
corpus_raw <- Corpus(DirSource(directory = "federalist", pattern = "fp"))
corpus_raw
```
In this case, there are 85 documents total. Text analysis often works with a much larger
set of documents.

Text analysis usually analyzes words or phrases without regard to sentence or paragraph structure
Common operations on a corpus of text include

* making everything lowercase
* removing extra whitespace
* removing punctuation
* removing numbers
* removing stop words (like "the", which are common but useless)
* utilizing word stems (like "politic" to include "political" and "politics")

Next, we apply some operations to the texts in the corpus:
```{r}
corpus <- tm_map(corpus_raw, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument)
```
We can create a `DocumentTermMatrix` that has one row for each document in the corpus,
one column for each word (stem), and cell for the count of the number of times that word (stem) appears in that
document:
```{r}
dtm <- DocumentTermMatrix(corpus) # sparse form
is.list(dtm)                      # TRUE actually
dtm.mat <- as.matrix(dtm)         # dense form using plain matrices
library(Matrix)                   # sparse form using the Matrix package
dtm.Mat <- sparseMatrix(dtm$i, dtm$j, x = dtm$v,
                        dims = c(dtm$nrow, dtm$ncol),
                        dimnames = dtm$dimnames)
dtm.Mat[1:10,1:6]
```

To find words (stems) that are highly associated with a given word (stem), do something like
```{r}
findAssocs(dtm, "govern", corlimit = 0.5)
```

We can convert `dtm` into a tidy `data.frame` with
```{r}
library(tidyverse)
library(tidytext)
corpus_tidy <- tidy(dtm)
```

We often weight by term frequency - inverse document frequency (tf-idf).
We can use term-frequency inverse document frequency weighting to get a better measure of how critical a word is
```{r}
corpus_tidy_tfidf <- corpus_tidy %>% bind_tf_idf(term, document, count)
corpus_tidy_tfidf

corpus_tidy_tfidf %>%
  select(-count) %>%
  arrange(desc(tf_idf))
```

## Predicting Authorship

Now we want a modified corpus that does not eliminate stopwords
```{r}
madison <- c(10, 14, 37:48, 58)
corpus1 <- tm_map(corpus_raw, content_transformer(tolower))
corpus1 <- tm_map(corpus1, stripWhitespace)
corpus1 <- tm_map(corpus1, removePunctuation)
corpus1 <- tm_map(corpus1, removeNumbers)

dtm1 <- as.matrix(DocumentTermMatrix(corpus1))
dtm1 <- dtm1 / rowSums(dtm1) * 1000 # scale so that rows sum to 1000
```

We can then code an outcome variable by author and predict it with the word frequency
```{r}
hamilton <- c(1, 6:9, 11:13, 15:17, 21:36, 59:61, 65:85)
madison <- c(10, 14, 37:48, 58)

author <- rep(NA, nrow(dtm1))
author[hamilton] <- 1 # 1 if Hamilton
author[madison] <- -1 # -1 if Madison
## training set data
train <- data.frame(author = author[c(hamilton, madison)],
                    dtm1[c(hamilton, madison), ])
## fit linear model
hm_fit <- lm(author ~ upon + there + consequently + whilst, data = train)
hm_fit
```

Now we can predict the authorship of unknown Federalist Papers:
```{r}
disputed  <- c(49,  50:57, 62,  63)
tf_disputed <- as.data.frame(dtm1[disputed, ])
pred  <- predict(hm_fit, newdata = tf_disputed)
sign(pred)
```

# Plot
Make a plot
```{r, message=FALSE}
library(ggplot2)
data.frame(nletters = nchar(colnames(dtm))) %>%
ggplot(aes(x = nletters)) + geom_histogram(binwidth = 1) +
geom_vline(xintercept = mean(nchar(colnames(dtm))),
           color = "green", size = 1, alpha = .5) +
labs(x = "Number of Letters", y = "Number of Words")
```

See also https://cran.r-project.org/web/views/WebTechnologies.html