-
Notifications
You must be signed in to change notification settings - Fork 2
Expand file tree
/
Copy path3. Processing Untidy Text.Rmd
More file actions
164 lines (132 loc) · 5.04 KB
/
3. Processing Untidy Text.Rmd
File metadata and controls
164 lines (132 loc) · 5.04 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
---
title: "Process Untidy Text"
author: "Michael Weisner"
date: "February 14, 2019"
output:
word_document: default
html_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Processing Untidy Text in R
So, what about untidy text?
## Federalist Papers Data
Download the Federalist Papers data from here:
[https://github.com/mdweisner/textmining_workshop/raw/master/federalist.zip](https://github.com/mdweisner/textmining_workshop/raw/master/federalist.zip)
OR
[goo.gl/y5v6bx](goo.gl/y5v6bx)
The text of the Federalist Papers need to be in a subdirectory called federalist below your working directory.
```{r}
download.file("https://github.com/mdweisner/textmining_workshop/raw/master/federalist.zip", destfile = "./federalist.zip")
unzip("federalist.zip")
dir("federalist")
```
You should now have a folder called federalist in your working directory.
## Basics of Text Processing
The basic workflow from text includes:
1. Convert your text data, format it into a corpus, which is a special collection of text documents
2. Clean the corpus (remove whitespace, convert to lowercase, remove stop words, reduce to wordstems)
3. Create a Document-Term Matrix (DTM) from the corpus
4. Conduct Analysis
## Corpus & Corpora
First, we indicate which documents are to be included in the corpus
```{r, message = FALSE}
library(tm)
corpus_raw <- Corpus(DirSource(directory = "federalist", pattern = "fp"))
corpus_raw
```
In this case, there are 85 documents total. Text analysis often works with a much larger
set of documents.
Text analysis usually analyzes words or phrases without regard to sentence or paragraph structure
Common operations on a corpus of text include
* making everything lowercase
* removing extra whitespace
* removing punctuation
* removing numbers
* removing stop words (like "the", which are common but useless)
* utilizing word stems (like "politic" to include "political" and "politics")
Next, we apply some operations to the texts in the corpus:
```{r}
corpus <- tm_map(corpus_raw, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument)
```
We can create a `DocumentTermMatrix` that has one row for each document in the corpus,
one column for each word (stem), and cell for the count of the number of times that word (stem) appears in that
document:
```{r}
dtm <- DocumentTermMatrix(corpus) # sparse form
is.list(dtm) # TRUE actually
dtm.mat <- as.matrix(dtm) # dense form using plain matrices
library(Matrix) # sparse form using the Matrix package
dtm.Mat <- sparseMatrix(dtm$i, dtm$j, x = dtm$v,
dims = c(dtm$nrow, dtm$ncol),
dimnames = dtm$dimnames)
dtm.Mat[1:10,1:6]
```
To find words (stems) that are highly associated with a given word (stem), do something like
```{r}
findAssocs(dtm, "govern", corlimit = 0.5)
```
We can convert `dtm` into a tidy `data.frame` with
```{r}
library(tidyverse)
library(tidytext)
corpus_tidy <- tidy(dtm)
```
We often weight by term frequency - inverse document frequency (tf-idf).
We can use term-frequency inverse document frequency weighting to get a better measure of how critical a word is
```{r}
corpus_tidy_tfidf <- corpus_tidy %>% bind_tf_idf(term, document, count)
corpus_tidy_tfidf
corpus_tidy_tfidf %>%
select(-count) %>%
arrange(desc(tf_idf))
```
## Predicting Authorship
Now we want a modified corpus that does not eliminate stopwords
```{r}
madison <- c(10, 14, 37:48, 58)
corpus1 <- tm_map(corpus_raw, content_transformer(tolower))
corpus1 <- tm_map(corpus1, stripWhitespace)
corpus1 <- tm_map(corpus1, removePunctuation)
corpus1 <- tm_map(corpus1, removeNumbers)
dtm1 <- as.matrix(DocumentTermMatrix(corpus1))
dtm1 <- dtm1 / rowSums(dtm1) * 1000 # scale so that rows sum to 1000
```
We can then code an outcome variable by author and predict it with the word frequency
```{r}
hamilton <- c(1, 6:9, 11:13, 15:17, 21:36, 59:61, 65:85)
madison <- c(10, 14, 37:48, 58)
author <- rep(NA, nrow(dtm1))
author[hamilton] <- 1 # 1 if Hamilton
author[madison] <- -1 # -1 if Madison
## training set data
train <- data.frame(author = author[c(hamilton, madison)],
dtm1[c(hamilton, madison), ])
## fit linear model
hm_fit <- lm(author ~ upon + there + consequently + whilst, data = train)
hm_fit
```
Now we can predict the authorship of unknown Federalist Papers:
```{r}
disputed <- c(49, 50:57, 62, 63)
tf_disputed <- as.data.frame(dtm1[disputed, ])
pred <- predict(hm_fit, newdata = tf_disputed)
sign(pred)
```
# Plot
Make a plot
```{r, message=FALSE}
library(ggplot2)
data.frame(nletters = nchar(colnames(dtm))) %>%
ggplot(aes(x = nletters)) + geom_histogram(binwidth = 1) +
geom_vline(xintercept = mean(nchar(colnames(dtm))),
color = "green", size = 1, alpha = .5) +
labs(x = "Number of Letters", y = "Number of Words")
```
See also https://cran.r-project.org/web/views/WebTechnologies.html