-
Notifications
You must be signed in to change notification settings - Fork 2
Expand file tree
/
Copy path2. Tidytext Mining Examples.Rmd
More file actions
131 lines (108 loc) · 3.55 KB
/
2. Tidytext Mining Examples.Rmd
File metadata and controls
131 lines (108 loc) · 3.55 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
title: "Text Mining Examples"
author: "Michael Weisner"
date: "February 14, 2019"
output:
word_document: default
html_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
### Jane Austen Books
```{r}
library(janeaustenr)
library(stringr)
```
```{r}
library(dplyr)
library(tidyverse)
library(tidytext)
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
tidy_books
```
### Sentiment
There are several sentiment analysis datasets, for example here are sentiment assignments from the [National Resource Council of Canada's Emotional Lexicon](https://www.nrc-cnrc.gc.ca/eng/solutions/advisory/emotion_lexicons.html)
Let's find words that are associated with being joyful
```{r}
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy") # extract joyful words as determined by the nrc group
head(nrc_joy)
```
Let's look at the sentiment of words in the book "Emma"
```{r}
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
```
Now let's do an inner join (so just of the words that are in both datsaets) of Emma with the Bing sentiment analysis
```{r}
jane_austen_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
head(jane_austen_sentiment)
```
Notice it has both negative and positive scores and a net sentiment (representing positive and negative language, for whatever that's worth).
And now let's plot it
```{r}
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")
```
Lastly, let's do a sentiment analysis of the most frequent words' contribution to positive and negative sentiment.
```{r}
tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup() %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip()
```
### Word Clouds
R Also has good libraries for wod clouds (which are less useful for statistics, but fun)
```{r}
library(wordcloud)
tidy_books %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
```
We could color the words by sentiment.
A good list of colors is available [here](http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf)
```{r}
sentiment_books <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
mutate(word = reorder(word, n))
n <- nrow(sentiment_books)
colors <- rep("grey", n)
colors[sentiment_books$sentiment == "negative"] <- "coral2"
colors[sentiment_books$sentiment == "positive"] <- "cyan3"
sentiment_books %>%
with(wordcloud(word, n, colors = colors, max.words = 100))
library(ggwordcloud)
sentiment_books_small <- sentiment_books %>%
arrange(desc(n)) %>%
mutate(rank = row_number()) %>%
filter(rank <= 100)
book_sent_gg <- ggplot(sentiment_books_small, aes(label = word, size = n, color = sentiment)) +
geom_text_wordcloud_area() +
scale_size_area(max_size = 24)
book_sent_gg
```