textmining_workshop/2. Tidytext Mining Examples.Rmd at master · mdweisner/textmining_workshop · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
title: "Text Mining Examples"
author: "Michael Weisner"
date: "February 14, 2019"
output:
  word_document: default
  html_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

### Jane Austen Books
```{r}
library(janeaustenr)
library(stringr)
```

```{r}
library(dplyr)
library(tidyverse)
library(tidytext)
tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

tidy_books
```

### Sentiment

There are several sentiment analysis datasets, for example here are sentiment assignments from the [National Resource Council of Canada's Emotional Lexicon](https://www.nrc-cnrc.gc.ca/eng/solutions/advisory/emotion_lexicons.html)

Let's find words that are associated with being joyful
```{r}
nrc_joy <- get_sentiments("nrc") %>%
  filter(sentiment == "joy") # extract joyful words as determined by the nrc group

head(nrc_joy)
```

Let's look at the sentiment of words in the book "Emma"
```{r}
tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
```

Now let's do an inner join (so just of the words that are in both datsaets) of Emma with the Bing sentiment analysis
```{r}
jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

head(jane_austen_sentiment)
```

Notice it has both negative and positive scores and a net sentiment (representing positive and negative language, for whatever that's worth).

And now let's plot it
```{r}
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")
```

Lastly, let's do a sentiment analysis of the most frequent words' contribution to positive and negative sentiment.
```{r}
tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup() %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()
```


### Word Clouds
R Also has good libraries for wod clouds (which are less useful for statistics, but fun)
```{r}
library(wordcloud)

tidy_books %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))
```

We could color the words by sentiment.
A good list of colors is available [here](http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf)
```{r}
sentiment_books <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  mutate(word = reorder(word, n))

n <- nrow(sentiment_books)
colors <- rep("grey", n)
colors[sentiment_books$sentiment == "negative"] <- "coral2"
colors[sentiment_books$sentiment == "positive"] <-  "cyan3"

sentiment_books %>%
with(wordcloud(word, n, colors = colors, max.words = 100))

library(ggwordcloud)
sentiment_books_small <- sentiment_books %>%
  arrange(desc(n)) %>%
  mutate(rank = row_number()) %>%
  filter(rank <= 100)

book_sent_gg <- ggplot(sentiment_books_small, aes(label = word, size = n, color = sentiment)) +
                         geom_text_wordcloud_area() +
                         scale_size_area(max_size = 24)

book_sent_gg
```