You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p>Drawing a semantic network highlights what are the most frequent terms, how they relate to each other, and reveal the different groups or "clusters" of they form.</p>
606
606
</div>
607
607
<divclass="paragraph">
608
-
<p>Often, a cluster of terms characterizes a topic. Hence, converting a text into a semantic network helps detecting topics in the text, from micro-topics to the general themes discussed in the documents.</p>
608
+
<p>Often, a cluster of terms characterizes a topic.
609
+
Hence, converting a text into a semantic network helps detecting topics in the text, from micro-topics to the general themes discussed in the documents.</p>
609
610
</div>
610
611
<divclass="paragraph">
611
612
<p>Semantic networks are regular networks, where:</p>
<p>It means that if you have a textual network, you can visualize it with Gephi just like any other network. Yet, not everything is the same, and this tutorial provides tips and tricks on why textual data can be a bit different than other data.</p>
625
+
<p>It means that if you have a textual network, you can visualize it with Gephi just like any other network.</p>
626
+
</div>
627
+
<divclass="paragraph">
628
+
<p>Yet, not everything is the same, and this tutorial provides tips and tricks on why textual data can be a bit different than other data.</p>
<p>So, <code>United States</code> should probably be a meaningful unit, not just <code>United</code> and <code>States</code>. Because <code>United States</code> is composed of 2 terms, it is called a "bi-gram".</p>
682
+
<p>So, <code>United States</code> should probably be a meaningful unit, not just <code>United</code> and <code>States</code>.
683
+
Because <code>United States</code> is composed of 2 terms, it is called a "bi-gram".</p>
679
684
</div>
680
685
<divclass="paragraph">
681
686
<p>Trigrams are interesting as well obviously (eg, <code>chocolate ice cream</code>).</p>
@@ -733,7 +738,8 @@ <h4 id="_3_stemming_and_lemmatization">3. Stemming and lemmatization</h4>
733
738
</ul>
734
739
</div>
735
740
<divclass="paragraph">
736
-
<p>A tool performing lemmatization is <ahref="https://textgrid.de/en/">TextGrid</a>. It has many functions for textual analysis, and lemmatization <ahref="https://wiki.de.dariah.eu/display/TextGrid/The+Lemmatizer+Tool">is explained there</a>.</p>
741
+
<p>A tool performing lemmatization is <ahref="https://textgrid.de/en/">TextGrid</a>.
742
+
It has many functions for textual analysis, and lemmatization <ahref="https://wiki.de.dariah.eu/display/TextGrid/The+Lemmatizer+Tool">is explained there</a>.</p>
737
743
</div>
738
744
</div>
739
745
</div>
@@ -761,7 +767,8 @@ <h2 id="_should_we_represent_all_terms_in_a_semantic_network">Should we represen
761
767
<p>Once this is done, we have transformed the text into plenty of words to represent. Should they all be included in the network?</p>
762
768
</div>
763
769
<divclass="paragraph">
764
-
<p>Imagine we have a word appearing just once, in a single footnote of a text long of 2,000 pages. Should this word appear? Probably not.</p>
770
+
<p>Imagine we have a word appearing just once, in a single footnote of a text long of 2,000 pages.
771
+
Should this word appear? Probably not.</p>
765
772
</div>
766
773
<divclass="paragraph">
767
774
<p>Which rule to apply to keep or leave out a word?</p>
<p>More words can be crammed in, but in this case the viewer would have to take time zooming in and out, panning to explore the visualization. The viewer transforms into an analyst, instead of a regular reader.</p>
792
+
<p>More words can be crammed in a visualization, but in this case the viewer would have to take time zooming in and out, panning to explore the visualization.
793
+
The viewer transforms into an analyst, instead of a regular reader.</p>
We call "semantic network" a visualization where textual items (words, expressions) are connected to each others, like above.
29
31
30
-
//ST: !
31
32
We will see in turn:
33
+
//ST: !
32
34
33
35
- why are semantic networks interesting
34
36
- how to create a semantic network
@@ -37,12 +39,16 @@ We will see in turn:
37
39
38
40
== Why semantic networks?
39
41
//ST: Why semantic networks?
42
+
//ST: !
40
43
41
44
A text, or many texts, can be hard to summarize.
42
45
43
46
Drawing a semantic network highlights what are the most frequent terms, how they relate to each other, and reveal the different groups or "clusters" of they form.
44
47
45
-
Often, a cluster of terms characterizes a topic. Hence, converting a text into a semantic network helps detecting topics in the text, from micro-topics to the general themes discussed in the documents.
48
+
//ST: !
49
+
50
+
Often, a cluster of terms characterizes a topic.
51
+
Hence, converting a text into a semantic network helps detecting topics in the text, from micro-topics to the general themes discussed in the documents.
46
52
47
53
//ST: !
48
54
@@ -52,15 +58,18 @@ Semantic networks are regular networks, where:
52
58
53
59
- relations are, usually, signifying a co-occurrences: two words are connected if they co-occur.
54
60
55
-
It means that if you have a textual network, you can visualize it with Gephi just like any other network. Yet, not everything is the same, and this tutorial provides tips and tricks on why textual data can be a bit different than other data.
56
-
57
61
//ST: !
62
+
63
+
It means that if you have a textual network, you can visualize it with Gephi just like any other network.
64
+
65
+
Yet, not everything is the same, and this tutorial provides tips and tricks on why textual data can be a bit different than other data.
66
+
58
67
== Choosing what a "term" is in a semantic network
68
+
//ST: Choosing what a "term" is in a semantic network
59
69
//ST: !
60
70
61
71
The starting point can be: a term is a single word. So in this sentence, we would have 7 terms:
62
72
63
-
64
73
My sister lives in the United States (7 words -> 7 terms)
65
74
66
75
This means that each single term is a meaningful semantic unit.
@@ -90,10 +99,13 @@ You can find a list of these useless terms in many languages, called "stopwords"
90
99
==== 2. Considering "n-grams"
91
100
//ST: !
92
101
93
-
So, `United States` should probably be a meaningful unit, not just `United` and `States`. Because `United States` is composed of 2 terms, it is called a "bi-gram".
102
+
So, `United States` should probably be a meaningful unit, not just `United` and `States`.
103
+
Because `United States` is composed of 2 terms, it is called a "bi-gram".
94
104
95
105
Trigrams are interesting as well obviously (eg, `chocolate ice cream`).
96
106
107
+
//ST: !
108
+
97
109
People often stop there, but I find that quadrigrams can be meaningful as well, if less frequent: `United States of America`, `functional magnetic resonance imaging`, `The New York Times`, etc.
98
110
99
111
Many tools exist to extract n-grams from texts, for example http://homepages.inf.ed.ac.uk/lzhang10/ngram.html[these programs which are under a free license].
@@ -129,11 +141,14 @@ This approach is interesting (implemented for example in the software http://www
129
141
- Stemming consists in chopping the end of the words, so that here, we would have only `live`.
130
142
- Lemmatization is the same, but in a more subtle way: it takes grammar into account. So, "good" and better" would be reduced to "good" because there is the same basic semantic unit behind these two words, even if their lettering differ completely.
131
143
132
-
A tool performing lemmatization is https://textgrid.de/en/[TextGrid]. It has many functions for textual analysis, and lemmatization https://wiki.de.dariah.eu/display/TextGrid/The+Lemmatizer+Tool[is explained there].
144
+
//ST: !
145
+
146
+
A tool performing lemmatization is https://textgrid.de/en/[TextGrid].
147
+
It has many functions for textual analysis, and lemmatization https://wiki.de.dariah.eu/display/TextGrid/The+Lemmatizer+Tool[is explained there].
133
148
134
149
135
-
//ST: !
136
150
== Should we represent all terms in a semantic network?
151
+
//ST: Should we represent all terms in a semantic network?
137
152
138
153
//ST: !
139
154
We have seen that some words are more interesting than others in a corpus:
@@ -145,7 +160,8 @@ We have seen that some words are more interesting than others in a corpus:
145
160
//ST: !
146
161
Once this is done, we have transformed the text into plenty of words to represent. Should they all be included in the network?
147
162
148
-
Imagine we have a word appearing just once, in a single footnote of a text long of 2,000 pages. Should this word appear? Probably not.
163
+
Imagine we have a word appearing just once, in a single footnote of a text long of 2,000 pages.
164
+
Should this word appear? Probably not.
149
165
150
166
Which rule to apply to keep or leave out a word?
151
167
@@ -158,7 +174,10 @@ A starting point can be the number of words you would like to see on a visualiza
158
174
- it already fills in all the space of a computer screen.
159
175
- 300 words provides enough information to allow micro-topics of a text to be distinguished
160
176
161
-
More words can be crammed in, but in this case the viewer would have to take time zooming in and out, panning to explore the visualization. The viewer transforms into an analyst, instead of a regular reader.
177
+
//ST: !
178
+
179
+
More words can be crammed in a visualization, but in this case the viewer would have to take time zooming in and out, panning to explore the visualization.
180
+
The viewer transforms into an analyst, instead of a regular reader.
162
181
163
182
//ST: !
164
183
==== 2. Representing only the most frequent terms
@@ -168,8 +187,8 @@ If ~ 300 words would fit in the visualization of the network, and the text you s
168
187
169
188
To visualize the semantic network *for a long, single text* the straightforward approach consists in picking the 300 most frequent words (or n-grams, see above).
170
189
171
-
172
190
In the case of a collection of texts to visualize (several documents instead of one), two possibilities:
191
+
173
192
//ST: !
174
193
175
194
1. Either you also take the most frequent terms across these documents, like before
@@ -213,8 +232,8 @@ tf-idf can be left for specialists of the textual data under consideration, afte
213
232
214
233
215
234
== the end
216
-
217
235
//ST: The end!
236
+
218
237
Visit https://www.facebook.com/groups/gephi/[the Gephi group on Facebook] to get help,
219
238
220
239
or visit https://seinecle.github.io/gephi-tutorials/[the website for more tutorials]
Copy file name to clipboardExpand all lines: docs/generated-slides/subdir/working-with-text-en_temp_html.md
+8-10Lines changed: 8 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
= Working with text in Gephi
2
2
Clément Levallois <clementlevallois@gmail.com>
3
-
2017-02-28
3
+
2017-03-07
4
4
5
5
last modified: {docdate}
6
6
@@ -95,7 +95,7 @@ So, `United States` should probably be a meaningful unit, not just `United` and
95
95
96
96
Trigrams are interesting as well obviously (eg, `chocolate ice cream`).
97
97
98
-
People often stop there, but I find quadrigrams can be meaningful as well, if less frequent: `United States of America`, `functional magnetic resonance imaging`, `The New York Times`, etc.
98
+
People often stop there, but I find that quadrigrams can be meaningful as well, if less frequent: `United States of America`, `functional magnetic resonance imaging`, `The New York Times`, etc.
99
99
100
100
Many tools exist to extract n-grams from texts, for example http://homepages.inf.ed.ac.uk/lzhang10/ngram.html[these programs which are under a free license].
101
101
@@ -116,9 +116,9 @@ Take `United States`: it is a noun (`States`) preceded by an adjective (`United`
116
116
117
117
This approach is interesting (implemented for example in the software http://www.vosviewer.com[Vosviewer]), but it has drawbacks:
118
118
119
-
- you need to detect adjectives and nouns in your text. This is very language dependent, and slow for large corpora.
119
+
- you need to detect adjectives and nouns in your text. This is language dependent (French put adjectives after nouns, for instance), and the processing is slow for large corpora.
120
120
121
-
- what about verbs, and noun phrases comprising non adjectives, such as "United States *of* America"? These are not included in the network.
121
+
- what about verbs, and noun phrases comprising non adjectives, such as "United States *of* America"? These are not going to be included in the network.
122
122
123
123
//ST: !
124
124
[start=3]
@@ -170,7 +170,7 @@ If ~ 300 words would fit in the visualization of the network, and the text you s
170
170
To visualize the semantic network *for a long, single text* the straightforward approach consists in picking the 300 most frequent words (or n-grams, see above).
171
171
172
172
173
-
In the case of a colection of texts to visualize (several documents instead of one), two possibilities:
173
+
In the case of a collection of texts to visualize (several documents instead of one), two possibilities:
174
174
//ST: !
175
175
176
176
1. Either you also take the most frequent terms across these documents, like before
@@ -198,7 +198,7 @@ Applying the tf-idf correction will highlight terms which are frequently used wi
198
198
(to go further, here is a webpage giving a simple example: http://www.tfidf.com/)
199
199
200
200
//ST: !
201
-
Should you visualize the most frequent words in your corpus, or the words which rank highest according to tf-idf?
201
+
So, should you visualize the most frequent words in your corpus, or the words which rank highest according to tf-idf?
202
202
203
203
Both are interesting, as they show a different info. I'd suggest that the simple frequency count is easier to interpret.
204
204
@@ -208,12 +208,10 @@ tf-idf can be left for specialists of the textual data under consideration, afte
208
208
//ST: (to be continued)
209
209
210
210
211
-
== More tutorials on importing data to Gephi
212
-
//ST: More tutorials on importing data to Gephi
211
+
== More tutorials on working with semantic networks
212
+
//ST: More tutorials on working with semantic networks
213
213
//ST: !
214
214
215
-
-https://github.com/gephi/gephi/wiki/Import-CSV-Data[The Gephi wiki on importing csv]
216
-
-https://www.youtube.com/watch?v=3Im7vNRA2ns[Video "How to import a CSV into Gephi" by Jen Golbeck]
Copy file name to clipboardExpand all lines: docs/generated-slides/subdir/working-with-text-en_temp_pdf.md
+8-10Lines changed: 8 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
= Working with text in Gephi
2
2
Clément Levallois <clementlevallois@gmail.com>
3
-
2017-02-28
3
+
2017-03-07
4
4
5
5
last modified: {docdate}
6
6
@@ -95,7 +95,7 @@ So, `United States` should probably be a meaningful unit, not just `United` and
95
95
96
96
Trigrams are interesting as well obviously (eg, `chocolate ice cream`).
97
97
98
-
People often stop there, but I find quadrigrams can be meaningful as well, if less frequent: `United States of America`, `functional magnetic resonance imaging`, `The New York Times`, etc.
98
+
People often stop there, but I find that quadrigrams can be meaningful as well, if less frequent: `United States of America`, `functional magnetic resonance imaging`, `The New York Times`, etc.
99
99
100
100
Many tools exist to extract n-grams from texts, for example http://homepages.inf.ed.ac.uk/lzhang10/ngram.html[these programs which are under a free license].
101
101
@@ -116,9 +116,9 @@ Take `United States`: it is a noun (`States`) preceded by an adjective (`United`
116
116
117
117
This approach is interesting (implemented for example in the software http://www.vosviewer.com[Vosviewer]), but it has drawbacks:
118
118
119
-
- you need to detect adjectives and nouns in your text. This is very language dependent, and slow for large corpora.
119
+
- you need to detect adjectives and nouns in your text. This is language dependent (French put adjectives after nouns, for instance), and the processing is slow for large corpora.
120
120
121
-
- what about verbs, and noun phrases comprising non adjectives, such as "United States *of* America"? These are not included in the network.
121
+
- what about verbs, and noun phrases comprising non adjectives, such as "United States *of* America"? These are not going to be included in the network.
122
122
123
123
//ST: !
124
124
[start=3]
@@ -170,7 +170,7 @@ If ~ 300 words would fit in the visualization of the network, and the text you s
170
170
To visualize the semantic network *for a long, single text* the straightforward approach consists in picking the 300 most frequent words (or n-grams, see above).
171
171
172
172
173
-
In the case of a colection of texts to visualize (several documents instead of one), two possibilities:
173
+
In the case of a collection of texts to visualize (several documents instead of one), two possibilities:
174
174
//ST: !
175
175
176
176
1. Either you also take the most frequent terms across these documents, like before
@@ -198,7 +198,7 @@ Applying the tf-idf correction will highlight terms which are frequently used wi
198
198
(to go further, here is a webpage giving a simple example: http://www.tfidf.com/)
199
199
200
200
//ST: !
201
-
Should you visualize the most frequent words in your corpus, or the words which rank highest according to tf-idf?
201
+
So, should you visualize the most frequent words in your corpus, or the words which rank highest according to tf-idf?
202
202
203
203
Both are interesting, as they show a different info. I'd suggest that the simple frequency count is easier to interpret.
204
204
@@ -208,12 +208,10 @@ tf-idf can be left for specialists of the textual data under consideration, afte
208
208
//ST: (to be continued)
209
209
210
210
211
-
== More tutorials on importing data to Gephi
212
-
//ST: More tutorials on importing data to Gephi
211
+
== More tutorials on working with semantic networks
212
+
//ST: More tutorials on working with semantic networks
213
213
//ST: !
214
214
215
-
-https://github.com/gephi/gephi/wiki/Import-CSV-Data[The Gephi wiki on importing csv]
216
-
-https://www.youtube.com/watch?v=3Im7vNRA2ns[Video "How to import a CSV into Gephi" by Jen Golbeck]
0 commit comments