You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p>Trigrams are interesting as well obviously (eg, <code>chocolate ice cream</code>).</p>
687
687
</div>
688
688
<divclass="paragraph">
689
-
<p>People often stop there, but I find that quadrigrams can be meaningful as well, if less frequent: <code>United States of America</code>, <code>functional magnetic resonance imaging</code>, <code>The New York Times</code>, etc.</p>
689
+
<p>People often stop there, but quadrigrams can be meaningful as well, if less frequent: <code>United States of America</code>, <code>functional magnetic resonance imaging</code>, <code>The New York Times</code>, etc.</p>
690
690
</div>
691
691
<divclass="paragraph">
692
692
<p>Many tools exist to extract n-grams from texts, for example <ahref="http://homepages.inf.ed.ac.uk/lzhang10/ngram.html">these programs which are under a free license</a>.</p>
We call "semantic network" a visualization where textual items (words, expressions) are connected to each others, like above.
31
31
32
32
We will see in turn:
33
+
33
34
//ST: !
34
35
35
36
- why are semantic networks interesting
@@ -106,7 +107,7 @@ Trigrams are interesting as well obviously (eg, `chocolate ice cream`).
106
107
107
108
//ST: !
108
109
109
-
People often stop there, but I find that quadrigrams can be meaningful as well, if less frequent: `United States of America`, `functional magnetic resonance imaging`, `The New York Times`, etc.
110
+
People often stop there, but quadrigrams can be meaningful as well, if less frequent: `United States of America`, `functional magnetic resonance imaging`, `The New York Times`, etc.
110
111
111
112
Many tools exist to extract n-grams from texts, for example http://homepages.inf.ed.ac.uk/lzhang10/ngram.html[these programs which are under a free license].
We call "semantic network" a visualization where textual items (words, expressions) are connected to each others, like above.
30
32
31
-
//ST: !
32
33
We will see in turn:
34
+
//ST: !
33
35
34
36
- why are semantic networks interesting
35
37
- how to create a semantic network
@@ -38,12 +40,16 @@ We will see in turn:
38
40
39
41
== Why semantic networks?
40
42
//ST: Why semantic networks?
43
+
//ST: !
41
44
42
45
A text, or many texts, can be hard to summarize.
43
46
44
47
Drawing a semantic network highlights what are the most frequent terms, how they relate to each other, and reveal the different groups or "clusters" of they form.
45
48
46
-
Often, a cluster of terms characterizes a topic. Hence, converting a text into a semantic network helps detecting topics in the text, from micro-topics to the general themes discussed in the documents.
49
+
//ST: !
50
+
51
+
Often, a cluster of terms characterizes a topic.
52
+
Hence, converting a text into a semantic network helps detecting topics in the text, from micro-topics to the general themes discussed in the documents.
47
53
48
54
//ST: !
49
55
@@ -53,15 +59,18 @@ Semantic networks are regular networks, where:
53
59
54
60
- relations are, usually, signifying a co-occurrences: two words are connected if they co-occur.
55
61
56
-
It means that if you have a textual network, you can visualize it with Gephi just like any other network. Yet, not everything is the same, and this tutorial provides tips and tricks on why textual data can be a bit different than other data.
57
-
58
62
//ST: !
63
+
64
+
It means that if you have a textual network, you can visualize it with Gephi just like any other network.
65
+
66
+
Yet, not everything is the same, and this tutorial provides tips and tricks on why textual data can be a bit different than other data.
67
+
59
68
== Choosing what a "term" is in a semantic network
69
+
//ST: Choosing what a "term" is in a semantic network
60
70
//ST: !
61
71
62
72
The starting point can be: a term is a single word. So in this sentence, we would have 7 terms:
63
73
64
-
65
74
My sister lives in the United States (7 words -> 7 terms)
66
75
67
76
This means that each single term is a meaningful semantic unit.
@@ -91,10 +100,13 @@ You can find a list of these useless terms in many languages, called "stopwords"
91
100
==== 2. Considering "n-grams"
92
101
//ST: !
93
102
94
-
So, `United States` should probably be a meaningful unit, not just `United` and `States`. Because `United States` is composed of 2 terms, it is called a "bi-gram".
103
+
So, `United States` should probably be a meaningful unit, not just `United` and `States`.
104
+
Because `United States` is composed of 2 terms, it is called a "bi-gram".
95
105
96
106
Trigrams are interesting as well obviously (eg, `chocolate ice cream`).
97
107
108
+
//ST: !
109
+
98
110
People often stop there, but I find that quadrigrams can be meaningful as well, if less frequent: `United States of America`, `functional magnetic resonance imaging`, `The New York Times`, etc.
99
111
100
112
Many tools exist to extract n-grams from texts, for example http://homepages.inf.ed.ac.uk/lzhang10/ngram.html[these programs which are under a free license].
@@ -130,11 +142,14 @@ This approach is interesting (implemented for example in the software http://www
130
142
- Stemming consists in chopping the end of the words, so that here, we would have only `live`.
131
143
- Lemmatization is the same, but in a more subtle way: it takes grammar into account. So, "good" and better" would be reduced to "good" because there is the same basic semantic unit behind these two words, even if their lettering differ completely.
132
144
133
-
A tool performing lemmatization is https://textgrid.de/en/[TextGrid]. It has many functions for textual analysis, and lemmatization https://wiki.de.dariah.eu/display/TextGrid/The+Lemmatizer+Tool[is explained there].
145
+
//ST: !
146
+
147
+
A tool performing lemmatization is https://textgrid.de/en/[TextGrid].
148
+
It has many functions for textual analysis, and lemmatization https://wiki.de.dariah.eu/display/TextGrid/The+Lemmatizer+Tool[is explained there].
134
149
135
150
136
-
//ST: !
137
151
== Should we represent all terms in a semantic network?
152
+
//ST: Should we represent all terms in a semantic network?
138
153
139
154
//ST: !
140
155
We have seen that some words are more interesting than others in a corpus:
@@ -146,7 +161,8 @@ We have seen that some words are more interesting than others in a corpus:
146
161
//ST: !
147
162
Once this is done, we have transformed the text into plenty of words to represent. Should they all be included in the network?
148
163
149
-
Imagine we have a word appearing just once, in a single footnote of a text long of 2,000 pages. Should this word appear? Probably not.
164
+
Imagine we have a word appearing just once, in a single footnote of a text long of 2,000 pages.
165
+
Should this word appear? Probably not.
150
166
151
167
Which rule to apply to keep or leave out a word?
152
168
@@ -159,7 +175,10 @@ A starting point can be the number of words you would like to see on a visualiza
159
175
- it already fills in all the space of a computer screen.
160
176
- 300 words provides enough information to allow micro-topics of a text to be distinguished
161
177
162
-
More words can be crammed in, but in this case the viewer would have to take time zooming in and out, panning to explore the visualization. The viewer transforms into an analyst, instead of a regular reader.
178
+
//ST: !
179
+
180
+
More words can be crammed in a visualization, but in this case the viewer would have to take time zooming in and out, panning to explore the visualization.
181
+
The viewer transforms into an analyst, instead of a regular reader.
163
182
164
183
//ST: !
165
184
==== 2. Representing only the most frequent terms
@@ -169,8 +188,8 @@ If ~ 300 words would fit in the visualization of the network, and the text you s
169
188
170
189
To visualize the semantic network *for a long, single text* the straightforward approach consists in picking the 300 most frequent words (or n-grams, see above).
171
190
172
-
173
191
In the case of a collection of texts to visualize (several documents instead of one), two possibilities:
192
+
174
193
//ST: !
175
194
176
195
1. Either you also take the most frequent terms across these documents, like before
@@ -214,8 +233,8 @@ tf-idf can be left for specialists of the textual data under consideration, afte
214
233
215
234
216
235
== the end
217
-
218
236
//ST: The end!
237
+
219
238
Visit https://www.facebook.com/groups/gephi/[the Gephi group on Facebook] to get help,
220
239
221
240
or visit https://seinecle.github.io/gephi-tutorials/[the website for more tutorials]
We call "semantic network" a visualization where textual items (words, expressions) are connected to each others, like above.
30
32
31
-
//ST: !
32
33
We will see in turn:
34
+
//ST: !
33
35
34
36
- why are semantic networks interesting
35
37
- how to create a semantic network
@@ -38,12 +40,16 @@ We will see in turn:
38
40
39
41
== Why semantic networks?
40
42
//ST: Why semantic networks?
43
+
//ST: !
41
44
42
45
A text, or many texts, can be hard to summarize.
43
46
44
47
Drawing a semantic network highlights what are the most frequent terms, how they relate to each other, and reveal the different groups or "clusters" of they form.
45
48
46
-
Often, a cluster of terms characterizes a topic. Hence, converting a text into a semantic network helps detecting topics in the text, from micro-topics to the general themes discussed in the documents.
49
+
//ST: !
50
+
51
+
Often, a cluster of terms characterizes a topic.
52
+
Hence, converting a text into a semantic network helps detecting topics in the text, from micro-topics to the general themes discussed in the documents.
47
53
48
54
//ST: !
49
55
@@ -53,15 +59,18 @@ Semantic networks are regular networks, where:
53
59
54
60
- relations are, usually, signifying a co-occurrences: two words are connected if they co-occur.
55
61
56
-
It means that if you have a textual network, you can visualize it with Gephi just like any other network. Yet, not everything is the same, and this tutorial provides tips and tricks on why textual data can be a bit different than other data.
57
-
58
62
//ST: !
63
+
64
+
It means that if you have a textual network, you can visualize it with Gephi just like any other network.
65
+
66
+
Yet, not everything is the same, and this tutorial provides tips and tricks on why textual data can be a bit different than other data.
67
+
59
68
== Choosing what a "term" is in a semantic network
69
+
//ST: Choosing what a "term" is in a semantic network
60
70
//ST: !
61
71
62
72
The starting point can be: a term is a single word. So in this sentence, we would have 7 terms:
63
73
64
-
65
74
My sister lives in the United States (7 words -> 7 terms)
66
75
67
76
This means that each single term is a meaningful semantic unit.
@@ -91,10 +100,13 @@ You can find a list of these useless terms in many languages, called "stopwords"
91
100
==== 2. Considering "n-grams"
92
101
//ST: !
93
102
94
-
So, `United States` should probably be a meaningful unit, not just `United` and `States`. Because `United States` is composed of 2 terms, it is called a "bi-gram".
103
+
So, `United States` should probably be a meaningful unit, not just `United` and `States`.
104
+
Because `United States` is composed of 2 terms, it is called a "bi-gram".
95
105
96
106
Trigrams are interesting as well obviously (eg, `chocolate ice cream`).
97
107
108
+
//ST: !
109
+
98
110
People often stop there, but I find that quadrigrams can be meaningful as well, if less frequent: `United States of America`, `functional magnetic resonance imaging`, `The New York Times`, etc.
99
111
100
112
Many tools exist to extract n-grams from texts, for example http://homepages.inf.ed.ac.uk/lzhang10/ngram.html[these programs which are under a free license].
@@ -130,11 +142,14 @@ This approach is interesting (implemented for example in the software http://www
130
142
- Stemming consists in chopping the end of the words, so that here, we would have only `live`.
131
143
- Lemmatization is the same, but in a more subtle way: it takes grammar into account. So, "good" and better" would be reduced to "good" because there is the same basic semantic unit behind these two words, even if their lettering differ completely.
132
144
133
-
A tool performing lemmatization is https://textgrid.de/en/[TextGrid]. It has many functions for textual analysis, and lemmatization https://wiki.de.dariah.eu/display/TextGrid/The+Lemmatizer+Tool[is explained there].
145
+
//ST: !
146
+
147
+
A tool performing lemmatization is https://textgrid.de/en/[TextGrid].
148
+
It has many functions for textual analysis, and lemmatization https://wiki.de.dariah.eu/display/TextGrid/The+Lemmatizer+Tool[is explained there].
134
149
135
150
136
-
//ST: !
137
151
== Should we represent all terms in a semantic network?
152
+
//ST: Should we represent all terms in a semantic network?
138
153
139
154
//ST: !
140
155
We have seen that some words are more interesting than others in a corpus:
@@ -146,7 +161,8 @@ We have seen that some words are more interesting than others in a corpus:
146
161
//ST: !
147
162
Once this is done, we have transformed the text into plenty of words to represent. Should they all be included in the network?
148
163
149
-
Imagine we have a word appearing just once, in a single footnote of a text long of 2,000 pages. Should this word appear? Probably not.
164
+
Imagine we have a word appearing just once, in a single footnote of a text long of 2,000 pages.
165
+
Should this word appear? Probably not.
150
166
151
167
Which rule to apply to keep or leave out a word?
152
168
@@ -159,7 +175,10 @@ A starting point can be the number of words you would like to see on a visualiza
159
175
- it already fills in all the space of a computer screen.
160
176
- 300 words provides enough information to allow micro-topics of a text to be distinguished
161
177
162
-
More words can be crammed in, but in this case the viewer would have to take time zooming in and out, panning to explore the visualization. The viewer transforms into an analyst, instead of a regular reader.
178
+
//ST: !
179
+
180
+
More words can be crammed in a visualization, but in this case the viewer would have to take time zooming in and out, panning to explore the visualization.
181
+
The viewer transforms into an analyst, instead of a regular reader.
163
182
164
183
//ST: !
165
184
==== 2. Representing only the most frequent terms
@@ -169,8 +188,8 @@ If ~ 300 words would fit in the visualization of the network, and the text you s
169
188
170
189
To visualize the semantic network *for a long, single text* the straightforward approach consists in picking the 300 most frequent words (or n-grams, see above).
171
190
172
-
173
191
In the case of a collection of texts to visualize (several documents instead of one), two possibilities:
192
+
174
193
//ST: !
175
194
176
195
1. Either you also take the most frequent terms across these documents, like before
@@ -214,8 +233,8 @@ tf-idf can be left for specialists of the textual data under consideration, afte
214
233
215
234
216
235
== the end
217
-
218
236
//ST: The end!
237
+
219
238
Visit https://www.facebook.com/groups/gephi/[the Gephi group on Facebook] to get help,
220
239
221
240
or visit https://seinecle.github.io/gephi-tutorials/[the website for more tutorials]
We call "semantic network" a visualization where textual items (words, expressions) are connected to each others, like above.
32
32
33
33
We will see in turn:
34
+
34
35
== !
35
36
36
37
- why are semantic networks interesting
@@ -105,7 +106,7 @@ Trigrams are interesting as well obviously (eg, `chocolate ice cream`).
105
106
106
107
== !
107
108
108
-
People often stop there, but I find that quadrigrams can be meaningful as well, if less frequent: `United States of America`, `functional magnetic resonance imaging`, `The New York Times`, etc.
109
+
People often stop there, but quadrigrams can be meaningful as well, if less frequent: `United States of America`, `functional magnetic resonance imaging`, `The New York Times`, etc.
109
110
110
111
Many tools exist to extract n-grams from texts, for example http://homepages.inf.ed.ac.uk/lzhang10/ngram.html[these programs which are under a free license].
Copy file name to clipboardExpand all lines: docs/generated-slides/working-with-text-en.html
+3-4Lines changed: 3 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -93,9 +93,8 @@
93
93
<section><divclass="paragraph"><p>This tutorial explains how to draw "semantic networks" like this one:</p></div>
94
94
<divclass="imageblock stretch" style="text-align: center"><imgsrc="images/en/cooccurrences-computer/gephi-result-1-en.png" alt="gephi result 1 en" height="100%" /></div><divclass="title">Figure 1. a semantic network</div></section>
95
95
<section><divclass="paragraph"><p>We call "semantic network" a visualization where textual items (words, expressions) are connected to each others, like above.</p></div>
96
-
<divclass="paragraph"><p>We will see in turn:
97
-
== !</p></div>
98
-
<divclass="ulist"><ul><li><p>why are semantic networks interesting</p></li><li><p>how to create a semantic network</p></li><li><p>tips and tricks to visualize semantic networks in the best possible way in Gephi</p></li></ul></div></section>
96
+
<divclass="paragraph"><p>We will see in turn:</p></div></section>
97
+
<section><divclass="ulist"><ul><li><p>why are semantic networks interesting</p></li><li><p>how to create a semantic network</p></li><li><p>tips and tricks to visualize semantic networks in the best possible way in Gephi</p></li></ul></div></section>
<section><divclass="paragraph"><p>A text, or many texts, can be hard to summarize.</p></div>
101
100
<divclass="paragraph"><p>Drawing a semantic network highlights what are the most frequent terms, how they relate to each other, and reveal the different groups or "clusters" of they form.</p></div></section>
@@ -120,7 +119,7 @@
120
119
<section><divclass="paragraph"><p>So, <code>United States</code> should probably be a meaningful unit, not just <code>United</code> and <code>States</code>.
121
120
Because <code>United States</code> is composed of 2 terms, it is called a "bi-gram".</p></div>
122
121
<divclass="paragraph"><p>Trigrams are interesting as well obviously (eg, <code>chocolate ice cream</code>).</p></div></section>
123
-
<section><divclass="paragraph"><p>People often stop there, but I find that quadrigrams can be meaningful as well, if less frequent: <code>United States of America</code>, <code>functional magnetic resonance imaging</code>, <code>The New York Times</code>, etc.</p></div>
122
+
<section><divclass="paragraph"><p>People often stop there, but quadrigrams can be meaningful as well, if less frequent: <code>United States of America</code>, <code>functional magnetic resonance imaging</code>, <code>The New York Times</code>, etc.</p></div>
124
123
<divclass="paragraph"><p>Many tools exist to extract n-grams from texts, for example <ahref="http://homepages.inf.ed.ac.uk/lzhang10/ngram.html">these programs which are under a free license</a>.</p></div></section>
<section><divclass="paragraph"><p>Another approach to go beyond single word terms (<code>United</code>, <code>States</code>) takes a different approach than n-grams. It says:</p></div>
We call "semantic network" a visualization where textual items (words, expressions) are connected to each others, like above.
31
31
32
32
We will see in turn:
33
+
33
34
//ST: !
34
35
35
36
- why are semantic networks interesting
@@ -106,7 +107,7 @@ Trigrams are interesting as well obviously (eg, `chocolate ice cream`).
106
107
107
108
//ST: !
108
109
109
-
People often stop there, but I find that quadrigrams can be meaningful as well, if less frequent: `United States of America`, `functional magnetic resonance imaging`, `The New York Times`, etc.
110
+
People often stop there, but quadrigrams can be meaningful as well, if less frequent: `United States of America`, `functional magnetic resonance imaging`, `The New York Times`, etc.
110
111
111
112
Many tools exist to extract n-grams from texts, for example http://homepages.inf.ed.ac.uk/lzhang10/ngram.html[these programs which are under a free license].
0 commit comments