Skip to content

Commit c7be2ab

Browse files
committed
formatting
1 parent 978a813 commit c7be2ab

9 files changed

Lines changed: 73 additions & 33 deletions
2.09 MB
Binary file not shown.

docs/generated-html/working-with-text-en.html

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -686,7 +686,7 @@ <h4 id="_2_considering_n_grams">2. Considering "n-grams"</h4>
686686
<p>Trigrams are interesting as well obviously (eg, <code>chocolate ice cream</code>).</p>
687687
</div>
688688
<div class="paragraph">
689-
<p>People often stop there, but I find that quadrigrams can be meaningful as well, if less frequent: <code>United States of America</code>, <code>functional magnetic resonance imaging</code>, <code>The New York Times</code>, etc.</p>
689+
<p>People often stop there, but quadrigrams can be meaningful as well, if less frequent: <code>United States of America</code>, <code>functional magnetic resonance imaging</code>, <code>The New York Times</code>, etc.</p>
690690
</div>
691691
<div class="paragraph">
692692
<p>Many tools exist to extract n-grams from texts, for example <a href="http://homepages.inf.ed.ac.uk/lzhang10/ngram.html">these programs which are under a free license</a>.</p>
@@ -891,7 +891,7 @@ <h2 id="_the_end">the end</h2>
891891
<div id="footer">
892892
<div id="footer-text">
893893
Version 1.0<br>
894-
Last updated 2017-03-07 22:17:16 CET
894+
Last updated 2017-03-07 22:26:29 CET
895895
</div>
896896
</div>
897897
</body>
131 Bytes
Binary file not shown.

docs/generated-slides/subdir/working-with-text-en_temp_common.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ image::en/cooccurrences-computer/gephi-result-1-en.png[align="center", title="a
3030
We call "semantic network" a visualization where textual items (words, expressions) are connected to each others, like above.
3131

3232
We will see in turn:
33+
3334
//ST: !
3435

3536
- why are semantic networks interesting
@@ -106,7 +107,7 @@ Trigrams are interesting as well obviously (eg, `chocolate ice cream`).
106107

107108
//ST: !
108109

109-
People often stop there, but I find that quadrigrams can be meaningful as well, if less frequent: `United States of America`, `functional magnetic resonance imaging`, `The New York Times`, etc.
110+
People often stop there, but quadrigrams can be meaningful as well, if less frequent: `United States of America`, `functional magnetic resonance imaging`, `The New York Times`, etc.
110111

111112
Many tools exist to extract n-grams from texts, for example http://homepages.inf.ed.ac.uk/lzhang10/ngram.html[these programs which are under a free license].
112113

docs/generated-slides/subdir/working-with-text-en_temp_html.md

Lines changed: 31 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -26,10 +26,12 @@ This tutorial explains how to draw "semantic networks" like this one:
2626
image::en/cooccurrences-computer/gephi-result-1-en.png[align="center", title="a semantic network"]
2727
{nbsp} +
2828

29+
//ST: !
30+
2931
We call "semantic network" a visualization where textual items (words, expressions) are connected to each others, like above.
3032

31-
//ST: !
3233
We will see in turn:
34+
//ST: !
3335

3436
- why are semantic networks interesting
3537
- how to create a semantic network
@@ -38,12 +40,16 @@ We will see in turn:
3840

3941
== Why semantic networks?
4042
//ST: Why semantic networks?
43+
//ST: !
4144

4245
A text, or many texts, can be hard to summarize.
4346

4447
Drawing a semantic network highlights what are the most frequent terms, how they relate to each other, and reveal the different groups or "clusters" of they form.
4548

46-
Often, a cluster of terms characterizes a topic. Hence, converting a text into a semantic network helps detecting topics in the text, from micro-topics to the general themes discussed in the documents.
49+
//ST: !
50+
51+
Often, a cluster of terms characterizes a topic.
52+
Hence, converting a text into a semantic network helps detecting topics in the text, from micro-topics to the general themes discussed in the documents.
4753

4854
//ST: !
4955

@@ -53,15 +59,18 @@ Semantic networks are regular networks, where:
5359

5460
- relations are, usually, signifying a co-occurrences: two words are connected if they co-occur.
5561

56-
It means that if you have a textual network, you can visualize it with Gephi just like any other network. Yet, not everything is the same, and this tutorial provides tips and tricks on why textual data can be a bit different than other data.
57-
5862
//ST: !
63+
64+
It means that if you have a textual network, you can visualize it with Gephi just like any other network.
65+
66+
Yet, not everything is the same, and this tutorial provides tips and tricks on why textual data can be a bit different than other data.
67+
5968
== Choosing what a "term" is in a semantic network
69+
//ST: Choosing what a "term" is in a semantic network
6070
//ST: !
6171

6272
The starting point can be: a term is a single word. So in this sentence, we would have 7 terms:
6373

64-
6574
My sister lives in the United States (7 words -> 7 terms)
6675

6776
This means that each single term is a meaningful semantic unit.
@@ -91,10 +100,13 @@ You can find a list of these useless terms in many languages, called "stopwords"
91100
==== 2. Considering "n-grams"
92101
//ST: !
93102

94-
So, `United States` should probably be a meaningful unit, not just `United` and `States`. Because `United States` is composed of 2 terms, it is called a "bi-gram".
103+
So, `United States` should probably be a meaningful unit, not just `United` and `States`.
104+
Because `United States` is composed of 2 terms, it is called a "bi-gram".
95105

96106
Trigrams are interesting as well obviously (eg, `chocolate ice cream`).
97107

108+
//ST: !
109+
98110
People often stop there, but I find that quadrigrams can be meaningful as well, if less frequent: `United States of America`, `functional magnetic resonance imaging`, `The New York Times`, etc.
99111

100112
Many tools exist to extract n-grams from texts, for example http://homepages.inf.ed.ac.uk/lzhang10/ngram.html[these programs which are under a free license].
@@ -130,11 +142,14 @@ This approach is interesting (implemented for example in the software http://www
130142
- Stemming consists in chopping the end of the words, so that here, we would have only `live`.
131143
- Lemmatization is the same, but in a more subtle way: it takes grammar into account. So, "good" and better" would be reduced to "good" because there is the same basic semantic unit behind these two words, even if their lettering differ completely.
132144

133-
A tool performing lemmatization is https://textgrid.de/en/[TextGrid]. It has many functions for textual analysis, and lemmatization https://wiki.de.dariah.eu/display/TextGrid/The+Lemmatizer+Tool[is explained there].
145+
//ST: !
146+
147+
A tool performing lemmatization is https://textgrid.de/en/[TextGrid].
148+
It has many functions for textual analysis, and lemmatization https://wiki.de.dariah.eu/display/TextGrid/The+Lemmatizer+Tool[is explained there].
134149

135150

136-
//ST: !
137151
== Should we represent all terms in a semantic network?
152+
//ST: Should we represent all terms in a semantic network?
138153

139154
//ST: !
140155
We have seen that some words are more interesting than others in a corpus:
@@ -146,7 +161,8 @@ We have seen that some words are more interesting than others in a corpus:
146161
//ST: !
147162
Once this is done, we have transformed the text into plenty of words to represent. Should they all be included in the network?
148163

149-
Imagine we have a word appearing just once, in a single footnote of a text long of 2,000 pages. Should this word appear? Probably not.
164+
Imagine we have a word appearing just once, in a single footnote of a text long of 2,000 pages.
165+
Should this word appear? Probably not.
150166

151167
Which rule to apply to keep or leave out a word?
152168

@@ -159,7 +175,10 @@ A starting point can be the number of words you would like to see on a visualiza
159175
- it already fills in all the space of a computer screen.
160176
- 300 words provides enough information to allow micro-topics of a text to be distinguished
161177

162-
More words can be crammed in, but in this case the viewer would have to take time zooming in and out, panning to explore the visualization. The viewer transforms into an analyst, instead of a regular reader.
178+
//ST: !
179+
180+
More words can be crammed in a visualization, but in this case the viewer would have to take time zooming in and out, panning to explore the visualization.
181+
The viewer transforms into an analyst, instead of a regular reader.
163182

164183
//ST: !
165184
==== 2. Representing only the most frequent terms
@@ -169,8 +188,8 @@ If ~ 300 words would fit in the visualization of the network, and the text you s
169188

170189
To visualize the semantic network *for a long, single text* the straightforward approach consists in picking the 300 most frequent words (or n-grams, see above).
171190

172-
173191
In the case of a collection of texts to visualize (several documents instead of one), two possibilities:
192+
174193
//ST: !
175194

176195
1. Either you also take the most frequent terms across these documents, like before
@@ -214,8 +233,8 @@ tf-idf can be left for specialists of the textual data under consideration, afte
214233

215234

216235
== the end
217-
218236
//ST: The end!
237+
219238
Visit https://www.facebook.com/groups/gephi/[the Gephi group on Facebook] to get help,
220239

221240
or visit https://seinecle.github.io/gephi-tutorials/[the website for more tutorials]

docs/generated-slides/subdir/working-with-text-en_temp_pdf.md

Lines changed: 31 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -26,10 +26,12 @@ This tutorial explains how to draw "semantic networks" like this one:
2626
image::en/cooccurrences-computer/gephi-result-1-en.png[align="center", title="a semantic network"]
2727
{nbsp} +
2828

29+
//ST: !
30+
2931
We call "semantic network" a visualization where textual items (words, expressions) are connected to each others, like above.
3032

31-
//ST: !
3233
We will see in turn:
34+
//ST: !
3335

3436
- why are semantic networks interesting
3537
- how to create a semantic network
@@ -38,12 +40,16 @@ We will see in turn:
3840

3941
== Why semantic networks?
4042
//ST: Why semantic networks?
43+
//ST: !
4144

4245
A text, or many texts, can be hard to summarize.
4346

4447
Drawing a semantic network highlights what are the most frequent terms, how they relate to each other, and reveal the different groups or "clusters" of they form.
4548

46-
Often, a cluster of terms characterizes a topic. Hence, converting a text into a semantic network helps detecting topics in the text, from micro-topics to the general themes discussed in the documents.
49+
//ST: !
50+
51+
Often, a cluster of terms characterizes a topic.
52+
Hence, converting a text into a semantic network helps detecting topics in the text, from micro-topics to the general themes discussed in the documents.
4753

4854
//ST: !
4955

@@ -53,15 +59,18 @@ Semantic networks are regular networks, where:
5359

5460
- relations are, usually, signifying a co-occurrences: two words are connected if they co-occur.
5561

56-
It means that if you have a textual network, you can visualize it with Gephi just like any other network. Yet, not everything is the same, and this tutorial provides tips and tricks on why textual data can be a bit different than other data.
57-
5862
//ST: !
63+
64+
It means that if you have a textual network, you can visualize it with Gephi just like any other network.
65+
66+
Yet, not everything is the same, and this tutorial provides tips and tricks on why textual data can be a bit different than other data.
67+
5968
== Choosing what a "term" is in a semantic network
69+
//ST: Choosing what a "term" is in a semantic network
6070
//ST: !
6171

6272
The starting point can be: a term is a single word. So in this sentence, we would have 7 terms:
6373

64-
6574
My sister lives in the United States (7 words -> 7 terms)
6675

6776
This means that each single term is a meaningful semantic unit.
@@ -91,10 +100,13 @@ You can find a list of these useless terms in many languages, called "stopwords"
91100
==== 2. Considering "n-grams"
92101
//ST: !
93102

94-
So, `United States` should probably be a meaningful unit, not just `United` and `States`. Because `United States` is composed of 2 terms, it is called a "bi-gram".
103+
So, `United States` should probably be a meaningful unit, not just `United` and `States`.
104+
Because `United States` is composed of 2 terms, it is called a "bi-gram".
95105

96106
Trigrams are interesting as well obviously (eg, `chocolate ice cream`).
97107

108+
//ST: !
109+
98110
People often stop there, but I find that quadrigrams can be meaningful as well, if less frequent: `United States of America`, `functional magnetic resonance imaging`, `The New York Times`, etc.
99111

100112
Many tools exist to extract n-grams from texts, for example http://homepages.inf.ed.ac.uk/lzhang10/ngram.html[these programs which are under a free license].
@@ -130,11 +142,14 @@ This approach is interesting (implemented for example in the software http://www
130142
- Stemming consists in chopping the end of the words, so that here, we would have only `live`.
131143
- Lemmatization is the same, but in a more subtle way: it takes grammar into account. So, "good" and better" would be reduced to "good" because there is the same basic semantic unit behind these two words, even if their lettering differ completely.
132144

133-
A tool performing lemmatization is https://textgrid.de/en/[TextGrid]. It has many functions for textual analysis, and lemmatization https://wiki.de.dariah.eu/display/TextGrid/The+Lemmatizer+Tool[is explained there].
145+
//ST: !
146+
147+
A tool performing lemmatization is https://textgrid.de/en/[TextGrid].
148+
It has many functions for textual analysis, and lemmatization https://wiki.de.dariah.eu/display/TextGrid/The+Lemmatizer+Tool[is explained there].
134149

135150

136-
//ST: !
137151
== Should we represent all terms in a semantic network?
152+
//ST: Should we represent all terms in a semantic network?
138153

139154
//ST: !
140155
We have seen that some words are more interesting than others in a corpus:
@@ -146,7 +161,8 @@ We have seen that some words are more interesting than others in a corpus:
146161
//ST: !
147162
Once this is done, we have transformed the text into plenty of words to represent. Should they all be included in the network?
148163

149-
Imagine we have a word appearing just once, in a single footnote of a text long of 2,000 pages. Should this word appear? Probably not.
164+
Imagine we have a word appearing just once, in a single footnote of a text long of 2,000 pages.
165+
Should this word appear? Probably not.
150166

151167
Which rule to apply to keep or leave out a word?
152168

@@ -159,7 +175,10 @@ A starting point can be the number of words you would like to see on a visualiza
159175
- it already fills in all the space of a computer screen.
160176
- 300 words provides enough information to allow micro-topics of a text to be distinguished
161177

162-
More words can be crammed in, but in this case the viewer would have to take time zooming in and out, panning to explore the visualization. The viewer transforms into an analyst, instead of a regular reader.
178+
//ST: !
179+
180+
More words can be crammed in a visualization, but in this case the viewer would have to take time zooming in and out, panning to explore the visualization.
181+
The viewer transforms into an analyst, instead of a regular reader.
163182

164183
//ST: !
165184
==== 2. Representing only the most frequent terms
@@ -169,8 +188,8 @@ If ~ 300 words would fit in the visualization of the network, and the text you s
169188

170189
To visualize the semantic network *for a long, single text* the straightforward approach consists in picking the 300 most frequent words (or n-grams, see above).
171190

172-
173191
In the case of a collection of texts to visualize (several documents instead of one), two possibilities:
192+
174193
//ST: !
175194

176195
1. Either you also take the most frequent terms across these documents, like before
@@ -214,8 +233,8 @@ tf-idf can be left for specialists of the textual data under consideration, afte
214233

215234

216235
== the end
217-
218236
//ST: The end!
237+
219238
Visit https://www.facebook.com/groups/gephi/[the Gephi group on Facebook] to get help,
220239

221240
or visit https://seinecle.github.io/gephi-tutorials/[the website for more tutorials]

docs/generated-slides/subdir/working-with-text-en_temp_slides.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ image::en/cooccurrences-computer/gephi-result-1-en.png[align="center", title="a
3131
We call "semantic network" a visualization where textual items (words, expressions) are connected to each others, like above.
3232

3333
We will see in turn:
34+
3435
== !
3536

3637
- why are semantic networks interesting
@@ -105,7 +106,7 @@ Trigrams are interesting as well obviously (eg, `chocolate ice cream`).
105106

106107
== !
107108

108-
People often stop there, but I find that quadrigrams can be meaningful as well, if less frequent: `United States of America`, `functional magnetic resonance imaging`, `The New York Times`, etc.
109+
People often stop there, but quadrigrams can be meaningful as well, if less frequent: `United States of America`, `functional magnetic resonance imaging`, `The New York Times`, etc.
109110

110111
Many tools exist to extract n-grams from texts, for example http://homepages.inf.ed.ac.uk/lzhang10/ngram.html[these programs which are under a free license].
111112

docs/generated-slides/working-with-text-en.html

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -93,9 +93,8 @@
9393
<section><div class="paragraph"><p>This tutorial explains how to draw "semantic networks" like this one:</p></div>
9494
<div class="imageblock stretch" style="text-align: center"><img src="images/en/cooccurrences-computer/gephi-result-1-en.png" alt="gephi result 1 en" height="100%" /></div><div class="title">Figure 1. a semantic network</div></section>
9595
<section><div class="paragraph"><p>We call "semantic network" a visualization where textual items (words, expressions) are connected to each others, like above.</p></div>
96-
<div class="paragraph"><p>We will see in turn:
97-
== !</p></div>
98-
<div class="ulist"><ul><li><p>why are semantic networks interesting</p></li><li><p>how to create a semantic network</p></li><li><p>tips and tricks to visualize semantic networks in the best possible way in Gephi</p></li></ul></div></section>
96+
<div class="paragraph"><p>We will see in turn:</p></div></section>
97+
<section><div class="ulist"><ul><li><p>why are semantic networks interesting</p></li><li><p>how to create a semantic network</p></li><li><p>tips and tricks to visualize semantic networks in the best possible way in Gephi</p></li></ul></div></section>
9998
<section id="_why_semantic_networks"><h2>Why semantic networks?</h2></section>
10099
<section><div class="paragraph"><p>A text, or many texts, can be hard to summarize.</p></div>
101100
<div class="paragraph"><p>Drawing a semantic network highlights what are the most frequent terms, how they relate to each other, and reveal the different groups or "clusters" of they form.</p></div></section>
@@ -120,7 +119,7 @@
120119
<section><div class="paragraph"><p>So, <code>United States</code> should probably be a meaningful unit, not just <code>United</code> and <code>States</code>.
121120
Because <code>United States</code> is composed of 2 terms, it is called a "bi-gram".</p></div>
122121
<div class="paragraph"><p>Trigrams are interesting as well obviously (eg, <code>chocolate ice cream</code>).</p></div></section>
123-
<section><div class="paragraph"><p>People often stop there, but I find that quadrigrams can be meaningful as well, if less frequent: <code>United States of America</code>, <code>functional magnetic resonance imaging</code>, <code>The New York Times</code>, etc.</p></div>
122+
<section><div class="paragraph"><p>People often stop there, but quadrigrams can be meaningful as well, if less frequent: <code>United States of America</code>, <code>functional magnetic resonance imaging</code>, <code>The New York Times</code>, etc.</p></div>
124123
<div class="paragraph"><p>Many tools exist to extract n-grams from texts, for example <a href="http://homepages.inf.ed.ac.uk/lzhang10/ngram.html">these programs which are under a free license</a>.</p></div></section>
125124
<section><h3>2 bis. Considering "noun phrases"</h3></section>
126125
<section><div class="paragraph"><p>Another approach to go beyond single word terms (<code>United</code>, <code>States</code>) takes a different approach than n-grams. It says:</p></div>

src/main/asciidoc/en/working-with-text-en.adoc

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ image::en/cooccurrences-computer/gephi-result-1-en.png[align="center", title="a
3030
We call "semantic network" a visualization where textual items (words, expressions) are connected to each others, like above.
3131

3232
We will see in turn:
33+
3334
//ST: !
3435

3536
- why are semantic networks interesting
@@ -106,7 +107,7 @@ Trigrams are interesting as well obviously (eg, `chocolate ice cream`).
106107

107108
//ST: !
108109

109-
People often stop there, but I find that quadrigrams can be meaningful as well, if less frequent: `United States of America`, `functional magnetic resonance imaging`, `The New York Times`, etc.
110+
People often stop there, but quadrigrams can be meaningful as well, if less frequent: `United States of America`, `functional magnetic resonance imaging`, `The New York Times`, etc.
110111

111112
Many tools exist to extract n-grams from texts, for example http://homepages.inf.ed.ac.uk/lzhang10/ngram.html[these programs which are under a free license].
112113

0 commit comments

Comments
 (0)