Skip to content

Commit 978a813

Browse files
committed
formatting
1 parent c4142bc commit 978a813

8 files changed

Lines changed: 152 additions & 84 deletions

docs/generated-html/working-with-text-en.html

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -605,7 +605,8 @@ <h2 id="_why_semantic_networks">Why semantic networks?</h2>
605605
<p>Drawing a semantic network highlights what are the most frequent terms, how they relate to each other, and reveal the different groups or "clusters" of they form.</p>
606606
</div>
607607
<div class="paragraph">
608-
<p>Often, a cluster of terms characterizes a topic. Hence, converting a text into a semantic network helps detecting topics in the text, from micro-topics to the general themes discussed in the documents.</p>
608+
<p>Often, a cluster of terms characterizes a topic.
609+
Hence, converting a text into a semantic network helps detecting topics in the text, from micro-topics to the general themes discussed in the documents.</p>
609610
</div>
610611
<div class="paragraph">
611612
<p>Semantic networks are regular networks, where:</p>
@@ -621,7 +622,10 @@ <h2 id="_why_semantic_networks">Why semantic networks?</h2>
621622
</ul>
622623
</div>
623624
<div class="paragraph">
624-
<p>It means that if you have a textual network, you can visualize it with Gephi just like any other network. Yet, not everything is the same, and this tutorial provides tips and tricks on why textual data can be a bit different than other data.</p>
625+
<p>It means that if you have a textual network, you can visualize it with Gephi just like any other network.</p>
626+
</div>
627+
<div class="paragraph">
628+
<p>Yet, not everything is the same, and this tutorial provides tips and tricks on why textual data can be a bit different than other data.</p>
625629
</div>
626630
</div>
627631
</div>
@@ -675,7 +679,8 @@ <h4 id="_1_removing_stopwords">1. Removing "stopwords"</h4>
675679
<div class="sect3">
676680
<h4 id="_2_considering_n_grams">2. Considering "n-grams"</h4>
677681
<div class="paragraph">
678-
<p>So, <code>United States</code> should probably be a meaningful unit, not just <code>United</code> and <code>States</code>. Because <code>United States</code> is composed of 2 terms, it is called a "bi-gram".</p>
682+
<p>So, <code>United States</code> should probably be a meaningful unit, not just <code>United</code> and <code>States</code>.
683+
Because <code>United States</code> is composed of 2 terms, it is called a "bi-gram".</p>
679684
</div>
680685
<div class="paragraph">
681686
<p>Trigrams are interesting as well obviously (eg, <code>chocolate ice cream</code>).</p>
@@ -733,7 +738,8 @@ <h4 id="_3_stemming_and_lemmatization">3. Stemming and lemmatization</h4>
733738
</ul>
734739
</div>
735740
<div class="paragraph">
736-
<p>A tool performing lemmatization is <a href="https://textgrid.de/en/">TextGrid</a>. It has many functions for textual analysis, and lemmatization <a href="https://wiki.de.dariah.eu/display/TextGrid/The+Lemmatizer+Tool">is explained there</a>.</p>
741+
<p>A tool performing lemmatization is <a href="https://textgrid.de/en/">TextGrid</a>.
742+
It has many functions for textual analysis, and lemmatization <a href="https://wiki.de.dariah.eu/display/TextGrid/The+Lemmatizer+Tool">is explained there</a>.</p>
737743
</div>
738744
</div>
739745
</div>
@@ -761,7 +767,8 @@ <h2 id="_should_we_represent_all_terms_in_a_semantic_network">Should we represen
761767
<p>Once this is done, we have transformed the text into plenty of words to represent. Should they all be included in the network?</p>
762768
</div>
763769
<div class="paragraph">
764-
<p>Imagine we have a word appearing just once, in a single footnote of a text long of 2,000 pages. Should this word appear? Probably not.</p>
770+
<p>Imagine we have a word appearing just once, in a single footnote of a text long of 2,000 pages.
771+
Should this word appear? Probably not.</p>
765772
</div>
766773
<div class="paragraph">
767774
<p>Which rule to apply to keep or leave out a word?</p>
@@ -782,7 +789,8 @@ <h4 id="_1_start_with_how_many_words_can_fit_in_your_visualization">1. Start wit
782789
</ul>
783790
</div>
784791
<div class="paragraph">
785-
<p>More words can be crammed in, but in this case the viewer would have to take time zooming in and out, panning to explore the visualization. The viewer transforms into an analyst, instead of a regular reader.</p>
792+
<p>More words can be crammed in a visualization, but in this case the viewer would have to take time zooming in and out, panning to explore the visualization.
793+
The viewer transforms into an analyst, instead of a regular reader.</p>
786794
</div>
787795
</div>
788796
<div class="sect3">
@@ -883,7 +891,7 @@ <h2 id="_the_end">the end</h2>
883891
<div id="footer">
884892
<div id="footer-text">
885893
Version 1.0<br>
886-
Last updated 2017-03-07 22:06:25 CET
894+
Last updated 2017-03-07 22:17:16 CET
887895
</div>
888896
</div>
889897
</body>
159 Bytes
Binary file not shown.

docs/generated-slides/subdir/working-with-text-en_temp_common.md

Lines changed: 31 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -25,10 +25,12 @@ This tutorial explains how to draw "semantic networks" like this one:
2525
image::en/cooccurrences-computer/gephi-result-1-en.png[align="center", title="a semantic network"]
2626
{nbsp} +
2727

28+
//ST: !
29+
2830
We call "semantic network" a visualization where textual items (words, expressions) are connected to each others, like above.
2931

30-
//ST: !
3132
We will see in turn:
33+
//ST: !
3234

3335
- why are semantic networks interesting
3436
- how to create a semantic network
@@ -37,12 +39,16 @@ We will see in turn:
3739

3840
== Why semantic networks?
3941
//ST: Why semantic networks?
42+
//ST: !
4043

4144
A text, or many texts, can be hard to summarize.
4245

4346
Drawing a semantic network highlights what are the most frequent terms, how they relate to each other, and reveal the different groups or "clusters" of they form.
4447

45-
Often, a cluster of terms characterizes a topic. Hence, converting a text into a semantic network helps detecting topics in the text, from micro-topics to the general themes discussed in the documents.
48+
//ST: !
49+
50+
Often, a cluster of terms characterizes a topic.
51+
Hence, converting a text into a semantic network helps detecting topics in the text, from micro-topics to the general themes discussed in the documents.
4652

4753
//ST: !
4854

@@ -52,15 +58,18 @@ Semantic networks are regular networks, where:
5258

5359
- relations are, usually, signifying a co-occurrences: two words are connected if they co-occur.
5460

55-
It means that if you have a textual network, you can visualize it with Gephi just like any other network. Yet, not everything is the same, and this tutorial provides tips and tricks on why textual data can be a bit different than other data.
56-
5761
//ST: !
62+
63+
It means that if you have a textual network, you can visualize it with Gephi just like any other network.
64+
65+
Yet, not everything is the same, and this tutorial provides tips and tricks on why textual data can be a bit different than other data.
66+
5867
== Choosing what a "term" is in a semantic network
68+
//ST: Choosing what a "term" is in a semantic network
5969
//ST: !
6070

6171
The starting point can be: a term is a single word. So in this sentence, we would have 7 terms:
6272

63-
6473
My sister lives in the United States (7 words -> 7 terms)
6574

6675
This means that each single term is a meaningful semantic unit.
@@ -90,10 +99,13 @@ You can find a list of these useless terms in many languages, called "stopwords"
9099
==== 2. Considering "n-grams"
91100
//ST: !
92101

93-
So, `United States` should probably be a meaningful unit, not just `United` and `States`. Because `United States` is composed of 2 terms, it is called a "bi-gram".
102+
So, `United States` should probably be a meaningful unit, not just `United` and `States`.
103+
Because `United States` is composed of 2 terms, it is called a "bi-gram".
94104

95105
Trigrams are interesting as well obviously (eg, `chocolate ice cream`).
96106

107+
//ST: !
108+
97109
People often stop there, but I find that quadrigrams can be meaningful as well, if less frequent: `United States of America`, `functional magnetic resonance imaging`, `The New York Times`, etc.
98110

99111
Many tools exist to extract n-grams from texts, for example http://homepages.inf.ed.ac.uk/lzhang10/ngram.html[these programs which are under a free license].
@@ -129,11 +141,14 @@ This approach is interesting (implemented for example in the software http://www
129141
- Stemming consists in chopping the end of the words, so that here, we would have only `live`.
130142
- Lemmatization is the same, but in a more subtle way: it takes grammar into account. So, "good" and better" would be reduced to "good" because there is the same basic semantic unit behind these two words, even if their lettering differ completely.
131143

132-
A tool performing lemmatization is https://textgrid.de/en/[TextGrid]. It has many functions for textual analysis, and lemmatization https://wiki.de.dariah.eu/display/TextGrid/The+Lemmatizer+Tool[is explained there].
144+
//ST: !
145+
146+
A tool performing lemmatization is https://textgrid.de/en/[TextGrid].
147+
It has many functions for textual analysis, and lemmatization https://wiki.de.dariah.eu/display/TextGrid/The+Lemmatizer+Tool[is explained there].
133148

134149

135-
//ST: !
136150
== Should we represent all terms in a semantic network?
151+
//ST: Should we represent all terms in a semantic network?
137152

138153
//ST: !
139154
We have seen that some words are more interesting than others in a corpus:
@@ -145,7 +160,8 @@ We have seen that some words are more interesting than others in a corpus:
145160
//ST: !
146161
Once this is done, we have transformed the text into plenty of words to represent. Should they all be included in the network?
147162

148-
Imagine we have a word appearing just once, in a single footnote of a text long of 2,000 pages. Should this word appear? Probably not.
163+
Imagine we have a word appearing just once, in a single footnote of a text long of 2,000 pages.
164+
Should this word appear? Probably not.
149165

150166
Which rule to apply to keep or leave out a word?
151167

@@ -158,7 +174,10 @@ A starting point can be the number of words you would like to see on a visualiza
158174
- it already fills in all the space of a computer screen.
159175
- 300 words provides enough information to allow micro-topics of a text to be distinguished
160176

161-
More words can be crammed in, but in this case the viewer would have to take time zooming in and out, panning to explore the visualization. The viewer transforms into an analyst, instead of a regular reader.
177+
//ST: !
178+
179+
More words can be crammed in a visualization, but in this case the viewer would have to take time zooming in and out, panning to explore the visualization.
180+
The viewer transforms into an analyst, instead of a regular reader.
162181

163182
//ST: !
164183
==== 2. Representing only the most frequent terms
@@ -168,8 +187,8 @@ If ~ 300 words would fit in the visualization of the network, and the text you s
168187

169188
To visualize the semantic network *for a long, single text* the straightforward approach consists in picking the 300 most frequent words (or n-grams, see above).
170189

171-
172190
In the case of a collection of texts to visualize (several documents instead of one), two possibilities:
191+
173192
//ST: !
174193

175194
1. Either you also take the most frequent terms across these documents, like before
@@ -213,8 +232,8 @@ tf-idf can be left for specialists of the textual data under consideration, afte
213232

214233

215234
== the end
216-
217235
//ST: The end!
236+
218237
Visit https://www.facebook.com/groups/gephi/[the Gephi group on Facebook] to get help,
219238

220239
or visit https://seinecle.github.io/gephi-tutorials/[the website for more tutorials]

docs/generated-slides/subdir/working-with-text-en_temp_html.md

Lines changed: 8 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
= Working with text in Gephi
22
Clément Levallois <clementlevallois@gmail.com>
3-
2017-02-28
3+
2017-03-07
44

55
last modified: {docdate}
66

@@ -95,7 +95,7 @@ So, `United States` should probably be a meaningful unit, not just `United` and
9595

9696
Trigrams are interesting as well obviously (eg, `chocolate ice cream`).
9797

98-
People often stop there, but I find quadrigrams can be meaningful as well, if less frequent: `United States of America`, `functional magnetic resonance imaging`, `The New York Times`, etc.
98+
People often stop there, but I find that quadrigrams can be meaningful as well, if less frequent: `United States of America`, `functional magnetic resonance imaging`, `The New York Times`, etc.
9999

100100
Many tools exist to extract n-grams from texts, for example http://homepages.inf.ed.ac.uk/lzhang10/ngram.html[these programs which are under a free license].
101101

@@ -116,9 +116,9 @@ Take `United States`: it is a noun (`States`) preceded by an adjective (`United`
116116

117117
This approach is interesting (implemented for example in the software http://www.vosviewer.com[Vosviewer]), but it has drawbacks:
118118

119-
- you need to detect adjectives and nouns in your text. This is very language dependent, and slow for large corpora.
119+
- you need to detect adjectives and nouns in your text. This is language dependent (French put adjectives after nouns, for instance), and the processing is slow for large corpora.
120120

121-
- what about verbs, and noun phrases comprising non adjectives, such as "United States *of* America"? These are not included in the network.
121+
- what about verbs, and noun phrases comprising non adjectives, such as "United States *of* America"? These are not going to be included in the network.
122122

123123
//ST: !
124124
[start=3]
@@ -170,7 +170,7 @@ If ~ 300 words would fit in the visualization of the network, and the text you s
170170
To visualize the semantic network *for a long, single text* the straightforward approach consists in picking the 300 most frequent words (or n-grams, see above).
171171

172172

173-
In the case of a colection of texts to visualize (several documents instead of one), two possibilities:
173+
In the case of a collection of texts to visualize (several documents instead of one), two possibilities:
174174
//ST: !
175175

176176
1. Either you also take the most frequent terms across these documents, like before
@@ -198,7 +198,7 @@ Applying the tf-idf correction will highlight terms which are frequently used wi
198198
(to go further, here is a webpage giving a simple example: http://www.tfidf.com/)
199199

200200
//ST: !
201-
Should you visualize the most frequent words in your corpus, or the words which rank highest according to tf-idf?
201+
So, should you visualize the most frequent words in your corpus, or the words which rank highest according to tf-idf?
202202

203203
Both are interesting, as they show a different info. I'd suggest that the simple frequency count is easier to interpret.
204204

@@ -208,12 +208,10 @@ tf-idf can be left for specialists of the textual data under consideration, afte
208208
//ST: (to be continued)
209209

210210

211-
== More tutorials on importing data to Gephi
212-
//ST: More tutorials on importing data to Gephi
211+
== More tutorials on working with semantic networks
212+
//ST: More tutorials on working with semantic networks
213213
//ST: !
214214

215-
- https://github.com/gephi/gephi/wiki/Import-CSV-Data[The Gephi wiki on importing csv]
216-
- https://www.youtube.com/watch?v=3Im7vNRA2ns[Video "How to import a CSV into Gephi" by Jen Golbeck]
217215

218216
== the end
219217

docs/generated-slides/subdir/working-with-text-en_temp_pdf.md

Lines changed: 8 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
= Working with text in Gephi
22
Clément Levallois <clementlevallois@gmail.com>
3-
2017-02-28
3+
2017-03-07
44

55
last modified: {docdate}
66

@@ -95,7 +95,7 @@ So, `United States` should probably be a meaningful unit, not just `United` and
9595

9696
Trigrams are interesting as well obviously (eg, `chocolate ice cream`).
9797

98-
People often stop there, but I find quadrigrams can be meaningful as well, if less frequent: `United States of America`, `functional magnetic resonance imaging`, `The New York Times`, etc.
98+
People often stop there, but I find that quadrigrams can be meaningful as well, if less frequent: `United States of America`, `functional magnetic resonance imaging`, `The New York Times`, etc.
9999

100100
Many tools exist to extract n-grams from texts, for example http://homepages.inf.ed.ac.uk/lzhang10/ngram.html[these programs which are under a free license].
101101

@@ -116,9 +116,9 @@ Take `United States`: it is a noun (`States`) preceded by an adjective (`United`
116116

117117
This approach is interesting (implemented for example in the software http://www.vosviewer.com[Vosviewer]), but it has drawbacks:
118118

119-
- you need to detect adjectives and nouns in your text. This is very language dependent, and slow for large corpora.
119+
- you need to detect adjectives and nouns in your text. This is language dependent (French put adjectives after nouns, for instance), and the processing is slow for large corpora.
120120

121-
- what about verbs, and noun phrases comprising non adjectives, such as "United States *of* America"? These are not included in the network.
121+
- what about verbs, and noun phrases comprising non adjectives, such as "United States *of* America"? These are not going to be included in the network.
122122

123123
//ST: !
124124
[start=3]
@@ -170,7 +170,7 @@ If ~ 300 words would fit in the visualization of the network, and the text you s
170170
To visualize the semantic network *for a long, single text* the straightforward approach consists in picking the 300 most frequent words (or n-grams, see above).
171171

172172

173-
In the case of a colection of texts to visualize (several documents instead of one), two possibilities:
173+
In the case of a collection of texts to visualize (several documents instead of one), two possibilities:
174174
//ST: !
175175

176176
1. Either you also take the most frequent terms across these documents, like before
@@ -198,7 +198,7 @@ Applying the tf-idf correction will highlight terms which are frequently used wi
198198
(to go further, here is a webpage giving a simple example: http://www.tfidf.com/)
199199

200200
//ST: !
201-
Should you visualize the most frequent words in your corpus, or the words which rank highest according to tf-idf?
201+
So, should you visualize the most frequent words in your corpus, or the words which rank highest according to tf-idf?
202202

203203
Both are interesting, as they show a different info. I'd suggest that the simple frequency count is easier to interpret.
204204

@@ -208,12 +208,10 @@ tf-idf can be left for specialists of the textual data under consideration, afte
208208
//ST: (to be continued)
209209

210210

211-
== More tutorials on importing data to Gephi
212-
//ST: More tutorials on importing data to Gephi
211+
== More tutorials on working with semantic networks
212+
//ST: More tutorials on working with semantic networks
213213
//ST: !
214214

215-
- https://github.com/gephi/gephi/wiki/Import-CSV-Data[The Gephi wiki on importing csv]
216-
- https://www.youtube.com/watch?v=3Im7vNRA2ns[Video "How to import a CSV into Gephi" by Jen Golbeck]
217215

218216
== the end
219217

0 commit comments

Comments
 (0)