Skip to content

Commit c4142bc

Browse files
committed
typos
1 parent 03c4f5a commit c4142bc

9 files changed

Lines changed: 503 additions & 55 deletions

docs/generated-html/working-with-text-en.html

Lines changed: 10 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -516,7 +516,7 @@ <h1>Working with text in Gephi</h1>
516516
<div class="details">
517517
<span id="author" class="author">Clément Levallois</span><br>
518518
<span id="email" class="email"><a href="mailto:clementlevallois@gmail.com">clementlevallois@gmail.com</a></span><br>
519-
<span id="revdate">2017-02-28</span>
519+
<span id="revdate">2017-03-07</span>
520520
</div>
521521
<div id="toc" class="toc">
522522
<div id="toctitle">Table of Contents</div>
@@ -538,7 +538,7 @@ <h1>Working with text in Gephi</h1>
538538
</ul>
539539
</li>
540540
<li><a href="#__to_be_continued">(to be continued)</a></li>
541-
<li><a href="#_more_tutorials_on_importing_data_to_gephi">More tutorials on importing data to Gephi</a></li>
541+
<li><a href="#_more_tutorials_on_working_with_semantic_networks">More tutorials on working with semantic networks</a></li>
542542
<li><a href="#_the_end">the end</a></li>
543543
</ul>
544544
</div>
@@ -681,7 +681,7 @@ <h4 id="_2_considering_n_grams">2. Considering "n-grams"</h4>
681681
<p>Trigrams are interesting as well obviously (eg, <code>chocolate ice cream</code>).</p>
682682
</div>
683683
<div class="paragraph">
684-
<p>People often stop there, but I find quadrigrams can be meaningful as well, if less frequent: <code>United States of America</code>, <code>functional magnetic resonance imaging</code>, <code>The New York Times</code>, etc.</p>
684+
<p>People often stop there, but I find that quadrigrams can be meaningful as well, if less frequent: <code>United States of America</code>, <code>functional magnetic resonance imaging</code>, <code>The New York Times</code>, etc.</p>
685685
</div>
686686
<div class="paragraph">
687687
<p>Many tools exist to extract n-grams from texts, for example <a href="http://homepages.inf.ed.ac.uk/lzhang10/ngram.html">these programs which are under a free license</a>.</p>
@@ -709,10 +709,10 @@ <h4 id="_2_bis_considering_noun_phrases">2 bis. Considering "noun phrases"</h4>
709709
<div class="ulist">
710710
<ul>
711711
<li>
712-
<p>you need to detect adjectives and nouns in your text. This is very language dependent, and slow for large corpora.</p>
712+
<p>you need to detect adjectives and nouns in your text. This is language dependent (French put adjectives after nouns, for instance), and the processing is slow for large corpora.</p>
713713
</li>
714714
<li>
715-
<p>what about verbs, and noun phrases comprising non adjectives, such as "United States <strong>of</strong> America"? These are not included in the network.</p>
715+
<p>what about verbs, and noun phrases comprising non adjectives, such as "United States <strong>of</strong> America"? These are not going to be included in the network.</p>
716716
</li>
717717
</ul>
718718
</div>
@@ -794,7 +794,7 @@ <h4 id="_2_representing_only_the_most_frequent_terms">2. Representing only the m
794794
<p>To visualize the semantic network <strong>for a long, single text</strong> the straightforward approach consists in picking the 300 most frequent words (or n-grams, see above).</p>
795795
</div>
796796
<div class="paragraph">
797-
<p>In the case of a colection of texts to visualize (several documents instead of one), two possibilities:</p>
797+
<p>In the case of a collection of texts to visualize (several documents instead of one), two possibilities:</p>
798798
</div>
799799
<div class="olist arabic">
800800
<ol class="arabic">
@@ -828,7 +828,7 @@ <h4 id="_2_representing_only_the_most_frequent_terms">2. Representing only the m
828828
<p>(to go further, here is a webpage giving a simple example: <a href="http://www.tfidf.com/" class="bare">http://www.tfidf.com/</a>)</p>
829829
</div>
830830
<div class="paragraph">
831-
<p>Should you visualize the most frequent words in your corpus, or the words which rank highest according to tf-idf?</p>
831+
<p>So, should you visualize the most frequent words in your corpus, or the words which rank highest according to tf-idf?</p>
832832
</div>
833833
<div class="paragraph">
834834
<p>Both are interesting, as they show a different info. I&#8217;d suggest that the simple frequency count is easier to interpret.</p>
@@ -846,18 +846,9 @@ <h2 id="__to_be_continued">(to be continued)</h2>
846846
</div>
847847
</div>
848848
<div class="sect1">
849-
<h2 id="_more_tutorials_on_importing_data_to_gephi">More tutorials on importing data to Gephi</h2>
849+
<h2 id="_more_tutorials_on_working_with_semantic_networks">More tutorials on working with semantic networks</h2>
850850
<div class="sectionbody">
851-
<div class="ulist">
852-
<ul>
853-
<li>
854-
<p><a href="https://github.com/gephi/gephi/wiki/Import-CSV-Data">The Gephi wiki on importing csv</a></p>
855-
</li>
856-
<li>
857-
<p><a href="https://www.youtube.com/watch?v=3Im7vNRA2ns">Video "How to import a CSV into Gephi" by Jen Golbeck</a></p>
858-
</li>
859-
</ul>
860-
</div>
851+
861852
</div>
862853
</div>
863854
<div class="sect1">
@@ -892,7 +883,7 @@ <h2 id="_the_end">the end</h2>
892883
<div id="footer">
893884
<div id="footer-text">
894885
Version 1.0<br>
895-
Last updated 2017-03-07 21:46:55 CET
886+
Last updated 2017-03-07 22:06:25 CET
896887
</div>
897888
</div>
898889
</body>
-1.54 KB
Binary file not shown.

docs/generated-slides/subdir/working-with-text-en_temp_common.md

Lines changed: 8 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
= Working with text in Gephi
22
Clément Levallois <clementlevallois@gmail.com>
3-
2017-02-28
3+
2017-03-07
44

55
last modified: {docdate}
66

@@ -94,7 +94,7 @@ So, `United States` should probably be a meaningful unit, not just `United` and
9494

9595
Trigrams are interesting as well obviously (eg, `chocolate ice cream`).
9696

97-
People often stop there, but I find quadrigrams can be meaningful as well, if less frequent: `United States of America`, `functional magnetic resonance imaging`, `The New York Times`, etc.
97+
People often stop there, but I find that quadrigrams can be meaningful as well, if less frequent: `United States of America`, `functional magnetic resonance imaging`, `The New York Times`, etc.
9898

9999
Many tools exist to extract n-grams from texts, for example http://homepages.inf.ed.ac.uk/lzhang10/ngram.html[these programs which are under a free license].
100100

@@ -115,9 +115,9 @@ Take `United States`: it is a noun (`States`) preceded by an adjective (`United`
115115

116116
This approach is interesting (implemented for example in the software http://www.vosviewer.com[Vosviewer]), but it has drawbacks:
117117

118-
- you need to detect adjectives and nouns in your text. This is very language dependent, and slow for large corpora.
118+
- you need to detect adjectives and nouns in your text. This is language dependent (French put adjectives after nouns, for instance), and the processing is slow for large corpora.
119119

120-
- what about verbs, and noun phrases comprising non adjectives, such as "United States *of* America"? These are not included in the network.
120+
- what about verbs, and noun phrases comprising non adjectives, such as "United States *of* America"? These are not going to be included in the network.
121121

122122
//ST: !
123123
[start=3]
@@ -169,7 +169,7 @@ If ~ 300 words would fit in the visualization of the network, and the text you s
169169
To visualize the semantic network *for a long, single text* the straightforward approach consists in picking the 300 most frequent words (or n-grams, see above).
170170

171171

172-
In the case of a colection of texts to visualize (several documents instead of one), two possibilities:
172+
In the case of a collection of texts to visualize (several documents instead of one), two possibilities:
173173
//ST: !
174174

175175
1. Either you also take the most frequent terms across these documents, like before
@@ -197,7 +197,7 @@ Applying the tf-idf correction will highlight terms which are frequently used wi
197197
(to go further, here is a webpage giving a simple example: http://www.tfidf.com/)
198198

199199
//ST: !
200-
Should you visualize the most frequent words in your corpus, or the words which rank highest according to tf-idf?
200+
So, should you visualize the most frequent words in your corpus, or the words which rank highest according to tf-idf?
201201

202202
Both are interesting, as they show a different info. I'd suggest that the simple frequency count is easier to interpret.
203203

@@ -207,12 +207,10 @@ tf-idf can be left for specialists of the textual data under consideration, afte
207207
//ST: (to be continued)
208208

209209

210-
== More tutorials on importing data to Gephi
211-
//ST: More tutorials on importing data to Gephi
210+
== More tutorials on working with semantic networks
211+
//ST: More tutorials on working with semantic networks
212212
//ST: !
213213

214-
- https://github.com/gephi/gephi/wiki/Import-CSV-Data[The Gephi wiki on importing csv]
215-
- https://www.youtube.com/watch?v=3Im7vNRA2ns[Video "How to import a CSV into Gephi" by Jen Golbeck]
216214

217215
== the end
218216

Lines changed: 240 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,240 @@
1+
= Working with text in Gephi
2+
Clément Levallois <clementlevallois@gmail.com>
3+
2017-02-28
4+
5+
last modified: {docdate}
6+
7+
:icons!:
8+
:iconsfont: font-awesome
9+
:revnumber: 1.0
10+
:example-caption!:
11+
:sourcedir: ../../../main/java
12+
13+
:title-logo-image: gephi-logo-2010-transparent.png[width="450" align="center"]
14+
15+
image::gephi-logo-2010-transparent.png[width="450" align="center"]
16+
{nbsp} +
17+
18+
//ST: 'Escape' or 'o' to see all sides, F11 for full screen, 's' for speaker notes
19+
20+
== Presentation of this tutorial
21+
//ST: Presentation of this tutorial
22+
23+
//ST: !
24+
This tutorial explains how to draw "semantic networks" like this one:
25+
26+
image::en/cooccurrences-computer/gephi-result-1-en.png[align="center", title="a semantic network"]
27+
{nbsp} +
28+
29+
We call "semantic network" a visualization where textual items (words, expressions) are connected to each others, like above.
30+
31+
//ST: !
32+
We will see in turn:
33+
34+
- why are semantic networks interesting
35+
- how to create a semantic network
36+
- tips and tricks to visualize semantic networks in the best possible way in Gephi
37+
38+
39+
== Why semantic networks?
40+
//ST: Why semantic networks?
41+
42+
A text, or many texts, can be hard to summarize.
43+
44+
Drawing a semantic network highlights what are the most frequent terms, how they relate to each other, and reveal the different groups or "clusters" of they form.
45+
46+
Often, a cluster of terms characterizes a topic. Hence, converting a text into a semantic network helps detecting topics in the text, from micro-topics to the general themes discussed in the documents.
47+
48+
//ST: !
49+
50+
Semantic networks are regular networks, where:
51+
52+
- nodes are words ("USA") or groups of words ("United States of America")
53+
54+
- relations are, usually, signifying a co-occurrences: two words are connected if they co-occur.
55+
56+
It means that if you have a textual network, you can visualize it with Gephi just like any other network. Yet, not everything is the same, and this tutorial provides tips and tricks on why textual data can be a bit different than other data.
57+
58+
//ST: !
59+
== Choosing what a "term" is in a semantic network
60+
//ST: !
61+
62+
The starting point can be: a term is a single word. So in this sentence, we would have 7 terms:
63+
64+
65+
My sister lives in the United States (7 words -> 7 terms)
66+
67+
This means that each single term is a meaningful semantic unit.
68+
69+
This approach is simple but not great. Look again at the sentence:
70+
71+
//ST: !
72+
73+
My sister lives in the United States
74+
75+
1. `My`, `in`, `the` are frequent terms which have no special significance: they should probably be discarded
76+
2. `United` and `States` are meaningful separately, but here they should probably be considered together: `United States`
77+
3. `lives` is the conjugated form of the verb `to live`. In a network, it would make sense to regroup `live`, `lives` and `lived` as one single node.
78+
79+
Analysts, facing each of these issues, have imagined several solutions:
80+
81+
//ST: !
82+
==== 1. Removing "stopwords"
83+
//ST: !
84+
85+
To remove these little terms without informational value, the most basic approach is to keep a list of them, and remove any word from the text which belongs to this list.
86+
87+
You can find a list of these useless terms in many languages, called "stopwords", http://www.ranks.nl/stopwords/[on this website].
88+
89+
//ST: !
90+
[start=2]
91+
==== 2. Considering "n-grams"
92+
//ST: !
93+
94+
So, `United States` should probably be a meaningful unit, not just `United` and `States`. Because `United States` is composed of 2 terms, it is called a "bi-gram".
95+
96+
Trigrams are interesting as well obviously (eg, `chocolate ice cream`).
97+
98+
People often stop there, but I find quadrigrams can be meaningful as well, if less frequent: `United States of America`, `functional magnetic resonance imaging`, `The New York Times`, etc.
99+
100+
Many tools exist to extract n-grams from texts, for example http://homepages.inf.ed.ac.uk/lzhang10/ngram.html[these programs which are under a free license].
101+
102+
//ST: !
103+
[start=2]
104+
==== 2 bis. Considering "noun phrases"
105+
//ST: !
106+
107+
Another approach to go beyond single word terms (`United`, `States`) takes a different approach than n-grams. It says:
108+
109+
"delete all in the text except for all groups of words made of nouns and adjectives, ending by a noun"
110+
111+
-> (these are called, a bit improperly, "noun phrases")
112+
113+
Take `United States`: it is a noun (`States`) preceded by an adjective (`United`). It will be considered as a valid term.
114+
115+
//ST: !
116+
117+
This approach is interesting (implemented for example in the software http://www.vosviewer.com[Vosviewer]), but it has drawbacks:
118+
119+
- you need to detect adjectives and nouns in your text. This is very language dependent, and slow for large corpora.
120+
121+
- what about verbs, and noun phrases comprising non adjectives, such as "United States *of* America"? These are not included in the network.
122+
123+
//ST: !
124+
[start=3]
125+
==== 3. Stemming and lemmatization
126+
//ST: !
127+
128+
`live`, `lives`, `lived`: in a semantic network, it is probably useless to have 3 nodes, one for each of these 3 forms of the same root.
129+
130+
- Stemming consists in chopping the end of the words, so that here, we would have only `live`.
131+
- Lemmatization is the same, but in a more subtle way: it takes grammar into account. So, "good" and better" would be reduced to "good" because there is the same basic semantic unit behind these two words, even if their lettering differ completely.
132+
133+
A tool performing lemmatization is https://textgrid.de/en/[TextGrid]. It has many functions for textual analysis, and lemmatization https://wiki.de.dariah.eu/display/TextGrid/The+Lemmatizer+Tool[is explained there].
134+
135+
136+
//ST: !
137+
== Should we represent all terms in a semantic network?
138+
139+
//ST: !
140+
We have seen that some words are more interesting than others in a corpus:
141+
142+
- stopwords should be removed,
143+
- some varieties of words (`lived`, `lives`) could be grouped together (`live`).
144+
- sequences of words (`baby phone`) can be added because they mean more than their words taken separately (`baby`, `phone`)
145+
146+
//ST: !
147+
Once this is done, we have transformed the text into plenty of words to represent. Should they all be included in the network?
148+
149+
Imagine we have a word appearing just once, in a single footnote of a text long of 2,000 pages. Should this word appear? Probably not.
150+
151+
Which rule to apply to keep or leave out a word?
152+
153+
//ST: !
154+
==== 1. Start with: how many words can fit in your visualization?
155+
//ST: !
156+
157+
A starting point can be the number of words you would like to see on a visualization. *A ball park figure is 300 words max*:
158+
159+
- it already fills in all the space of a computer screen.
160+
- 300 words provides enough information to allow micro-topics of a text to be distinguished
161+
162+
More words can be crammed in, but in this case the viewer would have to take time zooming in and out, panning to explore the visualization. The viewer transforms into an analyst, instead of a regular reader.
163+
164+
//ST: !
165+
==== 2. Representing only the most frequent terms
166+
//ST: !
167+
168+
If ~ 300 words would fit in the visualization of the network, and the text you start with contains 5,000 different words: which 300 words should be selected?
169+
170+
To visualize the semantic network *for a long, single text* the straightforward approach consists in picking the 300 most frequent words (or n-grams, see above).
171+
172+
173+
In the case of a colection of texts to visualize (several documents instead of one), two possibilities:
174+
//ST: !
175+
176+
1. Either you also take the most frequent terms across these documents, like before
177+
178+
2. Or you can apply a more subtle rule called "tf-idf", detailed below.
179+
180+
//ST: tf-idf
181+
182+
The idea with tf-idf is that terms which appear in all documents are not interesting, because they are so ubiquitous.
183+
184+
Example: you retrieve all the webpages mentioning the word `Gephi`, and then want to visualize the semantic network of the texts contained in these webpages.
185+
186+
//ST: !
187+
188+
-> by definition, all these webpages will mention Gephi, so Gephi will probably be the most frequent term.
189+
190+
-> so your network will end up with a node "Gephi" connected to many other terms, but you actually knew that. Boring.
191+
192+
-> terms used in all web pages are less interesting to you than terms which are used frequently, but not uniformly accross webpages.
193+
194+
//ST: !
195+
196+
Applying the tf-idf correction will highlight terms which are frequently used within some texts, but not used in many texts.
197+
198+
(to go further, here is a webpage giving a simple example: http://www.tfidf.com/)
199+
200+
//ST: !
201+
Should you visualize the most frequent words in your corpus, or the words which rank highest according to tf-idf?
202+
203+
Both are interesting, as they show a different info. I'd suggest that the simple frequency count is easier to interpret.
204+
205+
tf-idf can be left for specialists of the textual data under consideration, after they have been presented with the simple frequency count version.
206+
207+
== (to be continued)
208+
//ST: (to be continued)
209+
210+
211+
== More tutorials on importing data to Gephi
212+
//ST: More tutorials on importing data to Gephi
213+
//ST: !
214+
215+
- https://github.com/gephi/gephi/wiki/Import-CSV-Data[The Gephi wiki on importing csv]
216+
- https://www.youtube.com/watch?v=3Im7vNRA2ns[Video "How to import a CSV into Gephi" by Jen Golbeck]
217+
218+
== the end
219+
220+
//ST: The end!
221+
Visit https://www.facebook.com/groups/gephi/[the Gephi group on Facebook] to get help,
222+
223+
or visit https://seinecle.github.io/gephi-tutorials/[the website for more tutorials]
224+
pass:[ <!-- Start of StatCounter Code for Default Guide -->
225+
<script type="text/javascript">
226+
var sc_project = ;
227+
var sc_invisible = 1;
228+
var sc_security = "";
229+
var scJsHost = (("https:" == document.location.protocol) ?
230+
"https://secure." : "http://www.");
231+
document.write("<sc" + "ript type='text/javascript' src='" +
232+
scJsHost +
233+
"statcounter.com/counter/counter.js'></" + "script>");
234+
</script>
235+
<noscript><div class="statcounter"><a title="site stats"
236+
href="http://statcounter.com/" target="_blank"><img
237+
class="statcounter"
238+
src="//c.statcounter.com//0//1/" alt="site
239+
stats"></a></div></noscript>
240+
<!-- End of StatCounter Code for Default Guide -->]

0 commit comments

Comments
 (0)