seinecle
diff --git a/‎docs/asciidoctor-reveal.js-master.zip‎
-2.09 MB b/‎docs/asciidoctor-reveal.js-master.zip‎
-2.09 MB
diff --git a/‎docs/generated-html/working-with-text-en.html‎
Lines changed: 185 additions & 8 deletions b/‎docs/generated-html/working-with-text-en.html‎
Lines changed: 185 additions & 8 deletions
diff --git a/‎docs/generated-pdf/images/.png‎
91.7 KB b/‎docs/generated-pdf/images/.png‎
91.7 KB
diff --git a/‎docs/generated-pdf/images/3-groups-of-icons.png‎
38.1 KB b/‎docs/generated-pdf/images/3-groups-of-icons.png‎
38.1 KB
diff --git a/‎docs/generated-pdf/images/A-few-nodes-have-been-created.png‎
18.1 KB b/‎docs/generated-pdf/images/A-few-nodes-have-been-created.png‎
18.1 KB
diff --git a/‎docs/generated-pdf/images/Adding-a-column-for-Names.png‎
164 KB b/‎docs/generated-pdf/images/Adding-a-column-for-Names.png‎
164 KB
diff --git a/‎docs/generated-pdf/images/Adding-terms-and-launching-the-collection-of-tweets.png‎
38.3 KB b/‎docs/generated-pdf/images/Adding-terms-and-launching-the-collection-of-tweets.png‎
38.3 KB
diff --git a/‎docs/generated-pdf/images/Adjusting-edge-thickness.png‎
111 KB b/‎docs/generated-pdf/images/Adjusting-edge-thickness.png‎
111 KB
diff --git a/‎docs/generated-pdf/images/Adjusting-label-size.png‎
106 KB b/‎docs/generated-pdf/images/Adjusting-label-size.png‎
106 KB
diff --git a/‎docs/generated-pdf/images/An-Excel-file-with-weights.png‎
116 KB b/‎docs/generated-pdf/images/An-Excel-file-with-weights.png‎
116 KB
@@ -537,6 +537,13 @@ <h1>Working with text in Gephi</h1>
 <li><a href="#_2_representing_only_the_most_frequent_terms">2. Representing only the most frequent terms</a></li>
 </ul>
 </li>
+<li><a href="#_computing_connections_edges_in_the_network">Computing connections (edges) in the network</a>
+<ul class="sectlevel3">
+<li><a href="#_1_co_occurrences">1. Co-occurrences</a></li>
+<li><a href="#_2_what_weight_for_the_edges">2. What "weight" for the edges?</a></li>
+</ul>
+</li>
+<li><a href="#_visualizing_semantic_networks_with_gephi">Visualizing semantic networks with Gephi</a></li>
 <li><a href="#__to_be_continued">(to be continued)</a></li>
 <li><a href="#_more_tutorials_on_working_with_semantic_networks">More tutorials on working with semantic networks</a></li>
 <li><a href="#_the_end">the end</a></li>
@@ -547,7 +554,7 @@ <h1>Working with text in Gephi</h1>
 <div id="preamble">
 <div class="sectionbody">
 <div class="paragraph">
-<p>last modified: 2017-03-07</p>
+<p>last modified: 2017-03-08</p>
 </div>
 <div class="imageblock" style="text-align: center">
 <div class="content">
@@ -602,7 +609,7 @@ <h2 id="_why_semantic_networks">Why semantic networks?</h2>
 <p>A text, or many texts, can be hard to summarize.</p>
 </div>
 <div class="paragraph">
-<p>Drawing a semantic network highlights what are the most frequent terms, how they relate to each other, and reveal the different groups or "clusters" of they form.</p>
+<p>Drawing a semantic network highlights what are the most frequent terms, how they relate to each other, and reveal the different groups or "clusters" they form.</p>
 </div>
 <div class="paragraph">
 <p>Often, a cluster of terms characterizes a topic.
@@ -617,7 +624,7 @@ <h2 id="_why_semantic_networks">Why semantic networks?</h2>
 <p>nodes are words ("USA") or groups of words ("United States of America")</p>
 </li>
 <li>
-<p>relations are, usually, signifying a co-occurrences: two words are connected if they co-occur.</p>
+<p>relations are, usually, signifying co-occurrences: two words are connected if they appear in the same document, or in the same paragraph, or same sentence&#8230;&#8203; you decide.</p>
 </li>
 </ul>
 </div>
@@ -699,7 +706,7 @@ <h4 id="_2_bis_considering_noun_phrases">2 bis. Considering "noun phrases"</h4>
 </div>
 <div class="literalblock">
 <div class="content">
-<pre>"delete all in the text except for all groups of words made of nouns and adjectives, ending by a noun"</pre>
+<pre>"delete all in the text except for groups of words made of nouns and adjectives, ending by a noun"</pre>
 </div>
 </div>
 <div class="paragraph">
@@ -845,6 +852,176 @@ <h4 id="_2_representing_only_the_most_frequent_terms">2. Representing only the m
 <p>tf-idf can be left for specialists of the textual data under consideration, after they have been presented with the simple frequency count version.</p>
 </div>
 </div>
+</div>
+</div>
+<div class="sect1">
+<h2 id="_computing_connections_edges_in_the_network">Computing connections (edges) in the network</h2>
+<div class="sectionbody">
+<div class="paragraph">
+<p>We now have extracted the most interesting / meaningful terms from the text.
+How to decide which connections make sense between them?</p>
+</div>
+<div class="sect3">
+<h4 id="_1_co_occurrences">1. Co-occurrences</h4>
+<div class="paragraph">
+<p>Connections between terms are usually drawn from co-occurrences: two terms will be connected if they  appear next to each other in some pre-defined unit of text:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>in the same sentence</p>
+</li>
+<li>
+<p>in the same paragraph</p>
+</li>
+<li>
+<p>in the same document (if the corpus is made of several documents)</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>(note on vocabulary: in the following, we will call this a "unit of text").</p>
+</div>
+<div class="paragraph">
+<p>For example, in bibliometrics (the study of the publications produced by scientists), this could give:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>collect <strong>abstracts</strong> (short summaries) of all scientific articles discussing "nano-technologies".</p>
+</li>
+<li>
+<p>so, abstracts are our units of text here.</p>
+</li>
+<li>
+<p>two terms will be connected if they frequently appear <strong>in the same abstracts</strong>.</p>
+</li>
+</ul>
+</div>
+</div>
+<div class="sect3">
+<h4 id="_2_what_weight_for_the_edges">2. What "weight" for the edges?</h4>
+<div class="paragraph">
+<p>An edge between two terms will have:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>weight of "1" if these two terms co-occur in just one unit of text.</p>
+</li>
+<li>
+<p>weight of "2" if they co-occur in two units of text.</p>
+</li>
+<li>
+<p>etc&#8230;&#8203;</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>The logic is simple, and yet there are some refinements to discuss. It will be up to you to decide what&#8217;s preferable:</p>
+</div>
+<div class="sect4">
+<h5 id="_if_2_terms_appear_several_times_strong_in_a_given_unit_of_text_strong_should_their_co_occurences_be_counted_several_times">If 2 terms appear several times <strong>in a given unit of text</strong>, should their co-occurences be counted several times?</h5>
+<div class="paragraph">
+<p>An example to clarify. Let&#8217;s imagine that we are interested in webpages discussing nanotechnology.
+We want to draw the semantic network of the vocabulary used in these web pages.</p>
+</div>
+<div class="paragraph">
+<p>A co-occurrence is: when 2 terms are used on the same web page.</p>
+</div>
+<div class="paragraph">
+<p>Among the pages we collected, there is the Wikipedia page discussing nanotechnology:</p>
+</div>
+<div class="quoteblock">
+<blockquote>
+<div class="paragraph">
+<p><span class="red">Nanotechnology</span> ("nanotech") is manipulation of matter on an atomic, <span class="blue">molecular</span>, and supramolecular scale.
+The earliest, widespread description of <span class="red">nanotechnology</span> referred to the particular technological goal of precisely manipulating atoms and molecules for fabrication of macroscale products, also now referred to as <span class="blue">molecular</span> <span class="red">nanotechnology</span></p>
+</div>
+</blockquote>
+<div class="attribution">
+&#8212; <a href="https://en.wikipedia.org/wiki/Nanotechnology">Wikipedia</a>
+</div>
+</div>
+<div class="paragraph">
+<p>The question is:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>should I count only <strong>one</strong> co-occurrence between <code>molecular</code> and <code>nanotechnology</code>, because it happened on this one web page?</p>
+</li>
+<li>
+<p>or should I consider that <code>molecular</code> appears twice on this page, and <code>nanotechnology</code> three times, so <strong>multiple</strong> co-occurrences between these 2 terms should be counted, just on this page already?</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>There is no exact response, and you can experiment with both possibilities.</p>
+</div>
+</div>
+<div class="sect4">
+<h5 id="_if_two_terms_are_very_frequent_is_their_co_occurrence_really_of_interest">If two terms are very frequent, is their co-occurrence really of interest?</h5>
+<div class="paragraph">
+<p>Example:</p>
+</div>
+<div class="paragraph">
+<p>Chun-Yuen Teng, Yu-Ru Lin and Lada Adamic have studied (using Gephi!) <a href="https://arxiv.org/abs/1111.3919">the pairing of ingredients in cooking recipes</a>.</p>
+</div>
+<div class="paragraph">
+<p>So, in their study the unit of text was the "recipe", and the terms in the semantic network where ingredients in all these recipes.</p>
+</div>
+<div class="paragraph">
+<p>Just because they are so common, some ingredients (like <code>flour</code>, <code>sugar</code>, <code>salt</code>) are bound to appear more frequently in the same recipes (to co-occur), than infrequent ingredients.</p>
+</div>
+<div class="paragraph">
+<p>The authors of this study chose to highlight <strong>complementary ingredients</strong>: some ingredients appear often used together in the same recipes, <em>even if they are ingredients which are quite rarely used</em>.</p>
+</div>
+<div class="paragraph">
+<p>"Complementary" here means that these ingredients have some interesting relationship: when one is used, the other "must" be used as well.</p>
+</div>
+<div class="paragraph">
+<p>If we just count co-occurrences, this special relationship between infrequent complementary ingredients will be lost: by definition, 2 infrequent ingredients can&#8217;t co-occurr often.</p>
+</div>
+<div class="paragraph">
+<p>To fix this, a solution consists in comparing how many times the 2 ingredients co-occur, with how frequent they are in all recipes:</p>
+</div>
+<div class="paragraph">
+<p>&#8594; ingredients co-occurring <em>each and every time they are used</em> will have a large edge weight,</p>
+</div>
+<div class="paragraph">
+<p>&#8594; ingredients co-occuring many times, <em>but also appearing many times in different recipes</em>, will get a low edge weight.</p>
+</div>
+<div class="paragraph">
+<p>A simple formula does this operation. For ingredients A and B:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>weight of edge between A and B =
+nb of recipes where A &amp; B co-occur
+divided by
+(total nb of recipes where A appear x total nb of recipes where B appear)</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Logs are often added to this formula, which is called "Pointwise mutual information":</p>
+</div>
+<div class="stemblock">
+<div class="content">
+\$PMI = log((p(A, B)) /(p(A) p(B)))\$
+</div>
+</div>
+<div class="paragraph">
+<p>We now have nodes and their relations: a semantic network. Let&#8217;s see now how to visualize it in Gephi.</p>
+</div>
+</div>
+</div>
+</div>
+</div>
+<div class="sect1">
+<h2 id="_visualizing_semantic_networks_with_gephi">Visualizing semantic networks with Gephi</h2>
+<div class="sectionbody">
+
 </div>
 </div>
 <div class="sect1">
@@ -869,9 +1046,9 @@ <h2 id="_the_end">the end</h2>
 <p>or visit <a href="https://seinecle.github.io/gephi-tutorials/">the website for more tutorials</a>
     <!-- Start of StatCounter Code for Default Guide -->
     <script type="text/javascript">
-        var sc_project = ;
+        var sc_project = 11238920;
         var sc_invisible = 1;
-        var sc_security = "";
+        var sc_security = "8dac6cd5";
         var scJsHost = (("https:" == document.location.protocol) ?
             "https://secure." : "http://www.");
         document.write("<sc" + "ript type='text/javascript' src='" +
@@ -881,7 +1058,7 @@ <h2 id="_the_end">the end</h2>
     <noscript><div class="statcounter"><a title="site stats"
     href="http://statcounter.com/" target="_blank"><img
     class="statcounter"
-    src="//c.statcounter.com//0//1/" alt="site
+    src="//c.statcounter.com/11238920/0/8dac6cd5/1/" alt="site
     stats"></a></div></noscript>
     <!-- End of StatCounter Code for Default Guide --></p>
 </div>
@@ -891,7 +1068,7 @@ <h2 id="_the_end">the end</h2>
 <div id="footer">
 <div id="footer-text">
 Version 1.0<br>
-Last updated 2017-03-07 22:26:29 CET
+Last updated 2017-03-08 12:14:26 CET
 </div>
 </div>
 </body>