Skip to content

Commit bd51515

Browse files
committed
modified pom, added main class to prepare for book ouput
1 parent c7be2ab commit bd51515

484 files changed

Lines changed: 22954 additions & 265 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
-2.09 MB
Binary file not shown.

docs/generated-html/working-with-text-en.html

Lines changed: 185 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -537,6 +537,13 @@ <h1>Working with text in Gephi</h1>
537537
<li><a href="#_2_representing_only_the_most_frequent_terms">2. Representing only the most frequent terms</a></li>
538538
</ul>
539539
</li>
540+
<li><a href="#_computing_connections_edges_in_the_network">Computing connections (edges) in the network</a>
541+
<ul class="sectlevel3">
542+
<li><a href="#_1_co_occurrences">1. Co-occurrences</a></li>
543+
<li><a href="#_2_what_weight_for_the_edges">2. What "weight" for the edges?</a></li>
544+
</ul>
545+
</li>
546+
<li><a href="#_visualizing_semantic_networks_with_gephi">Visualizing semantic networks with Gephi</a></li>
540547
<li><a href="#__to_be_continued">(to be continued)</a></li>
541548
<li><a href="#_more_tutorials_on_working_with_semantic_networks">More tutorials on working with semantic networks</a></li>
542549
<li><a href="#_the_end">the end</a></li>
@@ -547,7 +554,7 @@ <h1>Working with text in Gephi</h1>
547554
<div id="preamble">
548555
<div class="sectionbody">
549556
<div class="paragraph">
550-
<p>last modified: 2017-03-07</p>
557+
<p>last modified: 2017-03-08</p>
551558
</div>
552559
<div class="imageblock" style="text-align: center">
553560
<div class="content">
@@ -602,7 +609,7 @@ <h2 id="_why_semantic_networks">Why semantic networks?</h2>
602609
<p>A text, or many texts, can be hard to summarize.</p>
603610
</div>
604611
<div class="paragraph">
605-
<p>Drawing a semantic network highlights what are the most frequent terms, how they relate to each other, and reveal the different groups or "clusters" of they form.</p>
612+
<p>Drawing a semantic network highlights what are the most frequent terms, how they relate to each other, and reveal the different groups or "clusters" they form.</p>
606613
</div>
607614
<div class="paragraph">
608615
<p>Often, a cluster of terms characterizes a topic.
@@ -617,7 +624,7 @@ <h2 id="_why_semantic_networks">Why semantic networks?</h2>
617624
<p>nodes are words ("USA") or groups of words ("United States of America")</p>
618625
</li>
619626
<li>
620-
<p>relations are, usually, signifying a co-occurrences: two words are connected if they co-occur.</p>
627+
<p>relations are, usually, signifying co-occurrences: two words are connected if they appear in the same document, or in the same paragraph, or same sentence&#8230;&#8203; you decide.</p>
621628
</li>
622629
</ul>
623630
</div>
@@ -699,7 +706,7 @@ <h4 id="_2_bis_considering_noun_phrases">2 bis. Considering "noun phrases"</h4>
699706
</div>
700707
<div class="literalblock">
701708
<div class="content">
702-
<pre>"delete all in the text except for all groups of words made of nouns and adjectives, ending by a noun"</pre>
709+
<pre>"delete all in the text except for groups of words made of nouns and adjectives, ending by a noun"</pre>
703710
</div>
704711
</div>
705712
<div class="paragraph">
@@ -845,6 +852,176 @@ <h4 id="_2_representing_only_the_most_frequent_terms">2. Representing only the m
845852
<p>tf-idf can be left for specialists of the textual data under consideration, after they have been presented with the simple frequency count version.</p>
846853
</div>
847854
</div>
855+
</div>
856+
</div>
857+
<div class="sect1">
858+
<h2 id="_computing_connections_edges_in_the_network">Computing connections (edges) in the network</h2>
859+
<div class="sectionbody">
860+
<div class="paragraph">
861+
<p>We now have extracted the most interesting / meaningful terms from the text.
862+
How to decide which connections make sense between them?</p>
863+
</div>
864+
<div class="sect3">
865+
<h4 id="_1_co_occurrences">1. Co-occurrences</h4>
866+
<div class="paragraph">
867+
<p>Connections between terms are usually drawn from co-occurrences: two terms will be connected if they appear next to each other in some pre-defined unit of text:</p>
868+
</div>
869+
<div class="ulist">
870+
<ul>
871+
<li>
872+
<p>in the same sentence</p>
873+
</li>
874+
<li>
875+
<p>in the same paragraph</p>
876+
</li>
877+
<li>
878+
<p>in the same document (if the corpus is made of several documents)</p>
879+
</li>
880+
</ul>
881+
</div>
882+
<div class="paragraph">
883+
<p>(note on vocabulary: in the following, we will call this a "unit of text").</p>
884+
</div>
885+
<div class="paragraph">
886+
<p>For example, in bibliometrics (the study of the publications produced by scientists), this could give:</p>
887+
</div>
888+
<div class="ulist">
889+
<ul>
890+
<li>
891+
<p>collect <strong>abstracts</strong> (short summaries) of all scientific articles discussing "nano-technologies".</p>
892+
</li>
893+
<li>
894+
<p>so, abstracts are our units of text here.</p>
895+
</li>
896+
<li>
897+
<p>two terms will be connected if they frequently appear <strong>in the same abstracts</strong>.</p>
898+
</li>
899+
</ul>
900+
</div>
901+
</div>
902+
<div class="sect3">
903+
<h4 id="_2_what_weight_for_the_edges">2. What "weight" for the edges?</h4>
904+
<div class="paragraph">
905+
<p>An edge between two terms will have:</p>
906+
</div>
907+
<div class="ulist">
908+
<ul>
909+
<li>
910+
<p>weight of "1" if these two terms co-occur in just one unit of text.</p>
911+
</li>
912+
<li>
913+
<p>weight of "2" if they co-occur in two units of text.</p>
914+
</li>
915+
<li>
916+
<p>etc&#8230;&#8203;</p>
917+
</li>
918+
</ul>
919+
</div>
920+
<div class="paragraph">
921+
<p>The logic is simple, and yet there are some refinements to discuss. It will be up to you to decide what&#8217;s preferable:</p>
922+
</div>
923+
<div class="sect4">
924+
<h5 id="_if_2_terms_appear_several_times_strong_in_a_given_unit_of_text_strong_should_their_co_occurences_be_counted_several_times">If 2 terms appear several times <strong>in a given unit of text</strong>, should their co-occurences be counted several times?</h5>
925+
<div class="paragraph">
926+
<p>An example to clarify. Let&#8217;s imagine that we are interested in webpages discussing nanotechnology.
927+
We want to draw the semantic network of the vocabulary used in these web pages.</p>
928+
</div>
929+
<div class="paragraph">
930+
<p>A co-occurrence is: when 2 terms are used on the same web page.</p>
931+
</div>
932+
<div class="paragraph">
933+
<p>Among the pages we collected, there is the Wikipedia page discussing nanotechnology:</p>
934+
</div>
935+
<div class="quoteblock">
936+
<blockquote>
937+
<div class="paragraph">
938+
<p><span class="red">Nanotechnology</span> ("nanotech") is manipulation of matter on an atomic, <span class="blue">molecular</span>, and supramolecular scale.
939+
The earliest, widespread description of <span class="red">nanotechnology</span> referred to the particular technological goal of precisely manipulating atoms and molecules for fabrication of macroscale products, also now referred to as <span class="blue">molecular</span> <span class="red">nanotechnology</span></p>
940+
</div>
941+
</blockquote>
942+
<div class="attribution">
943+
&#8212; <a href="https://en.wikipedia.org/wiki/Nanotechnology">Wikipedia</a>
944+
</div>
945+
</div>
946+
<div class="paragraph">
947+
<p>The question is:</p>
948+
</div>
949+
<div class="ulist">
950+
<ul>
951+
<li>
952+
<p>should I count only <strong>one</strong> co-occurrence between <code>molecular</code> and <code>nanotechnology</code>, because it happened on this one web page?</p>
953+
</li>
954+
<li>
955+
<p>or should I consider that <code>molecular</code> appears twice on this page, and <code>nanotechnology</code> three times, so <strong>multiple</strong> co-occurrences between these 2 terms should be counted, just on this page already?</p>
956+
</li>
957+
</ul>
958+
</div>
959+
<div class="paragraph">
960+
<p>There is no exact response, and you can experiment with both possibilities.</p>
961+
</div>
962+
</div>
963+
<div class="sect4">
964+
<h5 id="_if_two_terms_are_very_frequent_is_their_co_occurrence_really_of_interest">If two terms are very frequent, is their co-occurrence really of interest?</h5>
965+
<div class="paragraph">
966+
<p>Example:</p>
967+
</div>
968+
<div class="paragraph">
969+
<p>Chun-Yuen Teng, Yu-Ru Lin and Lada Adamic have studied (using Gephi!) <a href="https://arxiv.org/abs/1111.3919">the pairing of ingredients in cooking recipes</a>.</p>
970+
</div>
971+
<div class="paragraph">
972+
<p>So, in their study the unit of text was the "recipe", and the terms in the semantic network where ingredients in all these recipes.</p>
973+
</div>
974+
<div class="paragraph">
975+
<p>Just because they are so common, some ingredients (like <code>flour</code>, <code>sugar</code>, <code>salt</code>) are bound to appear more frequently in the same recipes (to co-occur), than infrequent ingredients.</p>
976+
</div>
977+
<div class="paragraph">
978+
<p>The authors of this study chose to highlight <strong>complementary ingredients</strong>: some ingredients appear often used together in the same recipes, <em>even if they are ingredients which are quite rarely used</em>.</p>
979+
</div>
980+
<div class="paragraph">
981+
<p>"Complementary" here means that these ingredients have some interesting relationship: when one is used, the other "must" be used as well.</p>
982+
</div>
983+
<div class="paragraph">
984+
<p>If we just count co-occurrences, this special relationship between infrequent complementary ingredients will be lost: by definition, 2 infrequent ingredients can&#8217;t co-occurr often.</p>
985+
</div>
986+
<div class="paragraph">
987+
<p>To fix this, a solution consists in comparing how many times the 2 ingredients co-occur, with how frequent they are in all recipes:</p>
988+
</div>
989+
<div class="paragraph">
990+
<p>&#8594; ingredients co-occurring <em>each and every time they are used</em> will have a large edge weight,</p>
991+
</div>
992+
<div class="paragraph">
993+
<p>&#8594; ingredients co-occuring many times, <em>but also appearing many times in different recipes</em>, will get a low edge weight.</p>
994+
</div>
995+
<div class="paragraph">
996+
<p>A simple formula does this operation. For ingredients A and B:</p>
997+
</div>
998+
<div class="literalblock">
999+
<div class="content">
1000+
<pre>weight of edge between A and B =
1001+
nb of recipes where A &amp; B co-occur
1002+
divided by
1003+
(total nb of recipes where A appear x total nb of recipes where B appear)</pre>
1004+
</div>
1005+
</div>
1006+
<div class="paragraph">
1007+
<p>Logs are often added to this formula, which is called "Pointwise mutual information":</p>
1008+
</div>
1009+
<div class="stemblock">
1010+
<div class="content">
1011+
\$PMI = log((p(A, B)) /(p(A) p(B)))\$
1012+
</div>
1013+
</div>
1014+
<div class="paragraph">
1015+
<p>We now have nodes and their relations: a semantic network. Let&#8217;s see now how to visualize it in Gephi.</p>
1016+
</div>
1017+
</div>
1018+
</div>
1019+
</div>
1020+
</div>
1021+
<div class="sect1">
1022+
<h2 id="_visualizing_semantic_networks_with_gephi">Visualizing semantic networks with Gephi</h2>
1023+
<div class="sectionbody">
1024+
8481025
</div>
8491026
</div>
8501027
<div class="sect1">
@@ -869,9 +1046,9 @@ <h2 id="_the_end">the end</h2>
8691046
<p>or visit <a href="https://seinecle.github.io/gephi-tutorials/">the website for more tutorials</a>
8701047
<!-- Start of StatCounter Code for Default Guide -->
8711048
<script type="text/javascript">
872-
var sc_project = ;
1049+
var sc_project = 11238920;
8731050
var sc_invisible = 1;
874-
var sc_security = "";
1051+
var sc_security = "8dac6cd5";
8751052
var scJsHost = (("https:" == document.location.protocol) ?
8761053
"https://secure." : "http://www.");
8771054
document.write("<sc" + "ript type='text/javascript' src='" +
@@ -881,7 +1058,7 @@ <h2 id="_the_end">the end</h2>
8811058
<noscript><div class="statcounter"><a title="site stats"
8821059
href="http://statcounter.com/" target="_blank"><img
8831060
class="statcounter"
884-
src="//c.statcounter.com//0//1/" alt="site
1061+
src="//c.statcounter.com/11238920/0/8dac6cd5/1/" alt="site
8851062
stats"></a></div></noscript>
8861063
<!-- End of StatCounter Code for Default Guide --></p>
8871064
</div>
@@ -891,7 +1068,7 @@ <h2 id="_the_end">the end</h2>
8911068
<div id="footer">
8921069
<div id="footer-text">
8931070
Version 1.0<br>
894-
Last updated 2017-03-07 22:26:29 CET
1071+
Last updated 2017-03-08 12:14:26 CET
8951072
</div>
8961073
</div>
8971074
</body>

docs/generated-pdf/images/.png

91.7 KB
Binary file not shown.
38.1 KB
18.1 KB
164 KB
38.3 KB
111 KB
106 KB
116 KB

0 commit comments

Comments
 (0)