You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p>A text, or many texts, can be hard to summarize.</p>
603
610
</div>
604
611
<divclass="paragraph">
605
-
<p>Drawing a semantic network highlights what are the most frequent terms, how they relate to each other, and reveal the different groups or "clusters" of they form.</p>
612
+
<p>Drawing a semantic network highlights what are the most frequent terms, how they relate to each other, and reveal the different groups or "clusters" they form.</p>
606
613
</div>
607
614
<divclass="paragraph">
608
615
<p>Often, a cluster of terms characterizes a topic.
<p>nodes are words ("USA") or groups of words ("United States of America")</p>
618
625
</li>
619
626
<li>
620
-
<p>relations are, usually, signifying a co-occurrences: two words are connected if they co-occur.</p>
627
+
<p>relations are, usually, signifying co-occurrences: two words are connected if they appear in the same document, or in the same paragraph, or same sentence…​ you decide.</p>
<pre>"delete all in the text except for all groups of words made of nouns and adjectives, ending by a noun"</pre>
709
+
<pre>"delete all in the text except for groups of words made of nouns and adjectives, ending by a noun"</pre>
703
710
</div>
704
711
</div>
705
712
<divclass="paragraph">
@@ -845,6 +852,176 @@ <h4 id="_2_representing_only_the_most_frequent_terms">2. Representing only the m
845
852
<p>tf-idf can be left for specialists of the textual data under consideration, after they have been presented with the simple frequency count version.</p>
846
853
</div>
847
854
</div>
855
+
</div>
856
+
</div>
857
+
<divclass="sect1">
858
+
<h2id="_computing_connections_edges_in_the_network">Computing connections (edges) in the network</h2>
859
+
<divclass="sectionbody">
860
+
<divclass="paragraph">
861
+
<p>We now have extracted the most interesting / meaningful terms from the text.
862
+
How to decide which connections make sense between them?</p>
863
+
</div>
864
+
<divclass="sect3">
865
+
<h4id="_1_co_occurrences">1. Co-occurrences</h4>
866
+
<divclass="paragraph">
867
+
<p>Connections between terms are usually drawn from co-occurrences: two terms will be connected if they appear next to each other in some pre-defined unit of text:</p>
868
+
</div>
869
+
<divclass="ulist">
870
+
<ul>
871
+
<li>
872
+
<p>in the same sentence</p>
873
+
</li>
874
+
<li>
875
+
<p>in the same paragraph</p>
876
+
</li>
877
+
<li>
878
+
<p>in the same document (if the corpus is made of several documents)</p>
879
+
</li>
880
+
</ul>
881
+
</div>
882
+
<divclass="paragraph">
883
+
<p>(note on vocabulary: in the following, we will call this a "unit of text").</p>
884
+
</div>
885
+
<divclass="paragraph">
886
+
<p>For example, in bibliometrics (the study of the publications produced by scientists), this could give:</p>
887
+
</div>
888
+
<divclass="ulist">
889
+
<ul>
890
+
<li>
891
+
<p>collect <strong>abstracts</strong> (short summaries) of all scientific articles discussing "nano-technologies".</p>
892
+
</li>
893
+
<li>
894
+
<p>so, abstracts are our units of text here.</p>
895
+
</li>
896
+
<li>
897
+
<p>two terms will be connected if they frequently appear <strong>in the same abstracts</strong>.</p>
898
+
</li>
899
+
</ul>
900
+
</div>
901
+
</div>
902
+
<divclass="sect3">
903
+
<h4id="_2_what_weight_for_the_edges">2. What "weight" for the edges?</h4>
904
+
<divclass="paragraph">
905
+
<p>An edge between two terms will have:</p>
906
+
</div>
907
+
<divclass="ulist">
908
+
<ul>
909
+
<li>
910
+
<p>weight of "1" if these two terms co-occur in just one unit of text.</p>
911
+
</li>
912
+
<li>
913
+
<p>weight of "2" if they co-occur in two units of text.</p>
914
+
</li>
915
+
<li>
916
+
<p>etc…​</p>
917
+
</li>
918
+
</ul>
919
+
</div>
920
+
<divclass="paragraph">
921
+
<p>The logic is simple, and yet there are some refinements to discuss. It will be up to you to decide what’s preferable:</p>
922
+
</div>
923
+
<divclass="sect4">
924
+
<h5id="_if_2_terms_appear_several_times_strong_in_a_given_unit_of_text_strong_should_their_co_occurences_be_counted_several_times">If 2 terms appear several times <strong>in a given unit of text</strong>, should their co-occurences be counted several times?</h5>
925
+
<divclass="paragraph">
926
+
<p>An example to clarify. Let’s imagine that we are interested in webpages discussing nanotechnology.
927
+
We want to draw the semantic network of the vocabulary used in these web pages.</p>
928
+
</div>
929
+
<divclass="paragraph">
930
+
<p>A co-occurrence is: when 2 terms are used on the same web page.</p>
931
+
</div>
932
+
<divclass="paragraph">
933
+
<p>Among the pages we collected, there is the Wikipedia page discussing nanotechnology:</p>
934
+
</div>
935
+
<divclass="quoteblock">
936
+
<blockquote>
937
+
<divclass="paragraph">
938
+
<p><spanclass="red">Nanotechnology</span> ("nanotech") is manipulation of matter on an atomic, <spanclass="blue">molecular</span>, and supramolecular scale.
939
+
The earliest, widespread description of <spanclass="red">nanotechnology</span> referred to the particular technological goal of precisely manipulating atoms and molecules for fabrication of macroscale products, also now referred to as <spanclass="blue">molecular</span><spanclass="red">nanotechnology</span></p>
<p>should I count only <strong>one</strong> co-occurrence between <code>molecular</code> and <code>nanotechnology</code>, because it happened on this one web page?</p>
953
+
</li>
954
+
<li>
955
+
<p>or should I consider that <code>molecular</code> appears twice on this page, and <code>nanotechnology</code> three times, so <strong>multiple</strong> co-occurrences between these 2 terms should be counted, just on this page already?</p>
956
+
</li>
957
+
</ul>
958
+
</div>
959
+
<divclass="paragraph">
960
+
<p>There is no exact response, and you can experiment with both possibilities.</p>
961
+
</div>
962
+
</div>
963
+
<divclass="sect4">
964
+
<h5id="_if_two_terms_are_very_frequent_is_their_co_occurrence_really_of_interest">If two terms are very frequent, is their co-occurrence really of interest?</h5>
965
+
<divclass="paragraph">
966
+
<p>Example:</p>
967
+
</div>
968
+
<divclass="paragraph">
969
+
<p>Chun-Yuen Teng, Yu-Ru Lin and Lada Adamic have studied (using Gephi!) <ahref="https://arxiv.org/abs/1111.3919">the pairing of ingredients in cooking recipes</a>.</p>
970
+
</div>
971
+
<divclass="paragraph">
972
+
<p>So, in their study the unit of text was the "recipe", and the terms in the semantic network where ingredients in all these recipes.</p>
973
+
</div>
974
+
<divclass="paragraph">
975
+
<p>Just because they are so common, some ingredients (like <code>flour</code>, <code>sugar</code>, <code>salt</code>) are bound to appear more frequently in the same recipes (to co-occur), than infrequent ingredients.</p>
976
+
</div>
977
+
<divclass="paragraph">
978
+
<p>The authors of this study chose to highlight <strong>complementary ingredients</strong>: some ingredients appear often used together in the same recipes, <em>even if they are ingredients which are quite rarely used</em>.</p>
979
+
</div>
980
+
<divclass="paragraph">
981
+
<p>"Complementary" here means that these ingredients have some interesting relationship: when one is used, the other "must" be used as well.</p>
982
+
</div>
983
+
<divclass="paragraph">
984
+
<p>If we just count co-occurrences, this special relationship between infrequent complementary ingredients will be lost: by definition, 2 infrequent ingredients can’t co-occurr often.</p>
985
+
</div>
986
+
<divclass="paragraph">
987
+
<p>To fix this, a solution consists in comparing how many times the 2 ingredients co-occur, with how frequent they are in all recipes:</p>
988
+
</div>
989
+
<divclass="paragraph">
990
+
<p>→ ingredients co-occurring <em>each and every time they are used</em> will have a large edge weight,</p>
991
+
</div>
992
+
<divclass="paragraph">
993
+
<p>→ ingredients co-occuring many times, <em>but also appearing many times in different recipes</em>, will get a low edge weight.</p>
994
+
</div>
995
+
<divclass="paragraph">
996
+
<p>A simple formula does this operation. For ingredients A and B:</p>
997
+
</div>
998
+
<divclass="literalblock">
999
+
<divclass="content">
1000
+
<pre>weight of edge between A and B =
1001
+
nb of recipes where A & B co-occur
1002
+
divided by
1003
+
(total nb of recipes where A appear x total nb of recipes where B appear)</pre>
1004
+
</div>
1005
+
</div>
1006
+
<divclass="paragraph">
1007
+
<p>Logs are often added to this formula, which is called "Pointwise mutual information":</p>
1008
+
</div>
1009
+
<divclass="stemblock">
1010
+
<divclass="content">
1011
+
\$PMI = log((p(A, B)) /(p(A) p(B)))\$
1012
+
</div>
1013
+
</div>
1014
+
<divclass="paragraph">
1015
+
<p>We now have nodes and their relations: a semantic network. Let’s see now how to visualize it in Gephi.</p>
1016
+
</div>
1017
+
</div>
1018
+
</div>
1019
+
</div>
1020
+
</div>
1021
+
<divclass="sect1">
1022
+
<h2id="_visualizing_semantic_networks_with_gephi">Visualizing semantic networks with Gephi</h2>
0 commit comments