more work on ep2

josenino95 · josenino95 · commit 6c8b36631964 · 2024-10-27T22:25:34.000-07:00
diff --git a/episodes/a-real-website.md b/episodes/a-real-website.md
@@ -21,7 +21,7 @@ exercises: 15
 
 In the previous episode we used a simple HTML document, not an actual website. Now that we move to more real, complex escenario, we need to add another package to our toolbox, the `requests` package. For the purpose of this web scraping lesson, we will only use `requests` to get the HTML behind a website. However, there's a lot of extra functionality that we are not covering but you can find in the [Requests package documentation](https://requests.readthedocs.io/en/latest/).
 
-We'll be scraping The Carpentries website, [https://carpentries.org/](https://carpentries.org/), and the list of upcoming and past workshop you can find in there. For that, first we'll load the `requests` package and then use the code `.get().text` to store the HTML document of the website.
+We'll be scraping The Carpentries website, [https://carpentries.org/](https://carpentries.org/), and specifically, the list of upcoming and past workshop you can find in there. For that, first we'll load the `requests` package and then use the code `.get().text` to store the HTML document of the website.
 
 ```python
 import requests
@@ -57,7 +57,67 @@ print(req)
 </html>
 ```
 
-The output from our previous code was truncated, as it is too long, but we can see that it is HTML and has some elements we didn't see in our previous simple example, like those identified with the `<meta>`, `<link>` and `<script>` tags.
+The output from our previous code was truncated, as it is too long, but we can see it is HTML and has some elements we didn't see in the example of the previous episode, like those identified with the `<meta>`, `<link>` and `<script>` tags.
+
+There's another way to see the HTML document behind a website, directly from your web browser. Using Google Chrome, you can right-click in any part of the website (on a Mac, press and hold the Control key in your keyboard while you click), and from the pop-up menu, click 'View page source', as the next image shows. If the 'View page source' option didn't appear for you, try clicking in another part of the website. A new tab will open with the HTML document for the website you were in.
+
+![](fig/view_page_source.png){alt="A screenshot of The Carpentries homepage in the Google Chrome web browser, showing how to View page source"}
+
+In the HTML page source on your browser you can scroll down and look for the second-level header (`<h2>`) with the text "Upcoming Carpentries Workshops". Or more easily, you can use the Find Bar (Ctrl + F on Windows and Command + F on Mac) to search for "Upcoming Carpentries Workshops". Just right down of that header we have the table element we are interested in, which starts with the opening tag `<table class="table table-striped" style="width: 100%;">`. Inside that element we see nested different elements familiar for a table, the rows (`<tr>`) and the cells for each row (`<td>`), and additionally the image (`<img>`) and hyperlink (`<a>`) elements.
+
+Now, going back to our coding, we left off on getting the HTML behind the website using `requests`, and stored it on the variable called `req`. From here we can proceed with BeautifulSoup as we learned in the previous episode, using the `BeautifulSoup()` function to parse our HTML, as the following code block shows. With the parsed document, we can use the `.find()` or `find_all()` methods to find the table element.
+
+```python
+soup = BeautifulSoup(req, 'html.parser')
+tables_by_tag = soup.find_all('table')
+print("Number of table elements found: ", len(tables_by_tag))
+print("Printing only the first 1000 characters of the table element: \n", str(tables_by_tag[0])[0:1000])
+```
+
+```output
+Number of table elements found:  1
+Printing only the first 1000 characters of the table element:
+ <table class="table table-striped" style="width: 100%;">
+<tr>
+<td>
+<img alt="swc logo" class="flags" height="24" src="https://carpentries.org/assets/img/logos/swc.svg" title="swc workshop" width="24">
+</img></td>
+<td>
+<img alt="mx" class="flags" src="https://carpentries.org/assets/img/flags/24/mx.png" title="MX">
+<img alt="globe image" class="flags" src="https://carpentries.org/assets/img/flags/24/w3.png" title="Online">
+<a href="https://galn3x.github.io/-2024-10-28-Metagenomics-online/">Nodo Nacional de BioinformÃ¡tica UNAM</a>
+<br>
+<b>Instructors:</b> CÃ©sar Aguilar, Diana Oaxaca, Nelly Selem-Mojica
+      
+      
+          <br>
+<b>Helpers:</b> Andreas Chavez, JosÃ© Manuel Villalobos Escobedo, Aaron Espinosa Jaime, AndrÃ©s Arredondo, Mirna VÃ¡zquez Rosas-Landa, David Alberto GarcÃ­a-Estrada
+      
+	</br></br></img></img></td>
+<td>
+		Oct 28 - Oct 31, 2024
+	</td>
+</tr>
+<tr>
+<td>
+<img alt="dc logo" class="flags" height="24" src="https://carpentries.org/assets/img/logos/dc.svg" title="dc 
+```
+
+We can see that there was found only one table element in the entire HTML, which corresponds to the table we are looking for. The output you see in the previous code block will be different from what you have in your computer, as the data in the upcoming workshops table is continously being updated.
+
+Besides searching elements using tags, sometimes it will be useful to search using attributes, like `id` or `class`. For example, we can see the table element has a class attribute with two values "table table-striped", which identifies all possible elements with similar styling. Therefore, we could have the same result than before using the `class_` argument on the `.find_all()` method as follows.
+
+```python
+tables_by_class = soup.find_all(class_="table table-striped")
+```
+
+Now that we know there is only onw table
+
+## Navigating the tree
+
+
+
+
 
 ::::::::::::::::::::::::::::::::::::: keypoints 
 
diff --git a/episodes/fig/view_page_source.png b/episodes/fig/view_page_source.png
diff --git a/notebooks/ep1.ipynb b/notebooks/ep1.ipynb
@@ -9,7 +9,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 1,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -1309,7 +1309,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 47,
+   "execution_count": 2,
    "metadata": {},
    "outputs": [
     {
@@ -5285,13 +5285,13 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 127,
+   "execution_count": 4,
    "metadata": {},
    "outputs": [],
    "source": [
     "# We'll use BeautifulSoup to parse the HTML,\n",
     "# as it has useful functions and tools to access the data in the HTML\n",
-    "soup = BeautifulSoup(req.text, 'html.parser')"
+    "soup = BeautifulSoup(req, 'html.parser')"
    ]
   },
   {
@@ -6733,32 +6733,111 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 129,
+   "execution_count": 24,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Number of table elements found:  1\n",
+      "Printing only the first 1000 characters of the table element: \n",
+      " <table class=\"table table-striped\" style=\"width: 100%;\">\n",
+      "<tr>\n",
+      "<td>\n",
+      "<img alt=\"swc logo\" class=\"flags\" height=\"24\" src=\"https://carpentries.org/assets/img/logos/swc.svg\" title=\"swc workshop\" width=\"24\">\n",
+      "</img></td>\n",
+      "<td>\n",
+      "<img alt=\"mx\" class=\"flags\" src=\"https://carpentries.org/assets/img/flags/24/mx.png\" title=\"MX\">\n",
+      "<img alt=\"globe image\" class=\"flags\" src=\"https://carpentries.org/assets/img/flags/24/w3.png\" title=\"Online\">\n",
+      "<a href=\"https://galn3x.github.io/-2024-10-28-Metagenomics-online/\">Nodo Nacional de BioinformÃ¡tica UNAM</a>\n",
+      "<br>\n",
+      "<b>Instructors:</b> CÃ©sar Aguilar, Diana Oaxaca, Nelly Selem-Mojica\n",
+      "      \n",
+      "      \n",
+      "          <br>\n",
+      "<b>Helpers:</b> Andreas Chavez, JosÃ© Manuel Villalobos Escobedo, Aaron Espinosa Jaime, AndrÃ©s Arredondo, Mirna VÃ¡zquez Rosas-Landa, David Alberto GarcÃ­a-Estrada\n",
+      "      \n",
+      "\t</br></br></img></img></td>\n",
+      "<td>\n",
+      "\t\tOct 28 - Oct 31, 2024\n",
+      "\t</td>\n",
+      "</tr>\n",
+      "<tr>\n",
+      "<td>\n",
+      "<img alt=\"dc logo\" class=\"flags\" height=\"24\" src=\"https://carpentries.org/assets/img/logos/dc.svg\" title=\"dc \n"
+     ]
+    }
+   ],
+   "source": [
+    "soup = BeautifulSoup(req, 'html.parser')\n",
+    "tables_by_tag = soup.find_all('table')\n",
+    "print(\"Number of table elements found: \", len(tables_by_tag))\n",
+    "print(\"Printing only the first 1000 characters of the table element: \\n\", str(tables_by_tag[0])[0:1000])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Number of table elements found:  1\n",
+      "<table class=\"table table-striped\" style=\"width: 100%;\">\n",
+      "<tr>\n",
+      "<td>\n",
+      "<img alt=\"swc logo\" class=\"flags\" height=\"24\" src=\"https://carpentries.org/assets/img/logos/swc.svg\" title=\"swc workshop\" width=\"24\">\n",
+      "</img></td>\n",
+      "<td>\n",
+      "<img alt=\"mx\" class=\"flags\" src=\"https://carpentries.org/assets/img/flags/24/mx.png\" title=\"MX\">\n",
+      "<img alt=\"globe image\" class=\"flags\" src=\"https://carpentries.org/assets/img/flags/24/w3.png\" title=\"Online\">\n",
+      "<a href=\"https://galn3x.github.io/-2024-10-28-Metagenomics-online/\">Nodo Nacional de BioinformÃ¡tica UNAM</a>\n",
+      "<br>\n",
+      "<b>Instructors:</b> CÃ©sar Aguilar, Diana Oaxaca, Nelly Selem-Mojica\n",
+      "      \n",
+      "      \n",
+      "          <br>\n",
+      "<b>Helpers:</b> Andreas Chavez, JosÃ© Manuel Villalobos Escobedo, Aaron Espinosa Jaime, AndrÃ©s Arredondo, Mirna VÃ¡zquez Rosas-Landa, David Alberto GarcÃ­a-Estrada\n",
+      "      \n",
+      "\t</br></br></img></img></td>\n",
+      "<td>\n",
+      "\t\tOct 28 - Oct 31, 2024\n",
+      "\t</td>\n",
+      "</tr>\n",
+      "<tr>\n",
+      "<td>\n",
+      "<img alt=\"dc logo\" class=\"flags\" height=\"24\" src=\"https://carpentries.org/assets/img/logos/dc.svg\" title=\"dc \n"
+     ]
+    }
+   ],
    "source": [
-    "tables = soup.find_all('table')"
+    "tables_by_class = soup.find_all(class_=\"table table-striped\")\n",
+    "print(\"Number of table elements found: \", len(tables_by_class))\n",
+    "print(str(tables_by_class[0])[0:1000])"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 130,
+   "execution_count": 9,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
-       "1"
+       "bs4.element.Tag"
       ]
      },
-     "execution_count": 130,
+     "execution_count": 9,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "# How many tables (elements with the table tag) did we find?\n",
-    "len(tables)"
+    "type(tables[0])"
    ]
   },
   {