Skip to content

Commit 6c8b366

Browse files
committed
more work on ep2
1 parent 1ec65df commit 6c8b366

3 files changed

Lines changed: 152 additions & 13 deletions

File tree

episodes/a-real-website.md

Lines changed: 62 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ exercises: 15
2121

2222
In the previous episode we used a simple HTML document, not an actual website. Now that we move to more real, complex escenario, we need to add another package to our toolbox, the `requests` package. For the purpose of this web scraping lesson, we will only use `requests` to get the HTML behind a website. However, there's a lot of extra functionality that we are not covering but you can find in the [Requests package documentation](https://requests.readthedocs.io/en/latest/).
2323

24-
We'll be scraping The Carpentries website, [https://carpentries.org/](https://carpentries.org/), and the list of upcoming and past workshop you can find in there. For that, first we'll load the `requests` package and then use the code `.get().text` to store the HTML document of the website.
24+
We'll be scraping The Carpentries website, [https://carpentries.org/](https://carpentries.org/), and specifically, the list of upcoming and past workshop you can find in there. For that, first we'll load the `requests` package and then use the code `.get().text` to store the HTML document of the website.
2525

2626
```python
2727
import requests
@@ -57,7 +57,67 @@ print(req)
5757
</html>
5858
```
5959

60-
The output from our previous code was truncated, as it is too long, but we can see that it is HTML and has some elements we didn't see in our previous simple example, like those identified with the `<meta>`, `<link>` and `<script>` tags.
60+
The output from our previous code was truncated, as it is too long, but we can see it is HTML and has some elements we didn't see in the example of the previous episode, like those identified with the `<meta>`, `<link>` and `<script>` tags.
61+
62+
There's another way to see the HTML document behind a website, directly from your web browser. Using Google Chrome, you can right-click in any part of the website (on a Mac, press and hold the Control key in your keyboard while you click), and from the pop-up menu, click 'View page source', as the next image shows. If the 'View page source' option didn't appear for you, try clicking in another part of the website. A new tab will open with the HTML document for the website you were in.
63+
64+
![](fig/view_page_source.png){alt="A screenshot of The Carpentries homepage in the Google Chrome web browser, showing how to View page source"}
65+
66+
In the HTML page source on your browser you can scroll down and look for the second-level header (`<h2>`) with the text "Upcoming Carpentries Workshops". Or more easily, you can use the Find Bar (Ctrl + F on Windows and Command + F on Mac) to search for "Upcoming Carpentries Workshops". Just right down of that header we have the table element we are interested in, which starts with the opening tag `<table class="table table-striped" style="width: 100%;">`. Inside that element we see nested different elements familiar for a table, the rows (`<tr>`) and the cells for each row (`<td>`), and additionally the image (`<img>`) and hyperlink (`<a>`) elements.
67+
68+
Now, going back to our coding, we left off on getting the HTML behind the website using `requests`, and stored it on the variable called `req`. From here we can proceed with BeautifulSoup as we learned in the previous episode, using the `BeautifulSoup()` function to parse our HTML, as the following code block shows. With the parsed document, we can use the `.find()` or `find_all()` methods to find the table element.
69+
70+
```python
71+
soup = BeautifulSoup(req, 'html.parser')
72+
tables_by_tag = soup.find_all('table')
73+
print("Number of table elements found: ", len(tables_by_tag))
74+
print("Printing only the first 1000 characters of the table element: \n", str(tables_by_tag[0])[0:1000])
75+
```
76+
77+
```output
78+
Number of table elements found: 1
79+
Printing only the first 1000 characters of the table element:
80+
<table class="table table-striped" style="width: 100%;">
81+
<tr>
82+
<td>
83+
<img alt="swc logo" class="flags" height="24" src="https://carpentries.org/assets/img/logos/swc.svg" title="swc workshop" width="24">
84+
</img></td>
85+
<td>
86+
<img alt="mx" class="flags" src="https://carpentries.org/assets/img/flags/24/mx.png" title="MX">
87+
<img alt="globe image" class="flags" src="https://carpentries.org/assets/img/flags/24/w3.png" title="Online">
88+
<a href="https://galn3x.github.io/-2024-10-28-Metagenomics-online/">Nodo Nacional de Bioinformática UNAM</a>
89+
<br>
90+
<b>Instructors:</b> César Aguilar, Diana Oaxaca, Nelly Selem-Mojica
91+
92+
93+
<br>
94+
<b>Helpers:</b> Andreas Chavez, José Manuel Villalobos Escobedo, Aaron Espinosa Jaime, Andrés Arredondo, Mirna Vázquez Rosas-Landa, David Alberto García-Estrada
95+
96+
</br></br></img></img></td>
97+
<td>
98+
Oct 28 - Oct 31, 2024
99+
</td>
100+
</tr>
101+
<tr>
102+
<td>
103+
<img alt="dc logo" class="flags" height="24" src="https://carpentries.org/assets/img/logos/dc.svg" title="dc
104+
```
105+
106+
We can see that there was found only one table element in the entire HTML, which corresponds to the table we are looking for. The output you see in the previous code block will be different from what you have in your computer, as the data in the upcoming workshops table is continously being updated.
107+
108+
Besides searching elements using tags, sometimes it will be useful to search using attributes, like `id` or `class`. For example, we can see the table element has a class attribute with two values "table table-striped", which identifies all possible elements with similar styling. Therefore, we could have the same result than before using the `class_` argument on the `.find_all()` method as follows.
109+
110+
```python
111+
tables_by_class = soup.find_all(class_="table table-striped")
112+
```
113+
114+
Now that we know there is only onw table
115+
116+
## Navigating the tree
117+
118+
119+
120+
61121

62122
::::::::::::::::::::::::::::::::::::: keypoints
63123

episodes/fig/view_page_source.png

336 KB
Loading

notebooks/ep1.ipynb

Lines changed: 90 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
},
1010
{
1111
"cell_type": "code",
12-
"execution_count": 3,
12+
"execution_count": 1,
1313
"metadata": {},
1414
"outputs": [],
1515
"source": [
@@ -1309,7 +1309,7 @@
13091309
},
13101310
{
13111311
"cell_type": "code",
1312-
"execution_count": 47,
1312+
"execution_count": 2,
13131313
"metadata": {},
13141314
"outputs": [
13151315
{
@@ -5285,13 +5285,13 @@
52855285
},
52865286
{
52875287
"cell_type": "code",
5288-
"execution_count": 127,
5288+
"execution_count": 4,
52895289
"metadata": {},
52905290
"outputs": [],
52915291
"source": [
52925292
"# We'll use BeautifulSoup to parse the HTML,\n",
52935293
"# as it has useful functions and tools to access the data in the HTML\n",
5294-
"soup = BeautifulSoup(req.text, 'html.parser')"
5294+
"soup = BeautifulSoup(req, 'html.parser')"
52955295
]
52965296
},
52975297
{
@@ -6733,32 +6733,111 @@
67336733
},
67346734
{
67356735
"cell_type": "code",
6736-
"execution_count": 129,
6736+
"execution_count": 24,
67376737
"metadata": {},
6738-
"outputs": [],
6738+
"outputs": [
6739+
{
6740+
"name": "stdout",
6741+
"output_type": "stream",
6742+
"text": [
6743+
"Number of table elements found: 1\n",
6744+
"Printing only the first 1000 characters of the table element: \n",
6745+
" <table class=\"table table-striped\" style=\"width: 100%;\">\n",
6746+
"<tr>\n",
6747+
"<td>\n",
6748+
"<img alt=\"swc logo\" class=\"flags\" height=\"24\" src=\"https://carpentries.org/assets/img/logos/swc.svg\" title=\"swc workshop\" width=\"24\">\n",
6749+
"</img></td>\n",
6750+
"<td>\n",
6751+
"<img alt=\"mx\" class=\"flags\" src=\"https://carpentries.org/assets/img/flags/24/mx.png\" title=\"MX\">\n",
6752+
"<img alt=\"globe image\" class=\"flags\" src=\"https://carpentries.org/assets/img/flags/24/w3.png\" title=\"Online\">\n",
6753+
"<a href=\"https://galn3x.github.io/-2024-10-28-Metagenomics-online/\">Nodo Nacional de Bioinformática UNAM</a>\n",
6754+
"<br>\n",
6755+
"<b>Instructors:</b> César Aguilar, Diana Oaxaca, Nelly Selem-Mojica\n",
6756+
" \n",
6757+
" \n",
6758+
" <br>\n",
6759+
"<b>Helpers:</b> Andreas Chavez, José Manuel Villalobos Escobedo, Aaron Espinosa Jaime, Andrés Arredondo, Mirna Vázquez Rosas-Landa, David Alberto García-Estrada\n",
6760+
" \n",
6761+
"\t</br></br></img></img></td>\n",
6762+
"<td>\n",
6763+
"\t\tOct 28 - Oct 31, 2024\n",
6764+
"\t</td>\n",
6765+
"</tr>\n",
6766+
"<tr>\n",
6767+
"<td>\n",
6768+
"<img alt=\"dc logo\" class=\"flags\" height=\"24\" src=\"https://carpentries.org/assets/img/logos/dc.svg\" title=\"dc \n"
6769+
]
6770+
}
6771+
],
6772+
"source": [
6773+
"soup = BeautifulSoup(req, 'html.parser')\n",
6774+
"tables_by_tag = soup.find_all('table')\n",
6775+
"print(\"Number of table elements found: \", len(tables_by_tag))\n",
6776+
"print(\"Printing only the first 1000 characters of the table element: \\n\", str(tables_by_tag[0])[0:1000])"
6777+
]
6778+
},
6779+
{
6780+
"cell_type": "code",
6781+
"execution_count": 25,
6782+
"metadata": {},
6783+
"outputs": [
6784+
{
6785+
"name": "stdout",
6786+
"output_type": "stream",
6787+
"text": [
6788+
"Number of table elements found: 1\n",
6789+
"<table class=\"table table-striped\" style=\"width: 100%;\">\n",
6790+
"<tr>\n",
6791+
"<td>\n",
6792+
"<img alt=\"swc logo\" class=\"flags\" height=\"24\" src=\"https://carpentries.org/assets/img/logos/swc.svg\" title=\"swc workshop\" width=\"24\">\n",
6793+
"</img></td>\n",
6794+
"<td>\n",
6795+
"<img alt=\"mx\" class=\"flags\" src=\"https://carpentries.org/assets/img/flags/24/mx.png\" title=\"MX\">\n",
6796+
"<img alt=\"globe image\" class=\"flags\" src=\"https://carpentries.org/assets/img/flags/24/w3.png\" title=\"Online\">\n",
6797+
"<a href=\"https://galn3x.github.io/-2024-10-28-Metagenomics-online/\">Nodo Nacional de Bioinformática UNAM</a>\n",
6798+
"<br>\n",
6799+
"<b>Instructors:</b> César Aguilar, Diana Oaxaca, Nelly Selem-Mojica\n",
6800+
" \n",
6801+
" \n",
6802+
" <br>\n",
6803+
"<b>Helpers:</b> Andreas Chavez, José Manuel Villalobos Escobedo, Aaron Espinosa Jaime, Andrés Arredondo, Mirna Vázquez Rosas-Landa, David Alberto García-Estrada\n",
6804+
" \n",
6805+
"\t</br></br></img></img></td>\n",
6806+
"<td>\n",
6807+
"\t\tOct 28 - Oct 31, 2024\n",
6808+
"\t</td>\n",
6809+
"</tr>\n",
6810+
"<tr>\n",
6811+
"<td>\n",
6812+
"<img alt=\"dc logo\" class=\"flags\" height=\"24\" src=\"https://carpentries.org/assets/img/logos/dc.svg\" title=\"dc \n"
6813+
]
6814+
}
6815+
],
67396816
"source": [
6740-
"tables = soup.find_all('table')"
6817+
"tables_by_class = soup.find_all(class_=\"table table-striped\")\n",
6818+
"print(\"Number of table elements found: \", len(tables_by_class))\n",
6819+
"print(str(tables_by_class[0])[0:1000])"
67416820
]
67426821
},
67436822
{
67446823
"cell_type": "code",
6745-
"execution_count": 130,
6824+
"execution_count": 9,
67466825
"metadata": {},
67476826
"outputs": [
67486827
{
67496828
"data": {
67506829
"text/plain": [
6751-
"1"
6830+
"bs4.element.Tag"
67526831
]
67536832
},
6754-
"execution_count": 130,
6833+
"execution_count": 9,
67556834
"metadata": {},
67566835
"output_type": "execute_result"
67576836
}
67586837
],
67596838
"source": [
67606839
"# How many tables (elements with the table tag) did we find?\n",
6761-
"len(tables)"
6840+
"type(tables[0])"
67626841
]
67636842
},
67646843
{

0 commit comments

Comments
 (0)