Skip to content

Commit 9c49464

Browse files
committed
updates on ep2
1 parent 6c8b366 commit 9c49464

2 files changed

Lines changed: 1004 additions & 4319 deletions

File tree

episodes/a-real-website.md

Lines changed: 205 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -17,104 +17,251 @@ exercises: 15
1717

1818
::::::::::::::::::::::::::::::::::::::::::::::::
1919

20-
## A "Requests" to a website
20+
## "Requests" the website HTML
2121

2222
In the previous episode we used a simple HTML document, not an actual website. Now that we move to more real, complex escenario, we need to add another package to our toolbox, the `requests` package. For the purpose of this web scraping lesson, we will only use `requests` to get the HTML behind a website. However, there's a lot of extra functionality that we are not covering but you can find in the [Requests package documentation](https://requests.readthedocs.io/en/latest/).
2323

24-
We'll be scraping The Carpentries website, [https://carpentries.org/](https://carpentries.org/), and specifically, the list of upcoming and past workshop you can find in there. For that, first we'll load the `requests` package and then use the code `.get().text` to store the HTML document of the website.
24+
We'll be scraping The Carpentries website, [https://carpentries.org/](https://carpentries.org/), and specifically, the list of upcoming and past workshop you can find at the bottom. For that, first we'll load the `requests` package and then use the code `.get().text` to store the HTML document of the website. Furthermore, to simplify our navigation through the HTML document, we will use the [Regular Expressions](https://docs.python.org/3/howto/regex.html) `re` module to remove all new line characters ("\n") and their surrounding whitespaces. You can think of removing new lines as a preprocessing or cleaning step, but in this lesson we won't be explaining the intricacies of regular expressions. For that, you can refer to this introductory explanation on the [Library Carpentry](https://librarycarpentry.org/lc-data-intro/01-regular-expressions.html).
25+
2526

2627
```python
2728
import requests
29+
import re
2830
url = 'https://carpentries.org/'
2931
req = requests.get(url).text
30-
print(req)
32+
cleaned_req = re.sub(r'\s*\n\s*', '', req).strip()
33+
print(cleaned_req[0:1000])
3134
```
3235

3336
```output
34-
<!doctype html>
35-
<html class="no-js" lang="en">
36-
<head>
37-
<meta charset="utf-8">
38-
<meta name="viewport" content="width=device-width, initial-scale=1.0">
39-
<title>The Carpentries</title>
40-
41-
<link rel="stylesheet" type="text/css" href="https://carpentries.org/assets/css/styles_feeling_responsive.css">
42-
43-
44-
45-
<script src="https://carpentries.org/assets/js/modernizr.min.js"></script>
46-
47-
<!-- matomo -->
48-
<script src="https://carpentries.org/assets/js/matomo-analytics.js"></script>
49-
50-
<link href="https://fonts.googleapis.com/css?family=Lato:400,400i,700,700i|Roboto:400,400i,700,700i&display=swap" rel="stylesheet">
51-
52-
<!-- Search Engine Optimization -->
53-
<meta name="description" content="The Carpentries is a fiscally sponsored project of Community Initiatives, a registered 501(c)3 non-profit organisation based in California, USA. We are a global community teaching foundational computational and data science skills to researchers in academia, industry and government.">
54-
55-
...
56-
</body>
57-
</html>
37+
<!doctype html><html class="no-js" lang="en"><head><meta charset="utf-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"><title>The Carpentries</title><link rel="stylesheet" type="text/css" href="https://carpentries.org/assets/css/styles_feeling_responsive.css"><script src="https://carpentries.org/assets/js/modernizr.min.js"></script><!-- matomo --><script src="https://carpentries.org/assets/js/matomo-analytics.js"></script><link href="https://fonts.googleapis.com/css?family=Lato:400,400i,700,700i|Roboto:400,400i,700,700i&display=swap" rel="stylesheet"><!-- Search Engine Optimization --><meta name="description" content="The Carpentries is a fiscally sponsored project of Community Initiatives, a registered 501(c)3 non-profit organisation based in California, USA. We are a global community teaching foundational computational and data science skills to researchers in academia, industry and government."><link rel="canonical" href="https://carpentries.org/index.html"><
5838
```
5939

60-
The output from our previous code was truncated, as it is too long, but we can see it is HTML and has some elements we didn't see in the example of the previous episode, like those identified with the `<meta>`, `<link>` and `<script>` tags.
40+
We truncated to print only the first 1000 characters of the document, as it is too long, but we can see it is HTML and has some elements we didn't see in the example of the previous episode, like those identified with the `<meta>`, `<link>` and `<script>` tags.
6141

6242
There's another way to see the HTML document behind a website, directly from your web browser. Using Google Chrome, you can right-click in any part of the website (on a Mac, press and hold the Control key in your keyboard while you click), and from the pop-up menu, click 'View page source', as the next image shows. If the 'View page source' option didn't appear for you, try clicking in another part of the website. A new tab will open with the HTML document for the website you were in.
6343

6444
![](fig/view_page_source.png){alt="A screenshot of The Carpentries homepage in the Google Chrome web browser, showing how to View page source"}
6545

6646
In the HTML page source on your browser you can scroll down and look for the second-level header (`<h2>`) with the text "Upcoming Carpentries Workshops". Or more easily, you can use the Find Bar (Ctrl + F on Windows and Command + F on Mac) to search for "Upcoming Carpentries Workshops". Just right down of that header we have the table element we are interested in, which starts with the opening tag `<table class="table table-striped" style="width: 100%;">`. Inside that element we see nested different elements familiar for a table, the rows (`<tr>`) and the cells for each row (`<td>`), and additionally the image (`<img>`) and hyperlink (`<a>`) elements.
6747

48+
## Finding the information we want
49+
6850
Now, going back to our coding, we left off on getting the HTML behind the website using `requests`, and stored it on the variable called `req`. From here we can proceed with BeautifulSoup as we learned in the previous episode, using the `BeautifulSoup()` function to parse our HTML, as the following code block shows. With the parsed document, we can use the `.find()` or `find_all()` methods to find the table element.
6951

7052
```python
71-
soup = BeautifulSoup(req, 'html.parser')
53+
soup = BeautifulSoup(cleaned_req, 'html.parser')
7254
tables_by_tag = soup.find_all('table')
7355
print("Number of table elements found: ", len(tables_by_tag))
7456
print("Printing only the first 1000 characters of the table element: \n", str(tables_by_tag[0])[0:1000])
7557
```
7658

7759
```output
7860
Number of table elements found: 1
79-
Printing only the first 1000 characters of the table element:
80-
<table class="table table-striped" style="width: 100%;">
81-
<tr>
82-
<td>
83-
<img alt="swc logo" class="flags" height="24" src="https://carpentries.org/assets/img/logos/swc.svg" title="swc workshop" width="24">
84-
</img></td>
85-
<td>
86-
<img alt="mx" class="flags" src="https://carpentries.org/assets/img/flags/24/mx.png" title="MX">
87-
<img alt="globe image" class="flags" src="https://carpentries.org/assets/img/flags/24/w3.png" title="Online">
88-
<a href="https://galn3x.github.io/-2024-10-28-Metagenomics-online/">Nodo Nacional de Bioinformática UNAM</a>
89-
<br>
90-
<b>Instructors:</b> César Aguilar, Diana Oaxaca, Nelly Selem-Mojica
91-
92-
93-
<br>
94-
<b>Helpers:</b> Andreas Chavez, José Manuel Villalobos Escobedo, Aaron Espinosa Jaime, Andrés Arredondo, Mirna Vázquez Rosas-Landa, David Alberto García-Estrada
95-
96-
</br></br></img></img></td>
97-
<td>
98-
Oct 28 - Oct 31, 2024
99-
</td>
100-
</tr>
101-
<tr>
102-
<td>
103-
<img alt="dc logo" class="flags" height="24" src="https://carpentries.org/assets/img/logos/dc.svg" title="dc
104-
```
105-
106-
We can see that there was found only one table element in the entire HTML, which corresponds to the table we are looking for. The output you see in the previous code block will be different from what you have in your computer, as the data in the upcoming workshops table is continously being updated.
61+
Printing only the first 1000 characters of the table element:
62+
<table class="table table-striped" style="width: 100%;"><tr><td><img alt="swc logo" class="flags" height="24" src="https://carpentries.org/assets/img/logos/swc.svg" title="swc workshop" width="24"/></td><td><img alt="mx" class="flags" src="https://carpentries.org/assets/img/flags/24/mx.png" title="MX"><img alt="globe image" class="flags" src="https://carpentries.org/assets/img/flags/24/w3.png" title="Online"><a href="https://galn3x.github.io/-2024-10-28-Metagenomics-online/">Nodo Nacional de Bioinformática UNAM</a><br><b>Instructors:</b> César Aguilar, Diana Oaxaca, Nelly Selem-Mojica<br><b>Helpers:</b> Andreas Chavez, José Manuel Villalobos Escobedo, Aaron Espinosa Jaime, Andrés Arredondo, Mirna Vázquez Rosas-Landa, David Alberto García-Estrada</br></br></img></img></td><td>Oct 28 - Oct 31, 2024</td></tr><tr><td><img alt="dc logo" class="flags" height="24" src="https://carpentries.org/assets/img/logos/dc.svg" title="dc workshop" width="24"/></td><td><img alt="de" class="flags" s
63+
```
64+
65+
From our output we see that there was only one table element in the entire HTML, which corresponds to the table we are looking for. The output you see in the previous code block will be different from what you have in your computer, as the data in the upcoming workshops table is continously updated.
10766

10867
Besides searching elements using tags, sometimes it will be useful to search using attributes, like `id` or `class`. For example, we can see the table element has a class attribute with two values "table table-striped", which identifies all possible elements with similar styling. Therefore, we could have the same result than before using the `class_` argument on the `.find_all()` method as follows.
10968

11069
```python
11170
tables_by_class = soup.find_all(class_="table table-striped")
11271
```
11372

114-
Now that we know there is only onw table
73+
Now that we know there is only one table element, we can start working with it directly by storing the first and only item in the `tables_by_tag` result set into another variable, which we will call just `workshops`. We can see that we moved from working with a "ResultSet" object to a "Tag" object, which we can start working with to extract information from each row and cell.
74+
75+
```python
76+
print("Before: ", type(tables_by_tag))
77+
workshops_table = tables_by_tag[0]
78+
print("After:", type(workshops_table))
79+
print("Element type:", workshops_table.name)
80+
```
81+
82+
```output
83+
Before: <class 'bs4.element.ResultSet'>
84+
After: <class 'bs4.element.Tag'>
85+
Element type: table
86+
```
11587

11688
## Navigating the tree
11789

90+
If we use the `prettify()` method on the `workshops_table` variable, we see that this table element has a nested tree structure. On the first level is the `<table>` tag. Inside that, we have rows `<tr>`, and inside rows we have table data cells `<td>`. We can start to identify certain information we may be interested in, for example:
91+
92+
- What type of workshop was it ('swc' for Software Carpentry, 'dc' for Data Carpentry, 'lc' for Library Carpentry, and 'cp' for workshops based on The Carpentries curriculum). We find this in the first `<td>` tag, or said in a different way, in the first cell of the row.
93+
- In what country was the workshop held. We can see the two-letter country code in the second cell of the row.
94+
- The URL to the workshop website, which will contain additional information. It is also contained in the second cell of the row, as the `href` attribute of the `<a>` tag.
95+
- The institution that is hosting the workshop. Also in the second `<td>`, in the text of the hyperlink `<a>` tag.
96+
- The name of instructors and helpers involved in the workshop.
97+
- The dates of the workshop, on the third and final cell.
98+
99+
```python
100+
print(workshops_table.prettify())
101+
```
102+
103+
```output
104+
<table class="table table-striped" style="width: 100%;">
105+
<tr>
106+
<td>
107+
<img alt="swc logo" class="flags" height="24" src="https://carpentries.org/assets/img/logos/swc.svg" title="swc workshop" width="24"/>
108+
</td>
109+
<td>
110+
<img alt="mx" class="flags" src="https://carpentries.org/assets/img/flags/24/mx.png" title="MX">
111+
<img alt="globe image" class="flags" src="https://carpentries.org/assets/img/flags/24/w3.png" title="Online">
112+
<a href="https://galn3x.github.io/-2024-10-28-Metagenomics-online/">
113+
Nodo Nacional de Bioinformática UNAM
114+
</a>
115+
<br>
116+
<b>
117+
Instructors:
118+
</b>
119+
César Aguilar, Diana Oaxaca, Nelly Selem-Mojica
120+
<br>
121+
<b>
122+
Helpers:
123+
</b>
124+
Andreas Chavez, José Manuel Villalobos Escobedo, Aaron Espinosa Jaime, Andrés Arredondo, Mirna Vázquez Rosas-Landa, David Alberto García-Estrada
125+
</br>
126+
</br>
127+
</img>
128+
</img>
129+
</td>
130+
<td>
131+
Oct 28 - Oct 31, 2024
132+
</td>
133+
</tr>
134+
<tr>
135+
<td>
136+
...
137+
</td>
138+
</tr>
139+
</table>
140+
```
141+
142+
To navigate in this HTML document tree we can use the methods `.contents()` (to access direct children nodes), `.parent()` (to access the parent node), `.next_sibling()`, and `.previous_sibling()` (to access the siblings of a node) methods. For example, if we want to access the second row of the table, which is the second child of the table element we could use the following code.
143+
144+
```python
145+
# The second [1 in Python indexing] child of our table element
146+
workshops_table.contents[1]
147+
```
148+
149+
If you go back to the 'View page source' of the website, you'll notice that the table element is nested inside a `<div class="medium-12 columns">` element, which means this `<div>` is the parent of our `<table>`. If we needed to, we could access this parent by using `workshops_table.parent`.
150+
151+
Now imagine we had selected the second data cell of our fifth row using `workshops_table.contents[4].contents[1]`, we could access the third data cell using `.next_sibling()` or the first data cell with `.previous_sibling()`.
152+
153+
```python
154+
# Access the fifth row, and from there, the second data cell
155+
row5_cell2 = workshops_table.contents[4].contents[1]
156+
# Access the third cell of the fifth row
157+
row5_cell3 = row5_cell2.next_sibling
158+
# Access the first cell of the fifth row
159+
row5_cell1 = row5_cell2.previous_sibling
160+
```
161+
162+
Why do we bother to learn all this methods? Depending on you web scraping use case, they might result useful in complex websites. Let's apply them to extract the information we want about the workshops, for example, to see how many upcoming workshops there are, which corresponds with the number of children the table element has
163+
164+
```python
165+
num_workshops = len(workshops_table.contents)
166+
print("Number of upcoming workshops listed: ", num_workshops)
167+
```
168+
169+
Let's work to extract data from only the first row, and later we can use a loop to iterate over all the rows of the table.
170+
171+
```python
172+
# Empty dictionary to hold the data
173+
dict_w = {}
174+
175+
# First row of data
176+
first_row = workshops_table.contents[0]
177+
178+
# To get to the first cell
179+
first_cell = first_row.contents[0]
180+
second_cell = first_cell.next_sibling
181+
third_cell = second_cell.next_sibling
182+
183+
# From the first cell, find the <image> tag and get the 'title' attribute, which contains the type of workshop
184+
dict_w['type'] = first_cell.find('img')['title']
185+
186+
# In the second cell, get the country from the 'title' attribute of the <image> tag
187+
dict_w['country'] = second_cell.find('img')['title']
188+
189+
# Now the link to the workshop website is in the 'href' attribute of the <a> tag
190+
dict_w['link'] = second_cell.find('a')['href']
191+
192+
# The institution that hosts the workshop is the text inside that <a> tag
193+
dict_w['link'] = second_cell.find('a').get_text()
194+
195+
# Get all the text from the second cell
196+
dict_w['all_text'] = second_cell.get_text(strip=True)
197+
198+
# Get the dates from the third cell
199+
dict_w['date'] = third_cell.get_text(strip=True)
200+
201+
print(dict_w)
202+
```
203+
204+
```output
205+
{'type': 'swc workshop',
206+
'country': 'MX',
207+
'link': 'https://galn3x.github.io/-2024-10-28-Metagenomics-online/',
208+
'host': 'Nodo Nacional de Bioinformática UNAM',
209+
'all_text': 'Nodo Nacional de Bioinformática UNAMInstructors:César Aguilar, Diana Oaxaca, Nelly Selem-MojicaHelpers:Andreas Chavez, José Manuel Villalobos Escobedo, Aaron Espinosa Jaime, Andrés Arredondo, Mirna Vázquez Rosas-Landa, David Alberto GarcÃ\xada-Estrada',
210+
'date': 'Oct 28 - Oct 31, 2024'}
211+
```
212+
213+
This was just for one row, but we can iterate over all the rows in the table adding a for loop and appending each dictionary to a list. That list will be transformed to a Pandas dataframe so we can see the results nicely.
214+
215+
```python
216+
list_of_workshops = []
217+
for row in range(num_workshops):
218+
n_row = workshops_table.contents[row]
219+
first_cell = n_row.contents[0]
220+
second_cell = first_cell.next_sibling
221+
third_cell = second_cell.next_sibling
222+
dict_w = {}
223+
dict_w['type'] = first_cell.find('img')['title']
224+
dict_w['country'] = second_cell.find('img')['title']
225+
dict_w['link'] = second_cell.find('a')['href']
226+
dict_w['host'] = second_cell.find('a').get_text()
227+
dict_w['all_text'] = second_cell.get_text(strip=True)
228+
dict_w['date'] = third_cell.get_text(strip=True)
229+
list_of_workshops.append(dict_w)
230+
231+
result_df = pd.DataFrame(list_of_workshops)
232+
```
233+
234+
Great! We've finished our first scraping task on a real website. Please be aware that there are multiple ways of achieving the same result. For example, instead of using the `.contents()` method to access the different rows of the table, we could have used `.find_all('tr')` to scan the table and loop through the row elements. Similarly, instead of moving to the siblings of the first data cell, we could have used `.find_all('td')`. Code using that other approach would look like this. Remember, the results are the same!
235+
236+
```python
237+
list_of_workshops = []
238+
for row in workshops_table.find_all('tr'):
239+
cells = row.find_all('td')
240+
first_cell = cells[0]
241+
second_cell = cells[1]
242+
third_cell = cells[2]
243+
dict_w = {}
244+
dict_w['type'] = first_cell.find('img')['title']
245+
dict_w['country'] = second_cell.find('img')['title']
246+
dict_w['link'] = second_cell.find('a')['href']
247+
dict_w['host'] = second_cell.find('a').get_text()
248+
dict_w['all_text'] = second_cell.get_text(strip=True)
249+
dict_w['date'] = third_cell.get_text(strip=True)
250+
list_of_workshops.append(dict_w)
251+
252+
result_df = pd.DataFrame(list_of_workshops)
253+
```
254+
255+
256+
We can obtain the exact same result only using the find_all and find methods
257+
258+
259+
## Automating data collection
260+
261+
Going to the
262+
263+
When we want to extract information form a website, we need to understand how the website is structured, how we can identify the elements we want
264+
118265

119266

120267

0 commit comments

Comments
 (0)