You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the previous episode we used a simple HTML document, not an actual website. Now that we move to more real, complex escenario, we need to add another package to our toolbox, the `requests` package. For the purpose of this web scraping lesson, we will only use `requests` to get the HTML behind a website. However, there's a lot of extra functionality that we are not covering but you can find in the [Requests package documentation](https://requests.readthedocs.io/en/latest/).
23
23
24
-
We'll be scraping The Carpentries website, [https://carpentries.org/](https://carpentries.org/), and specifically, the list of upcoming and past workshop you can find in there. For that, first we'll load the `requests` package and then use the code `.get().text` to store the HTML document of the website.
24
+
We'll be scraping The Carpentries website, [https://carpentries.org/](https://carpentries.org/), and specifically, the list of upcoming and past workshop you can find at the bottom. For that, first we'll load the `requests` package and then use the code `.get().text` to store the HTML document of the website. Furthermore, to simplify our navigation through the HTML document, we will use the [Regular Expressions](https://docs.python.org/3/howto/regex.html)`re` module to remove all new line characters ("\n") and their surrounding whitespaces. You can think of removing new lines as a preprocessing or cleaning step, but in this lesson we won't be explaining the intricacies of regular expressions. For that, you can refer to this introductory explanation on the [Library Carpentry](https://librarycarpentry.org/lc-data-intro/01-regular-expressions.html).
<meta name="description" content="The Carpentries is a fiscally sponsored project of Community Initiatives, a registered 501(c)3 non-profit organisation based in California, USA. We are a global community teaching foundational computational and data science skills to researchers in academia, industry and government.">
54
-
55
-
...
56
-
</body>
57
-
</html>
37
+
<!doctype html><html class="no-js" lang="en"><head><meta charset="utf-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"><title>The Carpentries</title><link rel="stylesheet" type="text/css" href="https://carpentries.org/assets/css/styles_feeling_responsive.css"><script src="https://carpentries.org/assets/js/modernizr.min.js"></script><!-- matomo --><script src="https://carpentries.org/assets/js/matomo-analytics.js"></script><link href="https://fonts.googleapis.com/css?family=Lato:400,400i,700,700i|Roboto:400,400i,700,700i&display=swap" rel="stylesheet"><!-- Search Engine Optimization --><meta name="description" content="The Carpentries is a fiscally sponsored project of Community Initiatives, a registered 501(c)3 non-profit organisation based in California, USA. We are a global community teaching foundational computational and data science skills to researchers in academia, industry and government."><link rel="canonical" href="https://carpentries.org/index.html"><
58
38
```
59
39
60
-
The output from our previous code was truncated, as it is too long, but we can see it is HTML and has some elements we didn't see in the example of the previous episode, like those identified with the `<meta>`, `<link>` and `<script>` tags.
40
+
We truncated to print only the first 1000 characters of the document, as it is too long, but we can see it is HTML and has some elements we didn't see in the example of the previous episode, like those identified with the `<meta>`, `<link>` and `<script>` tags.
61
41
62
42
There's another way to see the HTML document behind a website, directly from your web browser. Using Google Chrome, you can right-click in any part of the website (on a Mac, press and hold the Control key in your keyboard while you click), and from the pop-up menu, click 'View page source', as the next image shows. If the 'View page source' option didn't appear for you, try clicking in another part of the website. A new tab will open with the HTML document for the website you were in.
63
43
64
44
{alt="A screenshot of The Carpentries homepage in the Google Chrome web browser, showing how to View page source"}
65
45
66
46
In the HTML page source on your browser you can scroll down and look for the second-level header (`<h2>`) with the text "Upcoming Carpentries Workshops". Or more easily, you can use the Find Bar (Ctrl + F on Windows and Command + F on Mac) to search for "Upcoming Carpentries Workshops". Just right down of that header we have the table element we are interested in, which starts with the opening tag `<table class="table table-striped" style="width: 100%;">`. Inside that element we see nested different elements familiar for a table, the rows (`<tr>`) and the cells for each row (`<td>`), and additionally the image (`<img>`) and hyperlink (`<a>`) elements.
67
47
48
+
## Finding the information we want
49
+
68
50
Now, going back to our coding, we left off on getting the HTML behind the website using `requests`, and stored it on the variable called `req`. From here we can proceed with BeautifulSoup as we learned in the previous episode, using the `BeautifulSoup()` function to parse our HTML, as the following code block shows. With the parsed document, we can use the `.find()` or `find_all()` methods to find the table element.
69
51
70
52
```python
71
-
soup = BeautifulSoup(req, 'html.parser')
53
+
soup = BeautifulSoup(cleaned_req, 'html.parser')
72
54
tables_by_tag = soup.find_all('table')
73
55
print("Number of table elements found: ", len(tables_by_tag))
74
56
print("Printing only the first 1000 characters of the table element: \n", str(tables_by_tag[0])[0:1000])
75
57
```
76
58
77
59
```output
78
60
Number of table elements found: 1
79
-
Printing only the first 1000 characters of the table element:
We can see that there was found only one table element in the entire HTML, which corresponds to the table we are looking for. The output you see in the previous code block will be different from what you have in your computer, as the data in the upcoming workshops table is continously being updated.
61
+
Printing only the first 1000 characters of the table element:
From our output we see that there was only one table element in the entire HTML, which corresponds to the table we are looking for. The output you see in the previous code block will be different from what you have in your computer, as the data in the upcoming workshops table is continously updated.
107
66
108
67
Besides searching elements using tags, sometimes it will be useful to search using attributes, like `id` or `class`. For example, we can see the table element has a class attribute with two values "table table-striped", which identifies all possible elements with similar styling. Therefore, we could have the same result than before using the `class_` argument on the `.find_all()` method as follows.
Now that we know there is only one table element, we can start working with it directly by storing the first and only item in the `tables_by_tag` result set into another variable, which we will call just `workshops`. We can see that we moved from working with a "ResultSet" object to a "Tag" object, which we can start working with to extract information from each row and cell.
74
+
75
+
```python
76
+
print("Before: ", type(tables_by_tag))
77
+
workshops_table = tables_by_tag[0]
78
+
print("After:", type(workshops_table))
79
+
print("Element type:", workshops_table.name)
80
+
```
81
+
82
+
```output
83
+
Before: <class 'bs4.element.ResultSet'>
84
+
After: <class 'bs4.element.Tag'>
85
+
Element type: table
86
+
```
115
87
116
88
## Navigating the tree
117
89
90
+
If we use the `prettify()` method on the `workshops_table` variable, we see that this table element has a nested tree structure. On the first level is the `<table>` tag. Inside that, we have rows `<tr>`, and inside rows we have table data cells `<td>`. We can start to identify certain information we may be interested in, for example:
91
+
92
+
- What type of workshop was it ('swc' for Software Carpentry, 'dc' for Data Carpentry, 'lc' for Library Carpentry, and 'cp' for workshops based on The Carpentries curriculum). We find this in the first `<td>` tag, or said in a different way, in the first cell of the row.
93
+
- In what country was the workshop held. We can see the two-letter country code in the second cell of the row.
94
+
- The URL to the workshop website, which will contain additional information. It is also contained in the second cell of the row, as the `href` attribute of the `<a>` tag.
95
+
- The institution that is hosting the workshop. Also in the second `<td>`, in the text of the hyperlink `<a>` tag.
96
+
- The name of instructors and helpers involved in the workshop.
97
+
- The dates of the workshop, on the third and final cell.
To navigate in this HTML document tree we can use the methods `.contents()` (to access direct children nodes), `.parent()` (to access the parent node), `.next_sibling()`, and `.previous_sibling()` (to access the siblings of a node) methods. For example, if we want to access the second row of the table, which is the second child of the table element we could use the following code.
143
+
144
+
```python
145
+
# The second [1 in Python indexing] child of our table element
146
+
workshops_table.contents[1]
147
+
```
148
+
149
+
If you go back to the 'View page source' of the website, you'll notice that the table element is nested inside a `<div class="medium-12 columns">` element, which means this `<div>` is the parent of our `<table>`. If we needed to, we could access this parent by using `workshops_table.parent`.
150
+
151
+
Now imagine we had selected the second data cell of our fifth row using `workshops_table.contents[4].contents[1]`, we could access the third data cell using `.next_sibling()` or the first data cell with `.previous_sibling()`.
152
+
153
+
```python
154
+
# Access the fifth row, and from there, the second data cell
Why do we bother to learn all this methods? Depending on you web scraping use case, they might result useful in complex websites. Let's apply them to extract the information we want about the workshops, for example, to see how many upcoming workshops there are, which corresponds with the number of children the table element has
163
+
164
+
```python
165
+
num_workshops =len(workshops_table.contents)
166
+
print("Number of upcoming workshops listed: ", num_workshops)
167
+
```
168
+
169
+
Let's work to extract data from only the first row, and later we can use a loop to iterate over all the rows of the table.
170
+
171
+
```python
172
+
# Empty dictionary to hold the data
173
+
dict_w = {}
174
+
175
+
# First row of data
176
+
first_row = workshops_table.contents[0]
177
+
178
+
# To get to the first cell
179
+
first_cell = first_row.contents[0]
180
+
second_cell = first_cell.next_sibling
181
+
third_cell = second_cell.next_sibling
182
+
183
+
# From the first cell, find the <image> tag and get the 'title' attribute, which contains the type of workshop
184
+
dict_w['type'] = first_cell.find('img')['title']
185
+
186
+
# In the second cell, get the country from the 'title' attribute of the <image> tag
This was just for one row, but we can iterate over all the rows in the table adding a for loop and appending each dictionary to a list. That list will be transformed to a Pandas dataframe so we can see the results nicely.
Great! We've finished our first scraping task on a real website. Please be aware that there are multiple ways of achieving the same result. For example, instead of using the `.contents()` method to access the different rows of the table, we could have used `.find_all('tr')` to scan the table and loop through the row elements. Similarly, instead of moving to the siblings of the first data cell, we could have used `.find_all('td')`. Code using that other approach would look like this. Remember, the results are the same!
0 commit comments