Skip to content

Commit 1ec65df

Browse files
committed
finished ep1 and started ep2
1 parent 667d5a9 commit 1ec65df

6 files changed

Lines changed: 3370 additions & 279 deletions

File tree

.vscode/settings.json

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
{
2+
"r.rterm.windows": "C:\\Program Files\\R\\R-4.4.1\\bin\\x64\\Rterm.exe",
3+
"r.rpath.windows": "C:\\Program Files\\R\\R-4.4.1\\bin\\x64\\R.exe"
4+
}

config.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,8 @@ contact: 'jose_nino@ucsb.edu' # FIXME
6666

6767
# Order of episodes in your lesson
6868
episodes:
69-
- introduction.md
69+
- hello-scraping.md
70+
- a-real-website.md
7071

7172
# Information for Learners
7273
learners:

episodes/a-real-website.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
---
2+
title: "Scraping a real website"
3+
teaching: 30
4+
exercises: 15
5+
---
6+
7+
:::::::::::::::::::::::::::::::::::::: questions
8+
9+
- How do you write a lesson using Markdown and `{sandpaper}`?
10+
11+
::::::::::::::::::::::::::::::::::::::::::::::::
12+
13+
::::::::::::::::::::::::::::::::::::: objectives
14+
15+
- Explain how to use markdown with The Carpentries Workbench
16+
- Demonstrate how to include pieces of code, figures, and nested challenge blocks
17+
18+
::::::::::::::::::::::::::::::::::::::::::::::::
19+
20+
## A "Requests" to a website
21+
22+
In the previous episode we used a simple HTML document, not an actual website. Now that we move to more real, complex escenario, we need to add another package to our toolbox, the `requests` package. For the purpose of this web scraping lesson, we will only use `requests` to get the HTML behind a website. However, there's a lot of extra functionality that we are not covering but you can find in the [Requests package documentation](https://requests.readthedocs.io/en/latest/).
23+
24+
We'll be scraping The Carpentries website, [https://carpentries.org/](https://carpentries.org/), and the list of upcoming and past workshop you can find in there. For that, first we'll load the `requests` package and then use the code `.get().text` to store the HTML document of the website.
25+
26+
```python
27+
import requests
28+
url = 'https://carpentries.org/'
29+
req = requests.get(url).text
30+
print(req)
31+
```
32+
33+
```output
34+
<!doctype html>
35+
<html class="no-js" lang="en">
36+
<head>
37+
<meta charset="utf-8">
38+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
39+
<title>The Carpentries</title>
40+
41+
<link rel="stylesheet" type="text/css" href="https://carpentries.org/assets/css/styles_feeling_responsive.css">
42+
43+
44+
45+
<script src="https://carpentries.org/assets/js/modernizr.min.js"></script>
46+
47+
<!-- matomo -->
48+
<script src="https://carpentries.org/assets/js/matomo-analytics.js"></script>
49+
50+
<link href="https://fonts.googleapis.com/css?family=Lato:400,400i,700,700i|Roboto:400,400i,700,700i&display=swap" rel="stylesheet">
51+
52+
<!-- Search Engine Optimization -->
53+
<meta name="description" content="The Carpentries is a fiscally sponsored project of Community Initiatives, a registered 501(c)3 non-profit organisation based in California, USA. We are a global community teaching foundational computational and data science skills to researchers in academia, industry and government.">
54+
55+
...
56+
</body>
57+
</html>
58+
```
59+
60+
The output from our previous code was truncated, as it is too long, but we can see that it is HTML and has some elements we didn't see in our previous simple example, like those identified with the `<meta>`, `<link>` and `<script>` tags.
61+
62+
::::::::::::::::::::::::::::::::::::: keypoints
63+
64+
- Use `.md` files for episodes when you want static content
65+
- Use `.Rmd` files for episodes when you need to generate output
66+
- Run `sandpaper::check_lesson()` to identify any issues with your lesson
67+
- Run `sandpaper::build_lesson()` to preview your lesson locally
68+
69+
::::::::::::::::::::::::::::::::::::::::::::::::
70+
71+
[r-markdown]: https://rmarkdown.rstudio.com/

episodes/hello-scraping.md

Lines changed: 304 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,304 @@
1+
---
2+
title: "Hello-Scraping"
3+
teaching: 30
4+
exercises: 5
5+
---
6+
7+
:::::::::::::::::::::::::::::::::::::: questions
8+
9+
- How do you write a lesson using Markdown and `{sandpaper}`?
10+
11+
::::::::::::::::::::::::::::::::::::::::::::::::
12+
13+
::::::::::::::::::::::::::::::::::::: objectives
14+
15+
- Explain how to use markdown with The Carpentries Workbench
16+
- Demonstrate how to include pieces of code, figures, and nested challenge blocks
17+
18+
::::::::::::::::::::::::::::::::::::::::::::::::
19+
20+
## Introduction
21+
22+
This is part two of an Introduction to Web Scraping workshop we offered on February 2024. You can refer to those [workshop materials](https://ucsbcarpentry.github.io/2024-02-27-ucsb-webscraping/) to have a more gentle introduction to scraping using XPath and the `Scraper` Chrome extension.
23+
24+
We'll refresh some of the concepts covered there to have a practical understanding of how content/data is structured in a website. For that purpose, we'll see what Hypertext Markup Language (HTML) is and how it structures and formats the content using `tags`. From there, we'll use the BeautifulSoup library to parse the HTML content so we can easily search and access elements of the website we are interested in. Starting from basic examples, we'll move to scrape more complex, real-life websites.
25+
26+
## HTML quick overview
27+
28+
All websites have a Hypertext Markup Language (HTML) document behind them. The following text is HTML for a very simple website, with only three sentences. If you read it, can you imagine how that website looks?
29+
30+
```html
31+
<!DOCTYPE html>
32+
<html>
33+
<head>
34+
<title>Sample web page</title>
35+
</head>
36+
<body>
37+
<h1>h1 Header #1</h1>
38+
<p>This is a paragraph tag</p>
39+
<h2>h2 Sub-header</h2>
40+
<p>A new paragraph, now in the <b>sub-header</b></p>
41+
<h1>h1 Header #2</h1>
42+
<p>
43+
This other paragraph has two hyperlinks,
44+
one to <a href="https://carpentries.org/">The Carpentries homepage</a>,
45+
and another to the
46+
<a href="https://carpentries.org/past_workshops/">past workshops</a> page.
47+
</p>
48+
</body>
49+
</html>
50+
```
51+
52+
Well, if you put that text in a file with a .html extension, the job of your web browser when opening the file will be to interpret that (markup) language and display a nicely formatted website.
53+
54+
![](fig/simple_website.PNG){alt="Screenshot of a simple website with the previews HTML"}
55+
56+
An HTML document is composed of **elements**, which can be identified by **tags** written inside angle brackets (`<` and `>`). For example, the HTML root element, which delimits the beginning and end of an HTML document, is identified by the `<html>` tag.
57+
58+
Most elements have both a opening and a closing tag, determining the span of the element. In the previous simple website, we see a head element that goes from the opening tag `<head>` up to the closing tag `</head>`. Given than an element can be inside another element, an HTML document has a tree structure, where every element is a node that can contain child nodes, like the following image shows.
59+
60+
![The Document Object Model (DOM) that represents an HTML document with a tree structure. Source: Wikipedia. Author: Birger Eriksson](https://upload.wikimedia.org/wikipedia/commons/5/5a/DOM-model.svg){alt="Screenshot of a simple website with the previews HTML"}
61+
62+
Finally, we can define or modify the behavior, appeareance, or functionality of an element by using **attributes**. Attributes are inside the opening tag, and consist of a name and a value, formatted as `name="value"`. For example, in the previous simple website, we added a hyperlink with the `<a>...</a>` tags, but to set the destination URL we used the `href` attribute by writing in the opening tag `a href="https://carpentries.org/past_workshops/"`.
63+
64+
Here is a non-exhaustive list of elements you'll find in HTML and their purpose:
65+
66+
- `<hmtl>...</html>` The root element, which contains the entirety of the document.
67+
- `<head>...</head>` Contains metadata, for example, the title that the web browser displays.
68+
- `<body>...</body>` The content that is going to be displayed.
69+
- `<h1>...</h1>, <h2>...</h2>, <h3>...</h3>` Defines headers of level 1, 2, 3, etc.
70+
- `<p>...</p>` A paragraph.
71+
- `<a href="">...</a>` Creates a hyperlink, and we provide the destination URL with the `href` attribute.
72+
- `<img src="" alt="">` Embedds an image, giving a source to the image with the `src` attribute and specifying alternate text with `alt`.
73+
- `<table>...</table>, <th>...</th>, <tr>...</tr>, <td>...</td>` Defines a table, that as children will have a header (defined inside `th`), rows (defined inside `tr`), and a cell inside a row (as `td`).
74+
- `<div>...</div>` Is used to group sections of HTML content.
75+
- `<script>...</script>` Embeds or references JavaScript code.
76+
77+
In the previous list we've described some attributes specific for the hyperlink elements (`<a>`) and the image elements (`<img>`), but there are a few other global attributes that most HTML elements can have and are useful to identify specific elements when doing web scraping:
78+
79+
- `id=""` Assigns a unique identifier to an element, which cannot be repeated in the entire HTML document
80+
- `title=""` Provides extra information, displayed as a tooltip when the user hovers over the element.
81+
- `class=""` Is used to apply a similar styling to multiple elements at once.
82+
83+
To summarize, an **element** is identified by **tags**, and we can assign properties to an element by using **attributes**. Knowing this about HTML will make our lifes easier when trying to get some specific data from a website.
84+
85+
86+
## Parsing HTML with BeautifulSoup
87+
88+
Now that we know how a website is structured, we can start extracting information from it. The BeautifulSoup package is our main tool for that task, as it will parse the HTML so we can search and access the elements of interest in a programmatic way.
89+
90+
To see how this package works, we'll use the simple website example we showed before. As our first step, we will load the BeautifulSoup package, along with Pandas.
91+
92+
```python
93+
from bs4 import BeautifulSoup
94+
import pandas as pd
95+
```
96+
97+
Let's get the HTML content inside a string variable called `example_html`
98+
99+
```python
100+
example_html = """
101+
<!DOCTYPE html>
102+
<html>
103+
<head>
104+
<title>Sample web page</title>
105+
</head>
106+
<body>
107+
<h1>h1 Header #1</h1>
108+
<p>This is a paragraph tag</p>
109+
<h2>h2 Sub-header</h2>
110+
<p>A new paragraph, now in the <b>sub-header</b></p>
111+
<h1>h1 Header #2</h1>
112+
<p>
113+
This other paragraph has two hyperlinks,
114+
one to <a href="https://carpentries.org/">The Carpentries homepage</a>,
115+
and another to the
116+
<a href="https://carpentries.org/past_workshops/">past workshops</a> page.
117+
</p>
118+
</body>
119+
</html>
120+
"""
121+
```
122+
123+
We parse this HTML using the `BeautifulSoup()` function we imported, specifying that we want to use the `html.parser`. This object will represent the document as a nested data structure, similar to a tree as we mentioned before. If we use the `.prettify()` method on this object, we can see the nested structure, as inner elements will be indented to the right.
124+
125+
```python
126+
soup = BeautifulSoup(example_html, 'html.parser')
127+
print(soup.prettify())
128+
```
129+
130+
```output
131+
<!DOCTYPE html>
132+
<html>
133+
<head>
134+
<title>
135+
Sample web page
136+
</title>
137+
</head>
138+
<body>
139+
<h1>
140+
h1 Header #1
141+
</h1>
142+
<p>
143+
This is a paragraph tag
144+
</p>
145+
<h2>
146+
h2 Sub-header
147+
</h2>
148+
<p>
149+
A new paragraph, now in the
150+
<b>
151+
sub-header
152+
</b>
153+
</p>
154+
<h1>
155+
h1 Header #2
156+
</h1>
157+
<p>
158+
This other paragraph has two hyperlinks, one to
159+
<a href="https://carpentries.org/">
160+
The Carpentries homepage
161+
</a>
162+
, and another to the
163+
<a href="https://carpentries.org/past_workshops/">
164+
past workshops
165+
</a>
166+
.
167+
</p>
168+
</body>
169+
</html>
170+
```
171+
172+
Now that our `soup` variable holds the parsed document, we can use the `.find()` and `.find_all()` methods. `.find()` will search the tag that we specify, and return the entire element, including the starting and closing tags. If there are multiple elements with the same tag, `.find()` will only return the first one. If you want to return all the elements that match your search, you'd want to use `.find_all()` instead, which will return them in a list. Additionally, to return the text contained in a given element and all its children, you'd use `.get_text()`. Below you can see how all these commands play out in our simple website example.
173+
174+
```python
175+
print("1.", soup.find('title'))
176+
print("2.", soup.find('title').get_text())
177+
print("3.", soup.find('h1').get_text())
178+
print("4.", soup.find_all('h1'))
179+
print("5.", soup.find_all('a'))
180+
print("6.", soup.get_text())
181+
```
182+
183+
```output
184+
1. <title>Sample web page</title>
185+
2. Sample web page
186+
3. h1 Header #1
187+
4. [<h1>h1 Header #1</h1>, <h1>h1 Header #2</h1>]
188+
5. [<a href="https://carpentries.org/">The Carpentries homepage</a>, <a href="https://carpentries.org/past_workshops/">past workshops</a>]
189+
6.
190+
191+
192+
193+
Sample web page
194+
195+
196+
h1 Header #1
197+
This is a paragraph tag
198+
h2 Sub-header
199+
A new paragraph, now in the sub-header
200+
h1 Header #2
201+
This other paragraph has two hyperlinks, one to The Carpentries homepage, and another to the past workshops.
202+
203+
204+
205+
206+
```
207+
208+
How would you extract all hyperlinks identified with `<a>` tags? In our example, we see that there are only two hyperlinks, and we could extract them in a list using the `.find_all('a')` method.
209+
210+
```python
211+
links = soup.find_all('a')
212+
print("Number of hyperlinks found: ", len(links))
213+
print(links)
214+
```
215+
```output
216+
Number of hyperlinks found: 2
217+
[<a href="https://carpentries.org/">The Carpentries homepage</a>, <a href="https://carpentries.org/past_workshops/">past workshops</a>]
218+
```
219+
220+
To access the value of a given attribute in an element, for example the value of the `href` attribute in `<a href="">`, we would use square brackets and the name of the attribute (`['href']`), just like how in a Python dictionary we would access the value using the respective key. Let's make a loop that prints only the URL for each hyperlink we have in our example.
221+
222+
```python
223+
for item in links:
224+
print(item['href'])
225+
```
226+
```output
227+
https://carpentries.org/
228+
https://carpentries.org/past_workshops/
229+
```
230+
231+
::::::::::::::::::::::::::::::::::::: challenge
232+
233+
Create a Python dictionary that has the following three items, containing information about the **first** hyperlink in the HTML of our example.
234+
235+
```python
236+
first_link = {
237+
'element': the complete hyperlink element,
238+
'url': the destination url of the hyperlink,
239+
'text': the text that the website displays as the hyperlink
240+
}
241+
```
242+
243+
:::::::::::::::::::::::: solution
244+
245+
One way of completing the exercise is as follows.
246+
247+
```python
248+
first_link = {
249+
'element': str(soup.find('a')),
250+
'url': soup.find('a')['href'],
251+
'text': soup.find('a').get_text()
252+
}
253+
```
254+
An alternate but similar way is to store the tag found for not calling multiple times `soup.find('a')`, and also creating first an empty dictionary and append to it the keys and values we want, as this will be useful when we do this multiple times in a for loop.
255+
256+
```python
257+
find_a = soup.find('a')
258+
first_link = {}
259+
first_link['element'] = str(find_a)
260+
first_link['url'] = find_a['href']
261+
first_link['text'] = find_a.get_text()
262+
```
263+
:::::::::::::::::::::::::::::::::
264+
265+
::::::::::::::::::::::::::::::::::::::::::::::::
266+
267+
To finish this introduction on HTML and BeautifulSoup, let's create code for extracting in a structured way all the hyperlink elements, their destination URL and the text displayed for link. For that, let's use the `links` variable that we created before as `links = soup.find_all('a')`. We'll loop over each hyperlink element found, storing for each the three pieces of information we want in a dictionary, and finally appending that dictionary to a list called `list_of_dicts`. At the end we will have a list with two elements, that we can transform to a Pandas dataframe.
268+
269+
```python
270+
links = soup.find_all('a')
271+
list_of_dicts = []
272+
for item in links:
273+
dict_a = {}
274+
dict_a['element'] = str(item)
275+
dict_a['url'] = item['href']
276+
dict_a['text'] = item.get_text()
277+
list_of_dicts.append(dict_a)
278+
279+
links_df = pd.DataFrame(list_of_dicts)
280+
print(links_df)
281+
```
282+
283+
```output
284+
element \
285+
0 <a href="https://carpentries.org/">The Carpent...
286+
1 <a href="https://carpentries.org/past_workshop...
287+
288+
url text
289+
0 https://carpentries.org/ The Carpentries homepage
290+
1 https://carpentries.org/past_workshops/ past workshops
291+
```
292+
293+
You'll find more useful information about the BeautifulSoup package and how to use all its methods in the [Beautiful Soup Documentation website](https://beautiful-soup-4.readthedocs.io/en/latest/).
294+
295+
::::::::::::::::::::::::::::::::::::: keypoints
296+
297+
- Use `.md` files for episodes when you want static content
298+
- Use `.Rmd` files for episodes when you need to generate output
299+
- Run `sandpaper::check_lesson()` to identify any issues with your lesson
300+
- Run `sandpaper::build_lesson()` to preview your lesson locally
301+
302+
::::::::::::::::::::::::::::::::::::::::::::::::
303+
304+
[r-markdown]: https://rmarkdown.rstudio.com/

0 commit comments

Comments
 (0)