|
| 1 | +--- |
| 2 | +title: "Hello-Scraping" |
| 3 | +teaching: 30 |
| 4 | +exercises: 5 |
| 5 | +--- |
| 6 | + |
| 7 | +:::::::::::::::::::::::::::::::::::::: questions |
| 8 | + |
| 9 | +- How do you write a lesson using Markdown and `{sandpaper}`? |
| 10 | + |
| 11 | +:::::::::::::::::::::::::::::::::::::::::::::::: |
| 12 | + |
| 13 | +::::::::::::::::::::::::::::::::::::: objectives |
| 14 | + |
| 15 | +- Explain how to use markdown with The Carpentries Workbench |
| 16 | +- Demonstrate how to include pieces of code, figures, and nested challenge blocks |
| 17 | + |
| 18 | +:::::::::::::::::::::::::::::::::::::::::::::::: |
| 19 | + |
| 20 | +## Introduction |
| 21 | + |
| 22 | +This is part two of an Introduction to Web Scraping workshop we offered on February 2024. You can refer to those [workshop materials](https://ucsbcarpentry.github.io/2024-02-27-ucsb-webscraping/) to have a more gentle introduction to scraping using XPath and the `Scraper` Chrome extension. |
| 23 | + |
| 24 | +We'll refresh some of the concepts covered there to have a practical understanding of how content/data is structured in a website. For that purpose, we'll see what Hypertext Markup Language (HTML) is and how it structures and formats the content using `tags`. From there, we'll use the BeautifulSoup library to parse the HTML content so we can easily search and access elements of the website we are interested in. Starting from basic examples, we'll move to scrape more complex, real-life websites. |
| 25 | + |
| 26 | +## HTML quick overview |
| 27 | + |
| 28 | +All websites have a Hypertext Markup Language (HTML) document behind them. The following text is HTML for a very simple website, with only three sentences. If you read it, can you imagine how that website looks? |
| 29 | + |
| 30 | +```html |
| 31 | +<!DOCTYPE html> |
| 32 | +<html> |
| 33 | +<head> |
| 34 | +<title>Sample web page</title> |
| 35 | +</head> |
| 36 | +<body> |
| 37 | +<h1>h1 Header #1</h1> |
| 38 | +<p>This is a paragraph tag</p> |
| 39 | +<h2>h2 Sub-header</h2> |
| 40 | +<p>A new paragraph, now in the <b>sub-header</b></p> |
| 41 | +<h1>h1 Header #2</h1> |
| 42 | +<p> |
| 43 | +This other paragraph has two hyperlinks, |
| 44 | +one to <a href="https://carpentries.org/">The Carpentries homepage</a>, |
| 45 | +and another to the |
| 46 | +<a href="https://carpentries.org/past_workshops/">past workshops</a> page. |
| 47 | +</p> |
| 48 | +</body> |
| 49 | +</html> |
| 50 | +``` |
| 51 | + |
| 52 | +Well, if you put that text in a file with a .html extension, the job of your web browser when opening the file will be to interpret that (markup) language and display a nicely formatted website. |
| 53 | + |
| 54 | +{alt="Screenshot of a simple website with the previews HTML"} |
| 55 | + |
| 56 | +An HTML document is composed of **elements**, which can be identified by **tags** written inside angle brackets (`<` and `>`). For example, the HTML root element, which delimits the beginning and end of an HTML document, is identified by the `<html>` tag. |
| 57 | + |
| 58 | +Most elements have both a opening and a closing tag, determining the span of the element. In the previous simple website, we see a head element that goes from the opening tag `<head>` up to the closing tag `</head>`. Given than an element can be inside another element, an HTML document has a tree structure, where every element is a node that can contain child nodes, like the following image shows. |
| 59 | + |
| 60 | +{alt="Screenshot of a simple website with the previews HTML"} |
| 61 | + |
| 62 | +Finally, we can define or modify the behavior, appeareance, or functionality of an element by using **attributes**. Attributes are inside the opening tag, and consist of a name and a value, formatted as `name="value"`. For example, in the previous simple website, we added a hyperlink with the `<a>...</a>` tags, but to set the destination URL we used the `href` attribute by writing in the opening tag `a href="https://carpentries.org/past_workshops/"`. |
| 63 | + |
| 64 | +Here is a non-exhaustive list of elements you'll find in HTML and their purpose: |
| 65 | + |
| 66 | +- `<hmtl>...</html>` The root element, which contains the entirety of the document. |
| 67 | +- `<head>...</head>` Contains metadata, for example, the title that the web browser displays. |
| 68 | +- `<body>...</body>` The content that is going to be displayed. |
| 69 | +- `<h1>...</h1>, <h2>...</h2>, <h3>...</h3>` Defines headers of level 1, 2, 3, etc. |
| 70 | +- `<p>...</p>` A paragraph. |
| 71 | +- `<a href="">...</a>` Creates a hyperlink, and we provide the destination URL with the `href` attribute. |
| 72 | +- `<img src="" alt="">` Embedds an image, giving a source to the image with the `src` attribute and specifying alternate text with `alt`. |
| 73 | +- `<table>...</table>, <th>...</th>, <tr>...</tr>, <td>...</td>` Defines a table, that as children will have a header (defined inside `th`), rows (defined inside `tr`), and a cell inside a row (as `td`). |
| 74 | +- `<div>...</div>` Is used to group sections of HTML content. |
| 75 | +- `<script>...</script>` Embeds or references JavaScript code. |
| 76 | + |
| 77 | +In the previous list we've described some attributes specific for the hyperlink elements (`<a>`) and the image elements (`<img>`), but there are a few other global attributes that most HTML elements can have and are useful to identify specific elements when doing web scraping: |
| 78 | + |
| 79 | +- `id=""` Assigns a unique identifier to an element, which cannot be repeated in the entire HTML document |
| 80 | +- `title=""` Provides extra information, displayed as a tooltip when the user hovers over the element. |
| 81 | +- `class=""` Is used to apply a similar styling to multiple elements at once. |
| 82 | + |
| 83 | +To summarize, an **element** is identified by **tags**, and we can assign properties to an element by using **attributes**. Knowing this about HTML will make our lifes easier when trying to get some specific data from a website. |
| 84 | + |
| 85 | + |
| 86 | +## Parsing HTML with BeautifulSoup |
| 87 | + |
| 88 | +Now that we know how a website is structured, we can start extracting information from it. The BeautifulSoup package is our main tool for that task, as it will parse the HTML so we can search and access the elements of interest in a programmatic way. |
| 89 | + |
| 90 | +To see how this package works, we'll use the simple website example we showed before. As our first step, we will load the BeautifulSoup package, along with Pandas. |
| 91 | + |
| 92 | +```python |
| 93 | +from bs4 import BeautifulSoup |
| 94 | +import pandas as pd |
| 95 | +``` |
| 96 | + |
| 97 | +Let's get the HTML content inside a string variable called `example_html` |
| 98 | + |
| 99 | +```python |
| 100 | +example_html = """ |
| 101 | +<!DOCTYPE html> |
| 102 | +<html> |
| 103 | +<head> |
| 104 | +<title>Sample web page</title> |
| 105 | +</head> |
| 106 | +<body> |
| 107 | +<h1>h1 Header #1</h1> |
| 108 | +<p>This is a paragraph tag</p> |
| 109 | +<h2>h2 Sub-header</h2> |
| 110 | +<p>A new paragraph, now in the <b>sub-header</b></p> |
| 111 | +<h1>h1 Header #2</h1> |
| 112 | +<p> |
| 113 | +This other paragraph has two hyperlinks, |
| 114 | +one to <a href="https://carpentries.org/">The Carpentries homepage</a>, |
| 115 | +and another to the |
| 116 | +<a href="https://carpentries.org/past_workshops/">past workshops</a> page. |
| 117 | +</p> |
| 118 | +</body> |
| 119 | +</html> |
| 120 | +""" |
| 121 | +``` |
| 122 | + |
| 123 | +We parse this HTML using the `BeautifulSoup()` function we imported, specifying that we want to use the `html.parser`. This object will represent the document as a nested data structure, similar to a tree as we mentioned before. If we use the `.prettify()` method on this object, we can see the nested structure, as inner elements will be indented to the right. |
| 124 | + |
| 125 | +```python |
| 126 | +soup = BeautifulSoup(example_html, 'html.parser') |
| 127 | +print(soup.prettify()) |
| 128 | +``` |
| 129 | + |
| 130 | +```output |
| 131 | +<!DOCTYPE html> |
| 132 | +<html> |
| 133 | + <head> |
| 134 | + <title> |
| 135 | + Sample web page |
| 136 | + </title> |
| 137 | + </head> |
| 138 | + <body> |
| 139 | + <h1> |
| 140 | + h1 Header #1 |
| 141 | + </h1> |
| 142 | + <p> |
| 143 | + This is a paragraph tag |
| 144 | + </p> |
| 145 | + <h2> |
| 146 | + h2 Sub-header |
| 147 | + </h2> |
| 148 | + <p> |
| 149 | + A new paragraph, now in the |
| 150 | + <b> |
| 151 | + sub-header |
| 152 | + </b> |
| 153 | + </p> |
| 154 | + <h1> |
| 155 | + h1 Header #2 |
| 156 | + </h1> |
| 157 | + <p> |
| 158 | + This other paragraph has two hyperlinks, one to |
| 159 | + <a href="https://carpentries.org/"> |
| 160 | + The Carpentries homepage |
| 161 | + </a> |
| 162 | + , and another to the |
| 163 | + <a href="https://carpentries.org/past_workshops/"> |
| 164 | + past workshops |
| 165 | + </a> |
| 166 | + . |
| 167 | + </p> |
| 168 | + </body> |
| 169 | +</html> |
| 170 | +``` |
| 171 | + |
| 172 | +Now that our `soup` variable holds the parsed document, we can use the `.find()` and `.find_all()` methods. `.find()` will search the tag that we specify, and return the entire element, including the starting and closing tags. If there are multiple elements with the same tag, `.find()` will only return the first one. If you want to return all the elements that match your search, you'd want to use `.find_all()` instead, which will return them in a list. Additionally, to return the text contained in a given element and all its children, you'd use `.get_text()`. Below you can see how all these commands play out in our simple website example. |
| 173 | + |
| 174 | +```python |
| 175 | +print("1.", soup.find('title')) |
| 176 | +print("2.", soup.find('title').get_text()) |
| 177 | +print("3.", soup.find('h1').get_text()) |
| 178 | +print("4.", soup.find_all('h1')) |
| 179 | +print("5.", soup.find_all('a')) |
| 180 | +print("6.", soup.get_text()) |
| 181 | +``` |
| 182 | + |
| 183 | +```output |
| 184 | +1. <title>Sample web page</title> |
| 185 | +2. Sample web page |
| 186 | +3. h1 Header #1 |
| 187 | +4. [<h1>h1 Header #1</h1>, <h1>h1 Header #2</h1>] |
| 188 | +5. [<a href="https://carpentries.org/">The Carpentries homepage</a>, <a href="https://carpentries.org/past_workshops/">past workshops</a>] |
| 189 | +6. |
| 190 | +
|
| 191 | +
|
| 192 | +
|
| 193 | +Sample web page |
| 194 | +
|
| 195 | +
|
| 196 | +h1 Header #1 |
| 197 | +This is a paragraph tag |
| 198 | +h2 Sub-header |
| 199 | +A new paragraph, now in the sub-header |
| 200 | +h1 Header #2 |
| 201 | +This other paragraph has two hyperlinks, one to The Carpentries homepage, and another to the past workshops. |
| 202 | +
|
| 203 | +
|
| 204 | +
|
| 205 | +
|
| 206 | +``` |
| 207 | + |
| 208 | +How would you extract all hyperlinks identified with `<a>` tags? In our example, we see that there are only two hyperlinks, and we could extract them in a list using the `.find_all('a')` method. |
| 209 | + |
| 210 | +```python |
| 211 | +links = soup.find_all('a') |
| 212 | +print("Number of hyperlinks found: ", len(links)) |
| 213 | +print(links) |
| 214 | +``` |
| 215 | +```output |
| 216 | +Number of hyperlinks found: 2 |
| 217 | +[<a href="https://carpentries.org/">The Carpentries homepage</a>, <a href="https://carpentries.org/past_workshops/">past workshops</a>] |
| 218 | +``` |
| 219 | + |
| 220 | +To access the value of a given attribute in an element, for example the value of the `href` attribute in `<a href="">`, we would use square brackets and the name of the attribute (`['href']`), just like how in a Python dictionary we would access the value using the respective key. Let's make a loop that prints only the URL for each hyperlink we have in our example. |
| 221 | + |
| 222 | +```python |
| 223 | +for item in links: |
| 224 | + print(item['href']) |
| 225 | +``` |
| 226 | +```output |
| 227 | +https://carpentries.org/ |
| 228 | +https://carpentries.org/past_workshops/ |
| 229 | +``` |
| 230 | + |
| 231 | +::::::::::::::::::::::::::::::::::::: challenge |
| 232 | + |
| 233 | +Create a Python dictionary that has the following three items, containing information about the **first** hyperlink in the HTML of our example. |
| 234 | + |
| 235 | +```python |
| 236 | +first_link = { |
| 237 | + 'element': the complete hyperlink element, |
| 238 | + 'url': the destination url of the hyperlink, |
| 239 | + 'text': the text that the website displays as the hyperlink |
| 240 | +} |
| 241 | +``` |
| 242 | + |
| 243 | +:::::::::::::::::::::::: solution |
| 244 | + |
| 245 | +One way of completing the exercise is as follows. |
| 246 | + |
| 247 | +```python |
| 248 | +first_link = { |
| 249 | + 'element': str(soup.find('a')), |
| 250 | + 'url': soup.find('a')['href'], |
| 251 | + 'text': soup.find('a').get_text() |
| 252 | +} |
| 253 | +``` |
| 254 | +An alternate but similar way is to store the tag found for not calling multiple times `soup.find('a')`, and also creating first an empty dictionary and append to it the keys and values we want, as this will be useful when we do this multiple times in a for loop. |
| 255 | + |
| 256 | +```python |
| 257 | +find_a = soup.find('a') |
| 258 | +first_link = {} |
| 259 | +first_link['element'] = str(find_a) |
| 260 | +first_link['url'] = find_a['href'] |
| 261 | +first_link['text'] = find_a.get_text() |
| 262 | +``` |
| 263 | +::::::::::::::::::::::::::::::::: |
| 264 | + |
| 265 | +:::::::::::::::::::::::::::::::::::::::::::::::: |
| 266 | + |
| 267 | +To finish this introduction on HTML and BeautifulSoup, let's create code for extracting in a structured way all the hyperlink elements, their destination URL and the text displayed for link. For that, let's use the `links` variable that we created before as `links = soup.find_all('a')`. We'll loop over each hyperlink element found, storing for each the three pieces of information we want in a dictionary, and finally appending that dictionary to a list called `list_of_dicts`. At the end we will have a list with two elements, that we can transform to a Pandas dataframe. |
| 268 | + |
| 269 | +```python |
| 270 | +links = soup.find_all('a') |
| 271 | +list_of_dicts = [] |
| 272 | +for item in links: |
| 273 | + dict_a = {} |
| 274 | + dict_a['element'] = str(item) |
| 275 | + dict_a['url'] = item['href'] |
| 276 | + dict_a['text'] = item.get_text() |
| 277 | + list_of_dicts.append(dict_a) |
| 278 | + |
| 279 | +links_df = pd.DataFrame(list_of_dicts) |
| 280 | +print(links_df) |
| 281 | +``` |
| 282 | + |
| 283 | +```output |
| 284 | + element \ |
| 285 | +0 <a href="https://carpentries.org/">The Carpent... |
| 286 | +1 <a href="https://carpentries.org/past_workshop... |
| 287 | +
|
| 288 | + url text |
| 289 | +0 https://carpentries.org/ The Carpentries homepage |
| 290 | +1 https://carpentries.org/past_workshops/ past workshops |
| 291 | +``` |
| 292 | + |
| 293 | +You'll find more useful information about the BeautifulSoup package and how to use all its methods in the [Beautiful Soup Documentation website](https://beautiful-soup-4.readthedocs.io/en/latest/). |
| 294 | + |
| 295 | +::::::::::::::::::::::::::::::::::::: keypoints |
| 296 | + |
| 297 | +- Use `.md` files for episodes when you want static content |
| 298 | +- Use `.Rmd` files for episodes when you need to generate output |
| 299 | +- Run `sandpaper::check_lesson()` to identify any issues with your lesson |
| 300 | +- Run `sandpaper::build_lesson()` to preview your lesson locally |
| 301 | + |
| 302 | +:::::::::::::::::::::::::::::::::::::::::::::::: |
| 303 | + |
| 304 | +[r-markdown]: https://rmarkdown.rstudio.com/ |
0 commit comments