Skip to content

Commit d028f50

Browse files
authored
added questions, objectives, key points for 3 episodes, and setup instructions
2 parents 16ac023 + 84cc0f6 commit d028f50

5 files changed

Lines changed: 52 additions & 71 deletions

File tree

episodes/a-real-website.md

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,14 +6,16 @@ exercises: 15
66

77
:::::::::::::::::::::::::::::::::::::: questions
88

9-
- How do you write a lesson using Markdown and `{sandpaper}`?
9+
- How can I get the data and information from a real website?
10+
- How can I start automating my web scraping tasks?
1011

1112
::::::::::::::::::::::::::::::::::::::::::::::::
1213

1314
::::::::::::::::::::::::::::::::::::: objectives
1415

15-
- Explain how to use markdown with The Carpentries Workbench
16-
- Demonstrate how to include pieces of code, figures, and nested challenge blocks
16+
- Use the `requests` package to get the HTML document behind a website.
17+
- Navigate the tree structure behind an HTML document to extract the information we need.
18+
- Know how to avoid being blocked by sending too much requests to a website.
1719

1820
::::::::::::::::::::::::::::::::::::::::::::::::
1921

@@ -139,7 +141,7 @@ print(workshops_table.prettify())
139141
</table>
140142
```
141143

142-
To navigate in this HTML document tree we can use the methods `.contents()` (to access direct children nodes), `.parent()` (to access the parent node), `.next_sibling()`, and `.previous_sibling()` (to access the siblings of a node) methods. For example, if we want to access the second row of the table, which is the second child of the table element we could use the following code.
144+
To navigate in this HTML document tree we can use the following properties of the "bs4.element.Tag" object: `.contents` (to access direct children nodes), `.parent` (to access the parent node), `.next_sibling`, and `.previous_sibling` (to access the siblings of a node) methods. For example, if we want to access the second row of the table, which is the second child of the table element we could use the following code.
143145

144146
```python
145147
# The second [1 in Python indexing] child of our table element
@@ -424,10 +426,9 @@ print(req.status_code)
424426

425427
::::::::::::::::::::::::::::::::::::: keypoints
426428

427-
- Use `.md` files for episodes when you want static content
428-
- Use `.Rmd` files for episodes when you need to generate output
429-
- Run `sandpaper::check_lesson()` to identify any issues with your lesson
430-
- Run `sandpaper::build_lesson()` to preview your lesson locally
429+
- We can get the HTML behind any website using the "requests" package and the function `requests.get('website_url').text`.
430+
- An HTML document is a nested tree of elements. Therefore, from a given element, we can access its child, parent, or sibling, using `.contents`, `.parent`, `.next_sibling`, and `previous_sibling`.
431+
- It's polite to not send too many requests to a website in a short period of time. For that, we can use the `sleep()` function of the built-in Python module `time`.
431432

432433
::::::::::::::::::::::::::::::::::::::::::::::::
433434

episodes/dynamic-websites.md

Lines changed: 17 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,14 +6,17 @@ exercises: 5
66

77
:::::::::::::::::::::::::::::::::::::: questions
88

9-
- How do you write a lesson using Markdown and `{sandpaper}`?
9+
- What are the differences between static and dynamic websites?
10+
- Why is it important to understand these differences when doing web scraping?
11+
- How can I start my own web scraping project?
1012

1113
::::::::::::::::::::::::::::::::::::::::::::::::
1214

1315
::::::::::::::::::::::::::::::::::::: objectives
1416

15-
- Explain how to use markdown with The Carpentries Workbench
16-
- Demonstrate how to include pieces of code, figures, and nested challenge blocks
17+
- Use the `Selenium` package to scrape dynamic websites.
18+
- Identify the elements of interest using the browser's "Inspect" tool.
19+
- Understand the usual pipeline of a web scraping project.
1720

1821
::::::::::::::::::::::::::::::::::::::::::::::::
1922

@@ -257,11 +260,17 @@ This scraping pipeline helps break down complex scraping tasks into manageable s
257260

258261
::::::::::::::::::::::::::::::::::::: keypoints
259262

260-
- Use `.md` files for episodes when you want static content
261-
- Use `.Rmd` files for episodes when you need to generate output
262-
- Run `sandpaper::check_lesson()` to identify any issues with your lesson
263-
- Run `sandpaper::build_lesson()` to preview your lesson locally
264-
263+
- Dynamic websites load content using JavaScript, which isn't present in the initial or source HTML. It's important to distinguish between static and dynamic content when planning your scraping approach.
264+
- The `Selenium` package and its `webdriver` module simulate a real user interacting with a browser, allowing it to execute JavaScript and clicking, scrolling or filling in text boxes.
265+
- Here are the commandand we learned when we use `Selenium`:
266+
- `webdriver.Chrome()` # Start the Google Chrome browser simulator
267+
- `.get("website_url")` # Go to a given website
268+
- `.find_element(by, value)` and `.find_elements(by, value)` # Get a given element
269+
- `.click()` # Click the element selected
270+
- `.page_source` # Get the HTML after JavaScript has executed, which can later be parsed with BeautifulSoup
271+
- `.quit()` # Close the browser simulator
272+
- The browser's "Inspect" tool allows users to view the HTML document after dynamic content has loaded, revealing elements added by JavaScript. This tool helps identify the specific elements you are interested in scraping.
273+
- A typical scraping pipeline involves understanding the website's structure, determining content type (static or dynamic), using the appropriate tools (requests and BeautifulSoup for static, Selenium and BeautifulSoup for dynamic), and structuring the scraped data for analysis.
265274
::::::::::::::::::::::::::::::::::::::::::::::::
266275

267276
[r-markdown]: https://rmarkdown.rstudio.com/

episodes/hello-scraping.md

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,21 @@
11
---
22
title: "Hello-Scraping"
3-
teaching: 30
4-
exercises: 5
3+
teaching: 40
4+
exercises: 10
55
---
66

77
:::::::::::::::::::::::::::::::::::::: questions
88

9-
- How do you write a lesson using Markdown and `{sandpaper}`?
9+
- What is behind a website and how can I extract its information?
10+
- What is there to consider before I do web scraping?
1011

1112
::::::::::::::::::::::::::::::::::::::::::::::::
1213

1314
::::::::::::::::::::::::::::::::::::: objectives
1415

15-
- Explain how to use markdown with The Carpentries Workbench
16-
- Demonstrate how to include pieces of code, figures, and nested challenge blocks
16+
- Identify the structure and basic components of an HTML document.
17+
- Use BeautifulSoup to locate elements, tags, attributes and text in an HTML document.
18+
- Understand the situations in which web scraping is not suitable for obtaining the desired data.
1719

1820
::::::::::::::::::::::::::::::::::::::::::::::::
1921

@@ -320,10 +322,12 @@ To conclude, here is a brief code of conduct you should consider when doing web
320322

321323
::::::::::::::::::::::::::::::::::::: keypoints
322324

323-
- Use `.md` files for episodes when you want static content
324-
- Use `.Rmd` files for episodes when you need to generate output
325-
- Run `sandpaper::check_lesson()` to identify any issues with your lesson
326-
- Run `sandpaper::build_lesson()` to preview your lesson locally
325+
- Every website has an HTML document behind it that gives a structure to its content.
326+
- An HTML is composed of elements, which usually have a opening `<tag>` and a closing `</tag>`.
327+
- Elements can have different properties, assigned by attributes in the form of `<tag attribute_name="value">`.
328+
- We can parse any HTML document with `BeautifulSoup()` and find elements using the `.find()` and `.find_all()` methods.
329+
- We can access the text of an element using the `.get_text()` method and the attribute values as we do with Python dictionaries (`element["attribute_name"]`).
330+
- We must be careful to not tresspass the Terms of Service (TOS) of the website we are scraping.
327331

328332
::::::::::::::::::::::::::::::::::::::::::::::::
329333

learners/setup.md

Lines changed: 12 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -2,54 +2,21 @@
22
title: Setup
33
---
44

5-
FIXME:
6-
- Python, Jupyter notebook, libraries: b4s, requests, selenium, pandas
7-
- Google Chrome
5+
In this workshop you will learn how to extract data from websites, what you'd call web scraping, using Python. In Episode 1 we begin by reviewing the structure of websites in HTML and how to retrieve information from it using your browser and the `BeautifulSoup` package. In Episode 2 we'll dive deep on how to get the HTML behind any website using the `requests` package and how to parse and find information with `BeautifulSoup`. At the end,you’ll learn about the differences between static and dynamic webpages, and how to scrape the latter with the `Selenium` package.
86

9-
## Data Sets
7+
This workshop is designed for participants who already have a basic understanding of Python programming. In particular, it's best to know how to:
108

11-
<!--
12-
FIXME: place any data you want learners to use in `episodes/data` and then use
13-
a relative link ( [data zip file](data/lesson-data.zip) ) to provide a
14-
link to it, replacing the example.com link.
15-
-->
16-
Download the [data zip file](https://example.com/FIXME) and unzip it to your Desktop
9+
- Install and import packages and modules
10+
- Use lists and dictionaries
11+
- Use conditional statements (`if`, `else`, `elif`)
12+
- Use `for` loops
13+
- Calling functions, understanding parameters/arguments and return values
1714

1815
## Software Setup
1916

20-
::::::::::::::::::::::::::::::::::::::: discussion
21-
22-
### Details
23-
24-
Setup for different systems can be presented in dropdown menus via a `spoiler`
25-
tag. They will join to this discussion block, so you can give a general overview
26-
of the software used in this lesson here and fill out the individual operating
27-
systems (and potentially add more, e.g. online setup) in the solutions blocks.
28-
29-
:::::::::::::::::::::::::::::::::::::::::::::::::::
30-
31-
:::::::::::::::: spoiler
32-
33-
### Windows
34-
35-
Use PuTTY
36-
37-
::::::::::::::::::::::::
38-
39-
:::::::::::::::: spoiler
40-
41-
### MacOS
42-
43-
Use Terminal.app
44-
45-
::::::::::::::::::::::::
46-
47-
48-
:::::::::::::::: spoiler
49-
50-
### Linux
51-
52-
Use Terminal
53-
54-
::::::::::::::::::::::::
17+
Steps:
5518

19+
1. If you already have Anaconda, Jupyter Lab or Jupyter Notebooks installed in your computer, skip to step 2. Follow Miniforge's [download](https://github.com/conda-forge/miniforge?tab=readme-ov-file#download) and [installation](https://github.com/conda-forge/miniforge?tab=readme-ov-file#install) instructions for your respective operating system. If you are using a Windows machine, make sure you mark the option to "Add Miniforge3 to my PATH environment variable".
20+
2. If you are using Mac or Linux, open the 'Terminal'. If you are using Windows, open the 'Command Prompt' or 'Miniforge Prompt'.
21+
3. Activate the base conda environment by typing and running the 'conda activate' command.
22+
4. Install the necessary packages running 'pip install requests beautifulsoup4 selenium webdriver-manager pandas tqdm jupyterlab'.

notebooks/config-conda-env.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
conda create --name webscraping python=3.11.9
22
conda activate webscraping
3-
pip install requests beautifulsoup4 selenium pandas tqdm
3+
pip install requests beautifulsoup4 selenium webdriver-manager pandas tqdm jupyterlab

0 commit comments

Comments
 (0)