Added questions, objectives, and key points to all 3 episodes

josenino95 · josenino95 · commit 1b6dbb82282d · 2024-11-03T21:56:32.000-08:00
diff --git a/episodes/a-real-website.md b/episodes/a-real-website.md
@@ -6,14 +6,16 @@ exercises: 15
 
 :::::::::::::::::::::::::::::::::::::: questions 
 
-- How do you write a lesson using Markdown and `{sandpaper}`?
+- How can I get the data and information from a real website?
+- How can I start automating my web scraping tasks?
 
 ::::::::::::::::::::::::::::::::::::::::::::::::
 
 ::::::::::::::::::::::::::::::::::::: objectives
 
-- Explain how to use markdown with The Carpentries Workbench
-- Demonstrate how to include pieces of code, figures, and nested challenge blocks
+- Use the `requests` package to get the HTML document behind a website.
+- Navigate the tree structure behind an HTML document to extract the information we need.
+- Know how to avoid being blocked by sending too much requests to a website.
 
 ::::::::::::::::::::::::::::::::::::::::::::::::
 
@@ -139,7 +141,7 @@ print(workshops_table.prettify())
 </table>
 ```
 
-To navigate in this HTML document tree we can use the methods `.contents()` (to access direct children nodes), `.parent()` (to access the parent node), `.next_sibling()`, and `.previous_sibling()` (to access the siblings of a node) methods. For example, if we want to access the second row of the table, which is the second child of the table element we could use the following code.
+To navigate in this HTML document tree we can use the following properties of the "bs4.element.Tag" object: `.contents` (to access direct children nodes), `.parent` (to access the parent node), `.next_sibling`, and `.previous_sibling` (to access the siblings of a node) methods. For example, if we want to access the second row of the table, which is the second child of the table element we could use the following code.
 
 ```python
 # The second [1 in Python indexing] child of our table element
@@ -424,10 +426,9 @@ print(req.status_code)
 
 ::::::::::::::::::::::::::::::::::::: keypoints 
 
-- Use `.md` files for episodes when you want static content
-- Use `.Rmd` files for episodes when you need to generate output
-- Run `sandpaper::check_lesson()` to identify any issues with your lesson
-- Run `sandpaper::build_lesson()` to preview your lesson locally
+- We can get the HTML behind any website using the "requests" package and the function `requests.get('website_url').text`.
+- An HTML document is a nested tree of elements. Therefore, from a given element, we can access its child, parent, or sibling, using `.contents`, `.parent`, `.next_sibling`, and `previous_sibling`.
+- It's polite to not send too many requests to a website in a short period of time. For that, we can use the `sleep()` function of the built-in Python module `time`. 
 
 ::::::::::::::::::::::::::::::::::::::::::::::::
 
diff --git a/episodes/dynamic-websites.md b/episodes/dynamic-websites.md
@@ -6,14 +6,17 @@ exercises: 5
 
 :::::::::::::::::::::::::::::::::::::: questions 
 
-- How do you write a lesson using Markdown and `{sandpaper}`?
+- What are the differences between static and dynamic websites?
+- Why is it important to understand these differences when doing web scraping?
+- How can I start my own web scraping project?
 
 ::::::::::::::::::::::::::::::::::::::::::::::::
 
 ::::::::::::::::::::::::::::::::::::: objectives
 
-- Explain how to use markdown with The Carpentries Workbench
-- Demonstrate how to include pieces of code, figures, and nested challenge blocks
+- Use the `Selenium` package to scrape dynamic websites.
+- Identify the elements of interest using the browser's "Inspect" tool.
+- Understand the usual pipeline of a web scraping project.
 
 ::::::::::::::::::::::::::::::::::::::::::::::::
 
@@ -257,11 +260,17 @@ This scraping pipeline helps break down complex scraping tasks into manageable s
 
 ::::::::::::::::::::::::::::::::::::: keypoints 
 
-- Use `.md` files for episodes when you want static content
-- Use `.Rmd` files for episodes when you need to generate output
-- Run `sandpaper::check_lesson()` to identify any issues with your lesson
-- Run `sandpaper::build_lesson()` to preview your lesson locally
-
+- Dynamic websites load content using JavaScript, which isn't present in the initial or source HTML. It's important to distinguish between static and dynamic content when planning your scraping approach.
+- The `Selenium` package and its `webdriver` module simulate a real user interacting with a browser, allowing it to execute JavaScript and clicking, scrolling or filling in text boxes.
+- Here are the commandand we learned when we use `Selenium`:
+  - `webdriver.Chrome()` # Start the Google Chrome browser simulator
+  - `.get("website_url")` # Go to a given website
+  - `.find_element(by, value)` and `.find_elements(by, value)` # Get a given element
+  - `.click()` # Click the element selected
+  - `.page_source` # Get the HTML after JavaScript has executed, which can later be parsed with BeautifulSoup
+  - `.quit()` # Close the browser simulator
+- The browser's "Inspect" tool allows users to view the HTML document after dynamic content has loaded, revealing elements added by JavaScript. This tool helps identify the specific elements you are interested in scraping.
+- A typical scraping pipeline involves understanding the website's structure, determining content type (static or dynamic), using the appropriate tools (requests and BeautifulSoup for static, Selenium and BeautifulSoup for dynamic), and structuring the scraped data for analysis.
 ::::::::::::::::::::::::::::::::::::::::::::::::
 
 [r-markdown]: https://rmarkdown.rstudio.com/
diff --git a/episodes/hello-scraping.md b/episodes/hello-scraping.md
@@ -1,19 +1,21 @@
 ---
 title: "Hello-Scraping"
-teaching: 30
-exercises: 5
+teaching: 40
+exercises: 10
 ---
 
 :::::::::::::::::::::::::::::::::::::: questions 
 
-- How do you write a lesson using Markdown and `{sandpaper}`?
+- What is behind a website and how can I extract its information?
+- What is there to consider before I do web scraping?
 
 ::::::::::::::::::::::::::::::::::::::::::::::::
 
 ::::::::::::::::::::::::::::::::::::: objectives
 
-- Explain how to use markdown with The Carpentries Workbench
-- Demonstrate how to include pieces of code, figures, and nested challenge blocks
+- Identify the structure and basic components of an HTML document.
+- Use BeautifulSoup to locate elements, tags, attributes and text in an HTML document.
+- Understand the situations in which web scraping is not suitable for obtaining the desired data.
 
 ::::::::::::::::::::::::::::::::::::::::::::::::
 
@@ -320,10 +322,12 @@ To conclude, here is a brief code of conduct you should consider when doing web
 
 ::::::::::::::::::::::::::::::::::::: keypoints 
 
-- Use `.md` files for episodes when you want static content
-- Use `.Rmd` files for episodes when you need to generate output
-- Run `sandpaper::check_lesson()` to identify any issues with your lesson
-- Run `sandpaper::build_lesson()` to preview your lesson locally
+- Every website has an HTML document behind it that gives a structure to its content.
+- An HTML is composed of elements, which usually have a opening `<tag>` and a closing `</tag>`.
+- Elements can have different properties, assigned by attributes in the form of `<tag attribute_name="value">`.
+- We can parse any HTML document with `BeautifulSoup()` and find elements using the `.find()` and `.find_all()` methods.
+- We can access the text of an element using the `.get_text()` method and the attribute values as we do with Python dictionaries (`element["attribute_name"]`).
+- We must be careful to not tresspass the Terms of Service (TOS) of the website we are scraping.
 
 ::::::::::::::::::::::::::::::::::::::::::::::::