UCSBCarpentry
diff --git a/‎episodes/a-real-website.md‎
Lines changed: 6 additions & 6 deletions b/‎episodes/a-real-website.md‎
Lines changed: 6 additions & 6 deletions
diff --git a/‎episodes/dynamic-websites.md‎
Lines changed: 145 additions & 2 deletions b/‎episodes/dynamic-websites.md‎
Lines changed: 145 additions & 2 deletions
diff --git a/‎episodes/fig/product_cards_challenge.PNG‎
515 KB b/‎episodes/fig/product_cards_challenge.PNG‎
515 KB
@@ -386,12 +386,12 @@ for item in tqdm(urls):
     # Use the find function to search for the <meta> tag that 
     # has each specific 'name' attribute and get the value in the
     # 'content' attribute
-    dict_w['startdate'] = soup.find('meta', attrs ={'name': 'startdate'})['content']
-    dict_w['enddate'] = soup.find('meta', attrs ={'name': 'enddate'})['content']
-    dict_w['language'] = soup.find('meta', attrs ={'name': 'language'})['content']
-    dict_w['latlng'] = soup.find('meta', attrs ={'name': 'latlng'})['content']
-    dict_w['instructor'] = soup.find('meta', attrs ={'name': 'instructor'})['content']
-    dict_w['helper'] = soup.find('meta', attrs ={'name': 'helper'})['content']
+    dict_w['startdate'] = soup.find('meta', attrs = {'name': 'startdate'})['content']
+    dict_w['enddate'] = soup.find('meta', attrs = {'name': 'enddate'})['content']
+    dict_w['language'] = soup.find('meta', attrs = {'name': 'language'})['content']
+    dict_w['latlng'] = soup.find('meta', attrs = {'name': 'latlng'})['content']
+    dict_w['instructor'] = soup.find('meta', attrs = {'name': 'instructor'})['content']
+    dict_w['helper'] = soup.find('meta', attrs = {'name': 'helper'})['content']
 
     # Append to our list
     list_of_workshops.append(dict_w)
 
@@ -59,7 +59,7 @@ How can we direct Selenium to click the text "2015" for the table of that year t
 
 If we wanted to search a table element that has the `<table>` tag, we would run `driver.find_element(by=By.TAG_NAME, value="table")`. If we wanted to search an element with a specific value in the "class" attribute, for example an element like `<tr class="film">` we would run `driver.find_element(by=By.CLASS_NAME, value="film")`. To know what element we need to click to open "2015" table of Oscar winners we can use the "Inspect" tool (remember you can do this in Google Chrome by pointing your mouse over the "2015" value, make a right-click, and select "Inspect" from the pop-up menu). In the DevTools window, you'll see element `<a href="#" class="year-link" id="2015">2015</a>`. As the ID attribute is unique for only one element in the HTML, we can directly select the element by this attribute using the code you'll find after the following image.
 
-![](fig/inspect_element.png){alt="A screenshot of Google Chrome web browser, showing how to search a specific element by using Inspect from the Chrome DevTools"}
+![](fig/inspect_element.PNG){alt="A screenshot of Google Chrome web browser, showing how to search a specific element by using Inspect from the Chrome DevTools"}
 
 
 ```python
@@ -101,7 +101,150 @@ print(soup.find(class_='film').prettify())
 </tr>
 ```
 
-# The scraping pipeline
+The following code repeats the process of clicking and loading the 2015 data, but now using "headless" mode (i.e. without opening a browser window). Then, it extracts data from the table one column at a time, taking advantage that each column has a unique class attribute that identifies it. Instead of using for loops to extract data from each element that `.find_all()` finds, we use list comprehensions. You can learn more about them reading [Python's documentation for list comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions), or with this [Programiz short tutorial](https://www.programiz.com/python-programming/list-comprehension).
+
+```python
+# Create the Selenium webdriver and make it headless
+options = ChromeOptions()
+options.add_argument("--headless=new")
+driver = webdriver.Chrome(options=options)
+
+# Load the website. Find and click 2015. Get post JavaScript execution HTML. Close webdriver
+driver.get("https://www.scrapethissite.com/pages/ajax-javascript/")
+button_2015 = driver.find_element(by=By.ID, value="2015")
+button_2015.click()
+sleep(3)
+html_2015 = driver.page_source
+driver.quit()
+
+# Parse HTML using BeautifulSoup and extract each column as a list of values ising list comprehensions
+soup = BeautifulSoup(html_2015, 'html.parser')
+titles_lc = [elem.get_text() for elem in soup.find_all(class_="film-title")]
+nominations_lc = [elem.get_text() for elem in soup.find_all(class_="film-nominations")]
+awards_lc = [elem.get_text() for elem in soup.find_all(class_="film-awards")]
+
+# For the best picture column, we can't use .get_text() as there is no text
+# Rather, we want to see if there is an <i> tag
+best_picture_lc = ["Yes" if elem.find("i") == None else "No" for elem in soup.find_all(class_="film-best-picture")]
+
+# Create a dataframe based on the previous lists
+movies_2015 = pd.DataFrame(
+    {'titles': titles_lc, 'nominations': nominations_lc, 'awards': awards_lc, 'best_picture': best_picture_lc}
+)
+```
+
+::::::::::::::::::::::::::::::::::::: challenge
+
+Based on what we've learned in this episode, write code for getting the data of all the years from 2010 to 2015 of [Hartley Brody's website](https://www.scrapethissite.com/pages/ajax-javascript/) with information of Oscar Winning Films. Hint: You'll use the same code, but add loop through each year.
+
+:::::::::::::::::::::::: solution
+
+Besides adding a loop for each year, the following solution is refactoring the code by creating two functions: one that finds and clicks a year returning the html after the data shows up, and another that gets the html and parses it to extract the data and create a dataframe.
+
+So you can see the process of how Selenium opens the browser and clicks the years, we are not adding the "headless" option.
+
+```python
+# Function to search year hyperlink and click it
+def findyear_click_gethtml(year):
+    button = driver.find_element(by=By.ID, value=year)
+    button.click()
+    sleep(3)
+    html = driver.page_source
+    return html
+
+# Function to parse html, extract table data, and assign year column
+def parsehtml_extractdata(html, year):
+    soup = BeautifulSoup(html, 'html.parser')
+    titles_lc = [elem.get_text() for elem in soup.find_all(class_="film-title")]
+    nominations_lc = [elem.get_text() for elem in soup.find_all(class_="film-nominations")]
+    awards_lc = [elem.get_text() for elem in soup.find_all(class_="film-awards")]
+    best_picture_lc = ["No" if elem.find("i") == None else "Yes" for elem in soup.find_all(class_="film-best-picture")]
+    movies_df = pd.DataFrame(
+        {'titles': titles_lc, 'nominations': nominations_lc, 'awards': awards_lc, 'best_picture': best_picture_lc, 'year': year}
+    )
+    return movies_df
+
+# Open Selenium webdriver and go to the page
+driver = webdriver.Chrome()
+driver.get("https://www.scrapethissite.com/pages/ajax-javascript/")
+
+# Create empty dataframe where we will append/concatenate the dataframes we get for each year
+result_df = pd.DataFrame()
+
+for year in ["2010", "2011", "2012", "2013", "2014", "2015"]:
+    html_year = findyear_click_gethtml(year)
+    df_year = parsehtml_extractdata(html_year, year)
+    result_df = pd.concat([result_df, df_year])
+
+# Close the browser that Selenium opened
+driver.quit()
+```
+
+:::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::
+
+
+::::::::::::::::::::::::::::::::::::: challenge
+
+If you are tired of scraping table data like we've been doing for the last two episodes, here is another dynamic website exercise where you can practice what you've learned. Go to [this product page](https://www.scrapingcourse.com/javascript-rendering) created by scrapingcourse.com and extract all product names and prices, and also the hyperlink that each product card has to a detailed view page.
+
+When you complete that, and if you are up to an additional challenge, scrape from the detailed view page of each product its SKU, Category and Description.
+
+:::::::::::::::::::::::: solution
+
+To identify what elements containt the data you need, you should use the "Inspect" tool in your browser. The following image is a screenshot of the website. In there we can see that each product card is a `<div>` element with multiple attributes that we can use to narrow down our search to the specific elements we want. For example, we would use `'data-testid'='product-item'`. After we find all *divs* that satisfy that condition, we can extract from each the hyperlink, the name, and the price. The hyperlink is the 'href' attribute of the `<a>` tag. The name and price are inside `<span>` tags, and we could use multiple attributes to get each of them. In the following code, we will use `'class'='product-name'` to get the name and `'data-content'='product-price'` to get the price.
+
+![](fig/product_cards_challenge.PNG){alt="A screenshot of Google Chrome web browser, highlighting the `<div>` element that contains the data we want about the product"}
+
+```python
+# Open Selenium webdriver in headless mode and go to the desired page
+options = webdriver.ChromeOptions()
+options.add_argument("--headless=new")
+driver = webdriver.Chrome(options=options)
+driver.get("https://www.scrapingcourse.com/javascript-rendering")
+
+# As we don't have to click anything, just wait for the JavaScript to load, we can get the HTML right away
+sleep(3)
+html = driver.page_source
+
+# Parste the HTML
+soup = BeautifulSoup(html, 'html.parser')
+# Find all <div> elements that have a 'data-testid' attribute with the value of 'product-item'
+divs = soup.find_all("div", attrs = {'data-testid': 'product-item'})
+
+# Loop through the <div> elements we found, and for each get the href,
+# the name (inside a <span> element with attribute class="product-name")
+# and the price (inside a <span> element with attribute data-content="product-price"
+list_of_dicts = []
+for div in divs:
+    # Create a dictionary to store the data we want for each product
+    item_dict = {
+        'link': div.find('a')['href'],
+        'name': div.find('span', attrs = {'class': 'product-name'}).get_text(),
+        'price': div.find('span', attrs = {'data-content': 'product-price'}).get_text()
+    }
+    list_of_dicts.append(item_dict)
+
+all_products = pd.DataFrame(list_of_dicts)
+```
+
+We could arrive to the same result if we replace the for loop with list comprehensions. So here is another possible solution with that approach.
+
+```python
+links = [elem['href'] for elem in soup.find_all('a', attrs = {'class': 'product-link'})]
+names = [elem.get_text() for elem in soup.find_all('span', attrs = {'class': 'product-name'})]
+prices = [elem.get_text() for elem in soup.find_all('span', attrs = {'data-content': 'product-price'})]
+all_products_v2 = pd.DataFrame(
+    {'link': links, 'name': names, 'price': prices}
+)
+```
+
+:::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::
+
+## The scraping pipeline
 
 By now, you've learned about the core tools for web scraping: requests, BeautifulSoup, and Selenium. These three tools form a versatile pipeline for almost any web scraping task. When starting a new scraping project, there are several important steps to follow that will help ensure you capture the data you need.