Skip to content

Commit 7116edb

Browse files
committed
updated challenges for episode 3
1 parent 41f455f commit 7116edb

4 files changed

Lines changed: 729 additions & 60 deletions

File tree

episodes/a-real-website.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -386,12 +386,12 @@ for item in tqdm(urls):
386386
# Use the find function to search for the <meta> tag that
387387
# has each specific 'name' attribute and get the value in the
388388
# 'content' attribute
389-
dict_w['startdate'] = soup.find('meta', attrs ={'name': 'startdate'})['content']
390-
dict_w['enddate'] = soup.find('meta', attrs ={'name': 'enddate'})['content']
391-
dict_w['language'] = soup.find('meta', attrs ={'name': 'language'})['content']
392-
dict_w['latlng'] = soup.find('meta', attrs ={'name': 'latlng'})['content']
393-
dict_w['instructor'] = soup.find('meta', attrs ={'name': 'instructor'})['content']
394-
dict_w['helper'] = soup.find('meta', attrs ={'name': 'helper'})['content']
389+
dict_w['startdate'] = soup.find('meta', attrs = {'name': 'startdate'})['content']
390+
dict_w['enddate'] = soup.find('meta', attrs = {'name': 'enddate'})['content']
391+
dict_w['language'] = soup.find('meta', attrs = {'name': 'language'})['content']
392+
dict_w['latlng'] = soup.find('meta', attrs = {'name': 'latlng'})['content']
393+
dict_w['instructor'] = soup.find('meta', attrs = {'name': 'instructor'})['content']
394+
dict_w['helper'] = soup.find('meta', attrs = {'name': 'helper'})['content']
395395

396396
# Append to our list
397397
list_of_workshops.append(dict_w)

episodes/dynamic-websites.md

Lines changed: 145 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ How can we direct Selenium to click the text "2015" for the table of that year t
5959

6060
If we wanted to search a table element that has the `<table>` tag, we would run `driver.find_element(by=By.TAG_NAME, value="table")`. If we wanted to search an element with a specific value in the "class" attribute, for example an element like `<tr class="film">` we would run `driver.find_element(by=By.CLASS_NAME, value="film")`. To know what element we need to click to open "2015" table of Oscar winners we can use the "Inspect" tool (remember you can do this in Google Chrome by pointing your mouse over the "2015" value, make a right-click, and select "Inspect" from the pop-up menu). In the DevTools window, you'll see element `<a href="#" class="year-link" id="2015">2015</a>`. As the ID attribute is unique for only one element in the HTML, we can directly select the element by this attribute using the code you'll find after the following image.
6161

62-
![](fig/inspect_element.png){alt="A screenshot of Google Chrome web browser, showing how to search a specific element by using Inspect from the Chrome DevTools"}
62+
![](fig/inspect_element.PNG){alt="A screenshot of Google Chrome web browser, showing how to search a specific element by using Inspect from the Chrome DevTools"}
6363

6464

6565
```python
@@ -101,7 +101,150 @@ print(soup.find(class_='film').prettify())
101101
</tr>
102102
```
103103

104-
# The scraping pipeline
104+
The following code repeats the process of clicking and loading the 2015 data, but now using "headless" mode (i.e. without opening a browser window). Then, it extracts data from the table one column at a time, taking advantage that each column has a unique class attribute that identifies it. Instead of using for loops to extract data from each element that `.find_all()` finds, we use list comprehensions. You can learn more about them reading [Python's documentation for list comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions), or with this [Programiz short tutorial](https://www.programiz.com/python-programming/list-comprehension).
105+
106+
```python
107+
# Create the Selenium webdriver and make it headless
108+
options = ChromeOptions()
109+
options.add_argument("--headless=new")
110+
driver = webdriver.Chrome(options=options)
111+
112+
# Load the website. Find and click 2015. Get post JavaScript execution HTML. Close webdriver
113+
driver.get("https://www.scrapethissite.com/pages/ajax-javascript/")
114+
button_2015 = driver.find_element(by=By.ID, value="2015")
115+
button_2015.click()
116+
sleep(3)
117+
html_2015 = driver.page_source
118+
driver.quit()
119+
120+
# Parse HTML using BeautifulSoup and extract each column as a list of values ising list comprehensions
121+
soup = BeautifulSoup(html_2015, 'html.parser')
122+
titles_lc = [elem.get_text() for elem in soup.find_all(class_="film-title")]
123+
nominations_lc = [elem.get_text() for elem in soup.find_all(class_="film-nominations")]
124+
awards_lc = [elem.get_text() for elem in soup.find_all(class_="film-awards")]
125+
126+
# For the best picture column, we can't use .get_text() as there is no text
127+
# Rather, we want to see if there is an <i> tag
128+
best_picture_lc = ["Yes" if elem.find("i") == None else "No" for elem in soup.find_all(class_="film-best-picture")]
129+
130+
# Create a dataframe based on the previous lists
131+
movies_2015 = pd.DataFrame(
132+
{'titles': titles_lc, 'nominations': nominations_lc, 'awards': awards_lc, 'best_picture': best_picture_lc}
133+
)
134+
```
135+
136+
::::::::::::::::::::::::::::::::::::: challenge
137+
138+
Based on what we've learned in this episode, write code for getting the data of all the years from 2010 to 2015 of [Hartley Brody's website](https://www.scrapethissite.com/pages/ajax-javascript/) with information of Oscar Winning Films. Hint: You'll use the same code, but add loop through each year.
139+
140+
:::::::::::::::::::::::: solution
141+
142+
Besides adding a loop for each year, the following solution is refactoring the code by creating two functions: one that finds and clicks a year returning the html after the data shows up, and another that gets the html and parses it to extract the data and create a dataframe.
143+
144+
So you can see the process of how Selenium opens the browser and clicks the years, we are not adding the "headless" option.
145+
146+
```python
147+
# Function to search year hyperlink and click it
148+
def findyear_click_gethtml(year):
149+
button = driver.find_element(by=By.ID, value=year)
150+
button.click()
151+
sleep(3)
152+
html = driver.page_source
153+
return html
154+
155+
# Function to parse html, extract table data, and assign year column
156+
def parsehtml_extractdata(html, year):
157+
soup = BeautifulSoup(html, 'html.parser')
158+
titles_lc = [elem.get_text() for elem in soup.find_all(class_="film-title")]
159+
nominations_lc = [elem.get_text() for elem in soup.find_all(class_="film-nominations")]
160+
awards_lc = [elem.get_text() for elem in soup.find_all(class_="film-awards")]
161+
best_picture_lc = ["No" if elem.find("i") == None else "Yes" for elem in soup.find_all(class_="film-best-picture")]
162+
movies_df = pd.DataFrame(
163+
{'titles': titles_lc, 'nominations': nominations_lc, 'awards': awards_lc, 'best_picture': best_picture_lc, 'year': year}
164+
)
165+
return movies_df
166+
167+
# Open Selenium webdriver and go to the page
168+
driver = webdriver.Chrome()
169+
driver.get("https://www.scrapethissite.com/pages/ajax-javascript/")
170+
171+
# Create empty dataframe where we will append/concatenate the dataframes we get for each year
172+
result_df = pd.DataFrame()
173+
174+
for year in ["2010", "2011", "2012", "2013", "2014", "2015"]:
175+
html_year = findyear_click_gethtml(year)
176+
df_year = parsehtml_extractdata(html_year, year)
177+
result_df = pd.concat([result_df, df_year])
178+
179+
# Close the browser that Selenium opened
180+
driver.quit()
181+
```
182+
183+
:::::::::::::::::::::::::::::::::
184+
185+
::::::::::::::::::::::::::::::::::::::::::::::::
186+
187+
188+
::::::::::::::::::::::::::::::::::::: challenge
189+
190+
If you are tired of scraping table data like we've been doing for the last two episodes, here is another dynamic website exercise where you can practice what you've learned. Go to [this product page](https://www.scrapingcourse.com/javascript-rendering) created by scrapingcourse.com and extract all product names and prices, and also the hyperlink that each product card has to a detailed view page.
191+
192+
When you complete that, and if you are up to an additional challenge, scrape from the detailed view page of each product its SKU, Category and Description.
193+
194+
:::::::::::::::::::::::: solution
195+
196+
To identify what elements containt the data you need, you should use the "Inspect" tool in your browser. The following image is a screenshot of the website. In there we can see that each product card is a `<div>` element with multiple attributes that we can use to narrow down our search to the specific elements we want. For example, we would use `'data-testid'='product-item'`. After we find all *divs* that satisfy that condition, we can extract from each the hyperlink, the name, and the price. The hyperlink is the 'href' attribute of the `<a>` tag. The name and price are inside `<span>` tags, and we could use multiple attributes to get each of them. In the following code, we will use `'class'='product-name'` to get the name and `'data-content'='product-price'` to get the price.
197+
198+
![](fig/product_cards_challenge.PNG){alt="A screenshot of Google Chrome web browser, highlighting the `<div>` element that contains the data we want about the product"}
199+
200+
```python
201+
# Open Selenium webdriver in headless mode and go to the desired page
202+
options = webdriver.ChromeOptions()
203+
options.add_argument("--headless=new")
204+
driver = webdriver.Chrome(options=options)
205+
driver.get("https://www.scrapingcourse.com/javascript-rendering")
206+
207+
# As we don't have to click anything, just wait for the JavaScript to load, we can get the HTML right away
208+
sleep(3)
209+
html = driver.page_source
210+
211+
# Parste the HTML
212+
soup = BeautifulSoup(html, 'html.parser')
213+
# Find all <div> elements that have a 'data-testid' attribute with the value of 'product-item'
214+
divs = soup.find_all("div", attrs = {'data-testid': 'product-item'})
215+
216+
# Loop through the <div> elements we found, and for each get the href,
217+
# the name (inside a <span> element with attribute class="product-name")
218+
# and the price (inside a <span> element with attribute data-content="product-price"
219+
list_of_dicts = []
220+
for div in divs:
221+
# Create a dictionary to store the data we want for each product
222+
item_dict = {
223+
'link': div.find('a')['href'],
224+
'name': div.find('span', attrs = {'class': 'product-name'}).get_text(),
225+
'price': div.find('span', attrs = {'data-content': 'product-price'}).get_text()
226+
}
227+
list_of_dicts.append(item_dict)
228+
229+
all_products = pd.DataFrame(list_of_dicts)
230+
```
231+
232+
We could arrive to the same result if we replace the for loop with list comprehensions. So here is another possible solution with that approach.
233+
234+
```python
235+
links = [elem['href'] for elem in soup.find_all('a', attrs = {'class': 'product-link'})]
236+
names = [elem.get_text() for elem in soup.find_all('span', attrs = {'class': 'product-name'})]
237+
prices = [elem.get_text() for elem in soup.find_all('span', attrs = {'data-content': 'product-price'})]
238+
all_products_v2 = pd.DataFrame(
239+
{'link': links, 'name': names, 'price': prices}
240+
)
241+
```
242+
243+
:::::::::::::::::::::::::::::::::
244+
245+
::::::::::::::::::::::::::::::::::::::::::::::::
246+
247+
## The scraping pipeline
105248

106249
By now, you've learned about the core tools for web scraping: requests, BeautifulSoup, and Selenium. These three tools form a versatile pipeline for almost any web scraping task. When starting a new scraping project, there are several important steps to follow that will help ensure you capture the data you need.
107250

515 KB
Loading

0 commit comments

Comments
 (0)