You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: episodes/dynamic-websites.md
+145-2Lines changed: 145 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -59,7 +59,7 @@ How can we direct Selenium to click the text "2015" for the table of that year t
59
59
60
60
If we wanted to search a table element that has the `<table>` tag, we would run `driver.find_element(by=By.TAG_NAME, value="table")`. If we wanted to search an element with a specific value in the "class" attribute, for example an element like `<tr class="film">` we would run `driver.find_element(by=By.CLASS_NAME, value="film")`. To know what element we need to click to open "2015" table of Oscar winners we can use the "Inspect" tool (remember you can do this in Google Chrome by pointing your mouse over the "2015" value, make a right-click, and select "Inspect" from the pop-up menu). In the DevTools window, you'll see element `<a href="#" class="year-link" id="2015">2015</a>`. As the ID attribute is unique for only one element in the HTML, we can directly select the element by this attribute using the code you'll find after the following image.
61
61
62
-
{alt="A screenshot of Google Chrome web browser, showing how to search a specific element by using Inspect from the Chrome DevTools"}
62
+
{alt="A screenshot of Google Chrome web browser, showing how to search a specific element by using Inspect from the Chrome DevTools"}
The following code repeats the process of clicking and loading the 2015 data, but now using "headless" mode (i.e. without opening a browser window). Then, it extracts data from the table one column at a time, taking advantage that each column has a unique class attribute that identifies it. Instead of using for loops to extract data from each element that `.find_all()` finds, we use list comprehensions. You can learn more about them reading [Python's documentation for list comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions), or with this [Programiz short tutorial](https://www.programiz.com/python-programming/list-comprehension).
105
+
106
+
```python
107
+
# Create the Selenium webdriver and make it headless
108
+
options = ChromeOptions()
109
+
options.add_argument("--headless=new")
110
+
driver = webdriver.Chrome(options=options)
111
+
112
+
# Load the website. Find and click 2015. Get post JavaScript execution HTML. Close webdriver
Based on what we've learned in this episode, write code for getting the data of all the years from 2010 to 2015 of [Hartley Brody's website](https://www.scrapethissite.com/pages/ajax-javascript/) with information of Oscar Winning Films. Hint: You'll use the same code, but add loop through each year.
139
+
140
+
:::::::::::::::::::::::: solution
141
+
142
+
Besides adding a loop for each year, the following solution is refactoring the code by creating two functions: one that finds and clicks a year returning the html after the data shows up, and another that gets the html and parses it to extract the data and create a dataframe.
143
+
144
+
So you can see the process of how Selenium opens the browser and clicks the years, we are not adding the "headless" option.
# Create empty dataframe where we will append/concatenate the dataframes we get for each year
172
+
result_df = pd.DataFrame()
173
+
174
+
for year in ["2010", "2011", "2012", "2013", "2014", "2015"]:
175
+
html_year = findyear_click_gethtml(year)
176
+
df_year = parsehtml_extractdata(html_year, year)
177
+
result_df = pd.concat([result_df, df_year])
178
+
179
+
# Close the browser that Selenium opened
180
+
driver.quit()
181
+
```
182
+
183
+
:::::::::::::::::::::::::::::::::
184
+
185
+
::::::::::::::::::::::::::::::::::::::::::::::::
186
+
187
+
188
+
::::::::::::::::::::::::::::::::::::: challenge
189
+
190
+
If you are tired of scraping table data like we've been doing for the last two episodes, here is another dynamic website exercise where you can practice what you've learned. Go to [this product page](https://www.scrapingcourse.com/javascript-rendering) created by scrapingcourse.com and extract all product names and prices, and also the hyperlink that each product card has to a detailed view page.
191
+
192
+
When you complete that, and if you are up to an additional challenge, scrape from the detailed view page of each product its SKU, Category and Description.
193
+
194
+
:::::::::::::::::::::::: solution
195
+
196
+
To identify what elements containt the data you need, you should use the "Inspect" tool in your browser. The following image is a screenshot of the website. In there we can see that each product card is a `<div>` element with multiple attributes that we can use to narrow down our search to the specific elements we want. For example, we would use `'data-testid'='product-item'`. After we find all *divs* that satisfy that condition, we can extract from each the hyperlink, the name, and the price. The hyperlink is the 'href' attribute of the `<a>` tag. The name and price are inside `<span>` tags, and we could use multiple attributes to get each of them. In the following code, we will use `'class'='product-name'` to get the name and `'data-content'='product-price'` to get the price.
197
+
198
+
{alt="A screenshot of Google Chrome web browser, highlighting the `<div>` element that contains the data we want about the product"}
199
+
200
+
```python
201
+
# Open Selenium webdriver in headless mode and go to the desired page
We could arrive to the same result if we replace the for loop with list comprehensions. So here is another possible solution with that approach.
233
+
234
+
```python
235
+
links = [elem['href'] for elem in soup.find_all('a', attrs= {'class': 'product-link'})]
236
+
names = [elem.get_text() for elem in soup.find_all('span', attrs= {'class': 'product-name'})]
237
+
prices = [elem.get_text() for elem in soup.find_all('span', attrs= {'data-content': 'product-price'})]
238
+
all_products_v2 = pd.DataFrame(
239
+
{'link': links, 'name': names, 'price': prices}
240
+
)
241
+
```
242
+
243
+
:::::::::::::::::::::::::::::::::
244
+
245
+
::::::::::::::::::::::::::::::::::::::::::::::::
246
+
247
+
## The scraping pipeline
105
248
106
249
By now, you've learned about the core tools for web scraping: requests, BeautifulSoup, and Selenium. These three tools form a versatile pipeline for almost any web scraping task. When starting a new scraping project, there are several important steps to follow that will help ensure you capture the data you need.
0 commit comments