Skip to content

Commit d6426bb

Browse files
committed
Finished episode 3
1 parent 4c458d2 commit d6426bb

3 files changed

Lines changed: 959 additions & 12 deletions

File tree

episodes/dynamic-websites.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,86 @@ As the `requests` package retrieves the source HTML, we need a different approac
3131

3232
## Using Selenium to scrape dynamic websites
3333

34+
[Selenium](https://www.selenium.dev/) is an open source project for web browser automation. It will be useful for our scraping tasks as it will act as a real user interacting with a webpage in a browser. This way, Selenium will render pages in a browser environment, allowing JavaScript to load dynamic content, and therefore giving us access to the website HTML after JavaScript has executed. Additionally, this package simulates real user interactions like filling in text boxes, clicking, scrolling, or selecting drop-down menus, which will be useful when we scrape dynamic websites.
35+
36+
To use it, we'll load "webdriver" and "By" from the `selenium` package. webdriver open or simulate a web browser, interacting with it based on the instructions we give. By will allow us to specify how we will select a given element in the HTML, by tag (using "By.TAG_NAME") or by attributes like class ("By.CLASS_NAME"), id ("By.ID"), or name ("By.NAME"). We will also load the other packages we used in the previous episode.
37+
38+
```python
39+
from bs4 import BeautifulSoup
40+
import pandas as pd
41+
from time import sleep
42+
from selenium import webdriver
43+
from selenium.webdriver.common.by import By
44+
```
45+
46+
Selenium can simulate multiple multiple browsers like Chrome, Firefox, Safari, etc. For now, we'll use Chrome. When you run the following line of code, you'll notice that a Google Chrome window opens up. Don't close it, as this is how Selenium interacts with the browser. Later we'll see how to do *headless* browser interactions, headless meaning that browser interactions will happen in the background, without opening a new browser window or user interface.
47+
48+
```python
49+
driver = webdriver.Chrome()
50+
```
51+
52+
Now, to tell the browser to visit our Oscar winners page, use the `.get()` method on the `driver` object we just created.
53+
54+
```python
55+
driver.get("https://www.scrapethissite.com/pages/ajax-javascript/")
56+
```
57+
58+
How can we direct Selenium to click the text "2015" for the table of that year tho show up? First, we need to find that element, in a similar way to how we find elements with BeautifulSoup. Just like we used `.find()` in BeautifulSoup to find the first element that matched the specified criteria, in Selenium we have `.find_element()`. Likewise, as we used `.find_all()` in BeautifulSoup to return a list of all coincidences for our search criteria, we can use `.find_elements()` in Selenium. But the syntax of how we specify the search paramenters will be a little different.
59+
60+
If we wanted to search a table element that has the `<table>` tag, we would run `driver.find_element(by=By.TAG_NAME, value="table")`. If we wanted to search an element with a specific value in the "class" attribute, for example an element like `<tr class="film">` we would run `driver.find_element(by=By.CLASS_NAME, value="film")`. To know what element we need to click to open "2015" table of Oscar winners we can use the "Inspect" tool (remember you can do this in Google Chrome by pointing your mouse over the "2015" value, make a right-click, and select "Inspect" from the pop-up menu). In the DevTools window, you'll see element `<a href="#" class="year-link" id="2015">2015</a>`. As the ID attribute is unique for only one element in the HTML, we can directly select the element by this attribute using the code you'll find after the following image.
61+
62+
![](fig/inspect_element.png){alt="A screenshot of Google Chrome web browser, showing how to search a specific element by using Inspect from the Chrome DevTools"}
63+
64+
65+
```python
66+
button_2015 = driver.find_element(by=By.ID, value="2015")
67+
```
68+
69+
We've located the hyperlink element we want to click to get the table for that year, and on that element we will use the `.click()` method to interact with it. As the table takes a couple of seconds to load, we will also use the `sleep()` function from the "time" module to wait will the JavaScript runs and the table loads. Then, we use `driver.page_source` for Selenium to get the HTML document from the website, and we store it in a variable called `html_2015`. Finally, we close the web browser that Selenium was using with `driver.quit()`.
70+
71+
```python
72+
button_2015.click()
73+
sleep(3)
74+
html_2015 = driver.page_source
75+
driver.quit()
76+
```
77+
78+
Importantly, the HTML document we stored in `html_2015` **is the HTML after the dynamic content loaded**, so it will contain the table values for 2015 that weren't there originally and that we wouldn't be able to see if we had used the `requests` package instead.
79+
80+
We could continue using Selenium and its `.find_element()` and `.find_elements()` methods to to extract our data of interest. But instead, we will use BeautifulSoup to parse the HTML and find elements, as we already have some practice using it. If we search for the first element with class attribute equal to "film-title" and return the text inside it, we see that this HTML has the "Spotlight" movie.
81+
82+
```python
83+
soup = BeautifulSoup(html_2015, 'html.parser')
84+
print(soup.find(class_='film').prettify())
85+
```
86+
```python
87+
<tr class="film">
88+
<td class="film-title">
89+
Spotlight
90+
</td>
91+
<td class="film-nominations">
92+
6
93+
</td>
94+
<td class="film-awards">
95+
2
96+
</td>
97+
<td class="film-best-picture">
98+
<i class="glyphicon glyphicon-flag">
99+
</i>
100+
</td>
101+
</tr>
102+
```
103+
104+
# The scraping pipeline
105+
106+
By now, you've learned about the core tools for web scraping: requests, BeautifulSoup, and Selenium. These three tools form a versatile pipeline for almost any web scraping task. When starting a new scraping project, there are several important steps to follow that will help ensure you capture the data you need.
107+
108+
The first step is **understanding the website structure**. Every website is different and structures data in its own particular way. Spend some time exploring the site and identifying the HTML elements that contain the information you want. Next, **determine if the content is static or dynamic**. Static content can be directly accessed from the HTML source code using requests and BeautifulSoup, while dynamic content often requires Selenium to load JavaScript on the page before BeautifulSoup can parse it.
109+
110+
Once you know how the website presents its data, **start building your pipeline**. If the content is static, make a `requests` call to get the HTML document, and use `BeautifulSoup` to locate and extract the necessary elements. If the content is dynamic, use `Selenium` to load the page fully, perform any interactions (like clicking or scrolling), and then pass the rendered HTML to `BeautifulSoup` for parsing and extracing the necessary elements. Finally, **format and store the data** in a structured way that's useful for your specific project and that makes it easy to analyse later.
111+
112+
This scraping pipeline helps break down complex scraping tasks into manageable steps and allows you to adapt the tools based on the website’s unique features. With practice, you’ll be able to efficiently combine these tools to extract valuable data from almost any website.
113+
34114

35115
::::::::::::::::::::::::::::::::::::: keypoints
36116

episodes/fig/inspect_element.PNG

196 KB
Loading

0 commit comments

Comments
 (0)