Skip to content

Commit 4c458d2

Browse files
committed
add first section on ep3 on dynamic websites
1 parent 8b350d6 commit 4c458d2

3 files changed

Lines changed: 14 additions & 3 deletions

File tree

episodes/dynamic-websites.md

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,20 @@ exercises: 5
1717

1818
::::::::::::::::::::::::::::::::::::::::::::::::
1919

20-
## To-do
20+
## You see it but the HTML doesn't have it!
21+
22+
Visit the following webpage created by Hartley Brody for the purpose of learning and practicing web scraping: [https://www.scrapethissite.com/pages/ajax-javascript/](https://www.scrapethissite.com/pages/ajax-javascript/) (read first the [terms of use](https://www.scrapethissite.com/faq/)). Select "2015" to see that year's Oscar winning films. Now look at the HTML behind it as we've learned, using the "View page source" tool on your browser or in Python using the requests and BeautifulSoup packages. Can you find in any place of the HTML the best picture winner "Spotlight"? Can you find any other of the movies or the data from the table? If you can't, how could you scrape this page?
23+
24+
When you explore a page like this, you'll notice that the movie data, including the title "Spotlight," isn’t actually in the initial HTML source. This is because the website uses JavaScript to load the information dynamically. JavaScript is a programming language that runs in your browser, allowing websites to fetch, process, and display content on the fly, often based on user actions like clicking a button. When you select "2015", the browser executes JavaScript (called by one of the `<script>` elements you see in the HTML) to retrieve the relevant movie information from the web server and build a new HTML document with actual information in the table. This makes the page feel more interactive, but it also means the initial HTML you see doesn’t contain the movie data itself.
25+
26+
Let’s explore a new way to view HTML elements in your browser to better understand the differences in an HTML document before and after JavaScript is executed. On the Oscar winners page you just visited, right-click (or Control key + Click on Mac) and select "Inspect" from the pop-up menu, as shown in the image below. This opens DevTools on the side of your browser, offering a range of features for inspecting, debugging, and analyzing web pages in real-time. For this workshop, however, we’ll focus on just the "Elements" tab. In the "Elements" tab, you’ll see an HTML document that actually includes the table element, unlike what you saw in "View Page Source". This difference is because "View Page Source" displays the original HTML, before any JavaScript is run, while "Inspect" reveals the HTML as it appears after JavaScript has executed.
27+
28+
![](fig/inspect.png){alt="A screenshot of Google Chrome web browser, showing how to use Inspect from the Chrome DevTools"}
29+
30+
As the `requests` package retrieves the source HTML, we need a different approach to scrape these types of websites. For this, we will use the `Selenium` package. But don't forget about the "Inspect" tool we just learned, it will be handy when scraping.
31+
32+
## Using Selenium to scrape dynamic websites
2133

22-
To-do
2334

2435
::::::::::::::::::::::::::::::::::::: keypoints
2536

episodes/fig/inspect.png

175 KB
Loading

episodes/hello-scraping.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ We'll refresh some of the concepts covered there to have a practical understandi
2525

2626
## HTML quick overview
2727

28-
All websites have a Hypertext Markup Language (HTML) document behind them. The following text is HTML for a very simple website, with only three sentences. If you read it, can you imagine how that website looks?
28+
All websites have a Hypertext Markup Language (HTML) document behind them. The following text is HTML for a very simple website, with only three sentences. If you look at it, can you imagine how that website looks?
2929

3030
```html
3131
<!DOCTYPE html>

0 commit comments

Comments
 (0)