Skip to content

Commit 38d6c5a

Browse files
committed
finished ep 2
1 parent c14283d commit 38d6c5a

2 files changed

Lines changed: 806 additions & 747 deletions

File tree

episodes/a-real-website.md

Lines changed: 159 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: "Scraping a real website"
3-
teaching: 30
3+
teaching: 40
44
exercises: 15
55
---
66

@@ -249,21 +249,175 @@ for row in workshops_table.find_all('tr'):
249249
dict_w['date'] = third_cell.get_text(strip=True)
250250
list_of_workshops.append(dict_w)
251251

252-
result_df = pd.DataFrame(list_of_workshops)
252+
upcomingworkshops_df = pd.DataFrame(list_of_workshops)
253253
```
254254

255+
A key takeaway from this exercise is that, when we want to scrape data in a structured way, we have to spend some time getting to know how the website is structured and how we can identify and extract only the elements we are interested in.
256+
257+
::::::::::::::::::::::::::::::::::::: challenge
258+
259+
Extract the same information as in the previous exercise, but this time from the Past Workshops Page at [https://carpentries.org/past_workshops/](https://carpentries.org/past_workshops/). Which 5 countries have held the most workshops, and how many has each held?
260+
261+
:::::::::::::::::::::::: solution
262+
263+
We can reuse directly the code we wrote before, changing only the URL we got the HTML from.
264+
265+
```python
266+
url = 'https://carpentries.org/past_workshops/'
267+
req = requests.get(url).text
268+
cleaned_req = re.sub(r'\s*\n\s*', '', req).strip()
269+
270+
soup = BeautifulSoup(cleaned_req, 'html.parser')
271+
workshops_table = soup.find('table')
272+
273+
list_of_workshops = []
274+
for row in workshops_table.find_all('tr'):
275+
cells = row.find_all('td')
276+
first_cell = cells[0]
277+
second_cell = cells[1]
278+
third_cell = cells[2]
279+
dict_w = {}
280+
dict_w['type'] = first_cell.find('img')['title']
281+
dict_w['country'] = second_cell.find('img')['title']
282+
dict_w['link'] = second_cell.find('a')['href']
283+
dict_w['host'] = second_cell.find('a').get_text()
284+
dict_w['all_text'] = second_cell.get_text(strip=True)
285+
dict_w['date'] = third_cell.get_text(strip=True)
286+
list_of_workshops.append(dict_w)
287+
288+
pastworkshops_df = pd.DataFrame(list_of_workshops)
289+
290+
print('Total number of workshops in the table: ', len(pastworkshops_df))
291+
292+
print('Top 5 of countries by number of workshops held: \n',
293+
pastworkshops_df['location'].value_counts().head())
294+
```
295+
296+
```output
297+
Total number of workshops in the table: 3830
298+
Top 5 of countries by number of workshops held:
299+
country
300+
US 1837
301+
GB 468
302+
AU 334
303+
CA 225
304+
DE 172
305+
Name: count, dtype: int64
306+
```
307+
308+
:::::::::::::::::::::::::::::::::
309+
310+
::::::::::::::::::::::::::::::::::::::::::::::::
311+
255312

256-
We can obtain the exact same result only using the find_all and find methods
313+
::::::::::::::::::::::::::::::::::::: challenge
257314

315+
For a more challenging exercise, try to add to our output dataframe if the workshop was held online or not.
316+
317+
You'll notice from the website that the online workshops have a world icon next between the country flag and the name of the institution that hosts the workshop.
318+
319+
:::::::::::::::::::::::: solution
320+
321+
To start, we can see in the HTML document that the world icon is in the second data cell of a row. Additionally, for those workshops that are online, there is an additional image element with these attributes `<img title="Online" alt="globe image" class="flags"/>`. So we could search if the second data cell has an element with an attribute of `title="Online"`. If it doesn't, the `.find()` method would return nothing, what in Python is called a "NoneType" data type. So if `.find()` returns None, we should fill the respective cell in our dataframe with a "No", meaning that the workshop not held online, and in the opposite case fill it with a "Yes". Here is a possible code solution, which you would add to the previous code where we extracted the other data and created the dataframe.
322+
323+
```python
324+
if second_cell.find(title="Online") == None:
325+
online_value = "No"
326+
else:
327+
online_value = "Yes"
328+
dict_w['online'] = online_value
329+
```
330+
331+
:::::::::::::::::::::::::::::::::
332+
333+
::::::::::::::::::::::::::::::::::::::::::::::::
258334

259335
## Automating data collection
260336

261-
Going to the
337+
Until now we've only scraped one website at a time. But there may be situations where the information you need will be split in different pages, or where you have to follow a trace of hyperlinks. With the tools we've learned until now, this new task is straightforward. We would have to add a loop that goes to those other pages, gets the HTML document using the `requests` package, and parses the HTML with `BeautifulSoup` to extract the required information.
338+
339+
The additional and important step to consider in this task is to add a wait time between each request to the website, so we don't overload the web server that is providing us the information we need. If we send too many requests in a short period of time, we can prevent other “normal” users from accessing the site during that time, or even cause the server to run out of resources and crash. If the provider of the website detects an excessive use, it could block our computer from accessing that website, or even take legal action in extreme cases.
340+
341+
To make sure we don't crash the server, we can add a wait time between each step of our loop with the built-in Python module `time` and its `sleep()` function. With this function, Python will wait for the specified number of seconds before continuing to execute the next line of code. For example, when you run the following code, Python will wait 10 seconds between each print execution.
342+
343+
```python
344+
from time import sleep
345+
print('First')
346+
sleep(10)
347+
print('Second')
348+
```
349+
350+
Let's incorporate this important principle for extracting additional information from each of our workshop websites in the upcoming list. We already have our `upcomingworkshops_df` dataframe, and in there, a `link` column with the URL to the website for each individual workshop. For example, let's make a request for the HTML of the first workshop in the dataframe, and take a look.
351+
352+
```python
353+
first_url = upcomingworkshops_df.loc[0, 'link']
354+
print("URL we are visiting: ", first_url)
355+
356+
req = requests.get(first_url).text
357+
cleaned_req = re.sub(r'\s*\n\s*', '', req).strip()
358+
359+
soup = BeautifulSoup(cleaned_req, 'html.parser')
360+
print(soup.prettify())
361+
```
362+
363+
If we explore the HTML this way, or using the 'View page source' in the browser, we notice something interesting in the `<head>` element. As this information is inside `<head>` instead of the `<body>` element, it won't be displayed in our browser when we visit the page, but the meta elements will provide metadata for search engines to better understand, display, and index the page. Each of this `<meta>` tags contain useful information for our table of workshops, for example, a well formatted start and end date, the exact location of the workshop with latitude and longitude (for those not online), the language in which it will be taught, and a more structured way of listing instructors and helpers. This is precisely the information we will extract with the following code.
364+
365+
```python
366+
# List of URLs in our dataframe
367+
urls = list(upcomingworkshops_df.loc[:, 'link'])
368+
369+
# Start an empty list to store the different dictionaries with our data
370+
list_of_workshops = []
262371

263-
When we want to extract information form a website, we need to understand how the website is structured, how we can identify the elements we want
372+
# Start a loop over each URL
373+
for item in tqdm(urls):
374+
# Get the HTML and parse it
375+
req = requests.get(item).text
376+
cleaned_req = re.sub(r'\s*\n\s*', '', req).strip()
377+
soup = BeautifulSoup(cleaned_req, 'html.parser')
378+
379+
# Start an empty dictionary and fill it with the URL, which
380+
# is our identifier with our other dataframe
381+
dict_w = {}
382+
dict_w['link'] = item
383+
384+
# Use the find function to search for the <meta> tag that
385+
# has each specific name attribute and get the value in the
386+
# content attribute
387+
dict_w['startdate'] = soup.find('meta', attrs ={'name': 'startdate'})['content']
388+
dict_w['enddate'] = soup.find('meta', attrs ={'name': 'enddate'})['content']
389+
dict_w['language'] = soup.find('meta', attrs ={'name': 'language'})['content']
390+
dict_w['latlng'] = soup.find('meta', attrs ={'name': 'latlng'})['content']
391+
dict_w['instructor'] = soup.find('meta', attrs ={'name': 'instructor'})['content']
392+
dict_w['helper'] = soup.find('meta', attrs ={'name': 'helper'})['content']
393+
394+
# Append to our list
395+
list_of_workshops.append(dict_w)
396+
397+
# Be respectful, wait at least 3 seconds before a new request
398+
sleep(3)
399+
400+
extradata_upcoming_df = pd.DataFrame(list_of_workshops)
401+
```
402+
403+
::::::::::::::::::::::::::::::::::::: challenge
404+
405+
It is possible that you received an error when executing the previous block code, and the most probable reason is that the URL your tried to visit didn't exist. This is known as 404 code error, that indicates the requested page cannot be found on the server. What would be your approach to work around this possible error?
264406

407+
:::::::::::::::::::::::: solution
265408

409+
A Pythonic crude way of working around any error for a given URL would be to use a [try and except block](https://docs.python.org/3/tutorial/errors.html), for which you would ignore any URL that throws an error and continue with the next one.
266410

411+
A more stylish way to deal when a web page doesn't exist is to get the actual response code when `requests` tries to reach the page. If you receive a 200 code, it means the request was successful. In any other case, you'd want to store the code and skip the scraping of that page. The code you'd use to get the response code is:
412+
413+
```python
414+
req = requests.get(url)
415+
print(req.status_code)
416+
```
417+
418+
:::::::::::::::::::::::::::::::::
419+
420+
::::::::::::::::::::::::::::::::::::::::::::::::
267421

268422

269423
::::::::::::::::::::::::::::::::::::: keypoints

0 commit comments

Comments
 (0)