Skip to content

Commit 16ac023

Browse files
authored
Merge pull request carpentries-incubator#7 from josenino95/main
Finished episodes 1 to 3
2 parents 1b6dbf4 + 7116edb commit 16ac023

7 files changed

Lines changed: 2268 additions & 14 deletions

File tree

episodes/a-real-website.md

Lines changed: 13 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -360,11 +360,13 @@ soup = BeautifulSoup(cleaned_req, 'html.parser')
360360
print(soup.prettify())
361361
```
362362

363-
If we explore the HTML this way, or using the 'View page source' in the browser, we notice something interesting in the `<head>` element. As this information is inside `<head>` instead of the `<body>` element, it won't be displayed in our browser when we visit the page, but the meta elements will provide metadata for search engines to better understand, display, and index the page. Each of this `<meta>` tags contain useful information for our table of workshops, for example, a well formatted start and end date, the exact location of the workshop with latitude and longitude (for those not online), the language in which it will be taught, and a more structured way of listing instructors and helpers. This is precisely the information we will extract with the following code.
363+
If we explore the HTML this way, or using the 'View page source' in the browser, we notice something interesting in the `<head>` element. As this information is inside `<head>` instead of the `<body>` element, it won't be displayed in our browser when we visit the page, but the meta elements will provide metadata for search engines to better understand, display, and index the page. Each of this `<meta>` tags contain useful information for our table of workshops, for example, a well formatted start and end date, the exact location of the workshop with latitude and longitude (for those not online), the language in which it will be taught, and a more structured way of listing instructors and helpers. Each of these data points can be identified by the the "name" attribute in the `<meta>` tags, and the information we want to extract is the value in the "content" attribute.
364+
365+
The following code automates the process of getting this data from each website, for the first five workshops in our `upcomingworkshops_df` dataframe. We will only do it for five workshops to not send too many requests overwhelming the server, but we could also do it for all the workshops.
364366

365367
```python
366368
# List of URLs in our dataframe
367-
urls = list(upcomingworkshops_df.loc[:, 'link'])
369+
urls = list(upcomingworkshops_df.loc[:5, 'link'])
368370

369371
# Start an empty list to store the different dictionaries with our data
370372
list_of_workshops = []
@@ -382,14 +384,14 @@ for item in tqdm(urls):
382384
dict_w['link'] = item
383385

384386
# Use the find function to search for the <meta> tag that
385-
# has each specific name attribute and get the value in the
386-
# content attribute
387-
dict_w['startdate'] = soup.find('meta', attrs ={'name': 'startdate'})['content']
388-
dict_w['enddate'] = soup.find('meta', attrs ={'name': 'enddate'})['content']
389-
dict_w['language'] = soup.find('meta', attrs ={'name': 'language'})['content']
390-
dict_w['latlng'] = soup.find('meta', attrs ={'name': 'latlng'})['content']
391-
dict_w['instructor'] = soup.find('meta', attrs ={'name': 'instructor'})['content']
392-
dict_w['helper'] = soup.find('meta', attrs ={'name': 'helper'})['content']
387+
# has each specific 'name' attribute and get the value in the
388+
# 'content' attribute
389+
dict_w['startdate'] = soup.find('meta', attrs = {'name': 'startdate'})['content']
390+
dict_w['enddate'] = soup.find('meta', attrs = {'name': 'enddate'})['content']
391+
dict_w['language'] = soup.find('meta', attrs = {'name': 'language'})['content']
392+
dict_w['latlng'] = soup.find('meta', attrs = {'name': 'latlng'})['content']
393+
dict_w['instructor'] = soup.find('meta', attrs = {'name': 'instructor'})['content']
394+
dict_w['helper'] = soup.find('meta', attrs = {'name': 'helper'})['content']
393395

394396
# Append to our list
395397
list_of_workshops.append(dict_w)
@@ -402,7 +404,7 @@ extradata_upcoming_df = pd.DataFrame(list_of_workshops)
402404

403405
::::::::::::::::::::::::::::::::::::: challenge
404406

405-
It is possible that you received an error when executing the previous block code, and the most probable reason is that the URL your tried to visit didn't exist. This is known as 404 code error, that indicates the requested page cannot be found on the server. What would be your approach to work around this possible error?
407+
It is possible that you received an error when executing the previous block code, and the most probable reason is that the URL your tried to visit didn't exist. This is known as 404 code error, that indicates the requested page doesn't exist, or more precisely, it cannot be found on the server. What would be your approach to work around this possible error?
406408

407409
:::::::::::::::::::::::: solution
408410

0 commit comments

Comments
 (0)