Skip to content

Commit 8b350d6

Browse files
committed
updates on episode 1 and 2
1 parent 44e0304 commit 8b350d6

3 files changed

Lines changed: 79 additions & 12 deletions

File tree

episodes/a-real-website.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -360,11 +360,13 @@ soup = BeautifulSoup(cleaned_req, 'html.parser')
360360
print(soup.prettify())
361361
```
362362

363-
If we explore the HTML this way, or using the 'View page source' in the browser, we notice something interesting in the `<head>` element. As this information is inside `<head>` instead of the `<body>` element, it won't be displayed in our browser when we visit the page, but the meta elements will provide metadata for search engines to better understand, display, and index the page. Each of this `<meta>` tags contain useful information for our table of workshops, for example, a well formatted start and end date, the exact location of the workshop with latitude and longitude (for those not online), the language in which it will be taught, and a more structured way of listing instructors and helpers. This is precisely the information we will extract with the following code.
363+
If we explore the HTML this way, or using the 'View page source' in the browser, we notice something interesting in the `<head>` element. As this information is inside `<head>` instead of the `<body>` element, it won't be displayed in our browser when we visit the page, but the meta elements will provide metadata for search engines to better understand, display, and index the page. Each of this `<meta>` tags contain useful information for our table of workshops, for example, a well formatted start and end date, the exact location of the workshop with latitude and longitude (for those not online), the language in which it will be taught, and a more structured way of listing instructors and helpers. Each of these data points can be identified by the the "name" attribute in the `<meta>` tags, and the information we want to extract is the value in the "content" attribute.
364+
365+
The following code automates the process of getting this data from each website, for the first five workshops in our `upcomingworkshops_df` dataframe. We will only do it for five workshops to not send too many requests overwhelming the server, but we could also do it for all the workshops.
364366

365367
```python
366368
# List of URLs in our dataframe
367-
urls = list(upcomingworkshops_df.loc[:, 'link'])
369+
urls = list(upcomingworkshops_df.loc[:5, 'link'])
368370

369371
# Start an empty list to store the different dictionaries with our data
370372
list_of_workshops = []
@@ -382,8 +384,8 @@ for item in tqdm(urls):
382384
dict_w['link'] = item
383385

384386
# Use the find function to search for the <meta> tag that
385-
# has each specific name attribute and get the value in the
386-
# content attribute
387+
# has each specific 'name' attribute and get the value in the
388+
# 'content' attribute
387389
dict_w['startdate'] = soup.find('meta', attrs ={'name': 'startdate'})['content']
388390
dict_w['enddate'] = soup.find('meta', attrs ={'name': 'enddate'})['content']
389391
dict_w['language'] = soup.find('meta', attrs ={'name': 'language'})['content']
@@ -402,7 +404,7 @@ extradata_upcoming_df = pd.DataFrame(list_of_workshops)
402404

403405
::::::::::::::::::::::::::::::::::::: challenge
404406

405-
It is possible that you received an error when executing the previous block code, and the most probable reason is that the URL your tried to visit didn't exist. This is known as 404 code error, that indicates the requested page cannot be found on the server. What would be your approach to work around this possible error?
407+
It is possible that you received an error when executing the previous block code, and the most probable reason is that the URL your tried to visit didn't exist. This is known as 404 code error, that indicates the requested page doesn't exist, or more precisely, it cannot be found on the server. What would be your approach to work around this possible error?
406408

407409
:::::::::::::::::::::::: solution
408410

episodes/hello-scraping.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -310,13 +310,13 @@ Besides checking the website's policies, you should also be aware of the legisla
310310
To conclude, here is a brief code of conduct you should consider when doing web scraping:
311311

312312

313-
1. Ask nicely. If your project requires data from a particular organisation, you can try asking them directly if they could provide you what you are looking for, or check if they have an API to access the data. With some luck, they will have the primary data that they used on their website in a structured format, saving you the trouble.
314-
1. Don’t download copies of documents that are clearly not public. For example, academic journal publishers often have very strict rules about what you can and what you cannot do with their databases. Mass downloading article PDFs is probably prohibited and can put you (or at the very least your friendly university librarian) in trouble. If your project requires local copies of documents (e.g. for text mining projects), special agreements can be reached with the publisher. The library is a good place to start investigating something like that.
315-
1. Check your local legislation. For example, certain countries have laws protecting personal information such as email addresses and phone numbers. Scraping such information, even from publicly avaialable web sites, can be illegal (e.g. in Australia).
316-
1. Don’t share downloaded content illegally. Scraping for personal purposes is usually OK, even if it is copyrighted information, as it could fall under the fair use provision of the intellectual property legislation. However, sharing data for which you don’t hold the right to share is illegal.
317-
1. Share what you can. If the data you scraped is in the public domain or you got permission to share it, then put it out there for other people to reuse it (e.g. on datahub.io). If you wrote a web scraper to access it, share its code (e.g. on GitHub) so that others can benefit from it.
318-
1. Don’t break the Internet. Not all web sites are designed to withstand thousands of requests per second. If you are writing a recursive scraper (i.e. that follows hyperlinks), test it on a smaller dataset first to make sure it does what it is supposed to do. Adjust the settings of your scraper to allow for a delay between requests. More on this topic on the next episode.
319-
1. Publish your own data in a reusable way. Don’t force others to write their own scrapers to get at your data. Use open and software-agnostic formats (e.g. JSON, XML), provide metadata (data about your data: where it came from, what it represents, how to use it, etc.) and make sure it can be indexed by search engines so that people can find it.
313+
1. **Ask nicely if you can access the data in another way**. If your project requires data from a particular organisation, you can try asking them directly if they could provide you what you are looking for, or check if they have an API to access the data. With some luck, they will have the primary data that they used on their website in a structured format, saving you the trouble.
314+
1. **Don’t download copies of documents that are clearly not public**. For example, academic journal publishers often have very strict rules about what you can and what you cannot do with their databases. Mass downloading article PDFs is probably prohibited and can put you (or at the very least your friendly university librarian) in trouble. If your project requires local copies of documents (e.g. for text mining projects), special agreements can be reached with the publisher. The library is a good place to start investigating something like that.
315+
1. **Check your local legislation**. For example, certain countries have laws protecting personal information such as email addresses and phone numbers. Scraping such information, even from publicly avaialable web sites, can be illegal (e.g. in Australia).
316+
1. **Don’t share downloaded content illegally**. Scraping for personal purposes is usually OK, even if it is copyrighted information, as it could fall under the fair use provision of the intellectual property legislation. However, sharing data for which you don’t hold the right to share is illegal.
317+
1. **Share what you can**. If the data you scraped is in the public domain or you got permission to share it, then put it out there for other people to reuse it (e.g. on datahub.io). If you wrote a web scraper to access it, share its code (e.g. on GitHub) so that others can benefit from it.
318+
1. **Publish your own data in a reusable way**. Don’t force others to write their own scrapers to get at your data. Use open and software-agnostic formats (e.g. JSON, XML), provide metadata (data about your data: where it came from, what it represents, how to use it, etc.) and make sure it can be indexed by search engines so that people can find it.
319+
1. **Don’t break the Internet**. Not all web sites are designed to withstand thousands of requests per second. If you are writing a recursive scraper (i.e. that follows hyperlinks), test it on a smaller dataset first to make sure it does what it is supposed to do. Adjust the settings of your scraper to allow for a delay between requests. More on this topic in the next episode.
320320

321321
::::::::::::::::::::::::::::::::::::: keypoints
322322

notebooks/ep3.ipynb

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# First webscraping task"
8+
]
9+
},
10+
{
11+
"cell_type": "code",
12+
"execution_count": 2,
13+
"metadata": {},
14+
"outputs": [],
15+
"source": [
16+
"# Load BeautifulSoup, and pandas to store in a dataframe\n",
17+
"from bs4 import BeautifulSoup\n",
18+
"import pandas as pd\n",
19+
"import requests\n",
20+
"import re\n",
21+
"from time import sleep\n",
22+
"from tqdm import tqdm"
23+
]
24+
},
25+
{
26+
"cell_type": "code",
27+
"execution_count": 3,
28+
"metadata": {},
29+
"outputs": [],
30+
"source": [
31+
"from selenium import webdriver"
32+
]
33+
},
34+
{
35+
"cell_type": "code",
36+
"execution_count": 4,
37+
"metadata": {},
38+
"outputs": [],
39+
"source": [
40+
"driver = webdriver.Firefox()"
41+
]
42+
}
43+
],
44+
"metadata": {
45+
"kernelspec": {
46+
"display_name": "webscraping",
47+
"language": "python",
48+
"name": "python3"
49+
},
50+
"language_info": {
51+
"codemirror_mode": {
52+
"name": "ipython",
53+
"version": 3
54+
},
55+
"file_extension": ".py",
56+
"mimetype": "text/x-python",
57+
"name": "python",
58+
"nbconvert_exporter": "python",
59+
"pygments_lexer": "ipython3",
60+
"version": "3.13.0"
61+
}
62+
},
63+
"nbformat": 4,
64+
"nbformat_minor": 2
65+
}

0 commit comments

Comments
 (0)