updates on episode 1 and 2

josenino95 · josenino95 · commit 8b350d6be0e3 · 2024-10-30T17:07:46.000-07:00
diff --git a/episodes/a-real-website.md b/episodes/a-real-website.md
@@ -360,11 +360,13 @@ soup = BeautifulSoup(cleaned_req, 'html.parser')
 print(soup.prettify())
 ```
 
-If we explore the HTML this way, or using the 'View page source' in the browser, we notice something interesting in the `<head>` element. As this information is inside `<head>` instead of the `<body>` element, it won't be displayed in our browser when we visit the page, but the meta elements will provide metadata for search engines to better understand, display, and index the page. Each of this `<meta>` tags contain useful information for our table of workshops, for example, a well formatted start and end date, the exact location of the workshop with latitude and longitude (for those not online), the language in which it will be taught, and a more structured way of listing instructors and helpers. This is precisely the information we will extract with the following code.
+If we explore the HTML this way, or using the 'View page source' in the browser, we notice something interesting in the `<head>` element. As this information is inside `<head>` instead of the `<body>` element, it won't be displayed in our browser when we visit the page, but the meta elements will provide metadata for search engines to better understand, display, and index the page. Each of this `<meta>` tags contain useful information for our table of workshops, for example, a well formatted start and end date, the exact location of the workshop with latitude and longitude (for those not online), the language in which it will be taught, and a more structured way of listing instructors and helpers. Each of these data points can be identified by the the "name" attribute in the `<meta>` tags, and the information we want to extract is the value in the "content" attribute.
+
+The following code automates the process of getting this data from each website, for the first five workshops in our `upcomingworkshops_df` dataframe. We will only do it for five workshops to not send too many requests overwhelming the server, but we could also do it for all the workshops.
 
 ```python
 # List of URLs in our dataframe
-urls = list(upcomingworkshops_df.loc[:, 'link'])
+urls = list(upcomingworkshops_df.loc[:5, 'link'])
 
 # Start an empty list to store the different dictionaries with our data
 list_of_workshops = []
@@ -382,8 +384,8 @@ for item in tqdm(urls):
     dict_w['link'] = item
 
     # Use the find function to search for the <meta> tag that 
-    # has each specific name attribute and get the value in the
-    # content attribute
+    # has each specific 'name' attribute and get the value in the
+    # 'content' attribute
     dict_w['startdate'] = soup.find('meta', attrs ={'name': 'startdate'})['content']
     dict_w['enddate'] = soup.find('meta', attrs ={'name': 'enddate'})['content']
     dict_w['language'] = soup.find('meta', attrs ={'name': 'language'})['content']
@@ -402,7 +404,7 @@ extradata_upcoming_df = pd.DataFrame(list_of_workshops)
 
 ::::::::::::::::::::::::::::::::::::: challenge
 
-It is possible that you received an error when executing the previous block code, and the most probable reason is that the URL your tried to visit didn't exist. This is known as 404 code error, that indicates the requested page cannot be found on the server. What would be your approach to work around this possible error?
+It is possible that you received an error when executing the previous block code, and the most probable reason is that the URL your tried to visit didn't exist. This is known as 404 code error, that indicates the requested page doesn't exist, or more precisely, it cannot be found on the server. What would be your approach to work around this possible error?
 
 :::::::::::::::::::::::: solution
 
diff --git a/episodes/hello-scraping.md b/episodes/hello-scraping.md
@@ -310,13 +310,13 @@ Besides checking the website's policies, you should also be aware of the legisla
 To conclude, here is a brief code of conduct you should consider when doing web scraping:
 
 
-1. Ask nicely. If your project requires data from a particular organisation, you can try asking them directly if they could provide you what you are looking for, or check if they have an API to access the data. With some luck, they will have the primary data that they used on their website in a structured format, saving you the trouble.
-1. Don’t download copies of documents that are clearly not public. For example, academic journal publishers often have very strict rules about what you can and what you cannot do with their databases. Mass downloading article PDFs is probably prohibited and can put you (or at the very least your friendly university librarian) in trouble. If your project requires local copies of documents (e.g. for text mining projects), special agreements can be reached with the publisher. The library is a good place to start investigating something like that.
-1. Check your local legislation. For example, certain countries have laws protecting personal information such as email addresses and phone numbers. Scraping such information, even from publicly avaialable web sites, can be illegal (e.g. in Australia).
-1. Don’t share downloaded content illegally. Scraping for personal purposes is usually OK, even if it is copyrighted information, as it could fall under the fair use provision of the intellectual property legislation. However, sharing data for which you don’t hold the right to share is illegal.
-1. Share what you can. If the data you scraped is in the public domain or you got permission to share it, then put it out there for other people to reuse it (e.g. on datahub.io). If you wrote a web scraper to access it, share its code (e.g. on GitHub) so that others can benefit from it.
-1. Don’t break the Internet. Not all web sites are designed to withstand thousands of requests per second. If you are writing a recursive scraper (i.e. that follows hyperlinks), test it on a smaller dataset first to make sure it does what it is supposed to do. Adjust the settings of your scraper to allow for a delay between requests. More on this topic on the next episode.
-1. Publish your own data in a reusable way. Don’t force others to write their own scrapers to get at your data. Use open and software-agnostic formats (e.g. JSON, XML), provide metadata (data about your data: where it came from, what it represents, how to use it, etc.) and make sure it can be indexed by search engines so that people can find it.
+1. **Ask nicely if you can access the data in another way**. If your project requires data from a particular organisation, you can try asking them directly if they could provide you what you are looking for, or check if they have an API to access the data. With some luck, they will have the primary data that they used on their website in a structured format, saving you the trouble.
+1. **Don’t download copies of documents that are clearly not public**. For example, academic journal publishers often have very strict rules about what you can and what you cannot do with their databases. Mass downloading article PDFs is probably prohibited and can put you (or at the very least your friendly university librarian) in trouble. If your project requires local copies of documents (e.g. for text mining projects), special agreements can be reached with the publisher. The library is a good place to start investigating something like that.
+1. **Check your local legislation**. For example, certain countries have laws protecting personal information such as email addresses and phone numbers. Scraping such information, even from publicly avaialable web sites, can be illegal (e.g. in Australia).
+1. **Don’t share downloaded content illegally**. Scraping for personal purposes is usually OK, even if it is copyrighted information, as it could fall under the fair use provision of the intellectual property legislation. However, sharing data for which you don’t hold the right to share is illegal.
+1. **Share what you can**. If the data you scraped is in the public domain or you got permission to share it, then put it out there for other people to reuse it (e.g. on datahub.io). If you wrote a web scraper to access it, share its code (e.g. on GitHub) so that others can benefit from it.
+1. **Publish your own data in a reusable way**. Don’t force others to write their own scrapers to get at your data. Use open and software-agnostic formats (e.g. JSON, XML), provide metadata (data about your data: where it came from, what it represents, how to use it, etc.) and make sure it can be indexed by search engines so that people can find it.
+1. **Don’t break the Internet**. Not all web sites are designed to withstand thousands of requests per second. If you are writing a recursive scraper (i.e. that follows hyperlinks), test it on a smaller dataset first to make sure it does what it is supposed to do. Adjust the settings of your scraper to allow for a delay between requests. More on this topic in the next episode.
 
 ::::::::::::::::::::::::::::::::::::: keypoints 
 
diff --git a/notebooks/ep3.ipynb b/notebooks/ep3.ipynb
@@ -0,0 +1,65 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# First webscraping task"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Load BeautifulSoup, and pandas to store in a dataframe\n",
+    "from bs4 import BeautifulSoup\n",
+    "import pandas as pd\n",
+    "import requests\n",
+    "import re\n",
+    "from time import sleep\n",
+    "from tqdm import tqdm"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from selenium import webdriver"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "driver = webdriver.Firefox()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "webscraping",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.13.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}