Skip to content

Commit 443333d

Browse files
authored
Merge pull request #1 from UCSBCarpentry/expand-intro
add more content to intro with why, be respectful and intro to http
2 parents cc9152f + 47e9eb7 commit 443333d

3 files changed

Lines changed: 47 additions & 1 deletion

File tree

36.7 KB
Loading
21.3 KB
Loading

episodes/hello-scraping.md

Lines changed: 47 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,12 +24,58 @@ exercises: 10
2424
This workshop is a continuation of our Introduction to Web Scraping workshop.
2525
If you're looking for a gentler introduction that uses XPath and the Scraper Chrome extension, take a look at the [workshop materials for that workshop](https://carpentries-incubator.github.io/lc-webscraping/).
2626

27+
For recall, Web scraping is necessary when websites do not offer any interface to automate information or data retrieval via Web services, such as REST or SOAP, or any Application Programming Interfaces (APIs). Therefore, it is necessary to “scrape” the information embedded in the website itself.
28+
29+
When you want to extract information or download data from a website that is too large for efficient manual downloading or needs to be frequently updated, you should first:
30+
31+
1. Check if the website has any available Web services or if APIs have been developed to this end
32+
2. Check if any R (or other language you know) package has been developed by others as a wrapper around the API to facilitate the use of these Web services
33+
3. Nothing found? Well, let's code this ourselves then!
34+
35+
2736
Here, we’ll revisit some of those core ideas to build a more hands-on understanding of how content and data are structured on the web.
28-
We’ll start by exploring what HTML (Hypertext Markup Language) is and how it uses tags to organize and format content.
37+
We’ll start by exploring what HTTP (Hypertext Transfer Protocol) and HTML (Hypertext Markup Language) are, and how HTML uses tags to organize and format content in a website.
2938
Then, we’ll introduce the BeautifulSoup library to parse HTML and make it easier to search for and extract specific elements from a webpage.
3039

3140
We'll begin with simple examples and gradually move on to scraping more complex, real-world websites.
3241

42+
### Be respectful
43+
44+
When scraping data, it is essential to adhere to two main guidelines:
45+
46+
1. **Data Privacy and Confidentiality**: Always confirm that the data being collected is publicly available and contains no personal or confidential information.
47+
2. **Server Load**: Avoid overwhelming the web server. When collecting large amounts of data, best practice is to insert pauses between requests to allow the server to manage other traffic.
48+
49+
50+
## HTTP: Hypertext Transfer Protocol quick overview
51+
52+
### URL
53+
54+
At the heart of web communications is the request message, which is sent via *U*niform *R*esource *L*ocators (URLs). Basic `URL` structure:
55+
56+
![credits: https://code.tutsplus.com/tutorials/http-the-protocol-every-web-developer-must-know-part-1--net-31177](fig/http1-url-structure.png){An anatomical breakdown of a URL string, labeling its components: protocol (http), host ([www.domain.com](https://www.domain.com/)), port (1234), resource path (/path/to/resource), and query (?a=b&x=y)}
57+
58+
The protocol is typically http or https for secure communications. The default port is 80, but one can be set explicitly, as illustrated in the above image. The resource path is the local path to the resource on the server.
59+
60+
### Request
61+
62+
![credits: https://code.tutsplus.com/tutorials/http-the-protocol-every-web-developer-must-know-part-1--net-31177](fig/http1-req-res-details.png){alt='A diagram showing the HTTP request-response cycle between a client computer and a server, highlighting the URL + Verb request and the Status Code + Message Body response'}
63+
64+
The actions that should be performed on the host are specified via HTTP verbs. Today we are going to focus on two actions that are often used in web forms:
65+
66+
- `GET`: fetch an existing resource. The URL contains all the necessary information the server needs to locate and return the resource.
67+
- `POST`: create a new resource. POST requests usually carry a payload that specifies the data for the new resource.
68+
69+
### Response
70+
71+
Status codes:
72+
73+
- `1xx`: Informational Messages
74+
- `2xx`: Successful; most known is 200: OK, request was successfully processed
75+
- `3xx`: Redirection
76+
- `4xx`: Client Error; the famous 404: resource not found
77+
- `5xx`: Server Error
78+
3379
## HTML quick overview
3480

3581
All websites have a Hypertext Markup Language (HTML) document behind them.

0 commit comments

Comments
 (0)