You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: episodes/hello-scraping.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -34,7 +34,7 @@ When you want to extract information or download data from a website that is too
34
34
35
35
36
36
Here, we’ll revisit some of those core ideas to build a more hands-on understanding of how content and data are structured on the web.
37
-
We’ll start by exploring what HTTP (Hypertext Transfer Protocol) and HTML (Hypertext Markup Language) is and how it uses tags to organize and format content.
37
+
We’ll start by exploring what HTTP (Hypertext Transfer Protocol) and HTML (Hypertext Markup Language) are, and how HTML uses tags to organize and format content in a website.
38
38
Then, we’ll introduce the BeautifulSoup library to parse HTML and make it easier to search for and extract specific elements from a webpage.
39
39
40
40
We'll begin with simple examples and gradually move on to scraping more complex, real-world websites.
@@ -44,7 +44,7 @@ We'll begin with simple examples and gradually move on to scraping more complex,
44
44
When scraping data, it is essential to adhere to two main guidelines:
45
45
46
46
1.**Data Privacy and Confidentiality**: Always confirm that the data being collected is publicly available and contains no personal or confidential information.
47
-
2.**Server Load**: Avoid overwhelming the web server. When collecting large amount of data, best practice is to insert pauses between requests to allow the server to manage other traffic.
47
+
2.**Server Load**: Avoid overwhelming the web server. When collecting large amounts of data, best practice is to insert pauses between requests to allow the server to manage other traffic.
48
48
49
49
50
50
## HTTP: Hypertext Transfer Protocol quick overview
@@ -53,13 +53,13 @@ When scraping data, it is essential to adhere to two main guidelines:
53
53
54
54
At the heart of web communications is the request message, which is sent via *U*niform *R*esource *L*ocators (URLs). Basic `URL` structure:
{alt='An anatomical breakdown of a URL string, labeling its components: protocol (http), host ([www.domain.com](https://www.domain.com/)), port (1234), resource path (/path/to/resource), and query (?a=b&x=y)'}
57
57
58
58
The protocol is typically http or https for secure communications. The default port is 80, but one can be set explicitly, as illustrated in the above image. The resource path is the local path to the resource on the server.
{alt='A diagram showing the HTTP request-response cycle between a client computer and a server, highlighting the URL + Verb request and the Status Code + Message Body response'}
63
63
64
64
The actions that should be performed on the host are specified via HTTP verbs. Today we are going to focus on two actions that are often used in web forms:
0 commit comments