Skip to content

Commit 667d5a9

Browse files
committed
work updated
1 parent 791f1e8 commit 667d5a9

2 files changed

Lines changed: 29 additions & 16 deletions

File tree

episodes/introduction.md

Lines changed: 29 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ We'll refresh some of the concepts covered there to have a practical understandi
2525

2626
## HTML quick overview
2727

28-
All websites have Hypertext Markup Language (HTML) behind them. The following text is HTML for a very simple website, with only three sentences. If you read it, can you imagine how that website looks?
28+
All websites have a Hypertext Markup Language (HTML) document behind them. The following text is HTML for a very simple website, with only three sentences. If you read it, can you imagine how that website looks?
2929

3030
```html
3131
<!DOCTYPE html>
@@ -34,13 +34,13 @@ All websites have Hypertext Markup Language (HTML) behind them. The following te
3434
<title>Sample web page</title>
3535
</head>
3636
<body>
37-
<h1>h1 Header #1</h1>
37+
<h1 id="head1">h1 Header #1</h1>
3838
<p>This is a paragraph tag</p>
39-
<h2>h2 Sub-header</h2>
39+
<h2 id="subhead1">h2 Sub-header</h2>
4040
<p>A new paragraph, now in the <b>sub-header</b></p>
41-
<h1>h1 Header #2</h1>
41+
<h1 id="head2">h1 Header #2</h1>
4242
<p>
43-
This other paragraph has two hyperlinks,
43+
This other paragraph has two hyperlinks,
4444
one to <a href="https://carpentries.org/">The Carpentries homepage</a>,
4545
and another to the
4646
<a href="https://carpentries.org/past_workshops/">past workshops</a> page.
@@ -53,12 +53,34 @@ Well, if you put that text in a file with a .html extension, the job of your web
5353

5454
![](fig/simple_website.PNG){alt="Screenshot of a simple website with the previews HTML"}
5555

56-
HTML is composed of tags
56+
An HTML document is composed of elements, which can be identified by tags written inside angle brackets (`<` and `>`). For example, the HTML root element, which delimits the beginning and end of an HTML document, is identified by the `<html>` tag.
5757

58+
Most elements have both a opening and a closing tag, determining the span of the element. In the previous simple website, we see a head element that goes from the opening tag `<head>` up to the closing tag `</head>`. Given than an element can be inside another element, an HTML document has a tree structure, where every element is a node that can contain child nodes, like the following image shows.
5859

60+
![The Document Object Model (DOM) that represents an HTML document with a tree structure. Source: Wikipedia. Author: Birger Eriksson](https://upload.wikimedia.org/wikipedia/commons/5/5a/DOM-model.svg){alt="Screenshot of a simple website with the previews HTML"}
5961

62+
Finally, we can define or modify the behavior, appeareance, or functionality of an element by using attributes. Attributes are inside the opening tag, and consist of a name and a value, formatted as `name="value"`. For example, we can give an unique id to any element using the `id` attribute.
63+
64+
Here is a non-exhaustive list of elements you'll find in HTML and their purpose:
65+
66+
- `<hmtl>...</html>` The root, which contains the entirety of the document.
67+
- `<head>...</head>` Contains metadata, for example, the title that the web browser displays.
68+
- `<body>...</body>` The content that is going to be displayed.
69+
- `<h1>...</h1>, <h2>...</h2>, <h3>...</h3>` Defines headers of level 1, 2, 3, etc.
70+
- `<p>...</p>` A paragraph.
71+
- `<a href="">...</a>` Creates a hyperlink, and we provide the destination URL with the `href` attribute.
72+
- `<img src="" alt="">` Embedds an image, giving a source to the image with the `src` attribute and specifying alternate text with `alt`.
73+
- `<table>...</table>, <th>...</th>, <tr>...</tr>, <td>...</td>` Defines a table, that as children will have a header (defined inside `th`), rows (defined inside `tr`), and a cell inside a row (as `td`).
74+
- `<div>...</div>` Is used to group sections of HTML content.
75+
- `<script>...</script>` Embeds or references JavaScript code.
76+
77+
To summarize, an *element* is identified by *tags* , and we can assign properties to an element by using *attributes*. Knowing this about HTML will make our lifes easier when trying to get some specific data from a website.
78+
79+
80+
## Parsing HTML with BeautifulSoup
81+
82+
Now that we know how a website is structured, we can start extracting information from it
6083

61-
## Introduction
6284

6385
This is a lesson created via The Carpentries Workbench. It is written in
6486
[Pandoc-flavored Markdown](https://pandoc.org/MANUAL.txt) for static files and

notebooks/ep1.ipynb

Lines changed: 0 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -18,15 +18,6 @@
1818
"import pandas as pd"
1919
]
2020
},
21-
{
22-
"cell_type": "markdown",
23-
"metadata": {},
24-
"source": [
25-
"Simple example of an HTML website\n",
26-
"\n",
27-
"HTML tags (h1, h2, ..., a, b, img) and attributes (href, id, class)"
28-
]
29-
},
3021
{
3122
"cell_type": "code",
3223
"execution_count": 8,

0 commit comments

Comments
 (0)