Merge pull request #539 from datacarpentry/caesoma

LilithElina · web-flow · commit 0d1b72dd6007 · 2023-04-11T20:22:04.000+02:00
Caesoma - lesson 3 proposal
diff --git a/_episodes/03-index-slice-subset.md b/_episodes/03-index-slice-subset.md
@@ -118,7 +118,7 @@ the related Python data type dictionary).
 > the names of built-in data structures and methods. For example, a _list_ is a built-in
 > data type. It is possible to use the word 'list' as an identifier for a new object,
 > for example `list = ['apples', 'oranges', 'bananas']`. However, you would then
-> be unable to create an empty list using `list()` or convert a tuple to a 
+> be unable to create an empty list using `list()` or convert a tuple to a
 > list using `list(sometuple)`.
 {: .callout}
 
@@ -364,22 +364,38 @@ gives the **output**
 Remember that Python indexing begins at 0. So, the index location [2, 6]
 selects the element that is 3 rows down and 7 columns over in the DataFrame.
 
+It is worth noting that rows are selected when using `loc` with a single list of
+labels (or `iloc` with a single list of integers). However, unlike `loc` or `iloc`,
+indexing a data frame directly with labels will select columns (e.g. 
+`surveys_df['species_id', 'plot_id', 'weight']`), while ranges of integers will
+select rows (e.g. surveys_df[0:13]). Direct indexing of rows is redundant with
+using `iloc`, and will raise a `KeyError` if a single integer or list is used; the
+error will also occur if index labels are used without `loc` (or column labels used
+with it).
+A useful rule of thumb is the following: integer-based slicing is best done with
+`iloc` and will avoid errors (and is generally consistent with indexing of Numpy
+arrays), label-based slicing of rows is done with `loc`, and slicing of columns by
+directly indexing column names.
 
 
 > ## Challenge - Range
 >
 > 1. What happens when you execute:
 >
 >    - `surveys_df[0:1]`
+>    - `surveys_df[0]`
 >    - `surveys_df[:4]`
 >    - `surveys_df[:-1]`
 >
 > 2. What happens when you call:
 >
+>    - `surveys_df.iloc[0:1]`
+>    - `surveys_df.iloc[0]`
+>    - `surveys_df.iloc[:4, :]`
 >    - `surveys_df.iloc[0:4, 1:4]`
 >    - `surveys_df.loc[0:4, 1:4]`
 >
-> - How are the two commands different?
+> - How are the last two commands different?
 {: .challenge}
 
 
diff --git a/_extras/guide.md b/_extras/guide.md
@@ -192,19 +192,28 @@ previous steps visible.
 
 * What happens when you execute:
 
-	`surveys_df[0:3]`
-	`surveys_df[0:1]` slicing only the first element
-	`surveys_df[:5]` slicing from first element makes 0 redundant
-	`surveys_df[-1:]` you can count backwards
+	- `surveys_df[0:3]`
+  - `surveys_df[0]` results in a 'KeyError', since direct indexing of a row is redundant with `iloc`
+	- `surveys_df[0:1]` slicing only the first element
+	- `surveys_df[:5]` slicing from first element makes 0 redundant
+	- `surveys_df[-1:]` you can count backwards
 
   *Suggestion*: You can also select every Nth row: `surveys_df[1:10:2]`. So, how to interpret
   `surveys_df[::-1]`?
 
+* What happens when you call:
+
+  - `surveys_df.iloc[0:1]` returns the first row
+  - `surveys_df.iloc[0]` returns the first row as a named list
+  - `surveys_df.iloc[:4, :]` returns all columns of the first four rows
+  - `surveys_df.iloc[0:4, 1:4]` selects specified columns of the first four rows
+  - `surveys_df.loc[0:4, 1:4]` results in a 'TypeError'
+
 * What is the difference between `surveys_df.iloc[0:4, 1:4]` and `surveys_df.loc[0:4, 1:4]`?
 
-  Check the position, or the name. Cfr. the second is like it would be in a dictionary, asking for
-  the key-names. Column names 1:4 do not exist, resulting in an error. Check also the difference
-  between `surveys_df.loc[0:4]` and `surveys_df.iloc[0:4]`
+  While `iloc` uses integers as indices and slices accordingly, `loc` works with labels. It is
+  like accessing values from a dictionary, asking for the key names. Column names 1:4 do not exist,
+  resulting in an error. Check also the difference between `surveys_df.loc[0:4]` and `surveys_df.iloc[0:4]`.
 
 ### Advanced Selection Challenges