Skip to content

Commit 0d1b72d

Browse files
authored
Merge pull request #539 from datacarpentry/caesoma
Caesoma - lesson 3 proposal
2 parents 4cd25b5 + 2ca04b4 commit 0d1b72d

2 files changed

Lines changed: 34 additions & 9 deletions

File tree

_episodes/03-index-slice-subset.md

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,7 @@ the related Python data type dictionary).
118118
> the names of built-in data structures and methods. For example, a _list_ is a built-in
119119
> data type. It is possible to use the word 'list' as an identifier for a new object,
120120
> for example `list = ['apples', 'oranges', 'bananas']`. However, you would then
121-
> be unable to create an empty list using `list()` or convert a tuple to a
121+
> be unable to create an empty list using `list()` or convert a tuple to a
122122
> list using `list(sometuple)`.
123123
{: .callout}
124124

@@ -364,22 +364,38 @@ gives the **output**
364364
Remember that Python indexing begins at 0. So, the index location [2, 6]
365365
selects the element that is 3 rows down and 7 columns over in the DataFrame.
366366
367+
It is worth noting that rows are selected when using `loc` with a single list of
368+
labels (or `iloc` with a single list of integers). However, unlike `loc` or `iloc`,
369+
indexing a data frame directly with labels will select columns (e.g.
370+
`surveys_df['species_id', 'plot_id', 'weight']`), while ranges of integers will
371+
select rows (e.g. surveys_df[0:13]). Direct indexing of rows is redundant with
372+
using `iloc`, and will raise a `KeyError` if a single integer or list is used; the
373+
error will also occur if index labels are used without `loc` (or column labels used
374+
with it).
375+
A useful rule of thumb is the following: integer-based slicing is best done with
376+
`iloc` and will avoid errors (and is generally consistent with indexing of Numpy
377+
arrays), label-based slicing of rows is done with `loc`, and slicing of columns by
378+
directly indexing column names.
367379
368380
369381
> ## Challenge - Range
370382
>
371383
> 1. What happens when you execute:
372384
>
373385
> - `surveys_df[0:1]`
386+
> - `surveys_df[0]`
374387
> - `surveys_df[:4]`
375388
> - `surveys_df[:-1]`
376389
>
377390
> 2. What happens when you call:
378391
>
392+
> - `surveys_df.iloc[0:1]`
393+
> - `surveys_df.iloc[0]`
394+
> - `surveys_df.iloc[:4, :]`
379395
> - `surveys_df.iloc[0:4, 1:4]`
380396
> - `surveys_df.loc[0:4, 1:4]`
381397
>
382-
> - How are the two commands different?
398+
> - How are the last two commands different?
383399
{: .challenge}
384400
385401

_extras/guide.md

Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -192,19 +192,28 @@ previous steps visible.
192192

193193
* What happens when you execute:
194194

195-
`surveys_df[0:3]`
196-
`surveys_df[0:1]` slicing only the first element
197-
`surveys_df[:5]` slicing from first element makes 0 redundant
198-
`surveys_df[-1:]` you can count backwards
195+
- `surveys_df[0:3]`
196+
- `surveys_df[0]` results in a 'KeyError', since direct indexing of a row is redundant with `iloc`
197+
- `surveys_df[0:1]` slicing only the first element
198+
- `surveys_df[:5]` slicing from first element makes 0 redundant
199+
- `surveys_df[-1:]` you can count backwards
199200

200201
*Suggestion*: You can also select every Nth row: `surveys_df[1:10:2]`. So, how to interpret
201202
`surveys_df[::-1]`?
202203

204+
* What happens when you call:
205+
206+
- `surveys_df.iloc[0:1]` returns the first row
207+
- `surveys_df.iloc[0]` returns the first row as a named list
208+
- `surveys_df.iloc[:4, :]` returns all columns of the first four rows
209+
- `surveys_df.iloc[0:4, 1:4]` selects specified columns of the first four rows
210+
- `surveys_df.loc[0:4, 1:4]` results in a 'TypeError'
211+
203212
* What is the difference between `surveys_df.iloc[0:4, 1:4]` and `surveys_df.loc[0:4, 1:4]`?
204213

205-
Check the position, or the name. Cfr. the second is like it would be in a dictionary, asking for
206-
the key-names. Column names 1:4 do not exist, resulting in an error. Check also the difference
207-
between `surveys_df.loc[0:4]` and `surveys_df.iloc[0:4]`
214+
While `iloc` uses integers as indices and slices accordingly, `loc` works with labels. It is
215+
like accessing values from a dictionary, asking for the key names. Column names 1:4 do not exist,
216+
resulting in an error. Check also the difference between `surveys_df.loc[0:4]` and `surveys_df.iloc[0:4]`.
208217

209218
### Advanced Selection Challenges
210219

0 commit comments

Comments
 (0)