Skip to content

Commit 7219b49

Browse files
committed
02-index-slice-subset: fix metadata & code blocks
1 parent bf5432d commit 7219b49

1 file changed

Lines changed: 85 additions & 59 deletions

File tree

_episodes/02-index-slice-subset.md

Lines changed: 85 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -3,18 +3,20 @@ title: Indexing, Slicing and Subsetting DataFrames in Python
33
teaching: 30
44
exercises: 30
55
questions:
6-
- " How can I access specific data within my data set? "
7-
- " How can Python and Pandas help me to analyse my data?"
6+
- "How can I access specific data within my data set?"
7+
- "How can Python and Pandas help me to analyse my data?"
88
objectives:
9-
- Describe what 0-based indexing is.
10-
- Manipulate and extract data using column headings and index locations.
11-
- Employ slicing to select sets of data from a DataFrame.
12-
- Employ label and integer-based indexing to select ranges of data in a dataframe.
13-
- Reassign values within subsets of a DataFrame.
14-
- Create a copy of a DataFrame.
9+
- "Describe what 0-based indexing is."
10+
- "Manipulate and extract data using column headings and index locations."
11+
- "Employ slicing to select sets of data from a DataFrame."
12+
- "Employ label and integer-based indexing to select ranges of data in a dataframe."
13+
- "Reassign values within subsets of a DataFrame."
14+
- "Create a copy of a DataFrame."
1515
- "Query /select a subset of data using a set of criteria using the following operators: =, !=, >, <, >=, <=."
16-
- Locate subsets of data using masks.
17-
- Describe BOOLEAN objects in Python and manipulate data using BOOLEANs.
16+
- "Locate subsets of data using masks."
17+
- "Describe BOOLEAN objects in Python and manipulate data using BOOLEANs."
18+
keypoints:
19+
- "FIXME"
1820
---
1921

2022
In lesson 01, we read a CSV into a Python pandas DataFrame. We learned:
@@ -36,13 +38,14 @@ using:
3638
We will continue to use the surveys dataset that we worked with in the last
3739
lesson. Let's reopen and read in the data again:
3840

39-
```python
41+
~~~
4042
# Make sure pandas is loaded
4143
import pandas as pd
4244
4345
# Read in the survey CSV
4446
surveys_df = pd.read_csv("data/surveys.csv")
45-
```
47+
~~~
48+
{: .language-python}
4649

4750
## Indexing and Slicing in Python
4851

@@ -57,30 +60,32 @@ We use square brackets `[]` to select a subset of an Python object. For example,
5760
we can select all data from a column named `species_id` from the `surveys_df`
5861
DataFrame by name. There are two ways to do this:
5962

60-
```python
63+
~~~
6164
# TIP: use the .head() method we saw earlier to make output shorter
6265
# Method 1: select a 'subset' of the data using the column name
6366
surveys_df['species_id']
6467
6568
# Method 2: use the column name as an 'attribute'; gives the same output
6669
surveys_df.species_id
67-
```
70+
~~~
71+
{: .language-python}
6872

6973
We can also create a new object that contains only the data within the
7074
`species_id` column as follows:
7175

72-
```python
76+
~~~
7377
# Creates an object, surveys_species, that only contains the `species_id` column
7478
surveys_species = surveys_df['species_id']
75-
```
79+
~~~
80+
{: .language-python}
7681

7782
We can pass a list of column names too, as an index to select columns in that
7883
order. This is useful when we need to reorganize our data.
7984

8085
**NOTE:** If a column name is not contained in the DataFrame, an exception
8186
(error) will be raised.
8287

83-
```python
88+
~~~
8489
# Select the species and plot columns from the DataFrame
8590
surveys_df[['species_id', 'plot_id']]
8691
@@ -89,7 +94,8 @@ surveys_df[['plot_id', 'species_id']]
8994
9095
# What happens if you ask for a column that doesn't exist?
9196
surveys_df['speciess']
92-
```
97+
~~~
98+
{: .language-python}
9399

94100
Python tells us what type of error it is in the traceback, at the bottom it says `KeyError: 'speciess'` which means that `speciess` is not a column name (or Key in the related python data type dictionary).
95101

@@ -102,10 +108,11 @@ indexing. This means that the first element in an object is located at position
102108
0. This is different from other tools like R and Matlab that index elements
103109
within objects starting at 1.
104110

105-
```python
111+
~~~
106112
# Create a list of numbers:
107113
a = [1, 2, 3, 4, 5]
108-
```
114+
~~~
115+
{: .language-python}
109116

110117
![indexing diagram](../fig/slicing-indexing.png)
111118
![slicing diagram](../fig/slicing-slicing.png)
@@ -143,22 +150,24 @@ DataFrame. To slice out a set of rows, you use the following syntax:
143150
output. The stop bound is one step BEYOND the row you want to select. So if you
144151
want to select rows 0, 1 and 2 your code would look like this:
145152
146-
```python
153+
~~~
147154
# Select rows 0, 1, 2 (row 3 is not selected)
148155
surveys_df[0:3]
149-
```
156+
~~~
157+
{: .language-python}
150158
151159
The stop bound in Python is different from what you might be used to in
152160
languages like Matlab and R.
153161
154-
```python
162+
~~~
155163
# Select the first 5 rows (rows 0, 1, 2, 3, 4)
156164
surveys_df[:5]
157165
158166
# Select the last element in the list
159167
# (the slice starts at the last element, and ends at the end of the list)
160168
surveys_df[-1:]
161-
```
169+
~~~
170+
{: .language-python}
162171
163172
We can also reassign values within subsets of our DataFrame.
164173
@@ -169,13 +178,14 @@ copying objects and the concept of referencing objects in Python.
169178
170179
Let's start with an example:
171180
172-
```python
181+
~~~
173182
# Using the 'copy() method'
174183
true_copy_surveys_df = surveys_df.copy()
175184
176185
# Using the '=' operator
177186
ref_surveys_df = surveys_df
178-
```
187+
~~~
188+
{: .language-python}
179189
180190
You might think that the code `ref_surveys_df = surveys_df` creates a fresh
181191
distinct copy of the `surveys_df` DataFrame object. However, using the `=`
@@ -190,20 +200,22 @@ DataFrame.
190200
Let's look at what happens when we reassign the values within a subset of the
191201
DataFrame that references another DataFrame object:
192202
193-
```python
203+
~~~
194204
# Assign the value `0` to the first three rows of data in the DataFrame
195205
ref_surveys_df[0:3] = 0
196-
```
206+
~~~
207+
{: .language-python}
197208
198209
Let's try the following code:
199210
200-
```python
211+
~~~
201212
# ref_surveys_df was created using the '=' operator
202213
ref_surveys_df.head()
203214
204215
# surveys_df is the original dataframe
205216
surveys_df.head()
206-
```
217+
~~~
218+
{: .language-python}
207219
208220
What is the difference between these two dataframes?
209221
@@ -230,9 +242,10 @@ the other will see the same changes to the reference object.
230242
Okay, that's enough of that. Let's create a brand new clean dataframe from
231243
the original data CSV file.
232244
233-
```python
245+
~~~
234246
surveys_df = pd.read_csv("data/surveys.csv")
235-
```
247+
~~~
248+
{: .language-python}
236249
237250
## Slicing Subsets of Rows and Columns in Python
238251
@@ -247,10 +260,11 @@ To select a subset of rows **and** columns from our DataFrame, we can use the
247260
`iloc` method. For example, we can select month, day and year (columns 2, 3
248261
and 4 if we start counting at 1), like this:
249262
250-
```python
263+
~~~
251264
# iloc[row slicing, column slicing]
252265
surveys_df.iloc[0:3, 1:4]
253-
```
266+
~~~
267+
{: .language-python}
254268
255269
which gives the **output**
256270
@@ -267,7 +281,7 @@ ask for 0:3, you are actually telling Python to start at index 0 and select rows
267281
268282
Let's explore some other ways to index and select subsets of data:
269283
270-
```python
284+
~~~
271285
# Select all columns for rows of index values 0 and 10
272286
surveys_df.loc[[0, 10], :]
273287
@@ -276,7 +290,8 @@ surveys_df.loc[0, ['species_id', 'plot_id', 'weight']]
276290
277291
# What happens when you type the code below?
278292
surveys_df.loc[[0, 10, 35549], :]
279-
```
293+
~~~
294+
{: .language-python}
280295
281296
**NOTE**: Labels must be found in the DataFrame or you will get a `KeyError`.
282297
@@ -289,16 +304,18 @@ will get a different result than using `iloc` to select rows 1:4.
289304
We can also select a specific data value using a row and
290305
column location within the DataFrame and `iloc` indexing:
291306
292-
```python
307+
~~~
293308
# Syntax for iloc indexing to finding a specific data element
294309
dat.iloc[row, column]
295-
```
310+
~~~
311+
{: .language-python}
296312
297313
In this `iloc` example,
298314
299-
```python
315+
~~~
300316
surveys_df.iloc[2, 6]
301-
```
317+
~~~
318+
{: .language-python}
302319
303320
gives the **output**
304321
@@ -333,13 +350,14 @@ selects the element that is 3 rows down and 7 columns over in the DataFrame.
333350
We can also select a subset of our data using criteria. For example, we can
334351
select all rows that have a year value of 2002:
335352
336-
```python
353+
~~~
337354
surveys_df[surveys_df.year == 2002]
338-
```
355+
~~~
356+
{: .language-python}
339357
340358
Which produces the following output:
341359
342-
```python
360+
~~~
343361
record_id month day year plot_id species_id sex hindfoot_length weight
344362
33320 33321 1 12 2002 1 DM M 38 44
345363
33321 33322 1 12 2002 1 DO M 37 58
@@ -354,19 +372,22 @@ record_id month day year plot_id species_id sex hindfoot_length weight
354372
35548 35549 12 31 2002 5 NaN NaN NaN NaN
355373
356374
[2229 rows x 9 columns]
357-
```
375+
~~~
376+
{: .language-python}
358377
359378
Or we can select all rows that do not contain the year 2002:
360379
361-
```python
380+
~~~
362381
surveys_df[surveys_df.year != 2002]
363-
```
382+
~~~
383+
{: .language-python}
364384
365385
We can define sets of criteria too:
366386
367-
```python
387+
~~~
368388
surveys_df[(surveys_df.year >= 1980) & (surveys_df.year <= 1985)]
369-
```
389+
~~~
390+
{: .language-python}
370391
371392
### Python Syntax Cheat Sheet
372393
@@ -414,7 +435,7 @@ we also need to understand `BOOLEAN` objects in Python.
414435
415436
Boolean values include `True` or `False`. For example,
416437
417-
```python
438+
~~~
418439
# Set x to 5
419440
x = 5
420441
@@ -423,7 +444,8 @@ x > 5
423444
424445
# How about this?
425446
x == 5
426-
```
447+
~~~
448+
{: .language-python}
427449
428450
When we ask Python what the value of `x > 5` is, we get `False`. This is
429451
because the condition, `x` is not greater than 5, is not met since `x` is equal
@@ -442,13 +464,14 @@ null (missing or NaN) data values. We can use the `isnull` method to do this.
442464
The `isnull` method will compare each cell with a null value. If an element
443465
has a null value, it will be assigned a value of `True` in the output object.
444466
445-
```python
467+
~~~
446468
pd.isnull(surveys_df)
447-
```
469+
~~~
470+
{: .language-python}
448471
449472
A snippet of the output is below:
450473
451-
```python
474+
~~~
452475
record_id month day year plot_id species_id sex hindfoot_length weight
453476
0 False False False False False False False False True
454477
1 False False False False False False False False True
@@ -457,26 +480,29 @@ A snippet of the output is below:
457480
4 False False False False False False False False True
458481
459482
[35549 rows x 9 columns]
460-
```
483+
~~~
484+
{: .language-python}
461485
462486
To select the rows where there are null values, we can use
463487
the mask as an index to subset our data as follows:
464488
465-
```python
489+
~~~
466490
# To select just the rows with NaN values, we can use the 'any()' method
467491
surveys_df[pd.isnull(surveys_df).any(axis=1)]
468-
```
492+
~~~
493+
{: .language-python}
469494
470495
Note that the `weight` column of our DataFrame contains many `null` or `NaN`
471496
values. We will explore ways of dealing with this in Lesson 03.
472497
473498
We can run `isnull` on a particular column too. What does the code below do?
474499
475-
```python
500+
~~~
476501
# What does this do?
477502
empty_weights = surveys_df[pd.isnull(surveys_df['weight'])]['weight']
478503
print(empty_weights)
479-
```
504+
~~~
505+
{: .language-python}
480506
481507
Let's take a minute to look at the statement above. We are using the Boolean
482508
object `pd.isnull(surveys_df['weight'])` as an index to `surveys_df`. We are
@@ -488,7 +514,7 @@ asking Python to select rows that have a `NaN` value of weight.
488514
> 1. Create a new DataFrame that only contains observations with sex values that
489515
> are **not** female or male. Assign each sex value in the new DataFrame to a
490516
> new value of 'x'. Determine the number of null values in the subset.
491-
>
517+
>
492518
> 2. Create a new DataFrame that contains only observations that are of sex male
493519
> or female and where weight values are greater than 0. Create a stacked bar
494520
> plot of average weight by plot with male vs female values stacked for each

0 commit comments

Comments
 (0)