Skip to content

Commit 20bda05

Browse files
committed
04-merging-data: fix metadata & code blocks
1 parent 76fef9d commit 20bda05

1 file changed

Lines changed: 89 additions & 77 deletions

File tree

_episodes/04-merging-data.md

Lines changed: 89 additions & 77 deletions
Original file line numberDiff line numberDiff line change
@@ -3,13 +3,15 @@ title: Combining DataFrames with Pandas
33
teaching: 20
44
exercises: 25
55
questions:
6-
- " Can I work with data from multiple sources? "
7-
- " How can I combine data from different data sets? "
6+
- "Can I work with data from multiple sources?"
7+
- "How can I combine data from different data sets?"
88
objectives:
9-
- Combine data from multiple files into a single DataFrame using merge and concat.
10-
- Combine two DataFrames using a unique ID found in both DataFrames.
11-
- Employ `to_csv` to export a DataFrame in CSV format.
12-
- Join DataFrames using common fields (join keys).
9+
- "Combine data from multiple files into a single DataFrame using merge and concat."
10+
- "Combine two DataFrames using a unique ID found in both DataFrames."
11+
- "Employ `to_csv` to export a DataFrame in CSV format."
12+
- "Join DataFrames using common fields (join keys)."
13+
keypoints:
14+
- "FIXME"
1315
---
1416

1517
In many "real world" situations, the data that we want to use come in multiple
@@ -21,7 +23,7 @@ DataFrames](http://pandas.pydata.org/pandas-docs/stable/merging.html) including
2123
To work through the examples below, we first need to load the species and
2224
surveys files into pandas DataFrames. In iPython:
2325

24-
```python
26+
~~~
2527
import pandas as pd
2628
surveys_df = pd.read_csv("data/surveys.csv",
2729
keep_default_na=False, na_values=[""])
@@ -59,7 +61,8 @@ species_df
5961
53 ZM Zenaida macroura Bird
6062
6163
[54 rows x 4 columns]
62-
```
64+
~~~
65+
{: .language-python}
6366

6467
Take note that the `read_csv` method we used can take some additional options which
6568
we didn't use previously. Many functions in python have a set of options that
@@ -73,15 +76,16 @@ We can use the `concat` function in pandas to append either columns or rows from
7376
one DataFrame to another. Let's grab two subsets of our data to see how this
7477
works.
7578

76-
```python
79+
~~~
7780
# Read in first 10 lines of surveys table
7881
survey_sub = surveys_df.head(10)
7982
# Grab the last 10 rows
8083
survey_sub_last10 = surveys_df.tail(10)
8184
# Reset the index values to the second dataframe appends properly
8285
survey_sub_last10=survey_sub_last10.reset_index(drop=True)
8386
# drop=True option avoids adding new index column with old index values
84-
```
87+
~~~
88+
{: .language-python}
8589

8690
When we concatenate DataFrames, we need to specify the axis. `axis=0` tells
8791
pandas to stack the second DataFrame under the first one. It will automatically
@@ -92,13 +96,14 @@ same columns and associated column format in both datasets. When we stack
9296
horizonally, we want to make sure what we are doing makes sense (ie the data are
9397
related in some way).
9498

95-
```python
99+
~~~
96100
# Stack the DataFrames on top of each other
97101
vertical_stack = pd.concat([survey_sub, survey_sub_last10], axis=0)
98102
99103
# Place the DataFrames side by side
100104
horizontal_stack = pd.concat([survey_sub, survey_sub_last10], axis=1)
101-
```
105+
~~~
106+
{: .language-python}
102107

103108
### Row Index Values and Concat
104109
Have a look at the `vertical_stack` dataframe? Notice anything unusual?
@@ -113,19 +118,21 @@ save it to a different folder by adding the foldername and a slash to the file
113118
`vertical_stack.to_csv('foldername/out.csv')`. We use the 'index=False' so that
114119
pandas doesn't include the index number for each line.
115120

116-
```python
121+
~~~
117122
# Write DataFrame to CSV
118123
vertical_stack.to_csv('data_output/out.csv', index=False)
119-
```
124+
~~~
125+
{: .language-python}
120126

121127
Check out your working directory to make sure the CSV wrote out properly, and
122128
that you can open it! If you want, try to bring it back into python to make sure
123129
it imports properly.
124130

125-
```python
131+
~~~
126132
# For kicks read our output back into python and make sure all looks good
127133
new_output = pd.read_csv('data_output/out.csv', keep_default_na=False, na_values=[""])
128-
```
134+
~~~
135+
{: .language-python}
129136

130137
> ## Challenge - Combine Data
131138
>
@@ -173,14 +180,15 @@ To better understand joins, let's grab the first 10 lines of our data as a
173180
subset to work with. We'll use the `.head` method to do this. We'll also read
174181
in a subset of the species table.
175182

176-
```python
183+
~~~
177184
# Read in first 10 lines of surveys table
178185
survey_sub = surveys_df.head(10)
179186
180187
# Import a small subset of the species data designed for this part of the lesson.
181188
# It is stored in the data folder.
182189
species_sub = pd.read_csv('data/speciesSubset.csv', keep_default_na=False, na_values=[""])
183-
```
190+
~~~
191+
{: .language-python}
184192

185193
In this example, `species_sub` is the lookup table containing genus, species, and
186194
taxa names that we want to join with the data in `survey_sub` to produce a new
@@ -197,7 +205,7 @@ the same name that also contain the same data. If we are less lucky, we need to
197205
identify a (differently-named) column in each DataFrame that contains the same
198206
information.
199207

200-
```python
208+
~~~
201209
>>> species_sub.columns
202210
203211
Index([u'species_id', u'genus', u'species', u'taxa'], dtype='object')
@@ -206,7 +214,8 @@ Index([u'species_id', u'genus', u'species', u'taxa'], dtype='object')
206214
207215
Index([u'record_id', u'month', u'day', u'year', u'plot_id', u'species_id',
208216
u'sex', u'hindfoot_length', u'weight'], dtype='object')
209-
```
217+
~~~
218+
{: .language-python}
210219

211220
In our example, the join key is the column containing the two-letter species
212221
identifier, which is called `species_id`.
@@ -230,40 +239,41 @@ page](http://blog.codinghorror.com/a-visual-explanation-of-sql-joins/) is below:
230239
![Inner join -- courtesy of codinghorror.com](../fig/inner-join.png)
231240

232241
The pandas function for performing joins is called `merge` and an Inner join is
233-
the default option:
242+
the default option:
234243

235-
```python
244+
~~~
236245
merged_inner = pd.merge(left=survey_sub,right=species_sub, left_on='species_id', right_on='species_id')
237246
# In this case `species_id` is the only column name in both dataframes, so if we skippd `left_on`
238247
# And `right_on` arguments we would still get the same result
239248
240249
# What's the size of the output data?
241250
merged_inner.shape
242251
merged_inner
243-
```
252+
~~~
253+
{: .language-python}
244254

245255
**OUTPUT:**
246256

247257
```
248258
record_id month day year plot_id species_id sex hindfoot_length \
249-
0 1 7 16 1977 2 NL M 32
250-
1 2 7 16 1977 3 NL M 33
251-
2 3 7 16 1977 2 DM F 37
252-
3 4 7 16 1977 7 DM M 36
253-
4 5 7 16 1977 3 DM M 35
254-
5 8 7 16 1977 1 DM M 37
255-
6 9 7 16 1977 1 DM F 34
256-
7 7 7 16 1977 2 PE F NaN
257-
258-
weight genus species taxa
259-
0 NaN Neotoma albigula Rodent
260-
1 NaN Neotoma albigula Rodent
261-
2 NaN Dipodomys merriami Rodent
262-
3 NaN Dipodomys merriami Rodent
263-
4 NaN Dipodomys merriami Rodent
264-
5 NaN Dipodomys merriami Rodent
265-
6 NaN Dipodomys merriami Rodent
266-
7 NaN Peromyscus eremicus Rodent
259+
0 1 7 16 1977 2 NL M 32
260+
1 2 7 16 1977 3 NL M 33
261+
2 3 7 16 1977 2 DM F 37
262+
3 4 7 16 1977 7 DM M 36
263+
4 5 7 16 1977 3 DM M 35
264+
5 8 7 16 1977 1 DM M 37
265+
6 9 7 16 1977 1 DM F 34
266+
7 7 7 16 1977 2 PE F NaN
267+
268+
weight genus species taxa
269+
0 NaN Neotoma albigula Rodent
270+
1 NaN Neotoma albigula Rodent
271+
2 NaN Dipodomys merriami Rodent
272+
3 NaN Dipodomys merriami Rodent
273+
4 NaN Dipodomys merriami Rodent
274+
5 NaN Dipodomys merriami Rodent
275+
6 NaN Dipodomys merriami Rodent
276+
7 NaN Peromyscus eremicus Rodent
267277
```
268278

269279
The result of an inner join of `survey_sub` and `species_sub` is a new DataFrame
@@ -313,37 +323,38 @@ have values for the join key(s) in the `left` DataFrame.
313323
A left join is performed in pandas by calling the same `merge` function used for
314324
inner join, but using the `how='left'` argument:
315325

316-
```python
326+
~~~
317327
merged_left = pd.merge(left=survey_sub,right=species_sub, how='left', left_on='species_id', right_on='species_id')
318328
319329
merged_left
320330
321331
**OUTPUT:**
322332
323333
record_id month day year plot_id species_id sex hindfoot_length \
324-
0 1 7 16 1977 2 NL M 32
325-
1 2 7 16 1977 3 NL M 33
326-
2 3 7 16 1977 2 DM F 37
327-
3 4 7 16 1977 7 DM M 36
328-
4 5 7 16 1977 3 DM M 35
329-
5 6 7 16 1977 1 PF M 14
330-
6 7 7 16 1977 2 PE F NaN
331-
7 8 7 16 1977 1 DM M 37
332-
8 9 7 16 1977 1 DM F 34
333-
9 10 7 16 1977 6 PF F 20
334-
335-
weight genus species taxa
336-
0 NaN Neotoma albigula Rodent
337-
1 NaN Neotoma albigula Rodent
338-
2 NaN Dipodomys merriami Rodent
339-
3 NaN Dipodomys merriami Rodent
340-
4 NaN Dipodomys merriami Rodent
341-
5 NaN NaN NaN NaN
342-
6 NaN Peromyscus eremicus Rodent
343-
7 NaN Dipodomys merriami Rodent
344-
8 NaN Dipodomys merriami Rodent
345-
9 NaN NaN NaN NaN
346-
```
334+
0 1 7 16 1977 2 NL M 32
335+
1 2 7 16 1977 3 NL M 33
336+
2 3 7 16 1977 2 DM F 37
337+
3 4 7 16 1977 7 DM M 36
338+
4 5 7 16 1977 3 DM M 35
339+
5 6 7 16 1977 1 PF M 14
340+
6 7 7 16 1977 2 PE F NaN
341+
7 8 7 16 1977 1 DM M 37
342+
8 9 7 16 1977 1 DM F 34
343+
9 10 7 16 1977 6 PF F 20
344+
345+
weight genus species taxa
346+
0 NaN Neotoma albigula Rodent
347+
1 NaN Neotoma albigula Rodent
348+
2 NaN Dipodomys merriami Rodent
349+
3 NaN Dipodomys merriami Rodent
350+
4 NaN Dipodomys merriami Rodent
351+
5 NaN NaN NaN NaN
352+
6 NaN Peromyscus eremicus Rodent
353+
7 NaN Dipodomys merriami Rodent
354+
8 NaN Dipodomys merriami Rodent
355+
9 NaN NaN NaN NaN
356+
~~~
357+
{: .language-python}
347358

348359
The result DataFrame from a left join (`merged_left`) looks very much like the
349360
result DataFrame from an inner join (`merged_inner`) in terms of the columns it
@@ -353,17 +364,18 @@ number of rows** as the original `survey_sub` DataFrame. When we inspect
353364
come from `species_sub` (i.e., `species_id`, `genus`, and `taxa`) is
354365
missing (they contain NaN values):
355366

356-
```python
367+
~~~
357368
merged_left[ pd.isnull(merged_left.genus) ]
358369
**OUTPUT:**
359370
record_id month day year plot_id species_id sex hindfoot_length \
360-
5 6 7 16 1977 1 PF M 14
361-
9 10 7 16 1977 6 PF F 20
371+
5 6 7 16 1977 1 PF M 14
372+
9 10 7 16 1977 6 PF F 20
362373
363-
weight genus species taxa
364-
5 NaN NaN NaN NaN
374+
weight genus species taxa
375+
5 NaN NaN NaN NaN
365376
9 NaN NaN NaN NaN
366-
```
377+
~~~
378+
{: .language-python}
367379

368380
These rows are the ones where the value of `species_id` from `survey_sub` (in this
369381
case, `PF`) does not occur in `species_sub`.
@@ -396,14 +408,14 @@ The pandas `merge` function supports two other join types:
396408
>
397409
> 1. In the data folder, there is a plot `CSV` that contains information about the
398410
> type associated with each plot. Use that data to summarize the number of
399-
> plots by plot type.
411+
> plots by plot type.
400412
> 2. Calculate a diversity index of your choice for control vs rodent exclosure
401-
> plots. The index should consider both species abundance and number of
402-
> species. You might choose to use the simple [biodiversity index described
403-
> here](http://www.amnh.org/explore/curriculum-collections/biodiversity-counts/plant-ecology/how-to-calculate-a-biodiversity-index)
404-
> which calculates diversity as:
413+
> plots. The index should consider both species abundance and number of
414+
> species. You might choose to use the simple [biodiversity index described
415+
> here](http://www.amnh.org/explore/curriculum-collections/biodiversity-counts/plant-ecology/how-to-calculate-a-biodiversity-index)
416+
> which calculates diversity as:
405417
>
406-
> the number of species in the plot / the total number of individuals in the plot = Biodiversity index.
418+
> the number of species in the plot / the total number of individuals in the plot = Biodiversity index.
407419
{: .challenge}
408420

409421
{% include links.md %}

0 commit comments

Comments
 (0)