@@ -3,13 +3,15 @@ title: Combining DataFrames with Pandas
33teaching : 20
44exercises : 25
55questions :
6- - " Can I work with data from multiple sources? "
7- - " How can I combine data from different data sets? "
6+ - " Can I work with data from multiple sources?"
7+ - " How can I combine data from different data sets?"
88objectives :
9- - Combine data from multiple files into a single DataFrame using merge and concat.
10- - Combine two DataFrames using a unique ID found in both DataFrames.
11- - Employ `to_csv` to export a DataFrame in CSV format.
12- - Join DataFrames using common fields (join keys).
9+ - " Combine data from multiple files into a single DataFrame using merge and concat."
10+ - " Combine two DataFrames using a unique ID found in both DataFrames."
11+ - " Employ `to_csv` to export a DataFrame in CSV format."
12+ - " Join DataFrames using common fields (join keys)."
13+ keypoints :
14+ - " FIXME"
1315---
1416
1517In many "real world" situations, the data that we want to use come in multiple
@@ -21,7 +23,7 @@ DataFrames](http://pandas.pydata.org/pandas-docs/stable/merging.html) including
2123To work through the examples below, we first need to load the species and
2224surveys files into pandas DataFrames. In iPython:
2325
24- ``` python
26+ ~~~
2527import pandas as pd
2628surveys_df = pd.read_csv("data/surveys.csv",
2729 keep_default_na=False, na_values=[""])
@@ -59,7 +61,8 @@ species_df
596153 ZM Zenaida macroura Bird
6062
6163[54 rows x 4 columns]
62- ```
64+ ~~~
65+ {: .language-python}
6366
6467Take note that the ` read_csv ` method we used can take some additional options which
6568we didn't use previously. Many functions in python have a set of options that
@@ -73,15 +76,16 @@ We can use the `concat` function in pandas to append either columns or rows from
7376one DataFrame to another. Let's grab two subsets of our data to see how this
7477works.
7578
76- ``` python
79+ ~~~
7780# Read in first 10 lines of surveys table
7881survey_sub = surveys_df.head(10)
7982# Grab the last 10 rows
8083survey_sub_last10 = surveys_df.tail(10)
8184# Reset the index values to the second dataframe appends properly
8285survey_sub_last10=survey_sub_last10.reset_index(drop=True)
8386# drop=True option avoids adding new index column with old index values
84- ```
87+ ~~~
88+ {: .language-python}
8589
8690When we concatenate DataFrames, we need to specify the axis. ` axis=0 ` tells
8791pandas to stack the second DataFrame under the first one. It will automatically
@@ -92,13 +96,14 @@ same columns and associated column format in both datasets. When we stack
9296horizonally, we want to make sure what we are doing makes sense (ie the data are
9397related in some way).
9498
95- ``` python
99+ ~~~
96100# Stack the DataFrames on top of each other
97101vertical_stack = pd.concat([survey_sub, survey_sub_last10], axis=0)
98102
99103# Place the DataFrames side by side
100104horizontal_stack = pd.concat([survey_sub, survey_sub_last10], axis=1)
101- ```
105+ ~~~
106+ {: .language-python}
102107
103108### Row Index Values and Concat
104109Have a look at the ` vertical_stack ` dataframe? Notice anything unusual?
@@ -113,19 +118,21 @@ save it to a different folder by adding the foldername and a slash to the file
113118` vertical_stack.to_csv('foldername/out.csv') ` . We use the 'index=False' so that
114119pandas doesn't include the index number for each line.
115120
116- ``` python
121+ ~~~
117122# Write DataFrame to CSV
118123vertical_stack.to_csv('data_output/out.csv', index=False)
119- ```
124+ ~~~
125+ {: .language-python}
120126
121127Check out your working directory to make sure the CSV wrote out properly, and
122128that you can open it! If you want, try to bring it back into python to make sure
123129it imports properly.
124130
125- ``` python
131+ ~~~
126132# For kicks read our output back into python and make sure all looks good
127133new_output = pd.read_csv('data_output/out.csv', keep_default_na=False, na_values=[""])
128- ```
134+ ~~~
135+ {: .language-python}
129136
130137> ## Challenge - Combine Data
131138>
@@ -173,14 +180,15 @@ To better understand joins, let's grab the first 10 lines of our data as a
173180subset to work with. We'll use the ` .head ` method to do this. We'll also read
174181in a subset of the species table.
175182
176- ``` python
183+ ~~~
177184# Read in first 10 lines of surveys table
178185survey_sub = surveys_df.head(10)
179186
180187# Import a small subset of the species data designed for this part of the lesson.
181188# It is stored in the data folder.
182189species_sub = pd.read_csv('data/speciesSubset.csv', keep_default_na=False, na_values=[""])
183- ```
190+ ~~~
191+ {: .language-python}
184192
185193In this example, ` species_sub ` is the lookup table containing genus, species, and
186194taxa names that we want to join with the data in ` survey_sub ` to produce a new
@@ -197,7 +205,7 @@ the same name that also contain the same data. If we are less lucky, we need to
197205identify a (differently-named) column in each DataFrame that contains the same
198206information.
199207
200- ``` python
208+ ~~~
201209>>> species_sub.columns
202210
203211Index([u'species_id', u'genus', u'species', u'taxa'], dtype='object')
@@ -206,7 +214,8 @@ Index([u'species_id', u'genus', u'species', u'taxa'], dtype='object')
206214
207215Index([u'record_id', u'month', u'day', u'year', u'plot_id', u'species_id',
208216 u'sex', u'hindfoot_length', u'weight'], dtype='object')
209- ```
217+ ~~~
218+ {: .language-python}
210219
211220In our example, the join key is the column containing the two-letter species
212221identifier, which is called ` species_id ` .
@@ -230,40 +239,41 @@ page](http://blog.codinghorror.com/a-visual-explanation-of-sql-joins/) is below:
230239![ Inner join -- courtesy of codinghorror.com] ( ../fig/inner-join.png )
231240
232241The pandas function for performing joins is called ` merge ` and an Inner join is
233- the default option:
242+ the default option:
234243
235- ``` python
244+ ~~~
236245merged_inner = pd.merge(left=survey_sub,right=species_sub, left_on='species_id', right_on='species_id')
237246# In this case `species_id` is the only column name in both dataframes, so if we skippd `left_on`
238247# And `right_on` arguments we would still get the same result
239248
240249# What's the size of the output data?
241250merged_inner.shape
242251merged_inner
243- ```
252+ ~~~
253+ {: .language-python}
244254
245255** OUTPUT:**
246256
247257```
248258 record_id month day year plot_id species_id sex hindfoot_length \
249- 0 1 7 16 1977 2 NL M 32
250- 1 2 7 16 1977 3 NL M 33
251- 2 3 7 16 1977 2 DM F 37
252- 3 4 7 16 1977 7 DM M 36
253- 4 5 7 16 1977 3 DM M 35
254- 5 8 7 16 1977 1 DM M 37
255- 6 9 7 16 1977 1 DM F 34
256- 7 7 7 16 1977 2 PE F NaN
257-
258- weight genus species taxa
259- 0 NaN Neotoma albigula Rodent
260- 1 NaN Neotoma albigula Rodent
261- 2 NaN Dipodomys merriami Rodent
262- 3 NaN Dipodomys merriami Rodent
263- 4 NaN Dipodomys merriami Rodent
264- 5 NaN Dipodomys merriami Rodent
265- 6 NaN Dipodomys merriami Rodent
266- 7 NaN Peromyscus eremicus Rodent
259+ 0 1 7 16 1977 2 NL M 32
260+ 1 2 7 16 1977 3 NL M 33
261+ 2 3 7 16 1977 2 DM F 37
262+ 3 4 7 16 1977 7 DM M 36
263+ 4 5 7 16 1977 3 DM M 35
264+ 5 8 7 16 1977 1 DM M 37
265+ 6 9 7 16 1977 1 DM F 34
266+ 7 7 7 16 1977 2 PE F NaN
267+
268+ weight genus species taxa
269+ 0 NaN Neotoma albigula Rodent
270+ 1 NaN Neotoma albigula Rodent
271+ 2 NaN Dipodomys merriami Rodent
272+ 3 NaN Dipodomys merriami Rodent
273+ 4 NaN Dipodomys merriami Rodent
274+ 5 NaN Dipodomys merriami Rodent
275+ 6 NaN Dipodomys merriami Rodent
276+ 7 NaN Peromyscus eremicus Rodent
267277```
268278
269279The result of an inner join of ` survey_sub ` and ` species_sub ` is a new DataFrame
@@ -313,37 +323,38 @@ have values for the join key(s) in the `left` DataFrame.
313323A left join is performed in pandas by calling the same ` merge ` function used for
314324inner join, but using the ` how='left' ` argument:
315325
316- ``` python
326+ ~~~
317327merged_left = pd.merge(left=survey_sub,right=species_sub, how='left', left_on='species_id', right_on='species_id')
318328
319329merged_left
320330
321331**OUTPUT:**
322332
323333 record_id month day year plot_id species_id sex hindfoot_length \
324- 0 1 7 16 1977 2 NL M 32
325- 1 2 7 16 1977 3 NL M 33
326- 2 3 7 16 1977 2 DM F 37
327- 3 4 7 16 1977 7 DM M 36
328- 4 5 7 16 1977 3 DM M 35
329- 5 6 7 16 1977 1 PF M 14
330- 6 7 7 16 1977 2 PE F NaN
331- 7 8 7 16 1977 1 DM M 37
332- 8 9 7 16 1977 1 DM F 34
333- 9 10 7 16 1977 6 PF F 20
334-
335- weight genus species taxa
336- 0 NaN Neotoma albigula Rodent
337- 1 NaN Neotoma albigula Rodent
338- 2 NaN Dipodomys merriami Rodent
339- 3 NaN Dipodomys merriami Rodent
340- 4 NaN Dipodomys merriami Rodent
341- 5 NaN NaN NaN NaN
342- 6 NaN Peromyscus eremicus Rodent
343- 7 NaN Dipodomys merriami Rodent
344- 8 NaN Dipodomys merriami Rodent
345- 9 NaN NaN NaN NaN
346- ```
334+ 0 1 7 16 1977 2 NL M 32
335+ 1 2 7 16 1977 3 NL M 33
336+ 2 3 7 16 1977 2 DM F 37
337+ 3 4 7 16 1977 7 DM M 36
338+ 4 5 7 16 1977 3 DM M 35
339+ 5 6 7 16 1977 1 PF M 14
340+ 6 7 7 16 1977 2 PE F NaN
341+ 7 8 7 16 1977 1 DM M 37
342+ 8 9 7 16 1977 1 DM F 34
343+ 9 10 7 16 1977 6 PF F 20
344+
345+ weight genus species taxa
346+ 0 NaN Neotoma albigula Rodent
347+ 1 NaN Neotoma albigula Rodent
348+ 2 NaN Dipodomys merriami Rodent
349+ 3 NaN Dipodomys merriami Rodent
350+ 4 NaN Dipodomys merriami Rodent
351+ 5 NaN NaN NaN NaN
352+ 6 NaN Peromyscus eremicus Rodent
353+ 7 NaN Dipodomys merriami Rodent
354+ 8 NaN Dipodomys merriami Rodent
355+ 9 NaN NaN NaN NaN
356+ ~~~
357+ {: .language-python}
347358
348359The result DataFrame from a left join (` merged_left ` ) looks very much like the
349360result DataFrame from an inner join (` merged_inner ` ) in terms of the columns it
@@ -353,17 +364,18 @@ number of rows** as the original `survey_sub` DataFrame. When we inspect
353364come from ` species_sub ` (i.e., ` species_id ` , ` genus ` , and ` taxa ` ) is
354365missing (they contain NaN values):
355366
356- ``` python
367+ ~~~
357368merged_left[ pd.isnull(merged_left.genus) ]
358369**OUTPUT:**
359370 record_id month day year plot_id species_id sex hindfoot_length \
360- 5 6 7 16 1977 1 PF M 14
361- 9 10 7 16 1977 6 PF F 20
371+ 5 6 7 16 1977 1 PF M 14
372+ 9 10 7 16 1977 6 PF F 20
362373
363- weight genus species taxa
364- 5 NaN NaN NaN NaN
374+ weight genus species taxa
375+ 5 NaN NaN NaN NaN
3653769 NaN NaN NaN NaN
366- ```
377+ ~~~
378+ {: .language-python}
367379
368380These rows are the ones where the value of ` species_id ` from ` survey_sub ` (in this
369381case, ` PF ` ) does not occur in ` species_sub ` .
@@ -396,14 +408,14 @@ The pandas `merge` function supports two other join types:
396408>
397409> 1 . In the data folder, there is a plot ` CSV ` that contains information about the
398410> type associated with each plot. Use that data to summarize the number of
399- > plots by plot type.
411+ > plots by plot type.
400412> 2 . Calculate a diversity index of your choice for control vs rodent exclosure
401- > plots. The index should consider both species abundance and number of
402- > species. You might choose to use the simple [ biodiversity index described
403- > here] ( http://www.amnh.org/explore/curriculum-collections/biodiversity-counts/plant-ecology/how-to-calculate-a-biodiversity-index )
404- > which calculates diversity as:
413+ > plots. The index should consider both species abundance and number of
414+ > species. You might choose to use the simple [ biodiversity index described
415+ > here] ( http://www.amnh.org/explore/curriculum-collections/biodiversity-counts/plant-ecology/how-to-calculate-a-biodiversity-index )
416+ > which calculates diversity as:
405417>
406- > the number of species in the plot / the total number of individuals in the plot = Biodiversity index.
418+ > the number of species in the plot / the total number of individuals in the plot = Biodiversity index.
407419 {: .challenge}
408420
409421{% include links.md %}
0 commit comments