Skip to content

Commit b6a9907

Browse files
tobyhodgesdeppen8
andcommitted
Apply changes proposed by @deppen8 in #468, but without styling "pandas" with backticks.
- Add lots of code style formatting where appropriate - Replace references to isnull() with isna(). These methods are exactly the same for pandas, but isna() is consistent with fillna, dropna, etc. - Replace a reference to merged_left.genus with merged_left['genus'], the preferred notation style for pandas - Add more consistent use of {: .output} formatting - Remove u'string' notation, which is a Python 2.x leftover, to match what learners will see with Python 3.x Co-authored-by: Toby Hodges <tobyhodges@carpentries.org> Co-authored-by: Jacob Deppen <deppen.8@gmail.com>
1 parent 8ef6d50 commit b6a9907

1 file changed

Lines changed: 69 additions & 37 deletions

File tree

_episodes/05-merging-data.md

Lines changed: 69 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -24,14 +24,17 @@ DataFrames](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
2424
`merge` and `concat`.
2525

2626
To work through the examples below, we first need to load the species and
27-
surveys files into pandas DataFrames. In iPython:
27+
surveys files into pandas DataFrames. In a Jupyter Notebook or iPython:
2828

2929
~~~
3030
import pandas as pd
3131
surveys_df = pd.read_csv("data/surveys.csv",
3232
keep_default_na=False, na_values=[""])
3333
surveys_df
34+
~~~
35+
{: .language-python}
3436

37+
~~~
3538
record_id month day year plot species sex hindfoot_length weight
3639
0 1 7 16 1977 2 NA M 32 NaN
3740
1 2 7 16 1977 3 NA M 33 NaN
@@ -46,10 +49,17 @@ surveys_df
4649
35548 35549 12 31 2002 5 NaN NaN NaN NaN
4750
4851
[35549 rows x 9 columns]
52+
~~~
53+
{: .output}
4954

55+
~~~
5056
species_df = pd.read_csv("data/species.csv",
5157
keep_default_na=False, na_values=[""])
5258
species_df
59+
~~~
60+
{: .language-python}
61+
62+
~~~
5363
species_id genus species taxa
5464
0 AB Amphispiza bilineata Bird
5565
1 AH Ammospermophilus harrisi Rodent
@@ -65,14 +75,14 @@ species_df
6575
6676
[54 rows x 4 columns]
6777
~~~
68-
{: .language-python}
78+
{: .output}
6979

7080
Take note that the `read_csv` method we used can take some additional options which
7181
we didn't use previously. Many functions in Python have a set of options that
7282
can be set by the user if needed. In this case, we have told pandas to assign
73-
empty values in our CSV to NaN `keep_default_na=False, na_values=[""]`.
83+
empty values in our CSV to `NaN` with the parameters `keep_default_na=False` and `na_values=[""]`.
7484
We have explicitly requested to change empty values in the CSV to NaN,
75-
this is however also the default behaviour of `read_csv`.
85+
this is however also the default behaviour of `read_csv`.
7686
[More about all of the `read_csv` options here and their defaults.](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv)
7787

7888
# Concatenating DataFrames
@@ -111,16 +121,16 @@ horizontal_stack = pd.concat([survey_sub, survey_sub_last10], axis=1)
111121
{: .language-python}
112122

113123
### Row Index Values and Concat
114-
Have a look at the `vertical_stack` dataframe? Notice anything unusual?
115-
The row indexes for the two data frames `survey_sub` and `survey_sub_last10`
116-
have been repeated. We can reindex the new dataframe using the `reset_index()` method.
124+
Have a look at the `vertical_stack` DataFrame. Notice anything unusual?
125+
The row indexes for the two DataFrames `survey_sub` and `survey_sub_last10`
126+
have been repeated. We can reindex the new DataFrame using the `reset_index()` method.
117127

118128
## Writing Out Data to CSV
119129

120130
We can use the `to_csv` command to do export a DataFrame in CSV format. Note that the code
121131
below will by default save the data into the current working directory. We can
122132
save it to a different folder by adding the foldername and a slash to the file
123-
`vertical_stack.to_csv('foldername/out.csv')`. We use the 'index=False' so that
133+
`vertical_stack.to_csv('foldername/out.csv')`. We use the `index=False` so that
124134
pandas doesn't include the index number for each line.
125135

126136
~~~
@@ -130,7 +140,7 @@ vertical_stack.to_csv('data/out.csv', index=False)
130140
{: .language-python}
131141

132142
Check out your working directory to make sure the CSV wrote out properly, and
133-
that you can open it! If you want, try to bring it back into Python to make sure
143+
that you can open it! If you want, try to bring it back into pandas to make sure
134144
it imports properly.
135145

136146
~~~
@@ -142,17 +152,17 @@ new_output = pd.read_csv('data/out.csv', keep_default_na=False, na_values=[""])
142152
> ## Challenge - Combine Data
143153
>
144154
> In the data folder, there are two survey data files: `surveys2001.csv` and
145-
> `surveys2002.csv`. Read the data into Python and combine the files to make one
146-
> new data frame. Create a plot of average plot weight by year grouped by sex.
147-
> Export your results as a CSV and make sure it reads back into Python properly.
155+
> `surveys2002.csv`. Read the data into pandas and combine the files to make one
156+
> new DataFrame. Create a plot of average plot weight by year grouped by sex.
157+
> Export your results as a CSV and make sure it reads back into pandas properly.
148158
{: .challenge}
149159

150160
# Joining DataFrames
151161

152-
When we concatenated our DataFrames we simply added them to each other -
162+
When we concatenated our DataFrames, we simply added them to each other -
153163
stacking them either vertically or side by side. Another way to combine
154164
DataFrames is to use columns in each dataset that contain common values (a
155-
common unique id). Combining DataFrames using a common field is called
165+
common unique identifier). Combining DataFrames using a common field is called
156166
"joining". The columns containing the common values are called "join key(s)".
157167
Joining DataFrames in this way is often useful when one DataFrame is a "lookup
158168
table" containing additional data that we want to include in the other.
@@ -163,13 +173,13 @@ SQL database.
163173
For example, the `species.csv` file that we've been working with is a lookup
164174
table. This table contains the genus, species and taxa code for 55 species. The
165175
species code is unique for each line. These species are identified in our survey
166-
data as well using the unique species code. Rather than adding 3 more columns
167-
for the genus, species and taxa to each of the 35,549 line Survey data table, we
176+
data as well using the unique species code. Rather than adding three more columns
177+
for the genus, species and taxa to each of the 35,549 line `survey` DataFrame, we
168178
can maintain the shorter table with the species information. When we want to
169179
access that information, we can create a query that joins the additional columns
170-
of information to the Survey data.
180+
of information to the `survey` DataFrame.
171181

172-
Storing data in this way has many benefits including:
182+
Storing data in this way has many benefits.
173183

174184
1. It ensures consistency in the spelling of species attributes (genus, species
175185
and taxa) given each species is only entered once. Imagine the possibilities
@@ -182,7 +192,7 @@ Storing data in this way has many benefits including:
182192
## Joining Two DataFrames
183193

184194
To better understand joins, let's grab the first 10 lines of our data as a
185-
subset to work with. We'll use the `.head` method to do this. We'll also read
195+
subset to work with. We'll use the `.head()` method to do this. We'll also read
186196
in a subset of the species table.
187197

188198
~~~
@@ -211,16 +221,25 @@ identify a (differently-named) column in each DataFrame that contains the same
211221
information.
212222

213223
~~~
214-
>>> species_sub.columns
224+
species_sub.columns
225+
~~~
226+
{: .language-python}
215227

228+
~~~
216229
Index([u'species_id', u'genus', u'species', u'taxa'], dtype='object')
230+
~~~
231+
{: .output}
217232

218-
>>> survey_sub.columns
233+
~~~
234+
survey_sub.columns
235+
~~~
236+
{: .language-python}
219237

238+
~~~
220239
Index([u'record_id', u'month', u'day', u'year', u'plot_id', u'species_id',
221240
u'sex', u'hindfoot_length', u'weight'], dtype='object')
222241
~~~
223-
{: .language-python}
242+
{: .output}
224243

225244
In our example, the join key is the column containing the two-letter species
226245
identifier, which is called `species_id`.
@@ -232,13 +251,10 @@ also need to decide which type of join makes sense for our analysis.
232251

233252
## Inner joins
234253

235-
The most common type of join is called an _inner join_. An inner join combines
254+
The most common type of join is called an **inner join**. An inner join combines
236255
two DataFrames based on a join key and returns a new DataFrame that contains
237-
**only** those rows that have matching values in *both* of the original
238-
DataFrames.
239-
240-
Inner joins yield a DataFrame that contains only rows where the value being
241-
joined exists in BOTH tables. An example of an inner join, adapted from [Jeff Atwood's blogpost about SQL joins][join-types] is below:
256+
*only* those rows that have matching values in *both* of the original
257+
DataFrames. An example of an inner join, adapted from [Jeff Atwood's blogpost about SQL joins][join-types] is below:
242258

243259
![Inner join -- courtesy of codinghorror.com](../fig/inner-join.png)
244260

@@ -247,11 +263,26 @@ the default option:
247263

248264
~~~
249265
merged_inner = pd.merge(left=survey_sub, right=species_sub, left_on='species_id', right_on='species_id')
250-
# In this case `species_id` is the only column name in both dataframes, so if we skipped `left_on`
251-
# And `right_on` arguments we would still get the same result
266+
~~~
267+
{: .language-python}
268+
269+
In this case, `species_id` is the only column name in both DataFrames, so if we skipped the `left_on`
270+
and `right_on` arguments, `pandas` would guess that we wanted to use that column to join. However, it is
271+
usually better to be explicit.
252272

253-
# What's the size of the output data?
273+
So what is the size of the output data?
274+
275+
~~~
254276
merged_inner.shape
277+
~~~
278+
{: .language-python}
279+
280+
~~~
281+
(8, 12)
282+
~~~
283+
{: .output}
284+
285+
~~~
255286
merged_inner
256287
~~~
257288
{: .language-python}
@@ -298,8 +329,8 @@ DataFrame). For inner joins, the order of the `left` and `right` arguments does
298329
not matter.
299330

300331
The result `merged_inner` DataFrame contains all of the columns from `survey_sub`
301-
(record id, month, day, etc.) as well as all the columns from `species_sub`
302-
(species_id, genus, species, and taxa).
332+
(`record_id`, `month`, `day`, etc.) as well as all the columns from `species_sub`
333+
(`species_id`, `genus`, `species`, and `taxa`).
303334

304335
Notice that `merged_inner` has fewer rows than `survey_sub`. This is an
305336
indication that there were rows in `surveys_df` with value(s) for `species_id` that
@@ -360,16 +391,17 @@ merged_left
360391

361392
The result DataFrame from a left join (`merged_left`) looks very much like the
362393
result DataFrame from an inner join (`merged_inner`) in terms of the columns it
363-
contains. However, unlike `merged_inner`, `merged_left` contains the **same
364-
number of rows** as the original `survey_sub` DataFrame. When we inspect
394+
contains. However, unlike `merged_inner`, `merged_left` contains the *same
395+
number of rows* as the original `survey_sub` DataFrame. When we inspect
365396
`merged_left`, we find there are rows where the information that should have
366397
come from `species_sub` (i.e., `species_id`, `genus`, and `taxa`) is
367-
missing (they contain NaN values):
398+
missing (they contain `NaN` values):
368399

369400
~~~
370-
merged_left[ pd.isnull(merged_left.genus) ]
401+
merged_left[merged_left['genus'].isna()]
371402
~~~
372403
{: .language-python}
404+
373405
~~~
374406
record_id month day year plot_id species_id sex hindfoot_length \
375407
5 6 7 16 1977 1 PF M 14

0 commit comments

Comments
 (0)