Apply changes proposed by @deppen8 in #468, but without styling "pandas" with backticks.

tobyhodges · deppen8 · tobyhodges · commit b6a990723a9e · 2023-04-25T14:19:46.000+02:00
- Add lots of code style formatting where appropriate
- Replace references to isnull() with isna(). These methods are exactly the same for pandas, but isna() is consistent with fillna, dropna, etc.
- Replace a reference to merged_left.genus with merged_left['genus'], the preferred notation style for pandas
- Add more consistent use of {: .output} formatting
- Remove u'string' notation, which is a Python 2.x leftover, to match what learners will see with Python 3.x

Co-authored-by: Toby Hodges &lt;tobyhodges@carpentries.org&gt;
Co-authored-by: Jacob Deppen &lt;deppen.8@gmail.com&gt;
diff --git a/_episodes/05-merging-data.md b/_episodes/05-merging-data.md
@@ -24,14 +24,17 @@ DataFrames](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
 `merge` and `concat`.
 
 To work through the examples below, we first need to load the species and
-surveys files into pandas DataFrames. In iPython:
+surveys files into pandas DataFrames. In a Jupyter Notebook or iPython:
 
 ~~~
 import pandas as pd
 surveys_df = pd.read_csv("data/surveys.csv",
                          keep_default_na=False, na_values=[""])
 surveys_df
+~~~
+{: .language-python}
 
+~~~
        record_id  month  day  year  plot species  sex  hindfoot_length weight
 0              1      7   16  1977     2      NA    M               32  NaN
 1              2      7   16  1977     3      NA    M               33  NaN
@@ -46,10 +49,17 @@ surveys_df
 35548      35549     12   31  2002     5     NaN  NaN              NaN  NaN
 
 [35549 rows x 9 columns]
+~~~
+{: .output}
 
+~~~
 species_df = pd.read_csv("data/species.csv",
                          keep_default_na=False, na_values=[""])
 species_df
+~~~
+{: .language-python}
+
+~~~
   species_id             genus          species     taxa
 0          AB        Amphispiza        bilineata     Bird
 1          AH  Ammospermophilus          harrisi   Rodent
@@ -65,14 +75,14 @@ species_df
 
 [54 rows x 4 columns]
 ~~~
-{: .language-python}
+{: .output}
 
 Take note that the `read_csv` method we used can take some additional options which
 we didn't use previously. Many functions in Python have a set of options that
 can be set by the user if needed. In this case, we have told pandas to assign
-empty values in our CSV to NaN `keep_default_na=False, na_values=[""]`. 
+empty values in our CSV to `NaN` with the parameters `keep_default_na=False` and `na_values=[""]`.
 We have explicitly requested to change empty values in the CSV to NaN,
-this is however also the default behaviour of `read_csv`. 
+this is however also the default behaviour of `read_csv`.
 [More about all of the `read_csv` options here and their defaults.](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv)
 
 # Concatenating DataFrames
@@ -111,16 +121,16 @@ horizontal_stack = pd.concat([survey_sub, survey_sub_last10], axis=1)
 {: .language-python}
 
 ### Row Index Values and Concat
-Have a look at the `vertical_stack` dataframe? Notice anything unusual?
-The row indexes for the two data frames `survey_sub` and `survey_sub_last10`
-have been repeated. We can reindex the new dataframe using the `reset_index()` method.
+Have a look at the `vertical_stack` DataFrame. Notice anything unusual?
+The row indexes for the two DataFrames `survey_sub` and `survey_sub_last10`
+have been repeated. We can reindex the new DataFrame using the `reset_index()` method.
 
 ## Writing Out Data to CSV
 
 We can use the `to_csv` command to do export a DataFrame in CSV format. Note that the code
 below will by default save the data into the current working directory. We can
 save it to a different folder by adding the foldername and a slash to the file
-`vertical_stack.to_csv('foldername/out.csv')`. We use the 'index=False' so that
+`vertical_stack.to_csv('foldername/out.csv')`. We use the `index=False` so that
 pandas doesn't include the index number for each line.
 
 ~~~
@@ -130,7 +140,7 @@ vertical_stack.to_csv('data/out.csv', index=False)
 {: .language-python}
 
 Check out your working directory to make sure the CSV wrote out properly, and
-that you can open it! If you want, try to bring it back into Python to make sure
+that you can open it! If you want, try to bring it back into pandas to make sure
 it imports properly.
 
 ~~~
@@ -142,17 +152,17 @@ new_output = pd.read_csv('data/out.csv', keep_default_na=False, na_values=[""])
 > ## Challenge - Combine Data
 >
 > In the data folder, there are two survey data files: `surveys2001.csv` and
-> `surveys2002.csv`. Read the data into Python and combine the files to make one
-> new data frame. Create a plot of average plot weight by year grouped by sex.
-> Export your results as a CSV and make sure it reads back into Python properly.
+> `surveys2002.csv`. Read the data into pandas and combine the files to make one
+> new DataFrame. Create a plot of average plot weight by year grouped by sex.
+> Export your results as a CSV and make sure it reads back into pandas properly.
 {: .challenge}
 
 # Joining DataFrames
 
-When we concatenated our DataFrames we simply added them to each other -
+When we concatenated our DataFrames, we simply added them to each other -
 stacking them either vertically or side by side. Another way to combine
 DataFrames is to use columns in each dataset that contain common values (a
-common unique id). Combining DataFrames using a common field is called
+common unique identifier). Combining DataFrames using a common field is called
 "joining". The columns containing the common values are called "join key(s)".
 Joining DataFrames in this way is often useful when one DataFrame is a "lookup
 table" containing additional data that we want to include in the other.
@@ -163,13 +173,13 @@ SQL database.
 For example, the `species.csv` file that we've been working with is a lookup
 table. This table contains the genus, species and taxa code for 55 species. The
 species code is unique for each line. These species are identified in our survey
-data as well using the unique species code. Rather than adding 3 more columns
-for the genus, species and taxa to each of the 35,549 line Survey data table, we
+data as well using the unique species code. Rather than adding three more columns
+for the genus, species and taxa to each of the 35,549 line `survey` DataFrame, we
 can maintain the shorter table with the species information. When we want to
 access that information, we can create a query that joins the additional columns
-of information to the Survey data.
+of information to the `survey` DataFrame.
 
-Storing data in this way has many benefits including:
+Storing data in this way has many benefits.
 
 1. It ensures consistency in the spelling of species attributes (genus, species
    and taxa) given each species is only entered once. Imagine the possibilities
@@ -182,7 +192,7 @@ Storing data in this way has many benefits including:
 ## Joining Two DataFrames
 
 To better understand joins, let's grab the first 10 lines of our data as a
-subset to work with. We'll use the `.head` method to do this. We'll also read
+subset to work with. We'll use the `.head()` method to do this. We'll also read
 in a subset of the species table.
 
 ~~~
@@ -211,16 +221,25 @@ identify a (differently-named) column in each DataFrame that contains the same
 information.
 
 ~~~
->>> species_sub.columns
+species_sub.columns
+~~~
+{: .language-python}
 
+~~~
 Index([u'species_id', u'genus', u'species', u'taxa'], dtype='object')
+~~~
+{: .output}
 
->>> survey_sub.columns
+~~~
+survey_sub.columns
+~~~
+{: .language-python}
 
+~~~
 Index([u'record_id', u'month', u'day', u'year', u'plot_id', u'species_id',
        u'sex', u'hindfoot_length', u'weight'], dtype='object')
 ~~~
-{: .language-python}
+{: .output}
 
 In our example, the join key is the column containing the two-letter species
 identifier, which is called `species_id`.
@@ -232,13 +251,10 @@ also need to decide which type of join makes sense for our analysis.
 
 ## Inner joins
 
-The most common type of join is called an _inner join_. An inner join combines
+The most common type of join is called an **inner join**. An inner join combines
 two DataFrames based on a join key and returns a new DataFrame that contains
-**only** those rows that have matching values in *both* of the original
-DataFrames.
-
-Inner joins yield a DataFrame that contains only rows where the value being
-joined exists in BOTH tables. An example of an inner join, adapted from [Jeff Atwood's blogpost about SQL joins][join-types] is below:
+*only* those rows that have matching values in *both* of the original
+DataFrames. An example of an inner join, adapted from [Jeff Atwood's blogpost about SQL joins][join-types] is below:
 
 ![Inner join -- courtesy of codinghorror.com](../fig/inner-join.png)
 
@@ -247,11 +263,26 @@ the default option:
 
 ~~~
 merged_inner = pd.merge(left=survey_sub, right=species_sub, left_on='species_id', right_on='species_id')
-# In this case `species_id` is the only column name in  both dataframes, so if we skipped `left_on`
-# And `right_on` arguments we would still get the same result
+~~~
+{: .language-python}
+
+In this case, `species_id` is the only column name in  both DataFrames, so if we skipped the `left_on`
+and `right_on` arguments, `pandas` would guess that we wanted to use that column to join. However, it is
+usually better to be explicit.
 
-# What's the size of the output data?
+So what is the size of the output data?
+
+~~~
 merged_inner.shape
+~~~
+{: .language-python}
+
+~~~
+(8, 12)
+~~~
+{: .output}
+
+~~~
 merged_inner
 ~~~
 {: .language-python}
@@ -298,8 +329,8 @@ DataFrame). For inner joins, the order of the `left` and `right` arguments does
 not matter.
 
 The result `merged_inner` DataFrame contains all of the columns from `survey_sub`
-(record id, month, day, etc.) as well as all the columns from `species_sub`
-(species_id, genus, species, and taxa).
+(`record_id`, `month`, `day`, etc.) as well as all the columns from `species_sub`
+(`species_id`, `genus`, `species`, and `taxa`).
 
 Notice that `merged_inner` has fewer rows than `survey_sub`. This is an
 indication that there were rows in `surveys_df` with value(s) for `species_id` that
@@ -360,16 +391,17 @@ merged_left
 
 The result DataFrame from a left join (`merged_left`) looks very much like the
 result DataFrame from an inner join (`merged_inner`) in terms of the columns it
-contains. However, unlike `merged_inner`, `merged_left` contains the **same
-number of rows** as the original `survey_sub` DataFrame. When we inspect
+contains. However, unlike `merged_inner`, `merged_left` contains the *same
+number of rows* as the original `survey_sub` DataFrame. When we inspect
 `merged_left`, we find there are rows where the information that should have
 come from `species_sub` (i.e., `species_id`, `genus`, and `taxa`) is
-missing (they contain NaN values):
+missing (they contain `NaN` values):
 
 ~~~
-merged_left[ pd.isnull(merged_left.genus) ]
+merged_left[merged_left['genus'].isna()]
 ~~~
 {: .language-python}
+
 ~~~
    record_id  month  day  year  plot_id species_id sex  hindfoot_length  \
 5          6      7   16  1977        1         PF   M               14