Apply changes proposed by @deppen8 in #467, but without styling "pandas" with backticks.

tobyhodges · deppen8 · tobyhodges · commit ec3f6f049bf7 · 2023-04-25T13:52:01.000+02:00
- Make references more explicitly to "pandas" rather than "Python"
- Add lots of `code` style formatting where appropriate
- Replace references to `isnull()` with `isna()`. These methods are exactly the same for pandas, but `isna()` is consistent with `fillna`, `dropna`, etc. which are also used here.
- Replaced a reference to `survey_df.weight` with `survey_df['weight']`, the preferred notation style for pandas

Co-authored-by: Toby Hodges &lt;tobyhodges@carpentries.org&gt;
Co-authored-by: Jacob Deppen &lt;deppen.8@gmail.com&gt;
diff --git a/_episodes/04-data-types-and-format.md b/_episodes/04-data-types-and-format.md
@@ -6,26 +6,26 @@ questions:
     - "What types of data can be contained in a DataFrame?"
     - "Why is the data type important?"
 objectives:
-    - "Describe how information is stored in a Python DataFrame."
-    - "Define the two main types of data in Python: text and numerics."
+    - "Describe how information is stored in a pandas DataFrame."
+    - "Define the two main types of data in pandas: text and numerics."
     - "Examine the structure of a DataFrame."
     - "Modify the format of values in a DataFrame."
     - "Describe how data types impact operations."
-    - "Define, manipulate, and interconvert integers and floats in Python."
+    - "Define, manipulate, and interconvert integers and floats in Python/pandas."
     - "Analyze datasets having missing/null values (NaN values)."
     - "Write manipulated data to a file."
 keypoints:
-    - "Pandas uses other names for data types than Python, for example: `object` for textual data."
+    - "pandas uses other names for data types than Python, for example: `object` for textual data."
     - "A column in a DataFrame can only have one data type."
     - "The data type in a DataFrame’s single column can be checked using `dtype`."
     - "Make conscious decisions about how to manage missing data."
     - "A DataFrame can be saved to a CSV file using the `to_csv` function."
 ---
 
 The format of individual columns and rows will impact analysis performed on a
-dataset read into Python. For example, you can't perform mathematical
+dataset read into a pandas DataFrame. For example, you can't perform mathematical
 calculations on a string (text formatted data). This might seem obvious,
-however sometimes numeric values are read into Python as strings. In this
+however sometimes numeric values are read into pandas as strings. In this
 situation, when you then try to perform calculations on the string-formatted
 numeric data, you get an error.
 
@@ -44,26 +44,26 @@ this lesson: numeric and text data types.
 Numeric data types include integers and floats. A **floating point** (known as a
 float) number has decimal points even if that decimal point value is 0. For
 example: 1.13, 2.0, 1234.345. If we have a column that contains both integers and
-floating point numbers, Pandas will assign the entire column to the float data
+floating point numbers, pandas will assign the entire column to the float data
 type so the decimal points are not lost.
 
 An **integer** will never have a decimal point. Thus if we wanted to store 1.13 as
 an integer it would be stored as 1. Similarly, 1234.345 would be stored as 1234. You
-will often see the data type `Int64` in Python which stands for 64 bit integer. The 64
+will often see the data type `Int64` in pandas which stands for 64 bit integer. The 64
 refers to the memory allocated to store data in each cell which effectively
 relates to how many digits it can store in each "cell". Allocating space ahead of time
 allows computers to optimize storage and processing efficiency.
 
 ## Text Data Type
 
-Text data type is known as Strings in Python, or Objects in Pandas. Strings can
+The text data type is known as a *string* in Python, or *object* in pandas. Strings can
 contain numbers and / or characters. For example, a string might be a word, a
-sentence, or several sentences. A Pandas object might also be a plot name like
-'plot1'. A string can also contain or consist of numbers. For instance, '1234'
-could be stored as a string, as could '10.23'. However **strings that contain
+sentence, or several sentences. A pandas object might also be a plot name like
+`'plot1'`. A string can also contain or consist of numbers. For instance, `'1234'`
+could be stored as a string, as could `'10.23'`. However **strings that contain
 numbers can not be used for mathematical operations**!
 
-Pandas and base Python use slightly different names for data types. More on this
+pandas and base Python use slightly different names for data types. More on this
 is in the table below:
 
 | Pandas Type | Native Python Type | Description |
@@ -85,7 +85,6 @@ same `surveys.csv` dataset that we've used in previous lessons.
 ~~~
 # Make sure pandas is loaded
 import pandas as pd
-
 # Note that pd.read_csv is used because we imported pandas as pd
 surveys_df = pd.read_csv("data/surveys.csv")
 ~~~
@@ -103,9 +102,9 @@ pandas.core.frame.DataFrame
 ~~~
 {: .output}
 
-Next, let's look at the structure of our surveys data. In pandas, we can check
+Next, let's look at the structure of our `surveys_df` data. In pandas, we can check
 the type of one column in a DataFrame using the syntax
-`dataFrameName[column_name].dtype`:
+`dataframe_name['column_name'].dtype`:
 
 ~~~
 surveys_df['sex'].dtype
@@ -117,7 +116,7 @@ dtype('O')
 ~~~
 {: .output}
 
-A type 'O' just stands for "object" which in Pandas' world is a string
+A type 'O' just stands for "object" which in pandas is a string
 (text).
 
 ~~~
@@ -130,16 +129,16 @@ dtype('int64')
 ~~~
 {: .output}
 
-The type `int64` tells us that Python is storing each value within this column
-as a 64 bit integer. We can use the `dat.dtypes` command to view the data type
+The type `int64` tells us that pandas is storing each value within this column
+as a 64 bit integer. We can use the `dataframe_name.dtypes` command to view the data type
 for each column in a DataFrame (all at once).
 
 ~~~
 surveys_df.dtypes
 ~~~
 {: .language-python}
 
-which **returns**:
+which returns:
 
 ~~~
 record_id            int64
@@ -155,16 +154,16 @@ dtype: object
 ~~~
 {: .language-python }
 
-Note that most of the columns in our Survey data are of type `int64`. This means
-that they are 64 bit integers. But the weight column is a floating point value
+Note that most of the columns in our `survey_df` data are of type `int64`. This means
+that they are 64 bit integers. But the `weight` column is a floating point value
 which means it contains decimals. The `species_id` and `sex` columns are objects which
 means they contain strings.
 
 ## Working With Integers and Floats
 
 So we've learned that computers store numbers in one of two ways: as integers or
 as floating-point numbers (or floats). Integers are the numbers we usually count
-with. Floats have fractional parts (decimal places).  Let's next consider how
+with. Floats have fractional parts (decimal places). Let's next consider how
 the data type can impact mathematical operations on our data. Addition,
 subtraction, division and multiplication work on floats and integers as we'd expect.
 
@@ -264,21 +263,20 @@ dtype('float64')
 > Try converting the column `plot_id` to floats using
 >
 > ~~~
-> surveys_df.plot_id.astype("float")
+> surveys_df['plot_id'].astype("float")
 > ~~~
 > {: .language-python}
 >
-> Next try converting `weight` to an integer. What goes wrong here? What is Pandas telling you?
+> Next try converting `weight` to an integer. What goes wrong here? What is pandas telling you?
 > We will talk about some solutions to this later.
 {: .challenge}
-
 ## Missing Data Values - NaN
 
 What happened in the last challenge activity? Notice that this throws a value error:
 `ValueError: Cannot convert NA to integer`. If we look at the `weight` column in the surveys
 data we notice that there are NaN (**N**ot **a** **N**umber) values. **NaN** values are undefined
-values that cannot be represented mathematically. Pandas, for example, will read
-an empty cell in a CSV or Excel sheet as a NaN. NaNs have some desirable properties: if we
+values that cannot be represented mathematically. pandas, for example, will read
+an empty cell in a CSV or Excel sheet as `NaN`. NaNs have some desirable properties: if we
 were to average the `weight` column without replacing our NaNs, Python would know to skip
 over those cells.
 
@@ -302,26 +300,26 @@ values were handled.
 For instance, in some disciplines, like Remote Sensing, missing data values are
 often defined as -9999. Having a bunch of -9999 values in your data could really
 alter numeric calculations. Often in spreadsheets, cells are left empty where no
-data are available. Pandas will, by default, replace those missing values with
-NaN. However it is good practice to get in the habit of intentionally marking
-cells that have no data, with a no data value! That way there are no questions
+data are available. pandas will, by default, replace those missing values with
+`NaN`. However, it is good practice to get in the habit of intentionally marking
+cells that have no data with a no data value! That way there are no questions
 in the future when you (or someone else) explores your data.
 
 ### Where Are the NaN's?
 
-Let's explore the NaN values in our data a bit further. Using the tools we
-learned in lesson 02, we can figure out how many rows contain NaN values for
-weight. We can also create a new subset from our data that only contains rows
-with weight values > 0 (i.e., select meaningful weight values):
+Let's explore the `NaN` values in our data a bit further. Using the tools we
+learned in lesson 02, we can figure out how many rows contain `NaN` values for
+`weight`. We can also create a new subset from our data that only contains rows
+with `weight > 0` (i.e., select meaningful weight values):
 
 ~~~
-len(surveys_df[pd.isnull(surveys_df.weight)])
+len(surveys_df[surveys_df['weight'].isna()])
 # How many rows have weight values?
-len(surveys_df[surveys_df.weight > 0])
+len(surveys_df[surveys_df['weight'] > 0])
 ~~~
 {: .language-python}
 
-We can replace all NaN values with zeroes using the `.fillna()` method (after
+We can replace all `NaN` values with zeroes using the `.fillna()` method (after
 making a copy of the data so we don't lose our work):
 
 ~~~
@@ -331,8 +329,8 @@ df1['weight'] = df1['weight'].fillna(0)
 ~~~
 {: .language-python}
 
-However NaN and 0 yield different analysis results. The mean value when NaN
-values are replaced with 0 is different from when NaN values are simply thrown
+However `NaN` and `0` yield different analysis results. The mean value when `NaN`
+values are replaced with `0` is different from when `NaN` values are simply thrown
 out or ignored.
 
 ~~~
@@ -345,37 +343,36 @@ df1['weight'].mean()
 ~~~
 {: .output}
 
-We can fill NaN values with any value that we chose. The code below fills all
-NaN values with a mean for all weight values.
+We can fill `NaN` values with any value that we chose. The code below fills all
+`NaN` values with a mean for all weight values.
 
 ~~~
 df1['weight'] = surveys_df['weight'].fillna(surveys_df['weight'].mean())
 ~~~
 {: .language-python}
 
 We could also chose to create a subset of our data, only keeping rows that do
-not contain NaN values.
+not contain `NaN` values.
 
 The point is to make conscious decisions about how to manage missing data. This
 is where we think about how our data will be used and how these values will
 impact the scientific conclusions made from the data.
 
-Python gives us all of the tools that we need to account for these issues. We
+pandas gives us all of the tools that we need to account for these issues. We
 just need to be cautious about how the decisions that we make impact scientific
 results.
 
 > ## Counting
 > Count the number of missing values per column.
 >
-> > ## Hint
-> > The method `.count()` gives you the number of non-NA observations per column.
-> > Try looking to the `.isnull()` method.
+> > ## Hints
+> > The method `.count()` gives you the number of non-NaN observations per column.
+> > Try looking to the `.isna()` method.
 > {: .solution}
 {: .challenge}
-
 ## Writing Out Data to CSV
 
-We've learned about using manipulating data to get desired outputs. But we've also discussed
+We've learned about manipulating data to get desired outputs. But we've also discussed
 keeping data that has been manipulated separate from our raw data. Something we might be interested
 in doing is working with only the columns that have full data. First, let's reload the data so
 we're not mixing up all of our previous manipulations.
@@ -385,7 +382,7 @@ surveys_df = pd.read_csv("data/surveys.csv")
 ~~~
 {: .language-python}
 Next, let's drop all the rows that contain missing values. We will use the command `dropna`.
-By default, dropna removes rows that contain missing data for even just one column.
+By default, `dropna` removes rows that contain missing data for even just one column.
 
 ~~~
 df_na = surveys_df.dropna()
@@ -398,7 +395,7 @@ and 9 columns, much smaller than the 35549 row original.
 We can now use the `to_csv` command to export a DataFrame in CSV format. Note that the code
 below will by default save the data into the current working directory. We can
 save it to a different folder by adding the foldername and a slash before the filename:
-`df.to_csv('foldername/out.csv')`. We use 'index=False' so that
+`df.to_csv('foldername/out.csv')`. We use `'index=False'` so that
 pandas doesn't include the index number for each line.
 
 ~~~
@@ -423,4 +420,3 @@ What we've learned:
 + How to use `to_csv` to write manipulated data to a file.
 
 {% include links.md %}
-