Skip to content

Commit ec3f6f0

Browse files
tobyhodgesdeppen8
andcommitted
Apply changes proposed by @deppen8 in #467, but without styling "pandas" with backticks.
- Make references more explicitly to "pandas" rather than "Python" - Add lots of `code` style formatting where appropriate - Replace references to `isnull()` with `isna()`. These methods are exactly the same for pandas, but `isna()` is consistent with `fillna`, `dropna`, etc. which are also used here. - Replaced a reference to `survey_df.weight` with `survey_df['weight']`, the preferred notation style for pandas Co-authored-by: Toby Hodges <tobyhodges@carpentries.org> Co-authored-by: Jacob Deppen <deppen.8@gmail.com>
1 parent c71cb18 commit ec3f6f0

1 file changed

Lines changed: 48 additions & 52 deletions

File tree

_episodes/04-data-types-and-format.md

Lines changed: 48 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -6,26 +6,26 @@ questions:
66
- "What types of data can be contained in a DataFrame?"
77
- "Why is the data type important?"
88
objectives:
9-
- "Describe how information is stored in a Python DataFrame."
10-
- "Define the two main types of data in Python: text and numerics."
9+
- "Describe how information is stored in a pandas DataFrame."
10+
- "Define the two main types of data in pandas: text and numerics."
1111
- "Examine the structure of a DataFrame."
1212
- "Modify the format of values in a DataFrame."
1313
- "Describe how data types impact operations."
14-
- "Define, manipulate, and interconvert integers and floats in Python."
14+
- "Define, manipulate, and interconvert integers and floats in Python/pandas."
1515
- "Analyze datasets having missing/null values (NaN values)."
1616
- "Write manipulated data to a file."
1717
keypoints:
18-
- "Pandas uses other names for data types than Python, for example: `object` for textual data."
18+
- "pandas uses other names for data types than Python, for example: `object` for textual data."
1919
- "A column in a DataFrame can only have one data type."
2020
- "The data type in a DataFrame’s single column can be checked using `dtype`."
2121
- "Make conscious decisions about how to manage missing data."
2222
- "A DataFrame can be saved to a CSV file using the `to_csv` function."
2323
---
2424

2525
The format of individual columns and rows will impact analysis performed on a
26-
dataset read into Python. For example, you can't perform mathematical
26+
dataset read into a pandas DataFrame. For example, you can't perform mathematical
2727
calculations on a string (text formatted data). This might seem obvious,
28-
however sometimes numeric values are read into Python as strings. In this
28+
however sometimes numeric values are read into pandas as strings. In this
2929
situation, when you then try to perform calculations on the string-formatted
3030
numeric data, you get an error.
3131

@@ -44,26 +44,26 @@ this lesson: numeric and text data types.
4444
Numeric data types include integers and floats. A **floating point** (known as a
4545
float) number has decimal points even if that decimal point value is 0. For
4646
example: 1.13, 2.0, 1234.345. If we have a column that contains both integers and
47-
floating point numbers, Pandas will assign the entire column to the float data
47+
floating point numbers, pandas will assign the entire column to the float data
4848
type so the decimal points are not lost.
4949

5050
An **integer** will never have a decimal point. Thus if we wanted to store 1.13 as
5151
an integer it would be stored as 1. Similarly, 1234.345 would be stored as 1234. You
52-
will often see the data type `Int64` in Python which stands for 64 bit integer. The 64
52+
will often see the data type `Int64` in pandas which stands for 64 bit integer. The 64
5353
refers to the memory allocated to store data in each cell which effectively
5454
relates to how many digits it can store in each "cell". Allocating space ahead of time
5555
allows computers to optimize storage and processing efficiency.
5656

5757
## Text Data Type
5858

59-
Text data type is known as Strings in Python, or Objects in Pandas. Strings can
59+
The text data type is known as a *string* in Python, or *object* in pandas. Strings can
6060
contain numbers and / or characters. For example, a string might be a word, a
61-
sentence, or several sentences. A Pandas object might also be a plot name like
62-
'plot1'. A string can also contain or consist of numbers. For instance, '1234'
63-
could be stored as a string, as could '10.23'. However **strings that contain
61+
sentence, or several sentences. A pandas object might also be a plot name like
62+
`'plot1'`. A string can also contain or consist of numbers. For instance, `'1234'`
63+
could be stored as a string, as could `'10.23'`. However **strings that contain
6464
numbers can not be used for mathematical operations**!
6565

66-
Pandas and base Python use slightly different names for data types. More on this
66+
pandas and base Python use slightly different names for data types. More on this
6767
is in the table below:
6868

6969
| Pandas Type | Native Python Type | Description |
@@ -85,7 +85,6 @@ same `surveys.csv` dataset that we've used in previous lessons.
8585
~~~
8686
# Make sure pandas is loaded
8787
import pandas as pd
88-
8988
# Note that pd.read_csv is used because we imported pandas as pd
9089
surveys_df = pd.read_csv("data/surveys.csv")
9190
~~~
@@ -103,9 +102,9 @@ pandas.core.frame.DataFrame
103102
~~~
104103
{: .output}
105104

106-
Next, let's look at the structure of our surveys data. In pandas, we can check
105+
Next, let's look at the structure of our `surveys_df` data. In pandas, we can check
107106
the type of one column in a DataFrame using the syntax
108-
`dataFrameName[column_name].dtype`:
107+
`dataframe_name['column_name'].dtype`:
109108

110109
~~~
111110
surveys_df['sex'].dtype
@@ -117,7 +116,7 @@ dtype('O')
117116
~~~
118117
{: .output}
119118

120-
A type 'O' just stands for "object" which in Pandas' world is a string
119+
A type 'O' just stands for "object" which in pandas is a string
121120
(text).
122121

123122
~~~
@@ -130,16 +129,16 @@ dtype('int64')
130129
~~~
131130
{: .output}
132131

133-
The type `int64` tells us that Python is storing each value within this column
134-
as a 64 bit integer. We can use the `dat.dtypes` command to view the data type
132+
The type `int64` tells us that pandas is storing each value within this column
133+
as a 64 bit integer. We can use the `dataframe_name.dtypes` command to view the data type
135134
for each column in a DataFrame (all at once).
136135

137136
~~~
138137
surveys_df.dtypes
139138
~~~
140139
{: .language-python}
141140

142-
which **returns**:
141+
which returns:
143142

144143
~~~
145144
record_id int64
@@ -155,16 +154,16 @@ dtype: object
155154
~~~
156155
{: .language-python }
157156

158-
Note that most of the columns in our Survey data are of type `int64`. This means
159-
that they are 64 bit integers. But the weight column is a floating point value
157+
Note that most of the columns in our `survey_df` data are of type `int64`. This means
158+
that they are 64 bit integers. But the `weight` column is a floating point value
160159
which means it contains decimals. The `species_id` and `sex` columns are objects which
161160
means they contain strings.
162161

163162
## Working With Integers and Floats
164163

165164
So we've learned that computers store numbers in one of two ways: as integers or
166165
as floating-point numbers (or floats). Integers are the numbers we usually count
167-
with. Floats have fractional parts (decimal places). Let's next consider how
166+
with. Floats have fractional parts (decimal places). Let's next consider how
168167
the data type can impact mathematical operations on our data. Addition,
169168
subtraction, division and multiplication work on floats and integers as we'd expect.
170169

@@ -264,21 +263,20 @@ dtype('float64')
264263
> Try converting the column `plot_id` to floats using
265264
>
266265
> ~~~
267-
> surveys_df.plot_id.astype("float")
266+
> surveys_df['plot_id'].astype("float")
268267
> ~~~
269268
> {: .language-python}
270269
>
271-
> Next try converting `weight` to an integer. What goes wrong here? What is Pandas telling you?
270+
> Next try converting `weight` to an integer. What goes wrong here? What is pandas telling you?
272271
> We will talk about some solutions to this later.
273272
{: .challenge}
274-
275273
## Missing Data Values - NaN
276274
277275
What happened in the last challenge activity? Notice that this throws a value error:
278276
`ValueError: Cannot convert NA to integer`. If we look at the `weight` column in the surveys
279277
data we notice that there are NaN (**N**ot **a** **N**umber) values. **NaN** values are undefined
280-
values that cannot be represented mathematically. Pandas, for example, will read
281-
an empty cell in a CSV or Excel sheet as a NaN. NaNs have some desirable properties: if we
278+
values that cannot be represented mathematically. pandas, for example, will read
279+
an empty cell in a CSV or Excel sheet as `NaN`. NaNs have some desirable properties: if we
282280
were to average the `weight` column without replacing our NaNs, Python would know to skip
283281
over those cells.
284282
@@ -302,26 +300,26 @@ values were handled.
302300
For instance, in some disciplines, like Remote Sensing, missing data values are
303301
often defined as -9999. Having a bunch of -9999 values in your data could really
304302
alter numeric calculations. Often in spreadsheets, cells are left empty where no
305-
data are available. Pandas will, by default, replace those missing values with
306-
NaN. However it is good practice to get in the habit of intentionally marking
307-
cells that have no data, with a no data value! That way there are no questions
303+
data are available. pandas will, by default, replace those missing values with
304+
`NaN`. However, it is good practice to get in the habit of intentionally marking
305+
cells that have no data with a no data value! That way there are no questions
308306
in the future when you (or someone else) explores your data.
309307
310308
### Where Are the NaN's?
311309
312-
Let's explore the NaN values in our data a bit further. Using the tools we
313-
learned in lesson 02, we can figure out how many rows contain NaN values for
314-
weight. We can also create a new subset from our data that only contains rows
315-
with weight values > 0 (i.e., select meaningful weight values):
310+
Let's explore the `NaN` values in our data a bit further. Using the tools we
311+
learned in lesson 02, we can figure out how many rows contain `NaN` values for
312+
`weight`. We can also create a new subset from our data that only contains rows
313+
with `weight > 0` (i.e., select meaningful weight values):
316314
317315
~~~
318-
len(surveys_df[pd.isnull(surveys_df.weight)])
316+
len(surveys_df[surveys_df['weight'].isna()])
319317
# How many rows have weight values?
320-
len(surveys_df[surveys_df.weight > 0])
318+
len(surveys_df[surveys_df['weight'] > 0])
321319
~~~
322320
{: .language-python}
323321
324-
We can replace all NaN values with zeroes using the `.fillna()` method (after
322+
We can replace all `NaN` values with zeroes using the `.fillna()` method (after
325323
making a copy of the data so we don't lose our work):
326324
327325
~~~
@@ -331,8 +329,8 @@ df1['weight'] = df1['weight'].fillna(0)
331329
~~~
332330
{: .language-python}
333331
334-
However NaN and 0 yield different analysis results. The mean value when NaN
335-
values are replaced with 0 is different from when NaN values are simply thrown
332+
However `NaN` and `0` yield different analysis results. The mean value when `NaN`
333+
values are replaced with `0` is different from when `NaN` values are simply thrown
336334
out or ignored.
337335
338336
~~~
@@ -345,37 +343,36 @@ df1['weight'].mean()
345343
~~~
346344
{: .output}
347345
348-
We can fill NaN values with any value that we chose. The code below fills all
349-
NaN values with a mean for all weight values.
346+
We can fill `NaN` values with any value that we chose. The code below fills all
347+
`NaN` values with a mean for all weight values.
350348
351349
~~~
352350
df1['weight'] = surveys_df['weight'].fillna(surveys_df['weight'].mean())
353351
~~~
354352
{: .language-python}
355353
356354
We could also chose to create a subset of our data, only keeping rows that do
357-
not contain NaN values.
355+
not contain `NaN` values.
358356
359357
The point is to make conscious decisions about how to manage missing data. This
360358
is where we think about how our data will be used and how these values will
361359
impact the scientific conclusions made from the data.
362360
363-
Python gives us all of the tools that we need to account for these issues. We
361+
pandas gives us all of the tools that we need to account for these issues. We
364362
just need to be cautious about how the decisions that we make impact scientific
365363
results.
366364
367365
> ## Counting
368366
> Count the number of missing values per column.
369367
>
370-
> > ## Hint
371-
> > The method `.count()` gives you the number of non-NA observations per column.
372-
> > Try looking to the `.isnull()` method.
368+
> > ## Hints
369+
> > The method `.count()` gives you the number of non-NaN observations per column.
370+
> > Try looking to the `.isna()` method.
373371
> {: .solution}
374372
{: .challenge}
375-
376373
## Writing Out Data to CSV
377374
378-
We've learned about using manipulating data to get desired outputs. But we've also discussed
375+
We've learned about manipulating data to get desired outputs. But we've also discussed
379376
keeping data that has been manipulated separate from our raw data. Something we might be interested
380377
in doing is working with only the columns that have full data. First, let's reload the data so
381378
we're not mixing up all of our previous manipulations.
@@ -385,7 +382,7 @@ surveys_df = pd.read_csv("data/surveys.csv")
385382
~~~
386383
{: .language-python}
387384
Next, let's drop all the rows that contain missing values. We will use the command `dropna`.
388-
By default, dropna removes rows that contain missing data for even just one column.
385+
By default, `dropna` removes rows that contain missing data for even just one column.
389386
390387
~~~
391388
df_na = surveys_df.dropna()
@@ -398,7 +395,7 @@ and 9 columns, much smaller than the 35549 row original.
398395
We can now use the `to_csv` command to export a DataFrame in CSV format. Note that the code
399396
below will by default save the data into the current working directory. We can
400397
save it to a different folder by adding the foldername and a slash before the filename:
401-
`df.to_csv('foldername/out.csv')`. We use 'index=False' so that
398+
`df.to_csv('foldername/out.csv')`. We use `'index=False'` so that
402399
pandas doesn't include the index number for each line.
403400
404401
~~~
@@ -423,4 +420,3 @@ What we've learned:
423420
+ How to use `to_csv` to write manipulated data to a file.
424421
425422
{% include links.md %}
426-

0 commit comments

Comments
 (0)