Skip to content

Commit 76fef9d

Browse files
committed
03-data-types-and-format: fix metadata & code blocks
1 parent 7219b49 commit 76fef9d

1 file changed

Lines changed: 62 additions & 45 deletions

File tree

_episodes/03-data-types-and-format.md

Lines changed: 62 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -3,19 +3,19 @@ title: Data Types and Formats
33
teaching: 20
44
exercises: 25
55
questions:
6-
- " What types of data can be contained in a DataFrame?
7-
"
8-
- " Why is the data type important? "
6+
- "What types of data can be contained in a DataFrame?"
7+
- "Why is the data type important?"
98
objectives:
10-
- Describe how information is stored in a Python DataFrame.
9+
- "Describe how information is stored in a Python DataFrame."
1110
- "Define the two main types of data in Python: text and numerics."
12-
- Examine the structure of a DataFrame.
13-
- Modify the format of values in a DataFrame.
14-
- Describe how data types impact operations.
15-
- Define, manipulate, and interconvert integers and floats in Python.
16-
- Analyze datasets having missing/null values (NaN values).
17-
- Write manipulated data to a file.
18-
11+
- "Examine the structure of a DataFrame."
12+
- "Modify the format of values in a DataFrame."
13+
- "Describe how data types impact operations."
14+
- "Define, manipulate, and interconvert integers and floats in Python."
15+
- "Analyze datasets having missing/null values (NaN values)."
16+
- "Write manipulated data to a file."
17+
keypoints:
18+
- "FIXME"
1919
---
2020

2121
The format of individual columns and rows will impact analysis performed on a
@@ -78,45 +78,50 @@ Now that we're armed with a basic understanding of numeric and text data
7878
types, let's explore the format of our survey data. We'll be working with the
7979
same `surveys.csv` dataset that we've used in previous lessons.
8080

81-
```python
81+
~~~
8282
# Note that pd.read_csv is used because we imported pandas as pd
8383
surveys_df = pd.read_csv("data/surveys.csv")
84-
```
84+
~~~
85+
{: .language-python}
8586

8687
Remember that we can check the type of an object like this:
8788

88-
```python
89+
~~~
8990
type(surveys_df)
90-
```
91+
~~~
92+
{: .language-python}
9193

9294
**OUTPUT:** `pandas.core.frame.DataFrame`
9395

9496
Next, let's look at the structure of our surveys data. In pandas, we can check
9597
the type of one column in a DataFrame using the syntax
9698
`dataFrameName[column_name].dtype`:
9799

98-
```python
100+
~~~
99101
surveys_df['sex'].dtype
100-
```
102+
~~~
103+
{: .language-python}
101104

102105
**OUTPUT:** `dtype('O')`
103106

104107
A type 'O' just stands for "object" which in Pandas' world is a string
105108
(text).
106109

107-
```python
110+
~~~
108111
surveys_df['record_id'].dtype
109-
```
112+
~~~
113+
{: .language-python}
110114

111115
**OUTPUT:** `dtype('int64')`
112116

113117
The type `int64` tells us that python is storing each value within this column
114118
as a 64 bit integer. We can use the `dat.dtypes` command to view the data type
115119
for each column in a DataFrame (all at once).
116120

117-
```python
121+
~~~
118122
surveys_df.dtypes
119-
```
123+
~~~
124+
{: .language-python}
120125

121126
which **returns**:
122127

@@ -146,31 +151,33 @@ with. Floats have fractional parts (decimal places). Let's next consider how
146151
the data type can impact mathematical operations on our data. Addition,
147152
subtraction, division and multiplication work on floats and integers as we'd expect.
148153

149-
```python
154+
~~~
150155
print(5+5)
151156
10
152157
153158
print(24-4)
154159
20
155-
```
160+
~~~
161+
{: .language-python}
156162

157163
If we divide one integer by another, we get a float.
158164
The result on python 3 is different than in python 2, where the result is an
159165
integer (integer division).
160166

161-
```python
167+
~~~
162168
print(5/9)
163169
0.5555555555555556
164170
165171
print(10/3)
166172
3.3333333333333335
167-
```
173+
~~~
174+
{: .language-python}
168175

169176
We can also convert a floating point number to an integer or an integer to
170177
floating point number. Notice that Python by default rounds down when it
171178
converts from floating point to integer.
172179

173-
```python
180+
~~~
174181
# Convert a to an integer
175182
a = 7.83
176183
int(a)
@@ -180,19 +187,21 @@ int(a)
180187
b = 7
181188
float(b)
182189
7.0
183-
```
190+
~~~
191+
{: .language-python}
184192

185193
# Working With Our Survey Data
186194

187195
Getting back to our data, we can modify the format of values within our data, if
188196
we want. For instance, we could convert the `record_id` field to floating point
189197
values.
190198

191-
```python
199+
~~~
192200
# Convert the record_id field from an integer to a float
193201
surveys_df['record_id'] = surveys_df['record_id'].astype('float64')
194202
surveys_df['record_id'].dtype
195-
```
203+
~~~
204+
{: .language-python}
196205

197206
**OUTPUT:** `dtype('float64')`
198207

@@ -219,10 +228,11 @@ an empty cell in a CSV or Excel sheet as a NaN. NaNs have some desirable propert
219228
were to average the `weight` column without replacing our NaNs, Python would know to skip
220229
over those cells.
221230
222-
```python
231+
~~~
223232
surveys_df['weight'].mean()
224233
42.672428212991356
225-
```
234+
~~~
235+
{: .language-python}
226236
Dealing with missing data values is always a challenge. It's sometimes hard to
227237
know why values are missing - was it because of a data entry error? Or data that
228238
someone was unable to collect? Should the value be 0? We need to know how
@@ -245,36 +255,40 @@ learned in lesson 02, we can figure out how many rows contain NaN values for
245255
weight. We can also create a new subset from our data that only contains rows
246256
with weight values > 0 (i.e., select meaningful weight values):
247257
248-
```python
258+
~~~
249259
len(surveys_df[pd.isnull(surveys_df.weight)])
250260
# How many rows have weight values?
251261
len(surveys_df[surveys_df.weight> 0])
252-
```
262+
~~~
263+
{: .language-python}
253264
254265
We can replace all NaN values with zeroes using the `.fillna()` method (after
255266
making a copy of the data so we don't lose our work):
256267
257-
```python
268+
~~~
258269
df1 = surveys_df.copy()
259270
# Fill all NaN values with 0
260271
df1['weight'] = df1['weight'].fillna(0)
261-
```
272+
~~~
273+
{: .language-python}
262274
263275
However NaN and 0 yield different analysis results. The mean value when NaN
264276
values are replaced with 0 is different from when NaN values are simply thrown
265277
out or ignored.
266278
267-
```python
279+
~~~
268280
df1['weight'].mean()
269281
38.751976145601844
270-
```
282+
~~~
283+
{: .language-python}
271284
272285
We can fill NaN values with any value that we chose. The code below fills all
273286
NaN values with a mean for all weight values.
274287
275-
```python
288+
~~~
276289
df1['weight'] = surveys_df['weight'].fillna(surveys_df['weight'].mean())
277-
```
290+
~~~
291+
{: .language-python}
278292
279293
We could also chose to create a subset of our data, only keeping rows that do
280294
not contain NaN values.
@@ -299,16 +313,18 @@ keeping data that has been manipulated separate from our raw data. Something we
299313
in doing is working with only the columns that have full data. First, let's reload the data so
300314
we're not mixing up all of our previous manipulations.
301315
302-
```python
316+
~~~
303317
surveys_df = pd.read_csv("data/surveys.csv")
304-
```
318+
~~~
319+
{: .language-python}
305320
Next, let's drop all the rows that contain missing values. We will use the command `dropna`.
306321
By default, dropna removes columns that contain missing data for even just one row.
307322
308-
```python
323+
~~~
309324
df_na = surveys_df.dropna()
310325
311-
```
326+
~~~
327+
{: .language-python}
312328
313329
If you now type ```df_na```, you should observe that the resulting DataFrame has 30676 rows
314330
and 9 columns, much smaller than the 35549 row original.
@@ -319,10 +335,11 @@ save it to a different folder by adding the foldername and a slash before the fi
319335
`df.to_csv('foldername/out.csv')`. We use 'index=False' so that
320336
pandas doesn't include the index number for each line.
321337
322-
```python
338+
~~~
323339
# Write DataFrame to CSV
324340
df_na.to_csv('data_output/surveys_complete.csv', index=False)
325-
```
341+
~~~
342+
{: .language-python}
326343
327344
We will use this data file later in the workshop. Check out your working directory to make
328345
sure the CSV wrote out properly, and that you can open it! If you want, try to bring it

0 commit comments

Comments
 (0)