@@ -3,19 +3,19 @@ title: Data Types and Formats
33teaching : 20
44exercises : 25
55questions :
6- - " What types of data can be contained in a DataFrame?
7- "
8- - " Why is the data type important? "
6+ - " What types of data can be contained in a DataFrame?"
7+ - " Why is the data type important?"
98objectives :
10- - Describe how information is stored in a Python DataFrame.
9+ - " Describe how information is stored in a Python DataFrame."
1110 - " Define the two main types of data in Python: text and numerics."
12- - Examine the structure of a DataFrame.
13- - Modify the format of values in a DataFrame.
14- - Describe how data types impact operations.
15- - Define, manipulate, and interconvert integers and floats in Python.
16- - Analyze datasets having missing/null values (NaN values).
17- - Write manipulated data to a file.
18-
11+ - " Examine the structure of a DataFrame."
12+ - " Modify the format of values in a DataFrame."
13+ - " Describe how data types impact operations."
14+ - " Define, manipulate, and interconvert integers and floats in Python."
15+ - " Analyze datasets having missing/null values (NaN values)."
16+ - " Write manipulated data to a file."
17+ keypoints :
18+ - " FIXME"
1919---
2020
2121The format of individual columns and rows will impact analysis performed on a
@@ -78,45 +78,50 @@ Now that we're armed with a basic understanding of numeric and text data
7878types, let's explore the format of our survey data. We'll be working with the
7979same ` surveys.csv ` dataset that we've used in previous lessons.
8080
81- ``` python
81+ ~~~
8282# Note that pd.read_csv is used because we imported pandas as pd
8383surveys_df = pd.read_csv("data/surveys.csv")
84- ```
84+ ~~~
85+ {: .language-python}
8586
8687Remember that we can check the type of an object like this:
8788
88- ``` python
89+ ~~~
8990type(surveys_df)
90- ```
91+ ~~~
92+ {: .language-python}
9193
9294** OUTPUT:** ` pandas.core.frame.DataFrame `
9395
9496Next, let's look at the structure of our surveys data. In pandas, we can check
9597the type of one column in a DataFrame using the syntax
9698` dataFrameName[column_name].dtype ` :
9799
98- ``` python
100+ ~~~
99101surveys_df['sex'].dtype
100- ```
102+ ~~~
103+ {: .language-python}
101104
102105** OUTPUT:** ` dtype('O') `
103106
104107A type 'O' just stands for "object" which in Pandas' world is a string
105108(text).
106109
107- ``` python
110+ ~~~
108111surveys_df['record_id'].dtype
109- ```
112+ ~~~
113+ {: .language-python}
110114
111115** OUTPUT:** ` dtype('int64') `
112116
113117The type ` int64 ` tells us that python is storing each value within this column
114118as a 64 bit integer. We can use the ` dat.dtypes ` command to view the data type
115119for each column in a DataFrame (all at once).
116120
117- ``` python
121+ ~~~
118122surveys_df.dtypes
119- ```
123+ ~~~
124+ {: .language-python}
120125
121126which ** returns** :
122127
@@ -146,31 +151,33 @@ with. Floats have fractional parts (decimal places). Let's next consider how
146151the data type can impact mathematical operations on our data. Addition,
147152subtraction, division and multiplication work on floats and integers as we'd expect.
148153
149- ``` python
154+ ~~~
150155print(5+5)
15115610
152157
153158print(24-4)
15415920
155- ```
160+ ~~~
161+ {: .language-python}
156162
157163If we divide one integer by another, we get a float.
158164The result on python 3 is different than in python 2, where the result is an
159165integer (integer division).
160166
161- ``` python
167+ ~~~
162168print(5/9)
1631690.5555555555555556
164170
165171print(10/3)
1661723.3333333333333335
167- ```
173+ ~~~
174+ {: .language-python}
168175
169176We can also convert a floating point number to an integer or an integer to
170177floating point number. Notice that Python by default rounds down when it
171178converts from floating point to integer.
172179
173- ``` python
180+ ~~~
174181# Convert a to an integer
175182a = 7.83
176183int(a)
@@ -180,19 +187,21 @@ int(a)
180187b = 7
181188float(b)
1821897.0
183- ```
190+ ~~~
191+ {: .language-python}
184192
185193# Working With Our Survey Data
186194
187195Getting back to our data, we can modify the format of values within our data, if
188196we want. For instance, we could convert the ` record_id ` field to floating point
189197values.
190198
191- ``` python
199+ ~~~
192200# Convert the record_id field from an integer to a float
193201surveys_df['record_id'] = surveys_df['record_id'].astype('float64')
194202surveys_df['record_id'].dtype
195- ```
203+ ~~~
204+ {: .language-python}
196205
197206** OUTPUT:** ` dtype('float64') `
198207
@@ -219,10 +228,11 @@ an empty cell in a CSV or Excel sheet as a NaN. NaNs have some desirable propert
219228were to average the `weight` column without replacing our NaNs, Python would know to skip
220229over those cells.
221230
222- ```python
231+ ~~~
223232surveys_df[' weight' ].mean()
22423342.672428212991356
225- ```
234+ ~~~
235+ {: .language- python}
226236Dealing with missing data values is always a challenge. It' s sometimes hard to
227237know why values are missing - was it because of a data entry error? Or data that
228238someone was unable to collect? Should the value be 0 ? We need to know how
@@ -245,36 +255,40 @@ learned in lesson 02, we can figure out how many rows contain NaN values for
245255weight. We can also create a new subset from our data that only contains rows
246256with weight values > 0 (i.e., select meaningful weight values):
247257
248- ``` python
258+ ~~~
249259len (surveys_df[pd.isnull(surveys_df.weight)])
250260# How many rows have weight values?
251261len (surveys_df[surveys_df.weight> 0 ])
252- ```
262+ ~~~
263+ {: .language- python}
253264
254265We can replace all NaN values with zeroes using the `.fillna()` method (after
255266making a copy of the data so we don' t lose our work):
256267
257- ``` python
268+ ~~~
258269df1 = surveys_df.copy()
259270# Fill all NaN values with 0
260271df1[' weight' ] = df1[' weight' ].fillna(0 )
261- ```
272+ ~~~
273+ {: .language- python}
262274
263275However NaN and 0 yield different analysis results. The mean value when NaN
264276values are replaced with 0 is different from when NaN values are simply thrown
265277out or ignored.
266278
267- ``` python
279+ ~~~
268280df1[' weight' ].mean()
26928138.751976145601844
270- ```
282+ ~~~
283+ {: .language- python}
271284
272285We can fill NaN values with any value that we chose. The code below fills all
273286NaN values with a mean for all weight values.
274287
275- ``` python
288+ ~~~
276289 df1[' weight' ] = surveys_df[' weight' ].fillna(surveys_df[' weight' ].mean())
277- ```
290+ ~~~
291+ {: .language- python}
278292
279293We could also chose to create a subset of our data, only keeping rows that do
280294not contain NaN values.
@@ -299,16 +313,18 @@ keeping data that has been manipulated separate from our raw data. Something we
299313in doing is working with only the columns that have full data. First, let' s reload the data so
300314we' re not mixing up all of our previous manipulations.
301315
302- ``` python
316+ ~~~
303317surveys_df = pd.read_csv(" data/surveys.csv" )
304- ```
318+ ~~~
319+ {: .language- python}
305320Next, let' s drop all the rows that contain missing values. We will use the command `dropna`.
306321By default, dropna removes columns that contain missing data for even just one row.
307322
308- ``` python
323+ ~~~
309324df_na = surveys_df.dropna()
310325
311- ```
326+ ~~~
327+ {: .language- python}
312328
313329If you now type ```df_na``` , you should observe that the resulting DataFrame has 30676 rows
314330and 9 columns, much smaller than the 35549 row original.
@@ -319,10 +335,11 @@ save it to a different folder by adding the foldername and a slash before the fi
319335`df.to_csv(' foldername/out.csv' )` . We use ' index=False' so that
320336pandas doesn' t include the index number for each line.
321337
322- ``` python
338+ ~~~
323339# Write DataFrame to CSV
324340df_na.to_csv(' data_output/surveys_complete.csv' , index = False )
325- ```
341+ ~~~
342+ {: .language- python}
326343
327344We will use this data file later in the workshop. Check out your working directory to make
328345sure the CSV wrote out properly, and that you can open it! If you want, try to bring it
0 commit comments