@@ -3,14 +3,16 @@ title: Data Workflows and Automation
33teaching : 40
44exercises : 50
55questions :
6- - " Can I automate operations in Python? "
7- - " What are functions and why should I use them? "
6+ - " Can I automate operations in Python?"
7+ - " What are functions and why should I use them?"
88objectives :
9- - Describe why for loops are used in Python.
10- - Employ for loops to automate data analysis.
11- - Write unique filenames in Python.
12- - Build reusable code in Python.
13- - Write functions using conditional statements (if, then, else).
9+ - " Describe why for loops are used in Python."
10+ - " Employ for loops to automate data analysis."
11+ - " Write unique filenames in Python."
12+ - " Build reusable code in Python."
13+ - " Write functions using conditional statements (if, then, else)."
14+ keypoints :
15+ - " FIXME"
1416---
1517
1618So far, we've used Python and the pandas library to explore and manipulate
@@ -30,7 +32,7 @@ errors by making mistakes while processing each file by hand.
3032Let's write a simple for loop that simulates what a kid might see during a
3133visit to the zoo:
3234
33- ``` python
35+ ~~~
3436>>> animals = ['lion', 'tiger', 'crocodile', 'vulture', 'hippo']
3537>>> print(animals)
3638['lion', 'tiger', 'crocodile', 'vulture', 'hippo']
4244crocodile
4345vulture
4446hippo
45- ```
47+ ~~~
48+ {: .language-python}
4649
4750The line defining the loop must start with ` for ` and end with a colon, and the
4851body of the loop must be indented.
@@ -52,14 +55,15 @@ entry in `animals` every time the loop goes around. We can call the loop variabl
5255anything we like. After the loop finishes, the loop variable will still exist
5356and will have the value of the last entry in the collection:
5457
55- ``` python
58+ ~~~
5659>>> animals = ['lion', 'tiger', 'crocodile', 'vulture', 'hippo']
5760>>> for creature in animals:
5861... pass
5962
6063>>> print('The loop variable is now: ' + creature)
6164The loop variable is now: hippo
62- ```
65+ ~~~
66+ {: .language-python}
6367
6468We are not asking python to print the value of the loop variable anymore, but
6569the for loop still runs and the value of ` creature ` changes on each pass through
@@ -83,16 +87,17 @@ file.
8387Let's start by making a new directory inside the folder ` data ` to store all of
8488these files using the module ` os ` :
8589
86- ``` python
90+ ~~~
8791 import os
8892
8993 os.mkdir('data/yearly_files')
90- ```
94+ ~~~
95+ {: .language-python}
9196
9297The command ` os.mkdir ` is equivalent to ` mkdir ` in the shell. Just so we are
9398sure, we can check that the new directory was created within the ` data ` folder:
9499
95- ``` python
100+ ~~~
96101>>> os.listdir('data')
97102['plots.csv',
98103 'portal_mammals.sqlite',
@@ -102,7 +107,8 @@ sure, we can check that the new directory was created within the `data` folder:
102107 'surveys.csv',
103108 'surveys2002_temp.csv',
104109 'yearly_files']
105- ```
110+ ~~~
111+ {: .language-python}
106112
107113The command ` os.listdir ` is equivalent to ` ls ` in the shell.
108114
@@ -111,7 +117,7 @@ data into memory as a DataFrame, how to select a subset of the data using some
111117criteria, and how to write the DataFrame into a CSV file. Let's write a script
112118that performs those three steps in sequence for the year 2002:
113119
114- ``` python
120+ ~~~
115121import pandas as pd
116122
117123# Load the data into a DataFrame
@@ -122,7 +128,8 @@ surveys2002 = surveys_df[surveys_df.year == 2002]
122128
123129# Write the new DataFrame to a CSV file
124130surveys2002.to_csv('data/yearly_files/surveys2002.csv')
125- ```
131+ ~~~
132+ {: .language-python}
126133
127134To create yearly data files, we could repeat the last two commands over and
128135over, once for each year of data. Repeating code is neither elegant nor
@@ -138,7 +145,7 @@ confirm that the loop is behaving as we expect.
138145We have seen that we can loop over a list of items, so we need a list of years
139146to loop over. We can get the years in our DataFrame with:
140147
141- ``` python
148+ ~~~
142149>>> surveys_df['year']
143150
1441510 1977
@@ -150,21 +157,23 @@ to loop over. We can get the years in our DataFrame with:
15015735546 2002
15115835547 2002
15215935548 2002
153- ```
160+ ~~~
161+ {: .language-python}
154162
155163but we want only unique years, which we can get using the ` unique ` method
156- which we have already seen.
164+ which we have already seen.
157165
158- ``` python
166+ ~~~
159167>>> surveys_df['year'].unique()
160168array([1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987,
161169 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998,
162170 1999, 2000, 2001, 2002], dtype=int64)
163- ```
171+ ~~~
172+ {: .language-python}
164173
165174Putting this into our for loop we get
166175
167- ``` python
176+ ~~~
168177>>> for year in surveys_df['year'].unique():
169178... filename='data/yearly_files/surveys' + str(year) + '.csv'
170179... print(filename)
@@ -195,11 +204,12 @@ data/yearly_files/surveys1999.csv
195204data/yearly_files/surveys2000.csv
196205data/yearly_files/surveys2001.csv
197206data/yearly_files/surveys2002.csv
198- ```
207+ ~~~
208+ {: .language-python}
199209
200210We can now add the rest of the steps we need to create separate text files:
201211
202- ``` python
212+ ~~~
203213# Load the data into a DataFrame
204214surveys_df = pd.read_csv('data/surveys.csv')
205215
@@ -211,7 +221,8 @@ for year in surveys_df['year'].unique():
211221 # Write the new DataFrame to a CSV file
212222 filename = 'data/yearly_files/surveys' + str(year) + '.csv'
213223 surveys_year.to_csv(filename)
214- ```
224+ ~~~
225+ {: .language-python}
215226
216227Look inside the ` yearly_files ` directory and check a couple of the files you
217228just created to confirm that everything worked as expected.
@@ -220,7 +231,10 @@ just created to confirm that everything worked as expected.
220231
221232Notice that the code above created a unique filename for each year.
222233
234+ ~~~
223235 filename = 'data/yearly_files/surveys' + str(year) + '.csv'
236+ ~~~
237+ {: .language-python}
224238
225239Let's break down the parts of this name:
226240
@@ -272,7 +286,7 @@ easy to write functions that can be used by different programs.
272286
273287Functions are declared following this general structure:
274288
275- ``` python
289+ ~~~
276290def this_is_the_function_name(input_argument1, input_argument2):
277291
278292 # The body of the function is indented
@@ -281,7 +295,8 @@ def this_is_the_function_name(input_argument1, input_argument2):
281295
282296 # And returns their product
283297 return input_argument1 * input_argument2
284- ```
298+ ~~~
299+ {: .language-python}
285300
286301The function declaration starts with the word ` def ` , followed by the function
287302name and any arguments in parenthesis, and ends in a colon. The body of the
@@ -290,13 +305,14 @@ it is called, it includes a return statement at the end.
290305
291306This is how we call the function:
292307
293- ``` python
308+ ~~~
294309>>> product_of_inputs = this_is_the_function_name(2,5)
295310The function arguments are: 2 5 (this is done inside the function!)
296311
297312>>> print('Their product is:', product_of_inputs, '(this is done outside the function!)')
298313Their product is: 10 (this is done outside the function!)
299- ```
314+ ~~~
315+ {: .language-python}
300316
301317> ## Challenge - Functions
302318>
@@ -315,7 +331,7 @@ many different "chunks" of this code that we can turn into functions, and we can
315331even create functions that call other functions inside them. Let's first write a
316332function that separates data for just one year and saves that data to a file:
317333
318- ``` python
334+ ~~~
319335def one_year_csv_writer(this_year, all_data):
320336 """
321337 Writes a csv file for data from a given year.
@@ -330,21 +346,24 @@ def one_year_csv_writer(this_year, all_data):
330346 # Write the new DataFrame to a csv file
331347 filename = 'data/yearly_files/function_surveys' + str(this_year) + '.csv'
332348 surveys_year.to_csv(filename)
333- ```
349+ ~~~
350+ {: .language-python}
334351
335352The text between the two sets of triple double quotes is called a docstring and
336353contains the documentation for the function. It does nothing when the function
337354is running and is therefore not necessary, but it is good practice to include
338355docstrings as a reminder of what the code does. Docstrings in functions also
339356become part of their 'official' documentation:
340357
341- ``` python
358+ ~~~
342359one_year_csv_writer?
343- ```
360+ ~~~
361+ {: .language-python}
344362
345- ``` python
363+ ~~~
346364one_year_csv_writer(2002, surveys_df)
347- ```
365+ ~~~
366+ {: .language-python}
348367
349368We changed the root of the name of the CSV file so we can distinguish it from
350369the one we wrote before. Check the ` yearly_files ` directory for the file. Did it
@@ -356,7 +375,7 @@ the entire For loop by simply looping through a sequence of years and repeatedly
356375calling the function we just wrote, ` one_year_csv_writer ` :
357376
358377
359- ``` python
378+ ~~~
360379def yearly_data_csv_writer(start_year, end_year, all_data):
361380 """
362381 Writes separate CSV files for each year of data.
@@ -369,7 +388,8 @@ def yearly_data_csv_writer(start_year, end_year, all_data):
369388 # "end_year" is the last year of data we want to pull, so we loop to end_year+1
370389 for year in range(start_year, end_year+1):
371390 one_year_csv_writer(year, all_data)
372- ```
391+ ~~~
392+ {: .language-python}
373393
374394Because people will naturally expect that the end year for the files is the last
375395year with data, the for loop inside the function ends at ` end_year + 1 ` . By
@@ -379,13 +399,14 @@ first and last year for which we want files, we can even use this function to
379399create files for a subset of the years available. This is how we call this
380400function:
381401
382- ``` python
402+ ~~~
383403# Load the data into a DataFrame
384404surveys_df = pd.read_csv('data/surveys.csv')
385405
386406# Create CSV files
387407yearly_data_csv_writer(1977, 2002, surveys_df)
388- ```
408+ ~~~
409+ {: .language-python}
389410
390411BEWARE! If you are using IPython Notebooks and you modify a function, you MUST
391412re-run that cell in order for the changed function to be available to the rest
@@ -422,7 +443,7 @@ sign in the function declaration. Any arguments in the function without default
422443values (here, ` all_data ` ) is a required argument and MUST come before the
423444argument with default values (which are optional in the function call).
424445
425- ``` python
446+ ~~~
426447 def yearly_data_arg_test(all_data, start_year = 1977, end_year = 2002):
427448 """
428449 Modified from yearly_data_csv_writer to test default argument values!
@@ -440,7 +461,8 @@ argument with default values (which are optional in the function call).
440461
441462 start,end = yearly_data_arg_test (surveys_df)
442463 print('Default values:\t\t\t', start, end)
443- ```
464+ ~~~
465+ {: .language-python}
444466
445467```
446468 Both optional arguments: 1988 1993
@@ -454,7 +476,7 @@ But what if our dataset doesn't start in 1977 and end in 2002? We can modify the
454476function so that it looks for the start and end years in the dataset if those
455477dates are not provided:
456478
457- ``` python
479+ ~~~
458480 def yearly_data_arg_test(all_data, start_year = None, end_year = None):
459481 """
460482 Modified from yearly_data_csv_writer to test default argument values!
@@ -477,7 +499,8 @@ dates are not provided:
477499
478500 start,end = yearly_data_arg_test (surveys_df)
479501 print('Default values:\t\t\t', start, end)
480- ```
502+ ~~~
503+ {: .language-python}
481504```
482505 Both optional arguments: 1988 1993
483506 Default values: 1977 2002
@@ -510,7 +533,7 @@ The body of the test function now has two conditionals (if statements) that
510533check the values of ` start_year ` and ` end_year ` . If statements execute a segment
511534of code when some condition is met. They commonly look something like this:
512535
513- ``` python
536+ ~~~
514537 a = 5
515538
516539 if a<0: # Meets first condition?
@@ -527,7 +550,8 @@ of code when some condition is met. They commonly look something like this:
527550
528551 # if a ISN'T less than zero and ISN'T more than zero
529552 print('a must be zero!')
530- ```
553+ ~~~
554+ {: .language-python}
531555
532556Which would return:
533557
@@ -556,7 +580,7 @@ calling the function using keyword arguments, where each of the arguments in the
556580function definition is associated with a keyword and the function call passes
557581values to the function using these keywords:
558582
559- ``` python
583+ ~~~
560584 start,end = yearly_data_arg_test (surveys_df)
561585 print('Default values:\t\t\t', start, end)
562586
@@ -574,7 +598,8 @@ values to the function using these keywords:
574598
575599 start,end = yearly_data_arg_test (surveys_df, end_year = 1993)
576600 print('One keyword, default start:\t', start, end)
577- ```
601+ ~~~
602+ {: .language-python}
578603```
579604 Default values: 1977 2002
580605 No keywords: 1988 1993
0 commit comments