Skip to content

Commit 87ecab0

Browse files
authored
Merge pull request #559 from tobyhodges/episode9-solutions
relocate solutions for episode 09
2 parents 6ec0248 + 81c34a3 commit 87ecab0

2 files changed

Lines changed: 109 additions & 103 deletions

File tree

episodes/09-working-with-sql.md

Lines changed: 109 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -113,13 +113,82 @@ benchmarks][these benchmarks]).
113113
## Challenge - SQL
114114

115115
1. Create a query that contains survey data collected between 1998 - 2001 for
116-
observations of sex "male" or "female" that includes observation's genus and
117-
species and site type for the sample. How many records are returned?
118-
116+
observations of sex "male" or "female" that includes observation's genus and
117+
species and site type for the sample. How many records are returned?
119118
2. Create a dataframe that contains the total number of observations (count)
120-
made for all years, and sum of observation weights for each site, ordered by
121-
site ID.
122-
119+
made for all years, and sum of observation weights for each site, ordered by
120+
site ID.
121+
122+
::::::::::::::::::::::: solution
123+
124+
1.
125+
```python
126+
#Connect to the database
127+
con = sqlite3.connect("data/portal_mammals.sqlite")
128+
129+
cur = con.cursor()
130+
131+
# Return all results of query: year, plot type (site type), genus, species and sex
132+
# from the join of the tables surveys, plots and species, for the years 1998-2001 where sex is 'M' or 'F'.
133+
cur.execute('SELECT surveys.year,plots.plot_type,species.genus,species.species,surveys.sex \
134+
FROM surveys INNER JOIN plots ON surveys.plot = plots.plot_id INNER JOIN species ON \
135+
surveys.species = species.species_id WHERE surveys.year>=1998 AND surveys.year<=2001 \
136+
AND ( surveys.sex = "M" OR surveys.sex = "F")')
137+
138+
print('The query returned ' + str(len(cur.fetchall())) + ' records.')
139+
140+
# Close the connection
141+
con.close()
142+
```
143+
144+
```output
145+
The query returned 5546 records.
146+
```
147+
2.
148+
```python
149+
# Create two sqlite queries results, read as pandas DataFrame
150+
# Include 'year' in both queries so we have something to merge (join) on.
151+
con = sqlite3.connect("data/portal_mammals.sqlite")
152+
df1 = pd.read_sql_query("SELECT year,COUNT(*) FROM surveys GROUP BY year", con)
153+
df2 = pd.read_sql_query("SELECT year,plot,SUM(wgt) FROM surveys GROUP BY \
154+
year,plot ORDER BY plot ASC",con)
155+
156+
# Turn the plot_id column values into column names by pivoting
157+
df2 = df2.pivot(index='year',columns='plot')['SUM(wgt)']
158+
df = pd.merge(df1, df2, on='year')
159+
160+
# Verify that result of the SQL queries is stored in the combined dataframe
161+
print(df.head())
162+
163+
con.close()
164+
```
165+
166+
```output
167+
year COUNT(*) 1 2 3 4 5 6 7 \
168+
0 1977 503 567.0 784.0 237.0 849.0 943.0 578.0 202.0
169+
1 1978 1048 4628.0 4789.0 1131.0 4291.0 4051.0 2371.0 43.0
170+
2 1979 719 1909.0 2501.0 430.0 2438.0 1798.0 988.0 141.0
171+
3 1980 1415 5374.0 4643.0 1817.0 7466.0 2743.0 3219.0 362.0
172+
4 1981 1472 6229.0 6282.0 1343.0 4553.0 3596.0 5430.0 24.0
173+
174+
8 ... 15 16 17 18 19 20 21 22 \
175+
0 595.0 ... 48.0 132.0 1102.0 646.0 336.0 640.0 40.0 316.0
176+
1 3669.0 ... 734.0 548.0 4971.0 4393.0 124.0 2623.0 239.0 2833.0
177+
2 1954.0 ... 472.0 308.0 3736.0 3099.0 379.0 2617.0 157.0 2250.0
178+
3 3596.0 ... 1071.0 529.0 5877.0 5075.0 691.0 5523.0 321.0 3763.0
179+
4 4946.0 ... 1083.0 176.0 5050.0 4773.0 410.0 5379.0 600.0 5268.0
180+
181+
23 24
182+
0 169.0 NaN
183+
1 NaN NaN
184+
2 137.0 901.0
185+
3 742.0 4392.0
186+
4 57.0 3987.0
187+
188+
[5 rows x 26 columns]
189+
```
190+
191+
::::::::::::::::::::::::::::::::
123192

124193
::::::::::::::::::::::::::::::::::::::::::::::::::
125194

@@ -157,7 +226,40 @@ con.close()
157226

158227
2. What are some of the reasons you might want to save the results of your queries back into the
159228
database? What are some of the reasons you might avoid doing this.
160-
229+
230+
::::::::::::::::::::::: solution
231+
232+
1.
233+
```python
234+
#Connect to the database
235+
con = sqlite3.connect("data/portal_mammals.sqlite")
236+
237+
# Read the results into a DataFrame
238+
df1 = pd.read_sql_query('SELECT surveys.year,plots.plot_type,species.genus,species.species,surveys.sex \
239+
FROM surveys INNER JOIN plots ON surveys.plot = plots.plot_id INNER JOIN species ON \
240+
surveys.species = species.species_id WHERE surveys.year>=1998 AND surveys.year<=2001 \
241+
AND ( surveys.sex = "M" OR surveys.sex = "F")')
242+
243+
df1.to_sql("New Table 1", con, if_exists="replace")
244+
245+
# We already have the 'df' DataFrame created in the earlier exercise
246+
df.to_sql("New Table 2", con, if_exists="replace")
247+
248+
# Close the connection
249+
con.close()
250+
```
251+
2. If the database is shared with others and common queries
252+
(and potentially data corrections) are likely to be required by many
253+
it may be efficient for one person to perform the work
254+
and save it back to the database as a new table
255+
so others can access the results directly instead of performing the query themselves,
256+
particularly if it is complex.
257+
258+
However, we might avoid doing this if the database is an authoritative source
259+
(potentially version controlled) which should not be modified by users.
260+
Instead, we might save the qeury results to a new database that is more appropriate for downstream work.
261+
262+
::::::::::::::::::::::::::::::::
161263

162264
::::::::::::::::::::::::::::::::::::::::::::::::::
163265

instructors/instructor-notes.md

Lines changed: 0 additions & 96 deletions
Original file line numberDiff line numberDiff line change
@@ -240,102 +240,6 @@ plt.show()
240240

241241
[This page][matplotlib-mathtext] contains more information.
242242

243-
## 09-working-with-sql
244-
245-
### Challenge - SQL
246-
247-
- Create a query that contains survey data collected between 1998 - 2001 for observations of sex "male" or "female" that includes observation's genus and species and site type for the sample. How many records are returned?
248-
249-
```python
250-
#Connect to the database
251-
con = sqlite3.connect("data/portal_mammals.sqlite")
252-
253-
cur = con.cursor()
254-
255-
# Return all results of query: year, plot type (site type), genus, species and sex
256-
# from the join of the tables surveys, plots and species, for the years 1998-2001 where sex is 'M' or 'F'.
257-
cur.execute('SELECT surveys.year,plots.plot_type,species.genus,species.species,surveys.sex \
258-
FROM surveys INNER JOIN plots ON surveys.plot_id = plots.plot_id INNER JOIN species ON \
259-
surveys.species_id = species.species_id WHERE surveys.year>=1998 AND surveys.year<=2001 \
260-
AND ( surveys.sex = "M" OR surveys.sex = "F")')
261-
262-
print(len(cur.fetchall()))
263-
264-
# Close the connection
265-
con.close()
266-
```
267-
268-
```output
269-
5546
270-
```
271-
272-
Answer: 5546 records are found.
273-
274-
- Create a dataframe that contains the total number of observations (count) made for all years, and sum of observation weights for each site, ordered by site ID.
275-
276-
This question is a little ambiguous but we could e.g. do two SQL queries into dataframes, then pivot the second and merge them to create a table of observation count and plot total weight per year. The PIVOT operation could alternatively be performed in SQL.
277-
278-
```python
279-
import pandas as pd
280-
import sqlite3
281-
282-
# Create two sqlite queries results, read as pandas DataFrame
283-
# Include 'year' in both queries so we have something to merge (join) on.
284-
con = sqlite3.connect("data/portal_mammals.sqlite")
285-
df1 = pd.read_sql_query("SELECT year,COUNT(*) FROM surveys GROUP BY year", con)
286-
df2 = pd.read_sql_query("SELECT year,plot_id,SUM(weight) FROM surveys GROUP BY \
287-
year,plot_id ORDER BY plot_id ASC",con)
288-
289-
# Turn the plot_id column values into column names by pivoting
290-
df2 = df2.pivot(index='year',columns='plot_id')['SUM(weight)']
291-
df = pd.merge(df1, df2, on='year')
292-
293-
# Verify that result of the SQL queries is stored in the combined dataframe
294-
print(df.head())
295-
296-
con.close()
297-
```
298-
299-
Output looks something like
300-
301-
```output
302-
year COUNT(*) 1 2 3 4 5 6 7 \
303-
0 1977 503 567.0 784.0 237.0 849.0 943.0 578.0 202.0
304-
1 1978 1048 4628.0 4789.0 1131.0 4291.0 4051.0 2371.0 43.0
305-
2 1979 719 1909.0 2501.0 430.0 2438.0 1798.0 988.0 141.0
306-
3 1980 1415 5374.0 4643.0 1817.0 7466.0 2743.0 3219.0 362.0
307-
4 1981 1472 6229.0 6282.0 1343.0 4553.0 3596.0 5430.0 24.0
308-
309-
8 ... 15 16 17 18 19 20 21 22 \
310-
0 595.0 ... 48.0 132.0 1102.0 646.0 336.0 640.0 40.0 316.0
311-
1 3669.0 ... 734.0 548.0 4971.0 4393.0 124.0 2623.0 239.0 2833.0
312-
2 1954.0 ... 472.0 308.0 3736.0 3099.0 379.0 2617.0 157.0 2250.0
313-
3 3596.0 ... 1071.0 529.0 5877.0 5075.0 691.0 5523.0 321.0 3763.0
314-
4 4946.0 ... 1083.0 176.0 5050.0 4773.0 410.0 5379.0 600.0 5268.0
315-
316-
23 24
317-
0 169.0 NaN
318-
1 NaN NaN
319-
2 137.0 901.0
320-
3 742.0 4392.0
321-
4 57.0 3987.0
322-
```
323-
324-
### Challenge - Saving your work
325-
326-
- For each of the challenges in the previous challenge block, modify your code to save the results to their own tables in the portal database.
327-
328-
Per the example in the lesson, create a variable for the results of the SQL query, then add something like
329-
330-
```python
331-
<new_table>.to_sql("New Table", con, if_exists="replace")
332-
```
333-
334-
- What are some of the reasons you might want to save the results of your queries back into the database? What are some of the reasons you might avoid doing this?
335-
336-
If the database is shared with others and common queries (and potentially data corrections) are likely to be required by many it may be efficient for one person to perform the work and save it back to the database as a new table so others can access the results directly instead of performing the query themselves, particularly if it is complex.
337-
However, we might avoid doing this if the database is an authoritative source (potentially version controlled) which should not be modified by users. Instead, we might save the qeury results to a new database that is more appropriate for downstream work.
338-
339243
[seaborn]: https://stanford.edu/~mwaskom/software/seaborn
340244
[altair]: https://github.com/ellisonbg/altair
341245
[matplotlib-mathtext]: https://matplotlib.org/users/mathtext.html

0 commit comments

Comments
 (0)