Skip to content

Commit 1021a9c

Browse files
authored
Merge pull request #555 from tobyhodges/episode5-solutions
relocate solutions for episode 05
2 parents 87ecab0 + 8a12dee commit 1021a9c

2 files changed

Lines changed: 145 additions & 164 deletions

File tree

episodes/05-merging-data.md

Lines changed: 145 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -149,11 +149,42 @@ new_output = pd.read_csv('data/out.csv', keep_default_na=False, na_values=[""])
149149

150150
### Challenge - Combine Data
151151

152-
In the data folder, there are two survey data files: `surveys2001.csv` and
153-
`surveys2002.csv`. Read the data into pandas and combine the files to make one
154-
new DataFrame. Create a plot of average plot weight by year grouped by sex.
152+
In the data folder, there is another folder called `yearly_files`
153+
that contains survey data broken down into individual files by year.
154+
Read the data from two of these files,
155+
`surveys2001.csv` and `surveys2002.csv`,
156+
into pandas and combine the files to make one new DataFrame.
157+
Create a plot of average plot weight by year grouped by sex.
155158
Export your results as a CSV and make sure it reads back into pandas properly.
156159

160+
::::::::::::::::::::::: solution
161+
162+
```python
163+
# read the files:
164+
survey2001 = pd.read_csv("data/yearly_files/surveys2001.csv")
165+
survey2002 = pd.read_csv("data/yearly_files/surveys2002.csv")
166+
# concatenate
167+
survey_all = pd.concat([survey2001, survey2002], axis=0)
168+
# get the weight for each year, grouped by sex:
169+
weight_year = survey_all.groupby(['year', 'sex']).mean()["wgt"].unstack()
170+
# plot:
171+
weight_year.plot(kind="bar")
172+
plt.tight_layout() # tip: use this to improve the plot layout.
173+
# Try running the code without this line to see
174+
# what difference applying plt.tight_layout() makes.
175+
```
176+
177+
![](fig/04_chall_weight_year.png){alt='average weight for each year, grouped by sex'}
178+
179+
```python
180+
# writing to file:
181+
weight_year.to_csv("weight_for_year.csv")
182+
# reading it back in:
183+
pd.read_csv("weight_for_year.csv", index_col=0)
184+
```
185+
186+
::::::::::::::::::::::::::::::::
187+
157188

158189
::::::::::::::::::::::::::::::::::::::::::::::::::
159190

@@ -425,10 +456,88 @@ Create a new DataFrame by joining the contents of the `surveys.csv` and
425456

426457
1. taxa by plot
427458
2. taxa by sex by plot
459+
460+
::::::::::::::::::::::: solution
461+
462+
```python
463+
merged_left = pd.merge(left=surveys_df,right=species_df, how='left', on="species_id")
464+
```
465+
466+
1. taxa per plot (number of species of each taxa per plot):
467+
468+
```python
469+
merged_left.groupby(["plot_id"])["taxa"].nunique().plot(kind='bar')
470+
```
471+
472+
![](fig/04_chall_ntaxa_per_site.png){alt='taxa per plot'}
473+
474+
*Suggestion*: It is also possible to plot the number of individuals for each taxa in each plot
475+
(stacked bar chart):
428476

477+
```python
478+
merged_left.groupby(["plot_id", "taxa"]).count()["record_id"].unstack().plot(kind='bar', stacked=True)
479+
plt.legend(loc='upper center', ncol=3, bbox_to_anchor=(0.5, 1.05)) # stop the legend from overlapping with the bar plot
480+
```
481+
482+
![](fig/04_chall_taxa_per_site.png){alt='taxa per plot'}
483+
484+
2. taxa by sex by plot:
485+
Providing the Nan values with the M|F values (can also already be changed to 'x'):
486+
487+
```python
488+
merged_left.loc[merged_left["sex"].isnull(), "sex"] = 'M|F'
489+
ntaxa_sex_site= merged_left.groupby(["plot_id", "sex"])["taxa"].nunique().reset_index(level=1)
490+
ntaxa_sex_site = ntaxa_sex_site.pivot_table(values="taxa", columns="sex", index=ntaxa_sex_site.index)
491+
ntaxa_sex_site.plot(kind="bar", legend=False, stacked=True)
492+
plt.legend(loc='upper center', ncol=3, bbox_to_anchor=(0.5, 1.08),
493+
fontsize='small', frameon=False)
494+
```
495+
496+
![](fig/04_chall_ntaxa_per_site_sex.png){alt='taxa per plot per sex'}
497+
498+
::::::::::::::::::::::::::::::::
429499

430500
::::::::::::::::::::::::::::::::::::::::::::::::::
431501

502+
::::::::::::::::::::::: instructor
503+
504+
## Suggestion (for discussion only)
505+
506+
The number of individuals for each taxa in each plot per sex can be derived as well.
507+
508+
```python
509+
sex_taxa_site = merged_left.groupby(["plot_id", "taxa", "sex"]).count()['record_id']
510+
sex_taxa_site.unstack(level=[1, 2]).plot(kind='bar', logy=True)
511+
plt.legend(loc='upper center', ncol=3, bbox_to_anchor=(0.5, 1.15),
512+
fontsize='small', frameon=False)
513+
```
514+
515+
![](fig/04_chall_sex_taxa_site_intro.png){alt='taxa per plot per sex'}
516+
517+
This is not really the best plot choice, e.g. it is not easily readable.
518+
A first option to make this better, is to make facets.
519+
However, pandas/matplotlib do not provide this by default.
520+
Just as a pure matplotlib example (`M|F` if for not-defined sex records):
521+
522+
```python
523+
fig, axs = plt.subplots(3, 1)
524+
for sex, ax in zip(["M", "F", "M|F"], axs):
525+
sex_taxa_site[sex_taxa_site["sex"] == sex].plot(kind='bar', ax=ax, legend=False)
526+
ax.set_ylabel(sex)
527+
if not ax.is_last_row():
528+
ax.set_xticks([])
529+
ax.set_xlabel("")
530+
axs[0].legend(loc='upper center', ncol=5, bbox_to_anchor=(0.5, 1.3),
531+
fontsize='small', frameon=False)
532+
```
533+
534+
![](fig/04_chall_sex_taxa_site.png){alt='taxa per plot per sex'}
535+
536+
However, it would be better to link to [Seaborn][seaborn]
537+
and [Altair][altair] for this kind of multivariate visualisation.
538+
539+
::::::::::::::::::::::::::::::::::
540+
432541
::::::::::::::::::::::::::::::::::::::: challenge
433542

434543
### Challenge - Diversity Index
@@ -441,17 +550,46 @@ Create a new DataFrame by joining the contents of the `surveys.csv` and
441550
plots. The index should consider both species abundance and number of
442551
species. You might choose to use the simple [biodiversity index described
443552
here](https://www.amnh.org/explore/curriculum-collections/biodiversity-counts/plant-ecology/how-to-calculate-a-biodiversity-index)
444-
which calculates diversity as:
553+
which calculates diversity as: the number of species in the plot / the total number of individuals in the plot = Biodiversity index.
445554

446-
the number of species in the plot / the total number of individuals in the plot = Biodiversity index.
555+
::::::::::::::::::::::: solution
556+
557+
1.
558+
```python
559+
plot_info = pd.read_csv("data/plots.csv")
560+
plot_info.groupby("plot_type").count()
561+
```
562+
563+
2.
564+
```python
565+
merged_site_type = pd.merge(merged_left, plot_info, on='plot_id')
566+
# For each plot, get the number of species for each plot
567+
nspecies_site = merged_site_type.groupby(["plot_id"])["species"].nunique().rename("nspecies")
568+
# For each plot, get the number of individuals
569+
nindividuals_site = merged_site_type.groupby(["plot_id"]).count()['record_id'].rename("nindiv")
570+
# combine the two series
571+
diversity_index = pd.concat([nspecies_site, nindividuals_site], axis=1)
572+
# calculate the diversity index
573+
diversity_index['diversity'] = diversity_index['nspecies']/diversity_index['nindiv']
574+
```
447575

576+
Making a bar chart from this diversity index:
577+
578+
```python
579+
diversity_index['diversity'].plot(kind="barh")
580+
plt.xlabel("Diversity index")
581+
```
448582

449-
::::::::::::::::::::::::::::::::::::::::::::::::::
583+
![](fig/04_chall_diversity_index.png){alt='horizontal bar chart of diversity index by plot'}
450584

585+
::::::::::::::::::::::::::::::::
451586

587+
::::::::::::::::::::::::::::::::::::::::::::::::::
452588

453-
[join-types]: https://blog.codinghorror.com/a-visual-explanation-of-sql-joins/
454589

590+
[altair]: https://github.com/ellisonbg/altair
591+
[join-types]: https://blog.codinghorror.com/a-visual-explanation-of-sql-joins/
592+
[seaborn]: https://stanford.edu/~mwaskom/software/seaborn
455593

456594
:::::::::::::::::::::::::::::::::::::::: keypoints
457595

instructors/instructor-notes.md

Lines changed: 0 additions & 157 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,6 @@
22
title: Instructor Notes
33
---
44

5-
# Challenge solutions
6-
75
## Install the required workshop packages
86

97
Please use the instructions in the [Setup][lesson-setup] document to perform installs. If you
@@ -47,159 +45,6 @@ it or not at their preference. For example, if a student worries about keeping u
4745
typing, let them know they can skip the `.head()`, but that you'll use it to keep more lines of
4846
previous steps visible.
4947

50-
## 04-data-types-and-format
51-
52-
### Writing Out Data to CSV
53-
54-
If the students have trouble generating the output, or anything happens with that, the folder
55-
[`sample_output`](https://github.com/datacarpentry/python-ecology-lesson/tree/main/sample_output)
56-
in this repository contains the file `surveys_complete.csv` with the data they should generate.
57-
58-
## 05-merging-data
59-
60-
- In the data folder, there are two survey data files: survey2001.csv and survey2002.csv. Read the
61-
data into Python and combine the files to make one new data frame. Create a plot of average plot
62-
weight by year grouped by sex. Export your results as a CSV and make sure it reads back into
63-
Python properly.
64-
65-
```python
66-
# read the files:
67-
survey2001 = pd.read_csv("data/survey2001.csv")
68-
survey2002 = pd.read_csv("data/survey2002.csv")
69-
# concatenate
70-
survey_all = pd.concat([survey2001, survey2002], axis=0)
71-
# get the weight for each year, grouped by sex:
72-
weight_year = survey_all.groupby(['year', 'sex']).mean()["wgt"].unstack()
73-
# plot:
74-
weight_year.plot(kind="bar")
75-
plt.tight_layout() # tip(!)
76-
```
77-
78-
![](fig/04_chall_weight_year.png){alt='average weight for each year, grouped by sex'}
79-
80-
```python
81-
# writing to file:
82-
weight_year.to_csv("weight_for_year.csv")
83-
# reading it back in:
84-
pd.read_csv("weight_for_year.csv", index_col=0)
85-
```
86-
87-
- Create a new DataFrame by joining the contents of the surveys.csv and species.csv tables.
88-
89-
```python
90-
merged_left = pd.merge(left=surveys_df,right=species_df, how='left', on="species_id")
91-
```
92-
93-
Then calculate and plot the distribution of:
94-
95-
**1\. taxa per plot** (number of species of each taxa per plot):
96-
97-
Species distribution (number of taxa for each plot) can be derived as follows:
98-
99-
```python
100-
merged_left.groupby(["plot_id"])["taxa"].nunique().plot(kind='bar')
101-
```
102-
103-
![](fig/04_chall_ntaxa_per_site.png){alt='taxa per plot'}
104-
105-
*Suggestion*: It is also possible to plot the number of individuals for each taxa in each plot
106-
(stacked bar chart):
107-
108-
```python
109-
merged_left.groupby(["plot_id", "taxa"]).count()["record_id"].unstack().plot(kind='bar', stacked=True)
110-
plt.legend(loc='upper center', ncol=3, bbox_to_anchor=(0.5, 1.05))
111-
```
112-
113-
(the legend otherwise overlaps the bar plot)
114-
115-
![](fig/04_chall_taxa_per_site.png){alt='taxa per plot'}
116-
117-
**2\. taxa by sex by plot**:
118-
Providing the Nan values with the M|F values (can also already be changed to 'x'):
119-
120-
```python
121-
merged_left.loc[merged_left["sex"].isnull(), "sex"] = 'M|F'
122-
```
123-
124-
Number of taxa for each plot/sex combination:
125-
126-
```python
127-
ntaxa_sex_site= merged_left.groupby(["plot_id", "sex"])["taxa"].nunique().reset_index(level=1)
128-
ntaxa_sex_site = ntaxa_sex_site.pivot_table(values="taxa", columns="sex", index=ntaxa_sex_site.index)
129-
ntaxa_sex_site.plot(kind="bar", legend=False)
130-
plt.legend(loc='upper center', ncol=3, bbox_to_anchor=(0.5, 1.08),
131-
fontsize='small', frameon=False)
132-
```
133-
134-
![](fig/04_chall_ntaxa_per_site_sex.png){alt='taxa per plot per sex'}
135-
136-
*Suggestion (for discussion only)*:
137-
138-
The number of individuals for each taxa in each plot per sex can be derived as well.
139-
140-
```python
141-
sex_taxa_site = merged_left.groupby(["plot_id", "taxa", "sex"]).count()['record_id']
142-
sex_taxa_site.unstack(level=[1, 2]).plot(kind='bar', logy=True)
143-
plt.legend(loc='upper center', ncol=3, bbox_to_anchor=(0.5, 1.15),
144-
fontsize='small', frameon=False)
145-
```
146-
147-
![](fig/04_chall_sex_taxa_site_intro.png){alt='taxa per plot per sex'}
148-
149-
This is not really the best plot choice: not readable,... A first option to make this better, is to
150-
make facets. However, pandas/matplotlib do not provide this by default. Just as a pure matplotlib
151-
example (`M|F` if for not-defined sex records):
152-
153-
```python
154-
fig, axs = plt.subplots(3, 1)
155-
for sex, ax in zip(["M", "F", "M|F"], axs):
156-
sex_taxa_site[sex_taxa_site["sex"] == sex].plot(kind='bar', ax=ax, legend=False)
157-
ax.set_ylabel(sex)
158-
if not ax.is_last_row():
159-
ax.set_xticks([])
160-
ax.set_xlabel("")
161-
axs[0].legend(loc='upper center', ncol=5, bbox_to_anchor=(0.5, 1.3),
162-
fontsize='small', frameon=False)
163-
```
164-
165-
![](fig/04_chall_sex_taxa_site.png){alt='taxa per plot per sex'}
166-
167-
However, it would be better to link to [Seaborn][seaborn] and [Altair][altair] for its kind of
168-
multivariate visualisations.
169-
170-
- In the data folder, there is a plot CSV that contains information about the type associated with
171-
each plot. Use that data to summarize the number of plots by plot type.
172-
173-
```python
174-
plot_info = pd.read_csv("data/plots.csv")
175-
plot_info.groupby("plot_type").count()
176-
```
177-
178-
- Calculate a diversity index of your choice for control vs rodent exclosure plots. The index should
179-
consider both species abundance and number of species. You might choose the simple biodiversity
180-
index described here which calculates diversity as `the number of species in the plot / the total number of individuals in the plot = Biodiversity index.`
181-
182-
```python
183-
merged_site_type = pd.merge(merged_left, plot_info, on='plot_id')
184-
# For each plot, get the number of species for each plot
185-
nspecies_site = merged_site_type.groupby(["plot_id"])["species"].nunique().rename("nspecies")
186-
# For each plot, get the number of individuals
187-
nindividuals_site = merged_site_type.groupby(["plot_id"]).count()['record_id'].rename("nindiv")
188-
# combine the two series
189-
diversity_index = pd.concat([nspecies_site, nindividuals_site], axis=1)
190-
# calculate the diversity index
191-
diversity_index['diversity'] = diversity_index['nspecies']/diversity_index['nindiv']
192-
```
193-
194-
Making a bar chart:
195-
196-
```python
197-
diversity_index['diversity'].plot(kind="barh")
198-
plt.xlabel("Diversity index")
199-
```
200-
201-
![](fig/04_chall_diversity_index.png){alt='taxa per plot per sex'}
202-
20348
## 07-visualization-ggplot-python
20449

20550
Note `plotnine` contains a *lot* of deprecation warnings in some versions of python/matplotlib, warnings may need to be supressed with
@@ -240,8 +85,6 @@ plt.show()
24085

24186
[This page][matplotlib-mathtext] contains more information.
24287

243-
[seaborn]: https://stanford.edu/~mwaskom/software/seaborn
244-
[altair]: https://github.com/ellisonbg/altair
24588
[matplotlib-mathtext]: https://matplotlib.org/users/mathtext.html
24689

24790

0 commit comments

Comments
 (0)