@@ -149,11 +149,42 @@ new_output = pd.read_csv('data/out.csv', keep_default_na=False, na_values=[""])
149149
150150### Challenge - Combine Data
151151
152- In the data folder, there are two survey data files: ` surveys2001.csv ` and
153- ` surveys2002.csv ` . Read the data into pandas and combine the files to make one
154- new DataFrame. Create a plot of average plot weight by year grouped by sex.
152+ In the data folder, there is another folder called ` yearly_files `
153+ that contains survey data broken down into individual files by year.
154+ Read the data from two of these files,
155+ ` surveys2001.csv ` and ` surveys2002.csv ` ,
156+ into pandas and combine the files to make one new DataFrame.
157+ Create a plot of average plot weight by year grouped by sex.
155158Export your results as a CSV and make sure it reads back into pandas properly.
156159
160+ ::::::::::::::::::::::: solution
161+
162+ ``` python
163+ # read the files:
164+ survey2001 = pd.read_csv(" data/yearly_files/surveys2001.csv" )
165+ survey2002 = pd.read_csv(" data/yearly_files/surveys2002.csv" )
166+ # concatenate
167+ survey_all = pd.concat([survey2001, survey2002], axis = 0 )
168+ # get the weight for each year, grouped by sex:
169+ weight_year = survey_all.groupby([' year' , ' sex' ]).mean()[" wgt" ].unstack()
170+ # plot:
171+ weight_year.plot(kind = " bar" )
172+ plt.tight_layout() # tip: use this to improve the plot layout.
173+ # Try running the code without this line to see
174+ # what difference applying plt.tight_layout() makes.
175+ ```
176+
177+ ![ ] ( fig/04_chall_weight_year.png ) {alt='average weight for each year, grouped by sex'}
178+
179+ ``` python
180+ # writing to file:
181+ weight_year.to_csv(" weight_for_year.csv" )
182+ # reading it back in:
183+ pd.read_csv(" weight_for_year.csv" , index_col = 0 )
184+ ```
185+
186+ ::::::::::::::::::::::::::::::::
187+
157188
158189::::::::::::::::::::::::::::::::::::::::::::::::::
159190
@@ -425,10 +456,88 @@ Create a new DataFrame by joining the contents of the `surveys.csv` and
425456
4264571 . taxa by plot
4274582 . taxa by sex by plot
459+
460+ ::::::::::::::::::::::: solution
461+
462+ ``` python
463+ merged_left = pd.merge(left = surveys_df,right = species_df, how = ' left' , on = " species_id" )
464+ ```
465+
466+ 1 . taxa per plot (number of species of each taxa per plot):
467+
468+ ``` python
469+ merged_left.groupby([" plot_id" ])[" taxa" ].nunique().plot(kind = ' bar' )
470+ ```
471+
472+ ![ ] ( fig/04_chall_ntaxa_per_site.png ) {alt='taxa per plot'}
473+
474+ * Suggestion* : It is also possible to plot the number of individuals for each taxa in each plot
475+ (stacked bar chart):
428476
477+ ``` python
478+ merged_left.groupby([" plot_id" , " taxa" ]).count()[" record_id" ].unstack().plot(kind = ' bar' , stacked = True )
479+ plt.legend(loc = ' upper center' , ncol = 3 , bbox_to_anchor = (0.5 , 1.05 )) # stop the legend from overlapping with the bar plot
480+ ```
481+
482+ ![ ] ( fig/04_chall_taxa_per_site.png ) {alt='taxa per plot'}
483+
484+ 2 . taxa by sex by plot:
485+ Providing the Nan values with the M|F values (can also already be changed to 'x'):
486+
487+ ``` python
488+ merged_left.loc[merged_left[" sex" ].isnull(), " sex" ] = ' M|F'
489+ ntaxa_sex_site= merged_left.groupby([" plot_id" , " sex" ])[" taxa" ].nunique().reset_index(level = 1 )
490+ ntaxa_sex_site = ntaxa_sex_site.pivot_table(values = " taxa" , columns = " sex" , index = ntaxa_sex_site.index)
491+ ntaxa_sex_site.plot(kind = " bar" , legend = False , stacked = True )
492+ plt.legend(loc = ' upper center' , ncol = 3 , bbox_to_anchor = (0.5 , 1.08 ),
493+ fontsize = ' small' , frameon = False )
494+ ```
495+
496+ ![ ] ( fig/04_chall_ntaxa_per_site_sex.png ) {alt='taxa per plot per sex'}
497+
498+ ::::::::::::::::::::::::::::::::
429499
430500::::::::::::::::::::::::::::::::::::::::::::::::::
431501
502+ ::::::::::::::::::::::: instructor
503+
504+ ## Suggestion (for discussion only)
505+
506+ The number of individuals for each taxa in each plot per sex can be derived as well.
507+
508+ ``` python
509+ sex_taxa_site = merged_left.groupby([" plot_id" , " taxa" , " sex" ]).count()[' record_id' ]
510+ sex_taxa_site.unstack(level = [1 , 2 ]).plot(kind = ' bar' , logy = True )
511+ plt.legend(loc = ' upper center' , ncol = 3 , bbox_to_anchor = (0.5 , 1.15 ),
512+ fontsize = ' small' , frameon = False )
513+ ```
514+
515+ ![ ] ( fig/04_chall_sex_taxa_site_intro.png ) {alt='taxa per plot per sex'}
516+
517+ This is not really the best plot choice, e.g. it is not easily readable.
518+ A first option to make this better, is to make facets.
519+ However, pandas/matplotlib do not provide this by default.
520+ Just as a pure matplotlib example (` M|F ` if for not-defined sex records):
521+
522+ ``` python
523+ fig, axs = plt.subplots(3 , 1 )
524+ for sex, ax in zip ([" M" , " F" , " M|F" ], axs):
525+ sex_taxa_site[sex_taxa_site[" sex" ] == sex].plot(kind = ' bar' , ax = ax, legend = False )
526+ ax.set_ylabel(sex)
527+ if not ax.is_last_row():
528+ ax.set_xticks([])
529+ ax.set_xlabel(" " )
530+ axs[0 ].legend(loc = ' upper center' , ncol = 5 , bbox_to_anchor = (0.5 , 1.3 ),
531+ fontsize = ' small' , frameon = False )
532+ ```
533+
534+ ![ ] ( fig/04_chall_sex_taxa_site.png ) {alt='taxa per plot per sex'}
535+
536+ However, it would be better to link to [ Seaborn] [ seaborn ]
537+ and [ Altair] [ altair ] for this kind of multivariate visualisation.
538+
539+ ::::::::::::::::::::::::::::::::::
540+
432541::::::::::::::::::::::::::::::::::::::: challenge
433542
434543### Challenge - Diversity Index
@@ -441,17 +550,46 @@ Create a new DataFrame by joining the contents of the `surveys.csv` and
441550 plots. The index should consider both species abundance and number of
442551 species. You might choose to use the simple [ biodiversity index described
443552 here] ( https://www.amnh.org/explore/curriculum-collections/biodiversity-counts/plant-ecology/how-to-calculate-a-biodiversity-index )
444- which calculates diversity as:
553+ which calculates diversity as: the number of species in the plot / the total number of individuals in the plot = Biodiversity index.
445554
446- the number of species in the plot / the total number of individuals in the plot = Biodiversity index.
555+ ::::::::::::::::::::::: solution
556+
557+ 1 .
558+ ``` python
559+ plot_info = pd.read_csv(" data/plots.csv" )
560+ plot_info.groupby(" plot_type" ).count()
561+ ```
562+
563+ 2 .
564+ ``` python
565+ merged_site_type = pd.merge(merged_left, plot_info, on = ' plot_id' )
566+ # For each plot, get the number of species for each plot
567+ nspecies_site = merged_site_type.groupby([" plot_id" ])[" species" ].nunique().rename(" nspecies" )
568+ # For each plot, get the number of individuals
569+ nindividuals_site = merged_site_type.groupby([" plot_id" ]).count()[' record_id' ].rename(" nindiv" )
570+ # combine the two series
571+ diversity_index = pd.concat([nspecies_site, nindividuals_site], axis = 1 )
572+ # calculate the diversity index
573+ diversity_index[' diversity' ] = diversity_index[' nspecies' ]/ diversity_index[' nindiv' ]
574+ ```
447575
576+ Making a bar chart from this diversity index:
577+
578+ ``` python
579+ diversity_index[' diversity' ].plot(kind = " barh" )
580+ plt.xlabel(" Diversity index" )
581+ ```
448582
449- ::::::::::::::::::::::::::::::::::::::::::::::::::
583+ ![ ] ( fig/04_chall_diversity_index.png ) {alt='horizontal bar chart of diversity index by plot'}
450584
585+ ::::::::::::::::::::::::::::::::
451586
587+ ::::::::::::::::::::::::::::::::::::::::::::::::::
452588
453- [ join-types ] : https://blog.codinghorror.com/a-visual-explanation-of-sql-joins/
454589
590+ [ altair ] : https://github.com/ellisonbg/altair
591+ [ join-types ] : https://blog.codinghorror.com/a-visual-explanation-of-sql-joins/
592+ [ seaborn ] : https://stanford.edu/~mwaskom/software/seaborn
455593
456594:::::::::::::::::::::::::::::::::::::::: keypoints
457595
0 commit comments