@@ -132,7 +132,7 @@ We can use Pandas' `read_csv` function to pull the file directly into a [DataFra
132132### So What's a DataFrame?
133133
134134A DataFrame is a 2-dimensional data structure that can store data of different
135- types (including characters, integers, floating point values, factors and more)
135+ types (including strings, numbers, categories and more)
136136in columns. It is similar to a spreadsheet or an SQL table or the ` data.frame ` in
137137R. A DataFrame always has an index (0-based). An index refers to the position of
138138an element in the data structure.
@@ -347,9 +347,43 @@ what they return.
347347
3483484 . ` surveys_df.tail() `
349349
350+ ::::::::::::::::::::::: solution
351+
352+ 1 . ` surveys_df.columns ` provides the names of the columns in the DataFrame.
353+ 2 . ` surveys_df.shape ` provides the dimensions of the DataFrame as a tuple
354+ in ` (r,c) ` format,
355+ where ` r ` is the number of rows and ` c ` the number of columns.
356+ 3 . ` surveys_df.head() ` returns the first 5 lines of the DataFrame,
357+ annotated with column and row labels.
358+ Adding an integer as an argument to the function
359+ specifies the number of lines to display from the top of the DataFrame,
360+ e.g. ` surveys_df.head(15) ` will return the first 15 lines.
361+ 4 . ` surveys_df.tail() ` will display the last 5 lines,
362+ and behaves similarly to the ` head() ` method.
363+
364+ ::::::::::::::::::::::::::::::::
350365
351366::::::::::::::::::::::::::::::::::::::::::::::::::
352367
368+ ::::::::::::::::::::::: instructor
369+
370+ ## Recapping object (im)mutability
371+
372+ Working through solutions to the challenge above
373+ can provide a good opportunity to recap about mutability and immutability
374+ of different objects.
375+ Show that the DataFrame index
376+ ( the ` columns ` attribute) is immutable, e.g.
377+ ` surveys_df.columns[4] = "plotid" ` returns a ` TypeError ` .
378+
379+ Adapting the name is done with the ` rename ` function:
380+
381+ ``` python
382+ surveys_df.rename(columns = {" plot_id" : " plotid" })`)
383+ ```
384+
385+ ::::::::::::::::::::::::::::::::::
386+
353387### Calculating Statistics From Data In A Pandas DataFrame
354388
355389We've read our data into Python. Next, let's perform some quick summary
@@ -394,12 +428,26 @@ array(['NL', 'DM', 'PF', 'PE', 'DS', 'PP', 'SH', 'OT', 'DO', 'OX', 'SS',
394428
395429### Challenge - Statistics
396430
397- 1 . Create a list of unique site ID's ("plot\_ id") found in the surveys data. Call it
431+ 1 . Create a list of unique site IDs ("plot\_ id") found in the surveys data. Call it
398432 ` site_names ` . How many unique sites are there in the data? How many unique
399433 species are in the data?
400434
4014352 . What is the difference between ` len(site_names) ` and ` surveys_df['plot_id'].nunique() ` ?
402-
436+
437+ ::::::::::::::::::::::: solution
438+
439+ 1 . ` site_names = pd.unique(surveys_df["plot_id"]) `
440+ - How many unique sites are in the data?
441+ ` site_names.size ` or ` len(site_names) ` provide the answer: 24
442+ - How many unique species are in the data?
443+ ` len(pd.unique(surveys_df["species_id"])) ` tells us there are 49 species
444+ 2 . ` len(site_names) ` and ` surveys_df['plot_id'].nunique() `
445+ both provide the same output:
446+ they are alternative ways of getting the unique values.
447+ The ` nunique ` method combines the count and unique value extraction,
448+ and can help avoid the creation of intermediate variables like ` site_names ` .
449+
450+ ::::::::::::::::::::::::::::::::
403451
404452::::::::::::::::::::::::::::::::::::::::::::::::::
405453
@@ -496,27 +544,66 @@ summary stats.
496544
497545::::::::::::::: solution
498546
499- ### Did you get #3 right?
547+ 1 . The first column of output from ` grouped_data.describe() ` (count)
548+ tells us that the data contains 15690 records for female individuals
549+ and 17348 records for male individuals.
550+ - Note that these two numbers do not sum to 35549,
551+ the total number of rows we know to be in the ` surveys_df ` DataFrame.
552+ Why do you think some records were excluded from the grouping?
553+ 2 . Calling the ` mean() ` method on data grouped by these two columns
554+ calculates and returns
555+ the mean value for each combination of plot and sex.
556+ - Note that the mean is not meaningful for some variables,
557+ e.g. day, month, and year.
558+ You can specify particular columns and particular summary statistics
559+ using the ` agg() ` method (short for _ aggregate_ ),
560+ e.g. to obtain
561+ the last survey year,
562+ median foot-length
563+ and mean weight for each plot/sex combination:
564+
565+ ``` python
566+ surveys_df.groupby([' plot_id' , ' sex' ]).agg({" year" : ' max' ,
567+ " hindfoot_length" : ' median' ,
568+ " weight" : ' mean' })
569+ ```
500570
501- ** A Snippet of the Output from challenge 3 looks like: **
571+ 3 . ` surveys_df.groupby(['plot_id'])['weight'].describe() `
502572
503573``` output
504- site
505- 1 count 1903.000000
506- mean 51.822911
507- std 38.176670
508- min 4.000000
509- 25% 30.000000
510- 50% 44.000000
511- 75% 53.000000
512- max 231.000000
513- ...
574+ count mean std min 25% 50% 75% max
575+ plot_id
576+ 1 1903.0 51.822911 38.176670 4.0 30.0 44.0 53.0 231.0
577+ 2 2074.0 52.251688 46.503602 5.0 24.0 41.0 50.0 278.0
578+ 3 1710.0 32.654386 35.641630 4.0 14.0 23.0 36.0 250.0
579+ 4 1866.0 47.928189 32.886598 4.0 30.0 43.0 50.0 200.0
580+ 5 1092.0 40.947802 34.086616 5.0 21.0 37.0 48.0 248.0
581+ 6 1463.0 36.738893 30.648310 5.0 18.0 30.0 45.0 243.0
582+ 7 638.0 20.663009 21.315325 4.0 11.0 17.0 23.0 235.0
583+ 8 1781.0 47.758001 33.192194 5.0 26.0 44.0 51.0 178.0
584+ 9 1811.0 51.432358 33.724726 6.0 36.0 45.0 50.0 275.0
585+ 10 279.0 18.541219 20.290806 4.0 10.0 12.0 21.0 237.0
586+ 11 1793.0 43.451757 28.975514 5.0 26.0 42.0 48.0 212.0
587+ 12 2219.0 49.496169 41.630035 6.0 26.0 42.0 50.0 280.0
588+ 13 1371.0 40.445660 34.042767 5.0 20.5 33.0 45.0 241.0
589+ 14 1728.0 46.277199 27.570389 5.0 36.0 44.0 49.0 222.0
590+ 15 869.0 27.042578 35.178142 4.0 11.0 18.0 26.0 259.0
591+ 16 480.0 24.585417 17.682334 4.0 12.0 20.0 34.0 158.0
592+ 17 1893.0 47.889593 35.802399 4.0 27.0 42.0 50.0 216.0
593+ 18 1351.0 40.005922 38.480856 5.0 17.5 30.0 44.0 256.0
594+ 19 1084.0 21.105166 13.269840 4.0 11.0 19.0 27.0 139.0
595+ 20 1222.0 48.665303 50.111539 5.0 17.0 31.0 47.0 223.0
596+ 21 1029.0 24.627794 21.199819 4.0 10.0 22.0 31.0 190.0
597+ 22 1298.0 54.146379 38.743967 5.0 29.0 42.0 54.0 212.0
598+ 23 369.0 19.634146 18.382678 4.0 10.0 14.0 23.0 199.0
599+ 24 960.0 43.679167 45.936588 4.0 19.0 27.5 45.0 251.0
514600```
515601
516602:::::::::::::::::::::::::
517603
518604::::::::::::::::::::::::::::::::::::::::::::::::::
519605
606+
520607### Quickly Creating Summary Counts in Pandas
521608
522609Let's next count the number of samples for each species. We can do this in a few
@@ -542,6 +629,70 @@ What's another way to create a list of species and associated `count` of the
542629records in the data? Hint: you can perform ` count ` , ` min ` , etc. functions on
543630groupby DataFrames in the same way you can perform them on regular DataFrames.
544631
632+ ::::::::::::::::::::::: solution
633+
634+ As well as calling ` count() ` on the ` record_id ` column of the grouped
635+ DataFrame as above,
636+ an equivalent result can be obtained by extracting ` record_id ` from the
637+ result of ` count() ` called directly on the grouped DataFrame:
638+
639+ ``` python
640+ surveys_df.groupby(' species_id' ).count()[' record_id' ]
641+ ```
642+
643+ ``` output
644+ species_id
645+ AB 303
646+ AH 437
647+ AS 2
648+ BA 46
649+ CB 50
650+ CM 13
651+ CQ 16
652+ CS 1
653+ CT 1
654+ CU 1
655+ CV 1
656+ DM 10596
657+ DO 3027
658+ DS 2504
659+ DX 40
660+ NL 1252
661+ OL 1006
662+ OT 2249
663+ OX 12
664+ PB 2891
665+ PC 39
666+ PE 1299
667+ PF 1597
668+ PG 8
669+ PH 32
670+ PI 9
671+ PL 36
672+ PM 899
673+ PP 3123
674+ PU 5
675+ PX 6
676+ RF 75
677+ RM 2609
678+ RO 8
679+ RX 2
680+ SA 75
681+ SC 1
682+ SF 43
683+ SH 147
684+ SO 43
685+ SS 248
686+ ST 1
687+ SU 5
688+ UL 4
689+ UP 8
690+ UR 10
691+ US 4
692+ ZL 2
693+ ```
694+
695+ ::::::::::::::::::::::::::::::::
545696
546697::::::::::::::::::::::::::::::::::::::::::::::::::
547698
@@ -587,6 +738,18 @@ total_count.plot(kind='bar');
5877381 . Create a plot of average weight across all species per site.
5887392 . Create a plot of total males versus total females for the entire dataset.
589740
741+ ::::::::::::::::::::::: solution
742+
743+ 1 . ` surveys_df.groupby('plot_id').mean()["weight"].plot(kind='bar') `
744+
745+ ![ ] ( fig/01_chall_bar_meanweight.png ) {alt='average weight across all species for each plot'}
746+
747+ 2 . ` surveys_df.groupby('sex').count()["record_id"].plot(kind='bar') `
748+
749+ ![ ] ( fig/01_chall_bar_totalsex.png ) {alt='total males versus total females for the entire dataset'}
750+
751+ ::::::::::::::::::::::::::::::::
752+
590753
591754::::::::::::::::::::::::::::::::::::::::::::::::::
592755
0 commit comments