Skip to content

Commit 8a12dee

Browse files
authored
Merge branch 'main' into episode5-solutions
2 parents 40a90f4 + 87ecab0 commit 8a12dee

11 files changed

Lines changed: 9179 additions & 727 deletions

episodes/01-short-introduction-to-Python.md

Lines changed: 115 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -176,6 +176,12 @@ Notice that "Data Carpentry" is printed only once.
176176
lesson, we will introduce methods and user-defined functions. The Python
177177
documentation is excellent for reference on the differences between them.
178178

179+
**Tip**: When editing scripts like *example.py*, be careful not to use word
180+
processors such as MS Word, as they may introduce extra information that
181+
confuses Python. In this lesson we will be using either Jupyter notebooks or
182+
the Spyder IDE, and for your everday work you may also choose any text editor
183+
such as Notepad++, VSCode, Vim, or Emacs.
184+
179185
### Operators
180186

181187
We can perform mathematical calculations in Python using the basic operators
@@ -340,7 +346,72 @@ a_list = [1, 2, 3]
340346
4. What information does the built-in function `len()` provide?
341347
Does it provide the same information on both tuples and lists?
342348
Does the `help()` function confirm this?
343-
349+
350+
::::::::::::::::::::::::::: solution
351+
352+
1. What happens when you execute `a_list[1] = 5`?
353+
354+
The second value in `a_list` is replaced with `5`.
355+
356+
2. What happens when you execute `a_tuple[2] = 5`?
357+
358+
```error
359+
TypeError: 'tuple' object does not support item assignment
360+
```
361+
362+
As a tuple is immutable, it does not support item assignment.
363+
Elements in a list can be altered individually.
364+
365+
3. What does `type(a_tuple)` tell you about `a_tuple`?
366+
367+
```output
368+
<class 'tuple'>
369+
```
370+
371+
The function tells you that the variable `a_tuple` is an object of the class `tuple`.
372+
373+
4. What information does the built-in function `len()` provide?
374+
Does it provide the same information on both tuples and lists?
375+
Does the `help()` function confirm this?
376+
377+
```python
378+
len(a_list)
379+
```
380+
381+
```output
382+
3
383+
```
384+
385+
```python
386+
len(a_tuple)
387+
```
388+
389+
```output
390+
3
391+
```
392+
393+
`len()` tells us the length of an object.
394+
It works the same for both lists and tuples,
395+
providing us with the number of entries in each case.
396+
397+
```python
398+
help(len)
399+
```
400+
401+
```output
402+
Help on built-in function len in module builtins:
403+
404+
len(obj, /)
405+
Return the number of items in a container.
406+
```
407+
408+
Lists and tuples are both types of container
409+
i.e. objects that can contain multiple items,
410+
the key difference being that lists are mutable i.e.
411+
they can be modified after they have been created,
412+
while tuples are not: their value cannot be modified, only overwritten.
413+
414+
::::::::::::::::::::::::::::::::::::
344415

345416
::::::::::::::::::::::::::::::::::::::::::::::::::
346417

@@ -416,9 +487,52 @@ for key in rev.keys():
416487
reads "two" but instead `2`.
417488
3. Print the value of `rev` to the screen again to see if the value has changed.
418489

490+
::::::::::::::::::::::::::: solution
491+
492+
1.
493+
494+
```python
495+
print(rev)
496+
```
497+
498+
```output
499+
{'first': 'one', 'second': 'two', 'third': 'three'}
500+
```
501+
502+
2. and 3.
503+
504+
```python
505+
rev['second'] = 2
506+
print(rev)
507+
```
508+
509+
```output
510+
{'first': 'one', 'second': 2, 'third': 'three'}
511+
```
512+
513+
::::::::::::::::::::::::::::::::::::
419514

420515
::::::::::::::::::::::::::::::::::::::::::::::::::
421516

517+
:::::::::::::::::::::::: instructor
518+
519+
## Assigning to Dictionaries
520+
521+
It can help to further demonstrate the freedom the user has to define
522+
values to keys in a dictionary, by showing another example with a value
523+
completely unrelated to the current contents of the dictionary, e.g.
524+
525+
```python
526+
rev[2] = "apple-sauce"
527+
print(rev)
528+
```
529+
530+
```output
531+
{1: 'one', 2: 'apple-sauce', 3: 'three'}
532+
```
533+
534+
:::::::::::::::::::::::::::::::::::
535+
422536
## Functions
423537

424538
Defining a section of code as a **function** in Python is done using the `def`

episodes/02-starting-with-data.md

Lines changed: 178 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -132,7 +132,7 @@ We can use Pandas' `read_csv` function to pull the file directly into a [DataFra
132132
### So What's a DataFrame?
133133

134134
A DataFrame is a 2-dimensional data structure that can store data of different
135-
types (including characters, integers, floating point values, factors and more)
135+
types (including strings, numbers, categories and more)
136136
in columns. It is similar to a spreadsheet or an SQL table or the `data.frame` in
137137
R. A DataFrame always has an index (0-based). An index refers to the position of
138138
an element in the data structure.
@@ -347,9 +347,43 @@ what they return.
347347

348348
4. `surveys_df.tail()`
349349

350+
::::::::::::::::::::::: solution
351+
352+
1. `surveys_df.columns` provides the names of the columns in the DataFrame.
353+
2. `surveys_df.shape` provides the dimensions of the DataFrame as a tuple
354+
in `(r,c)` format,
355+
where `r` is the number of rows and `c` the number of columns.
356+
3. `surveys_df.head()` returns the first 5 lines of the DataFrame,
357+
annotated with column and row labels.
358+
Adding an integer as an argument to the function
359+
specifies the number of lines to display from the top of the DataFrame,
360+
e.g. `surveys_df.head(15)` will return the first 15 lines.
361+
4. `surveys_df.tail()` will display the last 5 lines,
362+
and behaves similarly to the `head()` method.
363+
364+
::::::::::::::::::::::::::::::::
350365

351366
::::::::::::::::::::::::::::::::::::::::::::::::::
352367

368+
::::::::::::::::::::::: instructor
369+
370+
## Recapping object (im)mutability
371+
372+
Working through solutions to the challenge above
373+
can provide a good opportunity to recap about mutability and immutability
374+
of different objects.
375+
Show that the DataFrame index
376+
( the `columns` attribute) is immutable, e.g.
377+
`surveys_df.columns[4] = "plotid"` returns a `TypeError`.
378+
379+
Adapting the name is done with the `rename` function:
380+
381+
```python
382+
surveys_df.rename(columns={"plot_id": "plotid"})`)
383+
```
384+
385+
::::::::::::::::::::::::::::::::::
386+
353387
### Calculating Statistics From Data In A Pandas DataFrame
354388

355389
We've read our data into Python. Next, let's perform some quick summary
@@ -394,12 +428,26 @@ array(['NL', 'DM', 'PF', 'PE', 'DS', 'PP', 'SH', 'OT', 'DO', 'OX', 'SS',
394428

395429
### Challenge - Statistics
396430

397-
1. Create a list of unique site ID's ("plot\_id") found in the surveys data. Call it
431+
1. Create a list of unique site IDs ("plot\_id") found in the surveys data. Call it
398432
`site_names`. How many unique sites are there in the data? How many unique
399433
species are in the data?
400434

401435
2. What is the difference between `len(site_names)` and `surveys_df['plot_id'].nunique()`?
402-
436+
437+
::::::::::::::::::::::: solution
438+
439+
1. `site_names = pd.unique(surveys_df["plot_id"])`
440+
- How many unique sites are in the data?
441+
`site_names.size` or `len(site_names)` provide the answer: 24
442+
- How many unique species are in the data?
443+
`len(pd.unique(surveys_df["species_id"]))` tells us there are 49 species
444+
2. `len(site_names)` and `surveys_df['plot_id'].nunique()`
445+
both provide the same output:
446+
they are alternative ways of getting the unique values.
447+
The `nunique` method combines the count and unique value extraction,
448+
and can help avoid the creation of intermediate variables like `site_names`.
449+
450+
::::::::::::::::::::::::::::::::
403451

404452
::::::::::::::::::::::::::::::::::::::::::::::::::
405453

@@ -496,27 +544,66 @@ summary stats.
496544

497545
::::::::::::::: solution
498546

499-
### Did you get #3 right?
547+
1. The first column of output from `grouped_data.describe()` (count)
548+
tells us that the data contains 15690 records for female individuals
549+
and 17348 records for male individuals.
550+
- Note that these two numbers do not sum to 35549,
551+
the total number of rows we know to be in the `surveys_df` DataFrame.
552+
Why do you think some records were excluded from the grouping?
553+
2. Calling the `mean()` method on data grouped by these two columns
554+
calculates and returns
555+
the mean value for each combination of plot and sex.
556+
- Note that the mean is not meaningful for some variables,
557+
e.g. day, month, and year.
558+
You can specify particular columns and particular summary statistics
559+
using the `agg()` method (short for _aggregate_),
560+
e.g. to obtain
561+
the last survey year,
562+
median foot-length
563+
and mean weight for each plot/sex combination:
564+
565+
```python
566+
surveys_df.groupby(['plot_id', 'sex']).agg({"year": 'max',
567+
"hindfoot_length": 'median',
568+
"weight": 'mean'})
569+
```
500570

501-
**A Snippet of the Output from challenge 3 looks like:**
571+
3. `surveys_df.groupby(['plot_id'])['weight'].describe()`
502572

503573
```output
504-
site
505-
1 count 1903.000000
506-
mean 51.822911
507-
std 38.176670
508-
min 4.000000
509-
25% 30.000000
510-
50% 44.000000
511-
75% 53.000000
512-
max 231.000000
513-
...
574+
count mean std min 25% 50% 75% max
575+
plot_id
576+
1 1903.0 51.822911 38.176670 4.0 30.0 44.0 53.0 231.0
577+
2 2074.0 52.251688 46.503602 5.0 24.0 41.0 50.0 278.0
578+
3 1710.0 32.654386 35.641630 4.0 14.0 23.0 36.0 250.0
579+
4 1866.0 47.928189 32.886598 4.0 30.0 43.0 50.0 200.0
580+
5 1092.0 40.947802 34.086616 5.0 21.0 37.0 48.0 248.0
581+
6 1463.0 36.738893 30.648310 5.0 18.0 30.0 45.0 243.0
582+
7 638.0 20.663009 21.315325 4.0 11.0 17.0 23.0 235.0
583+
8 1781.0 47.758001 33.192194 5.0 26.0 44.0 51.0 178.0
584+
9 1811.0 51.432358 33.724726 6.0 36.0 45.0 50.0 275.0
585+
10 279.0 18.541219 20.290806 4.0 10.0 12.0 21.0 237.0
586+
11 1793.0 43.451757 28.975514 5.0 26.0 42.0 48.0 212.0
587+
12 2219.0 49.496169 41.630035 6.0 26.0 42.0 50.0 280.0
588+
13 1371.0 40.445660 34.042767 5.0 20.5 33.0 45.0 241.0
589+
14 1728.0 46.277199 27.570389 5.0 36.0 44.0 49.0 222.0
590+
15 869.0 27.042578 35.178142 4.0 11.0 18.0 26.0 259.0
591+
16 480.0 24.585417 17.682334 4.0 12.0 20.0 34.0 158.0
592+
17 1893.0 47.889593 35.802399 4.0 27.0 42.0 50.0 216.0
593+
18 1351.0 40.005922 38.480856 5.0 17.5 30.0 44.0 256.0
594+
19 1084.0 21.105166 13.269840 4.0 11.0 19.0 27.0 139.0
595+
20 1222.0 48.665303 50.111539 5.0 17.0 31.0 47.0 223.0
596+
21 1029.0 24.627794 21.199819 4.0 10.0 22.0 31.0 190.0
597+
22 1298.0 54.146379 38.743967 5.0 29.0 42.0 54.0 212.0
598+
23 369.0 19.634146 18.382678 4.0 10.0 14.0 23.0 199.0
599+
24 960.0 43.679167 45.936588 4.0 19.0 27.5 45.0 251.0
514600
```
515601

516602
:::::::::::::::::::::::::
517603

518604
::::::::::::::::::::::::::::::::::::::::::::::::::
519605

606+
520607
### Quickly Creating Summary Counts in Pandas
521608

522609
Let's next count the number of samples for each species. We can do this in a few
@@ -542,6 +629,70 @@ What's another way to create a list of species and associated `count` of the
542629
records in the data? Hint: you can perform `count`, `min`, etc. functions on
543630
groupby DataFrames in the same way you can perform them on regular DataFrames.
544631

632+
::::::::::::::::::::::: solution
633+
634+
As well as calling `count()` on the `record_id` column of the grouped
635+
DataFrame as above,
636+
an equivalent result can be obtained by extracting `record_id` from the
637+
result of `count()` called directly on the grouped DataFrame:
638+
639+
```python
640+
surveys_df.groupby('species_id').count()['record_id']
641+
```
642+
643+
```output
644+
species_id
645+
AB 303
646+
AH 437
647+
AS 2
648+
BA 46
649+
CB 50
650+
CM 13
651+
CQ 16
652+
CS 1
653+
CT 1
654+
CU 1
655+
CV 1
656+
DM 10596
657+
DO 3027
658+
DS 2504
659+
DX 40
660+
NL 1252
661+
OL 1006
662+
OT 2249
663+
OX 12
664+
PB 2891
665+
PC 39
666+
PE 1299
667+
PF 1597
668+
PG 8
669+
PH 32
670+
PI 9
671+
PL 36
672+
PM 899
673+
PP 3123
674+
PU 5
675+
PX 6
676+
RF 75
677+
RM 2609
678+
RO 8
679+
RX 2
680+
SA 75
681+
SC 1
682+
SF 43
683+
SH 147
684+
SO 43
685+
SS 248
686+
ST 1
687+
SU 5
688+
UL 4
689+
UP 8
690+
UR 10
691+
US 4
692+
ZL 2
693+
```
694+
695+
::::::::::::::::::::::::::::::::
545696

546697
::::::::::::::::::::::::::::::::::::::::::::::::::
547698

@@ -587,6 +738,18 @@ total_count.plot(kind='bar');
587738
1. Create a plot of average weight across all species per site.
588739
2. Create a plot of total males versus total females for the entire dataset.
589740

741+
::::::::::::::::::::::: solution
742+
743+
1. `surveys_df.groupby('plot_id').mean()["weight"].plot(kind='bar')`
744+
745+
![](fig/01_chall_bar_meanweight.png){alt='average weight across all species for each plot'}
746+
747+
2. `surveys_df.groupby('sex').count()["record_id"].plot(kind='bar')`
748+
749+
![](fig/01_chall_bar_totalsex.png){alt='total males versus total females for the entire dataset'}
750+
751+
::::::::::::::::::::::::::::::::
752+
590753

591754
::::::::::::::::::::::::::::::::::::::::::::::::::
592755

0 commit comments

Comments
 (0)