diff --git a/docs/javascripts/mathjax.js b/docs/javascripts/mathjax.js new file mode 100644 index 0000000..0be88e0 --- /dev/null +++ b/docs/javascripts/mathjax.js @@ -0,0 +1,19 @@ +window.MathJax = { + tex: { + inlineMath: [["\\(", "\\)"]], + displayMath: [["\\[", "\\]"]], + processEscapes: true, + processEnvironments: true + }, + options: { + ignoreHtmlClass: ".*|", + processHtmlClass: "arithmatex" + } +}; + +document$.subscribe(() => { + MathJax.startup.output.clearCache() + MathJax.typesetClear() + MathJax.texReset() + MathJax.typesetPromise() +}) diff --git a/docs/wiki-guide/ABC-Glossary.md b/docs/wiki-guide/ABC-Glossary.md index dd32cfd..1a51224 100644 --- a/docs/wiki-guide/ABC-Glossary.md +++ b/docs/wiki-guide/ABC-Glossary.md @@ -16,6 +16,35 @@ It is meant to be a collaborative effort, so please [contribute](https://github. ## B +### Biodiversity + +Biodiversity has the dictionary definition of the “variety of life on earth”. In practice, this word can be broad to the point of almost being useless (“sure fine, but the devil is in the details”). It is often used as a generic catch-all phrase for “nature” or “life” or even “wildlife” in common colloquial use, like “protect biodiversity”, even though none of these are quite the same thing. + +Ecologists tend to think about biodiversity explicitly in terms of scales of organization. Probably the two most common ways of thinking about scale are spatial scale and levels of organization, which are sometimes correlated but are not the same thing. + +**Spacial scales** and how to think about them: + +- *Global* is the entire planet. +- *Continental*, think the Americas. +- *Regional*, think the East Coast of the US. +- [*Landscape*](#landscape): Approximately the scale at which spatial processes (such as dispersal and migration) operate for many charismatic plant and wildlife species. Note that landscape ecologists define landscape not as a spatial scale but as a field concerned with the operation of ecological processes across a heterogeneous backdrop. +- *Site*, as in one field site or study area (the term may also be applied to a particular sampling location within the study area). The sites themselves might differ vastly in terms of area–from $1m^2$ (or smaller) for things like experimental plots to tens of km (e.g., the Breeding Bird Survey’s “site” is ~40km long). + +**Levels of organization**: + +- [*Ecosystem*](#ecosystem): All of the organisms present along with their abiotic conditions. Focused on flows of energy or nutrients, or webs of interactions, between living members of a community and abiotic stocks. Ecosystem ecology has the most overlap with environmental science, and often thinks in box models or similar. Most overlap with climate modeling and climate science. Much canonical work was and is done in wetlands and aquatic systems. +- *Community*: The various populations of many species present at the same location and interacting with each other. Canonical topics are interactions (i.e., consumer-resource interactions, competition, mutualisms), and coexistence (e.g., why are there so many species?). This is what most outside the field think of when they think of “ecology” in general. +- [*Population*](#population): A single interacting group of organisms of a single species. Canonical topics include how populations are regulated (i.e., how their size is determined), what factors cause them to grow and shrink. Closely related to single-species conservation as trends in population size are often a quantity that is monitored and used to make conservation/management decisions. +- *Organismal*: At the scale of a single organism or below. Included in, but not a main focus of, ABC. Note that just because something is at this scale or smaller does not mean it’s organismal or sub-organismal biology—in the case of eDNA for example, the DNA is used as sign to indicate the presence of an organism, with questions and research usually conducted at the population or community level. + +#### Biodiversity Science + +This is not yet a discipline that is officially defined; it means different things to different people. + +### Biology + +The study of living things, things that reproduce themselves (including bacteria). + ## C ### CARE Principles for Indigenous Data Governance @@ -32,12 +61,27 @@ It is meant to be a collaborative effort, so please [contribute](https://github. For more information, see [CARE Principles for Indigenous Data Governance](https://www.gida-global.org/care). +### Checklist + +- List of species that are known to occur in a region (e.g., birds of Vermont). +- List of species that were seen and not seen during a fixed survey. + ### Contrastive Language-Image Pre-training (CLIP) +Contrastive language-image pretraining (CLIP) is a training objective that popularized the idea of training strong vision models from language supervision, rather than class supervision. ImageNet is the classic labeled image dataset: 1.2M images where each image is labeled with one class out of one thousand possible classes. + +CLIP enables learning strong vision representations from image-caption pairs scraped on the internet. + +CLIP also enables zero-shot image classification. + ## D ### Decoder +### Diffusion Models + +Diffusion models are a class of models used to generate images. They are the architecture for models like Stable Diffusion and DALL-E 3. They learn to reduce the noise in an image by just a little bit, conditioned on some text representation. Then, starting with pure random noise, you can iteratively apply a diffusion model to produce a new image. + ### Dimensionality Reduction Used in machine learning and data analysis to refer to a set of methods used to reduce the number of variables or features under consideration to a smaller subset with the greatest explanatory power without drastically reducing the accuracy of the model or analysis. The purpose is to exclude irrelevant, redundant, and noisy information, thereby improving computational complexity and model interpretability. @@ -55,10 +99,36 @@ Dimensionality reduction techniques can be subdivided into two main categories: ### Ecology -### Epoch (in machine learning) +The study of ecosystems: how organisms interact with each other and their environments. + +### Ecosystem + +There are many definitions: + +- A set of potentially interacting organisms within their natural environment (including both biotic and abiotic environment). +- Some people distinguish ecosystems as being more often described by process (e.g., nutrient cycling), compared to biodiversity. ### Encoder +### Endemic + +- In general when a species is found in one area and nowhere else (e.g. how many species are found exclusively in one region). +- Weighted endemism is more commonly used now for mapping; it is the percent of the distribution of one (or many) species in a grid cell (or other defined area). It’s the basis for spatial conservation planning. + +### Entomology + +The study of insects. + +### Epoch (in machine learning) + +### Evolution + +The change in proportion of genes or traits within a population over the course of generations (many disagreements about this definition). +Sometimes short and long time scales are distinguished: + +- **Short term**: over a few generations, prevalence of a gene changes. +- **Long term**: new species with new morphologies, a.k.a. body parts emerge. + ### Experiment (in machine learning) ## F @@ -93,6 +163,33 @@ The key difference from feature extraction is that feature selection does not ge #### Feature Space +### Fit (AI/ML Version) + +A model may be "fit" to a particular training set, in that it is optimized for that data and/or training objective. A well fit model will perform well on its training data, but—more importantly—it will be *generalizable* to new data under the same objective (e.g., classification of animals in images that it has not seen). + +- A model is considered to be *overfit* if it is too highly specialized to its training data, e.g., the model may perform near perfectly on museum specimen images it saw in training, but it cannot recognize the same species when photographed on a different background without a ruler. +- Conversely, a model is *underfit* if it does not capture the underlying structure of even its training data and thus does not perform well on either its training or unseen data. From the previous example, the model wouldn't recognize species in any of the images. +- IBM has a [nice general summary of these concepts](https://www.ibm.com/think/topics/underfitting). + +### Fitness (Ecology Version) + +There are many definitions; one would be the number of offspring (that survive to reproduce) of an individual animal. +People often talk about how certain traits contribute to fitness, e.g., longer giraffe necks = more food, better survival = more chances to reproduce. + +### Foundation Model + +A “foundation model” is a term coined by Stanford researchers in 2021 to describe large models trained on very general data that can adapt to a wide range of downstream tasks. It is another phrase to describe the pre-train/fine-tune paradigm introduced by computer vision researchers in the late 2010s. + +Typically, a foundation model: + +- Has many, many parameters. +- Is pre-trained (normally in a self-supervised fashion) on a huge quantity of data at great cost. +- Can be adapted to many different tasks **with significantly less data**, either through prompting (language models), fine-tuning (smaller foundation models) or linear probing ([vision models](#vision-models)). + +### Functional Diversity + +The variety of different forms and functions in a community/assemblage. [Traits](#trait) (such as body mass, diet, morphological characteristics) are often proxies of organism's functions in the ecosystem (think, e.g., long and pointy beaks in birds might imply nectarivorous diet, therefore pollination), and we use traits of species to quantify functional diversity. + ## G ### Genome-Wide Association Study (GWAS) @@ -111,14 +208,37 @@ i-'mi-j**ə**-'**ō**-miks A new scientific field in which computational (machine learning) tools built around biological knowledge bases are used by biologists to analyze image data in order to characterize patterns and gain insights into traits and relationships at individual, population and species scales—insights that then get incorporated into the algorithms that run the tools. +### Inference + +#### Inference (Ecology Version) + +- Often used in reference to the interpretations/conclusions that we derive from our results/data analysis. +**Examples**: + - “How does this covariate affect whether or not a species is present?” e.g. “When there are at least 50 trees with a diameter over XYZ within a 200m radius of this spot, you have a 90% chance of finding a Cerulean Warbler there” + - These sorts of covariate-presence associations are often used for species distribution modeling +- Debates: Many ecologists make the distinction between inference versus prediction. Inference is more about ‘inferring’ relationships between variables and linking to theory, and not about trying to predict to new situations etc.; the emphasis is more about interpretation compared to making predictions. + +### Integrated Development Environment (IDE) + +Software or application designed with features to aid in software development, such as code editing, build automation, and debugging. Common examples include [VSCode](https://code.visualstudio.com/) and [RStudio](https://posit.co/products/open-source/rstudio), both of which include version control/tracking, with added options for git-based tracking integration (e.g., through [GitHub](https://github.com/) or [GitLab](https://about.gitlab.com/)). Additional options include formatting settings, compilers, and plugins for various language options. VSCode is often used for Python development and Jupyter Notebooks, while RStudio is generally recommended for R Code and R Markdown Notebooks. + ## J ## K +### Keystone Species + +Its removal will cause large shift in the community (the list and abundance of species); there are a handful of great examples where this is real, but it’s not expected that every ecosystem has “a keystone species”. + ### K-Means Clustering ## L +### Landscape + +On the order of 10s-100s of kilometers of land. +Important because it’s the scale at which individual organisms do stuff, like disperse from where they were born to where they’ll eventually breed. + ### Latent Space ### Learning Rate @@ -127,10 +247,35 @@ A new scientific field in which computational (machine learning) tools built aro ## M +### Machine Learning (ML) + +Machine learning is a way to make predictions about new data based on old, seen data. Fitting a regression to $1$-D data in Excel is the most obvious example of “machine learning”. But you can imagine using many more input variables $(x_1, x_2, … x_n)$ and also predicting more output variables $(y_1, y_2, … y_n)$. + +As your data gets more complex, you probably want to choose a “line of best fit” that is more complex than just a line. Unfortunately, while fitting lines is very easy (it’s a convex optimization problem), fitting more complicated stuff is harder. + +Part of the field of machine learning is developing new methods to efficiently and effectively fit complicated functions to complicated data. + +### Model (AI/ML Version) + +1. A specific set of parameters (also known as weights; they are just lots of numbers) optimized through training. [meta-llama/Llama-2-7b-chat](https://huggingface.co/meta-llama/Llama-2-7b-chat) is a set of weights for the Llama2 7B variant. +2. A family of weights: Llama2 is the name for all Llama2 models, including different sizes (7B, 13B, 70B) and base/instruction-tuned versions. +3. An architecture: a transformer model refers to the general class of models that use self-attention (discussed later). + +### Model (Ecology Version) + +- Sometimes used in reference to mathematical models that arise from theory and describe relationships among components of an ecosystem. + - **Example**: predator-prey model of population dynamics. +- Sometimes used in reference to statistical models that describe relationships among variables, relationship between predictor/explanatory variables (for example, environment or climate measures) and response variables (for example, species richness, or presence). + - **Example**: occupancy models statistically associate covariates with species presence. + ### Multidimensional Scaling (MDS) ## N +### Neural Network + +A neural network is a type of model (function) with near-unlimited complexity. Because of this complexity, neural networks need lots of data to effectively “learn”, but they can fit very complicated datasets (for example, OpenAI’s GPT models have “fit” the English language). + ### Nucleotide The fundamental building blocks of DNA and RNA. A nucleotide is composed of a base and a sugar-phosphate backbone. @@ -151,12 +296,22 @@ A DNA or RNA molecule consists of a chain of the four relevant nucleotides in a ### Ontology +### Operational taxonomic units (OTU) + +Usually used in place of [“species”](#species) to describe and group single-celled organisms (e.g., bacteria). An OTU can be defined using metrics like genetic distance. + ## P ### Phenotype ### Phylogeny +A tree of life, depicts evolutionary relationships among species, usually within one taxonomic group (birds, mammals, etc.), often used to quantify phylogenetic diversity of an assemblage (another aspect of biodiversity). + +### Population + +Population is often used in reference to a set of interacting or potentially interbreeding individuals of a single species, e.g., “all the frogs in this series of 3 ponds near each other form a population. It’s a distinct population from the ones in the other ponds that are too far for the frogs to hop to”. + ### Pre-training ### Principal Component Analysis (PCA) @@ -165,14 +320,48 @@ A DNA or RNA molecule consists of a chain of the four relevant nucleotides in a ## R +### Range Shift + +Change in the places a species is found over time, e.g. due to changes in temperature, vegetation, weather, etc. + ## S +### Self-supervised Learning + +Self-supervised learning is very popular in both language and vision right now. I will explain the dominant self-supervised learning strategies in both modalities. + +**Causal language modeling**: given a token sequence $T_1, T_2, T_3, … T_N$, learn to predict the next token $T_{N+1}$. All text on the internet can be used for this task, and each sequence of $N$ tokens makes $N-1$ training examples. + +**Masked language modeling**: given a token sequence $T_1, T_2, MASK, T_4, … T_N$, learn to predict T_3. Again, all text on the internet can be used for this task because we can replace any token with $MASK$ to make a training example. + +**Vision SSL**: Given an image $IMG$, apply some augmentation to $IMG$ like color shifts, rotation, distortion, blur, crop, etc to make $IMG’$. Then minimize the distance between $f(IMG)$ and $f(IMG’)$ while maximizing distance between $f(IMG)$ and all other images. + +**Contrastive Language Image Pre-Training (CLIP)**: Given a large dataset of (image, text) pairs, learn two models: one for images $f_i(x)$, one for text $f_t(x)$. Minimize the distance between $f_i(image)$ and $f_t(text)$ for true (image, text) pairs and maximize the distance for random (image, text) pairs not found in the data. + +Self-supervised learning is all about finding a trick that enables learning useful representations without doing large-scale labeling. + ### Single Nucleotide Polymorphism (SNP) A SNP (pronounced "snip") is a variation in the [nucleotide](#nucleotide) present at a single position in a DNA sequence among individuals in a species. For example, a SNP may be the replacement of a cytosine (C) by a thymine (T) at the same location in a stretch of DNA, where C is observed in a subset of individuals and T is observed in the others. ### Snakemake +### Species + +- A simplistic and debated definition: A group of organisms that regularly reproduce with each other. +- There are *many* different ‘species concepts’ and it really depends on your study question. It has a lot of relevance for conservation as we are often concerned with identifying (and protecting) 'endangered species'. +- **Longstanding debates**: + - The ‘biological’ species concept assumes that different species should not be able to interbreed. *However*, that’s not true for many species, especially plants and invertebrates, microbes. There are also ‘lumpers’ and ‘splitters’: + - Lumpers tend to aggregate into fewer species. + - Splitters more likely to formally define many species. + - With molecular markers etc, now there are ‘operational taxonomic units’ where we don’t need a formally defined species. + - Definitions are influenced by bias: different parts of the world have different amounts of described species, so better known areas (e.g., US and Europe) have inflated diversity relative to understudied areas. + +### Species distribution model (SDM) + +An [ecological model](#model-ecology-version) that predicts where and when a species will be present based on covariates in the environment. +Also known as ecological niche models ‘ENM’ as they attempt to capture the ‘niche’ of a species, a term debated for the entire history of ecology, generally it is acknowledges that SDMs capture some version of the ‘realized’ niche rather than the fundamental/potential niche. + ### Subspecies ### Supervised Learning @@ -181,14 +370,58 @@ As opposed to [unsupervised learning](#unsupervised-learning), supervised learni ## T +### Taxa (s. taxon) + +*Taxa* is the plural of *taxon*, which is a rank of hierarchical classification system of organisms, most commonly Domain, Kingdom, Phylum, Class, Order, Family, Genus, Species, each representing a decreasingly large group of organisms. A “mountain bluebird” (*Sialia currucoides*) is a taxon, “amphibians” (members of class *Amphibia*) are also a taxon. + ### Taxonomy ### t-Distributed Stochastic Neighbor Embedding (t-SNE) +### Training Paradigms + +- [**Supervised learning**](#supervised-learning): Given a large dataset of (input, output) pairs, learn to predict the output given the input. +- [**Unsupervised learning**](#unsupervised-learning): Given a large dataset of (input) examples, learn something about the structure of the data (clustering, dimension reduction, etc) +- [**Self-supervised learning**](#self-supervised-learning): Given a large dataset of (input) examples, learn useful representations that can then be leveraged to do well on a supervised task. + ### Trait +A characteristic of an individual organism such as e.g., its body mass, diet, when it is active (nocturnal, diurnal), clutch size (number of eggs it lays), color of wings, wing span, beak shape, etc. Traits are only traits if they can be measured on an individual, though there is still some debate over it--for example, some ecologists consider ‘species range’ a trait, but because it cannot be measured on an individual, it is not actually a trait. + +- A trait is “functional” if it plays some function (most trait do, though we cannot often map traits to functions yet). +- A trait can be a ‘response trait’ or an ‘effect trait’: response traits describe an organism’s response to the environment, while effect traits describe the impact of this organism on the ecosystem's functioning. + ### Transfer Learning +### Transformer + +A transformer is a model architecture based on self-attention and feed-forward neural networks. They operate on sequences and can predict sequences or labels. There are three main variants: encoder-only, decoder-only and encoder-decoder. I will provide some examples of famous transformers to illustrate their strengths, weaknesses, and differences. + +*As a rule of thumb, a token is about 3/4 of a word.* + +#### Transformer Examples + +**BERT** is an encoder-only language model transformer. Given a sequence of tokens, it produces a dense vector representation for each token and a representation for the entire sequence. + +- Small model; can be trained on academic budget in 3 days +- Used for sequence classification (is this sentence about animals or humans?) and token classification (predict the subsequence of tokens that is about animals). +- Does not work with long contexts; limited to 512 token-length sequences. +- Encoder-only: looks at the entire sequence at once. Cannot generate new text. + +**Llama2** (from Meta/Facebook) is a (family of) decoder-only language model transformers. Given a sequence of tokens, it produces a dense vector representation for each token and can sample new tokens that continue the sequence. + +- 7B, 13B and 70B variants. Cannot be trained on an academic budget. 7B and 13B can do inference on academic budgets. +- Decoder-only: learns to minimize $p(x_i | x_{i-1}, x_{i-2}, … x_2, x_1)$ for real sequences of $x$’s. Uses a causal attention mask. +- Given some text, you can sample from p to continue generating realistic text. +4096 token context (8x BERT) by default; has 16K variants. + +**Vision Transformers (ViTs)** are an encoder-only transformer architecture for computer vision. They split an image up into 16x16 pixel “patches”, which are then treated as a sequence. They produce dense vector representations for each patch and also a representation for the entire image. + +- Many, many pre-trained weights available in many different sizes. Imageomics trained a ViT-B/16 (base, 16x16 pixel patch) for BioCLIP on the Ohio Supercomputer Center. ViT-L/14 (large, 14x14 pixel patch) is likely out of reach for most academic labs. Inference is very cheap. +- Can be used for image classification, object detection, pose estimation, etc. + +**Whisper** (from OpenAI) is a family of encoder-decoder transformer models for speech-to-text. The decoder component can be used to sample new tokens conditioned on both previously sampled tokens and the encoder’s representations. Encoder-decoder models are most often used where the output is variable length and is a different modality to the input. This includes speech-to-text, text-to-speech, image-to-text, language translation, and others. + ## U ### Unsupervised Learning @@ -197,7 +430,13 @@ As opposed to [supervised learning](#supervised-learning), unsupervised learning ## V -VLMs (Vision-Language Models) +### Vision Models + +Vision encoder, vision model, image model, and vision backbone are all synonyms used to describe a model that produces dense vector representations for images. Semantically similar images should be close together in this vector embedding space. + +### Vision Language Model (VLM) + +A model that incorporates both text and images; it may take both as input or output. Examples include [ViT-based](#transformer-examples) [CLIP](#contrastive-language-image-pre-training-clip) models which have both a vision and a text encoder to align images and text in the same embedding space. ## W @@ -208,3 +447,5 @@ VLMs (Vision-Language Models) ## Z ### Zero-Shot Prediction + +Predicting something on which the model was not explicitly trained. For instance, asking [BioCLIP](https://huggingface.co/spaces/imageomics/bioclip-2-demo) to classify a picture of a Pokemon, giving it a list ["pickachu", "ninetails", "evee"]; it was not trained with these labels, nor with images of the Pokemon, but it will still provide an answer. A more practical example, would be new species recognizing species which it did not see in training. diff --git a/docs/wiki-guide/Python-Eco-Quick-start.md b/docs/wiki-guide/Python-Eco-Quick-start.md new file mode 100644 index 0000000..285ac12 --- /dev/null +++ b/docs/wiki-guide/Python-Eco-Quick-start.md @@ -0,0 +1,124 @@ +# Quick-start to Python for Ecologists + +Looking to implement AI/ML tools into your pipeline? Here's a quick-start guide with a collection of resources to help you get started. + +## Getting Started + +Let's start with a brief introduction to the basic coding scaffolding you'll need: + +- **Python** is an [**object-oriented programming language**](https://en.wikipedia.org/wiki/Object-oriented_programming) that works similarly to **R**: it has basic built-in functions and gains enhanced functionality by installing *packages*. +- Programming languages like R, Python can be run in [Integrated Development Environments (IDE)](ABC-Glossary.md#integrated-development-environment-ide) (e.g. **Rstudio** for R, and **Visual Studio Code** for Python). + - [VSCode](https://code.visualstudio.com/) is a convenient, customizable Integrated Development Environment (IDE) for writing and editing code files in multiple languages, especially Python. + - [RStudio](https://posit.co/products/open-source/rstudio) supports other languages as well, but is most commonly used for R. + - [Spyder](https://www.spyder-ide.org/) is an open-source IDE with a similar visual appearance and functionality to RStudio, so it offers an easy transition for users familiar with the R environment. More advanced users who want added functionality will appreciate VSCode, which can be used for R, Python, and many other coding languages. +- Many programmers find it useful to write scripts in “lab notebook”-style documents that integrate comments, code, and printing and plotting results (e.g., **R Markdown Notebooks** for R and **Jupyter Notebooks** for Python). + - Notebooks provide space for exploration and testing, with more immediate feedback and nicely rendered documentation of design decisions alongside the code. + - Jupyter Notebooks can be run and edited in [VSCode](https://code.visualstudio.com/docs/datascience/jupyter-notebooks). +- Programs and workflows can also be run ***on the command line***, specifically this is running a program through the **C**ommand **L**ine **I**nterface (CLI), a.k.a the **terminal**, **shell**, **console**, or "bash shell". + - On Mac or Linux, open "Terminal" to get started. + - On Windows, you'll need to install one first; we recommend [Windows Terminal](https://learn.microsoft.com/en-us/windows/terminal/). + - Some useful common commands can be found in the [Command Line Cheat Sheet](Command-Line-Cheat-Sheet.md). + - To run your program through the CLI, save your code in a file with the correct file extension (e.g. `myfile.py`), and type into the command line something like + + ```console + python myfile.py + ``` + +- As your code evolves beyond ["Hello World!"](https://en.wikipedia.org/wiki/Hello,_world), it will likely require you to use some **packages** or **libraries**: code written by developers, which you can download and use to make your life easier. For Python packages, you’ll want to use an environment manager like [**conda**](https://www.anaconda.com/docs/getting-started/main) or [**venv**](https://docs.python.org/3/library/venv.html). + - Generally in Python, you want to scope a single environment for a particular project or task. Similar to R, specific libraries are loaded for a specific script or project. A key difference in Python is that the environment will also fix a particular version of Python and consist of all packages installed in that specific environment. Python can also selectively import set libraries (similar to R's "library" function) but python can also load single modules (or functions) within libraries used for a particular script. + - Environment management can also be accomplished for R projects, e.g., with [renv](https://rstudio.github.io/renv/). + - Learn more about different Python environment managers on the [Virtual Environments Page](Virtual-Environments.md). +- For effective collaboration—with yourself and others—use **Git** (`git`) for version control and sync it to a remote, such as [GitHub](https://github.com/). + - Learn more about the basics of, and motivations for, version control in [The Turing Way](https://book.the-turing-way.org/reproducible-research/vcs/). + +Two great resources for lessons covering these topics: + +1. The [Missing Semester of Your CS Education](https://missing.csail.mit.edu/): a collection of computer science-themed lessons from MIT. + +2. The [Software Carpentry Lessons](https://software-carpentry.org/lessons/): hands-on, guided introductions to the topics introduced above. Each of the Core Software Carpentry lessons is described and linked to below. + +### Software Carpentries Lessons + +!!! note + You can work through these lessons on your own or check The Carpentries site for [upcoming workshops](https://software-carpentry.org/workshops/workshops-upcoming/) being offered virtually or at a location near you. + +#### Working on the Command Line + +Lesson: [The Unix Shell](https://swcarpentry.github.io/shell-novice/) + +Work through each of the episodes in this lesson to gain a familiarity with Unix-based operating system basics. This lesson will prepare you for navigating the shell, i.e., working from the command line. We recommend you complete this before the GitHub lesson, since the *Git* lesson uses the command line. + +#### Introduction to GitHub + +Lesson: [Version Control with Git](https://swcarpentry.github.io/git-novice/) + +This lesson introduces users to local version control with `git` through the command line, then builds to interacting with the remote (e.g., ). It provides a comparison of tools and features introduced in the command line with their analogous UI (user interface) options in the remote (online). It also covers some common conventions and includes discussions of open science (see also [The Turing Way's discussion](https://book.the-turing-way.org/reproducible-research/open/)) and some core repository files, such as `.gitignore`, license, and citation files (also covered in our [GitHub Repo Guide](Github-Repo-Guide.md)). + +For those using R, this lesson includes a supplemental section on [using Git from RStudio](https://swcarpentry.github.io/git-novice/14-supplemental-rstudio.html). VSCode also has a [Git integration](https://code.visualstudio.com/docs/sourcecontrol/github). + +!!! tip "Pro tip" + Follow the [GitHub Workflow Guide](The-GitHub-Workflow.md) to improve collaboration and help avoid conflicts. [GitHub Projects](Guide-to-GitHub-Projects.md) are also a particularly powerful tool for collaborative project management. + +#### Basic Python + +Many machine learning algorithms and workflows are run using Python. If you're not familiar with Python, there are many resources to help you gain familiarity. Below are two Carpentries lessons to get you started: + +- [Programming with Python](http://swcarpentry.github.io/python-novice-inflammation) +- [Plotting and Programming in Python](http://swcarpentry.github.io/python-novice-gapminder) + +!!! warning + Indexing and indentation are different in Python than R. For a more comprehensive comparison, check out this [translation of common commands and syntax](https://aeturrell.github.io/coding-for-economists/coming-from-r.html#r-python). + +### Introduction to Data Analysis with Python + +This [data workshop training](https://youtu.be/71Ww42ddz9s) was first presented at the Imageomics All-Hands in 2024. It runs through an initial analysis of a simplified dataset, filling in a [dataset card](HF_DatasetCard_Template_mkdocs.md) as the data is explored, cleaned, and prepared for training. Notebooks and more information can be found in the [data workshop repo](https://github.com/Imageomics/data-workshop-AH-2024). To complete the training, follow the below instructions. + +#### Key Packages + +The key packages used in this workshop are described below, organized by their use-cases. + +- Data wrangling: `pandas` (DataFrames, the data structure), `datasets` (for accessing the data from Hugging Face). +- Notebooks (where the work gets done): `jupyterlab`, `ipywidgets` . + - The notebooks can be run in VSCode or by launching Jupyter from the command line (as done in the tutorial itself). +- Image handling and visualizations: `pillow`, `seaborn`. +- Machine learning tools: `scikit-learn`, `opencv`. + +#### Tutorial step-by-step instructions + +1. Clone the [workshop repository](https://github.com/Imageomics/data-workshop-AH-2024) and follow instructions to set up your local environment. +2. Read the [Key Learning Objectives](https://github.com/Imageomics/data-workshop-AH-2024/#key-learning-objectives). +3. Read the [Story of the Workshop](https://github.com/Imageomics/data-workshop-AH-2024/#story-of-the-workshop). +4. Follow along with the [workshop lesson](https://youtu.be/71Ww42ddz9s). +5. Review extra notes in the [Further Reading](https://github.com/Imageomics/data-workshop-AH-2024/tree/main/further_reading) section, which contains pointers and links to other resources. + +### Modeling Overview + +For more on various training paradigms, see the [training paradigms section of the ABC Glossary](ABC-Glossary.md#training-paradigms). +The glossary also covers [models](ABC-Glossary.md#model-aiml-version) such as [transformers](ABC-Glossary.md#transformer), [CLIP](ABC-Glossary.md#contrastive-language-image-pre-training-clip), and [diffusion models](ABC-Glossary.md#diffusion-models). + +For a more general discussion of Machine Learning topics, [IBM has a detailed guide](https://www.ibm.com/think/machine-learning#605511093). + +#### General hands-on practice + +- Introductory [PyTorch tutorials](https://pytorch.org/tutorials/beginner/basics/intro.html), which has a full PyTorch machine learning workflow example. +- A [GitHub Repo](https://github.com/davidbau/how-to-read-pytorch) on "How to Read Pytorch", which may help with some foundational concepts. +- A [Medium article](https://medium.com/fullstackai/how-to-train-an-object-detector-with-your-own-coco-dataset-in-pytorch-319e7090da5) on training an object detector with Pytorch, if you'd rather read about it first (be warned: there are large codeblocks included). +- [Climate Change AI](https://www.climatechange.ai/tutorials?) has a number of tutorials across a wide variety of topics and skill levels. + +#### Camera traps + +[Megadector](https://github.com/agentmorris/MegaDetector/), fine-tuned for your particular setup, is often the go-to when dealing with camera trap data. Check out this [YouTube video](https://www.youtube.com/watch?v=LUkQVARAVFI) by Siyu Yang to help you get started. + +#### Bioacoustics + +See the [OpenSoundScape tutorial](https://opensoundscape.org/en/stable/classifier_guide/guide.html) by Lauren Chronister, Tessa Rhinehart, Sam Lapp, and Santiago Ruiz Guzman, for a conceptual introduction to the classifier training workflow. This will prepare you dive into the [OpenSoundScape Documentation](https://opensoundscape.org/en/latest/) and build on the basic tutorials, expanding to your own data and use cases. + +#### Are there more options, you ask? + +In addition to the resources described on this page, you may also want to check out the following resources from the broader ABC Community: + +- [Data Science & Computing Cheat Sheet](https://docs.google.com/document/d/1YbOYnDZpRu6Jo1mpfg8m_zGeyY34NGECyxk2rNkH5eo/edit?usp=sharing) compiled by Tessa Rhinehart, Lauren Chronister, and Sara Beery with resources from both the [Kitzes](https://kitzeslab.org) and [Beery](https://beerys.github.io/) Labs. Some content links back to or is described in this guide, but there are other tutorials and resources that are not covered here. + +- [Ecological Modeling with AI and Python](https://ecoforecast.org/workshops/statistical-methods-seminar-series/#ai-python) tutorial by Sara Beery and Timm Haucke as part of the [Ecological Forecasting Initiative and the ESA Statistical Ecology Section Statistical Methods Seminar Series](https://ecoforecast.org/workshops/statistical-methods-seminar-series). + +- [Coding Club](https://ourcodingclub.github.io/) offers a number of open-source [tutorials](https://ourcodingclub.github.io/tutorials.html) covering topics in data analysis, reproducible research, and modeling, all in different languages (including R and Python), as well as the basics of R and Python. Be aware that some content might be out of date (check the last modified date at the top of lessons), as the site does not appear to be actively maintained anymore. diff --git a/mkdocs.yaml b/mkdocs.yaml index afa8e60..af238af 100644 --- a/mkdocs.yaml +++ b/mkdocs.yaml @@ -86,6 +86,8 @@ markdown_extensions: - attr_list - footnotes - md_in_html + - pymdownx.arithmatex: + generic: true - pymdownx.betterem - pymdownx.blocks.caption - pymdownx.details @@ -105,6 +107,11 @@ markdown_extensions: permalink: true title: 📖 On This Page +# MathJax configuration to support LaTeX math rendering +extra_javascript: + - javascripts/mathjax.js + - https://unpkg.com/mathjax@3/es5/tex-mml-chtml.js + nav: # These are all relative links within the repo, only update if adding or deleting pages - Home: index.md @@ -136,6 +143,7 @@ nav: - Quick References: - "Command Line Cheat Sheet": wiki-guide/Command-Line-Cheat-Sheet.md - "PyPI & Zenodo Release Automation": wiki-guide/GitHub-PyPI-Zenodo-Integration.md + - "Quick-start to Python for Ecologists": wiki-guide/Python-Eco-Quick-start.md - Code of Conduct: CODE_OF_CONDUCT.md - Digital Product Policy: - "About Digital Product Policies": wiki-guide/About-Digital-Product-Policies.md