doc: pool-based sampling example updated

cosmic-cortex · cosmic-cortex · commit 56d09f7a0742 · 2018-09-29T15:08:01.000+02:00
diff --git a/docs/source/content/examples/pool-based_sampling.ipynb b/docs/source/content/examples/pool-based_sampling.ipynb
@@ -22,7 +22,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Enforce a reproducible result across runs"
+    "To enforce a reproducible result across runs, we set a random seed."
    ]
   },
   {
@@ -42,9 +42,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Load our `iris` dataset\n",
+    "## The dataset\n",
     "\n",
-    "For more information on the iris dataset, see:\n",
+    "Now we load the dataset. In this example, we are going to use the famous Iris dataset. For more information on the iris dataset, see:\n",
     "  - [The dataset documentation on Wikipedia](https://en.wikipedia.org/wiki/Iris_flower_data_set)\n",
     "  - [The scikit-learn interface](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html)"
    ]
@@ -66,7 +66,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Apply PCA onto our features and extract the first 2 principle components"
+    "For visualization purposes, we apply PCA to the original dataset."
    ]
   },
   {
@@ -86,7 +86,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Visualize the principle components"
+    "This is how the dataset looks like."
    ]
   },
   {
@@ -124,9 +124,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Partition our `iris` dataset\n",
-    "\n",
-    "We first specify our training set $\\mathcal{L}$ consisting of 3 random examples. The remaining examples go to our \"unlabeled\" pool $\\mathcal{U}$."
+    "Now we partition our `iris` dataset into a training set $\\mathcal{L}$ and $\\mathcal{U}$. We first specify our training set $\\mathcal{L}$ consisting of 3 random examples. The remaining examples go to our \"unlabeled\" pool $\\mathcal{U}$."
    ]
   },
   {
@@ -151,7 +149,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Define our models"
+    "## Active learning with pool-based sampling\n",
+    "\n",
+    "For the classification, we are going to use a simple k-nearest neighbors classifier. In this step, we are also going to initialize the ```ActiveLearner```."
    ]
   },
   {
@@ -172,7 +172,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Predict class labels based on our limited dataset $\\mathcal{L}$"
+    "Let's see how our classifier performs on the initial training set!"
    ]
   },
   {
@@ -242,7 +242,7 @@
     "\n",
     "As we can see, our model is unable to properly learn the underlying data distribution. All of its predictions are for the third class label, and as such it is only as competitive as defaulting its predictions to a single class – if only we had more data!\n",
     "\n",
-    "Below, we tune our classifier by allowing it to query 20 instances it hasn't seen before. Using uncertainty sampling, our classifier aims to reduce the amount of uncertainty in its predictions using a variety of measures — see the documentation for more on specific [classification uncertainty measures](https://cosmic-cortex.github.io/modAL/Uncertainty-sampling#uncertainty). With each requested query, we remove that record from our pool $\\mathcal{U}$ and record our model's accuracy on the raw dataset."
+    "Below, we tune our classifier by allowing it to query 20 instances it hasn't seen before. Using uncertainty sampling, our classifier aims to reduce the amount of uncertainty in its predictions using a variety of measures — see the documentation for more on specific [classification uncertainty measures](https://modal-python.readthedocs.io/en/latest/content/query_strategies/Uncertainty-sampling.html). With each requested query, we remove that record from our pool $\\mathcal{U}$ and record our model's accuracy on the raw dataset."
    ]
   },
   {
@@ -388,7 +388,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.6"
+   "version": "3.6.5"
   }
  },
  "nbformat": 4,
diff --git a/docs/source/content/examples/ranked_batch_mode.ipynb b/docs/source/content/examples/ranked_batch_mode.ipynb
@@ -51,7 +51,9 @@
    "source": [
     "## The dataset\n",
     "\n",
-    "Now we load the dataset. In this example, we are going to use the famous Iris dataset."
+    "Now we load the dataset. In this example, we are going to use the famous Iris dataset. For more information on the iris dataset, see:\n",
+    "  - [The dataset documentation on Wikipedia](https://en.wikipedia.org/wiki/Iris_flower_data_set)\n",
+    "  - [The scikit-learn interface](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html)"
    ]
   },
   {
@@ -129,7 +131,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now we partition our `iris` dataset into a training set and a pool of unlabeled examples. We first specify our training set consisting of 3 random examples. The remaining examples go to our \"unlabeled\" pool."
+    "Now we partition our `iris` dataset into a training set $\\mathcal{L}$ and $\\mathcal{U}$. We first specify our training set $\\mathcal{L}$ consisting of 3 random examples. The remaining examples go to our \"unlabeled\" pool $\\mathcal{U}$."
    ]
   },
   {
@@ -154,7 +156,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Active learning with ranked batch mode sampling\n",
+    "## Active learning with ranked batch mode sampling\n",
     "\n",
     "For the classification, we are going to use a simple k-nearest neighbors classifier."
    ]
@@ -273,9 +275,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now we update our model by batch-mode sampling our \"unlabeled\" dataset. We tune our classifier by allowing it to query at most 20 instances it hasn't seen before. To properly utilize batch-mode sampling, we allow our model to request three records per query (instead of 1) but subsequently only allow our model to make 6 queries. Under the hood, our classifier aims to balance the ideas behind uncertainty and dissimilarity in its choices.\n",
+    "Now we Update our model by batch-mode sampling our \"unlabeled\" dataset $\\mathcal{U}$. We tune our classifier by allowing it to query at most 20 instances it hasn't seen before. To properly utilize batch-mode sampling, we allow our model to request three records per query (instead of 1) but subsequently only allow our model to make 6 queries. Under the hood, our classifier aims to balance the ideas behind uncertainty and dissimilarity in its choices.\n",
     "\n",
-    "With each requested query, we remove that record from our pool and record our model's accuracy on the raw dataset."
+    "With each requested query, we remove that record from our pool $\\mathcal{U}$ and record our model's accuracy on the raw dataset."
    ]
   },
   {
@@ -328,7 +330,8 @@
    "source": [
     "## Evaluate our model's performance\n",
     "\n",
-    "Here, we first plot the query iteration index against model accuracy. As you can see, our model is able to obtain an accuracy of ~0.90 within its first query, and isn't as susceptible to getting \"stuck\" with querying uninformative records from our unlabeled set."
+    "Here, we first plot the query iteration index against model accuracy. As you can see, our model is able to obtain an accuracy of ~0.90 within its first query, and isn't as susceptible to getting \"stuck\" with querying uninformative records from our unlabeled set.\n",
+    "To visualize the performance of our classifier, we also plot the correct and incorrect predictions on the full dataset."
    ]
   },
   {

Original file line number	Diff line number	Diff line change
`@@ -22,7 +22,7 @@`
`22`	`22`	`"cell_type": "markdown",`
`23`	`23`	`"metadata": {},`
`24`	`24`	`"source": [`
`25`		`- "### Enforce a reproducible result across runs"`
	`25`	`+ "To enforce a reproducible result across runs, we set a random seed."`
`26`	`26`	`]`
`27`	`27`	`},`
`28`	`28`	`{`
`@@ -42,9 +42,9 @@`
`42`	`42`	`"cell_type": "markdown",`
`43`	`43`	`"metadata": {},`
`44`	`44`	`"source": [`
`45`		- "### Load our `iris` dataset\n",
	`45`	`+ "## The dataset\n",`
`46`	`46`	`"\n",`
`47`		`- "For more information on the iris dataset, see:\n",`
	`47`	`+ "Now we load the dataset. In this example, we are going to use the famous Iris dataset. For more information on the iris dataset, see:\n",`
`48`	`48`	`" - [The dataset documentation on Wikipedia](https://en.wikipedia.org/wiki/Iris_flower_data_set)\n",`
`49`	`49`	`" - [The scikit-learn interface](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html)"`
`50`	`50`	`]`
`@@ -66,7 +66,7 @@`
`66`	`66`	`"cell_type": "markdown",`
`67`	`67`	`"metadata": {},`
`68`	`68`	`"source": [`
`69`		`- "### Apply PCA onto our features and extract the first 2 principle components"`
	`69`	`+ "For visualization purposes, we apply PCA to the original dataset."`
`70`	`70`	`]`
`71`	`71`	`},`
`72`	`72`	`{`
`@@ -86,7 +86,7 @@`
`86`	`86`	`"cell_type": "markdown",`
`87`	`87`	`"metadata": {},`
`88`	`88`	`"source": [`
`89`		`- "### Visualize the principle components"`
	`89`	`+ "This is how the dataset looks like."`
`90`	`90`	`]`
`91`	`91`	`},`
`92`	`92`	`{`
`@@ -124,9 +124,7 @@`
`124`	`124`	`"cell_type": "markdown",`
`125`	`125`	`"metadata": {},`
`126`	`126`	`"source": [`
`127`		- "### Partition our `iris` dataset\n",
`128`		`- "\n",`
`129`		`- "We first specify our training set $\\mathcal{L}$ consisting of 3 random examples. The remaining examples go to our \"unlabeled\" pool $\\mathcal{U}$."`
	`127`	+ "Now we partition our `iris` dataset into a training set $\\mathcal{L}$ and $\\mathcal{U}$. We first specify our training set $\\mathcal{L}$ consisting of 3 random examples. The remaining examples go to our \"unlabeled\" pool $\\mathcal{U}$."
`130`	`128`	`]`
`131`	`129`	`},`
`132`	`130`	`{`
`@@ -151,7 +149,9 @@`
`151`	`149`	`"cell_type": "markdown",`
`152`	`150`	`"metadata": {},`
`153`	`151`	`"source": [`
`154`		`- "## Define our models"`
	`152`	`+ "## Active learning with pool-based sampling\n",`
	`153`	`+ "\n",`
	`154`	+ "For the classification, we are going to use a simple k-nearest neighbors classifier. In this step, we are also going to initialize the ```ActiveLearner```."
`155`	`155`	`]`
`156`	`156`	`},`
`157`	`157`	`{`
`@@ -172,7 +172,7 @@`
`172`	`172`	`"cell_type": "markdown",`
`173`	`173`	`"metadata": {},`
`174`	`174`	`"source": [`
`175`		`- "## Predict class labels based on our limited dataset $\\mathcal{L}$"`
	`175`	`+ "Let's see how our classifier performs on the initial training set!"`
`176`	`176`	`]`
`177`	`177`	`},`
`178`	`178`	`{`
`@@ -242,7 +242,7 @@`
`242`	`242`	`"\n",`
`243`	`243`	`"As we can see, our model is unable to properly learn the underlying data distribution. All of its predictions are for the third class label, and as such it is only as competitive as defaulting its predictions to a single class – if only we had more data!\n",`
`244`	`244`	`"\n",`
`245`		- "Below, we tune our classifier by allowing it to query 20 instances it hasn't seen before. Using uncertainty sampling, our classifier aims to reduce the amount of uncertainty in its predictions using a variety of measures — see the documentation for more on specific [classification uncertainty measures](https://cosmic-cortex.github.io/modAL/Uncertainty-sampling#uncertainty). With each requested query, we remove that record from our pool $\\mathcal{U}$ and record our model's accuracy on the raw dataset."
	`245`	+ "Below, we tune our classifier by allowing it to query 20 instances it hasn't seen before. Using uncertainty sampling, our classifier aims to reduce the amount of uncertainty in its predictions using a variety of measures — see the documentation for more on specific [classification uncertainty measures](https://modal-python.readthedocs.io/en/latest/content/query_strategies/Uncertainty-sampling.html). With each requested query, we remove that record from our pool $\\mathcal{U}$ and record our model's accuracy on the raw dataset."
`246`	`246`	`]`
`247`	`247`	`},`
`248`	`248`	`{`
`@@ -388,7 +388,7 @@`
`388`	`388`	`"name": "python",`
`389`	`389`	`"nbconvert_exporter": "python",`
`390`	`390`	`"pygments_lexer": "ipython3",`
`391`		`- "version": "3.6.6"`
	`391`	`+ "version": "3.6.5"`
`392`	`392`	`}`
`393`	`393`	`},`
`394`	`394`	`"nbformat": 4,`
Original file line number	Diff line number	Diff line change
`@@ -51,7 +51,9 @@`
`51`	`51`	`"source": [`
`52`	`52`	`"## The dataset\n",`
`53`	`53`	`"\n",`
`54`		`- "Now we load the dataset. In this example, we are going to use the famous Iris dataset."`
	`54`	`+ "Now we load the dataset. In this example, we are going to use the famous Iris dataset. For more information on the iris dataset, see:\n",`
	`55`	`+ " - [The dataset documentation on Wikipedia](https://en.wikipedia.org/wiki/Iris_flower_data_set)\n",`
	`56`	`+ " - [The scikit-learn interface](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html)"`
`55`	`57`	`]`
`56`	`58`	`},`
`57`	`59`	`{`
`@@ -129,7 +131,7 @@`
`129`	`131`	`"cell_type": "markdown",`
`130`	`132`	`"metadata": {},`
`131`	`133`	`"source": [`
`132`		- "Now we partition our `iris` dataset into a training set and a pool of unlabeled examples. We first specify our training set consisting of 3 random examples. The remaining examples go to our \"unlabeled\" pool."
	`134`	+ "Now we partition our `iris` dataset into a training set $\\mathcal{L}$ and $\\mathcal{U}$. We first specify our training set $\\mathcal{L}$ consisting of 3 random examples. The remaining examples go to our \"unlabeled\" pool $\\mathcal{U}$."
`133`	`135`	`]`
`134`	`136`	`},`
`135`	`137`	`{`
`@@ -154,7 +156,7 @@`
`154`	`156`	`"cell_type": "markdown",`
`155`	`157`	`"metadata": {},`
`156`	`158`	`"source": [`
`157`		`- "# Active learning with ranked batch mode sampling\n",`
	`159`	`+ "## Active learning with ranked batch mode sampling\n",`
`158`	`160`	`"\n",`
`159`	`161`	`"For the classification, we are going to use a simple k-nearest neighbors classifier."`
`160`	`162`	`]`
`@@ -273,9 +275,9 @@`
`273`	`275`	`"cell_type": "markdown",`
`274`	`276`	`"metadata": {},`
`275`	`277`	`"source": [`
`276`		`- "Now we update our model by batch-mode sampling our \"unlabeled\" dataset. We tune our classifier by allowing it to query at most 20 instances it hasn't seen before. To properly utilize batch-mode sampling, we allow our model to request three records per query (instead of 1) but subsequently only allow our model to make 6 queries. Under the hood, our classifier aims to balance the ideas behind uncertainty and dissimilarity in its choices.\n",`
	`278`	`+ "Now we Update our model by batch-mode sampling our \"unlabeled\" dataset $\\mathcal{U}$. We tune our classifier by allowing it to query at most 20 instances it hasn't seen before. To properly utilize batch-mode sampling, we allow our model to request three records per query (instead of 1) but subsequently only allow our model to make 6 queries. Under the hood, our classifier aims to balance the ideas behind uncertainty and dissimilarity in its choices.\n",`
`277`	`279`	`"\n",`
`278`		`- "With each requested query, we remove that record from our pool and record our model's accuracy on the raw dataset."`
	`280`	`+ "With each requested query, we remove that record from our pool $\\mathcal{U}$ and record our model's accuracy on the raw dataset."`
`279`	`281`	`]`
`280`	`282`	`},`
`281`	`283`	`{`
`@@ -328,7 +330,8 @@`
`328`	`330`	`"source": [`
`329`	`331`	`"## Evaluate our model's performance\n",`
`330`	`332`	`"\n",`
`331`		`- "Here, we first plot the query iteration index against model accuracy. As you can see, our model is able to obtain an accuracy of ~0.90 within its first query, and isn't as susceptible to getting \"stuck\" with querying uninformative records from our unlabeled set."`
	`333`	`+ "Here, we first plot the query iteration index against model accuracy. As you can see, our model is able to obtain an accuracy of ~0.90 within its first query, and isn't as susceptible to getting \"stuck\" with querying uninformative records from our unlabeled set.\n",`
	`334`	`+ "To visualize the performance of our classifier, we also plot the correct and incorrect predictions on the full dataset."`
`332`	`335`	`]`
`333`	`336`	`},`
`334`	`337`	`{`