Merge pull request #95 from terraref/image_tutorial

Chris-Schnaufer · web-flow · commit 0382761c42ef · 2019-01-29T16:04:13.000-07:00
Image tutorial
diff --git a/sensors/06-list-datasets-by-plot.Rmd b/sensors/06-list-datasets-by-plot.Rmd
@@ -1,21 +1,30 @@
 # Generating file lists by plot
 
-The terrautils python package has a new products module that aid in connecting
-plot boundaries stored within betydb with file-based data products available
+## Pre-requisites: 
+
+* if you have not already done so, you will need to 1) sign up for the [beta user program](terraref.org/beta) and 2) 
+sign up and be approved for access to the the [sensor data portal](terraref.ncsa.illinois.edu/clowder) in order to get 
+the API key that will be used in this tutorial. 
+
+The terrautils python package has a new `products` module that aids in connecting
+plot boundaries stored within betydb with the file-based data products available
 from the workbench or Globus.
 
+* if are using Rstudio and want to run the Python code chunks, the R package "reticulate" is required
+* use `pip3 install terrautils` to install the terrautils Python library
+
 ## Getting started
 
-After installing terrautils, you should be able to import the *product* module.
+After installing terrautils, you should be able to import the `products` module.
 ```{python}
 from terrautils.products import get_sensor_list, unique_sensor_names
 from terrautils.products import get_file_listing, extract_file_paths
 ```
 
-The `get_sensor_list` and `get_file_listing` functions both require connection, url,
-and key parameters. *Connection* can be 'None', the *url* (called host in the
-code) should be something like https://terraref.ncsa.illinois.edu/clowder/.
-The *key* is a unique access key for the Clowder api.
+The `get_sensor_list` and `get_file_listing` functions both require the *connection*,
+*url*, and *key* parameters. The *connection* can be 'None'. The *url* (called host in the
+code) should be something like `https://terraref.ncsa.illinois.edu/clowder/`.
+The *key* is a unique access key for the Clowder API.
 
 ## Getting the sensor list
 
@@ -26,116 +35,160 @@ a plot id number. The utility function `unique_sensor_names` accpets the
 sensor list and provides a list of names suitable for use in the 
 `get_file_listing` function.
 
+To use this tutorial you will need to sign up for Clowder, have your 
+account approved, and then get an API key from the [Clowder web interface](https://terraref.ncsa.illinois.edu/clowder).
+
+```{python}
+url = 'https://terraref.ncsa.illinois.edu/clowder/'
+key = 'ENTER YOUR KEY HERE'
+```
+
 ```{python}
 sensors = get_sensor_list(None, url, key)
 names = unique_sensor_names(sensors)
+print(names)
 ```
 
+
 Names will now contain a list of sensor names available in the Clowder
-geostreams API. The currently available sensors are:
+geostreams API. The list of returned sensor names could be something like the 
+following:
 
-* IR Surface Temperature
-* Thermal IR GeoTIFFs Datasets
 * flirIrCamera Datasets
-* (EL) sensor_weather_station
-* Irrigation Observations
-* Canopy Cover
-* Energy Farm Observations SE
-* (EL) sensor_par
-* scanner3DTop Datasets
-* Weather Observations
-* Energy Farm Observations NE
+* IR Surface Temperature
 * RGB GeoTIFFs Datasets
-* (EL) sensor_co2
 * stereoTop Datasets
-* Energy Farm Observations CEN
+* scanner3DTop Datasets
+* Thermal IR GeoTIFFs Datasets
+* ...
 
 ## Getting a list of files
 
 The geostreams API can be used to get a list of datasets that overlap a
 specific plot boundary and, optionally, limited by a time range. Iterating 
 over the datasets allows the paths to all the files to be extracted.
 
-```{python}
+```{python eval = FALSE}
 sensor = 'Thermal IR GeoTIFFs Datasets'
 sitename = 'MAC Field Scanner Season 1 Field Plot 101 W'
+key = 'INSERT YOUR KEY HERE'
 datasets = get_file_listing(None, url, key, sensor, sitename)
 files = extract_file_paths(datasets)
 ```
 
 Datasets can be further filtered using the *since* and *until* parameters
 of `get_file_listing` with a date string.
 
-```{python}
+```{python eval=FALSE}
 dataset = get_file_listing(None, url, key, sensor, sitename, 
         since='2016-06-01', until='2016-06-10')
 ```
 
 
-# Alternative method
+# Querying the API
+
+<!-- 
+TODO: move this to a separate tutorial page focused on using curl
+-->
+
+The source files behind the data are available for downloading through the API. By executing a series
+of requests against the API it's possible to determine the files of interest and then download them.
+
+Each of the API URL's have the same beginning (https://terraref.ncsa.illinois.edu/clowder/api), 
+followed by the data needed for a specific request. As we step through the process you will be able
+to see how then end of the URL changes depending upon the reuqest.
+
+Below is what the API looks like as a URL. Try pasting it into your browser.
+
+[https://terraref.ncsa.illinois.edu/clowder/api/geostreams/sensors?sensor_name=MAC Field Scanner Season 1 Field Plot 101 W](https://terraref.ncsa.illinois.edu/clowder/api/geostreams/sensors?sensor_name=MAC Field Scanner Season 1 Field Plot 101 W)
+
+This will return data for the requested plot including its id. This id (or identifier) can then be used for 
+additional queries against the API.
 
-The following method demonstrates the same approach using the Clowder API. This
-approach is useful for understanding the data layout and when the Python
-terrautils package is not available.
+In the examples below we will be using **curl** on the command line to make our API calls. Since the
+API is accessed through URLs, it's possible to use the URLs in software programs or with a programming language
+to retrieve its data. 
+
+## A Word of Caution
+
+We are no longer using the python terrautils package, which is a python library that provides helper functions that simplify interactions with the Clowder API. One of the ways it makes the interface easier is by using function names that make sense in the scope of the project. The API and the Clowder database have different names and _this is confusing_ since the same names are used for different parts of the database.
+
+The names and meanings of variables in this section don't necessarily match the ones in the section 
+above and it may be easy to get them confused. The API queries the database directly and thereby reflects 
+the database structure. This is the main reason for the naming differences between the API and the terraref
+client.
+
+For example, the Clowder API's use of the term *SENSOR_NAME* is equivalent to *site_name* above.
 
 ## Finding plot ID
 
-```
-SENSOR_NAME = "MAC Field Scanner Season 1 Field Plot 101 W"
-GET https://terraref.ncsa.illinois.edu/clowder/api/geostreams/sensors?sensor_name={SENSOR_NAME}
+We can query the API to find the identifier associated with the name of a plot. For this example
+we use the variable name of SENSOR_DATA to indicate the name of the plot.
+
+``` {sh eval=FALSE}
+SENSOR_NAME="MAC Field Scanner Season 1 Field Plot 101 W"
+curl -o plot.json -X GET "https://terraref.ncsa.illinois.edu/clowder/api/geostreams/sensors?sensor_name=${SENSOR_NAME}"
 ```
 
-This returns a JSON object with an 'id' parameter. You can use this ID parameter to specify the right data stream.
+This creates a file named *plot.json* containing the JSON object returned by the API. The JSON object has an 
+'id' parameter. This ID parameter can be used to specify the correct data stream.
 
 ## Finding stream ID within a plot
 
-The names are formatted as "<Sensor Group> Datasets (<Sensor ID>)".
+Using the sensor ID returned in the JSON from the previous call and the id of a sensor returned previously to get
+the stream id. The names of streams are are formatted as "<Sensor Group> Datasets (<Sensor ID>)".
 
-```
-SENSOR_ID = 3355
-STREAM_NAME = "Thermal IR GeoTIFFs Datasets ({SENSOR_ID})"
-GET https://terraref.ncsa.illinois.edu/clowder/api/geostreams/streams?stream_name={STREAM_NAME}
+``` {sh eval=FALSE}
+SENSOR_ID=3355
+STREAM_NAME="Thermal IR GeoTIFFs Datasets (${SENSOR_ID})"
+curl -o stream.json -X GET "https://terraref.ncsa.illinois.edu/clowder/api/geostreams/streams?stream_name=${STREAM_NAME}"
 ```
 
-This returns a JSON object with an 'id' parameter. You can use this ID parameter to get the right datapoints.
+A file named *stream.json* will be created containing the returned JSON object. This JSON object has an 'id' parameter that
+contains the stream ID. You can use this ID parameter to get the datasets, and then datapoints, of interest.
 
-## Listing Clowder file IDs for that plot & sensor stream
+## Listing Clowder dataset IDs for that plot & sensor stream
 
-```
-STREAM_ID = "11586"
-GET https://terraref.ncsa.illinois.edu/clowder/api/geostreams/datapoints?stream_id={STREAM_ID}
+We now have a stream ID that we can use to list our datasets. The datasets in turn contain files of interest.
+
+``` {sh eval=FALSE}
+STREAM_ID=11586
+curl -o datasets.json -X GET "https://terraref.ncsa.illinois.edu/clowder/api/geostreams/datapoints?stream_id=${STREAM_ID}"
 ```
 
-This returns a list of datapoint JSON objects, each with a 'properties' parameter that looks like:
+After the call succeeds, a file named *datasets.json* is created containing the returned JSON onject. As part of the
+JSON object there are one or more `properties` fields containing *source_dataset* parameters.
 
-```{python}
+```{javascript eval=FALSE}
 properties: {
     dataset_name: "Thermal IR GeoTIFFs - 2016-05-09__12-07-57-990",
     source_dataset: "https://terraref.ncsa.illinois.edu/clowder/datasets/59fc9e7d4f0c3383c73d2905"
 },
 ```
 
-The source_dataset URL can be used to view the dataset in Clowder.
+The URL of each **source_dataset** can be used to view the dataset in Clowder.
 
-You can also filter the datapoints by date:
+The datasets can also be filtered by date. The following filters out datasets that are outside of the range of January 2, 2017 through June 20, 2017.
 
-```
-GET https://terraref.ncsa.illinois.edu/clowder/api/geostreams/datapoints?stream_id={STREAM_ID}&since=2017-01-02&until=2017-06-10
+``` {sh eval=FALSE}
+curl -o datasets.json -X GET "https://terraref.ncsa.illinois.edu/clowder/api/geostreams/datapoints?stream_id=${STREAM_ID}&since=2017-01-02&until=2017-06-10"
 ```
 
-## Getting ROGER file path from dataset
+## Getting file paths from dataset
 
-Given a source dataset URL, we can call the API to get the files and their paths.
+Now that we know what the dataset URLs are, we can use the URLs to query the API for file IDs in addition to their names and paths.
 
-```
-SOURCE_DATASET = "https://terraref.ncsa.illinois.edu/clowder/datasets/59fc9e7d4f0c3383c73d2905"
-# Add /api after /clowder, and add /files at the end of the URL
-GET "https://terraref.ncsa.illinois.edu/clowder/api/datasets/59fc9e7d4f0c3383c73d2905/files"
+Note the the URL has changed from our previous examples now that we're using the URLs returned by the previous call.
+
+``` {sh eval=FALSE}
+SOURCE_DATASET="https://terraref.ncsa.illinois.edu/clowder/datasets/59fc9e7d4f0c3383c73d2905"
+curl -o files.json -X GET "${SOURCE_DATASET}/files"
 ```
 
-This returns a list of files in the dataset and their paths if available:
+As before, we will have a file containing the returned JSON, named *files.json* in this case. The returned JSON consists of a list 
+of the files in the dataset with their IDs, and other data if available:
 
-```
+``` {javascript eval=FALSE}
 [
     {
         size: "346069",
@@ -156,4 +209,19 @@ This returns a list of files in the dataset and their paths if available:
 ]
 ```
 
-Depending on permissions you may need to provide authentication to get this list.
+## Retrieving the files
+
+Given that a large number of files may be contained in a dataset, it may be desireable to automate the process of pulling down files
+to the local system.
+
+For each file to be retrieved, the unique file ID is needed on the URL.
+
+``` {sh eval=FALSE}
+FILE_NAME="ir_geotiff_L1_ua-mac_2016-05-09__12-07-57-990.tif"
+FILE_ID=59fc9e844f0c3383c73d2980
+curl -o "${FILE_NAME}" -X GET "https://terraref.ncsa.illinois.edu/clowder/api/files/${FILE_ID}"
+```
+
+This call will cause the server to return the contents of the file identified in the URL. This file is then stored locally in *ir_geotiff_L1_ua-mac_2016-05-09__12-07-57-990.tif*.
+
+