Revisions to improve cloud tutorial for small AWS instances

andrewdelman · andrewdelman · commit ebc62c45fa9f · 2024-03-17T09:19:03.000Z
diff --git a/ECCO-ACCESS/Cloud_access_to_ECCO_datasets/Tutorial_AWS_Cloud_getting_started.ipynb b/ECCO-ACCESS/Cloud_access_to_ECCO_datasets/Tutorial_AWS_Cloud_getting_started.ipynb
@@ -7,6 +7,8 @@
    "source": [
     "# AWS Cloud: getting started and retrieving ECCO datasets\n",
     "\n",
+    "Andrew Delman, updated 2024-03-15.\n",
+    "\n",
     "## Introduction\n",
     "Previous tutorials have discussed how to download ECCO datasets from PO.DAAC to your local machine. However, during 2021-2022 PO.DAAC datasets (including ECCO) migrated to the NASA Earthdata Cloud hosted by Amazon Web Services (AWS). While data downloads from the cloud (using wget, curl, Python requests, etc.) function like downloads from any other website, there are definite advantages to working with datasets within the cloud environment. Data can be opened in an S3 bucket and viewed without downloading, or can be quickly downloaded to a user's cloud instance for computations. For more information on PO.DAAC datasets in the cloud, there are [a number of infographics here](https://podaac.jpl.nasa.gov/cloud-datasets/about).\n",
     "\n",
@@ -37,7 +39,7 @@
     "\n",
     "*Key pair (login)*: Click on **Create new key pair**. In the pop-up window, make the name whatever you want (e.g., aws_ec2_jupyter), select *Key pair type*: **RSA** and *Private key file format*: **.pem**, then **Create key pair**. This downloads the key file to your Downloads folder, and you should move it to your `.ssh` folder: `mv ~/Downloads/aws_ec2_jupyter.pem ~/.ssh/`. Then change the permissions to read-only for the file owner `chmod 400 ~/.ssh/aws_ec2_jupyter.pem`.\n",
     "\n",
-    "*Network settings*: Your institution may have existing security groups that you should use, so click the **Select existing security group** and check with your IT or cloud support to see if there are recommended security groups/VPCs to use. If not or you are doing this on your own, then click **Create security group**, which will create a new security group with a name like *launch-wizard-1*. Make sure that the boxes to allow HTTPS and HTTP traffic from the internet are checked.\n",
+    "*Network settings*: Your institution may have existing security groups that you should use, so click the **Select existing security group** and check with your IT or cloud support to see if there are recommended security groups/VPCs to use. If not or you are doing this on your own, then click **Create security group**, which will create a new security group with a name like *launch-wizard-1*. Make sure that the boxes to allow SSH, HTTPS, and HTTP traffic are checked.\n",
     "\n",
     "*Configure storage*: Specify a storage volume with at least **15 GiB gp3** as your root volume. This is important, since the python/conda installation with the packages we need will occupy ~7.5 GB, and we need some workspace as a buffer. If you are in Free tier then you can request up to 30 GB across all your instances, so you can use up the full amount in a single instance or split it across two instances with 15 GB each.\n",
     "\n",
@@ -95,7 +97,7 @@
     "~/jupyter_lab_start.sh\n",
     "```\n",
     "\n",
-    "You will get a prompt for a password (optional), or you can leave it blank and press enter. After this is done (and while still connected to your instance through port 9889), open up a window in your local machine's web browser and put ``http://127.0.0.1:9889/`` or ``http://localhost:9889/`` in the URL field. If you set a password for your session, enter it when prompted. A Jupyter lab should open up in the ECCOv4 tutorial Github repository on your instance. Go to the **Tutorials_as_Jupyter_Notebooks** directory, and you will see a number of notebooks ready to run! For example, you can access this one at *AWS_Cloud_getting_started.ipynb*.\n",
+    "You will get a prompt for a password (optional), or you can leave it blank and press enter. After this is done (and while still connected to your instance through port 9889), open up a window in your local machine's web browser and put ``http://127.0.0.1:9889/`` or ``http://localhost:9889/`` in the URL field. If you set a password for your session enter it when prompted, or if there is no password just click **Log in**. A Jupyter lab should open up in the ECCOv4 tutorial Github repository on your instance. Go to the **Tutorials_as_Jupyter_Notebooks** directory, and you will see a number of notebooks ready to run! For example, you can access this one at *AWS_Cloud_getting_started.ipynb*.\n",
     "\n",
     "## Reconnecting to your instance and Jupyter lab\n",
     "\n",
@@ -119,7 +121,7 @@
     "ssh -i \"~/.ssh/aws_ec2_jupyter.pem\" ec2-user@instance_ip_address -L 9889:localhost:9889\n",
     "```\n",
     "\n",
-    "Once connected to your instance you will need to start a new Jupyter lab session by running:\n",
+    "Note that the instance's public IP address may have changed when the instance was stopped and started again. Once connected to your instance you will need to start a new Jupyter lab session by running:\n",
     "\n",
     "```\n",
     "~/jupyter_lab_start.sh\n",
@@ -262,7 +264,11 @@
    "source": [
     "%%time\n",
     "\n",
-    "# Open 12 monthly files (temp/salinity)\n",
+    "# Open 12 monthly files (temp/salinity\n",
+    "\n",
+    "# suppress warnings\n",
+    "import warnings\n",
+    "warnings.filterwarnings(\"ignore\")\n",
     "\n",
     "time_log = [time.time()]\n",
     "\n",
@@ -274,12 +280,12 @@
     "ds = xr.open_mfdataset(file_list,\\\n",
     "                       data_vars='minimal',coords='minimal',\\\n",
     "                       compat='override',\\\n",
-    "                       chunks={'time':1,'k':50,'tile':13,'j':90,'i':90})\n",
+    "                       chunks={'time':1,'k':10,'tile':13,'j':90,'i':90})\n",
     "\n",
     "## repeat above with the grid file\n",
     "grid_file = ecco_podaac_s3_open(ShortName=\"ECCO_L4_GEOMETRY_LLC0090GRID_V4R4\",\\\n",
     "                                    StartDate=\"1992-01\",EndDate=\"2017-12\")\n",
-    "ds_grid = xr.open_dataset(grid_file)\n",
+    "ds_grid = xr.open_mfdataset([grid_file],chunks={'k':10,'tile':13,'j':90,'i':90})\n",
     "\n",
     "time_log.append(time.time())\n",
     "time_to_open_files = np.diff(np.asarray(time_log)[-2:])[0]\n",
@@ -290,7 +296,6 @@
     "\n",
     "## compute volumes of each cell\n",
     "cell_vol = ds_grid.hFacC*ds_grid.rA*ds_grid.drF\n",
-    "cell_vol = cell_vol.compute()\n",
     "\n",
     "## mean temperature weighted by the volume of each cell\n",
     "total_vol = cell_vol.sum().compute()\n",
@@ -2446,7 +2451,9 @@
    "source": [
     "#### Method 1b: Open using 2 processes and threads\n",
     "\n",
-    "We can use the [dask.distributed](https://distributed.dask.org) library to parallelize the opening of files from S3 and the subsequent computation. However, because of the overhead introduced by the `distributed` scheduler, this will not always speed up your code. This method works best when it is used to open files into xarray datasets with *open_mfdataset* (which by default does not immediately load the data into memory), and then a subset of the data can be loaded into memory at the earliest reasonable opportunity."
+    "We can use the [dask.distributed](https://distributed.dask.org) library to parallelize the opening of files from S3 and the subsequent computation. However, because of the overhead introduced by the `distributed` scheduler, this will not always speed up your code. This method works best when it is used to open files into xarray datasets with *open_mfdataset* (which by default does not immediately load the data into memory), and then a subset of the data can be loaded into memory at the earliest reasonable opportunity.\n",
+    "\n",
+    "> Note: If you are running a an instance with <2 GB memory (including the t2.micro free-tier) you will likely need to restart the kernel to clear your workspace before running the following cells. Then you can uncomment the cells below that reload the Python packages you need. You will also probably need to do the same before running Method 2."
    ]
   },
   {
@@ -2464,6 +2471,13 @@
     }
    ],
    "source": [
+    "# # un-comment and run this block if you just restarted the kernel\n",
+    "# import numpy as np\n",
+    "# import xarray as xr\n",
+    "# import matplotlib.pyplot as plt# \n",
+    "# from ecco_s3_retrieve import *\n",
+    "# import time\n",
+    "\n",
     "from distributed import Client\n",
     "client = Client()\n",
     "print(client)"
@@ -2544,12 +2558,12 @@
     "ds = xr.open_mfdataset(file_list,\\\n",
     "                       data_vars='minimal',coords='minimal',\\\n",
     "                       compat='override',\\\n",
-    "                       chunks={'time':1,'k':50,'tile':13,'j':90,'i':90})\n",
+    "                       chunks={'time':1,'k':10,'tile':13,'j':90,'i':90})\n",
     "\n",
     "## repeat above with the grid file\n",
     "grid_file = ecco_podaac_s3_open(ShortName=\"ECCO_L4_GEOMETRY_LLC0090GRID_V4R4\",\\\n",
     "                                    StartDate=\"1992-01\",EndDate=\"2017-12\")\n",
-    "ds_grid = xr.open_dataset(grid_file)\n",
+    "ds_grid = xr.open_mfdataset([grid_file],chunks={'k':10,'tile':13,'j':90,'i':90})\n",
     "\n",
     "time_log.append(time.time())\n",
     "time_to_open_files = np.diff(np.asarray(time_log)[-2:])[0]\n",
@@ -2560,7 +2574,6 @@
     "\n",
     "## compute volumes of each cell\n",
     "cell_vol = ds_grid.hFacC*ds_grid.rA*ds_grid.drF\n",
-    "cell_vol = cell_vol.compute()\n",
     "\n",
     "## mean temperature weighted by the volume of each cell\n",
     "total_vol = cell_vol.sum().compute()\n",
@@ -2722,6 +2735,14 @@
    "source": [
     "%%time\n",
     "\n",
+    "# # un-comment and run this block if you just restarted the kernel\n",
+    "# import numpy as np\n",
+    "# import xarray as xr\n",
+    "# import matplotlib.pyplot as plt# \n",
+    "# from ecco_s3_retrieve import *\n",
+    "# import time\n",
+    "\n",
+    "\n",
     "# Get/download 12 monthly files (temp/salinity) and grid parameters file to user's instance\n",
     "\n",
     "time_log = [time.time()]\n",
@@ -2735,13 +2756,13 @@
     "ds = xr.open_mfdataset(file_list,\\\n",
     "                       data_vars='minimal',coords='minimal',\\\n",
     "                       compat='override',\\\n",
-    "                       chunks={'time':1,'k':50,'tile':13,'j':90,'i':90})\n",
+    "                       chunks={'time':1,'k':10,'tile':13,'j':90,'i':90})\n",
     "\n",
     "## repeat above with the grid file\n",
     "grid_file = ecco_podaac_s3_get(ShortName=\"ECCO_L4_GEOMETRY_LLC0090GRID_V4R4\",\\\n",
     "                               StartDate=\"1992-01\",EndDate=\"2017-12\",\\\n",
     "                               n_workers=2,return_downloaded_files=True)\n",
-    "ds_grid = xr.open_dataset(grid_file)\n",
+    "ds_grid = xr.open_mfdataset([grid_file],chunks{'k':10,'tile':13,'j':90,'i':90})\n",
     "\n",
     "time_log.append(time.time())\n",
     "time_to_get_files = np.diff(np.asarray(time_log)[-2:])[0]\n",
@@ -2753,7 +2774,6 @@
     "\n",
     "## compute volumes of each cell\n",
     "cell_vol = ds_grid.hFacC*ds_grid.rA*ds_grid.drF\n",
-    "cell_vol = cell_vol.compute()\n",
     "\n",
     "## mean temperature weighted by the volume of each cell\n",
     "total_vol = cell_vol.sum().compute()\n",