|
8 | 8 | "# Using the `waterdata` module to pull data from the USGS Water Data APIs\n", |
9 | 9 | "The `waterdata` module will eventually replace the `nwis` module for accessing USGS water data. It leverages the [Water Data APIs](https://api.waterdata.usgs.gov/) to download metadata, daily values, and instantaneous values. \n", |
10 | 10 | "\n", |
11 | | - "While the specifics of this transition timeline are hazy, it is advised to switch to the new functions as soon as possible to reduce unexpected interruptions in your workflow.\n", |
| 11 | + "While the specifics of this transition timeline are opaque, it is advised to switch to the new functions as soon as possible to reduce unexpected interruptions in your workflow.\n", |
12 | 12 | "\n", |
13 | 13 | "As always, please report any issues you encounter on our [Issues](https://github.com/DOI-USGS/dataretrieval-python/issues) page. If you have questions or need help, please reach out to us at comptools@usgs.gov." |
14 | 14 | ] |
|
29 | 29 | "``` \n", |
30 | 30 | "Note that the environment variable name is `API_USGS_PAT`, which stands for \"API USGS Personal Access Token\".\n", |
31 | 31 | "\n", |
32 | | - "If you'd like a more permanent repository-specific solution, you can use the `python-dotenv` package to read your API key from a `.env` file in your repository root directory, like this:\n", |
| 32 | + "If you'd like a more permanent, repository-specific solution, you can use the `python-dotenv` package to read your API key from a `.env` file in your repository root directory, like this:\n", |
33 | 33 | "\n", |
34 | 34 | "```python\n", |
35 | 35 | "!pip install python-dotenv # only run this line once to install the package in your environment\n", |
|
55 | 55 | "These functions retrieve metadata tables that can be used to refine your data requests.\n", |
56 | 56 | "\n", |
57 | 57 | "- `get_reference_table()` - Not sure which parameter code you're looking for, or which hydrologic unit your study area is in? This function will help you find the right input values for the data endpoints to retrieve the information you want.\n", |
58 | | - "- `get_codes()` - Similar to `get_reference_table()`, this function retrieves dataframes containing available input values that correspond to the Samples database.\n", |
| 58 | + "- `get_codes()` - Similar to `get_reference_table()`, this function retrieves dataframes containing available input values that correspond to the Samples water quality database.\n", |
59 | 59 | "\n", |
60 | 60 | "### Data endpoints\n", |
61 | 61 | "- `get_daily()` - Daily values for monitoring locations, parameters, stat codes, and more.\n", |
|
68 | 68 | "- `get_samples()` - Discrete water quality sample results for monitoring locations, observed properties, and more." |
69 | 69 | ] |
70 | 70 | }, |
| 71 | + { |
| 72 | + "cell_type": "markdown", |
| 73 | + "id": "19b5aebf", |
| 74 | + "metadata": {}, |
| 75 | + "source": [ |
| 76 | + "### A few key tips\n", |
| 77 | + "- You'll notice that each of the data functions have many unique inputs you can specify. **DO NOT** specify too many! Specify *just enough* inputs to return what you need. But do not provide redundant geographical or parameter information as this may slow down your query and lead to errors.\n", |
| 78 | + "- Each function returns a Tuple, containing a dataframe and a Metadata class. If you have `geopandas` installed in your environment, the dataframe will be a `GeoDataFrame` with a geometry included. If you do not have `geopandas`, the dataframe will be a `pandas` dataframe with the geometry contained in a coordinates column. The Metadata object contains information about your query, like the query url.\n", |
| 79 | + "- If you do not want to return the `geometry` column, use the input `skip_geometry=True`." |
| 80 | + ] |
| 81 | + }, |
71 | 82 | { |
72 | 83 | "cell_type": "markdown", |
73 | 84 | "id": "68591b52", |
|
84 | 95 | "metadata": {}, |
85 | 96 | "outputs": [], |
86 | 97 | "source": [ |
| 98 | + "import pandas as pd\n", |
87 | 99 | "from IPython.display import display\n", |
| 100 | + "from datetime import datetime, timedelta\n", |
| 101 | + "from datetime import date\n", |
| 102 | + "from dateutil.relativedelta import relativedelta\n", |
88 | 103 | "from dataretrieval import waterdata" |
89 | 104 | ] |
90 | 105 | }, |
|
101 | 116 | }, |
102 | 117 | { |
103 | 118 | "cell_type": "markdown", |
104 | | - "id": "176c665b", |
| 119 | + "id": "1e0eab77", |
105 | 120 | "metadata": {}, |
106 | 121 | "source": [ |
107 | | - "What is this `metadata` element? Let's take a look:" |
| 122 | + "Let's say we want to find all parameter codes relating to streamflow discharge. We can use some string matching to find applicable codes." |
108 | 123 | ] |
109 | 124 | }, |
110 | 125 | { |
111 | 126 | "cell_type": "code", |
112 | 127 | "execution_count": null, |
113 | | - "id": "30b1b052", |
| 128 | + "id": "665ccb23", |
114 | 129 | "metadata": {}, |
115 | 130 | "outputs": [], |
116 | 131 | "source": [ |
117 | | - "metadata" |
| 132 | + "streamflow_pcodes = pcodes[pcodes['parameter_name'].str.contains('streamflow|discharge', case=False, na=False)]\n", |
| 133 | + "display(streamflow_pcodes[['parameter_code_id', 'parameter_name']])" |
118 | 134 | ] |
119 | 135 | }, |
120 | 136 | { |
121 | 137 | "cell_type": "markdown", |
122 | | - "id": "1e0eab77", |
| 138 | + "id": "d9487ee4", |
123 | 139 | "metadata": {}, |
124 | 140 | "source": [ |
125 | | - "All of these functions return Tuples containing a dataframe and a metadata element containing descriptors about the request made. This `BaseMetadata` object contains the request URL.\n", |
| 141 | + "Interesting that there are so many different streamflow-related parameter codes! Going on experience, let's use the most common one, `00060`, which is \"Discharge, cubic feet per second\".\n", |
126 | 142 | "\n", |
127 | | - "Let's say we want to find all parameter codes relating to streamflow discharge. We can use some string matching to find applicable codes." |
| 143 | + "Now that we know which parameter code we want to use, let's find all the stream monitoring locations that have recent discharge data and at least 10 years of daily values in the state of Nebraska. We will use the `waterdata.get_time_series_metadata()` function to suss out which sites fit the bill. This function will return a row for each *timeseries* that matches our inputs. It doesn't contain the daily discharge values themselves, just information *about* that timeseries." |
| 144 | + ] |
| 145 | + }, |
| 146 | + { |
| 147 | + "cell_type": "markdown", |
| 148 | + "id": "70ee1da9", |
| 149 | + "metadata": {}, |
| 150 | + "source": [ |
| 151 | + "First, let's get our expected date range in order. Note that the `waterdata` functions are capable of taking in bounded and unbounded date and datetime ranges. In this case, we want the start date of the discharge timeseries to be no more recent than 10 years ago, and we want the end date of the timeseries to be from at most a week ago." |
128 | 152 | ] |
129 | 153 | }, |
130 | 154 | { |
131 | 155 | "cell_type": "code", |
132 | 156 | "execution_count": null, |
133 | | - "id": "665ccb23", |
| 157 | + "id": "57e2c93a", |
134 | 158 | "metadata": {}, |
135 | 159 | "outputs": [], |
136 | 160 | "source": [ |
137 | | - "streamflow_pcodes = pcodes[pcodes['parameter_name'].str.contains('streamflow|discharge', case=False, na=False)]\n", |
138 | | - "display(streamflow_pcodes[['parameter_code_id', 'parameter_name']])" |
| 161 | + "ten_years_ago =(date.today() - relativedelta(years=10))\n", |
| 162 | + "one_week_ago = (datetime.now() - timedelta(days=7)).date()" |
139 | 163 | ] |
140 | 164 | }, |
141 | 165 | { |
142 | 166 | "cell_type": "markdown", |
143 | | - "id": "d9487ee4", |
| 167 | + "id": "2cd98164", |
| 168 | + "metadata": {}, |
| 169 | + "source": [] |
| 170 | + }, |
| 171 | + { |
| 172 | + "cell_type": "code", |
| 173 | + "execution_count": null, |
| 174 | + "id": "a901f5fa", |
144 | 175 | "metadata": {}, |
| 176 | + "outputs": [], |
145 | 177 | "source": [ |
146 | | - "Interesting that there are so many different streamflow-related parameter codes! Going on experience, let's use the most common one, `00060`, which is \"Discharge, cubic feet per second\".\n", |
147 | | - "\n", |
148 | | - "Now that we know which parameter code we want to use, let's find all the stream monitoring locations that have recent discharge data and at least 10 years of daily values in the state of Nebraska. " |
| 178 | + "NE_discharge,_ = waterdata.get_time_series_metadata(\n", |
| 179 | + " state_name=\"Nebraska\",\n", |
| 180 | + " parameter_code='00060',\n", |
| 181 | + " begin=f\"1700-01-01/{ten_years_ago}\",\n", |
| 182 | + " end=f\"{one_week_ago}/..\",\n", |
| 183 | + " skip_geometry=True\n", |
| 184 | + ")" |
| 185 | + ] |
| 186 | + }, |
| 187 | + { |
| 188 | + "cell_type": "code", |
| 189 | + "execution_count": null, |
| 190 | + "id": "8809a98d", |
| 191 | + "metadata": {}, |
| 192 | + "outputs": [], |
| 193 | + "source": [ |
| 194 | + "display(NE_discharge.sort_values(\"monitoring_location_id\").head())\n", |
| 195 | + "print(f\"There are {len(NE_discharge['monitoring_location_id'].unique())} sites with recent discharge data available in the state of Nebraska\")" |
| 196 | + ] |
| 197 | + }, |
| 198 | + { |
| 199 | + "cell_type": "markdown", |
| 200 | + "id": "8f464470", |
| 201 | + "metadata": {}, |
| 202 | + "source": [ |
| 203 | + "In the dataframe above, we are looking at 5 timeseries returned, ordered by monitoring location. You can also see that the first two rows show two different kinds of discharge for the same monitoring location: a mean daily discharge timeseries (with statistic id 00003, which represents \"mean\") and an instantaneous discharge timeseries (with statistic id 00011, which represents \"points\" or \"instantaneous\" values). Look closely and you may also notice that the `parent_timeseries_id` column for daily mean discharge matches the `time_series_id` for the instantaneous discharge. This is because once instantaneous measurements began at the site, they were used to calculate the daily mean." |
| 204 | + ] |
| 205 | + }, |
| 206 | + { |
| 207 | + "cell_type": "markdown", |
| 208 | + "id": "452b8830", |
| 209 | + "metadata": {}, |
| 210 | + "source": [ |
| 211 | + "Now that we know which sites have recent discharge data, let's find stream sites and plot them on a map. We will use the `waterdata.get_monitoring_locations()` function to grab more metadata about these sites. Even though we have a list of monitoring location IDs from the timeseries function, it's faster to use the `state_name` argument to return all stream sites, and then filter down to the ones we're interested in." |
149 | 212 | ] |
150 | 213 | }, |
151 | 214 | { |
|
155 | 218 | "metadata": {}, |
156 | 219 | "outputs": [], |
157 | 220 | "source": [ |
158 | | - "NE_locations,_ = waterdata.get_monitoring_locations(state_name=\"Nebraska\", site_type_code=\"ST\")\n", |
159 | | - "display(NE_locations.head())" |
| 221 | + "NE_locations,_ = waterdata.get_monitoring_locations(\n", |
| 222 | + " state_name=\"Nebraska\",\n", |
| 223 | + " site_type_code=\"ST\"\n", |
| 224 | + " )\n", |
| 225 | + "\n", |
| 226 | + "NE_locations_discharge = NE_locations.loc[NE_locations['monitoring_location_id'].isin(NE_discharge['monitoring_location_id'].unique().tolist())]\n", |
| 227 | + "display(NE_locations_discharge[[\"monitoring_location_id\", \"monitoring_location_name\", \"hydrologic_unit_code\"]].head())" |
| 228 | + ] |
| 229 | + }, |
| 230 | + { |
| 231 | + "cell_type": "markdown", |
| 232 | + "id": "f0fe5c4e", |
| 233 | + "metadata": {}, |
| 234 | + "source": [ |
| 235 | + "If you have `geopandas` installed, the function will return a `GeoDataFrame` with a `geometry` column containing the monitoring locations' coordinates. If you don't have `geopandas` installed, it will return a regular `pandas` DataFrame with coordinate columns instead. Let's take a look at the site locations using `gpd.explore()`. Hover over the site points to see all the columns returned from `waterdata.get_monitoring_locations()`." |
| 236 | + ] |
| 237 | + }, |
| 238 | + { |
| 239 | + "cell_type": "code", |
| 240 | + "execution_count": null, |
| 241 | + "id": "659b19a5", |
| 242 | + "metadata": {}, |
| 243 | + "outputs": [], |
| 244 | + "source": [ |
| 245 | + "NE_locations_discharge.set_crs(crs=\"WGS84\").explore()" |
| 246 | + ] |
| 247 | + }, |
| 248 | + { |
| 249 | + "cell_type": "code", |
| 250 | + "execution_count": null, |
| 251 | + "id": "f1d92784", |
| 252 | + "metadata": {}, |
| 253 | + "outputs": [], |
| 254 | + "source": [ |
| 255 | + "ne_sites = NE_locations['monitoring_location_id'].to_list()\n", |
| 256 | + "print(len(ne_sites))" |
| 257 | + ] |
| 258 | + }, |
| 259 | + { |
| 260 | + "cell_type": "markdown", |
| 261 | + "id": "897ca5e1", |
| 262 | + "metadata": {}, |
| 263 | + "source": [ |
| 264 | + "We cannot feed 1700+ monitoring location ID's into the time series metadata function, so we will need to break this list up into smaller chunks and loop through them." |
| 265 | + ] |
| 266 | + }, |
| 267 | + { |
| 268 | + "cell_type": "code", |
| 269 | + "execution_count": null, |
| 270 | + "id": "58307318", |
| 271 | + "metadata": {}, |
| 272 | + "outputs": [], |
| 273 | + "source": [ |
| 274 | + "chunk_size=50\n", |
| 275 | + "chunks = [ne_sites[i:i + chunk_size] for i in range(0, len(ne_sites), chunk_size)]\n", |
| 276 | + "len(chunks)" |
| 277 | + ] |
| 278 | + }, |
| 279 | + { |
| 280 | + "cell_type": "markdown", |
| 281 | + "id": "89379400", |
| 282 | + "metadata": {}, |
| 283 | + "source": [ |
| 284 | + "Now, we will loop through each chunk of sites and pull timeseries information for sites with discharge data from the past week and a timeseries that is at least 10 years old." |
| 285 | + ] |
| 286 | + }, |
| 287 | + { |
| 288 | + "cell_type": "code", |
| 289 | + "execution_count": null, |
| 290 | + "id": "52a67e9e", |
| 291 | + "metadata": {}, |
| 292 | + "outputs": [], |
| 293 | + "source": [ |
| 294 | + "dfs = pd.DataFrame()\n", |
| 295 | + "for site_group in chunks:\n", |
| 296 | + " try:\n", |
| 297 | + " timeseries_info,_ = waterdata.get_time_series_metadata(\n", |
| 298 | + " monitoring_location_id=site_group,\n", |
| 299 | + " parameter_code='00060',\n", |
| 300 | + " begin=f\"1700-01-01/{ten_years_ago}\",\n", |
| 301 | + " end=f\"{one_week_ago}/..\",\n", |
| 302 | + " skip_geometry=True\n", |
| 303 | + " )\n", |
| 304 | + " if not timeseries_info.empty:\n", |
| 305 | + " dfs = pd.concat([dfs, timeseries_info])\n", |
| 306 | + " except Exception as e:\n", |
| 307 | + " # Log & continue; you can also implement retries here\n", |
| 308 | + " print(f\"Batch failed (size={len(site_group)}): {e}\")\n", |
| 309 | + "\n", |
| 310 | + "display(dfs.head())" |
| 311 | + ] |
| 312 | + }, |
| 313 | + { |
| 314 | + "cell_type": "markdown", |
| 315 | + "id": "d1b01ca3", |
| 316 | + "metadata": {}, |
| 317 | + "source": [ |
| 318 | + "One thing you might notice is that this dataframe returns a `state_name` column!" |
160 | 319 | ] |
161 | 320 | } |
162 | 321 | ], |
|
0 commit comments