Skip to content

Commit aa48ad6

Browse files
authored
Rename, fix, and extend NAWQA (NWQN) demo (#153)
* Rename, fix, and extend NAQWA (NWQN) demo
1 parent 2293a51 commit aa48ad6

8 files changed

Lines changed: 345 additions & 138 deletions

File tree

demos/nawqa_data_pull/lithops.yaml

Lines changed: 0 additions & 14 deletions
This file was deleted.

demos/nawqa_data_pull/retrieve_nawqa_with_lithops.py

Lines changed: 0 additions & 112 deletions
This file was deleted.
File renamed without changes.
Lines changed: 19 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,14 @@
1-
# Retrieva data from the National Water Quality Assessment Program (NAWQA)
1+
# Retrieve data from the National Water Quality Network (NWQN)
22

3-
This examples walks through using lithops to retrieve data from every NAWQA
4-
monitoring site, then writes the results to a parquet files on s3. Each
5-
retrieval also searches the NLDI for neighboring sites with NAWQA data and
6-
merges those data assuming the monitoring site was relocated.
3+
> This usage example is for demonstration and not for research or
4+
> operational use.
5+
6+
This example uses Lithops to retrieve data from every NWQN
7+
monitoring site, then writes the results to Parquet files on S3. Each
8+
retrieval also searches the NLDI for neighboring sites with NWQN data and
9+
merges those data. In the streamflow retrieval, the neighborhood search
10+
progressively fill in gaps in the record by taking data from the
11+
nearest streamgage and rescaling it by the drainage area ratio.
712

813
1. Set up a Python environment
914
```bash
@@ -12,33 +17,35 @@ conda activate dataretrieval-lithops
1217
pip install -r requirements.txt
1318
```
1419

15-
1. Configure compute and storage backends for [lithops](https://lithops-cloud.github.io/docs/source/configuration.html).
20+
2. Configure compute and storage backends for [lithops](https://lithops-cloud.github.io/docs/source/configuration.html).
1621
The configuration in `lithops.yaml` uses AWS Lambda for [compute](https://lithops-cloud.github.io/docs/source/compute_config/aws_lambda.html) and AWS S3 for [storage](https://lithops-cloud.github.io/docs/source/storage_config/aws_s3.html).
1722
To use those backends, simply edit `lithops.yaml` with your `bucket` and `execution_role`.
1823

19-
1. Build a runtime image for Cubed
24+
3. Build a runtime image for Cubed
2025
```bash
2126
export LITHOPS_CONFIG_FILE=$(pwd)/lithops.yaml
2227
lithops runtime build -b aws_lambda -f Dockerfile_dataretrieval dataretrieval-runtime
2328
```
2429

25-
1. Download site list
30+
4. Download the site list from ScienceBase using `wget` or navigate to the URL and copy the CVS into `nwqn_data_pull/`.
2631
```bash
2732
wget https://www.sciencebase.gov/catalog/file/get/655d2063d34ee4b6e05cc9e6?f=__disk__b3%2F3e%2F5b%2Fb33e5b0038f004c2a48818d0fcc88a0921f3f689 -O NWQN_sites.csv
2833
```
2934

30-
1. Create a s3 bucket for the output, then set it as an environmental variable
35+
5. Create a s3 bucket for the output, then set it as an environmental variable
3136
```bash
3237
export DESTINATION_BUCKET=<path/to/bucket>
3338
```
3439

35-
1. Run the script
40+
6. Run the scripts
3641
```bash
37-
python retrieve_nawqa_with_lithops.py
42+
python retrieve_nwqn_samples.py
43+
44+
python retrieve_nwqn_streamflow.py
3845
```
3946

4047
## Cleaning up
41-
To rebuild the Litops image, delete the existing one by running
48+
To rebuild the Lithops image, delete the existing one by running
4249
```bash
4350
lithops runtime delete -b aws_lambda -d dataretrieval-runtime
4451
```

demos/nwqn_data_pull/lithops.yaml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
lithops:
2+
backend: aws_lambda
3+
storage: aws_s3
4+
5+
aws:
6+
region: us-west-2
7+
8+
aws_lambda:
9+
execution_role: arn:aws:iam::account-id:role/lambdaLithopsExecutionRole
10+
runtime: dataretrieval-runtime
11+
runtime_memory: 1024
12+
runtime_timeout: 900
13+
14+
aws_s3:
15+
bucket: arn:aws:s3:::the-name-of-your-bucket
Lines changed: 174 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,174 @@
1+
# Retrieve data from the National Water Quality Assessment Program (NAWQA)
2+
3+
import lithops
4+
import math
5+
import os
6+
import pandas as pd
7+
8+
from random import randint
9+
from time import sleep
10+
from dataretrieval import nldi, nwis, wqp
11+
12+
DESTINATION_BUCKET = os.environ.get('DESTINATION_BUCKET')
13+
PROJECT = "National Water Quality Assessment Program (NAWQA)"
14+
# some sites are not found in NLDI, avoid them for now
15+
NOT_FOUND_SITES = [
16+
"15565447", # "USGS-"
17+
"15292700",
18+
]
19+
BAD_GEOMETRY_SITES = [
20+
"06805500",
21+
"09306200",
22+
]
23+
24+
BAD_NLDI_SITES = NOT_FOUND_SITES + BAD_GEOMETRY_SITES
25+
26+
27+
def map_retrieval(site):
28+
"""Map function to pull data from NWIS and WQP"""
29+
print(f"Retrieving samples from site {site}")
30+
# skip bad sites
31+
if site in BAD_NLDI_SITES:
32+
site_list = [site]
33+
# else query slowly
34+
else:
35+
sleep(randint(0, 5))
36+
site_list = find_neighboring_sites(site)
37+
38+
# reformat for wqp
39+
site_list = [f"USGS-{site}" for site in site_list]
40+
41+
df, _ = wqp_get_results(siteid=site_list,
42+
project=PROJECT,
43+
)
44+
45+
try:
46+
# merge sites
47+
df['MonitoringLocationIdentifier'] = f"USGS-{site}"
48+
df.astype(str).to_parquet(f's3://{DESTINATION_BUCKET}/nwqn-samples.parquet',
49+
engine='pyarrow',
50+
partition_cols=['MonitoringLocationIdentifier'],
51+
compression='zstd')
52+
# optionally, `return df` for further processing
53+
54+
except Exception as e:
55+
print(f"No samples returned from site {site}: {e}")
56+
57+
58+
def exponential_backoff(max_retries=5, base_delay=1):
59+
"""Exponential backoff decorator with configurable retries and base delay"""
60+
def decorator(func):
61+
def wrapper(*args, **kwargs):
62+
attempts = 0
63+
while True:
64+
try:
65+
return func(*args, **kwargs)
66+
except Exception as e:
67+
attempts += 1
68+
if attempts > max_retries:
69+
raise e
70+
wait_time = base_delay * (2 ** attempts)
71+
print(f"Retrying in {wait_time} seconds...")
72+
sleep(wait_time)
73+
return wrapper
74+
return decorator
75+
76+
77+
@exponential_backoff(max_retries=5, base_delay=1)
78+
def nwis_get_info(*args, **kwargs):
79+
return nwis.get_info(*args, **kwargs)
80+
81+
82+
@exponential_backoff(max_retries=5, base_delay=1)
83+
def wqp_get_results(*args, **kwargs):
84+
return wqp.get_results(*args, **kwargs)
85+
86+
87+
@exponential_backoff(max_retries=3, base_delay=1)
88+
def find_neighboring_sites(site, search_factor=0.1, fudge_factor=3.0):
89+
"""Find sites upstream and downstream of the given site within a certain distance.
90+
91+
TODO Use geoconnex to determine mainstem length
92+
93+
Parameters
94+
----------
95+
site : str
96+
8-digit site number.
97+
search_factor : float, optional
98+
The factor by which to multiply the watershed length to determine the
99+
search distance.
100+
fudge_factor : float, optional
101+
An additional fudge factor to apply to the search distance, because
102+
watersheds are not circular.
103+
"""
104+
site_df, _ = nwis_get_info(sites=site)
105+
drain_area_sq_mi = site_df["drain_area_va"].values[0]
106+
length = _estimate_watershed_length_km(drain_area_sq_mi)
107+
search_distance = length * search_factor * fudge_factor
108+
# clip between 1 and 9999km
109+
search_distance = max(1.0, min(9999.0, search_distance))
110+
111+
# get upstream and downstream sites
112+
gdfs = [
113+
nldi.get_features(
114+
feature_source="WQP",
115+
feature_id=f"USGS-{site}",
116+
navigation_mode=mode,
117+
distance=search_distance,
118+
data_source="nwissite",
119+
)
120+
for mode in ["UM", "DM"] # upstream and downstream
121+
]
122+
123+
features = pd.concat(gdfs, ignore_index=True)
124+
125+
df, _ = nwis_get_info(sites=list(features.identifier.str.strip('USGS-')))
126+
# drop sites with disimilar different drainage areas
127+
df = df.where(
128+
(df["drain_area_va"] / drain_area_sq_mi) > search_factor,
129+
).dropna(how="all")
130+
131+
site_list = df["site_no"].to_list()
132+
133+
# include the original search site among the neighbors
134+
if site not in site_list:
135+
site_list.append(site)
136+
137+
return site_list
138+
139+
140+
def _estimate_watershed_length_km(drain_area_sq_mi):
141+
"""Estimate the diameter assuming a circular watershed.
142+
143+
Parameters
144+
----------
145+
drain_area_sq_mi : float
146+
The drainage area in square miles.
147+
148+
Returns
149+
-------
150+
float
151+
The diameter of the watershed in kilometers.
152+
"""
153+
# assume a circular watershed
154+
length_miles = 2 * (drain_area_sq_mi / math.pi) ** 0.5
155+
# convert from miles to km
156+
return length_miles * 1.60934
157+
158+
159+
if __name__ == "__main__":
160+
project = "National Water Quality Assessment Program (NAWQA)"
161+
162+
site_df = pd.read_csv(
163+
'NWQN_sites.csv',
164+
comment='#',
165+
dtype={'SITE_QW_ID': str, 'SITE_FLOW_ID': str},
166+
)
167+
168+
site_list = site_df['SITE_QW_ID'].to_list()
169+
#site_list = site_list[:2] # prune for testing
170+
171+
fexec = lithops.FunctionExecutor(config_file="lithops.yaml")
172+
futures = fexec.map(map_retrieval, site_list)
173+
174+
futures.get_result()

0 commit comments

Comments
 (0)