start to address feedback, get rid of double id columns being returned

ehinman · ehinman · commit bfdcb4f451a8 · 2026-01-14T15:03:08.000-06:00
diff --git a/dataretrieval/waterdata/utils.py b/dataretrieval/waterdata/utils.py
@@ -504,6 +504,7 @@ def _get_resp_data(resp: requests.Response, geopd: bool) -> pd.DataFrame:
         )
         df.columns = [col.replace("properties_", "") for col in df.columns]
         df.rename(columns={"geometry_coordinates": "geometry"}, inplace=True)
+        df = df.loc[:, ~df.columns.duplicated()]
         return df
 
     # Organize json into geodataframe and make sure id column comes along.
diff --git a/demos/WaterData_demo.ipynb b/demos/WaterData_demo.ipynb
@@ -77,7 +77,8 @@
     "- You'll notice that each of the data functions have many unique inputs you can specify. **DO NOT** specify too many! Specify *just enough* inputs to return what you need. But do not provide redundant geographical or parameter information as this may slow down your query and lead to errors.\n",
     "- Each function returns a Tuple, containing a dataframe and a Metadata class. If you have `geopandas` installed in your environment, the dataframe will be a `GeoDataFrame` with a geometry included. If you do not have `geopandas`, the dataframe will be a `pandas` dataframe with the geometry contained in a coordinates column. The Metadata object contains information about your query, like the query url.\n",
     "- If you do not want to return the `geometry` column, use the input `skip_geometry=True`.\n",
-    "- All of these functions (except `get_samples()`) have a `limit` argument, which signifies the number of rows returned with each \"page\" of data. The Water Data APIs use paging to chunk up large responses and send data most efficiently to the requester. The `waterdata` functions collect the rows of data from each page and combine them into one final dataframe at the end. The default and maximum limit per page is 50,000 rows. In other words, if you request 100,000 rows of data from the database, it will return all the data in 2 pages, and each page counts as a \"request\" using your API key. If you were to change the argument to `limit=10000`, then each page returned would contain 10,000 rows, and it would take 10 requests/pages to return the total 100,000 rows. In general, there is no need to adjust the `limit` argument. However, if you are working with slow internet speeds, adjusting the `limit` argument may reduce chances of failures due to bandwidth."
+    "- All of these functions (except `get_samples()`) have a `limit` argument, which signifies the number of rows returned with each \"page\" of data. The Water Data APIs use paging to chunk up large responses and send data most efficiently to the requester. The `waterdata` functions collect the rows of data from each page and combine them into one final dataframe at the end. The default and maximum limit per page is 50,000 rows. In other words, if you request 100,000 rows of data from the database, it will return all the data in 2 pages, and each page counts as a \"request\" using your API key. If you were to change the argument to `limit=10000`, then each page returned would contain 10,000 rows, and it would take 10 requests/pages to return the total 100,000 rows. In general, there is no need to adjust the `limit` argument. However, if you are working with slow internet speeds, adjusting the `limit` argument may reduce chances of failures due to bandwidth.\n",
+    "- You can find some other helpful tips in the [Water Data API documentation](https://api.waterdata.usgs.gov/docs/ogcapi/)."
    ]
   },
   {
@@ -86,7 +87,7 @@
    "metadata": {},
    "source": [
     "## Examples\n",
-    "Let's get into some examples using the functions listed above. First, we need to load the `waterdata` module and a few other packages and functions to go through the examples."
+    "Let's get into some examples using the functions listed above. First, we need to load the `waterdata` module and a few other packages and functions to go through the examples. To run the entirety of this notebook, you will need to install `dataretrieval`, `matplotlib`, and `geopandas` packages. `matplotlib` is needed to create the plots, and `geopandas` is needed to create the interactive maps."
    ]
   },
   {
@@ -104,7 +105,14 @@
     "from datetime import datetime, timedelta\n",
     "from datetime import date\n",
     "from dateutil.relativedelta import relativedelta\n",
-    "from dataretrieval import waterdata"
+    "from dataretrieval import waterdata\n",
+    "\n",
+    "# Check if geopandas is installed\n",
+    "import importlib.util\n",
+    "if importlib.util.find_spec(\"geopandas\") is None:\n",
+    "    GEOPANDAS=False\n",
+    "else:\n",
+    "    GEOPANDAS=True"
    ]
   },
   {
@@ -176,6 +184,14 @@
     "one_week_ago = (datetime.now() - timedelta(days=7)).date().strftime(\"%Y-%m-%d\")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "261b5a32",
+   "metadata": {},
+   "source": [
+    "We will also use the `skip_geometry` argument in our timeseries metadata call. By default, most `waterdata` functions return a geometry column containing the monitoring location's coordinates. This is a really cool feature that we will use later, but for this particular data pull, we don't need it. Setting `skip_geometry=True` makes the returned dataframe smaller and more efficient."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -208,7 +224,7 @@
    "id": "8f464470",
    "metadata": {},
    "source": [
-    "In the dataframe above, we are looking at 5 timeseries returned, ordered by monitoring location. You can also see that the first two rows show two different kinds of discharge for the same monitoring location: a mean daily discharge timeseries (with statistic id 00003, which represents \"mean\") and an instantaneous discharge timeseries (with statistic id 00011, which represents \"points\" or \"instantaneous\" values). Look closely and you may also notice that the `parent_timeseries_id` column for daily mean discharge matches the `time_series_id` for the instantaneous discharge. This is because once instantaneous measurements began at the site, they were used to calculate the daily mean."
+    "In the dataframe above, we are looking at 5 timeseries returned, ordered by monitoring location. You can also see that the first two rows show two different kinds of discharge for the same monitoring location: a mean daily discharge timeseries (with [statistic id](https://api.waterdata.usgs.gov/docs/ogcapi/) 00003, which represents \"mean\") and an instantaneous discharge timeseries (with statistic id 00011, which represents \"points\" or \"instantaneous\" values). Look closely and you may also notice that the `parent_timeseries_id` column for daily mean discharge matches the `time_series_id` for the instantaneous discharge. This is because once instantaneous measurements began at the site, they were used to calculate the daily mean."
    ]
   },
   {
@@ -217,7 +233,42 @@
    "metadata": {},
    "source": [
     "### Monitoring locations\n",
-    "Now that we know which sites have recent discharge data, let's find stream sites and plot them on a map. We will use the `waterdata.get_monitoring_locations()` function to grab more metadata about these sites. Even though we have a list of monitoring location IDs from the timeseries function, it's faster to use the `state_name` argument to return all stream sites, and then filter down to the ones we're interested in."
+    "Now that we know which sites have recent discharge data, let's find stream sites and plot them on a map. We will use the `waterdata.get_monitoring_locations()` function to grab more metadata about these sites.\n",
+    "\n",
+    "We can feed the unique monitoring location IDs from `NE_discharge` into the `get_monitoring_locations()` function to get the metadata for just those sites. However, there is a limit to the number of IDs that can be passed in one call to the API. The function may be able to handle the ~100 sites in one go, but for demonstration purposes, we will split the list of monitoring location IDs into a few chunks of 50 sent to the API and stitch the resulting dataframes together. Further down in this notebook, you'll see an example where we successfully feed all ~100 IDs in one call to the API. A loose rule of thumb is to keep the number of IDs below 200, but this exact number will depend on the typical length of each monitoring location ID (i.e. if your monitoring location IDs are > 13 characters long: \"USGS-XXXXXXXX\"+, you will need to feed in less than 200 at a time)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3c3eeac3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "chunk_size=50\n",
+    "site_list = NE_discharge['monitoring_location_id'].unique().tolist()\n",
+    "chunks = [site_list[i:i + chunk_size] for i in range(0, len(site_list), chunk_size)]\n",
+    "NE_locations = pd.DataFrame()\n",
+    "for site_group in chunks:\n",
+    "        try:\n",
+    "            chunk_data,_ = waterdata.get_monitoring_locations(\n",
+    "                monitoring_location_id=site_group,\n",
+    "                site_type_code=\"ST\"\n",
+    "            )\n",
+    "            if not chunk_data.empty:\n",
+    "                NE_locations = pd.concat([NE_locations, chunk_data])\n",
+    "        except Exception as e:\n",
+    "            print(f\"Chunk failed: {e}\")\n",
+    "\n",
+    "display(NE_locations[[\"monitoring_location_id\", \"monitoring_location_name\", \"hydrologic_unit_code\"]].head())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "21a0f28f",
+   "metadata": {},
+   "source": [
+    "That took a little bit of work to loop through the site chunks and bind the data back together. Admittedly, there may be times where chunking and iterating might be the most efficient workflow. But in this particular case, we have a less onerous option available: `get_monitoring_locations()` has a `state_name` argument. It will likely be faster to pull all stream sites for Nebraska and then filter down to the sites present in the timeseries dataframe: no iteration needed. Let's try this too."
    ]
   },
   {
@@ -241,7 +292,7 @@
    "id": "f0fe5c4e",
    "metadata": {},
    "source": [
-    "If you have `geopandas` installed, the function will return a `GeoDataFrame` with a `geometry` column containing the monitoring locations' coordinates. If you don't have `geopandas` installed, it will return a regular `pandas` DataFrame with coordinate columns instead. Let's take a look at the site locations using `gpd.explore()`. Hover over the site points to see all the columns returned from `waterdata.get_monitoring_locations()`."
+    "If you have `geopandas` installed, the function will return a `GeoDataFrame` with a `geometry` column containing the monitoring locations' coordinates. You can use `gpd.explore()` to view your geometry coordinates on an interactive map. We will demo this functionality below (Hover over the site points to see all the columns returned from `waterdata.get_monitoring_locations()`). If you don't have `geopandas` installed, `dataretrieval` will return a regular `pandas` DataFrame with coordinate columns instead."
    ]
   },
   {
@@ -251,7 +302,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "NE_locations_discharge.set_crs(crs=\"WGS84\").explore()"
+    "if GEOPANDAS:\n",
+    "    NE_locations_discharge.set_crs(crs=\"WGS84\").explore()"
    ]
   },
   {
@@ -295,9 +347,9 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "latest_dv['date'] = latest_dv['time'].astype(str)\n",
-    "\n",
-    "latest_dv[['geometry', 'monitoring_location_id', 'date', 'value', 'unit_of_measure']].set_crs(crs=\"WGS84\").explore(column='value', tiles='CartoDB dark matter', cmap='YlOrRd', scheme=None, legend=True)"
+    "if GEOPANDAS:\n",
+    "    latest_dv['date'] = latest_dv['time'].astype(str)\n",
+    "    latest_dv[['geometry', 'monitoring_location_id', 'date', 'value', 'unit_of_measure']].set_crs(crs=\"WGS84\").explore(column='value', tiles='CartoDB dark matter', cmap='YlOrRd', scheme=None, legend=True)"
    ]
   },
   {
@@ -320,9 +372,11 @@
     "    parameter_code=\"00060\",\n",
     "    statistic_id=\"00011\"\n",
     ")\n",
-    "latest_instantaneous['datetime'] = latest_instantaneous['time'].astype(str)\n",
     "\n",
-    "latest_instantaneous[['geometry', 'monitoring_location_id', 'datetime', 'value', 'unit_of_measure']].set_crs(crs=\"WGS84\").explore(column='value', cmap='YlOrRd', scheme=None, legend=True)"
+    "if GEOPANDAS:\n",
+    "    latest_instantaneous['datetime'] = latest_instantaneous['time'].astype(str)\n",
+    "\n",
+    "    latest_instantaneous[['geometry', 'monitoring_location_id', 'datetime', 'value', 'unit_of_measure']].set_crs(crs=\"WGS84\").explore(column='value', cmap='YlOrRd', scheme=None, legend=True)"
    ]
   },
   {
@@ -546,11 +600,20 @@
     "fig.suptitle(f\"Missouri River sites - Daily Mean, Instantaneous, and Field Measurement Discharge\")\n",
     "fig.autofmt_xdate()\n"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "60a1b100",
+   "metadata": {},
+   "source": [
+    "## Additional Resources\n",
+    "Check out the links below for more information on the Water Data APIs and other ways to download USGS water data:\n"
+   ]
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "dr-test",
+   "display_name": "drpy-no-geopandas",
    "language": "python",
    "name": "python3"
   },
@@ -564,7 +627,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.14.0"
+   "version": "3.14.2"
   }
  },
  "nbformat": 4,

Original file line number	Diff line number	Diff line change
`@@ -504,6 +504,7 @@ def _get_resp_data(resp: requests.Response, geopd: bool) -> pd.DataFrame:`
`504`	`504`	`)`
`505`	`505`	`df.columns = [col.replace("properties_", "") for col in df.columns]`
`506`	`506`	`df.rename(columns={"geometry_coordinates": "geometry"}, inplace=True)`
	`507`	`+ df = df.loc[:, ~df.columns.duplicated()]`
`507`	`508`	`return df`
`508`	`509`
`509`	`510`	`# Organize json into geodataframe and make sure id column comes along.`