Skip to content

Commit a1a5ed4

Browse files
committed
Apply feedback from @vandesai1
1 parent d347b20 commit a1a5ed4

1 file changed

Lines changed: 74 additions & 70 deletions

File tree

tutorials/parquet-catalog-demos/euclid-hats-parquet.md

Lines changed: 74 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,10 @@ kernelspec:
1313

1414
# Euclid Q1 Catalogs in HATS Parquet
1515

16+
This notebook introduces the [Euclid Q1](https://irsa.ipac.caltech.edu/data/Euclid/docs/overview_q1.html) HATS Collection served by IPAC/IRSA and demonstrates access with python.
17+
18+
+++
19+
1620
## Learning Goals
1721

1822
By the end of this tutorial, you will:
@@ -23,9 +27,8 @@ By the end of this tutorial, you will:
2327

2428
+++
2529

26-
## Introduction
30+
## 1. Introduction
2731

28-
This notebook introduces the [Euclid Q1](https://irsa.ipac.caltech.edu/data/Euclid/docs/overview_q1.html) HATS Collection served by IPAC/IRSA and demonstrates access with python.
2932
The Collection includes a HATS Catalog (main data product), Margin Cache (10 arcsec), and Index Table (OBJECT_ID).
3033
The Catalog includes the twelve Euclid Q1 tables listed below, joined on the column 'OBJECT_ID' into a single Parquet dataset with 1,329 columns (one row per Euclid MER Object).
3134
Among them, Euclid has provided several different redshift measurements, several flux measurements for each Euclid band, and flux measurements for bands from several ground-based observatories -- in addition to morphological and other measurements.
@@ -34,12 +37,12 @@ These were produced for different science goals using different algorithms and/o
3437
Having all columns in the same dataset makes access convenient because the user doesn't have to make separate calls for data from different tables and/or join the results.
3538
However, figuring out which, e.g., flux measurements to use amongst so many can be challenging.
3639
In the sections below, we look at some of their distributions and reproduce figures from several papers in order to highlight some of the options and point out their differences.
37-
The Appendix contains important information about the schema of this Parquet dataset, especially the syntax of the column names.
40+
The Appendix explains how the columns in this Parquet dataset are named and organized.
3841
For more information about the meaning and provenance of a column, refer to the links provided with the list of tables below.
3942

40-
### Euclid Q1 tables and docs
43+
### 1.1 Euclid Q1 tables and docs
4144

42-
The Euclid Q1 HATS Catalog includes the following twelve Q1 tables[*], which are organized underneath the Euclid processing function (MER, PHZ, or SPE) that created it.
45+
The Euclid Q1 HATS Catalog includes the following twelve Q1 tables, which are organized underneath the Euclid processing function (MER, PHZ, or SPE) that created it.
4346
Links to the Euclid papers describing the processing functions are provided, as well as pointers for each table.
4447
Table names are linked to their original schemas.
4548

@@ -65,9 +68,7 @@ See also:
6568
- [Frequently Asked Questions About Euclid Q1 data](https://euclid.caltech.edu/page/euclid-q1-data-faq) (hereafter, FAQ)
6669
- [Q1 Explanatory Supplement](https://euclid.esac.esa.int/dr/q1/expsup/)
6770

68-
[*] Euclid typically calls these "catalogs", but this notebook uses "tables" to avoid any confusion with the HATS Catalog product.
69-
70-
### Parquet, HEALPix, and HATS
71+
### 1.2 Parquet, HEALPix, and HATS
7172

7273
Parquet, HEALPix, and HATS are described in more detail at [https://irsadev.ipac.caltech.edu:9051/cloud_access/parquet/](https://irsadev.ipac.caltech.edu:9051/cloud_access/parquet/).
7374
([FIXME] Currently requires IPAC VPN. Update url when the page is published to ops.)
@@ -93,7 +94,7 @@ In brief:
9394

9495
+++
9596

96-
## Installs and imports
97+
## 2. Installs and imports
9798

9899
+++
99100

@@ -109,18 +110,21 @@ We rely on ``lsdb>=0.5.2``, ``hpgeom>=1.4``, ``numpy>=2.0``, and ``pyerfa>=2.0.1
109110
```
110111

111112
```{code-cell}
112-
import os # Determine number of CPUs (for parallelization)
113-
import dask.distributed # Parallelize catalog queries
113+
# import os # Determine number of CPUs (for parallelization)
114+
# import dask.distributed # Parallelize catalog queries
115+
# import lsdb # Query the catalog
116+
# import matplotlib.colors # Make figures look nice
114117
import hpgeom
115-
import lsdb # Query the catalog
116-
import matplotlib.colors # Make figures look nice
117118
import matplotlib.pyplot as plt # Create figures
118119
import numpy as np # Math
119120
import pandas as pd # Manipulate query results
120121
import pyarrow.compute as pc # Filter dataset
121122
import pyarrow.dataset # Load the dataset
122123
import pyarrow.parquet # Load the schema
123124
import pyarrow.fs # Simple S3 filesystem pointer
125+
126+
# Copy-on-write will become the default in pandas 3.0 and is generally more performant
127+
pd.options.mode.copy_on_write = True
124128
```
125129

126130
```{tip}
@@ -137,11 +141,11 @@ make sure you have restarted the kernel since doing `pip install`. Then re-run t
137141

138142
+++
139143

140-
## 1. Setup
144+
## 3. Setup
141145

142146
+++
143147

144-
## 1.1 AWS S3 paths
148+
### 3.1 AWS S3 paths
145149

146150
```{code-cell}
147151
s3_bucket = "irsa-fornax-testdata"
@@ -151,19 +155,19 @@ euclid_hats_collection_uri = f"s3://{s3_bucket}/{euclid_prefix}" # for lsdb
151155
euclid_parquet_metadata_path = f"{s3_bucket}/{euclid_prefix}/hats/dataset/_metadata" # for pyarrow
152156
euclid_parquet_schema_path = f"{s3_bucket}/{euclid_prefix}/hats/dataset/_common_metadata" # for pyarrow
153157
154-
# Temporary try/except to handle credentials in different environments before public release.
155-
try:
156-
# If running from within IPAC's network, your IP address acts as your credentials so this should work.
157-
lsdb.read_hats(euclid_hats_collection_uri)
158-
except PermissionError:
159-
# If running from Fornax, credentials are provided automatically under the hood but
160-
# lsdb ignores them in the call above. Construct a UPath which will pick up the credentials.
161-
from upath import UPath
158+
# # Temporary try/except to handle credentials in different environments before public release.
159+
# try:
160+
# # If running from within IPAC's network, your IP address acts as your credentials so this should work.
161+
# lsdb.read_hats(euclid_hats_collection_uri)
162+
# except PermissionError:
163+
# # If running from Fornax, credentials are provided automatically under the hood but
164+
# # lsdb ignores them in the call above. Construct a UPath which will pick up the credentials.
165+
# from upath import UPath
162166
163-
euclid_hats_collection_uri = UPath(euclid_hats_collection_uri)
167+
# euclid_hats_collection_uri = UPath(euclid_hats_collection_uri)
164168
```
165169

166-
### 1.2 Helper functions
170+
### 3.2 Helper functions
167171

168172
```{code-cell}
169173
def magnitude_to_flux(magnitude: float) -> float:
@@ -201,23 +205,29 @@ def flux_to_magnitude(flux_col_name: str, color_col_names: tuple[str, str] | Non
201205
return mag_expression
202206
```
203207

204-
### 1.3 PyArrow dataset
208+
### 3.3 Load the catalog as a PyArrow dataset
205209

206210
```{code-cell}
207211
# Load the catalog as a PyArrow dataset. This is used in many examples below.
208212
dataset = pyarrow.dataset.parquet_dataset(euclid_parquet_metadata_path, partitioning="hive", filesystem=pyarrow.fs.S3FileSystem())
209213
```
210214

211-
### 1.4 Frequently used columns
215+
### 3.4 Frequently used columns
212216

213217
+++
214218

219+
The following columns will be used throughout this notebook.
220+
Many other columns are defined in the sections below where they are used.
215221
Descriptors generally come from the respective paper (Romelli, Tucci, or Le Brun) unless noted.
216222

217223
```{code-cell}
218-
# MER Object ID
224+
# Object ID set by the MER pipeline.
219225
OBJECT_ID = "OBJECT_ID"
226+
```
227+
228+
Flux and source detection columns.
220229

230+
```{code-cell}
221231
# Whether the source was detected in the VIS mosaic (1) or only in the NIR-stack mosaic (0).
222232
VIS_DET = "MER_VIS_DET"
223233
@@ -226,16 +236,26 @@ VIS_DET = "MER_VIS_DET"
226236
# Otherwise, this is a non-physical NIR-stack flux and there was no VIS detection (aka, NIR-only).
227237
FLUX_TOTAL = "MER_FLUX_DETECTION_TOTAL"
228238
FLUXERR_TOTAL = "MER_FLUXERR_DETECTION_TOTAL"
239+
```
240+
241+
Point-like and spurious indicators.
242+
243+
```{code-cell}
244+
# Peak surface brightness minus the magnitude used for MER_POINT_LIKE_PROB.
245+
# Point-like: <-2.5. Compact: <-2.6. (Tucci)
246+
MUMAX_MINUS_MAG = "MER_MUMAX_MINUS_MAG"
247+
248+
# Probability from the star-galaxy classifier. Heavily biased toward high purity.
249+
# This is always NaN for NIR-only objects (use MER_MUMAX_MINUS_MAG instead).
250+
POINTLIKE_PROB = "MER_POINT_LIKE_PROB"
229251
230252
# Whether the detection has a >50% probability of being spurious (1=Yes, 0=No).
231253
SPURIOUS_FLAG = "MER_SPURIOUS_FLAG"
254+
```
232255

233-
# Point-like morphology indicators.
234-
POINTLIKE_PROB = "MER_POINT_LIKE_PROB" # Always NaN for NIR-only (use MER_MUMAX_MINUS_MAG instead)
235-
# Peak surface brightness minus magnitude in detection band.
236-
MUMAX_MINUS_MAG = "MER_MUMAX_MINUS_MAG" # <-2.5 => point-like. <-2.6 => compact. (Tucci)
256+
PHZ classifications. These were generated by a probabilistic random forest supervised ML algorithm.
237257

238-
# PHZ classifications were generated by a probabilistic random forest supervised ML algorithm.
258+
```{code-cell}
239259
PHZ_CLASS = "PHZ_PHZ_CLASSIFICATION"
240260
PHZ_CLASS_MAP = {
241261
1: "Star",
@@ -251,28 +271,19 @@ PHZ_CLASS_MAP = {
251271
}
252272
```
253273

254-
### 1.5 Euclid Deep Fields
274+
### 3.5 Euclid Deep Fields
255275

256276
+++
257277

278+
[FIXME] The notebook does not currently use these. Should either use them or remove them.
279+
258280
Euclid Q1 includes data from three Euclid Deep Fields: EDF-N (North), EDF-S (South), EDF-F (Fornax; also in the southern hemisphere).
259281
There is also a small amount of data from a fourth field: LDN1641 (Lynds' Dark Nebula 1641), which was observed for technical reasons during Euclid's verification phase and mostly ignored here.
260-
There are two notable differences between regions:
261-
262-
- EDF-N is closest to the galactic plane and thus contains a larger fraction of stars.
263-
- Different external data was available in EDF-N (DES with g, r, i, and z bands) vs EDF-S+F (UNIONS with u, g, r, i, and z bands -- UNIONS is a collaboration between CFIS, Pan-STARRS, HSC, WHIGS, and WISHES).
264-
The Euclid processing pipelines used the external data to supplement Euclid data to, for example, measure colors that were then used for PHZ classifications.
265-
Differences between the available data is the cause of various differences in pipeline handling and results.
266-
267-
The EDF regions are well separated, so we can distinguish them using a simple cone search without having to be too picky about the radius.
282+
The regions are well separated, so we can distinguish them using a simple cone search without having to be too picky about the radius.
268283
Rather than using the RA and Dec values directly, we'll find a set of HEALPix order 9 pixels that cover each area.
269284
A column ('_healpix_9') of order 9 indexes was added to the catalog for this purpose.
270285
These will suffice for a simple and efficient cone search.
271286

272-
[FIXME] The notebook does not currently use these but it might be good to do so.
273-
Maybe in the Magnitudes section to show the differences as a function of class.
274-
Anyway, either use it or remove it.
275-
276287
```{code-cell}
277288
# Column name of HEALPix order 9 pixel indexes.
278289
HEALPIX_9 = "_healpix_9"
@@ -294,7 +305,7 @@ ra, dec, radius = 52.932, -28.088, 3 # need ~10 sq deg
294305
edff_k9_pixels = hpgeom.query_circle(hpgeom.order_to_nside(9), ra, dec, radius)
295306
```
296307

297-
## 2. Redshifts for cosmology
308+
## 4. Redshifts for cosmology
298309

299310
+++
300311

@@ -334,13 +345,13 @@ Load a quality PHZ sample. Cuts are from Tucci sec. 5.3.
334345
PHZ_FLAG = "PHZ_PHZ_FLAGS"
335346
336347
# Columns we actually want to load.
337-
phz_columns = [OBJECT_ID, PHZ_Z, HEALPIX_9]
348+
phz_columns = [OBJECT_ID, PHZ_Z]
338349
339350
# Filter for quality PHZ redshifts.
340351
phz_filter = (
341352
(pc.field(VIS_DET) == 1) # No NIR-only objects.
342353
& (pc.field(FLUX_TOTAL) > magnitude_to_flux(24.5)) # I < 24.5
343-
& (pc.divide(pc.field(FLUX_TOTAL), pc.field(FLUXERR_TOTAL)) > 5) # I S/N > 5 # [CHECKME] Is this correct def of S/N?
354+
& (pc.divide(pc.field(FLUX_TOTAL), pc.field(FLUXERR_TOTAL)) > 5) # I band S/N > 5
344355
& ~pc.field(PHZ_CLASS).isin([1, 3, 5, 7]) # Exclude objects classified as star.
345356
& (pc.field(SPURIOUS_FLAG) == 0) # MER quality
346357
)
@@ -361,12 +372,7 @@ PHYSPARAM_GAL_MSTAR = "PHYSPARAM_PHZ_PP_MEDIAN_STELLARMASS" # log10(Stellar Mas
361372
# Columns we actually want to load.
362373
# We'll have pyarrow construct and return the sSFR, so we must pass a dict mapping column names to expressions.
363374
log10_ssfr = pc.subtract(pc.field(PHYSPARAM_GAL_SFR), pc.field(PHYSPARAM_GAL_MSTAR))
364-
pp_columns = {
365-
PHYSPARAM_GAL_Z: pc.field(PHYSPARAM_GAL_Z),
366-
"log10_ssfr": log10_ssfr,
367-
OBJECT_ID: pc.field(OBJECT_ID),
368-
HEALPIX_9: pc.field(HEALPIX_9),
369-
}
375+
pp_columns = {PHYSPARAM_GAL_Z: pc.field(PHYSPARAM_GAL_Z), "log10_ssfr": log10_ssfr, OBJECT_ID: pc.field(OBJECT_ID)}
370376
371377
# Partial filter for quality PHYSPARAM redshifts.
372378
pp_galaxy_filter = (
@@ -485,7 +491,6 @@ Here, we reproduce Tucci Fig. 17 (left panel) except that we don't consider the
485491
```{code-cell}
486492
# Get the common objects and set axes data x (PHZ) and y (PHYSPARAM).
487493
phz_pp_df = phz_df.join(pp_df.loc[pp_final_filter], how="inner", lsuffix="phz", rsuffix="pp")
488-
# phz_pp_df = phz_df.join(pp_df.loc[pp_final_filter & pp_df[HEALPIX_9].isin(edfn_k9_pixels)], how="inner", lsuffix="phz", rsuffix="pp")
489494
x, y = phz_pp_df[PHZ_Z], phz_pp_df[PHYSPARAM_GAL_Z]
490495
one_to_one_linspace = np.linspace(-0.01, 6, 100)
491496
@@ -535,7 +540,7 @@ plt.colorbar(cb)
535540

536541
+++
537542

538-
## 3. Classification thresholds - purity vs completeness
543+
## 5. Classification thresholds - purity vs completeness
539544

540545
+++
541546

@@ -559,12 +564,11 @@ class_columns = {"I magnitude": flux_to_magnitude(FLUX_TOTAL), MUMAX_MINUS_MAG:
559564
```{code-cell}
560565
# Load data.
561566
classes_df = dataset.to_table(columns=class_columns, filter=class_filter).to_pandas()
562-
# 30s
563-
564-
# Plot point-like morphology vs brightness as a function of class.
565-
# Here, we reproduce the first three panels of Tucci Fig. 6, combining top and bottom.
566567
```
567568

569+
Plot point-like morphology vs brightness as a function of class.
570+
Here, we reproduce the first three panels of Tucci Fig. 6, combining top and bottom.
571+
568572
```{code-cell}
569573
fig, axes = plt.subplots(1, 3, figsize=(20, 6))
570574
for ax, (class_name, class_df) in zip(axes, classes_df.groupby(PHZ_CLASS)):
@@ -579,7 +583,8 @@ for ax, (class_name, class_df) in zip(axes, classes_df.groupby(PHZ_CLASS)):
579583
ax.set_ylim(15, 27)
580584
```
581585

582-
Objects to the left of the vertical line are point-like.
586+
MER_MUMAX_MINUS_MAG is the peak surface brightness above the background minus the magnitude that was used to compute MER_POINT_LIKE_PROB.
587+
Objects to the left of the vertical line (<-2.5) are point-like.
583588
Stars are highly concentrated there, especially those that are not faint (I < 24.5), which we should expect given Euclid's requirement for a pure sample.
584589
Also as we should expect, most galaxies appear to the right of this line.
585590
However, notice the strip of bright (e.g., I < 23) "galaxies" that are point-like.
@@ -591,7 +596,7 @@ Many QSOs are likely to be missing from the expected region due to the overlap o
591596

592597
+++
593598

594-
## 4. Magnitudes
599+
## 6. Magnitudes
595600

596601
+++
597602

@@ -635,7 +640,6 @@ Load data.
635640

636641
```{code-cell}
637642
mags_df = dataset.to_table(columns=mag_columns, filter=mag_filter).to_pandas()
638-
# 30s
639643
```
640644

641645
Given Euclid's core science goals, we'll take the template fluxes as our baseline in this section.
@@ -661,7 +665,7 @@ for (class_name, class_ids), class_color in zip(classes.items(), class_colors):
661665
for ax, band in zip(axs, bands):
662666
ax.hist(class_df[band], label=class_name, color=class_color, **hist_kwargs)
663667
664-
# Get the objects that are in this class and possibly others.
668+
# Get the objects that were accepted as multiple classes.
665669
class_df = mags_df.loc[mags_df[PHZ_CLASS].isin(class_ids)]
666670
label = "+Galaxy" if class_name != "Galaxy" else "+any"
667671
# Of those objects, restrict to the ones that are point-like.
@@ -751,7 +755,7 @@ The offset is more pronounced for point-like objects, likely due to the PSF hand
751755

752756
+++
753757

754-
## 5. Galaxy morphology
758+
## 7. Galaxy morphology
755759

756760
+++
757761

@@ -771,7 +775,7 @@ morph_filter = (
771775
& (pc.field(VIS_DET) == 1)
772776
& (pc.field(FLUX_TOTAL) > magnitude_to_flux(23)) # I<23 recommended for reliable Sérsic fits.
773777
& (pc.field(SPURIOUS_FLAG) == 0)
774-
& (pc.field("MER_POINT_LIKE_PROB") <= 0.1)
778+
& (pc.field(POINTLIKE_PROB) <= 0.1)
775779
# Sec. 4. Remove an artificial peak at the limit of the param space. Recommended for any Sérsic-based analysis.
776780
& (pc.field("MORPH_SERSIC_SERSIC_VIS_INDEX") <= 5.45)
777781
# Secs. 4 & 5 make additional quality cuts that we skip for simplicity.
@@ -841,7 +845,7 @@ The right panel also largely agrees with expectations.
841845

842846
+++
843847

844-
## 6. NIR-only detections: high-redshift galaxy or nearby brown dwarf?
848+
## 8. NIR-only detections: high-redshift galaxy or nearby brown dwarf?
845849

846850
+++
847851

@@ -971,7 +975,7 @@ targets_columns = [
971975
972976
# Load data.
973977
targets_filter = pc.field(OBJECT_ID).isin(targets.keys())
974-
targets_df = dataset.to_table(columns=targets_columns, filter=targets_filter).to_pandas() # 1m 7s
978+
targets_df = dataset.to_table(columns=targets_columns, filter=targets_filter).to_pandas()
975979
```
976980

977981
```{code-cell}
@@ -1121,6 +1125,6 @@ schema.names[-5:]
11211125

11221126
**Authors:** Troy Raen (Developer; Caltech/IPAC-IRSA) and the IRSA Data Science Team.
11231127

1124-
**Updated:** 2025-06-11
1128+
**Updated:** 2025-06-16
11251129

11261130
**Contact:** [IRSA Helpdesk](https://irsa.ipac.caltech.edu/docs/help_desk.html) with questions or problems.

0 commit comments

Comments
 (0)