Skip to content

ScienceFilePath release version format#324

Merged
lacoak21 merged 12 commits into
IMAP-Science-Operations-Center:mainfrom
lacoak21:update_sciencefilepath_to_supportvrrr.mmm
Jun 18, 2026
Merged

ScienceFilePath release version format#324
lacoak21 merged 12 commits into
IMAP-Science-Operations-Center:mainfrom
lacoak21:update_sciencefilepath_to_supportvrrr.mmm

Conversation

@lacoak21

@lacoak21 lacoak21 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Change Summary

closes #320

Overview

Update the ScienceFilePath version to contain release number information. Add backwards compatibility for old version format while we transition.

File changes

  • imap_data_access/file_validation.py
    • Update version

Testing

Fix tests and add new ones for new version format.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request updates the ScienceFilePath filename/version handling to support the new vRRR.MMM schema (release + data version) while maintaining backward compatibility with the deprecated vXXX format, and adapts query “latest” resolution and tests accordingly.

Changes:

  • Extend science filename parsing/validation to accept vRRR.MMM and extract release_number / data_version components.
  • Update query parameter validation and “latest version” selection logic to account for the new version format.
  • Adjust and add tests to cover the new version format and backward compatibility behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
imap_data_access/file_validation.py Adds parsing/validation support for vRRR.MMM and exposes release_number / data_version.
imap_data_access/io.py Validates version differently for science queries and updates “latest” selection to use a version sort key.
tests/test_io.py Updates version-related expectations and adds coverage for version=latest with new format.
tests/test_file_validation.py Updates filename parsing tests for vRRR.MMM and adds deprecated-format coverage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread imap_data_access/file_validation.py Outdated
Comment thread imap_data_access/file_validation.py Outdated
Comment thread imap_data_access/file_validation.py Outdated
Comment thread imap_data_access/io.py Outdated
Comment thread imap_data_access/file_validation.py
Comment thread imap_data_access/io.py Outdated

@tech3371 tech3371 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had couple comments but overall looks good to me!

@lacoak21 lacoak21 force-pushed the update_sciencefilepath_to_supportvrrr.mmm branch from 52c0b22 to 2561259 Compare June 10, 2026 13:37
@lacoak21 lacoak21 requested review from Copilot and tech3371 June 10, 2026 13:39

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comment thread tests/test_io.py Outdated
Comment thread imap_data_access/io.py Outdated
Comment thread imap_data_access/webpoda.py Outdated

@tech3371 tech3371 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me. Thank you for looking at all places and finding holes to fill! I forgot about webpoda and nice find!

Comment thread imap_data_access/io.py Outdated
Comment thread imap_data_access/io.py Outdated
@lacoak21 lacoak21 force-pushed the update_sciencefilepath_to_supportvrrr.mmm branch from 205c67f to 8ef22a0 Compare June 15, 2026 18:17

@tech3371 tech3371 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made minor suggestions for new major and minor outcomes. This PR still applies nicely.

Comment thread imap_data_access/file_validation.py Outdated
Comment thread imap_data_access/file_validation.py Outdated
Comment thread imap_data_access/webpoda.py Outdated
@vineetbansal

Copy link
Copy Markdown
Collaborator

@lacoak21 - I'm enumerating all the places in imap_processing where a specific format of the version string is assumed, so that going forward these call sites can instead call some convenient functions/methods in imap-data-access. This is just to inform us on what additional functionality might be useful to expose as part of this PR.

  1. cli.py, lines 570-572 in post_processing:
        r = re.compile(r"v\d{3}")
        if not isinstance(self.version, str) or r.match(self.version) is None:
            self.version = f"v{int(self.version):03d}"  # vXXX
  1. cli.py line 581 in post_processing:
        ds.attrs["Data_version"] = self.version[1:]  # Strip 'v' from version
  1. cdf/utils.py, lines 121-132, in write_cdf:
    version = dataset.attrs.get("Data_version", None)
    if version is None:
        warnings.warn(
            "No Data_version attribute found in dataset. Using default v999.",
            stacklevel=2,
        )
        version = "999"
        dataset.attrs["Data_version"] = version
    elif not re.match(r"\d{3}", version):
        raise ValueError(
            f"The Data_version attribute {version} does not match expected format XXX."
        )
  1. cdf/utils.py, lines 142, in write_cdf:
    science_file = imap_data_access.ScienceFilePath.generate_from_inputs(
        instrument=instrument,
        data_level=data_level,
        descriptor=descriptor,
        start_time=start_date,
        version=f"v{version}",  # Ensure version is prefixed with 'v'
        repointing=repointing_int,
    )
  1. cdf/utils.py, lines 221, in parse_filename_like:
        r"(?:_(?P<version>v\d{3}))?"  # Optional version
  1. ancillary/ancillary_dataset_combiner.py, line 234:
        sorted_data_list = sorted(
            self.timestamped_data, key=lambda x: int(x.version[-3:])
        )
  1. ancillary/ancillary_dataset_combiner.py, line 281:
                        output_dataset["input_file_version"].data[index] = int(
                            data_input.version[-3:]
                        )

Given these (assuming I haven't missed any), I think it would be nice for this PR to provide functionality for:

  1. parsing - some way of taking an int/prefixed_string/unprefixed_string, and return a canonical version string (with leading zeros).
  2. removing prefixes - something that can be used in CDFs, so functionality that removes the v prefix given a version.
  3. get separate major and minor versions as ints.
  4. ordering - so that callers are not doing their own ordering. (this may not be needed since only ancillary files are doing this ordering, which I understand will not be renamed).
  5. validation - is_valid() or something similar.
  6. A named regex group that callers can use (like in 5 above).
  7. Some concept of "unknown" or "missing" version - so imap_processing is not doing something like version = "999" above.

Perhaps introducing a Version class that provides all of these things might be a clean way to do it, which this PR can then also use internally.

Of course this functionality is not a must-have, but then we'll be duplicating a lot of this versioning logic in imap_processing.

@lacoak21

lacoak21 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

I made minor suggestions for new major and minor outcomes. This PR still applies nicely.

Thanks! Yep I outlined what needed to be changed during the meeting.

@lacoak21

lacoak21 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@lacoak21 - I'm enumerating all the places in imap_processing where a specific format of the version string is assumed, so that going forward these call sites can instead call some convenient functions/methods in imap-data-access. This is just to inform us on what additional functionality might be useful to expose as part of this PR.

  1. cli.py, lines 570-572 in post_processing:
        r = re.compile(r"v\d{3}")
        if not isinstance(self.version, str) or r.match(self.version) is None:
            self.version = f"v{int(self.version):03d}"  # vXXX
  1. cli.py line 581 in post_processing:
        ds.attrs["Data_version"] = self.version[1:]  # Strip 'v' from version
  1. cdf/utils.py, lines 121-132, in write_cdf:
    version = dataset.attrs.get("Data_version", None)
    if version is None:
        warnings.warn(
            "No Data_version attribute found in dataset. Using default v999.",
            stacklevel=2,
        )
        version = "999"
        dataset.attrs["Data_version"] = version
    elif not re.match(r"\d{3}", version):
        raise ValueError(
            f"The Data_version attribute {version} does not match expected format XXX."
        )
  1. cdf/utils.py, lines 142, in write_cdf:
    science_file = imap_data_access.ScienceFilePath.generate_from_inputs(
        instrument=instrument,
        data_level=data_level,
        descriptor=descriptor,
        start_time=start_date,
        version=f"v{version}",  # Ensure version is prefixed with 'v'
        repointing=repointing_int,
    )
  1. cdf/utils.py, lines 221, in parse_filename_like:
        r"(?:_(?P<version>v\d{3}))?"  # Optional version
  1. ancillary/ancillary_dataset_combiner.py, line 234:
        sorted_data_list = sorted(
            self.timestamped_data, key=lambda x: int(x.version[-3:])
        )
  1. ancillary/ancillary_dataset_combiner.py, line 281:
                        output_dataset["input_file_version"].data[index] = int(
                            data_input.version[-3:]
                        )

Given these (assuming I haven't missed any), I think it would be nice for this PR to provide functionality for:

  1. parsing - some way of taking an int/prefixed_string/unprefixed_string, and return a canonical version string (with leading zeros).
  2. removing prefixes - something that can be used in CDFs, so functionality that removes the v prefix given a version.
  3. get separate major and minor versions as ints.
  4. ordering - so that callers are not doing their own ordering. (this may not be needed since only ancillary files are doing this ordering, which I understand will not be renamed).
  5. validation - is_valid() or something similar.
  6. A named regex group that callers can use (like in 5 above).
  7. Some concept of "unknown" or "missing" version - so imap_processing is not doing something like version = "999" above.

Perhaps introducing a Version class that provides all of these things might be a clean way to do it, which this PR can then also use internally.

Of course this functionality is not a must-have, but then we'll be duplicating a lot of this versioning logic in imap_processing.

Thanks Vineet for those thoughts! That is a great idea. I will go through these and see where they fit within this PR.

@lacoak21

lacoak21 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@vineetbansal looks like a lot of those could use ScienceFilePath.is_valid_version() which is already implemented in this PR. It uses the VALID_VERSION_PATTERN so seems like 5 and 6 are taken care of unless i am mistaken. I will try to capture the rest in more functions. What gets tricky is that this new version format is only used for science files and NOT ancillary or spice files so we have to be aware of that in imap_processing.

@vineetbansal

vineetbansal commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

@lacoak21 - yes - is_valid_version(version: str, raise: bool = True) -> bool should capture a lot of these. Other than that:

  • to_version(major: str|int|None, minor: str|int, raise: bool = True) -> str | None
  • from_version(version: str, raise: bool = True) -> tuple[str|None, str] | None
  • version_regex() -> str

should cover most use cases.

raise argument suggested here so callers are not creating their own exception messages, if False, then functions can return None.

The whole concept of major/minor can be hidden from callers by using a new Version class and only letting its constructor be aware of these, and let the functions work exclusively with this class, but that might complicate this PR too much at this point, so probably an overkill.

EDIT: The functions should probably just return well-formatted-and-padded strs, while what they take in (other than the is_valid_version case) can be a bit more relaxed, like ints or strings without zero-padding.

@lacoak21

Copy link
Copy Markdown
Contributor Author

The whole concept of major/minor can be hidden from callers by using a new Version class and only letting its constructor be aware of these, and let the functions work exclusively with this class, but that might complicate this PR too much at this point, so probably an overkill.

OK this sounds good! I dont think its so overkill here.

Comment thread imap_data_access/file_validation.py Outdated
@lacoak21 lacoak21 force-pushed the update_sciencefilepath_to_supportvrrr.mmm branch from ba8e747 to 7d00569 Compare June 17, 2026 18:29
)


class Version:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vineetbansal can you take a look at this class? Is there anything else its missing?

@vineetbansal

Copy link
Copy Markdown
Collaborator

Thanks @lacoak21. Let me do a deep-dive into this PR and get back to you by tomorrow morning.

@vineetbansal

vineetbansal commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

This is cool! I'm wondering what you think about the following changes, which are incremental over what you've already done here:

Right now Version is stateless. If it stored major/minor versions as instance variables then all callers could use it super cleanly, and one could do:

legacy_version = Version(None, 42)
assert legacy_version.major is None
assert legacy_version.minor == 42

version2 = Version(2, 1)
assert str(version1) == "v002.0001"

version3 = Version.from_version("v003.0002")

assert legacy_version < version2 < version3 

assert Version.from_version("xyz", raise_error=False) is None

So .to_version() can just be __str__, and from_version can remain a staticmethod that returns a Version object. A __lt__ and __eq__ along with functools.total_ordering decorator will allow comparison/sorting of versions.

Everything else like is_valid_version, is_valid_version_minor_only, _validate_range, version_regex can remain static.

Sorry to keep suggesting more changes, but I think this will make it super useful for callers. If you're understandably busy with other stuff but generally agree that this is useful, we can merge this (its usable already) and I can send another PR tomorrow on top of this (since then my work with imap-processing will be near nil anyway :))

@lacoak21

Copy link
Copy Markdown
Contributor Author

Sorry to keep suggesting more changes, but I think this will make it super useful for callers. If you're understandably busy with other stuff but generally agree that this is useful, we can merge this (its usable already) and I can send another PR tomorrow on top of this (since then my work with imap-processing will be near nil anyway :))

Ah I see. I was wondering about this! I had assumed from your earlier message you wanted it stateless but I can make these updates!

@vineetbansal

vineetbansal commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

This is great and looks very clean. Thanks!

I think the last thing to think about is that Jon mentioned in a meeting (and he has also outlined this in his detailed google doc), that existing files should be interpreted as having version 1, or "v001." (He mentions "SPDF requires start at 1").

I don't know what the plan was to regenerate version 1 files - i.e. does the version reported by sds-data-manager unconditionally return a major version 1 for now, or if they plan to use this API somehow. If its the former (which I think is the plan?), then everything here looks good. If the latter, then a helper method to go from legacy to new format might be needed here that they can use for the migration.

@vineetbansal

vineetbansal commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

@lacoak21 - I noticed that ReleaseFilePath inherits from AncillaryFilePath whose extract_filename_components uses rf"_(?P<version>{Version.minor_only_version_pattern})" for the version component. Does something need to change for ReleaseFilePath ? I understand that AncillaryFilePaths behavior is correct since they will never get major versions, but I'm not sure if/where ReleaseFilePath objects are generated.

Also, version_regex and is_valid_version are referring to valid_imap_version_pattern instead of VALID_VERSION_PATTERN. If the intention was to have a single line to change in the future, maybe they should point to VALID_VERSION_PATTERN?

@lacoak21

Copy link
Copy Markdown
Contributor Author

I don't know what the plan was to regenerate legacy files - i.e. does the version reported by sds-data-manager unconditionally return a major version 1 for now, or if they plan to use this API somehow. If its the former (which I think is the plan?), then everything here looks good. If the latter, then a helper method to go from legacy to new format might be needed here that they can use for the migration.

I think we are good here? @tech3371 please let me know if that is not true in this case. Do we have a definite plan for this?

@lacoak21

Copy link
Copy Markdown
Contributor Author

I think we are good here? @tech3371 please let me know if that is not true in this case. Do we have a definite plan for this?

I talked to tenzin and nothing needs to change here.

@lacoak21

Copy link
Copy Markdown
Contributor Author

@lacoak21 - I noticed that ReleaseFilePath inherits from AncillaryFilePath whose extract_filename_components uses rf"_(?P<version>{Version.minor_only_version_pattern})" for the version component. Does something need to change for ReleaseFilePath ? I understand that AncillaryFilePaths behavior is correct since they will never get major versions, but I'm not sure if/where ReleaseFilePath objects are generated.

Also, version_regex and is_valid_version are referring to valid_imap_version_pattern instead of VALID_VERSION_PATTERN. If the intention was to have a single line to change in the future, maybe they should point to VALID_VERSION_PATTERN?

  1. ReleaseFilePath should never have a major version and should inherit Version.minor_only_version_pattern
  2. VALID_VERSION_PATTERN is part of the ScienceFilePath class and is more specific than the Version class. I think the version class should have the more general version pattern (including vXXX and vMMM.mmmm) and then the scienceFilePath can make it more specific. Did I understand that question correctly?

@lacoak21 lacoak21 merged commit 40e6e6a into IMAP-Science-Operations-Center:main Jun 18, 2026
16 checks passed
@lacoak21 lacoak21 deleted the update_sciencefilepath_to_supportvrrr.mmm branch June 18, 2026 15:50
@github-project-automation github-project-automation Bot moved this to Done in IMAP Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

ENH: update science filename convention to support vRRR.MMM

5 participants