Skip to content

Update differ to ouptut MCF files#1998

Open
vish-cs wants to merge 1 commit into
datacommonsorg:masterfrom
vish-cs:differ
Open

Update differ to ouptut MCF files#1998
vish-cs wants to merge 1 commit into
datacommonsorg:masterfrom
vish-cs:differ

Conversation

@vish-cs
Copy link
Copy Markdown
Contributor

@vish-cs vish-cs commented May 11, 2026

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the import_differ tool to generate diffs in MCF format and a consolidated JSON summary, replacing the previous CSV-based outputs. It introduces a direct runner mode for executing a Java-based differ via subprocess and updates the validation logic to consume these new formats. Feedback from the review highlights a mismatch in the glob pattern used to locate MCF diff files, a regression in defensive error handling during JSON parsing, and confusing logic regarding the runner_mode flag mapping where the local mode triggers the Java runner instead of the native Python implementation.

Comment thread tools/import_validation/runner.py Outdated
Comment thread tools/import_validation/runner.py Outdated
Comment thread tools/import_differ/import_differ.py Outdated
@vish-cs vish-cs requested a review from ajaits May 11, 2026 11:39
@vish-cs vish-cs force-pushed the differ branch 2 times, most recently from d6cc92a to 866f95e Compare May 14, 2026 05:15
path = os.path.join(tmp_dir, file)
else:
path = os.path.join(dest, file)
with open(path, mode='w', encoding='utf-8') as out_file:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use util/file_util.py:FileIO() to support GCS or local files transparently without needing extra GCS file handling code here:

with FileIO(path, mode='w', encoding='utf-8') as out_file:
...

flags.DEFINE_string('file_format', 'mcf',
'Format of the input data (mcf,tfrecord)')
flags.DEFINE_string('runner_mode', 'local', 'Runner mode (local/cloud)')
flags.DEFINE_string('runner_mode', 'native',
Copy link
Copy Markdown
Contributor

@ajaits ajaits May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a brief description of the different modes in the file? The difference between native and direct is not clear by the term.

for diff_type in [
Diff.ADDED.name, Diff.DELETED.name, Diff.MODIFIED.name
]:
df_type = diff_df[diff_df[Column.diff_type.name] == diff_type]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the obs nodes are being filtered separately by type, can we write them into separate mcf files too?
That may help with other analysis and also skip the diffType property in the node.

def get_val(base_name):
col_name = base_name + suffix
if col_name in row:
return str(row[col_name])
Copy link
Copy Markdown
Contributor

@ajaits ajaits May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of str(), can we have a utility to convert it to a string?
for example, for lists, instead of [... ], MCFs use a,b,c.
Also if the value has spaces, it needs to be enclosed in double quotes.

def val_str(value) -> str:
    if isinstance(value, list):
        return ",".join([val_str(v) for v in value])
    if value and isinstance(value, str) and " " in value and value[0].isalpha():
        return '"' + value + '"'
    return str(value)

if 'StatVarObservation' in node.get(Column.typeOf.name):
values_to_combine = []
keys_to_combine = []
groupby_keys = [
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a constant since it is used in multiple places?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants