Skip to content

HDFS Asset URI: allow empty netloc for Hadoop fs.defaultFS#68022

Open
stegololz wants to merge 1 commit into
apache:mainfrom
stegololz:fix/hdfs-asset-uri-allow-default-fs
Open

HDFS Asset URI: allow empty netloc for Hadoop fs.defaultFS#68022
stegololz wants to merge 1 commit into
apache:mainfrom
stegololz:fix/hdfs-asset-uri-allow-default-fs

Conversation

@stegololz
Copy link
Copy Markdown
Contributor

Summary

Relax airflow.providers.apache.hdfs.assets.hdfs.sanitize_uri to accept the canonical hdfs:///path form (empty netloc). Previously rejected with ValueError: URI format hdfs:// must contain a namenode host.

Why

  • RFC 3986: the authority component of a URI is optional. hdfs:///path is well-formed.
  • Hadoop semantics: an empty authority means "resolve via fs.defaultFS from core-site.xml". This is the standard idiom for portable Spark/Hive/MapReduce jobs that must not hard-code a namenode — same shape as file:///etc/hosts.
  • The strict check was introduced in feat: Add uri sanitizers and asset factories for new schemes #66426 (alongside other new-scheme sanitizers). It is more restrictive than the Hadoop convention and breaks any DAG using Asset("hdfs:///apps/x/file.parquet") at parse time.

Change

  • providers/apache/hdfs/.../assets/hdfs.py: drop the "must contain a namenode host" check; keep the path-required check.
  • providers/apache/hdfs/.../tests/.../test_hdfs.py:
    • Add positive cases for hdfs:///apps/myapp/... (empty netloc) — pass.
    • Add negative case hdfs://namenode:8020 (no path) — fail.
    • Add test_convert_asset_to_openlineage_default_fs covering OpenLineage emission with empty netloc.

convert_asset_to_openlineage already tolerates an empty netloc (f"hdfs://{parsed.netloc}" yields hdfs:// namespace), so no functional change there.

Related

Gen-AI disclosure

This PR was prepared with Gen-AI assistance (Claude). I reviewed all generated code.

The hdfs asset URI sanitizer rejected hdfs:///path as missing a
namenode host. Per RFC 3986 the authority component is optional;
per Hadoop semantics an empty authority means 'resolve via
fs.defaultFS from core-site.xml' — i.e. hdfs:///apps/x is the
canonical form for jobs that must not hard-code a namenode.

Relax sanitize_uri to require only a non-empty path, and add
positive + negative parametrized tests covering the default-fs
form and the corresponding OpenLineage conversion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants