feat: built-in databricks unity catalog sync#802
Conversation
not sure if unity catalog refresh by itself the table metadata. will have to investigate for now skip refresh table
- alter column stmt is meant for delta, not uc so it tries to upgrade delta logs - refresh table stmt does not refresh the columns - create or replace stmt is also meant for delta, not uc so tries to upgrade delta logs - drop table + create new table: does the work, also sounds like uc does keep track of the table history so we don't loose associated stuff (top users, related assets, creation date...)
|
|
||
| DatabricksUnityCatalogSyncClient( | ||
| ExternalCatalogConfig catalogConfig, | ||
| String tableFormat, |
There was a problem hiding this comment.
SQL injection risk: fullName is derived from user-provided tableIdentifier and interpolated directly into the SQL string. While escapeSqlString is used for location, the table name itself is not sanitized.
If hierarchicalId contains ; DROP TABLE ..., this would be injected. Consider validating that catalog/schema/table names match [a-zA-Z0-9_]+ before interpolating them into DDL statements.
| if (table == null) { | ||
| return null; | ||
| } | ||
| return table.getStorageLocation(); |
There was a problem hiding this comment.
MSCK REPAIR TABLE ... SYNC METADATA is a Hive-specific command. Does Databricks UC's SQL Warehouse support this syntax? The standard Databricks approach for refreshing external table metadata is ALTER TABLE <table> SET TBLPROPERTIES or reading from the Delta log directly.
If this doesn't actually work on Databricks, schema drift would silently be ignored. Has this been tested against a real SQL Warehouse?
There was a problem hiding this comment.
Has this been tested against a real SQL Warehouse?
yes we are running this code in production. and yes i did validate schema evolution is supported (columns changes, comments changes, properties...). The drawback of the approach is it needs a sql warehouse running (it reads the delta logs to update the UC)
ALso the UC does not support modify tables columns or anything so this is today the best approach i have in mind.
see related doc https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-ddl-repair-table:
SYNC METADATA
Delta Lake only.
If the configuration spark.databricks.delta.catalog.update.enabled is set to true, target table metadata updates are automatically synced to the catalog service. Otherwise, a manual sync using REPAIR TABLE table_name SYNC METADATA may be required.
Reads the transaction log of the target table and updates the metadata info in the catalog service. To run this command, you must have MODIFY and SELECT privileges on the target table and USE SCHEMA and USE CATALOG privileges on the parent schema and catalog.
This argument works with Hive metastore in Databricks Runtime 16.1 and above. For Hive metastore tables, you must have USAGE and MODIFY privileges.
If Delta UniForm is enabled (requires Unity Catalog), SYNC METADATA triggers manual conversion of current Delta metadata to Iceberg metadata and syncs the latest Iceberg version for the Unity Catalog Iceberg endpoint. See [Read Delta tables with Iceberg clients](https://docs.databricks.com/aws/en/delta/uniform).
You can use REPAIR TABLE table_name SYNC METADATA to apply the Unity Catalog permissions model to shallow clones that you are reading from a foreign catalog that was created using Hive metastore federation. See[ Working with shallow clones](https://docs.databricks.com/aws/en/query-federation/hms-federation-concepts#shallow).
| String catalog = hierarchical.getCatalogName(); | ||
| if (StringUtils.isBlank(catalog)) { | ||
| throw new CatalogSyncException( | ||
| "Databricks UC requires a catalog name (expected catalog.schema.table)"); |
There was a problem hiding this comment.
createOrReplaceTable drops then recreates the table. This is not atomic — if createTable fails after dropTable succeeds, the table is gone. Consider using CREATE OR REPLACE TABLE DDL instead (if supported for external tables), or at minimum catch the create failure and log a clear error that the table was dropped but not recreated.
There was a problem hiding this comment.
Well sadly on databricks createOrReplaceTable is meant to modify the table layer (creates a commit) it's not only on UC metadata side. So we can't use this. I assume xtable createOrReplaceTable is meant for catalog only table recreation, not data themself. Or I am missing something ?
Added catch failure
| } catch (Exception e) { | ||
| throw new CatalogSyncException("Failed to get schema: " + fullName, e); | ||
| } | ||
| } |
There was a problem hiding this comment.
The init method can create up to 3 separate WorkspaceClient instances (lines 153, 158, 164) if statementExecution, tablesApi, and schemasApi are all null. Create the WorkspaceClient once at the top:
if (this.workspaceClient == null) {
this.workspaceClient = new WorkspaceClient(buildConfig(databricksConfig));
}
if (this.statementExecution == null) {
this.statementExecution = workspaceClient.statementExecution();
}
...|
|
||
| private ExternalCatalogConfig catalogConfig; | ||
| private DatabricksUnityCatalogConfig databricksConfig; | ||
| private Configuration hadoopConf; |
There was a problem hiding this comment.
All fields (catalogConfig, databricksConfig, hadoopConf, tableFormat, workspaceClient, etc.) are mutable and set via init(). This pattern is fragile — calling any method before init() would NPE.
Consider making these final and removing the no-arg constructor, or at least adding a null-check guard in methods that use these fields. The ServiceLoader no-arg constructor + init() pattern should be documented with a warning.
There was a problem hiding this comment.
this class is loaded via ServiceLoader, and that requires a public no-arg constructor. Removing it would break runtime plugin loading.
added warning and guards
| props.get(CLIENT_SECRET), | ||
| props.get(TOKEN)); | ||
| } | ||
| } |
There was a problem hiding this comment.
The token field is defined and parsed but the docs say PAT auth is "intentionally not wired" and "not supported yet". If it's not supported, don't add the field — it creates confusion. Remove TOKEN constant and token field until PAT auth is actually implemented.
|
|
||
| <dependencies> | ||
| <dependency> | ||
| <groupId>org.apache.xtable</groupId> |
There was a problem hiding this comment.
Pin the Databricks SDK version to a property in the parent pom.xml instead of hardcoding 0.85.0 here. Follow the pattern used for other dependency versions in the project.
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance |
There was a problem hiding this comment.
Tests have massive boilerplate — every test method repeats the same 15-line client construction block. Extract a createClient(String tableFormat) helper method to reduce duplication and improve readability.
| .thenReturn( | ||
| new StatementResponse() | ||
| .setStatus(new StatementStatus().setState(StatementState.SUCCEEDED))); | ||
|
|
There was a problem hiding this comment.
Missing test: refreshTable with matching schema (no-op path). You test the schema-mismatch path but not the case where schemas match and no MSCK REPAIR is issued. Also missing: refreshTable with null catalogTable and refreshTable with null/empty schema.
| 'hms', | ||
| 'glue-catalog', | ||
| 'unity-catalog', | ||
| 'databricks-unity-catalog', |
There was a problem hiding this comment.
The sidebar reference was changed from unity-catalog to databricks-unity-catalog, but the docs file is still named unity-catalog.md (not renamed). This will break the sidebar link. Either rename the md file or keep the sidebar reference as unity-catalog.
…client against uninitialized use
|
Hi @vinishjail97 thanks for your detailled review, i did update accordingly |
| String host; | ||
| String warehouseId; | ||
| String authType; | ||
| String clientId; |
There was a problem hiding this comment.
Lombok @Value generates toString() that includes all fields — clientSecret will be printed in any log statement or exception message that calls .toString() on this object, which is a credential leak.
Add @ToString.Exclude on the clientSecret field:
@ToString.Exclude
String clientSecret;| .setOnWaitTimeout(ExecuteStatementRequestOnWaitTimeout.CANCEL); | ||
|
|
||
| StatementResponse response = statementExecution.executeStatement(request); | ||
| if (response.getStatus() != null && response.getStatus().getState() == StatementState.FAILED) { |
There was a problem hiding this comment.
executeStatement only checks StatementState.FAILED. When the 30s wait timeout fires, the SDK cancels the statement and returns StatementState.CANCELLED — not FAILED. This means a timed-out DDL (CREATE TABLE / DROP / MSCK REPAIR) silently returns without throwing, leaving the caller believing the operation succeeded.
Change the check to cover all non-success states:
StatementState state = response.getStatus() == null ? null : response.getStatus().getState();
if (state != null && state != StatementState.SUCCEEDED) {
String errorMsg = (response.getStatus().getError() != null)
? response.getStatus().getError().getMessage() : null;
throw new CatalogSyncException("Databricks UC statement " + state + ": " + statement
+ (StringUtils.isBlank(errorMsg) ? "" : " (" + errorMsg + ")"));
}| } | ||
|
|
||
| @Override | ||
| public boolean hasDatabase(CatalogTableIdentifier tableIdentifier) { |
There was a problem hiding this comment.
The catalog+schema validation block (parse HierarchicalTableIdentifier, blank-check catalog, blank-check schema, call validateIdentifierComponent twice) is duplicated verbatim in hasDatabase (lines 129–141) and createDatabase (lines 157–169), and partially again inside getFullName. Consider extracting a private validateAndGetHierarchical(CatalogTableIdentifier) helper that returns the parsed identifier and throws on any validation failure, then have all three methods delegate to it.
| .setWarehouseId(databricksConfig.getWarehouseId()) | ||
| .setFormat(Format.JSON_ARRAY) | ||
| .setDisposition(Disposition.INLINE) | ||
| .setWaitTimeout("30s") |
There was a problem hiding this comment.
"30s" is a magic string with no ability to tune for slower warehouses or large DDL operations. Define it as a constant:
private static final String STATEMENT_WAIT_TIMEOUT = "30s";Ideally, expose it as an optional config key in DatabricksUnityCatalogConfig so users can override it.
| return result; | ||
| } | ||
|
|
||
| private static final class ColumnSignature { |
There was a problem hiding this comment.
ColumnSignature manually implements constructor, equals, and hashCode (~28 lines). Annotate with Lombok @Value to eliminate the boilerplate — consistent with the rest of the codebase.
| } | ||
| } | ||
|
|
||
| private static String toStructType(InternalSchema schema) { |
There was a problem hiding this comment.
Nit: toStructType uses a manual StringBuilder + first flag. Prefer:
return schema.getFields().stream()
.map(f -> f.getName() + ":" + toSparkSqlType(f))
.collect(Collectors.joining(",", "struct<", ">"));| ThreePartHierarchicalTableIdentifier tableIdentifier = | ||
| new ThreePartHierarchicalTableIdentifier("main", "default", "people"); | ||
| boolean exists = client.hasDatabase(tableIdentifier); | ||
| assertEquals(true, exists); |
There was a problem hiding this comment.
Nit: assertEquals(true, exists) → assertTrue(exists).
| </properties> | ||
|
|
||
| <dependencies> | ||
| <dependency> |
There was a problem hiding this comment.
databricks.sdk.version is declared in <properties> here AND in the root pom.xml's <properties> block. The child module's declaration is redundant since the version is already managed by the parent. Remove the <properties> block from this module's pom.
Important Read
What is the purpose of the pull request
This pull request adds Databricks Unity Catalog (DATABRICKS_UC) catalog sync support to XTable, with a new xtable-databricks module, UC config wiring, UC sync client
implementation, and documentation updates.
Brief change log
Verify this pull request
This change added tests and can be verified as follows: