Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GOBBLIN-2157] Copy table properties in iceberg distcp #4056

Merged
merged 5 commits into from
Sep 19, 2024

Conversation

Will-Lo
Copy link
Contributor

@Will-Lo Will-Lo commented Sep 13, 2024

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):
    The existing iceberg registration step has the following snippet:
    if (dstMetadata != null) {
      // use current destination metadata as 'base metadata' and source as 'updated metadata' while committing
      this.tableOps.commit(dstMetadata, srcMetadata.replaceProperties(dstMetadata.properties())); <<<<<------- BUG
    }

Since Iceberg's replaceProperties https://github.com/apache/iceberg/blob/e449d3405cfdb304c94835845bd8f34a73b4a517/core/src/main/java/org/apache/iceberg/TableMetadata.java#L583 just replaces the full set of properties, not just any existing properties, the result is that all of the properties are equivalent to the destination table metadata properties. Which effectively copies no properties.

This PR modifies this behavior so that the destination table properties can be overwritten by the source table properties if the source table updates existing properties, as well as maintaining any existing properties in the destination that is not defined on the source.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    Unit tests

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Comment on lines 219 to 220
combinedMetadataProperties.putAll(dstMetadata.properties());
combinedMetadataProperties.putAll(srcMetadata.properties());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a comment that captures the reasoning behind overshadowing dest with src

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 41.15%. Comparing base (45ad13e) to head (aa4e35e).
Report is 7 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master    #4056      +/-   ##
============================================
- Coverage     45.12%   41.15%   -3.97%     
+ Complexity     3199     2220     -979     
============================================
  Files           705      483     -222     
  Lines         26949    20490    -6459     
  Branches       2680     2373     -307     
============================================
- Hits          12160     8433    -3727     
+ Misses        13781    11157    -2624     
+ Partials       1008      900     -108     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Blazer-007
Copy link
Contributor

Blazer-007 commented Sep 14, 2024

@Will-Lo I believe if we want to copy some source side table properties then we should copy only selective property not all,

As lets say we change some property like clusterId or tableId or any other and that property is copied over to destination but destination requires this doesn't to be changed then in this case we will eventually be corrupting the properties.

Update - Saw the thread, if catalog is handling which properties to accept and which to not then it is good.

@Will-Lo Will-Lo merged commit 4fa6c51 into apache:master Sep 19, 2024
6 checks passed
@Will-Lo Will-Lo deleted the copy-table-properties-iceberg-distcp branch September 19, 2024 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants