Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Hudi MOR snapshot #14786

Closed
wants to merge 7 commits into from
Closed

Support Hudi MOR snapshot #14786

wants to merge 7 commits into from

Conversation

willzgw
Copy link
Contributor

@willzgw willzgw commented Oct 27, 2022

Description

Non-technical explanation

  • Support Hudi MOR Tables
  • Support MOR snapshot querying for Hudi tables

Release notes

( ) Support Hudi MOR Tables
( ) Support MOR snapshot querying for Hudi tables

# Section
* Support Hudi MOR Tables. ({issue}`5132`)
* Support MOR snapshot querying for Hudi tables. ({issue}`14432`)
* HUDI connector error with loading compaction file. ({issue}`20423`)

@cla-bot cla-bot bot added the cla-signed label Oct 27, 2022
@willzgw willzgw added the hudi Hudi connector label Oct 27, 2022
@willzgw willzgw requested review from codope, findepi and 7c00 October 27, 2022 07:02
Copy link
Contributor

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willzgw Thanks for adding the MOR snapshot query support. This would be very helpful for the community. Please edit the PR description accordingly and point to issue#14432.
Also, if you could add integ tests for MOR snapshot query that would be great. You can extend the existing setup. Note that you will have to add an upsert operation (folowwing after insert) in TpchHudiTablesInitializer#load.

@willzgw
Copy link
Contributor Author

willzgw commented Oct 28, 2022

I am trying to add some unit test cases for this PR.

Copy link
Contributor

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willzgw This is looking more and more in shape. It would be great if you could add a few tests as well.

plugin/trino-hudi/pom.xml Outdated Show resolved Hide resolved
Path path = new Path(baseFile.getPath());

HdfsContext context = new HdfsContext(session);
Configuration configuration = hdfsEnvironment.getConfiguration(context, path);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HudiSplit already carry configuration. See if we can pass from the split in page source provider.

jobConf.set(READ_COLUMN_NAMES_CONF_STR, join(dataColumns, HiveColumnHandle::getBaseColumnName));
schema.stringPropertyNames()
.forEach(name -> jobConf.set(name, schema.getProperty(name)));
refineCompressionCodecs(jobConf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my understanding, what's the intention of setting compression codecs here?

@ebyhr
Copy link
Member

ebyhr commented Dec 7, 2022

Could you rebase on upstream to resolve conflicts and add product tests?

@codope
Copy link
Contributor

codope commented Mar 4, 2023

@willzgw I have a PR for async split processing #16034 that improves the Hudi connector performance. On top of it, I have MoR support in my fork - https://github.com/codope/trino/tree/mor-snapshot
Can you please review the code changes in both? Eventually, if we go with this PR, we would want it to be rebased with #16034

@willzgw
Copy link
Contributor Author

willzgw commented Mar 4, 2023

@willzgw I have a PR for async split processing #16034 that improves the Hudi connector performance. On top of it, I have MoR support in my fork - https://github.com/codope/trino/tree/mor-snapshot Can you please review the code changes in both? Eventually, if we go with this PR, we would want it to be rebased with #16034

Hi @codope Let's continue with this PR, I will rebase this branch with #16034.
And, I recently did something similar to your PR on shopee's Trino to optimize the performance of hudi-connector, yours is much better, I just modified HudiReadOptimizedDirectoryLister and the batch-size of the load partition is a fixed value and not configurable.

@shan-shanmugam
Copy link

Hi @codope / @electrum, When can we anticipate the merge of this request?

@codope
Copy link
Contributor

codope commented Aug 18, 2023

@shan-shanmugam This PR is currently held back due to an ongoing effort to decouple Trino from Hadoop APIs - #15921
In Hudi 0.14.0, we have added a MOR snapshot reader which will help in doing away with some Hadoop/Hive imports in Trino-Hudi code. This PR needs to be reworked with the new reader.

@shan-shanmugam
Copy link

@codope, Thanks for the update. Any expected timeline for this?

@codope
Copy link
Contributor

codope commented Sep 4, 2023

@shan-shanmugam Hudi 0.14.0 RC vote is ongoing and the community is working on RC2. It should be out in a couple of weeks and then the remaining work on this PR can be picked up.

@electrum
Copy link
Member

electrum commented Oct 9, 2023

@codope It looks like even the latest version of HoodieMergeOnReadSnapshotReader still depends on Hadoop FileSystem. Do you have plans to address that, or will need a custom version in Trino?

@codope
Copy link
Contributor

codope commented Oct 10, 2023

@codope It looks like even the latest version of HoodieMergeOnReadSnapshotReader still depends on Hadoop FileSystem. Do you have plans to address that, or will need a custom version in Trino?

@electrum The current version of HoodieMergeOnReadSnapshotReader only depends on Configuration of Hadoop. Client code need not create FileSystem. There shouldn't be any FileSystem imports in trino-hudi code. The next step is to create HudiFileSystem and HudiPath abstraction to remove Hadoop from hudi - https://issues.apache.org/jira/browse/HUDI-6497. This is planned for 1.0.0.

@zhuzhengjun01
Copy link

This pr is very helpful to me, but when we use 433 version and I tried to cherry-pick this pr found that there are many conflicts after this pr 【#18241 】,especially the RecordCursor is not used after this pr.

@mosabua
Copy link
Member

mosabua commented Jan 11, 2024

👋 @willzgw - this PR has become inactive. We hope you are still interested in working on it. Please let us know, and we can try to get reviewers to help with that.

We're working on closing out old and inactive PRs, so if you're too busy or this has too many merge conflicts to be worth picking back up, we'll be making another pass to close it out in a few weeks.

Also cc @codope @brandyml

@willzgw
Copy link
Contributor Author

willzgw commented Jan 19, 2024

👋 @willzgw - this PR has become inactive. We hope you are still interested in working on it. Please let us know, and we can try to get reviewers to help with that.

We're working on closing out old and inactive PRs, so if you're too busy or this has too many merge conflicts to be worth picking back up, we'll be making another pass to close it out in a few weeks.

Also cc @codope @brandyml

Hi @mosabua
I am still working on this. I found some bugs of the HudiTableFileSystemView and this PR.
I will raise an issue to follow these problems and fix them in this PR.

@mosabua
Copy link
Member

mosabua commented Jan 19, 2024

Great .. feel free to reach out to us if you need help. @codope can help on the Hudi side..

Copy link

cla-bot bot commented Jan 19, 2024

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Guowei Zhang.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@cla-bot cla-bot bot removed the cla-signed label Jan 19, 2024
@mosabua
Copy link
Member

mosabua commented Jan 19, 2024

Looks like github still thinks you have a lot of conflicts to resolve @willzgw

@willzgw
Copy link
Contributor Author

willzgw commented Jan 20, 2024

Looks like github still thinks you have a lot of conflicts to resolve @willzgw

Yeah, still in progress.

@pravin1406
Copy link

@willzgw @codope Any timeline for this support ?

@mosabua
Copy link
Member

mosabua commented May 27, 2024

I am not sure if it is worth refactoring this PR. Currently @yihua and @codope are working towards a new Hudi release that removes the Hadoop dependencies. Once that is out they will work on Hudi connector updates that will probably change the codebase quite a bit. So maybe its worth waiting to see if MOR snapshot support will come out of these PRs or just be easier to implement from scratch then.

@willzgw
Copy link
Contributor Author

willzgw commented May 30, 2024

I am not sure if it is worth refactoring this PR. Currently @yihua and @codope are working towards a new Hudi release that removes the Hadoop dependencies. Once that is out they will work on Hudi connector updates that will probably change the codebase quite a bit. So maybe its worth waiting to see if MOR snapshot support will come out of these PRs or just be easier to implement from scratch then.

@mosabua
Thanks for the heads up. I will evaluate and follow up once HUDI is released.

Copy link

github-actions bot commented Sep 4, 2024

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Sep 4, 2024
Copy link

Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time.

@github-actions github-actions bot closed this Sep 26, 2024
@ebyhr
Copy link
Member

ebyhr commented Oct 22, 2024

Currently @yihua and @codope are working towards a new Hudi release that removes the Hadoop dependencies

@yihua @codope Did Hudi 0.15.0 remove all Hadoop dependencies? I see https://issues.apache.org/jira/browse/HUDI-6497 contains 0.15.0 as a fix version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hive Hive connector hudi Hudi connector stale
Development

Successfully merging this pull request may close these issues.