Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use default fpp for writing in Hive connector if not provided in table metadata #16589

Merged

Conversation

Chaho12
Copy link
Member

@Chaho12 Chaho12 commented Mar 16, 2023

Description

I think that if table metadata does not have FPP value, default fpp value should be used for bloom filter for writing ORC files.
As of now, it returns error saying that FPP for bloom filter is missing, which is ironic as

  1. Hive uses default bloom filter value (0.05%) if value is missing
  2. Trino doc saids that default value of fpp is 0.05

Additional context and related issues

Hive has default bloom filter of 0.05 (5%)

Trino doc saids that default value of fpp is 0.05

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive Connector
* Use default FPP for bloom filters in ORC when not specified in table metadata.

# Iceberg Connector
* Use default FPP for bloom filters in ORC when not specified in table metadata

@cla-bot cla-bot bot added the cla-signed label Mar 16, 2023
@Chaho12 Chaho12 requested a review from Praveen2112 March 16, 2023 12:24
@Chaho12
Copy link
Member Author

Chaho12 commented Mar 16, 2023

@Praveen2112 Hey, I was not sure who to tag so I tagged you as a reviewer because I noticed that you made changes to orc bloom filter previously. I hope you don't mind :)

@github-actions github-actions bot added hive Hive connector tests:hive labels Mar 16, 2023
@Chaho12 Chaho12 force-pushed the feature/jaeho.yoo/use_default_fpp branch from fa34f4f to 3b1b292 Compare June 13, 2023 02:39
@Chaho12
Copy link
Member Author

Chaho12 commented Sep 1, 2023

Hi ✋ PTAL whenever you have time. Thx :)

@Praveen2112 Praveen2112 self-assigned this Sep 1, 2023
Copy link
Member

@Praveen2112 Praveen2112 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me but can we add some test coverage here ? But even the table properties are default - doesn't hive specify them when creating a table ?

@Chaho12
Copy link
Member Author

Chaho12 commented Sep 1, 2023

can we add some test coverage here

I will add it soon.

doesn't hive specify them when creating a table ?

Yes, it does not specify when creating a table.

when I do show create tables after creating in hive, it does not add orc.bloom.filter.fpp in tblproperties (only adds 'bucketing_version'='2', 'transient_lastDdlTime'='1693569887') and neither can I find the propertiy in metastore db (at TABLE_PARAMS#PARAM_KEY COLUMN). Despite the document from hive, those properties are not added saved in metastore unless I specify them so.

When you look up new BloomFilter code in Hive and you can find it at 3 places

So one can wonder where does this 0.05 default value come from that is mentioned in the docs? Well it comes from map reducing job for ORC writers. The writer checks it the value is set in table properties and if it doesn't have one, default value is used which is set as 0.05 in orc conf.

  • This is very similar to Trino as Trino writer checks if it has fpp value in table property, and uses default value if there isn't one (this feature is what I added)

@Chaho12 Chaho12 force-pushed the feature/jaeho.yoo/use_default_fpp branch from 3b1b292 to e565c47 Compare September 1, 2023 17:24
@github-actions github-actions bot added the iceberg Iceberg connector label Sep 1, 2023
@Chaho12
Copy link
Member Author

Chaho12 commented Sep 1, 2023

Added test coverage when there isn't fpp value

@Chaho12 Chaho12 force-pushed the feature/jaeho.yoo/use_default_fpp branch 3 times, most recently from e53674e to 9907887 Compare January 3, 2024 03:14
@Chaho12 Chaho12 requested review from Praveen2112 and dain January 3, 2024 04:50
@Chaho12
Copy link
Member Author

Chaho12 commented Jan 3, 2024

PTAL rebased and updated some minor changes :)

Copy link
Member

@Praveen2112 Praveen2112 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this issue.

@Chaho12 Chaho12 force-pushed the feature/jaeho.yoo/use_default_fpp branch from 9907887 to c88edf0 Compare January 5, 2024 00:34
@Chaho12 Chaho12 force-pushed the feature/jaeho.yoo/use_default_fpp branch from 607e0f5 to 3ef12a1 Compare January 11, 2024 00:20
Copy link

github-actions bot commented Feb 1, 2024

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Feb 1, 2024
@mosabua
Copy link
Member

mosabua commented Feb 1, 2024

@Praveen2112 is this still good to go and we can merge?

@Praveen2112 Praveen2112 merged commit 8f6e574 into trinodb:master Feb 2, 2024
56 checks passed
@Praveen2112
Copy link
Member

@Chaho12 Thanks for working on this.

@github-actions github-actions bot added this to the 439 milestone Feb 2, 2024
@Chaho12 Chaho12 deleted the feature/jaeho.yoo/use_default_fpp branch February 2, 2024 06:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hive Hive connector iceberg Iceberg connector stale
Development

Successfully merging this pull request may close these issues.

3 participants