Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parquet_ignore_statistics session property to iceberg and delta #20228

Merged
merged 4 commits into from
Jan 2, 2024

Conversation

raunaqmorarka
Copy link
Member

@raunaqmorarka raunaqmorarka commented Dec 27, 2023

Description

Add parquet_ignore_statistics session property to iceberg and delta

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Iceberg, Delta Lake
* Add `parquet_ignore_statistics` catalog session property. ({issue}`20228`)

@cla-bot cla-bot bot added the cla-signed label Dec 27, 2023
@github-actions github-actions bot added docs tests:hive iceberg Iceberg connector delta-lake Delta Lake connector hive Hive connector labels Dec 27, 2023
@@ -37,13 +37,11 @@ public class ParquetReaderConfig

private ParquetReaderOptions options = new ParquetReaderOptions();

@Deprecated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was it intentional that the session property parquet_ignore_statistics was introduced only for Hive, but not for Iceberg (Delta Lake connector didn't exist at the time) in 1981cef ?

cc @findepi

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was originally added to Trino due to a bug in Trino which caused us to ingest corrupted statistics.

@findinpath
Copy link
Contributor

LGTM overall, but the PR is missing corresponding tests for the new functionality.

@@ -37,13 +37,11 @@ public class ParquetReaderConfig

private ParquetReaderOptions options = new ParquetReaderOptions();

@Deprecated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was originally added to Trino due to a bug in Trino which caused us to ingest corrupted statistics.

@findepi
Copy link
Member

findepi commented Dec 29, 2023

@raunaqmorarka do you have examples of corrupted Parquet stats for Iceberg or Delta?
would it be possible to automatically skip them, instead of relying on a user/administrator to turn the config?

@raunaqmorarka
Copy link
Member Author

@raunaqmorarka do you have examples of corrupted Parquet stats for Iceberg or Delta? would it be possible to automatically skip them, instead of relying on a user/administrator to turn the config?

We have a scenario reported in #20212 (comment) where the user through some other tools managed to create a parquet file in iceberg where the partitioned column stats say it is all nulls but that's not the case in actual data. There are now many libraries of varying quality for writing parquet files. The ability to ignore stats through session property is a useful debugging tool when tackling issues with wrong results.
We can skip stats automatically only we have some known pattern of incorrectness from a specific library/tool. We don't have enough details about how the file was produced at the moment to know whether it's a user mistake or a library bug.

Deprecation was originally intended for parquet.fail-on-corrupted-statistics
which is no longer in the code base.
Reading files with incorrect/corrupted statistics by ignoring statistics
is a valid use case and safe to use.
@raunaqmorarka raunaqmorarka merged commit b087a3c into trinodb:master Jan 2, 2024
62 checks passed
@raunaqmorarka raunaqmorarka deleted the pqr-ignore branch January 2, 2024 17:25
@github-actions github-actions bot added this to the 436 milestone Jan 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector docs hive Hive connector iceberg Iceberg connector
Development

Successfully merging this pull request may close these issues.

3 participants