-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parquet_ignore_statistics session property to iceberg and delta #20228
Conversation
@@ -37,13 +37,11 @@ public class ParquetReaderConfig | |||
|
|||
private ParquetReaderOptions options = new ParquetReaderOptions(); | |||
|
|||
@Deprecated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was originally added to Trino due to a bug in Trino which caused us to ingest corrupted statistics.
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSessionProperties.java
Show resolved
Hide resolved
LGTM overall, but the PR is missing corresponding tests for the new functionality. |
@@ -37,13 +37,11 @@ public class ParquetReaderConfig | |||
|
|||
private ParquetReaderOptions options = new ParquetReaderOptions(); | |||
|
|||
@Deprecated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was originally added to Trino due to a bug in Trino which caused us to ingest corrupted statistics.
@raunaqmorarka do you have examples of corrupted Parquet stats for Iceberg or Delta? |
We have a scenario reported in #20212 (comment) where the user through some other tools managed to create a parquet file in iceberg where the partitioned column stats say it is all nulls but that's not the case in actual data. There are now many libraries of varying quality for writing parquet files. The ability to ignore stats through session property is a useful debugging tool when tackling issues with wrong results. |
Deprecation was originally intended for parquet.fail-on-corrupted-statistics which is no longer in the code base. Reading files with incorrect/corrupted statistics by ignoring statistics is a valid use case and safe to use.
b894a93
to
6c05d8b
Compare
Description
Add parquet_ignore_statistics session property to iceberg and delta
Additional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: