Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude files that do not satisfy the metadata predicate #17693

Merged
merged 2 commits into from
Apr 9, 2024
Merged

Exclude files that do not satisfy the metadata predicate #17693

merged 2 commits into from
Apr 9, 2024

Conversation

chenjian2664
Copy link
Contributor

@chenjian2664 chenjian2664 commented May 30, 2023

Description

Exclude files that do not satisfy the metadata predicate while computing the table statistics in Iceberg.
While building the tableScan metadata predicates are ignored, it results scan files more than expectation.

Additional context and related issues

Release notes

(x) This is not user-visible or docs only and no release notes are required.

@cla-bot cla-bot bot added the cla-signed label May 30, 2023
@chenjian2664 chenjian2664 self-assigned this May 30, 2023
@chenjian2664 chenjian2664 requested review from ebyhr, findepi and homar May 30, 2023 01:54
@github-actions github-actions bot added the iceberg Iceberg connector label May 30, 2023
@chenjian2664 chenjian2664 requested review from findepi and ebyhr and removed request for findepi May 30, 2023 12:51
@chenjian2664 chenjian2664 changed the title Fix Iceberg table statistics collecting error Fix statistics computing error with path column predicate Jun 12, 2023
@findinpath
Copy link
Contributor

trino> create table iceberg.default.ptest1(c1 integer, c2 integer, p integer) with (partitioning = array['p'], format='parquet');
CREATE TABLE
trino> insert into iceberg.default.ptest1 values (4,5, 10), (11,16,20);
INSERT: 2 rows

trino> insert into iceberg.default.ptest1 values (1,2, 10), (2,3,10), (12,14,20);


rino> show stats for (select * from iceberg.default.ptest1 where "$path" = 'hdfs://hadoop-master:9000/user/hive/warehouse/ptest1-5f9f47e5f84b4c66a25017e3952ecb0d/data/p=10/20230703_114010_00002_wmerg-de99aaa0-fa75-441a-ad87-174e6a86e50a.parquet');
 column_name | data_size | distinct_values_count | nulls_fraction | row_count | low_value | high_value 
-------------+-----------+-----------------------+----------------+-----------+-----------+------------
 c1          |      NULL |                   2.0 |            0.0 |      NULL | 1         | 2          
 c2          |      NULL |                   2.0 |            0.0 |      NULL | 2         | 3          
 p           |      NULL |                   1.0 |            0.0 |      NULL | 10        | 10         
 NULL        |      NULL |                  NULL |           NULL |       2.0 | NULL      | NULL       
(4 rows)

Query 20230703_121334_00025_wmerg, FINISHED, 1 node
http://localhost:8080/ui/query.html?20230703_121334_00025_wmerg
Splits: 1 total, 1 done (100.00%)
CPU Time: 0.0s total,     0 rows/s,     0B/s, 0% active
Per Node: 0.0 parallelism,     0 rows/s,     0B/s
Parallelism: 0.0
Peak Memory: 546B
25.82 [0 rows, 0B] [0 rows/s, 0B/s]

trino> show stats for (select * from iceberg.default.ptest1 where "$path" LIKE 'hdfs://hadoop-master:9000/user/hive/warehouse/ptest1-5f9f47e5f84b4c66a25017e3952ecb0d/data/p=10/%');
 column_name | data_size | distinct_values_count | nulls_fraction | row_count | low_value | high_value 
-------------+-----------+-----------------------+----------------+-----------+-----------+------------
 c1          |      NULL |                  NULL |           NULL |      NULL | NULL      | NULL       
 c2          |      NULL |                  NULL |           NULL |      NULL | NULL      | NULL       
 p           |      NULL |                  NULL |           NULL |      NULL | NULL      | NULL       
 NULL        |      NULL |                  NULL |           NULL |      NULL | NULL      | NULL       
(4 rows)

Retrieving stats for a single file works.
Retrieving stats for the entire partition doesn't seem to work.

Iceberg connector provides the stats as expected through the changes of @chenjian2664 .
However io.trino.cost.FilterStatsCalculator.FilterExpressionStatsCalculatingVisitor#process dismisses them.

@raunaqmorarka do you happen to know why this is happening?

@findepi
Copy link
Member

findepi commented Jul 4, 2023

Iceberg connector provides the stats as expected through the changes of @chenjian2664 . However io.trino.cost.FilterStatsCalculator.FilterExpressionStatsCalculatingVisitor#process dismisses them.

do you happen to know why this is happening?

Iceberg doesn't produce stats for $path column, co the engine cannot determine stats for a predicate involving the $path. The way to solve this would be to subsume the $path predicate into Iceberg connector, which can sometimes be done.

@chenjian2664 chenjian2664 changed the title Fix statistics computing error with path column predicate Fix statistics computing error with metadata column predicate Aug 2, 2023
@chenjian2664 chenjian2664 requested a review from findepi August 2, 2023 11:57
@findepi findepi requested a review from findinpath November 7, 2023 08:00
Copy link

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Jan 15, 2024
@mosabua
Copy link
Member

mosabua commented Jan 15, 2024

@findinpath and @chenjian2664 could you connect and figure out next steps here?

@findinpath
Copy link
Contributor

@chenjian2664 pls rebase and I'll make another round of review. Appologies for not following up with the review on the last request.

Copy link

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Feb 21, 2024
@chenjian2664
Copy link
Contributor Author

@ebyhr @findinpath Need your help to review on this pr

@github-actions github-actions bot removed the stale label Feb 22, 2024
Copy link

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Mar 14, 2024
Copy link

github-actions bot commented Apr 5, 2024

Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time.

@github-actions github-actions bot closed this Apr 5, 2024
@mosabua mosabua reopened this Apr 5, 2024
@mosabua
Copy link
Member

mosabua commented Apr 5, 2024

I think this is still important to pursue. Hopefully @chenjian2664 can rebase and then @findinpath and others can follow up.

@chenjian2664
Copy link
Contributor Author

Thanks @mosabua
I believe it is ready to be reviewed @findinpath @ebyhr

@findepi
Copy link
Member

findepi commented Apr 8, 2024

from PR description:

Fix #17451

i am not sure this is true. I failed to reproduce #17451 on current master.

Can you please update the PR title and description to reflect current state?

Move getPathDomain, getFileModifiedTimePathDomain into IcebergUtil.
Refactor the getModificationTime and move it into IcebergUtil.
@chenjian2664 chenjian2664 changed the title Fix statistics computing error with metadata column predicate Fix statistics computing error while involving metadata predicate Apr 8, 2024
@chenjian2664
Copy link
Contributor Author

from PR description:

Fix #17451

i am not sure this is true. I failed to reproduce #17451 on current master.

Can you please update the PR title and description to reflect current state?

Yes, just notice the issue already has been fixed. 👍🏻
Updated the description, appreciate it if could take time to look at it. thanks

.row(null, null, null, null, 4.0, null, null)
.build();
if (format == AVRO) {
expectedTableStatistics =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a comment why we have different stats when working with avro

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t know all the reasons, but what I understand is that Avro is a row-based storage format and does not inherently support storing column-level statistics. Therefore, it's somewhat different from columnar storage formats like Parquet and ORC
Could you help to provide a concise comment

@github-actions github-actions bot removed the stale label Apr 8, 2024
@chenjian2664 chenjian2664 changed the title Fix statistics computing error while involving metadata predicate Exclude files that do not satisfy the metadata predicate Apr 9, 2024
Co-Authored-By: Marius Grama <findinpath@gmail.com>
@findepi findepi merged commit 191e778 into trinodb:master Apr 9, 2024
43 checks passed
@chenjian2664 chenjian2664 deleted the iceberg_stats branch April 9, 2024 08:20
@chenjian2664
Copy link
Contributor Author

@findinpath @mosabua @ebyhr @findepi Thank you all for the help

@github-actions github-actions bot added this to the 445 milestone Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed iceberg Iceberg connector
Development

Successfully merging this pull request may close these issues.

5 participants