Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add view text and reference chain information to TableInfo #18871

Merged
merged 2 commits into from
Feb 5, 2024

Conversation

xkrogen
Copy link
Member

@xkrogen xkrogen commented Aug 30, 2023

In #7330, a flag directlyReferenced was added to TableInfo to indicate whether a given table or view was directly referenced, or indirectly referenced via a view/filter/mask. This is useful, but lacks the information necessary to understand what caused the indirect reference. Understanding the context can be useful for auditing purposes, as well as general lineage use cases.

As one motivating example, let us say we have have the following:

  • table foo contains user data (PII)
  • view foo_sanitized provides a view over foo with PII columns removed
  • view foo_dirty defined as SELECT * FROM foo
  • legal requirement that when reading user data for a certain use case bar, PII cannot be used

Auditing is one way to enforce that queries run for bar only make use of foo_sanitized rather than accessing foo directly. However currently we can only tell that foo was indirectly referenced, e.g. we don't know if it was accessed via foo_dirty or foo_sanitized, and thus cannot determine if the access met the requirement.

Similarly, let us envision that we want to notify all consumers of a table foo of some impending migration/deprecation. It's not sufficient to just know that foo was accessed by way of a view; we need to know which view caused that access, so that we can identify the view owner to notify.

To achieve this, a new field referenceChain is added to TableInfo. This shows the chain of references leading from the direct reference to the view/table. For example assume we have the following:

CREATE TABLE foo;
CREATE VIEW foo1 AS SELECT * FROM foo;
CREATE VIEW foo2 AS SELECT * FROM foo1;

SELECT * FROM foo2;

For the SELECT query we will see the following values for TableInfo#referenceChain:

foo -> [foo2, foo1]
foo1 -> [foo2]
foo2 -> []

In this way we can clearly identify where the access originated from. This PR also proposes including the text of a view in the TableInfo entry for views, to make it easier to understand which transformations were applied.

This mainly focuses on views, but also accommodates column masks and row filters as part of the reference chain. Assuming the same setup as above, but also with a row filter applied on table foo like userid IN (SELECT userid FROM allowed_users), we would see:

allowed_users -> [foo2, foo1, <foo, "userid IN (SELECT userid FROM allowed_users)">]

Since row filters / column masks aren't named, they are identified by their target table and the expression.

This creates a new interface TableReferenceInfo in the SPI to represent these entities that can reference other tables. There are concrete subclasses for views, column masks, and row filters.

@hashhar
Copy link
Member

hashhar commented Aug 31, 2023

FYI @Praveen2112

@xkrogen xkrogen force-pushed the xkrogen/genericreferencechain branch from 2dee6b3 to 88aa2df Compare August 31, 2023 16:55
@xkrogen
Copy link
Member Author

xkrogen commented Aug 31, 2023

cc a few folks from #7330: @martint @skrzypo987 @kokosing @kasiafi

@electrum
Copy link
Member

electrum commented Sep 2, 2023

Thanks for the detailed write up explaining the motivation. This is a model that we should emulate in other PRs.

@xkrogen
Copy link
Member Author

xkrogen commented Sep 8, 2023

Hi @Praveen2112 @hashhar @skrzypo987 @kokosing @kasiafi , friendly ping, anyone interested in taking a look at this? Thanks! :)

@xkrogen xkrogen force-pushed the xkrogen/genericreferencechain branch from 88aa2df to dedf808 Compare September 11, 2023 20:14
@xkrogen
Copy link
Member Author

xkrogen commented Sep 18, 2023

Hi @Praveen2112 any more comments on the current version?

@Praveen2112
Copy link
Member

I'll take a look

@xkrogen
Copy link
Member Author

xkrogen commented Oct 2, 2023

Friendly ping @Praveen2112 in case you are able to take a look

cc also @phd3 in case you have a chance to review

@Praveen2112
Copy link
Member

Sorry for the delay !! Will take a look by this week

@xkrogen xkrogen force-pushed the xkrogen/genericreferencechain branch 2 times, most recently from 272e727 to e8d0b44 Compare October 11, 2023 19:29
@xkrogen
Copy link
Member Author

xkrogen commented Oct 20, 2023

Hi @Praveen2112 friendly reminder :)

Copy link
Member

@Praveen2112 Praveen2112 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this and apologies for the delay. I personally like adding additional information on views or MVs - I'm not sure if we need to capture the row filter and column masks along with this - as we already have them in TableInfo#getFilters and TableInfo#getColumns where we maintain each information about the column.

@xkrogen xkrogen force-pushed the xkrogen/genericreferencechain branch from e8d0b44 to 5d483b8 Compare October 24, 2023 17:24
@xkrogen
Copy link
Member Author

xkrogen commented Oct 24, 2023

Just updated based on last round of comments @Praveen2112 . Thanks, they were super helpful! I've left in the row-filter/column-mask info for now, LMK what you think based on my latest comment. I don't have a strong opinion here -- I am also primarily interested in the view info rather than mask/filter -- so if you still feel it should be removed after this, no concerns from my side.

@xkrogen xkrogen force-pushed the xkrogen/genericreferencechain branch 2 times, most recently from 49ba984 to d574c4d Compare October 25, 2023 20:11
@xkrogen
Copy link
Member Author

xkrogen commented Oct 26, 2023

The build says failed on the trino-main test step but I can't find any failure and I see in the logs

[INFO] Tests run: 1718, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 341.5 s -- in TestSuite
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 1718, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------

Going to re-push to try again ...

@xkrogen xkrogen force-pushed the xkrogen/genericreferencechain branch from d574c4d to 309d875 Compare October 26, 2023 15:34
@xkrogen
Copy link
Member Author

xkrogen commented Oct 26, 2023

🤔 This time the trino-main test step succeeded, but the artifact-checks step failed with:

Error: The operation was canceled.

Trying one more time ...

@xkrogen xkrogen force-pushed the xkrogen/genericreferencechain branch from 309d875 to 47f6a01 Compare October 26, 2023 21:44
@xkrogen
Copy link
Member Author

xkrogen commented Oct 27, 2023

okay all checks passing this time 😌

@xkrogen
Copy link
Member Author

xkrogen commented Oct 30, 2023

Hey @Praveen2112 any chance you can take another look? I'm going to be OOO for a couple of weeks starting this Fri, would love to get this merged before then 🤞🏻

@xkrogen xkrogen force-pushed the xkrogen/genericreferencechain branch from 47f6a01 to 8e544ac Compare November 1, 2023 20:40
@xkrogen xkrogen force-pushed the xkrogen/genericreferencechain branch from 8e544ac to e9c2b4d Compare November 29, 2023 21:26
@xkrogen xkrogen force-pushed the xkrogen/genericreferencechain branch from 21ed66d to b47a921 Compare December 18, 2023 19:01
@xkrogen
Copy link
Member Author

xkrogen commented Jan 3, 2024

Hi @Praveen2112 happy new year! Friendly reminder on this PR, let me know what you think on the discussion re: MVs. Thanks!

Copy link
Member

@Praveen2112 Praveen2112 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. I think the PR looks great !! Trino has a concept of redirected tables and it would be nice if we could add a test case for those kind of tables. LMK if you need additional help wrt to re-directed tables.

Comment on lines 24 to 31
@JsonTypeInfo(
use = JsonTypeInfo.Id.NAME,
property = "@type")
@JsonSubTypes({
@JsonSubTypes.Type(value = ViewReferenceInfo.class, name = "view"),
@JsonSubTypes.Type(value = RowFilterReferenceInfo.class, name = "rowFilter"),
@JsonSubTypes.Type(value = ColumnMaskReferenceInfo.class, name = "columnMask")})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No changes requested for my reference

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool!

@xkrogen xkrogen force-pushed the xkrogen/genericreferencechain branch 3 times, most recently from 4677de5 to 03746dc Compare January 5, 2024 16:55
@xkrogen
Copy link
Member Author

xkrogen commented Jan 17, 2024

Quick reminder on this @Praveen2112

@xkrogen xkrogen force-pushed the xkrogen/genericreferencechain branch from 03746dc to ee31c0f Compare January 17, 2024 22:56
@Praveen2112
Copy link
Member

Will take a look by early next week !!

Copy link
Member

@Praveen2112 Praveen2112 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanup looks great !! Thanks for working on this

@xkrogen xkrogen force-pushed the xkrogen/genericreferencechain branch 2 times, most recently from b4d7238 to 2582208 Compare January 30, 2024 23:17
@Praveen2112
Copy link
Member

Can we fix the maven failures (related to codestyle)

@xkrogen xkrogen force-pushed the xkrogen/genericreferencechain branch from 2582208 to 432c4e9 Compare February 1, 2024 18:42
@xkrogen
Copy link
Member Author

xkrogen commented Feb 1, 2024

Yup fixed and made sure to run mvnw validate this time :)

@xkrogen
Copy link
Member Author

xkrogen commented Feb 1, 2024

Failures don't look related:

> grep "pt (default, suite-hive-transactional, )/7_Product Tests.txt" -e "FAILURE" -C5
2024-02-01T20:29:33.6042186Z presto-master       |
2024-02-01T20:29:33.6042660Z presto-master       |
2024-02-01T20:29:33.6045595Z presto-master       | 2024-02-02T02:14:33.601+0545 INFO    dispatcher-query-16     io.trino.event.QueryMonitor     TIMELINE: Query 20240201_202933_00760_vexr8 :: FAILED (HIVE_INVALID_BUCKET_FILES) :: elapsed 51ms :: planning 10ms :: waiting 11ms :: scheduling 41ms :: running 0ms :: finishing 41ms :: begin 2024-02-02T02:14:33.547+05:45 :: end 2024-02-02T02:14:33.598+05:45
2024-02-01T20:29:33.6760368Z tests               | 2024-02-02 02:14:33 INFO: not retrying; @Flaky annotation not present
2024-02-01T20:29:33.6768286Z tests               | 2024-02-02 02:14:33 INFO: Test io.trino.tests.product.hive.TestHiveTransactionalTable.testReadFullAcid took 1.23m
2024-02-01T20:29:33.6771586Z tests               | 2024-02-02 02:14:33 INFO: FAILURE     /    io.trino.tests.product.hive.TestHiveTransactionalTable.testReadFullAcid (Groups: profile_specific_tests, hive_transactional) took 1 minutes and 14 seconds
2024-02-01T20:29:33.6868905Z tests               | 2024-02-02 02:14:33 SEVERE: Failure cause:
2024-02-01T20:29:33.6872278Z tests               | io.trino.tempto.query.QueryExecutionException: java.sql.SQLException: Query failed (#20240201_202933_00760_vexr8): Found file in sub-directory of ACID directory: hdfs://hadoop-master:9000/user/hive/warehouse/test_read_full_acid_false_none_decljdey5j/delete_delta_0000002_0000005/delete_delta_0000002_0000005/bucket_00000
2024-02-01T20:29:33.6875445Z tests               |      at io.trino.tempto.query.JdbcQueryExecutor.execute(JdbcQueryExecutor.java:119)
2024-02-01T20:29:33.6876958Z tests               |      at io.trino.tempto.query.JdbcQueryExecutor.executeQuery(JdbcQueryExecutor.java:84)
2024-02-01T20:29:33.6879027Z tests               |      at io.trino.tests.product.utils.QueryExecutors$1.lambda$executeQuery$0(QueryExecutors.java:54)


> grep test\ \(coretrino-main\)/* -i -e "Failures: 1" -C3
test (coretrino-main)/5_Maven Tests.txt-2024-02-01T20:09:50.6842294Z 2024-02-01T14:09:50.605-0600       INFO    ForkJoinPool-1-worker-6 org.eclipse.jetty.server.Server Stopped Server@7f5c8267{STOPPING}[11.0.20,sto=0]
test (coretrino-main)/5_Maven Tests.txt-2024-02-01T20:09:50.6849865Z 2024-02-01T14:09:50.606-0600       INFO    ForkJoinPool-1-worker-6 org.eclipse.jetty.server.AbstractConnector      Stopped http@4413ac74{HTTP/1.1, (http/1.1, h2c)}{0.0.0.0:37279}
test (coretrino-main)/5_Maven Tests.txt-2024-02-01T20:09:50.6853741Z 2024-02-01T14:09:50.606-0600       INFO    ForkJoinPool-1-worker-6 org.eclipse.jetty.server.handler.ContextHandler Stopped o.e.j.s.ServletContextHandler@53e8be44{/,null,STOPPED,@http}
test (coretrino-main)/5_Maven Tests.txt:2024-02-01T20:09:50.7115114Z [ERROR] Tests run: 31, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 52.02 s <<< FAILURE! -- in io.trino.server.security.oauth2.TestOidcDiscovery
test (coretrino-main)/5_Maven Tests.txt-2024-02-01T20:09:50.7121399Z [ERROR] io.trino.server.security.oauth2.TestOidcDiscovery.testOidcDiscoveryTimesOut -- Time elapsed: 10.85 s <<< FAILURE!
test (coretrino-main)/5_Maven Tests.txt-2024-02-01T20:09:50.7128384Z java.lang.AssertionError:
test (coretrino-main)/5_Maven Tests.txt-2024-02-01T20:09:50.7128743Z
--
test (coretrino-main)/5_Maven Tests.txt-2024-02-01T20:27:58.9326713Z [ERROR]   TestOidcDiscovery.testOidcDiscoveryTimesOut:169
test (coretrino-main)/5_Maven Tests.txt-2024-02-01T20:27:58.9343975Z Expecting code to raise a throwable.
test (coretrino-main)/5_Maven Tests.txt-2024-02-01T20:27:58.9344771Z [INFO]
test (coretrino-main)/5_Maven Tests.txt:2024-02-01T20:27:58.9345557Z [ERROR] Tests run: 8346, Failures: 1, Errors: 0, Skipped: 5
test (coretrino-main)/5_Maven Tests.txt-2024-02-01T20:27:58.9346271Z [INFO]
test (coretrino-main)/5_Maven Tests.txt-2024-02-01T20:27:58.9347200Z [INFO] ------------------------------------------------------------------------
test (coretrino-main)/5_Maven Tests.txt-2024-02-01T20:27:58.9347978Z [INFO] BUILD FAILURE

Going to re-trigger ...

@xkrogen xkrogen force-pushed the xkrogen/genericreferencechain branch from 432c4e9 to 11166fa Compare February 1, 2024 22:28
Copy link
Member

@Praveen2112 Praveen2112 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. Minor nits

Comment on lines +22 to +28
public static TableInfoAssert assertThat(TableInfo actual)
{
return new TableInfoAssert(actual);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we move this assertion method to TableInfoAssert.java as a public static method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting this only for the TableInfoAssert, for for all 4 of the methods in this class? (Note that the second commit adds 3 more methods here.)

I pulled everything into this one class to allow for a single static import, that then selects the correct assert based on the static type of the object:

import io.trino.common.assertions.TrinoAssertions.assertThat;

As opposed to:

import io.trino.common.assertions.TableInfoAssert.assertThat;
import io.trino.common.assertions.TableReferenceInfoAssert.assertThat;
import io.trino.common.assertions.BaseViewReferenceInfoAssert.assertThat;
import io.trino.common.assertions.FilterMaskReferenceInfoAssert.assertThat;

Especially if more fluent assertions are added in the future, this will provide a nifty entrypoint to discover all of the existing ones.

This is in line with how AssertJ itself works (see Assertions), and how they show adding custom assertions in their examples, e.g.: https://github.com/assertj/assertj-examples/blob/main/assertions-examples/src/test/java/org/assertj/examples/custom/UberSoftAssertions.java

@xkrogen xkrogen force-pushed the xkrogen/genericreferencechain branch from 11166fa to 346908b Compare February 2, 2024 17:47
@xkrogen xkrogen force-pushed the xkrogen/genericreferencechain branch from 346908b to ffd971a Compare February 2, 2024 20:26
@Praveen2112 Praveen2112 merged commit 467dd8b into trinodb:master Feb 5, 2024
92 checks passed
@Praveen2112
Copy link
Member

@xkrogen Thanks for working on this

@github-actions github-actions bot added this to the 439 milestone Feb 5, 2024
@xkrogen
Copy link
Member Author

xkrogen commented Feb 5, 2024

Woohoo! Thanks a lot for all of the help @Praveen2112 , very happy with the end state of this. And thanks for putting up with my many nagging reminders :)

@colebow
Copy link
Member

colebow commented Feb 6, 2024

Does this need a release note? @xkrogen @Praveen2112

@Praveen2112
Copy link
Member

Yeah !! We could have something like

 - Expose additional table information to EventListener

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants