Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][POC][DeltaLake] Prototype PR to add support reading tables using Delta Kernel library #23119

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

vkorukanti
Copy link
Contributor

Description

The Delta Kernel project is a set of Java libraries for building Delta connectors that can read from and write into Delta tables using a narrow set of APIs without understanding the Delta protocol details.

There are two sets of public APIs to build connectors.

  • Table APIs - Interfaces like Table and Snapshot that allow you to read and write Delta tables
  • Engine APIs - The Engine interface allows you to plug in connector-specific optimizations to compute-intensive components in the Kernel. For example, Delta Kernel provides a default Parquet file reader via the DefaultEngine, but you may choose to replace that default with a custom Engine implementation that has a faster Parquet reader for your connector/processing engine.

More Information about Delta Kernel can be found:

  • Delta Kernel source code
  • Talk explaining the rationale behind Kernel and the API design (slides are available here which are kept up-to-date with the changes).
  • User guide on the step-by-step process of using Kernel in a standalone Java program or in a distributed processing connector for reading and writing to Delta tables.
    Example Java programs that illustrate how to read and write Delta tables using the Kernel APIs.
  • Table and default Engine API Java documentation
    Migration guide

Currently, the Trino Delta connector has its own implementation of the Delta Log. We want to see if there is a way we can use Delta Kernel in Trino Delta connector so that Trino-Delta connector doesn't need to reimplement the same protocol updates.

This PR is just attempt to use the Kernel for read path with a session option. The prototype is in a very early stage, and a lot of details need to be implemented. Currently, it allows reading Delta tables, including the one with deletion vectors. It uses Trino's own Parquet reader. Looking for some early feedback on how Kernel can help reduce the development burden on the Trino-Delta connector to keep up with the Delta protocol updates.

@cla-bot cla-bot bot added the cla-signed label Aug 23, 2024
@vkorukanti vkorukanti requested review from ebyhr and findinpath August 23, 2024 17:50
@github-actions github-actions bot added the delta-lake Delta Lake connector label Aug 23, 2024
Copy link
Contributor

@findinpath findinpath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I skimmed the io.trino.plugin.deltalake.kernel.clients code and the usage of hadoop's Configuration class is a no-go for trinodb/trino. Please see into #15921 for details.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing unrelated property files is likely unintended.

private final TypeManager typeManager;

public KernelTableClient(
Configuration configuration,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trinodb/trino has migrated off from hadoop related API in favor of using natively the file system clients.
We'd likely not accept reintroducing this compile dependency in trino-delta-lake

}

Utils.closeCloseables(currentFileReader);
if (!fileIter.hasNext()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: fileIter -> fileIterator

Trino code conventions state that variable name shortcuts are disincouraged.

add config to enable kernel, create KernelDeltaLakeMetadata override

get table handle using Kernel APIs, stubs for TableClient custom impls, build changes

delegators, split manager, page source provider (not yet tested)

end-2-end working!

support for partition column
@github-actions github-actions bot added the stale label Oct 15, 2024
@mosabua mosabua added stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed. and removed stale labels Oct 15, 2024
@mosabua
Copy link
Member

mosabua commented Oct 15, 2024

Switched to stale-ignore label since this is an ongoing effort.

@trinodb trinodb deleted a comment from github-actions bot Oct 15, 2024
@trinodb trinodb deleted a comment from github-actions bot Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed.
Development

Successfully merging this pull request may close these issues.

3 participants