Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple Trino from Hadoop and Hive codebases #15921

Closed
electrum opened this issue Feb 1, 2023 · 2 comments
Closed

Decouple Trino from Hadoop and Hive codebases #15921

electrum opened this issue Feb 1, 2023 · 2 comments
Labels
roadmap Top level issues for major efforts in the project

Comments

@electrum
Copy link
Member

electrum commented Feb 1, 2023

Background

The Trino Hive connector, along with Iceberg, Delta Lake, and Hudi, were built on top of the Hadoop and Hive libraries for features like Hive Metastore client for catalog metadata, clients for HDFS and cloud object stores, reading and writing data files, and so on. Over the past 10 years, we have slowly been replacing or extending many of these core features to fix bugs, improve performance, add monitoring, and improve reliability. At the same time the Hive and Hadoop libraries have slowed in maintenance and feature development which has made the maintenance of the Trino Hive connector more difficult (e.g., fixing CVEs).

Proposal

Replace all uses of Hadoop and Hive libraries with native Trino code. Trino will no longer depend on the Hive library at all, and Hadoop will be an optional dependency that is only used for accessing HDFS. The major areas that will require rewrites, porting, and new code are described below.

Metadata

Metadata has been abstracted for a long time now with direct implementations for Glue, File, and a bridge to the Hive Thrift Metastore. Recently the HMS Thrift client implementation has been replaced with a Trino specific client generated using the newest Thrift code (the old version used by Hive had CVEs).

File Systems

The Hive plugin uses the Hadoop file system API for accessing files, but the usage has been highly controlled since the custom S3 file system was introduced. Recently, we created the TrinoFileSystem API to abstract away from Hadoop APIs and this abstraction is already integrated into the Iceberg and Delta connectors. For this project, we will add Trino file system implementations for S3, GCS, and Azure storage system. HDFS will continue to be supported, of course, but via an implementation of the Trino file system API.

File Formats

There are Trino native implementations of ORC, Parquet, and RCFile. For this project, we will write native implementations for the other formats supported by Trino (CSV, JSON, Regex, TextFile, SequenceFile, and Avro). The old implementations will be available as a fallback during a transition period. Eventually, we will remove support for Hive input formats and serdes. This means that custom formats, which were never officially supported in Trino, will no longer work.

Hive Type

The Hive plugin uses the internal Hive type system in several places for convenience. The single biggest user of Hive types is during interactions with Hive file formats, and the transition to native implementation will eliminate these.

Tasks

@jordandakota
Copy link

Does this decouple the need to use a hive metastore standalone along with Trino? Otherwise to use dynamic catalog creation with connectors like iceberg and delta, you still need also add config to hive and restart it.

@electrum
Copy link
Member Author

@jordandakota No, that's unrelated to this project. Please feel free to file an issue explaining your interest in detail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
roadmap Top level issues for major efforts in the project
Development

No branches or pull requests

2 participants