-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decouple Trino from Hadoop and Hive codebases #15921
Labels
roadmap
Top level issues for major efforts in the project
Comments
This was referenced Apr 26, 2023
Does this decouple the need to use a hive metastore standalone along with Trino? Otherwise to use dynamic catalog creation with connectors like iceberg and delta, you still need also add config to hive and restart it. |
@jordandakota No, that's unrelated to this project. Please feel free to file an issue explaining your interest in detail. |
This was referenced Oct 9, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Background
The Trino Hive connector, along with Iceberg, Delta Lake, and Hudi, were built on top of the Hadoop and Hive libraries for features like Hive Metastore client for catalog metadata, clients for HDFS and cloud object stores, reading and writing data files, and so on. Over the past 10 years, we have slowly been replacing or extending many of these core features to fix bugs, improve performance, add monitoring, and improve reliability. At the same time the Hive and Hadoop libraries have slowed in maintenance and feature development which has made the maintenance of the Trino Hive connector more difficult (e.g., fixing CVEs).
Proposal
Replace all uses of Hadoop and Hive libraries with native Trino code. Trino will no longer depend on the Hive library at all, and Hadoop will be an optional dependency that is only used for accessing HDFS. The major areas that will require rewrites, porting, and new code are described below.
Metadata
Metadata has been abstracted for a long time now with direct implementations for Glue, File, and a bridge to the Hive Thrift Metastore. Recently the HMS Thrift client implementation has been replaced with a Trino specific client generated using the newest Thrift code (the old version used by Hive had CVEs).
File Systems
The Hive plugin uses the Hadoop file system API for accessing files, but the usage has been highly controlled since the custom S3 file system was introduced. Recently, we created the
TrinoFileSystem
API to abstract away from Hadoop APIs and this abstraction is already integrated into the Iceberg and Delta connectors. For this project, we will add Trino file system implementations for S3, GCS, and Azure storage system. HDFS will continue to be supported, of course, but via an implementation of the Trino file system API.File Formats
There are Trino native implementations of ORC, Parquet, and RCFile. For this project, we will write native implementations for the other formats supported by Trino (CSV, JSON, Regex, TextFile, SequenceFile, and Avro). The old implementations will be available as a fallback during a transition period. Eventually, we will remove support for Hive input formats and serdes. This means that custom formats, which were never officially supported in Trino, will no longer work.
Hive Type
The Hive plugin uses the internal Hive type system in several places for convenience. The single biggest user of Hive types is during interactions with Hive file formats, and the transition to native implementation will eliminate these.
Tasks
TrinoFileSystem
in Delta Lake #15071Path
,Configuration
in Hudi #17291hadoop.mapred
usages in Hudi #17326s3://bucket
location (without trailing slash after bucket name) #17921Location.appendPath
result when location has no authority and single slash after scheme #17931The text was updated successfully, but these errors were encountered: