Decouple Trino from Hadoop and Hive codebases #15921

electrum · 2023-02-01T02:53:46Z

Background

The Trino Hive connector, along with Iceberg, Delta Lake, and Hudi, were built on top of the Hadoop and Hive libraries for features like Hive Metastore client for catalog metadata, clients for HDFS and cloud object stores, reading and writing data files, and so on. Over the past 10 years, we have slowly been replacing or extending many of these core features to fix bugs, improve performance, add monitoring, and improve reliability. At the same time the Hive and Hadoop libraries have slowed in maintenance and feature development which has made the maintenance of the Trino Hive connector more difficult (e.g., fixing CVEs).

Proposal

Replace all uses of Hadoop and Hive libraries with native Trino code. Trino will no longer depend on the Hive library at all, and Hadoop will be an optional dependency that is only used for accessing HDFS. The major areas that will require rewrites, porting, and new code are described below.

Metadata

Metadata has been abstracted for a long time now with direct implementations for Glue, File, and a bridge to the Hive Thrift Metastore. Recently the HMS Thrift client implementation has been replaced with a Trino specific client generated using the newest Thrift code (the old version used by Hive had CVEs).

File Systems

The Hive plugin uses the Hadoop file system API for accessing files, but the usage has been highly controlled since the custom S3 file system was introduced. Recently, we created the TrinoFileSystem API to abstract away from Hadoop APIs and this abstraction is already integrated into the Iceberg and Delta connectors. For this project, we will add Trino file system implementations for S3, GCS, and Azure storage system. HDFS will continue to be supported, of course, but via an implementation of the Trino file system API.

File Formats

There are Trino native implementations of ORC, Parquet, and RCFile. For this project, we will write native implementations for the other formats supported by Trino (CSV, JSON, Regex, TextFile, SequenceFile, and Avro). The old implementations will be available as a fallback during a transition period. Eventually, we will remove support for Hive input formats and serdes. This means that custom formats, which were never officially supported in Trino, will no longer work.

Hive Type

The Hive plugin uses the internal Hive type system in several places for convenience. The single biggest user of Hive types is during interactions with Hive file formats, and the transition to native implementation will eliminate these.

Tasks

The text was updated successfully, but these errors were encountered:

jordandakota · 2023-06-17T07:46:40Z

Does this decouple the need to use a hive metastore standalone along with Trino? Otherwise to use dynamic catalog creation with connectors like iceberg and delta, you still need also add config to hive and restart it.

electrum · 2023-07-11T23:06:26Z

@jordandakota No, that's unrelated to this project. Please feel free to file an issue explaining your interest in detail.

electrum added the roadmap Top level issues for major efforts in the project label Feb 1, 2023

alexjo2144 mentioned this issue Feb 7, 2023

Migrate the Delta Lake connector to use TrinoFileSystem #16020

Closed

electrum mentioned this issue Feb 7, 2023

S3FileIO Can Create Non-Posix Paths apache/iceberg#6758

Closed

electrum mentioned this issue Feb 20, 2023

Allow to override DirectoryLister #16184

Closed

pajaks mentioned this issue Mar 7, 2023

Replace usage of hadoop.fs.Path with String in Delta Lake #16256

Merged

electrum mentioned this issue Mar 8, 2023

Remove compile-time Hadoop dependency for JDBC and REST catalogs apache/iceberg#7049

Merged

codope mentioned this issue Apr 25, 2023

Migrate Hudi connector to use TrinoFileSystem #17228

Closed

This was referenced Apr 26, 2023

Add initial Azure TrinoFileSystem #17237

Merged

Dynamic Catalogs #12709

Open

zielmicha mentioned this issue May 5, 2023

Support for reading TIMESTAMP WITH LOCAL TIME ZONE in Hive connector #13595

Merged

findepi mentioned this issue Jun 15, 2023

Disallow tables with s3://bucket location (without trailing slash after bucket name) #17921

Closed

findepi mentioned this issue Jun 19, 2023

Revert "Remove usages of Hadoop Path for Hive LocationService" #17947

Closed

findepi mentioned this issue Jul 6, 2023

Prepare Iceberg MV for lack of classloader isolation #18148

Merged

electrum mentioned this issue Jul 27, 2023

Add thrift parameters to compatible with Apache Hive 4.0 trinodb/trino-hive-apache#43

Closed

codope mentioned this issue Aug 18, 2023

Support Hudi MOR snapshot #14786

Closed

This was referenced Oct 9, 2023

hive connector io.trino.spi.TrinoException: Unsupported storage format #19018

Open

Cannot read compressed(4mc) TEXT file after upgrading from 410 to 411 #19084

Closed

mosabua mentioned this issue Dec 5, 2023

To be compatible with Apache Hive 4 #13283

Closed

electrum mentioned this issue Dec 9, 2023

Support Parquet Modular Encryption for Trino #20069

Open

okayhooni mentioned this issue Dec 9, 2023

system.sync_partition_metadata() procedure cannot handle special character on partition path properly, not like as before. #20071

Closed

electrum closed this as completed Mar 19, 2024

sjdurfey mentioned this issue May 7, 2024

Hive's CombineTextInputFormat #21842

Closed

OmerRaifler mentioned this issue Aug 2, 2024

don't pass serialization lib to coral #22924

Merged

findinpath mentioned this issue Aug 26, 2024

[WIP][POC][DeltaLake] Prototype PR to add support reading tables using Delta Kernel library #23119

Open

ebyhr mentioned this issue Jan 8, 2025

[WIP] [Paimon] Introduce Paimmon Connector #24637

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple Trino from Hadoop and Hive codebases #15921

Decouple Trino from Hadoop and Hive codebases #15921

electrum commented Feb 1, 2023 •

edited by findepi

Loading

jordandakota commented Jun 17, 2023

electrum commented Jul 11, 2023

Decouple Trino from Hadoop and Hive codebases #15921

Decouple Trino from Hadoop and Hive codebases #15921

Comments

electrum commented Feb 1, 2023 • edited by findepi Loading

Background

Proposal

Metadata

File Systems

File Formats

Hive Type

Tasks

jordandakota commented Jun 17, 2023

electrum commented Jul 11, 2023

electrum commented Feb 1, 2023 •

edited by findepi

Loading