hive connector io.trino.spi.TrinoException: Unsupported storage format #19018

snowangles · 2023-09-12T18:07:31Z

Hello,

We have hive tables that use custom input formats and serdes. We noticed that starting with Trino 423 we're no longer able to query these tables.

Query 20230907_171018_00016_mrt64 failed: Unsupported storage format:foobar StorageFormat{serde=CUSTOM SERDE HERE, inputFormat=org.apache.hadoop.mapred.SequenceFileInputFormat, outputFormat=org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat}
io.trino.spi.TrinoException: Unsupported storage format: foobar StorageFormat{serde=CUSTOM SERDEDE HERE, inputFormat=org.apache.hadoop.mapred.SequenceFileInputFormat, outputFormat=org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat}
 at io.trino.plugin.hive.BackgroundHiveSplitLoader.lambda$loadPartition$4(BackgroundHiveSplitLoader.java:497)
 at java.base/java.util.Optional.orElseThrow(Optional.java:403)
 at io.trino.plugin.hive.BackgroundHiveSplitLoader.loadPartition(BackgroundHiveSplitLoader.java:497)
 at io.trino.plugin.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:400)
 at io.trino.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:314)
 at io.trino.plugin.hive.util.ResumableTasks$1.run(ResumableTasks.java:38)
 at io.trino.$gen.Trino_426____20230907_160032_2.run(Unknown Source)
 at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:79)
 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
 at java.base/java.lang.Thread.run(Thread.java:833)

The issue seems to be a recent change made to BackgroundHiveSplitLoader.java where a call getHiveStorageFormat was introduced which fails when querying table with a format not defined in HiveStorageFormat.

We had to make changes to HiveStorageFormat.java to add our custom serde definitions. This is a really concerning change for us. Why is hive connector all the sudden limited to only those formats defined in HiveStorageFormat?

The documentation page does not reflect that only certain SequenceFile serdes are supported: https://trino.io/docs/current/connector/hive.html

Assuming this change was done by design, what does the roadmap look like for Hive support in Trino?

The text was updated successfully, but these errors were encountered:

pangyifish · 2023-09-13T19:00:18Z

+1, observed the same issue

s905060 · 2023-09-14T19:14:40Z

+1 We are having the same issue.

shortland · 2023-09-14T19:15:04Z

+1, without major custom changes - this is completely breaking & blocking any further upgrades to Trino for us.

electrum · 2023-10-09T11:05:28Z

This is part of the project to decouple Trino from Hadoop and Hive codebases. Can you tell us more about the motivation for using custom input formats or serdes? Would it be feasible for you to convert to a standard format?

realknorke · 2023-10-10T07:54:58Z

We're using a custom SerDe which is just a wrapper around our own CSV parser. The parser (SFM) is much faster than the default parser shipped with Trino (OpenCSV). Worked well until v423.

snowangles · 2023-10-12T15:39:41Z

We have custom protobuf and parquet serdes. We are heavily invested in protobufs. For protobufs, we have implemented some custom types for performance reasons. And since our schemas are encoded in protobufs, we have written a parquet serde that can infer the schema from a protobuf.

It'll be a heavily lift to move our infrastructure off of this.

electrum · 2023-10-19T05:43:07Z

@realknorke Is the CSV input format compatible with Hive OpenCSV? If so, maybe we could replace Trino's CsvDeserializerFactory implementation with the faster version.

electrum · 2023-10-19T05:50:09Z

@snowangles Thanks for explaining. I'll need to think about this. At the moment, I don't have a good answer for you. You should be able to implement your custom reader in a fork of Trino (or a fork of the Hive connector) by adding your format to HiveStorageFormat and implementing it in HivePageSourceFactory.

realknorke · 2023-10-19T09:53:29Z

@electrum I don't know for sure whether or not the SFM CSV parser is 100% compatible with OpenCSV for every edge case (as CSV is a format from hell).
It should be safer for you Trino guys to stick with Hadoops OpenCSV as default.
BUT
It would also be good to be able to set an (custom) SerDe implementation as a parameter for Hive (HMS-backed) table creations. This would allow everyone to add custom formats and/or parsers/readers without Trino maintainers to worry about (much).

For the matter at hand, the change in v423 is not a show stopper for us as we can always switch back to the OpenCSV (default) parser by modifying the SerDe information in the HMS. But this would not be ideal.

Is there a good reason to not allow a custom SerDe as parameter for CREATE TABLE? (apart from the work necessary to implement that?)

Just FYI:
Here is how you do that for HMS-based tables in Spark:

CREATE EXTERNAL TABLE family (id INT, name STRING)
   ROW FORMAT SERDE 'com.ly.spark.serde.SerDeExample'
   STORED AS INPUTFORMAT 'com.ly.spark.example.serde.io.SerDeExampleInputFormat'
       OUTPUTFORMAT 'com.ly.spark.example.serde.io.SerDeExampleOutputFormat'
   LOCATION '/tmp/family/';

hashhar · 2023-10-19T10:43:34Z

not allow a custom SerDe as parameter

Being able to know that things will or will not work. e.g. if the serde is using Hadoop classes it might stop working in future.

realknorke · 2023-10-19T12:28:26Z

@hashhar can you please explain? Allowing a user to specify a custom SerDe (not maintained by the Trino team) is affecting things to work or not work - how?

electrum · 2023-10-30T07:25:33Z

The Hive connector no longer depends on Hive classes (for reasons explained here), so it's not possible to support custom Hive serdes. We also took advantage of that to cleanup the code to use the HiveStorageFormat enum in more places, and enums are not extensible, so supporting custom formats would require undoing those changes.

realknorke · 2023-10-30T12:10:31Z

@electrum thank you for the clarification. This means that Trino, on one hand, is aiming on bringing together multiple data sources, while, on the other hand, is restraining access to multiple data formats.
A wide variety of connectors but - when compared to Hive - a limited functionality, when used as a replacement for Hive/Hadoop. :(

That is an unfortunate design decision.

dain · 2023-11-06T18:01:44Z

I understand you are upset. The decision to drop support for Hive SerDe and Hadoop compression codecs was not made lightly. The Hadoop and Hive codebases are difficult to work with, and not well maintained. Additionally, the community has swiftly moved away from these Hadoop and Hive formats to Parquet and ORC, and they are pushing farther with the switch to Iceberg, Delta Lake, and
Hudi. I believe this is a negative reinforcing cycle that is unlikely to change.

Maintaining support for the full breadth of Hadoop/Hive features has been a herculean effort for the past ten yeas, which we happily did because of the vast usage of these systems. However, the usage of these systems has been in decline for years, and the effort to maintain support for them has not been reducing to match, and instead is actually growing as the Hadoop/Hive codebases become more difficult to work with.

This came to a head as we have attempted to add support for adding new features like Dynamic Catalogs #12709. The Hadoop/Hive have critical design flaws that make them incompatible with these new features. The only reasonable way to add these features was to decouple from the Hadoop/Hive codebases. This is a massive effort, and again we happily did it because we could finally reduce the effort required to maintain support for Hadoop/Hive, and actually add these amazing new features.

The where do we go from here. For opensource, popular, well-maintained, formats we will consider adding official support. We maybe be able add interfaces to extend the Hive plugin with new file formats and compression codecs. We have never supported extending the Hive plugin by adding jars to the plugin directory, but a few folks did and had varying degrees of success. If we do add
extension points for this, they will be specific to Trino and not use Hadoop/Hive APIs (or have them available in the classpath). This means you would need to adapt your custom format to Trino APIs (I assume if you have a custom format you have programmers). That said, we would need to see a broad community need for this before we would consider adding it (as again this is not something
we have ever supported).

realknorke · 2023-11-07T09:22:17Z

@dain Thank you very much for your thoughts and explanation!

hashhar mentioned this issue Jan 12, 2024

Support propagation of custom split info when using custom input format #15262

Closed

sjdurfey mentioned this issue May 7, 2024

Hive's CombineTextInputFormat #21842

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hive connector io.trino.spi.TrinoException: Unsupported storage format #19018

hive connector io.trino.spi.TrinoException: Unsupported storage format #19018

snowangles commented Sep 12, 2023 •

edited by electrum

Loading

pangyifish commented Sep 13, 2023

s905060 commented Sep 14, 2023

shortland commented Sep 14, 2023

electrum commented Oct 9, 2023

realknorke commented Oct 10, 2023 •

edited

Loading

snowangles commented Oct 12, 2023

electrum commented Oct 19, 2023

electrum commented Oct 19, 2023

realknorke commented Oct 19, 2023 •

edited

Loading

hashhar commented Oct 19, 2023

realknorke commented Oct 19, 2023

electrum commented Oct 30, 2023

realknorke commented Oct 30, 2023 •

edited

Loading

dain commented Nov 6, 2023 •

edited by electrum

Loading

realknorke commented Nov 7, 2023

hive connector io.trino.spi.TrinoException: Unsupported storage format #19018

hive connector io.trino.spi.TrinoException: Unsupported storage format #19018

Comments

snowangles commented Sep 12, 2023 • edited by electrum Loading

pangyifish commented Sep 13, 2023

s905060 commented Sep 14, 2023

shortland commented Sep 14, 2023

electrum commented Oct 9, 2023

realknorke commented Oct 10, 2023 • edited Loading

snowangles commented Oct 12, 2023

electrum commented Oct 19, 2023

electrum commented Oct 19, 2023

realknorke commented Oct 19, 2023 • edited Loading

hashhar commented Oct 19, 2023

realknorke commented Oct 19, 2023

electrum commented Oct 30, 2023

realknorke commented Oct 30, 2023 • edited Loading

dain commented Nov 6, 2023 • edited by electrum Loading

realknorke commented Nov 7, 2023

snowangles commented Sep 12, 2023 •

edited by electrum

Loading

realknorke commented Oct 10, 2023 •

edited

Loading

realknorke commented Oct 19, 2023 •

edited

Loading

realknorke commented Oct 30, 2023 •

edited

Loading

dain commented Nov 6, 2023 •

edited by electrum

Loading