Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hive connector io.trino.spi.TrinoException: Unsupported storage format #19018

Open
snowangles opened this issue Sep 12, 2023 · 15 comments
Open

Comments

@snowangles
Copy link

snowangles commented Sep 12, 2023

Hello,

We have hive tables that use custom input formats and serdes. We noticed that starting with Trino 423 we're no longer able to query these tables.

Query 20230907_171018_00016_mrt64 failed: Unsupported storage format:foobar StorageFormat{serde=CUSTOM SERDE HERE, inputFormat=org.apache.hadoop.mapred.SequenceFileInputFormat, outputFormat=org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat}
io.trino.spi.TrinoException: Unsupported storage format: foobar StorageFormat{serde=CUSTOM SERDEDE HERE, inputFormat=org.apache.hadoop.mapred.SequenceFileInputFormat, outputFormat=org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat}
 at io.trino.plugin.hive.BackgroundHiveSplitLoader.lambda$loadPartition$4(BackgroundHiveSplitLoader.java:497)
 at java.base/java.util.Optional.orElseThrow(Optional.java:403)
 at io.trino.plugin.hive.BackgroundHiveSplitLoader.loadPartition(BackgroundHiveSplitLoader.java:497)
 at io.trino.plugin.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:400)
 at io.trino.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:314)
 at io.trino.plugin.hive.util.ResumableTasks$1.run(ResumableTasks.java:38)
 at io.trino.$gen.Trino_426____20230907_160032_2.run(Unknown Source)
 at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:79)
 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
 at java.base/java.lang.Thread.run(Thread.java:833)

The issue seems to be a recent change made to BackgroundHiveSplitLoader.java where a call getHiveStorageFormat was introduced which fails when querying table with a format not defined in HiveStorageFormat.

We had to make changes to HiveStorageFormat.java to add our custom serde definitions. This is a really concerning change for us. Why is hive connector all the sudden limited to only those formats defined in HiveStorageFormat?

The documentation page does not reflect that only certain SequenceFile serdes are supported: https://trino.io/docs/current/connector/hive.html

Assuming this change was done by design, what does the roadmap look like for Hive support in Trino?

@pangyifish
Copy link

+1, observed the same issue

@s905060
Copy link

s905060 commented Sep 14, 2023

+1 We are having the same issue.

@shortland
Copy link

+1, without major custom changes - this is completely breaking & blocking any further upgrades to Trino for us.

@electrum
Copy link
Member

electrum commented Oct 9, 2023

This is part of the project to decouple Trino from Hadoop and Hive codebases. Can you tell us more about the motivation for using custom input formats or serdes? Would it be feasible for you to convert to a standard format?

@realknorke
Copy link
Member

realknorke commented Oct 10, 2023

We're using a custom SerDe which is just a wrapper around our own CSV parser. The parser (SFM) is much faster than the default parser shipped with Trino (OpenCSV). Worked well until v423.

@snowangles
Copy link
Author

We have custom protobuf and parquet serdes. We are heavily invested in protobufs. For protobufs, we have implemented some custom types for performance reasons. And since our schemas are encoded in protobufs, we have written a parquet serde that can infer the schema from a protobuf.

It'll be a heavily lift to move our infrastructure off of this.

@electrum
Copy link
Member

@realknorke Is the CSV input format compatible with Hive OpenCSV? If so, maybe we could replace Trino's CsvDeserializerFactory implementation with the faster version.

@electrum
Copy link
Member

@snowangles Thanks for explaining. I'll need to think about this. At the moment, I don't have a good answer for you. You should be able to implement your custom reader in a fork of Trino (or a fork of the Hive connector) by adding your format to HiveStorageFormat and implementing it in HivePageSourceFactory.

@realknorke
Copy link
Member

realknorke commented Oct 19, 2023

@electrum I don't know for sure whether or not the SFM CSV parser is 100% compatible with OpenCSV for every edge case (as CSV is a format from hell).
It should be safer for you Trino guys to stick with Hadoops OpenCSV as default.
BUT
It would also be good to be able to set an (custom) SerDe implementation as a parameter for Hive (HMS-backed) table creations. This would allow everyone to add custom formats and/or parsers/readers without Trino maintainers to worry about (much).

For the matter at hand, the change in v423 is not a show stopper for us as we can always switch back to the OpenCSV (default) parser by modifying the SerDe information in the HMS. But this would not be ideal.

Is there a good reason to not allow a custom SerDe as parameter for CREATE TABLE? (apart from the work necessary to implement that?)

Just FYI:
Here is how you do that for HMS-based tables in Spark:

CREATE EXTERNAL TABLE family (id INT, name STRING)
   ROW FORMAT SERDE 'com.ly.spark.serde.SerDeExample'
   STORED AS INPUTFORMAT 'com.ly.spark.example.serde.io.SerDeExampleInputFormat'
       OUTPUTFORMAT 'com.ly.spark.example.serde.io.SerDeExampleOutputFormat'
   LOCATION '/tmp/family/';

@hashhar
Copy link
Member

hashhar commented Oct 19, 2023

not allow a custom SerDe as parameter

Being able to know that things will or will not work. e.g. if the serde is using Hadoop classes it might stop working in future.

@realknorke
Copy link
Member

@hashhar can you please explain? Allowing a user to specify a custom SerDe (not maintained by the Trino team) is affecting things to work or not work - how?

@electrum
Copy link
Member

The Hive connector no longer depends on Hive classes (for reasons explained here), so it's not possible to support custom Hive serdes. We also took advantage of that to cleanup the code to use the HiveStorageFormat enum in more places, and enums are not extensible, so supporting custom formats would require undoing those changes.

@realknorke
Copy link
Member

realknorke commented Oct 30, 2023

@electrum thank you for the clarification. This means that Trino, on one hand, is aiming on bringing together multiple data sources, while, on the other hand, is restraining access to multiple data formats.
A wide variety of connectors but - when compared to Hive - a limited functionality, when used as a replacement for Hive/Hadoop. :(

That is an unfortunate design decision.

@dain
Copy link
Member

dain commented Nov 6, 2023

I understand you are upset. The decision to drop support for Hive SerDe and Hadoop compression codecs was not made lightly. The Hadoop and Hive codebases are difficult to work with, and not well maintained. Additionally, the community has swiftly moved away from these Hadoop and Hive formats to Parquet and ORC, and they are pushing farther with the switch to Iceberg, Delta Lake, and
Hudi. I believe this is a negative reinforcing cycle that is unlikely to change.

Maintaining support for the full breadth of Hadoop/Hive features has been a herculean effort for the past ten yeas, which we happily did because of the vast usage of these systems. However, the usage of these systems has been in decline for years, and the effort to maintain support for them has not been reducing to match, and instead is actually growing as the Hadoop/Hive codebases become more difficult to work with.

This came to a head as we have attempted to add support for adding new features like Dynamic Catalogs #12709. The Hadoop/Hive have critical design flaws that make them incompatible with these new features. The only reasonable way to add these features was to decouple from the Hadoop/Hive codebases. This is a massive effort, and again we happily did it because we could finally reduce the effort required to maintain support for Hadoop/Hive, and actually add these amazing new features.

The where do we go from here. For opensource, popular, well-maintained, formats we will consider adding official support. We maybe be able add interfaces to extend the Hive plugin with new file formats and compression codecs. We have never supported extending the Hive plugin by adding jars to the plugin directory, but a few folks did and had varying degrees of success. If we do add
extension points for this, they will be specific to Trino and not use Hadoop/Hive APIs (or have them available in the classpath). This means you would need to adapt your custom format to Trino APIs (I assume if you have a custom format you have programmers). That said, we would need to see a broad community need for this before we would consider adding it (as again this is not something
we have ever supported).

@realknorke
Copy link
Member

@dain Thank you very much for your thoughts and explanation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

8 participants