-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hive connector io.trino.spi.TrinoException: Unsupported storage format #19018
Comments
+1, observed the same issue |
+1 We are having the same issue. |
+1, without major custom changes - this is completely breaking & blocking any further upgrades to Trino for us. |
This is part of the project to decouple Trino from Hadoop and Hive codebases. Can you tell us more about the motivation for using custom input formats or serdes? Would it be feasible for you to convert to a standard format? |
We're using a custom SerDe which is just a wrapper around our own CSV parser. The parser (SFM) is much faster than the default parser shipped with Trino (OpenCSV). Worked well until v423. |
We have custom protobuf and parquet serdes. We are heavily invested in protobufs. For protobufs, we have implemented some custom types for performance reasons. And since our schemas are encoded in protobufs, we have written a parquet serde that can infer the schema from a protobuf. It'll be a heavily lift to move our infrastructure off of this. |
@realknorke Is the CSV input format compatible with Hive OpenCSV? If so, maybe we could replace Trino's |
@snowangles Thanks for explaining. I'll need to think about this. At the moment, I don't have a good answer for you. You should be able to implement your custom reader in a fork of Trino (or a fork of the Hive connector) by adding your format to |
@electrum I don't know for sure whether or not the SFM CSV parser is 100% compatible with OpenCSV for every edge case (as CSV is a format from hell). For the matter at hand, the change in v423 is not a show stopper for us as we can always switch back to the OpenCSV (default) parser by modifying the SerDe information in the HMS. But this would not be ideal. Is there a good reason to not allow a custom SerDe as parameter for Just FYI:
|
Being able to know that things will or will not work. e.g. if the serde is using Hadoop classes it might stop working in future. |
@hashhar can you please explain? Allowing a user to specify a custom SerDe (not maintained by the Trino team) is affecting things to work or not work - how? |
The Hive connector no longer depends on Hive classes (for reasons explained here), so it's not possible to support custom Hive serdes. We also took advantage of that to cleanup the code to use the |
@electrum thank you for the clarification. This means that Trino, on one hand, is aiming on bringing together multiple data sources, while, on the other hand, is restraining access to multiple data formats. That is an unfortunate design decision. |
I understand you are upset. The decision to drop support for Hive SerDe and Hadoop compression codecs was not made lightly. The Hadoop and Hive codebases are difficult to work with, and not well maintained. Additionally, the community has swiftly moved away from these Hadoop and Hive formats to Parquet and ORC, and they are pushing farther with the switch to Iceberg, Delta Lake, and Maintaining support for the full breadth of Hadoop/Hive features has been a herculean effort for the past ten yeas, which we happily did because of the vast usage of these systems. However, the usage of these systems has been in decline for years, and the effort to maintain support for them has not been reducing to match, and instead is actually growing as the Hadoop/Hive codebases become more difficult to work with. This came to a head as we have attempted to add support for adding new features like Dynamic Catalogs #12709. The Hadoop/Hive have critical design flaws that make them incompatible with these new features. The only reasonable way to add these features was to decouple from the Hadoop/Hive codebases. This is a massive effort, and again we happily did it because we could finally reduce the effort required to maintain support for Hadoop/Hive, and actually add these amazing new features. The where do we go from here. For opensource, popular, well-maintained, formats we will consider adding official support. We maybe be able add interfaces to extend the Hive plugin with new file formats and compression codecs. We have never supported extending the Hive plugin by adding jars to the plugin directory, but a few folks did and had varying degrees of success. If we do add |
@dain Thank you very much for your thoughts and explanation! |
Hello,
We have hive tables that use custom input formats and serdes. We noticed that starting with Trino 423 we're no longer able to query these tables.
The issue seems to be a recent change made to BackgroundHiveSplitLoader.java where a call getHiveStorageFormat was introduced which fails when querying table with a format not defined in HiveStorageFormat.
We had to make changes to HiveStorageFormat.java to add our custom serde definitions. This is a really concerning change for us. Why is hive connector all the sudden limited to only those formats defined in HiveStorageFormat?
The documentation page does not reflect that only certain SequenceFile serdes are supported: https://trino.io/docs/current/connector/hive.html
Assuming this change was done by design, what does the roadmap look like for Hive support in Trino?
The text was updated successfully, but these errors were encountered: