Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for TIMESTAMP WITH TIME ZONE with nanosecond precision in Parquet #13599

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

zielmicha
Copy link
Member

@zielmicha zielmicha commented Aug 10, 2022

Description

Adds support for reading of Parquet nanosecond timestamp column into TIMESTAMP WITH TIME ZONE Trino type.

Is this change a fix, improvement, new feature, refactoring, or other?

This is a fix or a new feature, depending on how you look at it - the support for TIMESTAMP WITH TIME ZONE already existed for millis and micros precision.

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

This is change to a library used by some connectors (Hive, Delta Lake, Iceberg), but in practice only changes anything for Hive connector (other connectors don't support nanosecond TIMESTAMP WITH TIME ZONE)

How would you describe this change to a non-technical end user or system administrator?

For Hive connector, the change adds support for reading into TIMESTAMP WITH TIME ZONE columns from Parquet files that contain timestamp columns formatted in certain way (storing the timestamp as number of nanoseconds).

Related issues, pull requests, and links

Documentation

(I think no documentation is needed)

( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Support for TIMESTAMP WITH TIME ZONE with nanosecond precision in Parquet

@cla-bot cla-bot bot added the cla-signed label Aug 10, 2022
@findepi
Copy link
Member

findepi commented Aug 10, 2022

What is the end-user visible effect of these changes?

@zielmicha
Copy link
Member Author

What is the end-user visible effect of these changes?

It's now possible to read Parquet files that use NANOS format (link to spec) with columns of type TIMESTAMP WITH TIME ZONE. Previously Trino simply raised exception when attempting to query them.

I have only tested this on Hive connector (with #13595), but this presumably applies to Iceberg and Delta Lake, which already have support for TIMESTAMP WITH TIME ZONE.

@zielmicha zielmicha force-pushed the parquet-nanos-zoned-timestamp-fix branch from cad9528 to 76fee2c Compare August 10, 2022 14:53
@findepi
Copy link
Member

findepi commented Aug 10, 2022

@zielmicha thanks for explaining. What's the schema of the attached parquet file?

@zielmicha
Copy link
Member Author

I'm not sure what's the best way to get Parquet schema, but:

import pyarrow.parquet as pq
pq.read_table('issue-5483-nanos.parquet').schema

prints

created: timestamp[ns]

@findepi
Copy link
Member

findepi commented Aug 10, 2022

$ parquet meta plugin/trino-hive/src/test/resources/issue-5483-nanos.parquet

File path:  plugin/trino-hive/src/test/resources/issue-5483-nanos.parquet
Created by: parquet-cpp-arrow version 5.0.0
Properties: (none)
Schema:
message schema {
  optional int64 created (TIMESTAMP(NANOS,false));
}


Row group 0:  count: 1  96.00 B records  start: 4  total(compressed): 96 B total(uncompressed):92 B
--------------------------------------------------------------------------------
         type      encodings count     avg size   nulls   min / max
created  INT64     S _ R     1         96.00 B    0       "2020-10-12T16:26:02.90666..." / "2020-10-12T16:26:02.90666..."

@zielmicha zielmicha force-pushed the parquet-nanos-zoned-timestamp-fix branch from 76fee2c to 4041a7c Compare August 10, 2022 15:27
Copy link
Member

@skrzypo987 skrzypo987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR name is sort of misleading, as the TIMESTAMP WITH TIME ZONE is the Tirno type, not Parquet.
Try something like "Support read of Parquet nanosecond timestamp column into TIMESTAMP WITH TIME ZONE Trino type"

@zielmicha zielmicha force-pushed the parquet-nanos-zoned-timestamp-fix branch from 4041a7c to 134a9df Compare August 17, 2022 18:25
@github-actions github-actions bot added the docs label Aug 17, 2022
@zielmicha zielmicha force-pushed the parquet-nanos-zoned-timestamp-fix branch 3 times, most recently from 53588d0 to 05a0efa Compare August 22, 2022 14:27
@zielmicha
Copy link
Member Author

@raunaqmorarka can you take a look? If I'm not mistaken @skrzypo987 has requested your review.

@zielmicha
Copy link
Member Author

@raunaqmorarka will you have time to review this PR? Let me know if I should find another reviewer.

@raunaqmorarka
Copy link
Member

@raunaqmorarka will you have time to review this PR? Let me know if I should find another reviewer.

I'll take a look at it early next week

@@ -52,16 +52,16 @@
import static org.apache.hadoop.hive.serde.serdeConstants.SERIALIZATION_LIB;
import static org.assertj.core.api.Assertions.assertThat;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we don't have a way to write this data via trino-parquet, we should add support for that.
There should be a test in AbstractTestParquetReader similar to testTimestamp but with time zone. It'll need update to ParquetTester#writeValue to add support for timestamp with timezone type.
There should also be a test in TestHiveCompatibility once the write path is working.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing TIMESTAMP WITH TIME ZONE isn't supported by Hive connector anyway yet - I'm planning to implement this in the next PR.

@arhimondr
Copy link
Contributor

I may not be the best person to review this. @raunaqmorarka could you please take an another look?

@zielmicha zielmicha force-pushed the parquet-nanos-zoned-timestamp-fix branch from 05a0efa to 43fbecd Compare May 16, 2023 08:25
@github-actions github-actions bot added the hive Hive connector label May 16, 2023
@zielmicha zielmicha force-pushed the parquet-nanos-zoned-timestamp-fix branch 2 times, most recently from 761f519 to 2f2dd5d Compare May 17, 2023 18:27
@mosabua
Copy link
Member

mosabua commented Jul 10, 2023

@zielmicha could you rebase?

@martint @findepi and @raunaqmorarka .. could you help moving this forward?

@@ -16,16 +16,23 @@
import io.trino.parquet.PrimitiveField;
import io.trino.spi.TrinoException;
import io.trino.spi.block.BlockBuilder;
import io.trino.spi.type.DateTimeEncoding;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can drop the changes in this file now, the legacy parquet reader code has been removed

@@ -194,7 +194,11 @@ public ColumnReader create(PrimitiveField field, AggregatedMemoryContext aggrega
if (timestampWithTimeZoneType.isShort()) {
return createColumnReader(field, valueDecoders::getInt96ToShortTimestampWithTimeZoneDecoder, LONG_ADAPTER, memoryContext);
}
throw unsupportedException(type, field);
return createColumnReader(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current parquet spec timestamps are supposed to be stored as INT64 https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp
We have INT96 support for only compatibility with Apache Hive. Does your use case require INT96 support ?
How is the data intended to be produced ? The example parquet file in this PR that is produced by parquet-cpp is using INT64

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't myself have a need to process INT96, I implemented this mostly for completeness. In particular, it seems that the default mode of Hive connector is to write timestamps in this format.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would defer adding INT96 support until the Hive connector gains the ability to write data for this type. I think we also need product tests for checking compatibility with Apache Hive when we add something specifically for Hive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have support for writing TIMESTAMP (without time zone) with INT96 - in fact this is what the test in BaseHiveConnectorTest. If you think that we need write support for writing TIMESTAMP WITH TIME ZONE, merging this can wait until I'm done it (I'm planning to add it support for writing).

What exactly would you like to test in the product tests?

return new InlineTransformDecoder<>(
getLongDecoder(encoding),
(values, offset, length) -> {
// decoded values are epochMicros, round to lower precision and convert to packed millis utc value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decoded values are epochNanos

{
long[] buffer = new long[length];
delegate.read(buffer, 0, length);
// decoded values are epochNanos, convert to (packed epochMillisUtc, picosOfMilli)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decoded values are epochNanos, round to lower precision and convert to (packed epochMillisUtc, picosOfMilli)

{HiveTimestampPrecision.MILLISECONDS, LocalDateTime.parse("2020-10-12T16:26:02.907"), "issue-5483.parquet"},
{HiveTimestampPrecision.MICROSECONDS, LocalDateTime.parse("2020-10-12T16:26:02.906668"), "issue-5483.parquet"},
{HiveTimestampPrecision.NANOSECONDS, LocalDateTime.parse("2020-10-12T16:26:02.906668"), "issue-5483.parquet"},
{HiveTimestampPrecision.MILLISECONDS, LocalDateTime.parse("2020-10-12T16:26:02.907"), "issue-5483-nanos.parquet"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the new file named issue-5483-nanos.parquet ? Not sure if it's strictly related to issue 5483
Maybe name it timestamp-nanos.parquet

@zielmicha zielmicha force-pushed the parquet-nanos-zoned-timestamp-fix branch 2 times, most recently from 9120e07 to 2e9a4b1 Compare August 16, 2023 22:28
@raunaqmorarka
Copy link
Member

Please rebase to latest master

@zielmicha zielmicha force-pushed the parquet-nanos-zoned-timestamp-fix branch 3 times, most recently from b41dc13 to 47990db Compare August 23, 2023 19:42
@zielmicha zielmicha force-pushed the parquet-nanos-zoned-timestamp-fix branch 3 times, most recently from eb5899f to 3cb2abc Compare August 28, 2023 15:12
@zielmicha zielmicha force-pushed the parquet-nanos-zoned-timestamp-fix branch from 3cb2abc to b019ac2 Compare August 28, 2023 15:19
@Fokko
Copy link
Contributor

Fokko commented Nov 13, 2023

@raunaqmorarka @zielmicha Any progress on this? At Iceberg we're adding support for Nanosecond precision: apache/iceberg#8683

@mosabua
Copy link
Member

mosabua commented Jan 12, 2024

@zielmicha @martint @raunaqmorarka could we rebase this and move towards merge?

Copy link

github-actions bot commented Sep 4, 2024

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Sep 4, 2024
@mosabua
Copy link
Member

mosabua commented Sep 18, 2024

@zielmicha @martint @raunaqmorarka I know this PR has been around a long time. From what I understand it would still be great to get this feature in for @zielmicha .. what are next steps to enable that? Rebase probably.. anything else?

@github-actions github-actions bot removed the stale label Sep 19, 2024
Copy link

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Oct 11, 2024
Copy link

github-actions bot commented Nov 4, 2024

Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time.

@github-actions github-actions bot closed this Nov 4, 2024
@mosabua mosabua added stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed. and removed stale labels Nov 4, 2024
@mosabua mosabua reopened this Nov 4, 2024
@mosabua
Copy link
Member

mosabua commented Nov 4, 2024

Reopening since I believe we want this feature added based on past discussions. I will leave @martint to decide on next steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector docs hive Hive connector stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed.
Development

Successfully merging this pull request may close these issues.

7 participants