Support for TIMESTAMP WITH TIME ZONE with nanosecond precision in Parquet #13599

zielmicha · 2022-08-10T11:00:53Z

Description

Adds support for reading of Parquet nanosecond timestamp column into TIMESTAMP WITH TIME ZONE Trino type.

Is this change a fix, improvement, new feature, refactoring, or other?

This is a fix or a new feature, depending on how you look at it - the support for TIMESTAMP WITH TIME ZONE already existed for millis and micros precision.

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

This is change to a library used by some connectors (Hive, Delta Lake, Iceberg), but in practice only changes anything for Hive connector (other connectors don't support nanosecond TIMESTAMP WITH TIME ZONE)

How would you describe this change to a non-technical end user or system administrator?

For Hive connector, the change adds support for reading into TIMESTAMP WITH TIME ZONE columns from Parquet files that contain timestamp columns formatted in certain way (storing the timestamp as number of nanoseconds).

Related issues, pull requests, and links

PR Support for reading TIMESTAMP WITH LOCAL TIME ZONE in Hive connector #13595 that adds support for TIMESTAMP WITH TIME ZONE to Hive connector

Documentation

(I think no documentation is needed)

( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Support for TIMESTAMP WITH TIME ZONE with nanosecond precision in Parquet

findepi · 2022-08-10T11:25:49Z

What is the end-user visible effect of these changes?

zielmicha · 2022-08-10T11:33:42Z

What is the end-user visible effect of these changes?

It's now possible to read Parquet files that use NANOS format (link to spec) with columns of type TIMESTAMP WITH TIME ZONE. Previously Trino simply raised exception when attempting to query them.

I have only tested this on Hive connector (with #13595), but this presumably applies to Iceberg and Delta Lake, which already have support for TIMESTAMP WITH TIME ZONE.

findepi · 2022-08-10T14:57:38Z

@zielmicha thanks for explaining. What's the schema of the attached parquet file?

zielmicha · 2022-08-10T15:03:41Z

I'm not sure what's the best way to get Parquet schema, but:

import pyarrow.parquet as pq
pq.read_table('issue-5483-nanos.parquet').schema

prints

created: timestamp[ns]

findepi · 2022-08-10T15:06:45Z

$ parquet meta plugin/trino-hive/src/test/resources/issue-5483-nanos.parquet

File path:  plugin/trino-hive/src/test/resources/issue-5483-nanos.parquet
Created by: parquet-cpp-arrow version 5.0.0
Properties: (none)
Schema:
message schema {
  optional int64 created (TIMESTAMP(NANOS,false));
}


Row group 0:  count: 1  96.00 B records  start: 4  total(compressed): 96 B total(uncompressed):92 B
--------------------------------------------------------------------------------
         type      encodings count     avg size   nulls   min / max
created  INT64     S _ R     1         96.00 B    0       "2020-10-12T16:26:02.90666..." / "2020-10-12T16:26:02.90666..."

skrzypo987

The PR name is sort of misleading, as the TIMESTAMP WITH TIME ZONE is the Tirno type, not Parquet.
Try something like "Support read of Parquet nanosecond timestamp column into TIMESTAMP WITH TIME ZONE Trino type"

plugin/trino-hive/src/test/java/io/trino/plugin/hive/parquet/TestTimestampPrecision.java

zielmicha · 2022-08-25T20:32:25Z

@raunaqmorarka can you take a look? If I'm not mistaken @skrzypo987 has requested your review.

zielmicha · 2022-09-02T10:25:43Z

@raunaqmorarka will you have time to review this PR? Let me know if I should find another reviewer.

raunaqmorarka · 2022-09-02T10:32:52Z

@raunaqmorarka will you have time to review this PR? Let me know if I should find another reviewer.

I'll take a look at it early next week

raunaqmorarka · 2022-09-05T06:02:43Z

plugin/trino-hive/src/test/java/io/trino/plugin/hive/parquet/TestTimestampPrecision.java

@@ -52,16 +52,16 @@
 import static org.apache.hadoop.hive.serde.serdeConstants.SERIALIZATION_LIB;
 import static org.assertj.core.api.Assertions.assertThat;



It looks like we don't have a way to write this data via trino-parquet, we should add support for that.
There should be a test in AbstractTestParquetReader similar to testTimestamp but with time zone. It'll need update to ParquetTester#writeValue to add support for timestamp with timezone type.
There should also be a test in TestHiveCompatibility once the write path is working.

Writing TIMESTAMP WITH TIME ZONE isn't supported by Hive connector anyway yet - I'm planning to implement this in the next PR.

lib/trino-parquet/src/main/java/io/trino/parquet/reader/Int64TimestampNanosColumnReader.java

arhimondr · 2023-05-10T21:43:47Z

I may not be the best person to review this. @raunaqmorarka could you please take an another look?

mosabua · 2023-07-10T16:50:24Z

@zielmicha could you rebase?

@martint @findepi and @raunaqmorarka .. could you help moving this forward?

raunaqmorarka · 2023-08-16T05:44:49Z

lib/trino-parquet/src/main/java/io/trino/parquet/reader/Int64TimestampNanosColumnReader.java

@@ -16,16 +16,23 @@
 import io.trino.parquet.PrimitiveField;
 import io.trino.spi.TrinoException;
 import io.trino.spi.block.BlockBuilder;
+import io.trino.spi.type.DateTimeEncoding;


You can drop the changes in this file now, the legacy parquet reader code has been removed

raunaqmorarka · 2023-08-16T05:57:05Z

lib/trino-parquet/src/main/java/io/trino/parquet/reader/ColumnReaderFactory.java

@@ -194,7 +194,11 @@ public ColumnReader create(PrimitiveField field, AggregatedMemoryContext aggrega
                if (timestampWithTimeZoneType.isShort()) {
                    return createColumnReader(field, valueDecoders::getInt96ToShortTimestampWithTimeZoneDecoder, LONG_ADAPTER, memoryContext);
                }
-                throw unsupportedException(type, field);
+                return createColumnReader(


In the current parquet spec timestamps are supposed to be stored as INT64 https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp
We have INT96 support for only compatibility with Apache Hive. Does your use case require INT96 support ?
How is the data intended to be produced ? The example parquet file in this PR that is produced by parquet-cpp is using INT64

I don't myself have a need to process INT96, I implemented this mostly for completeness. In particular, it seems that the default mode of Hive connector is to write timestamps in this format.

I would defer adding INT96 support until the Hive connector gains the ability to write data for this type. I think we also need product tests for checking compatibility with Apache Hive when we add something specifically for Hive.

We already have support for writing TIMESTAMP (without time zone) with INT96 - in fact this is what the test in BaseHiveConnectorTest. If you think that we need write support for writing TIMESTAMP WITH TIME ZONE, merging this can wait until I'm done it (I'm planning to add it support for writing).

What exactly would you like to test in the product tests?

lib/trino-parquet/src/main/java/io/trino/parquet/reader/decoders/ValueDecoders.java

raunaqmorarka · 2023-08-16T06:00:21Z

lib/trino-parquet/src/main/java/io/trino/parquet/reader/decoders/ValueDecoders.java

+        return new InlineTransformDecoder<>(
+                getLongDecoder(encoding),
+                (values, offset, length) -> {
+                    // decoded values are epochMicros, round to lower precision and convert to packed millis utc value


decoded values are epochNanos

raunaqmorarka · 2023-08-16T06:08:15Z

lib/trino-parquet/src/main/java/io/trino/parquet/reader/decoders/ValueDecoders.java

+            {
+                long[] buffer = new long[length];
+                delegate.read(buffer, 0, length);
+                // decoded values are epochNanos, convert to (packed epochMillisUtc, picosOfMilli)


decoded values are epochNanos, round to lower precision and convert to (packed epochMillisUtc, picosOfMilli)

raunaqmorarka · 2023-08-16T06:22:22Z

plugin/trino-hive/src/test/java/io/trino/plugin/hive/parquet/TestTimestampPrecision.java

+                        {HiveTimestampPrecision.MILLISECONDS, LocalDateTime.parse("2020-10-12T16:26:02.907"), "issue-5483.parquet"},
+                        {HiveTimestampPrecision.MICROSECONDS, LocalDateTime.parse("2020-10-12T16:26:02.906668"), "issue-5483.parquet"},
+                        {HiveTimestampPrecision.NANOSECONDS, LocalDateTime.parse("2020-10-12T16:26:02.906668"), "issue-5483.parquet"},
+                        {HiveTimestampPrecision.MILLISECONDS, LocalDateTime.parse("2020-10-12T16:26:02.907"), "issue-5483-nanos.parquet"},


Why is the new file named issue-5483-nanos.parquet ? Not sure if it's strictly related to issue 5483
Maybe name it timestamp-nanos.parquet

raunaqmorarka · 2023-08-17T02:33:59Z

Please rebase to latest master

Fokko · 2023-11-13T13:28:52Z

@raunaqmorarka @zielmicha Any progress on this? At Iceberg we're adding support for Nanosecond precision: apache/iceberg#8683

mosabua · 2024-01-12T22:29:50Z

@zielmicha @martint @raunaqmorarka could we rebase this and move towards merge?

github-actions · 2024-09-04T17:07:10Z

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

mosabua · 2024-09-18T17:29:38Z

@zielmicha @martint @raunaqmorarka I know this PR has been around a long time. From what I understand it would still be great to get this feature in for @zielmicha .. what are next steps to enable that? Rebase probably.. anything else?

github-actions · 2024-10-11T17:03:11Z

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

github-actions · 2024-11-04T17:03:03Z

Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time.

mosabua · 2024-11-04T17:54:32Z

Reopening since I believe we want this feature added based on past discussions. I will leave @martint to decide on next steps.

cla-bot bot added the cla-signed label Aug 10, 2022

github-actions bot added the tests:hive label Aug 10, 2022

findepi requested review from raunaqmorarka and skrzypo987 August 10, 2022 13:40

zielmicha force-pushed the parquet-nanos-zoned-timestamp-fix branch from cad9528 to 76fee2c Compare August 10, 2022 14:53

zielmicha force-pushed the parquet-nanos-zoned-timestamp-fix branch from 76fee2c to 4041a7c Compare August 10, 2022 15:27

skrzypo987 reviewed Aug 17, 2022

View reviewed changes

plugin/trino-hive/src/test/java/io/trino/plugin/hive/parquet/TestTimestampPrecision.java Show resolved Hide resolved

zielmicha force-pushed the parquet-nanos-zoned-timestamp-fix branch from 4041a7c to 134a9df Compare August 17, 2022 18:25

github-actions bot added the docs label Aug 17, 2022

zielmicha force-pushed the parquet-nanos-zoned-timestamp-fix branch 3 times, most recently from 53588d0 to 05a0efa Compare August 22, 2022 14:27

raunaqmorarka reviewed Sep 5, 2022

View reviewed changes

kokosing force-pushed the master branch from 3f05134 to 58d6356 Compare March 14, 2023 11:34

mosabua requested review from martint and arhimondr April 26, 2023 18:43

zielmicha force-pushed the parquet-nanos-zoned-timestamp-fix branch from 05a0efa to 43fbecd Compare May 16, 2023 08:25

github-actions bot added the hive Hive connector label May 16, 2023

zielmicha force-pushed the parquet-nanos-zoned-timestamp-fix branch 2 times, most recently from 761f519 to 2f2dd5d Compare May 17, 2023 18:27

findepi requested review from skrzypo987 and raunaqmorarka August 7, 2023 15:13

kokosing requested a review from Praveen2112 August 8, 2023 10:54

raunaqmorarka reviewed Aug 16, 2023

View reviewed changes

zielmicha force-pushed the parquet-nanos-zoned-timestamp-fix branch 2 times, most recently from 9120e07 to 2e9a4b1 Compare August 16, 2023 22:28

zielmicha force-pushed the parquet-nanos-zoned-timestamp-fix branch 3 times, most recently from b41dc13 to 47990db Compare August 23, 2023 19:42

zielmicha force-pushed the parquet-nanos-zoned-timestamp-fix branch 3 times, most recently from eb5899f to 3cb2abc Compare August 28, 2023 15:12

Parquet: Add support TIMESTAMP WITH TIME ZONE with nanosecond precision

b019ac2

zielmicha force-pushed the parquet-nanos-zoned-timestamp-fix branch from 3cb2abc to b019ac2 Compare August 28, 2023 15:19

findepi removed the tests:hive label Apr 18, 2024

github-actions bot added the stale label Sep 4, 2024

github-actions bot removed the stale label Sep 19, 2024

github-actions bot added the stale label Oct 11, 2024

github-actions bot closed this Nov 4, 2024

mosabua added stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed. and removed stale labels Nov 4, 2024

mosabua reopened this Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for TIMESTAMP WITH TIME ZONE with nanosecond precision in Parquet #13599

Support for TIMESTAMP WITH TIME ZONE with nanosecond precision in Parquet #13599

zielmicha commented Aug 10, 2022 •

edited

Loading

findepi commented Aug 10, 2022

zielmicha commented Aug 10, 2022

findepi commented Aug 10, 2022

zielmicha commented Aug 10, 2022

findepi commented Aug 10, 2022

skrzypo987 left a comment

zielmicha commented Aug 25, 2022

zielmicha commented Sep 2, 2022

raunaqmorarka commented Sep 2, 2022

raunaqmorarka Sep 5, 2022

zielmicha Jul 11, 2023

arhimondr commented May 10, 2023

mosabua commented Jul 10, 2023

raunaqmorarka Aug 16, 2023

raunaqmorarka Aug 16, 2023

zielmicha Aug 16, 2023

raunaqmorarka Aug 17, 2023

zielmicha Aug 23, 2023

raunaqmorarka Aug 16, 2023

raunaqmorarka Aug 16, 2023

raunaqmorarka Aug 16, 2023

raunaqmorarka commented Aug 17, 2023

Fokko commented Nov 13, 2023

mosabua commented Jan 12, 2024 •

edited

Loading

github-actions bot commented Sep 4, 2024

mosabua commented Sep 18, 2024

github-actions bot commented Oct 11, 2024

github-actions bot commented Nov 4, 2024

mosabua commented Nov 4, 2024

		@@ -52,16 +52,16 @@
		import static org.apache.hadoop.hive.serde.serdeConstants.SERIALIZATION_LIB;
		import static org.assertj.core.api.Assertions.assertThat;

Support for TIMESTAMP WITH TIME ZONE with nanosecond precision in Parquet #13599

Are you sure you want to change the base?

Support for TIMESTAMP WITH TIME ZONE with nanosecond precision in Parquet #13599

Conversation

zielmicha commented Aug 10, 2022 • edited Loading

Description

Related issues, pull requests, and links

Documentation

Release notes

findepi commented Aug 10, 2022

zielmicha commented Aug 10, 2022

findepi commented Aug 10, 2022

zielmicha commented Aug 10, 2022

findepi commented Aug 10, 2022

skrzypo987 left a comment

Choose a reason for hiding this comment

zielmicha commented Aug 25, 2022

zielmicha commented Sep 2, 2022

raunaqmorarka commented Sep 2, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arhimondr commented May 10, 2023

mosabua commented Jul 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raunaqmorarka commented Aug 17, 2023

Fokko commented Nov 13, 2023

mosabua commented Jan 12, 2024 • edited Loading

github-actions bot commented Sep 4, 2024

mosabua commented Sep 18, 2024

github-actions bot commented Oct 11, 2024

github-actions bot commented Nov 4, 2024

mosabua commented Nov 4, 2024

zielmicha commented Aug 10, 2022 •

edited

Loading

mosabua commented Jan 12, 2024 •

edited

Loading