Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for fetching Redshift query results using Redshift unload command #24117

Merged
merged 3 commits into from
Dec 31, 2024

Conversation

mayankvadariya
Copy link
Contributor

@mayankvadariya mayankvadariya commented Nov 13, 2024

Description

Redshift supports unloading select query results to S3 using UNLOAD command. PR aims to use unload command to unload Redshift query results on S3 and to read generated redshift query results files in parallel.

Official documentation on Redshift unload command:
https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html
https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD_command_examples.html

Connector internally converts Trino SQL to Redshift's unload SQL and runs it. Resultant Parquet files containing query results will be read in parallel to generate Trino query result. Only Parquet file format is supported in Redshift unload command.

Below are certain limitations for which Trino internally force fallback to traditional JDBC even if connector is configured to use unload.

  • select query containing TimeType and VarbinaryType column type is unsupported as unload doesn't support unloading to Parquet format.
  • select query ending with limit clause is unsupported as unload command doesn't support limit clause.
  • select query containing filter condition on DecimalType column is unsupported with unload as unload query is generated by Trino doesn't add precision to numeric cast function.
  • select query projecting static literals.

Benchmarking

Setup information:

Table Schema Records Data size(agg. parquet file size)
nation tpch.sf1000 25 3.5K
supplier tpch.sf1000 10000000 1.4G
customer tpch.sf1000 150000000 22.8G

Coordinator size: m6g.2xlarge
Worker size: 8 X m7g.8xlarge

Redshift cluster size: 2(nodes) X dc2.large

Benchmarking stats

UNLOAD performs better as the volume of query results increases .

Query Query text Average JDBC flow execution time(ms) Average UNLOAD flow execution time(ms)
q01 SELECT checksum(custkey), checksum(name), checksum(address), checksum(nationkey), checksum(phone), checksum(acctbal), checksum(mktsegment), checksum(comment) FROM "customer"; 622,571 240,893
q02 SELECT checksum(suppkey), checksum(name), checksum(address), checksum(nationkey), checksum(phone), checksum(acctbal), checksum(comment) FROM "supplier"; 35,618 17,327
q03 SELECT checksum(nationkey), checksum(name), checksum(regionkey), checksum(comment) FROM "nation"; 331 398

I'd like to thank @raunaqmorarka for his constant support in helping me land this PR.

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Redshift
* Improve performance of reading from Redshift tables using `UNLOAD` command on Redshift. ({issue}`24117`)

@cla-bot cla-bot bot added the cla-signed label Nov 13, 2024
@github-actions github-actions bot added the docs label Nov 14, 2024
@mayankvadariya mayankvadariya force-pushed the mayank/redshift-unload branch 4 times, most recently from f5be0a8 to 7b7d095 Compare November 18, 2024 22:00
@mayankvadariya mayankvadariya marked this pull request as ready for review November 19, 2024 14:20
@mayankvadariya mayankvadariya force-pushed the mayank/redshift-unload branch 2 times, most recently from a4868fc to a52f29b Compare November 26, 2024 22:35
@github-actions github-actions bot added hudi Hudi connector iceberg Iceberg connector delta-lake Delta Lake connector hive Hive connector labels Nov 26, 2024
@mayankvadariya mayankvadariya force-pushed the mayank/redshift-unload branch 2 times, most recently from 4be8ea2 to e33c9a4 Compare November 27, 2024 03:25
@mayankvadariya
Copy link
Contributor Author

https://github.com/trinodb/trino/actions/runs/12462869117/job/34784444202?pr=24359 Redshift tests are passing.
Additionally, added benchmarking stats in PR desc

@@ -64,6 +64,32 @@ documentation](https://docs.aws.amazon.com/redshift/latest/mgmt/jdbc20-configura
```{include} jdbc-authentication.fragment
```

### UNLOAD configuration

This feature enables using Amazon S3 to efficiently transfer data out of Redshift
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please describe what this actually does for user .. seems like it is only a performance improvement .. right? And if so .. why is it called unload?

Copy link
Contributor Author

@mayankvadariya mayankvadariya Dec 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature uses UNLOAD(https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html) approach for performance improvements. Additionally, it requires additional configs to enable this feature. Please suggest text improvements.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So lets move that into the performance section and explain more there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we need to explain that they need a S3 account and whatever else .. and when this should and should not be used.. I am not aware of this info so I cant really reword appropriately without more details

Copy link
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % comments

mayankvadariya and others added 2 commits December 27, 2024 14:20
Previously SplitOperatorInfo wasn't Mergeable and hence base
OperatorInfo(OperatorStats#getMergeableInfoOrNull) was null.
Co-authored-by: Mayank Vadariya <mayank.vadariya@starburstdata.com>
@mayankvadariya mayankvadariya force-pushed the mayank/redshift-unload branch 2 times, most recently from d0a5473 to f8fdfb1 Compare December 31, 2024 15:54
@raunaqmorarka raunaqmorarka merged commit b382e03 into trinodb:master Dec 31, 2024
97 of 98 checks passed
@github-actions github-actions bot added this to the 469 milestone Dec 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector docs hive Hive connector hudi Hudi connector iceberg Iceberg connector performance
Development

Successfully merging this pull request may close these issues.

9 participants