Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail query when the symlink file contains inexistent paths #19364

Merged
merged 1 commit into from
Oct 12, 2023

Conversation

findinpath
Copy link
Contributor

Description

When dealing with a symlink Hive table which has a symlink.txt file containing an inexistent path, fail early with a meaningful exception (similar to what happens in Hive), instead of failing with the bogus exception:

Cannot invoke "io.trino.plugin.hive.fs.TrinoFileStatus.getLength()" because "status" is null

Reproduction scenario

Reproduction scenario:

Spin up the product test environment:

testing/bin/ptl env up --environment multinode --config config-default --without-trino

Create the tables in Hive:

CREATE TABLE testsimpleparquet (col integer)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';

insert into testsimpleparquet values (1);
insert into testsimpleparquet values (2);
CREATE TABLE testsymlinkparquet (col integer)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

Create a symlink.txt file with the following content:

hdfs://hadoop-master:9000/user/hive/warehouse/testsimpleparquet/000000_0
hdfs://hadoop-master:9000/user/hive/warehouse/testsimpleparquet/000000_0_copy_1_bad_file

000000_0_copy_1_bad_file doesn't actually exist

Copy the symlink.txt file to testsymlinkparquet storage:

[hive@hadoop-master tmp]$ hdfs dfs -copyFromLocal symlink.txt /user/hive/warehouse/testsymlinkparquet

Query in Hive:

0: jdbc:hive2://localhost:10000/default> select * from testsymlinkparquet;

error: java.io.IOException: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://hadoop-master:9000/user/hive/warehouse/testsimpleparquet/000000_0_copy_1_bad_file (state=,code=0)

Query in Trino:

trino> select * from hive.default.testsymlinkparquet;
Query 20231011_205902_00037_i66g8 failed: Cannot invoke "io.trino.plugin.hive.fs.TrinoFileStatus.getLength()" because "status" is null
io.trino.spi.TrinoException: Cannot invoke "io.trino.plugin.hive.fs.TrinoFileStatus.getLength()" because "status" is null
	at io.trino.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:294)

Release notes

(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@findinpath findinpath force-pushed the findinpath/hive-symlink-invalid branch from 18038c2 to 98a90ac Compare October 12, 2023 10:49
@findinpath findinpath force-pushed the findinpath/hive-symlink-invalid branch from 98a90ac to 181a536 Compare October 12, 2023 10:50
@findinpath findinpath self-assigned this Oct 12, 2023
@findepi findepi merged commit c7e96a6 into trinodb:master Oct 12, 2023
57 checks passed
@github-actions github-actions bot added this to the 430 milestone Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hive Hive connector
Development

Successfully merging this pull request may close these issues.

4 participants