-
Notifications
You must be signed in to change notification settings - Fork 703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CARBONDATA-4263]support query with latestSegment #4189
base: master
Are you sure you want to change the base?
Conversation
Build Failed with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/211/ |
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5806/ |
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4063/ |
6a00ec2
to
c7d4143
Compare
retest this please |
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4079/ |
Build Failed with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/227/ |
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5824/ |
c7d4143
to
81b25c3
Compare
retest this please |
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5825/ |
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4080/ |
retest this please |
Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5826/ |
Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4081/ |
Build Failed with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/229/ |
Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4086/ |
please litao check if this solution can be made generic using the UDF 'insegment' already available in code and expose to the user in query statement rather than a config property @kunal642 @jackylk @ajantha-bhat @akashrn5 @QiangCai |
81b25c3
to
abc5876
Compare
retest this please |
hi brijoo, i check the doc of SEGMENT MANAGEMENT. This ability can not meet the demands, and I have no way to increase the table configuration. The segment manager that use set to configure, but not all the tables need quey latest segment. And the business not known the query that should using latest segment or whole segments. So I can't think of any other method except specifying the configuration when creating a table。 |
Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5833/ |
Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4089/ |
Build Success with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/236/ |
* @param validSegments the in put segment for search | ||
* @return the latest segment for query | ||
*/ | ||
public List<Segment> getLatestSegment(List<Segment> validSegments) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we need a single segment, then why return type is List?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to be consistent with the external interfaces, in addition, if the latest segments are required, they also have consistency
*/ | ||
public Segment[] getSegmentsToAccess(JobContext job, ReadCommittedScope readCommittedScope, | ||
List<Segment> validSegments) { | ||
String segmentString = job.getConfiguration().get(INPUT_SEGMENT_NUMBERS, ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call getSegmentsToAccess(JobContext job, ReadCommittedScope readCommittedScope) to get the segments set in configuration, instead of writing the code again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the old getSegmentsToAccess fun just use INPUT_SEGMENT_NUMBERS for input to get the segment List.
But now we need get segment not just INPUT_SEGMENT_NUMBERS but alse latest segment. the validSegments is need to use.
if use getSegmentsToAccess(JobContext job, ReadCommittedScope readCommittedScope) we need to analysis readCommittedScope to validSegments that the external functions have been implemented.
so i choose func overload to do this function.
|
|
can we check why "overwrite data" is much slower than "load data"? |
it is obviously, overwrite need to check the data in segments, load do not need, they have great differences. |
if we can fix the performance issue of load overwrite, does it satisfy your requirement? |
yes,if we can mkae the command of "insert overwrite" as qucikly as the command of "load", it can slove the problem. |
I suggest we locate the performance issue in INSERT OVERWIRTE and fix it in the first place. Instead of creating a patch solution, which we may remove it later and create compactibility problem. |
I agree with jacky |
I had a discussion with @MarvinLitt and it seems that the performance issue in OVERWRITE is related to the environment and after the environment is fixed, the performance issue/degradation is not observed. @MarvinLitt to discuss in community if the requirement is needed. |
The command with load overwrite that can make a same result with this pr. i just test for that the performance between load and load overwrite is the same. |
@MarvinLitt You can raise a JIRA for insert overwrite performance issue so that someone in the community can pick it up. Please close this PR as your scenario can be handled through Load overwrite for now |
support query with latest segment, when configure the TBLPROPERTIES that include "query_latest_segment".
Why is this PR needed?
some scenarios:
the number of data rows does not change.
The data of each column is increasing and changing.
At this scenario, it is faster load full data each time using load command.
In this case, the query only needs to query the latest segment
Need a way to control the table do like this.
What changes were proposed in this PR?
add a new property :query_latest_segment
when set 'true' ,it will get the latestSegment for query
when set 'false' or not set, there will be no impact.
Does this PR introduce any user interface change?
Is any new testcase added?