The new multi-stage query engine (a.k.a V2 query engine) is designed to support more complex SQL semantics such as JOIN
, OVER
window, MATCH_RECOGNIZE
and eventually, make Pinot support closer to full ANSI SQL semantics.
Scatter-Gather Query Engine
Multi-Stage Query Engine
It also resolves the bottleneck effect for the broker reduce stage where only a single machine is dedicated to perform heavy lifting such as high cardinality GROUP BY
result merging; ORDER BY
sorting, etc.
To enable the V2 engine,
-
please make sure to either
- Building Apache Pinot using the latest master commit.
- Download the latest Apache Pinot docker image using the official guide.
-
Please add the following configurations to your cluster config:
-
"pinot.multistage.engine.enabled": "true", "pinot.server.instance.currentDataTableVersion": "4", "pinot.query.server.port": "8421", "pinot.query.runner.port": "8442"
-
-
Start the cluster normally.
NOTE: If the cluster has already been started, please restart all the controller/broker/server components so that they pick up the new cluster config.
-
You should now see the following window in the controller query page:
Sample Query Screenshot
The V2 query engine is still in the beta phase, there might be various performance or feature gaps from the current query engine.
Here are the general troubleshooting steps:
- Try downloading the latest docker image or building from the latest master commit
- We continuously pushes bug fixes for the V2 engine so bugs you encountered might have already been fixed in the latest master build
- Try rewriting your query
- Some of the functions previously supported in the V1 engine might have a new way to express in the new engine. Please check and see if you are using any non-standard SQL functions or semantics.
- Try reducing the size of the table(s) used:
- Adding higher selectivity filters to the tables
- Try executing part of the subquery or a simplified version of the query first.
- This helps to determine the selectivity and scale of the query being executed.
- Try adding more servers
- The new V2 engine runs distributedly across the entire cluster, adding more servers to partitioned queries such as GROUP BY aggregates, equality JOINs help speed up the query runtime.
please report any bugs in Apache Pinot Slack V2 engine feedback channel. Please include:
- the table/schema config(s)
- the cluster config (zookeeper config, and each components config and scale)
- the problematic SQL query string and corresponding ERROR messages.
We are continuously improving the V2 engine. However, since the V2 engine is still in beta-testing phase, there are some limitations to call out:
- Incomplete data type support: multi-value columns and some other non-primitive data types are not supported. For example
SELECT *
with multi-value columns will fail. - The intermediate stages of the V2 engine are running purely on heap memory, thus executing a large table join will cause potential out-of-memory errors
- Because of this, the table scan phase for join queries is limited to 10 million rows.
- Currently, it doesn't incorporate table statistics into plan optimization.
- Currently, it doesn't support complex aggregation functions, such as
COVAR_POP
. - Some functions that lack implementation with
@ScalarFunction
annotation will not be supported in intermediate stages.
For more up-to-date tracking of feature and performance support please follow the Github tracking issues:
- SQL feature tracker: apache/pinot#9223
- Performance and stability tracker: apache/pinot#9273
The overall PEP design doc and discussion can be found in the following links