diff --git a/docs/src/main/sphinx/connector.md b/docs/src/main/sphinx/connector.md index 3b86e28f3d56..c741fdedb547 100644 --- a/docs/src/main/sphinx/connector.md +++ b/docs/src/main/sphinx/connector.md @@ -29,6 +29,7 @@ MariaDB Memory MongoDB MySQL +OpenSearch Oracle Phoenix Pinot diff --git a/docs/src/main/sphinx/connector/opensearch.md b/docs/src/main/sphinx/connector/opensearch.md new file mode 100644 index 000000000000..fc81680fb3b5 --- /dev/null +++ b/docs/src/main/sphinx/connector/opensearch.md @@ -0,0 +1,446 @@ +# OpenSearch connector + +```{raw} html + +``` + +The Opensearch Connector allows access to [OpenSearch](https://opensearch.org/) data from Trino. +This document describes how to setup the OpenSearch Connector to run SQL queries against OpenSearch. + +:::{note} +OpenSearch (1.1.0 or later) is required. +::: + +## Configuration + +To configure the OpenSearch connector, create a catalog properties file +`etc/catalog/example.properties` with the following contents, replacing the +properties as appropriate for your setup: + +```text +connector.name=opensearch +opensearch.host=localhost +opensearch.port=9200 +opensearch.default-schema-name=default +``` + +### Configuration properties + +:::{list-table} OpenSearch configuration properties +:widths: 35, 55, 10 +:header-rows: 1 + +* - Property name + - Description + - Default +* - `opensearch.host` + - The comma-separated list of host names for the OpenSearch node to connect + to. This property is required. + - +* - `opensearch.port` + - Port of the OpenSearch node to connect to. + - `9200` +* - `opensearch.default-schema-name` + - The schema that contains all tables defined without a qualifying schema + name. + - `default` +* - `opensearch.scroll-size` + - Sets the maximum number of hits that can be returned with each OpenSearch + scroll request. + - `1000` +* - `opensearch.scroll-timeout` + - Amount of time OpenSearch keeps the + [search context](https://opensearch.org/docs/latest/api-reference/scroll/) + alive for scroll requests. + - `1m` +* - `opensearch.request-timeout` + - Timeout value for all OpenSearch requests. + - `10s` +* - `opensearch.connect-timeout` + - Timeout value for all OpenSearch connection attempts. + - `1s` +* - `opensearch.backoff-init-delay` + - The minimum duration between backpressure retry attempts for a single + request to OpenSearch. Setting it too low might overwhelm an already + struggling ES cluster. + - `500ms` +* - `opensearch.backoff-max-delay` + - The maximum duration between backpressure retry attempts for a single + request to OpenSearch. + - `20s` +* - `opensearch.max-retry-time` + - The maximum duration across all retry attempts for a single request to + OpenSearch. + - `20s` +* - `opensearch.node-refresh-interval` + - How often the list of available OpenSearch nodes is refreshed. + - `1m` +* - `opensearch.ignore-publish-address` + - Disables using the address published by OpenSearch to connect for + queries. + - +::: + +## TLS security + +The OpenSearch connector provides additional security options to support +OpenSearch clusters that have been configured to use TLS. + +If your cluster has globally-trusted certificates, you should only need to +enable TLS. If you require custom configuration for certificates, the connector +supports key stores and trust stores in PEM or Java Key Store (JKS) format. + +The allowed configuration values are: + +:::{list-table} TLS Security Properties +:widths: 40, 60 +:header-rows: 1 + +* - Property name + - Description +* - `opensearch.tls.enabled` + - Enables TLS security. +* - `opensearch.tls.keystore-path` + - The path to the [PEM](/security/inspect-pem) or [JKS](/security/inspect-jks) + key store. +* - `opensearch.tls.truststore-path` + - The path to [PEM](/security/inspect-pem) or [JKS](/security/inspect-jks) + trust store. +* - `opensearch.tls.keystore-password` + - The key password for the key store specified by + `opensearch.tls.keystore-path`. +* - `opensearch.tls.truststore-password` + - The key password for the trust store specified by + `opensearch.tls.truststore-path`. +* - `opensearch.tls.verify-hostnames` + - Flag to determine if the hostnames in the certificates must be verified. Defaults + to `true`. +::: + +(opensearch-type-mapping)= + +## Type mapping + +Because Trino and OpenSearch each support types that the other does not, this +connector {ref}`maps some types ` when reading data. + +### OpenSearch type to Trino type mapping + +The connector maps OpenSearch types to the corresponding Trino types +according to the following table: + +:::{list-table} OpenSearch type to Trino type mapping +:widths: 30, 30, 50 +:header-rows: 1 + +* - OpenSearch type + - Trino type + - Notes +* - `BOOLEAN` + - `BOOLEAN` + - +* - `DOUBLE` + - `DOUBLE` + - +* - `FLOAT` + - `REAL` + - +* - `BYTE` + - `TINYINT` + - +* - `SHORT` + - `SMALLINT` + - +* - `INTEGER` + - `INTEGER` + - +* - `LONG` + - `BIGINT` + - +* - `KEYWORD` + - `VARCHAR` + - +* - `TEXT` + - `VARCHAR` + - +* - `DATE` + - `TIMESTAMP` + - For more information, see [](opensearch-date-types). +* - `IPADDRESS` + - `IP` + - +::: + +No other types are supported. + +(opensearch-array-types)= + +### Array types + +Fields in OpenSearch can contain [zero or more values](https://opensearch.org/docs/latest/field-types/supported-field-types/date/#custom-formats) +, but there is no dedicated array type. To indicate a field contains an array, it can be annotated in a Trino-specific structure in +the [\_meta](https://opensearch.org/docs/latest/field-types/index/#get-a-mapping) section of the index mapping. + +For example, you can have an OpenSearch index that contains documents with the following structure: + +```json +{ + "array_string_field": ["trino","the","lean","machine-ohs"], + "long_field": 314159265359, + "id_field": "564e6982-88ee-4498-aa98-df9e3f6b6109", + "timestamp_field": "1987-09-17T06:22:48.000Z", + "object_field": { + "array_int_field": [86,75,309], + "int_field": 2 + } +} +``` + +The array fields of this structure can be defined by using the following command to add the field +property definition to the `_meta.trino` property of the target index mapping. + +```shell +curl --request PUT \ + --url localhost:9200/doc/_mapping \ + --header 'content-type: application/json' \ + --data ' +{ + "_meta": { + "trino":{ + "array_string_field":{ + "isArray":true + }, + "object_field":{ + "array_int_field":{ + "isArray":true + } + }, + } + } +}' +``` + +:::{note} +It is not allowed to use `asRawJson` and `isArray` flags simultaneously for the same column. +::: + +(opensearch-date-types)= + +### Date types + +OpenSearch supports a wide array of [date] formats including +[built-in date formats] and also [custom date formats]. +The OpenSearch connector supports only the default `date` type. All other +date formats including [built-in date formats] and [custom date formats] are +not supported. Dates with the [format] property are ignored. + +### Raw JSON transform + +There are many occurrences where documents in OpenSearch have more complex +structures that are not represented in the mapping. For example, a single +`keyword` field can have widely different content including a single +`keyword` value, an array, or a multidimensional `keyword` array with any +level of nesting. + +```shell +curl --request PUT \ + --url localhost:9200/doc/_mapping \ + --header 'content-type: application/json' \ + --data ' +{ + "properties": { + "array_string_field":{ + "type": "keyword" + } + } +}' +``` + +Notice for the `array_string_field` that all the following documents are legal +for OpenSearch. See the [OpenSearch array documentation](https://opensearch.org/docs/latest/field-types/supported-field-types/index/#arrays) +for more details. + +```json +[ + { + "array_string_field": "trino" + }, + { + "array_string_field": ["trino","is","the","besto"] + }, + { + "array_string_field": ["trino",["is","the","besto"]] + }, + { + "array_string_field": ["trino",["is",["the","besto"]]] + } +] +``` + +Further, OpenSearch supports types, such as +[k-NN vector](https://opensearch.org/docs/latest/field-types/supported-field-types/knn-vector/), +that are not supported in Trino. New types are constantly emerging which can +cause parsing exceptions for users that use of these types in OpenSearch. To +manage all of these scenarios, you can transform fields to raw JSON by +annotating it in a Trino-specific structure in the [\_meta](https://opensearch.org/docs/latest/field-types/index/) +section of the index mapping. This indicates to Trino that the field, and all +nested fields beneath, need to be cast to a `VARCHAR` field that contains +the raw JSON content. These fields can be defined by using the following command +to add the field property definition to the `_meta.trino` property of the +target index mapping. + +```shell +curl --request PUT \ + --url localhost:9200/doc/_mapping \ + --header 'content-type: application/json' \ + --data ' +{ + "_meta": { + "trino":{ + "array_string_field":{ + "asRawJson":true + } + } + } +}' +``` + +This preceding configurations causes Trino to return the `array_string_field` +field as a `VARCHAR` containing raw JSON. You can parse these fields with the +{doc}`built-in JSON functions `. + +:::{note} +It is not allowed to use `asRawJson` and `isArray` flags simultaneously for the same column. +::: + +## Special columns + +The following hidden columns are available: + +| Column | Description | +|----------|-----------------------------------------------------| +| \_id | The OpenSearch document ID | +| \_score | The document score returned by the OpenSearch query | +| \_source | The source of the original document | + +(opensearch-full-text-queries)= + +## Full text queries + +Trino SQL queries can be combined with OpenSearch queries by providing the [full text query] +as part of the table name, separated by a colon. For example: + +```sql +SELECT * FROM "tweets: +trino SQL^2" +``` + +## Predicate push down + +The connector supports predicate push down of below data types: + +| OpenSearch | Trino | Supports | +|--------------|---------------|---------------| +| `binary` | `VARBINARY` | `NO` | +| `boolean` | `BOOLEAN` | `YES` | +| `double` | `DOUBLE` | `YES` | +| `float` | `REAL` | `YES` | +| `byte` | `TINYINT` | `YES` | +| `short` | `SMALLINT` | `YES` | +| `integer` | `INTEGER` | `YES` | +| `long` | `BIGINT` | `YES` | +| `keyword` | `VARCHAR` | `YES` | +| `text` | `VARCHAR` | `NO` | +| `date` | `TIMESTAMP` | `YES` | +| `ip` | `IPADDRESS` | `NO` | +| (all others) | (unsupported) | (unsupported) | + +## AWS authorization + +To enable AWS authorization using IAM policies, the `opensearch.security` option needs to be set to `AWS`. +Additionally, the following options need to be configured appropriately: + +| Property name | Description | +|------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------| +| `opensearch.aws.region` | AWS region or the OpenSearch endpoint. This option is required. | +| `opensearch.aws.access-key` | AWS access key to use to connect to the OpenSearch domain. If not set, the Default AWS Credentials Provider chain will be used. | +| `opensearch.aws.secret-key` | AWS secret key to use to connect to the OpenSearch domain. If not set, the Default AWS Credentials Provider chain will be used. | +| `opensearch.aws.iam-role` | Optional ARN of an IAM Role to assume to connect to the OpenSearch domain. Note: the configured IAM user has to be able to assume this role. | +| `opensearch.aws.external-id` | Optional external ID to pass while assuming an AWS IAM Role. | + +## Password authentication + +To enable password authentication, the `opensearch.security` option needs to be set to `PASSWORD`. +Additionally the following options need to be configured appropriately: + +| Property name | Description | +|----------------------------|--------------------------------------------| +| `opensearch.auth.user` | User name to use to connect to OpenSearch. | +| `opensearch.auth.password` | Password to use to connect to OpenSearch. | + +(opensearch-sql-support)= + +## SQL support + +The connector provides {ref}`globally available ` and +{ref}`read operation ` statements to access data and +metadata in the OpenSearch catalog. + +## Table functions + +The connector provides specific {doc}`table functions ` to +access OpenSearch. + +(opensearch-raw-query-function)= + +### `raw_query(varchar) -> table` + +The `raw_query` function allows you to query the underlying database directly. +This function requires [OpenSearch Query DSL](https://opensearch.org/docs/latest/query-dsl/index/) +syntax, because the full query is pushed down and processed in OpenSearch. +This can be useful for accessing native features which are not available in +Trino or for improving query performance in situations where running a query +natively may be faster. + +```{include} query-passthrough-warning.fragment +``` + +The `raw_query` function requires three parameters: + +- `schema`: The schema in the catalog that the query is to be executed on. +- `index`: The index in OpenSearch to be searched. +- `query`: The query to be executed, written in Elastic Query DSL. + +Once executed, the query returns a single row containing the resulting JSON +payload returned by OpenSearch. + +For example, query the `example` catalog and use the `raw_query` table +function to search for documents in the `orders` index where the country name +is `ALGERIA`: + +``` +SELECT + * +FROM + TABLE( + example.system.raw_query( + schema => 'sales', + index => 'orders', + query => '{ + "query": { + "match": { + "name": "ALGERIA" + } + } + }' + ) + ); +``` + +```{include} query-table-function-ordering.fragment +``` + +[built-in date formats]: https://opensearch.org/docs/latest/field-types/supported-field-types/date/#custom-formats +[custom date formats]: https://opensearch.org/docs/latest/field-types/supported-field-types/date/#custom-formats +[date]: https://opensearch.org/docs/latest/field-types/supported-field-types/date/ +[format]: https://opensearch.org/docs/latest/query-dsl/term/range/#format +[full text query]: https://opensearch.org/docs/latest/query-dsl/full-text/query-string/ diff --git a/docs/src/main/sphinx/static/img/opensearch.png b/docs/src/main/sphinx/static/img/opensearch.png new file mode 100644 index 000000000000..113d451b3467 Binary files /dev/null and b/docs/src/main/sphinx/static/img/opensearch.png differ