Skip to content

Commit

Permalink
Update README
Browse files Browse the repository at this point in the history
  • Loading branch information
jonavellecuerdo committed Feb 9, 2024
1 parent eb04dcc commit dbb401b
Show file tree
Hide file tree
Showing 2 changed files with 134 additions and 38 deletions.
167 changes: 133 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,59 +1,158 @@
# timdex-index-manager (tim)

TIMDEX! Index Manager (TIM) is a Python cli application for managing TIMDEX indexes in OpenSearch.

## Required ENV

- `WORKSPACE` = Set to `dev` for local development, this will be set to `stage` and `prod` in those environments by Terraform.

## Optional ENV

- `AWS_REGION` = Only needed if AWS region changes from the default of us-east-1.
- `OPENSEARCH_BULK_MAX_CHUNK_BYTES` = Chunk size limit for sending requests to the bulk indexing endpoint, in bytes. Defaults to 100 MB (the opensearchpy default) if not set.
- `OPENSEARCH_BULK_MAX_RETRIES` = Maximum number of retries when sending requests to the bulk indexing endpoint. Defaults to 8 if not set.
- `OPENSEARCH_REQUEST_TIMEOUT` = Only used for OpenSearch requests that tend to take longer than the default timeout of 10 seconds, such as bulk or index refresh requests. Defaults to 120 seconds if not set.
- `SENTRY_DSN` = If set to a valid Sentry DSN, enables Sentry exception monitoring. This is not needed for local development.
- `STATUS_UPDATE_INTERVAL` = The ingest process logs the # of records indexed every nth record (1000 by default). Set this env variable to any integer to change the frequency of logging status updates. Can be useful for development/debugging.
- `TIMDEX_OPENSEARCH_ENDPOINT` = If using a local Docker OpenSearch instance, this isn't needed. Otherwise set to OpenSearch instance endpoint _without_ the http scheme, e.g. `search-timdex-env-1234567890.us-east-1.es.amazonaws.com`. Can also be passed directly to the CLI via the `--url` option.
TIMDEX! Index Manager (TIM) is a Python CLI application for managing TIMDEX indices in OpenSearch.

## Development

- To preview a list of available Makefile commands: `make help`
- To install with dev dependencies: `make install`
- To update dependencies: `make update`
- To run unit tests: `make test`
- To lint the repo: `make lint`
- To run the app: `pipenv run tim --help`

### Local OpenSearch with Docker
**Important note:** The sections that follow provide instructions for running OpenSearch **locally with Docker**. These instructions are useful for testing. Please make sure the environment variable `TIMDEX_OPENSEARCH_ENDPOINT` is **not** set before proceeding.

A local OpenSearch instance can be started for development purposes by running:
### Running OpenSearch locally with Docker

``` bash
$ docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" \
-e "plugins.security.disabled=true" \
opensearchproject/opensearch:2.11.1
```
1. Run the following command:

To confirm the instance is up, run `pipenv run tim -u localhost ping`.
``` bash
docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" \
-e "plugins.security.disabled=true" \
opensearchproject/opensearch:2.11.1
```

Alternately, you can use the included Docker Compose file to start an OpenSearch node along with an OpenSearch Dashboard. This should leave you with the same
2. To confirm the instance is up, run `pipenv run tim -u localhost ping` or visit http://localhost:9200/. This should produce a log that looks like the following:
```
2024-02-08 13:22:16,826 INFO tim.cli.main(): OpenSearch client configured for endpoint 'localhost'

```bash
docker pull opensearchproject/opensearch:latest
docker pull opensearchproject/opensearch-dashboards:latest
docker compose up
```
Name: docker-cluster
UUID: RVCmwQ_LQEuh1GrtwGnRMw
OpenSearch version: 2.11.1
Lucene version: 9.7.0

2024-02-08 13:22:16,930 INFO tim.cli.log_process_time(): Total time to complete process: 0:00:00.105506
```

### Running Opensearch and OpenSearch Dashboards locally with Docker

You can use the included Docker Compose file ([compose.yaml](compose.yaml)) to start an OpenSearch instance along with OpenSearch Dashboards, "[the user interface that lets you visualize your Opensearch data and run and scale your OpenSearch clusters](https://opensearch.org/docs/latest/dashboards/)". Two tools that are useful for exploring indices are [DevTools](https://opensearch.org/docs/latest/dashboards/dev-tools/index-dev/) and [Discover](https://opensearch.org/docs/latest/dashboards/discover/index-discover/).

**Note:** To use Discover, you'll need to create an index pattern. When creating the index pattern, decline the option to set a date field. When set, it detects a date field in our indices but then crashes trying to use it. When prompted, enter an index or alias to pull patterns from, and it will automatically be configured to work well enough for initial data exploration.
To confirm the instance is up, run `pipenv run tim -u localhost ping`.
1. Run the following command:
```bash
docker pull opensearchproject/opensearch:latest
docker pull opensearchproject/opensearch-dashboards:latest
docker compose up
```
To access the Dashboard, access <http://localhost:5601>.
2. To confirm the instance is up, run `pipenv run tim -u localhost ping` or visit http://localhost:9200/.
DevTools is useful for writing/testing OpenSearch queries.
3. Access OpenSearch Dashboards through <http://localhost:5601>.
Discover is useful for browsing data. An index pattern will be required to use this tool. Note: do not set a date filed (choose the option to skip selecting a date field). It detects a date field in our indexes but then crashes trying to use it. Once you skip the data select field, just enter an index or alias to pull patterns from and it will automatically be configured to work well enough for initial data exploration.
For a more detailed example with test data, please refer to the Confluence document: [How to run and query OpenSearch locally](https://mitlibraries.atlassian.net/wiki/spaces/D/pages/3586129972/How+to+run+and+query+OpenSearch+locally).
### OpenSearch on AWS
### Indexing a record into OpenSearch locally with Docker
1. Follow the instructions in either[Running Opensearch locally with Docker](#running-opensearch-locally-with-docker) or [Running Opensearch and OpenSearch Dashboards locally with Docker](#running-opensearch-and-opensearch-dashboards-locally-with-docker).
2. Open a new terminal, and create a new index. Copy the name of the created index printed to the terminal's output.
```
pipenv run tim create -s <index-name>
```

3. Copy the index name and promote the index to the alias.

```
pipenv run tim promote -a <source-name> -i <index-name>
```

4. Bulk index records from a specified directory (e.g., including S3).
```
pipenv run tim bulk-index -s <source-name> <filepath-to-records>
```

5. After verifying that the bulk-index was successful, clean up your local OpenSearch instance by deleting the index.
```
pipenv run tim delete -i <index-name>
```

### Running OpenSearch on AWS

1. Ensure that you have the correct AWS credentials set for the Dev1 (or desired) account.

2. Set the `TIMDEX_OPENSEARCH_ENDPOINT` variable in your .env to match the Dev1 (or desired) TIMDEX OpenSearch endpoint (note: do not include the http scheme prefix).

3. Run `pipenv run tim ping` to confirm the client is connected to the expected TIMDEX OpenSearch instance.


## Environment Variables

### Required ENV

```
# Set to `dev` for local development, this will be set to `stage` and `prod` in those environments by Terraform.
WORKSPACE=dev
```
## Optional ENV
```
# Only needed if AWS region changes from the default of us-east-1.
AWS_REGION=

# Chunk size limit for sending requests to the bulk indexing endpoint, in bytes. Defaults to 104857600 (100 * 1024 * 1024) if not set.
OPENSEARCH_BULK_MAX_CHUNK_BYTES=

# Maximum number of retries when sending requests to the bulk indexing endpoint. Defaults to 50 if not set.
OPENSEARCH_BULK_MAX_RETRIES=

# Only used for OpenSearch requests that tend to take longer than the default timeout of 10 seconds, such as bulk or index refresh requests. Defaults to 120 seconds if not set.
OPENSEARCH_REQUEST_TIMEOUT=

# The ingest process logs the # of records indexed every nth record. Set this env variable to any integer to change the frequency of logging status updates. Can be useful for development/debugging. Defaults to 1000 if not set.
STATUS_UPDATE_INTERVAL=

# If using a local Docker OpenSearch instance, this isn't needed. Otherwise set to OpenSearch instance endpoint without the http schem (e.g., "search-timdex-env-1234567890.us-east-1.es.amazonaws.com"). Can also be passed directly to the CLI via the `--url` option.
TIMDEX_OPENSEARCH_ENDPOINT=

# If set to a valid Sentry DSN, enables Sentry exception monitoring This is not needed for local development.
SENTRY_DSN=
```
## CLI commands
All CLI commands can be run with `pipenv run`.
```
Usage: tim [OPTIONS] COMMAND [ARGS]...

TIM provides commands for interacting with OpenSearch indexes.
For more details on a specific command, run tim COMMAND -h.

╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --url -u TEXT The OpenSearch instance endpoint minus the http scheme, e.g. │
│ 'search-timdex-env-1234567890.us-east-1.es.amazonaws.com'. If not provided, will attempt to get from the │
│ TIMDEX_OPENSEARCH_ENDPOINT environment variable. Defaults to 'localhost'. │
│ --verbose -v Pass to log at debug level instead of info │
│ --help -h Show this message and exit. │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Get cluster-level information ─────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ ping Ping OpenSearch and display information about the cluster. │
│ indexes Display summary information about all indexes in the cluster. │
│ aliases List OpenSearch aliases and their associated indexes. │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Index management commands ─────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ create Create a new index in the cluster. │
│ delete Delete an index. │
│ promote Promote index as the primary alias and add it to any additional provided aliases. │
│ demote Demote an index from all its associated aliases. │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Bulk record processing commands ───────────────────────────────────────────────────────────────────────────────────────────────────╮
│ bulk-index Bulk index records into an index. │
│ bulk-delete Bulk delete records from an index. │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
```
5 changes: 1 addition & 4 deletions tests/fixtures/sample_records.json
Original file line number Diff line number Diff line change
Expand Up @@ -617,10 +617,7 @@
"literary_form": "nonfiction",
"locations": [
{
"geopoint": [
-77.025955,
38.942142
],
"geoshape": "BBOX (-77.11806895668957,-76.90988990509905, 38.99435963428633, 38.79162154730547)",
"kind": "Place of publication",
"value": "District of Columbia"
}
Expand Down

0 comments on commit dbb401b

Please sign in to comment.