Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update field mapping for 'rights.description' to index values as keyw… #310

Merged

Conversation

jonavellecuerdo
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo commented Feb 21, 2024

Purpose and background context

Update OpenSearch field mapping for rights.description to index values as keywords to support
"Access to file" filter. This work is required as a result of updates to the TIMDEX data model.

How can a reviewer manually see the effects of these changes?

Prerequisite step(s)

  1. Ran the following transform command in transmogrifier locally to create a JSON file with TIMDEX records that included the updates to the MITAardvark.rights field to support access filter.
    pipenv run transform -i s3://timdex-extract-dev-222053980223/gismit/gismit-2024-02-21-full-extracted-records-to-index.jsonl -o output/output.json -s gismit -v
    
  2. Copied transmogrifier/output/output.json to timdex-index-manager/output.json.
  3. Followed the instructions in the timdex-index-manager/README.md to run OpenSearch and OpenSearch Dashboards in a local docker container and bulk-index timdex-index-manager/output.json into the OpenSearch instance.

Note: When running this locally, I had to temporarily update the version of the Docker images in compose.yaml to 2.11.1. The latest will pull version 2.12, which now requires an env var called OPENSEARCH_INITIAL_ADMIN_PASSWORD to be set.

Now, the exciting bit! 🥳

  1. Ran the following query on local OpenSearch Dashboard Dev Tools:
GET gismit/_search
{
  "size": 0, 
  "aggs": {
    "rights": {
      "nested": {
        "path": "rights"
      },
      "aggs": {
        "filtered_rights_kind": {
          "filter": {
            "terms": {
              "rights.kind": [
                "Access to files"
              ]
            }
          },
          "aggs": {
            "rights_kind_count": {
              "terms": {
                "field": "rights.description.keyword"
              }
            }
          }
        }
      }
    }
  }
}
  1. Returned output [REDACTED]:
"aggregations": {
    "rights": {
      "doc_count": 4927,
      "filtered_rights_kind": {
        "doc_count": 2043,
        "rights_kind_count": {
          "doc_count_error_upper_bound": 0,
          "sum_other_doc_count": 0,
          "buckets": [
            {
              "key": "MIT authentication",
              "doc_count": 1210
            },
            {
              "key": "Free/open to all",
              "doc_count": 833
            }
          ]
        }
      }
    }
  }

The output shows that we are able to aggregate on Rights.description.keyword after filtering to `Rights.kind = "Access to files". Hooray!

Note: If you were to run the query above using rights.description, excluding .keyword, Dev Tools will throw the following error:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [rights.description] in order to load field data by uninverting the inverted index. Note that this can use significant memory."
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "gismit-2024-02-21t21-10-16",
        "node": "biZ_XeuGTJWPEXEp2fsdXw",
        "reason": {
          "type": "illegal_argument_exception",
          "reason": "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [rights.description] in order to load field data by uninverting the inverted index. Note that this can use significant memory."
        }
      }
    ],
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [rights.description] in order to load field data by uninverting the inverted index. Note that this can use significant memory.",
      "caused_by": {
        "type": "illegal_argument_exception",
        "reason": "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [rights.description] in order to load field data by uninverting the inverted index. Note that this can use significant memory."
      }
    }
  },
  "status": 400
}

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/GDT-138

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples have been verified
  • New dependencies are appropriate or there were no changes

…ords

Why these changes are being introduced:
* This change is required in order to perform aggregations on the 'description'
property of the 'rights' field. More specifically, this work is to enable
filtering based on "Access to files" and support the updates to the TIMDEX
data model in the PR: MITLibraries/transmogrifier#114

How this addresses that need:
* Update config to convert 'rights.description' to a multifield that indexes
the value as a "keyword" in a similarly named subfield (i.e., 'description.keyword')

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/GDT-138
@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review February 21, 2024 21:25
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks for the detailed PR.

@jonavellecuerdo jonavellecuerdo merged commit 0012371 into main Feb 22, 2024
3 checks passed
@jonavellecuerdo jonavellecuerdo deleted the GDT-138-update-tim-mapping-to-enable-access-filter branch February 22, 2024 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants