Skip to content

Provider: Solr

Petr Škoda edited this page Mar 24, 2020 · 3 revisions

dcat-ap-viewer can utilize Apache Solr as one of the main source indexing services used to search for datasets.

Provider

The Solr provider allows dcat-ap-viewer to use Apache Solr as a data source.

Example

The example below configures Solr provider to use dcat-ap-viewer core running on localhost. The core must contain data for cs and en language. If a user makes API request without specifying language a default cs language is used.

    - type: solr
      url: http://localhost:8983/solr/dcat-ap-viewer
      default-language: cs
      languages:
        - cs
        - en

Configuration

  • url - URL of Apache Solr endpoint with the name of the core included.
  • default-language - Default language to use when user request does not specify any.
  • languages - list of all languages stored in Apache Solr.

Implemented methods

  • v1-info
  • v2-dataset-list
  • v2-dataset-facet
  • v2-dataset-typeahead
  • v2-publisher-list

Solr schema

We decided that we want every search to use all languages at once, but show the results in the language of UI. There are two main reasons:

  • We do not plan to use machine translation, so most of the languages will not have labels/description in all languages.
  • User should be able to find a dataset using dataset's name even if the name is in a different language then the UI.
  • Major web search services also allow searching in all languages not only the language of the UI.

A major drawback is a lack of explainability. For example, a user may search for dum and get a dataset about housing. While this result is correct it may not be clear without the knowledge of both languages. As of now, we have no solution for this issue and we may need to address it in the future.

Multilingual search

As dcat-ap-viewer supports multiple languages, we to use need multilingual search in Apache Solr. Solr in action mentions three main approaches that can be used:

  • using separate fields
  • using multiple cores
  • using a single field

The white paper Optimizing Multilingual Search With Solr from 2015 describe similar possibilities. It offers details on how to use each method and comment that using a single field based approach may require a little bit more complicated setup.

It also states that the main disadvantage of separate fields based solution is the degrading performance as the number of language increases. We do not consider this to be an issue as of now. The main advantage of the separate fields is easy setup and use. For the aforementioned reasons, we decided to choose the separate fields based solution.