-
Notifications
You must be signed in to change notification settings - Fork 6
Provider: Solr
dcat-ap-viewer can utilize Apache Solr as one of the main source indexing services used to search for datasets.
The Solr provider allows dcat-ap-viewer to use Apache Solr as a data source.
The example below configures Solr provider to use dcat-ap-viewer core running on localhost.
The core must contain data for cs
and en
language. If a user makes API request without specifying language a default cs
language is used.
- type: solr
url: http://localhost:8983/solr/dcat-ap-viewer
default-language: cs
languages:
- cs
- en
- url - URL of Apache Solr endpoint with the name of the core included.
- default-language - Default language to use when user request does not specify any.
- languages - list of all languages stored in Apache Solr.
- v1-info
- v2-dataset-list
- v2-dataset-facet
- v2-dataset-typeahead
- v2-publisher-list
We decided that we want every search to use all languages at once, but show the results in the language of UI. There are two main reasons:
- We do not plan to use machine translation, so most of the languages will not have labels/description in all languages.
- User should be able to find a dataset using dataset's name even if the name is in a different language then the UI.
- Major web search services also allow searching in all languages not only the language of the UI.
A major drawback is a lack of explainability. For example, a user may search for dum
and get a dataset about housing.
While this result is correct it may not be clear without the knowledge of both languages. As of now, we have no solution for this issue and we may need to address it in the future.
As dcat-ap-viewer supports multiple languages, we to use need multilingual search in Apache Solr. Solr in action mentions three main approaches that can be used:
- using separate fields
- using multiple cores
- using a single field
The white paper Optimizing Multilingual Search With Solr from 2015 describe similar possibilities. It offers details on how to use each method and comment that using a single field based approach may require a little bit more complicated setup.
It also states that the main disadvantage of separate fields based solution is the degrading performance as the number of language increases. We do not consider this to be an issue as of now. The main advantage of the separate fields is easy setup and use. For the aforementioned reasons, we decided to choose the separate fields based solution.