-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flexible fusion backend #813
Comments
A third possible approach to passing different variants/subsets of a full text to different backends, a kind of fusion of the approaches (1) and (2):
@c-poley et al., how would you identify the different parts of the text? Would you use some existing metadata of a document to get e.g. the headline, or would a user manually input it separately in your workflow when inputting the text/file? I started to think could there be a transform to also detect and tag some particular parts of texts, e.g. title, abstract, TOC, authors, publishers, etc. (the authors and publishers could be advantageous to be deselected and removed from the text when performing subject indexing, so there could be a |
Well, the third possibility also can fulfil our requirements. Maybe, it is important, how we connect the text parts we use. We call it "text glue". Otherwise, it can become our own job to add one more character at the end of the headline or something else. For our purpose, we have the information like "headline", "toc", "blurb" or "fulltext" separately available. But: the mentioned idea at the end of your answer can get very interesting. Basically with such a feature, Annif moves from a toolbox for automatic suggestions to a tool box that makes it possible to identify structures in plain text with the help of algorithms and perhaps AI magic. Maybe it will become a research feature, because a low error structure extraction needs a lot of knowledge of the plain text (or that, which we think is text). One of my colleagues is in the process of looking more closely at text quality. Maybe it helps to get better suggestions. Maybe we get some side effects. |
Maybe I got carried away with the text parts identification; there are dedicated packages for this, and an Annif workflow could just use such a software via its API to analyse the input documents. Annif itself could then use either the approach (1) or (3) to pass different parts of the texts to different backends (or do some cleaning of the text, like for the mentioned authors). There is a OmniDocBench benchmarking different document parsing software. Some possibly useful software for such PDF layout analysis are:
Anyway, the benefit of passing different variants of a document text to different backends should be evaluated before contributing too much time for the implementation. Good if someone can experiment with this! |
With Annif, it is possible to use several specialised models for prediction in an ensemble. However, all models in an Annif ensemble, can only be given one specific single kind of text for prediction, so it is not possible to pass on different kinds of text to each single model.
The only way to adapt the text for prediction currently is to use the transform parameter. We can make use of that parameter to read either a limited amount of characters from the beginning or read all of the text. A parameter that would enable us to set a specific range (from character x to character y) to be read from the text would give us additional way to specify/cut down the text for specific models.
Could the annif ensemble functionality be extended in such a way that the individual models of an ensemble could be given different kinds of text (expressions of a document) for processing?
Another flexibility for the ensemble functionality would be the use of subsets of vocabularies for individual models in ensembles as discussed in issue #596 .
Anyway, the interface for the predictions need enhancments. In the following we describe some ideas we already discussed a few weeks ago:
Allow to submit text as structured data:
Use json:
-d '%7B%22headline%22%3A%22Wonderful%22%2C%22fulltext%22%3A%22Oh%2C%20what%20a%20wonderful%20world%22%7D'
or
Use xml tags:
-d '%3Cheadline%3EWonderful%3C%2Fheadline%3E%3Cfulltext%3EOh%2C%20what%20a%20wonderful%20world%3C%2Ftext%3E'
... with the possibility to define the tags at the right places in the projects.conf, like:
submitted_text=headline,' ',fulltext
The empty space between the "headline" and the "fulltext" defines the character(s), how to glue the parts of the text to submit.
submitted_text=headline,'.',toc
Here is a headline and a toc to submit. They are connected with a "."
An approach on the way to allow a fusion is an enhancement of the limit parameter. This will allow us to define the part of the submitted text.
In the projects.conf we only need a little enhancement, that defines the starting point and the number of characters to proceed:
transform=limit(500,2000)
We (and we think the whole community) would really benefit from the implementation of a fusion with freely configurable structured data like in (1). We have to admit, that the usage of structured data would be the most favorite and clean implementation.
Best regards,
Christoph, Frank, Jan-Helge and Sandro from the German National Library
The text was updated successfully, but these errors were encountered: