-
Notifications
You must be signed in to change notification settings - Fork 0
evaluate multiNER performance #12
Comments
Ok, I tried to use the parser from #22 to parse the KB corpus to plain txt, but I stumbled upon two problems:
|
Check the offending character in the file in question. It's certainly not out of the question that the file is corrupted. I had to hand-correct a couple of files in the Times corpus, too. |
Ok, I finally made it through all the files and have parsed the Golden Standard corpus from XML to TXT files. Pfff... About 1/4th of the 99 files contained the problem with non-decodable bytes, and I handled them all manually. Has this to do with OCR quality (e.g. extremely exotic characters?) or was a different encoding (than utf-8) used? If we ever establish contact again, ask Willem Jan from the KB about this. Anyhow, the cleaned @jgonggrijp : is there a trick you use to handle this type of error in scripts dealing with large amounts of data? I am thinking along the lines of 1) Ignoring the file and store it somewhere or 2) make a copy of the file without the byte at position |
@alexhebing I haven't tried fixing such problems automatically. I think it requires strong intelligence. If it can be done with weak intelligence, I don't know how. |
MultiNER, to my great frustration, performed awfully against the Italian Golden Standard provided by Lorella. The best score was with a configuration with stanford as leading package and 2 as Something is wrong with either multiNER, the bio_converter or the evaluation script, or all of them. Some issues that I found already are in #29 (this also includes a link to a multiNER issue). Fix these, tests some more, look for other issues. For reference, here is the score that Spacy (run in isolation, i.e. separate from multiNER) got on the same Golden Standard (due to issues with the way Spacy processes the text you feed it, it only used 163 of the 190 files): |
In addition, I ran a test with Stanford on the same Golden Standard. @BeritJanssen, look at the amazing score it gets (on 171 of the 190 files): If only multiner wouldn't contain bugs, surely it would score similarly, and probably better 😢 |
Wow, that's a different story. Well, then use Stanford for this experiment, I would say... I see there is some occasional activity on the KB repo. Maybe you can make an issue with your screenshots there? It's good for them (and others) to know that there are some issues still with the output. |
I forgot to mention (and think of) the fact that the model used for Italian was actually trained on this dataset. In that regard, the result are not that surprising. I am in contact with the people at KB about a PR, but Willem Jan (if I remember his name correctly) is currently absent due to illness. The issues that we experience, however, are probably due to my extensive rewriting of the code. |
After #17, #19 and #25, test the performance of multiNER with different configurations (e.g. type_preference, leading packages, maybe adjust how these work together (see also #15).
In any case, evaluate using the Dutch Historical corpus from the KB. If time permits (i.e. things go easy), also run some tests against the italian I-CAB corpus
The text was updated successfully, but these errors were encountered: