evaluate multiNER performance #12

alexhebing · 2019-04-09T12:24:01Z

After #17, #19 and #25, test the performance of multiNER with different configurations (e.g. type_preference, leading packages, maybe adjust how these work together (see also #15).

In any case, evaluate using the Dutch Historical corpus from the KB. If time permits (i.e. things go easy), also run some tests against the italian I-CAB corpus

alexhebing · 2019-06-04T05:38:49Z

Ok, I tried to use the parser from #22 to parse the KB corpus to plain txt, but I stumbled upon two problems:

Even after html.unescape() some HTML entities remain in the XMLs. ElementTree breaks because of them. So far, I have identified ><&". These need to be removed somehow. @BeritJanssen : is there a trick for this that is utilized in I-Analyzer (or elsewhere that you know of)?
There appears to be a problem with the encoding of the file in some cases. Thusfar, Python cannot read the files that have 000010472 in their title. It crashes with UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 113948: invalid start byte. Strangely enough, it does work for the other files I tested with (urn=ddd_000010470_mpeg21_p002_alto.alto, urn=ddd_000010474_mpeg21_p002_alto.alto, urn=ddd_000011329_mpeg21_p002_alto.alto, urn=ddd_000014128_mpeg21_p001_alto.alto). What is happening here? Do some files have a different encoding...? That would be totally weird...

jgonggrijp · 2019-06-11T10:00:02Z

There appears to be a problem with the encoding of the file in some cases. Thusfar, Python cannot read the files that have '000010472in their title. It crashes withUnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 113948: invalid start byte`. Strangely enough, it does work for the other files I tested with (urn=ddd_000010470_mpeg21_p002_alto.alto, urn=ddd_000010474_mpeg21_p002_alto.alto, urn=ddd_000011329_mpeg21_p002_alto.alto, urn=ddd_000014128_mpeg21_p001_alto.alto). What is happening here? Do some files have a different encoding...? That would be totally weird...

Check the offending character in the file in question. It's certainly not out of the question that the file is corrupted. I had to hand-correct a couple of files in the Times corpus, too.

alexhebing · 2019-06-14T13:18:13Z

Ok, I finally made it through all the files and have parsed the Golden Standard corpus from XML to TXT files. Pfff... About 1/4th of the 99 files contained the problem with non-decodable bytes, and I handled them all manually. Has this to do with OCR quality (e.g. extremely exotic characters?) or was a different encoding (than utf-8) used? If we ever establish contact again, ask Willem Jan from the KB about this. Anyhow, the cleaned .alto files are now in SurfDrive.

@jgonggrijp : is there a trick you use to handle this type of error in scripts dealing with large amounts of data? I am thinking along the lines of 1) Ignoring the file and store it somewhere or 2) make a copy of the file without the byte at position X and try to decode it again, keep going until it decodable. But that is probably going to make things even messier. Anything that you do in your scripts that I can take a look at?

jgonggrijp · 2019-06-18T08:35:16Z

@alexhebing I haven't tried fixing such problems automatically. I think it requires strong intelligence. If it can be done with weak intelligence, I don't know how.

alexhebing · 2019-07-03T06:08:54Z

MultiNER, to my great frustration, performed awfully against the Italian Golden Standard provided by Lorella. The best score was with a configuration with stanford as leading package and 2 as other packages_min:

Something is wrong with either multiNER, the bio_converter or the evaluation script, or all of them. Some issues that I found already are in #29 (this also includes a link to a multiNER issue). Fix these, tests some more, look for other issues.

For reference, here is the score that Spacy (run in isolation, i.e. separate from multiNER) got on the same Golden Standard (due to issues with the way Spacy processes the text you feed it, it only used 163 of the 190 files):

alexhebing · 2019-07-03T14:25:19Z

In addition, I ran a test with Stanford on the same Golden Standard. @BeritJanssen, look at the amazing score it gets (on 171 of the 190 files):

If only multiner wouldn't contain bugs, surely it would score similarly, and probably better 😢

BeritJanssen · 2019-07-04T11:58:45Z

Wow, that's a different story. Well, then use Stanford for this experiment, I would say... I see there is some occasional activity on the KB repo. Maybe you can make an issue with your screenshots there? It's good for them (and others) to know that there are some issues still with the output.

alexhebing · 2019-07-31T14:39:42Z

I forgot to mention (and think of) the fact that the model used for Italian was actually trained on this dataset. In that regard, the result are not that surprising.

I am in contact with the people at KB about a PR, but Willem Jan (if I remember his name correctly) is currently absent due to illness. The issues that we experience, however, are probably due to my extensive rewriting of the code.

alexhebing self-assigned this Apr 9, 2019

alexhebing added the question Further information is requested label Apr 9, 2019

alexhebing added this to the first version with coordinates milestone Apr 9, 2019

alexhebing modified the milestones: first version with coordinates, command line version 1 May 7, 2019

alexhebing changed the title ~~test multiNER~~ evaluate multiNER performance May 27, 2019

alexhebing modified the milestones: command line version 1, Command line v2 / evaluation May 27, 2019

alexhebing mentioned this issue May 27, 2019

Find annotated corpora to test multiNER #19

Closed

alexhebing pushed a commit that referenced this issue Jun 14, 2019

Refer #12. Add a test file with non decodable byte for future reference

09f85ff

JosedeKruif unassigned alexhebing Dec 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluate multiNER performance #12

evaluate multiNER performance #12

alexhebing commented Apr 9, 2019 •

edited

Loading

alexhebing commented Jun 4, 2019 •

edited

Loading

jgonggrijp commented Jun 11, 2019

alexhebing commented Jun 14, 2019 •

edited

Loading

jgonggrijp commented Jun 18, 2019

alexhebing commented Jul 3, 2019 •

edited

Loading

alexhebing commented Jul 3, 2019

BeritJanssen commented Jul 4, 2019 •

edited

Loading

alexhebing commented Jul 31, 2019

evaluate multiNER performance #12

evaluate multiNER performance #12

Comments

alexhebing commented Apr 9, 2019 • edited Loading

alexhebing commented Jun 4, 2019 • edited Loading

jgonggrijp commented Jun 11, 2019

alexhebing commented Jun 14, 2019 • edited Loading

jgonggrijp commented Jun 18, 2019

alexhebing commented Jul 3, 2019 • edited Loading

alexhebing commented Jul 3, 2019

BeritJanssen commented Jul 4, 2019 • edited Loading

alexhebing commented Jul 31, 2019

alexhebing commented Apr 9, 2019 •

edited

Loading

alexhebing commented Jun 4, 2019 •

edited

Loading

alexhebing commented Jun 14, 2019 •

edited

Loading

alexhebing commented Jul 3, 2019 •

edited

Loading

BeritJanssen commented Jul 4, 2019 •

edited

Loading