Skip to content
This repository has been archived by the owner on Jul 25, 2024. It is now read-only.

evaluate multiNER performance #12

Open
alexhebing opened this issue Apr 9, 2019 · 8 comments
Open

evaluate multiNER performance #12

alexhebing opened this issue Apr 9, 2019 · 8 comments
Labels
question Further information is requested

Comments

@alexhebing
Copy link

alexhebing commented Apr 9, 2019

After #17, #19 and #25, test the performance of multiNER with different configurations (e.g. type_preference, leading packages, maybe adjust how these work together (see also #15).

In any case, evaluate using the Dutch Historical corpus from the KB. If time permits (i.e. things go easy), also run some tests against the italian I-CAB corpus

@alexhebing alexhebing self-assigned this Apr 9, 2019
@alexhebing alexhebing added the question Further information is requested label Apr 9, 2019
@alexhebing alexhebing changed the title test multiNER evaluate multiNER performance May 27, 2019
@alexhebing
Copy link
Author

alexhebing commented Jun 4, 2019

Ok, I tried to use the parser from #22 to parse the KB corpus to plain txt, but I stumbled upon two problems:

  • Even after html.unescape() some HTML entities remain in the XMLs. ElementTree breaks because of them. So far, I have identified ><&". These need to be removed somehow. @BeritJanssen : is there a trick for this that is utilized in I-Analyzer (or elsewhere that you know of)?

  • There appears to be a problem with the encoding of the file in some cases. Thusfar, Python cannot read the files that have 000010472 in their title. It crashes with UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 113948: invalid start byte. Strangely enough, it does work for the other files I tested with (urn=ddd_000010470_mpeg21_p002_alto.alto, urn=ddd_000010474_mpeg21_p002_alto.alto, urn=ddd_000011329_mpeg21_p002_alto.alto, urn=ddd_000014128_mpeg21_p001_alto.alto). What is happening here? Do some files have a different encoding...? That would be totally weird...

@jgonggrijp
Copy link
Member

There appears to be a problem with the encoding of the file in some cases. Thusfar, Python cannot read the files that have '000010472in their title. It crashes withUnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 113948: invalid start byte`. Strangely enough, it does work for the other files I tested with (urn=ddd_000010470_mpeg21_p002_alto.alto, urn=ddd_000010474_mpeg21_p002_alto.alto, urn=ddd_000011329_mpeg21_p002_alto.alto, urn=ddd_000014128_mpeg21_p001_alto.alto). What is happening here? Do some files have a different encoding...? That would be totally weird...

Check the offending character in the file in question. It's certainly not out of the question that the file is corrupted. I had to hand-correct a couple of files in the Times corpus, too.

@alexhebing
Copy link
Author

alexhebing commented Jun 14, 2019

Ok, I finally made it through all the files and have parsed the Golden Standard corpus from XML to TXT files. Pfff... About 1/4th of the 99 files contained the problem with non-decodable bytes, and I handled them all manually. Has this to do with OCR quality (e.g. extremely exotic characters?) or was a different encoding (than utf-8) used? If we ever establish contact again, ask Willem Jan from the KB about this. Anyhow, the cleaned .alto files are now in SurfDrive.

@jgonggrijp : is there a trick you use to handle this type of error in scripts dealing with large amounts of data? I am thinking along the lines of 1) Ignoring the file and store it somewhere or 2) make a copy of the file without the byte at position X and try to decode it again, keep going until it decodable. But that is probably going to make things even messier. Anything that you do in your scripts that I can take a look at?

@jgonggrijp
Copy link
Member

@alexhebing I haven't tried fixing such problems automatically. I think it requires strong intelligence. If it can be done with weak intelligence, I don't know how.

@alexhebing
Copy link
Author

alexhebing commented Jul 3, 2019

MultiNER, to my great frustration, performed awfully against the Italian Golden Standard provided by Lorella. The best score was with a configuration with stanford as leading package and 2 as other packages_min:

stanford_2

Something is wrong with either multiNER, the bio_converter or the evaluation script, or all of them. Some issues that I found already are in #29 (this also includes a link to a multiNER issue). Fix these, tests some more, look for other issues.

For reference, here is the score that Spacy (run in isolation, i.e. separate from multiNER) got on the same Golden Standard (due to issues with the way Spacy processes the text you feed it, it only used 163 of the 190 files):
quick_spacy

@alexhebing
Copy link
Author

In addition, I ran a test with Stanford on the same Golden Standard. @BeritJanssen, look at the amazing score it gets (on 171 of the 190 files):

quick_stanford

If only multiner wouldn't contain bugs, surely it would score similarly, and probably better 😢

@BeritJanssen
Copy link
Member

BeritJanssen commented Jul 4, 2019

Wow, that's a different story. Well, then use Stanford for this experiment, I would say... I see there is some occasional activity on the KB repo. Maybe you can make an issue with your screenshots there? It's good for them (and others) to know that there are some issues still with the output.

@alexhebing
Copy link
Author

I forgot to mention (and think of) the fact that the model used for Italian was actually trained on this dataset. In that regard, the result are not that surprising.

I am in contact with the people at KB about a PR, but Willem Jan (if I remember his name correctly) is currently absent due to illness. The issues that we experience, however, are probably due to my extensive rewriting of the code.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants