-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve quote detection by finding start and end quotation marks #380
base: develop
Are you sure you want to change the base?
Conversation
apparently they don't support PY3.9 ... note to self: figure out the readthedocs situation, this seems... not great
Hi @afriedman412 , thanks for submitting these changes! I'm pretty sure I follow the updated logic, but it would be particularly helpful to have new tests (or just new test cases for an existing test) that confirm what we should expect to be handled by it. Would you be able to add some in here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've suggested a few changes for performance and stylistic consistency reasons, but nothing major. my big ask continues to be additional testing to make sure that these changes do what we want them to, and to illustrate the improved extraction cases. i can certainly come up with some, but you surely know better than i what to test here.
src/textacy/constants.py
Outdated
""" | ||
Ordinal points of the token.is_quote characters, matched up by start and end. | ||
|
||
source: | ||
switch = "\"\'" | ||
start = "“‘```“‘«‹「『„‚" | ||
end = "”’’’’”’»›」』”’" | ||
|
||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could you move this docstring to directly below (rather than above) the object's assignment?
src/textacy/constants.py
Outdated
|
||
""" | ||
QUOTATION_MARK_PAIRS = { | ||
(34, 34), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not critical, but helpful: could you add the corresponding marks in a comment on each line?
src/textacy/extract/triples.py
Outdated
for n, q in enumerate(qtok): | ||
if q.i not in [q_[1] for q_ in qtok_pair_idxs]: | ||
for q_ in qtok[n+1:]: | ||
if (ord(q.text), ord(q_.text)) in constants.QUOTATION_MARK_PAIRS: | ||
qtok_pair_idxs.append((q.i, q_.i)) | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i haven't profiled this code, but this series of nested loops feels like it could be a performance bottleneck. is there a still-clear but more performant alternative?
src/textacy/extract/triples.py
Outdated
# def get_cue_candidates(window_sents, verbs): | ||
# return [ | ||
# tok | ||
# for sent in window_sents | ||
# for tok in sent | ||
# if filter_cue_candidates(tok, verbs) | ||
# ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this still be here?
src/textacy/extract/triples.py
Outdated
return all([ | ||
tok.pos == VERB, | ||
tok.lemma_ in verbs | ||
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return all([ | |
tok.pos == VERB, | |
tok.lemma_ in verbs | |
]) | |
return (tok.pos == VERB and tok.lemma_ in verbs) |
src/textacy/extract/triples.py
Outdated
return all([ | ||
candidate.pos!=PUNCT, | ||
((candidate.i >= i and candidate.i >= j) or (candidate.i <= i and candidate.i <= j)), | ||
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return all([ | |
candidate.pos!=PUNCT, | |
((candidate.i >= i and candidate.i >= j) or (candidate.i <= i and candidate.i <= j)), | |
]) | |
return ( | |
candidate.pos != PUNCT | |
and ((candidate.i >= i and candidate.i >= j) or (candidate.i <= i and candidate.i <= j)) | |
) |
src/textacy/extract/triples.py
Outdated
for i_, j_ in list(zip( | ||
[tok.i for tok in doc if tok.text=="\n"], | ||
[tok.i for tok in doc if tok.text=="\n"][1:]) | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trying to improve efficiency/performance here
for i_, j_ in list(zip( | |
[tok.i for tok in doc if tok.text=="\n"], | |
[tok.i for tok in doc if tok.text=="\n"][1:]) | |
): | |
lb_tok_idxs = [tok.i for tok in doc if tok.text == "\n"] | |
for i_, j_ in zip(lb_tok_idxs, lb_tok_idxs[1:]): |
src/textacy/extract/triples.py
Outdated
""" | ||
if para: | ||
i_, j_ = line_break_window(i, j, doc) | ||
if i_: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if i_: | |
if i_ is not None: |
src/textacy/extract/triples.py
Outdated
return ( | ||
sent | ||
for sent in doc.sents | ||
# these boundary cases are a subtle bit of work... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙇♂️
src/textacy/extract/triples.py
Outdated
windower(qtok_start_idx, qtok_end_idx, doc, True), | ||
windower(qtok_start_idx, qtok_end_idx, doc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible for these two approaches to ever produce identical windows? does that matter?
hey @afriedman412 , does the newer PR (#382) supersede this one? |
…x` package, min_quote_length is now a `direct_quotations` parameter (not a constant), added a better example for testing linebreaks that function as closing quotes
The current code for detecting quotes is pretty unsophisticated. It just sequentially pairs anything the token.is_quote deems a quotation mark and assumes the indexes to be the quote boundaries. If there are an odd number of quotation marks, it throws an error. I've been doing quote detection in some of unreliably formatted text lately which has things like "»" used as bullet points and lots of unpredictable stray characters, so I came up with a workaround.
My version only matches specific types of quotation marks with their accepted counterparts. For example, if there is
“
it will pair it with the next”
. If it never finds a match, it ignores the singleton quotation mark. I did this by making a list of 11 approved tuples of unicode codes (which live in .constants).For example:
Bill told me I "shouldn‘t wear those pants" but I will.
In the current version, running quote detection here would raise an error because there are three quotation mark-like tokens in the sentence. Even if it didn't, it would return "shouldn" as a quote because textacy assumes sequential quotation marks are quote boundaries.
My version takes the first quotation mark (
q
) and iterates through all the later quotation marks until it finds one (q_
) where(ord(q.text), ord(q_.text))
is in the list of acceptable pairs. Here, that pair would be(34, 34)
because a double quotation mark is unicode number 34.