-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom space symbol #83
base: master
Are you sure you want to change the base?
Conversation
specify the blank symbol with. The default blank symbol is " ". This was previously hard-coded in the decoder, causing problems when using it for handwriting recognition where the blank symbol may be different, for example "|". modified: ctcdecode/__init__.py modified: ctcdecode/src/binding.cpp modified: ctcdecode/src/binding.h modified: ctcdecode/src/ctc_beam_search_decoder.cpp modified: ctcdecode/src/ctc_beam_search_decoder.h modified: ctcdecode/src/decoder_utils.cpp modified: ctcdecode/src/scorer.cpp modified: ctcdecode/src/scorer.h
…be called on the space symbol before passing when passing it as an argument to "ctc_decode.paddle_get_scorer" and "ctc_decode.paddle_beam_decode" and "ctc_decode.paddle_beam_decode_lm". Otherwise it will not be in the const* char format expected by the c++ interface for these parameters. modified: ctcdecode/__init__.py
…ustom_blank_symbol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contrib, Gideon. A few minor nits for you to correct and we'll get it merged in. Thanks!
@@ -35,7 +36,9 @@ std::vector<std::pair<double, Output>> ctc_beam_search_decoder( | |||
// size_t blank_id = vocabulary.size(); | |||
|
|||
// assign space id | |||
auto it = std::find(vocabulary.begin(), vocabulary.end(), " "); | |||
// Changed by Gideon from the blank symbol " " to a custom symbol specified as argument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dead code tells no lies. Please remove this comment and the line of code you commented out below (line 41). That's what git history is for.
@@ -153,7 +153,8 @@ bool add_word_to_dictionary( | |||
std::vector<int> int_word; | |||
|
|||
for (auto &c : characters) { | |||
if (c == " ") { | |||
// if (c == " ") { | |||
if (c == "|") { // Gideon: replaced the space symbol " " => "|" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can't be hardcoded. You'll have to parameterize the space character. Though, looking at it more closely, it looks like you could probably just do a lookup based on the SPACE_ID
param...
@@ -16,7 +16,8 @@ using namespace lm::ngram; | |||
Scorer::Scorer(double alpha, | |||
double beta, | |||
const std::string& lm_path, | |||
const std::vector<std::string>& vocab_list) { | |||
const std::vector<std::string>& vocab_list, | |||
const std::string &space_symbol) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please set a default arg here.
if (labels.empty()) return {}; | ||
|
||
std::string s = vec2str(labels); | ||
std::vector<std::string> words; | ||
if (is_character_based_) { | ||
words = split_utf8_str(s); | ||
} else { | ||
words = split_str(s, " "); | ||
// words = split_str(s, " "); | ||
words = split_str(s, space_symbol); //Gideon: replaced the space character from " " to a custom string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete dead code and comment
if (char_list_[i] == " ") { | ||
SPACE_ID_ = i; | ||
//if (char_list_[i] == " ") { | ||
if (char_list_[i] == space_symbol) { //Gideon: replaced the space character from " " to a custom string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete dead code and comment
…h, which facilitates usage with pytorch 1.0. Merge branch 'master' of https://github.com/parlance/ctcdecode into custom_space_symbol ctcdecode/__init__.py ctcdecode/src/binding.cpp ctcdecode/src/binding.h ctcdecode/src/ctc_beam_search_decoder.cpp ctcdecode/src/ctc_beam_search_decoder.h Changes to be committed: modified: README.md modified: build.py modified: ctcdecode/__init__.py modified: ctcdecode/src/binding.cpp modified: ctcdecode/src/binding.h modified: ctcdecode/src/ctc_beam_search_decoder.cpp modified: ctcdecode/src/ctc_beam_search_decoder.h modified: ctcdecode/src/decoder_utils.cpp modified: ctcdecode/src/decoder_utils.h modified: ctcdecode/src/path_trie.cpp modified: ctcdecode/src/path_trie.h modified: requirements.txt modified: setup.py modified: tests/test.py
"self._space_symbol.encode()" was swapped during merging of the code. - self._cutoff_prob, self.cutoff_top_n, self._blank_id,self._log_probs, self._space_symbol.encode(), output, timesteps, - scores, out_seq_len) + self._cutoff_prob, self.cutoff_top_n, self._blank_id, self._space_symbol.encode(), self._log_probs, + output, timesteps, scores, out_seq_len) modified: ctcdecode/__init__.py
I added an extra argument to the decoders to allow specification of a custom space symbol. Currently the space symbol used by the decoder is hard-coded to be " ". This is probably fine in most cases, but it does not work for example for my problem domain of handwriting recognition, in which the word separator can be a special symbol such as "|" and the normal space symbol " " may be not used at all.