-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode character classes #235
Comments
Hi @terpstra , your grep was correct: re2c doesn't support syntactic aliases for Unicode character classes yet. There is no technical reason it can't do that, but you are the first to ask. As a temporary quick workaround, I can generate and distribute together with re2c source code an "official" file with re2c definitions of Unicode categories: unicode_categories.re.txt. This is to be included verbatim into your .re files; the name |
Thanks a lot for this! Does re2c support some form of 'include'? Dumping tables this large into a source file whose main focus is parsing distracts the reader. Ultimately, I think users will want all the classes and subclasses in Unicode. For example, also the Lu class for upper-case letters / etc. Do you think this is a good candidate for future inclusion? |
No, but it would be useful. Initial implementation may only allow to include files from current directory (the one re2c is run from), otherwise we'd also need to support include paths.
Agreed.
Yes. Don't close the issue. :) |
I've noticed that "L \ Lu" in re2c v1.1.1 reports: It seems that the inclusion of any value above 0x80 in a character class renders it no longer a character class. |
@terpstra Meanwhile, re2c learnt to handle include files b94c5af:
|
Nice! Do you plan to put unicode_categories.re somewhere in the include path? For now I'm just copy-pasting it into my own symbol.re as you suggested. |
For now I think the best option is to copy Perhaps later re2c will install these definition files in some default locations, or at least default relative to re2c root directory, and we'll have a "standard library" of useful regular expressions. |
FYI -- this precompiled set of unicode definitions is fantastic -- I needed to add support for unicode strings to a project I started today, and found this. Made short work of an otherwise complicated problem. Thanks! (PS-- Thanks for asking about this Brett!) (PPS -- It goes without saying, but also to second Brett's thanks for re2c. I've been using it for a few years now and am always impressed with how easy it is to use!) |
Glad to hear that it works for you! |
Who is Brett? From the context, it sounds like you meant me. |
@terpstra My apologies, you are right. I saw terpstra and immediately thought of Brett Terpstra since our software projects intersect at times. But that isn't you, so while my comment stands in its intent (I appreciate your asking about this!) it doesn't mean quite as much since we've never met and your name is not Brett..... Move along.... Nothing to see here.... Just another person making an idiot of themselves on the internet... ;) |
Can someone give an example of character class example to handle unicode ? Here is what I have now and want to allow IDENTIFIER to contain unicode (UTF-8) characters and also WS to contain unicode white space. I can see that there is now (1.3) an include "unicode_categories.re" but no example of usage and it's not clear to me how to use it.
|
You are right, the documentation is lacking. Here is a working example: #include <assert.h>
#include <stdio.h>
int lex(const char *YYCURSOR)
{
const char *YYMARKER, *s = YYCURSOR;
/*!include:re2c "re2c-1.3/include/unicode_categories.re" */
/*!re2c
re2c:define:YYCTYPE = 'unsigned char';
re2c:flags:utf-8 = 1;
re2c:yyfill:enable = 0;
D = [0-9] ;
E = [Ee] [+-]? D+ ;
INTSUFFIX = ( "LL" | "ULL" | "ll" | "ull") ;
INTNUMBER = ( D+ ) INTSUFFIX? ;
FLOATNUMBER = ( D+ | D* "." D+ | D+ "." D* ) E? ;
CPLXNUMBER = ( D+ "." D+ ) "i" ;
HEX_P = [Pp] [+-]? D+ ;
HEXNUMBER = ('0' [xX] [0-9a-fA-F]+) (HEX_P | INTSUFFIX)? ;
WS = [ \t\r\v\f] ;
LF = [\n] ;
END = [\000] ;
ANY = [^] \ END ;
ESC = [\\] ;
SQ = ['] ;
DQ = ["] ;
STRING1 = SQ ( ANY \ SQ \ ESC | ESC ANY )* SQ ;
STRING2 = DQ ( ANY \ DQ \ ESC | ESC ANY )* DQ ;
IDENTIFIER = L ( L | D )* ;
"ХЫ!" { printf("special: %.*s\n", (int)(YYCURSOR - s), s); return 0; }
IDENTIFIER { printf("identifier: %.*s\n", (int)(YYCURSOR - s), s); return 1; }
STRING1 { printf("string-1: %.*s\n", (int)(YYCURSOR - s), s); return 2; }
STRING2 { printf("string-2: %.*s\n", (int)(YYCURSOR - s), s); return 3; }
HEXNUMBER { printf("hex: %.*s\n", (int)(YYCURSOR - s), s); return 4; }
INTNUMBER { printf("integer: %.*s\n", (int)(YYCURSOR - s), s); return 5; }
FLOATNUMBER { printf("floating: %.*s\n", (int)(YYCURSOR - s), s); return 6; }
CPLXNUMBER { printf("complex: %.*s\n", (int)(YYCURSOR - s), s); return 7; }
* { printf("error\n"); return -1; }
*/
}
int main()
{
assert(lex("ХЫ!") == 0);
assert(lex("хыхы") == 1);
assert(lex("'хыхы'") == 2);
assert(lex("\"хыхы\"") == 3);
assert(lex("0x3ff") == 4);
assert(lex("123") == 5);
assert(lex("123.45e-6") == 6);
assert(lex("123.45i") == 7);
return 0;
} I assumed that $ re2c unicode_example.re -W \
--input-encoding utf8 \
-ounicode_example.c \
&& cc unicode_example.c -ounicode_example Here
|
Thank you for the example !
Also looking at https://www.fileformat.info/info/unicode/category/index.htm I could see the description for the character classes and looking at unicode_categories.re I could see that there is literal repetitions of several characters like:
There is any disadvantage in using something like the rewrite bellow ?
Cheers ! |
@mingodad Thanks for the extended program!
No, absolutely not, and I would write it that way if I wrote it by hand. As it happens though, the file is autogenerated by a script https://github.com/skvadrik/re2c/blob/master/test/encodings/unicode_groups.hs#L149. The script can be fixed to generate shorter output. That would probably not affect the time spent by re2c on compilation by much (the bottleneck is usually large size of the DFA caused by the complexity of Unicode character classes). It certainly shouldn't affect the generated DFA. |
Hi folks, Just an FYI, I wrote a small C++ program to generate the Unicode 13.0 category definitions for re2c. https://github.com/NickStrupat/re2c-unicode-categories Thank you for your hard work building and maintaining re2c! |
@NickStrupat Awesome, thank you! Do you mind if I add your repo as a submodule and update |
Don't mind at all :) I'm just giving it a test now, so maybe hold off until I make sure it's all working. |
Sure, just give me a shout when you are done. |
Hi @NickStrupat, any update on this? Did you have the time to test your program? |
Not definitively. I think it works, but I'm not sure how to test it well, given my current time allowance. |
Ok, thanks. |
2.0.3 (2020-08-22) ~~~~~~~~~~~~~~~~~~ - Fix issues when building re2c as a CMake subproject (`#302 <https://github.com/skvadrik/re2c/pull/302>`_: - Final corrections in the SIMPA article "RE2C: A lexer generator based on lookahead-TDFA", https://doi.org/10.1016/j.simpa.2020.100027 2.0.2 (2020-08-08) ~~~~~~~~~~~~~~~~~~ - Enable re2go building by default. - Package CMake files into release tarball. 2.0.1 (2020-07-29) ~~~~~~~~~~~~~~~~~~ - Updated version for CMake build system (forgotten in release 2.0). - Added a short article about re2c for the Software Impacts journal. 2.0 (2020-07-20) ~~~~~~~~~~~~~~~~ - Added new code generation backend for Go and a new ``re2go`` program (`#272 <https://github.com/skvadrik/re2c/issues/272>`_: Go support). Added option ``--lang <c | go>``. - Added CMake build system as an alternative to Autotools (`#275 <https://github.com/skvadrik/re2c/pull/275>`_: Add a CMake build system (thanks to ligfx), `#244 <https://github.com/skvadrik/re2c/issues/244>`_: Switching to CMake). - Changes in generic API: + Removed primitives ``YYSTAGPD`` and ``YYMTAGPD``. + Added primitives ``YYSHIFT``, ``YYSHIFTSTAG``, ``YYSHIFTMTAG`` that allow to express fixed tags in terms of generic API. + Added configurations ``re2c:api:style`` and ``re2c:api:sigil``. + Added named placeholders in interpolated configuration strings. - Changes in reuse mode (``-r, --reuse`` option): + Do not reset API-related configurations in each `use:re2c` block (`#291 <https://github.com/skvadrik/re2c/issues/291>`_: Defines in rules block are not propagated to use blocks). + Use block-local options instead of last block options. + Do not accumulate options from rules/reuse blocks in whole-program options. + Generate non-overlapping YYFILL labels for reuse blocks. + Generate start label for each reuse block in storable state mode. - Changes in start-conditions mode (``-c, --start-conditions`` option): + Allow to use normal (non-conditional) blocks in `-c` mode (`#263 <https://github.com/skvadrik/re2c/issues/263>`_: allow mixing conditional and non-conditional blocks with -c, `#296 <https://github.com/skvadrik/re2c/issues/296>`_: Conditions required for all lexers when using '-c' option). + Generate condition switch in every re2c block (`#295 <https://github.com/skvadrik/re2c/issues/295>`_: Condition switch generated for only one lexer per file). - Changes in the generated labels: + Use ``yyeof`` label prefix instead of ``yyeofrule``. + Use ``yyfill`` label prefix instead of ``yyFillLabel``. + Decouple start label and initial label (affects label numbering). - Removed undocumented configuration ``re2c:flags:o``, ``re2c:flags:output``. - Changes in ``re2c:flags:t``, ``re2c:flags:type-header`` configuration: filename is now relative to the output file directory. - Added option ``--case-ranges`` and configuration ``re2c:flags:case-ranges``. - Extended fixed tags optimization for the case of fixed-counter repetition. - Fixed bugs related to EOF rule: + `#276 <https://github.com/skvadrik/re2c/issues/276>`_: Example 01_fill.re in docs is broken + `#280 <https://github.com/skvadrik/re2c/issues/280>`_: EOF rules with multiple blocks + `#284 <https://github.com/skvadrik/re2c/issues/284>`_: mismatched YYBACKUP and YYRESTORE (Add missing fallback states with EOF rule) - Fixed miscellaneous bugs: + `#286 <https://github.com/skvadrik/re2c/issues/286>`_: Incorrect submatch values with fixed-length trailing context. + `#297 <https://github.com/skvadrik/re2c/issues/297>`_: configure error on ubuntu 18.04 / cmake 3.10 - Changed bootstrap process (require explicit configuration flags and a path to re2c executable to regenerate the lexers). - Added internal options ``--posix-prectable <naive | complex>``. - Added debug option ``--dump-dfa-tree``. - Major revision of the paper "Efficient POSIX submatch extraction on NFA". ---- 1.3x ---- 1.3 (2019-12-14) ~~~~~~~~~~~~~~~~ - Added option: ``--stadfa``. - Added warning: ``-Wsentinel-in-midrule``. - Added generic API primitives: + ``YYSTAGPD`` + ``YYMTAGPD`` - Added configurations: + ``re2c:sentinel = 0;`` + ``re2c:define:YYSTAGPD = "YYSTAGPD";`` + ``re2c:define:YYMTAGPD = "YYMTAGPD";`` - Worked on reproducible builds (`#258 <https://github.com/skvadrik/re2c/pull/258>`_: Make the build reproducible). ---- 1.2x ---- 1.2.1 (2019-08-11) ~~~~~~~~~~~~~~~~~~ - Fixed bug `#253 <https://github.com/skvadrik/re2c/issues/253>`_: re2c should install unicode_categories.re somewhere. - Fixed bug `#254 <https://github.com/skvadrik/re2c/issues/254>`_: Turn off re2c:eof = 0. 1.2 (2019-08-02) ~~~~~~~~~~~~~~~~ - Added EOF rule ``$`` and configuration ``re2c:eof``. - Added ``/*!include:re2c ... */`` directive and ``-I`` option. - Added ``/*!header:re2c:on*/`` and ``/*!header:re2c:off*/`` directives. - Added ``--input-encoding <ascii | utf8>`` option. + `#237 <https://github.com/skvadrik/re2c/issues/237>`_: Handle non-ASCII encoded characters in regular expressions + `#250 <https://github.com/skvadrik/re2c/issues/250>`_ UTF8 enoding - Added include file with a list of definitions for Unicode character classes. + `#235 <https://github.com/skvadrik/re2c/issues/235>`_: Unicode character classes - Added ``--location-format <gnu | msvc>`` option. + `#195 <https://github.com/skvadrik/re2c/issues/195>`_: Please consider using Gnu format for error messages - Added ``--verbose`` option that prints "success" message if re2c exits without errors. - Added configurations for options: + ``-o --output`` (specify output file) + ``-t --type-header`` (specify header file) - Removed configurations for internal/debug options. - Extended ``-r`` option: allow to mix multiple ``/*!rules:re2c*/``, ``/*!use:re2c*/`` and ``/*!re2c*/`` blocks. + `#55 <https://github.com/skvadrik/re2c/issues/55>`_: allow standard re2c blocks in reuse mode - Fixed ``-F --flex-support`` option: parsing and operator precedence. + `#229 <https://github.com/skvadrik/re2c/issues/229>`_: re2c option -F (flex syntax) broken + `#242 <https://github.com/skvadrik/re2c/issues/242>`_: Operator precedence with --flex-syntax is broken - Changed difference operator ``/`` to apply before encoding expansion of operands. + `#236 <https://github.com/skvadrik/re2c/issues/236>`_: Support range difference with variable-length encodings - Changed output generation of output file to be atomic. + `#245 <https://github.com/skvadrik/re2c/issues/245>`_: re2c output is not atomic - Authored research paper "Efficient POSIX Submatch Extraction on NFA" together with Dr Angelo Borsotti. - Added experimental libre2c library (``--enable-libs`` configure option) with the following algorithms: + TDFA with leftmost-greedy disambiguation + TDFA with POSIX disambiguation (Okui-Suzuki algorithm) + TNFA with leftmost-greedy disambiguation + TNFA with POSIX disambiguation (Okui-Suzuki algorithm) + TNFA with lazy POSIX disambiguation (Okui-Suzuki algorithm) + TNFA with POSIX disambiguation (Kuklewicz algorithm) + TNFA with POSIX disambiguation (Cox algorithm) - Added debug subsystem (``--enable-debug`` configure option) and new debug options: + ``-dump-cfg`` (dump control flow graph of tag variables) + ``-dump-interf`` (dump interference table of tag variables) + ``-dump-closure-stats`` (dump epsilon-closure statistics) - Added internal options: + ``--posix-closure <gor1 | gtop>`` (switch between shortest-path algorithms used for the construction of POSIX closure) - Fixed a number of crashes found by American Fuzzy Lop fuzzer: + `#226 <https://github.com/skvadrik/re2c/issues/226>`_, `#227 <https://github.com/skvadrik/re2c/issues/227>`_, `#228 <https://github.com/skvadrik/re2c/issues/228>`_, `#231 <https://github.com/skvadrik/re2c/issues/231>`_, `#232 <https://github.com/skvadrik/re2c/issues/232>`_, `#233 <https://github.com/skvadrik/re2c/issues/233>`_, `#234 <https://github.com/skvadrik/re2c/issues/234>`_, `#238 <https://github.com/skvadrik/re2c/issues/238>`_ - Fixed handling of newlines: + correctly parse multi-character newlines CR LF in ``#line`` directives + consistently convert all newlines in the generated file to Unix-style LF - Changed default tarball format from .gz to .xz. + `#221 <https://github.com/skvadrik/re2c/issues/221>`_: big source tarball - Fixed a number of other bugs and resolved issues: + `#2 <https://github.com/skvadrik/re2c/issues/2>`_: abort + `#6 <https://github.com/skvadrik/re2c/issues/6>`_: segfault + `#10 <https://github.com/skvadrik/re2c/issues/10>`_: lessons/002_upn_calculator/calc_002 doesn't produce a useful example program + `#44 <https://github.com/skvadrik/re2c/issues/44>`_: Access violation when translating the attached file + `#49 <https://github.com/skvadrik/re2c/issues/49>`_: wildcard state \000 rules makes lexer behave weard + `#98 <https://github.com/skvadrik/re2c/issues/98>`_: Transparent handling of #line directives in input files + `#104 <https://github.com/skvadrik/re2c/issues/104>`_: Improve const-correctness + `#105 <https://github.com/skvadrik/re2c/issues/105>`_: Conversion of pointer parameters into references + `#114 <https://github.com/skvadrik/re2c/issues/114>`_: Possibility of fixing bug 2535084 + `#120 <https://github.com/skvadrik/re2c/issues/120>`_: condition consisting of default rule only is ignored + `#167 <https://github.com/skvadrik/re2c/issues/167>`_: Add word boundary support + `#168 <https://github.com/skvadrik/re2c/issues/168>`_: Wikipedia's article on re2c + `#180 <https://github.com/skvadrik/re2c/issues/180>`_: Comment syntax? + `#182 <https://github.com/skvadrik/re2c/issues/182>`_: yych being set by YYPEEK () and then not used + `#196 <https://github.com/skvadrik/re2c/issues/196>`_: Implicit type conversion warnings + `#198 <https://github.com/skvadrik/re2c/issues/198>`_: no match for ‘operator!=’ in ‘i != std::vector<_Tp, _Alloc>::rend() [with _Tp = re2c::bitmap_t, _Alloc = std::allocator<re2c::bitmap_t>]()’ + `#210 <https://github.com/skvadrik/re2c/issues/210>`_: How to build re2c in windows? + `#215 <https://github.com/skvadrik/re2c/issues/215>`_: A memory read overrun issue in s_to_n32_unsafe.cc + `#220 <https://github.com/skvadrik/re2c/issues/220>`_: src/dfa/dfa.h: simplify constructor to avoid g++-3.4 bug + `#223 <https://github.com/skvadrik/re2c/issues/223>`_: Fix typo + `#224 <https://github.com/skvadrik/re2c/issues/224>`_: src/dfa/closure_posix.cc: pack() tweaks + `#225 <https://github.com/skvadrik/re2c/issues/225>`_: Documentation link is broken in libre2c/README + `#230 <https://github.com/skvadrik/re2c/issues/230>`_: Changes for upcoming Travis' infra migration + `#239 <https://github.com/skvadrik/re2c/issues/239>`_: Push model example has wrong re2c invocation, breaks guide + `#241 <https://github.com/skvadrik/re2c/issues/241>`_: Guidance on how to use re2c for full-duplex command & response protocol + `#243 <https://github.com/skvadrik/re2c/issues/243>`_: A code generated for period (.) requires 4 bytes + `#246 <https://github.com/skvadrik/re2c/issues/246>`_: Please add a license to this repo + `#247 <https://github.com/skvadrik/re2c/issues/247>`_: Build failure on current Cygwin, probably caused by force-fed c++98 mode + `#248 <https://github.com/skvadrik/re2c/issues/248>`_: distcheck still looks for README + `#251 <https://github.com/skvadrik/re2c/issues/251>`_: Including what you use is find, but not without inclusion guards - Updated documentation and website.
We should probably parse https://unicode.org/Public/UCD/latest/ucd/UnicodeData.txt directly rather than rely on other language libraries. |
Howdy! I find myself wanting to build a lexer for a variant on C23, for which identifiers are now defined as (using unicode character class jargon) |
Not much change in this area, we still have unicode_categoies.re with character categories. What would you like to have? |
Maybe as a start add XID_Start and XID_Continue to the unicode categories in that file because so many programming languages now define identifiers using that notation? I think at some point it would be nice to be explicitly generating all the categories (including the ones defined by binary properties) that Unicode has; there are a lot of them now. I'm not quite sure what script is being used to generate these right now, but I might be able to help if I knew; I did a quick look and couldn't figure it out quickly. (It should probably be generated as part of a full build; I think right now it's being checked in from an external process?) |
Firstly, thanks a lot for this tool. It saved me a lot of time! I am using re2c to create a parser for an as-yet unpublished build tool. The input files are utf-8 encoded. Everything works fine for the ascii character set.
However, I'd like to expand my identifier space to include/allow unicode letters in addition to [a-zA-Z]. Currently the only way to do this that I can see is to write a parser for UnicodeData.txt that grabs all of the letter category code points and dumps them into a giant character class. That's fine, but now I have a generator for a generator for C++. It seems like this sort of Unicode character class functionality would be more naturally supported directly in re2c itself.
I was somewhat surprised this was not already supported, so I went looking for these classes in re2c and could not find them. Apologies if this is already supported and my grep-powers were insufficient.
Thanks!
The text was updated successfully, but these errors were encountered: