Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode character classes #235

Open
terpstra opened this issue Dec 22, 2018 · 27 comments
Open

Unicode character classes #235

terpstra opened this issue Dec 22, 2018 · 27 comments

Comments

@terpstra
Copy link

Firstly, thanks a lot for this tool. It saved me a lot of time! I am using re2c to create a parser for an as-yet unpublished build tool. The input files are utf-8 encoded. Everything works fine for the ascii character set.

However, I'd like to expand my identifier space to include/allow unicode letters in addition to [a-zA-Z]. Currently the only way to do this that I can see is to write a parser for UnicodeData.txt that grabs all of the letter category code points and dumps them into a giant character class. That's fine, but now I have a generator for a generator for C++. It seems like this sort of Unicode character class functionality would be more naturally supported directly in re2c itself.

I was somewhat surprised this was not already supported, so I went looking for these classes in re2c and could not find them. Apologies if this is already supported and my grep-powers were insufficient.

Thanks!

@skvadrik
Copy link
Owner

Hi @terpstra , your grep was correct: re2c doesn't support syntactic aliases for Unicode character classes yet. There is no technical reason it can't do that, but you are the first to ask.

As a temporary quick workaround, I can generate and distribute together with re2c source code an "official" file with re2c definitions of Unicode categories: unicode_categories.re.txt. This is to be included verbatim into your .re files; the name L can be used in subsequent re2c blocks to denote Unicode letters. The definitions are generated from the same scripts that generate re2c tests, so the definitions are coherent with what re2c is able to handle at the moment. The generator doesn't use UnicodeData.txt directly (though it should), it uses haskell Data.CharSet library.

@terpstra
Copy link
Author

Thanks a lot for this! Does re2c support some form of 'include'? Dumping tables this large into a source file whose main focus is parsing distracts the reader.

Ultimately, I think users will want all the classes and subclasses in Unicode. For example, also the Lu class for upper-case letters / etc. Do you think this is a good candidate for future inclusion?

@skvadrik
Copy link
Owner

Does re2c support some form of 'include'?

No, but it would be useful. Initial implementation may only allow to include files from current directory (the one re2c is run from), otherwise we'd also need to support include paths.

Ultimately, I think users will want all the classes and subclasses in Unicode.

Agreed.

Do you think this is a good candidate for future inclusion?

Yes. Don't close the issue. :)

@terpstra
Copy link
Author

I've noticed that "L \ Lu" in re2c v1.1.1 reports:
re2c: error: line 359, column 12: can only difference char sets

It seems that the inclusion of any value above 0x80 in a character class renders it no longer a character class.

@skvadrik
Copy link
Owner

@terpstra I opened #236: this is a known limitation, but worth a separate issue.

@skvadrik
Copy link
Owner

@terpstra Meanwhile, re2c learnt to handle include files b94c5af:

  • /*!include:re2c "x.re" */ works in the same way as #include "x.re" in C/C++, as if x.re was pasted verbatim in place of the directive.
  • -I <path> option allows to specify search paths for included files. Default search path is the directory of the source file, e.g. if you run re2c x/y/z.re, then default include path wil be x/y/.

@terpstra
Copy link
Author

Nice!

Do you plan to put unicode_categories.re somewhere in the include path? For now I'm just copy-pasting it into my own symbol.re as you suggested.

@skvadrik
Copy link
Owner

For now I think the best option is to copy unicode_categories.re in your source tree and then put /*!include:re2c "path/unicode_categories.re" */ in your .re file. If unicode_categories.re gets updated, at least you won't have to modify the including .re file and glue it together from pieces.

Perhaps later re2c will install these definition files in some default locations, or at least default relative to re2c root directory, and we'll have a "standard library" of useful regular expressions.

@fletcher
Copy link

fletcher commented Jan 17, 2019

FYI -- this precompiled set of unicode definitions is fantastic -- I needed to add support for unicode strings to a project I started today, and found this. Made short work of an otherwise complicated problem. Thanks!

(PS-- Thanks for asking about this Brett!)

(PPS -- It goes without saying, but also to second Brett's thanks for re2c. I've been using it for a few years now and am always impressed with how easy it is to use!)

@skvadrik
Copy link
Owner

Glad to hear that it works for you!
I wish the original re2c author and long-time contributors like @dnuffer read the above comment.

@terpstra
Copy link
Author

Who is Brett? From the context, it sounds like you meant me.

@fletcher
Copy link

@terpstra My apologies, you are right. I saw terpstra and immediately thought of Brett Terpstra since our software projects intersect at times. But that isn't you, so while my comment stands in its intent (I appreciate your asking about this!) it doesn't mean quite as much since we've never met and your name is not Brett.....

Move along.... Nothing to see here.... Just another person making an idiot of themselves on the internet... ;)

@skvadrik skvadrik mentioned this issue May 23, 2019
@mingodad
Copy link

mingodad commented Jan 30, 2020

Can someone give an example of character class example to handle unicode ?

Here is what I have now and want to allow IDENTIFIER to contain unicode (UTF-8) characters and also WS to contain unicode white space.

I can see that there is now (1.3) an include "unicode_categories.re" but no example of usage and it's not clear to me how to use it.

/*!re2c
  //re2c:flags:utf-8 = 1;
  re2c:yyfill:enable = 0;

  D        = [0-9] ;
  E        = [Ee] [+-]? D+ ;
  L        = [a-zA-Z_] ;

  INTSUFFIX   = ( "LL" | "ULL" | "ll" | "ull") ;

  INTNUMBER   = ( D+ ) INTSUFFIX? ;
  FLOATNUMBER   = ( D+ | D* "." D+ | D+ "." D* ) E? ;
  CPLXNUMBER   = ( D+ "." D+ ) "i" ;

  HEX_P    = [Pp] [+-]? D+ ;
  HEXNUM = ('0' [xX] [0-9a-fA-F]+) (HEX_P | INTSUFFIX)? ;

  WS       = [ \t\r\v\f] ;
  LF       = [\n] ;
  END      = [\000] ;
  ANY      = [\000-\377] \ END ;

  ESC      = [\\] ;
  SQ       = ['] ;
  DQ       = ["] ;

  STRING1  = SQ ( ANY \ SQ \ ESC | ESC ANY )* SQ ;
  STRING2  = DQ ( ANY \ DQ \ ESC | ESC ANY )* DQ ;

  IDENTIFIER = L ( L | D )* ;

*/

@skvadrik
Copy link
Owner

You are right, the documentation is lacking. Here is a working example:

#include <assert.h>
#include <stdio.h>

int lex(const char *YYCURSOR)
{
    const char *YYMARKER, *s = YYCURSOR;
    /*!include:re2c "re2c-1.3/include/unicode_categories.re" */
    /*!re2c

    re2c:define:YYCTYPE = 'unsigned char';
    re2c:flags:utf-8 = 1;
    re2c:yyfill:enable = 0;

    D = [0-9] ;
    E = [Ee] [+-]? D+ ;

    INTSUFFIX   = ( "LL" | "ULL" | "ll" | "ull") ;

    INTNUMBER   = ( D+ ) INTSUFFIX? ;
    FLOATNUMBER = ( D+ | D* "." D+ | D+ "." D* ) E? ;
    CPLXNUMBER  = ( D+ "." D+ ) "i" ;

    HEX_P       = [Pp] [+-]? D+ ;
    HEXNUMBER   = ('0' [xX] [0-9a-fA-F]+) (HEX_P | INTSUFFIX)? ;

    WS       = [ \t\r\v\f] ;
    LF       = [\n] ;
    END      = [\000] ;
    ANY      = [^] \ END ;

    ESC      = [\\] ;
    SQ       = ['] ;
    DQ       = ["] ;

    STRING1  = SQ ( ANY \ SQ \ ESC | ESC ANY )* SQ ;
    STRING2  = DQ ( ANY \ DQ \ ESC | ESC ANY )* DQ ;

    IDENTIFIER = L ( L | D )* ;

    "ХЫ!"       { printf("special:    %.*s\n", (int)(YYCURSOR - s), s); return 0; }
    IDENTIFIER  { printf("identifier: %.*s\n", (int)(YYCURSOR - s), s); return 1; }
    STRING1     { printf("string-1:   %.*s\n", (int)(YYCURSOR - s), s); return 2; }
    STRING2     { printf("string-2:   %.*s\n", (int)(YYCURSOR - s), s); return 3; }
    HEXNUMBER   { printf("hex:        %.*s\n", (int)(YYCURSOR - s), s); return 4; }
    INTNUMBER   { printf("integer:    %.*s\n", (int)(YYCURSOR - s), s); return 5; }
    FLOATNUMBER { printf("floating:   %.*s\n", (int)(YYCURSOR - s), s); return 6; }
    CPLXNUMBER  { printf("complex:    %.*s\n", (int)(YYCURSOR - s), s); return 7; }
    *           { printf("error\n"); return -1; }

    */
}

int main()
{
    assert(lex("ХЫ!") == 0);
    assert(lex("хыхы") == 1);
    assert(lex("'хыхы'") == 2);
    assert(lex("\"хыхы\"") == 3);
    assert(lex("0x3ff") == 4);
    assert(lex("123") == 5);
    assert(lex("123.45e-6") == 6);
    assert(lex("123.45i") == 7);
    return 0;
}

I assumed that unicode_categories.re are in a subdirectory re2c-1.3/include, but it may be a different place depending on your system and re2c installation (you can always use -I). Build:

$ re2c unicode_example.re -W \
     --input-encoding utf8 \
     -ounicode_example.c \
   && cc unicode_example.c -ounicode_example

Here --input-encoding utf8 is only needed if you plan to use Unicode literals like ХЫ! in this example (it-s an orthogonal feature to unicode_categories.re). Outptut:

$ ./unicode_example
special:    ХЫ!
identifier: хыхы
string-1:   'хыхы'
string-2:   "хыхы"
hex:        0x3ff
integer:    123
floating:   123.45e-6
complex:    123.45i

@mingodad
Copy link

Thank you for the example !
Here is the same a bit modified to manage unicode white space and also underscores in identifiers:

#include <assert.h>
#include <stdio.h>

enum {
	TK_LEXERROR=-1,
	TK_SPECIAL,
	TK_WS,
	TK_IDENT,
	TK_STR_SQ,
	TK_STR_DQ,
	TK_HEXNUM,
	TK_INTNUM,
	TK_FLOATNUM,
	TK_COMPLEXNUM,
};

int lex(const char *YYCURSOR)
{
    const char *YYMARKER, *s = YYCURSOR;
    /*!include:re2c "re2c-1.3/include/unicode_categories.re" */
    /*!re2c

    re2c:define:YYCTYPE = 'unsigned char';
    re2c:flags:utf-8 = 1;
    re2c:yyfill:enable = 0;

    D = [0-9] ;
    E = [Ee] [+-]? D+ ;

    INTSUFFIX   = ( "LL" | "ULL" | "ll" | "ull") ;

    INTNUMBER   = ( D+ ) INTSUFFIX? ;
    FLOATNUMBER = ( D+ | D* "." D+ | D+ "." D* ) E? ;
    CPLXNUMBER  = ( D+ "." D+ ) "i" ;

    HEX_P       = [Pp] [+-]? D+ ;
    HEXNUMBER   = ('0' [xX] [0-9a-fA-F]+) (HEX_P | INTSUFFIX)? ;

    WS       = ([ \t\r\v\f] | Zs | Zp);
    LF       = [\n] ;
    END      = [\000] ;
    ANY      = [^] \ END ;

    ESC      = [\\] ;
    SQ       = ['] ;
    DQ       = ["] ;

    STRING1  = SQ ( ANY \ SQ \ ESC | ESC ANY )* SQ ;
    STRING2  = DQ ( ANY \ DQ \ ESC | ESC ANY )* DQ ;

    IDENTIFIER = ('_' | L) ( '_' | L | D )* ;

    "ХЫ!"       { printf("special:    %.*s\n", (int)(YYCURSOR - s), s); return TK_SPECIAL; }
    WS	{ printf("white space: >%.*s<\n", (int)(YYCURSOR - s), s); return TK_WS; }
    IDENTIFIER  { printf("identifier: %.*s\n", (int)(YYCURSOR - s), s); return TK_IDENT; }
    STRING1     { printf("string-1:   %.*s\n", (int)(YYCURSOR - s), s); return TK_STR_SQ; }
    STRING2     { printf("string-2:   %.*s\n", (int)(YYCURSOR - s), s); return TK_STR_DQ; }
    HEXNUMBER   { printf("hex:        %.*s\n", (int)(YYCURSOR - s), s); return TK_HEXNUM; }
    INTNUMBER   { printf("integer:    %.*s\n", (int)(YYCURSOR - s), s); return TK_INTNUM; }
    FLOATNUMBER { printf("floating:   %.*s\n", (int)(YYCURSOR - s), s); return TK_FLOATNUM; }
    CPLXNUMBER  { printf("complex:    %.*s\n", (int)(YYCURSOR - s), s); return TK_COMPLEXNUM; }
    *           { printf("error\n"); return TK_LEXERROR; }

    */
}

int main()
{
    assert(lex("ХЫ!") == TK_SPECIAL);
    assert(lex("хыхы") == TK_IDENT);
    assert(lex("見る") == TK_IDENT);
    assert(lex("_見_る") == TK_IDENT);
    assert(lex("見_る") == TK_IDENT);
    assert(lex("_見_る_") == TK_IDENT);
    assert(lex(" ") == TK_WS);
    assert(lex("	") == TK_WS);
    assert(lex("\r") == TK_WS);
    assert(lex("\v") == TK_WS);
    assert(lex("\f") == TK_WS);
    assert(lex(" ") == TK_WS);
    assert(lex("'хыхы'") == TK_STR_SQ);
    assert(lex("'見る'") == TK_STR_SQ);
    assert(lex("\"хыхы\"") == TK_STR_DQ);
    assert(lex("\"見る\"") == TK_STR_DQ);
    assert(lex("0x3ff") == TK_HEXNUM);
    assert(lex("123") == TK_INTNUM);
    assert(lex("123.45e-6") == TK_FLOATNUM);
    assert(lex("123.45i") == TK_COMPLEXNUM);
    return 0;
}

Also looking at https://www.fileformat.info/info/unicode/category/index.htm I could see the description for the character classes and looking at unicode_categories.re I could see that there is literal repetitions of several characters like:

Z = [\x20-\x20\xa0-\xa0\u1680-\u1680\u2000-\u200a\u2028-\u2029\u202f-\u202f\u205f-\u205f\u3000-\u3000];
Zs = [\x20-\x20\xa0-\xa0\u1680-\u1680\u2000-\u200a\u202f-\u202f\u205f-\u205f\u3000-\u3000];
Zl = [\u2028-\u2028];
Zp = [\u2029-\u2029];

There is any disadvantage in using something like the rewrite bellow ?

/*Separator, Space*/
Zs = [\x20-\x20\xa0-\xa0\u1680-\u1680\u2000-\u200a\u202f-\u202f\u205f-\u205f\u3000-\u3000];
/*Separator, Line*/
Zl = [\u2028-\u2028];
/*Separator, Paragraph*/
Zp = [\u2029-\u2029];
/*Separators*/
Z = (Zs | Zl | Zp) ;

Cheers !

@skvadrik
Copy link
Owner

@mingodad Thanks for the extended program!

There is any disadvantage in using something like the rewrite bellow ?

No, absolutely not, and I would write it that way if I wrote it by hand. As it happens though, the file is autogenerated by a script https://github.com/skvadrik/re2c/blob/master/test/encodings/unicode_groups.hs#L149.

The script can be fixed to generate shorter output. That would probably not affect the time spent by re2c on compilation by much (the bottleneck is usually large size of the DFA caused by the complexity of Unicode character classes). It certainly shouldn't affect the generated DFA.

@NickStrupat
Copy link

Hi folks,

Just an FYI, I wrote a small C++ program to generate the Unicode 13.0 category definitions for re2c.

https://github.com/NickStrupat/re2c-unicode-categories

Thank you for your hard work building and maintaining re2c!

@skvadrik
Copy link
Owner

@NickStrupat Awesome, thank you! Do you mind if I add your repo as a submodule and update include/unicode_categories.txt with the output of your program?

@NickStrupat
Copy link

NickStrupat commented Mar 29, 2020

Don't mind at all :)

I'm just giving it a test now, so maybe hold off until I make sure it's all working.

@skvadrik
Copy link
Owner

Sure, just give me a shout when you are done.

@skvadrik
Copy link
Owner

Hi @NickStrupat, any update on this? Did you have the time to test your program?

@NickStrupat
Copy link

Not definitively. I think it works, but I'm not sure how to test it well, given my current time allowance.

@skvadrik
Copy link
Owner

Ok, thanks.

netbsd-srcmastr referenced this issue in NetBSD/pkgsrc Sep 20, 2020
2.0.3 (2020-08-22)
~~~~~~~~~~~~~~~~~~

- Fix issues when building re2c as a CMake subproject
  (`#302 <https://github.com/skvadrik/re2c/pull/302>`_:

- Final corrections in the SIMPA article "RE2C: A lexer generator based on
  lookahead-TDFA", https://doi.org/10.1016/j.simpa.2020.100027

2.0.2 (2020-08-08)
~~~~~~~~~~~~~~~~~~

- Enable re2go building by default.

- Package CMake files into release tarball.

2.0.1 (2020-07-29)
~~~~~~~~~~~~~~~~~~

- Updated version for CMake build system (forgotten in release 2.0).

- Added a short article about re2c for the Software Impacts journal.

2.0 (2020-07-20)
~~~~~~~~~~~~~~~~

- Added new code generation backend for Go and a new ``re2go`` program
  (`#272 <https://github.com/skvadrik/re2c/issues/272>`_: Go support).
  Added option ``--lang <c | go>``.

- Added CMake build system as an alternative to Autotools
  (`#275 <https://github.com/skvadrik/re2c/pull/275>`_:
  Add a CMake build system (thanks to ligfx),
  `#244 <https://github.com/skvadrik/re2c/issues/244>`_: Switching to CMake).

- Changes in generic API:

  + Removed primitives ``YYSTAGPD`` and ``YYMTAGPD``.
  + Added primitives ``YYSHIFT``, ``YYSHIFTSTAG``, ``YYSHIFTMTAG``
    that allow to express fixed tags in terms of generic API.
  + Added configurations ``re2c:api:style`` and ``re2c:api:sigil``.
  + Added named placeholders in interpolated configuration strings.

- Changes in reuse mode (``-r, --reuse`` option):

  + Do not reset API-related configurations in each `use:re2c` block
    (`#291 <https://github.com/skvadrik/re2c/issues/291>`_:
    Defines in rules block are not propagated to use blocks).
  + Use block-local options instead of last block options.
  + Do not accumulate options from rules/reuse blocks in whole-program options.
  + Generate non-overlapping YYFILL labels for reuse blocks.
  + Generate start label for each reuse block in storable state mode.

- Changes in start-conditions mode (``-c, --start-conditions`` option):

  + Allow to use normal (non-conditional) blocks in `-c` mode
    (`#263 <https://github.com/skvadrik/re2c/issues/263>`_:
    allow mixing conditional and non-conditional blocks with -c,
    `#296 <https://github.com/skvadrik/re2c/issues/296>`_:
    Conditions required for all lexers when using '-c' option).
  + Generate condition switch in every re2c block
    (`#295 <https://github.com/skvadrik/re2c/issues/295>`_:
    Condition switch generated for only one lexer per file).

- Changes in the generated labels:

  + Use ``yyeof`` label prefix instead of ``yyeofrule``.
  + Use ``yyfill`` label prefix instead of ``yyFillLabel``.
  + Decouple start label and initial label (affects label numbering).

- Removed undocumented configuration ``re2c:flags:o``, ``re2c:flags:output``.

- Changes in ``re2c:flags:t``, ``re2c:flags:type-header`` configuration:
  filename is now relative to the output file directory.

- Added option ``--case-ranges`` and configuration ``re2c:flags:case-ranges``.

- Extended fixed tags optimization for the case of fixed-counter repetition.

- Fixed bugs related to EOF rule:

  + `#276 <https://github.com/skvadrik/re2c/issues/276>`_:
    Example 01_fill.re in docs is broken
  + `#280 <https://github.com/skvadrik/re2c/issues/280>`_:
    EOF rules with multiple blocks
  + `#284 <https://github.com/skvadrik/re2c/issues/284>`_:
    mismatched YYBACKUP and YYRESTORE
    (Add missing fallback states with EOF rule)

- Fixed miscellaneous bugs:

  + `#286 <https://github.com/skvadrik/re2c/issues/286>`_:
    Incorrect submatch values with fixed-length trailing context.
  + `#297 <https://github.com/skvadrik/re2c/issues/297>`_:
    configure error on ubuntu 18.04 / cmake 3.10

- Changed bootstrap process (require explicit configuration flags and a path to
  re2c executable to regenerate the lexers).

- Added internal options ``--posix-prectable <naive | complex>``.

- Added debug option ``--dump-dfa-tree``.

- Major revision of the paper "Efficient POSIX submatch extraction on NFA".

----
1.3x
----

1.3 (2019-12-14)
~~~~~~~~~~~~~~~~

- Added option: ``--stadfa``.

- Added warning: ``-Wsentinel-in-midrule``.

- Added generic API primitives:

  + ``YYSTAGPD``
  + ``YYMTAGPD``

- Added configurations:

  + ``re2c:sentinel = 0;``
  + ``re2c:define:YYSTAGPD = "YYSTAGPD";``
  + ``re2c:define:YYMTAGPD = "YYMTAGPD";``

- Worked on reproducible builds
  (`#258 <https://github.com/skvadrik/re2c/pull/258>`_:
  Make the build reproducible).

----
1.2x
----

1.2.1 (2019-08-11)
~~~~~~~~~~~~~~~~~~

- Fixed bug `#253 <https://github.com/skvadrik/re2c/issues/253>`_:
  re2c should install unicode_categories.re somewhere.

- Fixed bug `#254 <https://github.com/skvadrik/re2c/issues/254>`_:
  Turn off re2c:eof = 0.

1.2 (2019-08-02)
~~~~~~~~~~~~~~~~

- Added EOF rule ``$`` and configuration ``re2c:eof``.

- Added ``/*!include:re2c ... */`` directive and ``-I`` option.

- Added ``/*!header:re2c:on*/`` and ``/*!header:re2c:off*/`` directives.

- Added ``--input-encoding <ascii | utf8>`` option.

  + `#237 <https://github.com/skvadrik/re2c/issues/237>`_:
    Handle non-ASCII encoded characters in regular expressions
  + `#250 <https://github.com/skvadrik/re2c/issues/250>`_
    UTF8 enoding

- Added include file with a list of definitions for Unicode character classes.

  + `#235 <https://github.com/skvadrik/re2c/issues/235>`_:
    Unicode character classes

- Added ``--location-format <gnu | msvc>`` option.

  + `#195 <https://github.com/skvadrik/re2c/issues/195>`_:
    Please consider using Gnu format for error messages

- Added ``--verbose`` option that prints "success" message if re2c exits
  without errors.

- Added configurations for options:

  + ``-o --output`` (specify output file)
  + ``-t --type-header`` (specify header file)

- Removed configurations for internal/debug options.

- Extended ``-r`` option: allow to mix multiple ``/*!rules:re2c*/``,
  ``/*!use:re2c*/`` and ``/*!re2c*/`` blocks.

  + `#55 <https://github.com/skvadrik/re2c/issues/55>`_:
    allow standard re2c blocks in reuse mode

- Fixed ``-F --flex-support`` option: parsing and operator precedence.

  + `#229 <https://github.com/skvadrik/re2c/issues/229>`_:
    re2c option -F (flex syntax) broken
  + `#242 <https://github.com/skvadrik/re2c/issues/242>`_:
    Operator precedence with --flex-syntax is broken

- Changed difference operator ``/`` to apply before encoding expansion of
  operands.

  + `#236 <https://github.com/skvadrik/re2c/issues/236>`_:
    Support range difference with variable-length encodings

- Changed output generation of output file to be atomic.

  + `#245 <https://github.com/skvadrik/re2c/issues/245>`_:
    re2c output is not atomic

- Authored research paper "Efficient POSIX Submatch Extraction on NFA"
  together with Dr Angelo Borsotti.

- Added experimental libre2c library (``--enable-libs`` configure option) with
  the following algorithms:

  + TDFA with leftmost-greedy disambiguation
  + TDFA with POSIX disambiguation (Okui-Suzuki algorithm)
  + TNFA with leftmost-greedy disambiguation
  + TNFA with POSIX disambiguation (Okui-Suzuki algorithm)
  + TNFA with lazy POSIX disambiguation (Okui-Suzuki algorithm)
  + TNFA with POSIX disambiguation (Kuklewicz algorithm)
  + TNFA with POSIX disambiguation (Cox algorithm)

- Added debug subsystem (``--enable-debug`` configure option) and new debug
  options:

  + ``-dump-cfg`` (dump control flow graph of tag variables)
  + ``-dump-interf`` (dump interference table of tag variables)
  + ``-dump-closure-stats`` (dump epsilon-closure statistics)

- Added internal options:

  + ``--posix-closure <gor1 | gtop>`` (switch between shortest-path algorithms
    used for the construction of POSIX closure)

- Fixed a number of crashes found by American Fuzzy Lop fuzzer:

  + `#226 <https://github.com/skvadrik/re2c/issues/226>`_,
    `#227 <https://github.com/skvadrik/re2c/issues/227>`_,
    `#228 <https://github.com/skvadrik/re2c/issues/228>`_,
    `#231 <https://github.com/skvadrik/re2c/issues/231>`_,
    `#232 <https://github.com/skvadrik/re2c/issues/232>`_,
    `#233 <https://github.com/skvadrik/re2c/issues/233>`_,
    `#234 <https://github.com/skvadrik/re2c/issues/234>`_,
    `#238 <https://github.com/skvadrik/re2c/issues/238>`_

- Fixed handling of newlines:

  + correctly parse multi-character newlines CR LF in ``#line`` directives
  + consistently convert all newlines in the generated file to Unix-style LF

- Changed default tarball format from .gz to .xz.

  + `#221 <https://github.com/skvadrik/re2c/issues/221>`_:
    big source tarball

- Fixed a number of other bugs and resolved issues:

  + `#2 <https://github.com/skvadrik/re2c/issues/2>`_: abort
  + `#6 <https://github.com/skvadrik/re2c/issues/6>`_: segfault
  + `#10 <https://github.com/skvadrik/re2c/issues/10>`_:
    lessons/002_upn_calculator/calc_002 doesn't produce a useful example program
  + `#44 <https://github.com/skvadrik/re2c/issues/44>`_:
    Access violation when translating the attached file
  + `#49 <https://github.com/skvadrik/re2c/issues/49>`_:
    wildcard state \000 rules makes lexer behave weard
  + `#98 <https://github.com/skvadrik/re2c/issues/98>`_:
    Transparent handling of #line directives in input files
  + `#104 <https://github.com/skvadrik/re2c/issues/104>`_:
    Improve const-correctness
  + `#105 <https://github.com/skvadrik/re2c/issues/105>`_:
    Conversion of pointer parameters into references
  + `#114 <https://github.com/skvadrik/re2c/issues/114>`_:
    Possibility of fixing bug 2535084
  + `#120 <https://github.com/skvadrik/re2c/issues/120>`_:
    condition consisting of default rule only is ignored
  + `#167 <https://github.com/skvadrik/re2c/issues/167>`_:
    Add word boundary support
  + `#168 <https://github.com/skvadrik/re2c/issues/168>`_:
    Wikipedia's article on re2c
  + `#180 <https://github.com/skvadrik/re2c/issues/180>`_:
    Comment syntax?
  + `#182 <https://github.com/skvadrik/re2c/issues/182>`_:
    yych being set by YYPEEK () and then not used
  + `#196 <https://github.com/skvadrik/re2c/issues/196>`_:
    Implicit type conversion warnings
  + `#198 <https://github.com/skvadrik/re2c/issues/198>`_:
    no match for ‘operator!=’ in ‘i != std::vector<_Tp, _Alloc>::rend() [with _Tp = re2c::bitmap_t, _Alloc = std::allocator<re2c::bitmap_t>]()’
  + `#210 <https://github.com/skvadrik/re2c/issues/210>`_:
    How to build re2c in windows?
  + `#215 <https://github.com/skvadrik/re2c/issues/215>`_:
    A memory read overrun issue in s_to_n32_unsafe.cc
  + `#220 <https://github.com/skvadrik/re2c/issues/220>`_:
    src/dfa/dfa.h: simplify constructor to avoid g++-3.4 bug
  + `#223 <https://github.com/skvadrik/re2c/issues/223>`_:
    Fix typo
  + `#224 <https://github.com/skvadrik/re2c/issues/224>`_:
    src/dfa/closure_posix.cc: pack() tweaks
  + `#225 <https://github.com/skvadrik/re2c/issues/225>`_:
    Documentation link is broken in libre2c/README
  + `#230 <https://github.com/skvadrik/re2c/issues/230>`_:
    Changes for upcoming Travis' infra migration
  + `#239 <https://github.com/skvadrik/re2c/issues/239>`_:
    Push model example has wrong re2c invocation, breaks guide
  + `#241 <https://github.com/skvadrik/re2c/issues/241>`_:
    Guidance on how to use re2c for full-duplex command & response protocol
  + `#243 <https://github.com/skvadrik/re2c/issues/243>`_:
    A code generated for period (.) requires 4 bytes
  + `#246 <https://github.com/skvadrik/re2c/issues/246>`_:
    Please add a license to this repo
  + `#247 <https://github.com/skvadrik/re2c/issues/247>`_:
    Build failure on current Cygwin, probably caused by force-fed c++98 mode
  + `#248 <https://github.com/skvadrik/re2c/issues/248>`_:
    distcheck still looks for README
  + `#251 <https://github.com/skvadrik/re2c/issues/251>`_:
    Including what you use is find, but not without inclusion guards

- Updated documentation and website.
@skvadrik
Copy link
Owner

We should probably parse https://unicode.org/Public/UCD/latest/ucd/UnicodeData.txt directly rather than rely on other language libraries.

@pmetzger
Copy link
Contributor

pmetzger commented Jan 6, 2025

Howdy! I find myself wanting to build a lexer for a variant on C23, for which identifiers are now defined as (using unicode character class jargon) XID_Start XID_Continue*; what would be the easiest way to deal with that in a modern re2c context?

@skvadrik
Copy link
Owner

skvadrik commented Jan 7, 2025

Not much change in this area, we still have unicode_categoies.re with character categories. What would you like to have?

@pmetzger
Copy link
Contributor

pmetzger commented Jan 7, 2025

Maybe as a start add XID_Start and XID_Continue to the unicode categories in that file because so many programming languages now define identifiers using that notation?

I think at some point it would be nice to be explicitly generating all the categories (including the ones defined by binary properties) that Unicode has; there are a lot of them now.

I'm not quite sure what script is being used to generate these right now, but I might be able to help if I knew; I did a quick look and couldn't figure it out quickly. (It should probably be generated as part of a full build; I think right now it's being checked in from an external process?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants