UTF-8 friendly praser/lexer

Per,

Could you please show us how to make parser/lexer supporting the UTF-8 encoding, instead of plain ASCII (the input string with > 127 chars)? I know that is not required for your project, but.. what if we would like to UTF-8 variable/function names?
I've done a search in the web, but no one showed how to do UTF-8 compliant solution.
Best regards,
PanKracy

Edited by PanKracy on Reason: Initial post
It is actually pretty easy, there is nothing special about it.

Only place you will need to change is in lexer where you are lexing name that can contain utf8 characters. Instead of doing ++ you will need to increment it by length of utf8 encoded codepoint. Which can be calculate pretty simple: http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html You don't even need to decode utf8 codepoint, just the length. As long as you keep whole utf8 string, you can compare/copy/hash it as regular ascii string.

Rest of token parsing/procesing code won't change. Unless you will want to introduce new tokens which are written in unicode...

Edited by Mārtiņš Možeiko on
You'll first want to change the stream step size and character decoding as Martins explained, which is the only UTF-8-specific change. But you'll also need to change what constitutes valid characters in a name if you want to support Unicode properly, which would be required regardless of whether you're using UTF-8 or another encoding.

Right now we just have [A-Za-Z_] for the initial character in a name and [A-Za-Z0-9_] for subsequent characters. You'll want to change that. There is an official recommendation from the Unicode consortium: http://unicode.org/reports/tr31/#Default_Identifier_Syntax. If you want to support this specification, you'll need a way to identify Unicode lexical categories as defined in the Unicode Character Database.

Unlike with ASCII, you can't recognize characters in such categories with a simple range check and a few equality checks--there are too many disjoint cases. There are various data structures for doing this kind of thing in a way that's efficient in both time and space. If you're interested in this general area, I recommend the book Unicode Demystified, which aside from giving a general introduction to Unicode is the only book I know which has coverage of the lower-level data structures that are typically used, like sparse two-level arrays. The simplest practical thing I can think of would be to store markers in a sorted array which delimit characters which are inside/outside the lexical category and then use bsearch to classify a character based on whether the index returned from bsearch is even or odd. But you'll want to have a simple fast path like our current code for recognizing ASCII identifiers, since that's the common case.

Oh, another thing you'll need for identifiers is Unicode normalization, so that equivalent identifiers (e.g. written with and without combining characters) will be interned to the same string pointer and hence won't be considered distinct by the compiler's symbol table. There are various choices for Unicode normal forms. Python uses NFKC normalization: https://docs.python.org/3.6/refer...lexical_analysis.html#identifiers. It might be worth looking at CPython's lexer source code to see how it handles all of this stuff, although it mostly just leans on Python's functionality for dealing with the Unicode character database, accessible via the built-in unicodedata module.

You're probably starting to see why I didn't want to cover all of this. :)

Hopefully this gives you some pointers to get started, if I didn't scare you off. In particular, I really recommend Unicode Demystified. Doing a proper job with localization isn't just a matter of knowing a few technical tricks but requires an understanding of the broader context. Good luck!

Edited by Per Vognsen on
Basically, you can start with decoder. In your lexer, you interpret not next char, but int (code point), that is decoded from 1-4 bytes (rather 1-2 bytes) of utf8 stream. Interesting see that maximum code point occupy only 21 bits, so you can augment into code point 11 bits of information. Even if you want to interpret only code points, that can be placed into 2 bytes, you can augment 5 bits of information into every 2 bytes long code point.
You don't really need to decode unicode codepoints unless you want to handle all the normalization stuff Per talks about. You only need length. And after that is regular strcmp or hashtable lookups for keywords or symbol names.
mmozeiko, pervognsen, osiyuk

Thank you all for your helpfull answers++.

You're probably starting to see why I didn't want to cover all of this. :)

Yes, I know, and so this is why I asked for :)

Per:
thank you for your project, it's very, very interesting.

Best regards,

PanKracy

Edited by PanKracy on