You'll first want to change the stream step size and character decoding as Martins explained, which is the only UTF-8-specific change. But you'll also need to change what constitutes valid characters in a name if you want to support Unicode properly, which would be required regardless of whether you're using UTF-8 or another encoding.
Right now we just have [A-Za-Z_] for the initial character in a name and [A-Za-Z0-9_] for subsequent characters. You'll want to change that. There is an official recommendation from the Unicode consortium: http://unicode.org/reports/tr31/#Default_Identifier_Syntax
. If you want to support this specification, you'll need a way to identify Unicode lexical categories as defined in the Unicode Character Database.
Unlike with ASCII, you can't recognize characters in such categories with a simple range check and a few equality checks--there are too many disjoint cases. There are various data structures for doing this kind of thing in a way that's efficient in both time and space. If you're interested in this general area, I recommend the book Unicode Demystified, which aside from giving a general introduction to Unicode is the only book I know which has coverage of the lower-level data structures that are typically used, like sparse two-level arrays. The simplest practical thing I can think of would be to store markers in a sorted array which delimit characters which are inside/outside the lexical category and then use bsearch to classify a character based on whether the index returned from bsearch is even or odd. But you'll want to have a simple fast path like our current code for recognizing ASCII identifiers, since that's the common case.
Oh, another thing you'll need for identifiers is Unicode normalization, so that equivalent identifiers (e.g. written with and without combining characters) will be interned to the same string pointer and hence won't be considered distinct by the compiler's symbol table. There are various choices for Unicode normal forms. Python uses NFKC normalization: https://docs.python.org/3.6/refer...lexical_analysis.html#identifiers
. It might be worth looking at CPython's lexer source code to see how it handles all of this stuff, although it mostly just leans on Python's functionality for dealing with the Unicode character database, accessible via the built-in unicodedata module.
You're probably starting to see why I didn't want to cover all of this. :)
Hopefully this gives you some pointers to get started, if I didn't scare you off. In particular, I really recommend Unicode Demystified. Doing a proper job with localization isn't just a matter of knowing a few technical tricks but requires an understanding of the broader context. Good luck!