Single tokens' token kind value being their ASCII character value.

I think, the single tokens' token kind value being their ASCII character value idea, could've been salvaged by using a lookup table.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
enum TokenClass {
    TOKEN_CLASS_CMP = 1 << 0,
    TOKEN_CLASS_ADD = 1 << 1,
    TOKEN_CLASS_MUL = 1 << 2,
    TOKEN_CLASS_ASSIGN = 1 << 3,
    TOKEN_CLASS_UNARY = 1 << 4,
};

const int token_class_bits[256] = {
    [TOKEN_EQ] = TOKEN_CLASS_CMP,
    [TOKEN_NOTEQ] = TOKEN_CLASS_CMP,
    ['<'] = TOKEN_CLASS_CMP,
    ['>'] = TOKEN_CLASS_CMP,
    [TOKEN_LTEQ] = TOKEN_CLASS_CMP,
    [TOKEN_GTEQ] = TOKEN_CLASS_CMP,

    ['+'] = TOKEN_CLASS_ADD | TOKEN_CLASS_UNARY,
    ['-'] = TOKEN_CLASS_ADD | TOKEN_CLASS_UNARY,
    ['^'] = TOKEN_CLASS_ADD,
    ['|'] = TOKEN_CLASS_ADD,

    ['*'] = TOKEN_CLASS_MUL | TOKEN_CLASS_UNARY,
    ['/'] = TOKEN_CLASS_MUL,
    ['%'] = TOKEN_CLASS_MUL,
    ['&'] = TOKEN_CLASS_MUL | TOKEN_CLASS_UNARY,
    [TOKEN_LSHIFT] = TOKEN_CLASS_MUL,
    [TOKEN_RSHIFT] = TOKEN_CLASS_MUL,

    ['='] = TOKEN_CLASS_ASSIGN,
    [TOKEN_ADD_ASSIGN] = TOKEN_CLASS_ASSIGN,
    [TOKEN_SUB_ASSIGN] = TOKEN_CLASS_ASSIGN,
    [TOKEN_OR_ASSIGN] = TOKEN_CLASS_ASSIGN,
    [TOKEN_AND_ASSIGN] = TOKEN_CLASS_ASSIGN,
    // ...
};

bool is_cmp_op() {
    return (token_class_bits[token.kind] & TOKEN_CLASS_CMP) != 0;
}

bool is_add_op() {
    return (token_class_bits[token.kind] & TOKEN_CLASS_ADD) != 0;
}

bool is_mul_op() {
    return (token_class_bits[token.kind] & TOKEN_CLASS_MUL) != 0;
}

bool is_assign_op() {
    return (token_class_bits[token.kind] & TOKEN_CLASS_ASSIGN) != 0;
}

bool is_unary_op() {
    return (token_class_bits[token.kind] & TOKEN_CLASS_UNARY) != 0;
}


This is pretty much the same as LuaJIT's way of implementing isdigit, isxdigit, etc: https://github.com/LuaJIT/LuaJIT/blob/master/src/lj_char.h


Somewhere at the end of Day 6's video pervognsen was talking about error handling in the context of "push oriented" APIs, the idea sounded familiar to me, then I remembered that Rob Pike had a post about it: https://blog.golang.org/errors-are-values

Edited by Anikki on Reason: Initial post
Yeah, the token class approach is what I was originally planning, and maybe when I start optimizing the parser I'll revisit it, but for now the range-based checks did most of what I want.

By the way, instead of looking up in the token class table each time, you can just assign the token class flags directly in the lexer since you're already discriminating the different tokens. And you don't need a table for that at all, not even in the lexer, if you move away from eponymous tokens in the default clause, and you end up with only one per-token discrimination in the whole pipeline, namely the relevant switch/if branch in the lexer for that token. Then the parser does is_token_class(TOKEN_CLASS_MUL), etc, which checks directly against the class field stored in Token without any table lookup. For precedence parsing specifically, you can also assign a precedence field to Token in the lexer and do iterative rather than recursive precedence climbing in the parser to collapse the different functions in the expression parser (cf. the extra credit exercise in homework 1). We'll revisit all of this later, before version 0, but thanks for bringing it up!

As you mention about LuaJIT, this also works for character class discrimination. In some of the week 1 streams I alluded to getting rid of isalnum(x) || x == '_' and so on with "char classes", and that's exactly the kind of thing I had in mind, and there you really do want the table-based approach.

I hadn't seen that Rob Pike post before. I'm not surprised others have made the observation (it's pretty obvious), but just thought it was worth mentioning on stream since I see a lot of people assuming that fine-grained error handling is just a necessary evil in all APIs.

Edited by Per Vognsen on