C unbounded lookahead

Hi,

I'm currently writing a C parser and so far it's going quite well. I've implemented a fairly complete lexer + expression parser in 300 + 370 LOC (shunting yard parser). But I don't have a plan yet how to go about statements and other global objects.

In https://youtu.be/T6TyvsKo_KI?t=17m52s Per mentions that C requires "unbounded/large lookahead" in some cases. So I was wondering what were those edge cases? Some insights here might well inform my next decisions.

Edited by jstimpfle on Reason: Initial post
jstimpfle

In https://youtu.be/T6TyvsKo_KI?t=17m52s Per mentions that C requires "unbounded/large lookahead" in some cases. So I was wondering what were those edge cases? Some insights here might well inform my next decisions.


tl;Dr for c++: int(x), y, *const z; vs int(x), y, new int;

Edited by Maykeye on
Thanks, always good to have arguments for future C++ rants :-). The syntax used there is C++ only though. My question is regarding C syntax.

(Maybe C++ is what Per actually means? I've noticed at least one other instance in that recording where he said "C" but definitely means "C++".)

Edited by jstimpfle on
In general it's hard to prevent saying silly things sometimes when I'm working without a script! But in this case I think I said (or at the very least meant) that you either need symbol table dependent parsing or unbounded lookahead. The answer is that when you have something like
1
(x******** ...
you don't know whether you're parsing a type cast to a pointer ... to a pointer to the type x, or the variable x multiplied by the nested deref of something yet to be seen, unless you know from your symbol table whether or not x has been previously typedefed. But with a symbol table, as soon as you see the x you know whether to parse the remainder as a cast-style type specifier or as a normal expression.

Edited by Per Vognsen on
I didn't mean to criticize. You have helped me a ton, and given me the courage to try it in the first place! Thank you for your answer and all the recordings.

Regarding parsing C variable declarations / typedefs, I do use a symbol table and would not want to do it in any other way. The code becomes extremely simple, I just store the typename and continue by parsing an expression statement. I then walk the resulting parse tree and make it into a type, and finally extract the identifier used in the expression and make it into a variable declaration of that type.

There is no additional parsing code for declarations - just trivial type-builder code. https://github.com/jstimpfle/lear...2c5306b771437d5239e3b9/stmt.c#L64

That is - unless function prototypes join the game. In K&R C, they had expression syntax, but since C89 the types go into the function parameter list. I don't know yet how to hack the parser to recognize these parameter types without writing extra code... :-(

EDIT: I realize your comment was about type casts. Seems like I will need another hack for those... It might be enough to insert an artifical missing-name expression when a ')' is seen where an expression is expected.

Edited by jstimpfle on
Yes, parsing C isn't terribly difficult in the grand scheme of things. If you want to see a simple parser (and an entire compiler) for C89, you can look at lcc's code as described in detail in the book A Retargetable C Compiler by Hanson and Fraser, which is one of my all-time favorites. The problem with symbol table dependent parsing is that it means you can't correctly parse a 100-line file without potentially parsing all 100,000+ lines of its transitive header dependencies. It's a small, easily avoidable language design flaw with serious implications, but it's a common mistake to fall into if you're writing a one-pass compiler since in that case you're building the symbol table as you're parsing and hence using the symbol table to assist parsing seems "free".

Edited by Per Vognsen on