Understanding grammar and syntax in EBNF

Help! I am stuck on Day 2 homework. I have read chapters 2 through 4.1 of Wirth book.

And I have issues implementing parser outputting s-expression.

I believe understand method of recursive descent. I do not understand construction of grammar. I have done that:

1
2
3
4
5
EBNF:

unary = ['-' | '~'] integer | term
factor = unary {('*' | '/' | '%' | '<<' | '>>' | '&') unary}
term = factor {('+' | '-' | '|' | '^') factor}


I think, there some ambiguity issues. How to resolve those issues?
How to check grammar being deterministic?
Are this grammar is regular?

Edited by Kiril Osiyuk on Reason: Initial post
You say you can't figure out how to output S-expressions. As a first step, I would recommend writing a recognizer that can tell whether a given string matches a valid expression. Did you do that?

Feel free to look at the parser code I wrote on stream if you get stuck or confused. All the theory can be easily ignored if you just focus on what the code is doing. I'm going to focus on theory in this post because you asked about theory. But I want to make it clear that I don't think understanding this stuff formally is necessary or even that helpful if you can just think in terms of code.

As for your other questions:

The binary operator part of the grammar looks good to me. Your rule for unary only needs to change slightly:

unary = INT | ('-' | '~') unary

I'm using INT in all caps to signify that it's a primitive token.

To check that the grammar is deterministic, for each alternative you need to compute the FIRST sets. Starting with unary, the FIRST set of the first alternative is {INT}, and the FIRST set of the second alternative is {'-', '~'}. These are disjoint, hence unary is deterministic.

Any repetition pattern implicitly presents an alternative between matching the pattern vs matching a follow-up pattern. I can't remember how exactly Wirth chooses to define it, but the idea here is that since the pattern enclosed by {} in term and factor starts with a certain set of operator tokens, and even for a symbol like - that occurs as both a binary operator and a unary operator, in the context where you're trying to match an infix - there's no alternative that could match a prefix -, so it's deterministic.

This stuff is actually a lot clearer when you just write out the code, following Wirth's template. In code terms, determinism just means you don't have multiple if-elseif cases that could be simultaneously true (if their FIRST sets overlap, the first case in the if-elseif chain to match would arbitrarily be selected by the parser as the right one).

The grammar is not regular in its current form, according to Wirth's definition. A rule (the syntactic equation for a nonterminal) is regular if all nonterminal references occur solely in tail position. And a grammar is regular if all its rules are regular. That is true for unary, but it is not true for term and factor. E.g. in the rule term = factor {('+' | '-' | '|' | '^') factor}, factor occurs in the head position.

That said, you can actually rewrite the grammar to be regular. You just have to "copy and paste" right-hand side definitions for every occurrence of the left-hand side. Let me show you how the transformation goes:

1
term = {'-' | '~'} INT {('*' | '/' | '%' | '<<' | '>>' | '&') {'-' | '~'} INT} {('+' | '-' | '|' | '^') {'-' | '~'} INT {('*' | '/' | '%' | '<<' | '>>' | '&') {'-' | '~'} INT}}


Ugly, right? I do not recommend doing this! And you wouldn't be able to regularize the grammar as soon as you added support for parenthesized expressions, so it's not really useful. But it shows that it's possible for a non-regular grammar to be equivalent to a regular grammar after appropriate transformations. A formal language (a set of strings) is said to be regular if there exists a regular grammar which matches the language. But like this example shows, it's possible for a regular language to also have corresponding non-regular grammars.

Edited by Per Vognsen on
Finally, I have done parser. Results:

1
2
3
4
PARSING ~1 --2 * 3 / 2 % 3 << 1 >> 2 + 13 >> 2 | 5 ^ 7 & 3
(^ (| (+ (- (~ 1) (>> (<< (% (/ (* (- 2) 3) 2) 3) 1) 2)) (>> 13 2)) 5) (& 7 3))
PARSING 12*34 + 45/56 + ~25
(+ (+ (* 12 34) (/ 45 56)) (~ 25))


I have used classic approach to recursive-descent parsing, based on the idea of grammar with nonterminals for each level of precedence.

1
2
3
4
int = INT
unary = int | ('-' | '~') int
factor = unary {('*' | '/' | '%' | '<<' | '>>' | '&') unary}
term = factor {('+' | '-' | '|' | '^') factor}


Now looking into shunting yard algorithm and precedence climbing.
I have published my code: https://github.com/osiyuk/Ion/blob/master/sexpr.c

Thank you Per Vognsen for challenging homework.

References

http://www.engr.mun.ca/~theo/Misc/exp_parsing.htm