So, how everyone is doing with their parsers/calculators?

One day has passed since last stream. Probably time to start discussing the findings.

Here're fun test expressions for parsers/calculators:
1
(3
and
1
-2147483648%-1
.

First example has no closing parenthesis. Reference implementation will print something like "Expected ), got )" instead of "Expected ), got ASCII 0" as it calls token_kind_name twice in the same printf, and token_kind_name has only one buffer, so one call gets overwritten by another(just like the ancestors have foretold)

Second one may trigger Floating point exception(on linux), as division and modulo are UB in theory and terminate the program in practice not only when dividing something by zero, but also in the case of (INT_MIN/-1) as it can't fit into int. Windows behaves similarly I believe.

SEI CERT C Coding Standard ("standard provides rules for secure coding in the C programming language") has a lot of rules on what to do to not get into UB and overflows, which may be useful to know for writing VM. Not necessary adhering to them, but good reference anyway.

And speaking of VM. I've found stretchy buffers to be a nice way to cheat around stack depth verification - since there are no jumps now, stack can only be of constant depth, so as long as bytecode is reasonable size, pushing and popping data to stretchy buffer will handle stack resizing so no overflow will occur.

Edited by Maykeye on Reason: Initial post
That's a good catch with token_kind_name. It's funny I made a big note and then proceeded to use it wrong. In response to your bug report, I just factored it into copy_token_kind_str and token_kind_str, where the former is the workhorse and copies into a user-provided buffer with a given size, using snprintf, and the latter calls it with a static buffer to provide the existing semantics. That way you still have a convenient expression-oriented accessor, but in cases like expect_token you can call copy_token_kind_str directly. I could also just have left it as is and made expect_token do a strcpy to a local buffer, but I like the idea of making the convenient but bug-prone function be a leaf-level wrapper rather than the workhorse.

(An even more convenient function would provide the same semantics but with stable pointers using a shared append-only scratch buffer that's freed when the compiler exits. I want to defer covering larger-scale memory management strategies like that for now, although they both simplify and speed up compilers by a significant margin. Most of our memory allocations will eventually just be out of a few large linear buffers. I want people to see the "before and after" effect of this and other optimizations, so we can get into the habit of making empirically-driven decisions. So for now I'm sticking to a more standard style of C memory management that will be familiar to everyone.)

I'll cover the bug and the fix in next stream.

Edited by Per Vognsen on
Btw, I recommend to use fuzzers to catch these kind of errors in any kind of parser - source code or binary format ones. Modern fuzzers are really fast and really good. They will find these and many more similar errors. One very good fuzzer is libFuzzer from LLVM project, its super simple to use it. For example, I have used it to find & fix many issues in stb_image.h. Another good one is AFL.
Yup, I had fuzzing planned starting as soon as we have a bit more code in place, but maybe I should just introduce it next week. ASan, etc, as well. It's a great suggestion.

Edited by Per Vognsen on
I've started working on the S-expression printer. Here's my WIP https://github.com/dannas/bitwise/blob/master/homework/main.c

I'm allocating and concatenating strings all over the place, so it's very ineffective. I guess I should create a tree of nodes instead. How have you done it?

Here's a list of things I've learned from the first couple of streams.

* Ctrl-F10 for running up to a line. I can't believe I've never used that before! Super-useful!
* C99 compound literals and C99 designated initializers. Up until now, I've mostly coded in C++ and C89.
* That expression precedence is often expressed as terms, factors and powers.
* This is silly, but using the word 'factor' for describing how to organize code. I've always heard people talked about 'refactoring' and never really thought about why it was named that way. When I heard Per say 'factor code' I realized that it's analogues to factoring mathematical expressions.
* Stretchy buffers. I hadn't expected that the C code would be this succinct.
* Including c-files for fast factoring of code while developing. Is this something that scales to bigger projects, or is it just used during exploration?
* The general workflow of using the debugger for shaking out bugs early. I've tended to rely on printf statements and only use the debugger when I've hit a bug.
* Using a none-value for enum types to ensure they're initialized to something well defined.

* On a tangent: After learning about stretchy buffers I looked up Sean Barretts original code and found a comment about strict aliasing: https://github.com/nothings/stb/blob/master/stretchy_buffer.h#L12. So I read up on strict aliasing and learned that there's something called 'effective types' in the C standard and that an aliasing violating really is when we're trying to read through a pointer whos type differs from the effective type of the storage we're reading from. And the effective type can change(!) when we write to the storage. That's what's happening when we get a buffer from malloc that has been previously freed.

I'll stop blabbering now. Thank you Per, for what I've learned (and possibly misunderstood so far).
dannas

* Including c-files for fast factoring of code while developing. Is this something that scales to bigger projects, or is it just used during exploration?

It's called unity build. Some build systems like meson have built-in support for it, CMake has cotire plugin. Dunno about scaling, but it allows some nifty tricks: as compiler sees the entire code, it theoretically can optimize more things (e.g. inline buf__grow where it decides to do so). Also compile times might improve as you don't have to parse huge standard headers many times.
There are some minuses: build times not always improve(as only one core is used, though you can create one file for each core), name of static functions can collide (e.g. if you make static void initTest() in both lex.c and parser.c)


dannas

* The general workflow of using the debugger for shaking out bugs early. I've tended to rely on printf statements and only use the debugger when I've hit a bug.

If you have linux, valgrind is another magnificent beast as it can catch bugs that manual debugging can miss.
For example it can tell you where `if(x)` depends on uninitialized `x` AND where `x` actually came from.


Also personally right now I'm trying to catch up with the stream. I caught up to base expression. I really want to try fuzzers after finishing parsing.

Hopefully there'll be less marathons in the future.

Edited by Maykeye on
Mine works under the debugger but segfaults when run without, which is very frustrating!

I'm finding it difficult to make enough time to keep up with the project already, I suspect I'm going to be way behind very shortly, but c'est la vie.
llimllib
Mine works under the debugger but segfaults when run without, which is very frustrating!

I'm finding it difficult to make enough time to keep up with the project already, I suspect I'm going to be way behind very shortly, but c'est la vie.


That often means that something is initialized in debug which isn't in release.

Make sure you initialize every variable that you declare and zero out every allocation that you make.
So I worked on this for a while over the weekend without having watched day three of streaming. I'm taking the aproach that Per mentioned in that I try to come up with a solution by myself first to more properly understand the problem. What I came up with was based on the readings that were assigned but also on some articles on recursive descent parsers I found myself. This is what I managed: https://github.com/Alphapork/bitwise_stuff/blob/master/bitwise_hw_1.c

It's only the absolute basics, none of the extra credits, but it does sorta work which I'm happy about. I'm using a fixed sized buffer for the s-expression which I know I shouldn't do, but I just wanted to get the parsing and output correct before trying to allocate strings of the proper size in realtime. I'm guessing there's a more clever way to create the expression that doesn't look so ugly, so I am looking forward to see how other people have solved it!

Looking forward I'm not sure will be able to keep up with the homework, since even this very basic assignment took me quite a while to complete, but I will definitly try my best! (Just keeping up with the videos can be challenge in and of itself!)

And as a sidenote, thank you so much for creating this project Per! As a first year student who sometimes find my course work a bit uninspiring this project really is a reminder how exciting this stuff can be!

edit*: Just completed the interpreter which was quite easy in comparison, but that probably has a lot to do with getting comfortable with the concepts. https://github.com/Alphapork/bitwise_stuff/blob/master/bitwise_hw_2.c

edit**: Was on a roll after completing the interpreter, so I finished the vm (just copied Pers code and added all the operations) and compiler as well. Guess I can take back what I said about keeping up with the homework! https://github.com/Alphapork/bitwise_stuff/tree/master/stack_machine

Edited by Alphapork on