Hacker News
3 years ago by jjice

This is just an anecdote, but I had to write a SQL DDL and DML parser during my last semester. I wrote the DDL parser by hand, and it wasn't as bad as I expected, but it was time consuming. I managed to convince the professor to give us the option of using a parser generator for the next phase (DML) since the point of the class wasn't parsing context free grammars and more focused on executing the SQL.

I used Flex and Bison since the project was in C. Getting up and running and understanding how the tools have to be set up with different options was a bit tricky, but after that, my parser was up and running in about two hours, compared to probably four times that for the hand written DDL. Our DML subset was also much larger and more complex than our DDL, so I was very happy with the development speed increase.

I had this idea that using a parser generator was slow and wasteful since many modern tutorials online write them by hand and speak against parser generators (possibly because there isn't a catch all for all languages). Turns out dev speed is way more important to me up front, because in the case that I notice parsing speed actually being an issue I should be happy that my MVP has gotten enough use.

It's also nice because a lexer and parser can be pretty easily black-boxed and swapped out for your hand written, just keep the AST and API the same and you should be good.

All that said, that's personal preferences and writing the parser by hand is definitely good experience and more extensible, especially for error handling. Nice work!

3 years ago by brundolf

To add an anecdote, I've gotten into a hobby of writing parsers and parser-related things, and I've developed a pattern and a handful of utility functions (which are small enough to be re-written from memory) that together make hand-writing a pretty breezy task. Just the other week at work I hand-wrote a GraphQL schema parser from scratch in about 2 hours using this technique (GraphQL's schemas have a relatively simple syntax, but I still think it's a decent data point)

3 years ago by qorrect

Is it hosted somewhere we can see it?

3 years ago by brundolf

It's not, though I've thought about open-sourcing or doing a write-up on my pattern, but I've just assumed there are tons of libraries out there for this that are probably better than my homegrown version. Could still make an interesting blog post though, I suppose

3 years ago by WalterBright

> dev speed is way more important to me up front

Having written many parsers, I can attest that the time spent writing and debugging the parser is about 0.0000001% of the time you'll invest in the compiler.

Heck, I spent more time trying to figure out how to integrate the tag name space from ImportC into the host D compiler's symbol table mechanism, than I did writing the C parser.

3 years ago by neilv

In a university compilers class, I suppose that implementing the parser (or the tokenizer part of it) might be used as a weedout assignment. Or to tell students without much programming experience that they are going to have to put a lot of energy into the class (so don't, say, take compilers, operating systems, organic chemistry, and epic novel-writing all in the same term).

(When I took compilers, the initial weedout/attention-getting assignment was artificial: write a string/symbol table manager module that could handle strings of arbitrary size, up to megabytes or larger for each entry. Though I don't know how well this first assignment worked for setting expectations, or if it could've been improved. After that school term, I heard from someone else in the class that he and I had been the only two people in the class to actually "finish" our compilers. Not that we were smarter or harder-working, but we both had a lot of prior programming experience, so a big head start. Setting expectations for a project-heavy systems CS class seems hard to do, without discouraging people who could do it if they invest the time, and I suppose just noticing a couple nerds who had a big head start could be discouraging to everyone else. One of the secret keys to learning/accomplishing something hard is to feel you're doing well enough to stick with it long enough to do well.)

3 years ago by WalterBright

P.S. Having a parallel tag symbol table means doubling the work in creating and tearing down symbol tables. Adding another member to the Symbol struct means lots of extra memory allocated.

Tag symbols are relatively rare, so I was willing to trade lookup times for reduced memory consumption. Wound up adding a pointer to a hash table in each symbol table, where the tag symbols would be stored. Only when tag symbols are actually used is it ever accessed or even initialized. It works a treat.

3 years ago by codr7

Second that, I think the divide is simply experience writing parsers.

3 years ago by dboreham

Your post came through a wormhole from 1979.

3 years ago by MisterTea

> Programming language that compiles into a x86 ELF executable.

> a compiler somewhat resembling C

> Add new or borrow from other language(s) features ontop of C, deviate away from just C

  function strlen(const char *s)
This does not appear to be a C compiler. The title should reflect this.
3 years ago by r1chardnl

The main goal at the moment is to create a C compiler, the function keyword there is just a placeholder for the time being. At the moment anything returned is just mov'd into the EAX register and then stored in a local variable if called that way.

3 years ago by MisterTea

If you're looking for C-like languages to study look up the plan 9 Alef language. It was a concurrent C-like that was designed to replace C until they realized its better off as a C library which is what plan 9's thread(2) is. It's mostly extinct but survives:

http://doc.cat-v.org/plan_9/2nd_edition/papers/alef/

https://seh.dev/go-legacy/

3 years ago by spijdar

Worth pointing out Plan9's C was itself sort of its own language. I think it was a strict subset, but I can't remember all the details. The preprocessor was much simpler, that much I remember.

3 years ago by r1chardnl

Looks really interesting, thanks for sharing this.

I'm not sure whether I'll incorporate everything from C or just make it compatible with C and behind the scenes do things differently e.g instead of carrying const char */string pointers everywhere always carry the length of said data structure along with it. Or for instance push an additional pointer on the stack for memory zones, where an allocator is always present and knows it's context, sort of like __thiscall with class methods for this->.

3 years ago by kubb

Oh, is _this_ where the inspiration for Go came from? It would explain a lot.

3 years ago by lerno

And if the idea is to extend C in some direction, that is what the language I’m working on does: C3 (http://www.c3-lang.org)

3 years ago by jeffrallen

If you are exploring alternatives to the stdlib, you might commit yourself to no zero terminated strings and see where it takes you. A fundamental data type with a pointer, a length and a capacity does a lot to stop buffer overflows.

3 years ago by guntars

To replace C-style strings you only need the length! You also get cheap substrings which is such a common operation that it alone IMO makes it worth taking this route.

3 years ago by r1chardnl

Can't edit my original reply anymore, but fair point, I've changed the keyword 'function' in the AST to a function return type instead. There's no error handling done yet to check whether the variable receiving the function return data matches the function return type, but it's on my todo list.

https://github.com/riicchhaarrd/ocean/commit/0618e0810c8d437...

Then again in C it would work aswell if you type casted the types.

  const char *f()
  {
      return (const char*)123;
  }

  int main()
  {
      int i = (int)f();
      return 0;
  }
3 years ago by tills13

Yeesh. This is probably the most pedantic comment I've read on HN.

Contributes absolutely nothing to the conversation, here.

3 years ago by thamer

Just a few small suggestions:

1. Some code seems to be duplicated (although with slightly different formatting) across multiple files, like dd/dw/db in elf.c and x86.c for example. I'd suggest consolidating those.

2. It helps to define types when you have variables that can take a limited set of possible values; typedef works but enums are better. I'm thinking of variables like the `type` parameter to `primitive_data_type_size`, for example. The compiler can help you detect a missing case statement if you use an enum, but not an int.

3. It's a matter of preference, but typedef'ing your structs can make your code more concise and not have it littered with struct keywords.

4. You seem to have a mix of tabs and spaces in some files (e.g. in `elf.c`). I would recommend configuring your editor to use just one of the two or it'll start to be an issue as your code base grows.

5. On a related point, I'd suggest picking a formatting style and sticking to it. You have functions like `accept` in `ast.c` declared without spaces after the opening `(` and others like `read_file` in the same file that has spaces.

Good luck! This is a great way to learn.

3 years ago by muth02446

Shameless plug: if you are having a lot of fun with the "front-end" part of language design but neither like the idea of writing your own backend nor integrating with LLVM, maybe Cwerg is right for you: https://github.com/robertmuth/Cwerg

3 years ago by justinlloyd

Great job! Writing a compiler from scratch is a massively valuable learning experience. I wrote one in BASIC for the 6502, that became self-hosting on April 24th, 1982. I'm going to watch your project with interest and relive old memories.

3 years ago by SavantIdiot

Every time I see a compiler, the first thing a zero in on is the parser.

I don't mean this in a bad way, but that tokenizer looks a bit ... simple. ;-)

I say that because ~26 years ago I had a go at writing a C preprocessor with the goal of creating a ToC and Index for my C projects. I sat down with a BNF of C and lex/yacc and got ground up in the details, then found some software that did exactly what I wanted and moved on. I just remember a few parts of the C syntax that don't fit nicely into BNF and hadn't studied parsers enough to dig myself out of the hole.

Good luck doing it manually, that's gonna be tricky as heck.

3 years ago by fao_

Parsing it manually is actually the preferred choice for most compilers, both GCC and Clang do it manually, for example.

3 years ago by SavantIdiot

Really? That's probably why I failed! Ironically (given my post) I've never looked at Clang's parser. I have looked at GCC's and, well, sheepishly: I don't get it, it is a beast of a project and I wasn't able to crack the architecture. I should try again tho.

3 years ago by rot13xor

C is almost context-free with the exception of typedef. A custom parser helps with error diagnostics, e.g. missed semicolon or a "." swapped with "->".

https://eli.thegreenplace.net/2007/11/24/the-context-sensiti...

3 years ago by quantumOctopus

Incredibile timing as I'm working on my own language as well (just started a week ago) and I'm literally starting from 0. So happy that in just a week I can understand at least half of this thread. Your project looks a bit too wild for me but I'm sure there's plenty I can learn there so thank you so much for sharing!

3 years ago by kinga254

Any specific resources you've used that you could recommend in line with compiler construction?

3 years ago by optymizer

I wrote a C compiler* using flex [1] and bison [2] that generated MIPS code. The biggest downside was the rather complex glue code.

At some point ANTLR [3] looked promising, but these days I'd probably write a lexer and recursive descent parser by hand, then generate LLVM IR [4].

[1] https://github.com/westes/flex

[2] https://www.gnu.org/software/bison/

[3] https://www.antlr.org/

[4] https://llvm.org/

* for a subset of the full C89 spec

3 years ago by r1chardnl

You could use https://en.wikipedia.org/wiki/Yacc

I decided to write my own lexer/parser/compiler it's pretty straightforward.

Sources I use(d) to program this, e.g look at resulting assembly from other compilers or look at how the AST is generated for other languages.

https://en.wikipedia.org/wiki/Recursive_descent_parser

https://en.cppreference.com/w/c/language/operator_precedence

https://godbolt.org/

https://astexplorer.net/

I didn't start writing C yesterday, and most of the things are just stuff I've learned over the years and seeing how other languages work and then just use what I know to try to program in a logical manner.

Also I've written a implementation of a scripting language (compiler and VM) before in similar fashion.

https://github.com/riicchhaarrd/gsc

3 years ago by blcArmadillo

This is specific to interpreters but the beginning part related to parsing and AST is common: http://craftinginterpreters.com/contents.html

3 years ago by MH15

I've had good luck with ANTLR4 and ObjWeb ASM for my JVM language Imp https://github.com/mh15/imp

Daily Digest

Get a daily email with the the top stories from Hacker News. No spam, unsubscribe at any time.