As an ambitious undergrad, I took the entire language theory/compiler series offered, which ended up in some graduate level courses. I still can't believe the amount of work and hassle involved in creating the very simple compiler for a C-like language we implemented on a 32 bit target architecture - when I got to the end of the series, the last course's syllabus was one sentence: "Reverse engineer the JVM."
At that point I realized I was in over my head and walked out. Fun read, thanks for sharing.
First of all, this is certainly impressive and inspirational!
I've been toying with the idea of writing a compiler from scratch for years, problem is I worry too much about optimization. With a simple code generator that isn't at all competitive with hand-written assembly, I'd want it to at least run on the kind of machine that such simple compilers have been historically written on. Like, say, a PC with 640K of RAM at most.
>So how much RAM does it use, exactly? Well… I didn’t realize this until after finishing the project, but in fact around 40 GB. I commend the brilliant engineers of Linux for designing an operating system kernel that can deal with people like me.
Wow. The latest Linux kernel source tree is 1.5 GB - multiply that by 16 if you stored it all in memory as a linked list of 64 bit chars, and it would still only account for about half of that! Shows the cost of abstraction, if the compiler where written in an imperative language you could easily get away with never deallocating anything.
>Name mangling
This shouldn't be required at all when you don't need to use an external linker? Maybe for debugging, but I suspect there is a lot of unnecessary use of strings where a pointer to some AST node would do.
>Assembler
AT&T syntax: just say no! I'd also recommend reading the Intel/AMD docs instead of a summary on some random website; at least enough to understand when the SIB byte is actually required. The best introduction would probably be the 80386 manual, since it's the first version of 32-bit x86 and the instruction set is a lot smaller. Then move on to a recent version to see what is different in 64 bit mode.
>Difference between syscalls and libc, and ELF32/64
Here's where UNIX culture is actively hostile towards any language not based on C and the "standard" toolchain. Really, it's hard to find any other explanation other than that they want to make it as difficult as possible to escape from that ecosystem.
Linux is still better than others, in that it at least has a stable kernel ABI, but the documentation is terrible. The assumption is that you only call the kernel through the wrapper functions in libc, which may do extra things you wouldn't want unless your goal was C compatibility, and the differences are barely (and incompletely) mentioned in the manpages.
A lot of it is not documented at all, it's all in include files that are themselves included from somewhere else, sometimes multiple levels deep. And /usr/include/asm-generic/unistd.h is an outright trap, the comment at the top (at least on my Debian system) states that it is based on x86-64 but the syscall numbers appear to be fictitious.
> Really, it's hard to find any other explanation other than that they want to make it as difficult as possible to escape from that ecosystem.
I find this extraordinarily bad faith, and also just plain unimaginative.
Your position is that the C/C++ side of the linux-y/posix-y community won't change anything because they don't want you to use Rust or Haskell or whatever?