I just got over writing 2 (well 2.5) toy compilers and I think a lot of the material in the compiler-teaching space lack some subtle developmental aspects.
I wish there was a course designed somewhere which talked about more ingrained issues: how to structure/design the AST[0], buffer based vs noncontextual tokenization/parser design, index handling and error sync in the parser, supporting multiple codegen architectures, handling FFI, exposing the compiler as an API for external tooling, namespaces and linkage handling etc. etc. etc.
It is refreshing to see how Carbon designed some of its components (majorly the frontend, yet to take a look at the backend) as it touches on some of the subtleties I mentioned. If someone is starting out on writing one, I would recommend taking a look at it or any of the talks.
Always nice to see new material coming up. A few resources that I would like to mention would be dabaez's compiler course, Khoury college's compiler course (in Rust, previously i think and Ocaml), Nora Sandler's book as well as http://compilerbook.org; Which I consider to be the best guide out there for writing small learning compilers, the videos are good as well.
You can learn all that stuff by doing and looking at other people’s design, it’s a bit niche to package that up into a ln advanced language design/compiler class, everyone is making different trade offs so are focusing on different things. There also isn’t really a market for more than a few highly trained language designers and implementers, and most wind up doing different things than what they were trained for.
I wrote 'multiple codegen architectures' instead of 'multiple architectures for codegen'.
As far as I have done in the toy compilers and seen the things in actual production ready compilers, the codegen is still very much tied to the one thing or the other rest llvm.
Making the _compiler itself_ provide an API enables things like LSP, which don't want to generate machine code at all. A traditional single-pass compiler usually can't accommodate this without re-plumbing.
The Eclipse Compiler for Java [1] is a notable exception, architected around incremental compilation, an API for “live” AST manipulation, and a layered non-batch approach to when to invoke various analysis steps.
The LSP for Java [2] used in eg. VSCode’s Java plugins, builds on this API.
But, no, I haven’t seen a generalized approach to this architecture discussed in literature.
Admittedly I didn't finish working through it (life got in the way), but I worked through a good chunk of it with a colleague from a non-CS background. He wanted to understand compilers and it seemed to be much more effective than other texts he'd tried to work through. I'll have to look at this one at some point, probably this summer when things are usually calmer for me.
Learning and working on this stuff almost always felt "all or nothing," (either it worked or it didn't), which was very frustrating.
This looks interesting:
> Like Ghuloum [2006], this course is different. Our development is by extending the complete, working compiler one small step at a time. At each step we end up with the working compiler, for a subset of the source language. Specifically, the methodology ([Ghuloum, 2006, §2.6]):
> 1. choose a small subset of the source language that is easy to directly compile
to assembly;
That varies by person. While it may or may not interest, I'll say learning how to write or generate parsers is a useful skill that might keep paying off.
I keep running into situations where I'd like to describe data in a high level. BNF grammars often fit those situations, are more readable than regex's, and could make for nice parsers. One must know how to parse, though. :)
The hardest part of writing a compiler is designing a language. In my opinion, most compiler courses seem too fixated on the implementation details. The fact that this course targets x86 already misses the mark for me. Computers are very fast, there is no reason to start with a language that translates to asm. This is incidental complexity that distracts from language design.
Hard disagree. There are a lot of implementation "details" that, if you want to do properly, are a lot of hard work and very much nontrivial. For example, do try to write a compiler with efficient incremental compilation, and especially one that does so while also having optimization passes. And that's just one example, most things in compiler implementations actually turn out to be fairly complex. And lots of features that modern languages support e.g. more powerful typesystems, trait/typeclass systems, etc. are also very very tricky.
While designing a language is by no means trivial, it generally really occupies just a very small fraction of the language/compiler developer's time. And, in most cases, the two things (language design + implementation details) have to walk hand-in-hand, since small changes to the language design can vastly improve the implementation end.
> The hardest part of writing a compiler is designing a language.
That's a completely separate field. If you want to learn about language design you should study that, though it helps if you have some background on both to get started with either one.
I disagree. First of all, you can write a compiler for any existing language. But second, inhave experience writing compilers and interpreters. For me the hardest was always to give good information/feedback on errors. Because the compiler cannot go ahead, means it is “confused” eliminates that confusion to help the user is very much non trivial.
Regarding the language design part, many universities I know of offer a {Advanced} Programming Languages class later on after a Compilers 101 class and that is where many of the topics like EBNF, Grammar Design along with different constructs are done.
Besides, for someone learning it for the first time I think designing a new language seems a bit difficult.
I’ve been modifying the the MIR c2mir JIT compiler to extend the c11 compiler to support simple classes, boxed strings(immutable, nun-nullable) with AOT support.
Imagine if Java and C had a love child, basically.
MIR is a fantastic piece of engineering.
Honestly the hardest part is representing types. Having played around with other compilers it seems to be a typical problem.
I’m stuck in the minutiae of representing highly structured complexity and defining behavior in c. I can understand why many languages have an intermediate compiler - it’s too much work and it will probably change over time.
These seems like a misstep that I've seen in a few other compiler implementation courses. For some reason these programming language professors always insist on completing the project in their personal favorite language (Haskell, OCaml, Standard ML, etc).
As a student this makes the project significantly more difficult. Instead of just learning how to implement a complier, you're also learning a new (and likely difficult) programming language.
Learning a new programming language isn't that hard of a task, every decent programmer will learn a dozen or maybe even dozens over their career.
Also, neither OCaml nor SML are hard to learn. Haskell is more challenging, but that's because it's become, in a sense, multiple languages. The core of Haskell is no harder than OCaml or SML to learn, except for reasoning about lazy evaluation and some of its consequences. All the things people use on top of Haskell, though, does make it more to learn but what you'd need to reach equivalent utility as SML or OCaml for a compilers course is not that hard to learn.
Many universities that use OCaml in upper-division courses also use it in lower-division; my university requires all CS majors to take a course that is taught in OCaml, and covers higher-order programming, "advanced" (Hindley-Milner) type systems, equational reasoning, etc., typically in their sophomore year.
The compilers class can then be taught in it without worrying about that problem much.
You have to pick some language and some amount of students are likely to be new to it, so you might as well pick one that excels at the task of writing compilers as ML-family languages are.
Algebraic data types are superb for compilers. Case exhaustiveness checks are really helpful in this domain, especially when doing any sort of semantic analysis.
Frankly, I try to avoid languages that don’t have ADTs as much as possible. They are incredibly useful for specifying invariants via your design, and their constraints on inputs lend themselves to easier implementation and maintenance.
The minimal element of a set which has no elements is 0.
Yes, every language has its warts, and anecdotes like this aren't going to change anyone's mind. I only wrote these up because they're hitting me right now in one of those languages that companies use "because anyone can learn them". I like harder languages better than easy languages, because they're easy.
In defense of such compilers professors, part of the purpose of a good undergraduate computer science program is to expose students to different programming language paradigms. Computing professionals are expected to grasp new languages, APIs, and tools quickly. Additionally, certain problems are easier to express in certain paradigms. For example, I would use a language like C or perhaps Rust in an operating systems class since we need to operate at a level that is much closer to the hardware. If I were teaching a course on machine learning, I’d use Python for its excellent ecosystem of libraries, though R and Julia are good alternatives. A course on relational databases should teach basic SQL.
Back in the 2000s there were some CS undergraduate programs that attempted to use Java in the entire curriculum, from introductory courses all the way to senior-level courses such as compilers. There was even an operating systems textbook that had Java examples throughout the text (https://www.amazon.com/Operating-System-Concepts-Abraham-Sil...).
I think using only one language for the entire undergraduate CS curriculum is a mistake. Sure, students don’t have to spend time learning additional languages. However, everything has to fit into that one language, depriving students the opportunity to see how languages that are better suited to specific types of problems could actually enhance their understanding of the concepts they are learning. In the case of Java, it’s a good general-purpose programming language, but there are classes such as computer organization and operating systems where it’s important to discuss low-level memory management, which conflicts with Java’s automatic memory management.
When it comes to writing compilers, it turns out that functional programming languages with algebraic data types and pattern matching make working with abstract syntax trees much easier. I learned this the hard way when I took compilers in 2009 at Cal Poly. At the beginning of the course, we were given two weeks to write an AST interpreter of a subset of Scheme. My lab partner and I didn’t like Dr. Scheme (now known as Racket), which we “endured” the previous quarter in a class on implementing programming language interpreters, and so we set about writing our interpreter in C++. It turned out to be a big mistake. We got it done, but it took us 40 hours to implement, and we had a complex class hierarchy to implement the AST. We realized that Dr. Scheme’s features were well-suited for manipulating and traversing ADTs. We never complained about Dr. Scheme or functional programming again, and we gladly did our other compiler assignments in that language.
16 years later, I teach Haskell to my sophomore-level students in my discrete mathematics class at a Bay Area community college that uses C++ for the introductory programming course.
I just got over writing 2 (well 2.5) toy compilers and I think a lot of the material in the compiler-teaching space lack some subtle developmental aspects.
I wish there was a course designed somewhere which talked about more ingrained issues: how to structure/design the AST[0], buffer based vs noncontextual tokenization/parser design, index handling and error sync in the parser, supporting multiple codegen architectures, handling FFI, exposing the compiler as an API for external tooling, namespaces and linkage handling etc. etc. etc.
It is refreshing to see how Carbon designed some of its components (majorly the frontend, yet to take a look at the backend) as it touches on some of the subtleties I mentioned. If someone is starting out on writing one, I would recommend taking a look at it or any of the talks.
Always nice to see new material coming up. A few resources that I would like to mention would be dabaez's compiler course, Khoury college's compiler course (in Rust, previously i think and Ocaml), Nora Sandler's book as well as http://compilerbook.org; Which I consider to be the best guide out there for writing small learning compilers, the videos are good as well.
[0]: Some related content that I enjoyed reading: https://lesleylai.info/en/ast-in-cpp-part-1-variant/
You can learn all that stuff by doing and looking at other people’s design, it’s a bit niche to package that up into a ln advanced language design/compiler class, everyone is making different trade offs so are focusing on different things. There also isn’t really a market for more than a few highly trained language designers and implementers, and most wind up doing different things than what they were trained for.
>supporting multiple codegen architectures
Shouldnt it be a result of modular software design?
>exposing the compiler as an API for external tooling
Isn't it just generating libraries instead of executables?
I wrote 'multiple codegen architectures' instead of 'multiple architectures for codegen'.
As far as I have done in the toy compilers and seen the things in actual production ready compilers, the codegen is still very much tied to the one thing or the other rest llvm.
Making the _compiler itself_ provide an API enables things like LSP, which don't want to generate machine code at all. A traditional single-pass compiler usually can't accommodate this without re-plumbing.
The Eclipse Compiler for Java [1] is a notable exception, architected around incremental compilation, an API for “live” AST manipulation, and a layered non-batch approach to when to invoke various analysis steps.
The LSP for Java [2] used in eg. VSCode’s Java plugins, builds on this API.
But, no, I haven’t seen a generalized approach to this architecture discussed in literature.
1: https://github.com/eclipse-jdt/eclipse.jdt.core 2: https://github.com/eclipse-jdtls/eclipse.jdt.ls
The notes reference:
Can't recommend it enough.Andy Keep - Writing a Nanopass Compiler https://www.youtube.com/watch?v=Os7FE3J-U5Q
https://github.com/akeep/scheme-to-c
And as a github search[1], I recall it turning up fun work[2] amidst clutter.
Hmm, I wonder if an LLM could sift a "Essentials of Compilation" search[3] for interesting repos?
[1] https://github.com/search?q=Ghuloum&type=repositories https://github.com/search?q=incremental%20approach%20to%20co... [2] just some old bookmarks: https://github.com/namin/inc https://github.com/iambrj/imin https://github.com/jaseemabid/inc [3] https://github.com/search?q=Essentials%20of%20Compilation&ty...
Another text using this approach:
https://mitpress.mit.edu/9780262047760/essentials-of-compila... - Jeremy Siek's Essentials of Compilation (there's also a Python version)
Admittedly I didn't finish working through it (life got in the way), but I worked through a good chunk of it with a colleague from a non-CS background. He wanted to understand compilers and it seemed to be much more effective than other texts he'd tried to work through. I'll have to look at this one at some point, probably this summer when things are usually calmer for me.
This is such a wonderful book.
This book is open access, https://github.com/IUCompilerCourse/Essentials-of-Compilatio... both the Python and the Racket versions.
If you are the kinda nerd that has nerd book club with friends, this is book is perfect for it.
Learning and working on this stuff almost always felt "all or nothing," (either it worked or it didn't), which was very frustrating.
This looks interesting:
> Like Ghuloum [2006], this course is different. Our development is by extending the complete, working compiler one small step at a time. At each step we end up with the working compiler, for a subset of the source language. Specifically, the methodology ([Ghuloum, 2006, §2.6]):
> 1. choose a small subset of the source language that is easy to directly compile to assembly;
> 2. Write the extensive test cases;
I didn't make much progress on my own languages until I discovered Forth, always seemed to get stuck and lose motivation in the parser.
Writing the parser is the least interesting part of building a language imo.
That varies by person. While it may or may not interest, I'll say learning how to write or generate parsers is a useful skill that might keep paying off.
I keep running into situations where I'd like to describe data in a high level. BNF grammars often fit those situations, are more readable than regex's, and could make for nice parsers. One must know how to parse, though. :)
The hardest part of writing a compiler is designing a language. In my opinion, most compiler courses seem too fixated on the implementation details. The fact that this course targets x86 already misses the mark for me. Computers are very fast, there is no reason to start with a language that translates to asm. This is incidental complexity that distracts from language design.
Hard disagree. There are a lot of implementation "details" that, if you want to do properly, are a lot of hard work and very much nontrivial. For example, do try to write a compiler with efficient incremental compilation, and especially one that does so while also having optimization passes. And that's just one example, most things in compiler implementations actually turn out to be fairly complex. And lots of features that modern languages support e.g. more powerful typesystems, trait/typeclass systems, etc. are also very very tricky.
While designing a language is by no means trivial, it generally really occupies just a very small fraction of the language/compiler developer's time. And, in most cases, the two things (language design + implementation details) have to walk hand-in-hand, since small changes to the language design can vastly improve the implementation end.
> The hardest part of writing a compiler is designing a language.
That's a completely separate field. If you want to learn about language design you should study that, though it helps if you have some background on both to get started with either one.
I disagree. First of all, you can write a compiler for any existing language. But second, inhave experience writing compilers and interpreters. For me the hardest was always to give good information/feedback on errors. Because the compiler cannot go ahead, means it is “confused” eliminates that confusion to help the user is very much non trivial.
Regarding the language design part, many universities I know of offer a {Advanced} Programming Languages class later on after a Compilers 101 class and that is where many of the topics like EBNF, Grammar Design along with different constructs are done.
Besides, for someone learning it for the first time I think designing a new language seems a bit difficult.
I’ve been modifying the the MIR c2mir JIT compiler to extend the c11 compiler to support simple classes, boxed strings(immutable, nun-nullable) with AOT support.
Imagine if Java and C had a love child, basically.
MIR is a fantastic piece of engineering.
Honestly the hardest part is representing types. Having played around with other compilers it seems to be a typical problem.
I’m stuck in the minutiae of representing highly structured complexity and defining behavior in c. I can understand why many languages have an intermediate compiler - it’s too much work and it will probably change over time.
>The compiler itself is to be developed in OCaml.
These seems like a misstep that I've seen in a few other compiler implementation courses. For some reason these programming language professors always insist on completing the project in their personal favorite language (Haskell, OCaml, Standard ML, etc).
As a student this makes the project significantly more difficult. Instead of just learning how to implement a complier, you're also learning a new (and likely difficult) programming language.
Learning a new programming language isn't that hard of a task, every decent programmer will learn a dozen or maybe even dozens over their career.
Also, neither OCaml nor SML are hard to learn. Haskell is more challenging, but that's because it's become, in a sense, multiple languages. The core of Haskell is no harder than OCaml or SML to learn, except for reasoning about lazy evaluation and some of its consequences. All the things people use on top of Haskell, though, does make it more to learn but what you'd need to reach equivalent utility as SML or OCaml for a compilers course is not that hard to learn.
Many universities that use OCaml in upper-division courses also use it in lower-division; my university requires all CS majors to take a course that is taught in OCaml, and covers higher-order programming, "advanced" (Hindley-Milner) type systems, equational reasoning, etc., typically in their sophomore year.
The compilers class can then be taught in it without worrying about that problem much.
You have to pick some language and some amount of students are likely to be new to it, so you might as well pick one that excels at the task of writing compilers as ML-family languages are.
Algebraic data types are superb for compilers. Case exhaustiveness checks are really helpful in this domain, especially when doing any sort of semantic analysis.
Frankly, I try to avoid languages that don’t have ADTs as much as possible. They are incredibly useful for specifying invariants via your design, and their constraints on inputs lend themselves to easier implementation and maintenance.
I grew up with "easy" languages. Pascal, VB, C, C++, Java, C#. And frankly I'm getting real sick of them.
I'm porting Dijkstra's algorithm over to C# at the moment, and in the last several hours here's the two most clownish things that have happened:
1) I have:
My IDE is telling me "Expression is always false according to nullable reference types' annotations". Nevertheless it enters that section every time.2) I have:
You know what this prints? The minimal element of a set which has no elements is 0.Yes, every language has its warts, and anecdotes like this aren't going to change anyone's mind. I only wrote these up because they're hitting me right now in one of those languages that companies use "because anyone can learn them". I like harder languages better than easy languages, because they're easy.
2)
In general I'd recommend using Min() (from LINQ) which works as expected
But this property has this remark: "If the SortedSet<T> has no elements, then the Min property returns the default value of T."
1)
Feels like you used compiler hints incorrectly
This is possible if you disregarded compiler warnings and assigned null to a non-nullable location.
In defense of such compilers professors, part of the purpose of a good undergraduate computer science program is to expose students to different programming language paradigms. Computing professionals are expected to grasp new languages, APIs, and tools quickly. Additionally, certain problems are easier to express in certain paradigms. For example, I would use a language like C or perhaps Rust in an operating systems class since we need to operate at a level that is much closer to the hardware. If I were teaching a course on machine learning, I’d use Python for its excellent ecosystem of libraries, though R and Julia are good alternatives. A course on relational databases should teach basic SQL.
Back in the 2000s there were some CS undergraduate programs that attempted to use Java in the entire curriculum, from introductory courses all the way to senior-level courses such as compilers. There was even an operating systems textbook that had Java examples throughout the text (https://www.amazon.com/Operating-System-Concepts-Abraham-Sil...).
I think using only one language for the entire undergraduate CS curriculum is a mistake. Sure, students don’t have to spend time learning additional languages. However, everything has to fit into that one language, depriving students the opportunity to see how languages that are better suited to specific types of problems could actually enhance their understanding of the concepts they are learning. In the case of Java, it’s a good general-purpose programming language, but there are classes such as computer organization and operating systems where it’s important to discuss low-level memory management, which conflicts with Java’s automatic memory management.
When it comes to writing compilers, it turns out that functional programming languages with algebraic data types and pattern matching make working with abstract syntax trees much easier. I learned this the hard way when I took compilers in 2009 at Cal Poly. At the beginning of the course, we were given two weeks to write an AST interpreter of a subset of Scheme. My lab partner and I didn’t like Dr. Scheme (now known as Racket), which we “endured” the previous quarter in a class on implementing programming language interpreters, and so we set about writing our interpreter in C++. It turned out to be a big mistake. We got it done, but it took us 40 hours to implement, and we had a complex class hierarchy to implement the AST. We realized that Dr. Scheme’s features were well-suited for manipulating and traversing ADTs. We never complained about Dr. Scheme or functional programming again, and we gladly did our other compiler assignments in that language.
16 years later, I teach Haskell to my sophomore-level students in my discrete mathematics class at a Bay Area community college that uses C++ for the introductory programming course.