The naked truth about the joys, frustrations, and hard work of writing your own programming language
My career has been all about designing programming languages and writing compilers for them. This has been a great joy and source of satisfaction to me, and perhaps I can offer some observations about what you’re in for if you decide to design and implement a professional programming language. This is actually a book-length topic, so I’ll just hit on a few highlights here and avoid topics well covered elsewhere.
First off, you’re in for a lot of work…years of work…most of which will be wandering in the desert. The odds of success are heavily stacked against you. If you are not strongly self-motivated to do this, it isn’t going to happen. If you need validation and encouragement from others, it isn’t going to happen.
Fortunately, embarking on such a project is not major dollar investment; it won’t break you if you fail. Even if you do fail, depending on how far the project got, it can look pretty good on your résumé and be good for your career.
One thing abundantly clear is that syntax matters. It matters an awful lot. It’s like the styling on a car — if the styling is not appealing, it simply doesn’t matter how hot the performance is. The syntax needs to be something your target audience will like.
Trying to go with something they’ve not seen before will make language adoption a much tougher sell.
I like to go with a mix of familiar syntax and aesthetic beauty. It’s got to look good on the screen. After all, you’re going to spend plenty of time looking at it. If it looks awkward, clumsy, or ugly, it will taint the language.
There are a few things I (perhaps surprisingly) suggest should not be considerations. These are false gods:
- Minimizing keystrokes. Maybe this mattered when programmers used paper tape, and it matters for small languages like bash or awk. For larger applications, much more programming time is spent reading than writing, so reducing keystrokes shouldn’t be a goal in itself. Of course, I’m not suggesting that large amounts of boilerplate is a good idea.
- Easy parsing. It isn’t hard to write parsers with arbitrary lookahead. The looks of the language shouldn’t be compromised to save a few lines of code in the parser. Remember, you’ll spend a lot of time staring at the code. That comes first. As mentioned below, it still should be a context-free grammar.
- Minimizing the number of keywords. This metric is just silly, but I see it cropping up repeatedly. There are a million words in the English language, I don’t think there is any looming shortage. Just use your good judgment.
Things that are true gods:
- Context-free grammars. What this really means is the code should be parsable without having to look things up in a symbol table. C++ is famously not a context-free grammar. A context-free grammar, besides making things a lot simpler, means that IDEs can do syntax highlighting without integrating most of a compiler front end. As a result, third-party tools become much more likely to exist.
- Redundancy. Yes, the grammar should be redundant. You’ve all heard people say that statement terminating
;are not necessary because the compiler can figure it out. That’s true — but such non-redundancy makes for incomprehensible error messages. Consider a syntax with no redundancy: Any random sequence of characters would then be a valid program. No error messages are even possible. A good syntax needs redundancy in order to diagnose errors and give good error messages.
- Tried and true. Absent a very strong reason, it’s best to stick with tried and true grammatical forms for familiar constructs. It really cuts the learning curve for the language and will increase adoption rates. Think of how people will hate the language if it swaps the operator precedence of
*. Save the divergence for features not generally seen before, which also signals the user that this is new.
As always, these principles should not be taken as dicta. Use good judgment. Any language design principle blindly followed leads to disaster. The principles are rarely orthogonal and frequently conflict. It’s a lot like designing a house — making the master closet bigger means the master bedroom gets smaller. It’s all about finding the right balance.
Getting past the syntax, the meat of the language will be the semantic processing, which is where meaning is assigned to the syntactical constructs. This is where you’ll be spending the vast bulk of design and implementation. It’s much like the organs in your body — they are unseen and we don’t think about them unless they are going wrong. There won’t be a lot of glory in the semantic work, but in it will be the whole point of the language.
Once through the semantic phase, the compiler does optimizations and then code generation — collectively called the “back end.” These two passes are very challenging and complicated. Personally, I love working with this stuff, and grumble that I’ve got to spend time on other issues. But unless you really like it, and it takes a fairly unhinged programmer to delight in the arcana of such things, I recommend taking the common sense approach and using an existing back end, such as the JVM, CLR, gcc, or LLVM. (Of course, I can always set you up with the glorious Digital Mars back end!)
How best to implement it? I hope I can at least set you off in the right direction. The first tool that beginning compiler writers often reach for is regex. Regex is just the wrong tool for lexing and parsing. Rob Pike explains why reasonably well. I’ll close that with the famous quote from Jamie Zawinski:
“Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.”
Somewhat more controversial, I wouldn’t bother wasting time with lexer or parser generators and other so-called “compiler compilers.” They’re a waste of time. Writing a lexer and parser is a tiny percentage of the job of writing a compiler. Using a generator will take up about as much time as writing one by hand, and it will marry you to the generator (which matters when porting the compiler to a new platform). And generators also have the unfortunate reputation of emitting lousy error messages.
Now that I mention it, error messages are a big factor in the quality of implementation of the language. It’s what the user sees, after all. If you’re tempted to put out error messages like “bad syntax,” perhaps you should consider taking up a career as a chartered accountant instead of writing a language. Good error messages are surprisingly hard to write, and often, you won’t discover how bad the error messages are until you work the tech support emails.
The philosophies of error message handling are:
- Print the first message and quit. This is, of course, the simplest approach, and it works surprisingly well. Most compilers’ follow-on messages are so bad that the practical programmer ignores all but the first one anyway. The holy grail is to find all the actual errors in one compile pass, leading to:
- Guess what the programmer intended, repair the syntax trees, and continue. This is an ever-popular approach. I’ve tried it indefatigably for decades, and it’s just been a miserable failure. The compiler seems to always guess wrong, and subsequent messages with the “fixed” syntax trees are just ludicrously wrong.
- The poisoning approach. This is much like how floating-point NaNs are handled. Any operation with a NaN operand silently results in a NaN. Applying this to error recovery, and any constructs that have a leaf for which an error occurred, is itself considered erroneous (but no additional error messages are emitted for it). Hence, the compiler is able to detect multiple errors as long as the errors are in sections of code with no dependency between them. This is the approach we’ve been using in the D compiler, and are very pleased with the results.
What else does the user care about in the hidden part of the compiler? Speed. I hear it over and over — compiler speed matters a lot. In fact, compile speed is often the first thing I hear when I ask a company what tipped the balance for choosing D. The reality is, most compilers are pigs. To blow people away with your language, show them that it compiles as fast as hitting the return key on the compile command.
Wanna know the secret of making your compiler fast? Use a profiler.
Sounds too easy, right? Trite, even. But raise your hands if you routinely use a profiler. Be honest, everyone says they do, but that profiler manual remains in its pristine shrink wrap. I’m just astonished at the number of programmers who never use profilers. But it’s great for me as a competitive advantage that never ceases to pay dividends.
Some other tools you simply must be using:
- valgrind. I suspect valgrind has almost single-handedly saved C and C++ from oblivion. I can’t heap enough praise on this tool. It has saved my error-prone sorry self untold numbers of frustrating hours.
- git and github. Not many tools are transformative, but these are. Not only do they provide an automated backup, but they enable collaborative work on the project by people all over the world. They also provide a complete history of where and from whom every line of code came from, in case there’s a legal issue.
- Automated testing framework. Compilers are enormously complicated beasts. Without constant testing of revisions, the project will reach a point where it cannot advance, as more bugs than improvements will be added. Add to this a coverage analyzer, which will show if the test suite is exercising all the code or not.
- Automated documentation generator. The D project participants, of course, built our own (Ddoc), and it, too, was transformative. Before Ddoc, the documentation had only a random correlation with the code, and too often, they had nothing to do with each other. After Ddoc, the two were brought in sync.
- Bugzilla. This is an automated bug tracking tool. Bugzilla represented a great leap forward from my pathetic older scheme of emails and folders, a system that simply cannot scale. Programmers are far less tolerant of buggy compilers than they used to be; this has to be addressed aggressively head on.
One semantic technique that is obvious in hindsight (but took Andrei Alexandrescu to point out to me) is called “lowering.” It consists of, internally, rewriting more complex semantic constructs in terms of simpler ones. For example,
while loops and
foreach loops can be rewritten in terms of
for loops. Then, the rest of the code only has to deal with
for loops. This turned out to uncover a couple of latent bugs in how
while loops were implemented in D, and so was a nice win. It’s also used to rewrite
scope guard statements in terms of
try-finally statements, etc. Every case where this can be found in the semantic processing will be win for the implementation.
If it turns out that there are some special-case rules in the language that prevent this “lowering” rewriting, it might be a good idea to go back and revisit the language design.
Any time you can find commonality in the handling of semantic constructs, it’s an opportunity to reduce implementation effort and bugs.
Rarely mentioned, but critical, is the need to write a runtime library. This is a major project. It will serve as a demonstration of how the language features work, so it had better be good. Some critical things to get right include:
- I/O performance. Most programs spend a lot of time in I/O. Slow I/O will make the whole language look bad. The benchmark is C stdio. If the language has elegant, lovely I/O APIs, but runs at only half the speed of C I/O, then it just isn’t going to be attractive.
- Memory allocation. A high percentage of time in most programs is spent doing mundane memory allocation. Get this wrong at your peril.
- Transcendental functions. OK, I lied. Nobody cares about the accuracy of transcendental functions, they only care about their speed. My proof comes from trying to port the D runtime library to different platforms, and discovering that the underlying C transcendental functions often fail the accuracy tests in the D library test suite. C library functions also often do a poor job handling the arcana of the IEEE floating-point bestiary — NaNs, infinities, subnormals, negative 0, etc. In D, we compensated by implementing the transcendental functions ourselves. Transcendental floating-point code is pretty tricky and arcane to write, so I’d recommend finding an existing library you can license and adapting that.
A common trap people fall into with standard libraries is filling them up with trivia. Trivia is sand clogging the gears and just dead weight that has to be carried around forever. My general rule is if the explanation for what the function does is more lines than the implementation code, then the function is likely trivia and should be booted out.
After The Prototype
You’ve done it, you’ve got a great prototype of a new language. Now what? Next comes the hardest part. This is where most new languages fail. You’ll be doing what every nascent rock band does — play shopping malls, high school dances, dive bars, and so on, slowly building up an audience. For languages, this means preparing presentations, articles, tutorials, and books on the language. Then, going to programmer meetings, conferences, companies, anywhere they’ll have you, and showing it off. You’ll get used to public speaking, and even find you enjoy it. (I enjoy it a lot.)
There’s one huge thing working in your favor: With the global reach of the Internet, there’s an instantly reachable global audience. Another favorable fact is that programmer meetings, conferences, etc., all are looking for great content. They love talks about new languages and new programming ideas. My experience with such audiences is that they are friendly and will give you lots of constructive feedback.
Of course, then you’ll almost certainly be forced to reevaluate some cherished features of the language and reengineer them.
But hey, you went into this with your eyes open!