Hacker News new | past | comments | ask | show | jobs | submit login
Designing a Language Without a Parser (thunderseethe.dev)
114 points by thunderseethe on July 4, 2023 | hide | past | favorite | 127 comments



> However, once I start constructing a parser, progress slows to a crawl

I can't really relate here. The parser is the easiest part of a compiler; the work only increases from there. I feel like if you ran out of steam at the parser, you never had enough steam to write a whole compiler. I don't think removing the parser will take you across the finish line if you otherwise were running out of steam.

My advice is to write your language in vertical slices. Write the parsing, semantic checking, and code generation for the simplest features first and progressively add feature slices, rather than trying to write the entire parser for a fully-baked language before proceeding. Consider including "print" as a built-in statement so you can print things out (and thus write tests) before you have working expressions and function calls.


> I don't think removing the parser will take you across the finish line if you otherwise were running out of steam.

Everyone's different. The problem with parsing isn't the difficulty, but rather the potential for endless bikeshedding. You're having to make a ton of opinionated decisions that in turn produce more questions about your syntax. If your personality is like mine in that it's a bit obsessive "completing" a phase, then parsing feels like an endless quagmire. In comparison, AST -> type inference -> codegen feels more structured and straightforward.


> The problem with parsing isn't the difficulty, but rather the potential for endless bikeshedding. You're having to make a ton of opinionated decisions that in turn produce more questions about your syntax.

Considering the author is talking about an endless graveyard of abandoned projects it kind of sounds like ADHD. I have ADHD and have a similar problem.

I still do endless bikeshedding where I'll do things like insist on zero memory allocations or streams/iterators over slices and those arbitrary limitations will completely compromise the project. Either I never finish it because I can't get it to work, or I get it to work but it sucks because of stuff like "the amount of work to avoid memory allocation might be much larger than the work of just making a damn memory allocation".


That explains why Lisp is so convenient for implementing DSLs. Homoiconicity skips all the bike shedding around syntax and parsing.


Lisp: "We got around parsing by getting rid of the syntax!" :)


Lisp has syntax. It's just usually fairly straightforward (for the most parts).

Forth also has an interesting simple syntax.


"Lisp has no syntax" really means "Lisp has no specific syntax". Any well-parenthesized S-expr is a syntactically valid Lisp program, even if it is nonsensical. Most other languages (including Forth) have keyword and syntax-dependent structuring.


> Any well-parenthesized S-expr is a syntactically valid Lisp program

It isn't. It's just a valid S-expr.

For example

((a b c) (d e f)) is not a valid Common Lisp program.

(let foo bar baz) is also not a valid Common Lisp program.

The syntax for LET is:

   let ({var | (var [init-form])}*) declaration* form*
SBCL:

  ; in: LET A
  ;     (LET A
  ;       B
  ;       C)
  ; 
  ; caught ERROR:
  ;   Malformed LET bindings: A.
  ; 
  ; compilation unit finished
  ;   caught 1 ERROR condition
LET is a built-in operator. It's not a function and not a macro.


> Any well-parenthesized S-expr is a syntactically valid Lisp program, even if it is nonsensical.

Which version of Lisp are you talking about? A very brief web search showed eg https://stackoverflow.com/questions/40210896/syntax-error-in...

For something really simple, try `(define define)` in Scheme.

I'd say Forth has less syntax than Lisp. But that's perhaps a reflection of most Forth implementations using some clever tricks to offload much of what's handled by syntax in other languages.

Eg defining new functions (ie 'words' in Forth terms) is not done via a special syntactic construct in Forth, but via 'immediate words'. The lines are a but blurry, though.


This is not a Lisp syntax error, this is a 'cond' macro syntax error.

The language has no syntax, the constructs within the language can and do impose additional restrictions (not extensions - restrictions) when evaluated. You may not find the distinction useful, but it is there.


Check the Common Lisp specification some time.

Every language construct has a syntax definition: function calls, macro calls, lambda expressions, special operator calls, ..., every macro.


Racket calls it a syntax error. I defer to them on the classification of their errors.


Neither Racket, nor any Scheme implementation calls (define define) a syntax error in general. It is only an error when define is defined to define, as it is in the base environment. If you have an environment where define does not define, (define define) may work.

Welcome to Racket v8.9 [bc].

> (define (define x) (display "Look ma, no error!\n"))

> (define define)

Look ma, no error!


In Common Lisp the identifiers in package Common Lisp are not allowed to be redefined.


It may be semantically invalid in Common Lisp, but it is syntactically valid in Lisp (as seen by observing different dialects).


The term "Lisp" is undefined. Common Lisp is an actual standard for a programming language with a specification. Its core symbols are reserved identifiers.

Anyway, if you would be allowed to reimplement a built-in operator, then from then on you would be using the new operator and you would be limited to its syntax.

Common Lisp OTOH provides namespaces (called "package") and there is a pre-defined namespace "COMMON-LISP" for the core language.


> The term "Lisp" is undefined

This hasn't been a problem in this thread thus far, but to be explicit, this seems like a good definition: https://en.wikipedia.org/wiki/Lisp_%28programming_language%2...

Given that definition, we can see that there are many actual standards for programming languages with specifications under the name "Lisp". Scheme - as was mentioned in this thread, is one such standard, along with Common Lisp.

> Anyway, if you would be allowed to reimplement a built-in operator, then from then on you would be using the new operator and you would be limited to its syntax.

I think you're arguing the merits of doing something like redefining define here, which is not the subject of this comment chain. Of course such behavior is questionable at best. The comment chain is discussing if such a thing would be valid in Lisp - and, in some dialects (such as Scheme) it is.

Scheme also has a feature called "libraries" which makes such behavior more feasible, as you could still have access to the shadowed primitive. Consider:

gosh[r7rs.user]$ (import (prefix (scheme base) base:))

gosh[r7rs.user]$ (define (define x) 5)

define

gosh[r7rs.user]$ (base:define x (+ (define define) (define define)))

x

gosh[r7rs.user]$ x

10


> Scheme - as was mentioned in this thread, is one such standard, along with Common Lisp.

Then see the Scheme standards. They include a syntax definition.

> The comment chain is discussing if such a thing would be valid in Lisp - and, in some dialects (such as Scheme) it is.

Lisp is not a defined language, thus you can assume anything.

My point is: Lisp dialects have syntax. This syntax is usually defined on top of s-expressions. S-expressions also have a syntax, but this syntax usually (there are exceptions) describes only data types (lists, conses, numbers, strings, symbols, arrays, records/structures, bit vectors, pathnames, ...) and not the actual programming language with its operators (functions, macros, special operators) & their way to define and invoke them.

Example: The LET operator in various Lisp dialects there have a specific syntax. This syntax is not the syntax of simple function calls like (operator arg0 ... args).

R7RS small defines this syntax (and also calls it syntax):

  (let ⟨bindings⟩ ⟨body⟩)
  (let ⟨variable⟩ (⟨binding spec⟩*) ⟨tail body⟩)

  ⟨Bindings⟩ has the form ((⟨variable1⟩ ⟨init1⟩) ...)
If you write let forms which don't use this syntax, then they are not valid R7RS Scheme let forms, even though they are valid s-expressions.


I forgot about Forth. Surely the king of non-syntax.


There's also Unlambda and Brainfuck and various other toy languages.


Smalltalk also has a quite minimal syntax... But Forth is just words.


I used to like Lisp's homoiconicity. These days, I'm not even sure the concept makes any sense? What is it even supposed to mean?

Lisp has (simple) syntax for creating literals and writing code. But when you inspect your data from the inside of Lisp, they don't have any more (or less) relation to that syntax than eg Python does.

Eg no modern Lisp stores a function as an S-Expr of its syntax. You get an opaque piece of data that you can do certain operations on, just like in eg Python or Haskell.

Lisp using S-Expressions is an interesting syntax choice, but that's it. It doesn't by itself have any deeper impact on the language. (Of course, socially it does have a deeper impact, because it makes it simple for language users to write macros that look exactly like built-in parts of the language, instead of bolted on.

But that's all in human brains and social interactions, nothing fancy going on here on a technical level.

If you wouldn't mind much more tediousness, you could implement seamless macros in almost any language as a pre-processing step.)


> Eg no modern Lisp stores a function as an S-Expr of its syntax

Many Common Lisp implementations do that, especially those with an interpreter: SBCL (yes, SBCL has an interpreter, too -> one can tell SBCL to use an interpreter), LispWorks, a Lisp Machine, Allegro CL, CLISP, ...

LispWorks:

  CL-USER 18 > (defun foo (a b) (* a (let ((a (* a a))) (break) (+ b (expt a b)))))
  FOO

  CL-USER 19 > (foo 1 2)

  Break.
    1 (continue) Return from break.
    2 (abort) Return to top loop level 0.

  Type :b for backtrace or :c <option number> to proceed.
  Type :bug-form "<subject>" for a bug report template or :? for other options.

  CL-USER 20 : 1 > :n
  Call to INVOKE-DEBUGGER

  CL-USER 21 : 1 > :n
  Call to BREAK

  CL-USER 22 : 1 > :n
  Interpreted call to FOO
You'll see that LispWorks says that FOO is running using the s-expression interpreter. Next we'll have a look at this s-expression:

  CL-USER 23 : 1 > :lambda
  (LAMBDA (A B) (DECLARE (SYSTEM::SOURCE-LEVEL #<EQ Hash Table{0} 82200D6A93>)) (DECLARE (LAMBDA-NAME FOO)) (* A (LET (#) (BREAK) (+ B #))))
The list is a value, which we can pretty print:

  CL-USER 24 : 1 > (pprint *)

   (LAMBDA (A B)
    (DECLARE (SYSTEM::SOURCE-LEVEL #<EQ Hash Table{0} 82200D6A93>))
    (DECLARE (LAMBDA-NAME FOO))
    (* A (LET (#) (BREAK) (+ B #))))
If you would modify this s-expression in the debugger, the program is directly changed.


Of course, compilers exist, so in a sense all code is data.

Besides the social stuff, S-expressions make it orders of magnitude simpler to build the infrastructure required to support seamless macros (just look at procedural macros in Rust for a recent example)


Well sure, if you hand wave away the main benefit of sexps then there aren’t many benefits.


Oh, S-Expressions are great for certain kinds of things.

I'm just having a problem with the concept of homoiconicity being touted as both coherent, and something that goes deeper than the very surface syntax level.


Have you ever written deep macros or only surface level ones?


What do you mean by 'deep' vs 'surface level' macros?

Do you mean something like implementing Prolog with macros in Lisp as a 'deep' macro? Or something else?


Yes


I have worked with overly complicated macros. What about them?

They aren't magic.


Maybe you can write a detailed blog post that compares making a DSL in lisp and a language of your choice. I would love to read it


> I used to like Lisp's homoiconicity. These days, I'm not even sure the concept makes any sense? What is it even supposed to mean?

You are not alone:

http://calculist.org/blog/2012/04/17/homoiconicity-isnt-the-...


Thanks! That blog post expresses much more clearly what I am trying to say.

Btw, it seems you could implement a 'read' that also recognises Python/Haskell-style significant indentation in addition to parens only. That would be an interesting elaboration of the point in the post.


> Parentheses make it unambiguous for the expander to understand what the arguments to a macro are, because it’s always clear where the arguments begin and end. It knows this without needing to understand anything about what the macro definition is going to do. Imagine trying to define a macro expander for a language with syntax like JavaScript’s.

Are you sure this blog post is making the point you have been trying to make?

Here is blog post that explores writing C in s-expressions.

https://itnext.io/what-makes-lisp-unique-8a0576b42293


I can't imagine someone who has ever written a Lisp macro coming away thinking "there's nothing special about s-expressions". Honestly, sounds like you're commenting on something you have absolutely no knowledge of.


Great ad hominem!

I have used Scheme, Racket and Common Lisp (SBCL) since the 1990s (before mostly switching to the ML family of languages). I read 'On Lisp' cover to cover. I know what I am talking about. You can still disagree with me, of course. But don't do it of the basis of me not having used enough Lisp.

S-Expressions make it easier to write macros that blend seamlessly into the host language. But there's no magic to them (compared to macros you could do as pre-processing in any other language) beyond that.

S-Expressions are a neat syntax. But homoiconicity is a weird concept; if it is coherent at all, it only pertains to the surface syntax level of the language.


> I know what I am talking about.

Sorry, but you manifestly don't:

> there's no magic to them (compared to macros you could do as pre-processing in any other language)

The "magic" of s-expressions is that they make it easy to operate on the source code of a program as a hierarchical data structure (i.e. as an AST) rather than as text. That turns out to be an extremely powerful lever. It is is one of the reasons Lisp has lasted as long and been as influential as it has. I'm sorry if this sounds like an ad-hominem, but if you think there is "no magic" to S-expressions the most likely explanation is that you don't really understand them. There is nothing comparable in any other language. There's a reason that they are still in use today. Indeed, there is a reason that they keep getting re-invented again and again. HTML, XML, and JSON are all (badly) reinvented s-expressions.

> homoiconicity is a weird concept

No, it isn't. It is simple and straightforward: homoiconicity is where the primary representation of programs is a data structure in a primitive type of the language itself. It isn't any weirder than (say) recursion. If you think it's weird that is more evidence that you don't actually understand it.


But Rust macros also do this, the procedural macros operate on the AST of the macro

https://doc.rust-lang.org/reference/procedural-macros.html


Plenty of languages give you the option of writing macros, however the difference between writing macros in a lisp and those languages is like the difference between using a general purpose programming language and an accidentally Turing-complete one.


C's preprocessor is such an 'accident'. But eg Rust's macros and template Haskell are very deliberately designed.


No. You can say that Common Lisp and Racket are deliberately designed around macros. In Haskell, Rust, C++, they have been bolted on to the language as an after thought. Unless I am mistaken, those languages do not have the ability to write DSLs interactively, unlike Common Lisp.


What do you mean by 'write DSLs interactively'?


That's not how Lisp works. Lisp macros don't operate on an AST.


I don't know how they work, how do they work?


Macros get and produce unparsed Lisp code, as nested lists.


>The "magic" of s-expressions is that they make it easy to operate on the source code of a program as a hierarchical data structure

where exactly is the magic? it's just tuple unpacking? like congrats you've constrained yourself to the absolute bare minimum and you've made it work. i mean i guess congrats for getting it to work at all but it's not magical, it's tedious. if you showed me a homoiconic language that did somehow pull off the magic trick of being more expressive than tuples then i would be indeed enchanted (mathematica at least manages to put lipstick on the whole exercise).


> i mean i guess congrats for getting it to work at all but it's not magical, it's tedious.

Magic lies in the eyes of the beholder. So you may not find it magical while I do. But I don't understand how you say that it is tedious.

With the homoiconicity of Lisp, you use the Lisp language itself to write macros and you manipulate the lists that make your Lisp programs directly.

Mainstream languages these days either need you to learn a special syntax for macros or they need you to learn their AST structures/classes and manipulate them or sometimes both. Isn't this more tedious? Isn't the Lisp way simpler and less tedious?


> where exactly is the magic?

There is a reason I put the word "magic" in scare quotes.

> it's just tuple unpacking?

It's not even that. S-expressions are just a serialization of linked lists. The "magic" happens because the details of the serialization happen to make them particularly good for writing code.

See my reply to /u/eru for more details. https://news.ycombinator.com/item?id=36599470


> It's not even that. S-expressions are just a serialization of linked lists.

Well, it's more like serialization of a binary tree. Cons-cells can branch in both the car and the cdr direction.


The distinction between a linked list and a binary tree is in the eye of the beholder. One of the brilliant things about s-expression syntax is that (a b c) is equivalent to (a . (b . (c . nil)))) so there is no distinction between binary trees and linked lists (whose elements can be other linked lists) other than that to be a "proper" list it has to be null-terminated.


> The "magic" of s-expressions is that they make it easy to operate on the source code of a program as a hierarchical data structure (i.e. as an AST) rather than as text.

As https://news.ycombinator.com/item?id=36597550 points out, you can do this with more complicated syntax as well. It's just a bit more annoying.

I agree that ergonomics are a big deal when using programming languages, and it shapes their culture.

Just like it's a pain to re-use code in C, so everyone always re-implements linked lists from scratch every time they need one.

> It is simple and straightforward: homoiconicity is where the primary representation of programs is a data structure in a primitive type of the language itself.

Why does it have to be a 'primitive' type?

Eg Haskell supports representing ASTs just fine, but you would use user-defined types for that.

Imagine a variant of Haskell that used S-Expression syntax for the sake of argument. That version of Haskell would still represent S-Expressions internally with a user-defined type. See https://hackage.haskell.org/package/s-expression and mentally translate all the code into S-Expressions (but leave the semantics the same).

Or just imagine a variant of Lisp where cons-cells are a user-defined data-structure. That wouldn't make their S-Expressions any worse, would it?

As long as whatever data structure you use to represent your AST is easy to work with, that's surely good enough?

And what do you mean by 'primary' representation? A Lisp compiler (or interpreter) has many different layers, and the AST is but one representation used in one of the layers.

For many purposes, S-Expressions aren't particularly useful, because eg they don't keep track of which variables are bound where. And, of course, an interpreter that works by walking S-Expressions is really, really slow.

So in what sense are S-Expression a primary representation?

You could imagine a re-skin of C that used S-Expression syntax. It's actually relatively easy to write such a front-end as a pre-processor that translates S-Expression syntax into classic C-syntax before feeding the result to a C compiler. (And allows you to run C programs on your S-Expressions as macros.)

But that change by itself wouldn't change too much about the language. And wouldn't make S-Expression-C as pleasant a language as any old Lisp.


> It's just a bit more annoying.

Well, yeah, but that is missing the point rather badly. There is nothing you can do in any high level language that you can't do in assembler. The only reason high level languages exist at all is to make programming less annoying.

> For many purposes, S-Expressions aren't particularly useful, because eg they don't keep track of which variables are bound where.

S-expressions are nothing more than a serialization of binary trees. There is a long, long list of features that they do not provide. Again, if you think that is relevant, you have completely missed the point.

For the record: the point, the thing that makes s-expressions cool, is that they provide a serialization of linked lists, and this serialization is super-simple, to the point where writing code in it is actually practical. They weren't designed for this. The fact that you can actually write practical code in s-expressions was a surprise, a discovery, not an invention. You can serialize binary trees as XML or JSON, and you can even use those to write code if you want to, but no one in their right might would actually do that. You would go nuts typing all those angle brackets, double-quotes, and commas. The reason s-expressions are cool is that they are parsimonious. You don't need all the angle-brackets and quotes and commas. S-expressions are actually a reasonable syntax for writing code whereas XML and JSON are not despite being fundamentally the same thing under the hood. Indeed, once you get accustomed to s-expressions, they are a superior syntax for writing code than conventional languages because they are easy to parse (no precedence rules to deal with).


> Well, yeah, but that is missing the point rather badly. There is nothing you can do in any high level language that you can't do in assembler. The only reason high level languages exist at all is to make programming less annoying.

Oh, there's plenty of stuff you deliberately _can't_ do in high level languages (compared to assembly or low level languages).

Eg Python doesn't let you use naked jumps (also called GOTO). Haskell doesn't let you mutate state willy-nilly.

> You don't need all the angle-brackets and quotes and commas.

Well, you have quoting and quasi-quoting. Very useful, and they use some more interesting syntax.

Yes, S-Expressions are a reasonable syntax. I used to like them a lot. But I also came to appreciate more traditional syntax when switching from Scheme to Haskell.

I was not arguing against S-Expressions at all. Only against 'homoiconicity' being a coherent or useful concept, or saying anything deeper about a language.

As I said earlier, if you start from C, change its syntax to use S-Expressions (rather simple to do), and allow S-Expression based macros (that use the same C-as-S-Expression syntax), you still don't have a language that's as pleasant to work with nor makes macros as easy as your favourite Lisp.


> Python doesn't let you use naked jumps

Not natively, but you can do it. It's just "a bit more annoying".

> you have quoting and quasi-quoting

That is obviously not what I meant. What I meant was that you can write (x y z) instead of ("x", "y", "z"). The latter is just "a bit more annoying" but that "bit" makes all the difference.

Also, ' and ` are not really a core part of s-expression syntax, they are just a convenient shorthand. Any s-expression written using ' and ` can also be written without them. The core syntax includes only parens and dots, and of course the constituents of atoms. Modern implementations also include native support for double-quoted strings, but that was not part of the original spec. Everything else is just syntactic sugar.


s-expressions can provide a hierarchical structure, but they are not representing AST. It's just another form of source code.

> And, of course, an interpreter that works by walking S-Expressions is really, really slow.

Common Lisp interpreters work by walking S-expressions. Typically you'll see the s-expressions directly in a debugger -> modifying the s-expression will modify the running program.


> Eg no modern Lisp stores a function as an S-Expr of its syntax. You get an opaque piece of data that you can do certain operations on, just like in eg Python or Haskell.

In Common Lisp

   (defun times-two (x) (\* x 2))

   (print (function-lambda-expression #'times-two))

   > (LAMBDA (X) (BLOCK TIMES-TWO (* X 2))) 
What would you do in Python or Haskell?


Sorry, I should have been more precise.

Yes, Common Lisp stores the s-expression. And you can get at it by applying an operation (in your case, `function-lambda-expression`) to some opaque handle.

The function itself isn't stored as its s-expression. Neither does any reasonable interpreter work by walking the s-expression. (Doing so would be very slow.)


Yes, I'm having trouble getting precisely to what you are talking about. Do you mean something like SBCL compiling to machine code, for example? If so, how does that affect using Common Lisp to implement some DSL, in the sense that it takes away the power of s-expressions for achieving this?


I am saying that S-Expression aren't any kind of 'primary' representation of your program in a modern Lisp. Especially a compiled one.

S-Expressions (or more precisely, the AST) is just one representation amongst many that's in use during the different stages of the compilation process.

They are still a great syntax choice for the kind of languages that Lisps are. Totally agreed.

My point was that homoiconicity is a surface level property that you could bolt on to eg C or Cobol, via a simple pre-processor, and it wouldn't change those languages much.

Just like you could bolt on using Python-style significant indentation to C or Fortran or Lisp, and it wouldn't change those languages much. (Apart from changing their syntax, of course.)


> I am saying that S-Expression aren't any kind of 'primary' representation of your program in a modern Lisp. Especially a compiled one.

Besides pointing out like others have that pretty much all Lisps have an interpreter that stores programs as s-expressions, why do you think compilation makes making a DSL in Lisp harder? Do you think you somehow need a parser all of a sudden?

> My point was that homoiconicity is a surface level property that you could bolt on to eg C or Cobol, via a simple pre-processor, and it wouldn't change those languages much.

Writing in C instead of assembly does not make your processor more powerful. Similarly, when designing languages, homoiconicity allows you to work directly with the AST in a way that is most ergonomically simple. Saying this is a surface level thing is an understatement


The nested list objects, whose printed representation is called S-expression, are the representation of the source code of Lisp. This is important and primary in multiple senses of the word.

In version control jargon, only primary objects should (ideally) be tracked in a repository, not secondary/derived ones. The most common example of primary objects are source code files; and of secondary/derived objects, compiled files.

Programs first exist as source code; primary means first. The semantics of the language is defined with reference to the syntax, whose description refers to the source code form. Everything begins with the semantics: it is primary. Other forms, like compiled, have to produce the same results like the reference evaluation model over the source code. Expansion of macros is defined over the source code. Other tooling is based on source code, like text editing, tags generation and whatnot.

Compiled code is not primary in any way. It is non-portable, sometimes not even fully documented, and supports almost no useful operations beyond invoking it (calling a compiled function or loading a compiled file) and disassembling compiled code.

The representation of compiled code can change between releases of a Lisp; the worst that happens is that users have to recompile their code since old compiled files do not load any more. If you break the source language, there will be gnashing of teeth.

The basis of portability among implementations of the same dialect is always the source code.

Lisp programs can quote pieces of themselves as literal constants. Atoms can be quoted, as well as symbols and entire lists. These quoted lists have to survive into the compiled form. Thus, the syntax is important because any piece of it can potentially be a literal which has to retain its shape and content no matter how the program is translated.

Textual source code contains valuable information not found elsewhere: formatting and comments.

In Lisps that have programmable readers, like Common Lisp, only the source code contains instances of the original read syntax. For instance, in some implementations of the backquote syntax, the reader expands it, and this may be done in such a way that the resulting object cannot be printed as backquote syntax any more. Application-defined read syntaxes usually do not support printing back in the original form. Thus they appear only in the textual source code.

Primary information is already lost when we have merely read an expression into memory without doing anything else, like expanding and compiling it.


actually: Common Lisp interpreters do walk s-expressions


Similarly Forth doesn't have a parser (though some implementations call their scanner a parser) and is great for writing DSLs


> The problem with parsing isn't the difficulty, but rather the potential for endless bikeshedding. You're having to make a ton of opinionated decisions that in turn produce more questions about your syntax.

I fail to see your point. OP was commenting on the technical difficulty of actually rolling out a parser, but you're not arguing the technical side. You're arguing the project management sides where "endless bike shedding" results in code churn. That's not the parser's fault, but the person's fault. You're bound to get hung up on details regardless of what part of the project you're working on, because it's not the parser compelling you to get hung up on details.


> That's not the parser's fault, but the person's fault. You're bound to get hung up on details regardless of what part of the project you're working on, because it's not the parser compelling you to get hung up on details.

Disagree. The fact that you're working on "a parser" creates the bad project structure that leads you to get hung up on details: a parser whose backend doesn't exist yet inherently has unclear requirements and a lot of bikesheddy decisions to make.


but it is. language syntax design is a total sinkhole. you give me a spec - I'll crap out a parser for you. tell me to design a textual language and chances are pretty good I won't get back to you. and if I do, chances are really high that you're gonna want me to do something different.


> language syntax design is a total sinkhole.

Sintax design, not implementing a parser.

You're arguing that you might not make up your mind and feel compelled to churn parts of your implementation due to your indecisiveness.

That's not the parser's fault, and I fail to see how anyone avoids this indecisiveness by replacing a proper parser with a bunch of ad-hoc hacks.

> chances are really high that you're gonna want me to do something different.

So what? Just work on it. How many parser generator tools are there? It's as if people have been doing this for decades.


This captures exactly the sentiment I was trying to express.

Thanks for chiming in!


You need to better define the goal of your PL. If you have an exact vision of your end user and what she or he is going to use it for, it becomes easier to navigate any trade offs.


I disagree. Parsers are tedious to write and obscure corner cases crop up over the course of development that make us completely change our approach to parsing or the grammar. You have to be pretty expert to get the grammar and parsing to work right the first time. Type inference is tough, but it’s not difficult the way writing a parser is. Codegen (suboptimal but functional) is even easier.

But yeah, write the vertical slice and implement a print builtin are good tips.


> rather than trying to write the entire parser for a fully-baked language before proceeding.

I see programmers, even experienced ones, making this mistake all the time when implementing larger changes. They try to get the whole thing implemented at once even when it's obvious from the outset that there are multiple sub-features that can be more easily implemented separately, so one can focus all their efforts on it, make sure it's well tested and clean, before moving on to the next sub-feature.


My first language, I spent months on the parser and then gave up. As I learned more I got better until now, I can write a hand-rolled parser for a simple language in under a day. There's a technique to making syntax which is easy to read/write and easy to parse with a look-ahead parser.

For more complicated languages, there are tools like tree-sitter and ANTLR4. You can even extend an existing tree-sitter grammar to augment a language without having to re-write the base language parser.


I have a better idea. Do the code gen while writing out the parser. It won't be the fastest language and the codebase would be a mess and you will most likely have correctness problems but it's the fastest way to production. The classic 3 stage compiler design (tokenizing/lexing/parsing -> AST -> optional lowering/bytecode etc. -> JIT/code gen) is only important if you want to semantic analysis and optimization. They are kinda redundant for prototyping (plus cranelift and LLVM exist, you can focus on the backend after you get your language design in order). Do a quick peephole optimization pass on the generated instructions if things get too slow.

If we are being honest here, most people just want their own better Python/Go/Rust. I doubt your average engineer is truly interested in the subtleties of compiler design and memory layout optimization. A single pass compiler, built like the olden day of yore, is the best option for most.

https://en.m.wikipedia.org/wiki/One-pass_compiler

Leave the 3 stage design for homework (or if you are being paid to do so).


Parser for "4-lulz language" or something that will also be used in IDEs, so has robust error recovery, can perform partial updates, etc?

I feel it's just that it is possible to say that work on the parser side is completed

meanwhile optimizations? you can probably endlessly improve stuff


Talking about the same thing as the article: hobby/learning languages. If you click the links to the author's projects you can see they have yet to finish a compiler or really even come close. Definitely not talking about sophisticated production-grade languages here; OP is trying to complete their first compiler at all. I think writing compilers is a really neat and useful learning project but you have to be smart about not biting off more than you can chew.

I would absolutely recommend not supporting incremental compilation or error recovery in your first compiler. Just stop everything at the first error. Save that for your second hobby compiler, or better yet, the first commercial compiler that you get paid to work on.


> writing compilers is a really neat and useful learning project but you have to be smart about not biting off more than you can chew.

Writing compilers is an especially useful project when you learn roughly how much you can chew. (Algol 68 is a nice example of a bunch of smart people, with relevant domain expertise, biting off way too much for the machines of the time.)


I think you're reaching higher


Yep, it was tremendously correct to start with (iirc) 'assert_eq' and 'print' as the two first statements in my language when building out the compiler. It meant I could start writing a test suite and assemble a larger and larger set of functionality as I went.

On the other hand I do really hate the task of writing parsers, so I can relate to people who think it's the worst/most difficult part. But I think other parts are probably harder, like getting type system stuff right.


> Consider including "print" as a built-in statement

I really wish Python 3 had taken that advice. ;-)


Ooh, I like this. Too many people start projects at the logical beginning. But what you really want early on in a project is to maximize speed of exploration of the interesting parts.

To me there's a clear analogy with startups. The naive conception of starting a company is that you get a pile of money so you can hire a bunch of people and create important infrastructure. But with startups, you're trying something new, so the most efficient use of time is to find the riskiest hypotheses and test them as directly as possible. That often involves doing things that seem wrong if you proceed in the "logical" way. E.g., I knew a successful UGC company that didn't implement accounts and logins until like 6 months in. But that was fine, because actual accounts were not needed to figure out whether the business worked.


Start in the middle. It's the most interesting part, too. After all, that's the core of your idea.

I don't know where I heard this, but the idea is so important to me that I saved it on my blog: https://dorotac.eu/posts/in_the_middle/


That is also a good way to start an improv scene. Figure out later, if at all, what is really going on.


> Start in the middle. It's the most interesting part, too. After all, that's the core of your idea.

For a programming language? Maybe if you are designing your language by feature list.

What if you are designing a programming language for ergonomics instead?

Let's be real - the differentiating factor in any modern language design is the syntax, not the features. They all mostly support a similar cross-section of features, in terms of "getting things done".

What you are really designing is a competitor to the existing languages, in which case it is beyond the scope and effort of a lone developer to match feature-for-feature of modern languages.

My experience of lone-wolf programming languages is all the same ... namely ...

Even if you do have one, single, differentiating feature, people aren't going to adopt unless you have all the other features they want. Doesn't matter how good your feature is if your language is missing some feature that people like in current mainstream languages.

You should also be careful of thinking that a single good feature will cause a little adoption; if it's any good the existing languages will simply adopt it!

Another path into darkness is thinking that the batteries included is so different to current offerings that people will adopt it, such as that recent post on HN about a language developed for cloud by the author of CDK. There's nothing in that language that can't be implemented as a library for existing languages.

For programming languages alone, going feature-first is a good way to produce an obscure language that no one is interested in. Without even a small community, the original dev themselves won't use it.

Where a new language makes sense is in ergonomics, not in features.

Can you make the syntax such that people onboard quickly? Can your syntax support something complicated in a manner that the most simple-minded developer can understand? Is your syntax amenable to collaboration? Can it be easily parsed in pieces for IDEs? Will the output be package-distributable or module-distributable? Can you ensure easy GDB integration? What build mechanism can be used (for reading the sources and figuring out dependencies).

Syntax is the major difference between writing in Kotlin and writing in Java.

My new language project, I'm still trying to nail down what the syntax should look like. I have no problem documenting the tree to support features I want, but I find that settling on what good syntax looks like to the majority of corporate developers is really really difficult.

Compare with, from an AST, emitting code for some advanced language constructs. That's almost a mechanical effort that I think I can delegate a lot of to ChatGPT.

Designing universally acceptable syntax, on the other hand, is a lot more complex and requires actual human decision-making.


As far as I can tell, this author is writing a language just for the experience of writing a language, without much regard to getting it used, so it sounds like their success criteria may be a fair bit different than yours.


> As far as I can tell, this author is writing a language just for the experience of writing a language, without much regard to getting it used, so it sounds like their success criteria may be a fair bit different than yours.

Actually, his criteria was exactly the same as mine when I first started writing my own languages - learning experience, bragging rights, whatever.[1]

At some point, though, even I didn't use my own languages, because without a community of users, it soon became pointless.

That's the thing about some projects - they take off even when the objective is not to build something that takes off. It's why I admire (and maybe slightly jealous of, as well) Andreas Kling's SerenityOS.

It's the only project I can think of in recent years that was totally superflous and unnecessary (any solo OS or PL project is, at this point in time), and yet its gathering steam and shaping up to be a real contender.

But that type of thing is like winning the lottery - the odds are against it.

[1] One of my comments on HN last week mentioned that I want to write a Forth type language, and telling one of the responders to my comment that it isn't because there doesn't exist a quality Forth implementation.


That is sort of the idea of the MVP (minimum viable product).

I prefer the analogy of painting. You start with collecting references, exploring ideas in a sketchbook, make color tests, draw outlines on canvas, use big brushes for colors, refine with smaller and smaller details.

The problem is, that programming is all details / only details. There is no easy way to use big brushstrokes, so you have to improvise and not loose the overview. It doesn't help that engineers love details.


While I agree designing syntax before you know your semantics is unwise, please consider also that there are several actually parserless languages extant: lisps, forths, APLs, smalltalks, various Edinburgh languages, etc.

(Esterel —IIRC— is the only language of which I'm aware that explicitly has two syntaxes, one traditionally parser based and one that, in principle, could be parserless)


I don't like the term parserless in this context. its not like you can just mmap an lisp source file and cast it to the Ast type from the article.

Parsing something might be trivial but its still parsing


"Not parsed ahead of time" might be the better qualification. At least in some forths you can cease parsing the file (in the "outer interpreter") at any point and perform any kind of computation or IO that you've previously defined, and based on that do whatever you want with the rest of the text of the source file (or just the next couple of tokens if you want), including parsing it manually in some other manner than what the outer interpreter would do by default. I haven't gone that deep in lisp, but I hear reader macros allow something similar, though perhaps they might be more restrictive by requiring a transformation into valid lisp trees / values, whereas forth allows you to just do whatever, and if that happens to have the side effect of adding new functions to the dictionary, so be it.


in that case, pretend I wrote "grammarless"

(in the examples above —which missed the prologs— I'm pretty sure the parsing is trivial on the order of "you can see everything that handles 'parsing' without needing to scroll a window", and in several of those examples it'd still be true even if your windows were only 25 lines long)


  > in that case, pretend I wrote "grammarless"
Gonna call you out on that as well. Forth, sure. Forth is...Forth.

But Lisp is not grammarless. Smalltalk is not either. The Lisp reader, notably Common Lisp, is a non-trivial piece of code. The Common Lisp lambda list, as a language construct, is not trivial either. (Dare I mention the CL LOOP macro?) Just because everything is "a list, symbol, or constant" does necessarily make the parsing problem trivial.

It SEEMS simple, but then you get into it, and you find you fall into the weeds.


It seems those in this discussion arguing that they don't need a parser or a grammar simply aren't aware they're implementing ad-hoc parsers and grammars but doing it poorly.


Not only am I exactly aware of that, I am even advocating it.

OP wants to write a hobby language, not reimplement Common Lisp.

If OP wants sexprs, they can get quick-and-dirty sexprs by:

  (1) substituting spaces around each paren in the input
  (2) breaking input up into a list of whitespace separated words
  (3) folding over the resulting list (by looking at its parens) into a tree
Sure, that's done poorly. However, it takes under 15 minutes (and under 15 lines?) to get them past their sticking point, and if the rest of their language (especially the part that differs from all other languages) actually turns out to be worth pursuing, they can always implement (or import) a real —or at least a better— reader later.

See https://news.ycombinator.com/item?id=36593060


> If OP wants sexprs, they can get quick-and-dirty sexprs by:

You know what it takes to implement a parser for s expressions?

E := atom | '(' *E ')'

With atom expanding to all data types you wish to support.

Let's presume using a parser generator is an insurmountable task for someone developing a parser. In a hand-rolled packrat parser, this whole grammar is implemented with a couple of functions.

Is writing two functions too much work for someone wanting to write their own parser?

Your ad-hoc trick is much more work than actually implementing a proper, maintainable parser.


FWIW, I've done it both ways. The ad-hoc trick is cromulent.


> in that case, pretend I wrote "grammarless"

You are always parsing a grammar. It seems you simply aren't aware of it.

Every single time you write code that expects to find a certain type of token after another type of token, you have a grammar.


It wasn't compiled, but the way Smalltalk-72 did ?integrated into the rest of the execution? parsing is worth understanding.

http://worrydream.com/EarlyHistoryOfSmalltalk/


I had the same thought as the OP about a language I wanted to design to explore some concepts, and thought of using JSON as the syntax, with the grammar defined as a JSON schema.


This is one thing I designed Jevko[0][1] for.

If you have an idea for a format or a language and would like to quickly start hacking on the layer above the syntax, Jevko is an option.

It's meant to be even simpler and hackable than S-expressions.

It gets you from a string to a tree in the least amount of steps.

See here[2] if interested.

Happy hacking!

[0] https://jevko.org/ [1] https://djedr.github.io/posts/jevko-2022-02-22.html [2] https://gist.github.com/djedr/151241f1a9a5bc627059dd9b23fc74...


With the square brackets it has a bit of a Rebol feel to it. Was that intentional or coincidental?


I suppose a bit of both.

I was more directly inspired by Lisps, but I do prefer the original M-expressions and the syntactic choices that REBOL and Red make.

I think placing the operator before the opening bracket better emphasizes its special significance and can reduce nesting for constructs like `f[x][y]` (vs. `((f x) y)` in Lisps). Square brackets somehow seem more aesthetically pleasing to me. And there is a practical reason to prefer them, especially if your syntax uses only one kind of brackets -- square brackets are the easiest to type on an average keyboard.

So REBOL-like syntax is nicer. As were M-expressions. They probably didn't catch on, because they were not minimal enough, compared to S-expressions. And maybe because S-expressions were fully implemented first.


Nice project. Thank you for sharing.


Thanks! :)


One great resource for making a language without a parser, not just delaying it but never writing it at all, is JetBrains' MPS[0]. It lets you make languages in a projectional editor: rather than a syntax parsed into an AST, you edit the AST directly in a series of GUI cells pretending to be an editor, much as in MS Word, so rather than writing a parser from syntax to AST, you write a renderer from AST to syntax.

Benefits of this approach include there being no such thing as a syntax ambiguity, language extensions being as easy to make as libraries, the language being easy to write by non-programmers, and a JetBrains IDE for your language - at the cost of not being able to use any other editor.

A great example of such a language is mbeddr[1].

[0]: https://www.jetbrains.com/mps/

[1]: http://mbeddr.com/


Just last week I went through the calculator tutorial and I gotta say nah this ain't it.

I had hope even though I knew what the deal with projectional was going in and indeed I was disappointed. The number of proprietary "dsl"s you have to grok to be proficient at using MPS is just mind-boggling and that's coming from someone that designs dsls for a living. Like by the time you get to codegen you're already 3 or 4 deep. And it's only for codegen, where you're generating Java, that you get the full projectional editor treatment. Everything else is just a gui form for some small "dsl" with effectively dropdowns, so you might as well just call it an API for a configuration system rather than a language. Like there's zero composition for the structure "language" and the editor "language". You're literally just toggling forms.

It's just a complete turn-off because even if it is "powerful" it's completely non-portable - you cannot ship anything without shipping MPS to orchestrate the cascade of dsls.

I tried to get mbeddr to work but could not even though I can drive gradle and etc fine.

Overall really disappointing.


Both comments are right. Projectional Editors are the thing to avoid parsers (instead of parsing you spend time in the editor though). And MPS is indeed a beast.


> I can’t tell you why I keep returning to this venture when I’ve failed at it so many times.

Sorry if this sounds stupid or obvious, but with this kind of thing I find that it's easier to cross the finish line if you maintain humble goals. Focus on just getting a working end-to-end MVP. Refine and enhance it down the road; don't get stuck trying to make version 0 an awesomely praiseworthy effort.


Absolutely! That's what I'm attempting this time. I'm hoping if I start with literally just the lambda calc + integer literals I can get something working e2e and layer stuff on top from there


If this was reddit, I would invoke a !remindme to check in on you in a month.

Hoping to hear a positive update from you in the near future!


This is one of the cases where you actually do want Lisp-style s-expressions, because they don't need any real parsing; functionally speaking, they are already the AST. (This is why you sometimes hear people saying that Lisp "doesn't have syntax".)


> This is why you sometimes hear people saying that Lisp "doesn't have syntax".

The other reason is that a language like Common Lisp is defined in terms of the data-structures used by the language and the language has no unique text representation: the “default” reader uses a slightly extended version of s-expressions, but any data structure in the language can be evaluated and any transform from text to data structures can be a textual syntax for Common Lisp.


Exactly. And if you are paren-phobic, you could also have a look at Postscript, Forth or even Smalltalk/Self -- the last two are about the most minimal you can get with infix operators (but no precedence) and "keyword-arguments" (sort of), well, unless you want to go all-in on infix, and do APL.


> Without fail by the time I’ve managed to produce an Abstract Syntax Tree (AST) I’ve lost all steam.

Lisp helps with this. You have the syntax settled, and can concentrate on the language from the get-go with all your steam.


One option you could also try is to use an existing language's syntax. Plenty of languages have high quality parser libraries by now, like swc/rome/acorn for JavaScript, rustc_parse for Rust, et. Of course syntax is influenced by semantics, so you'll end up wanting to remove or add syntax, but you could probably get decently far before that ends up a problem.


> Of course syntax is influenced by semantics

Only to the extent that AST structure depends on syntax. For something like s-expressions defining new semantics never requires new syntax since arbitrary trees suffice to syntactically express any AST.


One of my exploratory projects while I still though that academics career made sense involved replacing smalltalk's stack based bytecode with incrementally transformed S-expression trees. Naturally the first thing I went out to do was writing S-expression reader and writer in smalltalk, my advisor at the time told me that it is pointless busywork and I should just use ST literal syntax, it looked somewhat ugly with all the # there, but saved a lot of time.

Well, 12 years later (ie. early 2023) I realized that I don't really have any kind of cool sideproject and started implementing the same idea in C (with the added goal of the VM being natively multithreaded with fine-grained locking along the lines of JikesRVM and WebKit). Well, I have stub implementations of classes needed for the AST representation and S-expression reader and writer…


Yeah I was working on something similar with Go. We weren’t exactly trying to turn it into a Lisp, but for various reasons outputting the Go AST as s-exps was interesting.

That got me thinking that if I ever do any language design again I’m almost certainly going to just start by writing a Lisp and then layering any additional syntax that appears desirable on top of that.


Sure, but if you're using an existing language grammar, it will almost certainly be influenced by the language's semantics, unless that language happens to be lisp (and even then most lisps are not pure s-expressions)


Here's an interesting anecdote in the same spirit of skipping uninteresting work. Back when I worked with a rather experienced Haskell developer, he also wanted to design a language but dreaded the part where a parser was needed. Besides that, he took the further shortcut of representing functions in his language using functions in the host language Haskell. That is to say, the AST for functions contains Haskell functions. This technique is called higher-order abstract syntax.

Naturally there are limitations here. There's not much you can do to poke into the structure of Haskell functions other than evaluating them, so any sort of optimization and code gen work must happen by evaluating the function in Haskell, which gives you more of a challenge in designing the AST.

That said so far the author's language resembles simply typed lambda calculus and this is generally too simple. The meat of the language hasn't actually been designed yet in this post.


I like this. The focus on lexing/parsing for language implementation often overshadows the fact that the bulk of the work in any language effort is in the semantics and analysis (e.g., everything after one has a parse tree). On the production compiler I get paid to work on, front end work makes up at most 10% of the team time and effort. Even on experimental language projects that we occasionally play with, the front end is usually given minimal attention - just enough to have a syntax that we can use to start instantiating ASTs and doing the interesting work. More often than not, I punt on a front end altogether and just piggyback on some existing language (e.g., Haskell, ML, etc) and basically explore the semantics or analysis questions I'm interested in via a DSL-style embedding and ignore syntax altogether.


The parser design space is really interesting, but recursive descent seems to be the best go to, and you don’t have to think about parsing theory much. Or you could go with no syntax (lisp or scheme style), or a structured language (but you’ll need to do your own editor for that).

You could also try doing an embedded DSL in Haskell or even Kotlin if you just want to get your feet wet.


I always start with the AST and just work forward and backward from there.




I have been thinking for quite some time about using XML for this. It is pure AST you can relatively conveniently author as text.

Having nice syntax is a good thing, yet no matter what the syntax is, it ends up as an AST. And the structure of AST records and the ways they can be combined matters quite a lot, I would say more than text syntax. XML makes it visible and easy to experiment with without getting distracted by text syntax niceties.


I started something similar, but my goal is to create a projectional editor for the language, where syntax is more of a presentation problem than a parsing problem.

https://github.com/hardliner66/vormbaar/


Honestly designing a parser is easy: just start using ANTLR and perhaps add later an AST layer. However if you do not want to go that I suggest looking in projectional editing, for example JetBrains MPS or Freon by Jos Warmer




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: