Hacker News new | past | comments | ask | show | jobs | submit login
“Screw it, I'll make my own” – The story of a new programming language (breuleux.net)
139 points by chrismonsanto on Feb 6, 2016 | hide | past | favorite | 68 comments



> There is a bit of a catch-22 in language design where the more a language is used, the clearer it becomes which parts of it are problematic and should change, but the harder it gets to actually change them. To hone a language you must use it, but applications require that a language's features remain stable, robust, set in stone, and therefore as imperfect as they were at that moment. Furthermore, the more delays are suffered in fixing flaws or plugging holes, the more likely it gets that a dependency is fostered upon them. The options you do not explicitly and insistently leave open, tend to close and seal themselves before you know it.

This rings very true. I look at Ceylon and the features I'm most jealous of are things that Scala would and should just do, except for the need to retain compatibility with 10 years of existing applications and libraries, so instead they're enhancement proposals for the version after the version after the next version, and in the meantime we get by with macros that more-or-less cover the most important use cases. And Scala is relatively young as languages go - its warts are nothing to those of, say, C++. Sometimes I wonder if language design can only advance by new languages killing existing languages - past a certain point it seems impossible to change a language in the ways that it needs to be changed to become good.


A nice solution was used during development of the Go language: a tool for code rewriting (it was called "go fix"). It allowed the language authors to quickly and easily transform large swaths of code in the standard library, and also let everybody execute the same transformations in any third-party code. Thus enabling rapid and extremely painless iterations. [I'm consciously simplifying the story a bit to make it shorter and more concise, but I'm not hurting the core message I believe.]


I think that's a neat idea. I've thought about it, but I've never used Go personally and I wasn't sure just how much it would help. I imagine it works best for languages with good static properties/tools for code analysis.


I think the important cases are the ones that you can't fix by hand - the ones where you're removing a dangerous idiom.


Yeah, Python tried this as well with their 2to3 tool (https://docs.python.org/3.5/library/2to3.html), but their release bump was a little bigger than the Go prerelease changes (and the users not as flexible) so it wasn't much of a magic bullet.


I find it very useful but it does help to write forward compatible 2.x code and test frequently with 2to3 to catch what it misses.


go fix didn't worked all the time. there were some edge cases where it failed. also syntax of go didn't changed much.


Xcode/llvm has this with fixits.


I've been trying to break this catch-22 by experimenting with a language/VM/codebase without any backwards-compatibility guarantees. Instead of freezing interfaces for others to use, I'm going to guarantee instead that I'll have tests showing how to use my mechanisms, and you need to have tests for the programs you write using them. I'll try to make breakage happen in obvious rather than subtle ways, but if you don't write tests you'll be in a world of hurt. If you write thorough tests, however, it might be a grand adventure as we hopefully keep the language supple over time. Who's with me?

Probably nobody :) Oh well, I'll be here by myself getting muddy in this foot path next to the highway. But it might be entertaining to read about what I've been up to.

http://akkartik.name/post/libraries2

http://akkartik.name/post/readable-bad

http://akkartik.name/about

http://akkartik.name/post/mu


The formal definition of a version of Perl 6 corresponds to (a frozen version of) the Perl 6 test suite (currently around 120K unit tests).

A compiler declares its claimed compliance by reference to these test suites:

  > perl6 -v
  This is Rakudo version 2016.01.1
  built on MoarVM version 2016.01
  implementing Perl 6.c.
6.c labels a frozen version of the Perl 6 test suite and thus a version of the Perl 6 language.

All serious users of the language are encouraged to check/improve the language test suite to ensure it covers what they want covered in either an existing version or an upcoming one.


This is very much what Lua has done.

Major version switches not only don't retain backwards compatibility, but have in the past drastically changed the language. With v5.0 they finally seemed to hit on a good combination of features, syntax, and semantics, and the pace of new major releases has slowed noticeably.


Rich Hickey had about four lispy prototype languages before he set on the design on Clojure, it didn't spawn fully formed from his head. I think that's a sobering thought.


> past a certain point it seems impossible to change a language in the ways that it needs to be changed to become good

I think it'll be a few years before Perl 6 matures enough for most folk to see why it matters, but its design incorporates Larry Wall's response to pg's http://www.paulgraham.com/hundred.html


Nice story. I have similar feelings about my own attempt, Runa (https://github.com/djc/runa), in particular about just how large a language (ecosystem) has to be in order to be viable to a few more people.

Still, just the journey is quite interesting. It makes you understand other programming languages and their tradeoffs better, and I'm now much less afraid of parsers and lexers and so on. :)


> it is still unsettling when you feel like you can't participate in discussions about programming languages without indulging into self-promotion

Ahem, cough, cough, er, ...


Curious. Are there examples of programming languages that allow spaces in identifiers? Obviously, it would need to be designed for that.

I'm against the case-sensitive nature of some programming languages, and file systems for that matter. While it makes sense in a computing context (faster), you invent a new mode, just for the computer..

(Fortran was probably case-insensitive because upper-case letters were used first, and when lower-case letters were introduced, care was taken to neatly map them into the character set bit-wise, so one could simply clear the sixth binary digit to turn a lower-case letter into its upper-case equivalent, then continue the string comparison.)

Of course, with Unicode, you need a more complicated check, or you could just ignore Unicode in the language itself for identifiers, yet allow it in literals, data types, etc.


> Curious. Are there examples of programming languages that allow spaces in identifiers? Obviously, it would need to be designed for that.

At least Tcl allows spaces in identifiers, but to 'use' such identifiers one does have to add a bit of extra 'sugar' to prevent the code parser from interpreting the spaces as token separators:

Example of spaces in a variable name:

    $ rlwrap tclsh 
    % set "var name with spaces" "contents of the variable"
    contents of the variable
    % puts ${var name with spaces}
    contents of the variable
    % set "var name with spaces"
    contents of the variable
Example of spaces in a procedure (function) name (the first line defines the procedure):

    % proc {my space proc} {string} {puts "'my space proc' called with string='$string'"}

    % {my space proc} "hello how are you"
    'my space proc' called with string='hello how are you'
    % "my space proc" "the quick brown fox"
    'my space proc' called with string='the quick brown fox'
    % set pn "my space proc"
    my space proc
    % $pn "this that and the other"
    'my space proc' called with string='this that and the other'
So there you have at least one example.


Algol 68 allows spaces in identifiers. That means one has to use one of the "stropping" techniques to distinguish keywords from identifiers--case stropping (IF p THEN foo ELSE bar FI), quote stropping a la the old IBM Algol F Algol 60 compiler ('if' p 'then' foo 'else' bar 'fi'), and at least one other I don't remember offhand.


Case-sensitivity makes sense because it enforces uniformity of code. Case insensitivity only matters when you want to write identifiers differently at different locations. The only place where I see this could be useful is when you use a library that uses a different convention.

The IMO better way to solve this is to set a convention for your programming language and enforce it with the compiler (at least with warnings).


As you say, it's about conventions, but with a case-Sensitive system you are more likely going to HAVE TO enforce naming conventions, because there's a distinction now. Otherwise you'd write "MidiPort"/"MIDIPort"/"midiPort" or what have you.

Keep in mind we wouldn't need to care about enforcing case in a style guide, if case didn't matter, because there are no distinctions, and you'd be more inclined to write it the natural way; no camelCase to overrule ambiguation in writing "MIDI Port" as midiPort, or MidiPort, MIDIport, etc.

Case-sensitivity only creates unnecessary dissonance, and leads to clever uses of that system, adding even more choice; and as we know from The Matrix, the problem is choice. ;)

So if we keep it closer to how we would normally read and write words, I think there would be less dissonance about that aspect of programming, or naming files for that matter.


> Keep in mind we wouldn't need to care about enforcing case in a style guide, if case didn't matter, because there are no distinctions

You would still need it to get uniform code. Spaces vs. tabs also doesn't matter but it's still in almost all style guides.


I approve of case-sensitivity where an initial Capital letter indicates that Something is Publically accessible and it really doesn't matter what happens after that. Hence, we could have:

  Newton-Raphson Runge-Kutta Fast-Fourier-Transform

  Class-ID IO-Channel MIDI-port 

  Freudian-Id Io-Channel
In contrast to this, I feel it makes sense to have lowercase mean that something is private. Hence, no camelCase:

  x y z variable longer-variable-name
I've never been that comfortable about appending numerals to the end of identifiers to disambiguate them as I feel that this is a sign that they ideally ought to be subscripted and implemented as arrays. I much prefer hyphens to underscores but would ideally like to use individual words separated by spaces. This can only work if you have an IDE that hides all the underscores (which are incredibly ugly and serve no useful purpose in printed material these days) as you input them and outputs NBSPs instead and then uses similarly suppressed prefix sigils to style your raw input text into an output which conforms to traditional Mathematical notation. Hence, we could have:

  /foo_bar + /bar_qux
become:

foo bar + bar qux

similarly, the following is not a problem if you take advantage of the syntax rule that requires at least one space either side of an operator. Hence, we could have:

  /foo_bar / /bar-qux
become:

foo bar / bar-qux

i.e. the / sign isn't echoed when you initially type it as it is expecting a letter, but when the IDE receives whitespace it belatedly echoes it as the operator symbol as it is now sure that it isn't a suppressed sigil.


> Case-sensitivity makes sense because it enforces uniformity of code.

I don't see that case-sensitivity helps to achieve uniformity of code that much. Factors like code structure, common design patterns, and source code formatting are more important. The approach to the structure and design of an application or library is something that each individual development group decides for themselves. Source code formatting can (and should) be enforced by formatting tools.

Having used a case-insensitive language for a while (Object Pascal) I find that developers tend to follow the case convention of a given software project anyway and if they don't the case-sensitive typos aren't an issue. They don't make the code harder to understand and it all compiles.


Reading Erlang is really nice partially because all variables are capitalized (enforced by the compiler). You know immediately which parts of the code are what.

It actually drives me a little nuts that I can't do the same thing in Elixir (compiler enforced lowercase) because so much of the code looks the same.


Common Lisp allows spaces in identifiers and symbols names, but you have to quote it with vertical bars:

    (let ((|This variable has spaces| 1))
      (print |This variable has spaces|))


Similarly, R (which is basically a Lisp with C-like syntax) allows spaces in identifiers, though you'll have to construct usages of such identifiers using quote() or backticks.


Fortran was case-insensitive because it used early character encodings that did not support sensitivity.


Yes, I alluded to that by way of the sixth bit, I just had no reference to whether it was something they chose to do, or had no other choice at the time.



Thanks for the suggestions!

I found some more tidbits in these StackOverflow answers:

http://programmers.stackexchange.com/questions/145751/has-wh...


Agda allows for "mixfix" function names with spaces. It's very interesting.


VHDL allows arbitrary text in its extended identifiers. The feature was added for easier interop with other tools that have less restrictive rules than VHDLs normal identifiers.


In Racket, you can use "-" and "?" in variable names and I love it.


And "/" and ":" and all manner of others. It's a Lisp thing, where the syntax of S-expressions removes any ambiguity.


x-y and x - y meaning different things seems like a great source of errors to me


That's not a problem because subtraction in Racket, as in any Lisp-like, looks like this:

    (- x y)


It depends on a couple of factors. Personally I always write subtraction as "x - y", so I never have any issues. Also, if undeclared variables are a compile-time error, the compiler will (probably) complain loudly about "x-y". So you can fix that quickly and easily, it's not too bad.

The one error that might be problematic depending on the language's semantics is if you write something like "x.y-z" intending to subtract "z" from "x.y", but in fact you are going to get the "y-z" property of "x". I must say that in Earl Grey you'll unfortunately end up with "undefined" as the result of that expression (saner languages would raise an exception on a missing property). I have never had that issue in practice, but then again, I always space subtraction.


I don't have a problem with "x - y" and approve of hyphenated names if you can't support spaces within names.

I had initially jumped to the conclusion that your language would use a postfix notation and be more influenced by Forth than LISP. This is because I assumed it's name alluded to the way that Captain Jean Luc Picard would program his replicator.


> I chose to compile Earl Grey to JavaScript

Some programming languages generate code much deeper down the abstraction stack, e.g. Haskell generating machine code, whereas others generate code to a much shallower depth, e.g. most languages generating JavaScript. A language generating lisp code from some syntax could even be shunting code up the abstraction stack.

How deep does the generated code need to be down the stack for some syntax to earn the name "programming language"?


If you look at the code, and can't identify it as language X, then it probably deserves a new name.


This sounds pretty similar to the definition of a species.

If two organisms of appropriate gender can't reproduce with each other, then they probably aren't the same species.

If code from two samples can't be interspersed, then it's probably not the same language.


Under that definition, Perl and Python have finally merged, given Perl 6's Inline::Python. Peace at last!

I'd probably go with a definition more like "mutual intelligibility" like for natural languages, but that doesn't seem right either. Sometimes I have to turn on the subtitles for British TV.


Yeah, under that definition everything that supports FFI has merged into a massive unholy mess with C. I guess if it can't be "natively" compiled/interpreted as the same language without inlining or including, I'd say it's a different language.


None of the languages I typically use allow hyphens in identifiers, but I find that kebab-case to be the easiest on the eyes.


heck, even COBOL '75 (maybe even '68) allowed hyphens in identifiers.


It's not surprising that changing a language would be difficult. It does, after all, include a software component, and pretty much all large software projects suffer the same. It can be easy to change minor technical details, but a deep, far-reaching design decision is very often fixed in place. Even with a good test-suite, that kind of change is rarely undertaken. At least in my experience.

I suspect we've all worked on software that you would desperately love to change something fundamental in the design but no matter how painful leaving it alone is, it is preferable to live with the pain than to tear it all down.

I've known a few writers for example that say they wish they could change their characters, but it's too late. I wonder if that's analogous.


Very interesting account of your experience. I've been toying with the my ideal programming language for some time and am interested. You bring up a good point about people favoring the styles of their most recently used language. For that reason I worry that any language I'd create would merely be a combination of languages I've used previously and not something original. I'm excited to take a look at your work.


Well that's the problem that needs solving all right.

Yet again today I have a Python project that is already (at 200 lines!) obviously massively compressible if I had macros.

Maybe I can get there by returning functions & building up the functionality that way, but it feels like I'm trying to scratch that spot in the middle of my back, where I can't quite reach.


Rapid prototyping is this alternating process of building stuff up and then boiling it down. The building up is easier in Python, but the boiling down is easier in Lisp.

Sure would be nice to have both on tap.


Unfortunately, I can't say this is unexpected: http://breuleux.github.io/earl-grey/repl/



Ah, yeah, the link to the REPL in the GitHub repository was broken (I figure that's where you got it). I fixed it. As the other commenter pointed out, the real link is http://breuleux.github.io/earl-grey/repl.html


I'm curious, would it have been easier to change design decisions if you hadn't bootstrapped it?


Probably a little, but I don't think it would have made a big difference. It takes a certain amount of code before it becomes clear that the design has a flaw, and it's going to be the same amount of effort to adapt it, whether that code is in the compiler or elsewhere. A bootstrapped compiler just requires a bit more boilerplate in the transition.


[deleted]


> I don't understand why people would use a dash "-" inside a symbol

No need to use the shift key. At the levels of efficiency that a good language can reach, this matters.

Also, dashes match established lexicographical conventions better than underscores because dashes look like hyphens.


I'm all for efficiency but based on my experience reading the code people write, the last thing we need is getting them to type faster. I'd prefer if some people slowed down a whole heck of a lot to think more about what it is they're doing.


I always thought autocomplete suggestions and snippets preserved brain cycles. IMHO if you prevent developer exhaustion your code gets better-looking and more maintainable. But then again I'm a single developer, working on personal projects...


In COBOL, dashes are the way of joining words. Very readable actually.


Well camel-case does reduce the length of the identifier without sacrificing clarity. This can help reduce line length, which, in turn, allows more files to be viewed on a single screen in splits, without wrapping or scrolling. I think this solidifies camel-case as the most practical style, if you ignore aesthetics.

The underscore was invented to allow people using typewriters to underline text.

I think that the hyphen is more pleasing, aesthetically, but overall I think that it's a poor tradeoff.


The underscore in computing was developed to be able to separate words used as part of a variable name on computers with only upper-case. Quoting https://en.wikipedia.org/wiki/Underscore#History :

> IBM's report on NPL (the early name of what is now called PL/I) leaves the character set undefined, but specifically mentions the break character, and gives RATE_OF_PAY as an example identifier

It links to the 1964 report at http://bitsavers.informatik.uni-stuttgart.de/pdf/ibm/npl/320... which defines the "_" as the "break" character, on page 22, and on p23 says "[A]n identifier is a string of alphabetic characters, igis, and break characters with the initial character always alphabetic. Any number of break characters are allowed within an identifier; however, consecutive break characters are not permitted. Also, a break character cannot be the final character of an identifier."

(To verify the timing, "_" was not in X3.4 1963 (see http://worldpowersystems.com/archives/codes/X3.4-1963/page6.... ) and that ASCII code point was instead "left arrow".)

You argument regarding "more files to be viewed on a single screen" is valid, but incomplete. It's more a question of total program comprehension rather than a single metric.

This is hard to measure. We can look to related metrics of speed-of-identification and accuracy to see how messy the subject is. The report at http://www.cs.kent.edu/~jmaletic/papers/ICPC2010-CamelCaseUn... says that programmers who are trained in underscore style can recognize underscore style more quickly than camel case, while https://www.researchgate.net/publication/221219628_To_camelc... says that camel case is all around better.

At the very least it suggests that "most practical style" is hard to determine.


Meta characters were also added for file, unit, record, and group separators. Some may correctly argue that using these chars to structure data in flat files is a more simple and technically superior solution to the alternatives.

Still, people will stick to what they're familiar with despite the technical benefits. That's why we have CSV, TSV, JSON, etc.

I'd argue that 'most practical' is whatever format most people immediately understand at first glance. That literally means everybody, not just a small subset of programmers who already use a perticular sytling standard.


"whatever format most people immediately understand at first glance"

Certainly that's a useful starting point. The problem is in figuring that out when there are multiple, roughly similar representation.

But it also depends on the goal. Sometimes it's better to learn a new format (Einstein notation, bra-ket notation, copy editing and proofreading symbols, modern staff notation for music, shorthand, etc.) than to use a system that a larger subset of people will understand immediately.

Forth is an example of a programming language which is developed for programmer productivity, on the assumption that the programmer will put in the effort to be proficient in the language.


I was speaking in terms of the 'eventual' case, not a starting point.

While optimizing syntax/form makes sense in highly specialized domains where no useful alternative exists, I'd argue that the opposite holds true in domains where more 'natural' alternatives are abundant.

Can't say I'm familiar with all of those. For proofreading, meta chars are necessary to indicate edits without the ability to mutate the original text. Musical notation has widely been replaced by tabs for guitar. Shorthand may be useful for writing that isn't consumed by others.

Cursive is a perfect example of a form of language that was created for efficiency. Which, arguably, held true for handwriting. But it didn't add enough of a benefit above/beyond plain handwritten text and was very difficult to duplicate digitally.

Not to rag on Fourth, I'm sure it's probably a very good language but how widely is it used today?

Like I said, no amount of research proving that programmers choose languages based purely on their technical merits can disprove the writing on the wall.

People choose what feels natural to them based on previous experience and/or common convention. Whatever choice requires the least amount of context switching overhead and allows the lowest barrier of communication between devs will win in the end.

That's why Typescript is immensely popular for developers with a strong OOP background that prefer writing code in an IDE.

For C, the low level support for types and memory access make it a natural fit for systems development. I have written low level network code in C#, it's an extremely awkward and verbose mess.

Python wins when it comes to simplicity and the ability to write really powerful functionality with a minimum amount of code. The list slicing as well as comprehensions are easy to understand and increase productivity dramatically.


"Well camel-case does reduce the length of the identifier without sacrificing clarity"

I think camelCase easily beats underscores, but I also think it can reduce clarity a bit as soon as one uses abbreviations or mnemonics that commonly are written in all caps in identifiers.

Do you name your class IoChannel or IOChannel? For some, the former is a channel on a moon of Jupiter.

Do you name your variable classId or classID? For some, the former is related to Freud, so one could expect to see classEgo and classSuperego, too.

Made up examples? Yes, but I don't think you can fully ignore aesthetics; I find that camelCasing such terms as ID, IO, XML and HTML in identifiers sacrifices clarity. 'Id' in particular makes me cringe whenever I see it (yes, that makes me a bit of a snob, but I simply cannot get used to it). That certainly applies to cases where a common abbreviation also is a word or easily read as such.

On the other hand, I also think keeping such abbreviations all uppercase in CamelCase identifiers sometimes "doesn't look right", and "doesn't look right" aka "aesthetically ugly" distracts me from understanding code.

Also, the argument that "showing more" implies "more practical" isn't that strong. If it did, we could take an idea from colorForth and remove spaces, replacing them by color or font changes. We also could use multiple statements on a line.

I think I would prefer a language that allowed hyphens and punctuation in identifiers (the latter are really useful for such conventions as using a trailing '?' to indicate a method that tests a Boolean condition, a trailing '!' to indicate methods that mutate their arguments. I also like the Dylan convention of using asterisks to indicate class names.

That is 'think', though, because I don't use one for practical reasons such as the availability of libraries.


> I think camelCase easily beats underscores, but I also think it can reduce clarity a bit as soon as one uses abbreviations or mnemonics that commonly are written in all caps in identifiers.

For my money, I'll take Python conventions of CamelCase for a few things and underscores for the rest. I find underscores a lot more readable (the omnipresence of CamelCase is one of these things that irk me about C#, though it's not as bad as the mutant mix that is Capitalized_underscore which you see in some OCaml codebases).


why would you ignore aesthetics? I'd much rather spend my time poring over pleasant looking code than ugly looking code.


Beauty is in the eye of the beholder?

Look at the never-ending discussions about s-expressions in Lisp and all its variants.


Hyphenated words are a common/familiar feature of English underscores are not.

https://en.wikipedia.org/wiki/Hyphen#Joining




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: