GCC tiny

modulan5 · on Aug 24, 2017

I appreciate the series of articles on Tiny and think they are fine, but I have a few minor points:

1. Missing “string” types in Tiny?:

In Part 8 a “bool” and “string” type were both first introduced. Tiny’s type system was extended with “bool” in Part 9.

But the “string” type (not to be confused with string-literal), was never added to Tiny’s type system.

I assume an array of “char” would cover that “string” type if “char” is added as an extension of the type system.

2. Expressions in Tiny:

The rule "unary-op" shows that unary "plus", unary "minus" and "not" are all unary operators in Tiny. In Part 4, the rule is confirmed with a statement that "not" is a unary operator.

But in the table of op priorities in Part 1 "not" is placed in with "and", "or" instead of in with unary "plus" and unary "minus", so it has the incorrect prority there.

3. In the git-hub file for Tiny’s grammar, the rule for expression is:

expression -> primary | unop op expression | expression binop expression

I believe this should be changed to:

expression -> primary | unop expression | expression binop expression

brudgers · on Aug 24, 2017

I'm not really sure what a string type is because, mathematically, a string consists of letters which are part of an alphabet which are accepted or rejected by a state machine. The entire alphabet might consist of 0 and 1 (and in computing practice that's what everything boils down to eventually). In a language like, C a char is just a convenient abstraction over a byte and a byte is just an abstraction over bits...sometimes 8, sometimes not.

A difficulty arises when, as is often the case, "string" is used to denote human readable text. That this happens is understandable due to the use of char as an abstraction for byte and consequently arrays of char being able to represent human readable text in the several spoken languages common among programming communties of yore.

The high level abstraction missing from most languages is a text type that expresses the idea of human readability and is free from the ambiguity of string types. Or to put it another way, Unicode strings will always be problematic because the organization of characters of human alphabets is not always entirely logical.

fao_ · on Aug 24, 2017

""I'm not really sure what a string type is because, mathematically, a string consists of letters which are part of an alphabet which are accepted or rejected by a state machine.""

A C string is a chunk of memory containing either bytes or words, that is terminated with a null (byte|word) at the end.

A UTF-8 string uses the same transport, but it is blocked into "Codepoints" (Groups of 1 - 4 bytes, dictated by n high bits being set to a specific configuration).

A UTF-8 "Grapheme" (The actual character that gets printed) is a number of Codepoints that are interpreted as being grouped together, as per the NBNF given in the Unicode spec.

All of this takes place in a block of memory of a number of words or bytes.

That's all.

A string type that conforms to C Strings is basically just a block of memory of a certain size that happens to have things in it that conform to what we would consider graphemes to be. Some String types have the block of memory in a struct with a number and sans the last null byte, some string types use blocks of lists, or a list of characters. But the most prevalent is the C String.

brudgers · on Aug 24, 2017

I agree and apologize for not being clearer.

What I was getting at is that as a Type within a type system and independent of a particular language, it is not clear what the words "string type" mean. As you point out, in C, the string type is an abstraction over a contiguous memory block and allows addressing by bytes -- Addressing blocks of memory by bytes is often convenient when processing the values stored in the block with a state machine.

UTF-8 strings exist because the string type as an abstraction over bytes fed into a state machine became conflated with the idea that strings had an intrinsic relationship to human readable text.

Erlang is a programming language that does not conflate strings with human readable text. Perl6 is a language that handles human readable text better than average.

Koshkin · on Aug 24, 2017

Well, if you want a completely general definition of what a string type is, you could get one by following an axiomatic approach and considering the set of (abstract) operations, or functions, that are allowed to be done on strings: concatenation (a binary operation with the neutral element called "the empty string"), taking a substring, having a string-to-integer map "length", etc. From this general perspective, it would be incorrect to say, for example, that a string consists of "individual characters" - instead you might say that the smallest substrings of any (non-empty) string all have the length 1; in this sense the notion of a "character" does not even appear in the abstract definition of the string...

brudgers · on Aug 25, 2017

I don't disagree, and I should have said 'symbols.'

After I wrote what I wrote I went for a run and I suppose another way of making the distinction I am making between strings and text is that a machine can necessarily decide what is and is not a valid string and a mchine cannot decide what is and is not a valid text (unless we conflate strings and text).

Or to put it another way the recursively enumerable languages are closed under union, intersection, concatination and Kleene star. That's the realm of strings. Text is not closed under those properties, e.g. 'hello*' is not guaranteed to produce a valid text.

Koshkin · on Aug 25, 2017

You are right: a text is not a string - because it is subject to completely different axioms (the "grammar"). I guess, this is similar to, say, groups, in mathematics, and sets being different categories, and even though the usual naive definition of groups is based on sets, this is only possible because of the existence of the "forgetful functor"... In fact, in computing, a string is not considered a (valid representation of a) text until it has been successfully parsed and, often, translated into its "true" representation which is not a string at all!

Koshkin · on Aug 24, 2017

> a chunk of memory ... basically just a block of memory

Or, more accurately, a string in C is an array (of integers of a some size which depends on the chosen encoding of characters).

(What is an "array" in C is something of a separate topic. Interestingly, due to alignment, it may or may not be considered "just a block of memory", because it could have gaps inside it, i.e. pieces of memory that do not contain data that belong to the array.)

astrobe_ · on Aug 24, 2017

Does it provide interesting advantages over "transpiling" to C?

le-mark · on Aug 24, 2017

This is my question as well, seems like a boat load more work than writing c to a file and calling gcc on it.

One thing I can think of; if the language you're implementing has exceptions, you'd have to implement these the naive way wiht setjmp/longjmp if you target c. Supposedly if you used gcc in the way this article shows, you could use gcc to generate "real" excpetions.

dajt · on Aug 25, 2017

If the author wants to show how to write a front end targeting GCC then writing a translator to C would not do that.

It is an educational exercise.

aleden · on Aug 24, 2017

This is so ugly.

> The next step is telling GCC configure that we are going to build GCC with tiny support. This will fail. Do not worry, this is expected.

Back in the 2.9 - 3.8 LLVM days there was a lot of API breakage between releases. A lot of people I knew who worked with LLVM detested this, but would you rather deal with this kind of cruft?

jordigh · on Aug 24, 2017

Wait, so has it stopped breaking? Because their breakage essentially killed the LLVM-based JIT compiler project for GNU Octave. Someone wrote it based on the C++ "API" and we spent time trying to keep up with it and then we gave up. Nobody really understood it well enough to rewrite it for the C API or knew if it was even possible, so we abandoned it.

aleden · on Aug 24, 2017

The changes in the LLVM JIT were definitely a disappointment. IIRC it used to be a lot more lightweight the API was simpler to use.

But in hindsight it's hard to argue that the LLVM devs shouldn't have messed with it. Before you couldn't get symbolized stacktraces or use C++ exceptions. LLVM's primary goal has always been to produce the best code, and not necessarily try to be the fastest at doing it [1].

In the 3.x days, it was a fast evolving project. I remember projects I managed that used LLVM would no longer compile between releases because e.g. they would rename a header file.

"Just to rename a header file??" one might say; it's a subjective question whether this philosophy hurt or helped the project in the end. It might take years until that answer is forthcoming.

I'm sorry to hear about what happened with GNU Octave. Maybe after LLVM proves to stabilize someone new will write up the code :)

[1] Tangentially related: https://bellard.org/tcc/ Compiler Time(s) lines/second MBytes/second TinyCC 0.9.22 2.27 859000 29.6 GCC 3.2 -O0 20.0 98000 3.4

jordigh · on Aug 24, 2017

> But in hindsight it's hard to argue that the LLVM devs shouldn't have messed with it.

Not that hard, software stability is a very desirable property:

http://stevelosh.com/blog/2012/04/volatile-software/

rurban · on Aug 24, 2017

The LLVM C Api is even worse. First it is tied to the 3x changing C++ Api, and it's incomplete. You can only do symbol search via C++.