I appreciate the series of articles on Tiny and think they are fine, but I have ...

brudgers · on Aug 24, 2017

I'm not really sure what a string type is because, mathematically, a string consists of letters which are part of an alphabet which are accepted or rejected by a state machine. The entire alphabet might consist of 0 and 1 (and in computing practice that's what everything boils down to eventually). In a language like, C a char is just a convenient abstraction over a byte and a byte is just an abstraction over bits...sometimes 8, sometimes not.

A difficulty arises when, as is often the case, "string" is used to denote human readable text. That this happens is understandable due to the use of char as an abstraction for byte and consequently arrays of char being able to represent human readable text in the several spoken languages common among programming communties of yore.

The high level abstraction missing from most languages is a text type that expresses the idea of human readability and is free from the ambiguity of string types. Or to put it another way, Unicode strings will always be problematic because the organization of characters of human alphabets is not always entirely logical.

fao_ · on Aug 24, 2017

""I'm not really sure what a string type is because, mathematically, a string consists of letters which are part of an alphabet which are accepted or rejected by a state machine.""

A C string is a chunk of memory containing either bytes or words, that is terminated with a null (byte|word) at the end.

A UTF-8 string uses the same transport, but it is blocked into "Codepoints" (Groups of 1 - 4 bytes, dictated by n high bits being set to a specific configuration).

A UTF-8 "Grapheme" (The actual character that gets printed) is a number of Codepoints that are interpreted as being grouped together, as per the NBNF given in the Unicode spec.

All of this takes place in a block of memory of a number of words or bytes.

That's all.

A string type that conforms to C Strings is basically just a block of memory of a certain size that happens to have things in it that conform to what we would consider graphemes to be. Some String types have the block of memory in a struct with a number and sans the last null byte, some string types use blocks of lists, or a list of characters. But the most prevalent is the C String.

brudgers · on Aug 24, 2017

I agree and apologize for not being clearer.

What I was getting at is that as a Type within a type system and independent of a particular language, it is not clear what the words "string type" mean. As you point out, in C, the string type is an abstraction over a contiguous memory block and allows addressing by bytes -- Addressing blocks of memory by bytes is often convenient when processing the values stored in the block with a state machine.

UTF-8 strings exist because the string type as an abstraction over bytes fed into a state machine became conflated with the idea that strings had an intrinsic relationship to human readable text.

Erlang is a programming language that does not conflate strings with human readable text. Perl6 is a language that handles human readable text better than average.

Koshkin · on Aug 24, 2017

Well, if you want a completely general definition of what a string type is, you could get one by following an axiomatic approach and considering the set of (abstract) operations, or functions, that are allowed to be done on strings: concatenation (a binary operation with the neutral element called "the empty string"), taking a substring, having a string-to-integer map "length", etc. From this general perspective, it would be incorrect to say, for example, that a string consists of "individual characters" - instead you might say that the smallest substrings of any (non-empty) string all have the length 1; in this sense the notion of a "character" does not even appear in the abstract definition of the string...

brudgers · on Aug 25, 2017

I don't disagree, and I should have said 'symbols.'

After I wrote what I wrote I went for a run and I suppose another way of making the distinction I am making between strings and text is that a machine can necessarily decide what is and is not a valid string and a mchine cannot decide what is and is not a valid text (unless we conflate strings and text).

Or to put it another way the recursively enumerable languages are closed under union, intersection, concatination and Kleene star. That's the realm of strings. Text is not closed under those properties, e.g. 'hello*' is not guaranteed to produce a valid text.

Koshkin · on Aug 25, 2017

You are right: a text is not a string - because it is subject to completely different axioms (the "grammar"). I guess, this is similar to, say, groups, in mathematics, and sets being different categories, and even though the usual naive definition of groups is based on sets, this is only possible because of the existence of the "forgetful functor"... In fact, in computing, a string is not considered a (valid representation of a) text until it has been successfully parsed and, often, translated into its "true" representation which is not a string at all!

Koshkin · on Aug 24, 2017

> a chunk of memory ... basically just a block of memory

Or, more accurately, a string in C is an array (of integers of a some size which depends on the chosen encoding of characters).

(What is an "array" in C is something of a separate topic. Interestingly, due to alignment, it may or may not be considered "just a block of memory", because it could have gaps inside it, i.e. pieces of memory that do not contain data that belong to the array.)