Show HN: WSL, a clean text format for relational data

fiatjaf · on Sept 9, 2016

I like structured data that people (and software) can understand.

Better yet if we could find a way for (non-tech) people to be able to write structured data in clean way. My attempt, so far: https://github.com/fiatjaf/lsd

jstimpfle · on Sept 9, 2016

That syntax is too loose for my taste. WSL's approach is different; it tries to be as strict as possible. There should be preferably one and only one representation for any given value.

To be also syntactically efficient it has per-domain lexical syntax. This would not be possible without a schema.

robochat · on Sept 9, 2016

Looks interesting. It reminds a bit of the ARFF format. I can't tell whether WSL allows comments though which I think would be useful. Also insisting that no null entries are possible will probably lead to incompatible versions of the format when people are unable to avoid nulls.

One thing that could be nice would be a field to indicate the physical unit of each data type.

jstimpfle · on Sept 9, 2016

Thanks. From a cursory glance it looks like ARFF is more like CSV than WSL. Still far away from the relational model: Only a single table, fixed set of available data types, no distinction between domains and representation. Seems more like a language-independent struct description language (must be like protocol buffers, though I haven't used that either).

Re: comments: they are possible in the schema. For the relational data, comments as second-class citizens don't really make sense IMO, since associated data is typically stored in a rather scattered way in the database (this is a disadvantage compared to hierarchical representations).

The much better approach is to store Comments as first class citizens, like

  % DOMAIN Comment String
  % TABLE Person PersonID PersonName Comment
  Person michael [Michael Jordan] [nicknamed "Air Jordan"]

or if you want to allow multiple comments, or comments are very sparingly used, use a separate PersonComment table.

Re: NULL values: these can relatively easily be modelled by making an auxiliary table as described on the webpage. This leads to better normalization. The drawback is that in this way conceptually associated data is logically separated, and thus harder to edit and housekeep (that could be remedied by relation editors that can edit "views").

Another possibility is to model missing values in each datatype separately as needed. But that's probably a bad idea since the database wouldn't be able to discern such sentinel values from "present" values.

> One thing that could be nice would be a field to indicate the physical unit of each data type.

I'm not sure what you mean by "physical unit". Maybe things like "varchar(3)" in SQL? That would be easily feasible with domain parameterization. Something like

  % DOMAIN CountryCode String length=3

Presently the WSL spec demands that "domain parsers" ("String" in the above line) always return domains of the same internal representation though. Parameterization should only add "value constraints". Depending on interpretation length=3 and length=4 might mean distinct internal representations, so that might be a conflicting idea.

robochat · on Sept 10, 2016

Thanks for your answers. By physical unit, I meant metres, seconds, nanoseconds, kilograms. Since I'm more in the physical sciences, units are important to me. Keeping track of physical units is a kind of provenance. I see now though that WSL is more like a textual relational database rather than improved CSV.

I still stand by my NULL comment though, I totally understand why you want things to be as you've described (along with the emphasis on having not too many columns per table) but the problem will be other people and how they will inevitably use the format. Null is always problematic though and a source of arguments and bugs.

Have you ever seen recutils? It's a similar concept although I have no experience of it, at a glance I prefer WSL.

jstimpfle · on Sept 10, 2016

I would say it's a relational database more than only an improved CSV: you can absolutely use it as that. While it was meant to model whole databases, I don't think there are any disadvantages if you only have one table. You can even easily make the library parse lines without the table prefix (which is unnecessary if there is only one table). However that use case is more trivial and might not justify depending on a library.

Physical units are absolutely in scope. In fact thanks, I hadn't thought of that, if we can find a reasonable default implementation and syntax I might add them to the built-ins. (But you can also add them yourself as a user of the python API).

Depending on your taste, you could let them have a unit suffix.

  % DOMAIN kilograms Kilograms precision=3 suffix
  % DOMAIN seconds Seconds precision=3 suffix
  % DOMAIN metres Metres precision=2 suffix
  % TABLE Example kilograms seconds meters
  Example 1.0kg 5.042s 4.42m

Or

  % DOMAIN laptime Time infixunits
  % TABLE Laptime laptime
  Laptime 4h05m03s

> Have you ever seen recutils? It's a similar concept although I have no experience of it, at a glance I prefer WSL.

Yes, in fact I gave it a closer look. The scope is different; as the name says it's more record or even hierarchy oriented:

  - Records are written on paragraphs instead of lines; each member on its own line
  - Multi-valued members
  - Field names are explicit in each record.

It doesn't really try to be a clean interpretation of the relational model. While it is also a query language and there are some sort of joins, the multi-values member notation combined with joins seems to quickly lead to cases where it's not clear how to interpret the resulting data. Also, due to the more complex multi-line data format, it's much better suited for consumption by humans than machines.

WSL instead opts for unnamed records on single lines to be very easily consumable by machines as well, and is designed to enable very easy to use reader and writer APIs.

Note that the python library is rather slow; on my old machine, the 220KB example database needs about 200ms to parse. With a C implementation, I reccon there could be a 10x speedup. However if you care very much about speed and do not always read in the data completely, sqlite3 or one of the big iron databases are a better fit anyway. WSL trades speed for semantics and plain text representation.