Hacker News new | past | comments | ask | show | jobs | submit login
Dasel: Select, put and delete data from JSON, TOML, YAML, XML and CSV (github.com/tomwright)
492 points by edward 5 months ago | hide | past | favorite | 84 comments



I like the idea of using select/put/delete (sql-style syntax) to query non-rdb data storage. It sort of raises the question of, could there be 1 universal language to query relational databases, text file storage (json, csv, etc), and anything else.

Or put another way, is there any data storage format that couldn’t be queried by SQL?


That's basically SQL. Many SQL systems have lots of built in connectivity to various data sources.

DuckDB is a good example of a (literally) serverless SQL-based tool for data processing. It is designed to be able to treat the common data serialization formats as though they are tables in a schema [1], and you can export to many of the same formats. With extensions, you can also connect to relational databases as foreign tables.

This connectivity is a big reason it has built a pretty avid following in the data science world.

[1] https://duckdb.org/docs/data/overview

[2] https://duckdb.org/docs/extensions/json#json-importexport

[3] https://duckdb.org/docs/extensions/postgres


> Or put another way, is there any data storage format that couldn’t be queried by SQL?

Is your SQL Turing-complete? If yes, then it could query anything. Whether or not you'd like the experience is another thing.

Queries are programs. Querying data from a fixed schema, is easy. Hell, you could make an "universal query language" by just concatenating together this dasel, with SQL and Cypher, so you'd use the relevant facet when querying a specific data source. The real problem starts when your query structure isn't fixed - where what data you need depends on what the data says. When you're dealing with indirection. Once you start doing joins or conditionals or `foo[bar['baz']] if bar.hasProperty('baz') else 42` kind of indirection, you quickly land in the Turing tarpit[0] - whatever your query language is, some shapes of data will be super painful for it to deal with. Painful, but still possible.

--

[0] - https://en.wikipedia.org/wiki/Turing_tarpit


> It sort of raises the question of, could there be 1 universal language to query relational databases, text file storage (json, csv, etc), and anything else.

Sure there could be -- any turing-complete language (which SQL is) can query anything.

But the reason we have different programming languages* is because they have different affordances and make it easy to express certain things at the cost of being less convenient for other things. Thus APL/Prolog/Lisp/C/Python can all coexist.

SQL is great for relational databases, but it's like commuting to work in a tank when it comes to key-value stores.

* and of course because programmers love building tools, and a language is the ultimate tool.


sounds like a nightmare to do logistically. it would be cool though.


> could there be 1 universal language to query relational databases, text file storage (json, csv, etc), and anything else.

> Or put another way, is there any data storage format that couldn’t be queried by SQL?

We created PLDB.io (a Programming Language DataBase) and have studied nearly every language ever created and thought about this question a lot.

Yes, there could be 1 language to query everything, but there will always be a better DSL more relevant for particular kinds of data than others. It's sort of like how with a magnifying glass you can magnify anything, but if you want to look at bacteria you're going to want a microscope (and you wouldn't want a microscope to study an elephant).

Now it may turn out that there is 1 universal syntax that works best for everything (I'm sure people can guess what I would say), but I can't think of a case where you wouldn't want to have a DSL with semantics evolved to match a particular domain.


There are a lot of differences between storage formats. It would be incredibly difficult to create a universal query language. It would need to either a) change the storage formats so much that they're not really following their original standard, or b) create so many different versions of the query language that it's not really one standard.

Off the top of my head, SQL can't do lists as values, and doesn't have simple key-value storage. Json doesn't have tables, or primary keys / foreign keys, and can have nested data


SQL has both standard JSON and Array functions. What's the "list as value" feature you think is missing?


Yes, but more awkward syntax than a dedicate tool, and disparity between JSON and non-JSON types.


> Or put another way, is there any data storage format that couldn’t be queried by SQL?

Depends on how keen you are on pure SQL. For example, postgres and sqlite have json-extensions, but they also enhance the syntax for it. Simliar can be done for all other formats too, but this means you need to learn special syntax and be aware of the storage-format for every query. This is far off from a real universal language.


From what I understand SQL is or at least can be made Turing complete so in that sense you should be able to query any data store using it. However, that doesn’t mean it will be efficient to do so.

I suspect for most data structures you could construct an index to make querying faster. But think about querying something like a linked list: it is not going to be too efficient without an index but you should still be able to write an engine that will do so.

If you have something like a collection of arbitrary JSON objects without a set structure you should still be able to express what you are trying to do with SQL because Turing completeness means it can examine the object structure as well as contents before deciding what to do with it. But your SQL would look more like procedural code than you might be used to.


> It sort of raises the question of, could there be 1 universal language to query relational databases...

Even if SQL and/or another query language could be Turing-complete, that doesn't mean that you can have 1 universal language to perform all possible queries in an efficient way. In basic computer science terms that means that your data structure is linked with the queries, and efficiency you want to achieve, and ad-hoc changes should be created for specific problems.


Tree and graph structures can be queried using SQL (with more or less difficulty depending on how you have chosen to encode and index them), but it's not a particularly simple and straightforward language to use for such a task.


All data is a graph, and so graph query languages work well for this. SPARQL is my tool of choice but cypher and maybe GQL (though that's new).


interesting! that's a mathematically provable statement? or do you mean, most data can be represented as a graph? (I'm not asking antagonistically; I'm genuinely interested in the statement you made)


I will not address your question directly, but if you are interested in the mathematical treatment of equivalence between schemas, maybe category theory and the categorical query language (Spivak from MIT if I remember correctly) may be of interest.


Perhaps the limit is not around formats but around the type system. You may be able to dump data, but can you actually reliably use it for anything?


SAS is good at reading pretty much anything.


XML attributes come to mind


Perz1val, it's me, your grandchild from the distant future. Don't do this. XML goes rogue and destroys humanity.


If entries can be relations themselves it is not possible afaik. For example

  User | Telephone Numbers
  -----+------------------
  A    | 123, 456           <- not atomic; more than 1 number (i.e. a set)
  B    | 789
Now there are academic operators to convert to and from a purely relational system, but I don't think they are implemented/in the standard. I forgot what they are called, however.

In general you don't want a universal query language. Depending on the shape of the data you want different things to be easily expressible. You can, for example express queries on tree-shaped data with SQL (see xPath-Accelerator), but it is quite cumbersome and its meaning is lost to the reader. I.e.: It's fine when computer-generated, but there is too much noise for a human to read/write themselves. I'd be glad to be proven wrong here, but as time has shown, there is no one size fits all for programming languages. The requirements for different applications just vary too much.


I like using Nushell for this. It has a `from` builtin for all sorts of formats https://www.nushell.sh/commands/categories/formats.html and after that the data is just tables, which you can query with other builtins and syntax https://www.nushell.sh/book/navigating_structured_data.html


I got really excited when I saw `from_ini` but disappointed when I saw no `to_ini`.

I would really like to find a good workflow for idempotent modifications to INI files, but haven't stumbled across one yet.


I'm abusing `git config -f $PATH` as a poor man's portable INI parser/modifier in shell scripts, works pretty well but imposes some Git-specific syntax limitations.


Funny enough, the thing that brought this up for me most recently was that some value in `~/.gitconfig` is not idempotent when set through the command line, and I was trying to use it in a shell startup script.


Shameless plug but if you’re a fan of jq style querying rather than sql for some reason you can also use qq[0] for these and a few other formats.

[0] https://github.com/JFryy/qq


jq is such an underrated tool, it is fully turing complete functional language in and of itself. It opens up an ocean of possibilities in combination with scripting languages (python has bindings), and it is ungodly fast chewing up through tons of JSON like its nothing.

It is a bit perl-ish, but being pure and functional it is a little easier to reason about when you have to revisit your queries.

PS I am certainly bookmarking your tool as well =]

There is also amusing project jqjq that implements jq in jq itself that I love to point folks at to show how expressive the language is: https://github.com/wader/jqjq


Hey! thanks for the jqjq shoutout!


jq is one of those tools that I require an LLM to use


Neat; seems about every quarter or so one of these types of tools is highlighted here.

Awaiting all the responses from people to show off or list what tool they've landed on to support their specific use cases; I always learn a lot from these.


Personally I think this is a problem better spent by fixing the shell. There’s a few alt shells out there now, Nushell, Elvish plus the one I help maintain, Murex (https://murex.rocks).

I’m obviously going to biased here, but it’s definitely worth your time checking out some alt shells.


I did not know about Murex. This actually looks pretty cool! Or at least a definite improvement. How is it regarding using it as a scripting language in lieu of Bash?

The last shell I was intrigued by was es-shell which despite being old is still being updated and uses functional semantics while still looking like a shell language. I had chatgpt generate a comparison of all these with Bash (take with a grain of salt, I already had to correct at least one thing):

https://gist.github.com/pmarreck/b7bd1c270cb77005205bf91f80c...


I'm a bit confused as to the use case. Is it just a way to interact with json/yaml style documents as if they were a structured database, but from the command line? Kind of an in-between for those moments you don't want to write a quick script to batch modify files?

It looks really well done, I think I'm just failing to see how this is more beneficial than just opening a single file in the editor and making changes, or writing a quick functional script so you have the history of the changes that were made to a batch of files.

If someone could explain how I could (and why I should) add a new tool to my digital toolbelt, I'd greatly appreciate it.


I use jq for this kind of thing several times a week. It’s great for piped data - things like running curl to fetch JSON, then piping it though to reformat it in different ways.

Here’s a jq expression I used recently to turn a complete GitHub Issues thread into a single Markdown document:

    curl -s "https://api.github.com/repos/simonw/shot-scraper/issues/1/comments" \
      | jq -r '.[] | "## Comment by \(.user.login) on \(.created_at)\n\n\(.body)\n"'
I use this pattern a lot. Data often comes in slightly the wrong shape - being able to fix that with a one-liner terminal command is really useful.


FYI if you pipe that to glow via adding `| glow -` at the end, it will pretty-print it in the Terminal, if that's something you like

https://github.com/charmbracelet/glow


PowerShell[0]'s built-in Microsoft.PowerShell.Utility[1] module has commands to `Convert-From...` or `Convert-To...` JSON, CSV, a version of XML (CliXML), or custom key/value pairs (StringData) into objects, which can then be manipulated. Combined with IO cmdlets from the built-in module Microsoft.PowerShell.Management[2] such as `Get-Content` and `Set-Content`, a fair chunk of flat-file storage should be able to be made mutable with PowerShell.

---

[0]: https://microsoft.com/powershell

[1]: https://learn.microsoft.com/powershell/module/microsoft.powe...

[2]: https://learn.microsoft.com/powershell/module/microsoft.powe...


Weirdly enough, INI is not that well-supported; but yeah, it's loads better than stuff like curl/jq for the very common JSON, YAML and other structured data cases (which includes near-100% native support for what you are commonly running shell utilities for: ls, ps, and so forth). The best part is that it's integrated into the shell language at the top-level so you can leverage whatever you learn to sort, filter and process all kinds of structured data, and explore and manipulate in a straightforward way. It's very powerful to freely intermix gci, gi, irm output and script together these various sources.

And it's full of delightful tricks, like when you discover properly-throttled parallelism is easy (% -parallel {}) and you don't need to dive into yet another tool-specific abstruse sublanguage.


I used yq last week to scan through all the Java projects (i.e. Maven pom.xml-files) within our org to check which ones inherit from the corporate pom.

  yq eval --input-format xml --output-format csv '[file_index, file_name, .project.parent.groupId, .project.parent.artifactId, .project.parent.version]' **/pom.xml


Which yq? Last time I checked, there seemed to be a few tools with the same name.


The “go” yq

https://github.com/mikefarah/yq

| yq is a portable command-line YAML, JSON, XML, CSV, TOML and properties processor


For things that are mostly shell scripts and things in a similar family (Ansible playbooks, deployment pipelines etc.) and where you need to modify a structured file quickly, it's usually much faster to use the DSL provided by the tool than calling out to various scripts to extract or modify a single JSON key.

People often say that they'd prefer to write their shell scripts in Python or even Go these days, but the problem there is that the elements of structured programming makes the overall steps difficult to follow. Typically, the paradigm with use cases adjacent with shell scripts is to be able to view what it is doing without any sort of abstractions.


This could be useful for CICD where you need to bump a version number in a file based on the build number.


For this specific use case? sed would work just as well and probably already exists in your environment.


Having done similar things to this in the past, I rebut: no it wouldn't. :)

It's not trivial to change a particular

version: 2.33

line in a JSON file, when it's possible that string is present in many places with different meanings.

You can just wing it and be right 90% of the time; but that thoughtlessness will bite you in time.

Multiply that by the number of different context you might want to make "the same" change in a project, and sed/awk get to be a poor fit. YAML and JSON are just plain not line oriented.

You really need something with a featureset like xpath to set the semantically correct node to the right value, and every few years the kids decide they need YET ANOTHER thing that's not XML, or Yaml, or JSON, or TOML, or..


Maybe I am too prescriptive of your setup, but I don't really see "version: 2.33" making sense. Why not something akin to "version: __BUILD_NUMBER__" being actually checked into your tree, and then your build pipeline filling it in at build time?

Keeping a copy of the build number checked into your source tree seems like asking for trouble when the build number is so intrinsically tied to a single build pipeline run.


the in-between mode that you mention but seem to dismiss it is the way most traditional unixheads work with data most of the time: from the command line

editor? when i pull up emacs, 50% of the time it's write emacs macros, and I do that because shell scripts don't easily go backward in the stream. (something rarely mentioned about teco was that it was a stream editor that would chew its way forward through files; you didn't need the memory to keep it all in core, and it could go backward within understandable limits)

writing an actual shellscript is only for when it's really hairy, you are going to be repeating it and/or you need the types of error handling that cloud up the clarity of the commandline

the commandline does provide rudimentary "records" in the saved history


I see the appeal of having a declarative syntax rather than writing a bunch of code to make the change reliably and safely.


one benefit (idk if it applies here) is if the select/put/delete statements didn’t require loading the data in memory; so you could query massive data files with limited RAM and not have to solve that problem yourself for each data storage format you’re working with


> or writing a quick functional script

It's exactly a quick functional script.


I maintain dasel on Debian [1], and It's such a nice tool! I don't use most of its advanced capabilities, but being able to convert from/to any of those formats with a single tool is very useful.

[1]: https://tracker.debian.org/pkg/dasel


Cool project -- but we need a standardized/spec'd query language in order to realize the goals in the "one tool to rule them all" section of this readme.

I have a hard time internalizing the jq query syntax, and am not overly excited to invest in learning all the quirks when it's not based on a widely-adopted open standard. Maybe `JMESPath` could be the way forward.

Sometimes `gron` can be a pretty great alternative approach, depending on your use case. At least it is very intuitive and plays nicely with other tools.


Ultimately JSON, TOML, YAML, XML, properties files are tree structures, and XPath type syntax should roughly apply to them all, along with about a hundreds "path expression" languages (java had SpEL, velocity, JSP-EL, OGNL, and probably dozens of others).

XPath, although it had some clunky artifacts for XML (which was the reason we moved from XML like namespaces... ugh), had basically the apex of expression/path/navigation capabilites. It would be really nice to see XPath ported to a general nav language that is supported by all programming environments and handled all the relevant formats.


I still like Xidel[0] for this reason; it may be a little older, but for a CLI scraper a lot of data transformations needs can be satisfied with Xpath/XQuery.

[0]: https://github.com/benibela/xidel


Speaking of trees, gron/ungron is an amazing transformer that allows one to use any query tool on the leaves of the tree and then turn the flattened structure back into a document (json).

I'd love to see gron/ungron implemented for all tree structures.

https://github.com/tomnomnom/gron


Jq is far more useful/capable than JMESPath.


Another one for the big list:

https://github.com/dbohdan/structured-text-tools

In fact it's already on it 6 times.


I can’t tell you how many times I’d have hobbled together a tool like this to use in go. I will be converting to this.

Sometimes we don’t actually want to parse yaml, we just want to mutate it without needing to module the underlying objects.

Being able to select and replace, add data to an existing yaml document is a huge win for automation.


Yes! This is especially powerful when combined with git-xargs to auto-open PRs with the results of the mutation.

I wrote about this a little in https://www.bbkane.com/blog/go-project-notes/#scripting-chan... and it's really helped me keepy GitHub workflows and various config files in sync across project repos


Cool tool that I will have to try and fit into my belt. Probably my loudest "old man" gripe on dealing with data, is that XPath was actually quite nice. I always reach for whatever equivalent I can get for dealing with data in any project I'm working on.


I agree when the datastructure is a Tree there is nothing better than xpath.

That's why wheb working with Json I Love Jsonata.org

And it's not surprise is that good the creator of jsonata is on the xpath and xquery commites.


Having recently messed with JMESPath in AWS, I wonder which of these structured data tools:

  - Is easier to learn
  - Has most/best documentation 
  - Is faster to write in
Does anyone know of a good comparison article?

(I still default to jq, I guess it has the momentum)


Is there a JMESPath tool that works on json, yaml, toml, etc?


I don't think so, that's a good point against it.

It's heavily used by AWS and Azure though.


JMESPath works on anything you can parse into a dictionary with python, right?

It is missing a good "search" ability, though. If you don't know the full path down to the data, good luck.


Try josnata.org

I have tried jq, jmespath, yq and Jsonata was the best for me.



This was really cool to read:

Please open a discussion if:

    You have a question.
    You're not sure how to achieve something with dasel.
    You have an idea but don't quite know how you would like it to work.
    You have achieved something cool with dasel and want to show it off.
    Anything else!

---

I really like dasal.

Can I pipe a .csv to dasal and have it spit it out in JSON? And is that the best way to do that? (arent there like a ton of ways to achieve this, or would dasal make it super simple?)

Also, what would be interesting would be to be able to pull and scrape text, to put into a structured JSON.

For example - I was talking about using a Discrenment Lattice to construct a profile for a PERON PLACE THING that one was doing research on, such that you can pull multiple sources/data-types for information on [SUBJECT] and have the knowledge dossier updated. Where, for example one could pull a lot of results that can be summarized by an GPT - then using Dasal to grab the relevant component-data-points and dasal-ize and feed them into the Discernment Lattice JSON File such as I described here:

https://i.imgur.com/vuuAtAL.png

So building out a structured lattice file for a senator would look like:

https://i.imgur.com/68WFiGA.png

So, using a crawlee txtai workflow --> dasal parse --> into lattice file.

Then the lattice file can be used to compare similar slices across all the different [SUBJECTS] -- such that further ties can be made.

So, in this example - we have the data being organized for all the various entanglements a congress person has - and we can use that as a constraint for searching for relations between [subjects] which share elements across ordinarily opaque threads.

The cool thing, is that one could then easily use it to ensure you scrub and manipulate the data into a more trainable lens for effectively fine tuning the data that you want to fine tune the model with/on - thus creating a hyper contextually focused lens - https://i.imgur.com/yngUwpr.png


Hope that preliminary benchmarks listed in the README.md could be implemented in a proof-of-concept scheme rewarding functions returning the correct variable within a blockchain.

https://github.com/sharkdp/hyperfine

Dasel MIT License Copyright (c) 2020 Tom Wright


How often do you have to add singular entries to some data file your working with? For all other cases, Miller and xsv look more powerful.


Often you're adding multiple lines programmatically from a script or cron or whatever.


Thats very neat. Thanks. Would love to use that instead of dealing with XPath and JsonPath. But are there any client libraries for different languages, besides Go? I think having some Python, Java, C# client libs would be helpful to use it more commonly as if it were it's own standard


Why are terminal "movies" always a gif with no video controls that move at the speed of light?


They aren’t always. There’s sites like asciinema (I hope I’ve spelt that right). But the problem is GitHub readme’s are pretty limited in what you can embed. So you either have to link out to another site, or embed an animated gif.


With the syntax being a selling point, I think having a comparison between jq, yq, and dasel would be really useful. Luke a few examples how to accomplish X in each one of them, so we can judge for ourselves which syntax is preferable.


Comparison of last example from https://axel.leroy.sh/quick-refs#dasel

desel: .data.all().filterOr(moreThan(.quantity,3),equal(.quantity,3)).mapOf(key,key,quantity,quantity)

yq: .data[] | select(.quantity >= 3) | {"key": .key, "quantity": .quantity}

jq: .data[] | select(.quantity >= 3) | {key, quantity}


I occasionally need to remove PII from json payloads before sharing it on slack, would be nice if there was some kind of obfuscate option for either all json values or just some that’d replace all values with **.


A similar tool for ini files https://www.pixelbeat.org/programs/crudini/


No HCL support [0] though.

[0]: https://github.com/TomWright/dasel/issues/98


It would be awfully nice to have, or just a functional hcl2json that you can actually use, and doesn't barf on very common cases. Introspecting Terraform (or another HCL language) is something I've often wanted to do and support is very poor. Ideally Terraform would actually do this (produce an intermediate parse in JSON or something) but it doesn't. But there are a lot of cases where you want to examine such a manifest and derive information from it.


I have 6 petabytes of parquet where I would need to replace 1 value in each line. Could it handle that?


I also have a usecase for tweaking bits of a parquet file.


Interesting, but I guess yq already does that, albeit slower according to the README.


I think you need some kind of autocomplete here to make it worthwhile


Nice tool! TSV would be a nice add for logs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: