Hacker News new | past | comments | ask | show | jobs | submit login
Cue, an open-source data validation language (cuelang.org)
190 points by benhoyt on June 15, 2021 | hide | past | favorite | 71 comments



Some context:

Cue is a project originally started by Marcel van Lohuizen who previously was part of BCL (Borg Config Lang) at Google. The main use is to generate config files.

See the Kubernetes examples at: https://cuelang.org/docs/tutorials/

Here are two posts discussing the motivations for Cue over BCL/Jsonnet:

- https://github.com/cuelang/cue/issues/33#issuecomment-483615...

- https://github.com/cuelang/cue/discussions/669

A very interesting development is that Grafana appears to be adopting Cue as a first-class configuration option. See: "Bring new CUE-based config schema system to release-readiness" https://github.com/grafana/grafana/issues/33139

This could mean that a future where Grafana dashboards can be two-way synced with a git repo will eventually exist.

----

Other tools with some industry adoption in the "Infrastructure as Code" space include

- Dhall

- Jsonnet (from BCL)

- kustomize

- Helm

- kubecfg

- Tanka

- SkyCfg

- jkcfg

- Krane

- HCL (Terraform)

And two tools that fall into a separate class of enabling "Infrastructure as Software"

- Pulumi (TypeScript/Go/Python/.NET)

- CDK


re: Grafana (i'm the author of the linked issue) - i'm quite excited, i do think there's a world of possibilities here.

Two-way sync with a git repo is one possible path, and we've talked a lot internally at GL about how to best support it. My sense is that we can do it with relatively little friction and likely will - but if you're just syncing with a git repo, there's still a lot of arbitrary, opaque repo layout decisions that still have to be made (how do you map a filesystem position for a dashboard to a position in Grafana? In a way that places the dashboards next to the systems they're intended to observe? With many teams? With many Grafana instances?) which induce new kinds of friction at scale.

Fortunately - and not mutually exclusively with the above - by building the system for schema in CUE, we've made a composable thing that we can make into larger systems. That's what we're starting to do with Polly: https://github.com/pollypkg/polly

Conveniently, my parts of a Grafanaconline talk tomorrow discusses both of these https://grafana.com/go/grafanaconline/2021/dashboards-as-cod... :D


This seems really exciting. I haven't had the chance to use Grafana yet; from the linked issue, am I understanding correctly that you'll be able to serialize dashboards to Cue schema, and hence get all the niceties of a structured representation - versioning, non-visual editing, and reproducibility?

I recall seeing another project HN which created dashboards out of a yaml description. This seems like a fantastic idea, given that a lot of business panels and dashboard apps can be implemented with a limited set of UI interactions.


We've been building dashboards in YAML for a while now using Lowdefy - https://github.com/lowdefy/lowdefy

It is very flexible. Will publish more examples soon!


Yep, this was the one! It looks great though I hadn't got around to trying it. Do you think something like Cue makes for a better representation for lowdefy apps than YAML, since it seems to offer better abstraction ability and hence easier to compose?


It's an interesting thought. I'll definitely spend some time to consider how this could work.

Although, writing apps in yaml or json works really well currently. We mostly express logic through Lowdefy operators. We like supporting yaml / json since it is easy to write code to create or update such apps in any language.


> you'll be able to serialize dashboards to Cue schema

It'll be possible to serialize/represent the dashboards in CUE. Here's a handwavy, pseudocode-y example i use in the talk: https://gist.github.com/sdboyer/d76196f94ca78d1e84c739e95e64...

That said, it's not like we're planning on replacing all the "export JSON" buttons in the Grafana UI with "export CUE." One of the interesting properties of defining schemas in CUE is how it allows us to remove schema-defined default values from a dashboard's JSON. The JSON representation can actually look a lot more like a concise CUE representation.

> versioning

Versioning of the Grafana schema is the essential design goal of the "scuemata" system that is under discussion in that epic issue

Versioning of artifacts that are instances of the schema is a key goal with Polly https://docs.google.com/document/d/1GU0DGy-X6z4FVwbJYPsBKRdq...

> non-visual editing

Like, editing something other than raw code in an editor? Yes, this is also something directly enabled by the schema (again, see the Polly doc, the "Produce" heading). For data-intensive tasks, such editing experiences are the only way to see your logic in the context of data, and therefore IMO prerequisite for confidence

> reproducibility

Yup. This already isn't "hard" to do today, but reproducibility gets more complicated at scale - those questions about how to map what's on disk to what's in your Grafana (or whatever app) in my parent comment become more complicated, leading to friction, leading to staleness.


Thank you for the answer! Think I missed the grafana related context, but the design document is really instructive.


It might sound a bit pedantic, but kustomize strictly avoids the "Infrastructure as Code" space and stays in the "Infrastructure as Data" space. The main difference is that since it just deals with "data", you can build any higher level tooling on this. One of the major proponents of this idea is Brian Grant from Google. He tweets about this from time to time. Here is a recent one: https://twitter.com/bgrant0607/status/1404461906186833927


Is this distinction really about whether the customisation language is declarative? It seems to me that Dhall has the advantages Brian Grant attributes to "Infrastructure as Data", although it is an executable specification.


I also think “executable” is more important. Our configs are so large and complex that we need to DRY them up. A type system alone is insufficient.


Thanks for the "why cue" posts. The two key points appear to be inheritance vs. unification and nothing vs. typed. Somehow I'm unable to grok why unification is better than inheritance. Going a bit deeper:

* "Inheritance, is not commutative and idempotent in the general case"

* "A value is always final in CUE, it can only be made more specific."

From an engineering perspective, the latter is definitely more appealing. But I lack well articulated stories to understand how inheritance fails short, and how graph unification fares better. I wonder if there is somewhere a simple concrete example to contrast the not-idempotent inheritance approach vs. the graph unification approach.


I believe they're discussing commutation and idempotency in the sense of types, rather than the sense of values.

Inheritance allows you to override properties/attributes. If you inherit from 2 classes that both specify the same attribute/property, but with different types for the same attribute, one of them takes precedence and overrides the other. A inherits from B inherits from C is not the same as A inherits from C inherits from B if C says attribute X is a string and B says attribute X is an int.

From my understanding, the equivalent graph unification is invalid. If type A is a unification of type B and C, then B and C cannot have any overlap. Each property is either a member of B or a member of C, but never both. It's commutative because A = B | C (A is the unification of B and C) is the same as A = C | B (A is the unification of C and B). If x is a member of B, and I access A.x, I will always end up accessing B. With inheritance, there can be a B.x and a C.x. Which one I end up accessing depends on which one is A's parent.

Inheritance is not idempotent because if A inherits from B inherits from C, then A is implicitly also B and C. However, A can override B's and C's behavior, so I can't trust that calling C.x will always return the same value. It might return the type C has for that attribute, it might return the type B has for that attribute or it might return the type C has for that attribute. You can prevent overriding the types in children, but at that point you've basically built graph unification.

To give a concrete example, Python allows inheritance. If we are provided with this:

    class MyCar:
        # Epoch time for when the car was made
        created_at: int
    
    class MyCarV2(MyCar):
        # Time it was created in RFC3339 format
        created_at: str
    
    class MyCarV3(MyCarV2):
        # Using an actual datetime object
        created_at: datetime.datetime
And we have a function like this:

    def time_since_created(car: MyCar) -> datetime.timedelta:
That function has no idea what the type of car.created_at will be. Mypy will complain at you because it's bad practice, but it's valid inheritance. Even if they all start with same conceptual time, MyCar.created_at, MyCarV2.created_at and MyCarV3.created_at return different types, despite all supposedly being valid instances of MyCar.

Graph unification forces you to pick a single type for each attribute of a single type. Rather than having 3 types that behave differently, graph unification forces you condense them into one:

    class MyCar:
        created_at: typing.Union[int, str, datetime.datetime]
That time_since_created function now knows exactly what type created_at is. Nothing else can change the type of created_at. If you need to add another possible type you have to either add it to the typing.Union, or create a new class. You can't create a subclass of MyCar with a different type for created_at.


Interesting. Vaguely related, I wonder if cuelang has a mechanism to declare disjoint unions, aka. sum types.


cough how can you leave out https://github.com/purpleidea/mgmt/ =D It's in golang, and is the only reactive DSL.


The language is named Go.


This is very interesting! Working through the docs now and I'm enjoying the schema, and I've came up with similar ideas regarding data validation / generation in the past. It's nice to find a project like this! Thanks!

In most projects data validation becomes problematic. In a most of cases the schema could be a lot more defined than what type def offers. This allows for test cases to make sure data fits the model.

We've also been creating a DSL to build web apps. Check out Lowdefy [0] - I'm trying to come up with an "Infrastructure as Code" word for Lowdefy. "UI as config" is the closest fit, but not sure...

[0] - https://github.com/lowdefy/lowdefy


If you’ve ever had to wrangle yaml configuration files… do yourself a favor and learn Cue. It’s still young and the website can seem intimidating; but it’s simpler than it looks, and the language is unbelievably powerful. There simply isn’t anything else like it. In my opinion it’s in a league of its own compared to other configuration languages like HCL, Jsonnet, Dhall, Starlark etc. Marcel, the creator of Cue, is basically the godfather of configuration languages - most of the state of the art can be traced back to his work at Google. Despite his deep knowledge of the subject and unparalleled experience, he is modest, pragmatic and responsive to questions and feedback. The momentum behind Cue reminds me of Go in its early days.

I’ve been using Cue for over a year now, using it as the foundation for a new projet; and will gladly answer questions about our experience.


I tried dabbling with Cue, but it doesn't seem to solve the problem that I care about, which is that I have a whole bunch of configs that vary only slightly and I want to DRY them up.

For example, for any given application we have several fixed environments--dev, staging, prod--as well as "on demand" environments for things like pull requests or individual developer environments. The configs for these environments are almost the same, but they vary based on a handful of parameters. I want to be able to write a "generic environment" module for each application and then parameterize it accordingly for each environment.

Cue doesn't seem to care much about this problem, but rather it's just trying to make sure your data is type checked. It seems more like an advanced JSONSchema rather than a typed Starlark. I think the latter would be more powerful (albeit Cue's type system is more powerful than an ordinary generic type system with things like range types).

Cue almost has an answer to the DRY problem, but you can't quite emulate functions as far as I can tell (due, I think, to shadowing problems). I wonder what people who are convinced that Cue is the future would say to this? Am I just thinking about the problem wrong?


I'd say all of these problems have answers in CUE.

> I want to be able to write a "generic environment" module for each application and then parameterize it accordingly for each environment.

This is pretty solidly in the target use case range, i'd say - managing variations of the same "object type" over some dimension is a lot of what's targeted by the way that CUE treats directory hierarchies when loading files: https://cuelang.org/docs/concepts/packages/#instances

The main thing you have to consider in designing a layout is that you have to take a compositional approach to how you define individual config instances. That is, you can't start from prod's config, then override a value or two for staging.

If i were to do it - i have not, this is not how i currently use CUE - my first approach would probably be by defining defaults at the "policy" level (per the above link), which effectively allows you to get exactly one "override"-ish behavior.

Lots of possible approaches to this, though.

> but you can't quite emulate functions as far as I can tell

Function-like capability is present, just in a form that's less familiar. I think of them as "function structs." This post has a bunch of examples https://github.com/cuelang/cue/issues/139#issuecomment-55677.... It seems there's a plan to add a more comfortable notation (https://github.com/cuelang/cue/issues/943), but it's fundamentally possible now.


Consider this example: https://github.com/cuelang/cue/discussions/967

How would you solve this with directory structure and "function structs" respectively? I'm having trouble wrapping my head around the former and ran into shadowing problems with the latter.


I would do it like this:

test.cue:

    #Job: {
        command: string
        args: string
        cli: "\(command) \(args)"
    }

    #GoJob: #Job & {
        command: "go"
    }

    #GoJobV: #GoJob & {
        args: "-v ./..."
    }

    job: #GoJobV.cli
Results in:

  $ cue export test.cue 
  {
      "job": "go -v ./..."
  }


Cue solves the DRY problem using a lattice data structure instead of inheritance. This is precisely why cue is better than everything else.


> I want to be able to write a "generic environment" module for each application and then parameterize it accordingly for each environment.

This specifically looks like you want inheritance, which Cue eschews.

You can set up a generic config with default values for everything, and then have more specific configs that override the defaults.

A concrete example would help figure out if Cue can do what you want.


Here's a simplified example for configuring GitHub Actions workflows: https://github.com/cuelang/cue/discussions/967

FWIW, I'm not looking for "inheritance".


> You can set up a generic config with default values for everything, and then have more specific configs that override the defaults.

I was going to say the same thing - this pattern of having an "override" file in add to defaults, is something I've seen in multiple systems and liked. For example, it's used for JSON configuration in .NET, and with Docker Compose's YAML service configuration files.


A quick overview of CUE for those who are wondering what all the fuss is about: https://bitfieldconsulting.com/golang/cuelang-exciting


This is a fantastic read - well done.

I'd recommend starting with this article as it nicely lays out the motivations for Cue.


I had met Marcel van Lohuizen when he was in the board of my previous company. One of the passionate techie and down to earth guy. He was actually working on cuelang and had not released it yet. After he gave us a presentation on Cue, one of my thoughts was that it is not easy for beginners to grasp it but then the language is not meant for beginners. My Second thought which was completely whack, was may be you could use it as a add-on for Protobufs, as the schema definitions in Cuelang has validations builts into it, which might remove boilerplate validation code in grpc services.


If you are looking to do data validation from the JVM, you may try Baleen (written in Kotlin): https://github.com/ShopRunner/baleen/

I'm one of the contributors. We created a DSL in the language to describe the data and create tests. You can then use that data description to validate against json, csv, avro... One of the neat things we came up with was the concept of a data trace which is like a stack trace but is a path through the data to a particular error.


At this point one might consider using a real language and common software practices for type checking, extending, modularization, testing, etc... Instead of building an ecosystem just to keep Infrastructure as Yaml sane.

My experience with Pulumi and AWS CDK is absolutely brilliant in this regard, hopefully good DevOps/SRE/WhateverNewTerm practices and patterns will reassemble good software development practices in the future.


Cue has a unique lattice type system that allows you to refine a property from type->constraint->value, but does not allow you override an existing value (or change it in any way that conflicts with the existing type/constraints).

In my view, this is the insight and value proposition that sets cue apart from everything else, including general programming languages.

Inheritance + property overriding is the source of most problems in configuration because you can never know if a value is the source of truth.


Cannot +1 this hard enough. It is the kernel from which all other useful things flow.


> real language

In what sense is CUE not a real language?


I think parent means a General Purpose language, i.e. capable of computations.

Personally, on one hand I know allowing computations into configuration immediatly destroys any hope of having a tidy, rational schema in real word projects.

On the other hand though, i do believe configuration and code should be build with related tools, possibly the same tool- or at least tools using the same syntax!

(a bit like the json syntax is the same as a Python dict syntax, except this is the terrible example that is so poorly thought out that does more harm than good)

This unlocks a much greater degree of freedom and power than all the gluing together technologies that we have to do...


If this is the sense in which the comment is intended - CUE is capable of some kinds of computation. It's just not Turing complete.

My 2c - thus far, i've found the language features enabled by this constraint much more useful than the expressiveness lost - at least, for the purposes i've chosen.


I think he means one language that includes config, instead of yaml plus yaml-taming ecosystem.


It looks quite cool. I think it would be really useful if you have a lot of integrations into different programming languages, frameworks, and maybe even SQL servers.

So you could do data validation on the frontend, backend and the database server based on the same definitions.

It would save us a lot of bugs caused by different opinions of valid data in different layers of software.


I really like the language and look forward to seeing more adoption.

Is there a second implementation in another programming language?

As I understand it, it would make sense to have libraries for processing Cue data in every language.

I'm a bit concerned about Cue relying on Go too much. A data validation language should be independent of the implementation language.


Started working on a JS version a few weeks ago [1]. Even with 20% of the features it’s already so useful we’re building systems with it. And not just config - model all the things!

Overrides and inheritance are a world of pain. Unification and commutative operations restore sanity to the actual work of coding with a domain representation language because WYSIWYG. And you get type safety for your domain model.

The project is still at the “Read the Source, Luke” stage so caveat emptor until we get a respectable release out.

* https://github.com/rjrodger/aontu


Very excited to see someone doing this! Right now, Grafana is [planned to] relying on an anemic CUE->Typescript translator for getting its schema to the frontend - https://github.com/sdboyer/cuetsy. (Somebody also pointed me to Project Cambria recently, which could be an interesting compilation target for what we have https://www.inkandswitch.com/cambria.html)

Being able to work with CUE natively in TS, though, would be a huge gamechanger for what we can do with CUE in Grafana


Our implementation is in TS. However that's just an API. Do you have some ideas on how you'd like the TS type system to work with the Cue type system. It's an open question for us.


Nice, I was looking for a Javascript version. I will check it out. Regarding "Read the Source, Luke" you are in good company with Apple and its Swift ABI :)


And we kinda leverage the docs on cuelang.org since the syntax and semantics are mostly the same!


Does this make it possible to add syntax highlight and validation in an editor like monaco?


Yeah - we're just cheating and using YAML syntax modes for the moment which mostly works. The plan is to support LSP for much better DX of course.


It's not necessarily easy to see, just from general descriptions of CUE, what's possible.

Dagger was demo'd at the recent Dockercon - gives a bit more of a sense of the possible with CUE: https://docker.events.cube365.net/dockercon-live/2021/conten...


Vector [0] leverages CUE for their documentation [1].

0. https://vector.dev/

1. https://github.com/timberio/vector/blob/master/CONTRIBUTING....


Related hobby project - we've been building cueblox (cueblox.com) as a way to create, consume, and validate data in YAML and Markdown using cue. Cue is very powerful, and it's been fun working on this project.


Can a useful comparison be made between Cue and Clojure.spec?


Philosophically they seem very similar, but at a glance[0] it seems like clojure.spec is quite a bit more expressive. Also spec lives in your REPL session and in your source code, while Cue is meant to be used as a CLI, so there is a completely different approach.

[0] https://cuelang.org/docs/references/spec/


CUE is currently centered around its CLI, but AIUI, that's not the long-term goal. i read in some CUE issue somewhere that the goal is shifting towards enabling frameworks rather than driving people to the CLI, though i don't have a link.

Our use of CUE in Grafana is an example of framework-style usage. It is a hard requirement that users never have to install the CUE CLI to perform any of our planned CUE-related tasks; rather, the needed tooling is baked into Go packages we export, and things like grafana-cli. (Avoiding a dependency on the CUE CLI also gives us a defense mechanism against breaking changes)


CUE is great - it's the first config language I've used that feels like it's getting things right.

I've been playing with some different ways of using CUE:

- https://github.com/cuebernetes/cuebectl (declarative, continuous syncing with a kube cluster)

- https://github.com/ecordell/cuezel (bazel/make but with CUE)

both have some overlap philosophically with the "command" subsystem in CUE, but it's been fun to play with different approaches to the same problems.


The Cue integrated Kubernetes project I'm most excited about is KubeVela[0]. Effectively, you can create an "operator" for just the YAML bits to narrow your Kubernetes API and provide best practices via the Components and Trait overrides, and it should allow platform teams to standardize how their teams are deploying software on large Kubernetes installations.

[0] https://kubevela.io/


Just dropping this here https://www.w3.org/TR/shacl/ Shacl is a language for defining constraints on data. While it is focused on RDF data, it is also possible to use with JSON/CSV data using an RML mapper (so far only PoC on this side).


We use cuelang in big corp in prod, beautiful tool.


This looks interesting, and I’d love to know more, for example what is Cue’s approach to validating sequences?

I’d spend more time trying to answer that question myself but this site wants me to read an awful lot of philosophy before showing me any code. I dislike homepages like this. I’d rather you assume I buy into your philosophy, otherwise I’d leave, so you’re free to just show me what the language is like. When you do that, I’m more interested in examples than EBNF.


https://cuetorials.com/ is probably more hands-on than the homepage which is slightly more theoretical. For validating sequences see https://cuelang.org/docs/tutorials/tour/types/lists/ .


Brilliant, ta! I guess I was hoping for something more like linear temporal logic perhaps. What we’ve found with lots of validation libraries is that doing “if this then expect that” kind of rules on sequences is quite difficult.


When I hear data validation I think “csv or excel file generated by other department that I need to import”, how is cue with that sort of thing?

Eg I want to import data file and get a report that tells me 20 values in column A are negative, column B contains floats etc.

And can it munge stuff? Column C are stored as floats, but should have been of type int. If all values are whole numbers, accept it.


Is there any plans to support model generation in future as can be done with JSON schema through something like https://github.com/quicktype/quicktype ?


Cue definitions export to json schema.


Can't see how, and this seems to suggest this functionality is not implemented yet: https://github.com/cuelang/cue/discussions/663

Mind sharing a reference?


I do it the way they discuss in that issue. Converting from openapi to json schema.


Thanks for the tip.


It looks like it would be great to use locally within a browser, since so much data entry happens there.

But it doesn't seem like that's currently possible. I wonder if they have a WASM port (or similar) on a roadmap.


Try cuelang.org/play - it's exactly that. Seems to be just a demo though.


Cue definitions export to json schema which you can use to validate form input.


Wonder if https://hydra.cc/ would be a better choice


Goldman had a library like this when I was there. I wonder if it ended up catching on.


Haven't taken a detailed look, but can it work on binary data?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: