Unix system programming in OCaml (2014)

themckman · on July 12, 2018

For those interested, some of the underlying libraries that make up Docker for Mac (and, I think, Windows) are written in OCaml (or have components written in OCaml): VPNKit[0], DataKit[1] and HyperKit[2] (qcow2 support is implemented in OCaml).

0: https://github.com/moby/vpnkit

1: https://github.com/moby/datakit

2: https://github.com/moby/hyperkit

pjmlp · on July 12, 2018

They are taken from MirageOS.

https://blog.docker.com/2016/05/docker-unikernels-open-sourc...

dfee · on July 12, 2018

To be fair, GitHub attributes 1.5% of hyperkit’s code to ocaml

djs55 · on July 12, 2018

The 1.5% that's in the hyperkit repo is the shim to the Mirage block layer. The qcow2 implementation that's normally linked in is from https://github.com/mirage/ocaml-qcow .

emmelaich · on July 12, 2018

Ocaml, in comparison to other functional statically typed languages has had many successful Unix applications.

For a great tour of going to Ocaml (from Python), see Thomas Leonard's blog.

A retrospective is here but read the whole lot.

http://roscidus.com/blog/blog/2014/06/06/python-to-ocaml-ret...

cbcoutinho · on July 12, 2018

And related hackernews discussion

https://news.ycombinator.com/item?id=7858276

a0 · on July 12, 2018

This is such a great book. OCaml’s type system feels like a superpower specially in the context of Unix development which is traditionally done in C.

emersion · on July 12, 2018

I'm contributing to a linker written in ~OCaml [1], and now I understand writing systems code in OCaml is really a bad idea. It makes things more complicated when you really don't want them to be. As James Mickens says [2]:

>You can’t just place a LISP book on top of an x86 chip and hope that the hardware learns about lambda calculus by osmosis.

[1]: https://github.com/rems-project/linksem [2]: http://scholar.harvard.edu/files/mickens/files/thenightwatch...

phaer · on July 12, 2018

Can you provide an example of the things, OCaml makes more complicated and explain which languages you compare it to? I guess it's quite different in comparison to C, C++ or maybe Rust than to other garbage-collected 'system programming' languages like Go?

pasabagi · on July 12, 2018

I think he's getting at the distinction between operational and denotational semantics. Presumably, operational semantics are a better fit for systems programming, because if you need to imagine what the computer is actually doing (Linus's comments about why he likes C comes to mind). Ocaml is more in a denotational direction than C - although to be honest, I don't ultimately understand the distinction, given that C relates to an abstract machine that is not all that similar to most machines today.

ernst_klim · on July 12, 2018

But monads are a version of small-stem semantics. And the write linker, so something that operates low-level data, not an actual low-level code, OCaml is a great language for that.

Take a look at BAP, it heavily uses OCaml features and implements a beautiful design:

https://github.com/BinaryAnalysisPlatform/bap

https://binaryanalysisplatform.github.io/bap/api/master/argo...

I'm not convinced that OCaml is unsuitable for such tasks.

pasabagi · on July 12, 2018

I don't know enough to judge. I do feel that people generally (especially functional people) overestimate the overlap between maths and programming - I think maths is just the most useful toolset in programming, but isn't really what programming is about. So I can see why people feel C, with most of its metaphors derived from how (old) machines work, is perhaps a better fit for fundamental stuff than functional languages, that tend to derive their metaphors from maths.

That said, I don't know if this makes sense in a out-of-order, parallel world. I think as an organizational schema, imagining a machine that's doing things works better when that machine is relatively intuitive and simple.

ernst_klim · on July 12, 2018

> maths is just the most useful toolset in programming, but isn't really what programming is about.

It's not about maths, actually. You don't need to understand much math to comprehend monads, you need math only to reason about some properties and formal verification.

So think of monads as of codification of imperative effectful computations in a more general way.

I really like that example:

https://binaryanalysisplatform.github.io/bap/api/master/Mona...

What monad's bind does is just binding a step of computation to the rest of computation, so it is nothing new but a small step (operational) semantics written explicitly. That's why haskell has this do sugar that looks like as an imperative computation. Because it is a first class imperative computation.

Do you want first class features (functions, computations, effects) or features baked into the language is a subjective matter of course.

pasabagi · on July 12, 2018

That's fascinating, actually! Thanks for sharing. I'm not particularly partial to either side of the discussion - I like engineering considerations, and I also like really abstract computer science. I think people tend to specialize and polarize in a way that's more reflective of social structure than technical possibilities, which is a shame.

emersion · on July 12, 2018

Here are a few examples:

- Inferred types: you're constantly wondering "what is the type of this variable?". Type errors occur far from where the real type error is. Annotating is complicated because you're often dealing with complicated types (with nested tuples for instance).

- Using higher-order functions: often you'll see a function with nested functions in it [1], or some functional list processing function like "unzip" being used [2]. This increases the code complexity a lot ("what does unzip do again?").

- Using monads: yet another distraction from what's really happening. It forces you to write nested functions and doesn't play well with conditions [3]. Also various people use monadic operators way weirder than >>= and all of this becomes really hard to read.

- Using functional-friendly data structures: the linker is using immutable lists of optional bytes to represent memory images, because this is idiomatic in OCaml. However this is slow and inefficient as hell. Really, I can't load a 60MiB binary because the linker runs out of memory and is killed by the kernel on machine with 6GiB of RAM. Also, just linking hello-world takes ages.

- Designing functional loops: when the loop you want isn't in your stdlib, you first need to create a recursive function, and then make sure all recursive calls are tail calls otherwise you blow up your stack. The result is usually not very fun to read.

tl;dr: the C language is small, the OCaml language is not, and when you're dealing with complicated stuff you really don't _need_ more complexity from the language itself. James Mickens explains this (among other things) in his essay.

[1]: https://github.com/rems-project/linksem/blob/master/src/link... [2]: https://github.com/rems-project/linksem/blob/master/src/link... [3]: https://github.com/emersion/linksem/blob/54c3a8430e621198653...

(Note: set the syntax highlighting to "OCaml" when trying to read this source code)

EDIT: list formatting EDIT 2: grammar

ernst_klim · on July 12, 2018

>the linker is using immutable lists of optional bytes to represent memory images

Why would you use linked list instead of Bytes, Bigarrays or Arrays? You are doing it wrong, take a look at BAP's memory representation:

https://github.com/BinaryAnalysisPlatform/bap/tree/master/li...

https://binaryanalysisplatform.github.io/bap/api/master/argo...

Looked at your code, damn it's bad, I'd really suggest to dive into BAP sources and documentation a bit to understand how to write OCaml. It feels like you are trying to write in OCaml as in some different language, Haskell maybe, not using modules and functors. Reminds me how J people write C.

You could write `fun x y z -> ...` instead of `fun x -> fun y -> fun z -> ...`, it's also better to write signatures (with docstrings) in separate .mli files (much easier to read and generate docs), if you use monads like maybe (why not result) and option, it would be much more convenient to use bind operators instead of `match ... | None -> None`

emersion · on July 12, 2018

We need optional values for bytes, as a way to say "this byte is undefined", so we can't use Bytes directly.

But yes, we're doing it wrong for sure.

EDIT: seems like you edited your reply, so let me add this to mine :)

>Looked at your code, damn it's bad, I'd really suggest to dive into BAP sources and documentation a bit to understand how to write OCaml. It feels like you are trying to write in OCaml as in some different language, Haskell maybe, not using modules and functors. Reminds me how J people write C.

Oh, I also think it's bad don't worry. However we're using a subset of OCaml (Lem) so we don't have things like modules for instance. And no, I really don't want to start using complicated module features and functors, for the reasons I explained above.

ernst_klim · on July 12, 2018

Well, you could use arrays of optional bytes, but are use sure it's clever?

Maybe it's much better to define an abstract module of type

    module type Mem = sig
      type t
      val get_byte : t -> ~offset:int -> char option  
    end

And invent some clever (maybe lazy) implementation using just Bigarray?

BAP also has something like this in

https://github.com/BinaryAnalysisPlatform/bap/blob/master/li...

You could ask about BAP in general on

https://discuss.ocaml.org/

especially

https://discuss.ocaml.org/u/ivg/summary

(one of the guys behind BAP, very clever and talented in explaining stuff)

dpwm · on July 12, 2018

You can always do something like:

  type optional_bytes = (Bytes.t * Bytes.t)

  let get i (data, mask) =
    match mask.[i] with
    | '\x00' -> None
    | _ -> Some data.[i]

If you need immutability and you're not changing often, you can use strings.

If you're using lists because you're doing all manner of comprehensions, you'd still be better iterating over the Bytes and rebuilding using a Buffer.

ernst_klim · on July 12, 2018

>I really don't want to start using complicated module features and functors, for the reasons I explained above.

And why is that? Because you really don't want to write ML without modules (and you are using them anyway).

Edit: Ah, I see, you have an auto-generated ocaml code, which is used along with the Isabelle. That makes sense (although your code generator is rather strange and there is still no need to use lists for memory representations since Isabelle reasoning about memory chunks shouldn't be that hard). Anyway I wouldn't claim that OCaml is that bad for low level programming basing on such a subset of a language.

dpwm · on July 12, 2018

The code you link to is pretty bad and I really cannot understand why.

[1] OCaml does the currying for you. Why would this ever need to be done?

[2] I have never even thought of putting a let statement in a list before let alone seen one in the wild. Also that looks like a list comprehension, so there is some kind of non-standard syntax going on there.

[3] Monads are a pretty good fit for some things, like parsing.

> the linker is using immutable lists of optional bytes to represent memory images

OCaml has support for bytes in the Bytes module. How do you deal with optional bytes in C? Well, one way is to have a mask array. You could also have a bitmask. You can do the same in OCaml and memory usage will be similar.

I've never worked on anything OCaml that I didn't start myself. However I've dived into mature code-bases including the OCaml compiler and they are generally readable and understandable. The examples you highlight, even if unrepresentative of the code-base, are very difficult to comprehend.

I think a lot of your problems do seem to stem from this code-base and some of the mysterious things look to me like Haskellisms and attempts to make OCaml into something it's not. Sadly there is a lot of OCaml like this out there. Batteries and Core really haven't helped in this regard.

There is definitely an element of understanding OCaml. There isn't much of it to learn, but you need to approach it from the right direction and that can take time. Learn it without Core or Batteries.

Even so, I would be very strongly tempted to clean up that code if at all possible.

emersion · on July 12, 2018

I didn't link the code to say "hey look this code is bad", my intent was just to illustrate that functions inside functions indeed do happen, and unzip is also used, etc.

>OCaml has support for bytes in the Bytes module. How do you deal with optional bytes in C? Well, one way is to have a mask array. You could also have a bitmask. You can do the same in OCaml and memory usage will be similar.

Yeah, that one is probably more because the original authors have written this code in their perfect mathematical world. But this is also one thing I don't like about systems functional programming: it forces you to use abstractions that don't always make sense.

dpwm · on July 12, 2018

> I didn't link the code to say "hey look this code is bad", my intent was just to illustrate that functions inside functions indeed do happen, and unzip is also used, etc.

I appreciate that. Perhaps I was a little harsh on the code. It's using a lot of abstractions and it appears to be part of an interesting project. I'm not sure how necessary those abstractions are to making the project work.

That said, I haven't a clue what unzip is. I'm quite sure it's not standard library. Nor is unzip3. I could define a function foo in my code or find a library that implements it, and have it buried somewhere beneath layers of module imports. It would be unfair to criticise the language for that given it's a problem in pretty much every language.

> But this is also one thing I don't like about systems functional programming: it forces you to use abstractions that don't always make sense.

You're free to decide the abstractions you want. You could do all IO through a monad if you really wanted to. It would be strange but maybe not to a Haskeller. Some things make more sense with mutable state. OCaml gives you mutable state if you need it.

lmm · on July 12, 2018

> That said, I haven't a clue what unzip is. I'm quite sure it's not standard library. Nor is unzip3.

Unzip and unzip3 are pretty well-known and standard - unzip turn a list of pairs into a pair of lists (the first elements of all the pairs and the second elements of all the pairs), unzip3 does the corresponding thing for a list of triples.

I know that and I've never looked at the codebase you linked to or the library those functions were defined in, because those function names are standard across libraries and across languages even. Any helper framework/library will define its own functions, but anything one knows about, say, gevent's functions will only be useful when working with gevent. The great thing about functional programming is that many more functions end up being general-purpose things that are used in every functional codebase.

dpwm · on July 12, 2018

In Haskell, yes, zip and unzip are defined in Prelude. Python has zip(xs) and zip(*xs) as inverses of each other.

In OCaml we have List.combine and List.split that do the same sort of thing as Haskell's zip and unzip.

I haven't a clue what unzip is in this context, because it isn't standard library and the definitions are buried. If you haven't found its definition you don't know either, you're just guessing based on the name and assuming.

The most likely-looking definition was commented out, so I can only imagine it's coming from somewhere else. I have little confidence that it would simply be aliased to List.split. I would also guess it's not simply returning ([], []).

lmm · on July 12, 2018

> I haven't a clue what unzip is in this context, because it isn't standard library and the definitions are buried.

When all we have is a GitHub link, sure. If I was actually working on this code I'd have it in my IDE and click through to the definition.

> If you haven't found its definition you don't know either, you're just guessing based on the name and assuming.

Sure, but I've found that to be a reliable strategy when I see standard function names when working on functional codebases. No-one is going to deliberately give a function a misleading name (I would hope). I expect unzip to behave like standard unzip just as I'd expect an addUser method to add a user.

> The most likely-looking definition was commented out, so I can only imagine it's coming from somewhere else. I have little confidence that it would simply be aliased to List.split. I would also guess it's not simply returning ([], []).

I'd lay money that it behaves like the commented-out one (it may be a generalisation of it, but I'd certainly expect the same behaviour on lists). Of course it's possible for someone to make a mistake or do something misleading, but one advantage of the "weird" functional names for function is that it's unlikely anyone has used them by accident: if someone has called their function "unzip" that's almost certainly because they think it is the standard unzip function.

wtetzner · on July 12, 2018

> If I was actually working on this code I'd have it in my IDE and click through to the definition.

What IDE do you use for OCaml?

ernst_klim · on July 12, 2018

It was more like they are from a strange haskell realm.

Take a look at the ideal mathematical (verified) C compiler written in Coq. I find it neat:

https://github.com/AbsInt/CompCert

lmm · on July 12, 2018

> Inferred types: you're constantly wondering "what is the type of this variable?".

Your tools should be able to tell you that. Can't you just ask your IDE? In cases where it really is confusing you should probably annotate it.

> Annotating is complicated because you're often dealing with complicated types (with nested tuples for instance).

Having types for nested tuples doesn't make them any more complicated than they were already. Sometimes people overuse complex tuple structures when they should just declare a record, but that can happen in any language.

> Using higher-order functions: often you'll see a function with nested functions in it [1], or some functional list processing function like "unzip" being used [2]. This increases the code complexity a lot ("what does unzip do again?").

In any language you would need to do what unzip does. Having a common version of the function that can be reused in a lot of different places is much better for understandability than writing it out longhand every time.

> Using monads: yet another distraction from what's really happening.

They should do just the opposite: let you shift the distractions in the monad and keep the main code expressing the essence of the problem. Some things are better expressed in straight-through code that just does the things, but you can do that in OCaml too.

> It forces you to write nested functions and doesn't play well with conditions [3].

It plays well with conditions that are already expressed as functions/expressions, which are easier to think about in the first place. In good functional style you can completely forget about "control flow" because things like "if" become just an ordinary case of ordinary functions.

> Also various people use monadic operators way weirder than >>= and all of this becomes really hard to read.

Naming things is hard, but at least an OCaml "operator" is always just a normal function, so you can always just click through to the definition of it and read exactly what it does.

> Using functional-friendly data structures: the linker is using immutable lists of optional bytes to represent memory images, because this is idiomatic in OCaml. However this is slow and inefficient as hell.

This sounds like a specific bug rather than a general problem. Certainly there are plenty of datastructures that manage to be both functional-friendly and efficient.

> Designing functional loops: when the loop you want isn't in your stdlib, you first need to create a recursive function, and then make sure all recursive calls are tail calls otherwise you blow up your stack. The result is usually not very fun to read.

Does OCaml not have a recursion-schemes-like library? Then you can just fold down your datastructure (because a "loop" is never really about looping, it's about handling a datastructure) and you already have your intermediate representation and you can just write the essence of what to do at each step and it's really clean and lovely.

> tl;dr: the C language is small, the OCaml language is not

Disagree with that. C has all these different kinds of control flow and all these built-in operators, whereas everything you would want to do in OCaml is just plain old functions and expressions.

kbp · on July 12, 2018

> C has all these different kinds of control flow and all these built-in operators, whereas everything you would want to do in OCaml is just plain old functions and expressions.

OCaml does have for and while loops, to be fair, although they don't support break or continue (you can simulate break by throwing in the loop body).

rezeroed · on July 12, 2018

Aargh. Downvoting is so lazy. I'd be interested to hear both sides flesh this out. (Yes, I know the rule - don't complain about downvoting -- why do you think there has to be a rule about it‽)

dvfjsdhgfv · on July 12, 2018

The rule is ridiculous, but criticizing the rule is also forbidden.

tntn · on July 12, 2018

Only somewhat related,but I've tried to learn ocaml before but have never been able to figure out when to put what kind of delimiters where. Can anyone recommend a resource that explains the usage of the double comma and the like?

laylomo2 · on July 12, 2018

The double semicolon only matters in the REPL (aka the "toplevel").

Off the top of my head, I can think of at least two other situations where punctuation matters:

1) Whenever you have nested match cases:

    match left with
    | Some left ->
      match right with
      | Some right -> Some (left, right)
      | None -> None
    | None -> None

The compiler does not use indentation to figure out scoping, so the above gets treated as:

    match left with
    | Some left ->
      match right with
      | Some right -> Some (left, right)
      | None -> None
      | None -> None

So you need to surround the inner match with parens

    match left with
    | Some left ->
      (
        match right with
        | Some right -> Some (left, right)
        | None -> None
      )
    | None -> None

2) If you are doing imperative things inside an if statement:

Consider the following function:

    let cache_file file =
      if not (file_exists file) then
        printf "downloading file\n";
        download file
      else
        printf "file already exists\n"

The problem here is that OCaml allows one to create an if statement with no else clause. And since the compiler doesn't use indentation to figure out scoping, then the code is essentially treated like the following:

    let cache_file file =
      (if not (file_exists file) then printf "downloading file\n");
      download file;
      else (printf "file already exists\n");

Which is just totally wrong. So in order to do multiple imperative actions within an if block, you should use either parens, or begin/end delimiters (which are treated exactly the same as parens)

    let cache_file file =
      if not (file_exists file) then begin
        printf "downloading file\n";
        download file
      end else
        printf "file already exists\n"

dpwm · on July 12, 2018

There's an interesting thing with the semicolon. I've heard people advocating thinking of the semicolon as a binary operator:

(;) : unit -> unit -> unit

Of course, the reality is more complicated than this, because it can be used at the end of an expression, where it behaves like a postfix operator of type 'a -> 'a. Let's take your example:

    match left with
    | Some left ->
      match right with
      | Some right -> Some (left, right)
      | None -> None
    | None -> None

you can actually do

    match left with
    | Some left ->
      match right with
      | Some right -> Some (left, right)
      | None -> None ; ; (* <-- two separate semicolons *)
    | None -> None

This doesn't just work for imperative stuff, like the operator theory of the semicolon, it will work with this example.

My understanding is that this is due to the fact that optional semicolons are allowed at the end of statements and close the current expression.

So we're closing the first expression (None) and then closing the second expression (the inner match). The double semicolon is something else altogether.

Though it is worth noting that this example is straying from idiomatic OCaml and we can avoid this altogether with better matching:

    match (left, right) with
    | (Some left, Some right) -> Some (left, right)
    | _ -> None

Both the native optimizing compiler and BuckleScript are able to do what you would expect with this: the matched tuple is optimized away.

Your second unfixed example actually causes a syntax error. There is an if expression in OCaml and an if statement. This is particularly unfortunate, because the let cache_file = function | file when file_exists file -> printf "downloading file\n"; download file | _ -> printf "file already exists\n" first semicolon within an if expression turns the whole thing into an if statement, and an if statement can have no else clause. The else clause is then without an if statement, because we've triggered the imperative if statement. This is a syntax error.

were we to define an operator

   let (>>>) a b = a; b

then we could replace that semicolon with an >>> and things behave as expected. But it's really that if can be both a statement and an expression. For this reason I prefer using matches in general.

    let cache_file = function
      | file when not (file_exists file) ->
        printf "downloading file\n";
        download file
      | _ -> printf "file already exists\n"

theaeolist · on July 12, 2018

OCaml has no guaranteed order of evaluation, which is not what you want with (;). It needs to be part of the language as special syntax.

LeonidasXIV · on July 12, 2018

The signature of `;` where it an operator is more like

(;) : unit -> 'a -> 'a

And you can see it as a kind of sequencing operator, since it returns the value of the right hand side unchanged.

thedufer · on July 12, 2018

I can't think of any time a double-comma would be syntactically correct. Double-semicolons, which may be what you're referring to, are never required (although in some situations they will cause syntax errors to appear closer to the actual problem). They can be placed after all top-level value bindings.

Probably your best bet is to follow an intro series so you see some examples (https://dev.realworldocaml.org/ is the best at the moment, I believe), although if you're so inclined you could just read the BNF for the language (https://caml.inria.fr/pub/docs/manual-ocaml/language.html).

richeyryan · on July 12, 2018

Not a direct answer to your question but you could look at Reason. It is a C style syntax that layers on top of the OCaml semantics. Writing in Reason is writing OCaml. However, the new syntax makes it more approachable for people coming from C style languages and removes idiosyncrasies like the double semicolon.

Its still progressing and more focus has been given to getting it up and running in a compile to JS context but it should be equally as capable of systems work, especially as time goes on.

ofrzeta · on July 12, 2018

For some real world applications take a look at libguestfs http://libguestfs.org, a library that can inspect, mount, edit VM filesystem images. Also includes tools for p2v and v2v migration.

rpcope1 · on July 12, 2018

I think one major turn off, especially for systems programming, was the lack of multicore threading support (i.e. it had no way to get parallelism using threads). Does anyone know if this has changed?

_0w8t · on July 12, 2018

One of the fastest web servers is nginx that uses child processes (typically one child per core) and asynchronous IO within the child without any threads. The same model with a parent process that monitor threadless child workers is often used in embedded system. One can and people do exactly the same in OCaml.

Surely it requires more code to setup things especially if the child processes need to communicate over shared memory to max the performance, but the big plus is that the resulting system is much more robust. In case of troubles one can just let a child die or even use kill -9 and start a new child. This is not possible with most threading implementations. For example, try to recover from out-of-memory in multi threaded C++/Java/Go etc. application. It is very hard. With threadless children it is almost trivial.

pjmlp · on July 12, 2018

Plus we have learned the hard way that concurrency and processes are much saner from safety point of view than threads.

atombender · on July 12, 2018

Posted a week ago: http://kcsrk.info/slides/mcocaml_gallium.pdf

Discussion: https://news.ycombinator.com/item?id=17416797

In short, expect it merged in mid-to-late 2019, maybe.

ZirconiumX · on July 12, 2018

While this is an issue, there are cooperative threading libraries (mainly Lwt and Async, though the community prefers Lwt) which can mitigate most of the reasons you'd need multithreading. Hell, because it's only single threaded, you can mostly ignore locking.

StreamBright · on July 12, 2018

Isn't this project aiming to solve that?

http://ocamllabs.io/doc/multicore.html

toolslive · on July 12, 2018

You really don't want to use the Unix module directly. If you're serious about system programming in OCaml, use Lwt or Async to allow for concurrency.

testaccount7 · on July 12, 2018

IIRC in Coders at Work, Brendan Eich talks about hiring a programmer who wrote an OS in OCaml.