Hacker News new | past | comments | ask | show | jobs | submit login
ManagedC: Memory safe execution of C on a JVM [pdf] (chrisseaton.com)
59 points by mike_hearn on Sept 13, 2015 | hide | past | favorite | 22 comments



It is an interesting implementation. Tl;dr: they took a C interpreter for Java (with unsafe memory management) and implemented fat pointers (Java object + offset) in the interpreter.

The paper claims the implementation obeys C99, but there seems to be a violation around pointer round-tripping. Specifically, they forbid all casts from integers to pointers.

  char a, *p;
  
  p = (char*)(uintptr_t)&a;
In most C implementations, this is perfectly valid. In the paper's, it breaks.

Anyway, very interesting.


They discuss this in section 3.2 of the paper. The C99 standard (in section 6.3.2.3) says that integers can be cast to pointers, but, except in the special case of the integer 0, "the result is implementation-defined, might not be correctly aligned, might not point to an entity of the referenced type, and might be a trap representation". This implementation chooses the last option: if you cast an integer to a pointer, you get a pointer as a result, but one that you can't successfully dereference.


Of course I read that, or else I would have no idea they broke pointer round-tripping.

I believe this violates C99 §7.18.1.4 (if the TruffleC language defines the uintptr_t type in stdint.h):

  The following type designates an unsigned integer type
  with the property that any valid pointer to void can be
  converted to this type, then converted back to pointer to
  void, and the result will compare equal to the original
  pointer:

    uintptr_t
If the TruffleC language does not define uintptr_t ("these types are optional"), then hey, that's fine. A lot of valid code won't compile, though.


Is uintptr_t widely used? I don't recall ever seeing it before.


It's really useful if you want a type to which you can cast any (non-float) primitive value and know that you can cast it back to the same value you had before. I've used it as the cell type for a Forth interpreter, for example (where cells have to be able to represent addresses as well as integers). It's probably particularly not useful for general applications programming.


I have only glossed over the paper (yet), but this raises some thoughts on C in general. Why/Where/How we use it. I could distinguish 5 distinct use cases for C:

  1. Legacy code
  2. Shared (closed) code
  3. Zero runtime dependency code
  4. Hardware control
  5. Resource limitations

This could be a good thing for (1.) - rewriting a project (most) in a safe language is just infeasible, yet this offers some guarantees/protection basically for free. Even if a tool (any tool) catches fire all over the place on first run because of a bug (in app code) it's still a good thing.

Sometimes we want to share some functionality (library), yet stay closed source. Or have an ability to take a file, drop it on a remote machine and expect it to run. This is what native binaries with a stable API are for. Yet calling foreign function in VM'd languages is often so awkward that we just end up reimplementing a lot of software instead of calling foreign function possibly shipped/managed at an OS level.

By hardware control I mean something like writing to 0xABAD1DEA and having data fly out of serial port or NIC change modes or something along those lines. I guess this is doable in managed languages by some built-in magic proxy object, but I'm not entirely sure if this does not start with "write hardware definition file and rebuild the VM". Just a thought.

5. is the basic idea why I'm attempting to discuss this. Small MCUs are still general purpose computers, just very limited, and are rather good litmus tests - is it possible to cram a Hello World into ATtiny with 512 bytes of memory? Is it possible to run something on baremetal ARM Cortex M3/4 (no FPU, MPU/MMU)? No? Then it is by no means "general purpose" and we should thoroughly discuss limitations imposed by the technology.

Programming is a discipline just too diverse for a single individual to grasp and too rarely we step outside our boxen to see the whole world.


An example of where this might be useful is e.g. jRuby (Chris, one of the authors of this paper, and whose site this is linked to, worked on the Truffle backend for jRuby) - there are tons of Ruby gems out there with extensions written in C. I supposed this fits in the "legacy" category, of sorts, though for MRI this is not "legacy" but an artefact of the large number of C libraries people want to interface with coupled with performance considerations vs. MRI. In general the large number of Gems with C extensions is a bit of a pain point for alternative Ruby implementations.

I'd imagine there are quite a few other similar situations where you're running code on the JVM but it would be convenient to be able to pull in some C code without having to deal with JNI etc.


6. As an intermediate/output language for a compiler.


For legacy code a converter might be the better option. E.g.

http://www.tangiblesoftwaresolutions.com/Product_Details/CPl...

Yes, it's not perfect but at least debugger, instrumentation, ... will work against the module.


Well... If you just want to take old abandoned source and somehow run it - anything goes (IIRC, NumPy requires Fortran for MKL). I meant old projects that are still maintained at some level yet are large in scope and cannot be reimplemented incrementally, think OpenSSL.

If we are talking about e.g. servers, we can safely assume pretty beefy x86-compatible hardware and discuss in the context of that. In my book C is the ultimate at general-purposeness and anything we attempt to do with C must be discussed in that light.


I think that this work is a lot more thorough: http://www.cl.cam.ac.uk/~dc552/papers/asplos15-memory-safe-c...

They go through and identify C programming idioms as they are reflected in real code, and design their new memory model in part around that. The rest of the work on CHERI is also very interesting.


I don't get why they went with ManagedC when Managed C++ aka C++/CLI (which runs on CLR rather than JVM) existed for a decade.


If you read the paper, ManagedC is doing something very different to Managed C++, despite the similarities in name. C++/CLI is a different language where you have to define garbage collected pointers manually. ManagedC is basically the same as C99, albeit a whole lot more strict about things that might work on other compilers whilst being technically undefined.

To be more specific, in ManagedC the C code is interpreted, profiled and then JIT compiled just like in Java, where the JIT compiler (Graal) can then do very aggressive profile guided optimisations like inlining huge amounts of code, and using the resulting compile graphs to eliminate the overheads introduced by the sanity checking. It can also do things like eliminate dispatch costs when using function pointers and the like, in the same way it's done for Java.

What's also super interesting about this approach is the language interop you get. ManagedC is based on a project called TruffleC, which doesn't have any of the security/safety checks. TruffleC was built to enable Ruby code that's also being compiled by the same compiler to call into Ruby C extensions intended for the MRI interpreter. What's really crazy about this work is, the compiler can merge the C and Ruby code together incredibly tightly, to the extent that running a mix of Ruby and C on this experimental JVM can be much faster than running the Ruby and C together on the original implementation. The JVM can actually optimise out all the interop costs of moving between the Ruby and C worlds, just using compiler optimisations.


C++/CLI uses IJW, which includes native module in the library[1]. It's a different solution than the one presented here. It also only works on Windows.

[1]: http://blogs.msdn.com/b/abhinaba/archive/2012/11/14/c-cli-an...


At first glance I assumed they use a similar approach as how emscripten compiles C/C++ to JS, where the entire C-accessible memory heap is one big Javascript array object (which has the nice side-effect of basically switching off the garbage collector, unless you need to cross-over to the JS-side). But it looks like they are actually mapping granular C structs to JVM 'objects'.

There is (or was?) an emscripten-alternative called Duetto which had a somewhat similar 'granular' approach like the C-on-JVM described here, but it couldn't compete on performance and also needed a customized C/C++ dialect.


I've done this --- see:

http://cluecc.sourceforge.net/

It compiles C89 C into Java, Javascript, Lua, Perl and Common Lisp. Pointers are represented as an array pointing at the object and an offset into the array, so sizeof(void * ) == 2 and sizeof(everything else) == 1. This allows efficient pointer arithmetic while still using one native allocation per object.

It relies heavily on C89 undefined behaviour. Alas, C99 adds (IIRC) a defined mapping from an object to bytes and vice versa, which means this approach won't work. Also, the compiler frontend I was using, sparse, had bugs where it would try to convert a pointer to an int, do arithmetic, and then convert back again. So it's not suitable for real work.

Performance was better than I was expecting it would run C on Java at 1/3 the speed of native. That's pretty good for a naive toy. Running C on LuaJIT was amazing; very nearly the same speed as native. (For a set of artificial, non-representative benchmarks.)

Java, Javascript and Lisp are all crippled by not having goto. goto allows you to express arbitrary basic block graphs, which C supports. Without it, I have to use a big switch statement inside a while loop --- JITs hate this. Emscripten gets around this by using algorithms to try and represent the basic block graph as much as possible with structured control flow statements, but this can't work in all situations, so it has to fall back to the explicit state machine if that doesn't work. Which is, of course, slow.

LuaJIT does support goto. When I converted from Lua 5.1 (with no goto) to LuaJIT, I estimated about a 30% speed improvement.

Languages without goto statements are toys, dammit...


Common lisp does have a form of goto in the "go" operator within a tagbody. Was this not usable for your purposes? You needed a global goto?

http://clhs.lisp.se/Body/s_go.htm#go


TBH I don't know Lisp - the code generator was donated. There isn't enough libc to run the benchmarks and I haven't really examined the code much. Maybe it does use go. If you're interested in having a look, feel free...


> Maybe it does use go.

I just took a look at your repository. In the lisp implementation each generated function is wrapped with a prog which provides a tagbody and then a label is emitted for each basic block and go is used to jump to the labels. So be sure to leave out lisp in your future lists of toy languages... :D


Excellent news!

Normally at this point I'd ask if you were interested in fixing the libc for me, but frankly, it's not worth it --- clue's technology is pretty crappy. One day, in my copious free time, I'd like to retry the whole idea using a better compiler.


Do they allow the commonplace idiom of casting objects to void* and then back to the original type? (That is allowed by the standard)

And do they allow "overlay" casting where multiple structs are used to access the same memory region? This is used in the POSIX networking API for instance (and isn't that rare besides -- the alternatives is unions but that has the disadventage of closedness -- you can't add new variants).


This uses garbage collection which requires a stop-the-world pause unless using something like Zing JVM. A STW in many (real-time) applications is why you would choose C/C++ in the first place.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: