Hacker News new | past | comments | ask | show | jobs | submit login
BareMetal is a 64-bit OS for x86-64 based computers (returninfinity.com)
171 points by fogus on May 26, 2011 | hide | past | favorite | 79 comments



> Return Infinity goes back to the roots of computer programming with pure Assembly code. As we are programming at the hardware level, we can achieve a runtime speed that is not possible with higher-level languages like C/C++, VB, and Java.

When will this reasoning finally die?


It won't die because it's true, at least in some cases.

The LuaJIT 2.0 interpreter, written in x86-64 assembly language, is 2-5x the speed of the plain Lua interpreter, written in C. Note that this is with the JIT disabled -- it is an apples-to-apples comparison of interpreter-vs-interpreter: http://luajit.org/performance_x86.html

I recently wrote a protobuf-decoding assembly code generator that is 2-3x the speed of C++ generated code: http://blog.reverberate.org/2011/04/25/upb-status-and-prelim...

What is your evidence in support of the idea that assembly cannot be faster?


> What is your evidence in support of the idea that assembly cannot be faster?

I made no such claim.

My gripe was using it as a feature, claiming that since it's in assembly, it's certainly faster, which is just simply not true. In most cases, it's the algorithm that determines performance as opposed to the details of its implementation. Assembly certainly has its place, but arguing a kernel completely implemented in assembly is faster simply due to the abstraction level they're working on does not carry much weight. Of course you'll be able to find hand-tuned algorithms that are much faster in assembly than a higher-level language, but that does not follow that "Complex software written in X is generally slower than complex software written in assembly"

Also, Lua is a poor example. It's performance was much more heavily influenced by portability and embedability.


Of course, if you take two different pieces of software and implement one in assembly language, and the other in high level, you can't claim valid comparison.

But if you take one algorithm and implement it in both languages, assembly implementation will always be faster, thus the basis for their claim.


That's equivalent to claiming Assembly is always faster than other languages, because every program is just an algorithm. It's completely incorrect - I guarantee you I can write an implementation of an algorithm in Assembly that is slower than the same algorithm implemented in Ruby.


I think the person is trying to point out it is you need to be a good programmer to write good assembly, and the a assumption that you will always have a good programmer can be broken sometimes. An algorithm may be an algorithm, but sometimes an inexperienced programmer could easily make it slower.


> assembly implementation will always be faster

only if you don't suck at assembly.


I don't know, I always found the notion that humans will always be able to optimize better than machines to be somewhat... naïve. Is it an NP complete problem our own heuristics are currently better at estimating?

No one has performed controlled studies on these things - maybe they had some crummy bottlenecks, used some language feature that their compiler couldn't optimize away, maybe the benchmarks they use to determine performance are trivial (which is very often the case), etc.

Not to mention that most of an operating system's time, post boot is spent doing... what? Having the scheduler swap processes in and out? If you're running a single program that fits inside ram... it's totally fucking pointless, there's nothing left to optimize.

I have a better question. These guys are clearly smart. What the hell are they still doing in Atwood, Ontario?


> Is it an NP complete problem our own heuristics are currently better at estimating?

Some of the important problems are NP-complete (like register allocation). Another problem is that compilers aren't that good at telling fast-paths from slow-paths (and keeping everything in registers for the fast paths). For more info see this message from the author of LuaJIT: http://article.gmane.org/gmane.comp.lang.lua.general/75426


There's a new formulation of register allocation that's computationally tractable: http://compilers.cs.ucla.edu/fernando/projects/puzzles/exper...


Thanks for the link. That email was informative enough that I thought it deserved its own submission: http://news.ycombinator.com/item?id=2588696


I think the problem is that optimization is AI-complete. Without a lot of context about what your program is doing under what circumstances the problem is not solvable. You need to know when and how a specific code path is run.


Yes, though that doesn't mean that humans will be better at it.


Agreed, I guess that when we have build smarter compilers a Centaur approach would work best (like in chess). The computer can do a whole lot by bruteforce and smart algorithms and the human uses his knowledge of the context to steer it in the right direction.

Unfortunately we're not their yet.


Does the plain Lua interpreter also JIT? PyPy is written in Python and it's a lot faster than CPython in many cases, thanks in large part to the JIT.


The numbers I quoted are when LuaJIT has the JIT disabled. It's an interpreter-to-interpreter comparison.


Oh, cool, thanks for clarifying.


> What is your evidence in support of the idea that assembly cannot be faster?

I don't think that is the key, they key is the cost of that speed improvement. Say you spend a week to write the protobuf decoder in assembly so now it can decode in 30usec instead of 60usec. So you have an impressive 2x speed gain.

But then say, you are writing the data do a disk. Well maybe it doesn't really matter how fast you are decoding the protobuf if next you are sitting there for ages waiting for that data to be written out. That 30usec gain is nothing on top of that 10msec wait time that is coming next, so was that week a good investment f you just did for pure speed improvement? (well you might have done as a learning exercise, then speed doesn't really matter).


Although this is a valid point in many cases, I don't think this is one of those cases. It's in these "infrastructure" type projects like kernels, compilers, interpreters, and parsers where "micro" optimizations are actually really important.

> But then say, you are writing the data do a disk. Well maybe it doesn't really matter how fast you are decoding the protobuf if next you are sitting there for ages waiting for that data to be written out. That 30usec gain is nothing on top of that 10msec wait time that is coming next, so was that week a good investment f you just did for pure speed improvement? (well you might have done as a learning exercise, then speed doesn't really matter).

haberman's parser (1460 MB/s) outperforms Google's C++ parser (260 MB/s) more the 5x. Note that even in the disk example, a fast SSD will have enough bandwidth to throttle the CPU on Google's parser. On top of that, this is FOSS, which means his weeks of investment is multiplied every time someone downloads and uses his code.


> On top of that, this is FOSS, which means his weeks of investment is multiplied every time someone downloads and uses his code.

Excellent point.

Also, I didn't mean to talk specifically about his parser, it was just used as a general example.

It is just that in my experience, engineers (I am guilty too) have a tendency to spend time micro-optimizing without, in the end, making a difference in overall user-experience. For example, stuff like choosing to write a GUI app in C++ when it could have been whipped up in Python in a fraction of time and lines of code. The menus will open in 10ms instead of 3ms but maybe it doesn't really matter from user s point of view.

Same holds for most data that ends up in IO choke-points. Even memory today in SMP architectures is a choke-point. Spend time hand-optimizing CPU bound code only to find out that it ends up waiting on a lock, in a disk, network buffer, or for some user input.

Also micro-optimizations are often not future-proof. Many cache-friendly data structures and algorithms for example, assume a particular cache line size, or particular characteristics of hardware that just happen to change. Even in the assembly case, today we have 32bit, 64bit and ARM common target architectures, each with various levels of SSE extension support and other features, so one can spend a lot of time, maintaining and tweaking all of them.


In this case though, they say their market is for HPC clusters and embedded computing, which are two areas where most processes are likely to be CPU-intensive.


An interpreter isn't a fair comparison though - in assembly you can use a few tricks like threaded code (http://en.wikipedia.org/wiki/Threaded_code) to get a big speed boost, but these techniques aren't really broadly applicable to programs generally.

And in any case, it's still not an argument for writing an entire OS in assembly, but rather only a few important segments of the code.


OK, but if it's a JIT, then how much time is the interpreter actually running?

Wouldn't we expect it be executing the JIT-compiled code (i.e., doing useful work) most of the time?

If so, doesn't that really make the opposite point, that compiler (JIT or no) generated code is plenty fast?


LuaJIT has both an interpreter (written in assembly) and a JIT. The 2-5x I quoted is only for the interpreter (ie. with the JIT disabled). The speedup for the actual JIT is 2-130x vs. the interpreter written in C.


The Lua C implementation seems very conservatively written for portability and maintainability, but it's not slow either.

Handwritten assembly really can be faster than compiler generated code. The proof is that we can always look at the output of the compiler and invest more time improving on it by hand, whereas the compiler is required to complete in a short amount of time and usually without actually timing its code on the target machine.

Now if you take someone experienced in hand-tuning assembly like that and ask them to write the fastest possible code using a compiler, they're going to beat the pants off an ordinary coder who hasn't been benchmark everything he writes all along.

But the real lesson here is that Lua is just freaking awesome.


Actually, porting LuaJIT to BareMetal might be a neat idea. :-)


When the Singularity happens and computers are at least as smart as humans.

Until then, compilers will be mindbogglingly retarded piles of crap that produce code 10, 20, or more percent slower than a human. Doubly so on anything other than x86. Add a factor of 10 if SIMD is involved.


And the downvote brigade arrives, consisting entirely of programmers who don't read the assembly outputted by their compiler.

Part of the problem is simply that compilers typically cannot know the same information the programmer knows: assumptions about alignment and aliasing, for example, that the programmer knows, but the compiler doesn't.

But even if they did, there are plenty of cases where "producing good assembly code for a given algorithm" is infeasible with a brute-force approach, requiring the imprecise-but-effective pattern-matching of a human brain -- or something similarly powerful.


I don't understand why this reasoning is inappropriate for the application (HPC). If they were suggesting this as a way to develop massive enterprise applications with vast compatibility requirements and rapid application development needs that would be one thing, but for specialized supercomputing applications do you disagree with this approach?


Modern compilers are generally considered smart enough to do a better job at optimizing your code than you. Maybe if you're really, really smart you can do a better job than the compiler. But I suspect most people who say this are not, in fact, smarter than the compiler (but are simply suffering from self-serving bias). And by saying it, they're fooling a bunch more people into thinking that modern compilers are stupid, leading them into incorrect decisions like actually writing applications in assembly "for performance reasons".

So while yes, there are probably a small number of cases where it's still worthwhile to write stuff in assembly, it's not worthwhile to talk about it as though it's a good thing (I would guess you should only do it when it becomes a necessity, and complain about it a lot, rather than presenting it as a feature)


Compilers are great at some things and really bad at other things. The C and Fortran ABIs are too permissive in some cases, making certain optimizations impossible for the compiler to do (without combinatorial growth in generated code size). On x86-64, you can go a long way using SSE intrinsics, but that is pretty close to the assembly level. IBM did a bad job designing the PowerPC intrinsics, so they are nearly useless. There are a few computational kernels that I have sped up by a factor of two relative to what the compiler could produce or the best published result. The x264 project writes a huge amount of assembly and provides consistently better performance than other implementations. There is still a place for assembly, although it should be kept localized.


Take a piece of assembly code generated by a compiler and try to optimize it. You'll see that you don't actually need to be that smart to obtain a significant perf improvement.

Compilers are good, but they must ensure correctness for any source code. On the other hand, you now exactly what you need thus you can drastically simplify / optimize the assembly code.


As someone who writes assembly, I think this is overstating the case. There are a lot of algorithms (particularly short ones) where good C compilers generate nearly optimal code that you would be hard-pressed to improve on.

In my experience the benefit you get from writing assembly comes largely from your ability to do better register allocation for your fast-paths, in cases where your compiler would spill registers to the stack.

There are cases where the compiler does something that is genuinely stupid (http://blog.reverberate.org/2011/03/19/when-a-compilers-slow...) but in modern compilers these are pretty rare.


I don't think they are saying that you and I (application developers) should be writing assembly (they mention C/C++, and refer to C++ libraries elsewhere), unless we want to.

What they do emphasize is that the operating system was written in assembly, and while I don't know the members of the team personally, I'd guess anyone who has completed a project such as this (with this level of polish and utility) is at least potentially capable of being smarter than your average compiler.

I'd also like to state that I do agree, for a vast amount of software development, the convenience of higher-level languages and API's outweigh the associated performance disadvantages but for some programs, the ones that talk directly to hardware and who's library functions are called billions of times a second by application programs (and let's add, that are written far less frequently than application-level code) the assembly approach is justified.

Of course you're welcome to build something similar in a compiled language and prove us all wrong :)


I'd also like to say that while you may not enjoy programming in assembly that doesn't make it a bad thing.

Programming assembly is actually fun (for some of us) and gives you a level of intimacy and insight into the machine that no other language can provide.

OK maybe FORTH


Great! Go right ahead and program everything you want in assembly. If it's for fun, or for educational purposes, or whatever, awesome. But don't go around saying it's going to perform better than Haskell code, or whatever higher-level language you want, that produces the same output, without testing it.


languages like C/C++, VB, and Java.

That's kind of funny if you think about it. Are people really writing OS's in VB these days?


HN user daeken has been writing an OS in .NET http://daeken.com/renraku-future-os


when people fab chips that run java / vb / C++ natively. I saw a demo once using java on chip, but it was far from consumer-level.


Chips will never run java/vb/C++ natively. Chips execute instructions, and those languages contain constructs which are not instructions.

See a great discussion at http://electronics.stackexchange.com/questions/14527/any-pro...

(The Java demo you saw was probably Jazelle, which is a processor module on ARM chips that runs (some) Java assembly language instructions natively, instead of using a virtualized processor. That's possible for a lot of VM-based languages, but it's not running Java.


Modern chips don't run x86_64 assembly language either. That's just an compatibility layer that is translated away from as soon as possible.


You mean like Jazelle http://en.wikipedia.org/wiki/Jazelle that hardly anyone used?


Gosling came to my school and did a demo of an RV using something like that -- i wish i remembered the chip name

"hardly anyone used" <-- this is why assembly still survives. I still use a fair bit of it for commands that glibc doesnt wrap (i.e. RDTSC)


I'd be very much surprised if there were any significant workloads that would work faster on this thing than on a decently set up Linux or BSD install on the same hardware.

It's not my impression that modern OSs has a habit of getting in the way of pure computation - and when the computation is done, I'd much prefer a solid filesystem/network stack to get the results out of the door.

But as an academic/tinkering/hacking project, it's awesome. If assembly was in my backlog of stuff I want to learn/play with, this would be an obvious thing to get started on.


Cray runs a stripped down kernel they call Compute Node Linux. It still has virtual memory, which combined with the frequency of getting poorly mapped physical pages, causes difficult to predict performance. It is just accepted that performance results are not reproducible on the Cray and most modern clusters, especially those with "fat" nodes. For large runs on Jaguar, the standard deviation is often 20% to 30%, so people who are doing scalability studies run the same model several times and plot the best result. It can be worse on clusters like Ranger (4-socket quad-core Opteron nodes, connected by InfiniBand). Of course running the same 100k core job repeatedly to get a stable timing is a waste of resources. The problem is a combination of VM, multi-core interference, daemon noise, and network topology variability between runs.

In contrast, IBM's Blue Gene series runs Compute Node Kernel which is not Linux and uses offset-mapped memory. This obviates the need for a TLB. The rest of the OS is also stripped down compared to Cray's already lean CNL. Performance variability on Blue Gene is usually reliably less than 1%.

I think BareMetal looks rather silly and will probably not be used for anything serious, but ordinary Linux or BSD is a dubious choice for HPC.


I could make sense for a computing cluster, look at the demo of the BareMetal node, not even a local disk is present.


This has been around for at least two years (if not longer?) see http://forum.osdev.org/viewtopic.php?f=2&t=20946 for example.

Considering it is such a "from scratch" kind of project and given the progress so far, it seems to me like it might be more of a "let's see if we can" curiosity type thing rather than a project that an end user might actually want to use for anything practical.



Genuine Question: Can anyone give an example of something useful that has been done with this OS, or is planned for the very near future?


you can run Redis on it or any other simple single-threaded server


I doubt that since the OS doesn't support even TCP/IP, any kind of filesystem or even POSIX APIs. It'd be hard to even port Redis to it, let alone get anything useful done in absence of TCP/IP


I wonder how hard it would be to port redis and how it performance would compare.


I don't have a link, but something similar was done with custom kernels with Redis compiled to run native above Xen - the performance gain was only ~ 13% - so in this case it didn't worth the trouble.

But if you a large HPC cluster, getting 13% more of each compute node definitely worth the trouble.

EDITED: see link in child's post


Not EC2, but is this what you are thinking about?… http://openfoo.org/blog/redis-native-xen.html


That's interesting, though I think my takeaway was that the "performance tax" of the operating system layer is pretty minor, all things considered.


Thanks for the replies!



A great idea especially if they try to attack one market at a time.

They should mandate a very small (but popular) set of hardware that will be supported so if you want to use it that's it and then it reduces their support issues (they could even sell pre-installed boxes). Possibly create some drivers for Virtualbox drivers to allow people to dabble with it prior to building their own compatible hardware.

I'd like to see them include much needed secondary features in Intel optimized C with a roadmap for them to be reimplemented in ASM as time permits.

If they could develop/get a static web server with the speed of Nginx (or better) I'm sure this thing would explode in popularity; I'm sure CDN's etc. would see the benefits.


The part of this that stands out to me: the OS claims to be open source. But the bootloader is proprietary. Why? Does the source depend on proprietary specifications that have been embedded in parts of it or something? The documentation doesn't obviously preclude writing a replacement, but nor does it seem to be designed to encourage such a thing. On the surface it's not complex enough for this to be a huge task, but I'm suspecting there's at least one strange grinding obstacle in the way…


I understand that this is an experimental project, but it would seem that to target high-performance computing, they should allow Fortran as one of the languages also.


I'd be interested to see some benchmarks of some computationally expensive applications vs. running them on Linux or another OS.

How much gains are you making by optimizing the OS?


The thing that strikes me the most is how the author seems to consider C and C++ to be a single language called "C/C++".


C++ is an almost perfect superset of C. (using the term "perfect" in the "superset"-ness of c++, not its design quality.) From this perspective, it is appropriate to lump them together.


I see plenty of people who claim they write "C++" but end up writing some mutant "C with classes" C++ is not just "C with addons", it's a different language that happen to share its syntax and part of its standard library. Lumping them together leads to ugly C++ code.


Having just seen a billboard ad for http://www.mokafive.com/baremetal (enterprisey desktop virtualization), I was briefly expecting a legal dispute, but "BareMetal" isn't actually on their trademark list.


If it doesn't support MPI or a functional threading system, then it will never be used for HPC.


That's my biggest question: can it support threading? I work on HPC tasks that don't need MPI, but that are I/O bound. This means that it is more efficient to use multiple threads to process data while another is waiting for data to be loaded from the disk. I'm all for getting as close to the bare-metal as possible, but you're right. Without MPI or threading, this doesn't have much of a chance to be adopted.


any performance metrics? we can speculate about effectiveness of such a solution but it should be fairly easy to validate by running some common computational tasks this OS was designed to excel at, vs other popular OSs..


Why would they use FAT16 for the file system? Seems limiting to me...


Because it's really trivial to implement. Other file systems are a lot more complex, especially to write.


It's also well supported by boot media like .isos, USB, network, etc.


There's probably little need for the overhead associated with other filesystems (security, multi-user support, crash recovery, etc.).

Personally I would have preferred a custom filesystem designed for the application but I can see the convenience of being FAT-compatible. I wouldn't expect these nodes to keep much data locally (if any).


The bootloader for the kernel (called Pure64) requires this. They chose FAT16 because it's compatible with most operating systems and also not unimportant; it's relatively easy to implement.


And then an FPGA guy walks in a bar and...


There I upvoted you. Perfectly valid comment.

FPGAs can be programmed to give the answer in the time it takes the gates to propagate which is usually damn quick. None of these "cycles" things that CPUs use up.


The popularity of virtualization technology and the new trend of selling instances could make operating system development interesting again.

The requirements for an operating system have changed drastically with this new way of thinking about what it means to run an operating system. The requirements can be as low as supporting a single process that can talk tcp and (maybe) to disk. Look at Haskell Network Stack, it provides network support to an application and you don't need an OS proper, just Xen.

I'm very excited to see where highly lightweight OSes end up.


It's exokernels all over again.


It's ________ all over again. Love these types of comments on HN.


Oh, Xen and other baremetal hypervisors are actually very closely related to exokernels. I work with Xen for a living.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: