> Return Infinity goes back to the roots of computer programming with pure Assembly code. As we are programming at the hardware level, we can achieve a runtime speed that is not possible with higher-level languages like C/C++, VB, and Java.
It won't die because it's true, at least in some cases.
The LuaJIT 2.0 interpreter, written in x86-64 assembly language, is 2-5x the speed of the plain Lua interpreter, written in C. Note that this is with the JIT disabled -- it is an apples-to-apples comparison of interpreter-vs-interpreter: http://luajit.org/performance_x86.html
> What is your evidence in support of the idea that assembly cannot be faster?
I made no such claim.
My gripe was using it as a feature, claiming that since it's in assembly, it's certainly faster, which is just simply not true. In most cases, it's the algorithm that determines performance as opposed to the details of its implementation. Assembly certainly has its place, but arguing a kernel completely implemented in assembly is faster simply due to the abstraction level they're working on does not carry much weight. Of course you'll be able to find hand-tuned algorithms that are much faster in assembly than a higher-level language, but that does not follow that "Complex software written in X is generally slower than complex software written in assembly"
Also, Lua is a poor example. It's performance was much more heavily influenced by portability and embedability.
Of course, if you take two different pieces of software and implement one in assembly language, and the other in high level, you can't claim valid comparison.
But if you take one algorithm and implement it in both languages, assembly implementation will always be faster, thus the basis for their claim.
That's equivalent to claiming Assembly is always faster than other languages, because every program is just an algorithm. It's completely incorrect - I guarantee you I can write an implementation of an algorithm in Assembly that is slower than the same algorithm implemented in Ruby.
I think the person is trying to point out it is you need to be a good programmer to write good assembly, and the a assumption that you will always have a good programmer can be broken sometimes. An algorithm may be an algorithm, but sometimes an inexperienced programmer could easily make it slower.
I don't know, I always found the notion that humans will always be able to optimize better than machines to be somewhat... naïve. Is it an NP complete problem our own heuristics are currently better at estimating?
No one has performed controlled studies on these things - maybe they had some crummy bottlenecks, used some language feature that their compiler couldn't optimize away, maybe the benchmarks they use to determine performance are trivial (which is very often the case), etc.
Not to mention that most of an operating system's time, post boot is spent doing... what? Having the scheduler swap processes in and out? If you're running a single program that fits inside ram... it's totally fucking pointless, there's nothing left to optimize.
I have a better question. These guys are clearly smart. What the hell are they still doing in Atwood, Ontario?
> Is it an NP complete problem our own heuristics are currently better at estimating?
Some of the important problems are NP-complete (like register allocation). Another problem is that compilers aren't that good at telling fast-paths from slow-paths (and keeping everything in registers for the fast paths). For more info see this message from the author of LuaJIT: http://article.gmane.org/gmane.comp.lang.lua.general/75426
I think the problem is that optimization is AI-complete. Without a lot of context about what your program is doing under what circumstances the problem is not solvable. You need to know when and how a specific code path is run.
Agreed, I guess that when we have build smarter compilers a Centaur approach would work best (like in chess). The computer can do a whole lot by bruteforce and smart algorithms and the human uses his knowledge of the context to steer it in the right direction.
> What is your evidence in support of the idea that assembly cannot be faster?
I don't think that is the key, they key is the cost of that speed improvement. Say you spend a week to write the protobuf decoder in assembly so now it can decode in 30usec instead of 60usec. So you have an impressive 2x speed gain.
But then say, you are writing the data do a disk. Well maybe it doesn't really matter how fast you are decoding the protobuf if next you are sitting there for ages waiting for that data to be written out. That 30usec gain is nothing on top of that 10msec wait time that is coming next, so was that week a good investment f you just did for pure speed improvement? (well you might have done as a learning exercise, then speed doesn't really matter).
Although this is a valid point in many cases, I don't think this is one of those cases. It's in these "infrastructure" type projects like kernels, compilers, interpreters, and parsers where "micro" optimizations are actually really important.
> But then say, you are writing the data do a disk. Well maybe it doesn't really matter how fast you are decoding the protobuf if next you are sitting there for ages waiting for that data to be written out. That 30usec gain is nothing on top of that 10msec wait time that is coming next, so was that week a good investment f you just did for pure speed improvement? (well you might have done as a learning exercise, then speed doesn't really matter).
haberman's parser (1460 MB/s) outperforms Google's C++ parser (260 MB/s) more the 5x. Note that even in the disk example, a fast SSD will have enough bandwidth to throttle the CPU on Google's parser.
On top of that, this is FOSS, which means his weeks of investment is multiplied every time someone downloads and uses his code.
> On top of that, this is FOSS, which means his weeks of investment is multiplied every time someone downloads and uses his code.
Excellent point.
Also, I didn't mean to talk specifically about his parser, it was just used as a general example.
It is just that in my experience, engineers (I am guilty too) have a tendency to spend time micro-optimizing without, in the end, making a difference in overall user-experience. For example, stuff like choosing to write a GUI app in C++ when it could have been whipped up in Python in a fraction of time and lines of code. The menus will open in 10ms instead of 3ms but maybe it doesn't really matter from user
s point of view.
Same holds for most data that ends up in IO choke-points. Even memory today in SMP architectures is a choke-point. Spend time hand-optimizing CPU bound code only to find out that it ends up waiting on a lock, in a disk, network buffer, or for some user input.
Also micro-optimizations are often not future-proof. Many cache-friendly data structures and algorithms for example, assume a particular cache line size, or particular characteristics of hardware that just happen to change. Even in the assembly case, today we have 32bit, 64bit and ARM common target architectures, each with various levels of SSE extension support and other features, so one can spend a lot of time, maintaining and tweaking all of them.
In this case though, they say their market is for HPC clusters and embedded computing, which are two areas where most processes are likely to be CPU-intensive.
An interpreter isn't a fair comparison though - in assembly you can use a few tricks like threaded code (http://en.wikipedia.org/wiki/Threaded_code) to get a big speed boost, but these techniques aren't really broadly applicable to programs generally.
And in any case, it's still not an argument for writing an entire OS in assembly, but rather only a few important segments of the code.
LuaJIT has both an interpreter (written in assembly) and a JIT. The 2-5x I quoted is only for the interpreter (ie. with the JIT disabled). The speedup for the actual JIT is 2-130x vs. the interpreter written in C.
The Lua C implementation seems very conservatively written for portability and maintainability, but it's not slow either.
Handwritten assembly really can be faster than compiler generated code. The proof is that we can always look at the output of the compiler and invest more time improving on it by hand, whereas the compiler is required to complete in a short amount of time and usually without actually timing its code on the target machine.
Now if you take someone experienced in hand-tuning assembly like that and ask them to write the fastest possible code using a compiler, they're going to beat the pants off an ordinary coder who hasn't been benchmark everything he writes all along.
But the real lesson here is that Lua is just freaking awesome.
When the Singularity happens and computers are at least as smart as humans.
Until then, compilers will be mindbogglingly retarded piles of crap that produce code 10, 20, or more percent slower than a human. Doubly so on anything other than x86. Add a factor of 10 if SIMD is involved.
And the downvote brigade arrives, consisting entirely of programmers who don't read the assembly outputted by their compiler.
Part of the problem is simply that compilers typically cannot know the same information the programmer knows: assumptions about alignment and aliasing, for example, that the programmer knows, but the compiler doesn't.
But even if they did, there are plenty of cases where "producing good assembly code for a given algorithm" is infeasible with a brute-force approach, requiring the imprecise-but-effective pattern-matching of a human brain -- or something similarly powerful.
I don't understand why this reasoning is inappropriate for the application (HPC). If they were suggesting this as a way to develop massive enterprise applications with vast compatibility requirements and rapid application development needs that would be one thing, but for specialized supercomputing applications do you disagree with this approach?
Modern compilers are generally considered smart enough to do a better job at optimizing your code than you. Maybe if you're really, really smart you can do a better job than the compiler. But I suspect most people who say this are not, in fact, smarter than the compiler (but are simply suffering from self-serving bias). And by saying it, they're fooling a bunch more people into thinking that modern compilers are stupid, leading them into incorrect decisions like actually writing applications in assembly "for performance reasons".
So while yes, there are probably a small number of cases where it's still worthwhile to write stuff in assembly, it's not worthwhile to talk about it as though it's a good thing (I would guess you should only do it when it becomes a necessity, and complain about it a lot, rather than presenting it as a feature)
Compilers are great at some things and really bad at other things. The C and Fortran ABIs are too permissive in some cases, making certain optimizations impossible for the compiler to do (without combinatorial growth in generated code size). On x86-64, you can go a long way using SSE intrinsics, but that is pretty close to the assembly level. IBM did a bad job designing the PowerPC intrinsics, so they are nearly useless. There are a few computational kernels that I have sped up by a factor of two relative to what the compiler could produce or the best published result. The x264 project writes a huge amount of assembly and provides consistently better performance than other implementations. There is still a place for assembly, although it should be kept localized.
Take a piece of assembly code generated by a compiler and try to optimize it. You'll see that you don't actually need to be that smart to obtain a significant perf improvement.
Compilers are good, but they must ensure correctness for any source code. On the other hand, you now exactly what you need thus you can drastically simplify / optimize the assembly code.
As someone who writes assembly, I think this is overstating the case. There are a lot of algorithms (particularly short ones) where good C compilers generate nearly optimal code that you would be hard-pressed to improve on.
In my experience the benefit you get from writing assembly comes largely from your ability to do better register allocation for your fast-paths, in cases where your compiler would spill registers to the stack.
I don't think they are saying that you and I (application developers) should be writing assembly (they mention C/C++, and refer to C++ libraries elsewhere), unless we want to.
What they do emphasize is that the operating system was written in assembly, and while I don't know the members of the team personally, I'd guess anyone who has completed a project such as this (with this level of polish and utility) is at least potentially capable of being smarter than your average compiler.
I'd also like to state that I do agree, for a vast amount of software development, the convenience of higher-level languages and API's outweigh the associated performance disadvantages but for some programs, the ones that talk directly to hardware and who's library functions are called billions of times a second by application programs (and let's add, that are written far less frequently than application-level code) the assembly approach is justified.
Of course you're welcome to build something similar in a compiled language and prove us all wrong :)
I'd also like to say that while you may not enjoy programming in assembly that doesn't make it a bad thing.
Programming assembly is actually fun (for some of us) and gives you a level of intimacy and insight into the machine that no other language can provide.
Great! Go right ahead and program everything you want in assembly. If it's for fun, or for educational purposes, or whatever, awesome. But don't go around saying it's going to perform better than Haskell code, or whatever higher-level language you want, that produces the same output, without testing it.
(The Java demo you saw was probably Jazelle, which is a processor module on ARM chips that runs (some) Java assembly language instructions natively, instead of using a virtualized processor. That's possible for a lot of VM-based languages, but it's not running Java.
I'd be very much surprised if there were any significant workloads that would work faster on this thing than on a decently set up Linux or BSD install on the same hardware.
It's not my impression that modern OSs has a habit of getting in the way of pure computation - and when the computation is done, I'd much prefer a solid filesystem/network stack to get the results out of the door.
But as an academic/tinkering/hacking project, it's awesome. If assembly was in my backlog of stuff I want to learn/play with, this would be an obvious thing to get started on.
Cray runs a stripped down kernel they call Compute Node Linux. It still has virtual memory, which combined with the frequency of getting poorly mapped physical pages, causes difficult to predict performance. It is just accepted that performance results are not reproducible on the Cray and most modern clusters, especially those with "fat" nodes. For large runs on Jaguar, the standard deviation is often 20% to 30%, so people who are doing scalability studies run the same model several times and plot the best result. It can be worse on clusters like Ranger (4-socket quad-core Opteron nodes, connected by InfiniBand). Of course running the same 100k core job repeatedly to get a stable timing is a waste of resources. The problem is a combination of VM, multi-core interference, daemon noise, and network topology variability between runs.
In contrast, IBM's Blue Gene series runs Compute Node Kernel which is not Linux and uses offset-mapped memory. This obviates the need for a TLB. The rest of the OS is also stripped down compared to Cray's already lean CNL. Performance variability on Blue Gene is usually reliably less than 1%.
I think BareMetal looks rather silly and will probably not be used for anything serious, but ordinary Linux or BSD is a dubious choice for HPC.
Considering it is such a "from scratch" kind of project and given the progress so far, it seems to me like it might be more of a "let's see if we can" curiosity type thing rather than a project that an end user might actually want to use for anything practical.
I doubt that since the OS doesn't support even TCP/IP, any kind of filesystem or even POSIX APIs. It'd be hard to even port Redis to it, let alone get anything useful done in absence of TCP/IP
I don't have a link, but something similar was done with custom kernels with Redis compiled to run native above Xen - the performance gain was only ~ 13% - so in this case it didn't worth the trouble.
But if you a large HPC cluster, getting 13% more of each compute node definitely worth the trouble.
A great idea especially if they try to attack one market at a time.
They should mandate a very small (but popular) set of hardware that will be supported so if you want to use it that's it and then it reduces their support issues (they could even sell pre-installed boxes). Possibly create some drivers for Virtualbox drivers to allow people to dabble with it prior to building their own compatible hardware.
I'd like to see them include much needed secondary features in Intel optimized C with a roadmap for them to be reimplemented in ASM as time permits.
If they could develop/get a static web server with the speed of Nginx (or better) I'm sure this thing would explode in popularity; I'm sure CDN's etc. would see the benefits.
The part of this that stands out to me: the OS claims to be open source. But the bootloader is proprietary. Why? Does the source depend on proprietary specifications that have been embedded in parts of it or something? The documentation doesn't obviously preclude writing a replacement, but nor does it seem to be designed to encourage such a thing. On the surface it's not complex enough for this to be a huge task, but I'm suspecting there's at least one strange grinding obstacle in the way…
I understand that this is an experimental project, but it would seem that to target high-performance computing, they should allow Fortran as one of the languages also.
C++ is an almost perfect superset of C. (using the term "perfect" in the "superset"-ness of c++, not its design quality.) From this perspective, it is appropriate to lump them together.
I see plenty of people who claim they write "C++" but end up writing some mutant "C with classes" C++ is not just "C with addons", it's a different language that happen to share its syntax and part of its standard library. Lumping them together leads to ugly C++ code.
Having just seen a billboard ad for http://www.mokafive.com/baremetal (enterprisey desktop virtualization), I was briefly expecting a legal dispute, but "BareMetal" isn't actually on their trademark list.
That's my biggest question: can it support threading? I work on HPC tasks that don't need MPI, but that are I/O bound. This means that it is more efficient to use multiple threads to process data while another is waiting for data to be loaded from the disk. I'm all for getting as close to the bare-metal as possible, but you're right. Without MPI or threading, this doesn't have much of a chance to be adopted.
any performance metrics?
we can speculate about effectiveness of such a solution but it should be fairly easy to validate by running some common computational tasks this OS was designed to excel at, vs other popular OSs..
There's probably little need for the overhead associated with other filesystems (security, multi-user support, crash recovery, etc.).
Personally I would have preferred a custom filesystem designed for the application but I can see the convenience of being FAT-compatible. I wouldn't expect these nodes to keep much data locally (if any).
The bootloader for the kernel (called Pure64) requires this. They chose FAT16 because it's compatible with most operating systems and also not unimportant; it's relatively easy to implement.
FPGAs can be programmed to give the answer in the time it takes the gates to propagate which is usually damn quick. None of these "cycles" things that CPUs use up.
The popularity of virtualization technology and the new trend of selling instances could make operating system development interesting again.
The requirements for an operating system have changed drastically with this new way of thinking about what it means to run an operating system. The requirements can be as low as supporting a single process that can talk tcp and (maybe) to disk. Look at Haskell Network Stack, it provides network support to an application and you don't need an OS proper, just Xen.
I'm very excited to see where highly lightweight OSes end up.
When will this reasoning finally die?