The Soul of a New Machine: Rethinking the Computer [video]

zeckalpha · on July 28, 2020

The title is a reference to Tracy Kidder’s book: https://en.m.wikipedia.org/wiki/The_Soul_of_a_New_Machine

teh_klev · on July 28, 2020

It's a bit of a shame Cantrill jumps from the PDP11/70 to the Sun blob without even a brief mention of DG's finest (and the namesake company of the title of the presentation) such as the Nova and Eclipse ranges of their day. We should all feel cheated by this :) But anyway...

I was once-upon-a-time a Data General field engineer back in the 80's and bumped into The Soul of a New Machine around '87. It's a great read and a nice insight into DG as a company who were still considered a wee bit of a rogue outlier compared to DEC and IBM.

I'm a member of a closed group of ex DG employees on Facebook (was still permitted membership due to being in the broker game back then despite not being an employee), they're a really nice bunch of folks, though growing older by the day.

I had the pleasure and luck to work on older Novas (like the 800, 1200 and 3's) and the Nova 4 and the Eclipse 16-bit range such as the S/130's and all their associated peripherals (Phoenix and Gemini hard disks, Zebra disk drives the size of two washing machines etc). Fun fact - you could upgrade the Nova 4 to an S/140 by "obtaining" the correct microcode PROM's and performing some other minor patching; we did this, it wasn't always considered legal, but every other broker out there also was up to this game. DG didn't seem to mind because by then their mainline product by then were the MV's. DG was a very leaky company with regards to getting hold of "stuff", I don't recall anyone being sued for unlicensed and pirated copies of RDOS or AOS etc. A thing that was an expensive item from DG but almost like a consumable were these things called paddle boards. They're basically passive PCB's that allow interfacing between the inside of the machine to the outside world. We never bought these from DG, they always came from "some guy" and a box of 20 cost less than a bonafide DG part. DG knew this but never complained.

The diagnostic tools were tremendous (DTOS and ADES) which coupled with a portable fiche reader allowed you to diagnose and fix most problems on site. These were good times and I learned a huge amount about problem solving as a young broth-of-a-boy engineer. I still have a copy of "How to Microprogram Your Eclipse Computer" where I learned about microcode and that assembler wasn't really the true bare metal of a CPU :) I have other war stories I should write down some time.

bcantrill · on July 29, 2020

Sorry to sell DG short! (I did give it a brief mention at the top, it was just very very brief.) For whatever it's worth, I did go into DG in more depth in my blog entry on re-reading of Soul last year[1] -- one that attracted some comments from some very closely associated with the company and book!

[1] http://dtrace.org/blogs/bmc/2019/02/10/reflecting-on-the-sou...

teh_klev · on July 29, 2020

Hey no worries Bryan. Reading your article there just triggered a re-read of Soul for myself :)

vaxman · on July 29, 2020

yep, and was hoping the youtube would be a rock video...Question Stanford's position on this --DEC had a big office in Palo Alto, but it is where we kept the feral Unix (er Ultrix) monsters...and DEC won the 32-bit war baby (at least until the micro prism chip architecture was misappropriated into the 8080, er. Pentium "Pro").

TL;DR: DataGeneral (a spin off of DEC from before my time) with the Eclipse team was battling DEC (a 1950s tech giant that made its name subverting IBM) with the VAX team for "first 32-bit" bragging rights. (To make things interesting, for me at least, my Dad was an expert at Data General technology and I was a teenage mutant DEC nerd.) When I learned VAX MACRO32 they were drawing comparisons between it and the (even then, older than dirt) IBM 360 assembler and it was totally mind blowing. Getting rid of of the memory limit on the (preceding) PDP11's separate code and data segments and introducing a 4.3GB virtual address space (the "Virtual Address eXtension") changed everything. Prior to that, computer scientists had to rely on complex "overlay" techniques to swap program and data segments in and out of memory at the application level (under RT11 and RSX11, the preceding operating systems to DEC 32-bit VAX/VMS and under RDOS, the preceding operating system to DG 32-bit A/OS). Too hard for non-specialists, so someone wrote an entire operating system (RSTS/E) in BASIC which was quickly starting to dominate in the run up to VMS (also probably inspired Bill Gates and his BASIC interpreter ROMs for competing microprocessors).

iPhones/Macs and even Raspberry Pi are all dumping 32-bit now, but cost-effective 32-bit lives on in nRF and ESP class micro controllers that will surely "eat the planet" (who needs to carry a phone, dawn AR glasses/earplugs or sit at a desk/tablet when all planar surfaces for as far as you can see are I/O devices run by $0.10 micro controllers that network to edge nodes for any memory/compute heavy lifting).

Animats · on July 29, 2020

And we'll deal with all those old standards by having our own new standard!

He discusses "open firmware", which is not Open Firmware.[1] That was a boot ROM system from the 1990s. It was written in Forth, and was intended for use with a console interface. That's not what you want today. A good question to ask is, what do you want today at that level, and what do you not want. For example, a security oriented "cloud" company might want to load the machine, restart the machine, and freeze and dump the machine in an emergency, but not have the ability to examine or alter memory while running. Who patches a running production machine any more? Today's server firmware, with an administrative CPU that phones home and listens for commands to do who knows what, tends to have way too much capability for making small changes quietly and listening to what's going on.

[1] https://en.wikipedia.org/wiki/Open_Firmware

spikepuppet · on July 29, 2020

I'm a big fan of Bryan's talks and this was another great one. It's easy to forget that once you move outside of the hyperscalers, what's available to you is really showing it's age, or is filled with a whole bunch of honestly useless features. As such, i'm very keen to see what Oxide cooks up!

guerrilla · on July 28, 2020

> While our software systems have become increasingly elastic, the physical substrate available to run that software (that is, the computer!) has remained stuck in a bygone era of PC architecture. Hyperscale infrastructure providers have long since figured this out, building machines that are fit to purpose -- but those advances have been denied to the mass market. In this talk, we will talk about our vision for a new, rack-scale, server-side machine -- and how we anticipate advances like open firmware, RISC-V, and Rust will play a central role in realizing that vision.

star-trek-fleet · on July 28, 2020

This is a bland PR oriented statement. I was roughly expecting this level of details from the speaker.

The one statement feels rather bland: "those advances have been denied to the mass market"

What does this mean?

It was not denied, they were just too complex for mass market. People are happy to pay AWS so that they can worry not the machines, and write JS code from day one.

bcantrill · on July 28, 2020

No, they've been denied: I elaborate on this in the talk, but if you look at (say) an OCP-based system (e.g., Facebook's Tioga Pass[1]), the innovations in that system are simply not available for any price to the enterprise buyer. And yes, those buyers emphatically do exist -- and no, they are certainly not everyone deploying on elastic infrastructure.

[1] https://www.opencompute.org/documents/facebook-2s-server-tio...

kaliszad · on July 28, 2020

For me as a systems engineer and systems administrator appliances like VMware VxRail are totally infuriating at times. Especially the deeply object oriented design of their APIs that really hinders you implementing anything not already present in Ansible or Terraform in a reasonable amount of time yourself. I could fill a talk ranting. They really should take a hint from Rich Hickey and stuff like "Simple made easy" even if they don't write any Clojure at all.

In the end, the less sophisticated Citrix XenServer we use now for about 10 years seems to be more hackable in some ways.

chillfox · on July 29, 2020

Don't forget their convoluted documentation for those APIs, or how their libraries are poorly maintained, and has even worse documentation somehow.

sbierwagen · on July 28, 2020

That's a 109 page PDF. If the innovations are listed in that PDF, they are not leaping out at me while skimming it.

Googling "tioga pass server" brings up https://engineering.fb.com/data-center-engineering/the-end-t... which says nothing and https://www.mitacmct.com/OCPserver_E7278_E7278-S who seem to be selling them.

Tioga pass appears to be a small dual-socket server. How is it different from a typical dual-socket blade server?

jhallenworld · on July 29, 2020

The only advantage these hyperscale purpose built computing platforms provide is elimination of the profit taken by HP, IBM, Dell, etc.

They are complaining about the PC heritage in the server world- this is actually a huge convenience in that it is standardized hardware (so for example, it's easy to install any software made for PCs on them, including Linux). The cost of this compatibility is not very much these days (in terms of silicon area).

Blade servers had centralized power supplies since forever ago..

Also you can certainly get servers without CD drives :-) Rack front and back panel area is actually a limited commodity, so for example many modern servers are just packed with 2.5 inch drives..

rrss · on July 29, 2020

I thought a number of vendors sell "OCP Accepted" products that use the OCP designs?

I've not yet watched the presentation, and I'm not familiar with this stuff, so apologies if I'm missing something, but what is the difference between buying a server from Oxide and buying e.g. https://www.opencompute.org/products/109/wiwynn-tioga-pass-a... (from one of the vendors on the right)?

star-trek-fleet · on July 28, 2020

There was always a need of making something commercially successful. But there needs to be proportional demand to justify.

OCP cannot produce their products to mass market unless there is a strong demand. Certainly it looks like market mainstream is not too passionate about building or managing their machines.

I don't deny that some people, in any circumstances, would demand different offerings from the market mainstream.

And I am totally understanding why such statement like "a was denied to b" was used here.

I was merely stating, for mass market, there is no serious demand for what's claimed to be denied from them. And I am stating that from a more technical perspective nor a marketing or PR one. (And I am very positive about the necessity of marketing and PR)

kaliszad · on July 28, 2020

There are so many old (and frankly even new) line of business applications, where the developers haven't considered among other things laws of physics like speed of light in optical fiber much. These systems (client+server applications) tend to run much better on premise. The applications are often not automated much, aren't really secured that well (so you would probably need a VPN to the cloud to run it safely) and the bandwidth of internet connections at some of these companies are not really suitable for clients on premise and servers in the cloud anyway. You are lucky, if the synchronization to a different location works well enough.

Also, cloud is very costly if you don't use the up and especially down scaling because your application/ infrastructure wasn't really designed for that. Also if you buy some new machine for the factory it usually comes with software (usually MS Windows Server + MS SQL Server + some machine control software) that has hardware requirements that don't really fit well with cloud pricing. Such machines tend to run for decades and the company certainly hasn't thought about being efficient with computing resources on the server. On premise hardware isn't that costly if you consider these factors, if the supplier cannot secure the machine properly, you slap it into its own VLAN and write an ACL for the RDP access (because that is how it is) and are done with it. Basically dedicated Gigabit speed with very little latency for any communication between the clients and the server. Remember, you are almost lucky if a Windows Update doesn't break the software/ software license on the server or the client...

wmf · on July 28, 2020

In the current modular[1] structure of the industry where the server is a product and the network is a separate product and the hypervisor is yet another product etc, there's no demand for components that aren't compatible with the morass of existing standards. So yeah, there isn't enough demand for OCP servers and such.

It sounds like Oxide is trying to break out of that by providing the whole stack.

[1] https://stratechery.com/2013/clayton-christensen-got-wrong/

jamwt · on July 28, 2020

> People are happy to pay AWS

Many of them are not.

We have serious vendor lock-in now, where a very few companies are gatekeepers to almost any business that runs on the internet.

And their margins are _enormous_ on this business. It ends up costing much, much more to pay them to run our machines for us.

And increasingly, the expertise to do this is being consolidated in these companies, so the talent available to pursue any other way is diminishing as new grads never learn about the magic places their code runs.

The reliability outcomes are nearly the same, despite the deferral to their expertise.

Labor savings b/c you don't to learn about provisioning your own machines? Not much. AWS is so sophisticated you need to develop a nearly equivalent amount of (non-portable) expertise to actually operate it well. Remember, the alternative isn't just rack your own, it's... dedicated hosting! And lots of other options with less lock-in and more standards.

It's sort of frightening how complicit the broader technology industry is in this power consolidation.

star-trek-fleet · on July 28, 2020

You beat vendor lock in by standardization.

Vendors refuses to take part in standardization if they absolutely have the leverage.

Remember Amazon's reluctance in joining the CNCF and container groups?

By stating you are not happy, Amazon is perfectly ready to do what ever they can to please you, ad stated in their "customer obsession" (and I assure that that statement is as sincere as any human stating any commitment).

But back to the point, people in mass market primarily are no longer interested in managing machines, let alone building themselves.

steveklabnik · on July 28, 2020

This is probably the most thorough public explanation of what we're doing over at Oxide.

rudedogg · on July 28, 2020

When you posted about joining Oxide I couldn't quite figure out what they (now you) do by looking at the homepage. I stumbled across this other lecture (https://youtu.be/3LVeEjsn8Ts?t=2189) that is along the same line of thinking, and it started to make sense.

iamjk · on July 29, 2020

Their podcast, "On the Metal" provides more context at the work they're aiming to accomplish.

steveklabnik · on July 28, 2020

Thanks, I'll have to check this out!

jpm_sd · on July 28, 2020

Got a TL;DW for us? Video is 86 minutes long.

kaliszad · on July 28, 2020

Basically they want to build rack scale computers with open/ auditable firmware all the way down and really design the hardware for "hyper-scale" like computing. That means, no VGA/ USB/ DVD on the server, power and networking will probably be solved for more servers at once, there will be APIs for all of the low level stuff that is probably inconsistent with your typical Dells, HPEs, Lenovos, SuperMicros.

I find, Bryan Cantrill talks are generally worth it to watch even just for entertainment if for nothing else.

agumonkey · on July 28, 2020

I used to love listening to him, really.. (still remember his dtrace talk fondly) but this one was hard to focus on. Lots of uh ah hum. Surprising.

steveklabnik · on July 28, 2020

The sibling comment is good.

The problem that we're trying to solve is basically laid out on this slide: https://youtu.be/vvZA9n3e5pc?list=PLoROMvodv4rMWw6rRoeSpkise...

The business is "we will be selling servers." You can't buy any yet, but in the future, you'll be able to.

The talk lays out a history of servers, describes the problems with the servers that you can buy from vendors today, and lays out why we think we can build better ones.

armitron · on July 29, 2020

Good luck, you will need lots of it.

PhantomGremlin · on July 29, 2020

To add to sibling comments, I think this endeavor involves RISC-V based hardware supported by Rust software. That info is buried deep within the talk, I scrubbed thru so YMMV.

The specific complaint I (and the GP) have is: synopsis, synopsis, synopsis. Give me a few paragraph summary before asking me to invest an hour and a half of my time.

kaliszad · on July 28, 2020

Good luck/ "kick ass and have fun" while pushing computing forward. I applaud the effort to make the very foundations of computing more robust and introspectable. The most laudable goal seems to me to be especially the general accessibility of some of these achievements in the long run even to non-customers. Maybe, open and robust firmware will become the standard. Please also embrace IPv6 to prevent taking all the brokenness in that area (e.g. network boot, remote management) for the ride into the 21. century.

steveklabnik · on July 28, 2020

Thank you! It has been a lot of fun so far. My colleagues are some of the smartest, most helpful people I've ever worked with. I look forward to that future too :)

kaliszad · on July 28, 2020

We will evaluate the next hardware generation at our company probably sometime in 2022-2023. I sure would love to get my hands on an Oxide computer :-) though I fear we are more in the 3-10x 2U rack computer area for most locations. This is probably the sizing for most businesses in middle Europe, e.g. Germany.

ArtWomb · on July 28, 2020

This is a great talk! I have to admit my first thought wasn't to the data center. Which is obviously the predominant global energy waste. But to the "local" problem of green compute for IoT / drones / autonomous systems. What's the state of the art in OS dvelopment for high efficiency embedded hardware such as Contiki, Tiny OS, RIOT, Zypher, Mbed and Brillo? And what are the major insights that are missing?

steveklabnik · on July 28, 2020

I don't know! Since we're focused on the data center, that's where I have been too. I joined partially for personal growth; this is an area that I don't know as much about as I'd like to. It's only been a few weeks, but I've learned a ton. And there's enough of it in that space that I haven't had as much time or energy to look into other spaces. I do agree that there's a ton of IoTish things, and that it matters.

stevebmark · on July 29, 2020

A 1.5 hour video with no context and no TL;DR top comment did NOT make it organically to the top of HN, it was upvoted strategically by members of Oxide. Could you at least do the rest of us a solid and post the TL;DR?