Cray-1 vs Raspberry Pi

fastneutron · on Dec 25, 2023

When I see comparisons like this, the first thought I have is not the benchmarks, but rather what the most “heroic” real-world calculation of the day would have been on something like the Cray-1, and how to replicate those calculations today on something like a RPi. Weather/climate models? Rad-hydro?

The fidelity would almost certainly be super low compared to modern FEA software, but it would be a fun exercise to try.

buryat · on Dec 25, 2023

nuclear weapons simulations

the first machine went to Los Alamos

https://www.theatlantic.com/technology/archive/2014/01/these...

acqq · on Dec 25, 2023

The demand for the huge calculations for the design of nuclear weapons started in WW II already:

https://ahf.nuclearmuseum.org/ahf/history/human-computers-lo...

"The staff in the T-5 group included recruited women who had degrees in mathematics or physics, as well as, wives of scientists and other workers at Los Alamos. According to Their Day in the Sun: Women of the Manhattan Project, some of the human computers were Mary Frankel, Josephine Elliot, Beatrice “Bea” Langer, Augusta “Mici” Teller, Jean Bacher, and Kay Manley. While some of the computers worked full time, others, especially those who had young children, only worked part time.

General Leslie R. Groves, the Director of the Manhattan Project, pressured the wives of Los Alamos to work because he felt that it was a waste of resources to accommodate civilians. As told by Kay Manley, the wife of Los Alamos physicist John Manley, the recruitment of wives can also be traced to a desire to limit the housing of “any more people than was absolutely necessary.” This reason makes sense given the secretive nature of Los Alamos and the Manhattan Project. SEDs, a group of drafted men who were to serve domestically using their scientific and engineering backgrounds, also worked in the T division."

fastneutron · on Dec 25, 2023

> rad-hydro

These are incredibly expensive even on today’s hardware. If you look through some of the unclassified ASCI reports from the early 2000s, 3D calculations of this equation set were implied to be leadership-class computations. At the time of the Cray, it must’ve been coarse-grid 1D as the standard, with 2D as the dream.

ptek · on Dec 26, 2023

I've always been interested in this, I wonder how optimised the code was and if they used LUTs (Look Up Tables) as they did in the 80s for 3D calculations on the home computers.

Oh cool they got CrayOS working.

But still 1MB RAM, I remember getting the slow RAM 512KB update for my Amiga 500 in the early 90s.

simbolit · on Dec 25, 2023

One of the early customers was the European Centre for Medium-Range Weather Forecasts, so, wild guess, they probably used it for medium-range weather forecasts.

ithkuil · on Dec 25, 2023

> they probably used it for medium-range weather forecasts

in europe

defrost · on Dec 25, 2023

FWiW Australia used a CDC Cyber 205 for occassional weather modelling and other mathematical work in the early 1980s.

( There was a seperate dedicated weather computer, this one was used for 'other' jobs like speculative weather modelling, monster group algebraic fun, et al.)

https://en.wikipedia.org/wiki/CDC_Cyber

The UK was the first customer:

    In 1980, the successor to the Cyber 203, the Cyber 205 was announced. The UK Meteorological Office at Bracknell, England was the first customer and they received their Cyber 205 in 1981.

pietjepuk88 · on Dec 25, 2023

I thought the ECMWF models were (and always have been) global?

checkyoursudo · on Dec 25, 2023

Only centred on Europe.

ithkuil · on Dec 26, 2023

Fwiw I meant "made in Europe" (as opposed as models of Europe)

jorvi · on Dec 25, 2023

> European Centre for Medium-Range Weather Forecasts

fastneutron · on Dec 25, 2023

Numerically, I’m currently what this would have looked like. I’m talking about the governing equation set, discretization methods, data, etc. It would be a fun project to try and implement a toy model like that.

magicalhippo · on Dec 25, 2023

> It would be a fun project to try and implement a toy model like that.

If you really want a challenge, do it using pen, paper and a slide rule, like in the old days[1]. Just make sure to apply appropriate smoothing of the input data first[2].

[1]: https://www.smithsonianmag.com/history/how-world-war-i-chang...

[2]: https://arxiv.org/abs/2210.01674

pge · on Dec 25, 2023

Not the Cray-1, but the Navy used a Cray 90 a few years later for CFD calculations modeling flow around ship hulls (code written in Fortran).

I wish I had access to the code I wrote back then - what took minutes or hours on the Cray could probably run in seconds on a RPi now…

ip26 · on Dec 25, 2023

You could always start with loading up Spec ‘06, which contains micro kernels of such “heroic” workloads.

gshubert17 · on Dec 25, 2023

I toured an NCAR (National Center for Atmospheric Research) facility in Boulder around 1979; got to sit on a seat on their Cray-1. So yes, weather and climate calculations.

DonHopkins · on Dec 25, 2023

A Cray-1 could execute an infinite loop in 7.5 seconds!

ant6n · on Dec 25, 2023

In a similar way how Chuck Norris counted to infinity.. twice?

DonHopkins · on Dec 25, 2023

[flagged]

hackeraccount · on Dec 26, 2023

Touch grass.

belter · on Dec 25, 2023

Quite impressive, but can't avoid noticing you did not, go near a higher challenge, like compiling a C++ program in under 4 weeks... \s

bee_rider · on Dec 25, 2023

You could get some vintage matrices from SuiteSparse (formerly the university of Florida sparse matrix collection).

devoutsalsa · on Dec 25, 2023

3-D rendering? We had a super computing club in early 90s high school. I remember creating wireframe images, uploading then to a Cray XMP at Lawrence Livermore for the computation, and then downloading finished results.

DonHopkins · on Dec 25, 2023

I knew a guy who worked at one of the national labs that had its own Cray-1 supercomputer, in a machine room with a big observation window that visitors could admire it through.

Just before a big tour group of VIPs he knew would come by, he hid inside the Cray-1, and waited patiently for them to arrive.

Then he casually strolled out from inside the Cray-1, pulling up the zipper of his jeans, with a sheepish relieved expression on his face, looked up startled at the tour group gaping at him through the window, and scurried off!

codezero · on Dec 25, 2023

To which they said, "Urine; A lot of trouble."

blackoil · on Dec 25, 2023

Is there any software apart from benchmarks that will make it feel that fast. All softwares that I use feel more advanced version of things that I ran on my 386. GUI, IDE, compiler, office...

I understand that the exercise may still be theoretical as any sw used by Met now will be designed for million time fast computers. But there should exist some software that would have required Cray to run then.

xcv123 · on Dec 25, 2023

No. The Cray-1 ran batch processing jobs, mostly scientific simulations which took hours or days to compute. It had a terminal interface and wasn't used for real time interactive applications.

timthorn · on Dec 25, 2023

IIRC, in real world installations the Cray would have been paired with a minicomputer or mainframe.

flyinghamster · on Dec 25, 2023

The Wiki article on the Cray-1 indicates that it had a Data General Eclipse as its front-end processor.

yjftsjthsd-h · on Dec 25, 2023

So should we pair with a PiPDP-11?:)

jasonwatkinspdx · on Dec 25, 2023

It was a later machine than the Cray-1, but I remember seeing a post on the jsoftware forums from someone that ran interactive APL on a Cray. Having a REPL to access a machine like that in that era must have been quite something.

pinewurst · on Dec 25, 2023

Crays were not good interactive machines for a few reasons, but lacking virtual memory was the most notable.

tzs · on Dec 26, 2023

What do you mean by "virtual memory"?

The Cray-1 had memory management hardware similar to that on DEC PDP-10s that had KA-10 processors and numerous 68K Unix workstations in the '80s, all of which served as interactive time sharing systems just fine.

The memory management hardware in those systems would take the addresses generated by the instructions, compare them to a limit register, and if below the limit would then add them to a base register to get the final memory address. Some systems would have more than one pair of base/limit registers, such as a pair for code and pair for data.

Some people do call such systems virtual memory, because the addresses specified in the code are not the addresses that the memory sees (unless the base register contains 0). But I think it was more common for systems where all the MMU did was simply relocate segments and all of the program's code and data had to be in memory for the program to run to not be considered to be virtual memory. Virtual memory usually meant systems that allowed you to run programs that used address spaces larger than the available memory.

cardiffspaceman · on Dec 27, 2023

Back in the day, the KA10 under Tops10 could “go virtual” which meant that some of the processes would overflow to spinning metal, either disk or IIRC drum. Otherwise the various processes would all be in RAM.

tzs · on Dec 27, 2023

Yeah, the total size of all processes could be more than the available RAM. The key difference between systems like TOPS-10 on a KA10 and the systems people usually reserved the word "virtual memory" for is that to run a given process on TOPS-10 on a KA10 the processes had to be entirely in RAM. If a process had say 100k words of code, you need to give in 100k words of contiguous RAM for its code segment.

If you had two processes that each needed 100% of the RAM that was available for user program the operating system would have to keep one in RAM and one swapped entirely out to disk or drum. On a task switch it would have entirely swap out the current process, and entirely swap in the other process.

Even if each process was actually spending most of its time in just 10% of its code and most of the time was just actively using 10% of its data, it had to have 100% of its code and 100% of its data present.

The systems people usually gave the name virtual memory to had more sophisticated memory management hardware that would let them map a continuous process address space to a discontinuous physical address space, and you could have gaps in the mapping. Attempts to access gaps would interrupt the process in a resumable way, so the OS could handle that interrupt, map memory into that gap, and resume the process. With these systems they could run both of those 100% memory using programs at the same time while only having to actually allocate enough RAM to cover the 10% of the memory that the two programs were actively using. Context switches between the two did not have to touch drum or disk and so were much faster.

When a program switched to using a different 10% of its code or data, that OS could then map that to real RAM. Since both programs do over time do use 100% of their logical RAM eventually the OS will have to start moving data back and forth between RAM and the drum or disk. But it is not having to do that on every context switch like those systems that needed programs to be 100% in RAM did. Also since it only has to load regions of the program that are actively being used instead of the whole thing, when it does have to use the disk or drum it is for a smaller amount of data and so is faster.

Some TOPS-10 memory trivia. The "core" command allocates memory. On the KA-10 memory was allocated in multiples of 1k words, so "core 1k" was the minimal amount you could allocate. On the later KL-10 model (and I think the KI-10 model but am not sure) the hardware supported allocations in pages, which were half that size. On those you could use "core 1p" to allocate one page.

I saw that in the manual, and was curious what the error message would be if your tried to allocate 1 page on a KA-10. So I typed that command to see what it would say. The command never returned.

I then noticed there was silence in the room and some cursing, and it was apparent the system had crashed. I didn't think much of it because the system crashed fairly often. This was at Caltech and that PDP-10 was the computer that most undergraduate accounts were on, and people were always trying to push limits and explore interesting edge cases. So I figured this was another one of those.

I waited for the admins and operators to bring it back up, and then typed "core 1p" again hoping to this time get to see the error message. It crashed again!

When it came back up, and before I could try a third time, a broadcast message from the admins was sent to all terminals. It said something like "If you want to know why the system keeps crashing, ask tzs who is currently sitting at terminal 7".

Oops. Oh well, at least now I knew that there was not an error message for "tried to allocate 1 page on a system that does not support pages".

jasonwatkinspdx · on Dec 26, 2023

From what I remember the machine they were using had a Unix workstation that acted as the front end, so the Cray would be just as an accelerator. APL uses a stack based approach to memory management that maps well to bare hardware with minimal OS features.

yummypaint · on Dec 25, 2023

The best bet might be an old physics code written in fortran. Maybe calculation of scattering cross sections from matrix elements or something with alot of vectorizable linear algebra.

KeplerBoy · on Dec 25, 2023

A lot of the stuff that took hours back then can now be done in sub-seconds.

Think about mechanical engineering: Back then they might have simulated how cars deform in a crash. Now we can perform similar simulations in real-time for fun in our video games. Afaik it's hardly ever done because no one actually needs physically accurate models in games, but it could be done.

Same goes for rendering back then they rendered each frame of toy story for a good few hours, now we achieve arguably better graphics in real time.

Eisenstein · on Dec 25, 2023

> Think about mechanical engineering: Back then they might have simulated how cars deform in a crash. Now we can perform similar simulations in real-time for fun in our video games. Afaik it's hardly ever done because no one actually needs physically accurate models in games, but it could be done.

BeamNG is basically that.

AureliusMA · on Dec 25, 2023

You beat me to it! Such a fun game

qup · on Dec 25, 2023

I think there's something to be said for the wait period. The time to anticipate the result makes it feel very worthwhile, especially when only a handful of machines on the planet can do the calculation.

It all feels very mundane when I can do it on my slow commodity laptop in under a second.

wannacboatmovie · on Dec 25, 2023

I'm waiting for the follow-up from Jeff Geerling where he fits a Raspberry Pi into a Cray-1 enclosure.

squarefoot · on Dec 25, 2023

I would totally love a Cray1 shaped PC enclosure. A even smaller one for Raspberry Pis would also be cool.

smcameron · on Dec 25, 2023

https://www.thingiverse.com/thing:2610764

arbitrandomuser · on Dec 25, 2023

Or as many as he can with networking

snvzz · on Dec 25, 2023

It'd make more sense to compare with a RISC-V that has Vector 1.0.

Because a vector machine is what it was.

voxadam · on Dec 25, 2023

Has anyone taped out a RISC-V CPU with hardware vector support yet?

snvzz · on Dec 25, 2023

Yes, several, with the Vector 1.0 specification.

Some of them (Kendryte K230, a MCU) have already shipped to people.

Years ago, some chips shipped, with 0.7.1 (incompatible, pre-ratification). One of them is the TH1520, SoC in some SBCs released earlier this year.

cmrdporcupine · on Dec 25, 2023

Well, or compare to a GPU or a TPU

snvzz · on Dec 25, 2023

Those are largely simd but not vector.

cmrdporcupine · on Dec 25, 2023

That's fair. I haven't spent any time looking through the RISC-V vector extensions yet. I look forward to it, though.

rwmj · on Dec 25, 2023

They are said to be "inspired" by classic Cray vector instructions, although I have of course never used a Cray :-( so I can't comment on how true that is. I did use a Convex C2[0] for a while which also had real vector instructions, but it was all hidden behind a compiler option.

[0] https://en.wikipedia.org/wiki/Convex_Computer

dan-robertson · on Dec 25, 2023

How is vector different from simd?

IshKebab · on Dec 25, 2023

The RISC-V Vector extension allows the vector length to vary at runtime whereas with SIMD the vector length is fixed at compile time (128 bit, 256 bit etc.). It means the code is more portable basically.

With x86 SIMD the standard solution is to compile the same code multiple times for different SIMD widths (using different instructions) and then detect the CPU at runtime. Though that is such a pain that it's only really done in explicitly numerical libraries (Numpy, Eigen, etc.). In theory with Vector you can compile once, run anywhere.

dan-robertson · on Dec 25, 2023

That design seems like a reasonable thing for a high level language that could then be converted to different architectures’ simd widths. But I’m kinda surprised it’s good at the ISA level. Eg for something like a vectorized strlen, mightn’t one worry that the cpu would choose vlen[1] too large causing you to load from cache lines (or pages!) that turn out to be unnecessary for finding the length of the string. With the various simd extensions on x86 or arm, such a routine could be carefully written to align with cache lines and so avoid depending on reading the next line when the string ends before it. I also worry about various simd tricks that seem to rely on the width. Eg I think there’s some instruction to interpret each half-byte of one vector as an index into some vector of 16 things. How could these be ported to risc-v? Or maybe that’s not the sort of thing their vector extensions are meant for.

I guess part of my thinking here is that the ISA designers at intel, arm, aren’t stupid, but they ended up with fixed widths for sse, neon, knights landing, avx, avx-512. Presumably they had reasons to prefer that to the dynamic risc-v style thing. So I wonder: are there some risc-v constraints the push this design (eg maybe low-power environments presumably pushed neon to have a small width and this made higher-power environments suffer; having a dynamic length might allow both to use the same machine code), or were there some reasons intel preferred to stick with fixed widths, eg making something that could only work on more expensive chips and thereby having something people can pay more for? Is there something reasonable written about why risc-v went with this design.

[1] what do you even pass to vsetvl in this case as you don’t know your string length.

imtringued · on Dec 25, 2023

I'm not sure it is difficult to see why variable length SIMD makes sense. If you want to process 15 elements with a width of 8, you will need the function twice, once with SIMD processing whole batches of 8 elements and a scalar version of the same function to process the last 7 elements. This makes it inherently difficult to write SIMD code even in the simple and happy case of data parallelism. With RISC-V all you do is set vlen to 7 in the last iteration.

>what do you even pass to vsetvl in this case as you don’t know your string length.

I'm not sure what you are trying to say here. You must know the length of the buffer, if you don't know the length of the buffer, then processing the string is inherently sequential, just like reading from a linked list, since accessing even a single byte beyond the null terminator risks a buffer overflow. Why pick an example that can't be vectorized by definition?

dan-robertson · on Dec 25, 2023

I wonder if you’re using a different definition of ‘vectorized’ from the one I would use. For example glibc provides a vectorized strlen. Here is the sse version: https://github.com/bminor/glibc/blob/master/sysdeps/x86_64/m...

It’s pretty simple to imagine how to write an unoptimized version: read a vector from the start of the string, compare it to 0, convert that to a bitvector, test for equal to zero, then loop or clz and finish.

I would call this vectorized because it operates on 16 bytes (sse) at a time.

There are a few issues:

1. You’re still spending a lot of time in the scalar code checking loop conditions.

2. You’re doing unaligned reads which are slower on old processors

3. You may read across a cache line forcing you to pull a second line into cache even if the string ends before then.

4. You may read across a page boundary which could cause a segfault if the next page is not accessible

So the fixes are to do 64-byte (ie cache line) aligned accesses which also means page-aligned (so you won’t read from a page until you know the string doesn’t end in the previous page). That deals with alignment problems. You read four vector registers at a time but this doesn’t really cost much more if the string is shorter as it all comes from one cache line. Another trick in the linked code is that it first finds the cache line by reading the first 16 bytes then merging in the next 3 groups with unsigned-min, so it only requires one test against a zero vector instead of 4. Then it finds the zero in the cache line. You need to do a bit of work in the first iteration to become aligned. With AVX, you can use mask registers on reads to handle that first step instead.

dfox · on Dec 25, 2023

Another point is that the CPU can sequence the multiple calls to its internal SIMD unit internally without that having to be done by user code. This in extreme case degrades to the Cray-1-like vector unit, which still has measurable preformance impact and can be implementated even in very resource constrained environments.

camel-cdr · on Dec 25, 2023

> what do you even pass to vsetvl in this case as you don’t know your string length.

To the maximum, of course, `vsetvl x<not 0>, x0, ...` will do that for you.

You might read over a page boundery, but there is an instruction for that `vle8ff.v`, it's a fault-only-first uni-stride load. That is, it doesn't fault when one of the later elements goes outside our page, and adjusts the vector length accordingly. [0]

> With the various simd extensions on x86 or arm, such a routine could be carefully written to align with cache lines and so avoid depending on reading the next line when the string ends before it.

In practice, on current very early hardware, doing just that is faster, and definitely possible. [1]

> I also worry about various simd tricks that seem to rely on the width. Eg I think there’s some instruction to interpret each half-byte of one vector as an index into some vector of 16 things

I'm not aware of such an instruction, but rvv has vrgather.vv and vrgatherei16.vv to do something similar. Note that you can always return to "fixed-size" implementations if you really need to, buy just setting the vl accordingly. But for the most part I don't think this will be necessary. Do you have any specific problem in mind that may seem hard to do without fixed size SIMD?

> I guess part of my thinking here is that the ISA designers at intel, arm, aren’t stupid, but they ended up with fixed widths for sse, neon, knights landing, avx, avx-512.

I mean, ARM now has something similar with SVE, and x86 has a metric ton of legacy to work with and the market size to get adoption of whatever new instruction prefix they add. Edit: Also, didn't AVX10/AVX512vl go into a similar direction as SVE?

> Is there something reasonable written about why risc-v went with this design.

I'm not quite sure, but I remember that it was one thing from very early in the design. I think it's to have a mostly unified ecosystem for the binary app market. It also makes working with mixed precision easier, because it allows for the LMUL model.

But RISC-V is extendable, the P (Packed SIMD) extension is currently in the works and aimed at the embedded market, for in GPR SIMD operations for DSP type applications. [2]

[0] https://github.com/riscv/riscv-v-spec/blob/master/example/st...

[1] https://camel-cdr.github.io/rvv-bench-results/canmv_k230/str...

[2] https://lists.riscv.org/g/tech-p-ext/topics

dan-robertson · on Dec 25, 2023

Thanks for all the detailed information! That answers a bunch of my questions and the implementation of strlen is nice.

The instruction I was thinking of is pshufb. An example ‘weird’ use can be found for detecting white space in simdjson: https://github.com/simdjson/simdjson/blob/24b44309fb52c3e2c5...

This works as follows:

1. Observe that each ascii whitespace character ends with a different nibble.

2. Make some vector of 16 bytes which has the white space character whose final nibble is the index of the byte, or some other character with a different final nibble from the byte (eg first element is space =0x20, next could be eg 0xff but not 0xf1 as that ends in the same nibble as index)

3. For each block where you want to find white space, compute pcmpeqb(pshufb(whitespace, input), input). The rules of pshufb mean (a) non-ascii (ie bit 7 set) characters go to 0 so will compare false, (b) other characters are replaced with an element of whitespace according to their last nibble so will compare equal only if they are that whitespace character.

I’m not sure how easy it would be to do such tricks with vgather.vv. In particular, the length of the input doesn’t matter (could be longer) but the length of white space must be 16 bytes. I’m not sure how the whole vlen stuff interacts with tricks like this where you (a) require certain fixed lengths and (b) may have different lengths for tables and input vectors. (and indeed there might just be better ways, eg you could imagine an operation with a 256-bit register where you permute some vector of bytes by sign-extending the nth bit of the 256-bit register into the result where the input byte is n).

camel-cdr · on Dec 25, 2023

I'm actually doing something quite similar in my, in progress, unicode conversion routines.

For utf8 validation there is a clever algorithm that uses three 4-bit look-ups to detect utf8 errors: https://github.com/simdutf/simdutf/blob/master/src/icelake/i...

Aside on LMUL, if you haven't encountered it yet: rvv allows you to group vector registers when configuring the vector configuration with vsetvl such that vector instruction operate on multiple vector registers at once. That is, with LMUL=1 you have v0,v1...v31. With LMUL=2 you effectively have v0,v2,...v30, where each vector register is twice as large. with LMUL=4 v0,v4,...v28, with LMUL=8 v0,v8,...v24.

In my code, I happen to read the data with LMUL=2. The trivial implementation would just call vrgather.vv with LMUL=2, but since we only need a lookup table with 128 bits, LMUL=1 would be enough to store the lookup table (V requires a minimum VLEN of 128 bits).

So instead I do six LMUL=1 vrgather.vv's instead of three LMUL=2 vrgather.vv's because there is no lane crossing required and this will run faster in hardware: (see [0] for a relevant mico benchmark)

        # codegen for equivalent of that function
        vsetvli a1, zero, e16, m2, ta, ma
        vsrl.vi v16, v10, 4
        vsrl.vi v12, v12, 4
        vsetvli zero, a0, e8, m2, ta, ma
        vand.vi v16, v16, 15
        vand.vi v10, v10, 15
        vand.vi v12, v12, 15
        vsetvli a1, zero, e8, m1, ta, ma
        vrgather.vv     v18, v8, v16
        vrgather.vv     v19, v8, v17
        vrgather.vv     v16, v9, v10
        vrgather.vv     v17, v9, v11
        vrgather.vv     v8, v14, v12
        vrgather.vv     v9, v14, v13
        vsetvli zero, a0, e8, m2, ta, ma
        vand.vv v10, v18, v16
        vand.vv v8, v10, v8

This works for every VLEN greater than 128 bits, but an implementation with larger VLENs do have to do a theoretically more complex operation.

I don't think this will be much of a problem in practice though, as I predict most implementations with a smaller VLEN (128,256,512 bits) will have a fast LMUL=1 vrgather.vv. Implementations with very long VLENs (e.g. 4096 bits, like ara) could have a special fast path optimizations for smaller lookup ranges, although it remains to be seen what the hardware ecosystem will converge to.

I'm still contemplating whether or not to add a non vrgather version and runtime dispatch based on large VLENs or quick performance measurements. In my case this would require almost >30 instructions when done trivially. Your example would require about about 8 eq + 8 and vs 4 shuffle + 4 eq, that isn't that bad.

vrgather.vv is probably the most decisive instruction when it comes to scaling to larger vector lengths.

[0] https://camel-cdr.github.io/rvv-bench-results/canmv_k230/byt...

PS: I just looked over my optimized strlen implementation and realized it had a bug. That's fixed now, and the hot path didn't change, just the setup didn't work correctly.

dan-robertson · on Dec 26, 2023

Perhaps another way to do gather with mixed sizes would be with a combination of the vector and packed simd extensions, using a simd register for the lookup table.

camel-cdr · on Dec 26, 2023

I think I'll propose a vrgatherei4, analogous to vrgatherei16, but with 4 bit indices, for LUTs, when I eventually finish my project with a blog post.

Findecanor · on Dec 25, 2023

TPUs tend to be specialised for matrix multiplications, often at low precision.

hooverd · on Dec 25, 2023

The RPi, unlike the Cray-1, does not offer ample sitting space.

wolfgang42 · on Dec 25, 2023

Somebody made a Cray-themed Pi Zero cluster which perhaps fits the bill (if you’re a mouse, that is): https://www.clustered-pi.com/blog/clustered-pi-zero.html

linker3000 · on Dec 25, 2023

There's also a Z80 build.

https://rc2014.co.uk/1865/crayzee-eighty/

tom_ · on Dec 25, 2023

But for the price of the Cray, even without adjusting for inflation, you could buy a useful number of chairs. And just think of the electricity cost savings!

m463 · on Dec 25, 2023

yeah, the pi cases have been disappointing.

Few have adequate cooling. (flirc is good, the ones with fans are just annoying)

I'd love to have a pi case that had a built-in breadboard.

...or a case with comfortable seating.

intrasight · on Dec 25, 2023

I like the Flirc aluminum case. My Pi 5 case arrived last week. Now waiting for my new Pi.

DonHopkins · on Dec 25, 2023

It's not just for sitting. The ultimate hacker fantasy is to get laid on a Cray-1 couch!

geerlingguy · on Dec 25, 2023

Nor does it have quite the panache in its spartan design.

widea · on Dec 25, 2023

Why not?

qingcharles · on Dec 25, 2023

I'd just bought the latest and most expensive Intel x86 CPU in 2013 and built myself a new rig. My wife walked into the office, "You're not working, I can tell that, but I'm not sure what you're doing?" she said looking at the graphs on my screen.

"I'm calculating to see when my PC would have been the fastest on Earth. It looks like in 1992 it would be able to out-compute the latest Dept of Defense $90m supercomputer that filled an entire room, would you believe?"

"That's lovely. How will that help us pay our credit bills?"

Jesting aside. There is a bunch of data for this, like this set here:

https://en.wikipedia.org/wiki/TOP500

And if you extrapolate backwards or find older data, like I did, I came to the conclusion that if I took my PC back to 1981 it would actually be faster than every computer on Earth combined, or some insane statistic like that.

shagie · on Dec 25, 2023

One of my favorite machines from Top500 is SystemX.

https://www.top500.org/system/173736/

When it was commissioned in 2004, this array of 1100x Apple PowerPC 970 systems was the 7th most powerful computer on the list.

It's Linpack Performance was 12,250.00 GFlop/s.

iancmceachern · on Dec 25, 2023

My favorite was the 33rd in line at the time which was made up of 1700 sony PS3s

https://www.google.com/amp/s/phys.org/news/2010-12-air-plays...

Cockbrand · on Dec 25, 2023

They were more or less cousins, as they were both based on the PowerPC CPU architecture.

porbelm · on Dec 25, 2023

However the Sony Cell had just a PowerPC controlling core. The real magic, and why it was used in supercomputers at the time, is in its Stream cores; they were highly tailored for vector and floating point maths.

Cacti · on Dec 25, 2023

When the DoE claimed the PS3 could be a dual purpose munition, they weren’t kidding.

PaulRobinson · on Dec 25, 2023

Saddam Hussein did try to buy a load of PlayStations at some point.

hulitu · on Dec 25, 2023

He also had WMDs. /s

iancmceachern · on Dec 25, 2023

And Anna Nicole married for love

(Great line in the movie Shooter"

geeB · on Dec 25, 2023

About the same headline number as a $350 Xbox Series X! Although fp64 vs fp32 and Linpack vs peak.

zoky · on Dec 25, 2023

You forgot the best part: It was colloquially referred to as the “Big Mac”.

shagie · on Dec 25, 2023

First capture - https://web.archive.org/web/20040724145729/http://www.tcf.vt...

Last capture - https://web.archive.org/web/20070606024231/http://www.tcf.vt...

An FAQ from somewhere in the middle - https://web.archive.org/web/20060708113430/http://www.tcf.vt...

kevin_thibedeau · on Dec 25, 2023

The other fun thing is to find out the most recent year your phone would have made the bottom of the top 500 list.

bifftastic · on Dec 25, 2023

Looks like June 2002 for Pixel 8

hulitu · on Dec 25, 2023

My phone is so dumb down that any comparison is useless. It is like driving a Ferrari through a corn field.

DonHopkins · on Dec 25, 2023

The Cray-1 has a much better couch.

Animats · on Dec 25, 2023

The Cray-1's padded seats cover the power supplies. I was sad to see the Cray-1 at the Computer Museum in Mountain View being used as storage for catering supplies for some event in the lobby.

qgin · on Dec 25, 2023

We need to bring back computer seating

Y_Y · on Dec 25, 2023

I've sat on my raspi 3b a couple of times. Can't honestly recommend it.

ianburrell · on Dec 25, 2023

Make seats that are 19" wide and meant to go in front of racks. Or ones that are rack-mounted and pull out.

mherrmann · on Dec 25, 2023

TL; DR: ten years ago, Raspberry Pis and Android phones were a handful times faster. Nowadays, they are around 100 times faster. Pretty impressive, considering they fit in our pockets.

hulitu · on Dec 25, 2023

Thank god they have a modern OS to slow them down. /s

moffkalast · on Dec 25, 2023

No /s required, Wirth's law holds far more solidly than Moore's law. You may live to see man made software bloat beyond your comprehension.

eschaton · on Dec 28, 2023

I ran some comparisons a few years ago between a SPARCstation 20 Model 60 (the system for which the BYTE UNIX Benchmark is calibrated) and Raspbian on a Raspberry Pi: https://news.ycombinator.com/item?id=10795324

An original RPi is who it 6-7x the performance of a SPARCstation 20 according to the benchmarks

HarHarVeryFunny · on Dec 26, 2023

I remember watching a TV program about wave/fluid simulation (real time?) as a wow-wee demonstration of the power of a Cray-1.

In the meantime plenty of colleges and companies were running entire departments on a PDP-11 that had a fraction of the power.

Raspberry Pi faster than a Cray-1 is cool benchmark of how far we have come! The Cray had built-in seating though, which the Pi doesn't! :-)

qgin · on Dec 25, 2023

It's wild to imagine that 40 or so years from now, someone will have a drawer full of cheap plastic boxes, each with more power than the fastest computing cluster of 2023... promising to themselves that one day they're finally going to build that hobby project with one of them.

ryandrake · on Dec 25, 2023

Didn’t the Apollo guidance computer, which took people to the moon, have 4K of RAM? Today, 1 million times that barely runs the OS and a few Chrome tabs.

jes · on Dec 25, 2023

I hope the following from Wikipedia is helpful:

The computer had 2048 words of erasable magnetic-core memory and 36,864 words of read-only core rope memory. Both had cycle times of 11.72 microseconds. The memory word length was 16 bits: 15 bits of data and one odd-parity bit. The CPU-internal 16-bit word format was 14 bits of data, one overflow bit, and one sign bit (ones' complement representation). [1]

1. https://en.wikipedia.org/wiki/Apollo_Guidance_Computer

jojobas · on Dec 25, 2023

Russians ran computers with ferrite-plate RAM of similar size on submarines well into 90s or maybe even 00s. Still using software written for them on Kilo submarines in some sort of VMs.

chasd00 · on Dec 25, 2023

I’ve looked inside that capsule. I wouldn’t ride in it to the grocery store!

avg_dev · on Dec 25, 2023

such a dystopian future... I hope we are not still using plastic then :)

niederman · on Dec 25, 2023

How is this dystopian? Plastic is a really great material -- and way more eco-friendly than metal for building computers. It's only mass-produced single-use plastics like water bottles that are bad for the environment.

djaychela · on Dec 25, 2023

How is it more eco-friendly? AFAIK Most plastics used in such applications are not practically recyclable, whereas metals are.

huytersd · on Dec 25, 2023

I wish we could come up with a plastic that would biodegrade after a fixed amount of time say 200 years.

bbarnett · on Dec 25, 2023

You want the stored carbon in plastics to escape??

The best outcome for plastics, would be to bury them very deep (like nuclear waste), where they could eventually become some new oil like substance. No carbon escape.

huytersd · on Dec 25, 2023

Yeah but no one is burying the plastic, it’s too expensive. So realistically I’d much rather have the plastic breakdown so it’s not everywhere for 10k years.

vl · on Dec 25, 2023

A lot of plastic actually gets buried. For example, in Washington’s King County non-recycled waste is buried. Perhaps this needs to be done world-wide.

mdaniel · on Dec 26, 2023

There's no shortage of those (e.g. https://www.youtube.com/watch?v=h1zDJ1qZTlg and https://youtu.be/F2zm87p8f7M?t=1857 that I just happened to have handy) what there is a shortage of is economical ones (or scalable, the other major obstacle) as compared to the current situation where the negative externalities are not priced into the sales leading to perverse incentives

huytersd · on Dec 26, 2023

That’s just biodegradable plastic. It will last a couple of days out in the open. I actually mean plastic that’s lasts for multiple lifetimes but still starts to break down only after ~200 years.

huytersd · on Dec 25, 2023

You really think so? Aren’t we at the end of Moore’s law, so I’m really doubtful that we’ll see massive leaps like that.

RetroTechie · on Dec 25, 2023

Moore's law will die when we have 3D stacks of silicon thick enough that even integrated liquid cooling can't keep it cool. With feature sizes measured in a few atoms.

Or when economics of fabricating such structures just aren't worth it.

jamiek88 · on Dec 25, 2023

Yeah made out of not silicon.

Moores ‘law’ is a human driven law.

Computing is basically the absolute center of our society.

As long as our civilization exists we will spend massive resources on this.

Thus as long as it’s physically possible we’ll have progress.

topspin · on Dec 25, 2023

> Aren’t we at the end of Moore’s law

Three semi manufacturers are telling their investors they'll be at 2nm (or something) in late 2024 or 2025. So no, Moore's law has not seen its end, despite ~40 years of predictions to the contrary.

aembleton · on Dec 25, 2023

People have been saying that for at least twenty years

bbarnett · on Dec 25, 2023

I remember people saying this in the 80s!

benj111 · on Dec 25, 2023

The pi Pico would be an interesting comparison.

It doesn't have an fpu, and not much ram so it might actually be a close race for some of these tests.

moffkalast · on Dec 25, 2023

Well I'm anxiously anticipating the first Micropython build for the Cray-1.

auselen · on Dec 25, 2023

Recognizing the domain, you can read about the early history of benchmarks: http://www.roylongbottom.org.uk/whetstone.htm

whitej125 · on Dec 25, 2023

Years ago when my daughter was around 5 I was showing her a raspberry Pi zero I had just picked up. I told her - years ago before Daddy was your age a computer like this used to be as big as a house. Her response was - “houses were that small?”

mattnewton · on Dec 25, 2023

In your household, the children tell the dad jokes to dad.

chacham15 · on Dec 25, 2023

I like telling dad jokes...he usually laughs

graphe · on Dec 25, 2023

Your dad is named he?

ben_w · on Dec 25, 2023

He for short, Hehehe is the full name ;P

tonymet · on Dec 25, 2023

Have you seen houses from the 1920s -- she's not that far off

bboygravity · on Dec 25, 2023

Have you seen (tiny) houses in 2023s (large cities) -- she's not that far off.

tgv · on Dec 25, 2023

I’ve lived in them for most of my life. Quite roomy.

lagniappe · on Dec 25, 2023

they say the 2020's is the new 1920's in that regard

zx8080 · on Dec 25, 2023

:) Did showing raspPi to your daughter have any result (like interested in tech or anything)?

whitej125 · on Dec 25, 2023

Honestly she loves all her subjects in school. I’d say the “engineering tendency” that she picked up from her mom and I (both eng) is the desire the go deep on learning something. I see it when it comes to math, but equally see it when it comes to music or history.

utopcell · on Dec 25, 2023

Smart kid, thinking out of the box!

speed_spread · on Dec 25, 2023

It comes naturally when the box is so small!

nextaccountic · on Dec 25, 2023

Maybe not a house, but depending on your age a large room for sure

nullhole · on Dec 25, 2023

There's a line in the Jurassic Park book where a character is made suspicious by an offhand assertion (by Nedry) that he is using a multi XMP system.

Rpi4s are nice, in a sense, because you can only rarely honestly claim that the speed of the system is holding you back. Most times, presumably, it's the efficiency of the operations you are telling it to execute.

Uehreka · on Dec 25, 2023

> Rpi4s are nice, in a sense, because you can only rarely honestly claim that the speed of the system is holding you back.

As someone who uses them for a variety of purposes, I gotta note that they have pretty huge limitations. Like, the moment graphics enter the picture (no pun intended) you’re moving an order of magnitude slower than most desktops or laptops. Not to mention that support for hardware video encode/decode (which, especially decode, we generally take for granted) aren’t always available depending on the library or tool you’re working with.

Like yes, you can totally run a serviceable web server on a Pi and serve a blog or a small web app, but let’s not get carried away here.

moffkalast · on Dec 25, 2023

Requiring a small nuclear reactor to power it aside, the Pi 5 feels far more like it's up to the task of a full desktop machine. Admittedly I haven't tried anything but headless workloads on mine so far, but it's so much snappier it's genuinely unreal. I'm really looking forward to seeing how much faster lidar localization and just SLAM in general runs on it once ROS support is sorted.

Although they did remove the h264 decoder and encoder which is a bummer, like you say it's hard to get working support for it anyway. Vulkan + regular GPU acceleration might be easier. And it still only has 4 cores which is crap for desktop multitasking.

Firerouge · on Dec 25, 2023

The lack of I/O to the cpu might be the counterpoint to this. A single (exposed) PCIe lane might be enough for any singular task, but you're likely to start bogging down your bandwidth if you need to do any serious simultaneous tasks like network above a gbit, nvme disk IO, additional display or additional parallel computing like a GPU.

Cacti · on Dec 25, 2023

it’s easy to forget how many bits needs to be pushed down a graphics pipeline, per pixel, per frame. in fact forget the pipeline, just pushing the bits down the wire fast enough is a non-trivial task.

nullhole · on Dec 25, 2023

I agree. I was a bit too flip in my original comment; there are regular tasks now that simply require multiples of the data throughput that the 1980s Cray machines were capable of.

The simpler point I was aiming at was just that: the amount of computational power at the fingertips of so many of us is huge, and it's important to appreciate that.

eesmith · on Dec 25, 2023

From the novelization of the movie "War Games", starting at https://archive.org/details/wargames00davi/page/n117/mode/2u... :

> "Jesus," David said. "That's a Cray 2!"

> "Ten of them." McKittrick said.

> "I didn't know they were out yet."

> McKittrick almost preened. "Only ten. Come on, I want to show you something."

chatnealbot · on Dec 25, 2023

i wonder how moore's law figured into the pricing - for a nuclear simulation, makes sense you'd want to pay a lot, or basic science or political things like apollo moon missions, but for weather or commercial applications, if you wait a few years might not be worth to pay for cray right away, though there's marketing aspect of how advanced your product is.. maybe this was before moores law though

kergonath · on Dec 25, 2023

> if you wait a few years might not be worth to pay for cray right away

And then you don’t get anything done because there is always a better computer just around the corner. Most of the time, proposals are written for hardware that already exist and don’t need the absolute best. If you have some CFD or MHD calculations to do for a rocket engine or a nuclear reactor, you don’t care about the computer on which it ran, just that it ran on time and did not hold the whole project back. Even cutting edge science does not require cutting edge hardware most of the time.

Just like buying a desktop next year won’t help you play games today, at some point you have to settle and accept that your hardware will be outdated by the time it comes online (it’s a bit better now, but leading HPC clusters still get obsolesced quite quickly).

> maybe this was before moores law though

The exponential character of available CPU time on larger computers was apparent before Moore’s law.

timthorn · on Dec 25, 2023

The author's CV is as interesting as the benchmarks

brcmthrowaway · on Dec 25, 2023

The author has the most British name ever.

z3phyr · on Dec 25, 2023

The most British name belongs to Lord British.

earthscienceman · on Dec 25, 2023

link?

timthorn · on Dec 25, 2023

I'm referring to the same page, under the "Background Activities" heading

morbusfonticuli · on Dec 25, 2023

www.roylongbottom.org.uk/Cray 1 Supercomputer Performance Comparisons With Home Computers Phones and Tablets.htm#anchor1

mikewarot · on Dec 25, 2023

I wonder where a Raspberry Pi Pico - based on the RP2040 fits in all of that.

dgacmu · on Dec 25, 2023

A lot slower - roughly 100x - because most of these benchmarks measure FLOPS and the rp2040 doesn't have native floating point, so it has to emulate it, taking it down to 1-3 MFLOPS vs the 160 MFLOPS of the cray-1.

If you compared integer operations it would be a lot closer, but that's not really what the cray was designed for. (The rp2040 at 125mhz * 2 cores is in a pretty similar range)

Havoc · on Dec 25, 2023

For a second there I thought there was a new sbc called cray. Well played

travisgriggs · on Dec 25, 2023

The article has Android phone comparisons. Any idea how the iPhones stack up against the Cray-1?

hulitu · on Dec 25, 2023

My Android phone can not do any batch processing out of the box. You need 3 layers of emulation to be able to run a shell.

dianeb · on Dec 25, 2023

As interesting as this article is as a comparison to a 40 year old supercomputer, the reality is that computers really are artifacts of an era and their place in the progress or regress of technology is possibly valid. Today's world's fastest computer is a Cray:

https://www.top500.org/news/frontier-remains-no-1-in-the-top...

So, how does the Raspberry Pi stack-up against today's computers?

webprofusion · on Dec 25, 2023

Dude fix your website certificate, come on.

wannacboatmovie · on Dec 25, 2023

This site is served over plain HTTP. Not only does he not need to fix anything, you should fix your broken web browser that's "upgrading" to HTTPS when no one, especially the site owner, asked it to.

BMc2020 · on Dec 25, 2023

"In 1978, the Cray 1 supercomputer cost $7 Million, weighed 10,500 pounds and had a 115 kilowatt power supply. It was, by far, the fastest computer in the world. The Raspberry Pi costs around $70 (CPU board, case, power supply, SD card), weighs a few ounces, uses a 5 watt power supply and is more than 4.5 times faster than the Cray 1"

edit: thank you for the Christmas present, yc algorithm. God bless us every one.

wolfgang42 · on Dec 25, 2023

"The comment above was for the 2012 Pi 1. In 2020, the Pi 400 average Livermore Loops, Linpack and Whetstone MFLOPS reached 78.8, 49.5 and 95.5 times faster than the Cray 1."

Apart from the Cray-1, that whole section is also worth reading for some interesting insights into relative speed differences between various modern CPUs as well. (Though I do wish it was presented in table rather than narrative form, it’d be a lot easier to follow that way; there are also more detailed tables further down the page.)

LarsDu88 · on Dec 25, 2023

Ok, do Pi 5 now

moffkalast · on Dec 25, 2023

Multiply Pi 4 results by 3 and you have the rough ballpark.

jbverschoor · on Dec 25, 2023

I’m more impressed that the cray was that fast such a long time ago tbh

chasil · on Dec 25, 2023

That all began with the CDC 6600.

https://en.m.wikipedia.org/wiki/CDC_6600

moffkalast · on Dec 25, 2023

> CDC's first products were based on the machines designed at Engineering Research Associates (ERA), which Seymour Cray had been asked to update after moving to CDC.

> Cray has been credited with creating the supercomputer industry. Joel S. Birnbaum, then chief technology officer of Hewlett-Packard, said of him: "It seems impossible to exaggerate the effect he had on the industry; many of the things that high performance computers now do routinely were at the farthest edge of credibility when Seymour envisioned them.

> One story has it that when Cray was asked by management to provide detailed one-year and five-year plans for his next machine, he simply wrote, "Five-year goal: Build the biggest computer in the world. One year goal: One-fifth of the above." And another time, when expected to write a multi-page detailed status report for the company executives, Cray's two sentence report read: "Activity is progressing satisfactorily as outlined under the June plan. There have been no significant changes or deviations from the June plan."

> Cray avoided publicity, and there are a number of unusual tales about his life away from work, termed "Rollwagenisms", from then-CEO of Cray Research, John A. Rollwagen. He enjoyed skiing, windsurfing, tennis, and other sports. Another favorite pastime was digging a tunnel under his home; he attributed the secret of his success to "visits by elves" while he worked in the tunnel: "While I'm digging in the tunnel, the elves will often come to me with solutions to my problem."

Well Seymour you are an odd fellow, but I must say you design a good mainframe.

amelius · on Dec 25, 2023

It was even faster in practice because the software was less bloated.

stevenjgarner · on Dec 25, 2023

The primary operating system for the Cray-1 was the Cray Operating System (COS), which was a batch processing system. COS was specifically designed to exploit the hardware capabilities of the Cray-1, focusing on high-speed computation rather than on features like multi-user support. Given its focus on scientific and mathematical computations, COS supported compilers for languages like FORTRAN, which was the dominant language for scientific computing at the time. The Cray FORTRAN Compiler was highly optimized to take advantage of the Cray-1's vector processing capabilities. There was also a set of mathematical libraries optimized for its architecture. These libraries included routines for linear algebra, Fourier transforms, and other mathematical operations critical in scientific computing.

pklausler · on Dec 25, 2023

COS supported time-sharing and multiple users quite well, actually, including interactive sessions.

Trivia: Seymour was user U0100 on our in-house systems.

boznz · on Dec 25, 2023

This is actually quite depressing considering the amount of Raspberry pi's doing tasks like watering a plant, Reminds me of Marvin the paranoid android.

speed_spread · on Dec 25, 2023

I've come to the conclusion that the value of the RPI is not so much in the hardware platform but rather in being the cheapest Linux running computer you can buy. Its much easier to program for than any embedded board.

topspin · on Dec 25, 2023

> Its much easier to program for than any embedded board.

Indeed. I work in both MCUs and full-featured Linux environments and there is zero value to using the former for non-safety critical, non-power constrained, low precision applications. Running a sprinkler system on an RPi is an entirely reasonable choice: you have ample storage for history, trivially simple remote control using a variety of protocols and media and ample compute to operate high level languages and run easily maintained programs, including nice-to-haves like continuous integration of public weather data to optimize your schedule against prevailing rainfall.

Can you shoehorn all that into a ESP32 + micropython or whatever? Sure. I'll bet someone already has. And they spent 10x the time it would have taken otherwise. At least.

upon_drumhead · on Dec 25, 2023

Esp32 ( https://www.espressif.com/en/products/socs/esp32 ) are what most folks should be using to do that stuff. It’s a great platform to build upon.

throwup238 · on Dec 25, 2023

ESPHome even comes with a well thought out sprinkler controller module: https://esphome.io/components/sprinkler.html

It supports valves, pumps, schedules, etc. I programmed mine once a few years ago with YAML (no code!) and now I just power cycle them once every month or so. Been running great.

Dork1234 · on Dec 25, 2023

You can always replace them with RP2040 which have integer performance similar to the Cray-1 if that makes you feel better.

m463 · on Dec 25, 2023

...freeing up a human to come inside and read hn

internet101010 · on Dec 25, 2023

I've gone through the pi rabbit hole to the kubernetes/swarm level. Just skip that stuff at this point and get a mini pc. The money comes out to be the same and I promise it is way less of a hassle.

fuzztester · on Dec 25, 2023

What make and model would you recommend for a mini PC? I was thinking of getting one lately.

nielsole · on Dec 25, 2023

I got a used Lenovo ThinkCentre M910q i5-6500T 4x2.5GHz 8GB RAM 240GB SSD for 100 €.

Plenty fast for a couple of VMs with web servers. It's also a lot better for 24/7 than RPi, as I was always struggling with SD card wear

implements · on Dec 25, 2023

See also Dell OptiPlex 9020 Mini (i5-4590T 4x2GHz 16GB) - that class of “One Litre PC” make an excellent VPS / multi-VM setup.

fuzztester · on Dec 25, 2023

Thanks.