A "Snitch" cluster is 8+1 single-stage small integer cpus +1 DMA core. Cores each have their own "decoupled heavily pipelined" FPU. The cluster shares a L1 I-cache, has 32 banks of 4kB scratchpads, and has a 64bit config / peripheral AXI bus, and a 512bit "wide" AXI bus the +1 DMA core can talk to, and it's own connection to all the scratchpads.
Clusters are grouped together. On this "Occamy" chip, there are 4 Snitch clusters in a group. These groups share the 64bit/512bit AXI group bus, which connects up to top bus. Groups also share a L2 consts cache. Then 6 groups.
The chiplet also has 8 HBM2e chips bringing ~400GBps, and some other goodies, like a 8GBps off-die serial link, a system DMA controller, 8GBps and 64GBps die-to-die serials to connect to other chiplets, and a CVA6 management core.
There's some interesting iterations done to make the FPU more decoupleable, such as having native loop instructions & some interesting memory access instructions with address generation built in- "streaming semantic registers", a neat risc-v isa extension. https://arxiv.org/abs/1911.08356
I love love love the non-cache coherent design. Even within a cluster, there's still not really cache coherence, is my feel; there's just some shared memory one can use. The early slides in this deck so show much promise. Notably, a huge weakness in the ecosystem in general around hard IP for talking to memory, ethernet, PCIe, but otherwise, so much glorious stuff & rapidly improving tools for doing this kind of work.
At a quick glance, the presentation mentions DMA and local scratchpad memory. Is this similar in approach to the IBM/Sony/Toshiba Cell processor which also had one general purpose core in charge of a fleet of compute-oriented cores?
Which is nice, but doesn't really answer my question?
There's a tremendous usability difference between "X flops and all the different types of cores can read and write to unified memory with cache coherence" (as on Apple Silicon) vs. "X flops but you need to carefully orchestrate chunks of data in and out of local scratchpad buffers" (as on the Cell).
This is a somewhat misleading metric, seeing how FP64 compute isn't a priority for consumer GPUs. Still, it's great to see that at least some facet of modern GPUs can be replicated independently without using an ridiculously oversized die. There may yet be hope for some new GPU startup.
The FPU/memory interleaving reminds me of ARM's Helium instructions. Which makes sense given the flock of chickens approach. It's great to see people developing practical architectural innovations for their own use cases like this guy.
They wouldn't use a chip like this for anything safety-critical, which is the main reason for using well-established tools and techniques. Fielding new technology in a less critical role is an opportunity to establish new tools as well.
I would come to the same conclusion if they had used something like Chisel or another of the new HDLs that are based on higher level languages and metaprogramming, but Verilog is arguably the older language when compared with VHDL.
I probably shouldn't have said "new" without explaining myself more.
For safety- and mission-critical applications (SC and MC), Verilog is more established in the US, and VHDL is more established in Europe.
For SC, a tool doesn't get any points unless you've used it yourself, so the fact that they occupy the same role and have similar pedigree elsewhere is nearly irrelevant. Each tool permeates all the way down to the less critical roles in order to establish it for more critical use by those specific companies and government agencies. That's how these tools get entrenched in different places.
For MC, you can bring things into your company or agency that are relatively new to your organization or less used in your organization. This processor would at most be used for MC and it's a significant new capability, so it's not all that weird.
That argument essentially boils down to "they wanted to try it", but I don't think that it explains why they would want to try it. Verilog is arguably worse from a verification standpoint, which is afaik the main reason why VHDL won out in Europe. Pretty much every academic hardware working group that I encountered is also a formal verification working group.
The lower the application criticality and the bigger the project impact, the more leeway you can give if there are legitimate project-killing limitations. It's safe to say that the project is impactful because it adds an important sovereign capability, and low criticality because it won't be used for safety and it's not already slated for use in a flagship mission as a mission-killing component.
The only assumption you have to make is why VHDL would be bad. Could be a technical limitation that would be prohibitively expensive to fix. Could be that this group is coming from a non-space background in Verilog and switching to VHDL would be too long and expensive. Or maybe it's a long-term strategic decision to advance the use of Verilog in space because the ESA feels like VHDL will become a liability in the future.
> Ingenuity, also called Ginny, is a small robotic helicopter operating on Mars.
> The helicopter uses a Qualcomm Snapdragon 801 processor
Edit: There is a way to detect and recover from these types of errors. They appear to be able to run the computation on a number of cores with a voter to check for consistency.
Ingenuity had a Snapdragon, but there was also a redundant pair of arm r5f microcontrollers (which are also lockstepped dual-core processors, so quadruple redundancy) where the flight controllers were, and a rad-tolerant FPGA (proasic) that watchdogged everything else. The Snapdragon was used for less critical code such as imaging and comms.
Also, thanks to the thin atmosphere and lack of a magnetic field Mars's surface is about 50 times more radioactive than the Earth's, but it's still no as bad as space out past the Van Allen belts
Do the AI cores use a standardized API? Is it an extension to the RISC-V set? Is the API similar to normal CPU multithreading, or more like GPU kernels or SIMD?
It's RISC-V, both the control core and the compute cores. Has a few extensions on the compute side for loading the register file from DMA and chaining FPU ops with no core front end babysitting. Honestly reminds me a lot of GPU architectures, though not quite as wide per compute core (so not as wide compared to a SM or wavefront). Probably legitimately makes for a very nice HPC chip all else being equal.
I hope someone picks up the architecture and brings it to the wider market. Would be great to see some opensource hardware AI accelerator options.
Looking more closely at snitch, unfortunately it looks like the two largest contributors have already left the project. Perhaps they can keep momentum but industry poaching remains an issue for ESA.
> Occamy has a lightweight 32-bit CPU core that acts more as a control chip, and it is responsible for rerouting tasks to the AI cores, which are extensions to the instruction set architecture.
My understanding was that risc-v intends to support vector processing, not simd to avoid instruction bloat and for portable binaries. There is already a letter reserved for packed simd in the isa naming convention though (P). Are there actually boards I can buy that support either? Will llvm generate either? I look forward to seeing how it works.
There are no chips/boards implementing the standard P extension as it is nowhere near finalised. The base for the proposed P extension came from Andes who have been shipping their custom packed SIMD ISA on their own NDS32 and then RISC-V cores for a number of years.
There are currently no chips or boards implementing RVV 1.0. There seems to be a prospect for a board with a single core 1.6 GHz dual-issue in-order core (C908) implementing RVV 1.0 for $30-$40 perhaps towards the end of the year. https://twitter.com/SipeedIO/status/1654055669774036993
There have been chips and boards implementing the mid-2019 RVV draft 0.7.1 since the first Allwinner D1 board with single 1.0 GHz 64 bit C906 core in April 2021 (and a lot more since). In the last few months we have seen three new chips implementing this vector ISA:
- Buffalo Lab BL808 with a 480 MHz C906 (plus a couple of 32 bit cores) and 64 MB of PSRAM: Pine64 Ox64, Sipeed M1s
- THead TH1520 with 4x 2.0 GHz OoO C910 cores similar to Arm A72 in e.g. Pi 4: Lichee Pi 4A, Roma laptop (supposedly), other as yet unannounced boards coming soon
- Sophon SG2042 with 64x 2.0 GHz OoO C910 cores, 64 MB L3 cache, 32x PCIe gen 4, four DDR channels. I've been using one of these remotely via ssh. The first retail machine with it, the Milk-V "Pioneer" is available for preorder in China now, reportedly Crowdsupply soon.
RVV 0.7.1 is very similar to 1.0, but there were a few changes in between that make them incompatible (which is expected because draft specs were never expected to be widely implemented .. but here we are). SOME carefully-written code is binary compatible between them e.g. memcpy(), but in general a little usually trivial conversion is necessary.
I've got my hands on an Allwinner D1, is it possible to use intrinsics for rvv 0.7.1? I'd like to already test a few things in my c libraries. Are there any resources other than carefully reading the spec, that documents the differences between 0.7.1 and 1.0?
There was no work on C intrinsics yet when 0.7.1 was the current draft. That started with 0.8, if I recall correctly.
I'd rather program directly in assembly language, or in inline asm, than in intrinsics anyway. They are verbose, ugly, hard to write, hard to read, and even hard to modify because things such as the LMUL are baked into every single function name.
I don't know of any document that carefully shows the differences. I'd like to create one, if I had someone to pay for my time to do it.
"traditional" vendors provide ** in realms of Intelligence agencies..., so these "traditional" vendors are very hard to replace with open source variants.
just look what apple had to do just for example in terms of scanning all your pictures for ** or **, just to not to be ordered by federation to do things which are required to disclose to shareholders...
> The chip is being manufactured by Globalfoundries on the 12-nanometer low-power process.
As far as I can tell it is not radiation hardened yet. I think what the article is trying to say is that since it is an open source hardware, ESA is not dependent on the vendor for radiation hardening.
Glofo 12nm is an FD-SOI process, which is inherently less sensitive to radiation than normal bulk CMOS. Since the substrate is an isolator SOI devices are also generally immune to latchup.
Oh yeah, great - put the AI overlords in space where we can't easily reach them and shut them down when they go crazy and start dropping satellites on us.
A lot of talk about ai cores. Is that just a fancy way of saying the chip is good at making parallel computing?
Also is this chip only really suitable for missions or is this chip also suitable for something like an orbiting datacenter? Or would the earths magnetosphere be able to protect terrestrial chips in orbit from radiation?
the problem is that if you are a small time device maker the fpga you need to replicate this design would cost so much that the product becomes unaffordable. and you can’t run it fast enough unless you turn it into an asic?
i have been looking for ways to create an 8 core risc V cpu in a cheap fpga (sub $10) while making it run fast enough to compete with ARM cortex A chips. as far as i can tell this still can’t be done.
the people who design the chips often make poor choices regarding peripherals and features. this is because they are not the end users who are the software engineers. so what would be really interesting is to no longer need to create asics and just as a software developer design the cpu you want and program it into an fpga while being competitive cost wise.
For satellites doing high-res earth observation of astronomical observation, detecting noteworthy frames on-satellite to prioritize for downlink can save a lot of bandwidth. Downlink is often only possible during certain parts of orbit (unless geostationary), with long blackout periods. Being able to run a moderately big CNN le transformer to spot ships, troop movements, methane leaks etc on satellite would be worth a some extra power and weight.
I think that for surveillance and defense applications, have some more sophisticated algorithms running on board can be very advantageous by reducing the bandwidth needs in the downlink.
In the event of an European generalized war with russia, having smaller, nimble data payloads being sent in the downlink is definitely a tactical advantage, as we can certainly expect that one of the first targets of the Russians would be giant, fixed tracking stations.
A surveillance satellite that could interpret images and just send something like "300 tank movement detected in the last hours starting from coordinates X,Y, direction Z, average speed W" in a slow link, able to be received by mobile stations would be great for European defense.
Comms and control; signal corruption and delay will increase as the distance increases.
For comms, that means either more transmission power from a spacecraft that is likely to have been travelling for years, if not decades. Bigger ground or space-based receivers. Or much, much more cost-effectively, reducing bandwidth by adding more bits in error correction. QED using ai to transmit only interesting information becomes attractive.
While for control, it takes between five to twenty minutes for a signal between Earth and Mars, and between Earth and Voyager 1, it's currently twenty-one hours. Adding better control and the ability to cope with more eventualities to a craft that will fly itself in poorly understood situations reduces the cost of trying to achieve a reasonable likelihood of mission success [0].
[0] The alternative being to send one - or more - simpler but still massively expensive craft to determine the missing data ahead of time (and hope it doesn't change).
Here's the source for the compute CPU core: https://github.com/pulp-platform/snitch