Thanks! Yup, private retrieval is interesting as a product because it's a fundamentally new capability; there aren't really competitors we can show incremental improvements against. If you're still interested in the space, we'd be happy to compare notes! Feel free to email us: founders AT blyss.dev
Our FHE scheme uses lots of Number Theoretic Transforms (NTTs), which are pretty computationally expensive. NTT is a good candidate for acceleration, and there is quite a bit of interest from the zk community in doing so (https://www.zprize.io/prizes/accelerating-ntt-operations-on-...).
From a hardware perspective, NTT can be done in parallel, but has a fairly large working set of data (~512 MB) with lots of unstructured accesses. This is too big to fit in even the largest CPU L3 caches, so DRAM bandwidth is still relevant. It may be eventually be feasible to build an ASIC with this much on-chip memory, but in the meantime, GPUs do a pretty decent job with their massive HBM bandwidth.
interesting prize, I wonder why they fix that it has to be radix-2 NTT, using higher radix speeds things up an order of magnitude on GPU (granted I am using a 256 bit field, so it might be more memory bound)
While OPT-175B is great to have publicly available, it needs a lot more training to achieve good results. Meta trained OPT on 180B tokens, compared to 300B that GPT-3 saw. And the Chinchilla scaling laws suggest that almost 4T tokens would be required to get the most bang for compute buck.
And on top of that, there are some questions on the quality of open source data (The Pile) vs OpenAI’s proprietary dataset, which they seem to have spent a lot of effort cleaning. So: open source models are probably data-constrained, in both quantity and quality.
OPT-175B isn't publicly available, sadly. It's available to research institutions, which is much better than "Open"AI, but it doesn't help us hobbyists/indie researchers much.
I wonder when we'll start putting these models on the pirate bay or similar. Seems like an excellent use for the tech. Has no one tried to upload OPT-175B anywhere like that yet?
I was curious about exactly how much burst heat could be absorbed, so I asked WolframAlpha [0]. In a 15" workstation laptop, I think the CPU could quite reasonably pull +100 watts over steady-state TDP for 30 seconds. (100-gram aluminum heatsink absorbing 100W * 30sec -> ∆T = +33ºC)
It's interesting to note that none of the merged companies are fabless.
Analog and low-to-mid complexity digital designs don't usually use the smallest, newest, most expensive silicon processes that you need for processors, GPUs, and FPGAs. You generally need capacitors, precision resistors, and wider voltage ranges more than you do billions of transistors.
Maybe now that these older fabs are being forced to run as actual businesses rather than as bleeding-edge science projects, semiconductor design companies are able to bring them back into the fold to avoid dealing with the headaches of being fabless.
It's too bad though, because this adds a huge capital cost to what would otherwise be a really ripe opportunity for a new competitor. This consolidation has definitely brought higher prices and reduced the diversity of available parts.
The unit economics of analog ICs should be very good -- a product that needs 1/100th the silicon surface area and sells for 1/10th the price, using a much cheaper node than a modern digital IC. There should be plenty of room for a company to compete with Analog Devices on price while still making a killing.
None of these companies are pushing the process side and are able to run on old, well-depreciated fabs. Their business is uninteresting to the fabless giants. Why? After all they have old fabs too. But most of the embedded suppliers are forced to be low margin providers (often commodity or quasi-commodity parts only a couple of steps up the food chain from passives). They don’t have any margin to give away to the fabless guys.
Analog is a bit different in that “node” doesn’t really apply, but they are also not in the high value part of the value chain for the most part.
The design cycle time is longer. Every prototype requires a formal agreement with one of the fabs you work with, and usually involves a few million dollars changing hands. A few months is usually the minimum.
Even if it's expensive, owning a fab means you have the option to make prototypes of a design, or of parts of a design. You will never, ever do this if you're fabless.
We're good at simulating digital logic, but simulating analog designs is more difficult, and each process tends to have unique quirks. You want your designers to be familiar with these quirks, which is easier to do when everybody designing and using a process is under the same roof.
If you're making a chip that has exotic needs (voltage ranges, threshold voltages, RF performance, noise, thermal properties, bipolar + cmos, etc.) you will have more ability to tweak the process. Foundry type fabs typically offer a smaller "menu" of options that they're comfortable they can support. For example, I think you might have a hard time competing with some of AD's more expensive ADCs as a fabless semiconductor company.
Don't get me wrong... there are plenty of headaches to owning and operating a fab too.
Did everyone with an interest here see the post here [0] a week ago about free fabbing for 130nm open-source chips?
There's certainly the possibility to do analog chips here, but it would take a big team effort.
(have only dabbled a bit with FPGAs with soft-cores, last project I used one for was ~2009: running Linux on an Altera NIOS core, where we sampled at 100 Msps from an input until we filled up the RAM, then more slowly dumped it over Ethernet to a PC)
Being fabless has quite a few advantages too, the cost of building / running a fab is huge, so can only be afforded by major player, or folks who can live on very old process nodes. Most startups, and even ADI themselves use the later nodes like 16nm with external fabs like TSMC.
The early age of the semiconductor industry was definitely much more interesting than today's world of oligopolies and company consolidation. It's quite disappointing to see the big names in the industry have all vanished.
> today's world of oligopolies and company consolidation
How long until we have to sign a license agreement before we can use an OpAmp? Oh, and the license is only valid for consumer applications. Want to use the OpAmp for enterprise applications? That'll cost you more.
That's if we're lucky. In a darker scenario, all OpAmp designs have been bought by Apple, and you can't even use one if you opened your iPhone because the function has been integrated into the CPU.
Fortunately so far we don't need to sign a license agreement to use an Intel CPU, yet. OpAmp EULA doesn't look like something currently on the horizon. But if an EULAed CPU ever became true, so will an EULAed OpAmp...
> In a darker scenario, all OpAmp designs have been bought by Apple, and you can't even use one if you opened your iPhone because the function has been integrated into the CPU.
Scary, because many microcontrollers already have OpAmps built into them...
Making hardware hasn't gotten any easier except at the edges where open source has taken a hold, but it seems like the only reason you'd want to design hardware anymore is to have more vertical control. Tesla, Apple, Google are all prime examples.
The margins on an OK product just aren't worth it, and to be competitive you have to build the whole thing and provide a reference design that's within 20% of the best out there to even break even. No surprise the market is all oligopolies. If there wasn't open source and affordable fab services coming up, there would really be no hope.
+1. This can be terrible. When Avago acquired Broadcom (and kept Broadcom's name), initially they continued selling existing components for a while, then suddenly discontinued hundreds of discrete RF/microwave parts (some were inherited from Agilent and even Hewlett-Packard's days) because they are "legacy parts". No! It was a huge pain. Many of those, despite their old age, are still good and useful, it's just because the demand and profit from discrete parts are low, and the new management decided it's not worth keeping them.
On the other hand, after Analog purchased Linear, many of the high-performance Linear parts are still sold side-by-side with competing Analog parts today, Analog even created a "Powered by Linear" product line for selling Linear power converter chips. It was a wise decision, apparently the management knew those parts from Linear are of great value. I hope Analog will adopt a similar solution for these Maxim parts.
Yeah, I was using a 30 dBm, L-band part that they discontinued, and had to redesign a PA section. Didn't HP acquire Avantek back in the 90's? I used to use a bunch of ATF-xxx parts.
Discontinued parts also included passive parts originally made by the HP Components subsidiary from the 1980s, such as special Schottky diodes and PIN diodes for RF/microwave applications, up to 10 GHz, still perfectly working today. For example, HP's jelly-bean HSMS‑282x series 6 GHz Schottky diodes was the go-to choice in RF circuits (even at lower frequencies like VHF and UHF) for three decades and still in production as of 2016 - you can find their datasheets with an HP logo, another with an Agilent logo, another with an Avago logo, and the last one with a Broadcom logo - and they eventually came to an end when Broadcom killed them in 2017 after Avago's acquisition.
I was hit by this. On a recent weekend I was tinkering with a DIY software-defined amateur radio receiver design and needed some RF diodes, only to find all of them have been killed, and similar parts from NXP were not stocked by the local distributor, 10-day shipping... A friend told me that they have switched to Skyworks diodes since then. The legendary life and unfortunate death of HP diodes.
That being said, I was disappointed with EE Times’s vacuous take on this merger this morning. They have changed hands, and managed to survive, but like most of the vertical literature are a shadow of their former selves.
The author's comments on cache sizes are a bit reductive. Not all "L3" is created equal, and designers always make tradeoffs between capacity and latency.
In particular, the EPYC processors achieve such high cache capacities by splitting L3 into slices across multiple silicon dies, and accessing non-local L3 incurs huge interconnect latency - 132ns on latest EPYC vs 37ns on current Xeon [1]. Even DDR4 on Intel (90ns) is faster than much of an EPYC chip's L3 cache.
Intel's monolithic die strategy keeps worst case latency low, but increases costs significantly and totally precludes caches in the hundreds of MB. Depending on workload, that may or may not be the right choice.
In practice the large AMD L3s result in very good performance. The new Ryzen cpus for instance absolutely crush intel cpus at GCC compile times because of them ( https://www.youtube.com/watch?v=CVAt4fz--bQ )
Are there workloads where the AMD suffers due to its l3 design? Maybe, but I've not seen one yet. I would imagine something special like that you could try to arrange thread affinity to avoid non local l3 accesses.
> Are there workloads where the AMD suffers due to its l3 design?
Databases, particularly any database which benefits from more than 16MB of L3 cache.
> On my 3900x L3 latency is 10.4ns when local.
And L3 latency is >100ns when off-die. Remember, to keep memory cohesive, only one L3 cache can "own" data. You gotta wait for the "other core" to give up the data before you can load it into YOUR L3 cache and start writing to it.
Its clear that AMD has a very good cache-coherence system to mitigate the problem (aka: Infinity Fabric), but you can't get around the fundamental fact that a core only really has 16MB of L3 cache.
Intel systems can have all of its L3 cache work on all of its cores, which greatly benefits database applications.
---------
AMD Zen (and Zen2) is designed for cloud-servers, where those "independent" bits of L3 cache are not really a big problem. Intel Xeon are designed for big servers which need to scale up.
With that being said, cloud-server VMs are the dominant architecture today, so AMD really did innovate here. But it doesn't change the fact that their systems have the "split L3" problem which affects databases and some other applications.
> Databases, particularly any database which benefits from more than 16MB of L3 cache.
Yes but have you seen this actually measured, as being a net performance problem for AMD as compared to Intel, yet? I understand the theoretical concern.
Older (Zen 1), but you can see how even a AMD EPYC 7601 (32-core) is far slower than Intel Xeon Gold 6138 (20-core) in Postgres.
Apparently Java-benchmarks are also L3 cache heavy or something, because the Xeon Gold is faster in Java as well (at least, whatever Java benchmark Phoronix was running)
What I see there is that the EPYC 7601 (first graph, second from the bottom) is much faster than the Xeon 6138 -- it's only slower than /two/ Xeons ("the much more expensive dual Xeon Gold 6138 configuration"). The 32-core EPYC scores 30% more than the 20-core Xeon.
Look at PostgreSQL, where the split-L3 cache hampers the EPYC 7601's design.
As I stated earlier: in many workloads, the split-cache of EPYC seems to be a benfit. But in DATABASES, which is one major workload for any modern business, EPYC loses to a much weaker system.
Are their L3 slices MOESI like their L2's are (or at least were). That'd let you have multiple copies in different slices as long as you weren't mutating them.
Any information I look up comes up to an NDA-firewall pretty quickly (be it in performance counters, or hardware level documentation). It seems like AMD is highly protective of their coherency algorithm.
> That'd let you have multiple copies in different slices as long as you weren't mutating them.
Seems like the D(irty) state allows multiple copies to be mutated actually. But its still a "multiple copies" methodology. As any particular core comes up to the 8MB (Zen) or 16 MB (Zen2) limit, that's all they get. No way to have a singular dataset with 32MB of cache on Zen or Zen2.
In general, all Zen generations share two characteristics: cores are bound into 4 core clusters called CCXes, and two of those are bound into a group called a CCD. Chips (Zen 1 and 1+) and chiplets (Zen 2) both have only ever put one CCD per chip(-let), and 1, 2, and 4 chip(-lets) have been put on per socket.
In Zen 1 and 1+, each chip had a micro IO die, which contains the L3, making a quasi-NUMA system. Example: a dual processor Epyc of that generation would have one of 8 memory controllers reply to a fetch/write request (whoever had it closest, either somebody had it in L3 already, or somebody owned that memory channel).
L3 latency on such systems should be quoted as an average or as a best case/worst case. Stating L3 as worst case only ignores memory cache optimizations (such as prefetchers grabbing from non-local L3 and fetches from L3 do not compete with the finite RAM bandwidth, but add to it, thus leading to a possible 2-4x increase performance if multiple L3 caches are responding to your core); in addition, Intel has similar performance issues: RAM on another socket also has a latency penalty (the nature of all NUMA systems, no matter who manufactured it).
Where Zen 1 and 1+-based systems performed badly is when the prefetcher (or a NUMA-aware program) did not get pages into L2 or local L3 cache fast enough to hide the latency (Epyc had the problem of too many IO dies communicating with each other, Ryzen had the issue of not enough (singular) IO die to keep the system performing smoothly).
Zen 2 (the generation I personally adopted, wonderful architecture) switched to a chiplet design: it still retains dual 4 core CCXs per CCD (and thus, per chiplet), but the IO die now lives in its own chiplet, thus one monolithic L3 per socket. The IO die is scaled to the needs of the system, instead of statically grown with additional CCDs. Ryzen now performs ridiculously fast: meets or beats Coffee Lake Refresh performance (single and multi-threaded) for the same price, while using less watts and outputting less heat at the same time; Epyc now scales up to ridiculously huge sizes without losing performance in non-optimal cases or getting into weird NUMA latency games (everyone's early tests with Epyc 2 four socket systems on intentionally bad-for-NUMA workloads illustrate a very favorable worst case, meeting or beating Intel's current gargantuan Xeons in workloads sensitive to memory latency).
So, your statement of "a tiny NUMA system within the package" is correct for older Zens, not correct (and, thankfully, vastly improved) for Zen 2.
Yeah. I bet part of why there's so much L3 per core group is that it's really expensive to go further away.
Seems like there're at least two approaches for future gens: widen the scope across which you can share L3 without a slow trip across the I/O die, or speed up the hop through the I/O die. Unsure what's actually a cost-effective change vs. just a pipe dream, though.
I'm very confused, there appear to be several conflicting reports on L3 cache latency for EPYC chips [1] [2]. Is it the the larger random cache writes that are causing the additional latency?
Regardless I wouldn't be particularly concerned, cache seems like the easier issue to address vs power density.
> Is it the the larger random cache writes that are causing the additional latency?
Think of the MESI model.
If Core#0 controls memory location #500 (Exclusive state), and then Core#32 wants to write to memory location #500 (also requires Exclusive state), how do you coordinate this?
The steps are as follows:
#1: Core#0 flushes the write buffer, L1 cache, and L2 cache so that the L3 cache & memory location #500 is fully updated.
#2: Memory location #500 is pushed out from Core#0 L3 cache and pushed into Core#32 L3 cache. (Core#0 sets Location#500 to "Invalid", which allows Core#32 to set Location#500 to Exclusive).
#3: Core#32 L3 cache then transfers the data to L2, L1, and finally is able to be read by core#32.
--------
EDIT: Step #1 is missing when you read from DDR4 RAM. So DDR4 RAM reads under the Zen and Zen2 architecture are faster than remote L3 reads. An interesting quirk for sure.
In practice, Zen / Zen2's quirk doesn't seem to be a big deal for a large number of workloads (especially cloud servers / VMs). Databases are the only major workload I'm aware of where this really becomes a huge issue.
Actually, the L3 cache is also sharded across chiplets, so there's a small (~8MB) local portion of L3 that is fast, while remote slices will have to go over AMD's interdie connection fabric and incur a serious latency penalty. On first gen Epyc/Threadripper, nonlocal L3 hits were almost as slow as DRAM at ~100ns (!).
That still means multiple silicon dies, which we have known how to do for a while (see: Intel Core 2 Quad from 2006, and more recently AMD Epyc).
Having more dies lets you dissipate more heat, but then it's kinda hard to build low-latency / high-bandwidth interconnects between the dies. Inter-die buses go over a PCB or interposer, which impose higher parasitic capacitance and make it difficult/expensive to run wide interfaces. That's why techniques like "dark silicon" allocation are important - it allows us to get more perf in a single die.
It matters when defining parallel work distribution. Unless memory bandwidth is homogeneous across the whole board (i.e. each TPU on a board gets 600 GB/s to its peers), we can't do model parallelism across ASICs efficiently, and must fall back to data parallelism. Which is fine, until you run into limits on maximum batchsize (e.g. up to 8192, as FAIR was able to manage [1] with some tweaks to SGD).