More

neilmovva · on March 14, 2023

Thanks! Yup, private retrieval is interesting as a product because it's a fundamentally new capability; there aren't really competitors we can show incremental improvements against. If you're still interested in the space, we'd be happy to compare notes! Feel free to email us: founders AT blyss.dev

neilmovva · on March 14, 2023

Our FHE scheme uses lots of Number Theoretic Transforms (NTTs), which are pretty computationally expensive. NTT is a good candidate for acceleration, and there is quite a bit of interest from the zk community in doing so (https://www.zprize.io/prizes/accelerating-ntt-operations-on-...).

From a hardware perspective, NTT can be done in parallel, but has a fairly large working set of data (~512 MB) with lots of unstructured accesses. This is too big to fit in even the largest CPU L3 caches, so DRAM bandwidth is still relevant. It may be eventually be feasible to build an ASIC with this much on-chip memory, but in the meantime, GPUs do a pretty decent job with their massive HBM bandwidth.

byteware · on March 14, 2023

interesting prize, I wonder why they fix that it has to be radix-2 NTT, using higher radix speeds things up an order of magnitude on GPU (granted I am using a 256 bit field, so it might be more memory bound)

neilmovva · on Feb 20, 2023

While OPT-175B is great to have publicly available, it needs a lot more training to achieve good results. Meta trained OPT on 180B tokens, compared to 300B that GPT-3 saw. And the Chinchilla scaling laws suggest that almost 4T tokens would be required to get the most bang for compute buck.

And on top of that, there are some questions on the quality of open source data (The Pile) vs OpenAI’s proprietary dataset, which they seem to have spent a lot of effort cleaning. So: open source models are probably data-constrained, in both quantity and quality.

Miraste · on Feb 21, 2023

OPT-175B isn't publicly available, sadly. It's available to research institutions, which is much better than "Open"AI, but it doesn't help us hobbyists/indie researchers much.

generalizations · on Feb 21, 2023

I wonder when we'll start putting these models on the pirate bay or similar. Seems like an excellent use for the tech. Has no one tried to upload OPT-175B anywhere like that yet?

cma · on Feb 21, 2023

It could go on the clear net since trained weights aren't subject to copyright.

mensetmanusman · on Feb 20, 2023

It’s fun to think about a few billion weights being the difference between useless and gold.

tiborsaas · on Feb 20, 2023

Looking at my bank account I can relate :)

neilmovva · on Jan 16, 2023

I was curious about exactly how much burst heat could be absorbed, so I asked WolframAlpha [0]. In a 15" workstation laptop, I think the CPU could quite reasonably pull +100 watts over steady-state TDP for 30 seconds. (100-gram aluminum heatsink absorbing 100W * 30sec -> ∆T = +33ºC)

[0] https://www.wolframalpha.com/input?i=%28100w+*+30+sec%29+%2F...

neilmovva · on July 13, 2020

Latest in a trend of silicon industry consolidation. A few other major moves in the embedded market over the last five years:

NXP + Freescale in 2015

Microchip + Atmel in 2016

ON Semi + Fairchild in 2016

Infineon + Cypress in 2020

ohazi · on July 13, 2020

It's interesting to note that none of the merged companies are fabless.

Analog and low-to-mid complexity digital designs don't usually use the smallest, newest, most expensive silicon processes that you need for processors, GPUs, and FPGAs. You generally need capacitors, precision resistors, and wider voltage ranges more than you do billions of transistors.

Maybe now that these older fabs are being forced to run as actual businesses rather than as bleeding-edge science projects, semiconductor design companies are able to bring them back into the fold to avoid dealing with the headaches of being fabless.

It's too bad though, because this adds a huge capital cost to what would otherwise be a really ripe opportunity for a new competitor. This consolidation has definitely brought higher prices and reduced the diversity of available parts.

The unit economics of analog ICs should be very good -- a product that needs 1/100th the silicon surface area and sells for 1/10th the price, using a much cheaper node than a modern digital IC. There should be plenty of room for a company to compete with Analog Devices on price while still making a killing.

gumby · on July 14, 2020

None of these companies are pushing the process side and are able to run on old, well-depreciated fabs. Their business is uninteresting to the fabless giants. Why? After all they have old fabs too. But most of the embedded suppliers are forced to be low margin providers (often commodity or quasi-commodity parts only a couple of steps up the food chain from passives). They don’t have any margin to give away to the fabless guys.

Analog is a bit different in that “node” doesn’t really apply, but they are also not in the high value part of the value chain for the most part.

panabee · on July 13, 2020

could you kindly elaborate on some of the headaches of being fabless?

ohazi · on July 13, 2020

The design cycle time is longer. Every prototype requires a formal agreement with one of the fabs you work with, and usually involves a few million dollars changing hands. A few months is usually the minimum.

Even if it's expensive, owning a fab means you have the option to make prototypes of a design, or of parts of a design. You will never, ever do this if you're fabless.

We're good at simulating digital logic, but simulating analog designs is more difficult, and each process tends to have unique quirks. You want your designers to be familiar with these quirks, which is easier to do when everybody designing and using a process is under the same roof.

If you're making a chip that has exotic needs (voltage ranges, threshold voltages, RF performance, noise, thermal properties, bipolar + cmos, etc.) you will have more ability to tweak the process. Foundry type fabs typically offer a smaller "menu" of options that they're comfortable they can support. For example, I think you might have a hard time competing with some of AD's more expensive ADCs as a fabless semiconductor company.

Don't get me wrong... there are plenty of headaches to owning and operating a fab too.

photojosh · on July 14, 2020

Did everyone with an interest here see the post here [0] a week ago about free fabbing for 130nm open-source chips?

There's certainly the possibility to do analog chips here, but it would take a big team effort.

(have only dabbled a bit with FPGAs with soft-cores, last project I used one for was ~2009: running Linux on an Altera NIOS core, where we sampled at 100 Msps from an input until we filled up the RAM, then more slowly dumped it over Ethernet to a PC)

[0]: https://news.ycombinator.com/item?id=23755693

panabee · on July 14, 2020

is a fab the only way to make prototypes or parts of a design? for instance, could 3D printers one day solve this part of the stack?

thanks for the detailed explanation!

tails4e · on July 13, 2020

Being fabless has quite a few advantages too, the cost of building / running a fab is huge, so can only be afforded by major player, or folks who can live on very old process nodes. Most startups, and even ADI themselves use the later nodes like 16nm with external fabs like TSMC.

segfaultbuserr · on July 13, 2020

The early age of the semiconductor industry was definitely much more interesting than today's world of oligopolies and company consolidation. It's quite disappointing to see the big names in the industry have all vanished.

* Motorola => Freescale & ON Semiconductor

* Fairchild Semiconductor => ON Semiconductor

* Dallas Semiconductor => Maxim

* Signetics => Philips Semiconductors => NXP

* Freescale => NXP

* National Semiconductor => Texas Instruments

* Linear Technology => Analog Devices

* Atheros => Qualcomm

* Intersil => Renesas

* Atmel => Microchip

amelius · on July 13, 2020

> today's world of oligopolies and company consolidation

How long until we have to sign a license agreement before we can use an OpAmp? Oh, and the license is only valid for consumer applications. Want to use the OpAmp for enterprise applications? That'll cost you more.

That's if we're lucky. In a darker scenario, all OpAmp designs have been bought by Apple, and you can't even use one if you opened your iPhone because the function has been integrated into the CPU.

segfaultbuserr · on July 13, 2020

Fortunately so far we don't need to sign a license agreement to use an Intel CPU, yet. OpAmp EULA doesn't look like something currently on the horizon. But if an EULAed CPU ever became true, so will an EULAed OpAmp...

> In a darker scenario, all OpAmp designs have been bought by Apple, and you can't even use one if you opened your iPhone because the function has been integrated into the CPU.

Scary, because many microcontrollers already have OpAmps built into them...

ece · on July 14, 2020

Making hardware hasn't gotten any easier except at the edges where open source has taken a hold, but it seems like the only reason you'd want to design hardware anymore is to have more vertical control. Tesla, Apple, Google are all prime examples.

The margins on an OK product just aren't worth it, and to be competitive you have to build the whole thing and provide a reference design that's within 20% of the best out there to even break even. No surprise the market is all oligopolies. If there wasn't open source and affordable fab services coming up, there would really be no hope.

formerly_proven · on July 13, 2020

Can't say I like the prospects of all these companies consolidating and throwing out "redundant parts".

segfaultbuserr · on July 13, 2020

+1. This can be terrible. When Avago acquired Broadcom (and kept Broadcom's name), initially they continued selling existing components for a while, then suddenly discontinued hundreds of discrete RF/microwave parts (some were inherited from Agilent and even Hewlett-Packard's days) because they are "legacy parts". No! It was a huge pain. Many of those, despite their old age, are still good and useful, it's just because the demand and profit from discrete parts are low, and the new management decided it's not worth keeping them.

On the other hand, after Analog purchased Linear, many of the high-performance Linear parts are still sold side-by-side with competing Analog parts today, Analog even created a "Powered by Linear" product line for selling Linear power converter chips. It was a wise decision, apparently the management knew those parts from Linear are of great value. I hope Analog will adopt a similar solution for these Maxim parts.

madengr · on July 13, 2020

Yeah, I was using a 30 dBm, L-band part that they discontinued, and had to redesign a PA section. Didn't HP acquire Avantek back in the 90's? I used to use a bunch of ATF-xxx parts.

segfaultbuserr · on July 14, 2020

I see, it's mergers all the way down!

Discontinued parts also included passive parts originally made by the HP Components subsidiary from the 1980s, such as special Schottky diodes and PIN diodes for RF/microwave applications, up to 10 GHz, still perfectly working today. For example, HP's jelly-bean HSMS‑282x series 6 GHz Schottky diodes was the go-to choice in RF circuits (even at lower frequencies like VHF and UHF) for three decades and still in production as of 2016 - you can find their datasheets with an HP logo, another with an Agilent logo, another with an Avago logo, and the last one with a Broadcom logo - and they eventually came to an end when Broadcom killed them in 2017 after Avago's acquisition.

I was hit by this. On a recent weekend I was tinkering with a DIY software-defined amateur radio receiver design and needed some RF diodes, only to find all of them have been killed, and similar parts from NXP were not stocked by the local distributor, 10-day shipping... A friend told me that they have switched to Skyworks diodes since then. The legendary life and unfortunate death of HP diodes.

jononor · on July 13, 2020

Gonna be left with just 2-3 major players soon...

markrages · on July 13, 2020

In this same market: Analog bought Linear Tech, TI bought National.

andromeduck · on July 13, 2020

I'm waiting for them to buy Vicor

rmrfstar · on July 13, 2020

What are the good reports/journals to read if you want to learn more about the competitive landscape here?

nudgeee · on July 13, 2020

eetimes.com[0] is usually a decent place for general electronic industry news.

Tip: scroll down to pass the sponsored content to see the full news section below the fold.

[0] https://eetimes.com/

gumby · on July 14, 2020

Good recommendation.

That being said, I was disappointed with EE Times’s vacuous take on this merger this morning. They have changed hands, and managed to survive, but like most of the vertical literature are a shadow of their former selves.

Here’s the editorial, written by the publisher: https://www.eetimes.com/maxim-is-a-pit-stop-wheres-adi-heade...

borner791 · on July 13, 2020

Dont forget Avago and Broadcom

repiret · on July 13, 2020

And LSI => Avago before that. And PLX => Broadcom.

nudgeee · on July 13, 2020

Also Intel + Altera in 2015

deelowe · on July 13, 2020

I was going to point this one out. Pretty huge.

tyingq · on July 13, 2020

Intel + Altera 2015

And, not in your scope, but AMD acquires ATI, 2006, still has some echos. Also, Atheros/Qualcomm 2011.

gumby · on July 14, 2020

ADI + Linear in 2017 is more relevant to this transaction.

halo · on July 13, 2020

Couldn’t you say the same for most industries?

foobiekr · on July 14, 2020

Broadcom and Avago and a bunch of others.

neilmovva · on Aug 26, 2019

The author's comments on cache sizes are a bit reductive. Not all "L3" is created equal, and designers always make tradeoffs between capacity and latency.

In particular, the EPYC processors achieve such high cache capacities by splitting L3 into slices across multiple silicon dies, and accessing non-local L3 incurs huge interconnect latency - 132ns on latest EPYC vs 37ns on current Xeon [1]. Even DDR4 on Intel (90ns) is faster than much of an EPYC chip's L3 cache.

Intel's monolithic die strategy keeps worst case latency low, but increases costs significantly and totally precludes caches in the hundreds of MB. Depending on workload, that may or may not be the right choice.

[1] https://www.anandtech.com/show/14694/amd-rome-epyc-2nd-gen/7

gameswithgo · on Aug 26, 2019

In practice the large AMD L3s result in very good performance. The new Ryzen cpus for instance absolutely crush intel cpus at GCC compile times because of them ( https://www.youtube.com/watch?v=CVAt4fz--bQ )

Are there workloads where the AMD suffers due to its l3 design? Maybe, but I've not seen one yet. I would imagine something special like that you could try to arrange thread affinity to avoid non local l3 accesses.

On my 3900x L3 latency is 10.4ns when local.

dragontamer · on Aug 26, 2019

> Are there workloads where the AMD suffers due to its l3 design?

Databases, particularly any database which benefits from more than 16MB of L3 cache.

> On my 3900x L3 latency is 10.4ns when local.

And L3 latency is >100ns when off-die. Remember, to keep memory cohesive, only one L3 cache can "own" data. You gotta wait for the "other core" to give up the data before you can load it into YOUR L3 cache and start writing to it.

Its clear that AMD has a very good cache-coherence system to mitigate the problem (aka: Infinity Fabric), but you can't get around the fundamental fact that a core only really has 16MB of L3 cache.

Intel systems can have all of its L3 cache work on all of its cores, which greatly benefits database applications.

---------

AMD Zen (and Zen2) is designed for cloud-servers, where those "independent" bits of L3 cache are not really a big problem. Intel Xeon are designed for big servers which need to scale up.

With that being said, cloud-server VMs are the dominant architecture today, so AMD really did innovate here. But it doesn't change the fact that their systems have the "split L3" problem which affects databases and some other applications.

gameswithgo · on Aug 26, 2019

> Databases, particularly any database which benefits from more than 16MB of L3 cache.

Yes but have you seen this actually measured, as being a net performance problem for AMD as compared to Intel, yet? I understand the theoretical concern.

dragontamer · on Aug 26, 2019

https://www.phoronix.com/scan.php?page=article&item=amd-epyc...

Older (Zen 1), but you can see how even a AMD EPYC 7601 (32-core) is far slower than Intel Xeon Gold 6138 (20-core) in Postgres.

Apparently Java-benchmarks are also L3 cache heavy or something, because the Xeon Gold is faster in Java as well (at least, whatever Java benchmark Phoronix was running)

arantius · on Aug 27, 2019

What I see there is that the EPYC 7601 (first graph, second from the bottom) is much faster than the Xeon 6138 -- it's only slower than /two/ Xeons ("the much more expensive dual Xeon Gold 6138 configuration"). The 32-core EPYC scores 30% more than the 20-core Xeon.

dragontamer · on Aug 27, 2019

There's a lot of different benchmarks there.

Look at PostgreSQL, where the split-L3 cache hampers the EPYC 7601's design.

As I stated earlier: in many workloads, the split-cache of EPYC seems to be a benfit. But in DATABASES, which is one major workload for any modern business, EPYC loses to a much weaker system.

gameswithgo · on Aug 26, 2019

Thanks, perfect! I'll keep an eye on these to see how the new epycs do.

monocasa · on Aug 26, 2019

Are their L3 slices MOESI like their L2's are (or at least were). That'd let you have multiple copies in different slices as long as you weren't mutating them.

dragontamer · on Aug 26, 2019

AMD is using MDOEFSI, according to page 15 of: https://www.hotchips.org/wp-content/uploads/hc_archives/hc29...

However, I can't find any information on what MDOEFSI is. I'm assuming:

* Modified * Dirty * Owned * Exclusive * Forwarding * Shared * Invalid

Any information I look up comes up to an NDA-firewall pretty quickly (be it in performance counters, or hardware level documentation). It seems like AMD is highly protective of their coherency algorithm.

> That'd let you have multiple copies in different slices as long as you weren't mutating them.

Seems like the D(irty) state allows multiple copies to be mutated actually. But its still a "multiple copies" methodology. As any particular core comes up to the 8MB (Zen) or 16 MB (Zen2) limit, that's all they get. No way to have a singular dataset with 32MB of cache on Zen or Zen2.

pjc50 · on Aug 26, 2019

Is that really correct? That's huge latency for something that's in the same package. You can buy discrete SRAM with 70ns latency.

mort96 · on Aug 26, 2019

OP said only non-local L3 is 132ns. Local L3 (i.e L3 close to the core) is way faster, and the core would usually use local L3 cache.

pjc50 · on Aug 26, 2019

Oh I see - a tiny NUMA system within the package.

DiabloD3 · on Aug 26, 2019

Kind of.

In general, all Zen generations share two characteristics: cores are bound into 4 core clusters called CCXes, and two of those are bound into a group called a CCD. Chips (Zen 1 and 1+) and chiplets (Zen 2) both have only ever put one CCD per chip(-let), and 1, 2, and 4 chip(-lets) have been put on per socket.

In Zen 1 and 1+, each chip had a micro IO die, which contains the L3, making a quasi-NUMA system. Example: a dual processor Epyc of that generation would have one of 8 memory controllers reply to a fetch/write request (whoever had it closest, either somebody had it in L3 already, or somebody owned that memory channel).

L3 latency on such systems should be quoted as an average or as a best case/worst case. Stating L3 as worst case only ignores memory cache optimizations (such as prefetchers grabbing from non-local L3 and fetches from L3 do not compete with the finite RAM bandwidth, but add to it, thus leading to a possible 2-4x increase performance if multiple L3 caches are responding to your core); in addition, Intel has similar performance issues: RAM on another socket also has a latency penalty (the nature of all NUMA systems, no matter who manufactured it).

Where Zen 1 and 1+-based systems performed badly is when the prefetcher (or a NUMA-aware program) did not get pages into L2 or local L3 cache fast enough to hide the latency (Epyc had the problem of too many IO dies communicating with each other, Ryzen had the issue of not enough (singular) IO die to keep the system performing smoothly).

Zen 2 (the generation I personally adopted, wonderful architecture) switched to a chiplet design: it still retains dual 4 core CCXs per CCD (and thus, per chiplet), but the IO die now lives in its own chiplet, thus one monolithic L3 per socket. The IO die is scaled to the needs of the system, instead of statically grown with additional CCDs. Ryzen now performs ridiculously fast: meets or beats Coffee Lake Refresh performance (single and multi-threaded) for the same price, while using less watts and outputting less heat at the same time; Epyc now scales up to ridiculously huge sizes without losing performance in non-optimal cases or getting into weird NUMA latency games (everyone's early tests with Epyc 2 four socket systems on intentionally bad-for-NUMA workloads illustrate a very favorable worst case, meeting or beating Intel's current gargantuan Xeons in workloads sensitive to memory latency).

So, your statement of "a tiny NUMA system within the package" is correct for older Zens, not correct (and, thankfully, vastly improved) for Zen 2.

smueller1234 · on Aug 26, 2019

Which EPYC 2 four socket systems? I don't think those exist.

DiabloD3 · on Aug 26, 2019

Sorry I misspoke, dual socket Epycs compared to four socket Xeons; Intel may following AMD and abandoning >2 socket, as well.

twotwotwo · on Aug 26, 2019

Yeah. I bet part of why there's so much L3 per core group is that it's really expensive to go further away.

Seems like there're at least two approaches for future gens: widen the scope across which you can share L3 without a slow trip across the I/O die, or speed up the hop through the I/O die. Unsure what's actually a cost-effective change vs. just a pipe dream, though.

letstrynvm · on Aug 26, 2019

It's maybe the latency to bring the whole cache line over.

markhahn · on Aug 26, 2019

OP appears to be talking about change of ownership of a line, not merely bringing it across.

microcolonel · on Aug 26, 2019

When you access L3, you're not just accessing some memory.

bArray · on Aug 26, 2019

I'm very confused, there appear to be several conflicting reports on L3 cache latency for EPYC chips [1] [2]. Is it the the larger random cache writes that are causing the additional latency?

Regardless I wouldn't be particularly concerned, cache seems like the easier issue to address vs power density.

[1] https://www.tomshardware.com/reviews/amd-ryzen-5-1600x-cpu-r...

[2] https://www.tomshardware.com/reviews/amd-ryzen-7-1800x-cpu,4...

dragontamer · on Aug 26, 2019

> Is it the the larger random cache writes that are causing the additional latency?

Think of the MESI model.

If Core#0 controls memory location #500 (Exclusive state), and then Core#32 wants to write to memory location #500 (also requires Exclusive state), how do you coordinate this?

The steps are as follows:

#1: Core#0 flushes the write buffer, L1 cache, and L2 cache so that the L3 cache & memory location #500 is fully updated.

#2: Memory location #500 is pushed out from Core#0 L3 cache and pushed into Core#32 L3 cache. (Core#0 sets Location#500 to "Invalid", which allows Core#32 to set Location#500 to Exclusive).

#3: Core#32 L3 cache then transfers the data to L2, L1, and finally is able to be read by core#32.

--------

EDIT: Step #1 is missing when you read from DDR4 RAM. So DDR4 RAM reads under the Zen and Zen2 architecture are faster than remote L3 reads. An interesting quirk for sure.

In practice, Zen / Zen2's quirk doesn't seem to be a big deal for a large number of workloads (especially cloud servers / VMs). Databases are the only major workload I'm aware of where this really becomes a huge issue.

neilmovva · on July 7, 2019

Actually, the L3 cache is also sharded across chiplets, so there's a small (~8MB) local portion of L3 that is fast, while remote slices will have to go over AMD's interdie connection fabric and incur a serious latency penalty. On first gen Epyc/Threadripper, nonlocal L3 hits were almost as slow as DRAM at ~100ns (!).

sitkack · on July 7, 2019

Does that local vs remote L3 show up in the NUMA information?

neilmovva · on July 22, 2018

That still means multiple silicon dies, which we have known how to do for a while (see: Intel Core 2 Quad from 2006, and more recently AMD Epyc).

Having more dies lets you dissipate more heat, but then it's kinda hard to build low-latency / high-bandwidth interconnects between the dies. Inter-die buses go over a PCB or interposer, which impose higher parasitic capacitance and make it difficult/expensive to run wide interfaces. That's why techniques like "dark silicon" allocation are important - it allows us to get more perf in a single die.

neilmovva · on Feb 12, 2018

It matters when defining parallel work distribution. Unless memory bandwidth is homogeneous across the whole board (i.e. each TPU on a board gets 600 GB/s to its peers), we can't do model parallelism across ASICs efficiently, and must fall back to data parallelism. Which is fine, until you run into limits on maximum batchsize (e.g. up to 8192, as FAIR was able to manage [1] with some tweaks to SGD).

[1] https://arxiv.org/abs/1706.02677

neilmovva · on Sept 20, 2017

Intel's "10nm" has roughly twice the transistor density vs. Samsung/TSMC "10nm", so I wouldn't compare based on the advertised process names.

lvh · on Sept 20, 2017

That's true, but only if Intel's _proposed_ 10nm specs matches what they ship. They haven't shipped anything, that's the problem.

agumonkey · on Sept 20, 2017

What about ryzen, I forgot the size and which foundry they use.