Hacker News new | past | comments | ask | show | jobs | submit | abstractcontrol's comments login

I've thought about adding record row polymorphism to Spiral, but I am not familiar with it and couldn't figure out how to make it work well in the presence of generics.

Why is generics the tricky bit? Isn't that the bread-and-butter of this type system? You should just be able to substitute the term 'type variable' in the article for 'generics'.

Staged FP in Spiral: https://www.youtube.com/playlist?list=PL04PGV4cTuIVP50-B_1sc...

Some of the stuff in this playlist might be relevant to you, though it is mostly about programming GPUs in a functional language that compiles to Cuda. The author (me) sometimes works on the language during the video, either fixing bugs or adding new features.


What's a Net/60 basis? I am having trouble understanding how often you were paid. Every month or so?

Edit: Nwm, I saw you worked for 90 days without pay. Ack.


Net/30 is 30 days after the contract is signed in this case, Net/60 is 60 days. Sometimes instead of "after the contract is signed" it is "after work is delivered" but my case was the former.


https://www.youtube.com/playlist?list=PL04PGV4cTuIVP50-B_1sc...

Staged Functional Programming In Spiral

I am doing a fully fused ML GPU library along with a poker game to run it on in my own programming language that I've worked on for many years. Currently, right at this very moment, I am trying to optimize compilation times along with register usage by doing more on the heap, so I am creating a reference counting Cuda backend for Spiral.

Both the ML library and the poker game are designed to run completely on GPU for the sake of getting large speedups.

Once I am done with this and have trained the agent, I'll test it out on play money sites, and if that doesn't get it eaten by the rake, with real money.

I am doing fairly sophisticated functional programming in the videos, the kind you could only do in the Spiral language. Many parts of the series involve me working and improving the language itself in F#.


For a deep dive, maybe take a look at the Spiral matrix multiplication playlist: https://www.youtube.com/playlist?list=PL04PGV4cTuIWT_NXvvZsn...

I spent 2 months implementing a matmult kernel in Spiral and optimizing it.


Are Winograd’s algorithms useful to implement as a learning exercise?


Never tried those, so I couldn't say. I guess it would.

Even so, creating all the abstractions needed to implement even regular matrix multiplication in Spiral in a generic fashion took me two months, so I'd consider that good enough exercise.

You could do it a lot faster by specializing for specific matrix sizes, like in the Cuda examples repo by Nvidia, but then you'd miss the opportunity to do the tensor magic that I did in the playlist.


You are the author of the playlist/maker of the videos?


Yes.


sorry for noob question, how gpu programming is helpful ?


NNs for example are (mostly) a sequence of matrix multiplication operations, and GPUs are very good at those. Much better than CPUs. AI is hot at the moment, and Nvidia is producing the kind of hardware that can run large models efficiently which is why it's a 2 trillion-dollar company right now.

However, in the Spiral series, I aim to go beyond just making an ML library for running NN models and break new ground.

Newer GPUs actually support dynamic memory allocation, recursion, and the GPU threads have their own stacks, so you could in fact treat them as sequential devices and write games and simulators directly on them. I think once I finish the NL Holdem game, I'll be able to get over 100x fold improvements by running the whole program on the GPU versus the old approach of writing the sequential part on a CPU and only using the GPU to accelerate a NN model powering the computer agents.

I am not sure if this is a good answer, but this is how GPU programming would be helpful to me. It all comes down to performance.

The problem with programming them is that the program you are trying to speed up needs to be specially structured, so it utilizes the full capacity of the device.


> I have quite a lot of concurrency so I think my ideal hardware is a whole lot of little CPU cores with decent cache and matmul intrinsics

Back in 2015 I thought this would be the dominant model in 2022. I thought that the AI startups challenging Nvidia would be about that. Instead, they all targetted inference instead of programmability. I thought that a Tenstorrent hardware would be about what you are talking about - lots of tiny cores, local memory, message passing between them, AI/matmult intrinsics.

I've been hyped about Tenstorrent for a long time, but now that it is finally coming out with something, I can see that the Grayskulls are very overpriced. And if you look at the docs for their low-level kernel programming, you will see that Tensix cores can only have four registers, have no register spilling, and also don't support function calls. What would one be able to program with that?

It would have been interesting had the Grayskull cards been released in 2018. But in 2024 I have no idea what the company wants to do with them. It's over five years behind what I was expecting.

My expectations for how the AI hardware wave would unfold were fit for another world entirely. If this is the best the challengers can do, the most we can hope for is that they depress Nvidia's margins somewhat so we can buy its cards cheaper in the future. As we go towards the Singularity, I've gone from expecting revolutionary new hardware from AI startups to hoping Nvidia can keep making GPUs faster and more programmable.

Ironically, that latter thing is one trend that I missed, and going from Maxwell cards to the last generation, the GPUs have gained a lot in terms of how general purpose they are. The range of domains they can be used for is definitely going up as time goes on. I thought that AI chips would be necessary for this, and that GPUs would remain as toys, but it has been the other way around.


I wasn't as optimistic that there would be a broad adoption of some of the more advanced techniques I was working on so I did figure back in 2013 that most people would stick to the GEMMs and Convs with rather simple loss functions - I had a hard enough time explaining BPR triplet loss to people. Now with LLMs people will be doubling down on GEMMs for the foreseeable future.

My customers won't touch non-commodity hardware as they see it as a potential vector for vendors to screw them over, and they're not wrong about that. In a post apocalyptic they could just pull a graphics card out of a gaming computer to get things working again which gives them a strong feeling of security. Having very capable GPU cards as a commodity means I can re-use the same ops for my training and inference which roughly halves my workload.

My approach to hardware companies is that I'll believe it when I see it, I'll wait until something is publically available that I can buy off the shelf before looking too closely at it's architecture. NVidia with their Tensor Cores got so good so quickly that I never really looked too closely at alternatives. I'm kind of hopeful that AMD SoC would provide a good edge compute option so I might give that a go.

I had a look at tenstorrent given this article and the Grendel architecture seems interesting.


Grayskull shipped in 2020 and each tensix cores has five RISC-V cores. Get your basic facts right before you complain. The dev kit is just that, a dev kit. Groq sells their dev kit for $20k even though a single LPU is useless.


> Groq sells their dev kit for $20k even though a single LPU is useless.

I find this a very questionable business decision.


Considering the system only has a single H100, why would it be that performant?


Yeah this page is full of straight up lies?

“ Its performance in every regard is almost unreal (up to 284 times faster than x86).”

Like, there are at least 3 things wrong with that statement!


benchmark:

"NVIDIA GH200 CPU Performance Benchmarks Against AMD EPYC Zen 4 & Intel Xeon Emerald Rapids"

* https://www.phoronix.com/review/nvidia-gh200-gptshop-benchma...


BoM?


Bill of Materials.

Also know as cost of the chip.


Bill of Materials should really only be used for things that require a list of items/labor, and aren't sold individually such as a datacenter, a building, a hardware integration project, a wedding, etc. For things that are sold individually, "cost" will suffice, and BoM is an example of incorrectly using a more complicated term for the sake of seeming smart.


> Also know as cost of the chip.

As a shorthand for expected retail cost, that is terribly misapplied. Or even for internal cost.

The chip would be one line item on the BOM for the entire thing you plan on shipping (but not packaged yet). Even in the case you are "just" selling a chip, the BOM is likely more complicated,and in this context primarily the manufacturing side cares about that. The COGS (cost of goods sold) is something the company as a whole will care more about - this is what it actually costs you to get it out the door. You will hear "BOM cost" referring to the elemental cost of one item on the BOM, but that's not the BOM itself.

None of these are related to the retail (or wholesale) cost in a simple way, either than forming a floors on long term sustainable price.

The GGG-whatever post is using this sloppily to suggest that the chips are going to be very expensive to produce, therefore the product is going to be expensive.

You'll also see BOM in a materials and labor type invoice, (like when you get your car serviced) but that's not relevant here.


Yeah fair, my comment was a terrible comment.

What I really meant was total production cost + reasonable amortization cost for the development/tape out, which is of course is huge even if this chip was mass produced.

The point kinda stands though, this thing is collectively way too hard to fab + assemble to sell to consumers at any reasonable cost, especially on top of the massive R&D AMD, TSMC and everyone along the huge chain put into it.

Its not like a 4090 or Apple Silicon where the production is reasonably cheap, but margins are super high because they can be super high.


Bill of Materials


> Yet many startups and existing designers anticipated this demand correctly, years in advance, and they are all still kinda struggling. Nvidia is massively supply constrained. AI customers would be buying up MI250s, CS-2s, IPUs, Tenstorrent accelerators, Gaudi 2s and so on en masse if they wanted to... But they are not, and its not going to get any easier once the supply catches up.

Can you order any of these devices online as a regular person? Anybody can order a $300 Nvidia GPU and program it. This is the reason why deep learning originated on the GPUs. Forget those other AI accelerators, even if you bought something like a consumer grade AMD GPU, you couldn't program it because it's restricted. The reason why Nvidia's competitors are struggling is because their hardware is either too expensive or hard to buy.


Thanks for posting this.

Right now, I am grappling with RSI and I haven't been able to program for more than a few days in the past month. I am not even typing this, but using the Voice Access feature of Windows 11 in order to input this. I ordered an ergonomic keyboard (Glove80) and I am waiting for it. I also have an ergonomic mouse and even got an ergonomic chair. But I know that regardless of the case, I won't be able to program. for at least a few months. Until my hand recovers. If it does at all.

I am definitely going to check out this extension. Quick question: does it work on any language or just something like Javascript?


It works better in some languages rather than others. Python and JavaScript are the best supported. I have been working on making the situation better with go. But I demoed it with JavaScript because JavaScript is the most understandable to the largest part of my audience.

Realistically half of the way to program with voice is syntax. One of the really cool parts with the voice bindings is that the voice binding for making a function is the same in every language. I am honestly quite terrible with Python. However, when I know that the default command to make a function is funky, I can just step right in there and make it work.

You must crawl before you can walk. You must walk before you can run. You must run before you can fly. Welcome to the Crawling Phase.


Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: