More

C-x_C-f · 2025-05-15T06:33:51 1747290831

I don't want to dump too many but I found

   chess - checkers = wormseed mustard (63%)

pretty funny and very hard to understand. All the other options are hyperspecific grasslike plants like meadow salsify.

ccppurcell · 2025-05-15T07:40:24 1747294824

My philosophical take on it is that natural language has many many more dimensions than we could hope to represent. Whenever you do dimension reduction you lose information.

C-x_C-f · 2025-04-12T04:36:58 1744432618

A similar result was found by J. Gagné [0] five years ago, though I don't think he publishes his results outside of his blog (which is well worth a read by the way).

[0] https://coffeeadastra.com/2020/05/23/the-physics-of-kettle-s...

C-x_C-f · 2025-04-01T04:11:55 1743480715

How did you get started? Was it mostly plug-and-play or was some nontrivial hacking involved? I use emacs and I normally wouldn't mind shaving a yak or two, but right now I'm swamped with work and I'm kinda scared of getting sucked into a rabbit hole.

dskhatri · 2025-04-01T04:46:36 1743482796

gptel is mostly plug-and-play. The docs offer a comprehensive overview: https://github.com/karthink/gptel

C-x_C-f · 2025-03-30T01:17:55 1743297475

> Also, don't forget the Jacobian and gradient aren't the same thing!

Every gradient is a Jacobian but not every Jacobian is a gradient.

If you have a map f from R^n to R^m then the Jacobian at a point x is an m x n matrix which linearly approximates f at x. If m = 1 (namely if f is a scalar function) then the Jacobian is exactly the gradient.

If you already know about gradients (e.g. from physics or ML) and can't quite wrap your head around the Jacobian, the following might help (it's how I first got to understand Jacobians better):

1. write your function f from R^n to R^m as m scalar functions f_1, ..., f_m, namely f(x) = (f_1(x), ..., f_m(x))

2. take the gradient of f_i for each i

3. make an m x n matrix where the i-th row is the gradient of f_i

The matrix you build in step 3 is precisely the Jacobian. This is obvious if you know the definition and it's not a mathematically remarkable fact but for me at least it was useful to demystify the whole thing.

sfpotter · 2025-03-30T02:17:44 1743301064

For m = 1, the gradient is a "vector" (a column vector). The Jacobian is a functional/a linear map (a row vector, dual to a column vector). They're transposes of one another. For m > 1, I would normally just define the Jacobian as a linear map in the usual way and define the gradient to be its transpose. Remember that these are all just definitions at the end of the day and a little bit arbitrary.

oddthink · 2025-03-30T13:15:36 1743340536

I'd say a gradient is usually a covector / one-form. It's a map from vector directions to a scalar change. ie. df = f_x dx + f_y dy is what you can actually compute without a metric; it's in T*M, not TM. If you have a direction vector (e.g. 2 d/dx), you can get from there to a scalar.

sfpotter · 2025-03-30T15:00:06 1743346806

I'm not a big Riemannian geometry buff, but I took a look at the definition in Do Carmo's book and it appears that "grad f" actually lies in TM, consistent with what I said above. Would love to learn more if I've got this mixed up.

This would be nice, because it would generalize the "gradient" from vector calculus, which is clearly and unambiguously a vector.

oddthink · 2025-03-30T19:25:35 1743362735

It's probably just a notation/definition issue. I'm not sure if "grad f" is 100% consistently defined

I'm a simple-minded physicist. I just know if you apply the same coordinate transformation to the gradient and to the displacement vector, you get the wrong answer.

My usual reference is Schutz's Geometrical Methods of Mathematical Physics, and he defines the gradient as df, but other sources call that the "differential" and say the gradient is what you get if you use the metric to raise the indices of df.

But that raised-index gradient (i.e. g(df)), is weird and non-physical. It doesn't behave properly under coordinate transformations. So I'm not sure why folks use that definition.

You can see difference by looking at the differential in polar coordinates. If you have f=x+y, then df=dx+dy=(cos th + sin th)dr + r(cos th - sin th)d th. If you pretend this is instead a vector and transform it, you'd get "df"=(cos th + sin th)dr + (1/r)(cos th - sin th)d th, which just gives the wrong answer.

To be specific, if v=(1,1) in cartesian (ex,ey), then df(v)=2. But (1,1) in cartesian is (1,1/r) in polar (er, etheta). The "proper" df still gives 2, but the "weird metric one" gives 1+1/r^2, since you get the 1/r factor twice, instead of a 1/r and a balancing r.

sfpotter · 2025-03-31T18:27:35 1743445655

And I'm just a simple applied mathematician. For me, the gradient is the vector that points in the direction of steepest increase of a scalar field, and the Jacobian (or indeed, "differential") is the linear map in the Taylor expansion. I'll be curious to take a look at your reference: looks like a good one, and I'm definitely interested in seeing what the physicist's perspective is. Thanks!

C-x_C-f · 2025-03-11T21:15:28 1741727728

What do you mean by well-connected topology? If you mean that you can reach every neuron from any neuron then the number of connections you need is asymptotically n log n / 2 (not up to a constant factor or anything, just n log n / 2 on the nose, it's a sharp threshold), see [0]. In general when percolation is done on just n nodes without extra structure, it's called the Erdős–Rényi model [0], and most mathematicians even just call this "the" random graph model.

[0] https://en.wikipedia.org/wiki/Erd%C5%91s%E2%80%93R%C3%A9nyi_....

bob1029 · 2025-03-11T21:35:56 1741728956

I think we are on the same page.

https://en.wikipedia.org/wiki/Giant_component#Giant_componen...

C-x_C-f · 2025-03-11T21:57:30 1741730250

Ah but that's a bit different. The giant component doesn't connect all the neurons, only a fraction. The wiki page doesn't say this but if you have c * n / 2 edges then the fraction of neurons in the giant component is 1 + W(-c * exp(-c))/c where W is the Lambert W function [0], also called the product logarithm. As c tends to infinity this fraction tends to 1 but it's never 1.

[0] https://en.wikipedia.org/wiki/Lambert_W_function

C-x_C-f · 2025-03-09T05:11:53 1741497113

I've mentioned this in another comment but the Freetube [0] client for YouTube has settings that let you hide pretty much every distraction apart from the video. You can even exclude videos in your search result if they contain a set phrase (great for avoiding political bait). I know there are extensions that do all these things but I find this to be a nice all-in-one solution, and the UI is more responsive too. (It does suffer from expiring sessions and the like though.)

[0] https://freetubeapp.io/

C-x_C-f · 2025-03-09T05:07:11 1741496831

In the spirit of the video (and the article), I'm watching this on Freetube [0] while hiding the comments, recommended videos tab, and pretty much everything else apart from the video description.

[0] https://freetubeapp.io/

ip26 · 2025-03-09T05:24:45 1741497885

… and you are watching it because the HackerNews algorithm put this link on the top of its feed.

ThePowerOfFuet · 2025-03-09T06:52:12 1741503132

No algo, but upvotes plus some curation from @dang.

C-x_C-f · 2025-02-20T07:36:00 1740036960

No need to introduce the concept of energy. It's a "natural" probability measure on any space where the outcomes have some weight. In particular, it's the measure that maximizes entropy while fixing the average weight. Of course it's contentious if this is really "natural," and what that even means. Some hardcore proponents like Jaynes argue along the lines of epistemic humility but for applications it really just boils down to it being a simple and effective choice.

yorwba · 2025-02-20T09:21:26 1740043286

In statistical mechanics, fixing the average weight has significance, since the average weight i.e. average energy determines the total energy of a large collection of identical systems, and hence is macroscopically observable.

But in machine learning, it has no significance at all. In particular, to fix the average weight, you need to vary the temperature depending on the individual weights, but machine learning practicioners typically fix the temperature instead, so that the average weight varies wildly.

So softmax weights (logits) are just one particular way to parameterize a categorical distribution, and there's nothing precluding another parameterization from working just as well or better.

C-x_C-f · 2025-02-20T09:31:12 1740043872

I agree that the choice of softmax is arbitrary; but if I may be nitpicky, the average weight and the temperature determine one another (the average weight is the derivative of the log of the partition function with respect to the inverse temperature). I think the arbitrariness comes more from choosing logits as a weight in the first place.

C-x_C-f · 2024-12-12T00:20:01 1733962801

Ignorant question—how's this different from qemu-virgl? I've been using the latter (installed from homebrew) for the last few years passing --device virtio-vga.

SirGiggles · 2024-12-12T01:07:39 1733965659

Virtio-GPU Venus is similar to Virgl except it passes through Vulkan commands rather than OpenGL

C-x_C-f · on Feb 22, 2024

You forgot GPay! Which I think is a different app (I mean, it has a different name...) though I can't say I'm sure (Wikipedia says they're the same, but the source they use directly contradicts them). It's really comical at this point.

missingcolours · on Feb 23, 2024

The previous Google Pay app was definitely killed off and you had to install "GPay" instead. I /think/ Google Pay may have been a rename of the original Android Pay app, but not sure.

aitchnyu · on Feb 23, 2024

In India, it was Tez, then Google pay (IIRC two apps with this name existed at same time), then Gpay.