More

ersiees · 2024-11-16T22:47:31 1731797251

What is the alternative to AI twitter? Is there some incumbent where it moves?

minimaxir · 2024-11-16T22:58:31 1731797911

It depends on if you define AI Twitter as AI researchers or AI hypesters.

Many of the former have indeed migrated.

ersiees · 2024-11-08T17:12:47 1731085967

And there is scratch right?

atlintots · 2024-11-08T18:22:25 1731090145

Scratch isn't non-verbal (or is it?)

ersiees · 2024-11-03T17:54:03 1730656443

Why is it only for 7 days available?

millhouse1112 · 2024-11-03T17:56:42 1730656602

It's a free app. I don't want to keep all the data for a long time. Should I increase this? What duration makes sense for you?

bambax · 2024-11-03T18:00:18 1730656818

If it was purely local (saved on localStorage) it could be stored indefinitely? But then of course it would be hard to share.

foreigner · 2024-11-03T18:01:27 1730656887

Store the state in the URL

echoangle · 2024-11-04T00:16:27 1730679387

But then you have to reshare every modification

millhouse1112 · 2024-11-03T18:04:01 1730657041

Yep. It's a share-first app in a sense.

hanniabu · 2024-11-03T18:06:32 1730657192

30 days makes more sense

Also why not make it all client side? When they grab a share link you can convert the text to image, image to base64 to spend to the URL, and when somebody opens the link you ocr it and recreate the list.

millhouse1112 · 2024-11-03T18:12:29 1730657549

Me and my wife make some changes together. Add, remove, check, uncheck. Groceries list, etc.

With this approach we should share the new URL again after each change.

type_Ben_struct · 2024-11-03T18:09:26 1730657366

base64 isn’t just for images. You can encode text so unsure why you’d need OCR?

joegahona · 2024-11-03T18:23:06 1730658186

I would try to find data like "tasks over 30 days old never get completed" and tie the decision (and marketing) to something like that.

ersiees · 2024-10-28T11:12:08 1730113928

I think this is also interesting to keep in mind when thinking about the changes to the job market that are possible due to technology (e.g. AI).

Most of care will not be automated any time soon. And it is a huge part of the economy.

johanneskanybal · 2024-10-28T11:43:02 1730115782

Yes this was my first thought too.

Nice to see this post on HN.

ersiees · 2024-10-03T11:06:45 1727953605

Someone having a non paywall link?

gpvos · 2024-10-03T11:36:57 1727955417

The actual paper by Michael Inzlicht, linked in the first part of the article and also in a comment here, seems to be quite readable.

ersiees · on July 16, 2024

I think I was not very specific, but I think there is a lot of video on YouTube that does not make any money for the producers and in the past YouTube also did not show ads for these videos, but now they show Ads even if the producers of the content don't receive any money. I watch mostly lectures and niche videos on YouTube, I am pretty sure for most of these videos, the only entity getting Money out of it is YouTube.

ersiees · on July 11, 2024

Check out Lektor.lol an open source wrapper around ChatGPT I created just for that :)

ersiees · on Jan 31, 2024

more of a rural myth, I guess ;D

ersiees · on July 24, 2023

This trick “they found” is part of the standard torch implementation of multi head attention, namely it is called, add_zero_attention. They add a zero to the logits, resulting in a one in the denominator as e^0=1 https://pytorch.org/docs/stable/generated/torch.nn.Multihead...

lovelearning · on July 25, 2023

I find its documentation quite poor though: "If specified, adds a new batch of zeros to the key and value sequences at dim=1."

Doesn't describe the implications even briefly. If they add just your second sentence to that description, it'll immediately become so much more useful.

civilized · on July 24, 2023

It's an option which is set to false by default. Does that mean people have tried it and it's not usually helpful...?

blackkettle · on July 25, 2023

It probably means they have tried it for _some_ purpose, but not necessarily the one described in OP's post here. The claim is that this is specifically useful for quantization. It's seems reasonable to assume that this would have initially been tried and potentially discarded for having little or impact on general accuracy. But that's a different issue. I suppose we'll here something definitive in a month or so.

mlyle · on July 24, 2023

quickthrower2 · on July 24, 2023

Can you elaborate? (It wouldn't be the first time there was an extraneous feature that no one has every used in some code!)

thomasahle · on July 25, 2023

If you take the inner product between a lot of more or less random vectors (the key and query vectors in attention) most values are going to be close to 0. This means they contribute by e^0 to the denominator. Now, if you have a context length of say 2000, your denominator is already ~ 2000. Increasing it to 2001 doesn't really make a difference.

Adding 1 to the denominator can be useful if you have softmax with just a few options. Not in self-attention where you have thousands.

quickthrower2 · on July 25, 2023

That simple comment is a strong counterpoint to the entire blog post?

Except with the +1 denominator, it might be that the model trains all of the inputs to become very negative so softmax chucks out close to zeros, whereas it wouldn't bother before because making one prob bigger makes another smaller.

thomasahle · on July 25, 2023

> it might be that the model trains all of the inputs to become very negative

It still can't do this because of L2 regularization / weight decay. If two vectors are norm 1, their inner product is at least -1, so with 2000 vectors that's still 2000 * e^(-1) =~ 735.

Not saying it's theoretically impossible that it could happen. But you would have to try _really_ hard to make it happen.

redox99 · on July 25, 2023

I guess you could add a sort of gating operation with a learnable parameter that sends the value to -inf if doesn't reach the threshold.

Of course it might have some other serious repercussions.

Q6T46nT668w6i3m · on July 24, 2023

It’s useful but it’s less used than dummy tokens.

thomasahle · on July 25, 2023

Are dummy tokens just tokens that don't have an associated input/output token? Like, a way to give more computational power to the model without splitting the text into more actual tokens?

dijksterhuis · on July 25, 2023

TL;DR sort of yes. But they're also useful for reasons not related to computational "power".

An example here with an actual algorithm, although it's been a couple of years so my explanation might be a bit wrong in places. and/or i might have gotten the completely wrong end of the stick with the current thread.

--

The CTC (Connectionist Temporal Classification [0]) algorithm maps a sequence x with length X -> sequence y with length Y.

i.e. in speech to text we might have some audio features that correspond to the following class predictions (post softmax classification)

    x -> hellllloooooooooo wwwooorrrllld

we want to get this as the output

    y -> hello world

we have the alphabet as classes we try to predict for each sequence item in x.

we could just removed all the duplicate in the first long sequence, but we would end up with `helo world` ... we need to preserve one of the early `l` characters in `hello` somehow

CTC uses a blank token (aka dummy) token to handle potentially deliberately repeated items in sequence x.

By adding the blank token to the classes predictions, we can get the model to predict something like this (post softmax classification)

    y* -> hel~l~~oooo~~~~~~ w~~o~~r~~l~~d

The CTC decoder (non-ML decoding algo) heuristically removes repeated tokens. Turning the above into ...

    y -> hello world

... the duplicate `o` and `~` characters are removed.

It was a decent enough algorithm for speech-to-text prior to attention/transformers etc.

However, it makes CTC vulnerable to well designed adversarial example attacks because there is a massive bias within models to predict the blank token -- meaning it's very easy to modify input sequence x to switch the output sequence y to include blank tokens for nefarious purposes (the subject of my unfinished phd).

[0]: www.cs.toronto.edu/~graves/preprint.pdf

thomasahle · on July 25, 2023

> By adding the blank token to the classes predictions, we can get the model to predict something like this (post softmax classification) > y* -> hel~l~~oooo~~~~~~ w~~o~~r~~l~~d

This is a great solution. Though that's a dummy token in the output rather than the input. I guess you could do something inverse to do text to speech, but it might be hard to say where to insert the dummy tokens in that case.

janalsncm · on July 25, 2023

Nice catch! Hopefully OP will see this.

fstokesman · on July 25, 2023

https://en.wikipedia.org/wiki/Multiple_discovery

ersiees · on April 8, 2023

Very interesting that someone finally tries out muP in the real world. Do I understand the usage correctly:

MuP is only used to get around choosing an lr for each size? Here I wonder how it compares to standard heuristics like the one in the OG scaling laws paper by OAI and tricks like back winding a few steps after loss explosion.

For some reason muP was not trusted with the largest trainings? Why is that?