More

daemonologist · 2025-04-15T22:24:47 1744755887

AI companies high on their own supply, that's who. Ultralytics is (in)famous for it.

itissid · 2025-04-16T02:22:50 1744770170

Why is Ultralytics yolo famous for it?

daemonologist · 2025-04-16T02:52:20 1744771940

They had a bot, for a long time, that responded to every github issue in the persona of the founder and tried to solve your problem. It was bad at this, and thus a huge proportion of people who had a question about one of their yolo models received worse-than-useless advice "directly from the CEO," with no disclosure that it was actually a bot.

The bot is now called "UltralyticsAssistant" and discloses that it's automated, which is welcome. The bad advice is all still there though.

(I don't know if they're really _famous_ for this, but among friends and colleagues I have talked to multiple people who independently found and were frustrated by the useless github issues.)

ericye16 · 2025-04-16T05:47:48 1744782468

I was hit by this while working on a project for class and it was the most frustrating thing ever. The bot would completely hallucinate functions and docs and it confused everyone. I found one post where someone did the simple prompt injection of "ignore previous instructions and x" and it worked but I think it's delted now. Swore off ultralytics after that.

daemonologist · 2025-04-15T03:58:05 1744689485

Change in IP location? If you suddenly jump across the country but your traffic is still coming from a cell provider (so unlikely to be a VPN) Lyft could infer that you've flown, and to which airport.

Alternatively maybe barometer? All modern iPhones have one, but I don't know if apps are allowed to access it in the background (or without location services), and it wouldn't let Lyft tell which airport you were at.

daemonologist · 2025-04-14T17:49:58 1744652998

I ran NoLiMa on Quasar Alpha (GPT-4.1's stealth mode): https://news.ycombinator.com/item?id=43640166#43640790

Updated results from the authors: https://github.com/adobe-research/NoLiMa

It's the best known performer on this benchmark, but still falls off quickly at even relatively modest context lengths (85% perf at 16K). (Cutting edge reasoning models like Gemini 2.5 Pro haven't been evaluated due to their cost and might outperform it.)

daemonologist · 2025-04-10T05:18:16 1744262296

I'd be pleasantly surprised - GPT-4o is their bread and butter (it powers paid ChatGPT) and QA seems to be slightly ahead on benchmarks at similar or lower latency (so very roughly, it might be cheaper to run).

krackers · 2025-04-10T07:03:50 1744268630

DeepSeek V3 and R1 are about as good (or even slightly better than) 4o already though, so OpenAI wouldn't really lose much by such a release.

daemonologist · 2025-04-10T05:04:50 1744261490

To some extent the "mystery" (and temporary free-as-in-beer-ness) of this model might be getting to me, but I think it's pretty interesting. Given the token throughput (250B this week) it's obvious there's a pretty major player behind the model, but why is it stealthed? Maybe there's something about the architecture or training that would put people off if it was public right off the bat? Maybe they're purely collecting usage/acceptance data and want unbiased users?

On the Aider Polyglot leaderboard it's ~middle of the leading pack, comparable to DeepSeek V3 and 3.5 Sonnet. I ran NoLi(teral)Ma(tching), an unsaturated long-context benchmark, on it and was impressed though:

  = Model =========== Base Score = 8K Context = 16K Context =
  Quasar Alpha:       >=97.8%      89.2%        85.1%
  GPT-4o:             99.3%        89.2%        81.6%
  Llama 3.3 70B:      97.3%        72.1%        59.5%
  Gemini 1.5 Pro:     92.6%        63.9%        55.5%
  Claude 3.5 Sonnet:  87.6%        61.7%        45.7%
  Gemini 1.5 Flash:   84.7%        44.4%        35.5%
  GPT-4o mini:        84.9%        32.6%        20.6%
  Llama 3.1 8B:       76.7%        31.9%        22.6%

It also performs well - slightly better than GPT-o1 - on the "hard" subset at 16K context with 62.8%. Latency is quite good as well.

More details: https://old.reddit.com/r/LocalLLaMA/comments/1ju1czn/quasar_...

uep · 2025-04-10T10:56:14 1744282574

What is the reason you included Claude 3.5 instead of 3.7 in this?

daemonologist · 2025-04-10T14:53:39 1744296819

I only ran the benchmark on Quasar Alpha*; the rest of the scores come from the original paper [0] which was published before 3.7 was available. This is a pretty expensive benchmark to run if you're paying for API usage - I'd actually originally set out to run it on Llama 4 but abandoned that after estimating the cost.

* - I also reproduced the Llama 3.1 8B result to check my setup.

[0] - https://arxiv.org/abs/2502.05167 / https://github.com/adobe-research/NoLiMa

daemonologist · 2025-04-08T21:52:37 1744149157

Depends on what you consider vaguely useful, but any modern GPU which can fit the weights at a reasonable quantization (so, 16+ GB, maybe 12 in a pinch) will probably be tolerable. Bare minimum is probably ~3060 12GB.

apothegm · 2025-04-08T23:00:34 1744153234

Thanks! So this could actually be worth running on a mid-specced recentish Mac laptop?

daemonologist · 2025-04-09T21:08:59 1744232939

Macs are a bit odd because you're using system memory (so, more but slower) and Apple's APIs, but it's definitely worth a try. I'd guess >= M2 Pro + >= 24 GB RAM would do okay. Also maybe try an MLX version? (e.g. https://huggingface.co/justinmeans/DeepCoder-14B-Preview-mlx...)

daemonologist · 2025-04-06T14:53:47 1743951227

Yes, they removed radar from their vehicles in 2021: https://www.tesla.com/support/transitioning-tesla-vision

(Also I wouldn't say it's _irrelevant_ that they don't have lidar, as if they did it would cover some of the same weaknesses as radar.)

daemonologist · 2025-04-06T04:58:13 1743915493

Well, Wikipedia, but I take your point.

daemonologist · 2025-04-04T15:36:02 1743780962

Not for me - CPU usage remains indistinguishable from idle and any blocking of the main thread is imperceptible. (Firefox 137 on Fedora Gnome 41)

daemonologist · 2025-04-04T02:18:06 1743733086

That's probably more or less what's happening. If the site is doing full-text search, and there are no exact matches, adding more terms is going to generate more and more partial matches. Some kind of re-ranker might be able to make sense of them but those are difficult to get good results out of, probably rarely employed by a run-of-the-mill commerce site, and might be drowned out by competing objectives (sponsored results, ranking by likelihood of purchase, etc.)