marv1nnnnn's comments

marv1nnnnn · 2025-05-16T01:37:43 1747359463

never heard of this one! sounds really interesting

marv1nnnnn · 2025-05-16T01:31:54 1747359114

I just fixed the local version.. My bad, totally missed it.If you are still interested, could try it again. About the scrping step, this project using crawl4ai to scrape. Suppose the url is https://xxx/yy/, it will only scrape https://xxx/yy/*. You could post it as a github issue, will try to fix it.

marv1nnnnn · 2025-05-16T01:26:42 1747358802

I think it's really a thing about reasoning model. Non-reasoning model struggles at math too. It's more like a protocol between two math genius, they could communicate with really abstract stuff

marv1nnnnn · 2025-05-16T01:24:20 1747358660

Thanks for your comments! Really helpful. I have seen some llms.txt like this: https://docs.agno.com/llms.txt which I don't think will help LLM in real tasks. Some background of this project is I think LLM performs better in abstraction than human do. Those AIME test scores are not joking.So maybe smart LLM don't have to communicate with another LLM in plain text, they have more efficient way to communicate. About the excessive token of LLM reasoning, I find it varies. Gemini 2.5 pro is really an overthinker, but claude 3.7 won't. Finaly, I think most vibe-coding task don't require deep understanding of how a package works. It's more like a information retrieval task, so a lot could be compressed.

marv1nnnnn · 2025-05-15T15:49:46 1747324186

the description is here: https://github.com/marv1nnnnn/llm-min.txt/blob/main/sample/c...

marv1nnnnn · 2025-05-15T15:42:19 1747323739

Agree, actually this approach isn't even possible without the birth of reasoning LLM. In my test, reasoning LLM perform much better than non-reasoning LLM in interpreting the compressed file. Those LLMs are really good at understanding abstraction.

thegeomaster · 2025-05-15T16:20:23 1747326023

My point still stands --- the reasoning tokens being consumed to interpret the abstracted llms.txt could have been used for solving the problem at hand.

Again, I'm not saying the solution doesn't work well (my intuition on LLMs has been wrong enough times), but it would be really helpful/assuring to see some hard data.

marv1nnnnn · 2025-05-15T15:40:26 1747323626

Oof, you nailed it. Thanks for the sharp eyes on llm_min_guideline.md. That's a clear sign of me pushing this out too quickly to get feedback on the core concept, and I didn't give the supporting docs the attention they deserve. My bad. Cleaning that up, and generally adding more polish, is a top priority. Really appreciate you taking the time to look deeper and for the encouragement to keep going. It's very helpful!

ricardobeat · 2025-05-15T22:23:38 1747347818

Wait, are you also using an LLM to respond on Hacker News?

marv1nnnnn · 2025-05-16T01:25:05 1747358705

haha, is it that obvious? I only let LLM polished this one. I am not native speaker and I was trying to be polite ^-^

marci · 2025-05-16T10:00:54 1747389654

Damn... I saw your sentence starting with "Wait', and immediately thought "reasonning llm?"

marv1nnnnn · 2025-05-15T15:34:52 1747323292

Honestly it's really funny. I have the initial idea, and then brainstorm with gemini 2.5 pro a lot, let it design the system. (And in prompt let it think like Jeff Dean and John Carmack ) But most version fails. Then somehow realized I can't let it design from scratch, I give gemini a structure I think is reasonable and efficient after seeing all those versions, let it polish based on that and it works much better.

dmos62 · 2025-05-15T20:45:07 1747341907

That's a pretty cool approach!

marv1nnnnn · 2025-05-15T15:29:18 1747322958

I totally agreed with your critic. To be honest, it's even hard for myself to evaluate. What I do is select several packages that current LLM failed to handle, which are in the sample folder, `crawl4ai`, `google-genai` and `svelte`. And try some tricky prompt to see if it works. But even that evaluation is hard. LLM could hallucinate. I would say most time it works, but there are always few runs that failed to deliver. I actually prepared a comparison, cursor vs cursor + internet vs cursor + context7 vs cursor + llm-min.txt. But I thought it was stochastic, so I didn't put it here. Will consider add to repo as well

ricardobeat · 2025-05-15T16:48:08 1747327688

> But even that evaluation is hard. LLM could hallucinate. I would say most time it works, but there are always few runs that failed to deliver

You can use success rate % over N runs for a set of problems, which is something you can compare to other systems. A separate model does the evaluation. There are existing frameworks like DeepEval that facilitate this.

willvarfar · 2025-05-16T06:35:03 1747377303

Dual run.

Run the same questions against a model with the unminified and the minified and show the results side-by-side and see how, in your subjective opinion, they hold up.

eden-u4 · 2025-05-16T06:03:34 1747375414

why don't you ask the model about the shrinked system prompt and the original system prompt? in this way you can infer whether the same relevant informations are "stored" in the hidden state of the model.

Or better yet, check directly the hidden state difference between a model feed with the original prompt and one with the shrinked prompt.

This should avoid remove the randomness of the results.

rybosome · 2025-05-15T17:13:11 1747329191

To be honest with you, it being stochastic is exactly why you should post it.

Having data is how we learn and build intuition. If your experiments showed that modern LLMs were able to succeed more often when given the llm-min file, then that’s an interesting result even if all that was measured was “did the LLM do the task”.

Such a result would raise a lot of interesting questions and ideas, like about the possibility of SKF increasing the model’s ability to apply new information.

timhigins · 2025-05-15T17:32:52 1747330372

> LLM could hallucinate

The job of any context retrieval system is to retrieve the relevant info for the task so the LLM doesn't hallucinate. Maybe build a benchmark based on less-known external libraries with test cases that can check the output is correct (or with a mocking layer to know that the LLM-generated code calls roughly the correct functions).

marv1nnnnn · 2025-05-16T01:13:28 1747358008

Thanks for the feedback. This will be my next step. Personally I feel it's hard to design those test cases (by myself)