Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Llama.cpp Now Supports Qwen2-VL (Vision Language Model) (github.com/ggerganov)
155 points by BUFU 11 months ago | hide | past | favorite | 50 comments


The Qwen family of models are REALLY impressive. I would encourage anyone who hasn't paid them any attention to at least add them to your mental list of LLMs worth knowing about.

Qwen2-VL is a decent vision model. You can try it out online here: https://huggingface.co/spaces/GanymedeNil/Qwen2-VL-7B - I got great results from it for OCR against handwritten text: https://simonwillison.net/2024/Sep/4/qwen2-vl/

Qwen2.5-Coder-32B is an excellent (I'd say even GPT-4 class) model at generating code which I can run on a 64GB M2 MacBook Pro: https://simonwillison.net/2024/Nov/12/qwen25-coder/

QwQ is the Qwen team's exploration of the o1-style of model that has built in chain-of-thought. It's absolutely fascinating, partly because if you ask it a question in English it will often think in Chinese before spitting out an answer in English. My notes on that one here: https://simonwillison.net/2024/Nov/27/qwq/

Most of the Qwen models are Apache 2 licensed, which makes them more open than many of the other open weights models (Llama etc).

(Unsurprisingly they all get quite stubborn if you ask them about topics like Tiananmen Square)


Thanks for the summary. I have been testing QwQ on my M1 (via ollama). I tried a couple double-slit quantum thought experiments, and also found the reasoning mode absolutely fascinating. Occasionally a few logographs appear, but they so far they were not in the way.

The funniest was asking for an ascii graphics depiction of a minecraft watch recipe, and I was actually feeling quite sorry for it, 'wait that can't be right' 'let me try' 'still not right' round and round it went, at least a few pages at which point it decided to try the second recipe I'd asked about to see if that helped with the first.

I didnt know about the other models, 'coder' is downloading now, and fingers crossed it fits in 32GB and knows a bit about Zig.

It sounds like you got the vision one running locally on your M2, nice. I'm running Asahi Linux and not tried anything AI/SD/graphical orientated yet. But nice that you got some SVG out of coder, I never thought of using a coding model in that way.


QwQ often spits out Chinese characters smack dab in the middle of a sentence. Weirdly, it doesn’t break up the coherence or logic, there’s just symbols added.


Do you think training in multiple languages could act as a regularization? Just as polyglots are smarter in real life?


I haven’t seen the architecture of QwQ but I just assumed it learns languages insofar as to pick up relationships between words. It must mean it picks up logic across languages. Huh


I thought too so. But then o1 thinks in english and Qwen thinks in chinese. Is there advantage in thinking in different languages?


> Unsurprisingly they all get quite stubborn if you ask them about topics like Tiananmen Square

Has anyone made a political censorship eval yet?


This would be a great one. Censorship / political alignment compass.


Unfortunately, real political alignment doesnt exist. Most people dont use ideologies to determine alignment.


Alignment might be the wrong goal. Labeling and scoring, something like a multi-dimensional Ground News.


Is it possible to build similar to anthropic computer use feature with Qwen vision model.

Someone open sourced it with langchain

https://x.com/1littlecoder/status/1856397375704576399


Browser use is very easy. Can even do that headless. That way, you can also do bulk processing. For a client, I did some 16k websites with a simple LLM agent. With “computer use” how long would that take, and what would it cost? For me, it was ~$20 (I used Gemini for this task).


Agree. It is amazing that you can run an o1 style model on a Mac. I was able to run QwQ on my 24GB M3 MacBook Air, though results on complex reasoning on domain specific tasks did not work well, and I saw the Chinese 'thinking' too (they don't work well in o1 either). It opens up experimentation which is great, and the reasoning traces for domain specific tasks for RL is where the next improvements are going to come from


I recently fine-tuned the Qwen-Coder-7b-Instuct model to generate Milkdrop presets. Pretty amazing to see what can be done locally. https://huggingface.co/InferenceIllusionist/MilkDropLM-7b-v0...


Why does this model think in Chinese and o1 think in English? Is this because chain-of-thought is achieved by training these models on examples of what “thinking” looks like, which have been constructed by their respective model developers, as opposed to being a more generic feature?


We don’t know o1 thinks in English. What we see is a summary of the thinking process.

My guess is it “thinks” in non-eligible tokens


The o1 release blog post contains 8 full examples of o1 chains of thought (not the summarized versions visible to users). They're English.

https://openai.com/index/learning-to-reason-with-llms/#chain...

I have seen the summaries dip into completely random languages like Thai, so it might switch between languages occasionally.


Did you mean to write "non-legible" ?


> often think in Chinese

I noticed that too, but I haven't seen it think in numbers in Chinese like most bilingual chinese speakers prefers. Or at least I haven't been able to trigger it.


Recently I was scrolling through HF to try a very small model. Fired up Qwen 0.5 B and surprisingly for my purposes it did between that even Lllama 2 7B. That was very surprising to me.


Have you tried query extraction over tabular data? Is there any free models which are comparable to amazon textract for that?


It's pretty good for handwritten maths too - I just tried that demo. Do you know any other open models good at maths notation?


https://huggingface.co/datasets/TIGER-Lab/MathInstruct

Works with +700 year old books w some tweaks. took like $400 to train. can't share more because i don't know more.


That seems to be just for LLMs, not visual. I'm wanting to go from images of maths notation (photos, scans, digital handwriting) to formulas in Latex or MathML or something. Qwen2-VL can do it, but it's pretty heavyweight for just that.


Are we at a point where we could run nonquantized models from QwQ/Qwen series on a 128GB Macbook Pro?


I think so. Are bf16 models nonquantized? There's an MLX one here that should fit on that machine: https://huggingface.co/mlx-community/QwQ-32B-Preview-bf16


Do you have an opinion on Mini CPM 2.6 in comparison to Qwen2-VL?


I haven't tried that Mini CPM model yet.


Just tested it with a meme, and it nailed it.


> (Unsurprisingly they all get quite stubborn if you ask them about topics like Tiananmen Square)

I wonder how the abliterated variants respond to this query.


IMHO Qwen are shipping the best OSS models you can run locally on consumer GPUs right now, getting great results from both qwen2.5-coder:32b and qwq:32b running at ~18 tok/s and running on older NVidia A4000 GPUs. Definitely my first choices for local workloads.

It's also great to see qwq's open chain of thought baked in a OSS LLM so you can see it reason with itself in real-time, it's the kind of secret sauce that proprietary LLMs like o1 would prefer to keep hidden to try build a moat around.

We've got a lot to thank Meta and Qwen for in continually releasing improving high quality OSS models which also encourages others to follow. High quality OSS models are the best thing keeping the cost of LLMs down, you can get unbelievable value on OpenRouter with qwen2.5-coder:32b at $0.08/$0.18 M/tok qwq:32 available at $0.15/$0.60 M/tok which is more than 18x cheaper than Anthropic's latest budget Haiku 3.5 model at $0.80/$4 M/tok (4x price hike over Haiku 3.0).


Previously I've always been very skeptical of rosy pictures of a possible future where "everyone has an ai that's there to accomplish tasks for them" - given that I imagined such ai (if it ever came to exist) being run by the usual big tech who have their own incentives not so cleanly aligned with our own.

Right now, with the availability of open weights for cutting-edge models, it feels like this wave of technological advance is pleasantly decentralised however. I can download and run a model and tinker with things which at least feel like the seeds of such a future, where I _might_ be able to build things with my own interests at heart.

But what happens if these models stop being shared, and how likely is that? Reading about the vast quantities of compute deployed to train them, replicating the successes of the main players with a community of volunteers just seems an order of magnitude less achievable than traditional OSS efforts like Linux. This wave feels so tied to massive scale for its success, what do we do if big-tech stop handing out models?


I think we're all fortunate in that the companies behind the best OSS models, i.e. Meta/llama and Alibaba/Qwen are funding their compute and R&D from secondary business models instead of VC capital or an AI company who's primary business model is direct revenue from their models and will be seeking ROI. That's why I don't expect we can rely on Mistral AI to OSS their best models in the long run since that's their primary business model. This is reflected in their hosting costs which charge a healthy premium that's always more expensive than OpenRouter providers hosting their OSS models.

But I don't see why Meta and Alibaba would stop releasing their best models as OSS, since they benefit from the tooling, optimizations and software ecosystems being developed around their OSS models and don't benefit from a future where the best AI models are centralized behind the big tech Corps. As long as their core business remain profitable I don't expect them to stop improving and sharing their OSS models.


I read an article about humanoid robots yesterday and it scared me that it seemed like the expectation is still that the robot will be 24/7 online and "thinking" using some cloud brain. The current models described more in detail all used Open AI as a brain.

Having a personal robot would be great, but they have to invent a fully offline real positronic brain before I will consider allowing one in my house.

Fully open source might be too much to hope for, but that would obviously be the ideal. If it is closed source it definitely should be offline. I can have another, carefully sandboxed, AI in my computer that can help out with tasks that require online access. No need for the two types to be built into the same device.


Prediction 1: The value isn't in the foundation model, it's in fine tuning and in tightly integrated products.

Prediction 2: The ecosystem around open source models will grow to be much larger, richer, and deeper than closed source models.

If these are true, then OpenAI and Anthropic are in a precarious place. They basically burned a lot of capital to show the open source second movers what to build.


Yeah, Haiku 3.5 doesn't really count as a budget model - Anthropic have kept the original Claude 3 Haiku around for their budget entry.

I collected notes on the lowest cost hosted LLMs from the major vendors when I wrote up Amazon Nova last week: https://simonwillison.net/2024/Dec/4/amazon-nova/

Nova Micro is $0.035/$0.14 and Google's Gemini 1.5 Flash 8B is $0.0375/$0.15 - just beating those OpenRouter prices, but it may well be that the Qwen models provide better results.


Yeah I'm currently using Gemini 2 Flash (exp) free quota for a premium hosted model, it's a surprisingly great model, IMO Google has caught up with the leaders with their latest experimental models. I've also tested Nova's models, which are pretty high quality and exception value (lite/micro) for their performance.

Also worth shouting out you can get Meta's latest llama-3.3:70b (comparable to llama3.1:405b but must faster and cheaper) within GroqCloud's free quotas running at an impressive 276 tok/s.


Groq limits context window to 8192 is that your experience too


Have you tested tool calling capabilities of these cheaper models?


Not yet. A lot of them make bold claims and there are benchmarks like https://gorilla.cs.berkeley.edu/leaderboard.html but I don't have my own function calling test harness setup yet.


You know that you can make any model call and use tools simply by giving it few shot examples and writing your own parsing logic. I’ve done it many times for clients, both at the prompt and at the fine-tune level.


so far I was pretty courageous in giving my language models full access to the console so they can perform terminal comments whenever necessary.

thinking that the Chinese government might have built in a back door gives me a little pause though


How would a bunch of weights make a backdoor? The worst it could do is detect it's accessing an actual console and run a logged, visible command that tries to mess with your config or phone home, which is more of a front door with flashing lights saying "here I am!", so why would they bother?

Letting an LLM run arbitrary commands in your main user account seems risky even without worrying about conspiracies.


It’s absolutely possible to put a backdoor into an LLM.

https://arxiv.org/abs/2408.12798


Just to wear my tin foil hat for fun, it's not that the model would attempt to phone home itself (what would it have to say, anyway?) but that given the opportunity it would go around kicking doors open for later infiltration by an outside party. Subtle bugs being introduced to your Django app, invisible characters that break your ssh configs, that sort of thing.


Yes, deliberately introducing vulnerabilities when generating code is a good one, and could be quite subtle. For running console commands though, anything touching configuration for ssh, gpg, bash aliases, ~/bin, cron, etc., should be immediately obvious.

I was thinking "here's an IP address and ssh key" would be what to phone home with, and that could be encrypted/hidden pretty well, but any network access should be pretty suspicious right away.


That would be an extremely expensive way to install malware.


Why do you assume backdoors are limited to the Chinese government?


There is no backdoor but the model is heavily censored and biased towards China. It refuses to discuss Chinese or North Korean politicians, Tiananmen square, Uyghurs or anything sensitive to China. It's quite positive about Putin - it doesn't mind trashing Western leaders, though. It may write clever code, and I understand that Chinese researchers have to abide by local laws, but it certainly has opinions that are incompatible with mine.


given how Western models have their own biases, it occurs to me that we might be better off with a panel of models playing mock UN to cover everything.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: