> MobileLLM-125M/350M attains a remarkable 2.7%/4.3% accuracy boost over preceding 125M/350M SoTA models on zero-shot commonsense reasoning tasks
Small models, slightly improved, probably still not good enough for the same use as online models. Nothing wrong with incremental progress, however.
1.5B parameter model does seem to be a pretty decent step up, even beating larger models by a wide margin. I'm not sure why they didn't go larger -- having a more efficient model that fits on hardware the size of the RPi could be a gamechanger (IIRC TinyLlama 7B does run, barely).
>> Small models, slightly improved, probably still not good enough for the same use as online models. Nothing wrong with incremental progress, however.
An even smaller language model should still be useful as part of a speech-to-text system. These should benefit from using the language model to narrow down what word is spoken in the face of ambiguity or noise.
ASR systems already use language models during decoding, though mostly not large decoder-only LLMs. However, incorporating LLMs into ASR is currently at the center of a lot of research, e.g. using a speech encoder like wav2vec 2.0 or the whisper encoder with a Qformer etc. and a LoRA adapter on an LLM trained for ASR.
But imagine if these models were baked into your Instagram app and then used for ad targeting using your own compute. Then Facebook gets to look at tons of other data and for less cost (and much less litigation risk) to them.
In this application it’s unfair to compare tiny models to cloud models. Moreover any incremental precision boosts to tiny models would be notable (and directly translate to revenue).
> I'm not sure why they didn't go larger -- having a more efficient model that fits on hardware the size of the RPi could be a gamechanger (IIRC TinyLlama 7B does run, barely).
I'm not sure that RPi is the right target for the next step of local LLMs, and I think that it's worth considering web-deployment on engines like WebLLM [1].
A 7B model may "run fine" on a Raspberry Pi, but I've (personally) found 7B models to be a bit larger than I want to download / run for web-based interfaces.
However, a solid 125M model is the sort of thing that I can run on a webpage, and the time it takes to download to the local user's browser (combined with my bandwidth costs) aren't exorbitant.
Does it have to stay on mobile devices? Bit of niche but if its not a resource hog it could be handy for giving NPC's in games more interesting dialogue without having use
Even better if it could be tuned in someway to allow dialogue to influence NPC behavior or actions.
Would it be interesting dialogue? You could generate more dialogue, but would it have anything underpinning it of interest to the player? i.e. you could suddenly have townspeople that would talk about local scenery or their relationships with other NPCs, but none of that stuff they describe would actually exist in the game. I would personally be weirded out if NPCs started making stuff up.
I can imagine training some sort of LLM on your game data such that NPCs are able to actually describe the game world, but I can't imagine what kind of scale you'd need to operate at for that to be cheaper than just paying someone to write the dialogue. Maybe at Ubisoft's scale where your team sizes are in the thousands (AFAIK, they have been investigating using AI for writing, but it's mostly for things like combat barks which are very repetitive and basically noise.)
It would definitely depend a lot on the implementation. I think it could work great for some indie dev's. Not all of course, devs that like writing understandably won't like it.
It would be fascinating if NPCs had more backstory to them and more complex behaviors. Although I would imagine it would be near impossible to test since anything could influence their behavior.
I'm definitely interested in exploring this sort of thing. How much can we do with creating interesting characters and interesting circumstances?
Makes me think of the way that characters are set up in AI Alibis -- each with their own secrets, but also with clues about other NPC's secrets. That feels like clever design, and it's the first use-case of using LLMs for NPC dialogue that feels interesting to me: https://news.ycombinator.com/item?id=40921990
The Android apk for MLC is updated frequently with recent models built-in. And a Samsung S24+ can comfortably run 7-8B models at reasonable speeds (10ish tokens/sec).
I wonder how much you can push the "deeper and thinner" part. At some point your entire FFN fits into your L2 cache, you're bound to get some performance jumps.
Other research from Meta FAIR actually suggests that you should prune deeper layers if you want to improve performance while maintaining accuracy [1]. So there must be a cutoff point for smaller networks where this approach still works, otherwise the results are contradictory. Or we could drastically improve these new models even further.
"So far, we trained compact models from scratch using next tokens as hard labels. We explored Knowledge Distillation (KD)... Unfortunately KD increases training time (slowdown of 2.6−3.2×) and exhibits comparable or inferior accuracy to label-based training (details in appendix)."
Hey HN. I actually have a current need for on-device wake-word-like STT. Which model(s) have the lowest WER and can run on an RPi 4B? I've been looking at openWakeWord. It's for an DIY inventory system.
It seems like the smaller models get the largest size decrease by embedding share/weight tying between the linear head and token embeddings. Is there any research going into how to further reduce size from there?
If you mean that LM-head is just inverted embedding matrix then this was already done in GPT-2.
Unfortunately, the only thing I found out about this is that bigger models benefit from separate layer. But this was only mentioned somewhere in discord, so no paper to read and my personal hunch is that it should work for bigger models too. After all, GPT-3 was just scaled GPT-2.
From my personal experiments, models learn better if you give them harder task. And tied weights could be one of such things. Multi-token prediction could be another and bitnet could be also considered such... (and dropout too)
How about instead of Gen AI on the desktop, just AI on the desktop. Could organize all my files, emails, and notes and let me search for information from my own data.
Training models is not OS dependend. RAM is dependend on the size and i would argue this should be a lot easier to finetune with less GPU Ram.
Nonetheless the endgoal will probably be downloading a model like this or paying for finetuning than downloading and using it through an optimized Neuralchip.
Its currently more a question of when this will happen. The newest Windows cert already requires some neuralchip and even my google pixel 8 pro can host small models (i know the pixel is not a cheap phone, but the coprocessor should still be much more affordable than a big GPU)
I like the approach that Apple seems to be taking with fine tuned small models that handle routine tasks and then defer to larger off device models for things they can’t confidently do. I imagine you could construct a training set that contains examples that should produce low confidence answers where you could add an output that is essentially a “call for help” option so you could train it to choose that. Smaller models also means you could have more running in parallel and use another to route requests to the appropriate expert.
Reading emails, replying to emails, scheduling tasks, using apis for services.
Basically everything which doesn't need knowledge but actions.
"Tell my wife i'm late" and it will use some configured magic to talk to service xy and just does it.
Siri is very good in doing homeautomatistaion without the internet, the old google agent and alexa were absolutly not and i don't think they were ever available offline.
This basically gives you a local (local-first!) good working assistent
Would be very nice to have my schedule automatically managed by Siri. Already has a few nice things but I genuinely have trust issues, especially with AI.
You can get very far with the Shortcuts app by the way. Some examples: using your current location to estimate when you should leave to get to your next meeting on your calendar,
letting those included in the calendar event know you’re running late. Highly highly recommend it, the learning curve isn’t much, a bunch of drag and drop!
It can be fine tuned for device related actions. In other words, with all the capabilities of your device applications or services, the small model can virtually have the same capabilities. It can always dispatch a user request in way of “natural language” to those applications, and orchestrate the applications. It can dispatch user requests beyond the device capabilities to a cloud model. This is powerful since it changes how you interact with your devices.
I tested the Google AI on my phone, I had the browser open and asked it to read the page to me and it responded that it does not have access to the internet.
So I would like an AI assistant that:
1 can understand english and my native language
2 that is aware that runs on Android(or KDE/Linux) and can understand commands like "open the Android Settings , Application section " or "read the page that is opened in the browser" or "read the text in the popup that is now opened". Basically to be integrated with the OS via public and open APIs. Big AI companies could compete on selling us better assistants especially for multi lingual people.
3 the model should be small , it should not know geography, history, music bands etc, for tasks where the user asks question there should be an option for the model to forward the question to a search engine or even an online LLM.
It could power simple agents like Siri under the hood. Helping with natural language understanding, intent classification, retrieval, and other agent tasks.
Optimizing and loading in your own voice, selecting your primary language and adding a little bit of personal knowledge like nicknames, location and stuff?
My pixel 8 apparently can use / load local models but don't have the time right now to follow that rabbit hole
Small models, slightly improved, probably still not good enough for the same use as online models. Nothing wrong with incremental progress, however.
1.5B parameter model does seem to be a pretty decent step up, even beating larger models by a wide margin. I'm not sure why they didn't go larger -- having a more efficient model that fits on hardware the size of the RPi could be a gamechanger (IIRC TinyLlama 7B does run, barely).