Interesting, I also manage an IRC bot with multimodal capability for months now. It's not a real LMM - rather, a combination of 3 models. It uses Llava for images and Whisper for audio. The pipeline is simple: if it finds a URL which looks like an image - it feeds it to Llava (same with audio). Llava's response is injected back to the main LLM (a round robin of Solar 10.7B and Llama 13B) to provide the response in the style of the bot's character (persona) and in the context of the conversation. I run it locally on my RTX 3060 using llama.cpp. Additionally, it's also able to search on Wikipedia, in the news (provided by Yahoo RSS) and can open HTML pages (if it sees a URL which is not an image or audio).
Llava is a surprisingly good model for its size. However, what I found is that it often hallucinates "2 people in the background" for many images.
I made the bot just to explore how far I can go with local off-the-shelf LLMs, I never thought it could be useful for blind people, interesting. A practical idea I had on my mind was to hook it to a webcam so that if something interesting happens in front of my house, I can be notified by the bot, for example. I guess it could also be useful for blind people if the camera is mounted on the body.
> "Llava is a surprisingly good model for its size. However, what I found is that it often hallucinates "2 people in the background" for many images."
Llamafile.exe[1] is Llava based, and I find it hallucinates handbags in images a lot. Asked to describe a random photo with "I cannot see. Please accurately and thoroughly describe this scene with 'just-the-facts' descriptions and without editorialising." it comes out with text that feels like an estate agent wrote it, often picking an imaginary handbag or two as a detail worth mentioning:
"The image depicts a street scene where three people are gathered around a white car parked near a building. One man is standing next to the vehicle, while another is holding a cell phone and talking to a woman in front of him. The third person is also nearby, participating in the conversation or observing the situation. The background showcases various elements such as a couple of handbags on the ground close to one of the individuals, as well as multiple chairs placed at different distances from each other. These objects further emphasize the social aspect of this outdoor gathering."
(Note there are "three people", made of a man, another man, a woman, and a third person). There were no handbags in that street scene or two tourist-looking-people with a phone next to a car with a driver in it. Or:
"The scene depicts a group of people on the back of a boat, with a beautiful young woman riding in front. Several individuals are holding umbrellas above their heads as they enjoy the outing. The boat is located in shallow water close to shore, near brick buildings, possibly a hotel. A few chairs can be seen onboard along with several handbags carried by the passengers. Additionally, a couple of bottles and an orange are present in the scene, suggesting refreshments during the boating trip."
No handbags, chairs, bottles or orange were visible on the pleasure-trip boat going past. Or:
"Various vehicles can be spotted nearby, including cars parked or driving along the road, and a truck located further back in the scene. A handbag is also visible, possibly belonging to one of the shoppers at the market."
One woman off to the side was carrying a handbag with the strap diagonally across her body and the bag on her front. Possibly it belonged to her ... or possibly she nicked it?
"One person appears to be holding a backpack while standing with the rest of the group. A handbag can also be seen resting near another individual among the group."
Nope.
"a large number of pedestrians are walking up and down between shops and stores, likely engaging in various activities or running errands. Some people have handbags, which can be seen as they walk along the sidewalk."
Nobody visibly had a handbag.
It seems odd that it picks out handbags as one of the few things worth describing, repeatedly. As if the training data contained lots of images tagged 'handbag' and that such concept has survived into the small model.
See also [2] article and top comment in the discussion about fake photos in 1917; running this query over and over on random pictures from my photo collection, I recognise the output style the template-feeling elements of it, much more now.
I'm also totally blind and, somewhat relatedly, I've built Gptcmd, a small console app to ease GPT conversation and experimentation (see the readme for more on what it does, with inline demo). Version 2.0 will get GPT vision (image) support:
I had an interesting conversation the other day about how best to make ChatGPT style "streaming" interfaces accessible to screenreaders, where text updates as it streams in.
I'm totally blind and built Gptcmd, a small console app to make interacting with GPT, manipulating context/conversations, etc. easier. Since it's just a console app, when streaming is enabled, output is seemlessly reported.
Hey! I don’t understand too much about AI/ML/LLMs (and now LMMs!) so hoping someone could explain a little further for me?
What I gather is this is an IRC bot/plugin/add-on that will allow a user to prompt an ‘LMM’ which is essentially an LLM with multiple output capabilities (text, audio, images etc) which on the surface sounds awesome.
How does an LMM benefit blind users over an LLM with voice capability? Is the addition of image/video just for accessibility to none-blind people?
What’s the difference between this and integrating an LLM with voice/image/video capability?
Is there any reason that this has been made over other available uncensored/free/local LLMs (aside from this being an LMM)?
It's the multimodal input capability that seems to be of value here – see the transcript at https://2mb.codes/~cmb/ollama-bot/#chat-transcript .. Namely, being able to interrogate images in a verbal fashion, such that someone without sight (or perhaps even someone who just doesn't want to see an image) can get an appreciation for their contents.
Yes, the image interrogation is exactly the point.
This all started out when my friend said that it would be cool to be
able to chat on IRC with an LLM running on his own hardware. And then we were
like, oh hey, we can get this thing to describe images for us
if we use an LMM.
The next thing we want to do is obtain some glasses with cameras and wi-fi and
send images to ollama from them for real-time description. The benefits
are obvious, especially for mobility purposes.
This is so cool. I’d ask how it works, however I feel like I wouldn’t understand at a fundamental level, even if I read through your codebase. Interpreting an image in the concept of a machine baffles me, it doesn’t have eyes. It surely can’t sense light like humans can. It can’t possibly understand depth (the sofa is in the far left background?!). It can’t know what a goatee is, based on some pixels that are mildly different colours than the skin or background. These are all assumptions I’ve made coming into this, and I am relatively sure I’m wrong at this stage.
If you’d like to briefly post I’m sure a lot of HN denizens would appreciate it however. I’ll just stand at the sidelines, post this and spectate the commentary and try it myself with a small group.
To be completely honest, I don't really know what I'm doing. The IRC
bot I wrote isn't complicated at all; it basically just acts as a bridge
between IRC and a program that has an HTTP API.
FWIW I've never written an IRC bot before, so this is "baby's first bot".
I also wrote it in Go, even though I'm not a Go programmer. Probably all of
that shines through in the code.
The real magic happens in [ollama](https://ollama.ai/), which lets you run
LMMs locally.
> Interpreting an image in the concept of a machine baffles me, it doesn’t have eyes
Your mistake here is thinking what machine has understanding of anything. It doesn't. But if you know how human learning works, what is a compression and what is a lossy compression then it is quite easy to understand.
Machine is fed with tons of images with word references what is in the image. Then it finds what is similar in the images of a similar objects, ie works just like a compression algo, except it doesn't store the exact matches but relationships of some markers it finds in the images. That's why it doesn't and doesn't need to understand where is sofa and what is a sofa, it just have a relationship between something what has a relationship to the word 'sofa' and relationship with something what we, human describe as 'position'.
As a follow up to this I’d like to ask any partially sighted or blind people the issues they currently experience using a LLM such as ChatGPT, Bard, Llama or otherwise - both from a UI perspective and an API perspective.
Blind person here. GPT Vision is extremely sensitive and refuses to describe any content that it finds obscene. It's not just porn, its false positive rate is extremely high and it's not unusual for completely innocent images to be blocked.
Have you tried telling it that you are blind and need an accurate description? I've run into similar issues and this has helped (and it wouldn't even be a lie in your case). Also keep the prompt as vague as possible, no specific mention of people etc.
Thanks for your input! It’s interesting it’s oversensitive to this, I wonder what the safeguards are, and how strict they can be. This is a game changer I would imagine for anyone not able to see, so hopefully you get a better bit of kit very soon.
If there’s anything you need additional support with, check out ‘Be My Eyes’ app and hopefully someone will be useful for you!
I'm blind, but I wouldn't know. I haven't used bard or gpt or any of that,
and I don't plan on doing so. I thought about playing with GPT back in
early 2023, when the hype really started picking up. But when I realized
that they wanted my phone number, I noped out.
An additional question, if you don’t mind answering (and there’s zero obligation to). How have you found accessibility has changed on the web over the years? We have many tools these days to assist but do you feel there’s been a notable improvement to what used to be in place?
I've been online in some form or other since 1993. Back in 1993, everything
was basically accessible by default, because it was plain text.
No, there has not been an improvement, and in fact, things have gotten worse
in a lot of ways. Most of that is due to SPAs, and people who decide
to use JavaScript when HTML widgets would suffice.
I'm also involved with a text mode web browser project, [edbrowse](https://edbrowse.org/).
Ten years ago, it was feasible to use edbrowse for a great deal of online
activity. For instance, I used it to make purchases from Amazon and other
online stores. I could log into Paypal and send money with it.
Then, in the mid 2010s or so, SPAs started becoming a thing and edbrowse
broke on more and more sites. At this point, in 2024, I can't even use it
to read READMEs on Github.
And yes, I have accessibility trouble when using mainstream browsers too,
all the time.
The DOJ takes the position that the ADA applies to websites and they've done something about it like six times. If only they would increase enforcement efforts about 10000 fold.
This is shocking, I can’t do much about the wider industry but I’m a huge advocate for accessibility on the web, I’ll continue being a pain in the arse to POs, PMs, devs and general management pushing for this kind of change.
If everyone put their own loved ones in the situation of others, life would be different.
Do you also nope out on other services (bank accounts, uber, etc) thst require a phone number? Somewhat genuinely curious because I don't understand what the issue is with giving your phone number?
As a rule, I tend to nope out on signing up for anything at all. I do have a Lyft account and of course the bank account both of those entities have a legitimate reason for contacting me so I don’t mind them having my phone number, but it’s really terrible how we’ve normalized giving phone numbers away. It has become so normalized that I’m on a nerd site trying to explain why I feel so strongly about giving away my personal data. Also nota beni, PII is a liability, not an asset.
I answer a lot of this on a forum post on the OpenAI forum [1], but basically, thank goodness for API's and screen reader addons that make use of them, because the only two AI's that are even remotely easy to use and accessible are Bard and Bing.
Since there's no way to truly objectively tell if LLM output is correct, this seems like it would have its limits, even if it seems subjectively good, but I have that problem with all of the LLM stuff I guess.
Llava is a surprisingly good model for its size. However, what I found is that it often hallucinates "2 people in the background" for many images.
I made the bot just to explore how far I can go with local off-the-shelf LLMs, I never thought it could be useful for blind people, interesting. A practical idea I had on my mind was to hook it to a webcam so that if something interesting happens in front of my house, I can be notified by the bot, for example. I guess it could also be useful for blind people if the camera is mounted on the body.