I narrate notes to myself on my morning walks[1] and then run whisper locally to turn the audio into text... before having an LLM clean up my ramblings into organized notes and todo lists. I have it pretty much all local now, but I don't mind waiting a few extra seconds for it to process since it's once a day. I like the privacy because I was never comfortable telling my entire life to a remote AI company.
[1] It feels super strange to talk to yourself, but luckily I'm out early enough that I'm often alone. Worst case, I pretend I'm talking to someone on the phone.
Same. My husky/pyr mix needs a lot of exercise, so I'm outside a minimum of a few hours a day. As a result I do a lot of dictation on my phone.
I put together a script that takes any audio file (mp3, wav), normalizes it, runs it through ggerganov's whisper, and then cleans it up using a local LLM. This has saved me a tremendous amount of time. Even modestly sized 7b parameter models can handle syntactical/grammatical work relatively easily.
Button-toggled voice notes in the iPhone Notes app are a godsend for taking measurements. Rather than switching your hands between probe/equipment and notes repeatedly, which sucks badly, you can just dictate your readings and maaaaybe clean out something someone said in the background. Over the last decade, the microphones + speech recognition became Good Enough for this. Wake-word/endpoint models still aren't there yet, and they aren't really close, but the stupid on/off button in the Notes app 100% solves this problem and the workflow is now viable.
I love it and I sincerely hope that "Apple Intelligence" won't kill the button and replace it with a sub-viable conversational model, but I probably ought to figure out local whisper sooner rather than later because it's probably inevitable.
Some dubious marketing choices on their landing page:
> Finding the Truth – Surprisingly, my iZYREC revealed more than I anticipated. I had placed it in my husband's car, aiming to capture some fun moments, but it instead recorded intimate encounters between my husband and my close friend. Heartbreaking yet crucial, it unveiled a hidden truth, helping me confront reality.
> A Voice for the Voiceless – We suspected that a relative's child was living in an abusive home. I slipped the device into the child's backpack, and it recorded the entire day. The sound quality was excellent, and unfortunately, the results confirmed our suspicions. Thanks iZYREC, giving a voice to those who need it most.
It's an on-screen toggle button in the Notes app. Press to start recording, take your time, don't worry about about a 10 second turn around if you pause slightly too long between words, just speak and toggle the button back off when you are done. If someone walked up and had a conversation half way through just delete the words.
I do a lot of stargazing and have experimented with voice memos for recording my observations. The problem of course is later going back and listening to the voice memo and getting organized information out of what essentially turns into me rambling to myself.
I'm going to try to use whisper + AI to transcribe my voice memos into structured notes.
You can use it for everything. Just make sure that you have an input method set up on your computer and phone that allow you to use whisper.
That's how I'm writing this message to you.
Learning to use these speech-to-text systems will be a new kind of literacy.
I think pushing the transcription through language models is a fantastic way to deal with the complexity and frankly, disorganization of directly going from speech to text.
By doing this we can all basically type at 150-200 words a minute.
My faith in humanity has notched up a little today after seeing how easy Aiko and Oolama make it to use this incredible tech.
As a daily user of all the latest OpenAI/Claude models for the last few years, I'm amazed at how good llama3.1 is - all running locally, privately, with zero connection to the web. How did I not know about this?!
I just have a local script that pulls the audio file from Voice Memos (after it syncs from my iPhone), runs it through openai's whisper (really the best at voice to speech; excellent results) and then makes sense of it all with a prompt that asks for organized summary notes and todos in GH flavored markdown. That final output goes into my Obsidian vault. The model I use is llama3.1 but haven't spent much time testing others. I find you don't really need the largest models since the task is to organize text rather than augment it with a lot of external knowledge.
Humorously the harder part of the process was finding where the hell Voice Memos actually stores these audio files. I wish you could set the location yourself! They live deep inside ~/Library/Containers. Voice Memos has no export feature, but I found you can drag any audio recording out of the left sidebar to the desktop or a folder. So I just drag the voice memo into a folder my script watches and then it runs the automation.
If anyone has another, better option for recording your voice on an iPhone, let me know! The nice thing about all this is you don't even have to start / stop the recording ever on your walk... just leave it going. Dead space and side conversations and commands to your dog are all well handled and never seem to pollute my notes.
Have you tried the Shortcuts app? On phone and mac. Should be able to make one that finds and moves a voice memo when run. You can run them on button press or via automation.
Also what kind of local machine do you need? I have an imac pro, wondering if this will run the models or if I ought to be on an apple silicon machine? I have an M1 macbook air as well.
You can record your voice messages and send them to yourself in Telegram. They're saved on-device. You can then create a bot to do things to stuff as they come in, like "transcribe new ogg files and write back the text as a message after the voice memo".
This is exactly why I think the AI pins are a good idea. The Humane pin seems too big/too expensive/not quite there yet, but for exactly what you're doing, I would like some type of brooch.
Wish I was good enough at holding thought, on my long thinking walks, sometimes even just a screen is enough to kill the thought I want to flesh out. Typically I just take a notepad but the pin thing would be better!
I remember the first lecture in the Theory of Communication class where the professor introduced the idea that communication by definition requires at least two different participants. We objected by saying that it can perfectly be just one and the same participant (communication is not just about space but also time), and what you say is a perfect example of that.
GP is making a joke about speaking to oneself really just being the human version of Chain of Thought, which in my understanding is an architecural decision in LLMs to have it write out intermediate steps in problem solving and evaluate the validity of them as it goes.
You can use llama.cpp, it runs on almost all hardware. Whisper.cpp is similar, but unless you have a mid or high end nvidia card it will be a bit slower.
[1] It feels super strange to talk to yourself, but luckily I'm out early enough that I'm often alone. Worst case, I pretend I'm talking to someone on the phone.