Hacker News new | past | comments | ask | show | jobs | submit login

I narrate notes to myself on my morning walks[1] and then run whisper locally to turn the audio into text... before having an LLM clean up my ramblings into organized notes and todo lists. I have it pretty much all local now, but I don't mind waiting a few extra seconds for it to process since it's once a day. I like the privacy because I was never comfortable telling my entire life to a remote AI company.

[1] It feels super strange to talk to yourself, but luckily I'm out early enough that I'm often alone. Worst case, I pretend I'm talking to someone on the phone.




Same. My husky/pyr mix needs a lot of exercise, so I'm outside a minimum of a few hours a day. As a result I do a lot of dictation on my phone.

I put together a script that takes any audio file (mp3, wav), normalizes it, runs it through ggerganov's whisper, and then cleans it up using a local LLM. This has saved me a tremendous amount of time. Even modestly sized 7b parameter models can handle syntactical/grammatical work relatively easily.

Here's the gist:

https://gist.github.com/scpedicini/455409fe7656d3cca8959c123...

EDIT: I've always talked out loud through problems anyway, throw a BT earbud on and you'll look slightly less deranged.


If it’s helpful here is the prompt I use to clean up voice-memo transcripts: https://gist.github.com/adamsmith/2a22b08d3d4a11fb9fe06531ae...


Button-toggled voice notes in the iPhone Notes app are a godsend for taking measurements. Rather than switching your hands between probe/equipment and notes repeatedly, which sucks badly, you can just dictate your readings and maaaaybe clean out something someone said in the background. Over the last decade, the microphones + speech recognition became Good Enough for this. Wake-word/endpoint models still aren't there yet, and they aren't really close, but the stupid on/off button in the Notes app 100% solves this problem and the workflow is now viable.

I love it and I sincerely hope that "Apple Intelligence" won't kill the button and replace it with a sub-viable conversational model, but I probably ought to figure out local whisper sooner rather than later because it's probably inevitable.


I bought an iZYREC (?) and leave the phone at home. MacWhisper and some regex (I use verbal tags) and done


Some dubious marketing choices on their landing page:

> Finding the Truth – Surprisingly, my iZYREC revealed more than I anticipated. I had placed it in my husband's car, aiming to capture some fun moments, but it instead recorded intimate encounters between my husband and my close friend. Heartbreaking yet crucial, it unveiled a hidden truth, helping me confront reality.

> A Voice for the Voiceless – We suspected that a relative's child was living in an abusive home. I slipped the device into the child's backpack, and it recorded the entire day. The sound quality was excellent, and unfortunately, the results confirmed our suspicions. Thanks iZYREC, giving a voice to those who need it most.


Hahaha yes, a Chinese idea of marketing. However has good battery, good mic, and you can use if offline as mass storage so nothing lost


> Button-toggled voice notes in the iPhone Notes app

Is this a physical button or on-screen? I’ve been rewatching Twin Peaks recently and would love a high-tech implementation of Cooper’s tape recorder.


It's an on-screen toggle button in the Notes app. Press to start recording, take your time, don't worry about about a 10 second turn around if you pause slightly too long between words, just speak and toggle the button back off when you are done. If someone walked up and had a conversation half way through just delete the words.


They could be refering to the "action button" on the newer Pro phones, which replaced the silent switcher.


They might be referring to an on screen button.

Check out the new voice transcription feature in iOS 18. On my SE 2022 (very much not high-end), I have a good-size record/pause button on screen.

After you’re done with the voice recording, it gives you a transcription of what you spoke.


This has inspired me.

I do a lot of stargazing and have experimented with voice memos for recording my observations. The problem of course is later going back and listening to the voice memo and getting organized information out of what essentially turns into me rambling to myself.

I'm going to try to use whisper + AI to transcribe my voice memos into structured notes.


You can use it for everything. Just make sure that you have an input method set up on your computer and phone that allow you to use whisper.

That's how I'm writing this message to you.

Learning to use these speech-to-text systems will be a new kind of literacy.

I think pushing the transcription through language models is a fantastic way to deal with the complexity and frankly, disorganization of directly going from speech to text.

By doing this we can all basically type at 150-200 words a minute.


It would be really amazing if you could expand your workflow a little, especially how you have stitched everything together.


> That's how I'm writing this message to you.

Neat. Can you explain your setup a little? How do you go from voice to whisper to writing in this reply input form on a webpage?


I'm a noob and not a dev either. Can you please explain how to set this all up?

If it matters, I managed to install Jan AI on my Linux Mint and able to use the Mistral model. I use an Android phone if it helps. Thanks.


Cool. Not sure why downvoted.


For people on macOS, the free app Aiko on the App Store makes it easy to use Whisper, if you want a GUI: https://sindresorhus.com/aiko


My faith in humanity has notched up a little today after seeing how easy Aiko and Oolama make it to use this incredible tech.

As a daily user of all the latest OpenAI/Claude models for the last few years, I'm amazed at how good llama3.1 is - all running locally, privately, with zero connection to the web. How did I not know about this?!


I would be greatly interested in knowing how you set all that up if you felt like sharing the specifics.


My hope is to make this easy with a GH repo or at least detailed instructions.

I'm on a Mac and I found the easiest way to run & use local models is Ollama as it has a rest interface: https://github.com/ollama/ollama/blob/main/docs/api.md

I just have a local script that pulls the audio file from Voice Memos (after it syncs from my iPhone), runs it through openai's whisper (really the best at voice to speech; excellent results) and then makes sense of it all with a prompt that asks for organized summary notes and todos in GH flavored markdown. That final output goes into my Obsidian vault. The model I use is llama3.1 but haven't spent much time testing others. I find you don't really need the largest models since the task is to organize text rather than augment it with a lot of external knowledge.

Humorously the harder part of the process was finding where the hell Voice Memos actually stores these audio files. I wish you could set the location yourself! They live deep inside ~/Library/Containers. Voice Memos has no export feature, but I found you can drag any audio recording out of the left sidebar to the desktop or a folder. So I just drag the voice memo into a folder my script watches and then it runs the automation.

If anyone has another, better option for recording your voice on an iPhone, let me know! The nice thing about all this is you don't even have to start / stop the recording ever on your walk... just leave it going. Dead space and side conversations and commands to your dog are all well handled and never seem to pollute my notes.


Have you tried the Shortcuts app? On phone and mac. Should be able to make one that finds and moves a voice memo when run. You can run them on button press or via automation.

Also what kind of local machine do you need? I have an imac pro, wondering if this will run the models or if I ought to be on an apple silicon machine? I have an M1 macbook air as well.


You could also use the "share" menu and airdrop the audio from your iphone to your mac. Files end up in Downloads by default.


Amazing, thank you for this!


You can record your voice messages and send them to yourself in Telegram. They're saved on-device. You can then create a bot to do things to stuff as they come in, like "transcribe new ogg files and write back the text as a message after the voice memo".


Thanks for this!


This is exactly why I think the AI pins are a good idea. The Humane pin seems too big/too expensive/not quite there yet, but for exactly what you're doing, I would like some type of brooch.


how is that preferable to an airpod with siri or google or whatever app one has on a phone?


Wish I was good enough at holding thought, on my long thinking walks, sometimes even just a screen is enough to kill the thought I want to flesh out. Typically I just take a notepad but the pin thing would be better!


> It feels super strange to talk to yourself

I remember the first lecture in the Theory of Communication class where the professor introduced the idea that communication by definition requires at least two different participants. We objected by saying that it can perfectly be just one and the same participant (communication is not just about space but also time), and what you say is a perfect example of that.


"before having an LLM clean up my ramblings into organized notes and todo lists."

Which local LLM do you use?

Edit:

And self talk is quite a healthy and useful thing in itself, but avoiding it in public is indeed kind of necessary, because of the stigma

https://en.m.wikipedia.org/wiki/Intrapersonal_communication


That's just meat CoT (chain of thought) - right?


I do not understand?


GP is making a joke about speaking to oneself really just being the human version of Chain of Thought, which in my understanding is an architecural decision in LLMs to have it write out intermediate steps in problem solving and evaluate the validity of them as it goes.


Just put in some earbuds and everyone will assume you're on the phone.


YES. I discovered this exact use case myself a few months ago. Tweeted about it even:

https://x.com/adpirz/status/1823727814191014323?s=46


I was thinking about making this the other day. Would you mind sharing what you used?


What software did you use to set all this up? Kindof interested in giving this a shot myself.


You can use llama.cpp, it runs on almost all hardware. Whisper.cpp is similar, but unless you have a mid or high end nvidia card it will be a bit slower.

Still very reasonable on modern hardware.


If you build locally for Apple hardware (instructions in the whisper.cpp readme) then it performs quite admirably on Apple computers as well.


Definitely try it with Ollama, it is by far the simplest local LLM tool to get up and running with minimal fuss!


What do you use to run whisper locally? I don't think ollama can do it.


Can you give any more details on your setup? This sounds great


I found one that can isolate speakers, its just okay at that




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: