Hacker News new | past | comments | ask | show | jobs | submit login

Does anyone have an idea why they are so open about Whisper? Is it the poster child project for OAI people scratching their open source itch? Is there just no commercial value in speech to text?



I personally use Whisper to transcribe painfully long meetings (2+ hours). The transcripts are then segmented and, you guessed it, entered right into GPT-4 for clean up, summarisation, minutes, etc. So in a sense it's a great way to get more people to use their other products?


This sounds amazing. Would you be willing to share your code? Thanks!


I run this[0] on Google Colab. The way I have it set up is to encode the meeting minutes to .ogg, push them to Google Drive, then adjust the script to tell it how many speakers there were and the topic of conversation. The `initial_prompt` really helps the model especially if you are talking about brand names, etc. that it may not know how to correctly transcribe. I've added a comment at the bottom of the Gist with some of the prompts I've used in the past. I've successfully managed to produce reports on week-long meetings (~18 hours) that were essential to get the team up to speed.

As a company we are currently shifting to Otter.ai[1] which gives good enough results for everyday meetings.

[0]: https://gist.github.com/StanAngeloff/91480fac18a74d8aff3e4cf... [1]: https://otter.ai/


Wow, thanks so much for the in depth answer. This looks really great, I can’t wait to give it a try.


speech to text is a relatively crowded area with a lot of other companies in the space. Also really hard to get "wow" performance as it's either correct (like most other people's models) or it's wrong


I've been wondering this as well. I'm super glad, but it seems so different than every other thing they do. There's definitely commercial value, so I find it surprising.


I think it makes more sense to just consider why they're even building it. Their goal is to build an AGI, for which they think they need data and compute. They need market reach and revenue to make data access feasible and open investor's wallets for compute, and anything that makes the data easier to get and isn't too hard to do is going to help them on their main goal. Whisper being as widely available as possible is going to result in a lot more human origin language, not just in their services that are trainable, but on the web as a whole. Releasing whisper does basically nothing to increase output of machine generated text, and increases the amount of human text on the internet, so it's a net win. The actual calculation is then going to be on how hard it is to make, and my guess is that for the top AI research team in the world with Microsoft resources, it turned out to be a pretty easy problem to comprehensively solve.


Everyone’s got a loss leader




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: