Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
OpenAI quietly launched Whisper V2 in a GitHub commit (github.com/openai)
335 points by fudged71 on Dec 6, 2022 | hide | past | favorite | 126 comments



"Whisper V2" is a bit of an exaggeration. It's

* only an update to the `large` model (and therefore irrelevant for hobbyist, low CPU use cases)

* architecturally identical to the original whisper model

* just trained on more epochs


Agreed. It's maybe like Linux versioning scheme, where 6.0 is just the one that comes after 5.19:

> The major version number is incremented when the number after the dot starts looking "too big." There is literally no other reason. [1]

[1] https://www.kernel.org/category/releases.html


The large model can run on consumer GPUs and I'd argue is more relevant to the hobbyist use cases because I'm not working out the efficiency of running thousands of users - just my own videos.


Yeah, the commit is literally "add large-v2 model", nothing about the software proper.


> * only an update to the `large` model (and therefore irrelevant for hobbyist, low CPU use cases)

This is maybe the most important thing there is for a lot of cases since additional training is the thing that is just about impossible to do for most people.


Is there an easy to install speech engine (training and synthesizing) for windows? So the reverse of recognition?


I recently swapped out the AI model for voice transcription on revoldiv.com and replaced it with Whisper. The results have been truly impressive - even the smaller models outperform and generalize better than any other options on the market. If you want to give it a try, our model is capable of faster transcription by utilizing multiple GPUs and some other enhancements, and it is all free


Your app is very similar to our demo app! https://modal-labs-whisper-pod-transcriber-fastapi-app.modal...

How come you don't support audio files longer than 1hr? Is it because of $$ cost?

The above demo app gets faster transcription by chunking audio and parellelizing over dozens of CPUs, so you can transcribe a about 1hr of audio for $0.10.


> https://modal-labs-whisper-pod-transcriber-fastapi-app.modal...

Interesting, which model are you using? We use the medium model which is the sweet spot between time/performance ratio. We also chunk, We try to detect words and silences to do better chunking at word boundaries but if you do more chunking and you don't get the word boundaries right it seems like whisper loses some context and the accuracy suffers. We will soon support longer hours. We just want to make sure the wait time for transcription doesn't suffer for most users. But great demo, reach out to me if you want to collaborate


We’re using base. The code is open source at modal-labs/model-examples repo if you want to see anything we’re doing


I'd love to try it out, who should I ping to run this locally on my GPU server?


It's not a model you can run on your own server but a free service on revoldiv.com. You can expect 40 to 50 second wait time to transcribe an hour long video/audio. We combine whisper with our model to get word level timestamps, paragraph separation and sound detections like laughter, music etc... We recently added very basic podcast search and transcription.


Im getting an unknown error, errro code 551


Has anyone combined Whisper with a model that can assign owners to voices yet?

For example, instead of:

    Hello there!
    Hi 
    How are you?
    Good, and you?
You get something like:

    Voice A: Hello there!
    Voice B: Hi 
    Voice A: How are you?
    Voice B: Good, and you?


This is called speaker diarization, basically one of the 3 components of speaker recognition (verification, identification, diarization).

You can do this pretty conveniently using pyannote-audio[0].

Coincidentally I did a small presentation on this at a university seminar yesterday :). I could post a Jupyter notebook if you're interested.

PS: Bai & Zhang (2020) is a great review on the literature [1]

[0] https://github.com/pyannote/pyannote-audio

[1] https://arxiv.org/abs/2012.00931


Yes please posting a jupyter notebook will be of great help


yes please!


I modified a Jupyter notebook that’s been going around to do this and have nice Markdown output.

This is the best AI diarization and transcription I’ve been able to get so far: https://github.com/zachlatta/openai-whisper-speaker-identifi...


Not whisper, but:

> The Google Recorder app (...) transcribes meetings and interviews to text, instantly giving you a searchable transcription that is synced to the recorded audio (...) With this new update, the recorder can now identify and label each speaker automatically—an impressive feat. Google Recorder is exclusive to the Pixel 6 and newer Pixel devices.

https://arstechnica.com/gadgets/2022/12/pixel-7-update-adds-...


That's awesome. I wonder if there's a way to upload existing recordings to Google Recorder to get the transcriptions.


I'd be down to code this up if someone if anyone's interested. contact is in profile.


This is called "diarization"


Not that I know of, but I use Descript for content creation and it offers something like what you're describing. they have automatic speaker labels built in to their transcription service.

would be lovely to see this feature open sourced.


There is a discussion on the Whisper github page called something like “diarization” which details a few attempts to attain this functionality with additional tools.


I’ve had limited success combining the results of pyannotate with whisper transcripts (based on time stamps).

It’s ok, but the quality of speaker identification is nowhere near as good as the transcription itself.

I’d love to see models which try and use stereo information in recordings to solve the problem. Or, given a fixed camera and static speakers, I thought it should even be possible to use video to add information about who is speaking. There doesn’t seem to be anything like that right now tho.


Some attempts have been made but it still sucks.


I've been using whisper to get transcripts from my local radio stations. I know it's out of scope for the original project but I hope someone can build a streaming input around it in the future. I currently pipe in and save 10 minute chunks that get sent off for processing.



The streaming I was speaking to is from a url or network resource. I keep a directory of m3u files which is just a url inside that you can open in vlc etc. `So ffmpeg -i [url] -c copy [file-name].mp3` does the trick for now. `mpv` can do this while saving to a file but it's a nice point to start from as I'm already getting some ideas how I could use this or roll my own, thanks!


whisper.cpp provides similar functionality via the `livestream.sh` script that performs transcription of a remote stream [0]. For example, you can transcribe BBC radio in 10s chunks like this:

  $ make base.en
  $ ./examples/livestream.sh http://a.files.bbci.co.uk/media/live/manifesto/audio/simulcast/hls/nonuk/sbr_low/ak/bbc_world_service.m3u8 10

  [+] Transcribing stream with model 'base.en', step_s 10 (press Ctrl+C to stop):

  Buffering audio. Please wait...

   here at the BBC in London. This is Gordon Brown, a former British Prime Minister, who since 2012 has been UN Special Envoy-
   Lemboy for global education. We were speaking just after he'd issued a rallying cry on the eve of the 2022 Football World Cup. For
   Governments around the world to pressure Afghanistan to let girls go to school. What human rights abuses are what are being discussed as we...
   run up and start and have happened and have seen the World Cup matches begin. And it's important to draw attention to one human rights abuse.
   that everyone and that includes Qatar, the UAE, the Islamic organization of countries, the Gulf Corps...
[0] https://github.com/ggerganov/whisper.cpp/blob/master/example...


Wow! Thanks for sharing, I didn't explore the repo much beyond that link, this looks very promising, I was going to tomorrow but checking out the src and building now! The confidence color coding btw is chef's kiss. Great job with this.


Is this English only, or does it support other languages? Thank you.


It's a port of Whisper and uses the original models in a different format so it supports all the languages that the original does.


I would love to do this for my local police scanner.


The codecs and narrow frequencies used on most public safety trunked radio networks is truly terrible. IMBE and AMBE+2 should have never made it past Y2K, yet Motorola and Project 25 have ensured these remain in widespread use on these networks: https://en.wikipedia.org/wiki/Project_25

If Whisper achieves 85% or higher accuracy on this audio, it would be a miracle. Garbage in, garbage out tbh. Project 25 needs to move to a modern codec, ideally not one seeing little development done by one small company.


It does pretty well* on ATC. E.g. https://twitter.com/lemonodor/status/1578516727549153280 and https://twitter.com/lemonodor/status/1581354245181218816.

* Better than anything else I've seen, but certainly far from perfect.


Being a very constrained domain with it's own speech rules, transcripting ATC conversation would probably benefit at lot from fine-tuning on ATC speech data.


Yeah the tech is there for radio, same for POTS as well (still 8 bit/8000Hz at its core). I do listen on the scanner from time to time and assuming a clear signal traditional NFM always beats digital in terms of intelligibility.

Speaking of the telephone, they definitely could put improved audio quality as part of the 4G and 5G specs, but they don't. A modern telephone network is all IP anyways and backwards compatibility can be maintained.


I've done this with Google's video transcription model, which works surprisingly well.


I wonder how much training on degraded/radio encoded samples would be needed to improve performance in this area. You’re probably not the only person who wants to monitor police radios in their city.


whisper.cpp (https://github.com/ggerganov/whisper.cpp) supports streaming!


Out of curiosity, when you chunk it, what do you do about words (and maybe even sentences) that might get caught in the middle of the split?


Great question, I've not encountered it too much as I'm looking for specific stuff in the transcript based on keywords so some loss is acceptable and I don't always need the data immediately so it has not been an issue so far but what I do to alleviate this somewhat is have "stitched passes" at set intervals.

News, betting games and some shows HAVE to happen exactly at a certain time and are very rarely late so you can use these known times as checkpoints with some bias. So say I run this command `ffmpeg -i http://someexamplesite.fm/8b0hqm93yceuv -c copy -f segment -segment_time 60 -reset_timestamps 1 zip-%03d.mp3` that automatically chunks the stream into 1 minute files and I know news gets read at 1pm, I can merge everything from 10am to 1pm into say "Segment B" and then process that, you get the idea.

I also have a step in the pipeline after on the transcript after that tries to summarize what it can so any small gaps would likely be inferred as I have not noticed anything too wonky and the summaries and text I get so far have been pretty clean and good enough for my needs. Once I bench this some more however I'm sure I will have this and other interesting problems to solve, the one I'm fighting now a bit is ads that have music and fast talking.


Wow thank you for the in depth reply! That makes a lot of sense about things happening on a cadence.


Oh geez, this is one of those technical problems that have so many possible good answers that it'd make a great interview question.


Have you tried https://webcaptioner.com/ ?


I came up with this specifically because of whisper. This looks nice but doesn't suit my use case as I'm "listening" to over a dozen stations at once and I also stitch the chunks together end of day in a fat wav file.


This but for automation of calling in for prizes or entering keywords on station websites.


What processing / server / backend are you using to run the whisper model?


Stupid simple setup until I streamline the process some more. Just running on my pc (i7 8700k, gtx 1080) and using ffmpeg for copying the stream to disk and a simple python script that creates the chunks and transcripts.


Does anyone know if this new model handles silence better? I was trying to use whisper for transcribing bursts of talking amid large spans of silence, but the frequency of hallucinations was too high.


I suspect a simple solution is to remove the silence, as a pre processing step in the pipeline.


In large scale tests, I observed hallucinations from Whisper in speech regions of audio.


Sure, but that should be considered an accuracy problem. Telling a system to do its best to extract words from background sounds, and then getting words from it, is a different type of problem.

-------

I can't reply to the below, but you have to consider the difference in the signal to noise ratio for why it should be considered a different problem.

If I told a binary image classifier to classify a clear image of a cat as either a "cat" or a "dog", and it said "dog", then that would be an accuracy problem.

If I gave the same classifier an image of a black cat standing in a very dark room, where even a human would have trouble identifying it, and it says "dog" it's not an accuracy problem as much as a signal to noise ratio problem.

It seems like you're making the assumption that all of these have the issues you describe have the same root cause. I don't think that's a sound assumption...tehe.


Still important for future use to not have invalid results. This is a workaround for now


You don't need ML to trim out silence


Silence is often problem dependent... You may want ML to differentiate between noisy audio with speech and noisy audio without speech.


"Silence" is a problematic term. For me, that word encompasses: squeaky chairs, typing on a loud keyboard, moving objects around on my table, etc. In a perfect world, Whisper —like a human— can easily distinguish a human voice from the din of my office, and only try and transcribe my voice.

Does anyone have solutions for clearing out "silence" from an audio file that works off something a bit more accurate than just "<= decibel x"?

Edited for grammar.


There's a 2 week Whisper Fine Tuning event put on by Hugging Faces going on right now [1]. The new "large-v2" model was announced as part of the kickoff [2]. It's supposed to offer 5% - 10% improvements over the previous large model.

[1] https://discuss.huggingface.co/t/open-to-the-community-whisp... [2] https://youtu.be/fZMiD8sDzzg?t=1226


Nice catch. I'll run my test suite [1] on this and report back.

[1] https://twitter.com/lunixbochs/status/1574848899897884672


Would be cool if you could run these speech models in tandem, and compute a new error-rate for the consensus of them.


Can you elaborate on how you see this working?


run voice through 3 models, when one model is different than the other two then throw it away.

problems abound with that approach, but I could envision it.


What would you use in production?


I've just tested it quickly, large-v2 is actually slightly less accurate than v1. So pick the model carefully.


I don’t understand. Why is v2 worse than v1?


Other people reporting accuracy issues

https://github.com/openai/whisper/discussions/657


Hard to guess. We test on variety of datasets. New version has more hallucinations in the end of audio.


I have been using Whisper to transcribe my audio notes. I just save my voice memo from my phone to my NAS and my little script does the rest on a loop.


How do you handle connectivity from outside your home network (when you're at the grocery store for example)? Do you have a VPN running?


Not OP but regularly access my home network remotely. I have an Asus RT-AX86U which has a lot of features in the stock firmware, including OpenVPN configuration. If I need "LAN" access to my home network, I will open the OpenVPN client and then it will just work, this is useful to access things like local samba shares or SSH into my Linux machines (I was dissatisfied with the log spam of having an open SSH port even with fail2ban and router level protections).

If you have an always running server and only need access via a browser, forwarding port 80/443 to a reverse proxy works well. Dynamic DNS can be used if you don't have a static IP address. Alternatively, if you are more security conscious and don't mind the reliance on CloudFlare, you can use CloudFlare tunnel which won't expose ports from you local network at all. It can also be combined with a third party authentication mechanism.


Not OP but I use syncthing for such things.


Also not OP but I use tailscale for this. It's fantastically easy.


Tailscale + syncthing, works like a charm


Hey there, can you share your script? I'm a little lazy haha


It seems like Large-V2 is a huge improvement when transcribing Japanese. I tested it on "Macross Frontier - the Movie", and it no longer breaks after 8 minutes as before:

Large-V1 (transcribed at 2022-10-02):

* SRT: https://pastes.io/yrggofqhof

* Transcript: https://pastes.io/rtp9buhsm0

Large-V2 (latest version at 2022-12-07):

* SRT - https://pastes.io/uiqblpw1qk

* Transcript - https://pastes.io/rheqgnftzl

There's still some timing issues after a period of silence, but using a VAD as a workaround (like I do in my WebUI) may no longer be strictly necessary:

* https://github.com/openai/whisper/discussions/397


So in general and from the uninformed outside it seems like OpenAI (120 employees?) is outperforming Alphabet (187k employees). Is there any truth to that notion?


It’s apples and oranges imo. Google owns Wayze which is hitting the streets in SF this year with autonomous taxis. That’s huge if it goes well. There are of course other non-public models Google has developed like Imagen (diffusion model) and LaMDA (Google engineer convinced it was sentient). So I wouldn’t discount Google at all.

In terms of granting direct access to machine learning models, OpenAI has Google beat. But that’s not Google’s business model. And it remains to be seen if it’s a viable business model at all.


Google has models that are at parity with or outperform OpenAIs models at most tasks, they just don’t bring them to market quickly or build their brand around them as strongly as OpenAI does.

See Imagen for example - https://imagen.research.google/

If Google is to be believed, this outperforms Dall-E - and I’ve heard from people that use it that in general, it does perform better than Dall-E.


Doesn't do them any good to do nothing with it.

I don't get Google's brand or leadership anymore. They're staring disruption to their core revenue stream in the face and acting like the "everything is fine" dog.

Meanwhile Microsoft is playing dimensional chess across multiple industries and key developing areas of research.


It’s quite simple, actually. State of the art research gets attention and attracts top tier researchers. As long as they’re working for Google they’re not working for Bing. The research is sometimes useful but even if it’s not it is red meat for their researchers.

However although the models of today are interesting, they’re not consistent enough for many business applications. It’s not good enough to generate realistic cartoon characters most of the time but sometimes they randomly have 4 arms. Or you have a language model that can say interesting things but can’t be relied upon for basic facts. Maybe some day we will be there but not today.


It's very possible they are doing something with them, just not the sort of thing OpenAI is doing. Google could have a model just as capable as Whisper running in their Google Cloud or Assistant transcription service - but they might have upgraded from their previous model without any announcement or blog post.

Are you saying they should have more bombast in their product announcements?


> If Google is to be believed, this outperforms Dall-E - and I’ve heard from people that use it that in general, it does perform better than Dall-E.

Google actually surpassed its own model[0], first with Parti[1], which unlike DALL-E could even correctly insert text in the image, then with Imagen Video[2], which does as the name implies.

[0]: https://paperswithcode.com/sota/text-to-image-generation-on-...

[1]: https://parti.research.google/

[2]: https://imagen.research.google/video/


Google makes more money than OpenAI, so I wouldn't say that.

Also, OpenAI is basically a subsidiary of Microsoft at this point.


keep in mind that Google also works on other products than AI (e.g. Android), and that the best model is not necessarily the best for the business. E.g. it might be too costly to operate certain models as Googles scale, and so they would need to pick a compromise between quality of results, runtime cost and scalability.


Outperforming in what sense? AI model capabilities?


It could well be that 187K employees is a disadvantage, not just economically, but because every project has too many cooks in the kitchen so to say, and it has too much politics.


A tiny fraction of that 187k employees is working on AI. Total employee count is a silly metric.


Has there been any work on a model that can take in speech and outputs the phonemes (e.g. in IPA or an English format) similar to what Google's speech pronunciation tool, where it shows a "Sounds like you said ...."?


If I just want to play with this as a toy without any performance considerations, is it possible to run this on a CPU only? Or do I need to faf about with getting CUDA and that nonsense running?


You probably want to check this port : https://github.com/ggerganov/whisper.cpp


That indeed seems perfect.


Yes, it runs well enough on CPU. https://modal-labs-whisper-pod-transcriber-fastapi-app.modal... runs on CPU, and chunks up podcast audio to parallelize over dozens of CPUs.


It’ll run on CPU in decent time, it’ll also depend on the kind of cpu you have and the model you pick. In my m1, the base model can do 5-10s in about as long.

I uploaded it to a hugging face space if you just want to fiddle with it without installing

https://huggingface.co/spaces/jerpint/babelfish


Yes, it runs fine on your computer.


How can the Word Error Rate (WER) be larger than 100% for some languages?


If the target is the words "A A A" and you produce "B B B B", you have more errors than there were words in the target. 3 replacements and 1 insertion.


I wonder what the correlation between consistency of written/spoken language is for the breakdown per language here: https://github.com/openai/whisper/blob/main/language-breakdo...

For instance, I know Turkish is very consistent: it was refactored in 1928 with the birth of Turkey. Turkish is quite high in the rankings. I don't think because there's loads of data available, but because of its consistency. Contrary, English has loads of data, which should compensate for it inconsistency.


How does this benchmark against AWS [0] offering?

[0] https://aws.amazon.com/transcribe/


I have done some tests for a hobby project (transcribing local fire station radio dispatch), and Whisper is incomparably better than AWS transcribe.

Radio dispatch is very hard to understand, low quality audio. AWS transcribe was essentially useless, incomprehensible transcriptions. Whisper was 95+% accurate, with maybe 1 incorrect word per few sentences, and it was often easy to tell what the correct word would be.


Interesting, that's good to know, my personal experience was that transcribe handled accents very poorly, but if whisper can handle radio static maybe it can also decipher thick scottish


On the original blog post announcing Whisper there was a demo ("Accent" under the drop-down) of it transcribing a Scottish accent. https://openai.com/blog/whisper/


Whisper has done extremely well for me with a number of challenging audio transcriptions: background noise, heavy accents, music playing... you name it.


Is there a reason why Spanish and Italian have a lower WER than English? OpenAI is based in America right? They probably focused on English more than anything.


English is the worst. “We don't just borrow words; on occasion, English has pursued other languages down alleyways to beat them unconscious and rifle their pockets for new vocabulary.”

I could believe that other languages with more regular grammar might be easier to parse. Or maybe Romance languages lean on more easily distinguished phonemes.


I like that quote.

English also tends to take delight in stealing words from other languages, and then using them in the wrong way. Seemingly in an effort to drive speakers of the other language up the wall.


Wonderful quote.


Spanish and Italian are very easy to recognize due to very simple phonetic structure.


WER is really dataset-dependent, you can’t really compare across different test sets unless they’re almost identical (acoustic environment, microphone quality, speaker profiles, linguistic domain, etc).

WER for the exact same model can vary wildly between datasets, it’s really only useful for comparing model performance on a single test set.


Interesting to see there's work going into M{1,2} support rather than just CPU. Looks like there's an issue with the buffer lengths with pytorch M1 support atm but looks fixable in future pytorch releases.


Off topic: what’s OpenAI revenue model?

(I’m under the impression they don’t charge)


https://openai.com/api/pricing/ (ChatGPT is a free preview right now.)


They charge


Has there been any efforts in integrating whisper model into a more advanced engine like Kaldi to leverage the most out of this wonderful model


I’ve only ran this on my cpu. Using the tiny en model.

Anyone ran tiny en on t4 gpu (aws g4dn iirc)? What’s the speed up


For revoldiv.com we have profiled, many gpus, the best one is 4090. We do a lot of intelligent chunking and detect word boundaries and run the model in parallel in multiple gpus and we get about 40 to 50 seconds for an hour long audio but without expect 7 minutes for an hour long audio on tesla t4

  on tesla-t4-30gb-memory-8vcpu google cloud
   on tiny and tiny.en
    for 10 minute = 30 seconds
   on medium
    for 10 minute = 1m 30s
    for 60 minute = 7m
   on large
    for 60 miutes = 13m
  on NVIDIA GeForce RTX 4090
   on tiny
    for 10-minute = 5.5 seconds
    for 60-minute = 35 seconds
   on base
    for 10-minute = 7 seconds
    for 60-minute = 50 seconds
   on small
    for 10-minute = 14 seconds
    for 60-minute = 1 min 35 sec
   on medium
    for 10-minute = 26 seconds
    for 60-minute = 3 mins
   on large
    for 10-minute = 40 seconds
    for 60-minute = 3 min 54 sec


thats crazy. 35 seconds for 60 min on base with a 4090. wow! thanks for the info! also btw i mentioned on another thread i was getting an error. but are you planning on offering this as a paid api?


Can you send me the audio that caused it, you can email me at team AT revoldiv .com. If there is going to be a lot of interest, yes we can provide it as an api service. Our service has some niceties like word level timestamp, paragraph separation, sound detection etc... for now it is a free service you can use as much as you want


The new large model is garbage at translation (ja) compared to most of the other models.


Did they just commit straight to the main branch?


They probably have an internal repo which gets the commits first, then synced to github.


Why not? Should have commited to, like, dev?


Looks like they plugged GPT-4 AI into speech recognition research and now they are going to release huge updates every month.


What is the basis for this claim? How does their GPT-4 work intersect with the work on their ASR model?


It's a joke... They're saying GPT-4 is doing the research now.


Easily, you can ask questions about speech recognition system design. For example you can ask ChatGPT about upsampling vs downsampling, it gives correct answer. It also can correctly recommend you architecture for streaming speech recognition.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: