OpenAI quietly launched Whisper V2 in a GitHub commit

152334H · on Dec 6, 2022

"Whisper V2" is a bit of an exaggeration. It's

* only an update to the `large` model (and therefore irrelevant for hobbyist, low CPU use cases)

* architecturally identical to the original whisper model

* just trained on more epochs

m3at · on Dec 7, 2022

Agreed. It's maybe like Linux versioning scheme, where 6.0 is just the one that comes after 5.19:

> The major version number is incremented when the number after the dot starts looking "too big." There is literally no other reason. [1]

[1] https://www.kernel.org/category/releases.html

IanCal · on Dec 7, 2022

The large model can run on consumer GPUs and I'd argue is more relevant to the hobbyist use cases because I'm not working out the efficiency of running thousands of users - just my own videos.

yjftsjthsd-h · on Dec 7, 2022

Yeah, the commit is literally "add large-v2 model", nothing about the software proper.

nl · on Dec 7, 2022

> * only an update to the `large` model (and therefore irrelevant for hobbyist, low CPU use cases)

This is maybe the most important thing there is for a lot of cases since additional training is the thing that is just about impossible to do for most people.

prox · on Dec 7, 2022

Is there an easy to install speech engine (training and synthesizing) for windows? So the reverse of recognition?

userhacker · on Dec 7, 2022

I recently swapped out the AI model for voice transcription on revoldiv.com and replaced it with Whisper. The results have been truly impressive - even the smaller models outperform and generalize better than any other options on the market. If you want to give it a try, our model is capable of faster transcription by utilizing multiple GPUs and some other enhancements, and it is all free

thundergolfer · on Dec 7, 2022

Your app is very similar to our demo app! https://modal-labs-whisper-pod-transcriber-fastapi-app.modal...

How come you don't support audio files longer than 1hr? Is it because of $$ cost?

The above demo app gets faster transcription by chunking audio and parellelizing over dozens of CPUs, so you can transcribe a about 1hr of audio for $0.10.

userhacker · on Dec 7, 2022

> https://modal-labs-whisper-pod-transcriber-fastapi-app.modal...

Interesting, which model are you using? We use the medium model which is the sweet spot between time/performance ratio. We also chunk, We try to detect words and silences to do better chunking at word boundaries but if you do more chunking and you don't get the word boundaries right it seems like whisper loses some context and the accuracy suffers. We will soon support longer hours. We just want to make sure the wait time for transcription doesn't suffer for most users. But great demo, reach out to me if you want to collaborate

thundergolfer · on Dec 7, 2022

We’re using base. The code is open source at modal-labs/model-examples repo if you want to see anything we’re doing

71a54xd · on Dec 7, 2022

I'd love to try it out, who should I ping to run this locally on my GPU server?

userhacker · on Dec 7, 2022

It's not a model you can run on your own server but a free service on revoldiv.com. You can expect 40 to 50 second wait time to transcribe an hour long video/audio. We combine whisper with our model to get word level timestamps, paragraph separation and sound detections like laughter, music etc... We recently added very basic podcast search and transcription.

getcrunk · on Dec 7, 2022

Im getting an unknown error, errro code 551

lordswork · on Dec 7, 2022

Has anyone combined Whisper with a model that can assign owners to voices yet?

For example, instead of:

    Hello there!
    Hi 
    How are you?
    Good, and you?

You get something like:

    Voice A: Hello there!
    Voice B: Hi 
    Voice A: How are you?
    Voice B: Good, and you?

sva_ · on Dec 7, 2022

This is called speaker diarization, basically one of the 3 components of speaker recognition (verification, identification, diarization).

You can do this pretty conveniently using pyannote-audio[0].

Coincidentally I did a small presentation on this at a university seminar yesterday :). I could post a Jupyter notebook if you're interested.

PS: Bai & Zhang (2020) is a great review on the literature [1]

[0] https://github.com/pyannote/pyannote-audio

[1] https://arxiv.org/abs/2012.00931

wanderingmind · on Dec 7, 2022

Yes please posting a jupyter notebook will be of great help

swyx · on Dec 7, 2022

yes please!

zachlatta · on Dec 7, 2022

I modified a Jupyter notebook that’s been going around to do this and have nice Markdown output.

This is the best AI diarization and transcription I’ve been able to get so far: https://github.com/zachlatta/openai-whisper-speaker-identifi...

e12e · on Dec 7, 2022

Not whisper, but:

> The Google Recorder app (...) transcribes meetings and interviews to text, instantly giving you a searchable transcription that is synced to the recorded audio (...) With this new update, the recorder can now identify and label each speaker automatically—an impressive feat. Google Recorder is exclusive to the Pixel 6 and newer Pixel devices.

https://arstechnica.com/gadgets/2022/12/pixel-7-update-adds-...

lordswork · on Dec 7, 2022

That's awesome. I wonder if there's a way to upload existing recordings to Google Recorder to get the transcriptions.

itake · on Dec 7, 2022

I'd be down to code this up if someone if anyone's interested. contact is in profile.

gok · on Dec 7, 2022

This is called "diarization"

sbmsr · on Dec 7, 2022

Not that I know of, but I use Descript for content creation and it offers something like what you're describing. they have automatic speaker labels built in to their transcription service.

would be lovely to see this feature open sourced.

CiceroCiceronis · on Dec 7, 2022

There is a discussion on the Whisper github page called something like “diarization” which details a few attempts to attain this functionality with additional tools.

bigfudge · on Dec 7, 2022

I’ve had limited success combining the results of pyannotate with whisper transcripts (based on time stamps).

It’s ok, but the quality of speaker identification is nowhere near as good as the transcription itself.

I’d love to see models which try and use stereo information in recordings to solve the problem. Or, given a fixed camera and static speakers, I thought it should even be possible to use video to add information about who is speaking. There doesn’t seem to be anything like that right now tho.

nxmnxm99 · on Dec 7, 2022

Some attempts have been made but it still sucks.

iKlsR · on Dec 6, 2022

I've been using whisper to get transcripts from my local radio stations. I know it's out of scope for the original project but I hope someone can build a streaming input around it in the future. I currently pipe in and save 10 minute chunks that get sent off for processing.

synesthesiam · on Dec 7, 2022

Have you tried whisper-cpp?

https://github.com/ggerganov/whisper.cpp/tree/master/example...

iKlsR · on Dec 7, 2022

The streaming I was speaking to is from a url or network resource. I keep a directory of m3u files which is just a url inside that you can open in vlc etc. `So ffmpeg -i [url] -c copy [file-name].mp3` does the trick for now. `mpv` can do this while saving to a file but it's a nice point to start from as I'm already getting some ideas how I could use this or roll my own, thanks!

ggerganov · on Dec 7, 2022

whisper.cpp provides similar functionality via the `livestream.sh` script that performs transcription of a remote stream [0]. For example, you can transcribe BBC radio in 10s chunks like this:

  $ make base.en
  $ ./examples/livestream.sh http://a.files.bbci.co.uk/media/live/manifesto/audio/simulcast/hls/nonuk/sbr_low/ak/bbc_world_service.m3u8 10

  [+] Transcribing stream with model 'base.en', step_s 10 (press Ctrl+C to stop):

  Buffering audio. Please wait...

   here at the BBC in London. This is Gordon Brown, a former British Prime Minister, who since 2012 has been UN Special Envoy-
   Lemboy for global education. We were speaking just after he'd issued a rallying cry on the eve of the 2022 Football World Cup. For
   Governments around the world to pressure Afghanistan to let girls go to school. What human rights abuses are what are being discussed as we...
   run up and start and have happened and have seen the World Cup matches begin. And it's important to draw attention to one human rights abuse.
   that everyone and that includes Qatar, the UAE, the Islamic organization of countries, the Gulf Corps...

[0] https://github.com/ggerganov/whisper.cpp/blob/master/example...

iKlsR · on Dec 7, 2022

Wow! Thanks for sharing, I didn't explore the repo much beyond that link, this looks very promising, I was going to tomorrow but checking out the src and building now! The confidence color coding btw is chef's kiss. Great job with this.

nico · on Dec 7, 2022

Is this English only, or does it support other languages? Thank you.

EMIRELADERO · on Dec 7, 2022

It's a port of Whisper and uses the original models in a different format so it supports all the languages that the original does.

ortusdux · on Dec 6, 2022

I would love to do this for my local police scanner.

posguy · on Dec 7, 2022

The codecs and narrow frequencies used on most public safety trunked radio networks is truly terrible. IMBE and AMBE+2 should have never made it past Y2K, yet Motorola and Project 25 have ensured these remain in widespread use on these networks: https://en.wikipedia.org/wiki/Project_25

If Whisper achieves 85% or higher accuracy on this audio, it would be a miracle. Garbage in, garbage out tbh. Project 25 needs to move to a modern codec, ideally not one seeing little development done by one small company.

jjwiseman · on Dec 7, 2022

It does pretty well* on ATC. E.g. https://twitter.com/lemonodor/status/1578516727549153280 and https://twitter.com/lemonodor/status/1581354245181218816.

* Better than anything else I've seen, but certainly far from perfect.

lelag · on Dec 7, 2022

Being a very constrained domain with it's own speech rules, transcripting ATC conversation would probably benefit at lot from fine-tuning on ATC speech data.

_fjb4 · on Dec 7, 2022

Yeah the tech is there for radio, same for POTS as well (still 8 bit/8000Hz at its core). I do listen on the scanner from time to time and assuming a clear signal traditional NFM always beats digital in terms of intelligibility.

Speaking of the telephone, they definitely could put improved audio quality as part of the 4G and 5G specs, but they don't. A modern telephone network is all IP anyways and backwards compatibility can be maintained.

lights0123 · on Dec 7, 2022

I've done this with Google's video transcription model, which works surprisingly well.

CiceroCiceronis · on Dec 7, 2022

I wonder how much training on degraded/radio encoded samples would be needed to improve performance in this area. You’re probably not the only person who wants to monitor police radios in their city.

dgtlntv · on Dec 7, 2022

whisper.cpp (https://github.com/ggerganov/whisper.cpp) supports streaming!

mindvirus · on Dec 7, 2022

Out of curiosity, when you chunk it, what do you do about words (and maybe even sentences) that might get caught in the middle of the split?

iKlsR · on Dec 7, 2022

Great question, I've not encountered it too much as I'm looking for specific stuff in the transcript based on keywords so some loss is acceptable and I don't always need the data immediately so it has not been an issue so far but what I do to alleviate this somewhat is have "stitched passes" at set intervals.

News, betting games and some shows HAVE to happen exactly at a certain time and are very rarely late so you can use these known times as checkpoints with some bias. So say I run this command `ffmpeg -i http://someexamplesite.fm/8b0hqm93yceuv -c copy -f segment -segment_time 60 -reset_timestamps 1 zip-%03d.mp3` that automatically chunks the stream into 1 minute files and I know news gets read at 1pm, I can merge everything from 10am to 1pm into say "Segment B" and then process that, you get the idea.

I also have a step in the pipeline after on the transcript after that tries to summarize what it can so any small gaps would likely be inferred as I have not noticed anything too wonky and the summaries and text I get so far have been pretty clean and good enough for my needs. Once I bench this some more however I'm sure I will have this and other interesting problems to solve, the one I'm fighting now a bit is ads that have music and fast talking.

mindvirus · on Dec 7, 2022

Wow thank you for the in depth reply! That makes a lot of sense about things happening on a cadence.

cwkoss · on Dec 7, 2022

Oh geez, this is one of those technical problems that have so many possible good answers that it'd make a great interview question.

NaturalPhallacy · on Dec 6, 2022

Have you tried https://webcaptioner.com/ ?

iKlsR · on Dec 6, 2022

I came up with this specifically because of whisper. This looks nice but doesn't suit my use case as I'm "listening" to over a dozen stations at once and I also stitch the chunks together end of day in a fat wav file.

bochoh · on Dec 7, 2022

This but for automation of calling in for prizes or entering keywords on station websites.

rexreed · on Dec 6, 2022

What processing / server / backend are you using to run the whisper model?

iKlsR · on Dec 6, 2022

Stupid simple setup until I streamline the process some more. Just running on my pc (i7 8700k, gtx 1080) and using ffmpeg for copying the stream to disk and a simple python script that creates the chunks and transcripts.

darkpicnic · on Dec 6, 2022

Does anyone know if this new model handles silence better? I was trying to use whisper for transcribing bursts of talking amid large spans of silence, but the frequency of hallucinations was too high.

nomel · on Dec 6, 2022

I suspect a simple solution is to remove the silence, as a pre processing step in the pipeline.

lunixbochs · on Dec 7, 2022

In large scale tests, I observed hallucinations from Whisper in speech regions of audio.

nomel · on Dec 7, 2022

Sure, but that should be considered an accuracy problem. Telling a system to do its best to extract words from background sounds, and then getting words from it, is a different type of problem.

-------

I can't reply to the below, but you have to consider the difference in the signal to noise ratio for why it should be considered a different problem.

If I told a binary image classifier to classify a clear image of a cat as either a "cat" or a "dog", and it said "dog", then that would be an accuracy problem.

If I gave the same classifier an image of a black cat standing in a very dark room, where even a human would have trouble identifying it, and it says "dog" it's not an accuracy problem as much as a signal to noise ratio problem.

It seems like you're making the assumption that all of these have the issues you describe have the same root cause. I don't think that's a sound assumption...tehe.

gibolt · on Dec 7, 2022

Still important for future use to not have invalid results. This is a workaround for now

rozab · on Dec 6, 2022

You don't need ML to trim out silence

sdenton4 · on Dec 7, 2022

Silence is often problem dependent... You may want ML to differentiate between noisy audio with speech and noisy audio without speech.

darkpicnic · on Dec 7, 2022

"Silence" is a problematic term. For me, that word encompasses: squeaky chairs, typing on a loud keyboard, moving objects around on my table, etc. In a perfect world, Whisper —like a human— can easily distinguish a human voice from the din of my office, and only try and transcribe my voice.

Does anyone have solutions for clearing out "silence" from an audio file that works off something a bit more accurate than just "<= decibel x"?

Edited for grammar.

PaulMest · on Dec 7, 2022

There's a 2 week Whisper Fine Tuning event put on by Hugging Faces going on right now [1]. The new "large-v2" model was announced as part of the kickoff [2]. It's supposed to offer 5% - 10% improvements over the previous large model.

[1] https://discuss.huggingface.co/t/open-to-the-community-whisp... [2] https://youtu.be/fZMiD8sDzzg?t=1226

lunixbochs · on Dec 6, 2022

Nice catch. I'll run my test suite [1] on this and report back.

[1] https://twitter.com/lunixbochs/status/1574848899897884672

amelius · on Dec 6, 2022

Would be cool if you could run these speech models in tandem, and compute a new error-rate for the consensus of them.

lunixbochs · on Dec 7, 2022

Can you elaborate on how you see this working?

serf · on Dec 7, 2022

run voice through 3 models, when one model is different than the other two then throw it away.

problems abound with that approach, but I could envision it.

freediver · on Dec 7, 2022

What would you use in production?

nshm · on Dec 6, 2022

I've just tested it quickly, large-v2 is actually slightly less accurate than v1. So pick the model carefully.

verdenti · on Dec 6, 2022

I don’t understand. Why is v2 worse than v1?

nshm · on Dec 9, 2022

Other people reporting accuracy issues

https://github.com/openai/whisper/discussions/657

nshm · on Dec 8, 2022

Hard to guess. We test on variety of datasets. New version has more hallucinations in the end of audio.

galleywest200 · on Dec 6, 2022

I have been using Whisper to transcribe my audio notes. I just save my voice memo from my phone to my NAS and my little script does the rest on a loop.

chimineycricket · on Dec 6, 2022

How do you handle connectivity from outside your home network (when you're at the grocery store for example)? Do you have a VPN running?

andrewmackrodt · on Dec 6, 2022

Not OP but regularly access my home network remotely. I have an Asus RT-AX86U which has a lot of features in the stock firmware, including OpenVPN configuration. If I need "LAN" access to my home network, I will open the OpenVPN client and then it will just work, this is useful to access things like local samba shares or SSH into my Linux machines (I was dissatisfied with the log spam of having an open SSH port even with fail2ban and router level protections).

If you have an always running server and only need access via a browser, forwarding port 80/443 to a reverse proxy works well. Dynamic DNS can be used if you don't have a static IP address. Alternatively, if you are more security conscious and don't mind the reliance on CloudFlare, you can use CloudFlare tunnel which won't expose ports from you local network at all. It can also be combined with a third party authentication mechanism.

tehf0x · on Dec 6, 2022

Not OP but I use syncthing for such things.

dgacmu · on Dec 7, 2022

Also not OP but I use tailscale for this. It's fantastically easy.

wanderingmind · on Dec 7, 2022

Tailscale + syncthing, works like a charm

AB1908 · on Dec 11, 2022

Hey there, can you share your script? I'm a little lazy haha

aadnk · on Dec 7, 2022

It seems like Large-V2 is a huge improvement when transcribing Japanese. I tested it on "Macross Frontier - the Movie", and it no longer breaks after 8 minutes as before:

Large-V1 (transcribed at 2022-10-02):

* SRT: https://pastes.io/yrggofqhof

* Transcript: https://pastes.io/rtp9buhsm0

Large-V2 (latest version at 2022-12-07):

* SRT - https://pastes.io/uiqblpw1qk

* Transcript - https://pastes.io/rheqgnftzl

There's still some timing issues after a period of silence, but using a VAD as a workaround (like I do in my WebUI) may no longer be strictly necessary:

* https://github.com/openai/whisper/discussions/397

tpmx · on Dec 6, 2022

So in general and from the uninformed outside it seems like OpenAI (120 employees?) is outperforming Alphabet (187k employees). Is there any truth to that notion?

janalsncm · on Dec 6, 2022

It’s apples and oranges imo. Google owns Wayze which is hitting the streets in SF this year with autonomous taxis. That’s huge if it goes well. There are of course other non-public models Google has developed like Imagen (diffusion model) and LaMDA (Google engineer convinced it was sentient). So I wouldn’t discount Google at all.

In terms of granting direct access to machine learning models, OpenAI has Google beat. But that’s not Google’s business model. And it remains to be seen if it’s a viable business model at all.

joewhatkins · on Dec 6, 2022

Google has models that are at parity with or outperform OpenAIs models at most tasks, they just don’t bring them to market quickly or build their brand around them as strongly as OpenAI does.

See Imagen for example - https://imagen.research.google/

If Google is to be believed, this outperforms Dall-E - and I’ve heard from people that use it that in general, it does perform better than Dall-E.

echelon · on Dec 6, 2022

Doesn't do them any good to do nothing with it.

I don't get Google's brand or leadership anymore. They're staring disruption to their core revenue stream in the face and acting like the "everything is fine" dog.

Meanwhile Microsoft is playing dimensional chess across multiple industries and key developing areas of research.

janalsncm · on Dec 6, 2022

It’s quite simple, actually. State of the art research gets attention and attracts top tier researchers. As long as they’re working for Google they’re not working for Bing. The research is sometimes useful but even if it’s not it is red meat for their researchers.

However although the models of today are interesting, they’re not consistent enough for many business applications. It’s not good enough to generate realistic cartoon characters most of the time but sometimes they randomly have 4 arms. Or you have a language model that can say interesting things but can’t be relied upon for basic facts. Maybe some day we will be there but not today.

joewhatkins · on Dec 6, 2022

It's very possible they are doing something with them, just not the sort of thing OpenAI is doing. Google could have a model just as capable as Whisper running in their Google Cloud or Assistant transcription service - but they might have upgraded from their previous model without any announcement or blog post.

Are you saying they should have more bombast in their product announcements?

espadrine · on Dec 7, 2022

> If Google is to be believed, this outperforms Dall-E - and I’ve heard from people that use it that in general, it does perform better than Dall-E.

Google actually surpassed its own model[0], first with Parti[1], which unlike DALL-E could even correctly insert text in the image, then with Imagen Video[2], which does as the name implies.

[0]: https://paperswithcode.com/sota/text-to-image-generation-on-...

[1]: https://parti.research.google/

[2]: https://imagen.research.google/video/

bogwog · on Dec 6, 2022

Google makes more money than OpenAI, so I wouldn't say that.

Also, OpenAI is basically a subsidiary of Microsoft at this point.

Matthias247 · on Dec 6, 2022

keep in mind that Google also works on other products than AI (e.g. Android), and that the best model is not necessarily the best for the business. E.g. it might be too costly to operate certain models as Googles scale, and so they would need to pick a compromise between quality of results, runtime cost and scalability.

erdaniels · on Dec 6, 2022

Outperforming in what sense? AI model capabilities?

alchemist1e9 · on Dec 6, 2022

It could well be that 187K employees is a disadvantage, not just economically, but because every project has too many cooks in the kitchen so to say, and it has too much politics.

nomel · on Dec 6, 2022

A tiny fraction of that 187k employees is working on AI. Total employee count is a silly metric.

jamsch · on Dec 7, 2022

Has there been any work on a model that can take in speech and outputs the phonemes (e.g. in IPA or an English format) similar to what Google's speech pronunciation tool, where it shows a "Sounds like you said ...."?

0cf8612b2e1e · on Dec 7, 2022

If I just want to play with this as a toy without any performance considerations, is it possible to run this on a CPU only? Or do I need to faf about with getting CUDA and that nonsense running?

GistNoesis · on Dec 7, 2022

You probably want to check this port : https://github.com/ggerganov/whisper.cpp

0cf8612b2e1e · on Dec 7, 2022

That indeed seems perfect.

thundergolfer · on Dec 7, 2022

Yes, it runs well enough on CPU. https://modal-labs-whisper-pod-transcriber-fastapi-app.modal... runs on CPU, and chunks up podcast audio to parallelize over dozens of CPUs.

jerpint · on Dec 7, 2022

It’ll run on CPU in decent time, it’ll also depend on the kind of cpu you have and the model you pick. In my m1, the base model can do 5-10s in about as long.

I uploaded it to a hugging face space if you just want to fiddle with it without installing

https://huggingface.co/spaces/jerpint/babelfish

zachlatta · on Dec 7, 2022

Yes, it runs fine on your computer.

amelius · on Dec 6, 2022

How can the Word Error Rate (WER) be larger than 100% for some languages?

lunixbochs · on Dec 6, 2022

If the target is the words "A A A" and you produce "B B B B", you have more errors than there were words in the target. 3 replacements and 1 insertion.

kamilafsar · on Dec 7, 2022

I wonder what the correlation between consistency of written/spoken language is for the breakdown per language here: https://github.com/openai/whisper/blob/main/language-breakdo...

For instance, I know Turkish is very consistent: it was refactored in 1928 with the birth of Turkey. Turkish is quite high in the rankings. I don't think because there's loads of data available, but because of its consistency. Contrary, English has loads of data, which should compensate for it inconsistency.

boriscal · on Dec 7, 2022

How does this benchmark against AWS [0] offering?

[0] https://aws.amazon.com/transcribe/

gamegoblin · on Dec 7, 2022

I have done some tests for a hobby project (transcribing local fire station radio dispatch), and Whisper is incomparably better than AWS transcribe.

Radio dispatch is very hard to understand, low quality audio. AWS transcribe was essentially useless, incomprehensible transcriptions. Whisper was 95+% accurate, with maybe 1 incorrect word per few sentences, and it was often easy to tell what the correct word would be.

boriscal · on Dec 7, 2022

Interesting, that's good to know, my personal experience was that transcribe handled accents very poorly, but if whisper can handle radio static maybe it can also decipher thick scottish

paragraft · on Dec 7, 2022

On the original blog post announcing Whisper there was a demo ("Accent" under the drop-down) of it transcribing a Scottish accent. https://openai.com/blog/whisper/

blihp · on Dec 7, 2022

Whisper has done extremely well for me with a number of challenging audio transcriptions: background noise, heavy accents, music playing... you name it.

daemoens · on Dec 6, 2022

Is there a reason why Spanish and Italian have a lower WER than English? OpenAI is based in America right? They probably focused on English more than anything.

0cf8612b2e1e · on Dec 6, 2022

English is the worst. “We don't just borrow words; on occasion, English has pursued other languages down alleyways to beat them unconscious and rifle their pockets for new vocabulary.”

I could believe that other languages with more regular grammar might be easier to parse. Or maybe Romance languages lean on more easily distinguished phonemes.

paranoidrobot · on Dec 7, 2022

I like that quote.

English also tends to take delight in stealing words from other languages, and then using them in the wrong way. Seemingly in an effort to drive speakers of the other language up the wall.

pessimizer · on Dec 6, 2022

Wonderful quote.

nshm · on Dec 6, 2022

Spanish and Italian are very easy to recognize due to very simple phonetic structure.

nmfisher · on Dec 7, 2022

WER is really dataset-dependent, you can’t really compare across different test sets unless they’re almost identical (acoustic environment, microphone quality, speaker profiles, linguistic domain, etc).

WER for the exact same model can vary wildly between datasets, it’s really only useful for comparing model performance on a single test set.

_joel · on Dec 7, 2022

Interesting to see there's work going into M{1,2} support rather than just CPU. Looks like there's an issue with the buffer lengths with pytorch M1 support atm but looks fixable in future pytorch releases.

tiffanyh · on Dec 7, 2022

Off topic: what’s OpenAI revenue model?

(I’m under the impression they don’t charge)

jtxt · on Dec 7, 2022

https://openai.com/api/pricing/ (ChatGPT is a free preview right now.)

jokethrowaway · on Dec 7, 2022

They charge

wanderingmind · on Dec 7, 2022

Has there been any efforts in integrating whisper model into a more advanced engine like Kaldi to leverage the most out of this wonderful model

getcrunk · on Dec 7, 2022

I’ve only ran this on my cpu. Using the tiny en model.

Anyone ran tiny en on t4 gpu (aws g4dn iirc)? What’s the speed up

userhacker · on Dec 7, 2022

For revoldiv.com we have profiled, many gpus, the best one is 4090. We do a lot of intelligent chunking and detect word boundaries and run the model in parallel in multiple gpus and we get about 40 to 50 seconds for an hour long audio but without expect 7 minutes for an hour long audio on tesla t4

  on tesla-t4-30gb-memory-8vcpu google cloud
   on tiny and tiny.en
    for 10 minute = 30 seconds
   on medium
    for 10 minute = 1m 30s
    for 60 minute = 7m
   on large
    for 60 miutes = 13m
  on NVIDIA GeForce RTX 4090
   on tiny
    for 10-minute = 5.5 seconds
    for 60-minute = 35 seconds
   on base
    for 10-minute = 7 seconds
    for 60-minute = 50 seconds
   on small
    for 10-minute = 14 seconds
    for 60-minute = 1 min 35 sec
   on medium
    for 10-minute = 26 seconds
    for 60-minute = 3 mins
   on large
    for 10-minute = 40 seconds
    for 60-minute = 3 min 54 sec

getcrunk · on Dec 7, 2022

thats crazy. 35 seconds for 60 min on base with a 4090. wow! thanks for the info! also btw i mentioned on another thread i was getting an error. but are you planning on offering this as a paid api?

userhacker · on Dec 7, 2022

Can you send me the audio that caused it, you can email me at team AT revoldiv .com. If there is going to be a lot of interest, yes we can provide it as an api service. Our service has some niceties like word level timestamp, paragraph separation, sound detection etc... for now it is a free service you can use as much as you want

casey2 · on Dec 7, 2022

The new large model is garbage at translation (ja) compared to most of the other models.

Yajirobe · on Dec 6, 2022

Did they just commit straight to the main branch?

counttheforks · on Dec 6, 2022

They probably have an internal repo which gets the commits first, then synced to github.

ducktective · on Dec 7, 2022

Why not? Should have commited to, like, dev?

nshm · on Dec 6, 2022

Looks like they plugged GPT-4 AI into speech recognition research and now they are going to release huge updates every month.

dweekly · on Dec 6, 2022

What is the basis for this claim? How does their GPT-4 work intersect with the work on their ASR model?

wcoenen · on Dec 6, 2022

It's a joke... They're saying GPT-4 is doing the research now.

nshm · on Dec 6, 2022

Easily, you can ask questions about speech recognition system design. For example you can ask ChatGPT about upsampling vs downsampling, it gives correct answer. It also can correctly recommend you architecture for streaming speech recognition.