The large model can run on consumer GPUs and I'd argue is more relevant to the hobbyist use cases because I'm not working out the efficiency of running thousands of users - just my own videos.
> * only an update to the `large` model (and therefore irrelevant for hobbyist, low CPU use cases)
This is maybe the most important thing there is for a lot of cases since additional training is the thing that is just about impossible to do for most people.
I recently swapped out the AI model for voice transcription on revoldiv.com and replaced it with Whisper. The results have been truly impressive - even the smaller models outperform and generalize better than any other options on the market. If you want to give it a try, our model is capable of faster transcription by utilizing multiple GPUs and some other enhancements, and it is all free
How come you don't support audio files longer than 1hr? Is it because of $$ cost?
The above demo app gets faster transcription by chunking audio and parellelizing over dozens of CPUs, so you can transcribe a about 1hr of audio for $0.10.
Interesting, which model are you using? We use the medium model which is the sweet spot between time/performance ratio. We also chunk, We try to detect words and silences to do better chunking at word boundaries but if you do more chunking and you don't get the word boundaries right it seems like whisper loses some context and the accuracy suffers. We will soon support longer hours. We just want to make sure the wait time for transcription doesn't suffer for most users. But great demo, reach out to me if you want to collaborate
It's not a model you can run on your own server but a free service on revoldiv.com. You can expect 40 to 50 second wait time to transcribe an hour long video/audio. We combine whisper with our model to get word level timestamps, paragraph separation and sound detections like laughter, music etc... We recently added very basic podcast search and transcription.
> The Google Recorder app (...) transcribes meetings and interviews to text, instantly giving you a searchable transcription that is synced to the recorded audio (...) With this new update, the recorder can now identify and label each speaker automatically—an impressive feat. Google Recorder is exclusive to the Pixel 6 and newer Pixel devices.
Not that I know of, but I use Descript for content creation and it offers something like what you're describing. they have automatic speaker labels built in to their transcription service.
There is a discussion on the Whisper github page called something like “diarization” which details a few attempts to attain this functionality with additional tools.
I’ve had limited success combining the results of pyannotate with whisper transcripts (based on time stamps).
It’s ok, but the quality of speaker identification is nowhere near as good as the transcription itself.
I’d love to see models which try and use stereo information in recordings to solve the problem. Or, given a fixed camera and static speakers, I thought it should even be possible to use video to add information about who is speaking. There doesn’t seem to be anything like that right now tho.
I've been using whisper to get transcripts from my local radio stations. I know it's out of scope for the original project but I hope someone can build a streaming input around it in the future. I currently pipe in and save 10 minute chunks that get sent off for processing.
The streaming I was speaking to is from a url or network resource. I keep a directory of m3u files which is just a url inside that you can open in vlc etc. `So ffmpeg -i [url] -c copy [file-name].mp3` does the trick for now. `mpv` can do this while saving to a file but it's a nice point to start from as I'm already getting some ideas how I could use this or roll my own, thanks!
whisper.cpp provides similar functionality via the `livestream.sh` script that performs transcription of a remote stream [0]. For example, you can transcribe BBC radio in 10s chunks like this:
$ make base.en
$ ./examples/livestream.sh http://a.files.bbci.co.uk/media/live/manifesto/audio/simulcast/hls/nonuk/sbr_low/ak/bbc_world_service.m3u8 10
[+] Transcribing stream with model 'base.en', step_s 10 (press Ctrl+C to stop):
Buffering audio. Please wait...
here at the BBC in London. This is Gordon Brown, a former British Prime Minister, who since 2012 has been UN Special Envoy-
Lemboy for global education. We were speaking just after he'd issued a rallying cry on the eve of the 2022 Football World Cup. For
Governments around the world to pressure Afghanistan to let girls go to school. What human rights abuses are what are being discussed as we...
run up and start and have happened and have seen the World Cup matches begin. And it's important to draw attention to one human rights abuse.
that everyone and that includes Qatar, the UAE, the Islamic organization of countries, the Gulf Corps...
Wow! Thanks for sharing, I didn't explore the repo much beyond that link, this looks very promising, I was going to tomorrow but checking out the src and building now! The confidence color coding btw is chef's kiss. Great job with this.
The codecs and narrow frequencies used on most public safety trunked radio networks is truly terrible. IMBE and AMBE+2 should have never made it past Y2K, yet Motorola and Project 25 have ensured these remain in widespread use on these networks: https://en.wikipedia.org/wiki/Project_25
If Whisper achieves 85% or higher accuracy on this audio, it would be a miracle. Garbage in, garbage out tbh. Project 25 needs to move to a modern codec, ideally not one seeing little development done by one small company.
Being a very constrained domain with it's own speech rules, transcripting ATC conversation would probably benefit at lot from fine-tuning on ATC speech data.
Yeah the tech is there for radio, same for POTS as well (still 8 bit/8000Hz at its core). I do listen on the scanner from time to time and assuming a clear signal traditional NFM always beats digital in terms of intelligibility.
Speaking of the telephone, they definitely could put improved audio quality as part of the 4G and 5G specs, but they don't. A modern telephone network is all IP anyways and backwards compatibility can be maintained.
I wonder how much training on degraded/radio encoded samples would be needed to improve performance in this area. You’re probably not the only person who wants to monitor police radios in their city.
Great question, I've not encountered it too much as I'm looking for specific stuff in the transcript based on keywords so some loss is acceptable and I don't always need the data immediately so it has not been an issue so far but what I do to alleviate this somewhat is have "stitched passes" at set intervals.
News, betting games and some shows HAVE to happen exactly at a certain time and are very rarely late so you can use these known times as checkpoints with some bias. So say I run this command `ffmpeg -i http://someexamplesite.fm/8b0hqm93yceuv -c copy -f segment -segment_time 60 -reset_timestamps 1 zip-%03d.mp3` that automatically chunks the stream into 1 minute files and I know news gets read at 1pm, I can merge everything from 10am to 1pm into say "Segment B" and then process that, you get the idea.
I also have a step in the pipeline after on the transcript after that tries to summarize what it can so any small gaps would likely be inferred as I have not noticed anything too wonky and the summaries and text I get so far have been pretty clean and good enough for my needs. Once I bench this some more however I'm sure I will have this and other interesting problems to solve, the one I'm fighting now a bit is ads that have music and fast talking.
I came up with this specifically because of whisper. This looks nice but doesn't suit my use case as I'm "listening" to over a dozen stations at once and I also stitch the chunks together end of day in a fat wav file.
Stupid simple setup until I streamline the process some more. Just running on my pc (i7 8700k, gtx 1080) and using ffmpeg for copying the stream to disk and a simple python script that creates the chunks and transcripts.
Does anyone know if this new model handles silence better? I was trying to use whisper for transcribing bursts of talking amid large spans of silence, but the frequency of hallucinations was too high.
Sure, but that should be considered an accuracy problem. Telling a system to do its best to extract words from background sounds, and then getting words from it, is a different type of problem.
-------
I can't reply to the below, but you have to consider the difference in the signal to noise ratio for why it should be considered a different problem.
If I told a binary image classifier to classify a clear image of a cat as either a "cat" or a "dog", and it said "dog", then that would be an accuracy problem.
If I gave the same classifier an image of a black cat standing in a very dark room, where even a human would have trouble identifying it, and it says "dog" it's not an accuracy problem as much as a signal to noise ratio problem.
It seems like you're making the assumption that all of these have the issues you describe have the same root cause. I don't think that's a sound assumption...tehe.
"Silence" is a problematic term. For me, that word encompasses: squeaky chairs, typing on a loud keyboard, moving objects around on my table, etc. In a perfect world, Whisper —like a human— can easily distinguish a human voice from the din of my office, and only try and transcribe my voice.
Does anyone have solutions for clearing out "silence" from an audio file that works off something a bit more accurate than just "<= decibel x"?
There's a 2 week Whisper Fine Tuning event put on by Hugging Faces going on right now [1]. The new "large-v2" model was announced as part of the kickoff [2]. It's supposed to offer 5% - 10% improvements over the previous large model.
I have been using Whisper to transcribe my audio notes. I just save my voice memo from my phone to my NAS and my little script does the rest on a loop.
Not OP but regularly access my home network remotely. I have an Asus RT-AX86U which has a lot of features in the stock firmware, including OpenVPN configuration. If I need "LAN" access to my home network, I will open the OpenVPN client and then it will just work, this is useful to access things like local samba shares or SSH into my Linux machines (I was dissatisfied with the log spam of having an open SSH port even with fail2ban and router level protections).
If you have an always running server and only need access via a browser, forwarding port 80/443 to a reverse proxy works well. Dynamic DNS can be used if you don't have a static IP address. Alternatively, if you are more security conscious and don't mind the reliance on CloudFlare, you can use CloudFlare tunnel which won't expose ports from you local network at all. It can also be combined with a third party authentication mechanism.
It seems like Large-V2 is a huge improvement when transcribing Japanese. I tested it on "Macross Frontier - the Movie", and it no longer breaks after 8 minutes as before:
There's still some timing issues after a period of silence, but using a VAD as a workaround (like I do in my WebUI) may no longer be strictly necessary:
So in general and from the uninformed outside it seems like OpenAI (120 employees?) is outperforming Alphabet (187k employees). Is there any truth to that notion?
It’s apples and oranges imo. Google owns Wayze which is hitting the streets in SF this year with autonomous taxis. That’s huge if it goes well. There are of course other non-public models Google has developed like Imagen (diffusion model) and LaMDA (Google engineer convinced it was sentient). So I wouldn’t discount Google at all.
In terms of granting direct access to machine learning models, OpenAI has Google beat. But that’s not Google’s business model. And it remains to be seen if it’s a viable business model at all.
Google has models that are at parity with or outperform OpenAIs models at most tasks, they just don’t bring them to market quickly or build their brand around them as strongly as OpenAI does.
I don't get Google's brand or leadership anymore. They're staring disruption to their core revenue stream in the face and acting like the "everything is fine" dog.
Meanwhile Microsoft is playing dimensional chess across multiple industries and key developing areas of research.
It’s quite simple, actually. State of the art research gets attention and attracts top tier researchers. As long as they’re working for Google they’re not working for Bing. The research is sometimes useful but even if it’s not it is red meat for their researchers.
However although the models of today are interesting, they’re not consistent enough for many business applications. It’s not good enough to generate realistic cartoon characters most of the time but sometimes they randomly have 4 arms. Or you have a language model that can say interesting things but can’t be relied upon for basic facts. Maybe some day we will be there but not today.
It's very possible they are doing something with them, just not the sort of thing OpenAI is doing. Google could have a model just as capable as Whisper running in their Google Cloud or Assistant transcription service - but they might have upgraded from their previous model without any announcement or blog post.
Are you saying they should have more bombast in their product announcements?
> If Google is to be believed, this outperforms Dall-E - and I’ve heard from people that use it that in general, it does perform better than Dall-E.
Google actually surpassed its own model[0], first with Parti[1], which unlike DALL-E could even correctly insert text in the image, then with Imagen Video[2], which does as the name implies.
keep in mind that Google also works on other products than AI (e.g. Android), and that the best model is not necessarily the best for the business. E.g. it might be too costly to operate certain models as Googles scale, and so they would need to pick a compromise between quality of results, runtime cost and scalability.
It could well be that 187K employees is a disadvantage, not just economically, but because every project has too many cooks in the kitchen so to say, and it has too much politics.
Has there been any work on a model that can take in speech and outputs the phonemes (e.g. in IPA or an English format) similar to what Google's speech pronunciation tool, where it shows a "Sounds like you said ...."?
If I just want to play with this as a toy without any performance considerations, is it possible to run this on a CPU only? Or do I need to faf about with getting CUDA and that nonsense running?
It’ll run on CPU in decent time, it’ll also depend on the kind of cpu you have and the model you pick. In my m1, the base model can do 5-10s in about as long.
I uploaded it to a hugging face space if you just want to fiddle with it without installing
If the target is the words "A A A" and you produce "B B B B", you have more errors than there were words in the target. 3 replacements and 1 insertion.
For instance, I know Turkish is very consistent: it was refactored in 1928 with the birth of Turkey. Turkish is quite high in the rankings. I don't think because there's loads of data available, but because of its consistency. Contrary, English has loads of data, which should compensate for it inconsistency.
I have done some tests for a hobby project (transcribing local fire station radio dispatch), and Whisper is incomparably better than AWS transcribe.
Radio dispatch is very hard to understand, low quality audio. AWS transcribe was essentially useless, incomprehensible transcriptions. Whisper was 95+% accurate, with maybe 1 incorrect word per few sentences, and it was often easy to tell what the correct word would be.
Interesting, that's good to know, my personal experience was that transcribe handled accents very poorly, but if whisper can handle radio static maybe it can also decipher thick scottish
On the original blog post announcing Whisper there was a demo ("Accent" under the drop-down) of it transcribing a Scottish accent. https://openai.com/blog/whisper/
Whisper has done extremely well for me with a number of challenging audio transcriptions: background noise, heavy accents, music playing... you name it.
Is there a reason why Spanish and Italian have a lower WER than English? OpenAI is based in America right? They probably focused on English more than anything.
English is the worst. “We don't just borrow words; on occasion, English has pursued other languages down alleyways to beat them unconscious and rifle their pockets for new vocabulary.”
I could believe that other languages with more regular grammar might be easier to parse. Or maybe Romance languages lean on more easily distinguished phonemes.
English also tends to take delight in stealing words from other languages, and then using them in the wrong way. Seemingly in an effort to drive speakers of the other language up the wall.
WER is really dataset-dependent, you can’t really compare across different test sets unless they’re almost identical (acoustic environment, microphone quality, speaker profiles, linguistic domain, etc).
WER for the exact same model can vary wildly between datasets, it’s really only useful for comparing model performance on a single test set.
Interesting to see there's work going into M{1,2} support rather than just CPU. Looks like there's an issue with the buffer lengths with pytorch M1 support atm but looks fixable in future pytorch releases.
For revoldiv.com we have profiled, many gpus, the best one is 4090. We do a lot of intelligent chunking and detect word boundaries and run the model in parallel in multiple gpus and we get about 40 to 50 seconds for an hour long audio but without expect 7 minutes for an hour long audio on tesla t4
on tesla-t4-30gb-memory-8vcpu google cloud
on tiny and tiny.en
for 10 minute = 30 seconds
on medium
for 10 minute = 1m 30s
for 60 minute = 7m
on large
for 60 miutes = 13m
on NVIDIA GeForce RTX 4090
on tiny
for 10-minute = 5.5 seconds
for 60-minute = 35 seconds
on base
for 10-minute = 7 seconds
for 60-minute = 50 seconds
on small
for 10-minute = 14 seconds
for 60-minute = 1 min 35 sec
on medium
for 10-minute = 26 seconds
for 60-minute = 3 mins
on large
for 10-minute = 40 seconds
for 60-minute = 3 min 54 sec
thats crazy. 35 seconds for 60 min on base with a 4090. wow!
thanks for the info! also btw i mentioned on another thread i was getting an error. but are you planning on offering this as a paid api?
Can you send me the audio that caused it, you can email me at team AT revoldiv .com. If there is going to be a lot of interest, yes we can provide it as an api service. Our service has some niceties like word level timestamp, paragraph separation, sound detection etc... for now it is a free service you can use as much as you want
Easily, you can ask questions about speech recognition system design. For example you can ask ChatGPT about upsampling vs downsampling, it gives correct answer. It also can correctly recommend you architecture for streaming speech recognition.
* only an update to the `large` model (and therefore irrelevant for hobbyist, low CPU use cases)
* architecturally identical to the original whisper model
* just trained on more epochs