Will try to push a fix in the next few hours. We are instantiating the monaco editor with a custom font (Source Code Pro) before we're sure the font has loaded, which throws of the char box measurements in monaco. We did have a fix for this in the old (non-backend IDE), so I'll port that over ASAP. Thanks for notifying us :)
Right now, only js is supported out of the box, but I guess any language that can run via web assembly or other techniques could work. WebContainers has experimental python support, but it won't work with a lot of the dependencies you would usually utilize in python etc.
We've been researching different speech models at Scrimba, and went for Whisper on our own infrastructure. A few days ago I stumbled onto Deepgram, which blows whisper out of the water in terms of speed and accuracy (we need high precision word level timestamps). I thought their claim of being 80x faster than whisper had to be hyperbole, but it turned out to be true for us. Would recommend checking it out for anyone who need performant speech-to-text.
In my experience the accuracy is at least a bit better than whisper-small on their enhanced models. But we've just started using it so haven't had time to do many direct comparisons with whisper. Their word-timestamps are _much_ better, which is important if you want to be able to edit the audio based on the transcription.
As for speed I have no idea how they make it so fast, but I'm sure they've written about it somewhere. My guess is at least that they are slicing the audio and parallelising it. Will look into Conformer-1 as well!
I saw deepgrams claims as well an believed them also, then i tried it, it was TERRIBLE. Don't believe them. It only does well on the benchmark they trained it on. It is faster though but the quality is terrible.
Did you try their enhanced models? We're using it for relatively high-quality audio files and their accuracy is better than the whisper small.en model. More importantly, their word level timestamps is worlds better than whisper.
Yeah, I'm not sure why people get so hyped up about Whisper. In production use it's middling at best and there are commercial offerings the handily beat it in both accuracy and speed.
In most real world settings, at least in my personal use, latency to a remote AI comprises most of the usability difficulty with automated speech recognition. The larger whisper models can be run directly on a laptop using multi threading and achieve speech to text transcription that is fully sufficient to almost completely write whole emails, papers, documents with them. In fact, I've written most of this comment using an ASR system on my phone that uses whisper. While the smaller models (like the one user here) can need some correction, the bigger ones are almost perfect. They are both very sufficient and for realtime interactive use I see no future market for paid APIs.
Yesterday I wrote virtually all the prose in the manuscript while walking around with a friend and discussing it. We didn't even look at the phone.
Obviously there's an academic element here because I'm saying I'm using it for writing. But it's more of a human-centric computing thing. I'm replacing a lot of time that my thumbs are spent tapping on keys, my fingers are spent tapping on keyboard, and my eyes are spent staring at the words that are appearing, looking for typographical errors to correct, with time organizing my thoughts in a coherent way that can be spoken and read easily. I'm basically using whisper to create a new way to write that's more fluid, direct, and flows exactly as my speech does. I've tried this for years with all of the various ASR models on all the phones I've had and never been satisfied in the same way.
Whisper democratises high-quality transcription of the languages I personally care about, whether using a CPU or a GPU. It's FOSS, self-hostable, and very easy to use.
That's a big deal for someone who wants to build something using speech recognition (voice assistant, transcription of audio or video) without resorting to APIs.
Is this in the realm of aspiration or something you've actually worked on? Because Whisper is incredibly difficult (I'd say impossible) to use in a real time conversational setting. The transcription speed is too slow for interactive use even on a GPU once you step up above tiny or base. And when you step down this low the accuracy is attrocious (especially in noisy settings or with accented voices) and then you have to post process the output with a good NLP to make it usable in whatever actions you're driving.
Look, it's nice that it's out there and free to use. For the sake of my wallet I hope it gets really good. But it isn't competitive with top of the line commercial offerings if you need to ship something today.
At the company I work for we are currently in the process of transitioning from being a German-only to an international company. Recently, we started using Whisper to live-transcribe + translate all our (German) all-hands meetings to English. Yes, it required some fine-tuning (i.e. what size do you choose for the audio chunks you pass to Whisper) but overall it's been working incredibly well – almost flawlessly so. I don't recall which model size we use but it does run on a beefy GPU.
What if you are operating in an internet-denied application? A remote radio sensing system with a low bandwidth back channel or an airplane cabin, just to name two.
Actively working on it. I've not noticed any performance problems even with the large model (though the plan was always to run the speech recognition on a GPU - your use case may differ). It seems to be doing fairly well even with slightly noisy inputs, and certainly has better bang/$ than other non-API solutions that service my native language.
While true real-time would definitely be nice, I can approximate it well enough with various audio slicing techniques.
That's very similar to CPU-based performance with modern CPUs and parallelization! Frankly, with whisper.cpp it tends to be a little faster than the length of the audio for the "small" model, and much faster for "base" and "tiny".
Doesn't even have to be that modern, my Ivy Bridge CPU already achieves faster than realtime performance - which makes me wonder if there is maybe some upstart cost for the GPU based solution and it would outperform the CPU only with longer clips.
If you use the large_v2 version of whisper, and give it a prompt to indicate what it's transcribing, it can do extremely well. But do use the prompt feature.
Yes, I've also had this experience with React. I've created several imba apps with more complex UI than an average webstore and still rerender from root without even being close to spending 16ms on a full rerender. It does work. https://www.freecodecamp.org/news/the-virtual-dom-is-slow-me...
What a coincidence seeing this on HN. Great lib! I actually experimented with replacing our use of node-postgres with this today. Your lib was the reason I finally got around to shipping support for tagged template literals in Imba today (https://github.com/imba/imba/commit/ae8c329d1bb72eec6720108d...) :) Are you open for a PR exposing the option to return rows as arrays? It's pretty crucial for queries joining multiple tables with duplicate column names.
Brilliant! We're well on our way to migrating. The experience has been buttery smooth so far. And the codebase itself is really well organized. Huge thumbs up!
I played through it myself and tbh I think the video doesn't do it justice. I haven't been this blown away in years. Until I got to take over the controls I was utterly convinced there had to be pre-rendered video thrown into the mix.
If you have a compatible console I'd really recommend checking it out!
Yeah, it's not really optimized for that. The floating labels over the code examples are offset in 3d space (for a subtle parallax effect while scrolling), so when HW acceleration is off it may end up repainting the page on every scroll. The effect is probably not worth the tradeoffs :)
reply