Hacker News new | past | comments | ask | show | jobs | submit | somebee's comments login

Will try to push a fix in the next few hours. We are instantiating the monaco editor with a custom font (Source Code Pro) before we're sure the font has loaded, which throws of the char box measurements in monaco. We did have a fix for this in the old (non-backend IDE), so I'll port that over ASAP. Thanks for notifying us :)

Right now, only js is supported out of the box, but I guess any language that can run via web assembly or other techniques could work. WebContainers has experimental python support, but it won't work with a lot of the dependencies you would usually utilize in python etc.

we should be able to use this as a vscode extension to solve this issue. is there an sdk to integrate this into electron apps?

We are finalizing an electron app as we speak. That will allow recording anything that runs on your own system

We've been researching different speech models at Scrimba, and went for Whisper on our own infrastructure. A few days ago I stumbled onto Deepgram, which blows whisper out of the water in terms of speed and accuracy (we need high precision word level timestamps). I thought their claim of being 80x faster than whisper had to be hyperbole, but it turned out to be true for us. Would recommend checking it out for anyone who need performant speech-to-text.


80x faster than Whisper is an incredible feat. How is Deepgram's transcription accuracy?

Also, have you heard of Conformer-1 by Assembly-AI[1]? It released a few days ago and supposedly scored higher than Whisper on various benchmarks.

[1]: https://www.assemblyai.com/blog/conformer-1/


In my experience the accuracy is at least a bit better than whisper-small on their enhanced models. But we've just started using it so haven't had time to do many direct comparisons with whisper. Their word-timestamps are _much_ better, which is important if you want to be able to edit the audio based on the transcription.

As for speed I have no idea how they make it so fast, but I'm sure they've written about it somewhere. My guess is at least that they are slicing the audio and parallelising it. Will look into Conformer-1 as well!


The nice thing about whisper is that it runs locally.


I saw deepgrams claims as well an believed them also, then i tried it, it was TERRIBLE. Don't believe them. It only does well on the benchmark they trained it on. It is faster though but the quality is terrible.


Did you try their enhanced models? We're using it for relatively high-quality audio files and their accuracy is better than the whisper small.en model. More importantly, their word level timestamps is worlds better than whisper.


Yeah, I'm not sure why people get so hyped up about Whisper. In production use it's middling at best and there are commercial offerings the handily beat it in both accuracy and speed.

Whisper is mostly an academic toy.


In most real world settings, at least in my personal use, latency to a remote AI comprises most of the usability difficulty with automated speech recognition. The larger whisper models can be run directly on a laptop using multi threading and achieve speech to text transcription that is fully sufficient to almost completely write whole emails, papers, documents with them. In fact, I've written most of this comment using an ASR system on my phone that uses whisper. While the smaller models (like the one user here) can need some correction, the bigger ones are almost perfect. They are both very sufficient and for realtime interactive use I see no future market for paid APIs.

Yesterday I wrote virtually all the prose in the manuscript while walking around with a friend and discussing it. We didn't even look at the phone.

Obviously there's an academic element here because I'm saying I'm using it for writing. But it's more of a human-centric computing thing. I'm replacing a lot of time that my thumbs are spent tapping on keys, my fingers are spent tapping on keyboard, and my eyes are spent staring at the words that are appearing, looking for typographical errors to correct, with time organizing my thoughts in a coherent way that can be spoken and read easily. I'm basically using whisper to create a new way to write that's more fluid, direct, and flows exactly as my speech does. I've tried this for years with all of the various ASR models on all the phones I've had and never been satisfied in the same way.


Sounds great! Which app are you using for this?


"Openai Whisper Keyboard" is good. It doesn't use whisper.cpp but rather a pytorch implementation that runs on Android.

I also use whisper.el on emacs. It's amazing. Much more powerful but computer based of course.


Thanks!


Whisper democratises high-quality transcription of the languages I personally care about, whether using a CPU or a GPU. It's FOSS, self-hostable, and very easy to use.

That's a big deal for someone who wants to build something using speech recognition (voice assistant, transcription of audio or video) without resorting to APIs.


Is this in the realm of aspiration or something you've actually worked on? Because Whisper is incredibly difficult (I'd say impossible) to use in a real time conversational setting. The transcription speed is too slow for interactive use even on a GPU once you step up above tiny or base. And when you step down this low the accuracy is attrocious (especially in noisy settings or with accented voices) and then you have to post process the output with a good NLP to make it usable in whatever actions you're driving.

Look, it's nice that it's out there and free to use. For the sake of my wallet I hope it gets really good. But it isn't competitive with top of the line commercial offerings if you need to ship something today.


At the company I work for we are currently in the process of transitioning from being a German-only to an international company. Recently, we started using Whisper to live-transcribe + translate all our (German) all-hands meetings to English. Yes, it required some fine-tuning (i.e. what size do you choose for the audio chunks you pass to Whisper) but overall it's been working incredibly well – almost flawlessly so. I don't recall which model size we use but it does run on a beefy GPU.


What if you are operating in an internet-denied application? A remote radio sensing system with a low bandwidth back channel or an airplane cabin, just to name two.


Actively working on it. I've not noticed any performance problems even with the large model (though the plan was always to run the speech recognition on a GPU - your use case may differ). It seems to be doing fairly well even with slightly noisy inputs, and certainly has better bang/$ than other non-API solutions that service my native language.

While true real-time would definitely be nice, I can approximate it well enough with various audio slicing techniques.


Faster Whisper is 8x faster than real time on CPU and even faster on GPU. https://github.com/guillaumekln/faster-whisper

Vocode uses Whisper for real-time zero latency voicechat with chatGPT. Give their demo line a call to see how well it works: +1-650-729-9536


On a 1080Ti (6 year old GPU) I found Whisper large models to take around as much time to transcribe as the length of the audio.


That's very similar to CPU-based performance with modern CPUs and parallelization! Frankly, with whisper.cpp it tends to be a little faster than the length of the audio for the "small" model, and much faster for "base" and "tiny".


Doesn't even have to be that modern, my Ivy Bridge CPU already achieves faster than realtime performance - which makes me wonder if there is maybe some upstart cost for the GPU based solution and it would outperform the CPU only with longer clips.


Try quantised models. They perform reasonably well, although you probably want to run some benchmarks if you really want to get it done properly.


If there is a better mode I can easily integrate using python I’m all ears. For what I’m building quality is most important


Oh, Azure's speech recognition API beats it handily on English language. Both in accuracy and speed.

Another is Deepgram. Even this obscure vendor seems to be able to handle the samples I tried better than Whisper: https://picovoice.ai/platform/cat/

But yeah, go with Azure as your starting point. It is good and the price is likely acceptable unless you're transcribing all of youtube.


Umm I want to pay zero and run locally


If you use the large_v2 version of whisper, and give it a prompt to indicate what it's transcribing, it can do extremely well. But do use the prompt feature.


Yeah exactly this is why there’s hype. It’s the best model that you can use for free easily


I work at Rev.AI, check us out

Lowest WER in the industry, cheaper than Whisper API, and we have an on-prem solution


Yes, I've also had this experience with React. I've created several imba apps with more complex UI than an average webstore and still rerender from root without even being close to spending 16ms on a full rerender. It does work. https://www.freecodecamp.org/news/the-virtual-dom-is-slow-me...


My experience was not with React and I’m taking

> rerender from root under 16ms

with a huge grain of salt.


Everything rerenders from the root of the editor (that is - not the syntax highlighting of the code itself - as that is handled by monaco).


The ltree extension is fantastic if you have data like comments or any other hierarchical structure.


The ltree extension (https://www.postgresql.org/docs/current/ltree.html) is perfect for this usecase.


How does ltree compare with jsonb?


What a coincidence seeing this on HN. Great lib! I actually experimented with replacing our use of node-postgres with this today. Your lib was the reason I finally got around to shipping support for tagged template literals in Imba today (https://github.com/imba/imba/commit/ae8c329d1bb72eec6720108d...) :) Are you open for a PR exposing the option to return rows as arrays? It's pretty crucial for queries joining multiple tables with duplicate column names.


Thanks a lot! What a coincidence - and perfect timing since v3 supports .raw()[1] to receive results as arrays.

I'd also be very curious to hear how replacing pg goes :)

And also good job on Imba! I'm a really big fan of stripping syntax down to the bare essentials, and what you've done with Imba is really impressive!

[1] https://github.com/porsager/postgres#raw


Brilliant! We're well on our way to migrating. The experience has been buttery smooth so far. And the codebase itself is really well organized. Huge thumbs up!


Wow - thats awesome! Thank you!


I played through it myself and tbh I think the video doesn't do it justice. I haven't been this blown away in years. Until I got to take over the controls I was utterly convinced there had to be pre-rendered video thrown into the mix.

If you have a compatible console I'd really recommend checking it out!


Yeah, it's not really optimized for that. The floating labels over the code examples are offset in 3d space (for a subtle parallax effect while scrolling), so when HW acceleration is off it may end up repainting the page on every scroll. The effect is probably not worth the tradeoffs :)


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: