There's going to be a big update this week with some new stuff I haven't talked about. And a bunch of amazing, clear voices, with a huge variety of styles, that blow the default Suno voices out of the water. Arguably even better than Eleven in some ways. I'm excited even though I have nothing to DO with the voices!
Don't get too attached though. I was just playing around and made a Bark fork and it got more popular than expected. And now I'm dreading a future full of hours of unpaid support and maintenance that I definitely can NOT afford, for a software product I don't even really have a personal use case for. I'm not generating my own audiobooks or anything, I won’t be using it long term myself, I was just curious what Bark could do. (Turns out a LOT more than you might think at first glance, as you'll see this week.) So I'm already trying to work out how I can elegantly wind this thing down and transition people somewhere else. But I'll keep it updated for at least a little while.
Thanks for your fork, I know a lot of folks love it. This might not be the HN style, But I'd suggest you a freemium microSaas approach. A lot of people I know want 1-click solution for audios and you might have great potential for that. That can add some $$$$ to your bank.
Thanks. I'll consider it but I haven't deployed a model or service like that, so it'd be more of a second project in itself than a funding mechanism, probably. I was just realizing this morning how far behind I am on paying work from getting overly distracted by Bark lately. And it's a lot behind so it was kind of wake-up call. Though some of that is adding new features and trying new ideas (some to be seen later on the public fork) not all is support and stuff. Bark is a wild model, a lot of silly ideas kind of work.
I wish there was an easy way to fine tune bark, so we could truly clone our voice for bark inference.
Sadly the bark-voice-clone fork doesn't do it. The voices sound nothing like yourself.
Your gradio gui is great. But I don't understand where to copy the cloned npz files to. Even after refreshing the gradio GUI, the ClonedVoices don't appear in the Speaker or Generated Speaker dropdown.
I barely touched that, it's just from the Serp cloning github, but people kept asking so I put it in. Their clone just isn't really doing much though, it's just loading up the coarse model with the encoded wav file as a fake last generation history to the current segment, and that's basically it. But a lot of the voice is in the semantic model so just doing the coarse doesn't get you much. And even if the semantic model didn't matter as much, the coarse model has both semantic and coarse tokens as inputs and your injected coarse tokens aren't going to line up just right with what a true Bark generated pair of tokens would look like. So what you get is like a robot clone that has the most superficial similarity and lacks the depth of cadence that makes Bark awesome. That's in best case, more often get voices full of static or that don't even read the text you give them. (To be fair, any bark voice can do occasionally not read the text, it's a risk.)
I'm sure somebody will train a model that actually maps an input text to the Bark semantic representation, it shouldn't be that hard, it's just been a few weeks. But the existing clone there is just so primitive I don't know how it got so popular.
> I’m sure somebody will train a model that actually maps an input text to the Bark semantic representation,
But, clearly, that’s already part of the Bark model, so in the abstract one should be able to leverage the existing model with appropriate code to do that, rather than developing a new model. This seems too simple but…isn’t that just “generate_text_semantic” from the existing source (with None as the history_prompt. since you don’t want it in the context of some pre-existing speaker)?
EDIT: Looking at the SERP voice clone, that’s what they are doing. The one thing that I’m intuitively skeptical about (and this is way out of the kind of programming I do normally, so I could be way off) is that the temp they use is kept at the level normally used for synthesis (0.7). I’d think you’d want the temp low, since you’d want generating a baseline for a new speaker to be more deterministic than generating content from an existing speaker.
Leverage sure, but you still basically need a model that does the opposite. You can't like take OpenAI Whisper, which turns speech to text, and just run the model backwards and generate audio. For example. I mean you probably could with some work, but not out of the box.
If it’s as good as you say, set up a sponsorship goal and we’ll contribute. 11 is massively expensive to use in embedded in an application so anything that’s lightweight enough to be self hosted and can produce 11 labs like output has my dollars.
Suno is kind of underselling this with their default voices. With just a little effort you get great stuff, very emotive, fantastic cadences. Here's my female David Attenborough I was playing around yesterday. Not the clearest but charming.
And Bark is more than TTS. While I haven't had much success with one-shotting full songs with Bark, you build a decent sample library with it. A few recent ones I saved, just randomly concatted here:
Yeah I was trying to figure out how good it was in Korean. The cadence and flow was pretty good but there was kind of artifacts in the audio. Then I check the samples of the default audio prompts for Korean any my god, they were godawful. Switching it up made a world of difference.
>Yeah I was trying to figure out how good it was in Korean. The cadence and flow was pretty good but there was kind of artifacts in the audio. Then I check the samples of the default audio prompts for Korean any my god, they were godawful. Switching it up made a world of difference.
I have a decent amount really clear Korean voices BTW. That and French, from people asking on Discord. But I can't judge the accents only that they are clear speakers.
Korean was interestingly the lone somewhat-coherent one-shot long term music sample I ever managed to get out of Bark.
The second music bit in this Youtube was one continuous generation where the last prompt was used as the history for the next, with no cherry picking or assembling the clips, just one solid segment. And it sort of holds together for like almost a minute!
I was excited because I thought maybe Bark could be like a real-time OpenAI Jukebox. But that was literally the only time so far where using a full-feedback held together like that. You can kind of 'cheat' it to by using a very popular song as the input text, and sometimes Bark will produce the appropriate melody. But of course that's not really the point of using your own text. I have some ideas for making it more coherent, but nothing easy. Too bad, Jukebox is just SO SLOW.
Actually with what I know now I should re-render this and clean up the distortion. At the time I couldn't do it. Though I only have the first segment prompt.
I have a decent amount really clear Korean voices BTW. But I can't judge the accents, only that they are clear.
>I'm curious, how did you generate the David Attenborough voice? The repo says:
>> Bark tries to match the tone, pitch, emotion and prosody of a given preset, but does not currently support custom voice cloning
Check back later in the week, I'll have a bit more on that later after I catch up on actual work and can write a bit.
Yeah, it's kind of hand crafted. There's more to the story and more results. I would normally just Tweet but I think it's actually so interesting that it deserves more than a tweet, at least a thoughtful writeup or a youtube video. (And I need to catch up on real work this week first, so end of week at best.)
Replying to myself in an old thread as a little easter egg. Bark is just so fun I can't resist a teaser.
Suno is seriously underselling the power of the fully operational Bark model. I'm already cranking out "French Obamas" and I didn't know anything about TTS a month ago. Heck, I still barely know anything. (Obama has been annoyingly resistant to gender flipping though.)
The hardest part still seem to get rid of the synthetic high pitch background noise, I've tried most of the text-to-voice synthesizers and ElevenLabs are still my benchmark.
I actually think Bark actually beats Eleven right now. Bark tends to add a bit more of a metallic twinge towards the end of longer audio segments (I wonder if this is a bug and not a limitation actually...), but Bark is more expressive.
The trick to exceeding Eleven quality is using multiple speaker prompts, and swapping them in and out over the course of longer texts. Each is a slight variation on the original. If you hand tune the swaps it's super good, but if you just swap in 'beginning of new paragraph' 'continuation speaker' automatically, that alone is a huge boost.
I don't have a clip handy but I will make a YouTube or something this week, because the quality is wild if you do this. There's some passable audio clips on my README here https://github.com/JonathanFly/bark but all those are just using the same speaker for every line. That was all just the first samples I tried, from weeks ago, without really putting any effort into it. You can do a lot better.
So for a production or real time use case Eleven makes more sense. For example even my best speakers will switch to a new voice mid prompt, once in awhile. (As if the audio clip was from interview segment.)
Thanks, I checked more of the bark examples and they definitely add the "personal" touch to it, elevenlabs sound more stale, so i definitely see what you mean. I also listened a few more times to your sample and it's one of the hardest types of voices to get to sound natural, when i tried a similar on elevenlabs, it sounded a lot worse in terms of the metallic s-s at the end.
I'll set it up this weekend and play around with it :)
>when i tried a similar on elevenlabs, it sounded a lot worse in terms of the metallic s-s at the end.
That's interesting. When I'm judging Bark I'm looking at my own random samples, but for eleven I'm seeing stuff people post on Twitter or YouTube, which I suppose must be cherry-picked. I didn't even realize Eleven did the same thing!
I guess it's similar to what you mentioned with bark, the more training and custom adaptation that can be done, the less metallic it will sound. I tested with random voices until i got one that was similar, the existing Bella voice is almost as your sample but has very little metallic s-s.
With some tinkering you can create really interesting stuff with Bark. I managed to generate a couple of song snippets / intros using free form text [1]
Haven't tested it personally yet but if you are interested in voice cloning, you might wanna check this fork of Bark [2]
Actually I was just checking, and Bark isn't that close to maxing out GPU utilization. Running two instances on a 3090 seems like a throughput increase and the models fit. Update: And getting weird CUDA issues. Hmn...
On a laptop with a GTX 1650Ti (4GB), I’m seeing very roughly twice the audio time with “SUNO_USE_SMALL_MODELS” turned on, and about 4 times the audio time with that flag turn off.
Given what people are reporting for high-end cards, that’s much higher than I expected, which seems to underline the descriptions that its not fully utilitizing higher-end cards.
The way it tends toward context-sensitive shifts in delivery is kind of amazing.
E.g., Using this text prompt:
You come to my home and ask this?
Who am I? WHO [laughs] AM [laughs] I?!
I am the Artificial Intelligence!
It seems quite prone to shift from a more natural human-sounding voice to a similar-tone but over-the-top artificial one for the last sentence (smoothly transitioning, usually, too), as if the speaker were an AI dramatically breaking cover.
Oh, that's interesting - I've played the Polish bit from the examples page [1] and it's quite heavily accented. Like a foreign national learning the language. Do the other non-English examples also sound somewhat off to native speakers?
It's just a characteristic of some of the default voices. I made some perfect French voices by having somebody on Discord check for native accents, because otherwise I can't tell. Once somebody does this for every language it should be all set. (Somebody who isn't me though, it takes awhile and I'm just playing around a bit, don't even have a use case for TTS honestly.)
The french example sounds off in my opinion, it seems to mix up its consonants/ vowels.
The "OU" sound from "nous" sounds like "EU", or like if the sound was skipped.
The "OH" sound from "sommes" sounds like "EU" (again, but less pronounced).
The "T" sound from "trop" becames "P" ("pro").
I've been following Bark and its forks looking for a stable working cloning solution (for now, straight approaches at generating and using the npz files does not work), and oh boy. I have never seen so much low-quality git activity like in these repos. It really got me worried.
Incredible. To get that level of quality and realism and character in the voices, and to be able to clone the damn thing and play with it. What a time!
Another repost of a safe-guarding model, it's just obnoxious how the write-up for Do No Evil has become, like how Stability doesn't want you to do illegal things with their outputs or these guys not letting you build a custom voice model.
This is a fair point that should not have been downvoted. I was looking for yhe training code to train a voice for Arabic text after they removed the language from their list.
Hopefully someone standard defuse this model so that we can train the models that others aren't willing to train.
You don't need court order to be banned from GitHub.
"L.3. GitHub May Terminate
GitHub has the right to suspend or terminate your access to all or any part of the Website at any time, with or without cause, with or without notice, effective immediately. GitHub reserves the right to refuse service to anyone for any reason at any time."
The voices aren’t the model; while the model takes cobventional training for which code is not provided, voices are, or at least can be, built by what could be described as “accumulated in-context learning”. Every time you run text with a voice (which can be null) through the inference process, the result is an audio waveform and an updated history prompt.
There's going to be a big update this week with some new stuff I haven't talked about. And a bunch of amazing, clear voices, with a huge variety of styles, that blow the default Suno voices out of the water. Arguably even better than Eleven in some ways. I'm excited even though I have nothing to DO with the voices!
Don't get too attached though. I was just playing around and made a Bark fork and it got more popular than expected. And now I'm dreading a future full of hours of unpaid support and maintenance that I definitely can NOT afford, for a software product I don't even really have a personal use case for. I'm not generating my own audiobooks or anything, I won’t be using it long term myself, I was just curious what Bark could do. (Turns out a LOT more than you might think at first glance, as you'll see this week.) So I'm already trying to work out how I can elegantly wind this thing down and transition people somewhere else. But I'll keep it updated for at least a little while.