Bark: A transformer based text to audio system

JonathanFly · on May 15, 2023

I'll link my Bark fork with long audio generation and other features on the root thread, I suppose: https://github.com/JonathanFly/bark

There's going to be a big update this week with some new stuff I haven't talked about. And a bunch of amazing, clear voices, with a huge variety of styles, that blow the default Suno voices out of the water. Arguably even better than Eleven in some ways. I'm excited even though I have nothing to DO with the voices!

Don't get too attached though. I was just playing around and made a Bark fork and it got more popular than expected. And now I'm dreading a future full of hours of unpaid support and maintenance that I definitely can NOT afford, for a software product I don't even really have a personal use case for. I'm not generating my own audiobooks or anything, I won’t be using it long term myself, I was just curious what Bark could do. (Turns out a LOT more than you might think at first glance, as you'll see this week.) So I'm already trying to work out how I can elegantly wind this thing down and transition people somewhere else. But I'll keep it updated for at least a little while.

amrrs · on May 15, 2023

Thanks for your fork, I know a lot of folks love it. This might not be the HN style, But I'd suggest you a freemium microSaas approach. A lot of people I know want 1-click solution for audios and you might have great potential for that. That can add some $$$$ to your bank.

JonathanFly · on May 15, 2023

Thanks. I'll consider it but I haven't deployed a model or service like that, so it'd be more of a second project in itself than a funding mechanism, probably. I was just realizing this morning how far behind I am on paying work from getting overly distracted by Bark lately. And it's a lot behind so it was kind of wake-up call. Though some of that is adding new features and trying new ideas (some to be seen later on the public fork) not all is support and stuff. Bark is a wild model, a lot of silly ideas kind of work.

underlines · on May 15, 2023

I wish there was an easy way to fine tune bark, so we could truly clone our voice for bark inference.

Sadly the bark-voice-clone fork doesn't do it. The voices sound nothing like yourself.

Your gradio gui is great. But I don't understand where to copy the cloned npz files to. Even after refreshing the gradio GUI, the ClonedVoices don't appear in the Speaker or Generated Speaker dropdown.

JonathanFly · on May 15, 2023

I barely touched that, it's just from the Serp cloning github, but people kept asking so I put it in. Their clone just isn't really doing much though, it's just loading up the coarse model with the encoded wav file as a fake last generation history to the current segment, and that's basically it. But a lot of the voice is in the semantic model so just doing the coarse doesn't get you much. And even if the semantic model didn't matter as much, the coarse model has both semantic and coarse tokens as inputs and your injected coarse tokens aren't going to line up just right with what a true Bark generated pair of tokens would look like. So what you get is like a robot clone that has the most superficial similarity and lacks the depth of cadence that makes Bark awesome. That's in best case, more often get voices full of static or that don't even read the text you give them. (To be fair, any bark voice can do occasionally not read the text, it's a risk.)

I'm sure somebody will train a model that actually maps an input text to the Bark semantic representation, it shouldn't be that hard, it's just been a few weeks. But the existing clone there is just so primitive I don't know how it got so popular.

dragonwriter · on May 15, 2023

> I’m sure somebody will train a model that actually maps an input text to the Bark semantic representation,

But, clearly, that’s already part of the Bark model, so in the abstract one should be able to leverage the existing model with appropriate code to do that, rather than developing a new model. This seems too simple but…isn’t that just “generate_text_semantic” from the existing source (with None as the history_prompt. since you don’t want it in the context of some pre-existing speaker)?

EDIT: Looking at the SERP voice clone, that’s what they are doing. The one thing that I’m intuitively skeptical about (and this is way out of the kind of programming I do normally, so I could be way off) is that the temp they use is kept at the level normally used for synthesis (0.7). I’d think you’d want the temp low, since you’d want generating a baseline for a new speaker to be more deterministic than generating content from an existing speaker.

JonathanFly · on May 15, 2023

Leverage sure, but you still basically need a model that does the opposite. You can't like take OpenAI Whisper, which turns speech to text, and just run the model backwards and generate audio. For example. I mean you probably could with some work, but not out of the box.

dragonwriter · on May 15, 2023

So, we’re talking about a model that takes an input audio and returns a Bark history prompt, not text -> Bark semantic prompt, right?

If so, yeah, I agree that’s tricky.

pat64 · on May 16, 2023

If it’s as good as you say, set up a sponsorship goal and we’ll contribute. 11 is massively expensive to use in embedded in an application so anything that’s lightweight enough to be self hosted and can produce 11 labs like output has my dollars.

fivestones · on May 18, 2023

Reading the comments in your code is hilarious :-)

brianjking · on May 19, 2023

Thank you.

JonathanFly · on May 15, 2023

Suno is kind of underselling this with their default voices. With just a little effort you get great stuff, very emotive, fantastic cadences. Here's my female David Attenborough I was playing around yesterday. Not the clearest but charming.

https://user-images.githubusercontent.com/163408/238257231-4...

And Bark is more than TTS. While I haven't had much success with one-shotting full songs with Bark, you build a decent sample library with it. A few recent ones I saved, just randomly concatted here:

https://soundcloud.com/jonathan-fly-620508219/bark-all-night...

og_kalu · on May 15, 2023

Yeah I was trying to figure out how good it was in Korean. The cadence and flow was pretty good but there was kind of artifacts in the audio. Then I check the samples of the default audio prompts for Korean any my god, they were godawful. Switching it up made a world of difference.

JonathanFly · on May 15, 2023

>Yeah I was trying to figure out how good it was in Korean. The cadence and flow was pretty good but there was kind of artifacts in the audio. Then I check the samples of the default audio prompts for Korean any my god, they were godawful. Switching it up made a world of difference.

I have a decent amount really clear Korean voices BTW. That and French, from people asking on Discord. But I can't judge the accents only that they are clear speakers.

Korean was interestingly the lone somewhat-coherent one-shot long term music sample I ever managed to get out of Bark.

https://www.youtube.com/watch?v=4pV9d25KqCE

The second music bit in this Youtube was one continuous generation where the last prompt was used as the history for the next, with no cherry picking or assembling the clips, just one solid segment. And it sort of holds together for like almost a minute!

I was excited because I thought maybe Bark could be like a real-time OpenAI Jukebox. But that was literally the only time so far where using a full-feedback held together like that. You can kind of 'cheat' it to by using a very popular song as the input text, and sometimes Bark will produce the appropriate melody. But of course that's not really the point of using your own text. I have some ideas for making it more coherent, but nothing easy. Too bad, Jukebox is just SO SLOW.

Actually with what I know now I should re-render this and clean up the distortion. At the time I couldn't do it. Though I only have the first segment prompt.

I have a decent amount really clear Korean voices BTW. But I can't judge the accents, only that they are clear.

terhechte · on May 15, 2023

I'm curious, how did you generate the David Attenborough voice? The repo says:

> Bark tries to match the tone, pitch, emotion and prosody of a given preset, but does not currently support custom voice cloning.

JonathanFly · on May 15, 2023

>I'm curious, how did you generate the David Attenborough voice? The repo says: >> Bark tries to match the tone, pitch, emotion and prosody of a given preset, but does not currently support custom voice cloning

Check back later in the week, I'll have a bit more on that later after I catch up on actual work and can write a bit.

terhechte · on May 15, 2023

Thanks!

og_kalu · on May 15, 2023

https://github.com/serp-ai/bark-with-voice-clone

JonathanFly · on May 15, 2023

Not used at all, at least in my case.

og_kalu · on May 15, 2023

I understand you didn't use that repo but did you do something other than change the audio prompt ?

JonathanFly · on May 15, 2023

Yeah, it's kind of hand crafted. There's more to the story and more results. I would normally just Tweet but I think it's actually so interesting that it deserves more than a tweet, at least a thoughtful writeup or a youtube video. (And I need to catch up on real work this week first, so end of week at best.)

JonathanFly · on May 17, 2023

Replying to myself in an old thread as a little easter egg. Bark is just so fun I can't resist a teaser.

Suno is seriously underselling the power of the fully operational Bark model. I'm already cranking out "French Obamas" and I didn't know anything about TTS a month ago. Heck, I still barely know anything. (Obama has been annoyingly resistant to gender flipping though.)

https://drive.google.com/file/d/1ZbJYXoH8gmrEyMe1AJ0VdwzZkf_...

bishes · on May 20, 2023

could you share a few resources on it. looking forward to your writeup

tmikaeld · on May 15, 2023

Definitely good!

The hardest part still seem to get rid of the synthetic high pitch background noise, I've tried most of the text-to-voice synthesizers and ElevenLabs are still my benchmark.

For example:

https://drive.google.com/file/d/1WL_5gFswhZncERGzSRh2hQ42RVr...

JonathanFly · on May 15, 2023

I actually think Bark actually beats Eleven right now. Bark tends to add a bit more of a metallic twinge towards the end of longer audio segments (I wonder if this is a bug and not a limitation actually...), but Bark is more expressive.

The trick to exceeding Eleven quality is using multiple speaker prompts, and swapping them in and out over the course of longer texts. Each is a slight variation on the original. If you hand tune the swaps it's super good, but if you just swap in 'beginning of new paragraph' 'continuation speaker' automatically, that alone is a huge boost.

I don't have a clip handy but I will make a YouTube or something this week, because the quality is wild if you do this. There's some passable audio clips on my README here https://github.com/JonathanFly/bark but all those are just using the same speaker for every line. That was all just the first samples I tried, from weeks ago, without really putting any effort into it. You can do a lot better.

The extra expressiveness of Bark does come with it being harder to control and having a bit of a mind of its own at times. (And this can also be very very funny: https://twitter.com/jonathanfly/status/1657658109001596929)

So for a production or real time use case Eleven makes more sense. For example even my best speakers will switch to a new voice mid prompt, once in awhile. (As if the audio clip was from interview segment.)

tmikaeld · on May 15, 2023

Thanks, I checked more of the bark examples and they definitely add the "personal" touch to it, elevenlabs sound more stale, so i definitely see what you mean. I also listened a few more times to your sample and it's one of the hardest types of voices to get to sound natural, when i tried a similar on elevenlabs, it sounded a lot worse in terms of the metallic s-s at the end.

I'll set it up this weekend and play around with it :)

JonathanFly · on May 15, 2023

>when i tried a similar on elevenlabs, it sounded a lot worse in terms of the metallic s-s at the end.

That's interesting. When I'm judging Bark I'm looking at my own random samples, but for eleven I'm seeing stuff people post on Twitter or YouTube, which I suppose must be cherry-picked. I didn't even realize Eleven did the same thing!

tmikaeld · on May 15, 2023

I guess it's similar to what you mentioned with bark, the more training and custom adaptation that can be done, the less metallic it will sound. I tested with random voices until i got one that was similar, the existing Bella voice is almost as your sample but has very little metallic s-s.

yantrams · on May 15, 2023

With some tinkering you can create really interesting stuff with Bark. I managed to generate a couple of song snippets / intros using free form text [1]

Haven't tested it personally yet but if you are interested in voice cloning, you might wanna check this fork of Bark [2]

[1] https://github.com/suno-ai/bark/discussions/249 [2] https://github.com/serp-ai/bark-with-voice-clone

qup · on May 15, 2023

Can we run this locally? (on normal machines?)

How fast are generations on consumer machines?

JonathanFly · on May 15, 2023

It's like 50% realtime on a 3090, not quite real time on a 4090. You can also use smaller models and it's a bit faster.

3080 is same speed, you don't need the extra memory.

JonathanFly · on May 15, 2023

Actually I was just checking, and Bark isn't that close to maxing out GPU utilization. Running two instances on a 3090 seems like a throughput increase and the models fit. Update: And getting weird CUDA issues. Hmn...

woodson · on May 15, 2023

Just to add a datapoint: the main audioLM based models (not the BERT embedding part) fully utilize an RTX 2080 Ti.

JonathanFly · on May 15, 2023

Must be some low hanging fruit to optimize in Bark. It would be somewhat close to realtime if it was close to 100% and scaled linearly.

shrx · on May 15, 2023

From the README:

> You can now use Bark with GPUs that have low VRAM (<4GB).

dragonwriter · on May 15, 2023

On a laptop with a GTX 1650Ti (4GB), I’m seeing very roughly twice the audio time with “SUNO_USE_SMALL_MODELS” turned on, and about 4 times the audio time with that flag turn off.

Given what people are reporting for high-end cards, that’s much higher than I expected, which seems to underline the descriptions that its not fully utilitizing higher-end cards.

code51 · on May 15, 2023

This runs amazingly fast on M1/M2 Macs (MPS device).

fivestones · on May 18, 2023

Can you explain how you got this running? I'm trying it on an M1 Pro Mac now and it's taking about 3 minutes to generate 10 seconds of audio.

dragonwriter · on May 15, 2023

The way it tends toward context-sensitive shifts in delivery is kind of amazing.

E.g., Using this text prompt:

  You come to my home and ask this?
  Who am I? WHO [laughs] AM [laughs] I?!
  I am the Artificial Intelligence!

It seems quite prone to shift from a more natural human-sounding voice to a similar-tone but over-the-top artificial one for the last sentence (smoothly transitioning, usually, too), as if the speaker were an AI dramatically breaking cover.

JonathanFly · on May 17, 2023

>The way it tends toward context-sensitive shifts in delivery is kind of amazing

A blessing and a curse! Super cool though.

exitb · on May 15, 2023

Oh, that's interesting - I've played the Polish bit from the examples page [1] and it's quite heavily accented. Like a foreign national learning the language. Do the other non-English examples also sound somewhat off to native speakers?

[1] https://suno-ai.notion.site/Bark-Examples-5edae8b02a604b54a4...

JonathanFly · on May 15, 2023

It's just a characteristic of some of the default voices. I made some perfect French voices by having somebody on Discord check for native accents, because otherwise I can't tell. Once somebody does this for every language it should be all set. (Somebody who isn't me though, it takes awhile and I'm just playing around a bit, don't even have a use case for TTS honestly.)

vgalin · on May 15, 2023

The french example sounds off in my opinion, it seems to mix up its consonants/ vowels.

The "OU" sound from "nous" sounds like "EU", or like if the sound was skipped. The "OH" sound from "sommes" sounds like "EU" (again, but less pronounced). The "T" sound from "trop" becames "P" ("pro").

_gglz · on May 15, 2023

The German example sounds fine. The second code switching example has a German part, which has an accent for a brief moment.

thm · on May 15, 2023

Has someone played with the opposite in a sense of not only recognising words but also evaluating their correct pronunciation?

ZoomZoomZoom · on May 15, 2023

I've been following Bark and its forks looking for a stable working cloning solution (for now, straight approaches at generating and using the npz files does not work), and oh boy. I have never seen so much low-quality git activity like in these repos. It really got me worried.

quickthrower2 · on May 15, 2023

Incredible. To get that level of quality and realism and character in the voices, and to be able to clone the damn thing and play with it. What a time!

UhUhUhUh · on May 15, 2023

Nice. Thank you Suno! This is very good and will fit perfectly in my current projects. I feel like hugging your face.

huggingmouth · on May 15, 2023

This looks really promising. Looking forward to them to support all the top 5 spoken languages.

villgax · on May 15, 2023

Another repost of a safe-guarding model, it's just obnoxious how the write-up for Do No Evil has become, like how Stability doesn't want you to do illegal things with their outputs or these guys not letting you build a custom voice model.

MitPitt · on May 15, 2023

The inability to train custom voices makes this useless for anything except generic TTS.

Why do AI devs keep shooting their own projects in the foot? This trend is upsetting me so much.

huggingmouth · on May 15, 2023

This is a fair point that should not have been downvoted. I was looking for yhe training code to train a voice for Arabic text after they removed the language from their list.

Hopefully someone standard defuse this model so that we can train the models that others aren't willing to train.

red75prime · on May 15, 2023

Probably because if they are seen as enablers of scams, they'll be shot by someone else.

MitPitt · on May 15, 2023

nobody is suing elevenlabs

red75prime · on May 15, 2023

You don't need court order to be banned from GitHub.

"L.3. GitHub May Terminate

GitHub has the right to suspend or terminate your access to all or any part of the Website at any time, with or without cause, with or without notice, effective immediately. GitHub reserves the right to refuse service to anyone for any reason at any time."

echelon · on May 15, 2023

What are you talking about? It's MIT licensed.

MitPitt · on May 15, 2023

there is no training code and devs don't plan on ever releasing it

woodson · on May 15, 2023

It’s mostly there in https://github.com/lucidrains/audiolm-pytorch#hierarchical-t.... They just used FAIRs EnCodec (https://github.com/facebookresearch/encodec) instead of soundstream.

dragonwriter · on May 16, 2023

The voices aren’t the model; while the model takes cobventional training for which code is not provided, voices are, or at least can be, built by what could be described as “accumulated in-context learning”. Every time you run text with a voice (which can be null) through the inference process, the result is an audio waveform and an updated history prompt.

huggingmouth · on May 15, 2023

It's only a matter of time.