Bark – Text-prompted generative audio model

gkucsko · on April 20, 2023

Hey, one of the Suno founders/creators of Bark here. Thanks for all the comments, we love seeing how we can improve things in the future. At Suno we work on audio foundation models, creating speech, music, sounds effects etc….

Text to speech was a natural playground for us to share with the community and get some feedback. Given that this model is a full GPT model, the text input is merely a guidance and the model can technically create any audio from scratch even without input text, aka hallucinations or audio continuation.

When used as a TTS model, it’s very different from the awesome high quality TTS models already available. It produces a wider range of audio – that could be a high quality studio recording of an actor or the same text leading to two people shouting in an argument at a noisy bar. Excited to see what the community can build and what we can learn for future products.

Please let us know with any feedback, or if you’re interested in working on this: bark@suno.ai

ttul · on April 20, 2023

This tech will be used by crooks to automate attacks. Generate the language using GPT-4 and the audio using Bark, and then start making phone calls. Because it’s open source, all you need is GPUs. This is not a criticism. I’m impressed and grateful for the openness. Everyone needs to wake up and recognize that these attacks are coming at us essentially right now.

rcme · on April 21, 2023

yawn who cares? If it’s an issue, let law enforcement handle it. Everything has nefarious uses. Humanity marches on.

martyfmelb · on April 21, 2023

As technology gets more powerful more quickly, and the rule of law becomes more and more unable to prevent societal damage, this response becomes woefully inadequate.

For reference, look at how the societal damage of social networks has been handled: too little, too late. Same goes for RentTech.

But, I don't know the solution. The common computer has become so powerful that we cannot simply rely on inaccessible materials to prevent the danger of overly-powerful tech spreading too fast, as we do with bioweapons or traditional WMDs.

Fight tech with more tech, I suppose.

barrysteve · on April 21, 2023

You'll care when it effects you.

We want to go back to New York in the 90s? Petty theft everywhere?

isbvhodnvemrwvn · on April 21, 2023

The technology already exists, shutting down a company and pretending that it doesn't doesn't solve the problem.

lynx23 · on April 21, 2023

Like kitchen knives, which are used to end unhappy relationships. Is this really an argument we should be talking about? Wouldn't you feel silly presenting such an argument, to a household knive maker?

SanderNL · on April 21, 2023

That comparison holds up better if you imagine a world without knives or sharp objects of any kind. Now you can suddenly do tremendous harm by wielding a pointy stick. I don't think it's reaching to point out the dangers you just introduced.

With great power comes .... ? Profit?

generalizations · on April 21, 2023

Except that world used to exist, back in the stone age, and we're all far better off now because we didn't choose to live in fear of the misuse of powerful tools.

SanderNL · on April 21, 2023

I wouldn’t want to be around the first few guys with pointy weapons, but I am sure you would be fine.

CooLOGIC · on April 20, 2023

Recommend to encode sub-audible tracking symbols in synthesize speech patterns that includes GPS, IP, timestamp, country of origin. We do that already in bootleg movies so we can apply similar methods in synthesizes speech.

gs17 · on April 21, 2023

It's open source, and even if it wasn't, it would likely be pretty easy to remove those (not that you would get GPS) from an audio file.

CamperBob2 · on April 21, 2023

Hmm, I don't know about that. The data rate used by civilian GPS (L1 C/A) is only 50 bps. The symbols are normally spread over a couple MHz of bandwidth to make it possible to recover at levels below the thermal noise floor. I see no reason why the same thing couldn't be done at baseband, adding an imperceptible bit of extra noise to an audio signal.

Of course, you wouldn't encode real-time navigation data, but a small block of identifying text. Either way, though, someone without a copy of the spreading code isn't going to notice it or decode it. Given enough redundancy in both the time and frequency domains, removing it wouldn't be easy either.

BHSPitMonkey · on April 21, 2023

The real problem is that bad actors would simply encode some other person's coordinates/metadata into the recordings they produce, and we'll have been trained by then to blindly accept the presence of these markers as strong evidence of guilt.

gmerc · on April 21, 2023

Nothing stops you from using closed source solutions for that.

dmix · on April 20, 2023

How are the voices determined? Is there an option or is it just random/based on the prompts like "WOMAN"?

turnsout · on April 20, 2023

Amazing work so far! Do you have any sense about how difficult it would be to enable M1/M2 or CoreML support?

gkucsko · on April 20, 2023

thanks, the model itself is a pretty vanilla gpt model based heavily on karpathy's nanogpt, so should not need too many bells and whistles to get it running on specific architectures. that said i have very little experience with platform specific development, so would looove some help from the community :)

brucethemoose2 · on April 21, 2023

Could you stick a torch.compile in the inference and training code, maybe gated behind a flag? This should help AMD/Nvidia performance (and probably other vendors soon) significantly.

PyTorch themselves used nanoGPT training as demo for this: https://pytorch.org/blog/accelerating-large-language-models/

ttul · on April 20, 2023

A serious nod to Karpathy here. They could have chosen any other Transformer architecture, but chose perhaps the most reachable one - in the literal sense.

tmzt · on April 20, 2023

Would the same apply to a GGML port or are the architechtures too different?

jijji · on April 20, 2023

I like the emphasis tags, it's something that is not seen with a lot of these transformer models.... things like [laughs] makes alot of sense... i could see where hundreds or possibly thousands of emphasis style tags could be added to support a vast array of intonations in human speech. i.e. [yells], [shouts], [cries], [crying], [whispers], [sarcasasm], etc

calny · on April 20, 2023

Very cool. Side note: bark-gpt.com is already taken for a dog translator: "The world’s first AI powered, real-time communications tool between humans and their furry best friends."[0] I only know this because my law firm partner's name is Bark, and I wanted to automate some legal work and name the software "Bark GPT" after him.

[0] https://www.bark-gpt.com/

dasickis · on April 20, 2023

This is an April Fool's joke[1] and really good! We're doing this for real: https://sarama.app.

Reference:

1. https://www.laika.berlin/en/blog/new-client-barkgpt-ai-dog-b...

seydor · on April 20, 2023

I want to see the training set for this

dasickis · on April 21, 2023

Honestly you need to annotate datasets for barks and we're building this up with our users. Otherwise it's unclear what's happening even with video (we tried to build a YouTube/Instagram scraper and it was very unreliable even with vet tech annotations).

coolspot · on April 20, 2023

Woof!

lIl-IIIl · on April 20, 2023

Usually I can tell whether something is a parody/joke website, but here I am struggling.

cm2187 · on April 20, 2023

It's genius idea. As long as you tell owners what they want to believe their pet says, those guys will make a fortune.

tough · on April 20, 2023

I already know when my dogs needs to eat, drink, shit or pee or go for a walk or play because he usually will tell me.

Choosing to ignore your dog won't change because some magical AI can now translate it to -Im fine, Im only barking because you're an awesome being, keep your subscription humaaan-

sangnoir · on April 20, 2023

But can your dog (translator) say "I love you?" In a doggy-voice? Replika proves people will pay for this and convince themselves it's real, because they want it to be.

Bark-GPT's VC pitch: "Replika for real dogs"

BKirkpatrick · on April 21, 2023

your app is in a bad place when fraud is a part of your pitch don’t you think? …

ripperdoc · on April 20, 2023

Am I hallucinating or didn't several of the examples have background audio artifacts, like it's been trained on speech with noisy backgrounds, I'm guessing audio from movies paired with subtitles? Having random background audio can make it quite hard to use in production.

JonathanFly · on April 20, 2023

>Am I hallucinating or didn't several of the examples have background audio artifacts, like it's been trained on speech with noisy backgrounds, I'm guessing audio from movies paired with subtitles? Having random background audio can make it quite hard to use in production.

The other side of that problem is an opportunity. That's why the same model can also generate music, background noise and sound effects. And it's just because the prompt specifies those things explicitly. The input is truly semantic, so the output is rich and reflects that context. Is your input text sounds like it came from a speech, then there's a high chance your output audio will sound like a megaphone in a public space with crowd reactions and maybe even applause.

CreepGin · on April 20, 2023

I hear it too. I don't know if it's just background noise though. May be quality issues with the audio synthesis.

gkucsko · on April 20, 2023

yeah sometimes there are definitely artifacts. technically they can be removed pretty easily with another model (like denoiser from FB) but for now we wanted to keep it simple to learn to control these things better through prompt engineering. Like when using a high quality input prompt it generally continues with high quality

meepmorp · on April 20, 2023

At least in the last example, with the man and woman and the expensive oat milk, the background noise seemed to fit a likely public conversation scenario. I wasn't sure if it was accidental or not.

newswasboring · on April 20, 2023

Man I know this is HN, and I know we have a certain decorum we should be maintaining, but with the recent activity in this field the most appropriate response to these posts is "4bit when?" or "f16 when?". Not sure which one is applicable. I am having no luck running it on a 6GB vram gpu, so I guess its the 16 bit floating point one.

montebicyclelo · on April 20, 2023

related to this - to those releasing models, it would be great if you could share how much VRAM is required (seems very common for this key piece of info to be missing).

joseph_grobbles · on April 20, 2023

I'm successfully running it on a 12GB GPU (while it downloads some 12.1GB of model data on first run, the highest GPU memory usage was ~6.5GB, settling back down to around 5GB), however the results are nothing like the samples given on the github page. Using the exact code given and in the runs I've tried the results are rather terrible.

I'm not being negative -- some of the samples are really neat on their page -- and I know there is some idiosyncrasy of my setup that is causing issues, though it is a pretty typical conda + pytorch with CUDA 11.8.

Playing with the text and waveform temp from their defaults 0.7 is yielding some semi-decent results, but it feels essentially random.

tasty_freeze · on April 21, 2023

A few years a back someone had the genius idea of making a robo-answering phone program that would give vague but encouraging replies when it received some unsolicited sales pitch. Although it was a fixed sequence of responses, it fooled some callers for a surprisingly long time.

https://www.youtube.com/watch?v=XSoOrlh5i1k

Someone needs to hook create the plumbing to capture speech to text, feed it to a GPT script that has been told how to reply to such call center calls, then send that back through a TTS generator like this one.

To overcome any latency issues, it could build in a ploy to buy time like the old script did, eg, make the robo-answerer sound like a somewhat addled old man who has to think before each reply, perhaps prefixing responses with "hmm, ahh, ..." to buy time to generate the response.

jdprgm · on April 20, 2023

It seems like a lot of the entries in TTS are either close sourced saas apps or something like this with limitations on customizing it. It seems clearly inevitable and likely only months away that a high quality unrestricted open source option for things like voice cloning will emerge so i'm not sure why these projects are even really bothering trying to stop it. I think in order for TTS to have its StableDiffusion moment it will just be a matter of an unrestricted easily trainable open source model.

kleer001 · on April 20, 2023

>> i'm not sure why these projects are even really bothering trying to stop it.

CYA aka https://en.wikipedia.org/wiki/Cover_your_ass

also it still requires tons of money to run, so it's likely only businesses will do it

MacsHeadroom · on April 21, 2023

It takes about the same resources to run as Stable Diffusion or LLaMA. You don't even need a GPU.

kleer001 · on April 22, 2023

No, the power and money to run high quality servers like ChatGPT4 and MidJourney or DallE. Sure there's local-able alternatives, but they're lower quality and lower bandwidth and not being used as a business proposition.

Just like a private individual can own lock smith tools and play around with locks... don't go basing your business of supplying them wholesale to the general public worldwide for free.

MacsHeadroom · on April 22, 2023

Local LLMs are in production as the backbone of highly valued businesses. That will increasingly be true with the advent of commercially licensed foundation models currently in training such as StableLM and RedPajamas.

kleer001 · on April 30, 2023

> Local LLMs are in production as the backbone of highly valued businesses.

That surprises me. Would you be able to point some out to me?

From what I've used of local LLMs they seem nearly unusable.

generalizations · on April 20, 2023

Is there likely to be a way to stream this audio in the future? As in, here's an incoming stream of text, generate the audio on-the-fly instead of all at once.

vlugorilla · on April 20, 2023

Great news! It's astounding how quickly technology is advancing. Only yesterday, I was wondering about when a new model for text-to-speech would be developed, and today a game-changing model has been released! This new model is simply incredible!

rck · on April 20, 2023

Any idea what the training data for this is? Looking at the model, it looks like it is literally just copy-paste from Karpathy's nanoGPT, so the training data is what's most interesting. Pretty amazing anyway.

unraveller · on April 20, 2023

I found a secret demo page that shows in real time how they assess any sound file's mood swings along with number of detected laughs, coughs, etc. Guessing that ability is involved somehow.

gkucsko · on April 20, 2023

haha https://demo.suno.ai

DarthNebo · on April 21, 2023

Imagine Torvalds saying the same in the context of linux- 'to mitigate misuse of this technology, we limit the audio history prompts to a limited set of Suno-provided...'

bdg · on April 20, 2023

Ok, the German example caught me -- it's too real. "But maybe it would be faster if..."

cyberax · on April 20, 2023

On the other hand, Russian was disappointing. It put a stress in one word incorrectly (it confused the grammatical form, used the genitive case instead of the accusative) and in general sounded strange.

miki123211 · on April 20, 2023

The fact that this is open source and can generate more thann just speech is really nice, but for speech itself, it's much lower quality than what Eleven Labs provides.

All the open source models I've seen so far have this weird kind of neural fuzziness to them. I don't know what Eleven does better, but there's definitely a big difference.

ignoramous · on April 20, 2023

Bark's readme points out that to access the "larger model" you'd have to email them.

I guess, the "open" part of it is mostly for marketing.

drowsspa · on April 20, 2023

Seems like it's doable to fix it in post, but I guess nowadays we're all about just shoving everything into the model

quaintdev · on April 20, 2023

Good naming choice after OpenAIs whisper for speech to text. Bark is fitting name for text to speech.

sschueller · on April 20, 2023

"However, to mitigate misuse of this technology, we limit the audio history prompts to a limited set of Suno-provided, fully synthetic options to choose from for each language."

Isn't this open source and can be easily removed or am I missing something?

gs17 · on April 20, 2023

Yes, it seems to be enforced by a few assert statements in the code.

CamperBob2 · on April 22, 2023

Have you seen any information on how to create the .npy files for custom voices, though? That's where I hit a wall. There must be some process that takes a .wav or similar audio file and creates the (relatively small) history prompt data.

Edit: looking at https://www.reddit.com/r/singularity/comments/12udgzh/bark_t... now.

causi · on April 20, 2023

Some of it is very impressive although some of it seems about equal to the TTS built into my phone. How long until someone can package this up and make a program that takes in epubs and spits out mp3s?

billconan · on April 20, 2023

I tried it. It seems to hallucinate easily. The generated audio isn't what I provided.

It seems to be easily reproducible if I specify a non-existing speaker?

audio_array = generate_audio(text_prompt, 'en_speaker_3')

nathias · on April 20, 2023

I hope you reconsider the misuse mitigation, I'm trying to clone my voice, but so far the other tools weren't that great ...

newswasboring · on April 20, 2023

The misuse mitigation is 3 assert statements in one file. Its pretty easy to comment out.

peddling-brink · on April 21, 2023

Slightly trickier is how to generate a "history" file that will allow the model to use your voice.

itomato · on April 21, 2023

“Last nights Phish performance is proof that God loves us.” But in Portuguese.

https://suno-ai.notion.site/Bark-Examples-5edae8b02a604b54a4...

xingped · on April 20, 2023

Does it sound fairly robotic/static-y to anyone else or just me? Doesn't sound any better than any other TTS software I've tried and in fact sounds a bit worse, like it's noisy.

101008 · on April 20, 2023

The Spanish example (Miguel) is really bad.

gkucsko · on April 20, 2023

it's more meant to show code switching. more examples here: https://suno-ai.notion.site/Bark-Examples-5edae8b02a604b54a4...

wtk · on April 20, 2023

Polish one very realistic on the other hand

Tade0 · on April 20, 2023

Just about the only unrealistic thing is it recommending Szczecin's old town.

seydor · on April 20, 2023

Well i can see it becoming sexy soon

turnsout · on April 20, 2023

Soon? Is the model filtered/censored?

jamilton · on April 20, 2023

From the readme:

>Bark has the capability to fully clone voices - including tone, pitch, emotion and prosody. The model also attempts to preserve music, ambient noise, etc. from input audio. However, to mitigate misuse of this technology, we limit the audio history prompts to a limited set of Suno-provided, fully synthetic options to choose from for each language.

It's not immediately clear how the audio history prompts are created.

joshjob42 · on April 20, 2023

I don't know how they're made exactly, but one can just edit the code a bit and delete the restriction to just the given audio history prompts. It's literally just enforced, affect, with a simple "assert" command.

gkucsko · on April 20, 2023

history prompts are just unconditionally generated TTS from the same model. any of those can be used as history, but for convenience 10 are provided for each language (to generate things with consistent voices)

turnsout · on April 20, 2023

So the history prompts are collections of text/audio pairs?

gkucsko · on April 20, 2023

history is semantic, coarse and fine. so essentially the same thing thats getting generated just using it as an input before the generation

CamperBob2 · on April 22, 2023

So how do you clone an existing speaker's voice? That's the part I don't get.

treerunner · on April 20, 2023

Can someone briefly explain how a model for a specific language is made for this tool?

computerex · on April 20, 2023

How do I save the audio array to a file on the file system?

CamperBob2 · on April 22, 2023

There appear to be several ways to do it, but what worked for me was:

   from scipy.io.wavfile import write as write_wav
   write_wav("c:\\wherever\\whatever.wav", SAMPLE_RATE, audio_array)

If you want to play it right away, you can use:

   from playsound import playsound
   playsound("c:\\wherever\\whatever.wav")

What I wasn't able to do was make torchaudio play it directly, without creating an intermediate .wav file. Apparently there is a live-playback backend on the Mac, but not on on any other OS.

awaller1 · on April 20, 2023

this is incredible

pfa344 · on April 20, 2023

Hello how are you doing

txtai · on April 20, 2023

Excellent work! Hope to see a version that has a friendlier commercial license in the future (current version is CC-BY 4.0 NC).

mysterybox · on April 20, 2023

WTH1!!!

wgreenberg24 · on April 20, 2023

Awesome stuff! Can’t wait to see where this company goes