Hacker News new | past | comments | ask | show | jobs | submit login
OpenVoice: Versatile Instant Voice Cloning (arxiv.org)
399 points by saeedesmaili on Jan 1, 2024 | hide | past | favorite | 192 comments



I commend the authors on making this easy to try! However it doesn't work very well for me for general voice cloning. I read the first paragraph of the wikipedia page on books and had it generate the next sentence. It's obviously computer generated to my ear.

Audio sample: https://storage.googleapis.com/dalle-party/sample.mp3

Cloned voice (converted to mp3): https://storage.googleapis.com/dalle-party/output_en_default...

All I did was install the packages with pip and then run "demo_part1.ipynb" with my audio sample plugged in. Ran almost instantly on my laptop 3070 Ti / 8GB. (Also, I admit to not reading the paper, I just ran the code)


> It's obviously computer generated to my ear.

From the README

    Disclaimer

    This is an open-source implementation that approximates the performance of the internal voice clone technology of myshell.ai. The online version in myshell.ai has better 1) audio quality, 2) voice cloning similarity, 3) speech naturalness and 4) computational efficiency.


So this paper is a thinly veiled ad of myshell.ai's services?


Yes. And I used myshell.ai out of interest. It’s also absolutely terrible.


I went through downloading the open source version yesterday and tried it with my voice in the microphone, and a few other saved wav files.

It was terrible. Absolutely terrible. Like, given how much hype I saw about this, I expected something half decent. It was not. It was bad, so bad bad bad.

I was thinking maybe I did something wrong, but then I watched some of the youtube reviews - these guys were SO excited at the start of the video and then at the end, they all literally said, "Uh, well, you be the judge"

I still can't help but feel there's some kind of trick to it - get the right input sample, done in the right intonation, and maybe you can generate anything


I came here just for your comment. Thank you for doing this work so the rest of us doesn't have to.


Like 50% of arxiv. SV figured out that people read papers in 202x, not PRNewsWire and have adjusted accordingly.


Not totally unexpected unfortunately. Any other OSS players on the market?


RVC


Thanks for the real example. Sounded quite generated to my ear as well. Wonder if it can do any better with more source material.


Looking at the website and the examples, it's pretty clearly set up to make stylized anime voices.


This is the driver for a lot of things. Anime. x264 was to enable better compression of weeb videos. This tech will allow fan dubs to better represent the animes in the videos.


Anime also drove the development of a lot of subtitling technology if I remember correctly.


My experience with other tools like xtts is you really need to have a studio-quality voice sample to get the best results.


The most obvious problem to my ears is the syllable timing and inflection of the generated speech, and, intuitively, this doesn’t seem like a recording quality issue. It’s as if it did a mostly credible job of emulating the speaker trying to talk like a robot.


The biggest trip-up is the pronunciation of "prototypically", and you had "typically" in your original. Maybe it's overfitting to a stilted proto-typically? Could try with a different, less similar sentence


That might be the next big contribution – performance in perceptually catching the features of a not-so-good recording – for example, with a webcam style microphone.


Is it possible to use this (or Eleven Labs) to generate a voice model to plug into an Android phone's TTS?

I have a friend with a paralysed larynx who is often using his phone or a small laptop to type in order to communicate. I know he would love it if it was possible to take old recordings of him speaking and use that to give him back "his" voice, at least in some small measure.


Your friend can take a look at solutions from Acapela [0], SpeakUnique [1] or VOCALiD [2]. Not sure whether they have a solution for Android though.

I recently saw a video from google about a custom voice they created for somebody with ALS but I can't seem to find it online (Does anybody have a link?). Creating custom voices is not yet available on Android though. The latest iOS release (iOS 17) does support creating personalized voices.

ModelTalker [3] is a long-term (research?) project to create custom voices for people with speech disabilities. Their TTS seem to support Android so that might be another option.

[0] https://www.acapela-group.com/ [1] https://www.speakunique.co.uk/ [2] https://vocalid.ai/ [3] https://www.modeltalker.org/


Hi and thanks for the suggestions. Looking through them, it looks like you need to do what they call "voice banking" before you lose your voice. Basically reading a script they provide.

Unfortunately my friend's voice is too far gone for that to be possible. Hoping for something where they can use old recordings to generate a voice.


From my tests I think Audiobox from Meta is the most promising (even better than Eleven Labs) - too bad it's closed source and they force you to read some randomly generated sentences (to prevent the case of someone generating a cloned voice without consent).

Right now Eleven Labs is your best bet.

xTTS is just not there quality wise. The version available in the studio is marginally better than the OSS version but it's still pretty far from being believable.

The non-nerfed version of Tortoise (the author decided to ruin their own project but forks exist) was decent at voice cloning but it takes a lot of tries.

I'm pretty sure we already have the technology to do what you want and help your friend, it's just a matter of time until it gets better and more software comes out.



Some of the recent transformer models can work with audio clips just a few seconds long. I'm sure the final output is less good, but perhaps your friend has audio clips that would work for that from e.g. home movies.


Alas, no (made some contributions to TTS at G, and worked on Android).

iOS has this built in :/ which may bode well, there's no greater Google product manager than "whatever apple just shipped."

I'm doing some xplatform on device inference stuff (see FONNX on GitHub) and it'll be one of 100 items that'll stick on my mind for a while, I hope I find time and I'll try to ping you

Edit: is an Android app with a keyboard and "speak" button that does API calls to eleven labs sufficient for something worth trying?


> I'll try to ping you

Thanks. Use the email address in my profile if anything eventuates.

> is an Android app with a keyboard and "speak" button that does API calls to eleven labs sufficient for something worth trying?

Maybe. Obviously something with local processing would be preferred, but it might be an option when internet connectivity is good. Is there such an app?


There isn't an ElevenLabs app like that, but I think that's the most expedient method, by far. (i.e. O(days) instead of O(months))

(warning: detailed opinionated take, I suggest skimming)

Why? Local inference is hard. You need two things: the clips to voice model (which we have here, but bleeding edge), and text + voice -> speech model.

Text to voice to speech, locally, has excellent prior art for me, in the form of a Raspberry Pi-based ONNX inference library called [Piper](https://github.com/rhasspy/piper). I should just be able to copy that, about an afternoon of work! :P

Except...when these models are trained, they encode plaintext to model input using a library called eSpeak.

eSpeak is basically f(plaintext) => ints representing phonemes.

eSpeak is a C library and written in a style I haven't seen in a while and depends on other C libraries. So I end up needing to port like 20K lines of C to Dart...or I could use WASM, but over the last year, I lost the ability to be able to reason through how to get WASM running in Dart, both native and web.

Re: ElevenLabs

I had looked into the API months ago and vaguely remembered it was _very_ complete.

I spent the last hour or two playing with it, and reconfirmed that. They have enough API surface that you could build an app that took voice recordings, created a voice, and then did POSTs / socket connection to get audio data from that voice at will.

Only issue is pricing IMHO, $0.18 for 1000 characters. :/ But this is something I feel very comfortable saying wouldn't be _that_ much work to build and open source with a "bring your own API key" type thing.

I had forgotten about Eleven Labs till your post, which made me realize there was an actually meaningful and quite moving use case for it. All of Elevens advantages (cloning, peak quality by a mile) come into play, and the disadvantages are blunted: local voice cloning isn't there yet, and $0.18 / 1000 characters doesn't matter as much when it's interpersonal exchanges instead of long AI responses


Wouldn't it be better to use FFI to build an idiomatic interface to use in Dart instead?


It's a good point but I'm a perfectionist and can't abide without a web version.

though, now that I write that...

Native: FFI.

Web: Dart calling simple JS function, and the JS handles WASM.

...is an excellent sweet spot. Matches exactly what I do with FONNX. The trouble with WASM is Dart-bounded.

(n.b. re: local cloning for anyone this deep, this would allow local inference of the existing voices in the Raspberry Pi x ONNX voicer project above. It won't _necessarily_ help with doing voice cloning locally, you'll need to prove out that you can get a voice cloning model in ONNX to confirm.)

(n.b. re: translating to Dart, I think the only advantage of a pure Dart port would be memory safety stuff but I also don't think a pure Dart port is feasible without O(months) of time. The C is...very very very 2000s C. globals in one file representing current state that 3 other files need to access. array of structs formed by just reading bytes from a file at runtime that matches the struct layout)


That would be awesome


GitHub: https://github.com/myshell-ai/OpenVoice Checkpoint: hxxps://myshell-public-repo-hosting.s3.amazonaws.com/checkpoints_1226.zip

(Checkpoint link defanged because I’m allergic to direct links to zip files hosted on Amazon. Nor have I reviewed what the file contains.)


Thanks for the link to the repo. It's very useful.

As for the checkpoint, I'm not allergic and I don't do security theater:

https://github.com/myshell-ai/OpenVoice?tab=readme-ov-file#i... links to

https://myshell-public-repo-hosting.s3.amazonaws.com/checkpo...


Why do you call that security theater? I found and provided the information, but didn’t make it clickable. Anyone can decide for themselves to navigate there.

Your comment comes off as passive aggressive.


What is the threat vector of the functional https link that hxxps solves?


I think he's referring to your "defanging" which you implied was security related but doesn't actually achieve anything at all.


They are making you think about what you are doing before you click the link. that’s not theatre that’s keeping people from clicking arbitrary links to zip files which can auto-execute code once downloaded.

I’d suggest that those who think it is theatre probably don’t understand the implications of that action.


We understand exactly the implications of that action. There are no implications.

Simply downloading a zip from Amazon has zero risk. Even opening an arbitrary zip has essentially zero risk. RCE from opening a zip is obviously a really critical and valuable vulnerability and would not be wasted with a public link.

Combine that with the fact that this comes from a voice cloning GitHub repo and the chance of this having some 0-day are infinitesimal.

Finally just making the link non-clickable does not add security. Nobody can take any action to increase their security because they have to slightly edit a link (not that they would because it's sensible a clickable link in the GitHub readme).

So yes, I fully understand the implications and it is definitely security theatre.

I suggest that those who think that it isn't probably haven't really thought about the threat model.


I'll be honest. You've put way more thought into this then I did.

But in the spirit of hacker news, I'll continue the argument.

> There are no implications.

Untrue and absolutist.

> Simply downloading a zip from Amazon has zero risk.

Agreed.

> Even opening an arbitrary zip has essentially zero risk. RCE from opening a zip is obviously a really critical and valuable vulnerability and would not be wasted with a public link.

Broadly agreed. History is full of unzip vulns, but I agree that this doesn't seem likely. Much easier to persuade folks to deliberately run their malware by using the latest fad as a hook. I'm not claiming that happened here.

> Combine that with the fact that this comes from a voice cloning GitHub repo and the chance of this having some 0-day are infinitesimal.

Maybe you know these authors and this repo and trust them. I don't. I'm sure they are lovely, I have no idea, I've done no research, and I've never heard of them before. That being said, if I wanted to distribute a backdoor or cryptominer to a bunch of people with powerful computers, I'd definitely hop on the AI bandwagon.

> Finally just making the link non-clickable does not add security.

I disagree. Some of the commenters here are rather savvy and will properly evaluate what they are downloading. Some are not. Making a link unclickable will prevent a percentage of people from downloading. If shenanigans are discovered, someone will make a very loud comment warning folks to avoid the download. In that case some of those non-downloaders may have been saved from themselves.

Again, this wasn't a well thought out decision, but it was also a rather low impact decision, and I stand by it.


People like the parent routinely download all of the random zip files off the web that they can get their mouse cursors on. Nothing is going to stop them.


Yep. I don't worry about non-existent threats. Nothing is going to stop me because there is no risk. Have you ever been owned by downloading a zip? Me neither.


> That being said, if I wanted to distribute a backdoor or cryptominer to a bunch of people with powerful computers, I'd definitely hop on the AI bandwagon.

And write and entire novel research paper and open source the code and put it on GitHub? No you wouldn't. Don't be ridiculous.


> And write and entire novel research paper and open source the code and put it on GitHub? No you wouldn't. Don't be ridiculous.

You are moving the goalposts, why?

Regardless, generating a plausible sounding paper with source code is trivial with gpt4. Obviously it wouldn’t withstand scrutiny, but neither would my coinminer.


just downloading a zip file won't auto execute anything. and you can't meaningfully review it without downloading it, so it pretty much is security theatre


On which operating systems can Zip files automatically self-execute? Android .APKs come to mind, although in this case, Android asks you whether you want to install the application and thus gives you a chance to prevent the execution.


What does allergic mean in this context?


That file could contain anything. I don't know the authors or have any idea of their reputation.

I wanted to expose it so people didn't have to comb through the github, but decided to make it unclickable out of an abundance of caution. This appears to have offended people.

I would not have hesitated to link to hugging face. That is a known quantity.


FWIW I appreciate the courtesy and context; agreed that it's not the best idea to link directly to zip files (let alone those of questionable provenance).


I love this paper. It reads very much like "this is what we did, and we want to help others do it too." Also, the section "Remark on Novelty" is golden: "OpenVoice does not intend to invent the submodules in the model structure ... The contribution of OpenVoice is the decoupled framework that seperates the voice style and language control from the tone color cloning." They don't try to hype up their contribution.


Examples: https://research.myshell.ai/open-voice

Seems impressive!


> This repository is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, which prohibits commercial usage. MyShell reserves the ability to detect whether an audio is generated by OpenVoice, no matter whether the watermark is added or not.

So it is not 'open' then and you cannot make money out of this?


It is open, just not by your definition. You can view, use and modify the code to your hearts content. Sounds pretty open to me!


By the commonly held definition of open, in the context of "open source", it is not open.

> You can view, use and modify the code to your hearts content.

The non-commercial clause of their license specifically prohibits commercial use, so we cannot use this source, and presumably the data that the source uses, to our hearts content.

The OSI has a definition of open source that clearly states commercial use is required [0].

Wikipedias entry on Open Source Licensing also stipulates that commercial re-use is required [1].

There is a term called "source available" which is more in line with your intent.

[0] https://opensource.org/osd/

[1] https://en.wikipedia.org/wiki/Open-source_license


> commonly held definition of open, in the context of "open source", it is not open

While this is very true, the context of "open source" can't be assumed.


Where do they claim to be “open source”?


In their README in the GitHub repository as well as the paper. I opened an issue [0] and it looks like they've updated their README, at least, to reflect that it's not open source.

[0] https://github.com/myshell-ai/OpenVoice/issues/16


To be specific, while it is not a bad license, it does not quality for the Free Cultural Works mark as defined by the Creative Commons and Freedom Defined: https://creativecommons.org/public-domain/freeworks/


And not by opensource.org's definition, which prohibits use restrictions. It's not reasonable to act like OP is being idiosyncratic when this fails to meet the protected definition of "open source".


The term "open source" is not protected, the OSI (opensource.org) attempted and failed to acquire a trademark on that term.


Fair enough. Is there any shared definition of "open source" which permits use restrictions, then?


I admit not to have read the whole paper, but in the intro nowhere do they mention “open source”, so it seems unfair to measure them by that definition


Well… “use” isn’t exactly free, this the complaint. On a scale of free to not free, “cannot use this for my work” is a pretty big jump to the latter end IMO


Careful, you’re saying the quiet part out loud; freedom is about profiting off the uncompensated work of others.


Well ultimately we all need to eat. If someone wants to be compensated in today’s society, they either need to join a gift-based sub-society (see: OSS foundations, NGOs in general) or sell something. Trust me, I totally agree that freedom of information should be a completely separate concern from resource allocation

EDIT: I guess there’s a third option, “work another job and use OSS on your off hours”. Which feels… idk, disrespectful of the whole enterprise. OSS software development is important enough to deserve a wage IMO, to say the least


To your last point, people pay what the market will bear. In this case, it's free, so don't be surprised that if you give something away for free that people, well, take it for free. Importance has nothing to do with it.


As long as your hearts content isn't commercial


Open for business!

No wait…


You can’t. Scammers who don’t care about noncommercial licenses sure can!


This is the most insightful take. Licenses like this prevent certain businesses in certain countries, but it is quite harmful as it adds a powerful tool for propaganda/scammers/etc who don’t care about the laws.

Additionally, it only really hurts small businesses & startups as the big companies all have teams that can make their own version or pay for 3rd party apis for easily. So yeah, us startup folks won’t like this license much as it basically is aimed at us the most.

Either way, congrats with the tech. It does look very impressive!


erm, it's existence provides to scammers.

unless you're proposing it's use in detecting itself is some how symmetrical, which I really don't think is anything but unproven conjecture.


Yep, this is one of those "only bad actors" licenses, probably as a cash grab.

It will definitely stop those bad actors from scamming people this time, right? Right?


It's not really well advertised and I'm not sure Apple is continuing development, but iOS has a voice clone feature called "Personal Voice" - it takes about 15 mins to train it with your own voice (and then takes a few hours to process on-device when locked). You can use it in phone calls and FaceTime (maybe other places?). It would be nice to use it for general TTS.


It's an accessibility feature for people losing their voice or on the verge of. And it is TTS only, not speech-to-speech as your mention of "can use it in phone calls and facetime" implies. Not being s2s means it doesn't retain vocal disfluencies, prosody etc signals that make a voice feel real


I had to phone my bank, which is one of the bigger players in the UK high street market, a couple of days ago. They're still encouraging me to enroll in their idiotic "my voice is my password" programme. At this stage in the evolution of AI, that feels simply negligent.


Fidelity Investments just did something even worse ~a week ago - It asked me to reply to a few questions, then announced that I'd just been enrolled in it's voice identification program (or whatever they call it).

Now I've got Just Another Item on my ToDo list, to get that undone. Gawd, does every company promote it's stupidest people to management?


> Gawd, does every company promote it's stupidest people to management?

Yes: https://en.wikipedia.org/wiki/Dilbert_principle

Ironically, this is the place where they can do the least damage.


Clone management's voices and post it to their social media/etc. Super undo it!


They promote their best schmoozers to management.

They have so much money that competence no longer matters and bootlicking will get you much farther.


I don't know if GDPR (or any of its cousins) applies to you, but this kind of thing sounds exactly like the sort of thing it's supposed to outlaw.


How? Your bank stores personal data covered by gdpr but enabling crappy secure systems is not the domain of gdpr.

Most likely this is caused by SCA another European directive that ruined our lives with extra security hoops (for payment providers) for little extra security - or even worse in case of voice password or security questions


Sadly, the US, where I currently live, is quite behind in this. Considering going expat (not for this specific reason, but it doesn't help); any expats have recommendations of what countries have worked well for them?


Heavily depends on money available / crime level tolerance - albeit crime level in the US is pretty crazy.

Europe is deteriorating incredibly rapidly in terms of crime (due to a combination of economic poverty and uncontrolled immigration from third world countries) - but I think some of the low tax EU countries (Malta, Cyprus, Gibraltar, etc) are a good bet for a few more years.

My top choice if I had family (or friends I want to be close with) in the US would be Cayman.

I think long term, either South America drops the level of crime considerably and becomes the new place to be or China start building futuristic cities attracting wealthy western talent to offset their declining population rate.


> Gibraltar

That's a British Overseas Territory, it isn't in the EU.


I recommend you first narrow it down to somewhere whose main language you can speak. I picked Germany because I already had some experience with the language and slightly Dunning Kruger'd myself. I like it, but… well, even native German speakers say „Deutsche Sprache, schwere Sprache“ ("German is hard").

Cyprus has a lot of English speakers (and indeed a lot of street furniture that looks just like the UK, plus two UK airforce bases[0]), but the national language is Greek… I don't know if I'd risk that, given the one time I tried to ask for «Ένα σάντουιτς και ένα τσάι παρακαλώ»[1] in Athens[2], the person behind the counter replied in English to correct my pronunciation.

[0] https://en.wikipedia.org/wiki/Akrotiri_and_Dhekelia

[1] https://translate.google.com/?sl=el&tl=en&text=Ένα%20σάντουι...

[2] I know that's not in Cyprus, but it is, as you may guess, another place where Greek is the national language.


A person's voice is, I believe, personal data.

> Processing personal data is generally prohibited, unless it is expressly allowed by law, or the data subject has consented to the processing

- https://gdpr-info.eu/issues/consent/


I see, good point!

Given the same voice get processed and recorded during a normal phone call to the bank so you would need to give consent just to talk on the phone (and they do have a disclaimer when you are calling in Europe).

Most likely this is buried deep in some massive EULA you accept when you open an account.


All processing is supposed to require explicit and meaningfully informed consent for each separate use; one of the GDPR training lessons we get over here is basically "Bob has a bunch of customer's emails he got from the sign up process, is he allowed to use them to send adverts for a new product?" and the answer is "No, that's only allowed when the customers explicitly consented to that, you can't just use any data they happen you have given you for whatever new purpose you want".

EULAs are a bit more of a mess, as all the advice I've been given says "don't hide stuff like that" while all the websites I visit are "we're going to do this anyway because we think we can get away with it".


Importantly, you can also revoke consent at any time under the GDPR. Unlimited consent isn't possible, so the bank would have to make the (dubious) claim that such processing did not require permission at all.


Investec? Yeah thinking I need to phone them to disable mine


The tide is high, a powerful moment reminding us that, even in the face of vastness, we have the strength to overcome obstacles. Each wave is a lesson in perseverance, showing us that no matter how hard we are pushed back, we can always come back stronger. This tide teaches us resilience, adaptability, and the courage to seize new opportunities that come our way. Let’s embrace change as a chance to grow, renew, and explore new horizons.


The tide is high, a powerful moment reminding us that, even in the face of vastness, we have the strength to overcome obstacles. Each wave is a lesson in perseverance, showing us that no matter how hard we are pushed back, we can always come back stronger. This tide teaches us resilience, adaptability, and the courage to seize new opportunities that come our way. Let’s embrace change as a chance to grow, renew, and explore new horizons.


My first and ongoing thought is that immoral/criminal uses of voice cloning vastly exceed any legitimate ones.


My first thought is anonymity. I can make YouTube videos without needing to use my real voice... while being able to keep my personal inflections and emphasis, something TTS (AI) voices can't do.

Or...! Indie game development. I can learn basic voice acting (to get rid of the cringe), and act out all of my characters using different voices.


The indie game development and animated short content are the primary uses for this type of NN for me. I’m working (not very successfully) at putting together a single source voice — to many result voice ‘style transfer’ solution using standard PyTorch components. Realistically I can pay for the target sample voice to record some amount of varied vocal performance and then hopefully if the net is trained specifically on my voice as the source the hope is the transfer can capture the ‘performance’ qualities in my original.

And in case anyone is concerned, I intend to make the purpose of the vocal samples clear to the provider and then arrange appropriate credit and compensation to those whose voices I used. I also don’t intend to train with anything but public domain and purchased data.


Out of curiosity, what/how many legitimate use cases have you considered?


Potential legitimate uses I can think of -

1. licensing voice to other uses - people with recognizable trademarkable voices (actors, singers) have another potential revenue stream. yay!

2. use of past voices - voices that are not 'owned' from the past - let's say Humphrey Bogart's voice, can be used in projects without having to pay for imitator. This would be useful for both marketing and artistic projects. But probably less for marketing because they will want to go with step 1.

3. Teach yourself to talk like X. People who need to learn to talk like a particular person / have a particular accent could learn quicker. Just think - you will be able to supplement your comedy routine with kickass Christopher Walken impersonations any day now!

Variations of 3 and 2 together open up interesting modes of aesthetic impression, but I won't go into that here. But definitely I have some ideas that might benefit from being able to do this.


Surely software that communicates with people using natural language should be topping all these lists. Direct communication through voice with a local LLM is already possible. It won't be long before it's fast enough to approach natural interaction (if we can solve the "when is it my turn to speak" challenge) and then we enter a new phase of digital interaction and AI training.


A legitimate use, in the abstract, is one where a particular individual is willing to have their voice used to say X. The entertainment industry - movies and games - are likely to want this.

But if it's trivial to use somebody's voice to say any arbitrary thing, then it'll be done. Combined with deepfake videos, the result will be the ability to show anyone saying anything, including lies and things they find incredibly objectionable, in a disturbingly realistic way, and more so as time wears on.

The fundamental issue is that we don't live in a rights-respecting world. Making it easy to utter anything in the voice of anyone will lead to many more abuses than legitimate instances.


People will get immune to it if they aren't already. It's already common to fake screenshots of tweets/etc. Not a real problem unless you want to beleive falsehoods, then you will anayway.


What of commercial uses being greater than illegitimate ones? YouTube will give people the ability to hear it in their own localized language in the author's voice.


I disagree, we should just not accept voice as authentication.

I think the most common use case will be making art & content programmatically without voice actors (and most likely without actors at all once we nail video or a 3d model pipeline + frame by frame transformation to make it look realistic)


Talk with your loved ones and make a paraphrase for if you're stuck in a emergency and need money wired or something.

Some banks have voice authentication when you call in and you have to ask to opt out.


Which just means we need to build protocols around this risk, rather than foolishly trying to shove the genie back in the bottle, lest we be left with only the criminal uses


> This repository is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, which prohibits commercial usage. MyShell reserves the ability to detect whether an audio is generated by OpenVoice, no matter whether the watermark is added or not.

Creative Commons Attribution-NonCommercial 4.0 only prevents the use of the licensed code within other projects. It says nothing about output. People can (and will) use output generated by this for commercial projects, and they will be completely within their rights to do so.

So detect away I guess.


current leader in open source voice cloning is RVC, would like to see how it compares to it.


RVC is voice conversion (audio to audio), and it's typically finetuned.

This is zero shot TTS. Samples create vector encodings that serve as input to inference. There's no retraining the model unless you want it to generalize or perform better.


It isn't though, people need to read the paper and the comments from the author they aren't actually doing the voice generation they pass the text off to VITS, and then they're sauce is that they are doing tone mapping on that VITS output, so if anything they're a competitor to RVC, it's just that the version they published includes VITS also


Interesting.

Funny enough, a lot of RVC packages are using VITS to do RVC for TTS.


I hope someone can handle Cantonese one day


Is there some similar software that allows you to add lets say 40 years to a voice?


If I wanted to do voice cloning on my own hardware can anyone suggest what a good open source project would be to use? What is the state of the art in open source voice cloning?


I use Tortoise TTS. It's slow, a little clunky, and sometimes the output gets downright weird. But it's the best quality-oriented TTS I've found that I can run locally.

It's allegedly the basis of the tech used by Eleven Labs.

https://github.com/neonbjb/tortoise-tts


There are faster implementations of tortoise that allow fine-tuning. You can get close to ElevenLabs quality if you have a perfect dataset. https://git.ecker.tech/mrq/ai-voice-cloning



Now of only youtube would ban the use of this crap. Or at the very least allow you filter those videos.


There's genuine uses, look at Apple offering this tech recently as an accessibility feature for people losing the ability to speak to have text to speech in their own voice in lieu of being able to vocalize it themselves.

You're banning genuine uses like that or just creators who want to fix a fumbled or awkward line without completely re-recording if you ban it.


He's the same camp that bitches about SD, MJ and DALLe, but loves photoshop generative fill lol


Anyone knows at a deeply practical and technical level how 11Labs achieves the level they do?


Fraud becomes easier I guess.


Holy cow! If this works without curated audio...this is amazing!


Any GitHub link?



whats with the crypto thing?


From the github readme:

“MyShell reserves the ability to detect whether an audio is generated by OpenVoice, no matter whether the watermark is added or not.”

Call me skeptical…


And suddenly it becomes a bit weird:

https://docs.myshell.ai/tokenomics

Tokenomics

Disclaimer: MyShell is currently in the testing phase, and the content of the whitepaper may be subject to change in the future.

$SHELL is the token used for user incentive, governance and in-app utility.

The total supply of $SHELL is 1,000,000,000


And luckily, this submission seems to be about the paper/technology OpenVoice, not about the company MyShell (whatever that is).


License[0]: This repository is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, which prohibits commercial usage. MyShell reserves the ability to detect whether an audio is generated by OpenVoice, no matter whether the watermark is added or not.

[0] https://github.com/myshell-ai/OpenVoice


Wonderful company, not a scam at all: https://docs.myshell.ai/tokenomics


That watermark detection rights at the end is real sus


At least right now, there is a literal add_watermark function, so probably easy enough to remove that surface level. Unless they added something cute to the training data to poison the well.

https://github.com/myshell-ai/OpenVoice/blob/a33963c3d764bee...


What exactly are you talking about? The paper doesn't mention any watermark at all, as far as I can see/search.


The readme on the linked github reads: “MyShell reserves the ability to detect whether an audio is generated by OpenVoice, no matter whether the watermark is added or not.”


As you say, right at the bottom https://github.com/myshell-ai/OpenVoice


Ah, thank you. Guess that's OK that the company/service do whatever they want, the paper/technique doesn't involve watermarks, so it'd be easy to remove/modify whatever they do in the library/service itself.


Their Tokenomics page say

$SHELL is the token used for user incentive, governance and in-app utility.

The total supply of $SHELL is 1,000,000,000

Team, Treasury, Advisors & Private Sale = 55% allocation

Community Incentive = 40% allocation

Liquidity = 5%


So I guess we could (legally) now create a voice chatbot using Mickey Mouse audio from Steamboat Willie?


Possibly, except there is no dialogue in it.


Can someone give me a practical use case where this adds a net benefit to society?


From an indie game dev standpoint, I can probably say a sentence or two in a given way using my standard headset microphone.. and something like this would allow for clean voice lines fairly easily, as long as they don't need to stress too much emotion... But for a $0 game, that would still be beneficial. Imagine all the 2D Zelda/FF like games that don't get played today because people would rather listen to dialogue than read.

Of course, there's also the preservation of the voice of a loved one. I would probably pay to hear my father's voice again but there"s probably only one or two VHS tapes with his voice on it.


James Earl Jones, presumably hedging against his eventual demise, has allowed his voice to be used for things like the Star Wars franchise [0].

Small, independent film makers can now use a skeleton crew to voice parts.

I can't imagine it would be anything other than a niche service, but hearing the voice and, potentially, interacting with a chatbot/LLM with the voice of a passed love one.

This is off the top of my head. I would also guess that this technology is a stepping stone for other weird, interesting and profoundly helpful uses.

[0] https://www.theverge.com/2022/9/24/23370097/darth-vader-jame...


If you've ever done voice prompt recordings for a phone system, voice cloning would be super helpful for doing one off tweaks, especially if you have to record a bunch. Instead of rerecording 20 messages, which can sometimes take hours, you can use a clone of your own voice to make the necessary modifications. My friend does a lot of recordings as part of his job and when I showed him the Adobe voice editing preview he got really excited. It has the potential to make tweaks a lot easier, less time consuming, and reduce voice strain.


Unifying a voice in tutorial videos so that the difference in voice does not distract the learner.

Auto non-toxic rephrasing of online chat in video games, let people hear their voice but paraphrase what they said in a manner that doesn't turn the platform into a cesspit.

Cloning your own voice so that you can turn a script into audio without 50 takes and then having to remove a million Ums and errs.


> Auto non-toxic rephrasing of online chat in video games, let people hear their voice but paraphrase what they said in a manner that doesn't turn the platform into a cesspit.

that feels very orwellian


George Orwell — 'If you want a picture of the future, imagine a boot stamping on a human face—for ever.'

I think this is closer to the direction of Huxley in Brave New World, where a deeper understanding of how to manipulate without brute force creates a very different dystopian society than 1984.


"Don’t you see that the whole aim of Newspeak is to narrow the range of thought? In the end we shall make thoughtcrime literally impossible, because there will be no words in which to express it."


Censorship by itself doesn't stop people thinking (or even expressing) forbidden thoughts, it stops a person's words reaching other people.

BNW had a similar effect by conditioning, rather than by applying the strong form of the Sapir–Whorf hypothesis.


It's not perfectly isometric, but neither is it a stretch to call it orwellian.


Real time translation in the speakers own voice.


This is an exceptional use case!

Mr Beast talked about translating his videos to other languages to get more reach. This can be done for people with limited budget or just in general so people can watch videos without needing subtitles.

I wouldn't be surprised if we saw this incorporated into YT in the near future.


Another really good one would be for RPGs. Instead of clumsy approaches like "Hey Dragonborn" and whatnot, they could actually say your character's name out loud.


Right, and taking it one step further, LLMs in games with voice actors providing the basis for dynamic dialogue that sounds like it's coming from a person.


While listening to the examples given, I noted the cross-language ones. I’m eager to improve my accents in my nonnative languages by cloning my voice and comparing recordings of how I do sound with how I would sound as a native speaker!


Person A used to be able to speak, but lost their voice in a accident/because of reason Y. Luckily, there is surviving audio/video with their voice on it, so a text-to-voice with their own voice could be created for them to use.


My pastor has an injured, vocal cord that makes him sound gritty at times. A technology like this applied to old copies of his speaking might make him sound like he used to. I don’t know if he’d use something like that since we mostly rely on the Spirit of Christ to open hearts to the truth.

Outside public speakers, there’s probably other people whose lost their voice or have trouble vocalizing who might want to sound like their old selves. This could help them.

Disclaimer: I think these techs will more often do damage than good. I’m just brainstorming an answer to your question.


Possibly speech therapy.

Certainly entertainment. Movies / TV. It opens a new opportunity for videogames with generative characters.


Well we could just look at the obvious and existing use cases for text to speech stuff.

Alexa, siri, and similar, are all common place.

Another huge usecase would be anything to do with voice acting. Either in video games, cartoons, or the like.

This would completely democratize voice acting material, and would empower anyone to be able to do this for cheap.


... and put 99% of voice actors out of business. We'll eventually end up with every TV show, movie, and, video game being voiced by Ryan Gosling and Beyonce because market research.


Technology that makes things better, faster, and more democratized does tend to harm those who profit off of things being expensive and gatekept, yes.

Democratization will always be the enemy of those who profit from preventing others from being empowered.


No.

The real answer is yes, I could probably come up with some contrived examples, like I lost my voice in a freak LLM accident and now want to clone my old voice. But this doesn't (you don't?) really need a net benefit reason to figure it out and publish it. Because why? I assume, because "this shouldn't exist!" which is just a more palatable wa to phrase "won't someone think of the children".

Society doesn't benefit from ignorance, so given it can exist, what's the problem with it existing? Why does it need a practical reason? Because people will do bad things with it? Duh, but I'd rather everyone know then just the bad guys


My question wasn't to imply that I don't think a given technology should or shouldn't exist.

I was curious to see if anyone could name at the top of their head some practical use cases that they feel net out the potential harms of cloning and misusing someone else's voice.

There's some nice and certainly practical examples, but I don't feel any of them would net out the harms.

Perhaps there's a use case that we can't even comprehend yet that would though!


By this logic there shouldn’t be regulation on anything, because the bad guys will have it any way.

While you can’t make it go away, you can disincentivize propagation and use which can be the difference between thousands of cases of scams/extortions and millions. Until there’s a stronger argument for voice cloning models (talking to a dead loved one is creepy and not a positive argument) then we shouldn’t encourage tools with overwhelmingly nefarious utility.


That's correct, I believe it shouldn't be illegal to know anything. Nor do I think science needs any kind of regulation.

Hurting people, lying, that's already illegal.

I think Maybe you misunderstood my argument. My argument isn't that good guy with a voice cloaner is the only thing that can stop a bad guy with a voice cloaner. That's, as you pointed out, stupid. My argument is that no one benefits if how easy it is to make one remains a secret to everyone but the bad guys.


> Why does it need a practical reason?

To at least give us something as a consolation for all the havoc all sorts of deep fakes will wreak on societies. It's like asking what a knife can be used for other than murder. It's a valid question.


It's a valid question, but not a good one. The implication is something needs a reason to exist. This is a new technology, and just like all new technologies, the fear of it is spreading faster than the understanding.

Scams aren't going away. Will this make it easier to scam some people? Absolutely, so did the internet. I'm not claiming this is anything like the internet. My argument closer to, the reason people get scammed isn't because [thing exists] it's because bad people lie, and kind people trust them. We can all wring our hands in fear over what the new technology might do, or we can solve the problems we care about. Authenticity was hard before this, and it'll be hard after.


> But we live in a concrete society, [and] with concrete social and historical circumstances and political realities in this society, it is perfectly obvious that when something like a computer is invented, then it is going to be adopted will be for military purposes. It follows from the concrete realities in which we live, it does not follow from pure logic. But we're not living in an abstract society, we're living in the society in which we in fact live.

> If you look at the enormous fruits of human genius that mankind has developed in the last 50 years, atomic energy and rocketry and flying to the moon and coherent light, and it goes on and on and on -- and then it turns out that every one of these triumphs is used primarily in military terms. So it is not reasonable for a scientist or technologist to insist that he or she does not know -- or cannot know -- how it is going to be used.

-- Joseph Weizenbaum

That is not fear. That is being serious and unflinching, if anything.

> We can all wring our hands in fear over what the new technology might do, or we can solve the problems we care about.

I'm doing neither. I said it's a valid question, with which you agree. The rest is a straw man apropos nothing anyone actually said, here, and wringing your hands about it. It's a way bigger waste of time than asking a simple question and let those who want to answer that, and let those who don't want answer it simply don't answer it, instead of making up this "issue" with the question itself.

> Authenticity was hard before this, and it'll be hard after.

So "nothing changes", but technology is super important? You could say the same about, say, curing cancer. People will live for a while and then die, with or without it. Why since it makes no difference, what'd be the problem with "fearing" it?


Imagine being able to handle translations live and hearing the persons voice translated as if they were speaking to you in your native language with their own voice is a big one


Aside from the fact that is will be easier to scam people, I fail to see benefits. We can already translate everything with the same synthetic voice


You would be able to translate media into the language of your choice, but also retaining the original voices.


apple has Personal Voice for accessibility


Welcome to the new era of fakes and scams beyond our wildest imagination !


Elevenlabs has been around for a while now. Genie has been out of the bottle for a bit, and the sooner the notion that anything digital can be easily faked seeps into the wider consciousness the better. Trust nothing.


I've seen some prank calls (a YouTuber cloned Tucker Carlson's voice and called Alex Jones) but he just had a sound bank with a few pre-generated lines and it fell apart pretty quickly.

At least for now there's too much lag to do a real time conversation with a cloned voice.

Speech to Text > LLM Response > Generate Audio

If that time can shrink to subsecond, I think there'll be madness. (Specifically thinking of romance scammers)


At last summer's WeAreDevelopers World Congress in Berlin, one of the talks I went to was by someone who did this with their own voice, to better respond to (really long?) WhatsApp messages they kept getting.

It worked a bit too well, as it could parse the sound file and generate a complete response faster than real-time, leading people to ask if he'd actually listened to the messages they sent him.

Also they had trouble believing him when he told them how he'd done it.


Awful, bots on their own having real conversations with people with the voice of a loved one. Scamming on steroids


You don't need an LLM Response


> * the notion that anything digital can be easily faked seeps into the wider consciousness the better. Trust nothing.*

This is a society-destroying idea.

Most of us, especially younger people, only know how to vote, where there are wars, or even what our parents are doing by using digital media.

If digital media becomes untrustworthy, everyone will live in a warped and fragile alternate reality that no one can agree on.


> Trust nothing

> This is a society-destroying idea.

Believe it or not, this is how much of the population saw The Internet when it first came close to being mainstream. Everyone and their mother said "Don't believe anything you read on the cybernet", which ended up ironic as everyone and their mother ended up being the ones to believe anything on the cybernet anyways.

> everyone will live in a warped and fragile alternate reality that no one can agree on.

How is this any different from today? The various corners of the internet (which is mostly divided by languages: English, Russian, Spanish, Chinese and Portuguese) already have these vastly different realities and ground-truths.

I'm sure we could survive another Internet-Winter where people trust everything a bit less than today.


It's vastly different than today because today (or at least a few years ago), I could trust videos and voices delivered digitally. I can't do that anymore.


How long has society had voice and video delivered digitally? We managed to survive fine before we had it.

If it now becomes impossible to trust a voice received through the internet without being connected to a verified telephone number I don't know how that can be classified as society-changing.


Technology and society will adapt, just as we adapted encryption to verify credentials and secure banking data online, we'll end up with a validation signal for video and audio.


It can both be true that people need to adapt/“trust nothing,” and that this is bad.


Maybe this will teach people to rise up awareness and take personal security serious. Like not to trust anyone who is calling, especially from legacy line. Phone number and voice could be easily cloned.


The problem is that it's not just personal security, and it requires significant expertise / research to identify.

https://www.bbc.co.uk/news/world-africa-66987869


This aera is barely new. Look at how old some of the projects are:

https://github.com/underlines/awesome-ml/blob/master/audio-a...

The thing that changes is the complexity to run it. I was training my wife's voice and my voice for fun and needed 15min of audio and trained on my 3080 for 40 minutes.

Now it's 2 Minutes.


Yes, and the more accessible it is, the more widespread it will be.


True. But the better way forward IMHO is to give access to technology in equal ways, instead of keeping it in the hands of a few who promise they are the good ones. Because then an chilling effect can happen, and the playing field is level.


VALL-E is on Github for over a year already...


Overall, will this be a good thing or a bad thing for society, do you think?

If it is a bad thing, should we cheer it on?


[flagged]


Care to elaborate for someone lacking context?


You don't know what voice cloning is and what it can be used for? These people made it even easier to do. That's their achievement. Facilitating fraud, confusion and distrust.


> You don't know what voice cloning is and what it can be used for?

* Real-time translation of a user's voice, maintaining emotion and intonation

* Professional-quality audio from cheap microphone setups (for video tutorials, indie games, etc.)

* Allowing those with speech impairments to communicate using their natural voice again, or:

* Allowing those uncomfortable with their natural voice to communicate closer to how they wish to be perceived

* Customization of voice assistants, such as to use a native accent/dialect

* Movies, podcasts, audiobooks, news broadcasts, etc. made available in a huge range of languages

* And of course: memes, satire, and parody

I also think it opens the door for entirely new kinds of media, but it's hard to foresee exactly what will take off and what will die a gimmick. Maybe immersive-sims where characters can respond intelligently to anything you throw at them? Maybe personalized movies, like the ability to direct scenes yourself?


Most of that can be (and a lot of it was) resolved with artificial voices. There's no need for voice cloning, and certainly no need to make it widely accessible.


> Most of that can be (and a lot of it was) resolved with artificial voices

Existing artificial voices don't even attempt to tackle many of the points and are very clearly a poor substitute (watching a movie dubbed by Alexa?) even in the scenarios where they could be used - far from "resolved". If taking a binary view that it's already technically possible and therefore isn't relevant, I could claim the same of scams.

Given you initially claimed "no positive" but here only take issue to "most" applications listed, do you admit that there are some positives - even if you believe them to be outweighed by negatives?

> certainly no need to make it widely accessible

If it exists but isn't widely accessible, it's likely in the hands of Musk/Zuck and or various state actors. To me that seems the worst alternative - to have a public generally unaware that it's possible and receiving few of the benefits listed, yet still having it available as a tool for competent disinformation.


Adobe did the same thing years ago and never released it, because there's nothing productive you can do with it if it's not proven to be one's own voice that you have control over.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: