I've been following jpc [0] on the LAION discord since he started building this ...

jpcl · on Jan 18, 2024

Yeah, Whisper is not clear-cut but since it is not a generative model I think their data usage is a lot more likely to be considered fair-use.

And the part of that which we use for WhisperSpeech is just the phonetic representation so our model is not able to recreate any of the Whisper training data in any way.

leereeves · on Jan 18, 2024

The readme says "We are working only with properly licensed speech recordings and all the code is Open Source so the model will be always safe to use for commercial applications."

Is that less certain than the quote implies?

jpcl · on Jan 18, 2024

We are working hard to uphold all the licensing rules but nobody can absolve you from all legal risks.

There may be a court ruling/new law that any training needs a special permission from the original author and then even a CC-BY license won't cover this.

doctorpangloss · on Jan 18, 2024

Laypeople value the aesthetics of statements like these. It's very Discord energy.

Everyone using learned weights from other models, especially ones released by OpenAI, Stability and Google, such as text and audio encoders, is tainted by training materials that were not expressly licensed for the purpose of AI model training or unlimitedly licensed for any use.

jpcl · on Jan 18, 2024

That's true but you make it sound like it's totally obvious where the line of fair use should be drawn for AI training.

Until courts or lawmakers make it clearer I personally believe non-generative models (Whisper, ResNet, DINOv2) should be legally trainable on publicly released data. Generative models (image or video generation, TTS, LLMs?) should be held to a much higher scrutiny since their outputs can potentially compete the creators who put a lot of creativity into their art. That not true for an ImageNet-trained classification model or Whisper ASR.

doctorpangloss · on Jan 18, 2024

I believe your use should be protected. This is not meant to be a takedown, better you hear it from me though, because you'll never hear it from Discord.

> "We are working only with properly licensed"

...versus:

> "fair use"

You're smart enough to know these are very different things - saying you believe you are protected by fair use, and claiming that the data is "properly licensed." In legal there is a colossal difference, you went from say you were authorized to use something to you believe you are unauthorized but still permitted due to fair use.

jpcl · on Jan 19, 2024

Yeah, thanks. I'd love to try to clarify this (I'll put this into our documentation as well ASAP) for anyone that may be reading this in the future:

Our model is not a derivative of Whisper but we do use the transcripts (and encoder outputs) in our data preprocessing. I am convinced this data cannot be regarded as a derivative of the (proprietary) Whisper training set. Whisper itself is open-source (MIT) and has no special TOS (unlike, say, ChatGPT).

WhisperSpeech itself is trained on mostly Public Domain and some CC-BY-SA recordings. We are working on adding more data and we will provide clearer documentation on all the licenses.

3abiton · on Jan 18, 2024

They don't mention the ability to add custom voices to the speech output, I wonder if that's a feature thatbwould be supported

atwrk · on Jan 18, 2024

They do mention voice cloning in the README ("We’ve also added an example of voice cloning based on a reference audio file."), do you have something different in mind?

addandsubtract · on Jan 18, 2024

It's only found in the Colab and not in the Readme, though. The examples in the Colab are also better than the ones found in the Readme. Maybe the Readme still needs to be updated?

jpcl · on Jan 18, 2024

Hi, thanks a lot for the tip, I'll update the README samples ASAP. :)

I was busy working on inference performance in the last few weeks and totally did not expect to land on Hackernews today. Only noticed it an hour ago because my GitHub stars jumped quite a bit