I played with style transfer on a bunch of paintings from the Wikipedia featured paintings[0]. The result when it works well is absolutely fantastic but when it doesn't, it basically just applies noise. Even 'obnoxious' styles like Van Gogh need to be cherry picked for the most 'featureful' flavor images.
That being said, I am pretty hyped to get my 1080 to be able process more than 1 image an hour at 720p. Also, take a look at the featured paintings, they're all public domain and absolutely gorgeous.
I just re-implemented something similar for Dreamscope that gets very similar results. If you want to do it yourself, simply convert the original and style image to HSV and create a third image like this:
This works for images because images can be decomposed into different scales: in classical image processing into image pyramids, and in this case using deep convolutional neural networks which capture progressively more complex features and larger scales. So you decompose an image, keeping the high level features fixed, and modify the image (using gradient ascent) to make the low level features match those from a sample of the artistic style.
So to use the same algorithm for music you would have to decompose audio in a similar meaningful way. There has also been a lot of success in speech recognition with CNNs lately, but I don't know what the situation is with modelling music.
> So to use the same algorithm for music you would have to decompose audio in a similar meaningful way.
Well, in music, scaling could be compared to increasing/decreasing frequency. We all know that a song which is transposed by e.g. an octave still sounds the same (albeit lower/higher). So I think the concept translates well from images.
That doesn't sound like what I meant. The point of image pyramids is that they separate fine details (e.g. style) from coarse details (form). A note transposed down an octave is still exactly the same type of object; it's not more abstract so frequency is not an analog to scale in images.
What you want is a progression of abstractions, e.g. note, chord, melody... but one where 1) each is largely orthogonal so that they can be separated and recombined with a lot of flexibility (definitely not true of notes and melodies) 2) it be computed straightforwardly, preferably by a differentiable function 3) can separate style and form.
Yes, and I'll try to make it more explicit for people saying that sound is also two-dimensional. Before any analysis, sound is one-dimensional, that is unless you count stereo channels to represent an additional dimension, but then it equally makes sense to represent colors as an additional dimension in images. Frequencies appear when you do a Fourier transform - but this transform is equally applicable to images as it is to sound, so again images stay at least one dimension ahead of sound.
So it appears that dimensionality is just not a good way of explaining the difficulty here. I'd say the neural style technique was developed specifically for images, and it's becoming apparent that we can't simply apply it to sound and get good results. Maybe by working something similar from the ground up.
Sound just has an amplitude at every sample, but this is too low level to be very useful. Instead you would probably make a net process the same kind of features a human has, which is a list of frequencies at each sample, each with an amplitude.
(In humans we have oscillating hairs that perform an analog frequency extraction, while with computers you'd use a fourier transform to do it mathematically.)
That being said, I am pretty hyped to get my 1080 to be able process more than 1 image an hour at 720p. Also, take a look at the featured paintings, they're all public domain and absolutely gorgeous.
[0] https://en.wikipedia.org/wiki/Wikipedia:Featured_pictures/Ar...