Hacker News new | past | comments | ask | show | jobs | submit | XenophileJKO's comments login

If you think song and image generators can't make creative things, the problem is between the seat and keyboard. It is unfortunately that simple. 4o as an image generator can create out of distribution images because it literally works at a conceptual level.

Suno currently is limited architecturally to in distribution components, so trying to create instruments or vocal styles it never heard won't work. The parts that you can work in are a vast and rich creative space.


Tell me you don't know the first thing about making pictures or music without telling me you don't know the first thing about making pictures or music.


You're not making art, you're running prompts through a meme picture generator.

Like I said, you know nothing about making art. (Not a big deal, not everybody needs to.)

In an alternative parallel universe we might have gotten actual AI tools for making art - these would be various infilling plugins, basically. ("Make me a Photoshop brush that paints pine-needle looking spiky things", that sort of thing.)

What we got in this reality is not that, though. Meme making of the sort you posted is not art.

(That said, making memes, especially of the spicy and/or topical kind, is the one niche that generative AI is good at and nails perfectly.)


Arguing about what "art" is is a tale as old as time, relax. Your own perspective is not the only valid one that can exist.

I mean I did have a meme or two in there to show the range. You clearly didn't look at it if you think it is all memes. Maybe think hard about what art really is.. a brush is a brush..

I don't have to justify my art to anyone. The magic is.. you can't choose what is art. You can choose what YOU think is art and that's fine. However you can't dictate to the world what is art.


I agree you don’t have to justify your art. But which was not a meme? I clicked many and found only memes.

That's not the AI creating art though, that's you using 4o like a pencil. The ideas / creativity came from you, not the AI.

P.S. Do you mind sharing the prompt you used to do the crocodile?


Sure... I mean it isn't a single prompt.. it is always a process.

https://chatgpt.com/share/681855ce-a9d4-8004-9a3a-deb19994d8...

Edit: I was really just testing to see how well it could do layout and out of distrobution images. This was right when it came out and I was trying to see what the limitations were.


Nah he's right. Gatekeeping art and "the process" is as old as time.

I'm not a pro, but this seems different on the 4.5 model. It seems much crisper.

Yeah 4.5 definitely improves a lot, but I can still hear washiness in the high end. Probably not noticeable to most in isolation or poor audio devices like low-end phone speakers, but would be very noticeable on a good soundsystem in a playlist with professional songs.

It still doesn't capture meter or phrase structure well in classical songs, and there are a lot of weird artifacts. It's better than the old models, but there's still a long way to go.

I'm still experimenting with what works and doesn't work. Currently for style I am trying things like:

-----

STYLE: Earth Circuit Fusion

INSTRUMENTATION: - Deep analog synth bass with subtle distortion - Hybrid percussion combining djembe and electronic glitches - Polytonal synth arpeggios with unpredictable patterns - Processed field recordings for atmospheric texture - Circuit-bent toys creating unexpected melodic accents

VOCAL APPROACH: - Female vocalist with rich mid-range and clear upper register - Intimate yet confident delivery with controlled vibrato - Layered whisper-singing technique in verses - Full-voiced chorus delivery with slight emotional rasp - Spoken-word elements layered under bridge melodies - Stacked fifth harmonies creating ethereal chorus quality

PRODUCTION: - Grainy tape saturation on organic elements - Juxtaposition of lo-fi and hi-fi within same sections - Strategic arrangement dropouts for dramatic impact - Glitch transition effects between sections

---

One thing I have noticed with the new model is that it listens to direction in the lyrics more now, for example [whispered] or [bass drop], etc.

There are clear limits. I have been unsuccessful in spacial arrangement.

EDIT: I realized I didn't specify, this is when you do custom and you specify the lyrics and the style separately.


Well remember listing/ranking things are structurally hard for these models because you have to keep track of what it has listed and what it hasn't, etc.

So the main thing it seems like it is able to do it take very granular song direction. The 4.0 model didn't stick very well to song direction. It "might" put a "bass drop" where you wanted it, etc.

Seems like the new 4.5 both listens to more nuanced song style descriptions as well as in-line direction better. Still not perfect, but it is an improvement. It's pretty incredible.. it is a huge improvement. Take a look at this example (It does have some errors, like a heard it sing one of the sound directions): https://suno.com/s/xaJNWNZSH17wAano

This is going to open up a lot of creativity where while you are still often at the mercy of the random number generator, you can really shape the song more.


So my personal belief is that diffusion models will enable higher degrees of accuracy. This is because unlike an auto-regressive model it can adjust a whole block of tokens when it encounters some kind of disjunction.

Think of the old example where an auto regressive model would output: "There are 2 possibilities.." before it really enumerated them. Often the model has trouble overcoming the bias and will hallucinate a response to fit the proceeding tokens.

Chain of thought and other approaches help overcome this and other issues by incentivizing validation, etc.

With diffusion however it is easier for the other generated answer to change that set of tokens to match the actual number of possibilities enumerated.

This is why I think you'll see diffusion models be able to do some more advanced problem solving with a smaller number of "thinking" tokens.


Unfortunately the intuition and the math proofs so far suggest that autoregressive training is learning the joint distribution of probabilistic streams of tokens much better than diffision models do or will ever do. My intuitive take is that the conditional probability distribtion of decoder-only autoregressive models is at just the right level of complexity for probabilistic models to learn accurately enough. Intuitively (and simplifying things at the risk of breaking rigor), the diffusion (or masked models) have to occasionally issue tokens with less information and thus higher variance than a pure autoregressive model would have to do, so the joint distribution, ie the probability of the whole sentence/answer will be lower and thus diffusion models will never get precise enough. Of course, during generation the sampling techniques influence the above simplified idea dramatically and the typical randomized sampling for next token prediction is suboptimal and could be beaten by a carefully designed block diffusion sampler in principle in some contexts though I havent seen real examples of it yet. But the key ideas of the above scribbles are still valid: autoregresive models will always be better (or at least equal) probabilistic models of sequential data than diffusion models will be. So the diffusion models mostly offer a tradeoff for performance vs quality. Sometimes there is a lot of room for that tradeoff in practice.


This is tremendously interesting!

Could you point me to some literature? Especially regarding mathematical proofs of your intuition?

I’d like to recalibrate my priors to align better with current research results.


From the mathematical point of view the literature is about the distinction between a "filtering" distribution and a "smoothing" distribution. The smoothing distribution is strictly more powerful.

In theory intuitively the smoothing distribution has access to all the information that the filtering distribution has and some additional information therefore has a minimum lower than the filtering distribution.

In practice, because the smoothing input space is much bigger, keeping the same number of parameters we may not reach a better score because with diffusion we are tackling a much harder problem (the whole problem), whereas with autoregressive models we are taking a shortcut which happens to probably be one that humans are probably biased too (communication evolved so that it can be serialized to be exchanged orally).


Although what you say about smoothing vs filtering is true in principle, for conditional generation of the eventual joint distribution starting from the same condition and using an autoregresive vs diffusive LLM, it is the smoothing distribution that has less power. In other words, during inference starting from J tokens and writing token number K is of course better with diffusion if you also have some given tokens after token K and up to the maximal token N. However, if your input is fixed (tokens up to J) and you have to predict those additional tokens (from J+1 to N), you are solving a harder problem and have a lower joint probability at the end of the inference for the full generated sequence from J+1 up to N.

I am still jetlagged and not sure what the most helpful reference would be. Maybe start from the block diffusion paper I recommended in a parallel thread and trace your way up/down from there. The logic leading to Eq 6 is a special case of such a math proof.

https://openreview.net/forum?id=tyEyYT267x


What are the barriers to mixed architecture models? Models which could seamlessly pass from autoregressive to diffusion, etc.

Humans can integrate multiple sensory processing centers and multiple modes of thought all at once. It's baked into our training process (life).


The human processing is still autoregressive, but using multiple parallel synchronized streams. There is no problem with such an approach and my best guess is that in the next year we will see many teams training models using such tricks for generating reasoning traces in parallel.

The main concern is taking a single probabilistic stream (eg a book) and comparing autoregressive modelling of it with a diffusive modelling of it.

Regarding mixing diffusion and autoregressive—I was at ICLR last week and this work is probably relevant: https://openreview.net/forum?id=tyEyYT267x


Maybe diffusion for "thoughts" and autoregressive for output :S

Suggests an opportunity for hybrids, where the diffusion model might be responsible for large scale structure of response and the next token model for filling in details. Sort of like a multi scale model in dynamics simulations.


> it can adjust a whole block of tokens when it encounters some kind of disjunction.

This is true in principle for general diffusion models, but I don't think it's true for the noise model they use in Mercury (at least, going by a couple of academic papers authored by the Inception co-founders.) Their model generates noise by masking a token, and once it's masked, it stays masked. So the reverse-diffusion gets to decide on the contents of a masked token once, and after that it's fixed.


Here are two papers linked from Inception's site:

1. Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution - https://arxiv.org/abs/2310.16834

2. Simple and Effective Masked Diffusion Language Models - https://arxiv.org/abs/2406.07524


Thanks, yes, I was thinking specifically of "Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution". They actually consider two noise distributions: one with uniform sampling for each noised token position, and one with a terminal masking (the Q^{uniform} and Q^{absorb}.) However, the terminal-masking system is clearly superior in their benchmarks.

https://arxiv.org/pdf/2310.16834#page=6


The exact types of path dependencies in inference on text-diffusion models look like an interesting research project.


Yes, the problem is coming up with a noise model where reverse diffusion is tractable.


Thank you, I'll have to read the papers. I don't think I have read theirs.


Once that auto-regressive model goes deep enough (or uses "reasoning"), it actually has modeled what possibilities exist by the time it's said "There are 2 possibilities.."

We're long past that point of model complexity.


But as everyone knows, computer science has two hard problems: naming things, cache invalidation, and off by one errors.


I think you are underestimating the importance of a "world model" in the process. It is the modeling of how all these details are related to each other that is critical here.

The LLM will have an edge by being able to draw on higher level abstract concepts.


I think you are overestimating how much knowledge is o3s world model. Just because it can output something doesn't mean it's likely that it will substantially affect it's future outputs. Even just talking to it about college level algebra it seems to not understand these abstract concepts at all. I definitely don't feel the AGI I feel like it's a teenager trying to BS it's way through an essay with massive amounts of plagiarism.


I think an alternative possible explanation is it could be "double checking" the meta data. Like provide images with manipulated meta data as a test.


I mean you say that.. but the counter point.. is maybe I don't like your lyrics and I want to make my own.


Then write your own. Maybe you're no good at it, but you're not going to improve by just copying them from the chatgpt console.


Right, I write my own and plug them into a song generation service. With as little trial and error, I get a song I listen to.


You are on to the key insight here.. what is emerging is the creative consumer. I.e. I know what I want when I hear it. Or I know what I want when I see it.

This means you can hear something and say.. you know this is nice, but I would like it more if it were different in this way.

With generative tools you can do that. Personally I really like to listen to music, but I generally dislike the lyrics. I want uplifting songs, maybe about what I am doing right now to motivate me. Well with something like Suno.com.. I can just make one. Or I can work with claude or chatgpt to quickly iterate on some lyrics and edit them to create an even higher fidelity song.

The key here is that I can give a rat's ass if anyone in the world likes or cares about my song.. but I can listen to it while I work. It is exactly what I wanted to listen to or close enough.


My theory is that as the quality of these generative tools increase, we'll see the public opinion of them slowly shift. Regardless of philosophy (although discussing it is always fun), it just seems inevitable since there are so many more consumers than producers. And as you say, consumers are the ones that will primarily benefit from this new technology. As a consumer we care primarily (some could argue solely) about our own emotional reaction to the music —or more generally put, art-piece.

In practical terms I also believe that this will give rise to a lot of new consumer behavior, and, as you so aptly puts it "creative consumers" will become normal.

The ability to on-demand create more content to fill out some very narrow niche is a great example ("Today I want 24 hours of non stop Mongolian throat singing neo-industrial Christmas music"). Or maybe to create covers of songs in the voices of your favorite long dead artist. Anything from minor tweak of existing works ("I wish this love song was dedicated specifically to ME", to completely new works (Just look at how much the parody-music genre has grown since Suno and the like first appeared). The possibilities are near endless.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: