Producing images of spectrograms is a genius idea. Great implementation! A coupl...

Producing images of spectrograms is a genius idea. Great implementation!

A couple of ideas that come to mind:

- I wonder if you could separate the audio tracks of each instrument, generate separately, and then combine them. This could give more control over the generation. Alignment might be tough, though.

- If you could at least separate vocals and instrumentals, you could train a separate model for vocals (LLM for text, then text to speech, maybe). The current implementation doesn't seem to handle vocals as well as TTS models.