Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Producing images of spectrograms is a genius idea. Great implementation!

A couple of ideas that come to mind:

- I wonder if you could separate the audio tracks of each instrument, generate separately, and then combine them. This could give more control over the generation. Alignment might be tough, though.

- If you could at least separate vocals and instrumentals, you could train a separate model for vocals (LLM for text, then text to speech, maybe). The current implementation doesn't seem to handle vocals as well as TTS models.



I think you'd have to start with separate spectrograms per instrument, then blend the complete track in "post" at the end.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: