Transcribing solo piano to MIDI, even very complex polyphonic pieces, is fairly reasonable these days. You can try a demo from a not quite state of the art system here: https://piano-scribe.glitch.me/ . (full disclosure, I was one of researchers who produced this system).
I would argue that there's a lot more people that can take the output of the MIDI and turn it into a reasonable score than there are people that can listen to the raw audio and turn it into a reasonable approximation of the notes.
I agree 100% with this essay. I’ve been transcribing music manually for just over 20 years, and I’ve dabbled with automatic transcription products and algorithms. A few months ago, I attended my first ISMIR conference (the mostly academic community of researchers working on these problems).
With that background, I’ve come to believe automatic transcription is still so far from good that we’re better off creating tools that make it easier for humans to transcribe. That’s our philosophy behind Soundslice (https://www.soundslice.com/transcribe/), which combines a music notation editor with transcription tools. For anybody interested in transcribing music, I encourage you to give it a try.
It seems like an interesting machine learning problem rather than AI complete? There are lots of finicky details when transcribing speech as well, but it's apparently not AI complete. You do need a large corpus so the algorithm knows what's typical.
If I were going to work on this, I would work on generating lead sheets from YouTube videos. Recognizing chords seems like a useful thing to solve?
>Recognizing chords seems like a useful thing to solve?
It's also pretty easy to teach and learn for a practicing musician but much more difficult to teach a machine, since you run into issues with blind source separation.
Speech transcription is good enough provided we have enough preprocessing power, assume a single speaker, and know the language beforehand and have trained the model on a large number of previous speakers with the same dialect/accent.
In music, you don't know how many speakers there are (instruments playing), the dialect/accent (orchestration/chord voicing) changes on the fly, representations are non-unique and contextual, and artists intentionally subvert expected results to make good music.
Humans are just better at this and easier to train to do it than computers, for the moment.
On the other hand, most instruments are built for repeatability, so you can play the same note twice and have it sound pretty much the same. Humans produce speech with a resonant cavity made of soft tissue, so a sound like "aaah" corresponds to an unsharp range of tongue shapes etc. and saying the same thing twice is pretty much guaranteed to sound quite different.
Being able to separate individual notes of a musical piece into sharply defined buckets (keys of a piano) or one-dimensional subspaces (finger position on stringed instruments like guitars) simplifies the source separation problem a lot.
That representations are contextual and subject to interpretation by the artist is a harder problem (as discussed in TFA), but it should be possible to treat it separately from the pure chord recognition problem. (E.g. it would be easy to take notation and a matching MIDI file and then pretend that it's the output of the recognition step which the original notation should be recovered from.)
The buckets are more flat than sharply defined (ha). That's part of the problem. Information is lost from notation to pitch to recording, and I suspect one could prove that these transforms don't preserve topology and are not homophormisms that can be easily reversed.
To use your example, finger position is not a one dimensional subspace on a guitar. There are between 1-6 ways to play a given pitch even assuming standard tuning on a six string guitar, which is not a safe assumption to make.
But the issue with chord transcription is one at the heart of source separation outside of VC demos, which is the causality problem that no one likes to talk about. To separate into N sources you need to know a priori that there are N sources to separate into. This is not a trivial thing to predict in chord voicings, where N changes and is not predictable. Then you need to make a best guess at which instrument the pitches fall into, which may be shared.
This is something that even humans fall victim too. Untrained listeners are very bad at quantifying sources in an ensemble, and even trained listeners struggle with notating chord voicings with decent accuracy.
It’s so so much harder than transcription of language. The article attempts to capture this point, but while there’s generally only a few ways to punctuate a sentence, assuming you have all the words right, there are thousands of potential ways to notate a musical phrase, even assuming you can correctly interpret all the notes and rhythms correctly, but only a few would be easily sensible to a human performer.
> There are lots of finicky details when transcribing speech as well, but it's apparently not AI complete. You do need a large corpus so the algorithm knows what's typical.
Speech is dramatically easier. The reason is that language is meant to communicate and consequently carries a lot of redundant information that you can bring to bear.
Music often has no such redundancy. It may have themes that differ slightly each time, so nothing to lock onto.
> Recognizing chords seems like a useful thing to solve
Except that a single C note (especially on stringed instruments) may have many harmonics that also look like a C chord. The problem isn't straightforward.
>Except that a single C note (especially on stringed instruments) may have many harmonics that also look like a C chord. The problem isn't straightforward.
It's not terribly difficult either. For plucked or struck strings, you can rely on the inharmonicity of the overtones to distinguish fundamentals from harmonics. Ensemble bowed strings are more tricky because the overtones are mode-locked, but we can rely on variance between individual players in both the time and frequency domain.
Polyphonic pitch detection is a bread-and-butter feature in many commercial music technology products and the industry is many years ahead of academia; I have absolutely no doubt that you'd be able to buy automatic transcription software right now if the market was big enough to justify the R&D spend.
CTO at previous job was responsible for one of the better heuristic algorithms from the era before ML took over and our product dealt with this problem.
I'm inclined to disagree with your assessment. Transcription is to polyphonic pitch detection what machine translation is to OCR. We are okay at the latter, which is what folks like Celemony make their living in. We still suck terribly at the former.
The last 20% will be fiendishly hard, as it depends on a mountain of cultural knowledge. A tool to help amateur transcribers to get ~70% of the way is probably within reach of current tech.
All true, although one of the things that bothers me is that even if you can read music you to need to understand the historical context in order to play a piece properly. For example, you would play a Bach piece with a lighter touch and less legato than you would a Beethoven piece with the same notes. And of course a jazz musician looking at music from a certain area is almost certainly going to play what look like eighth notes with swing, which is a completely different feel and should actually be notated differently as well.
"Swing" was known even in the Baroque era, actually. It was called notes inégales (which literally means "non-equal notes"), while playing eighths "straight" would be called notes égales ("equal notes"). Some of these other variations are inherent in how notation has evolved over time - by the time of Beethoven, lots of indications would be written into the piece to try and convey how it should be played. You don't get that kind of thing from Bach-era scores, and sometimes people play them a bit too mechanically as a result. It's very much a matter of finding the right balance for every subgenre or tradition.
The notation is shorthand for the composers' intent, not a computer program to perfectly reconstruct the music every time. Ambiguity is a feature, not a bug.
Not sure why this bothers you. Musical notation is, by necessity, a greatly simplified representation of the intended performance. Not only that, but the composers themselves well understand that the interpretation of their notation is itself an artistic activity that is independent from the musical notation they write. There are thousands of recorded performances of any particular Bach or Beethoven work, every one different, sometimes very very much different. And none of them are “wrong”. We certainly don’t know for sure how Bach’s and Beethoven’s works were performed in their own days—there’s uncertainty over tuning, dynamics, and tempo, to say nothing of stylistic details—but we do know that the instruments they had available were very different, and yet their works are still great today, even though we can never reproduce what they were “supposed” to sound like.
This is why I tend to prefer "auteur" music; music composed, arranged, produced, performed, and ideally even mixed by the same person/small group of people.
It's a shame we'll never get to hear Beethoven or Mozart playing their pieces themselves, in their intended locations and with their preferred instruments and seats and such. I suspect it'd sound much better than any other performer who's interpreted their work.
Rather than targeting formal music notation, I think it makes a lot more sense to use modern digital audio workstation representation (horizontal bars representing individual tracks). I believe this can represent everything music notation can (for practical purposes) and has the advantage of being easily visually readable by people with little musical training (I've spent the time to learn basic music notation and I find it has a very high cognitive overhead for people who didn't learn it during early childhood).
It would seem to be likely that given enough training data (labelled scores) you could make a net that took raw complex music in and generated that out.
I’d love to work on automatic transcription for drum set. I trained in classical percussion, but I’m self-taught when it comes to rock, so I have trouble figuring out how to reproduce fills that I want to play.
I remember using Transcribe!, created by the author of this article, to try to figure out some of the fills in ZZ Top’s La Grange, which isn’t a complicated song exactly, but I can’t figure out how to make the fills feel right. Transcribe! Is great, btw.
Anyways, I’d be very curious to hear if there’s any existing work in this area. I feel like drum transcription could be easier in some ways than “pitched” instruments, but possibly harder in some unique ways too.
So, I've been working on just this for the last 5 months. It's getting there. I'm still a few % behind what 'onset and frames' (see comment by ekelsen) can do but the gap is shrinking and my stuff is orders of magnitude faster.
That said, I'm not sure yet if I'll be able to close the gap, and it's a hard problem to solve. And yes, like the author correctly identifies: for piano solo, or harpsichord.
I would argue that there's a lot more people that can take the output of the MIDI and turn it into a reasonable score than there are people that can listen to the raw audio and turn it into a reasonable approximation of the notes.
Papers / dataset:
https://arxiv.org/pdf/1810.12247.pdf [New dataset and slightly improved network]
https://arxiv.org/pdf/1710.11153.pdf [Original Network]