I've build a custom thumbnail/metadata extraction toolkit based on libavcodec/libavformat that runs the decoding in seccomp's strict mode and communicates the results through a linear RGB stdout stream. Works pretty well and has low overhead and complexity.
Full transcoding would be a bit more complex, but assuming decoding is done in software, I think that should also be possible.
It's layers. Docker is better than nothing, but a VM is better still, and even better is docker on a dedicated VM on dedicated hardware on a dedicated network segment.
To make a bit of a strawman of what you are saying even better still would be an unplugged power cable as a turned off machine is (mostly) unhackable.
To be more serious seurity is often in conflict with simplicity, efficiency, usability, and many other good things.
A baseline level of security (and avoidance of insecurities) should be expected everywhere, docker allows many places to easily reach it and is often a good enough tradeoff for many realities.