The problem with this is that it's exactly the setting for GANs, which means that if you get your hands on one it's trivial to train your models to defeat it.
Could we say the same for CAPTCHAs but somehow they persevere? There may be conditions where AI cannot help but be detected. Hiding those conditions would also set off an alarm.
CAPTCHAs works primarily by means of economics. Most image CAPTCHAs have been broken in general; it's just usually not worth to break particular ones for any but the biggest sites. Breaking many - if not most - image CAPTCHAs is almost a textbook image processing exercise; those that are too hard can be outsourced to people (through e.g. "solve the following CAPTCHAs to access pornography"). Google's CAPTCHA now moved from image recognition to estimating your humanness by using God knows how much data they collect on your browsing - it works, but comes with obvious privacy-related trade-offs.
I feel the war against synthesizers will be over as soon as someone open-sources a good enough one, or at least starts selling it relatively cheaply.
This is an arms race that clearly terminates in victory for the synthesizers, so I can't get too upset that there's a way around this particular step in the race.
No, the end state is clear. There is no reason to believe that speech synthesis will not terminate in speech indistinguishable from human speech, and once it reaches there, it's game over for this approach. There is an attainable end goal.
Part of the problem you may have realizing that is that human speech is not a point in the speech space; it's a range. If you are operating in the real world, that range is further expanded by the real-world noise you will encounter.
The fact that the synthesizers have access to the heuristics being used by the detectors merely accelerates an already inevitable process. We have plenty of other reasons to want great speech synthesis.
I'm not "extrapolating" from anything. The direct analysis is easy and obvious.
OK, I'll take that back. You aren't extrapolating; you're outright jumping the gun, by pure force of will.
The direct analysis is easy and obvious.
It is if you choose to believe in things because they seem, well, nifty to believe in.
As for me -- when it comes to anticipated technical innovations (however feasible-seeming), and especially binary predictions that they "will happen" (and not simply "could" or even "probably will" happen) -- I need hard evidence and (specific) lines of reasoning. Not simply "there's not reason to believe it won't happen; therefore it will."
I actually gave you a specific line of reasoning. I suspect you missed it because you are not used to thinking in terms of signal processing or information theory. Strangely, you have failed to convince me to try to spell it out more slowly, though. I will give you this hint, which is to try to come up with a program that could distinguish between the two types of speech so well that even if you handed that program to the synthesizer writers they would be unable to fool it in any way. Then iterate that process indefinitely, with the synthesizers getting better each time. That is not the whole of the argument, but it would set you on the correct path of understanding, if you thought through it honestly and did not assume magic functions in the detector code that secretly sneak direct divination of the intent in the backdoor.
This isn't a general claim to AI; this is a highly constrained, specialized task that is, frankly, probably perfectly attainable with modern technology even without assuming any further advances in AI.
I suspect you missed it because you are not used to thinking in terms of signal processing or information theory.
And we can end the discussion right there.
Being as -- "to give you this hint", and "to spell it out more slowly" for you -- you really do come off as incredibly condescending, with statements like these.
I wouldn't say trivial. First, in order to use GAN training you need access not only to the adversarial discriminator, but also to it's derivatives. Second, the derivatives need to be reasonably bounded and reasonably smooth, otherwise your voice generator won't be able to converge to a successful solution.
You don't really need access to the other model. By just using your own NN as a discriminator, it will train to be very good at fooling it. And so it should be just as good at fooling other NNs.
GANs can be trained without gradients using reinforcement learning. Even without GANs, just being able to tune hyper-parameters with such a discriminator should be a strong signal for the synthesizer.
Not to mention easily defeated by the AI hiring a human, overnighting them an earpiece, and having them move in meatspace as a proxy.
The whole AI vs Human question seems totally moot to me. What's the difference between an AI with a human employee and a human with an AI employee? Each pair is only as good as their separation of responsibilities.