There's also the situation where regular people submit their "community-contributed" captions where it's obvious that they used Google Translate themselves.
On the other hand, I've seen some shockingly good machine captions, where I'm fairly sure they were machine because they were just uploaded or I noticed technical terms being transcribed phonetically, but they nevertheless manage to transcribe better than I can understand it. My theory is that they prioritize the full-power RNN transcriptions for only some new videos, and haven't gone back over the full historical YT corpus.
The captions for Google meetings are frighteningly good and accurate. We use a lot of acronyms, and made up words that it turns into acronyms. I sometimes get spooked because even with it off, I know all those meetings and social calls are sitting in some google db to be turned into something someday.
I wish they'd care more about people with foreign accents. I still find any type of google voice recognition unusable, unless I lay on a really bad fake texan accent.
I am well aware of that, but when the video comes with no slides and not even a summary, you can guess that they were not paying hundreds of dollars to have it professionally transcribed, and you can also guess that humans were not involved anywhere in the process (either to generate or review it) when they, say, phonetically transcribe fairly common technical jargon which is written prominently on the slide which has been fullscreened for the past minute in the video.
(Is it really so hard to believe that the usual neural network progress curve has happened for speech transcription and that the future is already here, just unevenly distributed across videos?)
According to the article, captions have to be enabled and then individually approved by the channel owner. So of course you'd be unlikely to see spam or abuse in captions since the channel owner is filtering that spam and abuse out.
Not saying I agree with Google's decision. Just saying it makes sense that you wouldn't see spam and abuse.
(I am hearing, but like clean english captions because one can easily play at 2x or beyond.)