Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It was not some anecdotal fact that Bard got wrong, it was during their official public demo. It was a "scandal" because it showed Google was indeed unprepared and had no better product, not even preparing and fact checking their demo before was the cherry on the top.

Ethics is a false excuse because rushing that out show they never cared either. It was just PR and their bluff was called.

Also I skimmed over that Stochastic Paper and I’m unimpressed. I’m unfamiliar with the subject but many points seems unproven/political rather than scientific, with a fixation on training data instead of studying the emerging properties and many opinions notably regarding social activism, but maybe it was already discussed here on HN. Edit: found here: https://news.ycombinator.com/item?id=34382901



Google and ethics, now that’s an oxymoron


> I’m unfamiliar with the subject but many points seems unproven/political rather than scientific

You're exactly the kind of person Stochastic Parrots was trying to warn us about - you bought into the AI hype.

AI are extremely sensitive to the initial statistical conditions of their dataset. A good example of this is image regurgitation in diffusion models: if you include the same image n times in the data set, it gets n times the number of training epochs, and is far more likely to be memorized. Stable Diffusion's propensity to draw bad copies of the Getty Images logo is another example; there's so many watermarks and signatures in the training data that learning how to draw them measurably reduces loss. In my own AI training adventures[0], the image generator I trained loves to draw maps all the time, no matter what the prompt is, because Wikimedia Commons hosts an absolutely unconscionable number of them.

Stochastic Parrots is arguing that we can't effectively filter five terabytes[1] of training set text for every statistical bias. Since HN is allergic to social justice language, I'll put it in terms that are more politically correct here: gradient descent is vulnerable to Sybil attacks. Because you can only scrape content written by people who are online, the terminally online will decide what the model thinks, filtered through the underpaid moderators who are censoring your political opinions on TwitBook.

Of course, OpenAI will try anyway[2]. The best they've come up with is to use RLHF to deliberately encode a center-left bias into a language model that otherwise would be about as far-right as your average /pol/ user. This has helped ChatGPT avoid the fate of, say, Microsoft's Tay; but it is just sweeping the problem under the rug.

The other main prong of Stochastic Parrots is energy usage. The reason why OpenAI hasn't been outcompeted by actual open AI models is because it takes shittons of electricity and hardware to train these things. Stable Diffusion and BLOOM are the biggest open competitors to OpenAI, but they're being funded purely through burning venture capital. FOSS is sustainable because software development is cheap enough that people can do it as volunteer work. AI training is almost the opposite: extremely large capital costs that can only be recouped by the worst abuses of proprietary software.

[0] I am specifically trying to build a diffusion model trained purely on public domain images, called PD-Diffusion.

[1] No problem. We are Google. Five terabytes is so little that I've forgotten how to count that low.

[2] When filtering the dataset for DALL-E 2, OpenAI found that removing porn from the training set made the image generator's biases far worse. i.e. if you asked for a stock photo of a CEO, pre-filter DALL-E would give about 60% male, 40% female examples; post-filter DALL-E would only ever draw male CEOs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: