Nuking the caches is one thing, but what about services like the Internet Archiv...

kijin · on Feb 25, 2017

I agree that nuking entire domains would be bad for the Internet Archive. But I don't think it would be overwhelmingly difficult, nor controversial, to identify and remove the vast majority of "contaminated" documents. This applies to the Internet Archive as well as major search engines.

First, we're talking about raw memory pages, not merely malformed HTML. Those memory pages might contain valid HTML, but most of the sensitive information is in the headers, not HTML markup. It won't be very difficult to write a script to identify documents where random headers and POST data have been inserted where they don't belong, or where the markup is so obviously invalid (even compared to similar documents from the same site) that there is a high probability of contamination. Having a full list of contaminated domains would obviously help a lot, because we'll only have to deal with thousands of domains instead of millions.

Second, contaminated documents by definition contain information that is NOT what the publisher intended to be crawled, indexed, or archived. So there should be less resistance to removing them.

Finally, most of the contaminated domains used features such as Scrape Shield that were intended to deter archival. It's as if the domain had a robots.txt that said "User-agent:* Disallow:/". I'm not sure whether it's even possible for the Internet Archive to archive such domains. If they can, maybe they've been doing it against the publisher's wishes. If they can't, well, there's no problem to begin with.

wumpus · on Feb 25, 2017

Archives don't delete stuff, nor do they have much capacity to do much computation on their archived data. Whereas if blekko still existed as a search engine, I'd just push code to refuse to show cached pages or snippets containing text that likely means the CloudFlare problem. 15 minutes work, and the underlying data would expire fully in a couple of months.

So I would completely disagree with your speculation about what's easy or hard. (Note that I've worked at a search engine and an archive.)

d33 · on Feb 25, 2017

At the very least that'd cost Internet Archive a lot of human resources. I think it's reasonable to assume that if Cloudflare wants them to fix the problems they themselves created, they should cover their costs of such operation.

singlow · on Feb 25, 2017

My understanding is that scrape shield protects against certain abusive patterns of robot access, but not all URLs on the domain would ban robots. So archive may have many contaminated pages that scrape shield was enabled for but did not ban all robots from crawling based on various criteria.

gech · on Feb 25, 2017

Good time to shoe in some forced dmca content removal

wumpus · on Feb 25, 2017

Yeah, abusing the DMCA because it's the only tool you can think of, that's a great idea.

kijin · on Feb 25, 2017

Come to think of it, our society grants search engines the privilege of keeping copies of copyrighted material in exchange for the services they provide.

It might not be unreasonable to say that this privilege comes with a certain responsibility to ensure that those copies do not cause excessive harm to others.

So although Cloudflare is the one that fucked up, Google et al. also have a responsibility to do whatever they can to protect the public. They should do what they can, with or without Cloudflare's cooperation.

d33 · on Feb 25, 2017

I don't believe so. Cloudflare had this responsibility and they messed up. They are the only ones liable.

kijin · on Feb 25, 2017

Cloudflare cannot delete documents from other company's caches. The actual deletion must be performed by Google and Bing, who can then sue Cloudflare for cleanup costs if the latter is unwilling to cooperate.

When there's an oil spill, we don't wait for the oil company to come and clean up their own mess. Others clean it up a.s.a.p. and (ideally) then make the oil company pay the fines and damages. CloudBleed is a virtual oil spill. They literally sprayed other people's private data all over the internet.

wumpus · on Feb 25, 2017

Yep, it's a Section 230 for the search engine, probably -- the unfortunate data came from Cloudflare.

arrty88 · on Feb 25, 2017

Why can't they just search for pages with data after the closing HTML tag

ec109685 · on Feb 25, 2017

That isn't uncommon enough. Searching for one of the CF- http headers in the source code, however would work.