> Meanwhile when I wrote disparate blog articles, I went from $2/month, to $20, to $200 to $2000, and it’s because Google search sent you to me.
Imagine if Google trained an LLM on its index and started providing answers without ever sending users your way… Actually, we have a suitable example of just that being done. /s
I bookmarked this as a vivid illustration of my gripe with LLMs, and more specifically their creator/operators’ meticulous avoidance of the topic of attribution.
You’d think it’s fairly feasible: if training data had attribution, I can’t see how LLM would not be able to separate substance from style and some reflection on the former, so that if most of topical substance is found to come from only 1–2 sources it would just note those sources for the user. If you provide actually niche content, when ChatGPT answers relevant questions the substance may well come from your body of work alone.
Yet the asker pays LLM operator and is never informed you exist: you, in turn, stand to lose not only your ad revenue but also the potential ability to upsell something, engage with your audience, learn who that audience even is and how your work is sought, and just generally feel useful and valued.
Give it time and sufficient popularity of LLMs and no one will even be inclined to believe that you did the original work in the first place—since from anyone’s perspective you might as well have asked LLM.
The cynic in me says operators are particularly interested in public not asking the question whether it’s feasible to attribute LLM output because it may make them liable to pass to countless volunteer writers, turned involuntary training data providers, part of the profits made from running the model. Another reason might be that researching original authorship requires manual human labour proportionate to the size of training data, and who would ever want to pay for that? Definitely not a startup with millions in funding from Silicon Valley investors.
It is just me or all incentives to openly share useful information (as opposed to pure entertainment) are in jeopardy now?
Your prediction seems depressing but I can't think of a solid argument against it. Already there are sites that pay people to copy content from other sites or scrape & paste it. They rank higher than the original site by combining relevant data from multiple sites. AI that can copy-paste articles will make that business infinitely scalable and harder to detect.
What you describe could also be considered copyright laundering, but critically it is understood to be such—it is frowned upon and is penalized by the likes of Google. If we apply the same understanding to LLM design it might actually be half the battle.
Imagine if Google trained an LLM on its index and started providing answers without ever sending users your way… Actually, we have a suitable example of just that being done. /s
I bookmarked this as a vivid illustration of my gripe with LLMs, and more specifically their creator/operators’ meticulous avoidance of the topic of attribution.
You’d think it’s fairly feasible: if training data had attribution, I can’t see how LLM would not be able to separate substance from style and some reflection on the former, so that if most of topical substance is found to come from only 1–2 sources it would just note those sources for the user. If you provide actually niche content, when ChatGPT answers relevant questions the substance may well come from your body of work alone.
Yet the asker pays LLM operator and is never informed you exist: you, in turn, stand to lose not only your ad revenue but also the potential ability to upsell something, engage with your audience, learn who that audience even is and how your work is sought, and just generally feel useful and valued.
Give it time and sufficient popularity of LLMs and no one will even be inclined to believe that you did the original work in the first place—since from anyone’s perspective you might as well have asked LLM.
The cynic in me says operators are particularly interested in public not asking the question whether it’s feasible to attribute LLM output because it may make them liable to pass to countless volunteer writers, turned involuntary training data providers, part of the profits made from running the model. Another reason might be that researching original authorship requires manual human labour proportionate to the size of training data, and who would ever want to pay for that? Definitely not a startup with millions in funding from Silicon Valley investors.
It is just me or all incentives to openly share useful information (as opposed to pure entertainment) are in jeopardy now?
/thread hijack