Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In addition to the steps you're already taking, and the ethical suggestions from other commenters, I suggest that you aquaint yourself thoroughly with intellectual property (IP) law. If you eventually decide to publish anything based on what you learn, copyright and possibly trademark law will come into play.

Knowing what rights you have to use material you're scraping early on could guide you towards seeking out alternative sources in some cases, sparing you trouble down the line.



That's a good point! So far I'm not planning on publicly disclosing any of my results, but that may come, I guess.


I'm curious how this would be an issue; factual information isn't copyrightable, and most of the obvious things that I can think to do with a scraper amount to pulling factual information in bulk. Even if it's information like, "this is the average price for this item across 13 different stores". (Although I'm not a lawyer and only pay attention to American law, so take all of this with the appropriate amount of salt)


How much can you quote from a crawled document? Can you republish the entire crawl? What can you do under "fair use" of copyrighted material and what can't you do? Can you articulate a solid defense of your publication that it truly contains only pure factual information? Will BigCo dislike having its name associated with the study but can you protect yourself by limiting your publication to "nominative use" of its trademarks? What is the practical risk of someone raising a stink if the legality of your usage is ambiguous? Who actually holds copyright on the crawled documents?

You have a lot of rights and you can do a lot. Understanding those rights and where they end lets you do more, and with confidence.


So I think I just was being unimaginative on "scraping"; I wouldn't have thought to save quotes/prose, just things like word counts, processed results (sentiment analysis), pricing, etc. In which case most of that shouldn't come up, but yes I can see where other options are less simple.


> factual information isn't copyrightable

Tell that to Aaron Swartz.

Sure, if you think of factual information as an abstract concept. But as soon as you put that abstract concept into a concrete representation, that representation is absolutely copyrightable. And when you scrape data you're not scraping abstract information, you're scraping the representation of that information.

Try publishing PDFs of college textbooks online and see how well your "I'm just publishing factual information" argument works.

I'm not saying I agree with the law on this, and I'm also not saying that the way the law was intended should apply to the situation of scraping.


> Tell that to Aaron Swartz.

He wasn't downloading (purely) factual information, as I understood it.

> college textbooks

Not even remotely raw factual information. Heck, a table of numbers with a descriptive label probably is copyrightable, but you can scrape the table itself, yes?

I think the issue here is that I assumed a very narrow idea of what people would scrape; it hadn't crossed my mind to download prose or such, which I think is why we're arriving at different conclusions.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: