Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm gonna guess that Microsoft GitHub (tm) would shut you down pretty quickly if you tried to clone tens or hundreds of thousands of repos in a short window of time, b/c of course that's sketchy/abusive use of their infrastructure, right?

But of course if the data is already sitting in object storage inside your cloud environment and all you have to do is run some MapReduce jobs to get at it...

Hence: unfair, anticompetitive, intellectual-property-right-abusing behavior. Microsoft GitHub (tm) can prevent anyone else from running the kinds of analysis they do by simple "operational security", while running literally any kind of analysis, model training, etc. they want. Don't like it? But their commercial services and products so you can run Microsoft GitHub (tm) on your very own Microsoft Azure (tm) infrastructure, using Microsoft Visual Studio Code (tm) and Microsoft GitHub Codespaces (tm) so work on _your_ code privately.

Best of all, you can still still take advantage of the huge library of "free" code offered by Microsoft GitHub Copilot (tm) to ensure your private, proprietary codebase still has all of the advantages of Open Source Software, brought to you exclusively by the Microsoft GitHub Platform (tm).



Actually they don’t. I’ve cloned thousands of repos before (tried to archive conda-forge org for a project).

I’ve also built many parallel repo downloaders for CI reasons. You can clone repos all day pretty much with little rate limiting. I haven’t pushed parallelism past 64 per host though


I don't understand. Your favourite boba joint can email every one of their customers a coupon. That's "unfair" to the other boba joints without access to their mailing list too, right? You're just describing a regular old competitive advantage


> I'm gonna guess that Microsoft GitHub (tm) would shut you down pretty quickly if you tried to clone tens or hundreds of thousands of repos in a short window of time, b/c of course that's sketchy/abusive use of their infrastructure, right?

ArchiveTeam has a distributed Github archive project[0]. It's unclear what the status is right now. It seems like a worthwhile idea.

[0] https://wiki.archiveteam.org/index.php/GitHub#Archive_Team_p...


There are accessible code datasets that contain massive scrapes of Github.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: