My question is not how to run it, but what to run. If your scheduled task does a...

saltcured · on Feb 1, 2023

Clone each repo locally. Periodically do "git fetch -p" for each repo to update the local copy of upstream content. Run some periodic task like restic or rclone (depending on whether you want point in time snapshots or just a mirror of latest state) to mirror these local repos into your S3 bucket.

The local clones should evolve incrementally due to "git fetch", and then the restic or rclone task should figure out how to make incremental updates to the S3 content.

remram · on Feb 2, 2023

But that's... exactly what I described and asked how to avoid...

saltcured · on Feb 2, 2023

I had trouble parsing your earliest comment, so I only tried to address the incremental backup concern. I may not have understood the conversation, but it seemed like you claimed that a filesystem level backup of a clone was not going to produce incremental backup IO in practice.

A periodic fetch into a persistent cloned repo will be incremental unless the upstream is doing something crazy with frequent branch deletions and repacks. In practice, most upstream repos I encounter behave relatively monotonically. They accumulate new commits and branch/tag heads but do not often create garbage or need repacking.

A periodic backup of the cloned repo will also be incremental if using an appropriate tool like restic or rclone-copy. Also, since the clone only changes during the fetch, you can serialize these in one periodic job and be confident that you are making a consistent snapshot of the repo.

The advantage of this approach is its simplicity. It is easy to reason about and easy to work with the backups to restore a repo without having to learn about other tools. It's the kind of thing I could feel comfortable setting up and running for years on end with little supervision.

A more sophisticated approach that integrates with git hooks, e.g. to do event-driven rather than periodic backup, is plausible but I think could quickly get in the way of itself. And if working with a hosted upstream, you would need to integrate with their proprietary hooks, e.g. GitHub actions, and deal with other restrictions of the hosting environment. Such a solution likely brings new failure modes and may not be a worthwhile tradeoff...

remram · on Feb 3, 2023

Again, this requires you to have a persistent clone on a filesystem. I specifically wonder if we can do (and I quote) "direct incremental git-to-S3 backups", and you keep replying "it's easy, do it indirectly with a persistent cloned repo".

I don't understand where you are stuck, tbh.

yencabulator has provided a good tip I think, as you could store the previous set of refs and use that to build an incremental Git bundle (one with only the objects that were not in the previous bundle). I don't know if you can do that with the existing Git client though.