How we decreased GitLab repo backup times from 48 hours to 41 minutes

1 day ago/213 comments/about.gitlab.com

The performance improvement that GitLab contributed to Git is slated to be released with v2.50.0:

https://github.com/git/git/commit/bb74c0abbc31da35be52999569...

IME, it has always turned out to be the correct decision to eliminate any n^2 operation in anything I’ve written.

I don’t write exotic algorithms, but it’s always astounding how small n needs to be to become observably problematic.

21 hours ago by EdwardCoffin

Bruce Dawson says: I like to call this Dawson’s first law of computing: O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there.

https://bsky.app/profile/randomascii.bsky.social/post/3lk4c6...

19 hours ago by hinkley

I made the “mistake” in an interview of equating two super-quadratic solutions in an interview. What I meant was what Dawson meant. It doesn’t matter because they’re both too ridiculous to even discuss.

19 hours ago by dieortin

They’re too ridiculous… unless a more optimal solution does not exist

20 hours ago by paulddraper

The second law is that O(n * log n) is for practical intents and purposes O(n).

17 hours ago by sn9

Skiena has a great table in his algorithms book mapping time complexity to hypothetical times for different input sizes.

For n of 10^9, where lgn takes 0.03 us and n takes 1 s, nlgn takes 29.9 s and n^2 takes 31.7 years.

20 hours ago by EdwardCoffin

To be clear though, that isn't his second law, at least as of two months ago, according to https://bsky.app/profile/randomascii.bsky.social/post/3lk4c6...

19 hours ago by hinkley

But sometimes a big enough C can flip which solution helps you hit your margins.

a day ago by koala_man

Good call. O(N^2) is the worst time complexity because it's fast enough to be instantaneous in all your testing, but slow enough to explode in prod.

I've seen it several times before, and it's exactly what happened here.

10 hours ago by undefined

[deleted]

19 hours ago by david422

We just had this exact problem. Tests ran great, production slowed to a crawl.

13 hours ago by bcrl

I was just helping out with the network at an event. Worked great in testing, but it failed in production due to unicast flooding the network core. Turns out that some of the PoE Ethernet switches had an insufficiently sized CAM for the deployment combined with STP topology changes reducing the effective size of the CAM by a factor of 10 on the larger switches. Gotta love when packet forwarding goes from O(1) to O(n) and O(n^2)! Debugging that in production is non-trivial as the needle is in such a large haystack of packets so as to be nearly impossible to find in the output of tcpdump and wireshark. The horror... The horror...

15 hours ago by esprehn

Modern computers are pretty great at scanning small blocks of memory repeatedly, so n^2 can be faster than the alternative using a map in cases for small N.

I spent a lot of time fixing n^2 in blink, but there were some fun surprises:

https://source.chromium.org/chromium/chromium/src/+/main:thi...

For large N without a cache :nth-child matching would be very slow doing n^2 scans of the siblings to compute the index. On the other hand for small sibling counts it turned out the cache overhead was noticably worse than just doing an n^2. (See the end of the linked comment).

This turns out to be true in a lot of surprising places, both where linear search beats constant time maps, and where n^2 is better than fancy algorithms to compensate.

Memory latency and instruction timing is the gotcha of many fun algorithms in the real world.

4 hours ago by kevincox

This is true. But unless you are sure that current and future inputs will always be small I find it is better to start with the algorithm that scales better. Then you can add a special case for small sizes if it turns up in a hot path.

This is because performance is typically less important for the fast/small case and it is generally acceptable for processing twice as much to be twice (or slightly more than twice) as slow, but users are far more likely to hit and really burned by n^2 algorithms in things you thought would almost always be small and you never tested large enough sizes in testing to notice.

I wrote more on this topic here https://kevincox.ca/2023/05/09/less-than-quadratic/

14 hours ago by throwaway2037

    > where linear search beats constant time maps

Can you give an example? You said lots of good things in your post, but I struggling to believe this one. Also, it would help to see some wall clock times or real world impact.

7 hours ago by crote

You've got to keep in mind that computers aren't the 1-instruction-at-a-time purely sequential machines anymore.

Let's say you've got a linear array of bytes, and you want to see if it contains a specific value. What would a modern CPU need to execute? Well, we can actually compare 64 values at a time with _mm512_mask_cmp_epu8_mask! You still need a little bit of setup and a final "did any of them match" check, of course. Want to compare 512 values? You can probably do that in less than 10 clock cycles with modern machines

Doing the same with a hash set? Better make sure that hash algorithm is fast. Sure it's O(1), but if calculating the hash takes 20 cycles it doesn't matter.

13 hours ago by hansvm

Pick any compiled language and test it. Pick an algorithm making heavy use of a small (<10, maybe up to a hundred elements) hashset, and try using a linear structure instead. The difference will be most apparent with complicated keys, but even strings of more than a few characters should work.

Some example workloads include:

1. Tokenization (checking if a word is a keyword)

2. Interpretation (mapping an instruction name to its action)

3. ML (encoding low-cardinality string features in something like catboost)

4. JSON parsers (usually key count is low, so parse into a linear-scan hashmap rather than a general-purpose hashmap)

Details vary in the exact workload, the hardware you're using, what other instructions you're mixing in, etc. It's a well-known phenomenon though, and when you're doing a microoptimization pass it's definitely something to consider. 2x speedups are common. 10x or more happen from time to time.

It's similar to (but not _quite_ the same as) the reason real-world binary search uses linear scans for small element counts.

When you go to really optimize the system, you'll also find that the linear scan solution is often more amenable to performance improvements from batching.

As to how much it matters for your composite program? Even at a microoptimization level I think it's much more important to pay attention to memory access patterns. When we wrote our protobuf parser that's all we really had to care about to improve performance (33% less execution time for the entire workload, proto parsing being much better than that). You're much more likely to be able to write sane code that way (contrasted with focusing on instructions and whatnot first), and it's easier to layer CPU improvements on top of a good memory access pattern than to go the other way around.

a day ago by parthdesai

My rule of thumb for 80%-90% of the problems is, if you need complicated algorithm, it means your data model isn't right. Sure, you do need complicated algorithms for compilers, db internals, route planning et all, but all things considered, those are minority of the use cases.

21 hours ago by anttihaapala

This is not a complicated algorithm. A hash map (dictionary) or a hash set is how you would always do deduplication in Python, because it is easiest to write / least keystrokes anyway. That is not the case in C though, as it is much easier to use arrays and nested loops instead of hash maps.

14 hours ago by throwaway2037

    > That is not the case in C though, as it is much easier to use arrays and nested loops instead of hash maps.

I am confused. There are plenty of open source, fast hash map impls in C.

20 hours ago by paulddraper

This isn't knapsack. This is a dict lookup.

15 hours ago by LtWorf

I wrote a funny algorithm to group together words that end the same way to write them once in my binary wordlist file, since there is an array that points to the start character and a \0 to end the word. My initial solution was O(n²) but it was too slow on a real wordlist so I had to come up with something better. In the end I just sort the list with quicksort, but revert the order of the words and then the groupable ones end up nearby each other.

a day ago by SoftTalker

Cool discovery but the article could have been about 1/10 as long and still communicated effectively. At least they didn't post it as a video, so it was easy to skim to the important details.

an hour ago by hliyan

Interesting that multiple people are noticing this same thing. For me, this could have been:

"We found that a large portion of the 48 hours taken to backup our rails respository was due to a function in `git bundle create` that checks for duplicate references entered as command line arguments. The check contained a nested for loop ( O(N2) ), which we replaced with a map data structure in an upstream fix to Git. The patch was accepted, but we also backported fix without waiting for the next Git version. With this change, backup time dropped to 41 minutes."

20 hours ago by kokada

Yes. I read the whole article thinking that this must have been generated by LLM, because at least the style remembers it.

2 hours ago by djdeutschebahn

Exactly thought the same. Reading experience of the post would have been definitely improved, with less text.

3 hours ago by bearjaws

Don't take my bullet points away from me

2 hours ago by jorvi

They came for the em dashes, and I did not speak up. Then they came for the bullet points, and I did not speak up..

12 hours ago by jaygreco

Glad I wasn’t the only one who thought this. The post is also missing one obvious thing that I expect in any technical post: code snippets. Let me see the code.

ChatGPT has ruined bullet points for the rest of us…

No offense but writing this blog post couldn’t take more than a few minutes, why spoil it with LLM? Shoot, use one to check grammar and recommend edits even.

18 hours ago by ruuda

That was also my thought.

16 hours ago by davidron

For those that haven't read the article yet, scroll down to the flame graph and start reading unit it starts talking about back porting the fix. Then stop.

14 hours ago by wgjordan

"How we decreased reading 'How we decreased GitLab repo backup times from 48 hours to 41 minutes' times from 4.8 minutes to 41 seconds"

19 hours ago by 1oooqooq

it could have been longer. I still don't know why they were doing backup bundles with two refs :)

13 hours ago by 6LLvveMx2koXfwn

They weren't, if you look at the fix [1] the dedupe loop was run in all cases, not just those with known dupes, so the performance hit was any bundle with lots of refs.

1.https://github.com/git/git/commit/bb74c0abbc31da35be52999569...

a day ago by Arcuru

48 hours is a crazy amount of time to spend just to compress a git folder, it's only a couple GB. 41 minutes still seems like quite a long time.

Why aren't they just snapshotting and archiving the full git repo? Does `git bundle` add something over frequent ZFS backups?

a day ago by bombela

> Be aware that even with these recommendations, syncing in this way has some risk since it bypasses Git’s normal integrity checking for repositories, so having backups is advised. You may also wish to do a git fsck to verify the integrity of your data on the destination system after syncing.

https://git-scm.com/docs/gitfaq#_transfers

It doesn't tell you how to make a backup safely though.

On a personal scale, Syncthing and Btrfs snapshots work plenty good enough. It's as fast as the storage/network too.

a day ago by nightfly

Syncthing is the only way I've ever corrupted a git repo before

21 hours ago by hobofan

I think that's why they specified the "BTRFS snapshots" part. Yes, directly syncing a .git directory seems like a recipe for disaster with how often I've seen individual files lagging to sync, but I guess with BTRFS snaphots one can ensure that only a consistent view of a git directory is being backed up and synced.

21 hours ago by undefined

[deleted]

5 hours ago by ndriscoll

If filesystem snapshots weren't safe, wouldn't that also mean git is prone to corrupting your repo in the event of a power loss or crash? That seems like a bad bug.

a day ago by unsnap_biceps

zfs snapshots are difficult to offsite in non-zfs replicas, say like an S3 bucket.

That said, there's another less known feature that bundles help out with when used with `git clone --bundle-uri` The client can specify a location to a bundle, or the server can send the client the bundle location in the clone results and the client can fetch the bundle, unpack it, and then update the delta via the git server, so it's a lot lighter weight on the server for cloning large repos, and a ton faster for the client for initial clones.

20 hours ago by benlivengood

I think if you want consistent snapshot backups on non-zfs destinations the safest thing is to clone the snapshot and rsync from the clone. Not a single-step operation but preserves the atomicity of the snapshot.

EDIT: you could also rsync from a .zfs snapshot directory if you have them enabled.

a day ago by nightfly

ZFS can send to file or whatever you want to pipe to, you can have incremental sends, and if you convert to bookmarks on the sender you don't have to keep the historical data after you send it

15 hours ago by undefined

[deleted]

6 hours ago by broken_broken_

Reading the article I thought exactly the same! I‘d be curious to know how much time the same would take with zfs.

6 hours ago by gregorvand

15 years also seems like a long time

a day ago by bob1029

I'm confused why you wouldn't simply snapshot the block-level device if the protocol of the information on top is going to cause this much headache. Quiescing git operations for block level activity is probably not trivial, but it sounds like an easier problem to solve to me.

This is the approach I've taken with SQLite in production environments. Turn on WAL and the problem gets even easier to solve. Customer configures the VM for snapshots every X minutes. Git presumably doesn't have something approximating a WAL, so I understand the hesitation with this path. But, I still think the overall strategy is much more viable and robust to weird edges within git.

21 hours ago by amtamt

> This is the approach I've taken with SQLite in production environments. Turn on WAL and the problem gets even easier to solve.

A few months back a better solution was provided by SQLite: sqlite3_rsync

https://www.sqlite.org/rsync.html

20 hours ago by lbotos

> Git presumably doesn't have something approximating a WAL, so I understand the hesitation with this path.

Bingo. One of the worst problems is helping a client piece back together a corrupted repo when they are using snapshots. Check my profile to see how I know. :)

It's usually an OMG down scenario, and then you are adding in the "oh no, now the restore is corrupted."

It's fixable but it's definitely annoying.

21 hours ago by nonameiguess

Gitlab isn't just a managed service. They release the software for self-hosted instances as well. There is no guarantee or requirement that users all run Gitlab on the same filesystem or even a filesystem that supports block level snapshots at all. Presumably, they want a universal backup system that works for all Gitlabs.

21 hours ago by bob1029

I've never heard of a .git folder that spanned multiple filesystems. It sounds like we are now conflating the git workspace with everything else in the product.

There are system requirements that a customer would be expected to adhere to if they wanted a valid enterprise support contract with one of these vendors.

19 hours ago by to11mtm

I think GP's point is that the filesystems used by someone self-hosting gitlab may not be the same as what gitlab themselves are using.

File systems can be weird. Sometimes the OS can be weird and fsync type calls may not do what you expect. At least at one point MacOS fsync didn't behave the same way as Linux (i.e. Linux should ensure the write is truly done and not just in cache so long as the drive isn't lying). [0]

> There are system requirements that a customer would be expected to adhere to if they wanted a valid enterprise support contract with one of these vendors.

Gitlab has a community edition. Not handling data well would be bad for their public image.

[0] - https://news.ycombinator.com/item?id=30372218

19 hours ago by nonameiguess

I can't reply to the other reply for some reason, but what they said is indeed what I meant. gitlab.com might be running their Gitaly servers on btrfs or zfs or lvm volumes or whatever, but other customers may be using ext2. Gitlab the company could require customers to only run Gitaly on a specific filesystem, but up to now, they never have, it would be pretty shitty to suddenly change their minds after a decade plus of establishing one expectation, and whoever the developer is who submitted the patch to upstream Git and got a technical blog post out of it has absolutely no power to dictate contract terms to enterprise customers personally and instead did what is actually in their power to do.

a day ago by hiddew

"fixed it with an algorithmic change, reducing backup times exponentially"

If the backup times were O(n^2), are they now O(n^2 / 2^n)? I would guess not.

a day ago by umanwizard

This is not the precise mathematical definition of exponential, but rather the colloquial one, where it just means "a lot".

a day ago by cvoss

You shouldn't use a word that can carry a precise mathematical meaning in a sentence that literally uses mathematical notation in order to speak precisely and then expect readers not to interpret the word in the precise mathematical way.

21 hours ago by IshKebab

You should if you expect your readers to be normal humans who understand obvious context, and not pedantic HN readers who understand obvious context but delight in nit-picking it anyway.

21 hours ago by blharr

I somewhat agree, but for lack of a better word, what would you use? Quadratically doesn't have the same punch

6 hours ago by MyFedora

We simplify the big O notation in computer science. This is standard practice.

13 hours ago by hskalin

I find the whole article rather poorly written. Most likely using an LLM.

10 hours ago by morepedantic

>Ultimately, we traced the issue to a 15-year-old Git function with O(N²) complexity and fixed it with an algorithmic change, reducing backup times exponentially.

No, not in the exact same sentence as a big-O. That's either author error, or an intentional troll. Either way it's egg on their faces.

21 hours ago by undefined

[deleted]

17 hours ago by sn9

The algorithm complexity went down in the function they patched (6x improvement in their benchmark), but in the context of how they benefited with how they were using the algorithm the impact was much larger (improved to taking 1% of the time), which is plausibly exponential (and figuring out the actual complexity is neither relevant nor an economic use of time).

4 hours ago by ndriscoll

> figuring out the actual complexity is neither relevant nor an economic use of time

The fix was replacing a nested loop with a map. Figuring out that this goes from O(n^2) to O(n) (modulo details like bucket count) is immediate if you know what the words mean and understand enough to identify the problem and make the fix in the first place.

21 hours ago by csnweb

If you replace an n^2 algorithm with a log(n) lookup you get an exponential speed up. Although a hashmap lookup is usually O(1), which is even faster.

4 hours ago by ndriscoll

They're still using the map in a loop, so it'd be nlogn for a tree-based map or n for a hash map.

18 hours ago by ryao

That is not true unless n^C / e^n = log(n) where C is some constant, which it is not. The difference between log(n) and some polynomial is logarithmic, not exponential.

16 hours ago by csnweb

But if you happen to have n=2^c, then an algorithm with logarithmic complexity only needs c time. Thats why this is usually referred to as exponential speedup in complexity theory, just like from O(2^n) to O(n). More concretely if the first algorithm needs 1024 seconds, the second one will need only 10 seconds in both cases, so I think it makes sense.

16 hours ago by wasabi991011

It depends if you consider "speedup" to mean dividing the runtime or applying a function to the runtime.

I.e. you are saying and f(n) speedup means T(n)/f(n), but others would say it means f(T(n)) or some variation of that.

16 hours ago by wasabi991011

I interpreted that as n->log(n) since log and exp are inverses.

Also because I've often heard tha the quantum Fourier transform is an exponential speedup over the discrete Fourier transform, and there the scaling goes n^2->nlogn.

a day ago by divbzero

There seems to be a lesson here about the balance between premature vs. anticipatory optimization. We’re generally warned against premature optimization but perhaps, as a rule of thumb, we should look for optimizations in frequently-called functions that are obvious and not onerous to implement.

20 hours ago by Quekid5

If a set-of-strings was trivially available in the source language (at time of implementation) the original programmer would probably have done this (relatively) trivial optimization... This is a symptom of anemic languages like C.

a day ago by gre

looks like the commit is here: https://github.com/git/git/commit/a52d459e72b890c192485002ec...

a day ago by mortar

Thanks, this helped me find the submission and related discussion thread - https://lore.kernel.org/git/20250401-488-generating-bundles-...

a day ago by throawayonthe

[dead]

Daily Digest

Get a daily email with the the top stories from Hacker News. No spam, unsubscribe at any time.

Home About GitHub Kaggle

AI Blog Deep Learning Apps Security Checklist

Bookmarks Hacker News My Stack