The performance improvement that GitLab contributed to Git is slated to be released with v2.50.0:
https://github.com/git/git/commit/bb74c0abbc31da35be52999569...
IME, it has always turned out to be the correct decision to eliminate any n^2 operation in anything Iâve written.
I donât write exotic algorithms, but itâs always astounding how small n needs to be to become observably problematic.
Bruce Dawson says: I like to call this Dawsonâs first law of computing: O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there.
https://bsky.app/profile/randomascii.bsky.social/post/3lk4c6...
I made the âmistakeâ in an interview of equating two super-quadratic solutions in an interview. What I meant was what Dawson meant. It doesnât matter because theyâre both too ridiculous to even discuss.
Theyâre too ridiculous⌠unless a more optimal solution does not exist
The second law is that O(n * log n) is for practical intents and purposes O(n).
Skiena has a great table in his algorithms book mapping time complexity to hypothetical times for different input sizes.
For n of 10^9, where lgn takes 0.03 us and n takes 1 s, nlgn takes 29.9 s and n^2 takes 31.7 years.
To be clear though, that isn't his second law, at least as of two months ago, according to https://bsky.app/profile/randomascii.bsky.social/post/3lk4c6...
But sometimes a big enough C can flip which solution helps you hit your margins.
Good call. O(N^2) is the worst time complexity because it's fast enough to be instantaneous in all your testing, but slow enough to explode in prod.
I've seen it several times before, and it's exactly what happened here.
We just had this exact problem. Tests ran great, production slowed to a crawl.
I was just helping out with the network at an event. Worked great in testing, but it failed in production due to unicast flooding the network core. Turns out that some of the PoE Ethernet switches had an insufficiently sized CAM for the deployment combined with STP topology changes reducing the effective size of the CAM by a factor of 10 on the larger switches. Gotta love when packet forwarding goes from O(1) to O(n) and O(n^2)! Debugging that in production is non-trivial as the needle is in such a large haystack of packets so as to be nearly impossible to find in the output of tcpdump and wireshark. The horror... The horror...
Modern computers are pretty great at scanning small blocks of memory repeatedly, so n^2 can be faster than the alternative using a map in cases for small N.
I spent a lot of time fixing n^2 in blink, but there were some fun surprises:
https://source.chromium.org/chromium/chromium/src/+/main:thi...
For large N without a cache :nth-child matching would be very slow doing n^2 scans of the siblings to compute the index. On the other hand for small sibling counts it turned out the cache overhead was noticably worse than just doing an n^2. (See the end of the linked comment).
This turns out to be true in a lot of surprising places, both where linear search beats constant time maps, and where n^2 is better than fancy algorithms to compensate.
Memory latency and instruction timing is the gotcha of many fun algorithms in the real world.
This is true. But unless you are sure that current and future inputs will always be small I find it is better to start with the algorithm that scales better. Then you can add a special case for small sizes if it turns up in a hot path.
This is because performance is typically less important for the fast/small case and it is generally acceptable for processing twice as much to be twice (or slightly more than twice) as slow, but users are far more likely to hit and really burned by n^2 algorithms in things you thought would almost always be small and you never tested large enough sizes in testing to notice.
I wrote more on this topic here https://kevincox.ca/2023/05/09/less-than-quadratic/
> where linear search beats constant time maps
Can you give an example? You said lots of good things in your post, but I struggling to believe this one. Also, it would help to see some wall clock times or real world impact.You've got to keep in mind that computers aren't the 1-instruction-at-a-time purely sequential machines anymore.
Let's say you've got a linear array of bytes, and you want to see if it contains a specific value. What would a modern CPU need to execute? Well, we can actually compare 64 values at a time with _mm512_mask_cmp_epu8_mask! You still need a little bit of setup and a final "did any of them match" check, of course. Want to compare 512 values? You can probably do that in less than 10 clock cycles with modern machines
Doing the same with a hash set? Better make sure that hash algorithm is fast. Sure it's O(1), but if calculating the hash takes 20 cycles it doesn't matter.
Pick any compiled language and test it. Pick an algorithm making heavy use of a small (<10, maybe up to a hundred elements) hashset, and try using a linear structure instead. The difference will be most apparent with complicated keys, but even strings of more than a few characters should work.
Some example workloads include:
1. Tokenization (checking if a word is a keyword)
2. Interpretation (mapping an instruction name to its action)
3. ML (encoding low-cardinality string features in something like catboost)
4. JSON parsers (usually key count is low, so parse into a linear-scan hashmap rather than a general-purpose hashmap)
Details vary in the exact workload, the hardware you're using, what other instructions you're mixing in, etc. It's a well-known phenomenon though, and when you're doing a microoptimization pass it's definitely something to consider. 2x speedups are common. 10x or more happen from time to time.
It's similar to (but not _quite_ the same as) the reason real-world binary search uses linear scans for small element counts.
When you go to really optimize the system, you'll also find that the linear scan solution is often more amenable to performance improvements from batching.
As to how much it matters for your composite program? Even at a microoptimization level I think it's much more important to pay attention to memory access patterns. When we wrote our protobuf parser that's all we really had to care about to improve performance (33% less execution time for the entire workload, proto parsing being much better than that). You're much more likely to be able to write sane code that way (contrasted with focusing on instructions and whatnot first), and it's easier to layer CPU improvements on top of a good memory access pattern than to go the other way around.
My rule of thumb for 80%-90% of the problems is, if you need complicated algorithm, it means your data model isn't right. Sure, you do need complicated algorithms for compilers, db internals, route planning et all, but all things considered, those are minority of the use cases.
This is not a complicated algorithm. A hash map (dictionary) or a hash set is how you would always do deduplication in Python, because it is easiest to write / least keystrokes anyway. That is not the case in C though, as it is much easier to use arrays and nested loops instead of hash maps.
> That is not the case in C though, as it is much easier to use arrays and nested loops instead of hash maps.
I am confused. There are plenty of open source, fast hash map impls in C.This isn't knapsack. This is a dict lookup.
I wrote a funny algorithm to group together words that end the same way to write them once in my binary wordlist file, since there is an array that points to the start character and a \0 to end the word. My initial solution was O(n²) but it was too slow on a real wordlist so I had to come up with something better. In the end I just sort the list with quicksort, but revert the order of the words and then the groupable ones end up nearby each other.
Cool discovery but the article could have been about 1/10 as long and still communicated effectively. At least they didn't post it as a video, so it was easy to skim to the important details.
Interesting that multiple people are noticing this same thing. For me, this could have been:
"We found that a large portion of the 48 hours taken to backup our rails respository was due to a function in `git bundle create` that checks for duplicate references entered as command line arguments. The check contained a nested for loop ( O(N2) ), which we replaced with a map data structure in an upstream fix to Git. The patch was accepted, but we also backported fix without waiting for the next Git version. With this change, backup time dropped to 41 minutes."
Yes. I read the whole article thinking that this must have been generated by LLM, because at least the style remembers it.
Exactly thought the same. Reading experience of the post would have been definitely improved, with less text.
Don't take my bullet points away from me
They came for the em dashes, and I did not speak up. Then they came for the bullet points, and I did not speak up..
Glad I wasnât the only one who thought this. The post is also missing one obvious thing that I expect in any technical post: code snippets. Let me see the code.
ChatGPT has ruined bullet points for the rest of usâŚ
No offense but writing this blog post couldnât take more than a few minutes, why spoil it with LLM? Shoot, use one to check grammar and recommend edits even.
That was also my thought.
For those that haven't read the article yet, scroll down to the flame graph and start reading unit it starts talking about back porting the fix. Then stop.
"How we decreased reading 'How we decreased GitLab repo backup times from 48 hours to 41 minutes' times from 4.8 minutes to 41 seconds"
it could have been longer. I still don't know why they were doing backup bundles with two refs :)
They weren't, if you look at the fix [1] the dedupe loop was run in all cases, not just those with known dupes, so the performance hit was any bundle with lots of refs.
1.https://github.com/git/git/commit/bb74c0abbc31da35be52999569...
48 hours is a crazy amount of time to spend just to compress a git folder, it's only a couple GB. 41 minutes still seems like quite a long time.
Why aren't they just snapshotting and archiving the full git repo? Does `git bundle` add something over frequent ZFS backups?
> Be aware that even with these recommendations, syncing in this way has some risk since it bypasses Gitâs normal integrity checking for repositories, so having backups is advised. You may also wish to do a git fsck to verify the integrity of your data on the destination system after syncing.
https://git-scm.com/docs/gitfaq#_transfers
It doesn't tell you how to make a backup safely though.
On a personal scale, Syncthing and Btrfs snapshots work plenty good enough. It's as fast as the storage/network too.
Syncthing is the only way I've ever corrupted a git repo before
I think that's why they specified the "BTRFS snapshots" part. Yes, directly syncing a .git directory seems like a recipe for disaster with how often I've seen individual files lagging to sync, but I guess with BTRFS snaphots one can ensure that only a consistent view of a git directory is being backed up and synced.
If filesystem snapshots weren't safe, wouldn't that also mean git is prone to corrupting your repo in the event of a power loss or crash? That seems like a bad bug.
zfs snapshots are difficult to offsite in non-zfs replicas, say like an S3 bucket.
That said, there's another less known feature that bundles help out with when used with `git clone --bundle-uri` The client can specify a location to a bundle, or the server can send the client the bundle location in the clone results and the client can fetch the bundle, unpack it, and then update the delta via the git server, so it's a lot lighter weight on the server for cloning large repos, and a ton faster for the client for initial clones.
I think if you want consistent snapshot backups on non-zfs destinations the safest thing is to clone the snapshot and rsync from the clone. Not a single-step operation but preserves the atomicity of the snapshot.
EDIT: you could also rsync from a .zfs snapshot directory if you have them enabled.
ZFS can send to file or whatever you want to pipe to, you can have incremental sends, and if you convert to bookmarks on the sender you don't have to keep the historical data after you send it
Reading the article I thought exactly the same! Iâd be curious to know how much time the same would take with zfs.
15 years also seems like a long time
I'm confused why you wouldn't simply snapshot the block-level device if the protocol of the information on top is going to cause this much headache. Quiescing git operations for block level activity is probably not trivial, but it sounds like an easier problem to solve to me.
This is the approach I've taken with SQLite in production environments. Turn on WAL and the problem gets even easier to solve. Customer configures the VM for snapshots every X minutes. Git presumably doesn't have something approximating a WAL, so I understand the hesitation with this path. But, I still think the overall strategy is much more viable and robust to weird edges within git.
> This is the approach I've taken with SQLite in production environments. Turn on WAL and the problem gets even easier to solve.
A few months back a better solution was provided by SQLite: sqlite3_rsync
> Git presumably doesn't have something approximating a WAL, so I understand the hesitation with this path.
Bingo. One of the worst problems is helping a client piece back together a corrupted repo when they are using snapshots. Check my profile to see how I know. :)
It's usually an OMG down scenario, and then you are adding in the "oh no, now the restore is corrupted."
It's fixable but it's definitely annoying.
Gitlab isn't just a managed service. They release the software for self-hosted instances as well. There is no guarantee or requirement that users all run Gitlab on the same filesystem or even a filesystem that supports block level snapshots at all. Presumably, they want a universal backup system that works for all Gitlabs.
I've never heard of a .git folder that spanned multiple filesystems. It sounds like we are now conflating the git workspace with everything else in the product.
There are system requirements that a customer would be expected to adhere to if they wanted a valid enterprise support contract with one of these vendors.
I think GP's point is that the filesystems used by someone self-hosting gitlab may not be the same as what gitlab themselves are using.
File systems can be weird. Sometimes the OS can be weird and fsync type calls may not do what you expect. At least at one point MacOS fsync didn't behave the same way as Linux (i.e. Linux should ensure the write is truly done and not just in cache so long as the drive isn't lying). [0]
> There are system requirements that a customer would be expected to adhere to if they wanted a valid enterprise support contract with one of these vendors.
Gitlab has a community edition. Not handling data well would be bad for their public image.
I can't reply to the other reply for some reason, but what they said is indeed what I meant. gitlab.com might be running their Gitaly servers on btrfs or zfs or lvm volumes or whatever, but other customers may be using ext2. Gitlab the company could require customers to only run Gitaly on a specific filesystem, but up to now, they never have, it would be pretty shitty to suddenly change their minds after a decade plus of establishing one expectation, and whoever the developer is who submitted the patch to upstream Git and got a technical blog post out of it has absolutely no power to dictate contract terms to enterprise customers personally and instead did what is actually in their power to do.
"fixed it with an algorithmic change, reducing backup times exponentially"
If the backup times were O(n^2), are they now O(n^2 / 2^n)? I would guess not.
This is not the precise mathematical definition of exponential, but rather the colloquial one, where it just means "a lot".
You shouldn't use a word that can carry a precise mathematical meaning in a sentence that literally uses mathematical notation in order to speak precisely and then expect readers not to interpret the word in the precise mathematical way.
You should if you expect your readers to be normal humans who understand obvious context, and not pedantic HN readers who understand obvious context but delight in nit-picking it anyway.
I somewhat agree, but for lack of a better word, what would you use? Quadratically doesn't have the same punch
We simplify the big O notation in computer science. This is standard practice.
I find the whole article rather poorly written. Most likely using an LLM.
>Ultimately, we traced the issue to a 15-year-old Git function with O(N²) complexity and fixed it with an algorithmic change, reducing backup times exponentially.
No, not in the exact same sentence as a big-O. That's either author error, or an intentional troll. Either way it's egg on their faces.
The algorithm complexity went down in the function they patched (6x improvement in their benchmark), but in the context of how they benefited with how they were using the algorithm the impact was much larger (improved to taking 1% of the time), which is plausibly exponential (and figuring out the actual complexity is neither relevant nor an economic use of time).
> figuring out the actual complexity is neither relevant nor an economic use of time
The fix was replacing a nested loop with a map. Figuring out that this goes from O(n^2) to O(n) (modulo details like bucket count) is immediate if you know what the words mean and understand enough to identify the problem and make the fix in the first place.
If you replace an n^2 algorithm with a log(n) lookup you get an exponential speed up. Although a hashmap lookup is usually O(1), which is even faster.
They're still using the map in a loop, so it'd be nlogn for a tree-based map or n for a hash map.
That is not true unless n^C / e^n = log(n) where C is some constant, which it is not. The difference between log(n) and some polynomial is logarithmic, not exponential.
But if you happen to have n=2^c, then an algorithm with logarithmic complexity only needs c time. Thats why this is usually referred to as exponential speedup in complexity theory, just like from O(2^n) to O(n). More concretely if the first algorithm needs 1024 seconds, the second one will need only 10 seconds in both cases, so I think it makes sense.
It depends if you consider "speedup" to mean dividing the runtime or applying a function to the runtime.
I.e. you are saying and f(n) speedup means T(n)/f(n), but others would say it means f(T(n)) or some variation of that.
I interpreted that as n->log(n) since log and exp are inverses.
Also because I've often heard tha the quantum Fourier transform is an exponential speedup over the discrete Fourier transform, and there the scaling goes n^2->nlogn.
There seems to be a lesson here about the balance between premature vs. anticipatory optimization. Weâre generally warned against premature optimization but perhaps, as a rule of thumb, we should look for optimizations in frequently-called functions that are obvious and not onerous to implement.
If a set-of-strings was trivially available in the source language (at time of implementation) the original programmer would probably have done this (relatively) trivial optimization... This is a symptom of anemic languages like C.
looks like the commit is here: https://github.com/git/git/commit/a52d459e72b890c192485002ec...
Thanks, this helped me find the submission and related discussion thread - https://lore.kernel.org/git/20250401-488-generating-bundles-...
[dead]
Get a daily email with the the top stories from Hacker News. No spam, unsubscribe at any time.