I want "offline" dedupe, or "lazy" dedupe that doesn't require the pool to be fully offline, but doesn't happen immediately.
Because:
> When dedup is enabled [...] every single write and free operation requires a lookup and a then a write to the dedup table, regardless of whether or not the write or free proper was actually done by the pool.
To me, this is "obviously" the wrong approach in most cases. When I'm writing data, I want that write to complete as fast as possible, even at the cost of disk space. That's why I don't save files I'm actively working on in 7zip archives.
But later on, when the system is quiet, I would love for ZFS to go back and figure out which data is duplicated, and use the BRT or whatever to reclaim space. This could be part of a normal scrub operation.
Lazy/off-line dedup requires block pointer rewrite, but ZFS _cannot_ and will not ever get true BP rewrite because ZFS is not truly a CAS system. The problem is that physical locations are hashed into the Merkle hash tree, and that makes moving physical locations prohibitively expensive as you have to rewrite all the interior nodes on the way to the nodes you want to rewrite.
A better design would have been to split every node that has block pointers into two sections, one that has only logical block pointers and all of whose contents gets hashed into the tree, and one that has only the physical locations (as if it were a cache) of the corresponding logical block pointers in the first section, with the second section _not_ hashed into the Merkle hash tree. Then BP rewrite would only require re-writing blocks that are not part of the Merkle hash tree.
But as it is you can't get BP rewrite to work on ZFS, so you can't get what you're asking for.
Well... maybe. Perhaps on read hash mismatch ZFS could attempt to locate the pointed-to block in the dedup table using the hash from the pointer. Then ZFS could reallocate the dedup'ed block. The price you'd pay then is one pointless read -- not too bad. The impossibility of BP rewrite generally leads to band-aids like this.
> CAS system
It looks like it means: https://en.wikipedia.org/wiki/Content-addressable_storage
Sorry, yes, CAS really means that pointers are hash values -- maybe with extra metadata, yes, but _not_ including physical locations. The point is that you need some other way to map logical pointers to physical locations. The easiest way to do that is to store the mappings nearby to the references so that they are easy to find, but the mappings must be left out of the Merkle hash tree in order to make it possible to change the physical locations of the referenced blocks.
> I just wish we had "offline" dedupe, or even "lazy" dedupe...
This is the Windows dedupe methodology. I've used it pretty extensively and I'm generally happy with it when the underlying hardware is sufficient. It's very RAM and I/O hungry but you can schedule and throttle the "groveler".
I have had some data eating corruption from bugs in the Windows 2012 R2 timeframe.
You can use any of the offline dupe finders to do this.
Like jdupes or duperemove.
I sent PR's to both the ZFS folks and the duperemove folks to support the syscalls needed.
I actually have to go followup on the ZFS one, it took a while to review and i realized i completely forget to finish it up.
The ability to alter existing snapshots, even in ways that fully preserve the data, is extremely limited in ZFS. So yes that would be great, but if I was holding my breath for Block Pointer Rewrite I'd be long dead.
You need block pointer rewrite for this?
You don't need it to dedup writable files. But redundant copies in snapshots are stuck there as far as I'm aware. So if you search for duplicates every once in a while, you're not going to reap the space savings until your snapshots fully rotate.
"And this is the fundamental issue with traditional dedup: these overheads are so outrageous that you are unlikely to ever get them back except on rare and specific workloads."
This struck me as a very odd claim. I've worked with Pure and Dell/EMC arrays and for VMWare workloads they normally got at least 3:1 dedupe/compression savings. Only storing one copy of the base VM image works extremely well. Dedupe/compression works really well on syslog servers where I've seen 6:1 savings.
The effectiveness of dedupe is strongly affected by the size of the blocks being hashed, with the smaller the better. As the blocks get smaller the odds of having a matching block grow rapidly. In my experience 4KB is my preferred block size.
Couple of comments. Firstly, you are talking about highly redundant information when referencing VM images (e.g. the C drive on all Windows Serer images will be virtually identical), whereas he was using his own laptop contents as an example.
Secondly, I think you are conflating two different features: compression & de-duplication. In ZFS you can have compression turned on (almost always worth it) for a pool, but still have de-duplication disabled.
Fair point. My experience is with enterprise storage arrays and I have always used dedupe/compression at the same time. Dedupe is going to be a lot less useful on single computers.
I consider dedupe/compression to be two different forms of the same thing. compression reduces short range duplication while deduplication reduces long range duplication of data.
Yeah agreed, very closely related - even more so on ZFS where the compression (AFAIK) is on a block level rather than a file level.
Base VM images would be a rare and specific workload. One of the few cases dedupe makes sense. However you are likely using better strategies like block or filesystem cloning if you are doing VM hosting off a ZFS filesystem. Not doing so would be throwing away one of it's primary differentiators as a filesystem in such an environment.
General purpose fileserving or personal desktop/laptop use generally has very few duplicated blocks and is not worth the overhead. Backups are hit or miss depending on both how the backups are implemented, and if they are encrypted prior to the filesystem level.
Compression is a totally different thing and current ZFS best-practice is to enable it by default for pretty much every workload - the CPU used is barely worth mentioning these days, and the I/O savings can be considerable ignoring any storage space savings. Log storage is going to likely see a lot better than 6:1 savings if you have typical logging, at least in my experience.
> General purpose fileserving or personal desktop/laptop use generally has very few duplicated blocks and is not worth the overhead.
I would contest this is because we don't have a good transparent deduplication right now - just some bad compromises. Hard copies? Edit anything and it gets edited everywhere - not what you want. Symlinks? Look different enough that programs treat them differently.
I would argue your regular desktop user actually has an enormous demand for a good deduplicating file system - there's no end of use cases where the first step is "make a separate copy of all the relevant files just in case" and a lot of the time we don't do it because it's just too slow and wasteful of disk space.
If you're working with say, large video files, then a good dedupe system would make copies basically instant, and then have a decent enough split algorithm that edit's/cuts/etc. of the type people try to do losslessly or with editing programs are stored efficiently without special effort. How many people are producing video content today? Thanks to Tiktok we've dragged that skill right down to "random teenagers" who might hopefully pipeline into working with larger content.
But according to the article the regular desktop already has such a dedup system:
> If you put all this together, you end up in a place where so long as the client program (like /bin/cp) can issue the right copy offload call, and all the layers in between can translate it (eg the Window application does FSCTL_SRV_COPYCHUNK, which Samba converts to copy_file_range() and ships down to OpenZFS). And again, because thereâs that clear and unambiguous signal that the data already exists and also itâs right there, OpenZFS can just bump the refcount in the BRT.
I haven't tried it myself, but the widely quoted number for old ZFS dedup is that you need 5GB of RAM for every 1TB of disk space. Considering that 1 TB of disk space currently costs about $15 and 5GB of server RAM about $25, you need a 3:1 dedupe ratio just to break even.
If your data is a good fit you might get away with 1GB per TB, but if you are out of luck the 5GB might not even be enough. That's why the article speaks of ZFS dedup having a small sweet spot that your data has to hit, and why most people don't bother
Other file systems tend to prefer offline dedupe which has more favorable economics
That doesn't account for OpEx, though, such as power...
Assuming something reasonable like 20TB Toshiba MG10 HDDs and 64GB DDR4 ECC RAM, quick googling suggests that 1TB of disk space uses about 0.2-0.4W of power (0.2 in idle, 0.4 while writing), 5GB of RAM about 0.3-0.5W. So your break even on power is a bit earlier depending on the access pattern, but in the same ball park.
Why does it need so much RAM? It should only need to store the block hashes which should not need anywhere near that much RAM. Inline dedupe is pretty much standard on high-end storage arrays nowadays.
The linked blog post covers this, and the improvements made to make the new dedup better.
(5GiB / 1TiB) * 4KiB to bits
((5 gibibytes) / (1 tebibyte)) Ă (4 kibibytes) = 160 bits
VMs are known to benefit from dedupe so yes, you'll see benefits there. ZFS is a general-purpose filesystem not just an enterprise SAN so many ZFS users aren't running VMs.
Dedupe/compression works really well on syslog
I apologize for the pedantry but dedupe and compression aren't the same thing (although they tend to be bundled in the enterprise storage world). Logs are probably benefiting from compression not dedupe and ZFS had compression all along.
They are not the same thing, but when you boil it down to the raw math, they aren't identical twins, but they're absolutely fraternal twins.
Both are trying to eliminate repeating data, it's just the frame of reference that changes. Compression in this context is operating on a given block or handful of blocks. Deduplication is operating on the entire "volume" of data. "Volume" having a different meaning depending on the filesystem/storage array in question.
Well put. I like to say compression is just short range dedupe. Hash based dedupe wouldn't be needed if you could just to real-time LZMA on all of the data on a storage array but that just isn't feasible and hash-based dedupe is a very effective compromise.
Is "paternal twins" a linguistic borrowing of some sort? It seems a relatively novel form of what I've mostly seen referred to as monozygotic / 'identical' twins. Searching for some kind of semi-canonical confirmation of its widespread use turns up one, maybe two articles where it's treated as an orthodox term, and at least an equal number of discussions admonishing its use.
compression tends NOT to use a global dictionary. So to me they are vastly different even if they have the same goal of reducing the output size.
Compression with a global dict would like do better than dedup yet it will have a lot of other issues.
If we're being pedants, then storing the same information in fewer bits than the input is by definition a form of compression, no?
(Although yes I understand that file-level compression with a standard algorithm is a different thing than dedup)
We used to make extensive use of, and gained huge benefit from, dedup in ZFS. The specific use case was storage for VMWare clusters where we had hundreds of Linux and Windows VMs that were largely the same content. [this was pre-Docker]
I've read multiple comments on using dedup for VMs here. Wouldn't it be a lot more efficient for this to be implemented by the hypervisor rather than the filesystem?
I'm a former VMware certified admin. How do you envision this to work? All the data written to the VM's virtual disk will cause blocks to change and the storage array is the best place to keep track of that.
You do it at the file system layer. Clone the template which creates only metadata referencing the original blocks then you perform copy-on-write as needed.
> VMware certified admin
Not to be rude, but does this have any meaning?
COW is significantly slower and has nesting limits when compared to these deduped clones. Great question!
Can relate. Iâve recently taken ownership of a new work laptop with Ubuntu (with âexperimentalâ zfs) and using dedupe on my nix store has been an absolute blessing!
Nix already has some builtin deduplication, see `man nix-store-optimise`. Nixâs own hardlinking optimization reduces disk usage of the store (for me) by 30â40%.
Update. Turns out PyCharm does not play nice with a plethora of symlinks. :(
Well, TIL. Being relatively new to nix, youâve let me down another rabbit hole :)
Isn't it better to use `nix store optimise` for dedup of the nix store? The nix command has more knowledge of the structure of the nix store so should be able to do a better job with fewer resources. Also the store is immutable so you don't actually need reflinks - hard links are enough.
It is, yeah, though you have to turn it on. I'm not actually sure why it's off by default.
I'm so excited about fast dedup. I've been wanting to use ZFS deduping for ArchiveBox data for years, as I think fast dedup may finally make it viable to archive many millions of URLs in one collection and let the filesystem take care of compression across everything. So much of archive data is the same jquery.min.js, bootstrap.min.css, logo images, etc. repeated over and over in thousands of snapshots. Other tools compress within a crawl to create wacz or warc.gz files, but I don't think anyone has tried to do compression across the entire database of all snapshots ever taken by a tool.
Big thank you to all the people that worked on it!
BTW has anyone tried a probabilistic dedup approach using soemthing like a bloom filter so you don't have to store the entire dedup table of hashes verbatim? Collect groups of ~100 block hashes into a bucket each, and store a hyper compressed representation in a bloom filter. On write, lookup the hash of the block to write in the bloom filter, and if a potential dedup hit is detected, walk the 100 blocks in the matching bucket manually to look for any identical hashes. In theory you could do this with layers of bloom filters with different resolutions and dynamically swap out the heavier ones to disk when memory pressure is too high to keep the high resolution ones in RAM. Allowing the accuracy of the bloom filter to be changed as a tunable parameter would let people choose their preference around CPU time/overhead:bytes saved ratio.
Even with this change ZFS dedupe is still block-aligned, so it will not match repeated web assets well unless they exist at consistently identical offsets within the warc archives.
dm-vdo has the same behaviour.
You may be better off with long-range solid compression instead, or unpacking the warc files into a directory equivalent, or maybe there is some CDC-based FUSE system out there (Seafile perhaps)
I should clarify I don't use WARCs at all with archivebox, it just stores raw files on the filsystem because I rely on ZFS for all my compression, so there is no offset alignment issue.
The wget extractor within archivebox can produce WARCs as an output but no parts of ArchiveBox are built to rely on those, they are just one of the optional extractors that can be run.
I get the use case, but in most cases (and particularly this one) I'm sure it would be much better to implement that client-side.
You may have seen in the WARC standard that they already do de-duplication based on hashes and use pointers after the first store. So this is exactly a case where FS-level dedup is not all that good.
WARC only does deduping within a single WARC, I'm talking about deduping across millions of WARCs.
That's not true, you commonly have CDX index files which allow for de-duplication across arbitrarily large archives. The internet archive could not reasonably operate without this level of abstraction.
[edit] Should add a link, this is a pretty good overview, but you can also look at implementations such as the new zeno crawler.
https://support.archive-it.org/hc/en-us/articles/208001016-A...
While a slightly different use case, I suspect youâd like zbackup if you donât know about it.
I wonder why they are having so much trouble getting this working properly with smaller RAM footprints. We have been using commercial storage appliances that have been able to do this for about a decade (at least) now, even on systems with "little" RAM (compared to the amount of disk storage attached).
Just store fingerprints in a database and run through that at night and fixup the block pointers...
> and fixup the block pointers
That's why. Due to reasons[1], ZFS does not have the capability to rewrite block pointers. It's been a long requested feature[2] as it would also allow for defragmentation.
I've been thinking this could be solved using block pointer indirection, like virtual memory, at the cost of a bit of speed.
But I'm by no means a ZFS developer, so there's surely something I'm missing.
[1]: http://eworldproblems.mbaynton.com/posts/2014/zfs-block-poin...
It looks like theyâre playing more with indirection features now (created for vdev removal) for other features. One of the recent summit hackathons sketched out using indirect vdevs to perform rebalancing.
Once you get a lot of snapshots, though, the indirection costs start to rise.
Fixup block pointers is the one thing ZFS didn't want to do.
You can also use DragonFlyBSD with Hammer2, which supports both online and offline deduplication. It is very similar to ZFS in many ways. The big drawback though, is lack of file transfer protocols using RDMA.
I've also heard there are some experimental branches that makes it possible to run Hammer2 on FreeBSD. But FreeBSD also lacks RDMA support. For FreeBSD 15, Chelsio has sponsored NVMe-oF target, and initiator support. I think this is just TCP though.
You should use:
cp --reflink=auto
You get file level deduplication. The command above performs a lightweight copy (ZFS clone in file level), where the data blocks are copied only when modified. Its a copy, not a hard link. The same should work in other copy-on-write transactional filesystems as well if they have reflink support.I wanted to use ZFS badly, but of course all data must be encrypted. It was surprising to see how usage gets much more complicated than expected and so many people just donât encrypt their data because things get wild then.
Look, even Proxmox, which I totally expected to support encryption with default installation (it has âEnterpriseâ on the website) does loose important features when trying to use with encryption.
Also please study the issue tracker, there are a few surprising things I would not have expected to exist in a productive file system.
The best way to encrypt ZFS is to run unecrypted ZFS atop encrypted volumes (e.g. LUKS volumes). ZFS âencryptionâ leaves too much in plaintext for my comfort.
In the Proxmox forum some people tried this method and do not report big success. Can not recommend for production.
Still the same picture, encryption seems to be not a first class citizen in ZFS land.
I really wish we just had a completely different API as a filesystem. The API surface of filesystem on every OS is a complete disaster that we are locked into via backwards compatibility.
Internally ZFS is essentially an object store. There was some work which tried to expose it through an object store API. Sadly it seems to not have gone anywhere.
Tried to find the talk but failed, was sure I had seen it on a Delveloper Summit but alas.
Why is it a disaster and what would you replace it with? Is the AWS S3 style API an improvement?
Itâs only a âdisasterâ if you are using it exclusively programmatically and want to do special tuning.
File systems are pretty good if you have a mix of human and programmatic uses, especially when the programmatic cases are not very heavy duty.
The programmatic scenarios are often entirely human hostile, if you try to imagine what would be involved in actually using them. Like direct S3 access, for example.
High-density drives are usually zoned storage, and it's pretty difficult to implement the regular filesystem API on top of that with any kind of reasonable performance (device- vs host- managed SMR). The S3 API can work great on zones, but only because it doesn't let you modify an existing object without rewriting the whole thing, which is an extremely rough tradeoff.
One way it's a disaster is that file names (on Linux at least, haven't used Windows in a long time) are byte strings that can contain directory paths from different/multiple file systems.
So if you have non-ASCII characters in your paths, encoding/decoding is guesswork, and at worst, differs from path segment to path segment, and there's no metadata attached which encoding to use.
ZFS actually has settings related to that which originated from providing filesystems for different OSes, where it enforces canonical utf-8 with a specific canonization rule. AFAIK the reason for it existing was cooperation between Solaris, Linux, Windows, and Mac OS X computers all sharing same network filesystem hosted from ZFS.
That definitely does not sound like much fun to deal with.
Get a daily email with the the top stories from Hacker News. No spam, unsubscribe at any time.