Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small file size and directory order information leakage #3687

Closed
nadalle opened this issue Mar 13, 2018 · 18 comments
Closed

Small file size and directory order information leakage #3687

nadalle opened this issue Mar 13, 2018 · 18 comments

Comments

@nadalle
Copy link

nadalle commented Mar 13, 2018

After talking to Thomas Waldmann on IRC, I'm opening a ticket mostly to document this issue.

Currently, borg's model for storing chunks in the repository leaks information about the size and (potentially) directory structure of small files. Size and directory structure information can often be used to determine information about the content of the files themselves, through various fingerprint / watermark attacks.

Files in borg are chunked, compressed, and then the chunks are stored in the repository using PUT commands. These PUT commands necessarily contain the size of the chunk being stored, and additionally the segment files in the repository contain these sizes in easily parsed form.

For small files, the size of the chunks prior to compression will typically just be the size of the file itself. Additionally, a great many modern file formats are not compressible, and in any event an attacker can simply compress the data too. In this way, the chunk sizes can leak information about the size of small files.

Additionally, chunks will tend to be stored in temporal order in their associated segment files. Since the order borg processed the files in will be some directory ordering, the temporal order in the repository should typically match a directory sort order, which leaks additional information.

As a hypothetical example of how size and directory structure information can be used as a data fingerprint, consider the common method of fingerprinting CD's based on their track length. Even if a CD was encoded to mp3, it's typically possible to convert file size back to track length by assuming a standard bitrate.

It should then be evident that a database of CD track lengths is sufficient to show that a directory of encrypted files contains the CD tracks with high confidence, just by comparing the ratios of the file sizes to the ratios of track lengths. This can be done without reading any of the data, just the file sizes.

Many other more advanced sorts of watermarking attacks are possible.

There are various potential ways to solve this information leak. A few ideas:

  • Pad small files up to the chunk size.
  • Bundle multiple files into a chunk somehow.
  • More generally, differentiate between segment records (known to the remote server) and chunks (known to the client). Make the records all the same size.

Regardless of whether this is fixed any time soon, it seemed worth documenting better. The current documentation states "[...] all your files/dirs data and metadata are stored in their encrypted form into the repository" so a leak of file size and directory order information seems notable.

@nadalle
Copy link
Author

nadalle commented Mar 13, 2018

As an additional note, storing chunks post-compression leaks some information on large files too. For example, imagine I store a compressible file such as a VM image. After it gets chunked and compressed, an attacker can see a sequence of blocks, each of which was compressible by a certain amount. If the attacker has a copy of the file, this ought to be sufficient to show that borg stored that large file.

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Mar 13, 2018

The small file fingerprinting is a somehow known problem.

If one uses compression, the sizes are not as precisely as original, but of course an attacker could also try matching with all compression algorithms / levels if he has the original files.

The large file fingerprinting would not work like that.

The chunker is seeded with a random value that is part of borg's key material and different for each repo.
So the attacker's and the victim's chunker will cut chunks at different places, so the sequence of chunk sizes does not give a fingerprint. Also, the overall size of all chunks of such a large file after compression is influenced by that, so that's not easy to match either.

@enkore
Copy link
Contributor

enkore commented Mar 22, 2018

The random chunk seed is only useful iff an attacker cannot observe how a known (big) file is chunked. Additionally, chunk boundaries are low in entropy due to their small variance.

Broadly speaking, this sort of fingerprinting is difficult to defeat in an online backup system, because there are more channels to consider (like I/O timing of the source read operations leaking into the I/O timing of the network side).

Some other tools (duplicacy, casync to name two) chunk an aggregate (basically a tar); similar issues regarding compression apply. These are additionally vulnerable to CRIME/BREACH-like attacks between aggregate-co-located files, if compression is used (as is .tar.gz & friends).

https://en.wikipedia.org/wiki/BREACH
https://en.wikipedia.org/wiki/CRIME

@enkore
Copy link
Contributor

enkore commented Mar 22, 2018

Regarding directory order - borg traverses in inode-ordering (since 1.1), not native (usually hashed) directory order, or sorted by name/mtime/whatever. 1.0 uses hashed order, which of course results in poor performance.

Off-hand I'd say neither hashed dent order nor inode order recovery are really that valuable. Inode order tells you something about the allocation order of the inodes (depending on the FS), hashed order tells you... something?

@ThomasWaldmann
Copy link
Member

See PR ^^^.

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Mar 23, 2018

While thinking about attacks against the random chunker seed: we could move from a static seed stored in the key to a per-backup (or even per-file) generated seed.

Update: no, that does not work. We always need to chunk in the same way (per repo) or new chunks made from same input file won't dedup against old chunks made from that file.

@ThomasWaldmann ThomasWaldmann self-assigned this Mar 23, 2018
@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Mar 24, 2018

Documented by #3710 and #3720.

ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Mar 24, 2018
ThomasWaldmann added a commit that referenced this issue Mar 24, 2018
security: describe chunk size / proximity issue, see #3687
@ThomasWaldmann ThomasWaldmann removed their assignment Mar 24, 2018
@ThomasWaldmann ThomasWaldmann modified the milestones: 1.1.5, 2.0 - future Mar 24, 2018
@ncleaton
Copy link
Contributor

ncleaton commented Nov 29, 2019

As an additional note, storing chunks post-compression leaks some information on large files too. For example, imagine I store a compressible file such as a VM image. After it gets chunked and compressed, an attacker can see a sequence of blocks, each of which was compressible by a certain amount. If the attacker has a copy of the file, this ought to be sufficient to show that borg stored that large file.

Actually, I think it's a bit worse than that for large files. Even without compression, if an attacker has a copy of a large file that may be in your repo then they can test for the presence of that file with a high degree of certainty, and if the file is present they can extract information equivalent to having the chunker seed.

This is because, as commented in src/borg/_chunker.c, "XORing the entire table by a constant is equivalent to XORing the hash output with a different constant", so when we XOR the entire table by the chunker seed, it's the same as having a chunker seed of 0 but XORing the hash value by some other constant each time we test it. As by default we're only testing the lowest 21 bits of the hash value, only the lowest 21 bits of this other constant matter. That's important because although there are 2^32 possible chunker seeds many of them are equivalent to each-other and there are really only 2^21 different chunking patterns.

To verify this, chunk some input using a random chunker seed and record the cutpoints, and then look at the buzhash values at those cutpoints with a chunker seed of 0. You'll see that there's a particular value that the lowest 21 bits of that 0-seed buzhash take at every cutpoint.

2^21 is a small enough number that it's feasible to build a database of how a particular file would map to compressed chunk sizes for each of the 2^21 possible values. Armed with that database, it's very quick to inspect a borg repo to determine if that file is present, and to get the lowest 21 bits of the transformed chunker seed if so. I've tested this with a couple of ~10M files, it works quite well.

@ncleaton
Copy link
Contributor

Given that the small files information leak can't be fixed, maybe there's not much point worrying about this leak too much. Perhaps it isn't worthwhile to improve the situation at the cost of making the chunker slower.

What about jumbling up the buzhash table when applying the seed, using some other key material that's kept alongside the chunker seed to determine which entry moves to which slot ? This would add some code complexity but has no chunker speed penalty.

Because buzhash never uses an input byte as anything other than an index into the table, and it never uses anything other than an input byte as an index into the table, shuffling the table is equivalent to applying a byte-to-byte translation to the input before buzhashing. The buzhash is still a buzhash. I can't think of any way this would make any attacker easier, even if the attacker somehow determines the mapping.

@ThomasWaldmann
Copy link
Member

Well, while that would be not a breaking change, it could still blow up things for some people, by doubling the repo size when that change in chunking is introduced. And that big size would stay as long as old archives made with the old chunking method are present.

@ncleaton
Copy link
Contributor

ncleaton commented Dec 2, 2019

I was thinking that the change should only apply to newly created repos, when a new chunker seed is being generated anyway. As this change doesn't address the small files info leak, it's hard to justify the disruption of imposing it on existing repos.

Anyone with an existing repo who's very concerned about information leaks in large files in particular can choose to init a new repo and make future backups to that, and either delete the old repo after a while or bundle it up using tar and gnupg.

@ncleaton
Copy link
Contributor

I had a go at implementing a permutation of the buzhash table: #4921

However, there's a limit to how resistant buzhash can be no matter what changes are made to the table constants. The buzhash operations are all linear, so an attacker could treat all 8192 bits of the table as unknowns for Gaussian elimination, with each cutpoint on known data giving 21 equations. They'd need about 390 known cutpoints to be able to solve for the entire table.

Without this change: an attacker needs 1 known cutpoint to recover the chunker parameters.

With this change: the attacker can definitely still recover the chunker parameters with 390 known cutpoints. There may be another attack that needs fewer cutpoints, but if it's binary data that's being chunked then cracking it with 80 known cutpoints is a bound on what's possible, because the chunker permutation is 80*21 bits of information. On the other hand if it's ascii data using only 64 distinct byte values then the attacker only needs a quarter of the buzhash table and the bound is more like 25 cutpoints.

@ncleaton
Copy link
Contributor

There is another change that would help: rather than always cutting the chunk at the start of the 4095 byte buzhash window, compute a keyed digest of the window after every buzhash match, and use the low 12 bits of the digest to determine the offset into the window of the cutpoint.

That effectively takes back 12 of the 21 bits of information that an attacker gets with each known cutpoint. I think it would have fairly low cost because borg already computes a digest of each chunk and it's only an extra digest of 4095 bytes out of every 2M.

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Nov 17, 2020

just had an idea to solve this with an size-obfuscating meta-compressor, please review #5510.

note that #5510 (obfuscate all chunk sizes, even for small files if there is only 1 chunk) solves a slightly different issue than #5070 (make chunk sizes less predictable for bigger files resulting in multiple chunks).

note: older borg could not "decompress" chunks made with this new compressor. but as it is opt-in via -C obfuscate,..., that should be rather obvious. as the change isn't very intrusive, it could be even backported to 1.1-maint.

@ncleaton
Copy link
Contributor

If an attacker could cause the same set of small files to be encoded repeatedly by messing with the repo between runs, then they could narrow down the actual compressed size by doing that a few times. A repeatable pseudorandom padding seeded with something the attacker doesn't know would resist that.

I suggest also processing each directory in a pseudorandom order when in obfuscate mode, if that isn't already done.

Even with this obfuscation, a large set of small files is going to have a fairly distinctive fingerprint. Making the max padding level exponential in a pseudorandom function might help to complicate any attack, for example a padding of up to 50% of file size for 20% of files, up to 100% of file size for 10% of files, up to 200% for 5% of files and so on.

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Nov 26, 2020

I think with #5510 this issue is fixed.

The (optional) obfuscation has now 2 modes:

  • one relative (multiplicative)
  • one absolute (additive)

If used, it can obfuscate to various degrees:

  • a little (space saving, but less secure against size fingerprinting)
  • (6..14 levels total)
  • a lot (using a lot of space, more secure against size fingerprinting)

If there is a need, more modes/levels could be added later via a different SPEC value.

So, concerning the issues found here, I think this PR is enough to solve them IF one is willing to spend quite some space overhead for it (so the original [compressed] chunk size basically disappears in random size noise).

It is clear that:

  • the more space-wasting modes/levels can't be used for already huge original data, but they could be fine if you only have little original data, but you want to be very safe against size fingerprinting and you can afford the overhead.
  • the space-saving modes/levels might still enable size fingerprinting attacks, it just gets harder because it is more fuzzy

Directory order would need a change at a different place in the code (and soon likely even at 2 places, when borg can also be fed with a list of filenames to back up instead of just doing its builtin recursion) and would make borg slower and (as we've found above). And it is maybe no issue, because we use inode order (not alphabetical order). Even if that was an issue somehow, it would also work via the size leak and that can be addressed just by adding enough size noise.

There is still #5070 improving the chunker, not sure how to proceed with that.

I guess it would be useful:

  • to have less predictable chunking of bigger files (== which get cut into multiple chunks)
  • to address this issue for bigger files without the obfuscation space overhead

But:

  • it doesn't help with small files (typically: file << 2MiB --> 1 chunk only) - so do we need obfuscation anyway? and, if we need that anyway, is the chunker improvement still required?
  • resulting compatiblity issues when using a borg1.2-made repo with borg 1.1 need to be addressed.

@ncleaton I appreciate the lots of work you've put into this, just thinking about what makes most sense.

@ThomasWaldmann
Copy link
Member

@ncleaton @nadalle @enkore any feedback?

@ThomasWaldmann ThomasWaldmann modified the milestones: hydrogen-rc1, hydrogen Apr 16, 2021
@ghost ghost mentioned this issue Aug 26, 2021
@ThomasWaldmann
Copy link
Member

guess this is fixed, see above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants