Why can't I compress an encrypted file?

Question

Let's say I have a txt file, called harry_potter.txt. I can easily compress it with any compression algorithm.

So the entropy of the file is "smaller" than its size on the disk.

But if I encrypt the file with AES-256-CBC or AES-256-BCB (I used openssl, and enabled -nosalt option), the encrypted file can't be compressed anymore.

Why?

Does the result encrypted file have more information/entropy? Or is it the weakness of the current compression algorithm?
If the encrypted file does have more entropy, how does the additional entropy come from?

For what its worth, due to the nature of finite sets and pigeonhole principle, it is impossible for there to be an invertible function from the set of potential input files and the set of compressed outputs such that the outputs are all always smaller in size. You will necessarily have some inputs for whom compression does not decrease the size... and in fact will necessarily have some inputs for whom compression increases the size. At least so long as the compression is lossless. Lossy compression can be done in a way to always make smaller, but you run the risk of losing information. — JMoravitz
– JMoravitz, Commented Feb 6, 2024 at 18:06
Similar to @JMoravitz's comment: if we think of encryption as a random permutation of strings of a given length, then almost all outputs will be high-entropy strings. (While it's true that almost all inputs are also high-entropy strings, the inputs generally chosen are natural language which have much lower entropy than average.) So to me this is expected behaviour. — Greg Martin
– Greg Martin, Commented Feb 6, 2024 at 18:10
This is not restricted to encrypted files. If you try to run a general-purpose compression algorithm on something that's has already gone through a specialized compression algorithm (e.g, if you zip a video file), you're likely to get a larger file out. — Tacroy
– Tacroy, Commented Feb 7, 2024 at 3:25
a well-encrypted file should be almost indistinguishable from random noise — thegreatemu
– thegreatemu, Commented Feb 7, 2024 at 17:16
If you want your file both compressed and encrypted, then do the compression before the encryption. — Henry
– Henry, Commented Feb 8, 2024 at 1:44

psmears · Accepted Answer · 2024-02-07 11:05:42Z

36

You can think of compression as an invertible function $f:X\to X$ where $X$ is the set of all strings. A string is a finite sequence with elements taken from a finite set called an alphabet.

Now let $X_n$ denote the subset of all strings of length at most $n$. It is finite, and each $X_{n+1}$ contains $X_n$ as a proper subset. Therefore it is impossible for $f$ to map $X_{n+1}$ to $X_n$ and remain injective.

What this means is that for every compression algorithm there is an input such that the algorithm generates output that is at least as big. And in fact bigger if the the algorithm actually compresses at least one string.

So how does compression achieve smaller sizes? It cheats. It compresses only some strings, typically something with good enough patterns. For example text written by a human. It analyzes it, searches for patterns and takes advantage of information found to compress it.

And therefore it won't work on something opposite, like strings with not many patterns inside. Like for example the output of a good encryption algorithm. Or a random string. Or even an output of compression (i.e. compress twice).

But that totally depends on the algorithm. It is in fact possible to create a compression algorithm that does the opposite: works well for encrypted strings, badly for human readable strings. One such algorithm is: "decrypt and then apply classical compression". And so it isn't really related to any "entropy" or something.

edited Feb 7, 2024 at 11:05

psmears

8126 silver badges11 bronze badges

answered Feb 6, 2024 at 18:12

freakish

47.8k2 gold badges34 silver badges73 bronze badges

1

$\begingroup$ I agree with everything, but the sentence "And in fact bigger if the algorithm actually compresses at least one string.", while technically true, sounds deceptively pessimistic. The size of $X_n$ is exponential in n, with $|X_{n+1}| = |\Sigma| \times |X_n|$, where $\Sigma$ is the alphabet. Because the size of the $X_n$ increases so fast, it's possible to have a huge reduction in length for some strings, while having only at most a tiny increase in length for all strings. $\endgroup$

Stef
– Stef

2024-02-07 09:49:08 +00:00
Commented Feb 7, 2024 at 9:49
2

$\begingroup$ @Stef: "tiny increase" is still pessimistic. You can have an O(1) increase, or even just 1-bit increase or 1-byte increase. $\endgroup$

user21820
– user21820

2024-02-07 10:14:30 +00:00
Commented Feb 7, 2024 at 10:14
$\begingroup$ I suppose a lossless compression algorithm mostly just replaces repeatable patterns (those that repeat more frequently enough than expected in a random string) with shorter patterns. I suppose an optimal lossless compression algorithm would be able to compress any string by at least a few percent, but then it's probably NP-hard. $\endgroup$

rus9384
– rus9384

2024-02-07 16:08:53 +00:00
Commented Feb 7, 2024 at 16:08
6

$\begingroup$ @rus9384 no. No algorithm, not even (non-computable) function, is capable of compressing every string. That's just impossible, as I've explained in the answer. $\endgroup$

freakish
– freakish

2024-02-07 16:46:57 +00:00
Commented Feb 7, 2024 at 16:46
$\begingroup$ Re compression of encrypted strings: This is only possible if you can decrypt the string (or if the encryption is weak by modern standards). If you cannot decrypt the string, and it has been properly encrypted, it should be impractical to distinguish it from random noise, which is well-known to be incompressible on average. $\endgroup$

Kevin
– Kevin

2024-02-08 04:19:26 +00:00
Commented Feb 8, 2024 at 4:19

| Show 2 more comments

Karl · Accepted Answer · 2024-02-06 18:13:04Z

26

In the sense of Kolmogorov complexity, the encrypted file doesn't contain much more information than the original, because it can be represented as encrypt(<your key>, decompress(<compressed file contents>)). But discovering this representation would require knowing your key or cracking the encryption, and that's not the business of compression algorithms.

answered Feb 6, 2024 at 18:13

Karl

13.5k3 gold badges24 silver badges39 bronze badges

Add a comment |

Lawnmower Man · Accepted Answer · 2024-02-07 08:11:55Z

Algorithmic Weakness

Other answers have given good theoretical explanations. The reality is that compression algorithms in practical use tend to be shallow, and by that I mean that they look for structure over relatively small windows of the input. The general strategy is to interpret the input as a sequence of words of bit length N, and then create a dictionary that allows you to do something like a Huffman encoding of those words so that you can match them to codes with an average length shorter than N.

Text files are easy to compress because the English ones are sequences of words of bit length 8 (ASCII/utf-8, at least), but the bytes that actually occur are not uniformly distributed over all 8-bit patterns. Thus, it is easy to construct a dictionary where the most frequent letters get codes of just a few bits, while the rarest might be 16 bits or more. If you increase the window to, say, 64 bits, you'll still see that the "words" which occur in the input are distributed in a very non-uniform manner.

Encryption algorithms, on the other hand, are explicitly designed to produce output that almost perfectly defeats this type of compression. That's because any information about the plaintext word frequencies must not leak into the ciphertext, or the code can be more easily broken. Thus, encryption algorithms explicitly produce outputs with the most uniform word distributions possible, at almost every window size. No matter what scale you look at it, a good ciphertext should be statistically indistinguishable from random noise. And random noise cannot be compressed with a clever dictionary.

The structure which exists in the plaintext has been obfuscated by permutations (and often some padding) into a string that has no manifest structure whatsoever. So, you could say that due to the domain-agnostic nature of general-purpose compression algorithms, ciphertexts are the most difficult inputs to actually compress.

However, it is likely that for any compression algorithm to effectively compress encrypted inputs, it would have to first decrypt it. Knowing the domain of the source input should not give any advantage in compressing the ciphertext, or that would imply that the encryption algorithm has leaked information that makes it weak.

I love this answer. If you will humor me, I'd like to test my intuition on this. My initial WAG at this would be that random noise is roughly incompressible. Encryption is roughly an attempt to make information look like random noise. Does that hold any water? — JimmyJames
– JimmyJames, Commented Feb 7, 2024 at 22:51
@JimmyJames: Random noise is not a good subject for compression, that doesn’t mean it is totally incompressible, particularly if it’s of a finite length. But yes, that’s basically close enough. — jmoreno
– jmoreno, Commented Feb 7, 2024 at 23:27

999PingGG · Accepted Answer · 2024-02-07 19:49:17Z

10

Technically speaking, the entropy of the ciphertext is at most the entropy of the plaintext plus the length of the encryption key.

However, the compression algorithm doesn't know how its input was generated. As far as it's concerned, its input stream is just a bunch of random noise, which appears to lack any patterns, redundancy, or structure. So it doesn't compress well.

In theory, a more sophisticated compression tool could crack the encryption, and output a file containing the encryption key plus a well-compressed plaintext. But besides being computationally difficult, that would defeat the point of encrypting the file in the first place.

A similar effect can be achieved without encryption, by using algorithmically-generated data, e.g.,

data = bytes([pow(3, i, 256) for i in range(2**20)])
with open('foo.dat', 'wb') as f:
    f.write(data)

The foo.dat file can contain no more entropy than this 103-byte Python script. But on my machine, gzip -9 foo.dat still produces 3189 bytes of output, because gzip just isn't that good at detecting the pattern.

edited Feb 7, 2024 at 19:49

999PingGG

32 bronze badges

answered Feb 6, 2024 at 18:21

Dan

19.3k3 gold badges30 silver badges55 bronze badges

$\begingroup$ No more entropy than 103-byte Python script plus a few megabytes worth of Python dependencies. As contrasted with less than 200 kilobytes of the gzip binary. $\endgroup$

user3840170
– user3840170

2024-02-07 12:32:30 +00:00
Commented Feb 7, 2024 at 12:32
4

$\begingroup$ @user3840170 The 51-byte BF program >>+>>+>>+>>+[.[->+++<]>[-<+>]<[<++++++++[>[>>]]<]<] produces the same output, avoiding the need for megabytes of Python dependencies. $\endgroup$

Nitrodon
– Nitrodon

2024-02-07 19:43:49 +00:00
Commented Feb 7, 2024 at 19:43
1

$\begingroup$ But that still requires a BF interpreter. The point is that entropy is always relative to a description language, and it is not fair to use entropy estimates from different description languages. I am sure there is a MetaGolfScript dialect that can beat 51 bytes. $\endgroup$

user3840170
– user3840170

2024-03-02 09:50:56 +00:00
Commented Mar 2, 2024 at 9:50

Add a comment |

stochasticboy321 · Accepted Answer · 2024-02-07 06:04:16Z

This is somewhat tangential to the precise question you asked, but nevertheless may be interesting. High level, while the compression approaches you applied may not be able to compress encrypted data, there are schemes that can in fact let you significantly compress encrypted data if the secret key is available at decompression time (but not during compression, of course).

To my knowledge, the first work propsing compression of encrypted data is by Johnson et al. [1]. The main insight here is that while the compression algorithm sees only the encrypted stream, if we assume that the decoder has access to the secret key, then it can combine decoding and decryption into one step to enable compression gains via a reduction to the Wyner-Ziv source coding problem. I'll give a brief account of the results, limiting to the lossless setting for simplicity.

Structure: Suppose that the secret key $K^n$ is drawn uniformly from a $2^{nR'}$-length secret key codebook $\mathcal{K}_n \subset \{0,1\}^n$, and the data $X^n \in \{0,1\}^n$ is drawn iid from some law $p$, and is encrypted to give $Y^n = X^n \oplus K^n$, where the XOR is applied bitwise. Then an encoder maps this $Y^n$ to a ${nR}$ length bitstring $B,$ without knowledge of $K^n$, but knowing the secret-key codebook $\mathcal{K}_n$, and a decoder uses $(B,K^n)$ to produce $\hat{X}^n$ (implicitly, we're assuming that the decoder gets the secret key through some secure channel). Say that $(R',R)$ rates are achievable with Wyner-sense perfect secrecy if we can find an encoder and decoder that ensure $P(X^n \neq \hat{X}^n) \to 0$ as $n$ diverges.

The insight is that as far as the encoder is concerned, the secret $K^n$ is just some side information available at the decoder. Using this the paper argues that if you construct $\mathcal{K}_n$ by drawing $2^{nR'}$ iid sequences according to the law of $X$, $p$, itself, and use Slepian-Wolf coding for $X^n$, then you can achieve $(H(X), H(X))$-rates with Wyner-sense perfect secrecy. In other words, you can compress to the optimal rate, and retain $H(X)$-bits of secrecy (which is the best possible, since the source itself has only $H(X)$ bits of entropy).

[1]. Johnson, M., Ishwar, P., Prabhakaran, V., Schonberg, D., & Ramchandran, K. (2004). On compressing encrypted data. IEEE Transactions on Signal Processing, 52(10), 2992-3006.

Note that the paper is 20 years old and well cited, so likely this theory has evolved since. I haven't really read anything on the topic beyond this, but I'm sure you can explore to the extent of your interest by chasing down papers that cite it.

eax · Accepted Answer · 2024-02-07 13:51:39Z

Take an incompressible string $s$. Since the encryption function is injective, in general the encrypted string $E(s)$ will also be incompressible. Now take a compressible string $t$ and its encryption $E(t)$.

We expect a good encryption function to be semantically secure, i.e. if the attacker gives us two plaintexts $m_0$ and $m_1$, we return $E(m_x)$ randomly then the attacker can't determine if $x = 0$ or $x = 1$ with a non-negligible probability.

Now if $E(t)$ is compressible then this fact can be used to distinguish $E(s)$ and $E(t)$, therefore the encryption can't be semantically secure, thus all encrypted outputs must be incompressible.

Note that this in practice only applies to symmetric encryption algorithms. Asymmetric algorithms are usually non-injective and their outputs are compressible.

Stack Exchange Network

Why can't I compress an encrypted file?

6 Answers 6

Algorithmic Weakness

You must log in to answer this question.

Hot Network Questions

Why can't I compress an encrypted file?

6 Answers 6

Algorithmic Weakness

You must log in to answer this question.

Related

Hot Network Questions