English 中文(简体)
What is the probability that the first 4 bytes of MD5 hash computed from file contents will collide?
原标题:

This is a combinatorics question with some theory in hashing algorithms required.

Let s say the input can be any random sequence of bytes 30 kB to 5 MB of size (I guess that makes quite a few combinations of input values :))

What is the probability that the first 4 bytes (or first n bytes) of a MD5 hash computed from the byte sequence will be the same for distinct files?

In case this can not be computed specifically for MD5 hash, then what is the probability that any hash function that generates uniformly distributed m-byte hashes will compute hash with collision on first n bytes for given range of inputs?

最佳回答

In absence of more information about the byte values probability, I would day it is 1 in 2^32.

EDIT. Indeed, 1 in 2^16 if you are taking the hexadecimal characters instead of the pure bytes.

EDIT based on comment:

Can MD5 be considered that uniform that a the computed values are absolutely random?

MD5 hash algorithm is designed so that a small change in the input results in a completely different hash, so I would say that MD5 hash bytes are distributed with equal probability (I would not bet anything on it anyway). Anyway you can apply a post-processing to your hash (for example you can use keyed MD5) to increase its randomness (and to make it more secure, by the way, since plain MD5 hashes have been proved to be insecure).

问题回答

For an ideal hash function, the outputs are evenly distributed, so the chances of two colliding are one in 2^32. The birthday paradox, however, tells us that if we re comparing all pairs of hashes, we should expect to see a collision once we have 2^16 hashes, on average - so don t rely on only 4 bytes on the basis that "I have a lot less than 4 billion values".

MD5 isn t an ideal hash function, as we know, but the weaknesses are somewhat incidental here: finding a collision on 4 bytes is well within the realm of a reasonable brute-force attack, so there s no need to resort to cryptographic weaknesses. If you re only concerned about randomly picked data, you re not going to see a significant statistical deviation from randomness.

If you are interested in the odds of two particular inputs having the same 4-byte hash, then it s just 1/2^32. If you re interested in the odds of two inputs out of a set of X total inputs having the same odds, this stays fairly low until you start approaching 2^16 = 65536 distinct inputs in your set, where it reaches near 50% (this phenomenon is known as the Birthday Paradox).

In general, one of the criteria for a hash function to be cryptographically useful is the uniformity across all of the bits.

The odds of a collision in a n-bit hash are around 1 in 2^(n/2) due to the birthday paradox - so about 1 in 2^16 in this case. If for some reason you were referring to using 32 bits of the hex encoding, of course that would only be the first 16 actual bits, so then the odds of a collision would be about 1 in 2^8.

Given a specific fixed file, the odds that any other file chosen at random will have the same hash as that file is about 2^n. In terms of cryptographic hashes the difference between these is the first is a collision, the other is a preimage.

At this hash size, the weaknesses in MD5 are pretty irrelevant since the best known attacks on MD5 take roughly 2^32 computations, while one can generate a collision in even an ideally secure 32-bit hash in around 2^16 computations (since by merely choosing random inputs, you have 1 in 2^16 chance of a collision, so after roughly 2^16 random guesses you ll probably have found a colliding pair of inputs).

MD5 hashes are typically hexadecimals, so there are 16 possible values for each byte. Therefore for four bytes there are 16*16*16*16 = 65536 possible combinations, making the probability of a hash collision 1:65536.

md5 is hexadecimal, so each character can be any one of 16 alleles. So that would make 16^n

For 4 characters, that makes 65536 different possible combinations.





相关问题
AES 256 in CTR mode [closed]

ctr mode makes it possible to use a block cipher as a stream cipher but how strong will be the encryption in this mode ?

Illegalkeysize exception

I am using the Bouncy Castle Java cryptographic algorithm implementation. I am getting an IllegalKeySizeException. To overcome this I have even changed my java security jars (local_policy.jar and ...

Can two different strings generate the same MD5 hash code?

For each of our binary assets we generate a MD5 hash. This is used to check whether a certain binary asset is already in our application. But is it possible that two different binary assets generate ...

Load RSA keys from files

I used openSSL command to create 2 files: 1 for RSA public key & 1 for RSA private key. How do I recover RSA keys using C? Specifically, I have these functions: RSA_public_encrypt(read_num, ...

RSA cryptosystem

Hi i am trying to set up an RSA cryptosystem i have all the values except d selected prime numbers: p=1889, q=2003, n=3783667, phi=3779776, e= 61 i got stuck finding d could anyone help me to figure ...

热门标签