Question

This is a combinatorics question with some theory in hashing algorithms required.

Let s say the input can be any random sequence of bytes 30 kB to 5 MB of size (I guess that makes quite a few combinations of input values :))

What is the probability that the first 4 bytes (or first n bytes) of a MD5 hash computed from the byte sequence will be the same for distinct files?

In case this can not be computed specifically for MD5 hash, then what is the probability that any hash function that generates uniformly distributed m-byte hashes will compute hash with collision on first n bytes for given range of inputs?

Answer 1

In absence of more information about the byte values probability, I would day it is 1 in 2^32.

EDIT. Indeed, 1 in 2^16 if you are taking the hexadecimal characters instead of the pure bytes.

EDIT based on comment:

Can MD5 be considered that uniform that a the computed values are absolutely random?

MD5 hash algorithm is designed so that a small change in the input results in a completely different hash, so I would say that MD5 hash bytes are distributed with equal probability (I would not bet anything on it anyway). Anyway you can apply a post-processing to your hash (for example you can use keyed MD5) to increase its randomness (and to make it more secure, by the way, since plain MD5 hashes have been proved to be insecure).

Answer 2

For an ideal hash function, the outputs are evenly distributed, so the chances of two colliding are one in 2^32. The birthday paradox, however, tells us that if we re comparing all pairs of hashes, we should expect to see a collision once we have 2^16 hashes, on average - so don t rely on only 4 bytes on the basis that "I have a lot less than 4 billion values".

MD5 isn t an ideal hash function, as we know, but the weaknesses are somewhat incidental here: finding a collision on 4 bytes is well within the realm of a reasonable brute-force attack, so there s no need to resort to cryptographic weaknesses. If you re only concerned about randomly picked data, you re not going to see a significant statistical deviation from randomness.

Answer 3

If you are interested in the odds of two particular inputs having the same 4-byte hash, then it s just 1/2^32. If you re interested in the odds of two inputs out of a set of X total inputs having the same odds, this stays fairly low until you start approaching 2^16 = 65536 distinct inputs in your set, where it reaches near 50% (this phenomenon is known as the Birthday Paradox).

In general, one of the criteria for a hash function to be cryptographically useful is the uniformity across all of the bits.

Answer 4

The odds of a collision in a n-bit hash are around 1 in 2^(n/2) due to the birthday paradox - so about 1 in 2^16 in this case. If for some reason you were referring to using 32 bits of the hex encoding, of course that would only be the first 16 actual bits, so then the odds of a collision would be about 1 in 2^8.

Given a specific fixed file, the odds that any other file chosen at random will have the same hash as that file is about 2^n. In terms of cryptographic hashes the difference between these is the first is a collision, the other is a preimage.

At this hash size, the weaknesses in MD5 are pretty irrelevant since the best known attacks on MD5 take roughly 2^32 computations, while one can generate a collision in even an ideally secure 32-bit hash in around 2^16 computations (since by merely choosing random inputs, you have 1 in 2^16 chance of a collision, so after roughly 2^16 random guesses you ll probably have found a colliding pair of inputs).

Answer 5

MD5 hashes are typically hexadecimals, so there are 16 possible values for each byte. Therefore for four bytes there are 16*16*16*16 = 65536 possible combinations, making the probability of a hash collision 1:65536.

Answer 6

md5 is hexadecimal, so each character can be any one of 16 alleles. So that would make 16^n

For 4 characters, that makes 65536 different possible combinations.

友情链接