Reverse Engineering File Formats using AI Techniques

This is to extend the question: Tools to help reverse engineer binary file formats

Are there any tools that are publicly available that uses clustering and/or data mining techniques to reverse engineer file formats?

For example, with the tool you would have a collection of files that have the same format and the output of the tool would be the generic structure?


If one had a truly efficient binary encoding format (ZIP files are an example), then the information content in each bit is high. Essentially, it will look like a perfect random number.

You can t infer anything from that without additional knowledge.

If the binary encoding isn t efficient, in theory, you have some faint chance of seeing structure. But this still sounds really hard; how do you even begin guessing where the boundaries of fields are?

The AI machine learning types will tell you, you can t learn anything unless you already "almost" know it. Often they succeed by encoding the the problem with problem-tokens that at least you can reason about.

I don t think you can do this without providing more information. Do you know anything about the file formats? Field sizes are always less than N bits? Only ASCII strings are encoded or vice versa?



