Data is often stored in program-specific binary files for which there is little or no documentation. A typical example in our field is data that comes from an instrument, but I suspect the problem is general. What methods are there for trying to understand and interpret the data?
To set some boundaries. The files are not encrypted and there is no DRM. The type and format of the file is specific to the writer of the program (i.e. it is not a "standard file" - such as *.tar - whose identity has been lost). There is (probably) no deliberate obfuscation but there may be some amateur efforts to save space. We can assume that we have a general knowledge of what the data is and we may recognize some, but probably not all, of the fields and arrays.
Assume that the majority of the data is numeric, with scalars, and arrays (probably 1- and 2- dimensional and sometimes irregular or triangular). There will also be some character strings, probably names of people, sites, dates and maybe some keywords. There will be code in the program that reads the binary file, but we do not have access to the source or the assembler. As an example it may have been written by a VAX Fortran program or some early Unix or by Windows as OLE objects. The numbers may be big- or little-endian (which is not known at the start) but it s probably consistent. We may have different versions on different machines (e.g. Cray).
We can assume we have a reasonably large corpus of files - some hundreds, say.
We can assume two scenarios:
- We can rerun the program with different inputs so we can do experiments.
- We cannot rerun the program - we have a fixed set of documents. This has a gentle similarity to decoding historical documents in an unknown language (e.g. Linear B).
A partial solution may be acceptable - i.e. there may be some fields that no living person now understands, but most of the others are interpretable.
I am only interested in Open Source approaches.
UPDATE There is a related SO question (How to reverse engineer binary file formats for compatibility purposes) but the emphasis is somewhat different.
UPDATE Clever suggestion from @brianegge to address (1). Use truss
(or possibly strace
on Linux) to dump all write() and similar calls in the program. This should allow at least the collection of records written to disk.