Update: I have now written a PHP extension called php_ssdeep for the ssdeep C API to facilitate fuzzy hashing and hash comparisons in PHP natively. More information can be found over at my blog. I hope this is helpful to people.
I am involved in writing a custom document management application in PHP on a Linux box that will store various file formats (potentially 1000 s of files) and we need to be able to check whether a text document has been uploaded before to prevent duplication in the database.
Essentially when a user uploads a new file we would like to be able to present them with a list of files that are either duplicates or contain similar content. This would then allow them to choose one of the pre-existing documents or continue uploading their own.
Similar documents would be determined by looking through their content for similar sentances and perhaps a dynamically generated list of keywords. We can then display a percentage match to the user to help them find the duplicates.
Can you recommend any packages for this process and any ideas of how you might have done this in the past?
The direct duplicate I think can be done by getting all the text content and
- Stripping whitespace
- Removing punctuation
- Convert to lower or upper case
then form an MD5 hash to compare with any new documents. Stripping those items out should help prevent dupes not being found if the user edits a document to add in extra paragraph breaks for example. Any thoughts?
This process could also potentially run as a nightly job and we could notify the user of any duplicates when they next login if the computational requirement is too great to run in realtime. Realtime would be preferred however.