Question

My situation involves a directory containing MP3 files, and a database that should contain all the MP3 metadata (i.e. genres, artist names, album names, and track names) from the MP3 files in that directory. The database should always reflect what is in the directory, that is... the algorithm I m searching for should never delete items from the directory, only from database. The directory is the original, the database should be the exact copy of the directory after the program is run, and the directory should remain untouched by the program.

From my understanding, I believe I would have to loop over the contents of the directory inserting records in the database when I come across MP3 metadata that is found in the directory but isn t found in the database. Then, loop over the contents of the database and delete from the database what isn t found in the directory.

Is there a general algorithm for synchronizing two things such as a directory and a database? Is there a better way to do this than what I ve outlined?

I m trying to complete this task using Ruby Sequel and MySQL, in case that matters.

Just to clarify, assume all MP3 tag data is correct. I run all files through an automatic MP3 tagger and make sure everything is accurately tagged before moving it to this directory.

My preferred solution would be by running a Ruby script manually, rather than with Cron for example. My reasons being because it is the language I am already using to implement most of this project, and I must manually tag all MP3 files and verify they are correct, so I don t want any scripts to be ran until I can verify that all files in my MP3 directory are good to go.

Answer 1

In terms of general strategies,

webguydan s strategy has a script looping file and database records and comparing them, backward and forwards. You can do better by leveraging your database more.

Note what will and won t be slow here: Database inserts are the fastest database operation. Your slowest operation will probably be reading file system MP3 metadata. Single calls to the file system or to the database are much more expensive than set operations (get a list of files). In particular, the deletion operation that checks hundreds of rows for existence of a file will be quite slow.

Strategy 2. Creating a staging table in the database to perform comparisons. Your script reads all the mp3s and inserts them into a staging table, which is an empty copy of mp3 table. You then join to the real table to identify moves, deletes, additions.

Strategy 3: Simply replace the entire database table. I suspect that the slow part of this operation will be reading mp3 metadata, not any database operation. Since your process reads all the mp3 metadata anyway to identify changes, just replace the entire sql database at each refresh.

I d start with this approach and optimize if needed. Replacing won t work if you have additional information that you were storing in the mp3 sql database that you couldn t replace from the file system.

Strategy 4: Depending on your OS version, extracting mp3 metadata may be slow (or maybe not?) Skip reading slow parts of the file system (metadata?) by reading path,name,update date into your staging table. Simple sql queries can identify files that need to be updated (read mp3 data), deleted from table, or inserted.

Other approaches: Note that many operating systems and products already allow database style queries to the data you are talking about. MS Indexing service, for example.

Answer 2

You could probably handle this with 2 cron scheduled scripts.

The first script could scan the directory and look for files with a mod date greater than the last time the scan was performed (or based on newest record in DB). When a new file is found, the script could perform an insert.

You can read the MP3 s ID3 tags to get most of the metadata you are looking for, although this assumes that all of your files contain this data. You need to consider fallback mechanisms for what happens when data is missing, such as using the filename instead.

The deletion script could work in reverse. You will need to keep a field in the database that holds the relative path to the file in the directory. You can then loop through the database records and check that each file exists. When one isnt found, you can add the ID to a list. After the scan, you can perform the delete statement on the IDs that were captured.

友情链接