English 中文(简体)
What is a practical way to sync data in a directory to a database?
原标题:

My situation involves a directory containing MP3 files, and a database that should contain all the MP3 metadata (i.e. genres, artist names, album names, and track names) from the MP3 files in that directory. The database should always reflect what is in the directory, that is... the algorithm I m searching for should never delete items from the directory, only from database. The directory is the original, the database should be the exact copy of the directory after the program is run, and the directory should remain untouched by the program.

From my understanding, I believe I would have to loop over the contents of the directory inserting records in the database when I come across MP3 metadata that is found in the directory but isn t found in the database. Then, loop over the contents of the database and delete from the database what isn t found in the directory.

Is there a general algorithm for synchronizing two things such as a directory and a database? Is there a better way to do this than what I ve outlined?

I m trying to complete this task using Ruby Sequel and MySQL, in case that matters.

Just to clarify, assume all MP3 tag data is correct. I run all files through an automatic MP3 tagger and make sure everything is accurately tagged before moving it to this directory.

My preferred solution would be by running a Ruby script manually, rather than with Cron for example. My reasons being because it is the language I am already using to implement most of this project, and I must manually tag all MP3 files and verify they are correct, so I don t want any scripts to be ran until I can verify that all files in my MP3 directory are good to go.

最佳回答

In terms of general strategies,

webguydan s strategy has a script looping file and database records and comparing them, backward and forwards. You can do better by leveraging your database more.

Note what will and won t be slow here: Database inserts are the fastest database operation. Your slowest operation will probably be reading file system MP3 metadata. Single calls to the file system or to the database are much more expensive than set operations (get a list of files). In particular, the deletion operation that checks hundreds of rows for existence of a file will be quite slow.

Strategy 2. Creating a staging table in the database to perform comparisons. Your script reads all the mp3s and inserts them into a staging table, which is an empty copy of mp3 table. You then join to the real table to identify moves, deletes, additions.

Strategy 3: Simply replace the entire database table. I suspect that the slow part of this operation will be reading mp3 metadata, not any database operation. Since your process reads all the mp3 metadata anyway to identify changes, just replace the entire sql database at each refresh.

I d start with this approach and optimize if needed. Replacing won t work if you have additional information that you were storing in the mp3 sql database that you couldn t replace from the file system.

Strategy 4: Depending on your OS version, extracting mp3 metadata may be slow (or maybe not?) Skip reading slow parts of the file system (metadata?) by reading path,name,update date into your staging table. Simple sql queries can identify files that need to be updated (read mp3 data), deleted from table, or inserted.

Other approaches: Note that many operating systems and products already allow database style queries to the data you are talking about. MS Indexing service, for example.

问题回答

You could probably handle this with 2 cron scheduled scripts.

The first script could scan the directory and look for files with a mod date greater than the last time the scan was performed (or based on newest record in DB). When a new file is found, the script could perform an insert.

You can read the MP3 s ID3 tags to get most of the metadata you are looking for, although this assumes that all of your files contain this data. You need to consider fallback mechanisms for what happens when data is missing, such as using the filename instead.

The deletion script could work in reverse. You will need to keep a field in the database that holds the relative path to the file in the directory. You can then loop through the database records and check that each file exists. When one isnt found, you can add the ID to a list. After the scan, you can perform the delete statement on the IDs that were captured.





相关问题
How to add/merge several Big O s into one

If I have an algorithm which is comprised of (let s say) three sub-algorithms, all with different O() characteristics, e.g.: algorithm A: O(n) algorithm B: O(log(n)) algorithm C: O(n log(n)) How do ...

Grokking Timsort

There s a (relatively) new sort on the block called Timsort. It s been used as Python s list.sort, and is now going to be the new Array.sort in Java 7. There s some documentation and a tiny Wikipedia ...

Manually implementing high performance algorithms in .NET

As a learning experience I recently tried implementing Quicksort with 3 way partitioning in C#. Apart from needing to add an extra range check on the left/right variables before the recursive call, ...

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

Enumerating All Minimal Directed Cycles Of A Directed Graph

I have a directed graph and my problem is to enumerate all the minimal (cycles that cannot be constructed as the union of other cycles) directed cycles of this graph. This is different from what the ...

Quick padding of a string in Delphi

I was trying to speed up a certain routine in an application, and my profiler, AQTime, identified one method in particular as a bottleneck. The method has been with us for years, and is part of a "...

热门标签