I am working on a project that deals with analyzing a very large amount of data, so I discovered MapReduce fairly recently, and before i dive any further into it, i would like to make sure my expectations are correct.
The interaction with the data will happen from a web interface, so response time is critical here, i am thinking a 10-15 second limit. Assuming my data will be loaded into a distributed file system before i perform any analysis on it, what kind of a performance can i expect from it?
Let s say I need to filter a simple 5GB XML file that is well formed, has a fairly flat data structure and 10,000,000 records in it. And let s say the output will result in 100,000 records. Is 10 seconds possible?
If it, what kind of hardware am i looking at? If not, why not?
I put the example down, but now wish that I didn t. 5GB was just a sample that i was talking about, and in reality I would be dealing with a lot of data. 5GB might be data for one hour of the day, and I might want to identify all the records that meet a certain criteria.
A database is really not an option for me. What i wanted to find out is what is the fastest performance i can expect out of using MapReduce. Is it always in minutes or hours? Is it never seconds?