Question

Except for Amazon MapReduce, what other options do I have to process a large amount of data?

Answer 1

Microsoft also has Hadoop/MapReduce running on Windows Azure but it is under limited CTP, however you can provide your information and request for CTP access at link below: https://www.hadooponazure.com/ The Developer Preview for the Apache Hadoop- based Services for Windows Azure is available by invitation.

Besides that you can also try Google BigQuery in which you will have to move your data to Google propitiatory Storage first and then run BigQuery on it. Remember BigQuery is based on Dremel which is similar to MapReduce however faster due to column based search processing.

There is another option is to use Mortar Data, as they have used python and pig, intelligently to write jobs easily and visualize the results. I found it very interesting, please have a look: http://mortardata.com/#!/how_it_works

Answer 2

DataStax Brisk is good.

Full-on distributions

Apache Hadoop
Cloudera’s Distribution including Apache Hadoop (that’s the official name)
IBM Distribution of Apache Hadoop
DataStax Brisk
Amazon Elastic MapReduce

HDFS alternatives

Mapr
Appistry CloudIQ Storage Hadoop Edition
IBM Global Parallel File System (GPFS)
CloudStore

Hadoop MapReduce alternatives

Pervasive DataRush
Cascading
Hive (an Apache subproject, included in Cloudera’s distribution)
Pig (a Yahoo-developed language, included in Cloudera’s distribution)

Refer : http://gigaom.com/cloud/as-big-data-takes-off-the-hadoop-wars-begin/

Answer 3

If want to process large amount of data in real-time ( twitter feed, click stream from website) etc using cluster of machines then check out "storm" which was opensource d from twitter recently

Standard Apache Hadoop is good for processing in batch with petabytes of data where latency is not a problem.

Brisk from DataStax as mentioned above is quite unique in that you can use MapReduce Parallel processing on live data.

There are other efforts like Hadoop Online which allows to process using pipeline.

Google BigQuery obviously another option where you have csv (delimited records) and you can slice and dice without any setting up. It s extremely simple to use ,but is a premium service where you have to pay by no. of bytes processed ( first 100GB / month is free though).

Answer 4

If you want to stay in the cloud, you can also spin up EC2 instances to create a permanent Hadoop cluster. Cloudera has plenty of resources about setting up such a cluster here.

However, this option is less cost effective than Amazon Elastic Mapreduce, unless you have lots of jobs to run through the day, keeping your cluster fairly busy.

The other option is to build your own cluster. One of the nice features of Hadoop is that you can cobble heterogenous hardware into a cluster with decent computing power. The kind that can live in a rack in your server room. Considering that older hardware that s laying around is already paid for, the only costs to getting such a cluster going is new drives, and perhaps enough memory sticks to maximize the capacity of those boxes. Then cost effectiveness of such an approach is much better than Amazon. The only caveat would be whether you have the bandwidth necessary for pulling down all the data into the cluster s HDFS on a regular basis.

Answer 5

Google App Engine does MapReduce as well (at least the map part for now). http://code.google.com/p/appengine-mapreduce/

友情链接