English 中文(简体)
MapReduce in the cloud
原标题:

Except for Amazon MapReduce, what other options do I have to process a large amount of data?

问题回答

Microsoft also has Hadoop/MapReduce running on Windows Azure but it is under limited CTP, however you can provide your information and request for CTP access at link below: https://www.hadooponazure.com/ The Developer Preview for the Apache Hadoop- based Services for Windows Azure is available by invitation.

Besides that you can also try Google BigQuery in which you will have to move your data to Google propitiatory Storage first and then run BigQuery on it. Remember BigQuery is based on Dremel which is similar to MapReduce however faster due to column based search processing.

There is another option is to use Mortar Data, as they have used python and pig, intelligently to write jobs easily and visualize the results. I found it very interesting, please have a look: http://mortardata.com/#!/how_it_works

DataStax Brisk is good.

Full-on distributions

  1. Apache Hadoop
  2. Cloudera’s Distribution including Apache Hadoop (that’s the official name)
  3. IBM Distribution of Apache Hadoop
  4. DataStax Brisk
  5. Amazon Elastic MapReduce

HDFS alternatives

  1. Mapr
  2. Appistry CloudIQ Storage Hadoop Edition
  3. IBM Global Parallel File System (GPFS)
  4. CloudStore

Hadoop MapReduce alternatives

  1. Pervasive DataRush
  2. Cascading
  3. Hive (an Apache subproject, included in Cloudera’s distribution)
  4. Pig (a Yahoo-developed language, included in Cloudera’s distribution)

Refer : http://gigaom.com/cloud/as-big-data-takes-off-the-hadoop-wars-begin/

If want to process large amount of data in real-time ( twitter feed, click stream from website) etc using cluster of machines then check out "storm" which was opensource d from twitter recently

Standard Apache Hadoop is good for processing in batch with petabytes of data where latency is not a problem.

Brisk from DataStax as mentioned above is quite unique in that you can use MapReduce Parallel processing on live data.

There are other efforts like Hadoop Online which allows to process using pipeline.

Google BigQuery obviously another option where you have csv (delimited records) and you can slice and dice without any setting up. It s extremely simple to use ,but is a premium service where you have to pay by no. of bytes processed ( first 100GB / month is free though).

If you want to stay in the cloud, you can also spin up EC2 instances to create a permanent Hadoop cluster. Cloudera has plenty of resources about setting up such a cluster here.

However, this option is less cost effective than Amazon Elastic Mapreduce, unless you have lots of jobs to run through the day, keeping your cluster fairly busy.

The other option is to build your own cluster. One of the nice features of Hadoop is that you can cobble heterogenous hardware into a cluster with decent computing power. The kind that can live in a rack in your server room. Considering that older hardware that s laying around is already paid for, the only costs to getting such a cluster going is new drives, and perhaps enough memory sticks to maximize the capacity of those boxes. Then cost effectiveness of such an approach is much better than Amazon. The only caveat would be whether you have the bandwidth necessary for pulling down all the data into the cluster s HDFS on a regular basis.

Google App Engine does MapReduce as well (at least the map part for now). http://code.google.com/p/appengine-mapreduce/





相关问题
what is wrong with this mysql code

$db_user="root"; $db_host="localhost"; $db_password="root"; $db_name = "fayer"; $conn = mysqli_connect($db_host,$db_user,$db_password,$db_name) or die ("couldn t connect to server"); // perform query ...

Users asking for denormalized database

I am in the early stages of developing a database-driven system and the largest part of the system revolves around an inheritance type of relationship. There is a parent entity with about 10 columns ...

Easiest way to deal with sample data in Java web apps?

I m writing a Java web app in my free time to learn more about development. I m using the Stripes framework and eventually intend to use hibernate and MySQL For the moment, whilst creating the pages ...

join across databases with nhibernate

I am trying to join two tables that reside in two different databases. Every time, I try to join I get the following error: An association from the table xxx refers to an unmapped class. If the ...

How can I know if such value exists in database? (ADO.NET)

For example, I have a table, and there is a column named Tags . I want to know if value programming exists in this column. How can I do this in ADO.NET? I did this: OleDbCommand cmd = new ...

Convert date to string upon saving a doctrine record

I m trying to migrate one of my PHP projects to Doctrine. I ve never used it before so there are a few things I don t understand. In my current code, I have a class similar to this: class ...

热门标签