English 中文(简体)
Netezza, Teradata, DB2 Parallel/Enterprise, ... versus Hadoop or others?
原标题:

I m looking at building some data warehousing/querying infrastructure, right now on top of Map/Reduce solutions like Hadoop.

However, it strikes me that all the M/R work is just repeating what the RDBMS guys have solved for the last 20 years with parallel SQL databases. Parallel SQL implementations scale reads and writes across nodes, just like M/R, but additionally already contains the niceties from regular databases (SQL, existing integration libraries, etc).

The problem is: you don t seem to find the customers of those companies posting much online. So, does anyone here have experience with those kinds of solutions, and can give me some insight and/or links?

最佳回答

I have used Netezza and Hadoop. And have second hand knowledge of Infobright, a column database.

Netezza is a true database and implements ACID properties, which has both a cost and a benefit. Netezza is moving toward allowing more M/R code to run on its table data with the new architecture of twinfin. In the previous version of the appliance they supported user-defined functions and aggregations. In the new version, which runs linux on the SPUs and uses Intel processors, the door is opening to do more custom code close to the data. My experience with Netezza has been very positive - both the technology and the company.

Hadoop is pure map-reduce computing. It doesn t incur the cost of ACID database properties. So, it s really a different beast than Netezza. Depending on the use pattern it may be better and certainly cheaper than Netezza. Hadoop had supports Hbase and Hive that may give you the query convenience you need at a lower cost.

Another developer on our team evaluated Infobright, so this is second hand, and found the load performance to be poor and some of the aggregations to be slow. It has some parallels with Netezza (e.g. zone maps are used in netezza to help narrow scan scope). Infobright is open source with both a community and a supported enterprise edition.

There is much more that can be said in context of your particular problem - probably beyond the scope of this forum. Hope this helps.

问题回答

You haven t specified what questions you are trying to answer with your queries, or how your data is structured. Before you choose what solution to use you probably need to think about those two things.

You re correct: the major RDBMS vendors offer clustering solutions; both for parallel processing and high availability. They ve had this technology for a while and any enterprise with a lot of data is probably using it. When you buy ($$$) the product they will give you lots of documentation and help you set it up (more $$$) if you can afford it.

RDBMS are good for online transactions (OLTP); answering questions about specific rows (where does Mary live?); answering some summary-type questions (how much did we sell in the first quarter, etc.) Although they can be made to perform detailed summary questions (how much did we sell in the first quarter, broken down by product, salesperson, month, and region?), you re usually starting to tax their limits (any query that needs to visit all of the rows is going to be slow).

For those types of queries most enterprises have a data warehouse that structures the data into multi-dimensional "cubes." (See Cognos, Hyperion, others). That may be appropriate for what you re trying to do.

I don t have any experience with MapReduce but I ve read the wikipedia section on Uses and so if what you re trying to do falls into those categories I d continue with it.

If you are in a fast paced growing organization, you should use Teradata. We really have a good experience with Teradata. It gives you the scalability which cannot be given by any other vendor. Once you get used to its SQL and working style you will really appreciate the design and architecture of Teradata.





相关问题
DB2 varchar index join

For DB2... table1.a is varchar(30) and has an index on it. table2.b is varchar(45) and has an index on it. Will table1.a = table2.b use the index on table1, table2, or both? Although it would seem ...

How to put a constraint on two combined fields?

I d like to put a constraint, a check or a foreign key, on two combined fields from table1 to another field in table2. Here is what I tried, but both gave me errors: ALTER TABLE table1 ADD ...

Odd WHERE NOT EXISTS performance on DB2

I am experiencing very odd performance on DB2 version 9.1 when running the query below: select a.CYCL_NUM , a.AC_NUM , a.AUTHS_DTE , a.PL_ID , a.APRVD_RSPN_CDE , a.AUTHS_AMT , a....

AS400 DB2 Journals search

I am new to DB2 administration on AS400, could you point me to the best practices/tools to search for errors in the DB2 journals? So far I use the DSPJRN command but I am unable to make research. ...

2. 循环检测,采用回收分级系数

甲状腺可使用其专有的CONNECT,自v2起就进行分级查询。 在最近发表的第11g号新闻稿中,他们增加了复位的分级系数,也称为休养。

Tossing out certain result rows in a left join

In DB2, using the following left join select a.deptno, a.deptname, b.empno from #dept a left join #emp b on a.deptno = b.workdept on two tables, gets me a list like: dpt name emp ----------...

热门标签