English 中文(简体)
Transferring data from a DB2 DB to a greenplum DB
原标题:

My company has decided to implement a datamart using [Greenplum] and I have the task of figuring out how to go on about it. A ballpark figure of the amount of data to be transferred from the existing [DB2] DB to the Greenplum DB is about 2 TB.

I would like to know : 1) Is the Greenplum DB the same as vanilla [PostgresSQL]? (I ve worked on Postgres AS 8.3) 2) Are there any (free) tools available for this task (extract and import) 3) I have some knowledge of Python. Is it feasible, even easy to do this in a resonable amount of time?

I have no idea how to do this. Any advice, tips and suggestions will be hugely welcome.

问题回答

1) Greenplum is not vanilla postgres, but it is similar. It has some new syntax, but in general, is highly consistent.

2) Greenplum itself provides something called "gpfdist" which lets you listen on a port that you specify in order to bring in a file (but the file has to be split up). You want readable external tables. They are quite fast. Syntax looks like this:

CREATE READABLE EXTERNAL TABLE schema.ext_table
( thing int, thing2 int )
LOCATION (
     gpfdist://server:port1/path/to/filep1.txt ,
     gpfdist://server:port2/path/to/filep2.txt ,
     gpfdist://server:port3/path/to/filep3.txt 
) FORMAT  text  (delimiter E 	  null  null  escape  off ) ENCODING  UTF8 ;

CREATE TEMP TABLE import AS SELECT * FROM schema.ext_table DISTRIBUTED RANDOMLY;

If you play to their rules and your data is clean, the loading can be blazing fast.

3) You don t need python to do this, although you could automate it by using python to kick off the gpfdist processes, and then sending a command to psql that creates the external table and loads the data. Depends on what you want to do though.

Many of Greenplum s utilities are written in python and the current DBMS distribution comes with python 2.6.2 installed, including the pygresql module which you can use to work inside the GPDB.

For data transfer into greenplum, I ve written python scripts that connect to the source (Oracle) DB using cx_Oracle and then dumping that output either to flat files or named pipes. gpfdist can read from either sort of source and load the data into the system.

Generally, it is really slow if you use SQL insert or merge to import big bulk data.

The recommended way is to use the external tables you define to use file-based, web-based or gpfdist protocol hosted files.

And also greenplum has a utility named gpload, which can be used to define your transferring jobs, like source, output, mode(inert, update or merge).

1) It s not vanilla postgres

2) I have used pentaho data integration with good success in various types of data transfer projects. It allows for complex transformations and multi-threaded, multi-step loading of data if you design your steps carefully. Also I believe Pentaho support Greenplum specifically though I have no experience of this.





相关问题
摘录数据

我如何将Excel板的数据输入我的Django应用? I m将PosgreSQL数据库作为数据库。

Postgres dump of only parts of tables for a dev snapshot

On production our database is a few hundred gigabytes in size. For development and testing, we need to create snapshots of this database that are functionally equivalent, but which are only 10 or 20 ...

How to join attributes in sql select statement?

I want to join few attributes in select statement as one for example select id, (name + + surname + + age) as info from users this doesn t work, how to do it? I m using postgreSQL.

What text encoding to use?

I need to setup my PostgreSQL DB s text encoding to handle non-American English characters that you d find showing up in languages such as German, Spanish, and French. What character encoding should ...

SQL LIKE condition to check for integer?

I am using a set of SQL LIKE conditions to go through the alphabet and list all items beginning with the appropriate letter, e.g. to get all books where the title starts with the letter "A": SELECT * ...

热门标签