Question

My company has decided to implement a datamart using [Greenplum] and I have the task of figuring out how to go on about it. A ballpark figure of the amount of data to be transferred from the existing [DB2] DB to the Greenplum DB is about 2 TB.

I would like to know : 1) Is the Greenplum DB the same as vanilla [PostgresSQL]? (I ve worked on Postgres AS 8.3) 2) Are there any (free) tools available for this task (extract and import) 3) I have some knowledge of Python. Is it feasible, even easy to do this in a resonable amount of time?

I have no idea how to do this. Any advice, tips and suggestions will be hugely welcome.

Answer 1

1) Greenplum is not vanilla postgres, but it is similar. It has some new syntax, but in general, is highly consistent.

2) Greenplum itself provides something called "gpfdist" which lets you listen on a port that you specify in order to bring in a file (but the file has to be split up). You want readable external tables. They are quite fast. Syntax looks like this:

CREATE READABLE EXTERNAL TABLE schema.ext_table
( thing int, thing2 int )
LOCATION (
     gpfdist://server:port1/path/to/filep1.txt ,
     gpfdist://server:port2/path/to/filep2.txt ,
     gpfdist://server:port3/path/to/filep3.txt 
) FORMAT  text  (delimiter E 	  null  null  escape  off ) ENCODING  UTF8 ;

CREATE TEMP TABLE import AS SELECT * FROM schema.ext_table DISTRIBUTED RANDOMLY;

If you play to their rules and your data is clean, the loading can be blazing fast.

3) You don t need python to do this, although you could automate it by using python to kick off the gpfdist processes, and then sending a command to psql that creates the external table and loads the data. Depends on what you want to do though.

Answer 2

Many of Greenplum s utilities are written in python and the current DBMS distribution comes with python 2.6.2 installed, including the pygresql module which you can use to work inside the GPDB.

For data transfer into greenplum, I ve written python scripts that connect to the source (Oracle) DB using cx_Oracle and then dumping that output either to flat files or named pipes. gpfdist can read from either sort of source and load the data into the system.

Answer 3

Generally, it is really slow if you use SQL insert or merge to import big bulk data.

The recommended way is to use the external tables you define to use file-based, web-based or gpfdist protocol hosted files.

And also greenplum has a utility named gpload, which can be used to define your transferring jobs, like source, output, mode(inert, update or merge).

Answer 4

1) It s not vanilla postgres

2) I have used pentaho data integration with good success in various types of data transfer projects. It allows for complex transformations and multi-threaded, multi-step loading of data if you design your steps carefully. Also I believe Pentaho support Greenplum specifically though I have no experience of this.

友情链接