English 中文(简体)
Greenplum data distribution among segments
原标题:

I have a Greenplum database, in which I have 10 segments reflecting the 10 hard disks. My table is divided into primary partition based on date and secondary partition based on hash id. So for a month there will be 30 primary partition and each partition containing 100 sub partition. And the data in the sub-partition is loaded based on hashid. Now the question is how these partitions are distributed among the segments.

Guess 1:

seg1(equally distributed based on pri partition : 30datepartition/10=3 date partitions)
date1---0-99 sub partition of hashid
date2---0-99 sub partition of hashid
date3---0-99 sub partition of hashid

seg2(equally contains 30/10=3 date partitions)
date4---0-99 partition of hashid
date5---0-99 partition of hashid
date6---0-99 partition of hashid

...
..

seg10
date27---0-99 partition of hashid
date28---0-99 partition of hashid
date29---0-99 partition of hashid

OR

Guess 2

seg1(distributed by 100hashid/10=10 hashid partitions)
date1---0-9 partition of hashid
date2---0-9 partition of hashid
...
date30---0-9 partition of hashid

seg2(equally contains 100hashid/10=10 hashid partitions)
date1---10-19 partition of hashid
date2---10-19 partition of hashid
...
date30---10-19 partition of hashid

How does this work? Guess 1 or 2, if both are wrong kindly en light me on the way it is distributed on segment level.

Is it a good design to sub-partition it based on hash id. As I am dealing with 6 million records every day, and i have to store date for a year, i wanted the search to hit on very less portion of data. In other words based on the key query i will determine and hashid range and it will search in those specific partitions.

Thanks Ganesh.R

问题回答

In Greenplum, the distribution key determines how data is scattered across all segments in a cluster. Partitioning breaks down the data inside each segment into smaller chunks just like partitioning in any other DBMS.

You want to pick a distribution key that divides data evenly across the cluster and then use partitioning to subdivide the table. The idea is to set up your tables so that each segment DB in your cluster works on data sets that are roughly the same size. Overall DB response will be as slow as the slowest segment in the cluster.

I m not 100% sure, but I think partitions are split up per node. So, in your example, each node will have 30 partitions.

If you want to specify what key to shard on, use DISTRIBUTE BY.

When you create a table, and distribution key is any key say event_id and distribution is done on the basis of any date column for example event_date, Best way is partition by column should be the part of distribution key to properly distribute the data / for skewness,

Thanks





相关问题
Algorithm to distribute over *part* of a Normal distribution

Is there a single pass algorithm that can produce numbers distributed over part of a normal (Gaussian) distribution? I want to specify a base value (the center of the distribution), a standard ...

XNA game distribution questions

I want to start on a 2D game using C#, so I m checking out my options. XNA sounds good, but a few points are not very clear to me. If I use XNA, can I just build my game and distribute my game as a ...

How to use Git as content distribution network

The book Git Internal mentions about using git as peer to peer content distribution network at p50, but doesn t have much detail. Especially how to have several branches which has tracking different ...

Weird PyPI authentication behavior

I m trying to upload my package to PyPI. It asks me to identify, I do, it gives an OK response (which doesn t happen unless the identification is right), but then it claims I didn t identify! Why? [.....

热门标签