Question

Are there any best-pratices/patterns or general advice for partitioning large amounts of hierarchical data?

Think of, say, a database of all the people in a given country and tracking who has worked with who. Thinking of the "person" entities in isolation, if a lot of data were to be kept about each person then a natural approach seems to be to divide the population across multiple horizontal partitions. However, the relations (who worked with who) could (and will) cross partitionsa. Clustering on these relations (ie. using employer for example as the partition key in order to minimize cross-partition-references) won t be viable over time as the data becomes more and more cross-linked. Such clustering would also result in unbalanced partitions which would hamper scalability.

I m rather stuck right now, so would be very greatful for any help offered.

Thanks.

Answer 1

It seems you have three problems:

Storing data about an employee (excluding relationships/hierarchy)
Employer to Employee hierarchy (which can change over time)
Employee to Employee work history (again, changing over time)

To tackle each in turn:

Employee data: This could be partitioned, with a unique id, with alternate key for surname+given names+date of birth. Either partition by spreading evenly by id, or other info such as area/region (though that will mean some partitions will be hotter than others)
Employer/employee hierarchy: Needs a secondary table to define this, allowing changes over time. eg. Employee id, Employer id, start date, end date and keyed by employee id + employer id and back the other way employer id + employee id. I recommend reading the following: http://www.slideshare.net/billkarwin/sql-antipatterns-strike-back , it might have ideas that work well for the size of your data.
Employee/employee work history: Needs another secondary table, very similar to #2, cross referencing employees and the time they ve worked together. eg. employee1 id, employee2 id, start date, end date, which would be indexed by each of the id s at a minimum.

The key here is that don t attempt to place the relationships/hierarchy within the employee data table - it will be slow and limit the linking you need (especially as links change over time).

友情链接