A lot of the trick to this is finding your exact requirements, and yours still sound pretty vague. Do you just need to support operations like this?
- Update key K to value V.
- Look up a somewhat-recent value of key K.
You mentioned you need eventual consistency. So if you do a single update, it will eventually replicate everywhere. If you do two nearly-simultaneous updates, do you care which one wins? If one replica reports that an update was successfully completed, do you care if the value could be lost if that replica were to temporarily crash shortly afterward? Or if that replica were permanently destroyed?
How precise should somewhat-recent be? If there s a netsplit or something, a lookup might return a very stale result or just fail. Do you care which?
Do you ever need to support fancier operations like...
- Get the absolute latest value of key K?
- Update the value of key K to value V provided the latest value is currently V?
Do you have rigid reliability, latency, and/or bandwidth requirements? How far apart are your replicas / how good is the network between? This impacts if you can have cross-replica communication on every update and even on every lookup; or even if you can/should fail over operations to a remote replica if the local one seems to be down.
Depending on your answers here, I ve worked with a couple different schemes that might meet your requirements. There are several possible variations on them.
- The simplest thing is to just have the application always talk to the local replica. Replicas timestamp values (using NTP-synced clocks) and only talk to each other for asynchronous replication. Highest timestamp wins in replication. Of course, if applications on two different replicas each do a read/modify/write near simultaneously, one of the modifications can easily be lost. (In fact, without a conditional update scheme, the same is even true for near-simultaneous changes on the same replica.) If a replica permanently fails, recent-ish updates can be lost. This is more or less what Bigtable s built-in replication does. In the paper you linked, it d be the "Optimistic - Multimaster" branch but not caring too much about losing some updates makes it simpler than they suggest.
- Some databases use the Paxos algorithm (see for example "Data Management for Internet-Scale Single-Sign-On" here to make fancier things possible. Each replica can know how far behind it might be so you can say "give me a value that s no more than 1 minute old" or "give me the absolute latest value". An update isn t considered complete until a quorum of replicas have accepted it, so "give me the absolute latest value" will definitely always return that value until another update happens. You can do the conditional update operation I mentioned to prevent simultaneous writers from tramping each other. This doesn t seem to fit neatly into either the optimistic or pessimistic category as defined by that author because updates are replicated synchronously to a quorum but replicas which didn t vote in the latest Paxos round may still be able to answer some queries. The scheme can be very complicated, though...