Log in

No account? Create an account
How I Learned to Stop Worrying and Love Using a Lot of Disk Space to Scale - Adventures in Engineering — LiveJournal
The wanderings of a modern ronin.

Ben Cantrick
  Date: 2008-05-22 14:26
  Subject:   How I Learned to Stop Worrying and Love Using a Lot of Disk Space to Scale
  Tags:  reddit

How do you structure your database using a distributed hash table like BigTable? The answer isn't what you might expect. If you were thinking of translating relational models directly to BigTable then think again. The best way to implement joins with BigTable is: don't. You - pause for dramatic effect - duplicate data instead of normalize it.

Flickr anticipated this design in their architecture when they chose to duplicate comments in both the commentor and the commentee user shards rather than create a separate comment relation. I don't know how that decision was made, but it must have gone against every fiber in their relational bones. But Flickr’s reasoning was genius. To scale you need to partition. User data must spread across the shards. So where do comments belong in a scalable architecture?

From one world view comments logically belong to a relation binding comments and users together. But if your unit of scalability is the user shard there is no separate relation space. So you go against all your training and decide to duplicate the comments. Nerd heroism at its best. Let inductive rules derived from observation guide you rather than deductions from arbitrarily chosen first principles. Very Enlightenment era thinking. Voltaire would be proud.

In a relational world duplication is removed in order to prevent update anomalies. Error prevention is the driving force in relational modeling. Normalization is a kind of ethical system for data. What happens, for example, if a comment changes? Both copies of the comment must be updated. That leads to errors because who can remember where all the data is stored? A severe ethical violation may happen. Go directly to relational jail :-)

Post A Comment | 1 Comment | | Link

Trevor Stone: mathnet - to cogitate and to solve
  User: flwyd
  Date: 2008-05-23 05:52 (UTC)
  Subject:   (no subject)
Keyword:mathnet - to cogitate and to solve
Actually, the relational model is not concerned with data storage. The relational model says* you should be able to perform a single update to the comment, but if the implementation wants to store the comment in one chunk of disk or thirty it doesn't matter, so long as the desired semantic and transactional properties are maintained (atomic, isolated, etc.).

* I think normalization may actually fall outside of the scope of the relational model too, but it's at least a kissing cousin.
Reply | Thread | Link

May 2015