Scalable Datastores
Rick Cattell

Last revised 8/20/2013


In recent years, a number of data storage systems have been developed with excellent horizontal scaling properties.  Most are commonly called "NoSQL" systems.  Horizontal scaling allows dozens or hundreds of machines to operate as a single database system, with performance improving approximately linearly with the number of machines. This is interesting because traditional relational database systems failed to scale well when their data is distributed over many servers (with the exception of read-mostly data warehousing). 

Scalable data stores can be categorized into four groups:

Here is a paper I wrote comparing these systems:

    Datastore Comparison

This paper was published in ACM SIGMOD Record in 2010.  Some of the system descriptions are now out of date, so you should check the sources above or talk to me to learn what has changed.  I also wrote a paper on the 8 most important elements of scalable database systems.  This paper is unpublished, but can be viewed here:

    Requirements for Scalability

And finally, I wrote a paper with Mike Stonebraker, weighing the important factors in making a datastore scalable:

    CACM paper

This paper has some discussion of SQL vs NoSQL scalability.  It was published in Communications of the ACM (June 2011).

Any input on these papers or suggestions on this web page are welcomed.  You can contact me at rick(at)cattell.net. 

If you'd like further reading on scalable SQL and NoSQL datastores, you can click on the links at the top of this page to learn more about specific systems, or the links below for some other general references that I like:
You will find lots of claims about the performance and scalability of systems out there, but few apples-to-apples comparisons.  In my opinion, the best scalability benchmark today is the Yahoo Cloud Serving Benchmark.   With Roberto Zicari, I interviewed two authors of the Yahoo benchmark paper:

  YCSB Interview

I'm trying to encourage others to run the Yahoo (YCSB) benchmark as well.  This will hopefully reduce some of the scalability hype  and confusion out there.