Scale Right From the Beginning

We build really big software. Big as in 100,000-people-participating-in-the-same-web-conference big or alert-an-entire-state-about-severe-weather big. But few software systems start out that big. Twitter and Netflix had big ideas for sure, but the technology started small. And they were able to smoothly grow to keep up with demand… well, perhaps not also smoothly! So when’s the right time to start worrying about your scalability strategy?

We get asked that question a lot, and the answer is early and often. In fact, it’s one of the first questions we ask when evaluating the feasibility of a project. While most technology businesses will never need to address Twitter or Netflix sized scalability challenges, we believe that if you are planning for success, you need to be planning for scalability.

The geek-speak of scalability usually starts with heaps of technology: “We are building an on demand, auto-scaling, geo-clustered solution on asynchronous, multi-core, parallel processing architectures deployed to a cloud based, virtualized, high IOPS, big-data clustered pay-as-you-go infrastructure.” So it’s just about picking the most scalable technology, right? Wrong! While the technology is certainly critical when building for scale, it’s rarely the limiting factor. With a few exceptions, you can build scalable solutions on most any platform. The important thing is to think about scalability in every decision you make.

The strategy begins by assessing every element of the stack: servers, storage, applications, databases, bandwidth, IO, and especially the services your solution depends on. While doing this, most engineers tend to worry first about performance. Truth is, regarding scalability; the first focus needs to be reliability. Building for reliability not only makes the stack more robust, it lays the foundation for scalability.

The easiest way to change this mind set is by asking reliability questions like:

  • What happens to our system when we are slashdotted?
  • What happens when our cloud provider goes down?
  • What happens when our server runs out of disk space?

Designing for reliability in these situations means we probably have a least two of everything — and we are on our way to N of everything, which is the hallmark of scalability. But don’t be tempted by tactical solutions. Fire anyone who suggests, “We’ll just get a bigger one when we need it.” They are not thinking strategically — scaling up is not a strategy!

Now that we are in the right mindset, let’s talk about some of the technical nitty-gritty. There are a number of key architectural themes and design patterns that we rely on to pave the way for scalability. The essentials are pretty common “best practices” that most software teams should understand:

  • Embrace the KISS principle – The most elegant architectures are those where there is nothing left to take away… which means there is less complexity when scaling out.
  • Eliminate machine and environment dependencies – Don’t design applications to depend on local server resources like storage or memory, unless they’re solving local problems.
  • Use the right tool for the job – Now is the time to look into the toolbox and pick the right ones. Trying to hack something in after there is a scalability problem means it’s probably too late.

Here are some specifics that we apply to every big time system:

  • Fully Stateless Design – It’s easy to say and usually hard to do, but eliminating per-user state is crucial when efficiently scaling out. Most web application frameworks default to enabling server side session state and most developers are more than happy to tuck away gobs of resource stealing data into those little buckets of memory. That’s fine on one machine but breaks the machine dependency rule from above. To solve this problem, many developers will try to put their session data into a relational database. Brilliant! The problem is now moved into an even-more-expensive place to scale! For high-end scalability, every node in your system must be able to handle every request, and clogging up a relational database doesn’t cut it.
  • Asynchronous Everything – This is where hands start to get dirty, but nothing scales more efficiently than fully asynchronous code. The applications we build tend to do unique and amazing functional things that all boil down to a two basic tasks: moving data and manipulating data. That data is coming from the client, a database, another server, or perhaps any number of other places, but it’s all coming in over a slow network connection that the code will spend a lot of time waiting on. Waiting ties up tons of server resources, kills performance, and costs way more to scale than it should. Asynchronous code never waits… it asks for data and then does something else useful until that data is ready. Asynchronous design is receiving renewed interest due to the success of frameworks like Twisted and node.js, but the truth is that most platforms have supported this concept for decades in the form of IO Completion Ports, epoll, and kqueue so a whole new platform doesn’t have to be adopted to start efficiently scaling.
  • Data Partitioning – Understanding the dynamics of the data and how it can be partitioned should happen before there are gazillions of rows of it. This is perhaps one of the hardest challenges to be dealt with, but the database is usually the first point of contention and tough (or at least expensive) to efficiently scale out if it hasn’t been planned for in advance. Thinking about data in this manner, earlier rather than later, affords the opportunity to optimize when, where, and how to store all those precious bits that the super-scale stack is going mange. The classic RDBMS is a “one size fits all” storage solution that still can’t be beat for a lot of use cases…. but scaling isn’t one of them. Today we are picking storage technologies that align much more closely with our use cases. In just one typical project we’ll combine Cassandra for high performance writes, MongoDb for a balance of speed and queryability, and, yes, SQL for solid transaction integrity.

We’ve been practicing this approach for many years now and have yet to be disappointed in the results. As an example, we are operating a fairly complex JSON API that scales well beyond 4000 transactions per second per node on inexpensive virtual machines in the Amazon EC2 cloud. As a frame of reference, we recently consulted with a client whose system required 90 (yes, nine-zero) machines for that throughput in the very same cloud!

So let’s review. Design for scale from the get-go: check. Get the team in the mindset of reliability: check. Use the right technology in the right way: check. But that’s just the tip of the iceberg. The reality is that all of these things can go right, and scalability challenges will still present themselves. There are a myriad of less technical things that have to go right to be successful with large-scale software systems. Starting with development and operational processes to testing, monitoring, and support systems… the list goes on and on. In a future post, we will hit some of these areas head on and discuss how we approach them in real-world super-scalable solutions.