Monday, March 23, 2009

Evolutionary Architecture

Agile Architecture, Round 2. As I (and others) have said, "Do The Simplest Thing That Could Possibly Work" applies to more than your software. So today, we'll look at the common growth process for a system, and some of the constraints that limit each phase. Our example is a web-based, database-driven application, but can be applied to any system with persistent storage. (Thanks to mxGraph for images).



Step 1 is the absolute basic setup: everything running on one server. One webserver, one database backend. I suppose Step Zero would be no database, just files or something. But even that trivial example is enough to get us off the ground. No failover, no backup, no load handling? No problem! Now, I would not advise using this type of system in a live environment, but for prototyping and identifying performance bottlenecks, it's just fine. Get something up and running, then see how it behaves. The ultimate limit of the system with this configuration is limited by how big of a box you can get your hands on. But most likely, one giant enterprise server would not work as well as step 2, for reasons that will be explained.

This also would be a good place to mention the value of monitoring and logging at this level. In order to identify bottlenecks, you need to be able to see how your system responds to load.



Step 2 is a very common setup. It's not that much different from step 1, except that the database and webserver are on physically separate machines. This is the first step toward separating into functionality, except the functionality is as coarse-grained as it can be at this point. You could just as easily add a clone of the first machine, but now you've got two sets of processes to manage, which we will address in the next iteration. This setup allows us to tune each machine for its specified task: application or database access.

By introducing a second machine, we've now introduced a synchronization issue between them. If one machine goes down, the second machine needs to be able to address that somehow. Fail the request? Queue and retry? The answer is "It depends" on the situation being addressed. Serving webpages, the user may wish to submit their query again. Storing credit card requests may need to be more robust. But we can handle the problem with the next step:



Step 3 could theoretically be scaled out to hundreds or thousands of servers, and Step 4 is really a subdivision of Step 3. Notice that our number of servers is increasing exponentially. We started at 1, then added a second. Now we're managing (at least) 4. The cluster implementation can be as simple or as complicated as you make it - J2ee container-managed cluster or simple proxy software. The important things with this step are:
  • you can now scale as many servers as you may need, quickly and simply.
  • you're removed almost all failures from impacting availability.
  • you're still constrained by the database master server.
This step is assuming that you're still dealing with fairly homologous data at this point: all the data coming and going from the servers can be dealt with similarly. This type of layout is pretty common among the Less Than Huge websites, that is, excepting the Ebay's, Facebook's, and Google's of the world. Once you get that huge, you will probably need a unique approach to handle your traffic anyway. But this will get you to 99% of the way there. At this point, you've already addressed pushing updates to your servers and handling inconsistency between them (in the previous step). But let's say your monitoring is indicating that you're either writing too much data for one write-only database to keep up with. Or let's say you identified a major performance bottleneck in the system. That's where step 4 comes into play:



We're now, finally, at a point that could be considered "architecture." Each type of process is located on its own server, connecting to a database that contains the data that it needs to perform that function. This would be the point where de-normalization may come into play. By removing the need for connecting the databases, the minor performance/space hit taken by denormalization would be offset by the lack of interconnection between the databases. Also at this point, there may be additional vertical stacking, separating the presentation layer from specific data-processing. Now we're into the classic 3 (or n)-tier model, but the internals of those layers can scale as large as we want (within some physical limits).

So to sum up, your architecture should "grow" along with your application. It should provide a framework to allow your application to handle growth, not restrict you to grow in a specific path.

No comments: