Monday, March 23, 2009

Evolutionary Architecture

Agile Architecture, Round 2. As I (and others) have said, "Do The Simplest Thing That Could Possibly Work" applies to more than your software. So today, we'll look at the common growth process for a system, and some of the constraints that limit each phase. Our example is a web-based, database-driven application, but can be applied to any system with persistent storage. (Thanks to mxGraph for images).



Step 1 is the absolute basic setup: everything running on one server. One webserver, one database backend. I suppose Step Zero would be no database, just files or something. But even that trivial example is enough to get us off the ground. No failover, no backup, no load handling? No problem! Now, I would not advise using this type of system in a live environment, but for prototyping and identifying performance bottlenecks, it's just fine. Get something up and running, then see how it behaves. The ultimate limit of the system with this configuration is limited by how big of a box you can get your hands on. But most likely, one giant enterprise server would not work as well as step 2, for reasons that will be explained.

This also would be a good place to mention the value of monitoring and logging at this level. In order to identify bottlenecks, you need to be able to see how your system responds to load.



Step 2 is a very common setup. It's not that much different from step 1, except that the database and webserver are on physically separate machines. This is the first step toward separating into functionality, except the functionality is as coarse-grained as it can be at this point. You could just as easily add a clone of the first machine, but now you've got two sets of processes to manage, which we will address in the next iteration. This setup allows us to tune each machine for its specified task: application or database access.

By introducing a second machine, we've now introduced a synchronization issue between them. If one machine goes down, the second machine needs to be able to address that somehow. Fail the request? Queue and retry? The answer is "It depends" on the situation being addressed. Serving webpages, the user may wish to submit their query again. Storing credit card requests may need to be more robust. But we can handle the problem with the next step:



Step 3 could theoretically be scaled out to hundreds or thousands of servers, and Step 4 is really a subdivision of Step 3. Notice that our number of servers is increasing exponentially. We started at 1, then added a second. Now we're managing (at least) 4. The cluster implementation can be as simple or as complicated as you make it - J2ee container-managed cluster or simple proxy software. The important things with this step are:
  • you can now scale as many servers as you may need, quickly and simply.
  • you're removed almost all failures from impacting availability.
  • you're still constrained by the database master server.
This step is assuming that you're still dealing with fairly homologous data at this point: all the data coming and going from the servers can be dealt with similarly. This type of layout is pretty common among the Less Than Huge websites, that is, excepting the Ebay's, Facebook's, and Google's of the world. Once you get that huge, you will probably need a unique approach to handle your traffic anyway. But this will get you to 99% of the way there. At this point, you've already addressed pushing updates to your servers and handling inconsistency between them (in the previous step). But let's say your monitoring is indicating that you're either writing too much data for one write-only database to keep up with. Or let's say you identified a major performance bottleneck in the system. That's where step 4 comes into play:



We're now, finally, at a point that could be considered "architecture." Each type of process is located on its own server, connecting to a database that contains the data that it needs to perform that function. This would be the point where de-normalization may come into play. By removing the need for connecting the databases, the minor performance/space hit taken by denormalization would be offset by the lack of interconnection between the databases. Also at this point, there may be additional vertical stacking, separating the presentation layer from specific data-processing. Now we're into the classic 3 (or n)-tier model, but the internals of those layers can scale as large as we want (within some physical limits).

So to sum up, your architecture should "grow" along with your application. It should provide a framework to allow your application to handle growth, not restrict you to grow in a specific path.

Tuesday, March 10, 2009

Schedule, LOC, and Budget? You're Lying

Software engineering is an imprecise science. Anyone who tells you differently is either not working very hard, or doesn't understand that fact (or both). Actually, calling it engineering is a bit of a misnomer itself, since the engineering work in building systems is as different as engineering a submarine vs. a bridge. Engineering involves creating repeatable systems of common features. That is, there are a set of rules for building sets of bridges that have little in common with the rules for building submarines.

The problem with defining software as engineering lies in the identification of the dividing line between engineering and manufacturing. Determining the amount of effort (time, code, and people) for software is much less defined than traditional "peoplespace" engineering because of the fluidity of the tools that are used. Imagine writing a proposal to build the first transcontinental railroad. 50 miles per year, thousands of workers. Now image the proposal today: much more production, much less workers. Computers allow the creation of these monstrously efficient production mechanisms. Hence the statement "hardware is cheap, people are expensive."

Looking at schedules, we see there are two types of release schedules: time-constrained (release on a certain date, like Ubuntu), or effort-constrained (all these features are in release X, like, say, Apache Httpd). Time-constrained releases on a set date, with a variable number of features. Effort-constrained delivers when the set features are done. Neither schedule mechanism has any concern with people or time. So either you release on a scheduled date, with whatever is done at that time, or you release when everything is done, regardless of date.

It would be silly to create a release schedule that released every 10,000 lines of code, but that's what our snakeoil salesman are proposing. Here's how it works: a budget is calculated based on the estimated lines of code, times the number of man/hours that would take to create, based on historical estimates of productivity. So, like Earl Sheib, you're saying "I can build that system with 100,000 lines of code!" Calculating budget is now a function of hourly billing rate times productivity and estimated lines of code.

Here's an example: customer says "build me a new SuperWidget." Contractor looks thoughtful and says "We think SuperWidget should take 50,000 lines of code, since it's kind of like TurboSquirrel we built before. Our normal efficiency is 1 LOC per hour (seriously), and billing rate is $200 per hour. So budget is $10 million dollars. There's 2,000 man/hours per year, so we need 25 people and you can have it in one year." There's your budget, schedule, and estimate, all rolled into one. If one changes, the others have to change as well. SuperWidget is a large project, obviously.

That's seriously how it works. Oh sure, there's supposed to be more analysis to determine how similar the project will be to previous projects, but the number is still, at best, a guess. It's not like building a bridge: you can't estimate (very well) the amount of steel and concrete, simply because software is not bound to physical properties.

So how do you get this model to work? The "fudge factor" is in the productivity. You set your productivity so low that you know, absolutely, you won't miss the estimate. Why do you only produce 1 LOC per hour? StarOffice, considered one of the largest open source projects, is estimated around 9 million LOC. That's 4,500 man/years, using our previous calculation of 1 LOC per day. 4.5 years with 1,000 developers, or 45 years with 100 developers. Obviously, something else is going on. Estimates show that productivity can be around 10 times our estimate. But how do you get there? That's the topic for next time.

Friday, March 06, 2009

Score Your Work!

Today's topic will be a quiz: how much does your employer suck? Total up the points at the end (javascript if I feel nice).

1. Cafeteria - Does you company:
offer tasty, fresh food upon request for reasonable price, including free? (1 point)
offer mediocre food at their convenience? (0 points)
employ LunchLady Doris? (-1 point)

2. Security - Does your company:
treat you like an adult, and keep physical security to a minimum (badge readers, etc)? (1 point)
Have some sort of obviously CYA policy in place involving security guards? (0 points)
check your badge at the gate, again at the building, then again at the entrance to your office? (-1 point)

3. Work Environment - Does your company:
Have modern, clean, up-to-date, large, well-lit work areas? (1 point)
Have old, dirty, out-of-date, small, dark work areas? (0 points)
Have all of the previous and have on-going construction so it sounds like you're working on a major street at rush hour? (-1 point)

4. Equipment - Does your company:
provide you with what you need to do your job, and provide an easy mechanism to acquire it? (1 point)
begrudgingly provide the minimum of what you need, but make the process so tedious that it's usually not worth the effort? (0 points)
prefer to maintain a death-grip over the infrastructure, in order to maintain their need to exist?(-1 point)

5. Management - Does your company:
have a reasonable number of managers, who are willing and capable of working with you to effect change in the way the company does business? (1 point)
have too many managers, some of whom have no noticeable effect on productivity? (0 points)
have so many managers that they create a totally separate reporting chain to separate work and labor-related issues? (-1 point)

More to come!