Wednesday, July 30, 2008

The Cost of Big Servers

I decided to compare the cost/benefit of a blade arrangement versus a "big server" setup. For the comparison I've configured a Dell PowerEdge M600 as my blade and a Sun Fire X4600 M2 as my "big server" setup.
Price:
Dell = $2,605
Sun =
$15,995.00

CPU:
Dell = 2xQuad Core Intel® Xeon® E5405, 2x6MB Cache, 2.0GHz, 1333MHz FSB
Sun =
4 Dual-Core AMD Opteron - Model 8222, 3.0 GHz

RAM:
Dell = 8GB 667MHz (8x1GB), Dual Ranked DIMMs
Sun = 16 GB (8 x 2 GB DIMMs)

Disk:
Dell = 2x73GB 10K RPM Serial-Attach SCSI 3Gbps 2.5-in HotPlug Hard Drive
Sun = 292 GB (2 x 146 GB) 10000 rpm SAS Disks

OS:
Dell = RHEL 4.5ES (although I would put some other Linux variant on if it were up to me).
Sun = Solaris 10 (shocker, I know)

Roughly, the Dell is 6x cheaper than the Sun, which means I can buy 6 of them for the price of the Sun (duh). Now, partitioning that system could create two database servers (those would need more disk, obviously) and four application/web servers. So for the price of one "big server," I've already got a cluster of four app servers and a cluster of two database servers. Arrange as needed for your system.

Throwing disk arrays into the mix continues this thought. Comparing a Dell PowerVault NF600 to a Sun StorageTek 5320 NAS Appliance again shows the case against "one big server", even though we're comparing Network disk servers

Price:
Dell = $4,926
Sun = $ 48,164.00

Disk capacity:
Dell = 4x400GB 10K RPM Serial-Attach SCSI
Sun = 16 x 500 GB 7200 rpm SATA Disk Drives

CPU:
Dell = Dual Core Intel® Xeon® 5110, 1.60GHz, 4MB Cache
Sun = 1 x 2.6 GHz (assuming AMD)

Now we're in the ballpark of 10x price difference. So I could have 10 Dell NAS drives serving my 6 servers (again, I know there's no reason to have 10 drives for 6 hypothetical servers, but just to prove a point), vs one big drive serving one big server.

The price of adding "commodity-level" hardware is so (relatively) low, that there's almost no reason not to do it. Planning to build on this type of architecture quickly leads to partitioned applications, so that a new application can get its own host, or any service that needs more power gets more power, rather than improving the performance of the service. Figure $100/hour per developer. If it takes more than half a man/week to profile, re-code, retest and redeploy the service, you'd be better off throwing hardware at the problem. And as the net load on the system grows, it's much easier to build out horizontally when the increment size is small.

Now I know that you can run multiple virtual servers on one "big server", but the point I'm trying to get across is that the bang-for-the-buck of the "little" servers is much much greater, and acutally building around small servers creates a more modular, scalable software architecture.

Wednesday, July 23, 2008

Agile Architecture

Agile (or eXtreme Programming) development usually focuses on the development of the software itself, but it can (and should) be applied to all parts of the system. If the architecture of the system is determined before the first iteration and not open for refactoring or other redesign principles, then it falls into the category of BigDesignUpFront. To put the problem another way, how can the domain of the system be determined if what the system will do is not determined? And, if the system functionality is already determined, then there's no room to DoTheSimplestThingThatCouldPossiblyWork.

One argument against DoTheSimplestThingThatCouldPossiblyWork is that nothing complex (like a compiler) would be built. I disagree, simply because DTSTTCPW should only be applied to the domain as it is understood. So TheSimplestThing when you are getting started should be different than TheSimplestThing as your system becomes operational. TheSimplestThing when you are working in isolation on your personal computer becomes massively different from TheSimplestThing when you are providing a service on the Internet. As a ridiculous example, think about the problem of searching for a given word in your filesystem. Would you build Google to do it? No (or I sincerely hope not). You'd use find or perl or some other script language. Now, if you want to let anyone search for any word in any word on the Internet, now you've got to build Google, and all the system availability and performance benchmarks that go along with it.

The great thing about Web technologies is that they scale well, and don't require replacing to service greater demand or performance objectives. If one webserver can handle your own requests, then a cluster can handle thousands of them. But how do you refactor an architecture? Most people think of architecture as an inflexible part of the design. Nothing could be further from the truth. Writing POJOs (Plain Old Java Objects) leads quickly into EJBs (Enterprise Java Beans) in a J2EE container, or web services. Same goes for any web-based scripting language. Command line PHP can be easily hosted by Apache and serve the same data over the web.

The key point is that just as your software should be designed iteratively, so should the archtiecture. As the system requirements grow and change, the architecture should grow and change with it. Will there be additional software changes to handle the new architecture? Probably. Will the overall system be better suited to handle the system parameters? Yes. Otherwise, you will be trying to put a now square peg into a hole that's still round. It can be done, but is sure isn't TheSimplestThing.

Monday, July 21, 2008

The Network Is The Problem

The first goal of any architecture should address two questions: how does the user interact with the system, and what does the system do when the user can't connect. Obviously, the goal of the system is to provide as much connectivity as possible, but there will always be potential problems between the client and the first network connection. The system will need to address how to handle data that does not make it to the server (like, blogger, for example).

How the user interacts with the system will affect how the system handles the data when the network is down. Of primary importance is what to do with any data that would have been transmitted from client or server, and how to recover once connectivity is restored. For a time-critical system, an additional important decision must be addressed: what to do if the request isn't received in time. If the request can't be fulfilled, does it still have value?

By focusing on the user input, you can keep your project in scope. If you start looking at data, then you can start looking at all the places that data can go, and all the systems that need to interact to deal with the data. By focusing on what the user needs, rather than what the system requires, you can iteratively add more and more features, and rebuild the system as new capabilities are required for those features.

A trivial example: a user needs to know what his work assignments are. So he 1) logs into the system (define the scope of the user environment, address network connectivity - see, I told you it would be important), 2) requests his work (pre-defined data, manual query?) 3) displays the data (what kind of display is available?)

Now, once those decisions are addressed, they don't need to be addressed for the user to create a new work assignment. Rather than attempting to figure out all of the system up front, the system can provide a needed functionality quickly. Once a decision is made, it can be applied to all other functionality in the system.

Monday, July 14, 2008

Agility 101: Refactor Your Build

One of the problems with using an adjective like "agile" or "extreme" (ugh) is that it implies a degree. As in, "A mouse is more agile than an elephant." However, you could say the elephant is more agile than, say, an aircraft carrier. So, off the bat, agile seems open for interpretation, whereas Rational Unified Process or Waterfall have well defined steps that you either are doing or aren't doing. This is wrong. Agile has a very well defined set of processes, but since the term seems to lend itself to statements like "We need to add some agility," it is working at a disadvantage.

The primary goal of agile development is continuous, sustainable development. If this is not possible in your current environment, ask yourself why not? A primary area for improvement is the build/deploy/test environment cycle. If it takes hours to build, deploy, and coordinate testing, then you've reduced the amount of time available to do anything useful. Some important points to consider:

  • How many steps does it take to go from build to running system? If it's very many, re-examine all the functionality that Ant provides.
  • Do multiple versions of the system potentially interfere with each other? If so, figure out a way to run multiple versions in isolation. As an additional point, can multiple versions of the same software co-exist on the same server or in the same environment? If I need to care what you are doing, that's going to take time to rectify.
  • Can I build and test multiple branches without interfering with a previous build? To reiterate the previous point, if it takes time to reconfigure and set up to test a fix or enhancement, that's time that can't be used for coding.


    Given the capabilities of a tool like Ant, there's no reason that a complete, executing system cannot be produced out of every build.
  •