Rent vs. Buy (or EC2 vs. building your own iron)

Over the past months Jeff Atwood (of Coding Horror fame) has been chronicling Stack Overflows quest for new hardware, starting with “Server Hosting – Rent vs. Buy?” and ending with some glamour shots. I’ve recently (along with others) built a setup for a .Net website in the same “to big for shared or low-end VPS hosting and (much) too small to have dedicated sysadmin staff” segment. We ended up going for Amazon EC2 so I thought I’d share our reasoning by comparing with the Stack Overflow setup.

UPDATE1: Atwood just gave another reason as to why EC2 may be  attractive.

UDATE2: Some of the gloomy projections in this post actually came through (for Stackoverflow, not for us): Tuesday Outage: It’s RAID-tastic!

First some notes on pricing: Mr. Atwood’s three servers costs him a total $6,000, on top of which comes rack space rent, bandwidth and licenses (where he gets off very cheaply by taking advantage of Microsoft’s BizSpark program). We rent two large EC2 instances, one of them with a SQL Server Standard license, for $1.6 pr. hour giving a total of $14,000 pr. year (on top of which comes bandwidth and Elastic Block Store usage). Mr. Atwood could buy all his gear (minus rack space) more than two times over for that money. And except for one important parameter, which I shall expand on later, his machines are much faster: The Database server has eight cores and 24GB of memory, while the Web servers have four cores and 8GB of memory. Our EC2 instances have to get by with just two cores and 7.5GB. An interesting aside is that exactly half the $1.6 goes to licenses (compared with getting non-windows large instances), most of it for SQL Server Standard.

Several commenters had some beefs with the disks in the new Stack Overflow database server and I agree they look rather dinky. The server has six 7200 RPM SATA drives in RAID1 and RAID10 arrays for OS/logs and data files respectively. While the drives are “Enterprise” branded, I hazard the guess that they are pretty much the same as desktop ones, except for a slightly higher MTBF promise and better warranty from the manufacturer. At any rate, 7200 RPM drives can only sustain about 125 random IOs pr. second, and because of the RAIDing, the IO-rate of the array is not six times that. On EC2 we have access to formidable Elastic Block Storage volumes, which are capable of sustaining upwards of 1000 IOPS. Should we need more oomph, the volumes can be soft-raided together until the 1GBps link from the EC2 instance to the EBS volume runs out of steam. (For completeness, I should note that the sequential IO performance of EBS volumes is not very good. That is irrelevant for most database workloads however, since users generally don’t have the good manners to request data in the order it is placed on disk). Mr. Atwood mentions that query execution time decreases nicely with CPU speed. This is obviously an important parameter when building a responsive web site, but I’d venture that query throughput volume is mostly related to disk performance and that we would have an edge here.

Another potential problem is the reliability of the drives, the longer warranty-period not withstanding. Let’s assume for a second (and I admit this is a pretty improbable scenario), that one of the drives in the Stack Overflow RAID10 array (holding SQL Server data files) copped it and went to the great disk-array in the sky. Mr. Atwood would probably get a notification of this, and immediately initiate a backup-operation to the good array (also holding OS and logs). Let’s also assume that at that very moment, the God of the datacenter decides to invoke Murphy’s law on the other disk in the mirror-set, killing it and taking the array and the database with it. Stack Overflow stops flowing, blog posts are written (I shall magnaminously refrain), F# buffs recurse indefinitely trying to post a question about why Stack Overflow is down but finding that Stack Overflow is down. Reddit and Slashdot are notified, further swamping the exasperated web servers – unable as they are to get anything out of the database. Mr. Atwood, in the meantime, is cheering on SQL Server Management studio to restore the latest backup as quickly as possible to the still-good array. He manages to bring the site up within a quarter of an hour, minus all activity since the last backup and running at a somewhat slower clip than usual. Having wiped the sweat from his brow, he still has to drive to the datacenter and swap the two bad drives (unless he trusts the datacenter dudes to do so), getting the usual datacenter tinnitus and a sniffle in the process.

If, on the other hand, the EBS volume holding our database were to die (an even more unlikely event), we would merely create a new volume, attach it to our database instance and restore from backup (conveniently located in nearby S3). Reaction time and data loss would be similar, but performance will not be degraded for any period. Also, I don’t have to plod out to some datacenter and fuss around with a server. Instead I can concentrate on adding new features to the site.

Some people stress the “Elastic” part of EC2, claiming that it is mostly relevant if your hardware requirements are extremely variable or you expect them to increase very rapidly. I think the flexibility it affords is relevant in more modest scenarios too though. Some examples: Need more IPs? Click of a button. Need a test server to try out a new version of your site? Click of a button. Need to increase the size of your database drive? Grab a snapshot and use it to create a bigger volume. Plus all the other features such as a CDN, secure backup in S3 and redundant datacenters that Amazon offers without large upfront costs.

EC2 is no panacea for sure and I agree with Mr. Atwood that poring over specs and reviews and putting together your own gear on the cheap is extremely rewarding. If you value your time and need flexibility though, it might be worth it to limit yourself to building desktop systems and use something like EC2 for hosting.

EC2 SQL Server backup strategies and tactics

The many backup modes offered by Microsoft SQL server, combined with the prodigious hardware options offered on Amazon EC2 can make choosing a backup strategy for your setup a little confusing. In this post, I’ll sketch some options and end with a simple PowerShell script usable on both Express and Standard versions, that’ll backup your database to S3.

To start with, you should probably be running your database off an EBS (Elastic Block Storage) volume. They can sustain many more random IOPS than instance disks (good for typical workloads) and they live independently of your instances. While i haven’t had an instance die from under me, if one should cop it, all data on the local disks will be gone-gone.

EBS volumes can fail too however, and will do se at an annualised rate of 0.1% to 0.5% according to Amazon. You may decide this is good enough for your purposes and leave it at that. Bear in mind, however, that this failure rate is compounded by other factors such as Windows or SQL Server malfunctioning and corrupting the volume, you pressing the wrong button in AWS console/Management Studio, a disgruntled employee doing it on purpose or something else entirely. In other words, you should take backups.

A simple approach is to use the snapshotting feature of EBS. This basically saves the (diff of the) contents of your volume to S3, from whence it can be restored to life if something happens to the volume. I’ve used this to muck around with test-environments and such. It works fine and could conceivably be automated using the AWS API. It’s a rather low-level approach though, and you could easily find yourself restoring from a snapshot taken with SQL Server’s pants around its ankles, in the middle of a transaction. While obviously capable of recovering from such an indescretion and rolling back to a safe state, this can be something of a hassle.

Another option is to do normal backups to another EBS volume mounted on the same instance. While I have no knowledge of Amazon datacenter topologies, one could fear that different EBS volumes attached to the same instance end up being hosted on the same EBS-SAN-thingamebob, the death of which would then also be your undoing.

You could also copy backup-files to another instance mounting its own EBS volume, or set up replication — allowing you to recover very quickly. Note that SQL Server Express can subscribe to a Standard instance in a replication setup, although it cannot publish. Your replicated instance could even live in a different availability zone, although you would then incur bandwidth cost on exchanged data on top of the cost of running an extra instance.

The approach we ended up taking uses S3 however. Amazon promises S3 to be very safe (“no single point of failure”) and has the added benefit of being available independently of EC2 instances. To do a backup, we basically do a full database backup to one of the local disks and then move the file to S3. This is handled by a PowerShell script invoked as a scheduled task, making it usable on SQL Server Express instances (where the native SQL Server backup scheduling is not otherwise available). To handle the S3 interaction, we use the free CloudBerry snap-in. A few gotchas:

  1. If you’re running on a X64 system, install the snap-in with that .Net version
  2. You probably have to modify the PowerShell script execution policy on your instance
  3. You need the DotNetZip lib for zipping

Some possible improvements are zipping of files and shrinking of logfile before upload (*both added February 1. 2009*) and perhaps an incremental backup scheme.

Script is included below.

# This Powershell script is used to backup a SQL Server database and move the backup file to S3
# It can be run as a scheduled task like this:
# C:\WINDOWS\system32\WindowsPowerShell\v1.0\powershell.exe &'C:\Path\dbbackup.ps1'
# Written by Michael Friis (http://friism.com)

$key = "yourkey"
$secret = "yoursecret"
$localfolder = "C:\path\tobackupfolder"
$s3folder = "somebucket/backup/"
$name = Get-Date -uformat "backup_%Y_%m_%d"
$filename = $name + ".bak"
$zipfilename = $name + ".zip"
$dbname = "yourdb"
$dblogname = "yourdb_log"
$ziplibloc = "C:\pathto\ziplib\Ionic.Utils.Zip.dll"

# Remove existing db backup file
if(Test-Path -path ($localfolder + "\" + $filename)) { Remove-Item ($localfolder + "\" + $filename) }

$query =
"
USE {2}
GO

DBCC SHRINKFILE({3})

GO

BACKUP DATABASE [{2}] TO  DISK = N'{0}\{1}'
        WITH NOFORMAT, NOINIT,  NAME = N'backup', SKIP, REWIND, NOUNLOAD,  STATS = 10
GO
declare @backupSetId as int
select @backupSetId = position from msdb..backupset
where database_name=N'{2}' and backup_set_id=(select max(backup_set_id)
from msdb..backupset where database_name=N'{2}' )

if @backupSetId is null
begin
        raiserror(N'Verify failed. Backup information for database ''{2}'' not found.', 16, 1)
end
RESTORE VERIFYONLY FROM  DISK = N'{0}\{1}'
        WITH  FILE = @backupSetId,  NOUNLOAD,  NOREWIND" -f $localfolder, $filename, $dbname, $dblogname

sqlcmd -Q $query -S "."

# Remove existing zip file
if(Test-Path -path ($localfolder + "\" + $zipfilename)) { Remove-Item ($localfolder + "\" + $zipfilename) }

#Zip the backup file
[System.Reflection.Assembly]::LoadFrom($ziplibloc);
$zipfile =  new-object Ionic.Utils.Zip.ZipFile($localfolder + "\" + $zipfilename);
$e= $zipfile.AddFile($localfolder + "\" + $filename)
$zipfile.Save()

#Upload to S3
Add-PSSnapin CloudBerryLab.Explorer.PSSnapIn
$s3 = Get-CloudS3Connection -Key $key -Secret $secret
$destination = $s3 | Select-CloudFolder -path $s3folder
$src = Get-CloudFilesystemConnection | Select-CloudFolder $localfolder
$src | Copy-CloudItem $destination –filter $zipfilename