Yesterday I attended a talk by Randy Bias of Cloudscaling. The talk was called "Pets vs. Cattle: The Elastic Cloud."
I thought Randy's perspective on infrastructure and system administration is really interesting, so I'm going to explain some of the key points that he shared with us yesterday.
In the olden days, before the "cloud" as we know it, companies favored the "enterprise computing" model of system administration. The enterprise computing model is:
- Smart hardware
Under the old-fashioned "enterprise computing" infrastructure model, servers were given cutesy names like "Cookie," "Dakota," "Reagan," or "Aardvark." Each server was procured individually and configured by hand (often by several different people). Because each server was configured manually, no two servers were exactly like. Each machine was like a special snowflake.
And if one of the server were to suddenly fall ill—perhaps it stopped responding—it was all-hands-on-deck to bring the server back to life.
Basically, the servers were treated like pets.
Image credit: Christian Haugen on Flickr
Instead of treating the machines as pets, Randy told the audience, we should be treating them as cattle.
"When one of them gets sick, you shoot 'em in the head and replace 'em with a new one."—Randy Bias, CEO of Cloudscaling
Image credit: thskyt on Flickr
Under the system administration 2.0 model, configuration and deployment are automated and each server is expendable:
- Commodity hardware
- Open source
Configuration is automated. Deployments are automated. If one of the servers starts having issues, just get rid of it and spin up a brand new one in its place.
Best practice, said Randy, is to develop software that automatically detects malfunctioning machines, retires them, and spins up new ones in their place.
Netflix has really been on the cutting edge of cloud computing for several years. They have a number of tools, all with a sort of Simian theme, designed to identify and remove underperforming machines:
- Conformity Monkey finds instances that fail to conform to best practices and shuts them down.
- Doctor Monkey runs health checks on each instance and removes the sick instances from service.
- Security Monkey finds security violations or vulnerabilities and terminates the offending instances.
Netflix even takes it a step further. They actually designed a service that kills healthy servers in order to test the fault tolerance of their infrastructure.
Netflix developed a system called—I love this name—the Chaos Monkey.
"The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage."—Netflix Tech Blog
Randy posted the slides from his talk here.
You should also check out Noah Slater's "Pets vs. Cattle" blog post.