Ruminations: Cloud Scale Data Centers

As Microsoft evolves its data center strategy, it is increasing its use of fault tolerant software platforms like Azure which were built to run on large clusters of commodity hardware, rather than continuing to invest in redundant hardware and power systems. The idea is operationalize the response to failures by automatically and seamlessly recovering when services fail.  This can be achieved by replicating state among various machines, eliminating dependencies between software/hardware components, adding instrumentation and run-time telemetry, and automating responses to failures. The system must also be simple; using design patterns that are well understood, but simple enough so that the system can be easily triaged. By adopting these principles, Microsoft is reducing the complexity of its data centers and improving its TCO considerably. It's also less likely that a meteor strike or the next superstorm will cause an outage!

You can read more about Microsoft's cloud scale data centers at http://www.globalfoundationservices.com/posts/2013/february/11/software-reigns-in-microsofts-cloud-scale-data-centers.aspx.

Enterprises can achieve similar results by adopting some the aforementioned principles. For instance, when building new applications developers should work alongside operations to examine the ramifications their design decisions have on the underlying infrastructure and work together to design the application to be elastic, self-healing, and fault-tolerant. Too often, developers build their applications in isolation; only to throw it over the wall to operations who then have to manage to an SLA that they didn't define.  Having shared goals that are designed to optimize the whole stack will incent these different groups to work together which in turn should lower TCO and improve time to recovery.

As for hardware redundancy, once the software becomes resilient, there's less need for redundant hardware. This is evident in Microsoft's latest data centers where parts of it aren't backed up by generator and the UPS only lasts long enough to move the load elsewhere. It's this simple design that is helping Microsoft lower its CapEx and OpEx with each generation.

Ruminations

About Me

Friday, February 15, 2013

Cloud Scale Data Centers

No comments:

Post a Comment