At Microsoft, we've developed a business model to help manage costs, capacity, and our investments. We also use it to influence behavior which I'll come to shortly. But first, when we look at costs, we're really looking at the costs to deliver our cloud infrastructure, where the accountability lies, and the components that comprise our cost allocation model. When we look at capacity, we're really looking at the resources that are currently being consumed versus what's available. And finally, when we look at our investments, we're really looking at where we invest and how to optimize those investments to get the greatest return.
In terms of costs, there's the cost of the infrastructure that our online services use to run their service. This includes data center services, bandwidth, networking, storage, and incident management costs. Missing from this are things like lead time capacity for future growth which is what GFS is accountable for. They are also responsible for bringing down our infrastructure costs over time. Then there are the direct costs incurred by our properties or online service, e.g. dedicated servers and services. The properties themselves are accountable for things like rated kW and online services direct costs.
Our costs are based on granular rates that have been adjusted by service and location making the costs as fair and equitable as possible. As I alluded to earlier, GFS is accountable for scaling mixed adjusted rates while consistently driving their rates down year over year.
The idea behind the rate structure and our cost allocation model is to drive convergence of platforms and infrastructure and deliver an SLA that drives reliability into the software instead of the underlying infrastructure.
As Microsoft evolves its data center strategy, it is increasing its use of fault tolerant software platforms like Azure which were built to run on large clusters of commodity hardware, rather than continuing to invest in redundant hardware and power systems. The idea is operationalize the response to failures by automatically and seamlessly recovering when services fail. This can be achieved by replicating state among various machines, eliminating dependencies between software/hardware components, adding instrumentation and run-time telemetry, and automating responses to failures. The system must also be simple; using design patterns that are well understood, but simple enough so that the system can be easily triaged. By adopting these principles, Microsoft is reducing the complexity of its data centers and improving its TCO considerably. It's also less likely that a meteor strike or the next superstorm will cause an outage!
You can read more about Microsoft's cloud scale data centers at http://www.globalfoundationservices.com/posts/2013/february/11/software-reigns-in-microsofts-cloud-scale-data-centers.aspx.
Enterprises can achieve similar results by adopting some the aforementioned principles. For instance, when building new applications developers should work alongside operations to examine the ramifications their design decisions have on the underlying infrastructure and work together to design the application to be elastic, self-healing, and fault-tolerant. Too often, developers build their applications in isolation; only to throw it over the wall to operations who then have to manage to an SLA that they didn't define. Having shared goals that are designed to optimize the whole stack will incent these different groups to work together which in turn should lower TCO and improve time to recovery.
As for hardware redundancy, once the software becomes resilient, there's less need for redundant hardware. This is evident in Microsoft's latest data centers where parts of it aren't backed up by generator and the UPS only lasts long enough to move the load elsewhere. It's this simple design that is helping Microsoft lower its CapEx and OpEx with each generation.
All Things D published an article today about why the data center industry ought to adopt a JIT approach to building data centers. This is effectively what Microsoft is doing in its new Gen 4 data centers where it has a global supply chain of manufacturers who build different modules, e.g. power modules, IT modules, cooling modules, etc that can be assembled in different configurations to deliver different classes of service. Now, rather than building a huge custom-designed facility and gradually filling it to capacity, Microsoft can have JIT approach to adding capacity, allowing it to respond quicker to demand signals from its various properties while delivering outstanding PUEs. Moreover, these new Gen 4 data centers are significantly less expensive for Microsoft to build and operate in part because less of the site is being aside for electrical and mechanical equipment and they're being cooled by air-side economizers. In its newest designs these modules rest outside on a concrete pad.
Modularity is not a panacea though. There has to be uniformity and standards for it to work well. Additionally, you need an application that is highly tolerant of hardware failures because the module that houses the IT equipment is now a fault domain. Consider what would happen if there were a fire inside a module. Also, how easy will be to retrofit the container with new gear? How much up front engineering will be required to accommodate a modular design? These are things you'll want to contemplate before moving forward with a modular approach.
There have been many articles written about the pros and cons of modular data centers, including a a video of Kevin Brown, the author of the All Things D article, from October 2012 when he was speaker at the Data Center World Conference.
As David Linthicum wrote in his blog today, DDoS attacks on cloud computing infrastructures are steadily increasing. When people come visit our data centers I always try to stress how we're able to apply security and privacy resources to an extent that would be cost prohibitive for a lot organizations to implement themselves. Moreover, all of the principles of the trustworthy computing initiative that Bill Gates launched in 2002 now apply to the Microsoft cloud, including the security development lifecycle (SDL) which requires multiple code reviews and threat analyses before a service is released to the web.
If you're interested in learning more about how we secure our cloud infrastructure, I encourage you to read http://www.globalfoundationservices.com/posts/2009/may/27/securing-microsoft’s-cloud-infrastructure.aspx.