Ruminations

Monday, April 21, 2014

There's a fly in my vSoup

A couple weeks ago Derrik Harris and Barb Darrow interviewed the VP of Marketing from RightScale about a survey they recently conducted to measure public cloud usage, see https://gigaom.com/2014/04/10/enterprise-cloud-adoption-amazon-is-big-natch-but-watch-out-for-vsoup/. It should come as no surprise that AWS ranked first among all cloud providers. After all, they are by far and away the market leader in cloud computing. What happened next was less expected, however. According to RightScale, VMware finished second behind AWS, closely followed by Azure and Google. This result was so surprising that RightScale went ahead and re-interviewed the respondents who said they were using VMware's public cloud because a) vCHS has only been in market since September, 2013 and b) the confusion surrounding VMware's product portfolio. For instance, a lot of VMware customers running vSphere think they're using a cloud, but upon further inspection, lack the characteristics of a cloud, e.g. elastic pools of resources, metered services, broad accessibility, and so on. When the results of the secondary survey were finally tabulated, RightScale discovered that fewer respondents were actually using vCHS. Yet despite the confusion, the survey still showed strong interest in vCHS. This is perhaps less surprising when you consider VMware's dominant marketshare in the virtualization space and the growing interest in hybrid clouds. For organizations that have already standardized on the VMware hypervisor, vCHS is a logical choice. Not only is it compatible with vSphere, but moving a VM between vSphere and vCHS is a relatively trivial operation. This freedom to choose [where workloads run] is especially attractive to enterprises that are fearful of vendor lock-in or are generally apprehensive about cloud computing.

What remains to be seen is whether VMware can convert this interest into paying customers and whether it can offer enough differentiated value to compete against the big 3. For now though, I think they'll settle for being a fly in the ointment.

Waiter, can I get a new bowl of soup?

Tuesday, May 28, 2013

TechEd 2013, New Orleans

For those of you going to Microsoft TechEd 2013 in New Orleans this year, come see me at the Server and Tools booth where I will be talking and fielding questions about Microsoft's data centers. The booth will also offer an opportunity to see images and artifacts from a handful of these impressive facilities. Hope to see you there.

Thursday, April 18, 2013

Just because you can move workloads to the cloud, doesn't mean you should

Occasionally I will encounter Enterprises who say they want to move their legacy applications to the cloud. When I do, they usually say it's because they're running out of power and space in their own data centers or it's because they don't have the human capital to continue managing the infrastructure supporting these applications. The challenge with legacy applications is that very few of them were designed to scale horizontally.  Instead they scale vertically in that you need to add compute and IO capacity to a VM instance rather than adding new instances. This presents a problem for a lot of IaaS offerings because VM instances only come in a few sizes and reconfiguring a VM instance often requires downtime. Moreover, a lot of legacy applications are brittle in that they have a hard dependency on the underlying infrastructure or other components of the application such that when a failure occurs, users experience down time. Finally, legacy applications may have traffic patterns, like high read operations, which could adversely affect their total cost of ownership. This is because many IaaS and PaaS offerings charge for network egress.

The cloud is ideally suited for loosely coupled applications that are designed to scale horizontally. When building new applications, Enterprises should design them for resiliency, i.e. they should assume failures will periodically occur and design their application to gracefully recover from those failures. If possible, the application should continue running with a reduce set of capabilities rather than experiencing an outage. A good way to test the resiliency of an application is by injecting faults. Netflix has done a lot of work in this area and has since open sourced its Chaos Monkey fault injection system which tests how a service reacts to different types of failures. For more information, see http://techblog.netflix.com/2011/07/netflix-simian-army.html.

So what types of applications are good for the cloud? Good candidates are applications with predicable or unpredictable bursting where there are intense spikes in activity. Ordinarily, if you were run these sorts of applications in your own data center, you'd have to account and plan for these periodic spikes in activity. With cloud computing, you only pay for the added capacity when you need it. Applications that are growing fast, or those with on-off patterns, like a performance review application, are also suitable candidates.

Unfortunately, a majority of legacy applications did not account for the on-demand computing environment we have today so when you move them to the cloud you are essentially paying for them to run all the time. That said, you may still be able to derive savings by moving some of these workloads to the cloud, at least from a power standpoint. Let me explain: say you have a server that consumes 453 watts of power on average.  If your data center has a PUE of 1.8 which is the industry average, you will pay roughly $491 a year to keep that server powered on at industrial power rates (US). Compare that to a medium size VM in Azure which is only $115 per year or an extra large instance which at $460 per year is still less than it would cost to run in your own data center, and that's for 8x1.6GHz CPU and 14GB of RAM.

A Surround Strategy
When talking about legacy applications, I like to use the following analogy. LOB or legacy applications generate data which is equivalent to gold in the ground. It has value to the enterprise but it may be restricted to a segment of the workforce and it can only be mined by LOB applications. Rather than moving those applications to the cloud, you can surround them with lightweight, modern applications delivered from the cloud that expose the data, turning the gold into jewelry which ultimately creates greater value for the Enterprise. This is essentially what Microsoft has done with it's ERP system where it created web services layer to expose information that can be consumed from these lightweight purpose built apps.

Tuesday, February 26, 2013

Microsoft News Cycle, week of February 17th

Last week was a relatively quiet for Microsoft, newswise. There were a series of articles written about Microsoft's plans to expand its facility in Cheyenne, Wyoming. There was also a study done by Nasuni that showed how Azure storage outperformed comparable offerings from Amazon and Rackspace. This is partly a result of our ongoing efforts to flatten our network and improve east-west throughput within the data center. And finally, there was a story written about an Office 365 win at the state of Texas where security, privacy, and compliance were important considerations.

Below are the links to the articles that were written:

Datacenter Dynamics

Microsoft’s planning Cheyenne data center extension

Wyoming Business Report

Microsoft looking to expand Cheyenne data center

ZDNet

Microsoft and Apple continue to grow datacenter investment

ZDNet
Microsoft Azure pips Amazon as king of cloud storage

Billings Gazette

Microsoft looks to expand Cheyenne data center

Wyoming Tribune Eagle

Microsoft already eyeing expansion

Government Security News

State of Texas to use Microsoft’s Office 365 to house communication and collaboration capabilities in the Cloud

Monday, February 18, 2013

Managing Cloud Infrastructure Costs

At Microsoft, we've developed a business model to help manage costs, capacity, and our investments. We also use it to influence behavior which I'll come to shortly. But first, when we look at costs, we're really looking at the costs to deliver our cloud infrastructure, where the accountability lies, and the components that comprise our cost allocation model. When we look at capacity, we're really looking at the resources that are currently being consumed versus what's available. And finally, when we look at our investments, we're really looking at where we invest and how to optimize those investments to get the greatest return.

In terms of costs, there's the cost of the infrastructure that our online services use to run their service. This includes data center services, bandwidth, networking, storage, and incident management costs. Missing from this are things like lead time capacity for future growth which is what GFS is accountable for. They are also responsible for bringing down our infrastructure costs over time. Then there are the direct costs incurred by our properties or online service, e.g. dedicated servers and services. The properties themselves are accountable for things like rated kW and online services direct costs.

Our costs are based on granular rates that have been adjusted by service and location making the costs as fair and equitable as possible. As I alluded to earlier, GFS is accountable for scaling mixed adjusted rates while consistently driving their rates down year over year.

The idea behind the rate structure and our cost allocation model is to drive convergence of platforms and infrastructure and deliver an SLA that drives reliability into the software instead of the underlying infrastructure.

Friday, February 15, 2013

Cloud Scale Data Centers

As Microsoft evolves its data center strategy, it is increasing its use of fault tolerant software platforms like Azure which were built to run on large clusters of commodity hardware, rather than continuing to invest in redundant hardware and power systems. The idea is operationalize the response to failures by automatically and seamlessly recovering when services fail.  This can be achieved by replicating state among various machines, eliminating dependencies between software/hardware components, adding instrumentation and run-time telemetry, and automating responses to failures. The system must also be simple; using design patterns that are well understood, but simple enough so that the system can be easily triaged. By adopting these principles, Microsoft is reducing the complexity of its data centers and improving its TCO considerably. It's also less likely that a meteor strike or the next superstorm will cause an outage!

You can read more about Microsoft's cloud scale data centers at http://www.globalfoundationservices.com/posts/2013/february/11/software-reigns-in-microsofts-cloud-scale-data-centers.aspx.

Enterprises can achieve similar results by adopting some the aforementioned principles. For instance, when building new applications developers should work alongside operations to examine the ramifications their design decisions have on the underlying infrastructure and work together to design the application to be elastic, self-healing, and fault-tolerant. Too often, developers build their applications in isolation; only to throw it over the wall to operations who then have to manage to an SLA that they didn't define.  Having shared goals that are designed to optimize the whole stack will incent these different groups to work together which in turn should lower TCO and improve time to recovery.

As for hardware redundancy, once the software becomes resilient, there's less need for redundant hardware. This is evident in Microsoft's latest data centers where parts of it aren't backed up by generator and the UPS only lasts long enough to move the load elsewhere. It's this simple design that is helping Microsoft lower its CapEx and OpEx with each generation.

Friday, February 8, 2013

Just-in-Time Data Centers

All Things D published an article today about why the data center industry ought to adopt a JIT approach to building data centers. This is effectively what Microsoft is doing in its new Gen 4 data centers where it has a global supply chain of manufacturers who build different modules, e.g. power modules, IT modules, cooling modules, etc that can be assembled in different configurations to deliver different classes of service. Now, rather than building a huge custom-designed facility and gradually filling it to capacity, Microsoft can have JIT approach to adding capacity, allowing it to respond quicker to demand signals from its various properties while delivering outstanding PUEs. Moreover, these new Gen 4 data centers are significantly less expensive for Microsoft to build and operate in part because less of the site is being aside for electrical and mechanical equipment and they're being cooled by air-side economizers. In its newest designs these modules rest outside on a concrete pad.

Modularity is not a panacea though. There has to be uniformity and standards for it to work well. Additionally, you need an application that is highly tolerant of hardware failures because the module that houses the IT equipment is now a fault domain. Consider what would happen if there were a fire inside a module. Also, how easy will be to retrofit the container with new gear? How much up front engineering will be required to accommodate a modular design? These are things you'll want to contemplate before moving forward with a modular approach.

There have been many articles written about the pros and cons of modular data centers, including a a video of Kevin Brown, the author of the All Things D article, from October 2012 when he was speaker at the Data Center World Conference.

About Me