Wednesday, January 23, 2013

Microsoft's data center evolution


Microsoft opened its first data center in 1989.  Data centers from this era, considered generation 1, employed little to no air flow management; rooms were kept very cool which resulted in PUEs of 2.0 or higher.  The original rationale for building these types of data centers was really to consolidate compute resources that were previously distributed across the network.  Today, organizations with these types of data centers are struggling because they’re running out of power or space or cooling capacity.  

In 2007, Microsoft decided to start building designing and directly operating its own data centers because the cost of maintaining its generation 1 facilities was rising too fast.  These generation 2 data centers were primarily about increasing density and accelerating deployment.  Unlike the previous generation where equipment was installed into racks piecemeal and was often non-uniform, racks fully populated with blade servers were now deployed and brought online very quickly.  Moreover, airflow was now being optimized for the rack instead of the server which improved the efficiency of these data centers considerably, achieving PUEs between 1.4-1.6. 

A lot of today’s modern data centers would be considered generation 2 by Microsoft’s standards where high density racks of blade servers are aligned into hot and cold isles in the data center with or without hot isle containment systems.

A year later, Microsoft adopted the concept of containment and starting deploying servers in ISO standard shipping containers.  These containers now allowed Microsoft to deploy large quantities of severs very quickly with predictable results because of the uniformity of the equipment in the containers.  For example, when a container arrives onsite, it can be fully provisioned and operational within 8 hours.  Moreover, by tighly regulating airflow inside the container, increasing the set point temperate, and increasing its use of air and water side economizers, Microsoft was able to improve its efficiency to where these generation 3 containerized data centers are now operating with PUEs between 1.2 and 1.5.

In its latest data center designs, considered generation 4, Microsoft is incorporating all the learning from its previous generations and is now deploying modular data centers where it builds an engineering spine and modules are connected to it in a plug-in-play fashion.   With this design, Microsoft is able to reduce its operating expenses because its using adiabatic cooling which works like a swamp cooler to cool the servers inside IT pre-assembled components (IT PACs).  This type of cooling is considerably less expensive (and uses less water) than operating chillers because the power is being used to move air rather than chill water.  Microsoft is also reducing its capital expenses with this latest design because less of the data center is being set aside for mechanicals like chillers and other supporting equipment.  Additionally, the components to build the data center are being supplied by several vendors from around the world.  This allows Microsoft to have a just-in-time approach to building data centers where they can quickly add capacity according to demand signals it receives from the service teams.

Thursday, January 17, 2013

XBOX and the data center

Today I read an article that explains how Microsoft incorporates the learning from operating massive services like Bing, XBOX, and Windows Azure into products like Windows Server and System Center.  No other vendor, with perhaps the exception of OpenStack, which can serve as both a public or private cloud, has this sort of feedback mechanism.  The difference is that OpenStack is only providing IaaS whereas Microsoft now hosts over 200 services that are available in over 70 countries worldwide.  This experience is helping Microsoft improve its economies of scale, its efficiency, and service availability which ultimately helps organizations using the Windows Server econsystem achieve similar results in their own data centers. 

Several things that are now part of Windows Server, boot to VHD for example, were initally developed for Windows Azure.  With the enhancements that Microsoft is now delivering in Windows Server, organization can build clouds that share many of the attributes of Azure like pooled resources, elasticity, self-service provisioning, and so on.  And because Azure and Windows Server share a common platform, Microsoft is able to offer a consistent management, identity, and development experience across its public, private, and hybrid cloud offerings. 

Wednesday, January 16, 2013

Azure Internals

For my first post, I thought I would try tackling how Windows Azure works and how your applications are deployed in the data center.  Unlike a normal operating system which is responsible for managing local resources, Windows Azure manages a pool of compute, storage, and networking resources within the data center. It also exposes a set of building blocks that developers can use to build distributed applications that are both highly available and easily scalable.  Azure is able to offer high availability because there are multiple instances of your service running on servers in different fault and update domains which I'll explain here shortly.  The services you develop are assumed to be stateless; if you need to store state, you store it in Windows Azure BLOB or table storage.

When you need to update your service, Windows Azure will connect to each instance of your service in series while it is running and replace it with the newer version.  This is vastly different from IaaS where you would ordinarily have to do that yourself.  Moreover, because there are no dependencies between the underlying OS and your service, we're able apply updates and patches without effecting the availability of your service. 

An Azure service can be comprised of different roles types, for example, there's a web role which is essentially an instance of IIS; there's also a worker role that you can use to execute your logic.  These roles are similar to DLLs that runs in a VM instance.  For you to qualify for the SLA, you need to provision at least 2 instances of your role and these roles have to be deploying across different fault and update domains which you can specify or have the Azure Fabric Controller do for you.

Having fault and update domains is how we meet our published SLAs for Windows Azure.  Fault domains are for unplanned outages whereas update domains are for maintaining availability during planned service updates. 

When your updates are applied, Azure coordinates them among the various instances of your service in sequence such that the availability of your service is not effected.  There may be reduced capacity because there are fewer instances your service running, but it's still available.  The important thing to realize here is as you apply updates to your service, there may mixed versions of a role, for example, some of your worker roles may be running an older version as others are being updated.  This is why Azure allows you to perform manual updates of the roles in different update domains.  Another way to perform updates is to do a VIP swap which involves creating 2 instances of your service and telling the load balancer to swap VIPs. 

A fault domain is a single point of failure.  For Azure, a rack is a fault domain with each rack consisting of about 40 servers or blades.  When a fault that affects a running instance of your service occurs, Azure will automatically try to replace it with a new instance in another fault domain.  It will also tell the load balancer to stop sending requests to the failed instance. 

After your done writing your service, you upload it to the Azure through the Windows Azure portal.  From there it is sent to the RDFE (Red Dog Front End) which uses the service configuration to determine which region to deploy your service into.  Your service is then passed to a Windows Azure Fabric Controller (FC) which is responsible for provisioning your service on to servers within a stamp where a stamp is a large collection of servers in racks, along with TOR switches, and the like. 

The FC is a Azure service that runs on multiple servers within a stamp.  This service is aware of the hardware and network topology within the data center and it's what is responsible for managing the lifecycle of your service.  When a stamp is being provisioned, the FC will send a command to the in-rack PDU to power on the server.  The server then PXE boots Windows PE which formats the disk and downloads a VHD of the Windows Azure OS.  After completing sysprep, the server tells the FC that it is ready to start accepting new applications.

The FC uses a algorithm that looks at resource availability across fault and update domains within the stamp to determine which servers to deploy your roles onto.  The algorithm automatically attempts put different tiers of your service on the same server, e.g. 1 instance of a worker role and 1 instance of a web role, which simplifies how we do updates but also improves performance because communication between the 2 roles doesn't have to traverse the network.

By now you should have a better sense of how Azure provides high availability and what happens after you upload your service to the Azure portal.  The advantage to using Azure and other PaaS offerings is that rather than having to manage the underlying operating systems and building middleware services like queuing yourself, developers can use the higher-order service that we expose to quickly build distributed applications that scale easily.  And because it's sold under a utility consumption model, it's also a great incubator for new ideas.  This is particularly important nowadays given roughly 75% of an IT budget still goes towards maintaining existing applications which leaves very little for experimentation and the risk associated with it.  With PaaS, if your service doesn't attract users or usage diminishes over time, you can de-provision instances of it or de-provision it all together.