A reading of “CoolProvision_Underprovisioning Datacenter Cooling”

Table of Contents

Here is the paper.
This is a rather interesting article. Although it was written quite a while ago, it contains many intriguing perspectives. It is still very worth reading carefully.

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/08/CoolProvision-final.pdf

Part 1: Abstract

This paper, a collaborative effort between Rutgers University, GoDaddy, and Microsoft, brings together academic research, real-world application, and substantial resources. It delves into a prevalent issue in the data center industry: the conservative approach to designing cooling systems based on worst-case climate scenarios over the past 10-20 years. Traditionally, data centers are designed to handle the maximum cooling load under full capacity during these extreme conditions, which serves as a baseline for all subsequent design decisions.

While this conservative approach is sound, it undeniably leads to significant upfront and ongoing costs. Oversized cooling equipment is often inefficient when operating at low loads, a common scenario as data centers gradually ramp up to full capacity over several years. The paper proposes a framework to strike a balance between cost and risk by challenging these conventional constraints.

To reduce system performance, the authors consider two primary strategies: (1) decreasing IT processing capacity, potentially impacting service quality, or (2) allowing IT equipment temperatures to rise, in exchange for a controlled degradation in reliability. Based on my experience, most data centers have limited capabilities for proactive IT load management, and interactions between infrastructure and IT teams can be challenging. Thus, adjusting IT capacity to address cooling deficiencies is often impractical and risky. However, increasing IT temperature tolerance to accommodate short-term cooling outages is a promising approach. This requires a well-defined Service Level Agreement (SLA) between infrastructure and IT teams, clearly outlining responsibilities and boundaries. The infrastructure team should leverage equipment-specific temperature limits, rather than industry standards like ASHRAE, to negotiate with the IT team, as equipment can often tolerate higher inlet temperatures than recommended. The SLA should also specify a time limit for elevated temperature operation, as prolonged high-temperature operation may not be energy-efficient and could impact overall energy consumption.

Part 2: Background Introduction

Building upon the summary, the authors further elaborate on four key objectives for addressing short-term cooling overloads:

Developing cost models for both provisioning and operating cooling systems, as well as for hardware replacement. This involves quantifying the financial implications of various cooling strategies and equipment choices.
Creating cooling and reliability models to characterize the thermal behavior of IT equipment and assess their impact on hardware failures. While previous studies have explored similar models, with a particular focus on hard drive lifespan, this research suggests that moderate temperature increases (below 35°C) may have a limited impact on overall system reliability.
Formulating performance and power models to represent workload scheduling and energy management policies. The goal is to integrate IT workloads and energy consumption more effectively, although this is often a complex challenge.
Developing an optimization and simulation framework that incorporates the aforementioned models, weather data, and workload projections to determine the optimal cooling system provisioning. This framework will leverage neural networks to capture the complex interactions between external factors and infrastructure performance.

Part 3: Data Center Cooling

This chapter introduces the primary cooling methods, cooling towers, water-side energy efficiency, and direct evaporation. There isn’t much new or noteworthy content in this section.

Part 4: Configuration-Based Data Center Cooling

This section delves into the core topic: configuration-based data center cooling. It begins by examining traditional cooling configurations, which typically adhere to worst-case scenarios as recommended by standards like Uptime Institute. While such an approach is safe, it may not be the most efficient or cost-effective. This is not to criticize Uptime Institute or any other standards body; rather, it highlights the limitations of blindly following standards without considering specific project requirements. A truly effective design requires a deep understanding of the underlying assumptions and constraints of these standards, and the ability to tailor them to the unique needs of each project.

Our proposed configuration method offers a more nuanced approach. We illustrate this with two key points:

Extreme conditions: ASHRAE standards often define extremely conservative design conditions that are only exceeded for a small percentage of the year (e.g., 2%).
Flexibility in temperature control: By allowing a slightly higher operating temperature (e.g., 3 degrees Celsius above the ASHRAE maximum), natural cooling systems can be implemented in many locations worldwide, significantly reducing operational costs.

This flexibility in configuration aligns with the concept of utilization factors commonly used in data center design. These factors effectively simplify the design process by assuming a certain level of underutilization. Additionally, from an operational standpoint, it is often impractical to assume that a data center will operate at full capacity immediately upon deployment. The gradual ramp-up of workloads necessitates a cooling system that can efficiently handle periods of low utilization, particularly as server hardware becomes more energy-efficient. By slightly reducing the full-load cooling capacity, data centers can achieve significant cost savings and improve their market competitiveness.

Part 5: Cooling Supply

5.1 Overview

In this section, the author provides a more concrete illustration of the aforementioned configuration optimization process. They propose using time and space slices to simplify the calculations. However, I disagree with this approach. For a large-scale data center, the most appropriate unit for total cost of ownership (TCO) calculations is a cooling system module (similar to Tencent’s module concept). A finer granularity of analysis may not accurately reflect the overall cost structure and could potentially distort the results. This is because power and cooling systems are highly interconnected, and local adjustments can propagate upward, affecting the overall system performance and configuration costs.

5.2 Optimizer and Problem Quantification

This section is particularly valuable as it introduces a quantitative approach to the optimization problem by proposing potential input parameters for the optimizer (as shown in the figure). From another perspective, the identification of these parameters serves as a great example of how to define the inputs and outputs when considering Total Cost of Ownership (TCO). While the provided parameters may not be exhaustive, organizations should identify additional parameters that align with their specific business objectives to create a more comprehensive TCO calculation.

5.3 Cost Model

In this section, the author simplifies the relationship between cooling load and cost using a linear model. I disagree with this simplification as the relationship between different cooling loads and costs is more likely to be parabolic, although the exact shape of the parabola would require more data to determine.

Building upon the cost model, the author introduces the relationship between failure rate and cost. The author equates failure rate with hard disk drive failure rate, which I agree is a reasonable simplification. Furthermore, the author correlates operating temperature with hard disk drive failure rate and introduces the concept of Annualized Failure Rate (AFR). This is a valuable metric, and I would like to expand on this concept based on the paper “Impact of Temperature on Hard Disk Drive Reliability in Large Datacenters” (which I plan to review in my next reading response).

5.4 Thermal and Cooling Model

This section provides a high-level overview of potential input parameters for developing a thermal and cooling model:

Server inlet temperature and humidity: The temperature and humidity of the air entering the server.
Server outlet temperature and humidity: The temperature and humidity of the air exiting the server.
IT load (power consumption): The electrical power consumed by the IT equipment.
Cooling unit operation: The status of the cooling unit (e.g., on/off, mode).
Equipment speed: The operating speed of equipment such as fans and compressors.

While this list provides a general framework, it lacks the specificity required for a practical implementation. The model presented here is more of a conceptual guide than a concrete solution.

5.5 Energy Consumption Model

Server Energy Consumption Model: The author decomposes server energy consumption into three components: CPU active power (frequency-dependent), CPU idle power, and other server power. I disagree with this overly simplified model. While the CPU is a major energy consumer in servers, it typically accounts for only about 50% of total power consumption. Such a simplification can lead to significant modeling errors. I suggest including the power consumption of hard drives and fans for a more accurate model.

Cooling Energy Consumption Model: This section is very brief, almost nonexistent. The cooling energy consumption model is indeed complex, especially for water-based systems. I propose using a combination of physics-based modeling and neural networks to create a more comprehensive model.

5.6 Load and Energy Management

In this chapter, the author extensively discusses the use of Dynamic Voltage and Frequency Scaling (DVFS) to throttle workloads and achieve peak shaving and valley filling. However, I maintain my previous stance that controlling IT workloads is not straightforward, and the additional risks introduced may not justify the benefits to the infrastructure. This is a complex issue that requires careful consideration. Therefore, I believe that the practical implementation of the strategies discussed in this chapter is limited and may face significant challenges.

Part 6: Parasol Implementation

This is a very small-scale experimental implementation with little to no reference value. Moreover, the author has not provided detailed information on DVFS or addressed IT energy conservation.

Last words

This paper offers a valuable perspective: ASHRAE’s thermal comfort limits should serve as a reference rather than a rigid standard for data center design. By sacrificing some operational redundancy during the design phase, designers can achieve cost savings. This approach is particularly relevant in today’s competitive market, where tight cost controls are essential. Reducing excessive redundancy and tailoring designs to real-world usage scenarios can significantly lower TCO and give enterprises a competitive edge. However, determining the optimal level of redundancy that operations managers can tolerate remains a critical question.

Explore a fresh perspective on DC technology

Did this article resonate with you? Subscribe to our newsletter for more insightful technical content delivered straight to your inbox.

Tristan

air flow management

Improving datacenter airflow

Part1: background This data center has poor airflow management (lots of mixing issues) This data center has efficient airflow management (little to no mixing issues) Big

09/08/2024 No Comments

PUE