数据中心服务器芯片温度,【DKV】新建数据中心气流——第二部分：入口温度VS服务器性能...-CSDN博客

by Ian Seaton | May 17, 2017 | Blog 185294450_3_20200313031211113 185294450_4_20200313031211535

第一部分已经讨论了服务器入口温度对服务器能耗的影响，本文将讨论入口温度对服务器性能的影响。

气流管理主要讨论一些相关的问题，这些问题将驱使我们如何利用优秀的气流管理来降低数据中心的运营成本。在我的上一篇文章中，在七部分系列文章的第一部分中，我使用ASHRAE的服务器参数来确定数据中心运行温度曲线，探讨了服务器功率与服务器入口温度之间的关系，提出了一种在较高温度评估下，机电设备节能与增加的服务器风扇能耗之间权衡的方法。我建议，对于大多数应用程序来说，在服务器风扇能耗在从节能趋势转换之前，数据中心可能被允许比许多行业人士想象的要高得多的温度范围。然而，如果我们冒险提高数据中心的运行温度，我们真正的节省了多少呢？影响了多少服务器的工作性能？这就引出了今天的主题和本系列的第二部分：服务器性能与服务器入口温度的关系。

今天的服务器比最近的传统服务器具有更高的热稳定性，特别是随着A3级和A4级服务器的出现。在最近的过去，随着服务器配备变速风扇和主板的热管理，它们包含了通过降低性能来应对过高温度的智能控制程序。不幸的是，如果节能功能经常被禁用，这种自我保护策略就可能无法发挥作用。相反，有些服务器生产企业基本上只提供A3服务器(安全操作可达104℉(40℃)入口温度)和A2服务器，就所有实际用途而言，相对于企业的采购渠道而言，在跳蚤市场更加容易采购到允许操作可达95℉(35℃)服务器。因此，如果一个新的数据中心正在配备新的IT设备，这是一个更直接的考虑。但是，如果旧设备将被移到一个新的空间，那么联系供应商了解不同设备的性能温度阈值可能是很重要的。以下文章中提供的证据表明，在低于服务器风扇能耗超限的温度水平下运行，不会导致降低当今数据中心IT设备的工作性能。

比较惊讶的事情是，我们今天的CUP的工作温度可能达到95℃，甚至高达100℃，利用Linpack浮点运算跟踪测试显示，例如英特尔CPU，在工作频率和由此产生的相关FLOP运算能力降低之前，CPU能在100℃(对于普通速度读取器是212华氏度)下接近50%的工作时间。当然，这不是在212华氏度(100℃)下运行我们的数据中心的许可证，因此，合理做法是保持数据中心在某个点上运行，以确保服务器入口温度足够低，从而使CPU的运行温度低于影响性能的温度。当我们不确定阈值是多少时，安全区就在ASHRAE推荐的范围内。尽管供应商的文件允许更大的范围，允许的工作范围已经缓慢地迁移到18℃到27℃。如今，这种技巧不一定非得如此，因为大多数服务器都配有传感器和输出回路，可以告知CPU的温度。虽然这些信息是有价值的，但对于数据中心以分钟为时间单位管理来说，并不一定有有价值，除非数据中心中的每一台设备都来自同一供应商，并且配备了相同的CPU温度监测输出格式。如果没有这种描述我们大部分空间的一致性，我们需要一些指导，在不影响内部温度的情况下，我们可以在哪里测量外部温度。

2011年，当ASHRAE TC9.9增加了新的服务器类型并扩展了允许的温度范围时，第二年，我们看到了一系列科学和工程活动，以寻求理解这些环境指南对部署在数据中心的设备的影响。在IBM进行了一项特别明确和可控的研究，并在美国机械工程师协会技术会议上作了报告。他们特别关注A3级工作环境下(41-104℉)服务器性能，更具体地说是在温度范围的上限，他们使用不同的电源测试1U、2U和刀片服务器包，并选择工作负载测试包来模拟高性能计算和虚拟云的典型工作负载。

并从最佳和最差的泄漏功率测试样本中选择测试对象，他们评估了70多个不同的CPU，以确定在这些条件下，温度的变化对结果的影响。他们将每台设备和相关工作负载测试基线的入口温度设定在77℉(25℃)，然后在95℉(45℃)(A2类上限)和104℉(40℃)(A3类上限)下重新测试。

结果汇总在表1中，其中95°F和104°F列是与77°F基线相比执行的操作比率。清楚地表明，在这些较高温度下，性能并没有下降。唯一超出测试工作负载和数据采集的 /-1%公差的测试，是在强化涡轮增压模式下运行Linpack的系统，并且仅显示2%的性能下降，或超出误差公差范围1%。 185294450_5_20200313031211691

在同一时间的框架内，多伦多大学仅对一种服务器型号进行了测试，但使用了四家主要供应商的七种不同硬盘驱动器，并在更大范围的工作负载和更多温度设置下测试了该设备。这些测试所用的环境温度远高于IBM测试，在正常统计误差范围之外，识别IT设备性能下降变得更容易些。他们的基准工作内容包括测量访问4gb内存的时间、每秒8kb内存随机访、整数运算的速度、浮点运算的速度、响应随机读写请求的速度、处理高度连续的65kb读写请求的速度、在线业务处理，联机业务的I/O处理、数据库的工作负载、磁盘数据库工作负载、文件系统处理和HPC计算查询。测试模型用于强调系统不同部分或模拟许多实际应用的行业标准工具，将测试数据变得可视化。IT测试在一个热室内进行，在该热室内，温度可以从-10℃增加到60℃，范围比我们今天在数据中心通常看到的要宽一些。

多伦多大学的研究人员研究了硬盘和CPU的性能。对于硬盘，他们发现在环境温度为140℉(60℃)时，传输量通常会下降5-10%，有些甚至会下降30%。更重要的是，在不同的环境条件下，对于不同的硬盘，在统计上明显的传输量下降。结果观察到，有两个在104℉(40℃)和113℉(45℃)下，一个在131℉(55℃)之前没有任何吞吐量下降。因为所有测试设备的额定温度都是122℉(50℃)或140℉(60℃)，如果你们中的任何人考虑允许数据中心的“冷通道”超过100℉(37.8℃)，我建议你核实原始来源以了解供应商和型号。否则，如果您不希望冷通道或送风温度超过100℉(37.8℃)，那么硬盘吞吐量将不会受到数据中心环境温度的影响。至于CPU和内存性能，他们没有看到任何限制性能的基准环境高达131°(55℃)。

2011年ASHRAE环境温度指南更新发布后，随即进行的研究项目的数据强烈表明，在服务器风扇能量增加到整体能耗节省之前，服务器入口温度在先前确定的阈值范围内不会降低计算性能。实际上，性能温度通常远远超过经济温度阈值。因此，这一性能温度的净幅度标明，为了避免根本不必要的配置一些机电设备，可以合理地牺牲一些运营的成本。而且，我们还需要继续，还有ITE成本，可靠性，和其他环境的考虑和评价，我将在以后的文章中谈到。

英文原文：

by Ian Seaton | May 17, 2017 | Blog

This is a continuation of Airflow Management Considerations for a New Data Center – Part 1: Server Power vs Inlet Temperature]

How hot can you let your data center get before damage sets in? This is part 2 of a 7 part series on airflow management considerations for new data centers. In Part 1 I discussed server power versus inlet temperature. In this part, I will be talking about server performance versus inlet temperature.

Airflow management considerations will be those questions that will inform the degree to which we can take advantage of our excellent airflow management practices to drive down the operating cost of our data center. In my previous piece, part one of a seven-part series drawing on ASHRAE’s server metrics for determining a data center operating envelope, I explored the question of server power versus server inlet temperature, presenting a methodology for assessing the trade-off of mechanical plant energy savings versus increased server fan energy at higher temperatures. I suggested that for most applications, a data center could be allowed to encroach into much higher temperature ranges than many industry practitioners might have thought before server fan energy penalties reverse the savings trend. However, how much are we really saving if our temperature adventures are affecting how much work our computing equipment is doing for us? And that brings us to today’s subject and part two of this series: server performance versus server inlet temperature.

Servers today are much more thermally robust than recent legacy servers, particularly with the advent of Class A3 and Class A4 servers. In the recent past, as servers became equipped with variable speed fans and onboard thermal management, they contained the intelligence to respond to excessive temperatures by slowing down performance. Unfortunately, if energy savings features are disabled, as they frequently are, this self-preservation tactic will likely not function. Conversely, there are some server OEM’s who are essentially only delivering A3 servers (with safe operation up to 104˚inlet) and an A2 server, for all practical purposes, with allowable operation up to 95˚F, is going to be easier to source at flea markets than OEM sales channels. So if a new data center is being equipped with new IT equipment, this is a more straight-forward consideration. However, if legacy equipment will be moved into a new space, it will be important to contact the vendors to learn where performance temperature thresholds might be for different equipment. The evidence, as presented in the following paragraphs, suggests that operating up to the temperature levels that fall below server fan energy penalties will not result in inhibiting performance of today’s data center ITE.

It might be surprising to learn that today’s CPU’s are designed to operate up to 95˚C or even 100˚C and testing with Linpack to track floating point operations has revealed that Intel CPUs, for example, can operate at 100˚C (that’s 212˚F for the casual speed reader) for up to 50% of the time before experiencing a slight drop-off in operating frequency and resultant FLOP transaction rate.1 Of course, that is not a license to run our data centers at 212˚F, so the trick is to keep the data center operating at some point that assures a server inlet temperature is low enough to keep the CPU operating below that temperature where performance is impacted. Back when we were not sure what that threshold might be, the safe zone was inside the ASHRAE recommended envelope, which has slowly migrated to 64.4˚ to 80.6˚F, despite vendor documentation that allowed for much wider ranges. That trick does not have to be such a trick these days, as most servers come with sensors and outputs that tell us the CPU temperature. While that information is available, it is not necessarily useful for minute-to-minute management of the data center unless every piece of equipment in the data center comes from the same vendor and comes equipped with the same CPU temperature monitoring output format. Without that homogeneity, which describes most of our spaces, we need some guidance on where we can take the outside temperature without adversely affecting the inside temperature.

When ASHRAE TC9.9 added the new server classes and extended the allowable temperature envelope in 2011, the following year we saw a relative flurry of scientific and engineering activity in search of understanding the implications of these environmental guidelines on the equipment deployed in data centers. One particularly well-defined and controlled study was conducted at IBM and reported at an American Society of Mechanical Engineers technical conference. Their focus was specifically on server performance within the Class A3 envelope (41-104˚F), and more specifically at the upper range of that envelope. They tested 1U, 2U and blade server packages with different power supplies and they selected workload test packages to simulate both high-performance computing and virtualization cloud typical workloads. They evaluated over 70 different CPUs and selected test samples from the best and worst for power leakage to determine the effect of that variable on results at these conditions. They baselined each piece of equipment and associated workload test at 77˚F server inlet temperature and then re-tested at 95˚F (Class A2 upper limit) and at 104˚F (Class A3 upper limit). The results, summarized in Table 1, wherein the 95˚F and 104˚F columns are the ratios of operations performed versus the 77˚F baseline, clearly, indicate there is no degradation of performance at these higher temperatures. The only test that fell outside the /- 1% tolerance of the test workloads and data acquisition was the worst power leakage blade system running Linpack in the intensified Turbo Boost mode, and that only showed a 2% performance degradation, or 1% beyond the tolerance margin of error.

Within this same time frame, tests were conducted at the University of Toronto on just one server model, but with seven different hard drives from four major vendors and exercised the equipment with a far wider range of workloads and many more temperature settings. These tests took ambient temperatures much higher than the IBM tests so performance declines became easier to identify outside the range of normal statistical error. Their benchmark workloads included measuring time to access 4gb of memory, giga-updates per second of 8kb chunk memory random access, speed of integer operations, speed of floating point operations, speed of responding to random read/write requests, speed of handling highly sequential 65kb read/write requests, on-line transaction processing, I/O processing of on-line transactions, decision support database workloads, disk-bound database workloads, file system transactions, and HPC computational queries, all on recognized, industry-standard tools designed to either stress different parts of the system or to model a number of real world applications.3 Testing was conducted inside a thermal chamber in which temperatures could be controlled in 0.1˚C increments from -10˚ up to 60˚C (14-140˚F, and a bit wider range than we typically see in data centers today).

The University of Toronto researchers looked at both disk drive and CPU performance. For the disk drives, at an ambient temperature of 140˚F, they found throughput declines typically in the 5-10% range, with some up to 30%. More importantly, statistically noticeable throughput declines occurred at different ambient conditions for different disk drives: a couple observed at 104˚F and 113˚F and one not showing any reduction in throughput until 131˚F. If any of you are considering allowing your data centers to have “cold aisles” over 100˚F, and since all the tested equipment was rated at either 122˚F or 140˚F, I invite you to check out the original source for information on vendors and models.4 Otherwise, if you do not anticipate allowing cold aisles or supply temperatures to exceed 100˚F, then disk drive throughput will not be affected by the data center environmental envelope. As for CPU and memory performance, they did not see any throttling down of performance on any of the benchmarks up to 131˚F5.

Data from research projects conducted immediately after the release of the 2011 ASHRAE environmental guidelines update suggests strongly that computing performance will not be degraded by server inlet temperatures within the ranges previously identified as thresholds before server fan energy increases eat into mechanical plant savings. In fact, the performance temperatures generally far exceed the economic temperature thresholds. This performance temperature headroom, therefore, suggests that some op ex-savings might be reasonably sacrificed for the cap ex-avoidance of not having to build a mechanical plant at all. And, we’re still not done, there are also ITE cost considerations, reliability considerations, and other environmental considerations, which I will be hitting in subsequent posts.