Cooling the Heat: Strategies for Energy Efficiency in Data Centers

Cooling activities correspond to approx. 40% of a data centre's power consumption - coupled with the pressures of achieving a better PUE, increasing demand for higher density racks and rising energy costs. It quickly becomes clear that efficient cooling system design is a priority topic for both existing data centres looking to future proof their facilities and new builds alike wishing to maintain a cost-efficient and resilient operation.

Executive Summary

This article is designed to provide high level technical guidance to decision makers in the mission critical data centre space - reviewing the practical implementation pros and cons of various existing and developing cooling technologies - from a facility engineering and project management perspective.

Site location and the required cooling demand are defining parameters for any choice relating to cooling technology implementation. Decisions should be made keeping in mind the need to 'future proof' your facility. Luckily, the gaps between various technologies are closing with the advent of hybrid and combined system options that can be tailored to provide the most efficient and cost-effective solution.

There is much potential to optimise existing facilities with a Plan, Do, Check, Act approach, be it via redefining facility set-points, intelligent air-flow management with enhanced monitoring, or by retro-fitting newer technologies in a phased approach alongside existing facilities.

The various available air and liquid cooled technologies and their technical limitations are described in-depth to help you and your teams make an informed decision regarding the best cooling strategies for your organisation's growth plans going forward.

The technologies themselves are evolving quickly in order to adapt to future market requirements, in addition to sustainability legislation in preparedness for the next generation of IT equipment that is already on our doorstep.

Key decision factors include CapEx plus OpEx demands for the lifetime of the installation, taking into account system availability on a large/flexible scale, together with the potential risk and failure modes which are also described in detail.

Credit: Vertiv - Understanding CDUs for Liquid Cooling

Location, location, location

The evaluation of cooling system options is a fundamental design step that should feature early on during new-build site selection discussions, and one that will have a direct impact on the long term OpEx of the facility in addition to its ability to cope with the challenges previously mentioned - namely increases in power demand within a fixed footprint.

Whilst there is a natural tendency to favour cooler climate regions for siting such facilities due to the higher availability of free cooling - the need for digital infrastructure globally doesn't always allow for this luxury.

The below illustration by Microsoft, provides an overview of the cooling methods typically employed in different climates.

Credit: Microsoft Datacenters

Interestingly, technology developments plus the combined focus on both Power Usage Effectiveness (PUE) and Water Usage Effectiveness (WUE) as a more holistic metric is leading to further refinement of cooling systems with a hybrid functionality able to shift between operating modes - thus capitalising on ambient weather conditions and the need to balance between energy and water saving priorities.

Higher facility temperatures

The question of running data centre facilities at higher temperatures with the aim of reducing cooling demand appears to have both supporters and critics. Whilst the allowable standard operating envelope for mission critical IT equipment has expanded under the ASHRAE standards - colocation service providers and enterprise-owned facilities are taking varying approaches based on their hardware requirements, customer agreements, sustainability vision and 'appetite for risk'. Key considerations should also include predictions of server lifetime and anticipated failure rates before implementing radical changes to the operating environment.

Further afield, Singapore has recently released a new 'Standard for Tropical Data Centres' aimed at supporting a gradual increase in operating temperatures.

An interesting offshoot for discussion includes waste heat re-use opportunities under the mantra: 'higher temps, higher value'.

Credit: ASHRAE

Airflow management & cooling optimisation

A key concern relating to air-cooled applications is that they often can't target critical heat source components which effectively contributes to wastage.

Before implementing potentially costly changes across your operating environment, it is always more prudent to 'get more bang for your buck' and optimise existing infrastructure across your facilities - following the Plan, Do, Check, Act approach discussed in previous articles.

For air-cooled arrangements, there are a few fundamental goals airflow should seek to obtain, the basis of which underpin any heat exchange system - Upsite's Lars Strong eloquently summarised an ideal system in a 2020 seminar. They include:

  1. Provide optimum IT equipment intake air conditions, with the lowest possible flow rate of conditioned air at the warmest possible temperature.

  2. Minimise losses and maximise heat exchange across the entire cooling loop. With a focus on achieving the highest return temperature to the cooling units.

Credit: Upsite Technologies

High efficiency demonstration sites such as BTDC One aim to challenge the status quo by implementing 'intelligent' cooling systems that integrate IT workloads (i.e. kW draw), server fan speeds and CPU temperatures to maximise the efficiency of cooling delivery - contributing to a stable instantaneous PUE. Siemens offer an interesting 'White Space Cooling Optimisation' solution using Artificial Intelligence (AI) to achieve a similar objective. Huawei's iCooling solution also offers similar promise.

Advanced temperature monitoring strategies are also available - such as The Uptime Institute's approach of using 3 sensors on every other rack at various positions. The live sensor data, coupled with CFD modelling can provide real time analysis and 'next level' control.

Companies such as Boyd Corp engineer thermal solutions to drive 'smarter airflow', otherwise known as 'Calibrated Vectored Cooling (CVC)' - such as heat sinks with thermal interface pads on high heat sources (I.e. server processors), Plus air baffles, blockers, gaskets and seals to help redirect airflow through the server racks to where it's most needed.

Differential air pressure sensors also have a role to play in understanding airflow patterns within your facility. I.e. Monitoring hot vs cold aisles, or front vs rear surfaces of cabinets - in order to identify unwanted leaks, to help validate differential pressures are being maintained, and to ensure airflow is going in the right direction.

As facilities mature, it is well worthwhile modernising and inspecting critical monitoring & controls equipment to ensure they still offer 'best-in-class' functionality.

Credit: AKCP

Available cooling techniques

  • Air & Water Based 'Free-cooling'

Free-cooling systems using air as a medium is a mature technology in the DC space, but with air's limited thermal transfer properties - they face a challenge when cooling higher-density racks - coupled with their tendency for mechanical failures and potential to introduce moisture in the critical environment (direct systems). Indirect systems overcome this issue albeit via a trade-off in efficiency.

Another dilemma exists in higher ambient temperatures where supplementary mechanical DX (Direct Expansion), CHW (Chilled Water) or adiabatic/evaporative modes are required, with increased power consumption.

For climates that have humidity issues - measures to treat excessively dry or damp air also add to the complexity and cost of such equipment.

Other forms of free-cooling include the use of naturally cold water as the medium, relying on nearby water sources. This option can also be hampered in water scarce regions, or in excessively cold areas where the water temperature drops below freezing point.

  • CHW Cooling & DX Cooling

CHW systems are efficient at dealing with large cooling loads but require a high CapEx. The anticipated cooling loads also need to be specified at the system design phase - which can't always be defined taking into account future requirements. Multiple DX units (Computer Room Air Conditioners - CRAC's) may be ideal for smaller loads such as server / UPS rooms and remote/peripheral locations due to their lower comparative cost and the flexibility to add more as the load increases - however for larger applications, a few CHW runs can accomplish much more than many refrigerant lines.

Furthermore, the availability of low-GWP refrigerants, their respective price and future F-gas legislation developments are all key factors to consider when weighing up the options.

  • Evaporative & Adiabatic Cooling

The fact that water is more efficient at removing heat compared to air, coupled with the use of evaporation to enhance the cooling process - is what makes these cooling methods so effective. However, water consumption remains an issue with purely evaporative systems - which can be helped by integrating 'Thermosyphon Hybrid Cooling'. Indirect evaporative (IEC) systems can also overcome issues associated with unwanted moisture and the introduction of external air. Two stage systems are also available, termed 'Indirect Direct Evaporator Cooling' (IDEC) systems, which first cool via indirect means, followed by a second stage direct 'adiabatic' cooling step - these systems boast the ability to lower temperatures further, with less humidity, water and energy.

Adiabatic systems share the power of evaporative cooling but are more water-efficient - they are coupled with water evaporation systems to pre-cool the ambient air only during the hottest parts of the day/year, running as a dry system during other periods - making them an attractive option in hotter, drier climates (less so in very humid tropical climates). In contrast, the water consumption of purely evaporative systems can be constant throughout the year. Finally, a DX coil stage can be provided further enhance cooling capacity.

Credit: Vertiv

  • Rear door heat exchangers (RDHx), Direct-to-Chip (DtC) & Immersion Cooling

Liquid cooling technologies are developing at an unprecedented pace, with the potential to deliver cleaner, scalable and targeted solutions with higher efficiencies than air-cooled alternatives (albeit with higher CapEx) - contributing to lower PUE and WUE metrics.

In terms of retro-fitting existing server racks with updated cooling systems - an important consideration includes OEM warranties, and the potential impact on them should servers be re-purposed to enable liquid cooling. In parallel, new-generation equipment is being developed which accepts liquid cooling by default. This is a critical step to give DC operators peace of mind that their supplier network is futureproofing to meet upcoming challenges.

The proper infrastructure also needs to be designed for all types of liquid cooling technology, ensuring a fluid cooling loop enables effective heat transfer between the facility, secondary circuits and the facility cooling medium itself. A key design factor is the ability of the system to guarantee precise temperature control of the facility cooling medium in response to changes in load. Fluid volumes should also be kept to the minimum required to mitigate issues relating to leaks and over-pressure in the system.

Finally, if improved cooling solutions can be provided, i.e. without airflow heat transfer limitations - reductions in data centre footprint can be achieved through higher rack power densities and thus space-saving floor layouts can be employed.

Credit: LiquidStack's 2PIC 2022 Case Study

Rear door heat exchangers (RDHx), with their proximity to the servers plus ability to remove heat away from the heat source pose an interesting proposal, they also minimise the need for traditional hot aisle containment.

As with any liquid-based system, leak detection and mitigation measures must be employed with a focus on high quality design and precision installation up front, the inclusion of CHW system pressure monitoring and transparent drip trays are also typically implemented. Redundancy is a key consideration, with the need for secondary cooling loop supplies or backup systems sized to tolerate additional heat load in the event of RDHx failure.

For colocation facilities, the associated pipework distribution network to and from the white space may prove to be a hinderance in the event a revamp is required to accommodate new customers. It is also important to validate the heat removal performance of this technology plus its ability to rapidly respond to changes in server workload, with supplemental fan cooling potentially required in the 60kW density range, and limitations in high and ultra-high density operating environments.

Finally, in the case of retrofits, compatibility with the existing rack design is important to avoid unforeseen issues during installation.

Credit: Vertiv Liebert® DCD

Direct-to-Chip cooling (also known as DtC / direct-to-plate) takes advantage of the same rack-based architecture as air cooled systems and can be deployed to existing infrastructure. DtC can itself be divided into two categories including 'microchannel' and 'microconvective' approaches. The former spreads a coolant over the entire surface area of a 'cold plate' whilst the latter uses targeted, perpendicular 'jet' cooling aimed at specific hot spots within the processor. The key heat generating components include the CPUs, GPUs and memory modules - whilst either single-phase or two-phase cold plates can be utilised to facilitate heat transfer via a dielectric fluid engineered for DtC cooling.

Credit: Jetcool D2C Liquid Cooling Tech White Paper

Since this form of cooling targets specific heat sources within the server, it generally removes 70-75% of the heat generated within the rack - thus requiring a hybrid cooling approach (albeit with greatly reduced air-cooling infrastructure) to dissipate the remaining heat, i.e. from power supplies and IC capacitors.

A key consideration is server level or system failure, which can result in immediate overheating due to the limited coolant volume and thus cause lengthy downtime. Higher protection can be ensured by incorporating lower liquid pressures on the server side and dripless double sealed connectors. It remains though an important factor to consider in the mission-critical operating environment.

Mainstream data centre providers seem to be 'erring on the side of caution' regarding uptake of this technology, not only because of the higher CapEx and specialist maintenance requirements, but also due to the fact existing facilities already have cooling infrastructure with a long operational lifespan. This isn't hindering 'retrofits' though alongside existing systems, taking an 'iterative' approach - allowing operators to gather real operational data and likewise become accustomed to these systems.

Dwelling on the specialist maintenance aspect for a moment - The notion of 'large scale' DtC cooling deployment may face resistance due to the intricate servicing, specialised training, tools, and enhanced security protocol required plus the respective OpEx.

DC providers are encouraged to keep their finger on the pulse in the liquid cooling space, by developing partnerships with research institutions and establishing innovation centres to accelerate product development in this regard.

An example of an innovative research initiative includes the 'COOLERCHIPS' programme in the US. An interesting development includes research on 'intrachip micro-cooling' for the emerging class of high-performance electronics - for which conventional chip-cooling methods may struggle to remove heat effectively. Traditional DtC cold plates adhere to flat profile chips whilst future technologies might incorporate three-dimensional 'stacks of processing chips' - hence the need for a truly embedded system within the stack itself.

Credit: Iseotope learning hub

Immersion cooling techniques include both single- and two-phase (2PIC or 2-PIC) immersion, either via an 'enclosed chassis (clamshell)' arrangement or an 'enclosed tank'. The former benefiting from less dielectric fluid in a self-contained casing with installation in conventional server racks. These techniques offer a more plausible approach to 'larger scale' deployment than DtC variants - plus have the advantage of cooling everything as opposed to just the high heat sources as per DtC, therefore reducing potential failure modes.

Consideration needs to be given to the replacement frequencies i.e. during IT refresh activities over the rack lifetime. Enclosed tank types benefit from the ability to directly changeover IT equipment for renewal without replacing the tank itself. Immersion cooling in general also offers cost savings due to a reduction in IT components such as heat sinks (assessed on a case-by-case basis), fans and humidity sensors etc. Conversely, component level interaction with the dielectric fluid can also lead to material and functionality degradation. Fluid contamination can occur via the electronics hardware itself, thus a focus should be on the specification of materials with low concentrations of extractable content plus component pre-cleaning prior to immersion.

The term 'precision liquid immersion cooling' approach refers to the 'enclosed chassis (clamshell)' variant. Both Iceotope and LiquidCool offer solutions of this type, with the latter patenting an impressive combined DtC and total immersion cooling technology that fits into a standard 19" rack. Using a Directed-Flow ™ concept, the coolest fluid is sent to the hottest electronic components first, before immersing other components to gather remaining heat. This approach boasts further reductions in floor space and dielectric fluid requirements, plus enhanced serviceability. Research trends seem to be focussing on this initiative with further developments in mind.

Credit: LiquidCool

From an operational perspective, single phase arrangements benefit from simpler tank designs, easier fluid containment and less issues with material compatibility and fluid hygiene. Conversely, the cooling infrastructure for two-phase solutions are typically less complex due to their greater heat transfer efficiency. Potential fluid loss mechanisms in tank arrangements should also be minimised i.e. due to filling/start-up operations, venting (during power fluctuations), parasitic loss via tank seals and evaporative losses during servicing, etc. In both cases the expected fluid lifetime and total fluid cost over the lifetime of the tank or data centre should be assessed.

Servicing of hardware (I.e. hot swapping) is possible in both configurations, with companies such as TMGcore utilising robotics technology for the replacement of failed servers.

A cost-benefit study should be conducted to fully evaluate the above considerations.

For those looking at retrofit opportunities into legacy or existing modern data centres, specific attention should be given to OEM manufacturer warranties plus the compatibility of all components that could potentially come into contact with the dielectric fluids used in immersion cooling (I.e. labels, capacitors, connectors, sealants, relays, heat shrink, cables, mechanical HDD's, etc.) in addition to the physical layout of IT equipment taking into account their heat dissipation characteristics and maximum temperature tolerance in relation to the fluid flow direction. Put simply, components which generate more heat / require lower operating temperatures should be arranged in the coolest part of the tank, whilst those with a higher tolerance situated downstream / in the highest part of the tank.

The physical layout of IT equipment is more critical in single-phase arrangements, since in two-phase immersion - a more uniform isothermal environment can be achieved thus eliminating stratification effects. The focus then shifts however to the spacing of components in the upper fluid area of two-phase systems to allow an escape path for vapour produced by the phase changes within.

In some cases, the optimisation of existing air-cooled IT components is required to adapt them for liquid cooling. An example includes heat sink modification, where immersion-based heat sinks typically require a larger fin pitch due to the increased viscosity of the fluids compared to air. Devices such as CPUs and GPUs that would normally require a copper heat sink in air-cooled / single-phase immersion installations would also require enhancement for use in two-phase immersion.

From a fire safety perspective, dielectric fluids typically feature an extremely high flash point (I.e. NatureCool™ at 325° C), with zero-ignition potential - thus lowering fire risks and cheaper insurance premiums. It is important though to monitor developments as IT equipment and their respective density evolves to ensure high flash points are maintained. The availability of such fluids plus the immersion cooling equipment themselves should be assessed for large scale deployments. Sustainability developments are also important to monitor, for example, 3M has announced an exit from per- and polyfluoroalkyl substances (PFAS 'forever chemicals') chemical manufacturing by the end of 2025. They are said to manufacture approximately 80% market share of the fluids used in 2PIC technology. This puts additional pressure on system manufacturers to source non-PFAS containing formulations.

Sustainability wise, the use of two-phase liquid cooling solutions in both DtC and immersion technologies are under intense scrutiny and require an in-depth study taking into account potential legislative updates. This may lead to a market preference for single-phase 'DtC' methods going forward.

Immersion technology suppliers are not content with 'second place' in that regard, with recent Intel-Submer announcements highlighting the development of a 'Forced Convection Heat Sink (FCHS) package' which aims to expand the cooling capabilities of single-phase immersion cooling offerings. Other advances include tank fan integration in preparation for handling even higher power densities.

Credit: Submer Case Study - Immersion Cooling of High-Density Compute Workloads

We hope you found this article useful!

If you are looking for professional feasibility study support or cost-benefit analysis relating to your next industrial mega-facility design & build project...

Contact — Biyat Energy & Environment Ltd (biyatenergyenvironment.com)

This article was written by Luay Zayed, Founder' of Biyat Energy & Environment Ltd. A global energy and environmental consultancy specializing in turnkey engineering solutions that protect the environment and improve energy efficiency in the manufacturing & industrial sectors.

Credit: Emergen Research - Top 10 Companies in Data Center Cooling Market in 2024

Luay Zayed