In the current environment, the demand for computing power is exploding, and the master server has become the core carrier of artificial intelligence training, cloud computing services, and scientific computing. As the chip process approaches physical limits and the power density of a single stand climbs from 10kW to 30kW or more, the design of the cooling system directly determines the stability, energy efficiency and total cost of ownership (TCO) of the server. From traditional air cooling to submerged liquid cooling, from passive heat dissipation to AI-driven dynamic temperature control, heat dissipation technology is experiencing revolutionary breakthroughs. What are the key considerations in the heat dissipation design of the main server?
Challenges faced by the primary server heat dissipation
The heat dissipation of the main server is essentially to efficiently transfer the heat generated by the chip, memory, hard disk and other components to the external environment. What factors affect the path and efficiency of heat transfer?
One of the influencing factors is the heat source density. The power density of GPU cluster can reach 500W/cm², far exceeding the 100W/cm² of CPU, which is difficult to cope with the traditional heat dissipation method. Also affected by the heat transfer medium, the specific heat capacity of air (1.005 kJ/kg·K) is much lower than that of liquid (4.18 kJ/kg·K of water), resulting in limited efficiency of the air cooling system; The heat dissipation efficiency is proportional to the temperature difference between the internal and external environment of the server, but the data center usually needs to maintain a low temperature (ASHRAE recommends 1827 ° C) to further compress the heat dissipation space. These physical limitations force the design of cooling systems to break through traditional thinking and evolve to multi-dimensional collaborative innovation.
Key design dimensions: from airflow management to cooling technology selection
Airflow management is the "skeleton" of the heat dissipation system, which directly affects the mixing degree and energy loss of hot and cold air:
Cold and hot channel isolation Avoid short circuit of hot and cold air flow by blocking cold channel (CCA) or hot channel (HAC). For example, Google data centers use hot channel closure, with the top fan to direct the hot air out of the room, reducing the return rate to less than 5%.
Cabinet layout policy High-power cabinets (such as GPU servers) should be distributed to avoid partial overheating. Facebook's "Honeycomb layout" equalizes air distribution by interleaving high - and low-density cabinets.
Floor height and perforation rate Raise the height of the raised floor to more than 60cm, and optimize the perforation rate of the perforated floor (recommended 4060%), which can reduce the airflow resistance by 20%30%.
According to power density and cost budget, mainstream cooling technology can be divided into three categories:
The air-cooled system is a traditional machine room air conditioner (CRAC), suitable for low-density scenarios (≤10kW/ rack), but the energy efficiency ratio (COP) is only 24, and the power cost accounts for up to 40%. Indirect evaporative cooling can also be taken to use external air humidity evaporation cooling, suitable for dry climate areas.
In the liquid cooling system, the liquid cooling plate directly contacts the CPU or GPU through the copper or aluminum cold plate, and the coolant (such as 50% water and 50% ethylene glycol) carries away the heat, which is suitable for the 1530kW/ rack scenario. The NVIDIA DGX A100 uses cold plate heat dissipation to reduce the GPU temperature by 15 ° C. Submerged liquid cooling fully immerses the server in a non-conductive fluorinated liquid (such as 3M Novec) for quiet fanless operation and supports ultra-high density above 50kW/ rack. Bitcoin mines save 40% of their electricity through immersion cooling.
Hybrid cooling system: Combining the advantages of air cooling and liquid cooling, such as Huawei's FusionCol Indirect liquid cooling, transfers heat to the external cooling tower through the backplane heat exchanger, and the PUE can reach 1.15 or less.
Use redundancy and disaster recovery design
The reliability of the heat dissipation system must meet the N+1 or 2N redundancy standards:
Dual power supply: Key devices such as cooling pumps and fans need to be configured with independent circuits to avoid single point of failure.
Dynamic handover mechanism: When the primary cooling system fails, the backup system should take over within 30 seconds. For example, Tencent's Tianjin data center uses double-loop cooling pipes to support seamless switching.
Energy efficiency optimization: from PUE control to waste heat recovery
PUE (Power Usage Efficiency) management
PUE= total power consumption of the data center/power consumption of IT equipment, with an ideal value approaching 1:
Natural Cooling (Free Cooling) : When the external temperature is lower than the set value, the outside air is directly introduced to cool down.
Frequency conversion technology: dynamically adjust the speed of the pump and fan according to the load to reduce the energy consumption of some loads. Schneider Electric's variable frequency cooling system saves 25%35% of energy.
Waste heat reuse
Heat emitted by the server can be recovered in the following ways:
District heating: The Stockholm data center in Sweden delivers waste heat to the municipal heating network to meet the needs of 900 households.
Absorption refrigeration: The use of waste heat to drive the lithium bromide refrigerator, to provide cooling capacity for the office area, to achieve the cascade utilization of energy.
Intelligent upgrade: AI and iot driven cooling revolution
Digital twinning and simulation prediction is to use CFD (computational fluid dynamics) simulation and digital twinning technology to preview the effect of different heat dissipation schemes. IBM Thermal Advisor can reduce the thermal debugging time by 70% by optimizing the cabinet layout in a virtual environment. AI dynamic temperature control, real-time sensor network: temperature, humidity, and air pressure sensors are deployed inside the server and at the inlet and outlet of the cabinet, with a sampling frequency of 1Hz; Reinforcement learning algorithm: AI system developed by Google DeepMind can adjust cooling equipment parameters in real time, reducing data center energy consumption by 40%; Analyze vibration, noise and other data for early warning of fan failure or refrigerant leakage.
Future trend: green heat dissipation and new material breakthrough
In the future, the main server will use the cooling liquid phase (liquid → gas) to absorb latent heat, and the heat transfer efficiency will be increased by 5 times. The thermal conductivity of SiC (490 W/m·K) is much higher than that of copper (401 W/m·K), which can significantly reduce the chip junction temperature. Environmentally friendly fluorination solution (such as MIVOLT) can be decomposed in the natural environment to reduce ecological pollution.
The cooling system of the main server needs to combine interdisciplinary disciplines such as thermodynamics, materials science and artificial intelligence. With the heat dissipation technology from the energy burden to the value engine, the future data center is not only a computing power engineering or smart energy network hub, creating more possibilities.