Monday, February 2, 2015

centos - Package temperature above threshold, cpu clock throttled



I have 2 HPE Proliant DL360 Gen10 servers that are configured nearly the same. They both run CentOS 7.5. The only differences are that one has newer firmware and kernel, in an attempt to fix this problem.



dmesg is reporting the following repeatedly and the performance of the server is suffering.



[Oct12 11:43] CPU5: Package temperature above threshold, cpu clock throttled (total events = 539077151)

[ +0.000001] CPU1: Package temperature above threshold, cpu clock throttled (total events = 539077144)
[ +0.000003] CPU4: Package temperature above threshold, cpu clock throttled (total events = 539077179)
[ +0.000002] CPU7: Package temperature above threshold, cpu clock throttled (total events = 539077201)
[ +0.000001] CPU3: Package temperature above threshold, cpu clock throttled (total events = 539077211)
[ +0.000004] CPU6: Package temperature above threshold, cpu clock throttled (total events = 539077197)
[ +0.000001] CPU2: Package temperature above threshold, cpu clock throttled (total events = 539077208)
[ +0.000001] CPU0: Package temperature above threshold, cpu clock throttled (total events = 539077122)
[Oct12 11:44] CPU6: Core temperature above threshold, cpu clock throttled (total events = 447115263)
[ +0.000001] CPU2: Core temperature above threshold, cpu clock throttled (total events = 447115267)
[ +0.002025] CPU6: Core temperature/speed normal



The HP iLO is reporting ~30C less than sensors is reporting.



coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +95.0°C (high = +86.0°C, crit = +96.0°C)
Core 0: +95.0°C (high = +86.0°C, crit = +96.0°C)
Core 2: +95.0°C (high = +86.0°C, crit = +96.0°C)
Core 3: +95.0°C (high = +86.0°C, crit = +96.0°C)

Core 4: +94.0°C (high = +86.0°C, crit = +96.0°C)


The HPE iLO interface reports the CPU is 55C at the same time the sensors reading is taken.



When I run sensors, I get the following in dmesg:



[Oct12 11:46] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20180313/exfield-393)
[ +0.000726] ACPI Error: Method parse/execution failed \_SB.PMI0._PMM, AE_AML_BUFFER_LIMIT (20180313/psparse-516)
[ +0.000500] ACPI Error: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20180313/power_meter-338)



I updated to the latest kernel (4.18.13-1.el7.elrepo.x86_64) this morning and that didn't help either.


Answer



I was able to mostly resolve this by updating the kernel in the OS. I'm now on 4.18.13-1.el7.elrepo.x86_64 and the temperature is reported differently than in the iLO UI, but the ratio between CPU temp and "high" is much better and lines up better with the iLO ratios.



coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +74.0°C (high = +86.0°C, crit = +96.0°C)
Core 0: +72.0°C (high = +86.0°C, crit = +96.0°C)

Core 2: +72.0°C (high = +86.0°C, crit = +96.0°C)
Core 3: +74.0°C (high = +86.0°C, crit = +96.0°C)
Core 4: +71.0°C (high = +86.0°C, crit = +96.0°C)

No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...