I have 2 HPE Proliant DL360 Gen10 servers that are configured nearly the same. They both run CentOS 7.5. The only differences are that one has newer firmware and kernel, in an attempt to fix this problem.
dmesg
is reporting the following repeatedly and the performance of the server is suffering.
[Oct12 11:43] CPU5: Package temperature above threshold, cpu clock throttled (total events = 539077151)
[ +0.000001] CPU1: Package temperature above threshold, cpu clock throttled (total events = 539077144)
[ +0.000003] CPU4: Package temperature above threshold, cpu clock throttled (total events = 539077179)
[ +0.000002] CPU7: Package temperature above threshold, cpu clock throttled (total events = 539077201)
[ +0.000001] CPU3: Package temperature above threshold, cpu clock throttled (total events = 539077211)
[ +0.000004] CPU6: Package temperature above threshold, cpu clock throttled (total events = 539077197)
[ +0.000001] CPU2: Package temperature above threshold, cpu clock throttled (total events = 539077208)
[ +0.000001] CPU0: Package temperature above threshold, cpu clock throttled (total events = 539077122)
[Oct12 11:44] CPU6: Core temperature above threshold, cpu clock throttled (total events = 447115263)
[ +0.000001] CPU2: Core temperature above threshold, cpu clock throttled (total events = 447115267)
[ +0.002025] CPU6: Core temperature/speed normal
The HP iLO is reporting ~30C less than sensors
is reporting.
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +95.0°C (high = +86.0°C, crit = +96.0°C)
Core 0: +95.0°C (high = +86.0°C, crit = +96.0°C)
Core 2: +95.0°C (high = +86.0°C, crit = +96.0°C)
Core 3: +95.0°C (high = +86.0°C, crit = +96.0°C)
Core 4: +94.0°C (high = +86.0°C, crit = +96.0°C)
The HPE iLO interface reports the CPU is 55C at the same time the sensors reading is taken.
When I run sensors
, I get the following in dmesg
:
[Oct12 11:46] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20180313/exfield-393)
[ +0.000726] ACPI Error: Method parse/execution failed \_SB.PMI0._PMM, AE_AML_BUFFER_LIMIT (20180313/psparse-516)
[ +0.000500] ACPI Error: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20180313/power_meter-338)
I updated to the latest kernel (4.18.13-1.el7.elrepo.x86_64
) this morning and that didn't help either.
Answer
I was able to mostly resolve this by updating the kernel in the OS. I'm now on 4.18.13-1.el7.elrepo.x86_64 and the temperature is reported differently than in the iLO UI, but the ratio between CPU temp and "high" is much better and lines up better with the iLO ratios.
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +74.0°C (high = +86.0°C, crit = +96.0°C)
Core 0: +72.0°C (high = +86.0°C, crit = +96.0°C)
Core 2: +72.0°C (high = +86.0°C, crit = +96.0°C)
Core 3: +74.0°C (high = +86.0°C, crit = +96.0°C)
Core 4: +71.0°C (high = +86.0°C, crit = +96.0°C)
No comments:
Post a Comment