随着物理服务器的老化,我们发现有很多服务器会报内存故障,表示内存条故障,然后再运行一段时间,就会突然宕机。
进入服务器,通过dmesg可以看到报错
[78802930.264886] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[78802930.264892] EDAC sbridge MC0: CPU 10: Machine Check Event: 0 Bank 7: 8c00004000010092
[78802930.264894] EDAC sbridge MC0: TSC 0
[78802930.264896] EDAC sbridge MC0: ADDR 1de16e25c0
[78802930.264897] EDAC sbridge MC0: MISC 50141486
[78802930.264899] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1697764936 SOCKET 1 APIC 20
[78802931.091267] EDAC MC0: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x1de16e2 offset:0x5c0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0092 socket:1 ha:0 channel_mask:4 rank:0)
幸好node_exporter提供了相关的指标,我们加上相关的监控就好了。
- alert: Node-物理机内存出现错误
expr: increase(node_edac_correctable_errors_total[5m]) > 0
labels:
team: Node
severity: Warning
annotations:
summary: '请检查内存条错误'
.
转载请注明:IPCPU-网络之路 » 物理服务器内存EDAC故障和告警