本文共 2784 字,大约阅读时间需要 9 分钟。
192.168.219.90 使用 dmesg|grep -i error 查看时发现这台机器内存有问题,如下图所示:
[Hardware Error]: MC4 Error (node 1): L3 cache tag error.[Hardware Error]: Error Status: Corrected error, no action required.[Hardware Error]: MC4_ADDR: 0x00000018edfd9100
[Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: SNP[Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8cf6cb900[Hardware Error]: Error Status: Corrected error, no action required.[Hardware Error]: MC4_ADDR: 0x00000008cf6cb900
[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)[Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8cf6cb900[Hardware Error]: Error Status: Corrected error, no action required.[Hardware Error]: MC4_ADDR: 0x00000008cf6cb900
[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)进一步查询发现是第5条内存有问题,需要联系私有云那边报修。
grep [0-9] /sys/devices/system/edac/mc/mc/csrow/ch*_ce_count/sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:0/sys/devices/system/edac/mc/mc0/csrow2/ch1_ce_count:0/sys/devices/system/edac/mc/mc1/csrow2/ch0_ce_count:0/sys/devices/system/edac/mc/mc1/csrow2/ch1_ce_count:0/sys/devices/system/edac/mc/mc2/csrow2/ch0_ce_count:146/sys/devices/system/edac/mc/mc2/csrow2/ch1_ce_count:0/sys/devices/system/edac/mc/mc3/csrow2/ch0_ce_count:0/sys/devices/system/edac/mc/mc3/csrow2/ch1_ce_count:0/sys/devices/system/edac/mc/mc4/csrow2/ch0_ce_count:0/sys/devices/system/edac/mc/mc4/csrow2/ch1_ce_count:0/sys/devices/system/edac/mc/mc5/csrow2/ch0_ce_count:0/sys/devices/system/edac/mc/mc5/csrow2/ch1_ce_count:0/sys/devices/system/edac/mc/mc6/csrow2/ch0_ce_count:0/sys/devices/system/edac/mc/mc6/csrow2/ch1_ce_count:0/sys/devices/system/edac/mc/mc7/csrow2/ch0_ce_count:0/sys/devices/system/edac/mc/mc7/csrow2/ch1_ce_count:0count不为0的行即代表存在内存错误。
mc:第几个CPU。csrow:内存通道。ch*:通道内的第几根内存。然后通过dmidecode查看:
[root@customer log]# dmidecode -t memory |grep 'Locator: DIMM'
Locator: DIMM01Locator: DIMM02Locator: DIMM03Locator: DIMM04Locator: DIMM05Locator: DIMM06Locator: DIMM07Locator: DIMM08Locator: DIMM09Locator: DIMM10Locator: DIMM11Locator: DIMM12Locator: DIMM13Locator: DIMM14Locator: DIMM15Locator: DIMM16Locator: DIMM17Locator: DIMM18Locator: DIMM19Locator: DIMM20Locator: DIMM21Locator: DIMM22Locator: DIMM23Locator: DIMM24Locator: DIMM25Locator: DIMM26Locator: DIMM27Locator: DIMM28Locator: DIMM29Locator: DIMM30Locator: DIMM31Locator: DIMM32通过服务器控制台查看内存:主板上内存插槽的分布:
结合报错日志:kernel: EDAC MC1: 16107 CE error on CPU#1Channel#2_DIMM#1 (channel:2slot:1
应该是内存插槽DIMM_F1的问题。解决:
最后我们要做的就是,把有问题的F1插槽上的内存拔出来或是更换到其它的内存插槽上面,之后系统启动后不再报错。转载于:https://blog.51cto.com/linushai/2063768