一、问题背景
在我司工厂老化测试中,发现了多台机器死机的情况,经过一些分析判断被我评估为是DDR的问题,本篇文章就记录一下这几个案例,以及我判定为DDR问题的依据,供稳定性人员提供参考。
2025/04/24:更新案例:2.1章节 ~ 2.6章节
2025/05/13:更新新案例:2.7章节 ~ 2.8章节
二、案例
2.1 案例1
[ 137.648537][ T0] Unable to handle kernel execute from non-executable memory at virtual address ffffffc00c92c7a8
[ 137.649182][ T2652] Internal error: Oops - Undefined instruction: 0000000002000000 [#1] PREEMPT SMP
[ 137.671119][ T0] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[ 137.671121][ T0] Mem abort info:
[ 137.671122][ T0] ESR = 0x0000000086000005
[ 137.671124][ T0] EC = 0x21: IABT (current EL), IL = 32 bits
[ 137.671125][ T0] SET = 0, FnV = 0
[ 137.671126][ T0] EA = 0, S1PTW = 0
[ 137.671127][ T0] FSC = 0x05: level 1 translation fault
[ 137.671129][ T0] user pgtable: 4k pages, 39-bit VAs, pgdp=00000000c85d7000
[ 137.671131][ T0] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
[ 137.671224][ C0] Unexpected kernel BRK exception at EL1
[ 137.733876][ C0] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[ 137.733881][ C0] Mem abort info:
[ 137.733882][ C0] ESR = 0x0000000086000005
[ 137.733885][ C0] EC = 0x21: IABT (current EL), IL = 32 bits
[ 137.733888][ C0] SET = 0, FnV = 0
[ 137.733890][ C0] EA = 0, S1PTW = 0
[ 137.733892][ C0] FSC = 0x05: level 1 translation fault
[ 137.733895][ C0] user pgtable: 4k pages, 39-bit VAs, pgdp=00000000ecd5f000
[ 137.733900][ C0] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
[ 137.789473][ T2652] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[ 137.789476][ T2652] Mem abort info:
[ 137.789477][ T2652] ESR = 0x0000000086000005
[ 137.789479][ T2652] EC = 0x21: IABT (current EL), IL = 32 bits
[ 137.789482][ T2652] SET = 0, FnV = 0
[ 137.789484][ T2652] EA = 0, S1PTW = 0
[ 137.789485][ T2652] FSC = 0x05: level 1 translation fault
[ 137.789487][ T2652] user pgtable: 4k pages, 39-bit VAs, pgdp=00000000d422e000
[ 137.789491][ T2652] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
[ 139.095329][ T734] Unexpected kernel BRK exception at EL1
[ 139.154139][ T734] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[ 139.154142][ T734] Mem abort info:
[ 139.154143][ T734] ESR = 0x0000000086000005
[ 139.154145][ T734] EC = 0x21: IABT (current EL), IL = 32 bits
[ 139.154147][ T734] SET = 0, FnV = 0
[ 139.154149][ T734] EA = 0, S1PTW = 0
[ 139.154150][ T734] FSC = 0x05: level 1 translation fault
[ 139.154152][ T734] user pgtable: 4k pages, 39-bit VAs, pgdp=00000000cbdfe000
[ 139.154155][ T734] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
end time 1745402634.6860282 time cost 6.808462142944336 for Dmesg
判断依据:
- T0线程报Unable to handle kernel execute from non-executable memory,这个错误意味着内核试图在一个非可执行的内存区域执行代码。
- T2652线程报Undefined instruction(这表明内核遇到了一个不明的指令错误,可能是由于程序执行时出现了无效的指令,导致内核无法理解和执行它。这通常是因为内核或硬件指令集的某些问题。)以及Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000的空指针解引用(这表明内核尝试访问一个
NULL
地址) - T734线程报Unexpected kernel BRK exception at EL1以及Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
可以发现在不同核上的不同线程在同一时间内出现了异常,异常的体现也是随机的,像是一大片页表结构或者内核代码段失效,所以这种问题我认为大概率为DDR不稳定造成的。
2.2 案例2
[ 909.331382][T27244] Unable to handle kernel level 3 address size fault at virtual address ffffff816b0c8f58
[ 909.332051][ T1548] Unable to handle kernel level 3 address size fault at virtual address ffffff80a7ae4920
[ 909.332061][ T1548] Mem abort info:
[ 909.332064][ T1548] ESR = 0x0000000096000003
[ 909.332066][ T1548] EC = 0x25: DABT (current EL), IL = 32 bits
[ 909.332070][ T1548] SET = 0, FnV = 0
[ 909.332073][ T1548] EA = 0, S1PTW = 0
[ 909.332075][ T1548] FSC = 0x03: level 3 address size fault
[ 909.332078][ T1548] Data abort info:
[ 909.332079][ T1548] ISV = 0, ISS = 0x00000003
[ 909.332081][ T1548] CM = 0, WnR = 0
[ 909.332084][ T1548] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000a1c14000
[ 909.332089][ T1548] [ffffff80a7ae4920] pgd=180000027fc94003, p4d=180000027fc94003, pud=180000027fc94003, pmd=180000027fb56003, pte=0068020127ae4707
[ 909.332107][ T1548] Internal error: Oops: 0000000096000003 [#1] PREEMPT SMP
[ 909.332222][ T1548] Dumping ftrace buffer:
[ 909.332245][ T1548] (ftrace buffer empty)
[ 909.332248][ T1548] Modules linked in: wlan(OE)
[ 909.332253][ T1435] Unable to handle kernel level 3 address size fault at virtual address ffffff8003fc0d90
[ 909.332255][ T1548] nt36xxx_spi(OE)
[ 909.332258][ T1548] focaltech_spi(OE)
[ 909.332260][ T1435] Mem abort info:
[ 909.332261][ T1548] msm_drm(OE)
[ 909.332262][ T1435] ESR = 0x0000000096000003
[ 909.332264][ T1548] rmnet_offload(OE)
[ 909.332264][ T1435] EC = 0x25: DABT (current EL), IL = 32 bits
[ 909.332267][ T1548] rmnet_shs(OE)
[ 909.332267][ T1435] SET = 0, FnV = 0
[ 909.332270][ T1548] rmnet_perf_tether(OE)
[ 909.332270][ T1435] EA = 0, S1PTW = 0
[ 909.332273][ T1548] rmnet_wlan(OE)
[ 909.332273][ T1435] FSC = 0x03: level 3 address size fault
[ 909.332276][ T1435] Data abort info:
[ 909.332276][ T1548] msm_kgsl(OE)
[ 909.332277][ T1435] ISV = 0, ISS = 0x00000003
[ 909.332279][ T1548] rmnet_perf(OE)
[ 909.332279][ T1435] CM = 0, WnR = 0
[ 909.332282][ T1548] rmnet_core(OE)
[ 909.332282][ T1435] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000a1c14000
[ 909.332284][ T1548] mac80211(E)
[ 909.332286][ T1435] [ffffff8003fc0d90] pgd=180000027ff78003
[ 909.332286][ T1548] machine_dlkm(OE)
[ 909.332288][ T1435] , p4d=180000027ff78003
[ 909.332289][ T1548] ipanetm(OE)
[ 909.332290][ T1435] , pud=180000027ff78003
[ 909.332291][ T1548] rmnet_ctl(OE)
[ 909.332292][ T1435] , pmd=180000027ff6b003
[ 909.332294][ T1548] wcd938x_dlkm(OE)
[ 909.332294][ T1435] , pte=0068020083fc0707
[ 909.333074][ T1548] Unable to handle kernel level 3 address size fault at virtual address ffffffc0010c4440
[ 909.333078][ T1548] Mem abort info:
[ 909.333079][ T1548] ESR = 0x0000000096000003
[ 909.333081][ T1548] EC = 0x25: DABT (current EL), IL = 32 bits
[ 909.333084][ T1548] SET = 0, FnV = 0
[ 909.333086][ T1548] EA = 0, S1PTW = 0
[ 909.333088][ T1548] FSC = 0x03: level 3 address size fault
[ 909.333090][ T1548] Data abort info:
[ 909.333092][ T1548] ISV = 0, ISS = 0x00000003
[ 909.333093][ T1548] CM = 0, WnR = 0
[ 909.333095][ T1548] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000a1c14000
[ 909.333098][ T1548] [ffffffc0010c4440] pgd=100000027fffe003, p4d=100000027fffe003, pud=100000027fffe003, pmd=100000009dc1f003, pte=006802009b976703
[ 909.353464][ C5] Mem abort info:
[ 909.353471][ C5] ESR = 0x0000000096000003
[ 909.353473][ C5] EC = 0x25: DABT (current EL), IL = 32 bits
[ 909.353476][ C5] SET = 0, FnV = 0
[ 909.353477][ C5] EA = 0, S1PTW = 0
[ 909.353479][ C5] FSC = 0x03: level 3 address size fault
[ 909.353481][ C5] Data abort info:
[ 909.353482][ C5] ISV = 0, ISS = 0x00000003
[ 909.353483][ C5] CM = 0, WnR = 0
[ 909.353485][ C5] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000a1c14000
[ 909.353490][ C5] [ffffff816b0c8f58] pgd=180000027f691003, p4d=180000027f691003, pud=180000027f691003, pmd=180000027f538003, pte=00680001eb0c8707
判断依据:
与案例1类似,不同核的不同线程同一时间内触发了null pointer的空指针解引用panic,很不正常!大概率为DDR不稳定问题
2.3 案例3
[10916.779670][ T2330] Unable to handle kernel level 3 address size fault at virtual address ffffffc00a2f2ec0
[10916.781507][ C1] CFI failure at try_to_wake_up+0x43c/0x8ac (target: select_task_rq_rt+0x0/0x2c4; expected type: 0xaa3494c0)
[10916.781696][T18182] Unable to handle kernel level 3 address size fault at virtual address ffffffc00a2f2368
[10916.781702][T18182] Mem abort info:
[10916.781705][T18182] ESR = 0x0000000096000003
[10916.781707][T18182] EC = 0x25: DABT (current EL), IL = 32 bits
[10916.781711][T18182] SET = 0, FnV = 0
[10916.781713][T18182] EA = 0, S1PTW = 0
[10916.781715][T18182] FSC = 0x03: level 3 address size fault
[10916.781718][T18182] Data abort info:
[10916.781720][T18182] ISV = 0, ISS = 0x00000003
[10916.781722][T18182] CM = 0, WnR = 0
[10916.781724][T18182] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000a1c14000
[10916.781728][T18182] [ffffffc00a2f2368] pgd=100000027fffe003, p4d=100000027fffe003, pud=100000027fffe003, pmd=100000027fff9003, pte=00782000a22f2703
[10916.781742][T18182] Internal error: Oops: 0000000096000003 [#1] PREEMPT SMP
[10916.782643][T18182] CPU: 3 PID: 18182 Comm: oid.aac.decoder Tainted: G C OE 6.1.118-android14-11-maybe-dirty #1
[10916.782650][T18182] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[10916.782655][T18182] pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[10916.782661][T18182] pc : uprobe_mmap+0x38/0x500
[10916.782672][T18182] lr : mmap_region+0x7a0/0xa40
[10916.782680][T18182] sp : ffffffc00e64bae0
[10916.782682][T18182] x29: ffffffc00e64bb00 x28: ffffff801cb93c00 x27: ffffff80b4a7ec80
[10916.782690][T18182] x26: ffffff806b78a580 x25: ffffff80b4a7ec80 x24: ffffff818bd800c8
[10916.782697][T18182] x23: 0000000000000001 x22: 00000000040444f9 x21: ffffff801cb93c00
[10916.782704][T18182] x20: 000000740106e000 x19: ffffff80b4a7ec80 x18: ffffffc00a3b3058
[10916.782711][T18182] x17: 000000740106e000 x16: 0000000000000000 x15: 0000000000000008
[10916.782718][T18182] x14: 0000000000000000 x13: 0000000000000010 x12: 0000000000000008
[10916.782724][T18182] x11: ffffff80ac8b1a68 x10: 0000000000000000 x9 : 0000000000000000
[10916.782730][T18182] x8 : ffffffc00a2f2368 x7 : 0000000000000000 x6 : ffffff80b4a7ec80
[10916.782737][T18182] x5 : 0000000000000000 x4 : 00000000040444f9 x3 : 000000740106f000
[10916.782743][T18182] x2 : ffffffc0083235d0 x1 : 00000000040444f9 x0 : ffffff818bd800c8
[10916.782751][T18182] Call trace:
[10916.782755][T18182] uprobe_mmap+0x38/0x500
[10916.782761][T18182] mmap_region+0x7a0/0xa40
[10916.782766][T18182] do_mmap+0x3f8/0x520
[10916.782771][T18182] vm_mmap_pgoff+0x19c/0x204
[10916.782779][T18182] ksys_mmap_pgoff+0x78/0xf4
[10916.782784][T18182] __arm64_sys_mmap+0x34/0x44
[10916.782791][T18182] invoke_syscall+0x58/0x114
[10916.782796][T18182] el0_svc_common+0x88/0xfc
[10916.782801][T18182] do_el0_svc+0x24/0x84
[10916.782805][T18182] el0_svc+0x2c/0x90
[10916.782814][T18182] el0t_64_sync_handler+0x68/0xb4
[10916.782819][T18182] el0t_64_sync+0x1a4/0x1a8
[10916.782829][T18182] Code: f9431d08 f81f83a8 f0010108 910da108 (f8bfc108)
[10916.782833][T18182] ---[ end trace 0000000000000000 ]---
[10916.783247][T18182] Kernel panic - not syncing: Oops: Fatal exception
[10916.783251][T18182] SMP: stopping secondary CPUs
[10916.783275][ C6] VendorHooks: CPU6: stopping
[10916.783280][ C6] CPU: 6 PID: 21140 Comm: QMESA_64 Tainted: G D C OE 6.1.118-android14-11-maybe-dirty #1
[10916.783284][ C6] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[10916.783286][ C6] pstate: 80001000 (Nzcv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
[10916.783289][ C6] pc : 000000000040d8f4
[10916.783290][ C6] lr : 0000000000408f80
[10916.783291][ C6] sp : 00000073e88be740
[10916.783292][ C6] x29: 00000073e88be770 x28: 00000000ffffffff x27: 000000003d7f0b10
[10916.783297][ C6] x26: 0000000000000000 x25: 000000003d7f0b10 x24: 0000000000000002
[10916.783301][ C6] x23: 00000073ddd7d108 x22: 0000000000000020 x21: 0000000000000000
[10916.783304][ C6] x20: 000000003d7f08a0 x19: 000000000002d31e x18: ffffffffffffff9c
[10916.783307][ C6] x17: 000000000000000d x16: 0000000000002710 x15: 00000000004d4b48
[10916.783311][ C6] x14: 00000073dd4e53f0 x13: 0000000000338eb0 x12: 00000073de1c9e40
[10916.783314][ C6] x11: 000000003d7f0cf8 x10: 000000003d7f18f8 x9 : 4b3473b672eba93d
[10916.783317][ C6] x8 : 0a8fd9b8d6950f60 x7 : f9be10c4ee25e2da x6 : 000000003d7f14f8
[10916.783321][ C6] x5 : ab8f8c979761d4ff x4 : 000000003d7f18f8 x3 : 0000000000112e80
[10916.783324][ C6] x2 : 00000000ffffffff x1 : 0000000000000000 x0 : 000000003d7f0bd0
[10916.783328][ C0] VendorHooks: CPU0: stopping
[10916.783333][ C0] CPU: 0 PID: 21139 Comm: QMESA_64 Tainted: G D C OE 6.1.118-android14-11-maybe-dirty #1
[10916.783338][ C0] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[10916.783340][ C0] pstate: 80001000 (Nzcv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
[10916.783345][ C0] pc : 0000000000403dac
[10916.783347][ C0] lr : 000000000040431c
[10916.783348][ C0] sp : 00000073e90be750
[10916.783350][ C0] x29: 00000073e90be750 x28: 00000000ffffffff x27: 000000003d7ef690
[10916.783357][ C0] x26: 0000000000000000 x25: 000000003d7ef690 x24: 0000000000000001
[10916.783362][ C0] x23: 00000073cdc6a1a0 x22: 00000073ce1c9eb0 x21: 0000000000000002
[10916.783368][ C0] x20: 000000003d7ef420 x19: 0000000000044ba0 x18: ffffffffffffff9c
[10916.783374][ C0] x17: 000000000000000d x16: 0000000000002710 x15: 00000000004d4b48
[10916.783380][ C0] x14: 00000073cd4e53f0 x13: 0000000000338eb0 x12: 00000073ce1c9e40
[10916.783386][ C0] x11: 000000003d7ef878 x10: 000000003d7f0478 x9 : 91b7fbbaf02e9e2e
[10916.783391][ C0] x8 : 20768bca88d14982 x7 : 512baf47a67b8016 x6 : 000000003d7f0078
[10916.783397][ C0] x5 : d65f7063f8d5498f x4 : 0000000000487f68 x3 : 00000073cdd7d019
[10916.783403][ C0] x2 : 0000000031ed750b x1 : 00000000000000f7 x0 : 00000073cdd13f7d
[10916.783410][ C4] VendorHooks: CPU4: stopping
[10916.783414][ C4] CPU: 4 PID: 21136 Comm: QMESA_64 Tainted: G D C OE 6.1.118-android14-11-maybe-dirty #1
[10916.783419][ C4] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[10916.783421][ C4] pstate: 20001000 (nzCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
[10916.783426][ C4] pc : 000000000040d4b0
[10916.783427][ C4] lr : 000000000040d8c8
[10916.783429][ C4] sp : 00000073f11d6720
[10916.783430][ C4] x29: 00000073f11d6720 x28: 00000000ffffffff x27: 000000003d7eb910
[10916.783436][ C4] x26: 0000000000000000 x25: 000000003d7eb910 x24: 0000000000000001
[10916.783442][ C4] x23: 00000073e5c6a1a0 x22: 0000000000000000 x21: 00000000ffffffff
[10916.783448][ C4] x20: 000000003d7eb9d0 x19: 000000003d7eb9d0 x18: ffffffffffffff9c
[10916.783454][ C4] x17: 000000000000000d x16: 0000000000002710 x15: 00000000004d4b48
[10916.783459][ C4] x14: 00000073e54e53f0 x13: 0000000000338eb0 x12: 00000073e61c9e40
[10916.783465][ C4] x11: 000000003d7ebaf8 x10: 000000003d7ec6f8 x9 : 10053edea2ad6252
[10916.783471][ C4] x8 : 3d153fa93c6a4f09 x7 : 00000000000001fc x6 : 000000003d7ebf18
[10916.783476][ C4] x5 : 80cd37161fb6742d x4 : 000000003d7ec318 x3 : a691699a77262e3d
[10916.783482][ C4] x2 : 000000003d7ec718 x1 : 000000003d7ebaf0 x0 : 000000003d7ec2f8
[10916.783490][ C5] VendorHooks: CPU5: stopping
[10916.783494][ C5] CPU: 5 PID: 21137 Comm: QMESA_64 Tainted: G D C OE 6.1.118-android14-11-maybe-dirty #1
[10916.783499][ C5] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[10916.783501][ C5] pstate: 20001000 (nzCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
[10916.783506][ C5] pc : 000000000040d3ec
[10916.783507][ C5] lr : 000000000040d8c8
[10916.783509][ C5] sp : 00000073f39d6720
[10916.783510][ C5] x29: 00000073f39d6720 x28: 00000000ffffffff x27: 000000003d7ecd90
[10916.783516][ C5] x26: 0000000000000000 x25: 000000003d7ecd90 x24: 0000000000000000
[10916.783522][ C5] x23: 00000073c5b57238 x22: 0000000000000000 x21: 00000000ffffffff
[10916.783527][ C5] x20: 000000003d7ece50 x19: 000000003d7ece50 x18: ffffffffffffff9c
[10916.783533][ C5] x17: 000000000000000d x16: 0000000000002710 x15: 00000000004d4b48
[10916.783539][ C5] x14: 00000073c54e53f0 x13: 0000000000338eb0 x12: 00000073c61c9e40
[10916.783544][ C5] x11: 000000003d7ecf78 x10: 000000003d7edb78 x9 : edc62dc9e20a6222
[10916.783550][ C5] x8 : ca5081040d02d94e x7 : 000000003d7ed1f8 x6 : 000000003d7ed778
[10916.783556][ C5] x5 : 2dd9c9cc140bf6aa x4 : 000000003d7eddf8 x3 : 9f128466d6045ff6
[10916.783561][ C5] x2 : 000000003d7ed9f8 x1 : 000000003d7ecf70 x0 : 000000003d7ed778
[10916.783568][ C2] VendorHooks: CPU2: stopping
[10916.783572][ C2] CPU: 2 PID: 21138 Comm: QMESA_64 Tainted: G D C OE 6.1.118-android14-11-maybe-dirty #1
[10916.783577][ C2] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[10916.783579][ C2] pstate: 20001000 (nzCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
[10916.783583][ C2] pc : 000000000040d3d0
[10916.783585][ C2] lr : 000000000040d8c8
[10916.783586][ C2] sp : 00000073eaffe720
[10916.783588][ C2] x29: 00000073eaffe720 x28: 00000000ffffffff x27: 000000003d7ee210
[10916.783594][ C2] x26: 0000000000000000 x25: 000000003d7ee210 x24: 0000000000000001
[10916.783600][ C2] x23: 00000073edc6a1a0 x22: 0000000000000000 x21: 00000000ffffffff
[10916.783605][ C2] x20: 000000003d7ee2d0 x19: 000000003d7ee2d0 x18: ffffffffffffff9c
[10916.783611][ C2] x17: 000000000000000d x16: 0000000000002710 x15: 00000000004d4b48
[10916.783616][ C2] x14: 00000073ed4e53f0 x13: 0000000000338eb0 x12: 00000073ee1c9e40
[10916.783622][ C2] x11: 000000003d7ee3f8 x10: 000000003d7eeff8 x9 : 2a40f0d8788f2339
[10916.783627][ C2] x8 : 0067b24ec2453898 x7 : 000000003d7ee758 x6 : 000000003d7eebf8
[10916.783633][ C2] x5 : ad17418726ae9005 x4 : 000000003d7ef358 x3 : fd96a6b096648357
[10916.783638][ C2] x2 : 000000003d7eef58 x1 : 000000003d7ee3f0 x0 : 000000003d7eebf8
[10916.789423][ C7] VendorHooks: CPU7: stopping
[10916.789425][ C7] CPU: 7 PID: 2330 Comm: HwBinder:2002_1 Tainted: G D C OE 6.1.118-android14-11-maybe-dirty #1
[10916.789427][ C7] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[10916.789428][ C7] pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[10916.789430][ C7] pc : console_emit_next_record+0x348/0x3d4
[10916.789435][ C7] lr : console_emit_next_record+0x344/0x3d4
[10916.789438][ C7] sp : ffffffc013f433f0
[10916.789438][ C7] x29: ffffffc013f43480 x28: ffffffc00a0ecd90 x27: ffffffc00a0cae30
[10916.789441][ C7] x26: 0000000000000000 x25: ffffffc00a2e7660 x24: 000000000000006e
[10916.789443][ C7] x23: ffffffc00a2e97b8 x22: ffffffc00a2e7778 x21: ffffffc00a1f78d0
[10916.789445][ C7] x20: 0000000000000001 x19: ffffffc013f434dc x18: ffffffc013b5d0a8
[10916.789447][ C7] x17: 00000000529c6ef0 x16: 00000000529c6ef0 x15: 75616620657a6973
[10916.789449][ C7] x14: 000000000000000a x13: 0000000000000030 x12: 000000000000006e
[10916.789451][ C7] x11: 00000000ffffffff x10: 0000000100000001 x9 : ffffff808ff62580
[10916.789452][ C7] x8 : 0000000100000001 x7 : 205b5d3037363937 x6 : 372e36313930315b
[10916.789454][ C7] x5 : ffffffc00a2e97cf x4 : ffffffc013f4336f x3 : ffffffc008887cf8
[10916.789456][ C7] x2 : 0000000000000005 x1 : 00000000000000c0 x0 : ffffffc00a1f6ce8
[10916.789458][ C7] Call trace:
[10916.789459][ C7] console_emit_next_record+0x348/0x3d4
[10916.789461][ C7] console_unlock+0x154/0x24c
[10916.789466][ C7] vprintk_emit+0xcc/0x27c
[10916.789469][ C7] vprintk_default+0x44/0x70
[10916.789472][ C7] vprintk+0xe4/0x114
[10916.789473][ C7] _printk+0x54/0x80
[10916.789476][ C7] do_mem_abort+0xd0/0x118
[10916.789480][ C7] el1_abort+0x3c/0x5c
[10916.789484][ C7] el1h_64_sync_handler+0x54/0x90
[10916.789485][ C7] el1h_64_sync+0x68/0x6c
[10916.789488][ C7] memblock_is_map_memory+0x34/0x84
[10916.789492][ C7] __check_object_size+0xbc/0x29c
[10916.789496][ C7] binder_alloc_copy_user_to_buffer+0x108/0x288
[10916.789500][ C7] binder_transaction+0x1130/0x1ea8
[10916.789503][ C7] binder_thread_write+0xb54/0x2768
[10916.789506][ C7] binder_ioctl+0x550/0x2438
[10916.789508][ C7] __arm64_sys_ioctl+0xa8/0xe4
[10916.789512][ C7] invoke_syscall+0x58/0x114
[10916.789515][ C7] el0_svc_common+0x88/0xfc
[10916.789516][ C7] do_el0_svc+0x24/0x84
[10916.789517][ C7] el0_svc+0x2c/0x90
[10916.789519][ C7] el0t_64_sync_handler+0x68/0xb4
[10916.789521][ C7] el0t_64_sync+0x1a4/0x1a8
分析过程:
首先多个核都同时出现异常
其次追踪其中一个线程
x8寄存器是从0xFFFFFFC00A2F2000加上偏移0x368得来的
uprobe_mmap() 中访问了全局变量 uprobes_tree
if (no_uprobe_events())
return 0;
尝试访问地址 &uprobes_tree.rb_node → 0xffffffc00a2f2368
地址页在触发访问时未被映射到当前内核页表中
触发了 Level 3 Address Size Fault
这个挺奇怪的,暂时没什么实际证据证明是DDR的问题,但是从多核同时触发异常,可以怀疑是DDR的问题
2.4 案例4
[ 909.331382][T27244] Unable to handle kernel level 3 address size fault at virtual address ffffff816b0c8f58
[ 909.332051][ T1548] Unable to handle kernel level 3 address size fault at virtual address ffffff80a7ae4920
[ 909.332061][ T1548] Mem abort info:
[ 909.332064][ T1548] ESR = 0x0000000096000003
[ 909.332066][ T1548] EC = 0x25: DABT (current EL), IL = 32 bits
[ 909.332070][ T1548] SET = 0, FnV = 0
[ 909.332073][ T1548] EA = 0, S1PTW = 0
[ 909.332075][ T1548] FSC = 0x03: level 3 address size fault
[ 909.332078][ T1548] Data abort info:
[ 909.332079][ T1548] ISV = 0, ISS = 0x00000003
[ 909.332081][ T1548] CM = 0, WnR = 0
[ 909.332084][ T1548] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000a1c14000
[ 909.332089][ T1548] [ffffff80a7ae4920] pgd=180000027fc94003, p4d=180000027fc94003, pud=180000027fc94003, pmd=180000027fb56003, pte=0068020127ae4707
[ 909.332107][ T1548] Internal error: Oops: 0000000096000003 [#1] PREEMPT SMP
[ 909.332222][ T1548] Dumping ftrace buffer:
[ 909.332245][ T1548] (ftrace buffer empty)
[ 909.332248][ T1548] Modules linked in: wlan(OE)
[ 909.332253][ T1435] Unable to handle kernel level 3 address size fault at virtual address ffffff8003fc0d90
[ 909.332255][ T1548] nt36xxx_spi(OE)
[ 909.332258][ T1548] focaltech_spi(OE)
[ 909.332260][ T1435] Mem abort info:
[ 909.332261][ T1548] msm_drm(OE)
[ 909.332262][ T1435] ESR = 0x0000000096000003
[ 909.332264][ T1548] rmnet_offload(OE)
[ 909.332264][ T1435] EC = 0x25: DABT (current EL), IL = 32 bits
[ 909.332267][ T1548] rmnet_shs(OE)
[ 909.332267][ T1435] SET = 0, FnV = 0
[ 909.332270][ T1548] rmnet_perf_tether(OE)
[ 909.332270][ T1435] EA = 0, S1PTW = 0
[ 909.332273][ T1548] rmnet_wlan(OE)
[ 909.332273][ T1435] FSC = 0x03: level 3 address size fault
[ 909.332276][ T1435] Data abort info:
[ 909.332276][ T1548] msm_kgsl(OE)
[ 909.332277][ T1435] ISV = 0, ISS = 0x00000003
[ 909.332279][ T1548] rmnet_perf(OE)
[ 909.332279][ T1435] CM = 0, WnR = 0
[ 909.332282][ T1548] rmnet_core(OE)
[ 909.332282][ T1435] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000a1c14000
[ 909.332284][ T1548] mac80211(E)
[ 909.332286][ T1435] [ffffff8003fc0d90] pgd=180000027ff78003
[ 909.332286][ T1548] machine_dlkm(OE)
[ 909.332288][ T1435] , p4d=180000027ff78003
[ 909.332289][ T1548] ipanetm(OE)
[ 909.332290][ T1435] , pud=180000027ff78003
[ 909.332291][ T1548] rmnet_ctl(OE)
[ 909.332292][ T1435] , pmd=180000027ff6b003
[ 909.332294][ T1548] wcd938x_dlkm(OE)
[ 909.332294][ T1435] , pte=0068020083fc0707
[ 909.333074][ T1548] Unable to handle kernel level 3 address size fault at virtual address ffffffc0010c4440
[ 909.333078][ T1548] Mem abort info:
[ 909.333079][ T1548] ESR = 0x0000000096000003
[ 909.333081][ T1548] EC = 0x25: DABT (current EL), IL = 32 bits
[ 909.333084][ T1548] SET = 0, FnV = 0
[ 909.333086][ T1548] EA = 0, S1PTW = 0
[ 909.333088][ T1548] FSC = 0x03: level 3 address size fault
[ 909.333090][ T1548] Data abort info:
[ 909.333092][ T1548] ISV = 0, ISS = 0x00000003
[ 909.333093][ T1548] CM = 0, WnR = 0
[ 909.333095][ T1548] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000a1c14000
[ 909.333098][ T1548] [ffffffc0010c4440] pgd=100000027fffe003, p4d=100000027fffe003, pud=100000027fffe003, pmd=100000009dc1f003, pte=006802009b976703
[ 909.353464][ C5] Mem abort info:
[ 909.353471][ C5] ESR = 0x0000000096000003
[ 909.353473][ C5] EC = 0x25: DABT (current EL), IL = 32 bits
[ 909.353476][ C5] SET = 0, FnV = 0
[ 909.353477][ C5] EA = 0, S1PTW = 0
[ 909.353479][ C5] FSC = 0x03: level 3 address size fault
[ 909.353481][ C5] Data abort info:
[ 909.353482][ C5] ISV = 0, ISS = 0x00000003
[ 909.353483][ C5] CM = 0, WnR = 0
[ 909.353485][ C5] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000a1c14000
判断依据:
与案例1类似,不同核的不同线程同一时间内触发了Unable to handle kernel level 3 address size fault at virtual address,很不正常!大概率为DDR不稳定问题
2.5 案例5
[ 909.331382][T27244] Unable to handle kernel level 3 address size fault at virtual address ffffff816b0c8f58
[ 909.332051][ T1548] Unable to handle kernel level 3 address size fault at virtual address ffffff80a7ae4920
[ 909.332061][ T1548] Mem abort info:
[ 909.332064][ T1548] ESR = 0x0000000096000003
[ 909.332066][ T1548] EC = 0x25: DABT (current EL), IL = 32 bits
[ 909.332070][ T1548] SET = 0, FnV = 0
[ 909.332073][ T1548] EA = 0, S1PTW = 0
[ 909.332075][ T1548] FSC = 0x03: level 3 address size fault
[ 909.332078][ T1548] Data abort info:
[ 909.332079][ T1548] ISV = 0, ISS = 0x00000003
[ 909.332081][ T1548] CM = 0, WnR = 0
[ 909.332084][ T1548] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000a1c14000
[ 909.332089][ T1548] [ffffff80a7ae4920] pgd=180000027fc94003, p4d=180000027fc94003, pud=180000027fc94003, pmd=180000027fb56003, pte=0068020127ae4707
[ 909.332107][ T1548] Internal error: Oops: 0000000096000003 [#1] PREEMPT SMP
[ 909.332222][ T1548] Dumping ftrace buffer:
[ 909.332245][ T1548] (ftrace buffer empty)
[ 909.332248][ T1548] Modules linked in: wlan(OE)
[ 909.332253][ T1435] Unable to handle kernel level 3 address size fault at virtual address ffffff8003fc0d90
[ 909.332255][ T1548] nt36xxx_spi(OE)
[ 909.332258][ T1548] focaltech_spi(OE)
[ 909.332260][ T1435] Mem abort info:
[ 909.332261][ T1548] msm_drm(OE)
[ 909.332262][ T1435] ESR = 0x0000000096000003
[ 909.332264][ T1548] rmnet_offload(OE)
[ 909.332264][ T1435] EC = 0x25: DABT (current EL), IL = 32 bits
[ 909.332267][ T1548] rmnet_shs(OE)
[ 909.332267][ T1435] SET = 0, FnV = 0
[ 909.332270][ T1548] rmnet_perf_tether(OE)
[ 909.332270][ T1435] EA = 0, S1PTW = 0
[ 909.332273][ T1548] rmnet_wlan(OE)
[ 909.332273][ T1435] FSC = 0x03: level 3 address size fault
[ 909.332276][ T1435] Data abort info:
[ 909.332276][ T1548] msm_kgsl(OE)
[ 909.332277][ T1435] ISV = 0, ISS = 0x00000003
[ 909.332279][ T1548] rmnet_perf(OE)
[ 909.332279][ T1435] CM = 0, WnR = 0
[ 909.332282][ T1548] rmnet_core(OE)
[ 909.332282][ T1435] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000a1c14000
[ 909.332284][ T1548] mac80211(E)
[ 909.332286][ T1435] [ffffff8003fc0d90] pgd=180000027ff78003
[ 909.332286][ T1548] machine_dlkm(OE)
[ 909.332288][ T1435] , p4d=180000027ff78003
[ 909.332289][ T1548] ipanetm(OE)
[ 909.332290][ T1435] , pud=180000027ff78003
[ 909.332291][ T1548] rmnet_ctl(OE)
[ 909.332292][ T1435] , pmd=180000027ff6b003
[ 909.332294][ T1548] wcd938x_dlkm(OE)
[ 909.332294][ T1435] , pte=0068020083fc0707
[ 909.333074][ T1548] Unable to handle kernel level 3 address size fault at virtual address ffffffc0010c4440
[ 909.333078][ T1548] Mem abort info:
[ 909.333079][ T1548] ESR = 0x0000000096000003
[ 909.333081][ T1548] EC = 0x25: DABT (current EL), IL = 32 bits
[ 909.333084][ T1548] SET = 0, FnV = 0
[ 909.333086][ T1548] EA = 0, S1PTW = 0
[ 909.333088][ T1548] FSC = 0x03: level 3 address size fault
[ 909.333090][ T1548] Data abort info:
[ 909.333092][ T1548] ISV = 0, ISS = 0x00000003
[ 909.333093][ T1548] CM = 0, WnR = 0
[ 909.333095][ T1548] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000a1c14000
[ 909.333098][ T1548] [ffffffc0010c4440] pgd=100000027fffe003, p4d=100000027fffe003, pud=100000027fffe003, pmd=100000009dc1f003, pte=006802009b976703
[ 909.353464][ C5] Mem abort info:
[ 909.353471][ C5] ESR = 0x0000000096000003
[ 909.353473][ C5] EC = 0x25: DABT (current EL), IL = 32 bits
[ 909.353476][ C5] SET = 0, FnV = 0
[ 909.353477][ C5] EA = 0, S1PTW = 0
[ 909.353479][ C5] FSC = 0x03: level 3 address size fault
[ 909.353481][ C5] Data abort info:
[ 909.353482][ C5] ISV = 0, ISS = 0x00000003
[ 909.353483][ C5] CM = 0, WnR = 0
[ 909.353485][ C5] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000a1c14000
判断依据:
与案例4类似,不同核的不同线程同一时间内触发了Unable to handle kernel level 3 address size fault at virtual address,很不正常!大概率为DDR不稳定问题
2.6 案例6
[ 1052.754789][ T5244] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000e0
[ 1052.758577][T23517] [NVT-ts] nvt_bin_header_parser 188: ovly_info = 1, ilm_dlm_num = 2, ovly_sec_num = 3, info_sec_num = 13, partition = 18
[ 1052.767961][ T5244] Mem abort info:
[ 1052.767965][ T5244] ESR = 0x0000000096000005
[ 1052.767967][ T5244] EC = 0x25: DABT (current EL), IL = 32 bits
[ 1052.767969][ T5244] SET = 0, FnV = 0
[ 1052.794504][ T535] fg_read_current: [FG_NFG1000] successed to read IBAT:-923
[ 1052.794611][ T5244] EA = 0, S1PTW = 0
[ 1052.801073][T23517] [NVT-ts] nvt_bootloader_reset 848: end
[ 1052.808016][ T5244] FSC = 0x05: level 1 translation fault
[ 1052.808018][ T5244] Data abort info:
[ 1052.808019][ T5244] ISV = 0, ISS = 0x00000005
[ 1052.808020][ T5244] CM = 0, WnR = 0
[ 1052.808021][ T5244] user pgtable: 4k pages, 39-bit VAs, pgdp=00000000d8f03000
[ 1052.808023][ T5244] [00000000000000e0] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
[ 1052.808030][ T5244] Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP
[ 1052.808074][ T5244] Dumping ftrace buffer:
[ 1052.859124][T31837] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000e0
[ 1052.861728][T31830] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000e0
[ 1052.861735][T31830] Mem abort info:
[ 1052.861737][T31830] ESR = 0x0000000096000006
[ 1052.861739][T31830] EC = 0x25: DABT (current EL), IL = 32 bits
[ 1052.861742][T31830] SET = 0, FnV = 0
[ 1052.861744][T31830] EA = 0, S1PTW = 0
[ 1052.861746][T31830] FSC = 0x06: level 2 translation fault
[ 1052.861749][T31830] Data abort info:
[ 1052.861750][T31830] ISV = 0, ISS = 0x00000006
[ 1052.861751][T31830] CM = 0, WnR = 0
[ 1052.861753][T31830] user pgtable: 4k pages, 39-bit VAs, pgdp=00000001ea216000
[ 1052.861757][T31830] [00000000000000e0] pgd=08000001eccbe003, p4d=08000001eccbe003, pud=08000001eccbe003, pmd=0000000000000000
[ 1052.862418][ T5244] (ftrace buffer empty)
[ 1052.870882][ T535] sc853x-charger 0-0061: sc853x_get_adc_data 1 5190
[ 1052.872118][T31837] Mem abort info:
[ 1052.881584][ T5244] qpnp_lcdb_regulator(E)
[ 1052.889699][ T5244] CPU: 6 PID: 5244 Comm: NPDecoder-CL Tainted: G C OE 6.1.118-android14-11-maybe-dirty #1
[ 1052.889702][ T5244] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[ 1052.889704][ T5244] pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 1052.889707][ T5244] pc : cap_capable+0xc/0x3c
[ 1052.889721][ T5244] lr : security_capable+0x68/0x90
[ 1052.889724][ T5244] sp : ffffffc0211737a0
[ 1052.889725][ T5244] x29: ffffffc0211737a0 x28: 0000000000000000 x27: 00000000000000b8
[ 1052.889728][ T5244] x26: ffffff8090868200 x25: ffffff8092bc4680 x24: 0000000000000000
[ 1052.889730][ T5244] x23: ffffffc0096165a8 x22: ffffff8049844900 x21: ffffffc00a0f1950
[ 1052.889732][ T5244] x20: 0000000000000017 x19: 0000000000000002 x18: ffffffc014695070
[ 1052.889734][ T5244] x17: 0000000090dc9518 x16: 0000000090dc9518 x15: b400007ccf9e3c90
[ 1052.889736][ T5244] x14: 0000020000000000 x13: 0000000000000102 x12: ffffff80955b5280
[ 1052.889739][ T5244] x11: ffffff8002b25480 x10: ffffff816c2f4b00 x9 : 0000000000000001
[ 1052.889741][ T5244] x8 : 0000000000000000 x7 : ffffffffffffff00 x6 : ffffff808b6ff0b4
[ 1052.889743][ T5244] x5 : 0000000000000006 x4 : 000000000000ed71 x3 : 0000000000000002
[ 1052.889745][ T5244] x2 : 0000000000000017 x1 : ffffffc00a0f1950 x0 : ffffff8049844900
[ 1052.889747][ T5244] Call trace:
[ 1052.889749][ T5244] cap_capable+0xc/0x3c
[ 1052.889752][ T5244] has_capability_noaudit+0x38/0x60
[ 1052.889758][ T5244] binder_do_set_priority+0x90/0x36c
[ 1052.889764][ T5244] binder_transaction_priority+0x114/0x264
[ 1052.889767][ T5244] binder_proc_transaction+0x248/0x898
[ 1052.889770][ T5244] binder_transaction+0x1768/0x1ea8
[ 1052.889773][ T5244] binder_thread_write+0xb54/0x2768
[ 1052.889776][ T5244] binder_ioctl+0x550/0x2438
[ 1052.889779][ T5244] __arm64_sys_ioctl+0xa8/0xe4
[ 1052.889783][ T5244] invoke_syscall+0x58/0x114
[ 1052.889786][ T5244] el0_svc_common+0xb4/0xfc
[ 1052.889788][ T5244] do_el0_svc+0x24/0x84
[ 1052.889789][ T5244] el0_svc+0x2c/0x90
[ 1052.889793][ T5244] el0t_64_sync_handler+0x68/0xb4
[ 1052.889796][ T5244] el0t_64_sync+0x1a4/0x1a8
[ 1052.889799][ T5244] Code: 90dc9518 f9405408 eb01011f 54000100 (b940e109)
[ 1052.889801][ T5244] ---[ end trace 0000000000000000 ]---
[ 1052.901222][ T5263] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000e0
[ 1052.904621][T31837] EC = 0x25: DABT (current EL), IL = 32 bits
[ 1052.909956][ T5263] Mem abort info:
[ 1052.909957][ T5263] ESR = 0x0000000096000005
[ 1052.909958][ T5263] EC = 0x25: DABT (current EL), IL = 32 bits
[ 1052.909959][ T5263] SET = 0, FnV = 0
[ 1052.909960][ T5263] EA = 0, S1PTW = 0
[ 1052.909961][ T5263] FSC = 0x05: level 1 translation fault
[ 1052.909962][ T5263] Data abort info:
[ 1052.909962][ T5263] ISV = 0, ISS = 0x00000005
[ 1052.909963][ T5263] CM = 0, WnR = 0
[ 1052.909963][ T5263] user pgtable: 4k pages, 39-bit VAs, pgdp=00000000d4f9a000
[ 1052.909965][ T5263] [00000000000000e0] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
[ 1052.919834][ T5244] Kernel panic - not syncing: Oops: Fatal exception
[ 1052.919836][ T5244] SMP: stopping secondary CPUs
[ 1052.919861][ C2] VendorHooks: CPU2: stopping
[ 1052.919868][ C2] CPU: 2 PID: 31833 Comm: QMESA_64 Tainted: G D C OE 6.1.118-android14-11-maybe-dirty #1
[ 1052.919874][ C2] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[ 1052.919877][ C2] pstate: 80001000 (Nzcv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
[ 1052.919883][ C2] pc : 0000000000403db4
[ 1052.919885][ C2] lr : 000000000040431c
[ 1052.919887][ C2] sp : 0000007674c55670
[ 1052.919889][ C2] x29: 0000007674c55670 x28: 00000000ffffffff x27: 000000001707c250
[ 1052.919896][ C2] x26: 0000000000000000 x25: 000000001707c230 x24: 0000000000000000
[ 1052.919903][ C2] x23: 000000001707c160 x22: 0000000000000001 x21: 000000001707bfa0
[ 1052.919910][ C2] x20: 0000000000000000 x19: 0000007670bce194 x18: ffffffffffffff9c
[ 1052.919916][ C2] x17: 0000000000000049 x16: 0000000000002710 x15: 00000000004d4b48
[ 1052.919921][ C2] x14: d207e381491a4aaf x13: de4a6af0a0bfaff5 x12: 3c913bab09122bb9
[ 1052.919928][ C2] x11: 94e4dff91f08a69f x10: fadccd4d70ad9f9a x9 : e29afe0f16109821
[ 1052.919934][ C2] x8 : 6cfd2256f07dec61 x7 : 170f324bae43d0d6 x6 : 0000007671736954
[ 1052.919939][ C2] x5 : 0b76436913d88acd x4 : 0000000000487f68 x3 : 0000007670c0428d
[ 1052.919945][ C2] x2 : 000000001532474e x1 : 0000000000000030 x0 : 0000007670be9bbb
[ 1052.919954][ C4] VendorHooks: CPU4: stopping
[ 1052.919958][ C4] CPU: 4 PID: 31836 Comm: QMESA_64 Tainted: G D C OE 6.1.118-android14-11-maybe-dirty #1
[ 1052.919963][ C4] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[ 1052.919966][ C4] pstate: 80001000 (Nzcv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
[ 1052.919970][ C4] pc : 0000000000403dac
[ 1052.919972][ C4] lr : 000000000040431c
[ 1052.919973][ C4] sp : 000000766e327680
[ 1052.919975][ C4] x29: 000000766e327680 x28: 00000000004896b8 x27: 0000000000000000
[ 1052.919981][ C4] x26: 0000000000000001 x25: 0000000000000003 x24: 000000001707ffa0
[ 1052.919987][ C4] x23: 0000000000000000 x22: 00000000fe7a4e9c x21: 000000001707fd20
[ 1052.919992][ C4] x20: 0000000000000005 x19: 000000765a3aee14 x18: ffffffffffffff9c
[ 1052.919997][ C4] x17: 0000000000000049 x16: 0000000000002710 x15: 00000000004d4b48
[ 1052.920003][ C4] x14: 0000007659737df0 x13: 00000000003cd87c x12: 0000000000000000
[ 1052.920008][ C4] x11: 0101010101010101 x10: 0000000000000038 x9 : 00000000ffffff80
[ 1052.920014][ C4] x8 : 2525252525252525 x7 : 0000000000000000 x6 : 00000076783294c6
[ 1052.920019][ C4] x5 : 0000000000000000 x4 : 0000000000487f68 x3 : 000000765a3e4f0d
[ 1052.920024][ C4] x2 : 00000000197791b9 x1 : 000000000000005a x0 : 000000765a3dfca7
[ 1052.920032][ C1] VendorHooks: CPU1: stopping
[ 1052.920035][ C1] CPU: 1 PID: 31831 Comm: QMESA_64 Tainted: G D C OE 6.1.118-android14-11-maybe-dirty #1
[ 1052.920040][ C1] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[ 1052.920042][ C1] pstate: 80001000 (Nzcv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
[ 1052.920047][ C1] pc : 0000000000403db4
[ 1052.920049][ C1] lr : 000000000040431c
[ 1052.920051][ C1] sp : 0000007677b27680
[ 1052.920053][ C1] x29: 0000007677b27680 x28: 00000000004896b8 x27: 0000000000000000
[ 1052.920059][ C1] x26: 0000000000000000 x25: 0000000000000002 x24: 0000000017079900
[ 1052.920064][ C1] x23: 0000000000000000 x22: 0000000053465b49 x21: 00000000170796a0
[ 1052.920071][ C1] x20: 000000000000000a x19: 000000764e0efc00 x18: ffffffffffffff9c
[ 1052.920076][ C1] x17: 0000000000000049 x16: 0000000000002710 x15: 00000000004d4b48
[ 1052.920084][ C1] x14: 000000764d737df0 x13: 00000000003cd87c x12: 000000764e66ee80
[ 1052.920092][ C1] x11: 0000000017079af8 x10: 000000001707a6f8 x9 : 1fe47d6b184bc804
[ 1052.920098][ C1] x8 : 972ea7d08aa01c03 x7 : a67d73ef8fa9f5ac x6 : 000000001707a2f8
[ 1052.920104][ C1] x5 : bf66cf1f7ee5e51c x4 : 0000000000487f68 x3 : 000000764e125cf9
[ 1052.920109][ C1] x2 : 000000000caa6a19 x1 : 0000000000000004 x0 : 000000764e0f8ba8
[ 1052.920116][ C5] VendorHooks: CPU5: stopping
[ 1052.920120][ C5] CPU: 5 PID: 31832 Comm: QMESA_64 Tainted: G D C OE 6.1.118-android14-11-maybe-dirty #1
[ 1052.920125][ C5] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[ 1052.920128][ C5] pstate: 20001000 (nzCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
[ 1052.920132][ C5] pc : 00000000004480b0
[ 1052.920134][ C5] lr : 0000000000409058
[ 1052.920135][ C5] sp : 0000007678327770
[ 1052.920137][ C5] x29: 0000007678327770 x28: 00000000ffffffff x27: 000000001707add0
[ 1052.920143][ C5] x26: 0000000000000000 x25: 000000001707adb0 x24: 0000000000000012
[ 1052.920148][ C5] x23: 0000000000000000 x22: 0000000000000002 x21: 000000001707ad10
[ 1052.920154][ C5] x20: 000000001707ab20 x19: 000000000000000d x18: ffffffffffffff9c
[ 1052.920160][ C5] x17: 0000000000000049 x16: 0000000000002710 x15: 00000000004d4b48
[ 1052.920165][ C5] x14: 6baa6e969e5d8f32 x13: 9d424bc2b56d0dc6 x12: 9920ea012d42ac2d
[ 1052.920171][ C5] x11: 70e1532c0b43064e x10: 4c95bcd98c78e77b x9 : 8124cc131ff463a1
[ 1052.920176][ C5] x8 : 2151d750098d0124 x7 : 6d70fcd99f7e25a5 x6 : 000000765127e870
[ 1052.920182][ C5] x5 : d092813433abb5d6 x4 : 0000000000000008 x3 : 000000765266f190
[ 1052.920187][ C5] x2 : 00000000000122f8 x1 : 00000076521b5d90 x0 : 000000765125aab8
[ 1052.922166][ C0] VendorHooks: CPU0: stopping
[ 1052.922169][ C0] CPU: 0 PID: 31837 Comm: QMESA_64 Tainted: G D C OE 6.1.118-android14-11-maybe-dirty #1
[ 1052.922174][ C0] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[ 1052.922176][ C0] pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 1052.922181][ C0] pc : console_emit_next_record+0x348/0x3d4
[ 1052.922202][ C0] lr : console_emit_next_record+0x344/0x3d4
[ 1052.922206][ C0] sp : ffffffc023e23690
[ 1052.922208][ C0] x29: ffffffc023e23720 x28: ffffffc00a0ecd90 x27: ffffffc00a0cae30
[ 1052.922214][ C0] x26: 0000000000000000 x25: ffffffc00a2e7660 x24: 0000000000000029
[ 1052.922220][ C0] x23: ffffffc00a2e97b8 x22: ffffffc00a2e7778 x21: ffffffc00a1f78d0
[ 1052.922228][ C0] x20: 0000000000000001 x19: ffffffc023e2377c x18: ffffffc01fd1d0c0
[ 1052.922234][ C0] x17: 00000000529c6ef0 x16: 00000000529c6ef0 x15: ffffffc00902ed5c
[ 1052.922241][ C0] x14: 0000000000000030 x13: 0000000000000020 x12: 0000000000000029
[ 1052.922246][ C0] x11: 00000000ffffffff x10: 0000000000000001 x9 : ffffff808622b840
[ 1052.922252][ C0] x8 : 0000000000000001 x7 : 545b5d3135373136 x6 : 382e32353031205b
[ 1052.922257][ C0] x5 : ffffffc00a2e97cf x4 : ffffffc023e2360f x3 : ffffffc008887cf8
[ 1052.922263][ C0] x2 : 0000000000000006 x1 : 00000000000000c0 x0 : ffffffc00a1f6ce8
[ 1052.922269][ C0] Call trace:
[ 1052.922274][ C0] console_emit_next_record+0x348/0x3d4
[ 1052.922280][ C0] console_unlock+0x154/0x24c
[ 1052.922286][ C0] vprintk_emit+0xcc/0x27c
[ 1052.922292][ C0] vprintk_default+0x44/0x70
[ 1052.922299][ C0] vprintk+0xe4/0x114
[ 1052.922303][ C0] _printk+0x54/0x80
[ 1052.922308][ C0] mem_abort_decode+0x5c/0x144
[ 1052.922318][ C0] __do_kernel_fault+0x248/0x2b4
[ 1052.922324][ C0] do_page_fault+0x218/0x4c8
[ 1052.922344][ C0] do_translation_fault+0x38/0x54
[ 1052.922349][ C0] do_mem_abort+0x58/0x118
[ 1052.922353][ C0] el1_abort+0x3c/0x5c
[ 1052.922359][ C0] el1h_64_sync_handler+0x54/0x90
[ 1052.922364][ C0] el1h_64_sync+0x68/0x6c
[ 1052.922370][ C0] cap_capable+0xc/0x3c
[ 1052.922378][ C0] file_ns_capable+0x20/0x44
[ 1052.922385][ C0] pagemap_read+0x124/0x428
[ 1052.922393][ C0] vfs_read+0x100/0x2c8
[ 1052.922399][ C0] ksys_read+0x78/0xe8
[ 1052.922402][ C0] __arm64_sys_read+0x1c/0x2c
[ 1052.922406][ C0] invoke_syscall+0x58/0x114
[ 1052.922411][ C0] el0_svc_common+0xb4/0xfc
[ 1052.922416][ C0] do_el0_svc+0x24/0x84
[ 1052.922420][ C0] el0_svc+0x2c/0x90
[ 1052.922428][ C0] el0t_64_sync_handler+0x68/0xb4
[ 1052.922434][ C0] el0t_64_sync+0x1a4/0x1a8
[ 1053.951825][ T5244] SMP: failed to stop secondary CPUs 3,6-7
判断依据:
首先多个核都同时出现异常
其次追踪其中一个线程T5244
可以看到死在ldr w9,[x8,0xE0]
,问题很明显cred->user_ns
调用时出现的,查看x8寄存器其值为0,
而x8寄存器是由ldr x8,[x0,#0x88]
得来,而x0寄存器的值是正常的为0xFFFFFF8049844900,那x8寄存器也就是0xFFFFFF8049844988,而实际上x8寄存器其值为0,这个就很明显是一个内存问题了,x8寄存器的值变了。
2.7 案例7
Line 19448: [ 884.542922][ T283] Unable to handle kernel level 3 address size fault at virtual address ffffffc00a2c25e8
Line 19449: [ 884.552843][ T283] Mem abort info:
Line 19450: [ 884.556518][ T283] ESR = 0x0000000096000003
Line 19451: [ 884.561034][ T283] EC = 0x25: DABT (current EL), IL = 32 bits
Line 19452: [ 884.567154][ T283] SET = 0, FnV = 0
Line 19453: [ 884.570952][ T283] EA = 0, S1PTW = 0
Line 19456: [ 884.574893][ T283] FSC = 0x03: level 3 address size fault
Line 19457: [ 884.580611][ T283] Data abort info:
Line 19458: [ 884.584308][ T283] ISV = 0, ISS = 0x00000003
Line 19459: [ 884.588948][ T283] CM = 0, WnR = 0
Line 19460: [ 884.592840][ T283] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000a1c14000
Line 19461: [ 884.600696][ T283] [ffffffc00a2c25e8] pgd=100000027fffe003, p4d=100000027fffe003, pud=100000027fffe003, pmd=100000027fff9003, pte=00780000a22c2703
Line 19463: [ 884.616092][ T283] Internal error: Oops: 0000000096000003 [#1] PREEMPT SMP
Line 19464: [ 884.623129][ T283] Dumping ftrace buffer:
Line 19465: [ 884.627242][ T283] (ftrace buffer empty)
Line 19477: [ 884.632065][ T283] CPU: 7 PID: 283 Comm: ueventd Tainted: G C OE 6.1.118-android14-11-maybe-dirty #1
Line 19478: [ 884.632070][ T283] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
Line 19479: [ 884.632073][ T283] pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
Line 19480: [ 884.632075][ T283] pc : vm_area_dup+0x20/0x12c
Line 19481: [ 884.632095][ T283] lr : copy_mm+0x3b8/0x7fc
Line 19482: [ 884.632097][ T283] sp : ffffffc00beb3b70
Line 19483: [ 884.632098][ T283] x29: ffffffc00beb3b70 x28: fffffffffff7dfff x27: ffffff801d6b12c0
Line 19484: [ 884.632102][ T283] x26: 0000000000000000 x25: ffffff80a994d708 x24: ffffff801c83bc80
Line 19485: [ 884.632105][ T283] x23: 000000000000067c x22: ffffff80049fc060 x21: ffffffc00a2c2000
Line 19486: [ 884.632107][ T283] x20: ffffff801c83bc80 x19: ffffff80045312c0 x18: ffffffc00b4ad050
Line 19487: [ 884.632110][ T283] x17: 0000007f98a6a000 x16: 0000000000000006 x15: 0000007f98a6a000
Line 19488: [ 884.632112][ T283] x14: 0000000000000001 x13: 0000000000000241 x12: 0000000000000000
Line 19489: [ 884.632115][ T283] x11: ffffff809f5a3e80 x10: 0000000100000000 x9 : 0000000004000000
Line 19490: [ 884.632118][ T283] x8 : 0000000000000070 x7 : 0000000000000000 x6 : ffffff801c83b578
Line 19491: [ 884.632120][ T283] x5 : 0000000000000001 x4 : 0000000000000000 x3 : 0000000000000242
Line 19492: [ 884.632123][ T283] x2 : 0000000100023ad4 x1 : 0000000000000cc0 x0 : ffffff801c83bc80
Line 19493: [ 884.632126][ T283] Call trace:
Line 19494: [ 884.632128][ T283] vm_area_dup+0x20/0x12c
Line 19495: [ 884.632133][ T283] copy_mm+0x3b8/0x7fc
Line 19496: [ 884.632135][ T283] copy_process+0x4dc/0xc9c
Line 19497: [ 884.632136][ T283] kernel_clone+0xb0/0x488
Line 19498: [ 884.632138][ T283] __arm64_sys_clone+0x5c/0x8c
Line 19499: [ 884.632140][ T283] invoke_syscall+0x58/0x114
Line 19500: [ 884.632144][ T283] el0_svc_common+0xb4/0xfc
Line 19501: [ 884.632146][ T283] do_el0_svc+0x24/0x84
Line 19502: [ 884.632147][ T283] el0_svc+0x2c/0x90
Line 19503: [ 884.632155][ T283] el0t_64_sync_handler+0x68/0xb4
Line 19504: [ 884.632157][ T283] el0t_64_sync+0x1a4/0x1a8
Line 19505: [ 884.632162][ T283] Code: 910003fd 90011095 aa0003f4 52819801 (f942f6a0)
Line 19507: [ 884.638961][ T283] ---[ end trace 0000000000000000 ]---
```

从栈帧里看new为NULL,以及PC指针的位置,可以得知kmem_cache_alloc还没执行,下一条才是BL跳转函数去执行
我们分析一下这部分的汇编
```asm
90011095 adrp x21,0xFFFFFFC00A2C2000
AA0003F4 mov x20,x0 ; x20,orig
52819801 mov w1,#0xCC0 ; w1,#3264
F942F6A0 ldr x0,[x21,#0x5E8] ; x0,[x21,#1512]
ldr x21+0x5e8的地址放到x0,也就是vm_area_cachep
所以此时x0寄存器应该为0xFFFFFF800B4EA000
但是实际上x0现在为0xFFFFFF801C83BC80,这个值还是kmem_cache_alloc入栈时x0的值,说明这条ldr指令没有被执行,这个是非常奇怪的!!
这里就可以怀疑是否是DDR不稳定导致的了,最好是与memory沟通,当前的DDR是否存在类似的问题,然后对这台机器做一些memory的压力测试
2.8 案例8
[10907.962594][ T14] Internal error: Oops - Undefined instruction: 0000000002000000 [#1] PREEMPT SMP
[10907.971885][ T14] Dumping ftrace buffer:
[10907.976090][ T14] (ftrace buffer empty)
[10907.981288][ T14] CPU: 2 PID: 14 Comm: rcu_preempt Tainted: G WC OE 6.1.118-android14-11-maybe-dirty #1
[10907.981296][ T14] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[10907.981301][ T14] pstate: 804000c5 (Nzcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[10907.981308][ T14] pc : __mod_timer+0x160/0x404
[10907.981330][ T14] lr : __mod_timer+0x150/0x404
[10907.981336][ T14] sp : ffffffc00a52bc50
[10907.981338][ T14] x29: ffffffc00a52bc70 x28: ffffffc00a16d620 x27: ffffffc00a09f240
[10907.981349][ T14] x26: 00000000ffffffff x25: 0000000000000000 x24: 0000000000000000
[10907.981357][ T14] x23: 0000000000000002 x22: ffffff81f6f4f240 x21: 0000000000000000
[10907.981365][ T14] x20: 0000000100287754 x19: ffffffc00a52bcd8 x18: ffffffc00a3ed028
[10907.981374][ T14] x17: 0000000022d92518 x16: 0000000022d92518 x15: ffffff81f6f60180
[10907.981382][ T14] x14: 0000000000000010 x13: 0000000100287753 x12: 0000000100288192
[10907.981390][ T14] x11: 0000000000000002 x10: 0000000100287753 x9 : 0000000000000000
[10907.981397][ T14] x8 : 0000000000000000 x7 : 0000000000000001 x6 : fffffffdef8cc520
[10907.981405][ T14] x5 : 00000000000003e8 x4 : 0000000000000001 x3 : ffffffc00a16dea0
[10907.981413][ T14] x2 : 0000000000000000 x1 : ffffff81f6f4f240 x0 : 0000000000000000
[10907.981421][ T14] Call trace:
[10907.981426][ T14] __mod_timer+0x160/0x404
[10907.981433][ T14] schedule_timeout+0x8c/0x12c
[10907.981443][ T14] rcu_gp_fqs_loop+0x200/0x968
[10907.981452][ T14] rcu_gp_kthread+0x64/0x2bc
[10907.981457][ T14] kthread+0x10c/0x154
[10907.981465][ T14] ret_from_fork+0x10/0x20
[10907.981477][ T14] Code: 2a0003f8 36000057 34000e38 b9402277 (f503201f)
[10907.988286][ T14] ---[ end trace 0000000000000000 ]---
[10908.094081][ T14] Kernel panic - not syncing: Oops - Undefined instruction: Fatal exception
可以看到 实际的 地址应该为 D503201F,但是变成了F503201F,这个是发生了bitflip,可以被认为为DDR的问题
三、总结
3.1 为什么 DDR 不稳定可能是原因
- 内存访问错误:如果内存存在不稳定或故障的情况(如信号干扰、电源问题或硬件故障),它可能会导致某些内存地址无法正确读取或写入。这种错误会导致内核崩溃,尤其是在多核处理器的系统中,多个核心可能同时尝试访问受影响的内存区域,导致并发的错误。
- 多核系统的并发性:在多核系统中,如果内存不稳定,可能会触发多个 CPU 核心同时遇到问题。比如,某些内存区域对一个核心是可访问的,而对另一个核心则不可访问,或者某个内存地址在一个核心上读取正常,在另一个核心上读取就会发生错误。这种情况可能会导致看似“同步”发生的 kernel panic。
- 内存映射与虚拟内存:内存错误可能导致虚拟内存映射失败,从而触发内核无法处理的错误,如“NULL pointer dereference”或内存访问异常(如翻译错误、权限错误等)。这些问题在内存不稳定的情况下较为常见。
- 缓存一致性问题:现代处理器通常使用缓存来提高性能。如果内存不稳定,可能会导致缓存中的数据不一致,进而引发多个 CPU 核心同时触发异常,导致 panic。特别是在采用了缓存一致性协议的大规模多核系统中,内存不稳定可能导致不同核心间缓存同步的问题。
3.2 内存不稳定的潜在症状
- 随机的、无法重现的崩溃:当内存问题不一定在每次都发生时,可能会随机触发内核 panic。
- 多核同时崩溃:如上案例,多个核心同时出现 panic,这通常是内存访问出现了较为严重的故障。
- 硬件诊断错误:在一些情况下,内存错误可能不会在初次使用时立即表现出来,只有在高负载或特定条件下才会出现问题。
3.3 如何验证 DDR 是否为问题根源
出问题的机器做一下DDR的压力测试,这方面测试可以听从Memory工程师的建议