一、问题背景

在我司工厂老化测试中,发现了多台机器死机的情况,经过一些分析判断被我评估为是DDR的问题,本篇文章就记录一下这几个案例,以及我判定为DDR问题的依据,供稳定性人员提供参考。


2025/04/24:更新案例:2.1章节 ~ 2.6章节

2025/05/13:更新新案例:2.7章节 ~ 2.8章节

二、案例

2.1 案例1

[  137.648537][    T0] Unable to handle kernel execute from non-executable memory at virtual address ffffffc00c92c7a8
[  137.649182][ T2652] Internal error: Oops - Undefined instruction: 0000000002000000 [#1] PREEMPT SMP
[  137.671119][    T0] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[  137.671121][    T0] Mem abort info:
[  137.671122][    T0]   ESR = 0x0000000086000005
[  137.671124][    T0]   EC = 0x21: IABT (current EL), IL = 32 bits
[  137.671125][    T0]   SET = 0, FnV = 0
[  137.671126][    T0]   EA = 0, S1PTW = 0
[  137.671127][    T0]   FSC = 0x05: level 1 translation fault
[  137.671129][    T0] user pgtable: 4k pages, 39-bit VAs, pgdp=00000000c85d7000
[  137.671131][    T0] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
[  137.671224][    C0] Unexpected kernel BRK exception at EL1
[  137.733876][    C0] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[  137.733881][    C0] Mem abort info:
[  137.733882][    C0]   ESR = 0x0000000086000005
[  137.733885][    C0]   EC = 0x21: IABT (current EL), IL = 32 bits
[  137.733888][    C0]   SET = 0, FnV = 0
[  137.733890][    C0]   EA = 0, S1PTW = 0
[  137.733892][    C0]   FSC = 0x05: level 1 translation fault
[  137.733895][    C0] user pgtable: 4k pages, 39-bit VAs, pgdp=00000000ecd5f000
[  137.733900][    C0] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
[  137.789473][ T2652] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[  137.789476][ T2652] Mem abort info:
[  137.789477][ T2652]   ESR = 0x0000000086000005
[  137.789479][ T2652]   EC = 0x21: IABT (current EL), IL = 32 bits
[  137.789482][ T2652]   SET = 0, FnV = 0
[  137.789484][ T2652]   EA = 0, S1PTW = 0
[  137.789485][ T2652]   FSC = 0x05: level 1 translation fault
[  137.789487][ T2652] user pgtable: 4k pages, 39-bit VAs, pgdp=00000000d422e000
[  137.789491][ T2652] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
[  139.095329][  T734] Unexpected kernel BRK exception at EL1
[  139.154139][  T734] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[  139.154142][  T734] Mem abort info:
[  139.154143][  T734]   ESR = 0x0000000086000005
[  139.154145][  T734]   EC = 0x21: IABT (current EL), IL = 32 bits
[  139.154147][  T734]   SET = 0, FnV = 0
[  139.154149][  T734]   EA = 0, S1PTW = 0
[  139.154150][  T734]   FSC = 0x05: level 1 translation fault
[  139.154152][  T734] user pgtable: 4k pages, 39-bit VAs, pgdp=00000000cbdfe000
[  139.154155][  T734] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
end time 1745402634.6860282 time cost 6.808462142944336 for Dmesg

判断依据:

  1. T0线程报Unable to handle kernel execute from non-executable memory,这个错误意味着内核试图在一个非可执行的内存区域执行代码。
  2. T2652线程报Undefined instruction(这表明内核遇到了一个不明的指令错误,可能是由于程序执行时出现了无效的指令,导致内核无法理解和执行它。这通常是因为内核或硬件指令集的某些问题。)以及Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000的空指针解引用(这表明内核尝试访问一个 NULL 地址)
  3. T734线程报Unexpected kernel BRK exception at EL1以及Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000

可以发现在不同核上的不同线程在同一时间内出现了异常,异常的体现也是随机的,像是一大片页表结构或者内核代码段失效,所以这种问题我认为大概率为DDR不稳定造成的。

2.2 案例2

[  909.331382][T27244] Unable to handle kernel level 3 address size fault at virtual address ffffff816b0c8f58
[  909.332051][ T1548] Unable to handle kernel level 3 address size fault at virtual address ffffff80a7ae4920
[  909.332061][ T1548] Mem abort info:
[  909.332064][ T1548]   ESR = 0x0000000096000003
[  909.332066][ T1548]   EC = 0x25: DABT (current EL), IL = 32 bits
[  909.332070][ T1548]   SET = 0, FnV = 0
[  909.332073][ T1548]   EA = 0, S1PTW = 0
[  909.332075][ T1548]   FSC = 0x03: level 3 address size fault
[  909.332078][ T1548] Data abort info:
[  909.332079][ T1548]   ISV = 0, ISS = 0x00000003
[  909.332081][ T1548]   CM = 0, WnR = 0
[  909.332084][ T1548] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000a1c14000
[  909.332089][ T1548] [ffffff80a7ae4920] pgd=180000027fc94003, p4d=180000027fc94003, pud=180000027fc94003, pmd=180000027fb56003, pte=0068020127ae4707
[  909.332107][ T1548] Internal error: Oops: 0000000096000003 [#1] PREEMPT SMP
[  909.332222][ T1548] Dumping ftrace buffer:
[  909.332245][ T1548]    (ftrace buffer empty)
[  909.332248][ T1548] Modules linked in: wlan(OE)
[  909.332253][ T1435] Unable to handle kernel level 3 address size fault at virtual address ffffff8003fc0d90
[  909.332255][ T1548]  nt36xxx_spi(OE)
[  909.332258][ T1548]  focaltech_spi(OE)
[  909.332260][ T1435] Mem abort info:
[  909.332261][ T1548]  msm_drm(OE)
[  909.332262][ T1435]   ESR = 0x0000000096000003
[  909.332264][ T1548]  rmnet_offload(OE)
[  909.332264][ T1435]   EC = 0x25: DABT (current EL), IL = 32 bits
[  909.332267][ T1548]  rmnet_shs(OE)
[  909.332267][ T1435]   SET = 0, FnV = 0
[  909.332270][ T1548]  rmnet_perf_tether(OE)
[  909.332270][ T1435]   EA = 0, S1PTW = 0
[  909.332273][ T1548]  rmnet_wlan(OE)
[  909.332273][ T1435]   FSC = 0x03: level 3 address size fault
[  909.332276][ T1435] Data abort info:
[  909.332276][ T1548]  msm_kgsl(OE)
[  909.332277][ T1435]   ISV = 0, ISS = 0x00000003
[  909.332279][ T1548]  rmnet_perf(OE)
[  909.332279][ T1435]   CM = 0, WnR = 0
[  909.332282][ T1548]  rmnet_core(OE)
[  909.332282][ T1435] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000a1c14000
[  909.332284][ T1548]  mac80211(E)
[  909.332286][ T1435] [ffffff8003fc0d90] pgd=180000027ff78003
[  909.332286][ T1548]  machine_dlkm(OE)
[  909.332288][ T1435] , p4d=180000027ff78003
[  909.332289][ T1548]  ipanetm(OE)
[  909.332290][ T1435] , pud=180000027ff78003
[  909.332291][ T1548]  rmnet_ctl(OE)
[  909.332292][ T1435] , pmd=180000027ff6b003
[  909.332294][ T1548]  wcd938x_dlkm(OE)
[  909.332294][ T1435] , pte=0068020083fc0707

[  909.333074][ T1548] Unable to handle kernel level 3 address size fault at virtual address ffffffc0010c4440
[  909.333078][ T1548] Mem abort info:
[  909.333079][ T1548]   ESR = 0x0000000096000003
[  909.333081][ T1548]   EC = 0x25: DABT (current EL), IL = 32 bits
[  909.333084][ T1548]   SET = 0, FnV = 0
[  909.333086][ T1548]   EA = 0, S1PTW = 0
[  909.333088][ T1548]   FSC = 0x03: level 3 address size fault
[  909.333090][ T1548] Data abort info:
[  909.333092][ T1548]   ISV = 0, ISS = 0x00000003
[  909.333093][ T1548]   CM = 0, WnR = 0
[  909.333095][ T1548] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000a1c14000
[  909.333098][ T1548] [ffffffc0010c4440] pgd=100000027fffe003, p4d=100000027fffe003, pud=100000027fffe003, pmd=100000009dc1f003, pte=006802009b976703
[  909.353464][    C5] Mem abort info:
[  909.353471][    C5]   ESR = 0x0000000096000003
[  909.353473][    C5]   EC = 0x25: DABT (current EL), IL = 32 bits
[  909.353476][    C5]   SET = 0, FnV = 0
[  909.353477][    C5]   EA = 0, S1PTW = 0
[  909.353479][    C5]   FSC = 0x03: level 3 address size fault
[  909.353481][    C5] Data abort info:
[  909.353482][    C5]   ISV = 0, ISS = 0x00000003
[  909.353483][    C5]   CM = 0, WnR = 0
[  909.353485][    C5] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000a1c14000
[  909.353490][    C5] [ffffff816b0c8f58] pgd=180000027f691003, p4d=180000027f691003, pud=180000027f691003, pmd=180000027f538003, pte=00680001eb0c8707

判断依据:

与案例1类似,不同核的不同线程同一时间内触发了null pointer的空指针解引用panic,很不正常!大概率为DDR不稳定问题

2.3 案例3

[10916.779670][ T2330] Unable to handle kernel level 3 address size fault at virtual address ffffffc00a2f2ec0
[10916.781507][    C1] CFI failure at try_to_wake_up+0x43c/0x8ac (target: select_task_rq_rt+0x0/0x2c4; expected type: 0xaa3494c0)
[10916.781696][T18182] Unable to handle kernel level 3 address size fault at virtual address ffffffc00a2f2368
[10916.781702][T18182] Mem abort info:
[10916.781705][T18182]   ESR = 0x0000000096000003
[10916.781707][T18182]   EC = 0x25: DABT (current EL), IL = 32 bits
[10916.781711][T18182]   SET = 0, FnV = 0
[10916.781713][T18182]   EA = 0, S1PTW = 0
[10916.781715][T18182]   FSC = 0x03: level 3 address size fault
[10916.781718][T18182] Data abort info:
[10916.781720][T18182]   ISV = 0, ISS = 0x00000003
[10916.781722][T18182]   CM = 0, WnR = 0
[10916.781724][T18182] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000a1c14000
[10916.781728][T18182] [ffffffc00a2f2368] pgd=100000027fffe003, p4d=100000027fffe003, pud=100000027fffe003, pmd=100000027fff9003, pte=00782000a22f2703
[10916.781742][T18182] Internal error: Oops: 0000000096000003 [#1] PREEMPT SMP

[10916.782643][T18182] CPU: 3 PID: 18182 Comm: oid.aac.decoder Tainted: G         C OE      6.1.118-android14-11-maybe-dirty #1
[10916.782650][T18182] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[10916.782655][T18182] pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[10916.782661][T18182] pc : uprobe_mmap+0x38/0x500
[10916.782672][T18182] lr : mmap_region+0x7a0/0xa40
[10916.782680][T18182] sp : ffffffc00e64bae0
[10916.782682][T18182] x29: ffffffc00e64bb00 x28: ffffff801cb93c00 x27: ffffff80b4a7ec80
[10916.782690][T18182] x26: ffffff806b78a580 x25: ffffff80b4a7ec80 x24: ffffff818bd800c8
[10916.782697][T18182] x23: 0000000000000001 x22: 00000000040444f9 x21: ffffff801cb93c00
[10916.782704][T18182] x20: 000000740106e000 x19: ffffff80b4a7ec80 x18: ffffffc00a3b3058
[10916.782711][T18182] x17: 000000740106e000 x16: 0000000000000000 x15: 0000000000000008
[10916.782718][T18182] x14: 0000000000000000 x13: 0000000000000010 x12: 0000000000000008
[10916.782724][T18182] x11: ffffff80ac8b1a68 x10: 0000000000000000 x9 : 0000000000000000
[10916.782730][T18182] x8 : ffffffc00a2f2368 x7 : 0000000000000000 x6 : ffffff80b4a7ec80
[10916.782737][T18182] x5 : 0000000000000000 x4 : 00000000040444f9 x3 : 000000740106f000
[10916.782743][T18182] x2 : ffffffc0083235d0 x1 : 00000000040444f9 x0 : ffffff818bd800c8
[10916.782751][T18182] Call trace:
[10916.782755][T18182]  uprobe_mmap+0x38/0x500
[10916.782761][T18182]  mmap_region+0x7a0/0xa40
[10916.782766][T18182]  do_mmap+0x3f8/0x520
[10916.782771][T18182]  vm_mmap_pgoff+0x19c/0x204
[10916.782779][T18182]  ksys_mmap_pgoff+0x78/0xf4
[10916.782784][T18182]  __arm64_sys_mmap+0x34/0x44
[10916.782791][T18182]  invoke_syscall+0x58/0x114
[10916.782796][T18182]  el0_svc_common+0x88/0xfc
[10916.782801][T18182]  do_el0_svc+0x24/0x84
[10916.782805][T18182]  el0_svc+0x2c/0x90
[10916.782814][T18182]  el0t_64_sync_handler+0x68/0xb4
[10916.782819][T18182]  el0t_64_sync+0x1a4/0x1a8
[10916.782829][T18182] Code: f9431d08 f81f83a8 f0010108 910da108 (f8bfc108) 
[10916.782833][T18182] ---[ end trace 0000000000000000 ]---
[10916.783247][T18182] Kernel panic - not syncing: Oops: Fatal exception
[10916.783251][T18182] SMP: stopping secondary CPUs
[10916.783275][    C6] VendorHooks: CPU6: stopping
[10916.783280][    C6] CPU: 6 PID: 21140 Comm: QMESA_64 Tainted: G      D  C OE      6.1.118-android14-11-maybe-dirty #1
[10916.783284][    C6] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[10916.783286][    C6] pstate: 80001000 (Nzcv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
[10916.783289][    C6] pc : 000000000040d8f4
[10916.783290][    C6] lr : 0000000000408f80
[10916.783291][    C6] sp : 00000073e88be740
[10916.783292][    C6] x29: 00000073e88be770 x28: 00000000ffffffff x27: 000000003d7f0b10
[10916.783297][    C6] x26: 0000000000000000 x25: 000000003d7f0b10 x24: 0000000000000002
[10916.783301][    C6] x23: 00000073ddd7d108 x22: 0000000000000020 x21: 0000000000000000
[10916.783304][    C6] x20: 000000003d7f08a0 x19: 000000000002d31e x18: ffffffffffffff9c
[10916.783307][    C6] x17: 000000000000000d x16: 0000000000002710 x15: 00000000004d4b48
[10916.783311][    C6] x14: 00000073dd4e53f0 x13: 0000000000338eb0 x12: 00000073de1c9e40
[10916.783314][    C6] x11: 000000003d7f0cf8 x10: 000000003d7f18f8 x9 : 4b3473b672eba93d
[10916.783317][    C6] x8 : 0a8fd9b8d6950f60 x7 : f9be10c4ee25e2da x6 : 000000003d7f14f8
[10916.783321][    C6] x5 : ab8f8c979761d4ff x4 : 000000003d7f18f8 x3 : 0000000000112e80
[10916.783324][    C6] x2 : 00000000ffffffff x1 : 0000000000000000 x0 : 000000003d7f0bd0
[10916.783328][    C0] VendorHooks: CPU0: stopping
[10916.783333][    C0] CPU: 0 PID: 21139 Comm: QMESA_64 Tainted: G      D  C OE      6.1.118-android14-11-maybe-dirty #1
[10916.783338][    C0] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[10916.783340][    C0] pstate: 80001000 (Nzcv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
[10916.783345][    C0] pc : 0000000000403dac
[10916.783347][    C0] lr : 000000000040431c
[10916.783348][    C0] sp : 00000073e90be750
[10916.783350][    C0] x29: 00000073e90be750 x28: 00000000ffffffff x27: 000000003d7ef690
[10916.783357][    C0] x26: 0000000000000000 x25: 000000003d7ef690 x24: 0000000000000001
[10916.783362][    C0] x23: 00000073cdc6a1a0 x22: 00000073ce1c9eb0 x21: 0000000000000002
[10916.783368][    C0] x20: 000000003d7ef420 x19: 0000000000044ba0 x18: ffffffffffffff9c
[10916.783374][    C0] x17: 000000000000000d x16: 0000000000002710 x15: 00000000004d4b48
[10916.783380][    C0] x14: 00000073cd4e53f0 x13: 0000000000338eb0 x12: 00000073ce1c9e40
[10916.783386][    C0] x11: 000000003d7ef878 x10: 000000003d7f0478 x9 : 91b7fbbaf02e9e2e
[10916.783391][    C0] x8 : 20768bca88d14982 x7 : 512baf47a67b8016 x6 : 000000003d7f0078
[10916.783397][    C0] x5 : d65f7063f8d5498f x4 : 0000000000487f68 x3 : 00000073cdd7d019
[10916.783403][    C0] x2 : 0000000031ed750b x1 : 00000000000000f7 x0 : 00000073cdd13f7d
[10916.783410][    C4] VendorHooks: CPU4: stopping
[10916.783414][    C4] CPU: 4 PID: 21136 Comm: QMESA_64 Tainted: G      D  C OE      6.1.118-android14-11-maybe-dirty #1
[10916.783419][    C4] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[10916.783421][    C4] pstate: 20001000 (nzCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
[10916.783426][    C4] pc : 000000000040d4b0
[10916.783427][    C4] lr : 000000000040d8c8
[10916.783429][    C4] sp : 00000073f11d6720
[10916.783430][    C4] x29: 00000073f11d6720 x28: 00000000ffffffff x27: 000000003d7eb910
[10916.783436][    C4] x26: 0000000000000000 x25: 000000003d7eb910 x24: 0000000000000001
[10916.783442][    C4] x23: 00000073e5c6a1a0 x22: 0000000000000000 x21: 00000000ffffffff
[10916.783448][    C4] x20: 000000003d7eb9d0 x19: 000000003d7eb9d0 x18: ffffffffffffff9c
[10916.783454][    C4] x17: 000000000000000d x16: 0000000000002710 x15: 00000000004d4b48
[10916.783459][    C4] x14: 00000073e54e53f0 x13: 0000000000338eb0 x12: 00000073e61c9e40
[10916.783465][    C4] x11: 000000003d7ebaf8 x10: 000000003d7ec6f8 x9 : 10053edea2ad6252
[10916.783471][    C4] x8 : 3d153fa93c6a4f09 x7 : 00000000000001fc x6 : 000000003d7ebf18
[10916.783476][    C4] x5 : 80cd37161fb6742d x4 : 000000003d7ec318 x3 : a691699a77262e3d
[10916.783482][    C4] x2 : 000000003d7ec718 x1 : 000000003d7ebaf0 x0 : 000000003d7ec2f8
[10916.783490][    C5] VendorHooks: CPU5: stopping
[10916.783494][    C5] CPU: 5 PID: 21137 Comm: QMESA_64 Tainted: G      D  C OE      6.1.118-android14-11-maybe-dirty #1
[10916.783499][    C5] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[10916.783501][    C5] pstate: 20001000 (nzCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
[10916.783506][    C5] pc : 000000000040d3ec
[10916.783507][    C5] lr : 000000000040d8c8
[10916.783509][    C5] sp : 00000073f39d6720
[10916.783510][    C5] x29: 00000073f39d6720 x28: 00000000ffffffff x27: 000000003d7ecd90
[10916.783516][    C5] x26: 0000000000000000 x25: 000000003d7ecd90 x24: 0000000000000000
[10916.783522][    C5] x23: 00000073c5b57238 x22: 0000000000000000 x21: 00000000ffffffff
[10916.783527][    C5] x20: 000000003d7ece50 x19: 000000003d7ece50 x18: ffffffffffffff9c
[10916.783533][    C5] x17: 000000000000000d x16: 0000000000002710 x15: 00000000004d4b48
[10916.783539][    C5] x14: 00000073c54e53f0 x13: 0000000000338eb0 x12: 00000073c61c9e40
[10916.783544][    C5] x11: 000000003d7ecf78 x10: 000000003d7edb78 x9 : edc62dc9e20a6222
[10916.783550][    C5] x8 : ca5081040d02d94e x7 : 000000003d7ed1f8 x6 : 000000003d7ed778
[10916.783556][    C5] x5 : 2dd9c9cc140bf6aa x4 : 000000003d7eddf8 x3 : 9f128466d6045ff6
[10916.783561][    C5] x2 : 000000003d7ed9f8 x1 : 000000003d7ecf70 x0 : 000000003d7ed778
[10916.783568][    C2] VendorHooks: CPU2: stopping
[10916.783572][    C2] CPU: 2 PID: 21138 Comm: QMESA_64 Tainted: G      D  C OE      6.1.118-android14-11-maybe-dirty #1
[10916.783577][    C2] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[10916.783579][    C2] pstate: 20001000 (nzCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
[10916.783583][    C2] pc : 000000000040d3d0
[10916.783585][    C2] lr : 000000000040d8c8
[10916.783586][    C2] sp : 00000073eaffe720
[10916.783588][    C2] x29: 00000073eaffe720 x28: 00000000ffffffff x27: 000000003d7ee210
[10916.783594][    C2] x26: 0000000000000000 x25: 000000003d7ee210 x24: 0000000000000001
[10916.783600][    C2] x23: 00000073edc6a1a0 x22: 0000000000000000 x21: 00000000ffffffff
[10916.783605][    C2] x20: 000000003d7ee2d0 x19: 000000003d7ee2d0 x18: ffffffffffffff9c
[10916.783611][    C2] x17: 000000000000000d x16: 0000000000002710 x15: 00000000004d4b48
[10916.783616][    C2] x14: 00000073ed4e53f0 x13: 0000000000338eb0 x12: 00000073ee1c9e40
[10916.783622][    C2] x11: 000000003d7ee3f8 x10: 000000003d7eeff8 x9 : 2a40f0d8788f2339
[10916.783627][    C2] x8 : 0067b24ec2453898 x7 : 000000003d7ee758 x6 : 000000003d7eebf8
[10916.783633][    C2] x5 : ad17418726ae9005 x4 : 000000003d7ef358 x3 : fd96a6b096648357
[10916.783638][    C2] x2 : 000000003d7eef58 x1 : 000000003d7ee3f0 x0 : 000000003d7eebf8
[10916.789423][    C7] VendorHooks: CPU7: stopping
[10916.789425][    C7] CPU: 7 PID: 2330 Comm: HwBinder:2002_1 Tainted: G      D  C OE      6.1.118-android14-11-maybe-dirty #1
[10916.789427][    C7] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[10916.789428][    C7] pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[10916.789430][    C7] pc : console_emit_next_record+0x348/0x3d4
[10916.789435][    C7] lr : console_emit_next_record+0x344/0x3d4
[10916.789438][    C7] sp : ffffffc013f433f0
[10916.789438][    C7] x29: ffffffc013f43480 x28: ffffffc00a0ecd90 x27: ffffffc00a0cae30
[10916.789441][    C7] x26: 0000000000000000 x25: ffffffc00a2e7660 x24: 000000000000006e
[10916.789443][    C7] x23: ffffffc00a2e97b8 x22: ffffffc00a2e7778 x21: ffffffc00a1f78d0
[10916.789445][    C7] x20: 0000000000000001 x19: ffffffc013f434dc x18: ffffffc013b5d0a8
[10916.789447][    C7] x17: 00000000529c6ef0 x16: 00000000529c6ef0 x15: 75616620657a6973
[10916.789449][    C7] x14: 000000000000000a x13: 0000000000000030 x12: 000000000000006e
[10916.789451][    C7] x11: 00000000ffffffff x10: 0000000100000001 x9 : ffffff808ff62580
[10916.789452][    C7] x8 : 0000000100000001 x7 : 205b5d3037363937 x6 : 372e36313930315b
[10916.789454][    C7] x5 : ffffffc00a2e97cf x4 : ffffffc013f4336f x3 : ffffffc008887cf8
[10916.789456][    C7] x2 : 0000000000000005 x1 : 00000000000000c0 x0 : ffffffc00a1f6ce8
[10916.789458][    C7] Call trace:
[10916.789459][    C7]  console_emit_next_record+0x348/0x3d4
[10916.789461][    C7]  console_unlock+0x154/0x24c
[10916.789466][    C7]  vprintk_emit+0xcc/0x27c
[10916.789469][    C7]  vprintk_default+0x44/0x70
[10916.789472][    C7]  vprintk+0xe4/0x114
[10916.789473][    C7]  _printk+0x54/0x80
[10916.789476][    C7]  do_mem_abort+0xd0/0x118
[10916.789480][    C7]  el1_abort+0x3c/0x5c
[10916.789484][    C7]  el1h_64_sync_handler+0x54/0x90
[10916.789485][    C7]  el1h_64_sync+0x68/0x6c
[10916.789488][    C7]  memblock_is_map_memory+0x34/0x84
[10916.789492][    C7]  __check_object_size+0xbc/0x29c
[10916.789496][    C7]  binder_alloc_copy_user_to_buffer+0x108/0x288
[10916.789500][    C7]  binder_transaction+0x1130/0x1ea8
[10916.789503][    C7]  binder_thread_write+0xb54/0x2768
[10916.789506][    C7]  binder_ioctl+0x550/0x2438
[10916.789508][    C7]  __arm64_sys_ioctl+0xa8/0xe4
[10916.789512][    C7]  invoke_syscall+0x58/0x114
[10916.789515][    C7]  el0_svc_common+0x88/0xfc
[10916.789516][    C7]  do_el0_svc+0x24/0x84
[10916.789517][    C7]  el0_svc+0x2c/0x90
[10916.789519][    C7]  el0t_64_sync_handler+0x68/0xb4
[10916.789521][    C7]  el0t_64_sync+0x1a4/0x1a8

分析过程:

首先多个核都同时出现异常

其次追踪其中一个线程

x8寄存器是从0xFFFFFFC00A2F2000加上偏移0x368得来的

uprobe_mmap() 中访问了全局变量 uprobes_tree

if (no_uprobe_events())
    return 0;

尝试访问地址 &uprobes_tree.rb_node → 0xffffffc00a2f2368

地址页在触发访问时未被映射到当前内核页表中

触发了 Level 3 Address Size Fault

这个挺奇怪的,暂时没什么实际证据证明是DDR的问题,但是从多核同时触发异常,可以怀疑是DDR的问题

2.4 案例4

[  909.331382][T27244] Unable to handle kernel level 3 address size fault at virtual address ffffff816b0c8f58
[  909.332051][ T1548] Unable to handle kernel level 3 address size fault at virtual address ffffff80a7ae4920
[  909.332061][ T1548] Mem abort info:
[  909.332064][ T1548]   ESR = 0x0000000096000003
[  909.332066][ T1548]   EC = 0x25: DABT (current EL), IL = 32 bits
[  909.332070][ T1548]   SET = 0, FnV = 0
[  909.332073][ T1548]   EA = 0, S1PTW = 0
[  909.332075][ T1548]   FSC = 0x03: level 3 address size fault
[  909.332078][ T1548] Data abort info:
[  909.332079][ T1548]   ISV = 0, ISS = 0x00000003
[  909.332081][ T1548]   CM = 0, WnR = 0
[  909.332084][ T1548] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000a1c14000
[  909.332089][ T1548] [ffffff80a7ae4920] pgd=180000027fc94003, p4d=180000027fc94003, pud=180000027fc94003, pmd=180000027fb56003, pte=0068020127ae4707
[  909.332107][ T1548] Internal error: Oops: 0000000096000003 [#1] PREEMPT SMP
[  909.332222][ T1548] Dumping ftrace buffer:
[  909.332245][ T1548]    (ftrace buffer empty)
[  909.332248][ T1548] Modules linked in: wlan(OE)
[  909.332253][ T1435] Unable to handle kernel level 3 address size fault at virtual address ffffff8003fc0d90
[  909.332255][ T1548]  nt36xxx_spi(OE)
[  909.332258][ T1548]  focaltech_spi(OE)
[  909.332260][ T1435] Mem abort info:
[  909.332261][ T1548]  msm_drm(OE)
[  909.332262][ T1435]   ESR = 0x0000000096000003
[  909.332264][ T1548]  rmnet_offload(OE)
[  909.332264][ T1435]   EC = 0x25: DABT (current EL), IL = 32 bits
[  909.332267][ T1548]  rmnet_shs(OE)
[  909.332267][ T1435]   SET = 0, FnV = 0
[  909.332270][ T1548]  rmnet_perf_tether(OE)
[  909.332270][ T1435]   EA = 0, S1PTW = 0
[  909.332273][ T1548]  rmnet_wlan(OE)
[  909.332273][ T1435]   FSC = 0x03: level 3 address size fault
[  909.332276][ T1435] Data abort info:
[  909.332276][ T1548]  msm_kgsl(OE)
[  909.332277][ T1435]   ISV = 0, ISS = 0x00000003
[  909.332279][ T1548]  rmnet_perf(OE)
[  909.332279][ T1435]   CM = 0, WnR = 0
[  909.332282][ T1548]  rmnet_core(OE)
[  909.332282][ T1435] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000a1c14000
[  909.332284][ T1548]  mac80211(E)
[  909.332286][ T1435] [ffffff8003fc0d90] pgd=180000027ff78003
[  909.332286][ T1548]  machine_dlkm(OE)
[  909.332288][ T1435] , p4d=180000027ff78003
[  909.332289][ T1548]  ipanetm(OE)
[  909.332290][ T1435] , pud=180000027ff78003
[  909.332291][ T1548]  rmnet_ctl(OE)
[  909.332292][ T1435] , pmd=180000027ff6b003
[  909.332294][ T1548]  wcd938x_dlkm(OE)
[  909.332294][ T1435] , pte=0068020083fc0707
[  909.333074][ T1548] Unable to handle kernel level 3 address size fault at virtual address ffffffc0010c4440
[  909.333078][ T1548] Mem abort info:
[  909.333079][ T1548]   ESR = 0x0000000096000003
[  909.333081][ T1548]   EC = 0x25: DABT (current EL), IL = 32 bits
[  909.333084][ T1548]   SET = 0, FnV = 0
[  909.333086][ T1548]   EA = 0, S1PTW = 0
[  909.333088][ T1548]   FSC = 0x03: level 3 address size fault
[  909.333090][ T1548] Data abort info:
[  909.333092][ T1548]   ISV = 0, ISS = 0x00000003
[  909.333093][ T1548]   CM = 0, WnR = 0
[  909.333095][ T1548] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000a1c14000
[  909.333098][ T1548] [ffffffc0010c4440] pgd=100000027fffe003, p4d=100000027fffe003, pud=100000027fffe003, pmd=100000009dc1f003, pte=006802009b976703
[  909.353464][    C5] Mem abort info:
[  909.353471][    C5]   ESR = 0x0000000096000003
[  909.353473][    C5]   EC = 0x25: DABT (current EL), IL = 32 bits
[  909.353476][    C5]   SET = 0, FnV = 0
[  909.353477][    C5]   EA = 0, S1PTW = 0
[  909.353479][    C5]   FSC = 0x03: level 3 address size fault
[  909.353481][    C5] Data abort info:
[  909.353482][    C5]   ISV = 0, ISS = 0x00000003
[  909.353483][    C5]   CM = 0, WnR = 0
[  909.353485][    C5] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000a1c14000

判断依据:

与案例1类似,不同核的不同线程同一时间内触发了Unable to handle kernel level 3 address size fault at virtual address,很不正常!大概率为DDR不稳定问题

2.5 案例5

[  909.331382][T27244] Unable to handle kernel level 3 address size fault at virtual address ffffff816b0c8f58
[  909.332051][ T1548] Unable to handle kernel level 3 address size fault at virtual address ffffff80a7ae4920
[  909.332061][ T1548] Mem abort info:
[  909.332064][ T1548]   ESR = 0x0000000096000003
[  909.332066][ T1548]   EC = 0x25: DABT (current EL), IL = 32 bits
[  909.332070][ T1548]   SET = 0, FnV = 0
[  909.332073][ T1548]   EA = 0, S1PTW = 0
[  909.332075][ T1548]   FSC = 0x03: level 3 address size fault
[  909.332078][ T1548] Data abort info:
[  909.332079][ T1548]   ISV = 0, ISS = 0x00000003
[  909.332081][ T1548]   CM = 0, WnR = 0
[  909.332084][ T1548] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000a1c14000
[  909.332089][ T1548] [ffffff80a7ae4920] pgd=180000027fc94003, p4d=180000027fc94003, pud=180000027fc94003, pmd=180000027fb56003, pte=0068020127ae4707
[  909.332107][ T1548] Internal error: Oops: 0000000096000003 [#1] PREEMPT SMP
[  909.332222][ T1548] Dumping ftrace buffer:
[  909.332245][ T1548]    (ftrace buffer empty)
[  909.332248][ T1548] Modules linked in: wlan(OE)
[  909.332253][ T1435] Unable to handle kernel level 3 address size fault at virtual address ffffff8003fc0d90
[  909.332255][ T1548]  nt36xxx_spi(OE)
[  909.332258][ T1548]  focaltech_spi(OE)
[  909.332260][ T1435] Mem abort info:
[  909.332261][ T1548]  msm_drm(OE)
[  909.332262][ T1435]   ESR = 0x0000000096000003
[  909.332264][ T1548]  rmnet_offload(OE)
[  909.332264][ T1435]   EC = 0x25: DABT (current EL), IL = 32 bits
[  909.332267][ T1548]  rmnet_shs(OE)
[  909.332267][ T1435]   SET = 0, FnV = 0
[  909.332270][ T1548]  rmnet_perf_tether(OE)
[  909.332270][ T1435]   EA = 0, S1PTW = 0
[  909.332273][ T1548]  rmnet_wlan(OE)
[  909.332273][ T1435]   FSC = 0x03: level 3 address size fault
[  909.332276][ T1435] Data abort info:
[  909.332276][ T1548]  msm_kgsl(OE)
[  909.332277][ T1435]   ISV = 0, ISS = 0x00000003
[  909.332279][ T1548]  rmnet_perf(OE)
[  909.332279][ T1435]   CM = 0, WnR = 0
[  909.332282][ T1548]  rmnet_core(OE)
[  909.332282][ T1435] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000a1c14000
[  909.332284][ T1548]  mac80211(E)
[  909.332286][ T1435] [ffffff8003fc0d90] pgd=180000027ff78003
[  909.332286][ T1548]  machine_dlkm(OE)
[  909.332288][ T1435] , p4d=180000027ff78003
[  909.332289][ T1548]  ipanetm(OE)
[  909.332290][ T1435] , pud=180000027ff78003
[  909.332291][ T1548]  rmnet_ctl(OE)
[  909.332292][ T1435] , pmd=180000027ff6b003
[  909.332294][ T1548]  wcd938x_dlkm(OE)
[  909.332294][ T1435] , pte=0068020083fc0707
[  909.333074][ T1548] Unable to handle kernel level 3 address size fault at virtual address ffffffc0010c4440
[  909.333078][ T1548] Mem abort info:
[  909.333079][ T1548]   ESR = 0x0000000096000003
[  909.333081][ T1548]   EC = 0x25: DABT (current EL), IL = 32 bits
[  909.333084][ T1548]   SET = 0, FnV = 0
[  909.333086][ T1548]   EA = 0, S1PTW = 0
[  909.333088][ T1548]   FSC = 0x03: level 3 address size fault
[  909.333090][ T1548] Data abort info:
[  909.333092][ T1548]   ISV = 0, ISS = 0x00000003
[  909.333093][ T1548]   CM = 0, WnR = 0
[  909.333095][ T1548] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000a1c14000
[  909.333098][ T1548] [ffffffc0010c4440] pgd=100000027fffe003, p4d=100000027fffe003, pud=100000027fffe003, pmd=100000009dc1f003, pte=006802009b976703
[  909.353464][    C5] Mem abort info:
[  909.353471][    C5]   ESR = 0x0000000096000003
[  909.353473][    C5]   EC = 0x25: DABT (current EL), IL = 32 bits
[  909.353476][    C5]   SET = 0, FnV = 0
[  909.353477][    C5]   EA = 0, S1PTW = 0
[  909.353479][    C5]   FSC = 0x03: level 3 address size fault
[  909.353481][    C5] Data abort info:
[  909.353482][    C5]   ISV = 0, ISS = 0x00000003
[  909.353483][    C5]   CM = 0, WnR = 0
[  909.353485][    C5] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000a1c14000

判断依据:

与案例4类似,不同核的不同线程同一时间内触发了Unable to handle kernel level 3 address size fault at virtual address,很不正常!大概率为DDR不稳定问题

2.6 案例6

[ 1052.754789][ T5244] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000e0
[ 1052.758577][T23517] [NVT-ts] nvt_bin_header_parser 188: ovly_info = 1, ilm_dlm_num = 2, ovly_sec_num = 3, info_sec_num = 13, partition = 18
[ 1052.767961][ T5244] Mem abort info:
[ 1052.767965][ T5244]   ESR = 0x0000000096000005
[ 1052.767967][ T5244]   EC = 0x25: DABT (current EL), IL = 32 bits
[ 1052.767969][ T5244]   SET = 0, FnV = 0
[ 1052.794504][  T535] fg_read_current: [FG_NFG1000] successed to read IBAT:-923
[ 1052.794611][ T5244]   EA = 0, S1PTW = 0
[ 1052.801073][T23517] [NVT-ts] nvt_bootloader_reset 848: end
[ 1052.808016][ T5244]   FSC = 0x05: level 1 translation fault
[ 1052.808018][ T5244] Data abort info:
[ 1052.808019][ T5244]   ISV = 0, ISS = 0x00000005
[ 1052.808020][ T5244]   CM = 0, WnR = 0
[ 1052.808021][ T5244] user pgtable: 4k pages, 39-bit VAs, pgdp=00000000d8f03000
[ 1052.808023][ T5244] [00000000000000e0] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
[ 1052.808030][ T5244] Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP
[ 1052.808074][ T5244] Dumping ftrace buffer:
[ 1052.859124][T31837] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000e0
[ 1052.861728][T31830] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000e0
[ 1052.861735][T31830] Mem abort info:
[ 1052.861737][T31830]   ESR = 0x0000000096000006
[ 1052.861739][T31830]   EC = 0x25: DABT (current EL), IL = 32 bits
[ 1052.861742][T31830]   SET = 0, FnV = 0
[ 1052.861744][T31830]   EA = 0, S1PTW = 0
[ 1052.861746][T31830]   FSC = 0x06: level 2 translation fault
[ 1052.861749][T31830] Data abort info:
[ 1052.861750][T31830]   ISV = 0, ISS = 0x00000006
[ 1052.861751][T31830]   CM = 0, WnR = 0
[ 1052.861753][T31830] user pgtable: 4k pages, 39-bit VAs, pgdp=00000001ea216000
[ 1052.861757][T31830] [00000000000000e0] pgd=08000001eccbe003, p4d=08000001eccbe003, pud=08000001eccbe003, pmd=0000000000000000
[ 1052.862418][ T5244]    (ftrace buffer empty)
[ 1052.870882][  T535] sc853x-charger 0-0061: sc853x_get_adc_data 1 5190
[ 1052.872118][T31837] Mem abort info:
[ 1052.881584][ T5244]  qpnp_lcdb_regulator(E)
[ 1052.889699][ T5244] CPU: 6 PID: 5244 Comm: NPDecoder-CL Tainted: G         C OE      6.1.118-android14-11-maybe-dirty #1
[ 1052.889702][ T5244] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[ 1052.889704][ T5244] pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 1052.889707][ T5244] pc : cap_capable+0xc/0x3c
[ 1052.889721][ T5244] lr : security_capable+0x68/0x90
[ 1052.889724][ T5244] sp : ffffffc0211737a0
[ 1052.889725][ T5244] x29: ffffffc0211737a0 x28: 0000000000000000 x27: 00000000000000b8
[ 1052.889728][ T5244] x26: ffffff8090868200 x25: ffffff8092bc4680 x24: 0000000000000000
[ 1052.889730][ T5244] x23: ffffffc0096165a8 x22: ffffff8049844900 x21: ffffffc00a0f1950
[ 1052.889732][ T5244] x20: 0000000000000017 x19: 0000000000000002 x18: ffffffc014695070
[ 1052.889734][ T5244] x17: 0000000090dc9518 x16: 0000000090dc9518 x15: b400007ccf9e3c90
[ 1052.889736][ T5244] x14: 0000020000000000 x13: 0000000000000102 x12: ffffff80955b5280
[ 1052.889739][ T5244] x11: ffffff8002b25480 x10: ffffff816c2f4b00 x9 : 0000000000000001
[ 1052.889741][ T5244] x8 : 0000000000000000 x7 : ffffffffffffff00 x6 : ffffff808b6ff0b4
[ 1052.889743][ T5244] x5 : 0000000000000006 x4 : 000000000000ed71 x3 : 0000000000000002
[ 1052.889745][ T5244] x2 : 0000000000000017 x1 : ffffffc00a0f1950 x0 : ffffff8049844900
[ 1052.889747][ T5244] Call trace:
[ 1052.889749][ T5244]  cap_capable+0xc/0x3c
[ 1052.889752][ T5244]  has_capability_noaudit+0x38/0x60
[ 1052.889758][ T5244]  binder_do_set_priority+0x90/0x36c
[ 1052.889764][ T5244]  binder_transaction_priority+0x114/0x264
[ 1052.889767][ T5244]  binder_proc_transaction+0x248/0x898
[ 1052.889770][ T5244]  binder_transaction+0x1768/0x1ea8
[ 1052.889773][ T5244]  binder_thread_write+0xb54/0x2768
[ 1052.889776][ T5244]  binder_ioctl+0x550/0x2438
[ 1052.889779][ T5244]  __arm64_sys_ioctl+0xa8/0xe4
[ 1052.889783][ T5244]  invoke_syscall+0x58/0x114
[ 1052.889786][ T5244]  el0_svc_common+0xb4/0xfc
[ 1052.889788][ T5244]  do_el0_svc+0x24/0x84
[ 1052.889789][ T5244]  el0_svc+0x2c/0x90
[ 1052.889793][ T5244]  el0t_64_sync_handler+0x68/0xb4
[ 1052.889796][ T5244]  el0t_64_sync+0x1a4/0x1a8
[ 1052.889799][ T5244] Code: 90dc9518 f9405408 eb01011f 54000100 (b940e109) 
[ 1052.889801][ T5244] ---[ end trace 0000000000000000 ]---
[ 1052.901222][ T5263] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000e0
[ 1052.904621][T31837]   EC = 0x25: DABT (current EL), IL = 32 bits
[ 1052.909956][ T5263] Mem abort info:
[ 1052.909957][ T5263]   ESR = 0x0000000096000005
[ 1052.909958][ T5263]   EC = 0x25: DABT (current EL), IL = 32 bits
[ 1052.909959][ T5263]   SET = 0, FnV = 0
[ 1052.909960][ T5263]   EA = 0, S1PTW = 0
[ 1052.909961][ T5263]   FSC = 0x05: level 1 translation fault
[ 1052.909962][ T5263] Data abort info:
[ 1052.909962][ T5263]   ISV = 0, ISS = 0x00000005
[ 1052.909963][ T5263]   CM = 0, WnR = 0
[ 1052.909963][ T5263] user pgtable: 4k pages, 39-bit VAs, pgdp=00000000d4f9a000
[ 1052.909965][ T5263] [00000000000000e0] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
[ 1052.919834][ T5244] Kernel panic - not syncing: Oops: Fatal exception
[ 1052.919836][ T5244] SMP: stopping secondary CPUs
[ 1052.919861][    C2] VendorHooks: CPU2: stopping
[ 1052.919868][    C2] CPU: 2 PID: 31833 Comm: QMESA_64 Tainted: G      D  C OE      6.1.118-android14-11-maybe-dirty #1
[ 1052.919874][    C2] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[ 1052.919877][    C2] pstate: 80001000 (Nzcv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
[ 1052.919883][    C2] pc : 0000000000403db4
[ 1052.919885][    C2] lr : 000000000040431c
[ 1052.919887][    C2] sp : 0000007674c55670
[ 1052.919889][    C2] x29: 0000007674c55670 x28: 00000000ffffffff x27: 000000001707c250
[ 1052.919896][    C2] x26: 0000000000000000 x25: 000000001707c230 x24: 0000000000000000
[ 1052.919903][    C2] x23: 000000001707c160 x22: 0000000000000001 x21: 000000001707bfa0
[ 1052.919910][    C2] x20: 0000000000000000 x19: 0000007670bce194 x18: ffffffffffffff9c
[ 1052.919916][    C2] x17: 0000000000000049 x16: 0000000000002710 x15: 00000000004d4b48
[ 1052.919921][    C2] x14: d207e381491a4aaf x13: de4a6af0a0bfaff5 x12: 3c913bab09122bb9
[ 1052.919928][    C2] x11: 94e4dff91f08a69f x10: fadccd4d70ad9f9a x9 : e29afe0f16109821
[ 1052.919934][    C2] x8 : 6cfd2256f07dec61 x7 : 170f324bae43d0d6 x6 : 0000007671736954
[ 1052.919939][    C2] x5 : 0b76436913d88acd x4 : 0000000000487f68 x3 : 0000007670c0428d
[ 1052.919945][    C2] x2 : 000000001532474e x1 : 0000000000000030 x0 : 0000007670be9bbb
[ 1052.919954][    C4] VendorHooks: CPU4: stopping
[ 1052.919958][    C4] CPU: 4 PID: 31836 Comm: QMESA_64 Tainted: G      D  C OE      6.1.118-android14-11-maybe-dirty #1
[ 1052.919963][    C4] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[ 1052.919966][    C4] pstate: 80001000 (Nzcv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
[ 1052.919970][    C4] pc : 0000000000403dac
[ 1052.919972][    C4] lr : 000000000040431c
[ 1052.919973][    C4] sp : 000000766e327680
[ 1052.919975][    C4] x29: 000000766e327680 x28: 00000000004896b8 x27: 0000000000000000
[ 1052.919981][    C4] x26: 0000000000000001 x25: 0000000000000003 x24: 000000001707ffa0
[ 1052.919987][    C4] x23: 0000000000000000 x22: 00000000fe7a4e9c x21: 000000001707fd20
[ 1052.919992][    C4] x20: 0000000000000005 x19: 000000765a3aee14 x18: ffffffffffffff9c
[ 1052.919997][    C4] x17: 0000000000000049 x16: 0000000000002710 x15: 00000000004d4b48
[ 1052.920003][    C4] x14: 0000007659737df0 x13: 00000000003cd87c x12: 0000000000000000
[ 1052.920008][    C4] x11: 0101010101010101 x10: 0000000000000038 x9 : 00000000ffffff80
[ 1052.920014][    C4] x8 : 2525252525252525 x7 : 0000000000000000 x6 : 00000076783294c6
[ 1052.920019][    C4] x5 : 0000000000000000 x4 : 0000000000487f68 x3 : 000000765a3e4f0d
[ 1052.920024][    C4] x2 : 00000000197791b9 x1 : 000000000000005a x0 : 000000765a3dfca7
[ 1052.920032][    C1] VendorHooks: CPU1: stopping
[ 1052.920035][    C1] CPU: 1 PID: 31831 Comm: QMESA_64 Tainted: G      D  C OE      6.1.118-android14-11-maybe-dirty #1
[ 1052.920040][    C1] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[ 1052.920042][    C1] pstate: 80001000 (Nzcv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
[ 1052.920047][    C1] pc : 0000000000403db4
[ 1052.920049][    C1] lr : 000000000040431c
[ 1052.920051][    C1] sp : 0000007677b27680
[ 1052.920053][    C1] x29: 0000007677b27680 x28: 00000000004896b8 x27: 0000000000000000
[ 1052.920059][    C1] x26: 0000000000000000 x25: 0000000000000002 x24: 0000000017079900
[ 1052.920064][    C1] x23: 0000000000000000 x22: 0000000053465b49 x21: 00000000170796a0
[ 1052.920071][    C1] x20: 000000000000000a x19: 000000764e0efc00 x18: ffffffffffffff9c
[ 1052.920076][    C1] x17: 0000000000000049 x16: 0000000000002710 x15: 00000000004d4b48
[ 1052.920084][    C1] x14: 000000764d737df0 x13: 00000000003cd87c x12: 000000764e66ee80
[ 1052.920092][    C1] x11: 0000000017079af8 x10: 000000001707a6f8 x9 : 1fe47d6b184bc804
[ 1052.920098][    C1] x8 : 972ea7d08aa01c03 x7 : a67d73ef8fa9f5ac x6 : 000000001707a2f8
[ 1052.920104][    C1] x5 : bf66cf1f7ee5e51c x4 : 0000000000487f68 x3 : 000000764e125cf9
[ 1052.920109][    C1] x2 : 000000000caa6a19 x1 : 0000000000000004 x0 : 000000764e0f8ba8
[ 1052.920116][    C5] VendorHooks: CPU5: stopping
[ 1052.920120][    C5] CPU: 5 PID: 31832 Comm: QMESA_64 Tainted: G      D  C OE      6.1.118-android14-11-maybe-dirty #1
[ 1052.920125][    C5] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[ 1052.920128][    C5] pstate: 20001000 (nzCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
[ 1052.920132][    C5] pc : 00000000004480b0
[ 1052.920134][    C5] lr : 0000000000409058
[ 1052.920135][    C5] sp : 0000007678327770
[ 1052.920137][    C5] x29: 0000007678327770 x28: 00000000ffffffff x27: 000000001707add0
[ 1052.920143][    C5] x26: 0000000000000000 x25: 000000001707adb0 x24: 0000000000000012
[ 1052.920148][    C5] x23: 0000000000000000 x22: 0000000000000002 x21: 000000001707ad10
[ 1052.920154][    C5] x20: 000000001707ab20 x19: 000000000000000d x18: ffffffffffffff9c
[ 1052.920160][    C5] x17: 0000000000000049 x16: 0000000000002710 x15: 00000000004d4b48
[ 1052.920165][    C5] x14: 6baa6e969e5d8f32 x13: 9d424bc2b56d0dc6 x12: 9920ea012d42ac2d
[ 1052.920171][    C5] x11: 70e1532c0b43064e x10: 4c95bcd98c78e77b x9 : 8124cc131ff463a1
[ 1052.920176][    C5] x8 : 2151d750098d0124 x7 : 6d70fcd99f7e25a5 x6 : 000000765127e870
[ 1052.920182][    C5] x5 : d092813433abb5d6 x4 : 0000000000000008 x3 : 000000765266f190
[ 1052.920187][    C5] x2 : 00000000000122f8 x1 : 00000076521b5d90 x0 : 000000765125aab8
[ 1052.922166][    C0] VendorHooks: CPU0: stopping
[ 1052.922169][    C0] CPU: 0 PID: 31837 Comm: QMESA_64 Tainted: G      D  C OE      6.1.118-android14-11-maybe-dirty #1
[ 1052.922174][    C0] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[ 1052.922176][    C0] pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 1052.922181][    C0] pc : console_emit_next_record+0x348/0x3d4
[ 1052.922202][    C0] lr : console_emit_next_record+0x344/0x3d4
[ 1052.922206][    C0] sp : ffffffc023e23690
[ 1052.922208][    C0] x29: ffffffc023e23720 x28: ffffffc00a0ecd90 x27: ffffffc00a0cae30
[ 1052.922214][    C0] x26: 0000000000000000 x25: ffffffc00a2e7660 x24: 0000000000000029
[ 1052.922220][    C0] x23: ffffffc00a2e97b8 x22: ffffffc00a2e7778 x21: ffffffc00a1f78d0
[ 1052.922228][    C0] x20: 0000000000000001 x19: ffffffc023e2377c x18: ffffffc01fd1d0c0
[ 1052.922234][    C0] x17: 00000000529c6ef0 x16: 00000000529c6ef0 x15: ffffffc00902ed5c
[ 1052.922241][    C0] x14: 0000000000000030 x13: 0000000000000020 x12: 0000000000000029
[ 1052.922246][    C0] x11: 00000000ffffffff x10: 0000000000000001 x9 : ffffff808622b840
[ 1052.922252][    C0] x8 : 0000000000000001 x7 : 545b5d3135373136 x6 : 382e32353031205b
[ 1052.922257][    C0] x5 : ffffffc00a2e97cf x4 : ffffffc023e2360f x3 : ffffffc008887cf8
[ 1052.922263][    C0] x2 : 0000000000000006 x1 : 00000000000000c0 x0 : ffffffc00a1f6ce8
[ 1052.922269][    C0] Call trace:
[ 1052.922274][    C0]  console_emit_next_record+0x348/0x3d4
[ 1052.922280][    C0]  console_unlock+0x154/0x24c
[ 1052.922286][    C0]  vprintk_emit+0xcc/0x27c
[ 1052.922292][    C0]  vprintk_default+0x44/0x70
[ 1052.922299][    C0]  vprintk+0xe4/0x114
[ 1052.922303][    C0]  _printk+0x54/0x80
[ 1052.922308][    C0]  mem_abort_decode+0x5c/0x144
[ 1052.922318][    C0]  __do_kernel_fault+0x248/0x2b4
[ 1052.922324][    C0]  do_page_fault+0x218/0x4c8
[ 1052.922344][    C0]  do_translation_fault+0x38/0x54
[ 1052.922349][    C0]  do_mem_abort+0x58/0x118
[ 1052.922353][    C0]  el1_abort+0x3c/0x5c
[ 1052.922359][    C0]  el1h_64_sync_handler+0x54/0x90
[ 1052.922364][    C0]  el1h_64_sync+0x68/0x6c
[ 1052.922370][    C0]  cap_capable+0xc/0x3c
[ 1052.922378][    C0]  file_ns_capable+0x20/0x44
[ 1052.922385][    C0]  pagemap_read+0x124/0x428
[ 1052.922393][    C0]  vfs_read+0x100/0x2c8
[ 1052.922399][    C0]  ksys_read+0x78/0xe8
[ 1052.922402][    C0]  __arm64_sys_read+0x1c/0x2c
[ 1052.922406][    C0]  invoke_syscall+0x58/0x114
[ 1052.922411][    C0]  el0_svc_common+0xb4/0xfc
[ 1052.922416][    C0]  do_el0_svc+0x24/0x84
[ 1052.922420][    C0]  el0_svc+0x2c/0x90
[ 1052.922428][    C0]  el0t_64_sync_handler+0x68/0xb4
[ 1052.922434][    C0]  el0t_64_sync+0x1a4/0x1a8
[ 1053.951825][ T5244] SMP: failed to stop secondary CPUs 3,6-7

判断依据:

首先多个核都同时出现异常

其次追踪其中一个线程T5244

可以看到死在ldr w9,[x8,0xE0],问题很明显cred->user_ns调用时出现的,查看x8寄存器其值为0

而x8寄存器是由ldr x8,[x0,#0x88]得来,而x0寄存器的值是正常的为0xFFFFFF8049844900,那x8寄存器也就是0xFFFFFF8049844988,而实际上x8寄存器其值为0,这个就很明显是一个内存问题了,x8寄存器的值变了。

2.7 案例7

	Line 19448: [  884.542922][  T283] Unable to handle kernel level 3 address size fault at virtual address ffffffc00a2c25e8
	Line 19449: [  884.552843][  T283] Mem abort info:
	Line 19450: [  884.556518][  T283]   ESR = 0x0000000096000003
	Line 19451: [  884.561034][  T283]   EC = 0x25: DABT (current EL), IL = 32 bits
	Line 19452: [  884.567154][  T283]   SET = 0, FnV = 0
	Line 19453: [  884.570952][  T283]   EA = 0, S1PTW = 0
	Line 19456: [  884.574893][  T283]   FSC = 0x03: level 3 address size fault
	Line 19457: [  884.580611][  T283] Data abort info:
	Line 19458: [  884.584308][  T283]   ISV = 0, ISS = 0x00000003
	Line 19459: [  884.588948][  T283]   CM = 0, WnR = 0
	Line 19460: [  884.592840][  T283] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000a1c14000
	Line 19461: [  884.600696][  T283] [ffffffc00a2c25e8] pgd=100000027fffe003, p4d=100000027fffe003, pud=100000027fffe003, pmd=100000027fff9003, pte=00780000a22c2703
	Line 19463: [  884.616092][  T283] Internal error: Oops: 0000000096000003 [#1] PREEMPT SMP
	Line 19464: [  884.623129][  T283] Dumping ftrace buffer:
	Line 19465: [  884.627242][  T283]    (ftrace buffer empty)
	Line 19477: [  884.632065][  T283] CPU: 7 PID: 283 Comm: ueventd Tainted: G         C OE      6.1.118-android14-11-maybe-dirty #1
	Line 19478: [  884.632070][  T283] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
	Line 19479: [  884.632073][  T283] pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
	Line 19480: [  884.632075][  T283] pc : vm_area_dup+0x20/0x12c
	Line 19481: [  884.632095][  T283] lr : copy_mm+0x3b8/0x7fc
	Line 19482: [  884.632097][  T283] sp : ffffffc00beb3b70
	Line 19483: [  884.632098][  T283] x29: ffffffc00beb3b70 x28: fffffffffff7dfff x27: ffffff801d6b12c0
	Line 19484: [  884.632102][  T283] x26: 0000000000000000 x25: ffffff80a994d708 x24: ffffff801c83bc80
	Line 19485: [  884.632105][  T283] x23: 000000000000067c x22: ffffff80049fc060 x21: ffffffc00a2c2000
	Line 19486: [  884.632107][  T283] x20: ffffff801c83bc80 x19: ffffff80045312c0 x18: ffffffc00b4ad050
	Line 19487: [  884.632110][  T283] x17: 0000007f98a6a000 x16: 0000000000000006 x15: 0000007f98a6a000
	Line 19488: [  884.632112][  T283] x14: 0000000000000001 x13: 0000000000000241 x12: 0000000000000000
	Line 19489: [  884.632115][  T283] x11: ffffff809f5a3e80 x10: 0000000100000000 x9 : 0000000004000000
	Line 19490: [  884.632118][  T283] x8 : 0000000000000070 x7 : 0000000000000000 x6 : ffffff801c83b578
	Line 19491: [  884.632120][  T283] x5 : 0000000000000001 x4 : 0000000000000000 x3 : 0000000000000242
	Line 19492: [  884.632123][  T283] x2 : 0000000100023ad4 x1 : 0000000000000cc0 x0 : ffffff801c83bc80
	Line 19493: [  884.632126][  T283] Call trace:
	Line 19494: [  884.632128][  T283]  vm_area_dup+0x20/0x12c
	Line 19495: [  884.632133][  T283]  copy_mm+0x3b8/0x7fc
	Line 19496: [  884.632135][  T283]  copy_process+0x4dc/0xc9c
	Line 19497: [  884.632136][  T283]  kernel_clone+0xb0/0x488
	Line 19498: [  884.632138][  T283]  __arm64_sys_clone+0x5c/0x8c
	Line 19499: [  884.632140][  T283]  invoke_syscall+0x58/0x114
	Line 19500: [  884.632144][  T283]  el0_svc_common+0xb4/0xfc
	Line 19501: [  884.632146][  T283]  do_el0_svc+0x24/0x84
	Line 19502: [  884.632147][  T283]  el0_svc+0x2c/0x90
	Line 19503: [  884.632155][  T283]  el0t_64_sync_handler+0x68/0xb4
	Line 19504: [  884.632157][  T283]  el0t_64_sync+0x1a4/0x1a8
	Line 19505: [  884.632162][  T283] Code: 910003fd 90011095 aa0003f4 52819801 (f942f6a0) 
	Line 19507: [  884.638961][  T283] ---[ end trace 0000000000000000 ]---
	```

![](https://iliuqi-obsidian.oss-cn-shanghai.aliyuncs.com/20250513103301157.jpeg)
从栈帧里看newNULL,以及PC指针的位置,可以得知kmem_cache_alloc还没执行,下一条才是BL跳转函数去执行

我们分析一下这部分的汇编
```asm
90011095 adrp x21,0xFFFFFFC00A2C2000
AA0003F4 mov x20,x0 ; x20,orig
52819801 mov w1,#0xCC0 ; w1,#3264
F942F6A0 ldr x0,[x21,#0x5E8] ; x0,[x21,#1512]

ldr x21+0x5e8的地址放到x0,也就是vm_area_cachep

所以此时x0寄存器应该为0xFFFFFF800B4EA000

但是实际上x0现在为0xFFFFFF801C83BC80,这个值还是kmem_cache_alloc入栈时x0的值,说明这条ldr指令没有被执行,这个是非常奇怪的!!

这里就可以怀疑是否是DDR不稳定导致的了,最好是与memory沟通,当前的DDR是否存在类似的问题,然后对这台机器做一些memory的压力测试

2.8 案例8

[10907.962594][ T14] Internal error: Oops - Undefined instruction: 0000000002000000 [#1] PREEMPT SMP

[10907.971885][ T14] Dumping ftrace buffer:

[10907.976090][ T14] (ftrace buffer empty)

[10907.981288][ T14] CPU: 2 PID: 14 Comm: rcu_preempt Tainted: G WC OE 6.1.118-android14-11-maybe-dirty #1

[10907.981296][ T14] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)

[10907.981301][ T14] pstate: 804000c5 (Nzcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)

[10907.981308][ T14] pc : __mod_timer+0x160/0x404

[10907.981330][ T14] lr : __mod_timer+0x150/0x404

[10907.981336][ T14] sp : ffffffc00a52bc50

[10907.981338][ T14] x29: ffffffc00a52bc70 x28: ffffffc00a16d620 x27: ffffffc00a09f240

[10907.981349][ T14] x26: 00000000ffffffff x25: 0000000000000000 x24: 0000000000000000

[10907.981357][ T14] x23: 0000000000000002 x22: ffffff81f6f4f240 x21: 0000000000000000

[10907.981365][ T14] x20: 0000000100287754 x19: ffffffc00a52bcd8 x18: ffffffc00a3ed028

[10907.981374][ T14] x17: 0000000022d92518 x16: 0000000022d92518 x15: ffffff81f6f60180

[10907.981382][ T14] x14: 0000000000000010 x13: 0000000100287753 x12: 0000000100288192

[10907.981390][ T14] x11: 0000000000000002 x10: 0000000100287753 x9 : 0000000000000000

[10907.981397][ T14] x8 : 0000000000000000 x7 : 0000000000000001 x6 : fffffffdef8cc520

[10907.981405][ T14] x5 : 00000000000003e8 x4 : 0000000000000001 x3 : ffffffc00a16dea0

[10907.981413][ T14] x2 : 0000000000000000 x1 : ffffff81f6f4f240 x0 : 0000000000000000

[10907.981421][ T14] Call trace:

[10907.981426][ T14] __mod_timer+0x160/0x404

[10907.981433][ T14] schedule_timeout+0x8c/0x12c

[10907.981443][ T14] rcu_gp_fqs_loop+0x200/0x968

[10907.981452][ T14] rcu_gp_kthread+0x64/0x2bc

[10907.981457][ T14] kthread+0x10c/0x154

[10907.981465][ T14] ret_from_fork+0x10/0x20

[10907.981477][ T14] Code: 2a0003f8 36000057 34000e38 b9402277 (f503201f)

[10907.988286][ T14] ---[ end trace 0000000000000000 ]---

[10908.094081][ T14] Kernel panic - not syncing: Oops - Undefined instruction: Fatal exception

可以看到 实际的 地址应该为 D503201F,但是变成了F503201F,这个是发生了bitflip,可以被认为为DDR的问题

三、总结

3.1 为什么 DDR 不稳定可能是原因

  1. 内存访问错误:如果内存存在不稳定或故障的情况(如信号干扰、电源问题或硬件故障),它可能会导致某些内存地址无法正确读取或写入。这种错误会导致内核崩溃,尤其是在多核处理器的系统中,多个核心可能同时尝试访问受影响的内存区域,导致并发的错误。
  2. 多核系统的并发性:在多核系统中,如果内存不稳定,可能会触发多个 CPU 核心同时遇到问题。比如,某些内存区域对一个核心是可访问的,而对另一个核心则不可访问,或者某个内存地址在一个核心上读取正常,在另一个核心上读取就会发生错误。这种情况可能会导致看似“同步”发生的 kernel panic。
  3. 内存映射与虚拟内存:内存错误可能导致虚拟内存映射失败,从而触发内核无法处理的错误,如“NULL pointer dereference”或内存访问异常(如翻译错误、权限错误等)。这些问题在内存不稳定的情况下较为常见。
  4. 缓存一致性问题:现代处理器通常使用缓存来提高性能。如果内存不稳定,可能会导致缓存中的数据不一致,进而引发多个 CPU 核心同时触发异常,导致 panic。特别是在采用了缓存一致性协议的大规模多核系统中,内存不稳定可能导致不同核心间缓存同步的问题。

3.2 内存不稳定的潜在症状

  • 随机的、无法重现的崩溃:当内存问题不一定在每次都发生时,可能会随机触发内核 panic。
  • 多核同时崩溃:如上案例,多个核心同时出现 panic,这通常是内存访问出现了较为严重的故障。
  • 硬件诊断错误:在一些情况下,内存错误可能不会在初次使用时立即表现出来,只有在高负载或特定条件下才会出现问题。

3.3 如何验证 DDR 是否为问题根源

出问题的机器做一下DDR的压力测试,这方面测试可以听从Memory工程师的建议