一、问题背景
- 出现的问题几乎必现
- 可以定位问题在屏幕,插上屏幕就不出现死机了
二、问题分析
2.1 查看dmesg_TZ.txt
Line 2749: [ 1.572743][ T187] gh-watchdog hypervisor:qcom,gh-watchdog: wdog bark_time 20000
Line 2751: [ 1.572773][ T187] gh-watchdog hypervisor:qcom,gh-watchdog: wdog pet_time 15000
Line 2752: [ 1.573753][ T187] gh-watchdog hypervisor:qcom,gh-watchdog: QCOM Apps Watchdog Initialized
Line 9187: [ 110.709447][ C0] gh-watchdog hypervisor:qcom,gh-watchdog: Causing a QCOM Apps Watchdog bite!
Line 9188: [ 110.718164][ C0] gh-watchdog hypervisor:qcom,gh-watchdog: vWdog-CTL: 1, vWdog-time since last pet: 1088, vWdog-expired status: 1
----------begin Watchdog----------
Non-secure Watchdog data
Watchdog enabled
Pet time: 15.0s
Bark time: 20.0s
Watchdog last pet: 107.249543659
Watchdog next pet: 122.249543659
Watchdog next bark: 127.249543659
Watchdog pet timer not expired
触发了watchdog,造成死机的表象。
log显示上一次喂狗的时间点:107.249543659s,bark的时间点:127.249543659
[ 108.053406][ C0] watchdog: BUG: soft lockup - CPU#0 stuck for 45s! [kworker/u16:7:253]
[ 108.061627][ C0] CPU#0 Utilization every 4s during lockup:
[ 108.067398][ C0] #1: 1% system, 8% softirq, 93% hardirq, 0% idle
[ 108.074390][ C0] #2: 0% system, 8% softirq, 93% hardirq, 0% idle
[ 108.081376][ C0] #3: 0% system, 8% softirq, 93% hardirq, 0% idle
[ 108.088361][ C0] #4: 1% system, 8% softirq, 93% hardirq, 0% idle
[ 108.095352][ C0] #5: 0% system, 8% softirq, 93% hardirq, 0% idle
这条 log 意味着:
- CPU0 被一个中断上下文(hardirq)严重占用,导致软中断(softirq)和内核线程得不到调度,CPU0 几乎被硬件中断处理任务占满(93%);
kworker/u16:7
是一个 unbound 类型的内核工作线程,它也无法运行;- 时间长达 45 秒,远超过 watchdog 设置的 bark/pet 时间门限。
结合之前的信息,极有可能是:
某个设备中断(如 QtiBus / IPC / CAN)在 IRQ handler 中没有及时退出,可能原因如下:
- 中断风暴(high interrupt rate):设备持续触发中断,CPU 被反复进入中断处理;
- 中断处理逻辑中存在死循环 / 阻塞:比如错误使用了锁、等待某条件未满足等;
- bottom half(tasklet/softirq)未触发 / 无法完成,导致中断得不到有效处理而持续重入;
2.2 查看堆栈
[ 108.105820][ C0] CPU: 0 PID: 253 Comm: kworker/u16:7 Tainted: G WC OE 6.6.56-android15-8-maybe-dirty-debug #1 1400000003000000474e550054ccef5515b3b24a
[ 108.105836][ C0] Hardware name: Qualcomm Technologies, Inc. Kunzite QRD (DT)
[ 108.105849][ C0] Workqueue: dsi_err_workq dsi_display_handle_fifo_overflow [msm_drm]
[ 108.107156][ C0] pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 108.107171][ C0] pc : handle_softirqs+0x130/0x638
[ 108.107191][ C0] lr : handle_softirqs+0x12c/0x638
[ 108.107206][ C0] sp : ffffffc080003f60
[ 108.107217][ C0] x29: ffffffc080003f90 x28: ffffff8034114100 x27: ffffffc0822b1e48
[ 108.107252][ C0] x26: ffffffc0814575f1 x25: ffffffc082494940 x24: 0000000000000000
[ 108.107287][ C0] x23: 0000000000000002 x22: ffffffc0824d8730 x21: ffffffc0822af008
[ 108.107320][ C0] x20: ffffffc0824b6140 x19: ffffff8034114100 x18: ffffffc080005010
[ 108.107354][ C0] x17: 00000000ad6b63b6 x16: 00000000ad6b63b6 x15: 00000000d46ec33a
[ 108.107388][ C0] x14: 00000000c43ac6e8 x13: ffffffc080000000 x12: ffffffc080004000
[ 108.107422][ C0] x11: 9c2b5b8708ce6000 x10: ffffffc08003089c x9 : 9c2b5b8708ce6000
[ 108.107455][ C0] x8 : 0000000100000100 x7 : 0000000000000000 x6 : ffffffc0801f2734
[ 108.107489][ C0] x5 : 0000000000000000 x4 : 0000000000000001 x3 : 0000000000000000
[ 108.107522][ C0] x2 : ffffff8034114100 x1 : ffffffc08003089c x0 : ffffffc08003089c
[ 108.107555][ C0] Call trace:
[ 108.107567][ C0] handle_softirqs+0x130/0x638
[ 108.107582][ C0] ____do_softirq+0x14/0x24
[ 108.107598][ C0] call_on_irq_stack+0x3c/0x70
[ 108.107614][ C0] __irq_exit_rcu+0x6c/0x104
[ 108.107627][ C0] irq_exit_rcu+0x10/0x44
[ 108.107641][ C0] el1_interrupt+0x38/0x58
[ 108.107657][ C0] el1h_64_irq_handler+0x18/0x24
[ 108.107672][ C0] el1h_64_irq+0x68/0x6c
[ 108.107686][ C0] dsi_ctrl_hw_cmn_ctrl_reset+0x8c/0x240 [msm_drm 1400000003000000474e5500eea6e9e76729ba32]
[ 108.108982][ C0] dsi_ctrl_reset+0x58/0x88 [msm_drm 1400000003000000474e5500eea6e9e76729ba32]
[ 108.110269][ C0] dsi_display_handle_fifo_overflow+0x160/0x294 [msm_drm 1400000003000000474e5500eea6e9e76729ba32]
[ 108.111564][ C0] process_one_work+0x260/0x624
[ 108.111580][ C0] worker_thread+0x28c/0x454
[ 108.111594][ C0] kthread+0x118/0x158
[ 108.111609][ C0] ret_from_fork+0x10/0x20
[ 108.111625][ C0] Kernel panic - not syncing: softlockup: hung tasks
[ 108.118187][ C0] CPU: 0 PID: 253 Comm: kworker/u16:7 Tainted: G WC OEL 6.6.56-android15-8-maybe-dirty-debug #1 1400000003000000474e550054ccef5515b3b24a
[ 108.132991][ C0] Hardware name: Qualcomm Technologies, Inc. Kunzite QRD (DT)
[ 108.140321][ C0] Workqueue: dsi_err_workq dsi_display_handle_fifo_overflow [msm_drm]
[ 108.149632][ C0] Call trace:
[ 108.152801][ C0] dump_backtrace+0xf0/0x140
[ 108.157268][ C0] show_stack+0x18/0x28
[ 108.161295][ C0] dump_stack_lvl+0x70/0xa4
[ 108.165677][ C0] panic+0x158/0x3e4
[ 108.169447][ C0] watchdog_timer_fn+0x394/0x494
[ 108.174260][ C0] __hrtimer_run_queues+0x1d8/0x40c
[ 108.179337][ C0] hrtimer_interrupt+0xf4/0x3b8
[ 108.184060][ C0] arch_timer_handler_virt+0x50/0x64
[ 108.189226][ C0] handle_percpu_devid_irq+0x100/0x320
[ 108.194564][ C0] generic_handle_domain_irq+0x48/0x68
[ 108.199903][ C0] gic_handle_irq+0x4c/0x114
[ 108.204368][ C0] do_interrupt_handler+0xa0/0xe8
[ 108.209275][ C0] el1_interrupt+0x34/0x58
[ 108.213573][ C0] el1h_64_irq_handler+0x18/0x24
[ 108.218385][ C0] el1h_64_irq+0x68/0x6c
[ 108.222502][ C0] handle_softirqs+0x130/0x638
[ 108.227147][ C0] ____do_softirq+0x14/0x24
[ 108.231528][ C0] call_on_irq_stack+0x3c/0x70
[ 108.236171][ C0] __irq_exit_rcu+0x6c/0x104
[ 108.240636][ C0] irq_exit_rcu+0x10/0x44
[ 108.244848][ C0] el1_interrupt+0x38/0x58
[ 108.249145][ C0] el1h_64_irq_handler+0x18/0x24
[ 108.253956][ C0] el1h_64_irq+0x68/0x6c
[ 108.258072][ C0] dsi_ctrl_hw_cmn_ctrl_reset+0x8c/0x240 [msm_drm 1400000003000000474e5500eea6e9e76729ba32]
[ 108.269287][ C0] dsi_ctrl_reset+0x58/0x88 [msm_drm 1400000003000000474e5500eea6e9e76729ba32]
[ 108.279369][ C0] dsi_display_handle_fifo_overflow+0x160/0x294 [msm_drm 1400000003000000474e5500eea6e9e76729ba32]
[ 108.291181][ C0] process_one_work+0x260/0x624
[ 108.295913][ C0] worker_thread+0x28c/0x454
[ 108.300379][ C0] kthread+0x118/0x158
[ 108.304332][ C0] ret_from_fork+0x10/0x20
[ 108.308630][ C0] SMP: stopping secondary CPUs
异常点:dsi_ctrl_hw_cmn_ctrl_reset()
卡住
这个函数被多次出现在 backtrace 栈顶,表明它没有正常返回,可能处于死循环、错误等待,或在中断上下文中执行了阻塞操作(非常危险)
2.3 查看中断
272 0x73 0xff 1165940 0 0 0 0 0 0 0 msm_drm GICv3 v.v (struct irq_desc *)0xffffff805024b400
273 0x4 0xff 1165212 0 0 0 0 0 0 0 dsi_ctrl sde v.v (struct irq_desc *)0xffffff804fea6800
IRQ 使用统计
IRQ 号 | 中断号(hex) | 触发次数 | 设备名 | 控制器 | irq_desc 地址 |
---|---|---|---|---|---|
272 | 0x73 | 1165940 | msm_drm | GICv3 | 0xffffff805024b400 |
273 | 0x4 | 1165212 | dsi_ctrl | sde | 0xffffff804fea6800 |
✅ 这些数据也证实系统处于中断风暴状态:
msm_drm
+dsi_ctrl
模块的中断触发频率异常高(百万级);- 两者差值仅几百,说明可能是 同一个显示子系统内的中断级联事件;
- GIC 中断控制器
GICv3
表示硬件中断源很明确,说明是 真实硬件中断持续触发,不是 spurious interrupt。
✅ 中断未屏蔽 + 处理不当 = 死循环
- 这些中断非常可能引发
dsi_display_handle_fifo_overflow()
的调用; - 如果其中一个中断未被有效 disable 或 ack,而又不断触发 handler,就会导致当前 CPU(如 CPU0)持续被中断 handler 占据,无法返回 scheduler,形成 softlock;
2.4 查找硬件中断号
使用T32查找这两个中断对应的hw_irq