一、问题背景

  1. 出现的问题几乎必现
  2. 可以定位问题在屏幕,插上屏幕就不出现死机了

二、问题分析

2.1 查看dmesg_TZ.txt

Line 2749: [    1.572743][  T187] gh-watchdog hypervisor:qcom,gh-watchdog: wdog bark_time 20000
	Line 2751: [    1.572773][  T187] gh-watchdog hypervisor:qcom,gh-watchdog: wdog pet_time 15000
	Line 2752: [    1.573753][  T187] gh-watchdog hypervisor:qcom,gh-watchdog: QCOM Apps Watchdog Initialized
	Line 9187: [  110.709447][    C0] gh-watchdog hypervisor:qcom,gh-watchdog: Causing a QCOM Apps Watchdog bite!
	Line 9188: [  110.718164][    C0] gh-watchdog hypervisor:qcom,gh-watchdog: vWdog-CTL: 1, vWdog-time since last pet: 1088, vWdog-expired status: 1

----------begin Watchdog----------
Non-secure Watchdog data
Watchdog enabled
Pet time: 15.0s
Bark time: 20.0s
Watchdog last pet: 107.249543659
Watchdog next pet: 122.249543659
Watchdog next bark: 127.249543659
Watchdog pet timer not expired

触发了watchdog,造成死机的表象。

log显示上一次喂狗的时间点:107.249543659s,bark的时间点:127.249543659

[  108.053406][    C0] watchdog: BUG: soft lockup - CPU#0 stuck for 45s! [kworker/u16:7:253]
[  108.061627][    C0] CPU#0 Utilization every 4s during lockup:
[  108.067398][    C0] 	#1:   1% system,	  8% softirq,	 93% hardirq,	  0% idle
[  108.074390][    C0] 	#2:   0% system,	  8% softirq,	 93% hardirq,	  0% idle
[  108.081376][    C0] 	#3:   0% system,	  8% softirq,	 93% hardirq,	  0% idle
[  108.088361][    C0] 	#4:   1% system,	  8% softirq,	 93% hardirq,	  0% idle
[  108.095352][    C0] 	#5:   0% system,	  8% softirq,	 93% hardirq,	  0% idle

这条 log 意味着:

  • CPU0 被一个中断上下文(hardirq)严重占用,导致软中断(softirq)和内核线程得不到调度,CPU0 几乎被硬件中断处理任务占满(93%);
  • kworker/u16:7 是一个 unbound 类型的内核工作线程,它也无法运行;
  • 时间长达 45 秒,远超过 watchdog 设置的 bark/pet 时间门限。

结合之前的信息,极有可能是:

某个设备中断(如 QtiBus / IPC / CAN)在 IRQ handler 中没有及时退出,可能原因如下:

  1. 中断风暴(high interrupt rate):设备持续触发中断,CPU 被反复进入中断处理;
  2. 中断处理逻辑中存在死循环 / 阻塞:比如错误使用了锁、等待某条件未满足等;
  3. bottom half(tasklet/softirq)未触发 / 无法完成,导致中断得不到有效处理而持续重入;

2.2 查看堆栈

[  108.105820][    C0] CPU: 0 PID: 253 Comm: kworker/u16:7 Tainted: G        WC OE      6.6.56-android15-8-maybe-dirty-debug #1 1400000003000000474e550054ccef5515b3b24a
[  108.105836][    C0] Hardware name: Qualcomm Technologies, Inc. Kunzite QRD (DT)
[  108.105849][    C0] Workqueue: dsi_err_workq dsi_display_handle_fifo_overflow [msm_drm]
[  108.107156][    C0] pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  108.107171][    C0] pc : handle_softirqs+0x130/0x638
[  108.107191][    C0] lr : handle_softirqs+0x12c/0x638
[  108.107206][    C0] sp : ffffffc080003f60
[  108.107217][    C0] x29: ffffffc080003f90 x28: ffffff8034114100 x27: ffffffc0822b1e48
[  108.107252][    C0] x26: ffffffc0814575f1 x25: ffffffc082494940 x24: 0000000000000000
[  108.107287][    C0] x23: 0000000000000002 x22: ffffffc0824d8730 x21: ffffffc0822af008
[  108.107320][    C0] x20: ffffffc0824b6140 x19: ffffff8034114100 x18: ffffffc080005010
[  108.107354][    C0] x17: 00000000ad6b63b6 x16: 00000000ad6b63b6 x15: 00000000d46ec33a
[  108.107388][    C0] x14: 00000000c43ac6e8 x13: ffffffc080000000 x12: ffffffc080004000
[  108.107422][    C0] x11: 9c2b5b8708ce6000 x10: ffffffc08003089c x9 : 9c2b5b8708ce6000
[  108.107455][    C0] x8 : 0000000100000100 x7 : 0000000000000000 x6 : ffffffc0801f2734
[  108.107489][    C0] x5 : 0000000000000000 x4 : 0000000000000001 x3 : 0000000000000000
[  108.107522][    C0] x2 : ffffff8034114100 x1 : ffffffc08003089c x0 : ffffffc08003089c
[  108.107555][    C0] Call trace:
[  108.107567][    C0]  handle_softirqs+0x130/0x638
[  108.107582][    C0]  ____do_softirq+0x14/0x24
[  108.107598][    C0]  call_on_irq_stack+0x3c/0x70
[  108.107614][    C0]  __irq_exit_rcu+0x6c/0x104
[  108.107627][    C0]  irq_exit_rcu+0x10/0x44
[  108.107641][    C0]  el1_interrupt+0x38/0x58
[  108.107657][    C0]  el1h_64_irq_handler+0x18/0x24
[  108.107672][    C0]  el1h_64_irq+0x68/0x6c
[  108.107686][    C0]  dsi_ctrl_hw_cmn_ctrl_reset+0x8c/0x240 [msm_drm 1400000003000000474e5500eea6e9e76729ba32]
[  108.108982][    C0]  dsi_ctrl_reset+0x58/0x88 [msm_drm 1400000003000000474e5500eea6e9e76729ba32]
[  108.110269][    C0]  dsi_display_handle_fifo_overflow+0x160/0x294 [msm_drm 1400000003000000474e5500eea6e9e76729ba32]
[  108.111564][    C0]  process_one_work+0x260/0x624
[  108.111580][    C0]  worker_thread+0x28c/0x454
[  108.111594][    C0]  kthread+0x118/0x158
[  108.111609][    C0]  ret_from_fork+0x10/0x20
[  108.111625][    C0] Kernel panic - not syncing: softlockup: hung tasks
[  108.118187][    C0] CPU: 0 PID: 253 Comm: kworker/u16:7 Tainted: G        WC OEL     6.6.56-android15-8-maybe-dirty-debug #1 1400000003000000474e550054ccef5515b3b24a
[  108.132991][    C0] Hardware name: Qualcomm Technologies, Inc. Kunzite QRD (DT)
[  108.140321][    C0] Workqueue: dsi_err_workq dsi_display_handle_fifo_overflow [msm_drm]
[  108.149632][    C0] Call trace:
[  108.152801][    C0]  dump_backtrace+0xf0/0x140
[  108.157268][    C0]  show_stack+0x18/0x28
[  108.161295][    C0]  dump_stack_lvl+0x70/0xa4
[  108.165677][    C0]  panic+0x158/0x3e4
[  108.169447][    C0]  watchdog_timer_fn+0x394/0x494
[  108.174260][    C0]  __hrtimer_run_queues+0x1d8/0x40c
[  108.179337][    C0]  hrtimer_interrupt+0xf4/0x3b8
[  108.184060][    C0]  arch_timer_handler_virt+0x50/0x64
[  108.189226][    C0]  handle_percpu_devid_irq+0x100/0x320
[  108.194564][    C0]  generic_handle_domain_irq+0x48/0x68
[  108.199903][    C0]  gic_handle_irq+0x4c/0x114
[  108.204368][    C0]  do_interrupt_handler+0xa0/0xe8
[  108.209275][    C0]  el1_interrupt+0x34/0x58
[  108.213573][    C0]  el1h_64_irq_handler+0x18/0x24
[  108.218385][    C0]  el1h_64_irq+0x68/0x6c
[  108.222502][    C0]  handle_softirqs+0x130/0x638
[  108.227147][    C0]  ____do_softirq+0x14/0x24
[  108.231528][    C0]  call_on_irq_stack+0x3c/0x70
[  108.236171][    C0]  __irq_exit_rcu+0x6c/0x104
[  108.240636][    C0]  irq_exit_rcu+0x10/0x44
[  108.244848][    C0]  el1_interrupt+0x38/0x58
[  108.249145][    C0]  el1h_64_irq_handler+0x18/0x24
[  108.253956][    C0]  el1h_64_irq+0x68/0x6c
[  108.258072][    C0]  dsi_ctrl_hw_cmn_ctrl_reset+0x8c/0x240 [msm_drm 1400000003000000474e5500eea6e9e76729ba32]
[  108.269287][    C0]  dsi_ctrl_reset+0x58/0x88 [msm_drm 1400000003000000474e5500eea6e9e76729ba32]
[  108.279369][    C0]  dsi_display_handle_fifo_overflow+0x160/0x294 [msm_drm 1400000003000000474e5500eea6e9e76729ba32]
[  108.291181][    C0]  process_one_work+0x260/0x624
[  108.295913][    C0]  worker_thread+0x28c/0x454
[  108.300379][    C0]  kthread+0x118/0x158
[  108.304332][    C0]  ret_from_fork+0x10/0x20
[  108.308630][    C0] SMP: stopping secondary CPUs

异常点:dsi_ctrl_hw_cmn_ctrl_reset() 卡住

这个函数被多次出现在 backtrace 栈顶,表明它没有正常返回,可能处于死循环、错误等待,或在中断上下文中执行了阻塞操作(非常危险)

2.3 查看中断

272 0x73         0xff       1165940    0          0          0          0          0          0          0           msm_drm                        GICv3           v.v (struct irq_desc *)0xffffff805024b400  
 273 0x4          0xff       1165212    0          0          0          0          0          0          0           dsi_ctrl                       sde             v.v (struct irq_desc *)0xffffff804fea6800

IRQ 使用统计

IRQ 号中断号(hex)触发次数设备名控制器irq_desc 地址
2720x731165940msm_drmGICv30xffffff805024b400
2730x41165212dsi_ctrlsde0xffffff804fea6800

✅ 这些数据也证实系统处于中断风暴状态

  • msm_drm + dsi_ctrl 模块的中断触发频率异常高(百万级);
  • 两者差值仅几百,说明可能是 同一个显示子系统内的中断级联事件
  • GIC 中断控制器 GICv3 表示硬件中断源很明确,说明是 真实硬件中断持续触发,不是 spurious interrupt。

✅ 中断未屏蔽 + 处理不当 = 死循环

  • 这些中断非常可能引发 dsi_display_handle_fifo_overflow() 的调用;
  • 如果其中一个中断未被有效 disable 或 ack,而又不断触发 handler,就会导致当前 CPU(如 CPU0)持续被中断 handler 占据,无法返回 scheduler,形成 softlock;

2.4 查找硬件中断号

使用T32查找这两个中断对应的hw_irq

三、根本原因