一、问题现象
死机
二、问题分析
2.1 初步定位问题类型
查看dmesg_TZ.txt
[161726.646617][ C0] gh-watchdog hypervisor:qcom,gh-watchdog: QCOM Apps Watchdog bark! Now = 161726.646605
[161726.646652][ C0] gh-watchdog hypervisor:qcom,gh-watchdog: QCOM Apps Watchdog last pet at 161706.646098
[161726.646663][ C0] gh-watchdog hypervisor:qcom,gh-watchdog: cpu alive mask from last pet ff从离线解析的dump日志中,我们可以轻松定位出发生了watchdog 死机。
last pet : 161706s
bark time: 161726s
20s时间内都没有喂狗
2.2 查看watchdog
----------begin Watchdog----------
Non-secure Watchdog data
Pet time: 15.0s
Bark time: 20.0s
Watchdog last pet: 161706.000
Watchdog pet timer not expired
Current jiffies crossed pet_timer expires jiffies
CPU online bits: 11111111
CPU runqueue online bits: 11111111
CPU isolated bits: 00000000
CPU dying bits: 00000000
pet_timer_flags: 0x26800003
pet_timer_expires: 4335322726
Current jiffies : 4335324028
Timestamp of last timer interrupt(last_jiffies_update): 161726.924
tick_do_timer_cpu: 3
CPU logical map: [0, 1, 2, 3, 4, 5, 6, 7]
tick_broadcast_oneshot_mask: 00000000
tick_broadcast_pending_mask: 00000000
tick_broadcast_force_mask: 00000000
tick_broadcast_device cpumask: 11111111
tick_broadcast_device next_event: 9223372036.854776
CPU0 tick_device next_event: 161726.720000
CPU1 tick_device next_event: 161726.928000
CPU2 tick_device next_event: 161726.928000
CPU3 tick_device next_event: 161726.928000
CPU4 tick_device next_event: 161726.940000
CPU5 tick_device next_event: 161726.940000
CPU6 tick_device next_event: 161726.940000
CPU7 tick_device next_event: 161726.940000
CPU#0 : ping_start: 161706.646000 : ping_end: 161706.646013
CPU#1 : ping_start: 161706.646013 : ping_end: 161706.646044
CPU#2 : ping_start: 161706.646044 : ping_end: 161706.646053
CPU#3 : ping_start: 161706.646053 : ping_end: 161706.646053
CPU#4 : ping_start: 161706.646054 : ping_end: 161706.646063
CPU#5 : ping_start: 161706.646064 : ping_end: 161706.646070
CPU#6 : ping_start: 161706.646071 : ping_end: 161706.646080
CPU#7 : ping_start: 161706.646080 : ping_end: 161706.646088
timer_fired : 161706.645948
thread_start : 161706.645999
------------end Watchdog----------Pet time = 15 s
系统每隔 15 秒要触发一次“喂狗”操作(schedule pet_timer)。Bark time = 20 s
如果 20 秒内仍未再次“喂狗”,则 Watchdog 会“bark”并最终“bite”。last pet = 161706.000
最后一次真实的喂狗动作记录在内核时间戳 161706.000 s。jiffies & expires
pet_timer_expires = 4335322726:以 jiffies 为单位,下一次计划触发喂狗动作的时刻。Current jiffies = 4335324028:当前系统 jiffies。由于
4335324028 > 4335322726,说明本该在此时触发的 pet_timer 已经“错过”
ping_start
CPU#0 : ping_start: 161706.646000 → ping_end: 161706.646013
CPU#1 : ping_start: 161706.646013 → ping_end: 161706.646044
…
CPU#7 : ping_start: 161706.646080 → ping_end: 161706.646088所有cpu的ping_start都是161706s的那次喂狗的时间点触发的记录。
这里我们要解释一下第四点和第五点
pet_timer 是以定时器(hrtimer 或 jiffies_timer)的方式注册在内核定时器链表上。上述“expires < current”说明定时器回调本应在 jiffies=4335322726 时触发,但在当前 jiffies 已到 4335324028 仍未执行。
这种情况常见于:
全局中断被长时间禁止(local_irq_disable/uninterruptible section)。
内核处于某些 critical section,例如大批量的自旋锁持有或软中断/任务let 被无限阻塞。
每次 pet_timer(喂狗定时器)到期时,Watchdog 会在各个 CPU 上发起一次 ping,记录下 ping_start 和 ping_end。
因此,这两个时间戳反映的是上一次定时器回调成功执行的时刻。
从前面的 dump 可以看到,pet_timer 应该在 jiffies=4335322726 时触发,但由于定时器链路被阻塞(tick 漏调度、全局中断被屏蔽或 nohz 模式问题),它一直没跑,所以也就没机会重新发 ping,也就没有新的 ping_start 时间戳。
2.3 找到watchdog pet_timer
查看timerlist.txt
--------------------------------------------------------------------------------------------------------------------------------
CPU 3(tvec_base: ffffff89fd4d4780 timer_jiffies: 4335322329(161720.136s) next_timer: 4335322384(161720.356s) active_timers: NA)
--------------------------------------------------------------------------------------------------------------------------------
+ vectors Timers (75)
INDEX TIMER_LIST_ADDR EXPIRES EXPIRES(s) FUNCTION WORK REMARKS
68 ffffff87b31498a8 4335322654 161721.436s tcp_write_timer
98 ffffff8799dee8a8 4335322378 161720.332s tcp_write_timer
98 ffffff87f2012ca8 4335322378 161720.332s tcp_write_timer
98 ffffff87f20154a8 4335322378 161720.332s tcp_write_timer
98 ffffff87b3fcdea8 4335322378 161720.332s tcp_write_timer
98 ffffff87b1515ea8 4335322378 161720.332s tcp_write_timer
98 ffffff87f2015ea8 4335322378 161720.332s tcp_write_timer
98 ffffff80394818a8 4335322378 161720.332s tcp_write_timer
98 ffffff87b3fca2a8 4335322378 161720.332s tcp_write_timer
105 ffffff80578836a8 4335322436 161720.564s tcp_write_timer
105 ffffff87c43140a8 4335322436 161720.564s tcp_write_timer
105 ffffff87c43168a8 4335322436 161720.564s tcp_write_timer
106 ffffff87b294f2a8 4335322444 161720.596s tcp_write_timer
111 ffffff87c43104a8 4335322484 161720.756s tcp_write_timer
112 ffffff8057884aa8 4335322493 161720.792s tcp_write_timer
113 ffffff87c43136a8 4335322500 161720.82s tcp_write_timer
114 ffffff87b2948ea8 4335322508 161720.852s tcp_write_timer
117 ffffff80578868a8 4335322536 161720.964s tcp_write_timer
122 ffffff80578840a8 4335322575 161721.12s tcp_write_timer
122 ffffff80578804a8 4335322575 161721.12s tcp_write_timer
122 ffffff879120f2a8 4335322575 161721.12s tcp_write_timer
133 ffffff87da92ea18 4335325478 161732.732s tcp_keepalive_timer
153 ffffff87b3cc18a8 4335322654 161721.436s tcp_write_timer
153 ffffff87ba2484a8 4335322628 161721.332s tcp_write_timer
153 ffffff87ba24b6a8 4335322628 161721.332s tcp_write_timer
153 ffffff87ced40ea8 4335322628 161721.332s tcp_write_timer
153 ffffff87f20118a8 4335322628 161721.332s tcp_write_timer
153 ffffff87b29498a8 4335322628 161721.332s tcp_write_timer
154 ffffff87b3fc8ea8 4335322693 161721.592s tcp_write_timer
154 ffffff87b3cc22a8 4335322693 161721.592s tcp_write_timer
154 ffffff8027bf2960 4335322726 161721.724s qcom_wdt_pet_task_wakeup[qcom_wdt_core]
155 ffffff80394872a8 4335322815 161722.08s tcp_write_timer
155 ffffff8039480ea8 4335322777 161721.928s tcp_write_timer
156 ffffff87b333caa8 4335322855 161722.24s tcp_write_timer
156 ffffff8054fec0a8 4335322866 161722.284s tcp_write_timer
157 ffffff80480c4aa8 4335322932 161722.548s tcp_write_timer
157 ffffff8022e72ca8 4335322893 161722.392s tcp_write_timer
158 ffffff8056ab84a8 4335323003 161722.832s tcp_write_timer
158 ffffff80480c04a8 4335322970 161722.7s tcp_write_timer
158 ffffff87f20136a8 4335322985 161722.76s tcp_write_timer
159 ffffff805f37d4a8 4335323039 161722.976s tcp_write_timer
160 ffffff8036a822a8 4335323115 161723.28s tcp_write_timer
160 ffffff803680aca8 4335323075 161723.12s tcp_write_timer
161 ffffff80480c68a8 4335323156 161723.444s tcp_write_timer
162 ffffff80480c18a8 4335323261 161723.864s tcp_write_timer
162 ffffff80480c2ca8 4335323210 161723.66s tcp_write_timer
163 ffffff8022e736a8 4335323315 161724.08s tcp_write_timer
164 ffffff80480c40a8 4335323365 161724.28s tcp_write_timer
165 ffffff87812440a8 4335323455 161724.64s tcp_write_timer
165 ffffff87b3cc72a8 4335323417 161724.488s tcp_write_timer
166 ffffff87b16adea8 4335323494 161724.796s tcp_write_timer
167 ffffff805c8ed4a8 4335323556 161725.044s tcp_write_timer
167 ffffff87b13b84a8 4335323528 161724.932s tcp_write_timer
168 ffffff87b14872a8 4335323620 161725.3s tcp_write_timer
168 ffffff87b3fce8a8 4335323588 161725.172s tcp_write_timer
169 ffffff87b3fcaca8 4335323709 161725.656s tcp_write_timer
169 ffffff80475be8a8 4335323660 161725.46s tcp_write_timer
170 ffffff87b3fc84a8 4335323753 161725.832s tcp_write_timer
171 ffffff87b3cc36a8 4335323840 161726.18s tcp_write_timer
171 ffffff87814884a8 4335323792 161725.988s tcp_write_timer
172 ffffff87815154a8 4335323890 161726.38s tcp_write_timer
173 ffffff878148b6a8 4335323931 161726.544s tcp_write_timer
174 ffffff805eec36a8 4335323974 161726.716s tcp_write_timer
176 ffffff8039482ca8 4335324158 161727.452s tcp_write_timer
177 ffffff805c8ecaa8 4335324190 161727.58s tcp_write_timer
178 ffffff805c8e8ea8 4335324256 161727.844s tcp_write_timer
178 ffffff8799deb6a8 4335324266 161727.884s tcp_write_timer
179 ffffff80417422a8 4335324342 161728.188s tcp_write_timer
179 ffffff8047a5d4a8 4335324301 161728.024s tcp_write_timer
180 ffffff891f0c84a8 4335324381 161728.344s tcp_write_timer
181 ffffff87f2010ea8 4335324473 161728.712s tcp_write_timer
239 ffffffc08231a9d0 4335328588 161745.172s delayed_work_timer_fn work_fn
239 ffffffc087eebd38 4335328408 161744.452s process_timeout
239 ffffffc083f33d48 4335328324 161744.116s process_timeout
254 ffffffc082327aa0 4335336088 161775.172s delayed_work_timer_fn gc_worker 发现watchdog的pet_timer线程位于CPU3上,正处于peding状态等待执行,EXPIRES为161721.724s ,如果按照计划161721触发喂狗,也就没问题的,但是现在系统时间到161726s都没有触发。
CPU 0 的
timer_jiffies已经是 4335323981(161726.744 s)CPU 3 的
timer_jiffies仍停留在 4335322329(161720.136 s)
CPU 3 的定时器时钟整整落后了大约 6.6 秒,所以它根本没有机会跑到 4335322726(161721.724 s)去触发 qcom_wdt_pet_task_wakeup,Watchdog 就判定超时、bark -> bite 了。
并且可以发现CPU3 qcom_wdt_pet_task_wakeup前面存在大量的tcp_write_timer等待运行。
2.4 查看CPU3的内核栈
Parsing debug information for MSM_DUMP_DATA_CPU_CTX. Version: 32 Magic: 42444832 Source: vm_3_vcpu_300
Parsing CPU3 affinity 300 context start 802e6920 end 802e6db0 version 32 client_id-> 0
cpu_type = MSM_DUMP_CPU_TYPE_AARCH64
ctx_type = MSM_DUMP_CTX_TYPE_NS_VCPU_EL10_CTX
cpu_index = 3
regset_name = MSM_DUMP_REGSET_IDS_AARCH64_GPRS
regset_name = MSM_DUMP_REGSET_IDS_AARCH64_NEON
regset_name = MSM_DUMP_REGSET_IDS_AARCH64_SYSREGS_EL0
regset_name = MSM_DUMP_REGSET_IDS_AARCH64_EL1
regset_name = MSM_DUMP_REGSET_IDS_AARCH64_VM_EL2
regset_name = MSM_DUMP_REGSET_IDS_AARCH64_CNTV_EL10
Core vcpu3 PC: dev_queue_xmit_nit+8c <ffffffc080d6e8f4>
Core vcpu3 LR: dev_queue_xmit_nit+150 <ffffffc080d6e9b8>
[<ffffffc080d6e8f4>] dev_queue_xmit_nit+0x8c
[<ffffffc080d70f40>] dev_hard_start_xmit+0x90
[<ffffffc080dd5e18>] sch_direct_xmit+0xf0
[<ffffffc080dd64c4>] __qdisc_run+0x3cc
[<ffffffc080d7df64>] qdisc_run+0x48
[<ffffffc080d7204c>] __dev_queue_xmit+0x3e0
[<ffffffc080e76d84>] ip_finish_output2+0x48c
[<ffffffc080e76760>] __ip_finish_output+0x194
[<ffffffc080e73300>] ip_finish_output+0x20
[<ffffffc080e73440>] ip_output+0x50
[<ffffffc080e73914>] __ip_queue_xmit+0x414
[<ffffffc080e7397c>] ip_queue_xmit+0x14
[<ffffffc080e9db4c>] __tcp_transmit_skb+0x814
[<ffffffc080e99fe0>] __tcp_retransmit_skb+0x2cc
[<ffffffc080e9aa38>] tcp_retransmit_skb+0x1c
[<ffffffc080e9f868>] tcp_retransmit_timer+0x58c
[<ffffffc080e9fffc>] tcp_write_timer_handler+0x94
[<ffffffc080ea02c8>] tcp_write_timer+0x78
[<ffffffc080188a58>] call_timer_fn+0x4c
[<ffffffc0801888ac>] __run_timers+0x1c4
[<ffffffc080187814>] run_timer_softirq+0x24
[<ffffffc0800b95cc>] handle_softirqs+0x124
[<ffffffc080010224>] __do_softirq+0x14
[<ffffffc0800165b0>] ____do_softirq+0x10
[<ffffffc08001653c>] call_on_irq_stack+0x3c
[<ffffffc080016590>] do_softirq_own_stack+0x1c
[<ffffffc0800b99ac>] __irq_exit_rcu+0x54
[<ffffffc0800b994c>] irq_exit_rcu+0x10
[<ffffffc08102c0dc>] el1_interrupt+0x38
[<ffffffc08102c098>] el1h_64_irq_handler+0x18
[<ffffffc080011304>] el1h_64_irq+0x68
[<ffffffc0800b9498>] do_softirq+0x48
[<ffffffc0800b9448>] __local_bh_enable_ip+0x88
[<ffffffc08103e3d0>] _raw_spin_unlock_bh+0x24
[<ffffffc07dbe662c>] wlan_hdd_netif_queue_control[qca_cld3_adrastea]+0x508
[<ffffffc07dbec6d0>] hdd_cm_netif_queue_enable[qca_cld3_adrastea]+0x13c
[<ffffffc07dbee040>] hdd_cm_connect_success_pre_user_update[qca_cld3_adrastea]+0x940
[<ffffffc07dbed1c0>] hdd_cm_connect_complete[qca_cld3_adrastea]+0x50
[<ffffffc07dce1e9c>] osif_cm_connect_comp_ind[qca_cld3_adrastea]+0x3c
[<ffffffc07dce25a4>] osif_connect_handler[qca_cld3_adrastea]+0x130
[<ffffffc07dce208c>] osif_cm_connect_complete_cb[qca_cld3_adrastea]+0x10
[<ffffffc07de37e4c>] mlme_cm_osif_connect_complete[qca_cld3_adrastea]+0x3c
[<ffffffc07def0b00>] cm_fw_roam_sync_propagation[qca_cld3_adrastea]+0x708
[<ffffffc07ddaa474>] cm_roam_sync_event_handler_cb[qca_cld3_adrastea]+0x1fc
[<ffffffc07dd9d428>] wlan_cm_sta_mlme_vdev_roam_notify[qca_cld3_adrastea]+0x24
[<ffffffc07ddc0cf4>] sta_mlme_vdev_notify_roam_start[qca_cld3_adrastea]+0x54
[<ffffffc07de3e140>] mlme_vdev_subst_up_active_event[qca_cld3_adrastea]+0x200
[<ffffffc07de18df0>] wlan_sm_dispatch[qca_cld3_adrastea]+0x164
[<ffffffc07de3aeec>] wlan_vdev_mlme_sm_deliver_evt[qca_cld3_adrastea]+0x60
[<ffffffc07def0114>] cm_fw_send_vdev_roam_event[qca_cld3_adrastea]+0x5c
[<ffffffc07de2d09c>] cm_subst_roam_sync_event[qca_cld3_adrastea]+0x90
[<ffffffc07de18df0>] wlan_sm_dispatch[qca_cld3_adrastea]+0x164
[<ffffffc07de2cd68>] cm_state_roaming_event[qca_cld3_adrastea]+0x124
[<ffffffc07de18df0>] wlan_sm_dispatch[qca_cld3_adrastea]+0x164
[<ffffffc07de313e4>] cm_state_connected_event[qca_cld3_adrastea]+0x150
[<ffffffc07de18df0>] wlan_sm_dispatch[qca_cld3_adrastea]+0x164
[<ffffffc07de31e30>] cm_sm_deliver_event[qca_cld3_adrastea]+0x124
[<ffffffc07def0038>] cm_fw_roam_sync_req[qca_cld3_adrastea]+0xe8
[<ffffffc07ddaa0e4>] cm_roam_sync_event_handler[qca_cld3_adrastea]+0x40
[<ffffffc07dd94098>] target_if_cm_roam_sync_event[qca_cld3_adrastea]+0x120
[<ffffffc07dde9970>] __wmi_control_rx[qca_cld3_adrastea]+0x3ec
[<ffffffc07ddf0a68>] wmi_process_fw_event_handler[qca_cld3_adrastea]+0x34
[<ffffffc07de502d4>] scheduler_target_if_mq_handler[qca_cld3_adrastea]+0x4c
[<ffffffc07de51a30>] scheduler_thread[qca_cld3_adrastea]+0x1cc
[<ffffffc0800e1158>] kthread+0x114
[<ffffffc0800164f0>] ret_from_fork+0x10PC → dev_queue_xmit_nit
CPU 3 的程序计数器(PC)在
dev_queue_xmit_nit(网络层发包路由)附近,接着一路在tcp_retransmit_timer、tcp_write_timer_handler、call_timer_fn等路径上不断循环。它再到达
run_timer_softirq、handle_softirqs、__do_softirq,然后直接跳到_raw_spin_unlock_bh及后续 rcu/irq 退出流程。从调用栈看,软中断处理链条并未让出到定时器层面。
local_bh 禁用导致定时器延后
在网络驱动(qca_cld3_adrastea)的
wlan_hdd_netif_queue_control等函数里,软中断上下文中会短暂禁用 bottom-halves (__local_bh_disable()),但如果路径过长或递归调用,中断上下文会一直占据 CPU3,直到__local_bh_enable()才允许定时器软中断(包括 pet_timer)运行。由于你看到 CPU 3 多次在
__local_bh_enable_ip之前反复处理上层 qca_cm_roam、wmi、scheduler 等事件,这个关闭–开启窗口可能超过数秒,直接导致 pet_timer 过期还没机会触发。
三、问题根因
软中断本身是 非抢占 的:一旦进入 do_softirq(),内核会连续地处理所有 pending 软中断类型,直到软中断队列清空,或者处理软中断被明确禁用/延后。具体交互方式:
网络驱动或软中断处理路径里调用了
__local_bh_disable()(即禁止 bottom halves,也就是禁用软中断上下文延后执行),那么即便有新的 softirq(包括wdt定时器软中断)到来,也要等到__local_bh_enable()之后才能再次进入do_softirq()。CPU 3 上的网络栈处理(
dev_queue_xmit_nit→tcp_retransmit_timer→ 各种 qca_cld3 的 WMI/roam 回调)占用了很长时间的软中断上下文,并且多次短暂禁用底半部,没有让出给wdt定时器软中断。由于 软中断不允许被其它软中断打断,一旦网络软中断开始,定时器软中断就只能等到网络软中断完全结束、底半部重新使能后才能执行;如果这段 “堵塞” 超过了 Watchdog 的 pet / bark 窗口,定时器回调就会错过,从而导致 Watchdog 认为超时。
四、解决方案
wlan team: To be continued!