AI智能摘要
系统出现死机,核心原因是Watchdog定时器未按时触发“喂狗”操作导致20秒内无回应。从dmesg和timerlist日志分析,定时器的回调因中断或内核临界区阻塞未能执行,所有CPU的最后ping均停留在同一时刻,反映定时器链路被阻塞。这突出内核定时器和中断管理对系统稳定性的关键影响。
此摘要由AI分析文章内容生成,仅供参考。

一、问题现象

死机

二、问题分析

2.1 初步定位问题类型

查看dmesg_TZ.txt

[161726.646617][    C0] gh-watchdog hypervisor:qcom,gh-watchdog: QCOM Apps Watchdog bark! Now = 161726.646605
[161726.646652][    C0] gh-watchdog hypervisor:qcom,gh-watchdog: QCOM Apps Watchdog last pet at 161706.646098
[161726.646663][    C0] gh-watchdog hypervisor:qcom,gh-watchdog: cpu alive mask from last pet ff

从离线解析的dump日志中,我们可以轻松定位出发生了watchdog 死机。

last pet : 161706s

bark time: 161726s

20s时间内都没有喂狗

2.2 查看watchdog

----------begin Watchdog----------
Non-secure Watchdog data
Pet time: 15.0s
Bark time: 20.0s
Watchdog last pet: 161706.000
Watchdog pet timer not expired
Current jiffies crossed pet_timer expires jiffies
CPU online bits: 11111111
CPU runqueue online bits: 11111111
CPU isolated bits: 00000000
CPU dying bits: 00000000
pet_timer_flags: 0x26800003
pet_timer_expires: 4335322726
Current jiffies  : 4335324028
Timestamp of last timer interrupt(last_jiffies_update): 161726.924
tick_do_timer_cpu: 3
CPU logical map: [0, 1, 2, 3, 4, 5, 6, 7]
tick_broadcast_oneshot_mask: 00000000
tick_broadcast_pending_mask: 00000000
tick_broadcast_force_mask: 00000000
tick_broadcast_device cpumask: 11111111
tick_broadcast_device next_event: 9223372036.854776
CPU0 tick_device next_event: 161726.720000
CPU1 tick_device next_event: 161726.928000
CPU2 tick_device next_event: 161726.928000
CPU3 tick_device next_event: 161726.928000
CPU4 tick_device next_event: 161726.940000
CPU5 tick_device next_event: 161726.940000
CPU6 tick_device next_event: 161726.940000
CPU7 tick_device next_event: 161726.940000
CPU#0 : ping_start: 161706.646000 : ping_end: 161706.646013
CPU#1 : ping_start: 161706.646013 : ping_end: 161706.646044
CPU#2 : ping_start: 161706.646044 : ping_end: 161706.646053
CPU#3 : ping_start: 161706.646053 : ping_end: 161706.646053
CPU#4 : ping_start: 161706.646054 : ping_end: 161706.646063
CPU#5 : ping_start: 161706.646064 : ping_end: 161706.646070
CPU#6 : ping_start: 161706.646071 : ping_end: 161706.646080
CPU#7 : ping_start: 161706.646080 : ping_end: 161706.646088
timer_fired : 161706.645948
thread_start : 161706.645999
------------end Watchdog----------
  • Pet time = 15 s
    系统每隔 15 秒要触发一次“喂狗”操作(schedule pet_timer)。

  • Bark time = 20 s
    如果 20 秒内仍未再次“喂狗”,则 Watchdog 会“bark”并最终“bite”。

  • last pet = 161706.000
    最后一次真实的喂狗动作记录在内核时间戳 161706.000 s。

  • jiffies & expires

    • pet_timer_expires = 4335322726:以 jiffies 为单位,下一次计划触发喂狗动作的时刻。

    • Current jiffies = 4335324028:当前系统 jiffies。

    • 由于 4335324028 > 4335322726,说明本该在此时触发的 pet_timer 已经“错过”

  • ping_start

CPU#0 : ping_start: 161706.646000 → ping_end: 161706.646013
CPU#1 : ping_start: 161706.646013 → ping_end: 161706.646044
…
CPU#7 : ping_start: 161706.646080 → ping_end: 161706.646088

所有cpu的ping_start都是161706s的那次喂狗的时间点触发的记录。
这里我们要解释一下第四点和第五点
pet_timer 是以定时器(hrtimer 或 jiffies_timer)的方式注册在内核定时器链表上。上述“expires < current”说明定时器回调本应在 jiffies=4335322726 时触发,但在当前 jiffies 已到 4335324028 仍未执行。

这种情况常见于:

  • 全局中断被长时间禁止(local_irq_disable/uninterruptible section)。

  • 内核处于某些 critical section,例如大批量的自旋锁持有或软中断/任务let 被无限阻塞。

每次 pet_timer(喂狗定时器)到期时,Watchdog 会在各个 CPU 上发起一次 ping,记录下 ping_start 和 ping_end。

  • 因此,这两个时间戳反映的是上一次定时器回调成功执行的时刻。

从前面的 dump 可以看到,pet_timer 应该在 jiffies=4335322726 时触发,但由于定时器链路被阻塞(tick 漏调度、全局中断被屏蔽或 nohz 模式问题),它一直没跑,所以也就没机会重新发 ping,也就没有新的 ping_start 时间戳。

2.3 找到watchdog pet_timer

查看timerlist.txt

--------------------------------------------------------------------------------------------------------------------------------
CPU 3(tvec_base: ffffff89fd4d4780 timer_jiffies: 4335322329(161720.136s) next_timer: 4335322384(161720.356s) active_timers: NA)
--------------------------------------------------------------------------------------------------------------------------------

+ vectors Timers (75)

	INDEX  TIMER_LIST_ADDR    EXPIRES        EXPIRES(s)     FUNCTION                                 WORK                                                 REMARKS
	68     ffffff87b31498a8   4335322654     161721.436s    tcp_write_timer                                                                               
	98     ffffff8799dee8a8   4335322378     161720.332s    tcp_write_timer                                                                               
	98     ffffff87f2012ca8   4335322378     161720.332s    tcp_write_timer                                                                               
	98     ffffff87f20154a8   4335322378     161720.332s    tcp_write_timer                                                                               
	98     ffffff87b3fcdea8   4335322378     161720.332s    tcp_write_timer                                                                               
	98     ffffff87b1515ea8   4335322378     161720.332s    tcp_write_timer                                                                               
	98     ffffff87f2015ea8   4335322378     161720.332s    tcp_write_timer                                                                               
	98     ffffff80394818a8   4335322378     161720.332s    tcp_write_timer                                                                               
	98     ffffff87b3fca2a8   4335322378     161720.332s    tcp_write_timer                                                                               
	105    ffffff80578836a8   4335322436     161720.564s    tcp_write_timer                                                                               
	105    ffffff87c43140a8   4335322436     161720.564s    tcp_write_timer                                                                               
	105    ffffff87c43168a8   4335322436     161720.564s    tcp_write_timer                                                                               
	106    ffffff87b294f2a8   4335322444     161720.596s    tcp_write_timer                                                                               
	111    ffffff87c43104a8   4335322484     161720.756s    tcp_write_timer                                                                               
	112    ffffff8057884aa8   4335322493     161720.792s    tcp_write_timer                                                                               
	113    ffffff87c43136a8   4335322500     161720.82s     tcp_write_timer                                                                               
	114    ffffff87b2948ea8   4335322508     161720.852s    tcp_write_timer                                                                               
	117    ffffff80578868a8   4335322536     161720.964s    tcp_write_timer                                                                               
	122    ffffff80578840a8   4335322575     161721.12s     tcp_write_timer                                                                               
	122    ffffff80578804a8   4335322575     161721.12s     tcp_write_timer                                                                               
	122    ffffff879120f2a8   4335322575     161721.12s     tcp_write_timer                                                                               
	133    ffffff87da92ea18   4335325478     161732.732s    tcp_keepalive_timer                                                                           
	153    ffffff87b3cc18a8   4335322654     161721.436s    tcp_write_timer                                                                               
	153    ffffff87ba2484a8   4335322628     161721.332s    tcp_write_timer                                                                               
	153    ffffff87ba24b6a8   4335322628     161721.332s    tcp_write_timer                                                                               
	153    ffffff87ced40ea8   4335322628     161721.332s    tcp_write_timer                                                                               
	153    ffffff87f20118a8   4335322628     161721.332s    tcp_write_timer                                                                               
	153    ffffff87b29498a8   4335322628     161721.332s    tcp_write_timer                                                                               
	154    ffffff87b3fc8ea8   4335322693     161721.592s    tcp_write_timer                                                                               
	154    ffffff87b3cc22a8   4335322693     161721.592s    tcp_write_timer                                                                               
	154    ffffff8027bf2960   4335322726     161721.724s    qcom_wdt_pet_task_wakeup[qcom_wdt_core]                                                       
	155    ffffff80394872a8   4335322815     161722.08s     tcp_write_timer                                                                               
	155    ffffff8039480ea8   4335322777     161721.928s    tcp_write_timer                                                                               
	156    ffffff87b333caa8   4335322855     161722.24s     tcp_write_timer                                                                               
	156    ffffff8054fec0a8   4335322866     161722.284s    tcp_write_timer                                                                               
	157    ffffff80480c4aa8   4335322932     161722.548s    tcp_write_timer                                                                               
	157    ffffff8022e72ca8   4335322893     161722.392s    tcp_write_timer                                                                               
	158    ffffff8056ab84a8   4335323003     161722.832s    tcp_write_timer                                                                               
	158    ffffff80480c04a8   4335322970     161722.7s      tcp_write_timer                                                                               
	158    ffffff87f20136a8   4335322985     161722.76s     tcp_write_timer                                                                               
	159    ffffff805f37d4a8   4335323039     161722.976s    tcp_write_timer                                                                               
	160    ffffff8036a822a8   4335323115     161723.28s     tcp_write_timer                                                                               
	160    ffffff803680aca8   4335323075     161723.12s     tcp_write_timer                                                                               
	161    ffffff80480c68a8   4335323156     161723.444s    tcp_write_timer                                                                               
	162    ffffff80480c18a8   4335323261     161723.864s    tcp_write_timer                                                                               
	162    ffffff80480c2ca8   4335323210     161723.66s     tcp_write_timer                                                                               
	163    ffffff8022e736a8   4335323315     161724.08s     tcp_write_timer                                                                               
	164    ffffff80480c40a8   4335323365     161724.28s     tcp_write_timer                                                                               
	165    ffffff87812440a8   4335323455     161724.64s     tcp_write_timer                                                                               
	165    ffffff87b3cc72a8   4335323417     161724.488s    tcp_write_timer                                                                               
	166    ffffff87b16adea8   4335323494     161724.796s    tcp_write_timer                                                                               
	167    ffffff805c8ed4a8   4335323556     161725.044s    tcp_write_timer                                                                               
	167    ffffff87b13b84a8   4335323528     161724.932s    tcp_write_timer                                                                               
	168    ffffff87b14872a8   4335323620     161725.3s      tcp_write_timer                                                                               
	168    ffffff87b3fce8a8   4335323588     161725.172s    tcp_write_timer                                                                               
	169    ffffff87b3fcaca8   4335323709     161725.656s    tcp_write_timer                                                                               
	169    ffffff80475be8a8   4335323660     161725.46s     tcp_write_timer                                                                               
	170    ffffff87b3fc84a8   4335323753     161725.832s    tcp_write_timer                                                                               
	171    ffffff87b3cc36a8   4335323840     161726.18s     tcp_write_timer                                                                               
	171    ffffff87814884a8   4335323792     161725.988s    tcp_write_timer                                                                               
	172    ffffff87815154a8   4335323890     161726.38s     tcp_write_timer                                                                               
	173    ffffff878148b6a8   4335323931     161726.544s    tcp_write_timer                                                                               
	174    ffffff805eec36a8   4335323974     161726.716s    tcp_write_timer                                                                               
	176    ffffff8039482ca8   4335324158     161727.452s    tcp_write_timer                                                                               
	177    ffffff805c8ecaa8   4335324190     161727.58s     tcp_write_timer                                                                               
	178    ffffff805c8e8ea8   4335324256     161727.844s    tcp_write_timer                                                                               
	178    ffffff8799deb6a8   4335324266     161727.884s    tcp_write_timer                                                                               
	179    ffffff80417422a8   4335324342     161728.188s    tcp_write_timer                                                                               
	179    ffffff8047a5d4a8   4335324301     161728.024s    tcp_write_timer                                                                               
	180    ffffff891f0c84a8   4335324381     161728.344s    tcp_write_timer                                                                               
	181    ffffff87f2010ea8   4335324473     161728.712s    tcp_write_timer                                                                               
	239    ffffffc08231a9d0   4335328588     161745.172s    delayed_work_timer_fn                    work_fn                                              
	239    ffffffc087eebd38   4335328408     161744.452s    process_timeout                                                                               
	239    ffffffc083f33d48   4335328324     161744.116s    process_timeout                                                                               
	254    ffffffc082327aa0   4335336088     161775.172s    delayed_work_timer_fn                    gc_worker      

发现watchdog的pet_timer线程位于CPU3上,正处于peding状态等待执行,EXPIRES为161721.724s ,如果按照计划161721触发喂狗,也就没问题的,但是现在系统时间到161726s都没有触发。

  • CPU 0timer_jiffies 已经是 4335323981(161726.744 s)

  • CPU 3timer_jiffies 仍停留在 4335322329(161720.136 s)

CPU 3 的定时器时钟整整落后了大约 6.6 秒,所以它根本没有机会跑到 4335322726(161721.724 s)去触发 qcom_wdt_pet_task_wakeup,Watchdog 就判定超时、bark -> bite 了。

并且可以发现CPU3 qcom_wdt_pet_task_wakeup前面存在大量的tcp_write_timer等待运行。

2.4 查看CPU3的内核栈

Parsing debug information for MSM_DUMP_DATA_CPU_CTX. Version: 32 Magic: 42444832 Source: vm_3_vcpu_300
Parsing CPU3 affinity 300 context start 802e6920 end 802e6db0 version 32 client_id-> 0
cpu_type = MSM_DUMP_CPU_TYPE_AARCH64
ctx_type = MSM_DUMP_CTX_TYPE_NS_VCPU_EL10_CTX
cpu_index = 3
regset_name = MSM_DUMP_REGSET_IDS_AARCH64_GPRS
regset_name = MSM_DUMP_REGSET_IDS_AARCH64_NEON
regset_name = MSM_DUMP_REGSET_IDS_AARCH64_SYSREGS_EL0
regset_name = MSM_DUMP_REGSET_IDS_AARCH64_EL1
regset_name = MSM_DUMP_REGSET_IDS_AARCH64_VM_EL2
regset_name = MSM_DUMP_REGSET_IDS_AARCH64_CNTV_EL10
Core vcpu3 PC: dev_queue_xmit_nit+8c <ffffffc080d6e8f4>
Core vcpu3 LR: dev_queue_xmit_nit+150 <ffffffc080d6e9b8>

[<ffffffc080d6e8f4>] dev_queue_xmit_nit+0x8c
[<ffffffc080d70f40>] dev_hard_start_xmit+0x90
[<ffffffc080dd5e18>] sch_direct_xmit+0xf0
[<ffffffc080dd64c4>] __qdisc_run+0x3cc
[<ffffffc080d7df64>] qdisc_run+0x48
[<ffffffc080d7204c>] __dev_queue_xmit+0x3e0
[<ffffffc080e76d84>] ip_finish_output2+0x48c
[<ffffffc080e76760>] __ip_finish_output+0x194
[<ffffffc080e73300>] ip_finish_output+0x20
[<ffffffc080e73440>] ip_output+0x50
[<ffffffc080e73914>] __ip_queue_xmit+0x414
[<ffffffc080e7397c>] ip_queue_xmit+0x14
[<ffffffc080e9db4c>] __tcp_transmit_skb+0x814
[<ffffffc080e99fe0>] __tcp_retransmit_skb+0x2cc
[<ffffffc080e9aa38>] tcp_retransmit_skb+0x1c
[<ffffffc080e9f868>] tcp_retransmit_timer+0x58c
[<ffffffc080e9fffc>] tcp_write_timer_handler+0x94
[<ffffffc080ea02c8>] tcp_write_timer+0x78
[<ffffffc080188a58>] call_timer_fn+0x4c
[<ffffffc0801888ac>] __run_timers+0x1c4
[<ffffffc080187814>] run_timer_softirq+0x24
[<ffffffc0800b95cc>] handle_softirqs+0x124
[<ffffffc080010224>] __do_softirq+0x14
[<ffffffc0800165b0>] ____do_softirq+0x10
[<ffffffc08001653c>] call_on_irq_stack+0x3c
[<ffffffc080016590>] do_softirq_own_stack+0x1c
[<ffffffc0800b99ac>] __irq_exit_rcu+0x54
[<ffffffc0800b994c>] irq_exit_rcu+0x10
[<ffffffc08102c0dc>] el1_interrupt+0x38
[<ffffffc08102c098>] el1h_64_irq_handler+0x18
[<ffffffc080011304>] el1h_64_irq+0x68
[<ffffffc0800b9498>] do_softirq+0x48
[<ffffffc0800b9448>] __local_bh_enable_ip+0x88
[<ffffffc08103e3d0>] _raw_spin_unlock_bh+0x24
[<ffffffc07dbe662c>] wlan_hdd_netif_queue_control[qca_cld3_adrastea]+0x508
[<ffffffc07dbec6d0>] hdd_cm_netif_queue_enable[qca_cld3_adrastea]+0x13c
[<ffffffc07dbee040>] hdd_cm_connect_success_pre_user_update[qca_cld3_adrastea]+0x940
[<ffffffc07dbed1c0>] hdd_cm_connect_complete[qca_cld3_adrastea]+0x50
[<ffffffc07dce1e9c>] osif_cm_connect_comp_ind[qca_cld3_adrastea]+0x3c
[<ffffffc07dce25a4>] osif_connect_handler[qca_cld3_adrastea]+0x130
[<ffffffc07dce208c>] osif_cm_connect_complete_cb[qca_cld3_adrastea]+0x10
[<ffffffc07de37e4c>] mlme_cm_osif_connect_complete[qca_cld3_adrastea]+0x3c
[<ffffffc07def0b00>] cm_fw_roam_sync_propagation[qca_cld3_adrastea]+0x708
[<ffffffc07ddaa474>] cm_roam_sync_event_handler_cb[qca_cld3_adrastea]+0x1fc
[<ffffffc07dd9d428>] wlan_cm_sta_mlme_vdev_roam_notify[qca_cld3_adrastea]+0x24
[<ffffffc07ddc0cf4>] sta_mlme_vdev_notify_roam_start[qca_cld3_adrastea]+0x54
[<ffffffc07de3e140>] mlme_vdev_subst_up_active_event[qca_cld3_adrastea]+0x200
[<ffffffc07de18df0>] wlan_sm_dispatch[qca_cld3_adrastea]+0x164
[<ffffffc07de3aeec>] wlan_vdev_mlme_sm_deliver_evt[qca_cld3_adrastea]+0x60
[<ffffffc07def0114>] cm_fw_send_vdev_roam_event[qca_cld3_adrastea]+0x5c
[<ffffffc07de2d09c>] cm_subst_roam_sync_event[qca_cld3_adrastea]+0x90
[<ffffffc07de18df0>] wlan_sm_dispatch[qca_cld3_adrastea]+0x164
[<ffffffc07de2cd68>] cm_state_roaming_event[qca_cld3_adrastea]+0x124
[<ffffffc07de18df0>] wlan_sm_dispatch[qca_cld3_adrastea]+0x164
[<ffffffc07de313e4>] cm_state_connected_event[qca_cld3_adrastea]+0x150
[<ffffffc07de18df0>] wlan_sm_dispatch[qca_cld3_adrastea]+0x164
[<ffffffc07de31e30>] cm_sm_deliver_event[qca_cld3_adrastea]+0x124
[<ffffffc07def0038>] cm_fw_roam_sync_req[qca_cld3_adrastea]+0xe8
[<ffffffc07ddaa0e4>] cm_roam_sync_event_handler[qca_cld3_adrastea]+0x40
[<ffffffc07dd94098>] target_if_cm_roam_sync_event[qca_cld3_adrastea]+0x120
[<ffffffc07dde9970>] __wmi_control_rx[qca_cld3_adrastea]+0x3ec
[<ffffffc07ddf0a68>] wmi_process_fw_event_handler[qca_cld3_adrastea]+0x34
[<ffffffc07de502d4>] scheduler_target_if_mq_handler[qca_cld3_adrastea]+0x4c
[<ffffffc07de51a30>] scheduler_thread[qca_cld3_adrastea]+0x1cc
[<ffffffc0800e1158>] kthread+0x114
[<ffffffc0800164f0>] ret_from_fork+0x10
  • PC → dev_queue_xmit_nit

    • CPU 3 的程序计数器(PC)在 dev_queue_xmit_nit(网络层发包路由)附近,接着一路在 tcp_retransmit_timertcp_write_timer_handlercall_timer_fn 等路径上不断循环。

    • 它再到达 run_timer_softirqhandle_softirqs__do_softirq,然后直接跳到 _raw_spin_unlock_bh 及后续 rcu/irq 退出流程。从调用栈看,软中断处理链条并未让出到定时器层面。

  • local_bh 禁用导致定时器延后

    • 在网络驱动(qca_cld3_adrastea)的 wlan_hdd_netif_queue_control 等函数里,软中断上下文中会短暂禁用 bottom-halves (__local_bh_disable()),但如果路径过长或递归调用,中断上下文会一直占据 CPU3,直到 __local_bh_enable() 才允许定时器软中断(包括 pet_timer)运行。

    • 由于你看到 CPU 3 多次在 __local_bh_enable_ip 之前反复处理上层 qca_cm_roam、wmi、scheduler 等事件,这个关闭–开启窗口可能超过数秒,直接导致 pet_timer 过期还没机会触发。

三、问题根因

软中断本身是 非抢占 的:一旦进入 do_softirq(),内核会连续地处理所有 pending 软中断类型,直到软中断队列清空,或者处理软中断被明确禁用/延后。具体交互方式:

  • 网络驱动或软中断处理路径里调用了 __local_bh_disable()(即禁止 bottom halves,也就是禁用软中断上下文延后执行),那么即便有新的 softirq(包括wdt定时器软中断)到来,也要等到 __local_bh_enable() 之后才能再次进入 do_softirq()

  • CPU 3 上的网络栈处理(dev_queue_xmit_nittcp_retransmit_timer → 各种 qca_cld3 的 WMI/roam 回调)占用了很长时间的软中断上下文,并且多次短暂禁用底半部,没有让出给wdt定时器软中断。

  • 由于 软中断不允许被其它软中断打断,一旦网络软中断开始,定时器软中断就只能等到网络软中断完全结束、底半部重新使能后才能执行;如果这段 “堵塞” 超过了 Watchdog 的 pet / bark 窗口,定时器回调就会错过,从而导致 Watchdog 认为超时。

四、解决方案

wlan team: To be continued!