0. 问题现象

死机,dmesg中日志如下:

[   34.029307][    C0] ------------[ cut here ]------------
[   35.844227][    C0] msm_watchdog f410000.qcom,wdt: QCOM Apps Watchdog bark! Now = 35.844222
[   35.844236][    C0] msm_watchdog f410000.qcom,wdt: QCOM Apps Watchdog last pet at 15.841438
[   35.844243][    C0] msm_watchdog f410000.qcom,wdt: cpu alive mask from last pet ff
[   35.844347][    C0] CPU0 next event is 35840000000
[   35.844350][    C0] CPU1 next event is 34048000000
[   35.844353][    C0] CPU2 next event is 34048000000
[   35.844356][    C0] CPU3 next event is 34048000000
[   35.844358][    C0] CPU4 next event is 34048000000
[   35.844360][    C0] CPU5 next event is 34048000000
[   35.844363][    C0] CPU6 next event is 34048000000
[   35.844365][    C0] CPU7 next event is 34046135975
[   35.850421][    C0] Top irqs in last 35850 ms:
[   35.850426][    C0] IRQ 11 [GICv3:arch_timer] - 29310 times
[   35.850430][    C0] IRQ 295 [ipcc:glink-native-adsp] - 6227 times
[   35.850434][    C0] IRQ 284 [GICv3:cci] - 4691 times
[   35.850438][    C0] IRQ 16 [GICv3:ipcc_0] - 8446 times
[   35.850441][    C0] IRQ 93 [GICv3:ufshcd] - 9749 times
[   35.850447][    C0] msm_watchdog f410000.qcom,wdt: Causing a QCOM Apps Watchdog bite!
[   35.850456][    C0] msm_watchdog f410000.qcom,wdt: Wdog - STS: 0x1401be, CTL: 0x3, BARK TIME: 0x9ffc4, BITE TIME: 0xeffa6

1. 问题分析

1.1 check watchdog基础信息

----------begin Watchdog----------
Non-secure Watchdog data
Watchdog enabled
Pet time: 15.0s
Bark time: 20.0s
Watchdog last pet: 1.5e-08
Watchdog next pet: 15.000000015
Watchdog next bark: 20.000000015
Watchdog task is waiting on core 6 from 30.961304
CPU online bits: 11111111
CPU runqueue online bits: 11111101
CPU isolated bits: 00000000
CPU dying bits: 00000010
pet_timer_flags: 0x29400006
pet_timer_expires: 4294900006
Current jiffies  : 4294901254
Timestamp of last timer interrupt(last_jiffies_update): 35.828
tick_do_timer_cpu: 7
CPU logical map: [0, 1, 2, 3, 4, 5, 6, 7]
tick_broadcast_oneshot_mask: 00000000
tick_broadcast_pending_mask: 00000000
tick_broadcast_force_mask: 00000000
tick_broadcast_device cpumask: 00000010
tick_broadcast_device next_event: 9223372036.854776
CPU0 tick_device next_event: 35.840000
CPU1 tick_device next_event: 34.048000
CPU2 tick_device next_event: 34.048000
CPU3 tick_device next_event: 34.048000
CPU4 tick_device next_event: 34.048000
CPU5 tick_device next_event: 34.048000
CPU6 tick_device next_event: 34.048000
CPU7 tick_device next_event: 34.046136
CPU#0 : ping_start: 15.841391 : ping_end: 15.841398
CPU#1 : ping_start: 15.841398 : ping_end: 15.841405
CPU#2 : ping_start: 15.841406 : ping_end: 15.841413
CPU#3 : ping_start: 15.841414 : ping_end: 15.841422
CPU#4 : ping_start: 15.841422 : ping_end: 15.841428
CPU#5 : ping_start: 15.841429 : ping_end: 15.841434
CPU#6 : ping_start: 15.841434 : ping_end: 15.841434
CPU#7 : ping_start: 15.841434 : ping_end: 15.841436
timer_fired : 30.961298
thread_start : 15.841390
------------end Watchdog----------

1.2 查看tz中记录的non secure watchdog日志

core 0:

Core 0 PC: __delay+9c <ffffffc008faa978>

Core 0 LR: __delay[jt]+8c <ffffffc008faa998>

 

[<ffffffc008faa978>] __delay+0x9c

[<ffffffc008faaa54>] __const_udelay+0x28

[<ffffffc0014d5890>] qcom_wdt_trigger_bite[qcom_wdt_core]+0x424

[<ffffffc0014d6118>] qcom_wdt_bark_handler[qcom_wdt_core]+0xac

[<ffffffc008142e9c>] __handle_irq_event_percpu+0xbc

[<ffffffc008143150>] handle_irq_event+0x48

[<ffffffc008149218>] handle_fasteoi_irq+0x124

[<ffffffc008141ed0>] generic_handle_domain_irq+0x44

[<ffffffc008010144>] gic_handle_irq+0x4c

[<ffffffc008fdf820>] el1_interrupt+0x98

[<ffffffc008fdf77c>] el1h_64_irq_handler+0x18

[<ffffffc008011304>] ret_to_kernel[jt]+0x0

[<ffffffc00813fd60>] console_emit_next_record+0x330

[<ffffffc00813d598>] console_unlock+0x150

[<ffffffc00813d2f0>] vprintk_emit[jt]+0x16c

[<ffffffc00813d6e4>] vprintk_default+0x44

[<ffffffc0081402a4>] vprintk[jt]+0x90

[<ffffffc008fd87c4>] _printk+0x54

[<ffffffc0080afb14>] __warn_printk+0xe0

[<ffffffc008d87634>] dev_watchdog+0x220

[<ffffffc0081863cc>] call_timer_fn+0x4c

[<ffffffc00818622c>] __run_timers+0x1b8

[<ffffffc0081852d4>] run_timer_softirq+0x24

[<ffffffc008010330>] __do_softirq+0x11c

[<ffffffc008016f68>] ____do_softirq+0x10

[<ffffffc008016ef4>] call_on_irq_stack+0x3c

[<ffffffc008016f48>] do_softirq_own_stack+0x1c

[<ffffffc0080b9340>] __irq_exit_rcu+0x54

[<ffffffc0080b92e0>] irq_exit_rcu+0x10

[<ffffffc008fdf82c>] el1_interrupt+0xa4

[<ffffffc008fdf77c>] el1h_64_irq_handler+0x18

[<ffffffc008011304>] ret_to_kernel[jt]+0x0

[<ffffffc00815ddc0>] rcu_momentary_dyntick_idle+0x18

[<ffffffc0081c3b4c>] multi_cpu_stop+0x94

[<ffffffc0081c4330>] cpu_stopper_thread+0xe8

[<ffffffc0080e9a7c>] smpboot_thread_fn[jt]+0x104

[<ffffffc0080e2db0>] kthread+0x104

[<ffffffc008016ea8>] ret_from_fork+0x10

其他核的trace都类似以下:

Core 7 PC: rcu_momentary_dyntick_idle+3c <ffffffc00815dde4>

Core 7 LR: multi_cpu_stop+94 <ffffffc0081c3b4c>

 

[<ffffffc00815dde4>] rcu_momentary_dyntick_idle+0x3c

[<ffffffc0081c3b4c>] multi_cpu_stop+0x94

[<ffffffc0081c4330>] cpu_stopper_thread+0xe8

[<ffffffc0080e9a7c>] smpboot_thread_fn[jt]+0x104

[<ffffffc0080e2db0>] kthread+0x104

[<ffffffc008016ea8>] ret_from_fork+0x10

所有的cpu都暂时停止了工作,等待rcu_mementary_dyntick_idle
而cpu0 正在执行printk的动作,其他core都在等cpu0执行完输出任务,msm-watchdog没办法在规定时间内pet,导致了此次异常

2. 解决方案

关闭kernel的串口日志