问题现场

笔者第一次遇到类似问题,首先查看当时的kernel log,定位到的异常是大量的I/O ERROR以及read ERROR,初步怀疑跟storage访问相关。

继续看log,会发现许多设备在走shutdown流程,比如说第四行开始的log,就是下电时需要把缓存写进storage,在完成一些storage同步操作:

 sd 0:0:0:5: [sdf] Synchronizing SCSI cache

因此怀疑当时在走shutdown关机流程。

[20834.577473][T24761] fsa4480_shutdown poweroff reset switch.
[20834.581554][ T1377] BATTERY_CHG: get_quick_charge_type: usb_real_type[0][Unknown] temp[4400]
[20834.581712][T24761] arm-smmu 3da0000.kgsl-smmu: disabling translation
//下面几行也是shutdown流程中,storage进行同步时打印的log
[20834.583470][T24761] sd 0:0:0:5: [sdf] Synchronizing SCSI cache
[20834.589569][T24761] sd 0:0:0:4: [sde] Synchronizing SCSI cache
[20834.590049][T24761] sd 0:0:0:3: [sdd] Synchronizing SCSI cache
[20834.590354][T24761] sd 0:0:0:2: [sdc] Synchronizing SCSI cache
[20834.590564][T24761] sd 0:0:0:1: [sdb] Synchronizing SCSI cache
[20834.590896][T24761] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[20834.601527][T24761] ufshcd-qcom 1d84000.ufshc: ufs_qcom_parse_reg_info: Unable to find qcom,vccq-proxy-vote-supply regulator, assuming enabled
[20834.602494][    T9] sd 0:0:0:0: rejecting I/O to offline device
[20834.602537][    T9] I/O error, dev sda, sector 31903920 op 0x0:(READ) flags 0x800 phys_seg 1 prio class 2
[20834.602801][T24761] msm_geni_serial a94000.qcom,qup_uart: msm_geni_serial_driver_shutdown called 1
[20834.610731][T32385] I/O error, dev sda, sector 41577464 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 2
[20834.610925][ T1186] I/O error, dev sda, sector 31815224 op 0x0:(READ) flags 0x800 phys_seg 1 prio class 2
[20834.610952][T32385] I/O error, dev sda, sector 14663480 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
[20834.610987][ T1273] I/O error, dev sda, sector 32468816 op 0x0:(READ) flags 0x800 phys_seg 1 prio class 2
[20834.611016][ T1273] Read-error on swap-device (253:0:13692368)
[20834.611058][T32385] I/O error, dev sda, sector 31150640 op 0x0:(READ) flags 0x800 phys_seg 1 prio class 2
[20834.611169][T32385] I/O error, dev sda, sector 41446704 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 2
[20834.611301][T32385] I/O error, dev sda, sector 29971016 op 0x0:(READ) flags 0x80700 phys_seg 33 prio class 2
[20834.611487][T32385] I/O error, dev sda, sector 29971024 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[20834.611572][ T3052] I/O error, dev sda, sector 14663480 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[20834.611575][ T2805] Read-error on swap-device (253:0:700056)
[20834.611729][ T1273] Read-error on swap-device (253:0:10023008)
[20834.612891][ T3052] Read-error on swap-device (253:0:6690328)
[20834.620757][T24761] msm_geni_serial a94000.qcom,qup_uart: msm_geni_serial_remove driver removed 1
[20834.621189][T24761] arm-smmu 15000000.apps-smmu: disabling translation
[20834.641723][T11612] Read-error on swap-device (253:0:8699136)
[20834.642096][T11612] Read-error on swap-device (253:0:16448464)
[20834.652392][  T509] Read-error on swap-device (253:0:7215016)
[20834.652620][  T509] Read-error on swap-device (253:0:9236288)
[20834.654678][T13910] Read-error on swap-device (253:0:3670560)
[20834.661837][ T1411] Read-error on swap-device (253:0:9453728)
[20834.662237][ T1383] process_accept_req: 4 callbacks suppressed
[20834.662245][ T1383] smcinvoke: process_accept_req: Setting pid:1383, server id : 18 state to defunct
[20834.665740][    T1] init: Service 'vendor.sensors-hal-multihal' (pid 1415) received signal 7 from uid 1000 pid 1415
[20834.665797][    T1] init: Sending signal 9 to service 'vendor.sensors-hal-multihal' (pid 1415) process group...
[20834.671824][ T4309] binder: 1610:4309 ioctl c0306201 7a92011a68 returned -14
[20834.672458][ T4311] binder: 1610:4311 ioctl c0306201 7a91f0da68 returned -14
[20834.679359][ T1370] smcinvoke: process_tzcb_req: server is defunct, state= 1 tzhandle = -2147418089
[20834.679378][ T1370] smcinvoke: process_tzcb_req: server invalid, res: -90
[20834.684681][ T1370] smcinvoke: process_tzcb_req: server is defunct, state= 1 tzhandle = -2147483625
[20834.684705][ T1370] smcinvoke: process_tzcb_req: server invalid, res: -90
[20834.685119][T25029] devfreq 1d84000.ufshc: dvfs failed with (-16) error
[20834.686306][  T268] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000007
[20834.686340][  T268] CPU: 0 PID: 268 Comm: init Tainted: G        WC OE      6.6.56-android15-8-ga3c3e7c841d9-ab13323268-4k #1 1400000003000000474e550019e9fdb8a0d3938f
[20834.686361][  T268] Hardware name: Qualcomm Technologies, Inc. Kunzite QRD (DT)
[20834.686374][  T268] Call trace:
[20834.686384][  T268]  dump_backtrace+0xe8/0x108
[20834.686408][  T268]  show_stack+0x18/0x28
[20834.686422][  T268]  dump_stack_lvl+0x50/0x6c
[20834.686442][  T268]  dump_stack+0x18/0x24
[20834.686457][  T268]  panic+0x158/0x36c
[20834.686472][  T268]  do_exit+0x800/0xa6c
[20834.686492][  T268]  trace_android_vh_do_group_exit+0x0/0x90
[20834.686507][  T268]  get_signal+0x7bc/0x8d4
[20834.686522][  T268]  do_notify_resume+0x11c/0x764
[20834.686535][  T268]  el0_svc+0x64/0x68
[20834.686548][  T268]  el0t_64_sync_handler+0x68/0xbc
[20834.686560][  T268]  el0t_64_sync+0x1a8/0x1ac

搜索关键字healthd,发现问题时间点电量确实很低,在走低电关机流程

[20834.470085][ T1377] healthd: battery l=1 v=3171 t=44.0 h=2 st=3 c=-1016000 fc=5513000 cc=6 chg=

初步分析

因此目前可以得到两点关键信息:

  1. 系统当时正在走低电关机流程

  2. 系统尝试访问磁盘数据但出现了IO ERROR

因此笔者怀疑是和下电时序相关:

系统下电流程中,storage先下电了,导致有进程尝试去访问storage失败。

至于为什么会以Attempt to kill init的形式体现,目前不清楚咋回事。

同步咨询平台。

平台回复

从log中可以看出,当时charger相关函数检测到了系统低电之后,调用了kernel_power_off来执行下电关机。可以看下面贴的当时CPU3的调用栈。

kernel_power_off是对于kernel层执行的power off,走的是底层的shutdown流程。

正常下电需要确保安卓上层已经完成shutdown,再进行底层shutdown,以防止一些竞态问题。而此题是直接从底层开始shutdown,就无法保证上层已经完成全部的shutdown流程了,此题直接原因是由于外部存储设备已经下电,上层去访问失败导致出现的异常,kill init的动作是上层访问存储设备失败之后产生的。

[20834.540034][T24761] CPU: 3 PID: 24761 Comm: kworker/3:1 Tainted: G        WC OE      6.6.56-android15-8-ga3c3e7c841d9-ab13323268-4k #1 1400000003000000474e550019e9fdb8a0d3938f
[20834.540047][T24761] Hardware name: Qualcomm Technologies, Inc. Kunzite QRD (DT)
[20834.540054][T24761] Workqueue: events battery_chg_check_status_work [qti_battery_charger]
[20834.540146][T24761] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[20834.540156][T24761] pc : remove_proc_entry+0x168/0x1ec
[20834.540166][T24761] lr : remove_proc_entry+0x168/0x1ec
[20834.540174][T24761] sp : ffffffc0b97cbbc0
[20834.540178][T24761] x29: ffffffc0b97cbbe0 x28: ffffff88520fd880 x27: ffffffc0822ac000
[20834.540194][T24761] x26: 0000000000000001 x25: ffffffc082182698 x24: 0000000000000006
[20834.540207][T24761] x23: 0000000000000006 x22: ffffffc082182718 x21: ffffffc07cb7bd06
[20834.540221][T24761] x20: 0000000000000000 x19: ffffffc07cb7bd06 x18: ffffffc09c901050
[20834.540235][T24761] x17: 0000000000000001 x16: ffffffffffffffff x15: 0000000000000004
[20834.540249][T24761] x14: ffffff88f5300000 x13: 0000000000000003 x12: 0000000000000003
[20834.540263][T24761] x11: 0000000100003973 x10: c000000100003973 x9 : d34d315511775600
[20834.540281][T24761] x8 : d34d315511775600 x7 : 277761725f707427 x6 : 70742720656d616e
[20834.540297][T24761] x5 : ffffffc0822d2795 x4 : ffffffc08156c43e x3 : 0000000000000000
[20834.540310][T24761] x2 : 0000000000000000 x1 : ffffffc0b97cb950 x0 : 000000000000000d
[20834.540324][T24761] Call trace:
[20834.540329][T24761]  remove_proc_entry+0x168/0x1ec
[20834.540338][T24761]  goodix_ts_procfs_exit+0x6c/0xd8 [gt9916k_spi 1400000003000000474e5500edc467ed48425c12]
[20834.540428][T24761]  goodix_ts_shutdown+0x16c/0x1fc [gt9916k_spi 1400000003000000474e5500edc467ed48425c12]
[20834.540511][T24761]  platform_shutdown+0x3c/0x4c
[20834.540523][T24761]  device_shutdown+0x1ac/0x270
[20834.540535][T24761]  kernel_power_off+0x3c/0xf4
[20834.540547][T24761]  battery_chg_check_status_work+0x214/0x220 [qti_battery_charger 1400000003000000474e5500337ac494df74d690]
[20834.540630][T24761]  process_scheduled_works+0x1d4/0x468
[20834.540641][T24761]  worker_thread+0x244/0x334
[20834.540650][T24761]  kthread+0x114/0x1bc
[20834.540658][T24761]  ret_from_fork+0x10/0x20

解决方法

与charger同事对齐,发现他们在问题版本之后几天已经合入了优化代码:

diff --git a/drivers/power/supply/qti_battery_charger.c b/drivers/power/supply/qti_battery_charger.c
index 4bc5c32..19cee06 100644
--- a/drivers/power/supply/qti_battery_charger.c
+++ b/drivers/power/supply/qti_battery_charger.c
 
......
 static const int usb_prop_map[USB_PROP_MAX] = {
@@ -1572,7 +1574,7 @@
  pr_emerg("Initiating a shutdown in 100 ms\n");
  msleep(100);
  pr_emerg("Attempting kernel_power_off: Battery voltage low\n");
 -kernel_power_off();
+// kernel_power_off();
 }
......

删除了原本代码中底层shutdown的操作kernel_power_off,之后低电情况下先走上层关机流程,可以避免此问题。

使用新版本进行压测,没有复现此问题。

代码浅析

正常情况是上层先走完shutdown流程,该关闭的服务/工具什么的都关闭,再通过syscall让底层继续shutdown。

这里来简单看一下上层转底层的代码:

RebootSystem

上层流程走完之后,最终会走到这里,进入底层shutdown流程


void __attribute__((noreturn))
RebootSystem(unsigned int cmd, const std::string& rebootTarget, const std::string& reboot_reason) {
    LOG(INFO) << "Reboot ending, jumping to kernel";

    if (!IsRebootCapable()) {
        // On systems where init does not have the capability of rebooting the
        // device, just exit cleanly.
        exit(0);
    }

    switch (cmd) {
        case ANDROID_RB_POWEROFF:
            reboot(RB_POWER_OFF);//SHUTDOWN
            break;

        case ANDROID_RB_RESTART2:
            syscall(__NR_reboot, LINUX_REBOOT_MAGIC1, LINUX_REBOOT_MAGIC2,
                    LINUX_REBOOT_CMD_RESTART2, rebootTarget.c_str());
            break;

        case ANDROID_RB_THERMOFF:
            if (android::base::GetBoolProperty("ro.thermal_warmreset", false)) {
                std::string reason = "shutdown,thermal";
                if (!reboot_reason.empty()) reason = reboot_reason;

                LOG(INFO) << "Try to trigger a warm reset for thermal shutdown";
                syscall(__NR_reboot, LINUX_REBOOT_MAGIC1, LINUX_REBOOT_MAGIC2,
                        LINUX_REBOOT_CMD_RESTART2, reason.c_str());
            } else {
                reboot(RB_POWER_OFF);
            }
            break;
    }
    // In normal case, reboot should not return.
    PLOG(ERROR) << "reboot call returned";
    abort();
}

SYSCALL_DEFINE4(reboot)

关机/重启相关的系统调用

/*
 * Reboot system call: for obvious reasons only root may call it,
 * and even root needs to set up some magic numbers in the registers
 * so that some mistake won't make this reboot the whole machine.
 * You can also set the meaning of the ctrl-alt-del-key here.
 *
 * reboot doesn't sync: do that yourself before calling this.
 */
SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd,
                void __user *, arg)
{
        struct pid_namespace *pid_ns = task_active_pid_ns(current);
        char buffer[256];
        int ret = 0;

       ...............

        /* Instead of trying to make the power_off code look like
         * halt when pm_power_off is not set do it the easy way.
         */
        if ((cmd == LINUX_REBOOT_CMD_POWER_OFF) && !kernel_can_power_off())
                cmd = LINUX_REBOOT_CMD_HALT;

        mutex_lock(&system_transition_mutex);
        switch (cmd) {
        case LINUX_REBOOT_CMD_RESTART:
                kernel_restart(NULL);
                break;

        case LINUX_REBOOT_CMD_CAD_ON:
                C_A_D = 1;
                break;

        case LINUX_REBOOT_CMD_CAD_OFF:
                C_A_D = 0;
                break;

        case LINUX_REBOOT_CMD_HALT:
                kernel_halt();
                do_exit(0);

        case LINUX_REBOOT_CMD_POWER_OFF:
                kernel_power_off();//最终还是调用此函数,但是此时已经走完了上层shutdown
                do_exit(0);
                break;

    ............
        mutex_unlock(&system_transition_mutex);
        return ret;
}

补充

为什么当时会怀疑是下电时序导致的问题呢?

因为之前笔者遇到过一例NOC ERROR,也是发生在系统低电关机流程中

最终定位到是因为GPU和ARM-SMMU由于下电时序没有做好同步,对于某一块地址出现了竞态访问,导致出现异常。

因此,在系统下电流程中,做好时序上的同步是非常重要的。