问题现场
笔者第一次遇到类似问题,首先查看当时的kernel log,定位到的异常是大量的I/O ERROR以及read ERROR,初步怀疑跟storage访问相关。
继续看log,会发现许多设备在走shutdown流程,比如说第四行开始的log,就是下电时需要把缓存写进storage,在完成一些storage同步操作:
sd 0:0:0:5: [sdf] Synchronizing SCSI cache
因此怀疑当时在走shutdown关机流程。
[20834.577473][T24761] fsa4480_shutdown poweroff reset switch.
[20834.581554][ T1377] BATTERY_CHG: get_quick_charge_type: usb_real_type[0][Unknown] temp[4400]
[20834.581712][T24761] arm-smmu 3da0000.kgsl-smmu: disabling translation
//下面几行也是shutdown流程中,storage进行同步时打印的log
[20834.583470][T24761] sd 0:0:0:5: [sdf] Synchronizing SCSI cache
[20834.589569][T24761] sd 0:0:0:4: [sde] Synchronizing SCSI cache
[20834.590049][T24761] sd 0:0:0:3: [sdd] Synchronizing SCSI cache
[20834.590354][T24761] sd 0:0:0:2: [sdc] Synchronizing SCSI cache
[20834.590564][T24761] sd 0:0:0:1: [sdb] Synchronizing SCSI cache
[20834.590896][T24761] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[20834.601527][T24761] ufshcd-qcom 1d84000.ufshc: ufs_qcom_parse_reg_info: Unable to find qcom,vccq-proxy-vote-supply regulator, assuming enabled
[20834.602494][ T9] sd 0:0:0:0: rejecting I/O to offline device
[20834.602537][ T9] I/O error, dev sda, sector 31903920 op 0x0:(READ) flags 0x800 phys_seg 1 prio class 2
[20834.602801][T24761] msm_geni_serial a94000.qcom,qup_uart: msm_geni_serial_driver_shutdown called 1
[20834.610731][T32385] I/O error, dev sda, sector 41577464 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 2
[20834.610925][ T1186] I/O error, dev sda, sector 31815224 op 0x0:(READ) flags 0x800 phys_seg 1 prio class 2
[20834.610952][T32385] I/O error, dev sda, sector 14663480 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
[20834.610987][ T1273] I/O error, dev sda, sector 32468816 op 0x0:(READ) flags 0x800 phys_seg 1 prio class 2
[20834.611016][ T1273] Read-error on swap-device (253:0:13692368)
[20834.611058][T32385] I/O error, dev sda, sector 31150640 op 0x0:(READ) flags 0x800 phys_seg 1 prio class 2
[20834.611169][T32385] I/O error, dev sda, sector 41446704 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 2
[20834.611301][T32385] I/O error, dev sda, sector 29971016 op 0x0:(READ) flags 0x80700 phys_seg 33 prio class 2
[20834.611487][T32385] I/O error, dev sda, sector 29971024 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[20834.611572][ T3052] I/O error, dev sda, sector 14663480 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[20834.611575][ T2805] Read-error on swap-device (253:0:700056)
[20834.611729][ T1273] Read-error on swap-device (253:0:10023008)
[20834.612891][ T3052] Read-error on swap-device (253:0:6690328)
[20834.620757][T24761] msm_geni_serial a94000.qcom,qup_uart: msm_geni_serial_remove driver removed 1
[20834.621189][T24761] arm-smmu 15000000.apps-smmu: disabling translation
[20834.641723][T11612] Read-error on swap-device (253:0:8699136)
[20834.642096][T11612] Read-error on swap-device (253:0:16448464)
[20834.652392][ T509] Read-error on swap-device (253:0:7215016)
[20834.652620][ T509] Read-error on swap-device (253:0:9236288)
[20834.654678][T13910] Read-error on swap-device (253:0:3670560)
[20834.661837][ T1411] Read-error on swap-device (253:0:9453728)
[20834.662237][ T1383] process_accept_req: 4 callbacks suppressed
[20834.662245][ T1383] smcinvoke: process_accept_req: Setting pid:1383, server id : 18 state to defunct
[20834.665740][ T1] init: Service 'vendor.sensors-hal-multihal' (pid 1415) received signal 7 from uid 1000 pid 1415
[20834.665797][ T1] init: Sending signal 9 to service 'vendor.sensors-hal-multihal' (pid 1415) process group...
[20834.671824][ T4309] binder: 1610:4309 ioctl c0306201 7a92011a68 returned -14
[20834.672458][ T4311] binder: 1610:4311 ioctl c0306201 7a91f0da68 returned -14
[20834.679359][ T1370] smcinvoke: process_tzcb_req: server is defunct, state= 1 tzhandle = -2147418089
[20834.679378][ T1370] smcinvoke: process_tzcb_req: server invalid, res: -90
[20834.684681][ T1370] smcinvoke: process_tzcb_req: server is defunct, state= 1 tzhandle = -2147483625
[20834.684705][ T1370] smcinvoke: process_tzcb_req: server invalid, res: -90
[20834.685119][T25029] devfreq 1d84000.ufshc: dvfs failed with (-16) error
[20834.686306][ T268] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000007
[20834.686340][ T268] CPU: 0 PID: 268 Comm: init Tainted: G WC OE 6.6.56-android15-8-ga3c3e7c841d9-ab13323268-4k #1 1400000003000000474e550019e9fdb8a0d3938f
[20834.686361][ T268] Hardware name: Qualcomm Technologies, Inc. Kunzite QRD (DT)
[20834.686374][ T268] Call trace:
[20834.686384][ T268] dump_backtrace+0xe8/0x108
[20834.686408][ T268] show_stack+0x18/0x28
[20834.686422][ T268] dump_stack_lvl+0x50/0x6c
[20834.686442][ T268] dump_stack+0x18/0x24
[20834.686457][ T268] panic+0x158/0x36c
[20834.686472][ T268] do_exit+0x800/0xa6c
[20834.686492][ T268] trace_android_vh_do_group_exit+0x0/0x90
[20834.686507][ T268] get_signal+0x7bc/0x8d4
[20834.686522][ T268] do_notify_resume+0x11c/0x764
[20834.686535][ T268] el0_svc+0x64/0x68
[20834.686548][ T268] el0t_64_sync_handler+0x68/0xbc
[20834.686560][ T268] el0t_64_sync+0x1a8/0x1ac
搜索关键字healthd,发现问题时间点电量确实很低,在走低电关机流程
[20834.470085][ T1377] healthd: battery l=1 v=3171 t=44.0 h=2 st=3 c=-1016000 fc=5513000 cc=6 chg=
初步分析
因此目前可以得到两点关键信息:
系统当时正在走低电关机流程
系统尝试访问磁盘数据但出现了IO ERROR
因此笔者怀疑是和下电时序相关:
系统下电流程中,storage先下电了,导致有进程尝试去访问storage失败。
至于为什么会以Attempt to kill init的形式体现,目前不清楚咋回事。
同步咨询平台。
平台回复
从log中可以看出,当时charger相关函数检测到了系统低电之后,调用了kernel_power_off来执行下电关机。可以看下面贴的当时CPU3的调用栈。
kernel_power_off是对于kernel层执行的power off,走的是底层的shutdown流程。
正常下电需要确保安卓上层已经完成shutdown,再进行底层shutdown,以防止一些竞态问题。而此题是直接从底层开始shutdown,就无法保证上层已经完成全部的shutdown流程了,此题直接原因是由于外部存储设备已经下电,上层去访问失败导致出现的异常,kill init的动作是上层访问存储设备失败之后产生的。
[20834.540034][T24761] CPU: 3 PID: 24761 Comm: kworker/3:1 Tainted: G WC OE 6.6.56-android15-8-ga3c3e7c841d9-ab13323268-4k #1 1400000003000000474e550019e9fdb8a0d3938f
[20834.540047][T24761] Hardware name: Qualcomm Technologies, Inc. Kunzite QRD (DT)
[20834.540054][T24761] Workqueue: events battery_chg_check_status_work [qti_battery_charger]
[20834.540146][T24761] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[20834.540156][T24761] pc : remove_proc_entry+0x168/0x1ec
[20834.540166][T24761] lr : remove_proc_entry+0x168/0x1ec
[20834.540174][T24761] sp : ffffffc0b97cbbc0
[20834.540178][T24761] x29: ffffffc0b97cbbe0 x28: ffffff88520fd880 x27: ffffffc0822ac000
[20834.540194][T24761] x26: 0000000000000001 x25: ffffffc082182698 x24: 0000000000000006
[20834.540207][T24761] x23: 0000000000000006 x22: ffffffc082182718 x21: ffffffc07cb7bd06
[20834.540221][T24761] x20: 0000000000000000 x19: ffffffc07cb7bd06 x18: ffffffc09c901050
[20834.540235][T24761] x17: 0000000000000001 x16: ffffffffffffffff x15: 0000000000000004
[20834.540249][T24761] x14: ffffff88f5300000 x13: 0000000000000003 x12: 0000000000000003
[20834.540263][T24761] x11: 0000000100003973 x10: c000000100003973 x9 : d34d315511775600
[20834.540281][T24761] x8 : d34d315511775600 x7 : 277761725f707427 x6 : 70742720656d616e
[20834.540297][T24761] x5 : ffffffc0822d2795 x4 : ffffffc08156c43e x3 : 0000000000000000
[20834.540310][T24761] x2 : 0000000000000000 x1 : ffffffc0b97cb950 x0 : 000000000000000d
[20834.540324][T24761] Call trace:
[20834.540329][T24761] remove_proc_entry+0x168/0x1ec
[20834.540338][T24761] goodix_ts_procfs_exit+0x6c/0xd8 [gt9916k_spi 1400000003000000474e5500edc467ed48425c12]
[20834.540428][T24761] goodix_ts_shutdown+0x16c/0x1fc [gt9916k_spi 1400000003000000474e5500edc467ed48425c12]
[20834.540511][T24761] platform_shutdown+0x3c/0x4c
[20834.540523][T24761] device_shutdown+0x1ac/0x270
[20834.540535][T24761] kernel_power_off+0x3c/0xf4
[20834.540547][T24761] battery_chg_check_status_work+0x214/0x220 [qti_battery_charger 1400000003000000474e5500337ac494df74d690]
[20834.540630][T24761] process_scheduled_works+0x1d4/0x468
[20834.540641][T24761] worker_thread+0x244/0x334
[20834.540650][T24761] kthread+0x114/0x1bc
[20834.540658][T24761] ret_from_fork+0x10/0x20
解决方法
与charger同事对齐,发现他们在问题版本之后几天已经合入了优化代码:
diff --git a/drivers/power/supply/qti_battery_charger.c b/drivers/power/supply/qti_battery_charger.c
index 4bc5c32..19cee06 100644
--- a/drivers/power/supply/qti_battery_charger.c
+++ b/drivers/power/supply/qti_battery_charger.c
......
static const int usb_prop_map[USB_PROP_MAX] = {
@@ -1572,7 +1574,7 @@
pr_emerg("Initiating a shutdown in 100 ms\n");
msleep(100);
pr_emerg("Attempting kernel_power_off: Battery voltage low\n");
-kernel_power_off();
+// kernel_power_off();
}
......
删除了原本代码中底层shutdown的操作kernel_power_off,之后低电情况下先走上层关机流程,可以避免此问题。
使用新版本进行压测,没有复现此问题。
代码浅析
正常情况是上层先走完shutdown流程,该关闭的服务/工具什么的都关闭,再通过syscall让底层继续shutdown。
这里来简单看一下上层转底层的代码:
RebootSystem
上层流程走完之后,最终会走到这里,进入底层shutdown流程
void __attribute__((noreturn))
RebootSystem(unsigned int cmd, const std::string& rebootTarget, const std::string& reboot_reason) {
LOG(INFO) << "Reboot ending, jumping to kernel";
if (!IsRebootCapable()) {
// On systems where init does not have the capability of rebooting the
// device, just exit cleanly.
exit(0);
}
switch (cmd) {
case ANDROID_RB_POWEROFF:
reboot(RB_POWER_OFF);//SHUTDOWN
break;
case ANDROID_RB_RESTART2:
syscall(__NR_reboot, LINUX_REBOOT_MAGIC1, LINUX_REBOOT_MAGIC2,
LINUX_REBOOT_CMD_RESTART2, rebootTarget.c_str());
break;
case ANDROID_RB_THERMOFF:
if (android::base::GetBoolProperty("ro.thermal_warmreset", false)) {
std::string reason = "shutdown,thermal";
if (!reboot_reason.empty()) reason = reboot_reason;
LOG(INFO) << "Try to trigger a warm reset for thermal shutdown";
syscall(__NR_reboot, LINUX_REBOOT_MAGIC1, LINUX_REBOOT_MAGIC2,
LINUX_REBOOT_CMD_RESTART2, reason.c_str());
} else {
reboot(RB_POWER_OFF);
}
break;
}
// In normal case, reboot should not return.
PLOG(ERROR) << "reboot call returned";
abort();
}
SYSCALL_DEFINE4(reboot)
关机/重启相关的系统调用
/*
* Reboot system call: for obvious reasons only root may call it,
* and even root needs to set up some magic numbers in the registers
* so that some mistake won't make this reboot the whole machine.
* You can also set the meaning of the ctrl-alt-del-key here.
*
* reboot doesn't sync: do that yourself before calling this.
*/
SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd,
void __user *, arg)
{
struct pid_namespace *pid_ns = task_active_pid_ns(current);
char buffer[256];
int ret = 0;
...............
/* Instead of trying to make the power_off code look like
* halt when pm_power_off is not set do it the easy way.
*/
if ((cmd == LINUX_REBOOT_CMD_POWER_OFF) && !kernel_can_power_off())
cmd = LINUX_REBOOT_CMD_HALT;
mutex_lock(&system_transition_mutex);
switch (cmd) {
case LINUX_REBOOT_CMD_RESTART:
kernel_restart(NULL);
break;
case LINUX_REBOOT_CMD_CAD_ON:
C_A_D = 1;
break;
case LINUX_REBOOT_CMD_CAD_OFF:
C_A_D = 0;
break;
case LINUX_REBOOT_CMD_HALT:
kernel_halt();
do_exit(0);
case LINUX_REBOOT_CMD_POWER_OFF:
kernel_power_off();//最终还是调用此函数,但是此时已经走完了上层shutdown
do_exit(0);
break;
............
mutex_unlock(&system_transition_mutex);
return ret;
}
补充
为什么当时会怀疑是下电时序导致的问题呢?
因为之前笔者遇到过一例NOC ERROR,也是发生在系统低电关机流程中
最终定位到是因为GPU和ARM-SMMU由于下电时序没有做好同步,对于某一块地址出现了竞态访问,导致出现异常。
因此,在系统下电流程中,做好时序上的同步是非常重要的。