问题现象

死机

分析步骤

[  189.052980][ T5068] Unable to handle kernel paging request at virtual address 00046ffca9037bf9
[  189.052991][ T5068] Mem abort info:
[  189.052997][ T5068]   ESR = 0x0000000096000004
[  189.053005][ T5068]   EC = 0x25: DABT (current EL), IL = 32 bits
[  189.053013][ T5068]   SET = 0, FnV = 0
[  189.053020][ T5068]   EA = 0, S1PTW = 0
[  189.053027][ T5068]   FSC = 0x04: level 0 translation fault
[  189.053035][ T5068] Data abort info:
[  189.053039][ T5068]   ISV = 0, ISS = 0x00000004
[  189.053045][ T5068]   CM = 0, WnR = 0
[  189.053053][ T5068] [00046ffca9037bf9] address between user and kernel address ranges
[  189.053064][ T5068] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[  189.053311][ T5068] Dumping ftrace buffer:
[  189.053331][ T5068]    (ftrace buffer empty)
[  189.055391][ T5068] CPU: 1 PID: 5068 Comm: binder:1027_3 Tainted: G        WC OE      6.1.118-android14-11-maybe-dirty #1
[  189.055405][ T5068] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[  189.055412][ T5068] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  189.055426][ T5068] pc : dpm_complete+0x128/0x44c
[  189.055451][ T5068] lr : dpm_complete+0x114/0x44c
[  189.055462][ T5068] sp : ffffffc0243fbb40
[  189.055468][ T5068] x29: ffffffc0243fbb60 x28: ffffff8035d52580 x27: ffffffc00a1fc000
[  189.055489][ T5068] x26: ffffffc00a1fc210 x25: ffffffc0243fbb48 x24: ffffff8093e724a0
[  189.055508][ T5068] x23: ffffff8093e72518 x22: ffffff8093e72400 x21: ffffffc0092f0ae9
[  189.055527][ T5068] x20: ffffffc00a1fc1c0 x19: 0000000000000010 x18: ffffffc022c2d078
[  189.055545][ T5068] x17: 000000007b71745f x16: 000000007b71745f x15: ffffff8179342180
[  189.055564][ T5068] x14: 0000000000000010 x13: ffffffc0082809d4 x12: ffffffc00939e698
[  189.055582][ T5068] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffffc00a0c7000
[  189.055600][ T5068] x8 : a9046ffca9037bfd x7 : 3a4d50006574656c x6 : 0000101a1e00090b
[  189.055619][ T5068] x5 : 0b09001e1a100000 x4 : 0000008000000000 x3 : ffffff8056d3a9c8
[  189.055637][ T5068] x2 : 00000000ffff93a3 x1 : 0000000000000000 x0 : ffffff8093e72400
[  189.055657][ T5068] Call trace:
[  189.055663][ T5068]  dpm_complete+0x128/0x44c
[  189.055677][ T5068]  suspend_devices_and_enter+0x894/0xc04
[  189.055698][ T5068]  pm_suspend+0x330/0x694
[  189.055711][ T5068]  state_store+0x104/0x1c8
[  189.055724][ T5068]  kobj_attr_store+0x30/0x48
[  189.055747][ T5068]  sysfs_kf_write+0x54/0x6c
[  189.055769][ T5068]  kernfs_fop_write_iter+0x104/0x1a4
[  189.055789][ T5068]  vfs_write+0x244/0x2e0
[  189.055805][ T5068]  ksys_write+0x78/0xe8
[  189.055816][ T5068]  __arm64_sys_write+0x1c/0x2c
[  189.055829][ T5068]  invoke_syscall+0x58/0x114
[  189.055845][ T5068]  el0_svc_common+0xb4/0xfc
[  189.055857][ T5068]  do_el0_svc+0x24/0x84
[  189.055867][ T5068]  el0_svc+0x2c/0x90
[  189.055884][ T5068]  el0t_64_sync_handler+0x68/0xb4
[  189.055897][ T5068]  el0t_64_sync+0x1a4/0x1a8
[  189.055920][ T5068] Code: b40002a8 f9400508 b40003e8 aa1603e0 (b85fc110) 
[  189.055933][ T5068] ---[ end trace 0000000000000000 ]---
[  189.169167][ T5068] Kernel panic - not syncing: Oops: Fatal exception

初步定位模块


问题出现在系统休眠过程中

设备陆续suspend

出问题的dev,为 disp_feature/disp-DSI-0

suspend的流程里,出现了问题,disp-DSI-0的class像是被注销了

第一个问题点


查看dmesg,可以看到初始化流程有两个线程同时执行,
7.0x 秒左右,T615线程执行到mi_display_pwrkey_callback_set
7.04 秒左右,T710线程触发了pwrkey的irq
7.40 秒左右,T710初始化了mi disp_core和mi disp_log
7.45 秒左右,T675再次初始化mi_disp_core和mi disp_log,检查到已经初始化直接return
7.45 秒左右,T675初始化mi disp_feature


由此得到第一个问题点:
display的初始化流程竟然被电源键的中断触发函数触发,而没有走正常的display的流程,这个需要整改

第二个问题点

	Line 4538: [    7.456376][  T710] sysfs: cannot create duplicate filename '/devices/virtual/mi_display/disp_feature'
	Line 4549: [    7.467624][  T710] CPU: 1 PID: 710 Comm: irq/135-pm8941_ Tainted: G        WC OE      6.1.118-android14-11-maybe-dirty #1
	Line 4559: [    7.485547][  T710] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
	Line 4560: [    7.485552][  T710] Call trace:
	Line 4561: [    7.485555][  T710]  dump_backtrace+0xf4/0x11c
	Line 4562: [    7.485569][  T710]  show_stack+0x18/0x24
	Line 4563: [    7.485573][  T710]  dump_stack_lvl+0x60/0x90
	Line 4564: [    7.485580][  T710]  sysfs_create_dir_ns+0xf0/0x150
	Line 4565: [    7.485588][  T710]  kobject_add_internal+0x228/0x478
	Line 4566: [    7.485595][  T710]  kobject_add+0x94/0x10c
	Line 4567: [    7.485600][  T710]  device_add+0x144/0x618
	Line 4568: [    7.485607][  T710]  device_create_groups_vargs+0xcc/0x12c
	Line 4570: [    7.499011][  T710]  device_create+0x58/0x80
	Line 4571: [    7.499017][  T710]  mi_disp_feature_init+0xdc/0x20c [msm_drm]
	Line 4573: [    7.510902][  T710]  mi_get_disp_feature+0x20/0x40 [msm_drm]
	Line 4575: [    7.522143][  T710]  mi_display_powerkey_callback+0x18/0x80 [msm_drm]
	Line 4577: [    7.537274][  T710]  pm8941_pwrkey_irq+0x1e8/0x330 [pm8941_pwrkey]
	Line 4578: [    7.537302][  T710]  irq_thread_fn+0x44/0xa4
	Line 4579: [    7.537315][  T710]  irq_thread+0x164/0x290
	Line 4580: [    7.537320][  T710]  kthread+0x10c/0x154
	Line 4581: [    7.537328][  T710]  ret_from_fork+0x10/0x20
	Line 4583: [    7.547231][  T710] kobject_add_internal failed for disp_feature with -EEXIST, don't try to register things with the same name in the same directory.
	Line 4588: [    7.559217][  T710] [mi_disp:mi_disp_feature_init [msm_drm]] [E]create device failed for disp_feature
	Line 4591: [    7.572531][  T710] ------------[ cut here ]------------
	Line 4593: [    7.584887][  T710] remove_proc_entry: removing non-empty directory '/proc/mi_display', leaking at least 'mipi_rw_prim'
	Line 4594: [    7.584917][  T710] WARNING: CPU: 1 PID: 710 at fs/proc/generic.c:720 remove_proc_entry+0x1e0/0x1ec
	Line 4595: [    7.584935][  T710] Modules linked in: rmnet_wlan(OE) rmnet_offload(OE) rmnet_perf(OE) rmnet_shs(OE) rmnet_perf_tether(OE) rmnet_core(OE) gauge_iio(E) ipanetm(OE) 
	Line 4625: [    7.672205][  T710] CPU: 1 PID: 710 Comm: irq/135-pm8941_ Tainted: G        WC OE      6.1.118-android14-11-maybe-dirty #1
	Line 4626: [    7.672211][  T710] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
	Line 4627: [    7.672214][  T710] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
	Line 4628: [    7.672219][  T710] pc : remove_proc_entry+0x1e0/0x1ec
	Line 4629: [    7.672234][  T710] lr : remove_proc_entry+0x1e0/0x1ec
	Line 4630: [    7.672240][  T710] sp : ffffffc00b4a3c60
	Line 4631: [    7.672242][  T710] x29: ffffffc00b4a3c80 x28: 0000000000000000 x27: 00000000ffffffff
	Line 4632: [    7.672250][  T710] x26: 0000000000000001 x25: ffffffc00a1a4580 x24: 000000000000000a
	Line 4633: [    7.672256][  T710] x23: 000000000000000a x22: ffffffc009318048 x21: ffffff804c52b180
	Line 4634: [    7.672263][  T710] x20: ffffff804c52b22c x19: ffffff804c52b200 x18: ffffffc00aafd048
	Line 4635: [    7.672269][  T710] x17: 0000000000000015 x16: 00000000000000a4 x15: ffffffc00902ec88
	Line 4636: [    7.672276][  T710] x14: 0000000000000001 x13: 000000000000004e x12: 0000000000000018
	Line 4637: [    7.672282][  T710] x11: 00000000ffffffff
	Line 4640: [    7.687628][  T710]  x10: ffffffc00a09eb5c x9 : 67aa0542b3522000
	Line 4641: [    7.687638][  T710] x8 : 67aa0542b3522000 x7 : 656c20746120676e x6 : 0000000000000027
	Line 4642: [    7.687644][  T710] x5 : ffffff8179154234 x4 : ffffffc0093675d5 x3 : ffff0a00ffffff04
	Line 4643: [    7.687651][  T710] x2 : 0000000000000001 x1 : 0000000000000000 x0 : 0000000000000063
	Line 4644: [    7.687658][  T710] Call trace:
	Line 4645: [    7.687663][  T710]  remove_proc_entry+0x1e0/0x1ec
	Line 4646: [    7.687673][  T710]  mi_disp_core_deinit+0x34/0x60 [msm_drm]
	Line 4653: [    7.705247][  T710]  mi_disp_feature_init+0x16c/0x20c [msm_drm]
	Line 4663: [    7.722296][  T710]  mi_get_disp_feature+0x20/0x40 [msm_drm]
	Line 4669: [    7.739086][  T710]  mi_display_powerkey_callback+0x18/0x80 [msm_drm]
	Line 4671: [    7.762509][  T710]  pm8941_pwrkey_irq+0x1e8/0x330 [pm8941_pwrkey]
	Line 4672: [    7.762528][  T710]  irq_thread_fn+0x44/0xa4
	Line 4673: [    7.762539][  T710]  irq_thread+0x164/0x290
	Line 4674: [    7.762544][  T710]  kthread+0x10c/0x154
	Line 4675: [    7.762550][  T710]  ret_from_fork+0x10/0x20
	Line 4677: [    7.784476][  T710] ---[ end trace 0000000000000000 ]---
	Line 4678: [    7.784632][  T710] [mi_disp:mi_display_powerkey_callback [msm_drm]] [E]invalid dsi_display or dsi_panel ptr

pm8941_pwrkey_irq 最终触发mi_disp_core_deinit,对应代码

void mi_disp_core_deinit(void)
{
	if (!g_disp_core)
		return;
	debugfs_remove_recursive(g_disp_core->debugfs_dir);
	remove_proc_entry(MI_DISPLAY_PROCFS_DIR, NULL);
	class_destroy(g_disp_core->class);
	kfree(g_disp_core);
	g_disp_core = NULL;     //置空g_disp_core ,但是
}

这边会使得 g_disp_core->class destory掉,以及kfree掉g_disp_core 以及设为NULL
这里特地问了一下AI,
{% tip success %}

  1. class_destory把class清除了
  2. kfree(g_disp_core) 不会直接将g_disp_core所指向的内存直接清0,而是给系统标记,这段内存可以被释放,可以被使用了
  3. g_disp_core=NULL,这段是将制作指向的地址从原来的指针指向NULL
    {% endtip %}
    继续查看mi_disp_core_deinit的调用,确认调用处为以下的代码:
int mi_disp_feature_init(void)
{
	int ret = 0;
	struct disp_feature *df = NULL;
	struct disp_core *disp_core = NULL;
	int i;

	ret = mi_disp_core_init();
	if (ret < 0)
		return -ENODEV;

	mi_disp_log_init();

	disp_core = mi_get_disp_core();
	if (!disp_core)
		return -ENODEV;

	if (g_disp_feature) {
		DISP_INFO("mi disp_feature already initialized, return!\n");
		return 0;
	}

	df = kzalloc(sizeof(struct disp_feature), GFP_KERNEL);
	if (!df) {
		DISP_ERROR("can not allocate Buffer\n");
		ret = -ENOMEM;
		goto err_core_deinit;
	}

	ret = mi_disp_cdev_register(DISP_FEATURE_DEVICE_NAME,
				&disp_feature_fops, &df->cdev);
	if (ret < 0) {
		DISP_ERROR("cdev register failed for %s\n", DISP_FEATURE_DEVICE_NAME);
		goto err_alloc_mem;
	}

	df->dev_id = df->cdev->dev;
	df->class = disp_core->class;                     ///disp_core->class赋值给disp_feature
	df->pdev = device_create(df->class, NULL, df->dev_id, df, DISP_FEATURE_DEVICE_NAME);
	if (IS_ERR(df->pdev)) {
		DISP_ERROR("create device failed for %s\n", DISP_FEATURE_DEVICE_NAME);    /////这里log打印了
		ret = -ENODEV;
		goto err_cdev_register;
	}

	df->version = MI_DISP_FEATURE_VERSION;
	for (i = MI_DISP_PRIMARY; i < MI_DISP_MAX; i++) {
		df->d_display[i].dev = NULL;
		df->d_display[i].display = NULL;
		df->d_display[i].disp_id = MI_DISP_MAX;
		df->d_display[i].intf_type = MI_INTF_MAX;
		mutex_init(&df->d_display[i].mutex_lock);
	}
	INIT_LIST_HEAD(&df->client_list);
	spin_lock_init(&df->client_spinlock);

	g_disp_feature = df;                                //第一次初始化时将申请的内存df指针 赋值给全局变量

	DISP_INFO("mi disp_feature driver initialized!\n");

	if (hwconf_init() < 0) {
		DISP_ERROR("can not initialize hwconf.\n");
	}

	return 0;

err_cdev_register:                                      ////跳到这里执行
	mi_disp_cdev_unregister(df->cdev);
err_alloc_mem:
	kfree(df);
err_core_deinit:
	mi_disp_core_deinit();     /////这里
	return ret;
}

goto err_cdev_register
err_cdev_register: ////跳到这里执行
mi_disp_cdev_unregister(df->cdev); ////注销cdev
err_alloc_mem:
kfree(df); ////标记df的内存可释放
err_core_deinit:
mi_disp_core_deinit(); /////这里

void mi_disp_cdev_unregister(struct cdev *cdev)
{
	unregister_chrdev_region(cdev->dev, 1);
	cdev_del(cdev);
	cdev = NULL;
}

第二个问题出现了
cdev是函数的形参局部变量,将局部变量设为NULL,并不会影响实参
所以df->cdev应该不为NULL,这点我们看一下g_disp_feature->cdev就可以知道,确实没被清0

从函数汇编角度来看这个问题,也是可以确认的

x0为cdev的值,函数一进来就将x0保存到x19里了,后续操作都不会对x0直接操作,而是操作x19
可以看到ldr x19,[sp, #0x10] ,这里是编译器优化,直接将x19寄存器当作sp来使用返回函数地址了,所以直到函数结束返回,x0中的值仍然没有变

第三个问题点

goto err_cdev_register
	err_cdev_register:                                      ////跳到这里执行
		mi_disp_cdev_unregister(df->cdev);   ////注销cdev
	err_alloc_mem:
		kfree(df);                                             ////标记df的内存可释放
	err_core_deinit:
		mi_disp_core_deinit();                        /////这里

kfree了df后,没有将df=NULL,以及g_disp_feature=NULL,
这个是很容易出现问题的

这里需要注意的是:
df和g_disp_feature指向的是同一块内存空间,但是这两个指针是不一样的,属于不同的地址,如果只kfree了df,标明这块内存可以被释放。如果这些内存被使用了,那df和g_disp_feature仍然指向原来的地址。直接调用就会出现异常!

问题总结

这个问题虽然发现了3个问题点,但是实际的死机是因为class的状态被destory后没有同步给g_disp_feature,将g_disp_core以及g_disp_feature都要置为NULL

	df->class = disp_core->class;                     ///disp_core->class赋值给disp_feature
void mi_disp_core_deinit(void)
{
	if (!g_disp_core)
		return;
	debugfs_remove_recursive(g_disp_core->debugfs_dir);
	remove_proc_entry(MI_DISPLAY_PROCFS_DIR, NULL);
	class_destroy(g_disp_core->class);
	kfree(g_disp_core);
	g_disp_core = NULL;     //置空g_disp_core ,但是disp_core->class没有被删除,仍然有df->class指针可以访问到这个成员
}

所以在suspend流程时认为class还存在导致了这个问题,从trace32里看到的整个class的成员都是异常的,这个说明这个内存块应该被其他人使用了