一次linux死锁问题的调试

13 篇文章 0 订阅

系统环境

内核配置:单核单cpu,禁抢占,内核版本linux4.1.15

问题描述

读取sys节点无法返回,系统没有报panic 等ops 信息,这个现象和死锁比较像,打开内核lockdep 配置来检测调试:如下图,打开这些检测。来进行定位
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
好了,运行后重新操作一遍果然打印出来信息:

[ 223.052568] mm ERR line = 869 mm_read_data_and_report_camaro : check sum error
[ 223.060730] mm ERR line = 1030 mm_irq_handle_thread : I2C report point process error
[ 223.070885] mm ERR line = 869 mm_read_data_and_report_camaro : check sum error
[ 223.079103] mm ERR line = 1030 mm_irq_handle_thread : I2C report point process error
[ 223.089128] mm ERR line = 869 mm_read_data_and_report_camaro : check sum error
[ 223.097291] mm ERR line = 1030 mm_irq_handle_thread : I2C report point process error
[ 223.107360] mm ERR line = 869 mm_read_data_and_report_camaro : check sum error
[ 223.115573] mm ERR line = 1030 mm_irq_handle_thread : I2C report point process error
[ 223.125706] mm ERR line = 869 mm_read_data_and_report_camaro : check sum error
[ 223.133867] mm ERR line = 1030 mm_irq_handle_thread : I2C report point process error
[ 223.143981] mm ERR line = 869 mm_read_data_and_report_camaro : check sum error
[ 223.152146] mm ERR line = 1030 mm_irq_handle_thread : I2C report point process error
[ 223.161036] BUG: sleeping function called from invalid context at kernel/irq/manage.c:110
[ 223.169237] in_atomic(): 1, irqs_disabled(): 128, pid: 143, name: cat
[ 223.175690] 2 locks held by cat/143:
[ 223.179274] #0: (&mm_data->mm_mutex){…}, at: [] mm_irq_disable+0x18/0x98 [mm_limc]
[ 223.189861] #1: (&(&mm_data->irq_lock)->rlock){…}, at: [] mm_irq_disable+0x24/0x98 [mm_limc]
[ 223.200953] CPU: 0 PID: 143 Comm: cat Tainted: G W O 4.1.15 #10
[ 223.207840] Hardware name: Freescale i.MX6 Ultralite (Device Tree)
[ 223.214070] [] (unwind_backtrace) from [] (show_stack+0x10/0x14)
[ 223.221842] [] (show_stack) from [] (synchronize_irq+0x28/0x88)
[ 223.229550] [] (synchronize_irq) from [] (mm_irq_disable+0x44/0x98 [mm_limc])
[ 223.239217] [] (mm_irq_disable [mm_limc]) from [] (mm_update_with_bin_read+0x1f4/0x354 [mm_limc])
[ 223.251283] [] (mm_update_with_bin_read [mm_limc]) from [] (proc_reg_read+0x70/0x84)
[ 223.261489] [] (proc_reg_read) from [] (__vfs_read+0x28/0xc8)
[ 223.268999] [] (__vfs_read) from [] (vfs_read+0x84/0xe8)
[ 223.276073] [] (vfs_read) from [] (SyS_read+0x40/0x78)
[ 223.282974] [] (SyS_read) from [] (ret_fast_syscall+0x0/0x3c)
[ 223.290475] BUG: scheduling while atomic: cat/143/0x00000002
[ 223.296146] 2 locks held by cat/143:
[ 223.299730] #0: (&mm_data->mm_mutex){…}, at: [] mm_irq_disable+0x18/0x98 [mm_limc]
[ 223.310289] #1: (&(&mm_data->irq_lock)->rlock){…}, at: [] mm_irq_disable+0x24/0x98 [mm_limc]
[ 223.321371] Modules linked in: tty_dev(O) tty_host(O) usb(O) lcd_panel_LCD_103(O) lcd_hw_mpu(O) lcd_fb(O) sci_ncn8025(O) pcd_pn512(O) pcd_base(O) imx_es8374(O) snd_soc_fsl_asrc(O) snd_soc_fsl_sai(O) es8374(O) imx_pcm_dma(O) csi(O) mm_limc(O) msr(O) keypad_matrix(O) input_base(O) gpio_pca953x(O) Q25_M01_P00(O) devices_base(O) misc(O) logger(O) verify(O) device_node_manager(O) bbl(O) rtc(O)
[ 223.356681] CPU: 0 PID: 143 Comm: cat Tainted: G W O 4.1.15 #10
[ 223.363564] Hardware name: Freescale i.MX6 Ultralite (Device Tree)
[ 223.369777] [] (unwind_backtrace) from [] (show_stack+0x10/0x14)
[ 223.377547] [] (show_stack) from [] (__schedule_bug+0x50/0x68)
[ 223.385146] [] (__schedule_bug) from [] (__schedule+0x5c/0x3d0)
[ 223.392829] [] (__schedule) from [] (schedule+0x90/0xac)
[ 223.399902] [] (schedule) from [] (synchronize_irq+0x6c/0x88)
[ 223.407437] [] (w) from [] (mm_irq_disable+0x44/0x98 [mm_limc])
[ 223.417089] [] (mm_irq_disable [mm_limc]) from [] (mm_update_with_bin_read+0x1f4/0x354 [mm_limc])
[ 223.429157] [] (mm_update_with_bin_read [mm_limc]) from [] (proc_reg_read+0x70/0x84)
[ 223.439357] [] (proc_reg_read) from [] (__vfs_read+0x28/0xc8)
[ 223.446863] [] (__vfs_read) from [] (vfs_read+0x84/0xe8)
[ 223.453939] [] (vfs_read) from [] (SyS_read+0x40/0x78)
[ 223.460839] [] (SyS_read) from [] (ret_fast_syscall+0x0/0x3c)
[ 223.468632] BUG: spinlock trylock failure on UP on CPU#0, irq/34-touch,ir/62
[ 223.475718] lock: 0xc87f814c, .magic: dead4ead, .owner: cat/143, .owner_cpu: 0
[ 223.483045] CPU: 0 PID: 62 Comm: irq/34-touch,ir Tainted: G W O 4.1.15 #10
[ 223.490887] Hardware name: Freescale i.MX6 Ultralite (Device Tree)
[ 223.497104] [] (unwind_backtrace) from [] (show_stack+0x10/0x14)
[ 223.504881] [] (show_stack) from [] (do_raw_spin_trylock+0x44/0x50)
[ 223.512920] [] (do_raw_spin_trylock) from [] (_raw_spin_lock_irqsave+0x5c/0x90)
[ 223.522019] [] (_raw_spin_lock_irqsave) from [] (mm_i2c_isr+0x60/0xe0 [mm_limc])
[ 223.531900] [] (mm_i2c_isr [mm_limc]) from [] (irq_thread_fn+0x1c/0x34)
[ 223.540970] [] (irq_thread_fn) from [] (irq_thread+0xec/0x184)
[ 223.548564] [] (irq_thread) from [] (kthread+0xdc/0xf4)
[ 223.555553] [] (kthread) from [] (ret_from_fork+0x14/0x2c)
[ 240.562794] BUG: spinlock lockup suspected on CPU#0, irq/34-touch,ir/62
[ 240.569429] lock: 0xc87f814c, .magic: dead4ead, .owner: cat/143, .owner_cpu: 0
[ 240.576753] CPU: 0 PID: 62 Comm: irq/34-touch,ir Tainted: G W O 4.1.15 #10
[ 240.584590] Hardware name: Freescale i.MX6 Ultralite (Device Tree)
[ 240.590799] [] (unwind_backtrace) from [] (show_stack+0x10/0x14)
[ 240.598568] [] (show_stack) from [] (do_raw_spin_lock+0xe4/0x124)
[ 240.606427] [] (do_raw_spin_lock) from [] (_raw_spin_lock_irqsave+0x78/0x90)
[ 240.615264] [] (_raw_spin_lock_irqsave) from [] (mm_i2c_isr+0x60/0xe0 [mm_limc])
[ 240.625138] [] (mm_i2c_isr [mm_limc]) from [] (irq_thread_fn+0x1c/0x34)
[ 240.634206] [] (irq_thread_fn) from [] (irq_thread+0xec/0x184)
[ 240.641797] [] (irq_thread) from [] (kthread+0xdc/0xf4)
[ 240.648785] [] (kthread) from [] (ret_from_fork+0x14/0x2c)
[ 240.656729] mm INFO line = 172 mm_irq_disable :
[ 240.662220] mm INFO line = 295 mm_update_with_bin_read :
[ 240.668455] mm INFO line = 1556 mm_read_tp_info : driver version 1.0.0.0
[ 240.682689] mm ERR line = 260 mm_i2c_transfer : mm_i2c_transfer err
[ 240.721053] mm ERR line = 260 mm_i2c_transfer : mm_i2c_transfer err mm ERR line = 260 mm_i2c_transfer : mm_i2c_transfer err
[ 240.788321] mm ERR line = 260 mm_i2c_transfer : mm_i2c_transfer err[ 240.815881] mm ERR line = 343 mm_i2c_write_and_read : mm_i2c_write_and_read, i2c write error, ret = -5
[ 241.026415] mm INFO line = 1439 mm_get_chip_type : ----------------------ILI Chip Type=0x0-------------------------
[ 241.055193] mm INFO line = 1444 mm_get_chip_type : ----------------------ILI Chip Type=0x17-------------------------
[ 241.078152] mm INFO line = 1439 mm_get_chip_type : ----------------------ILI Chip Type=0x9ff-------------------------
[ 241.106245] mm INFO line = 420 mm_reset : delay = 100
[ 241.222236] mm ERR line = 1763 tp_show : read TP info fail
[ 241.229920] mm INFO line = 1444 mm_get_chip_type : ----------------------ILI Chip Type=0xff-------------------------
[ 241.266931] mm INFO line = 420 mm_reset : delay = 100
[ 241.382894] mm INFO line = 297 mm_update_with_bin_read :
[ 241.389124] mm ERR line = 300 mm_update_with_bin_read : I2C error,don’t upgrate -1

从上面提示可以看出mm_i2c_isr 在获取锁失败,而这把锁正是被cat进程拿着,对应的节点的函数mm_update_with_bin_read中的mm_irq_disable 的确有获取这两把锁,

读取sys节点的接口函数如下:

static ssize_t mm_update_with_bin_read(struct file *pFile,
        char __user * buf, size_t nCount,
        loff_t * pPos)
{
    int ret = 0;
    struct file *filp;
    struct inode *inode;
    unsigned char tmpbuf[256] = { 0 };
    mm_segment_t fs;
    off_t fsize;
    unsigned char *save_hex_content = NULL;
    
    if (*pPos != 0) {
        return 0;
    }
    
    if (!(strstr(mm_bin_path, ".bin"))) {
        tp_log_err("mm_bin_path is invalid mm_bin_path = %s\n",
                   mm_bin_path);
        nCount = sprintf(tmpbuf,"mm_bin_path is invalid mm_bin_path = %s\n",
                    mm_bin_path);
        goto out;
    } else {
        filp = filp_open(mm_bin_path, O_RDONLY, 0);
        
        if (IS_ERR(filp)) {
            tp_log_err("hex File Open Error mm_bin_path = %s\n",
                       mm_bin_path);
            nCount =sprintf(tmpbuf,"hex File Open Error mm_bin_path = %s\n",
                        mm_bin_path);
            goto out;
        } else {
            tp_log_info("hex File Open Right,O_RDONLY %s\n",
                        mm_bin_path);
                        
            if (!filp->f_op) {
                tp_log_err("File Operation Method Error\n");
                nCount =sprintf(tmpbuf, "File Operation Method Error\n");
                goto out;
            } else {
                inode = file_inode(filp);
                fsize = inode->i_size;
                tp_log_info("File size:%d \n", (int)fsize);
                save_hex_content = (unsigned char *)vmalloc((int)fsize);
                    
                if (!(save_hex_content)) {
                    tp_log_err("alloctation save_hex_content memory failed\n");
                    nCount =sprintf(tmpbuf, "alloctation save_hex_content memory failed\n");
                    goto out;
                }
                
                fs = get_fs();
                set_fs(KERNEL_DS);
     			vfs_read(filp, save_hex_content, fsize, &(filp->f_pos));
                set_fs(fs);
                filp_close(filp, NULL);
            }
        }
    }    
    mm_irq_disable();    
    ret = mm_read_tp_info();
    
    if (ret < 0) {
        tp_log_err("I2C error,don't upgrate %d\n", ret);
        nCount = sprintf(tmpbuf, "I2C error,don't upgrate= %d\n", ret);
        goto out;
    } else {
        target_fw_update_ptr = save_hex_content;
       
        mm_data->firmware_updating = true;
        ret = mm_upgrade_firmware(mm_force_update);
        mm_data->firmware_updating = false;      

        if (ret < 0) {
            tp_log_err("upgrade fail ret = %d\n", ret);
            nCount = sprintf(tmpbuf, "upgrade fail ret = %d\n", ret);
            goto out;
        }
    }    
    ret = mm_read_tp_info();
    
    if (ret < 0)
        tp_log_err("read TP info fail\n");        
out:
    mm_irq_enable();    
    if (save_hex_content) {
        vfree(save_hex_content);
        save_hex_content = NULL;
    }    
    *pPos += nCount;
    return nCount;
}
void mm_irq_disable(void)
{
    unsigned long irqflag = 0;
    mutex_lock(&mm_data->mm_mutex);
    spin_lock_irqsave(&mm_data->irq_lock, irqflag);
 
    if ((mm_data->irq_status)) {
		  disable_irq(mm_data->client->irq);
        mm_data->irq_status = false;
 
    }
    spin_unlock_irqrestore(&mm_data->irq_lock, irqflag);	
    mutex_unlock(&mm_data->mm_mutex);
}

static irqreturn_t mm_i2c_isr(int irq, void *dev_id)
{
    unsigned long irqflag = 0;

    if (mm_data->firmware_updating) {
        tp_log_err("firmware_updating return\n");
        return IRQ_HANDLED;
    }
    tp_log_debug(PRINT_I2C_IRQ,"Enter i2c IRQ\n");    
    spin_lock_irqsave(&mm_data->irq_lock, irqflag);
    
    if (mm_data->irq_status) {
        mm_data->irq_status = false;
        disable_irq_nosync(mm_data->client->irq);
        mm_data->irq_trigger = true;
        wake_up_interruptible(&waiter);
    }
    
    spin_unlock_irqrestore(&mm_data->irq_lock, irqflag);
    return IRQ_HANDLED;
}
     ret =request_threaded_irq(mm_data->client->irq, NULL,
                                 mm_i2c_isr,
                                 IRQF_TRIGGER_LOW | IRQF_ONESHOT,
                                 "touch,irq-gpio", mm_data);
void synchronize_irq(unsigned int irq)
{
	struct irq_desc *desc = irq_to_desc(irq);

	if (desc) {
		__synchronize_hardirq(desc);
		/*
		 * We made sure that no hardirq handler is
		 * running. Now verify that no threaded handlers are
		 * active.
		 */
		wait_event(desc->wait_for_threads,
			   !atomic_read(&desc->threads_active));
	}
}

分析

从调试信息可以找出问题在于 :
mm_i2c_isr 获取不到mm_data->irq_lock,因为mm_irq_disable 函数获取了mm_data->irq_lock,那为啥没有释放这个锁,因为调用的disable_irq 会调用synchronize_irq,这里面在等待irq的中断处理完成,而中断处理线程 正是mm_i2c_isr ,因此这两个线程彼此等待,产生了死锁

思考

  1. 这个问题出现的概率很高,几乎是必现,可是从源码来看,mm_irq_disable 已经调用spin_lock_irqsave关中断了,为啥还能触发中断线程呢? 这个中断究竟是啥时来的?
    因为从现象上看是先由mm_irq_disable 获取锁,后进中断线程,中断却不可能在mm_irq_disable 拿到锁之后触发,而是在其cat 线程获取锁之前产生的,之后进入中断上下文处理,退出中断上下文退回到cat 线程中,而此时再获取了锁,可是由于前面分析的,刚才的中断还有中断线程没有完成,因此synchronize_irq会触发等待休眠。而一旦中断线程被调度就发生了死锁。
    2,从上面的解释,触发死锁的条件是在cat 线程开始到mm_irq_disable 执行期间产生了中断,那这时到底产生中断的概率有多大,通过示波器一看,这个中断的确有问题,触发频率很高,所以这个死锁的概率也就大了
    3.之前调用printk打印跟踪,这个死锁又没了,这是为什么,prink 会关中断,导致没有进入中断,但是理论上还是有点说不过去。只能是概率小了,并没有彻底解决啊,其他时间,中断一旦触发,依然会发生死锁才对,这个地方还是没有搞懂,希望有明白的告诉一下我。
  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值