硬件平台:MTK6797
软件版本:Android6.0
问题描述:我司的硬件方案比较个别,采用了6797,具体是应用了usb转以太网口作为服务器的功能。由于硬件方案商的疏忽,我们的usb1.0接了亚信千兆的usb转以太网的芯片,虽然usb1.0很大程度上限制了网卡芯片的性能,但是时间考虑,就先这样使用了,但是,从此,开始了修复usb问题的漫漫道路。。。
最开始发现问题的情景是:系统升级较大带宽从服务器拉取升级文件的时候,会引发内核崩溃。由于环境是网络行为下导致的,因此首要开始查usb转以太网的相关驱动。」
下边先贴出一部分崩溃日志:
[ 161.458388] (0)[152:hps_main] IRQ occurrs in this duration, IRQ[105:musbfsh], dur:64077 ns (s:161458283615, e:161458347692)
[ 161.458388]
[ 161.473320] (0)[152:hps_main][TASKLET DURATION WARN] Tasklet:usbnet_bh+0x0/0x2d8, dur:14656693 ns > 10 ms,(s:161458501307,e:161473158000)
[ 161.473387] (0)[152:hps_main] IRQ occurrs in this duration, IRQ[105:musbfsh], dur:94538 ns (s:161473162538, e:161473257076)
[ 161.473387]
[ 161.473655] (0)[152:hps_main][SOFTIRQ DURATION WARN] SoftIRQ:6, dur:15132154 ns > 5 ms,(s:161458495692,e:161473627846)
[ 161.473788] (0)[152:hps_main] IRQ occurrs in this duration, IRQ[105:musbfsh], dur:59461 ns (s:161473662769, e:161473722230)
[ 161.473788]
[ 161.577681] (0)[152:hps_main][SOFTIRQ DURATION WARN] SoftIRQ:3, dur:103056846 ns > 5 ms,(s:161474588769,e:161577645615)
[ 161.577861] (0)[152:hps_main] IRQ occurrs in this duration, IRQ[105:musbfsh], dur:93692 ns (s:161577690154, e:161577783846)
[ 161.577861]
[ 161.593203] (0)[152:hps_main][TASKLET DURATION WARN] Tasklet:usbnet_bh+0x0/0x2d8, dur:14925692 ns > 10 ms,(s:161578221000,e:161593146692)
[ 161.593270] (0)[152:hps_main] IRQ occurrs in this duration, IRQ[105:musbfsh], dur:96308 ns (s:161593039923, e:161593136231)
[ 161.593270]
[ 161.593857] (0)[152:hps_main][SOFTIRQ DURATION WARN] SoftIRQ:6, dur:15613693 ns > 5 ms,(s:161578215538,e:161593829231)
[ 161.593918] (0)[152:hps_main] IRQ occurrs in this duration, IRQ[105:musbfsh], dur:94462 ns (s:161593724461, e:161593818923)
[ 161.593918]
[ 161.699285] (0)[152:hps_main][SOFTIRQ DURATION WARN] SoftIRQ:3, dur:103754231 ns > 5 ms,(s:161595495461,e:161699249692)
[ 161.699466] (0)[152:hps_main] IRQ occurrs in this duration, IRQ[105:musbfsh], dur:95000 ns (s:161699293231, e:161699388231)
[ 161.699466]
[ 161.715018] (0)[152:hps_main][TASKLET DURATION WARN] Tasklet:usbnet_bh+0x0/0x2d8, dur:15173539 ns > 10 ms,(s:161699788000,e:161714961539)
[ 161.715156] (0)[152:hps_main] IRQ occurrs in this duration, IRQ[105:musbfsh], dur:56615 ns (s:161715026539, e:161715083154)
[ 161.715156]
[ 161.715584] (0)[152:hps_main][SOFTIRQ DURATION WARN] SoftIRQ:6, dur:16034077 ns > 5 ms,(s:161699522923,e:161715557000)
[ 161.715641] (0)[152:hps_main] IRQ occurrs in this duration, IRQ[105:musbfsh], dur:53462 ns (s:161715449769, e:161715503231)
[ 161.715641]
[ 161.722630] -(4)[2300:busybox]HPS task info (run on CPU 0)
[ 161.722649] -(4)[2300:busybox]BUG: failure at /export/source/alps/kernel-3.18/drivers/misc/mediatek/base/power/mt6797/mt_hotplug_strategy_core.c:248/cpuhp_timer_func()!
[ 161.722663] -(4)[2300:busybox]Unable to handle kernel paging request at virtual address 0000dead
[ 161.722671] -(4)[2300:busybox]pgd = ffffffc02b4e6000
[ 161.722677] [0000dead] *pgd=000000006ae0c003, *pud=000000006ae0c003, *pmd=000000007b79d003, *pte=0000000000000000
[ 161.722707] -(4)[2300:busybox]Internal error: Oops: 96000047 [#1] PREEMPT SMP
[ 161.722723] -(5)[0:swapper/5]CPU5: stopping
[ 161.722736] -(5)[0:swapper/5]CPU: 5 PID: 0 Comm: swapper/5 Tainted: G W 3.18.22 #2
[ 161.722743] -(5)[0:swapper/5]Hardware name: MT6797 (DT)
[ 161.722750] -(5)[0:swapper/5]Call trace:
[ 161.722766] -(5)[0:swapper/5][<ffffffc00008cb28>] dump_backtrace+0x0/0x15c
[ 161.722776] -(5)[0:swapper/5][<ffffffc00008cc94>] show_stack+0x10/0x1c
[ 161.722788] -(5)[0:swapper/5][<ffffffc000a5b404>] dump_stack+0x74/0xb8
[ 161.722799] -(5)[0:swapper/5][<ffffffc00009628c>] handle_IPI+0x370/0x3a0
[ 161.722808] -(5)[0:swapper/5][<ffffffc000084564>] gic_handle_irq+0xd8/0xe0
[ 161.722816] -(5)[0:swapper/5]Exception stack(0xffffffc07aa5fd80 to 0xffffffc07aa5fea0)
[ 161.722827] -(5)[0:swapper/5]fd80: 0112db30 ffffffc0 a71de749 00000025 7aa5fed0 ffffffc0 0077bbc4 ffffffc0
[ 161.722837] -(5)[0:swapper/5]fda0: 60000145 00000000 00086e1c ffffffc0 0077bbc0 ffffffc0 00000000 00000000
[ 161.722846] -(5)[0:swapper/5]fdc0: 7d090000 00000000 7aa5fe60 ffffffc0 8497feb0 00042c1d 00007df6 00000000
[ 161.722856] -(5)[0:swapper/5]fde0: 00000018 00000000 0077bb08 ffffffc0 7aa5c000 ffffffc0 00000001 00000000
[ 161.722866] -(5)[0:swapper/5]fe00: 019a3000 ffffffc0 014a2000 ffffffc0 72792bf0 ffffffc0 00000001 00000000
[ 161.722875] -(5)[0:swapper/5]fe20: fffffffe 0fffffff 00000131 00000000 dc8cb016 cb88537f 00000084 00000000
[ 161.722885] -(5)[0:swapper/5]fe40: 00000005 00000000 0112db30 ffffffc0 a71de749 00000025 7e08d478 ffffffc0
[ 161.722894] -(5)[0:swapper/5]fe60: 00000005 00000000 00000000 00000000 a71c6de2 00000025 00000005 00000000
可以看到在触发崩溃之前,cpu负载均衡算法在频繁被调用。负载均衡机制简单理解为系统为了均衡多核处理器中各个核子的任务繁重程度,而启动的调度机制。说明在传输中的任务是很多的,核子的任务需要频频被调整,这里我们就联想到了网络传输中的中断事件,是因为网络传输时,usb转eth的中断会被高频触发,这个中断的名称如下:
root@aeon6797_6c_m:/ # cat /proc/interrupts | grep musbfsh
105: 72286999 GICv3 105 musbfsh
此中断每秒会被触发数万次,由于usb1.0承载能力有限,在处理如此高频的中断事件时,难免出现busy之类的情况,而此时系统出于安全考虑,会调用自己设置好的BUG_ON接口,BUG_ON的作用如下:
BUG_ON作用:一些内核调用可以用来方便标记bug,提供断言并输出信息。最常用的两个是BUG()和BUG_ON()。
当被调用的时候,它们会引发oops,导致栈的回溯和错误信息的打印。为什么这些声明会导致 oops跟硬件的体系结构
是相关的。大部分体系结构把BUG()和BUG_ON()定义成某种非法操作,这样自然会产生需要的oops。你可以把这些调用当作断言使用,想要断言某种情况不该发生:
if (bad_thing)
BUG(); //需要linux 内核开启General setup->Configure standard kernel features->BUG() support
或者使用更好的形式:
BUG_ON(bad_thing);
可以用panic()引发更严重的错误。调用panic()不但会打印错误消息(Oops)而且还会挂起整个系统。显然,你只应该在极端恶劣的情况下使用它:
if (terrible_thing)
panic("foo is %ld\n", foo);
而触发该事件的代码段如下:
--- a/kernel-3.18/drivers/misc/mediatek/usb11/musbfsh_hsdma.c
+++ b/kernel-3.18/drivers/misc/mediatek/usb11/musbfsh_hsdma.c
@@ -191,8 +191,8 @@ static int dma_channel_program(struct dma_channel *channel,
musbfsh_channel->transmit ? "Tx" : "Rx", packet_sz,
(unsigned int)dma_addr, len, mode);
- BUG_ON(channel->status == MUSBFSH_DMA_STATUS_UNKNOWN ||
- channel->status == MUSBFSH_DMA_STATUS_BUSY);
+ //BUG_ON(channel->status == MUSBFSH_DMA_STATUS_UNKNOWN ||
+ // channel->status == MUSBFSH_DMA_STATUS_BUSY);
channel->actual_len = 0;
musbfsh_channel->start_addr = dma_addr;
按上述修改,即可屏蔽busy或者unknow的情况下的bugon触发,针对这一问题,亚信官方给出的解释是:usb1.0连接他们的usb转eth芯片的原因,usb1.0处理能力有限,在使用usb2.0以上的连接方式的时候,就没有该问题。
至此,洋洋洒洒,usb问题久攻不下的局面被彻底扭转了。。。。
修复问题数十载,最终两行定乾坤。