复现步骤
执行:
cts-tf > run cts -m CtsNetTestCases
后概率复现;
初步分析
通过多次尝试,发现大多集中在如下这项测试项:
android.net.wifi.cts.WifiManagerTest#testStartLocalOnlyHotspotSingleRequestByApps
单跑此项后验证,确实有较高概率导致system_server进程崩溃,Android上层重启:
cts-tf > run cts -m CtsNetTestCases -t android.net.wifi.cts.WifiManagerTest#testStartLocalOnlyHotspotSingleRequestByApps
系统崩溃日志节选如下:
01-13 17:51:36.588 1338 1486 E AndroidRuntime: *** FATAL EXCEPTION IN SYSTEM PROCESS: android.io
01-13 17:51:36.588 1338 1486 E AndroidRuntime: java.lang.IllegalStateException: command '110 network route add local wlan1 fe80::/64' failed with '400 110 addRoute() failed (No such device)'
01-13 17:51:36.588 1338 1486 E AndroidRuntime: at com.android.server.NetworkManagementService.modifyRoute(NetworkManagementService.java:1176)
01-13 17:51:36.588 1338 1486 E AndroidRuntime: at com.android.server.NetworkManagementService.addInterfaceToLocalNetwork(NetworkManagementService.java:2673)
01-13 17:51:36.588 1338 1486 E AndroidRuntime: at com.android.server.connectivity.tethering.TetherInterfaceStateMachine.configureLocalIPv6Routes(TetherInterfaceStateMachine.java:386)
01-13 17:51:36.588 1338 1486 E AndroidRuntime: at com.android.server.connectivity.tethering.TetherInterfaceStateMachine.setRaParams(TetherInterfaceStateMachine.java:446)
01-13 17:51:36.588 1338 1486 E AndroidRuntime: at com.android.server.connectivity.tethering.TetherInterfaceStateMachine.updateUpstreamIPv6LinkProperties(TetherInterfaceStateMachine.java:338)
01-13 17:51:36.588 1338 1486 E AndroidRuntime: at com.android.server.connectivity.tethering.TetherInterfaceStateMachine.access$700(TetherInterfaceStateMachine.java:67)
01-13 17:51:36.588 1338 1486 E AndroidRuntime: at com.android.server.connectivity.tethering.TetherInterfaceStateMachine$BaseServingState.processMessage(TetherInterfaceStateMachine.java:570)
01-13 17:51:36.588 1338 1486 E AndroidRuntime: at com.android.server.connectivity.tethering.TetherInterfaceStateMachine$LocalHotspotState.processMessage(TetherInterfaceStateMachine.java:607)
01-13 17:51:36.588 1338 1486 E AndroidRuntime: at com.android.internal.util.StateMachine$SmHandler.processMsg(StateMachine.java:992)
01-13 17:51:36.588 1338 1486 E AndroidRuntime: at com.android.internal.util.StateMachine$SmHandler.handleMessage(StateMachine.java:809)
01-13 17:51:36.588 1338 1486 E AndroidRuntime: at android.os.Handler.dispatchMessage(Handler.java:106)
01-13 17:51:36.588 1338 1486 E AndroidRuntime: at android.os.Looper.loop(Looper.java:193)
01-13 17:51:36.588 1338 1486 E AndroidRuntime: at android.os.HandlerThread.run(HandlerThread.java:65)
01-13 17:51:36.588 1338 1486 E AndroidRuntime: at com.android.server.ServiceThread.run(ServiceThread.java:44)
01-13 17:51:36.588 1338 1486 E AndroidRuntime: Caused by: com.android.server.NativeDaemonConnector$NativeDaemonFailureException: command '110 network route add local wlan1 fe80::/64' failed with '400 110 addRoute() failed (No such device)'
01-13 17:51:36.588 1338 1486 E AndroidRuntime: at com.android.server.NativeDaemonConnector.executeForList(NativeDaemonConnector.java:515)
01-13 17:51:36.588 1338 1486 E AndroidRuntime: at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:411)
01-13 17:51:36.588 1338 1486 E AndroidRuntime: at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:406)
01-13 17:51:36.588 1338 1486 E AndroidRuntime: at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:391)
01-13 17:51:36.588 1338 1486 E AndroidRuntime: at com.android.server.NetworkManagementService.modifyRoute(NetworkManagementService.java:1172)
01-13 17:51:36.588 1338 1486 E AndroidRuntime: ... 13 more
01-13 17:51:36.589 1338 1486 I am_crash: [1338,0,system_server,-1,com.android.server.NativeDaemonConnector$NativeDaemonFailureException,command '110 network route add local wlan1 fe80::/64' failed with '400 110 addRoute() failed (No such device)',NativeDaemonConnector.java,515]
从这里可以一目了然看到异常的调用栈;
逻辑分析
这里就略去逐步分析的过程,直接上分析结果:
- 这项测试项是测试第三方应用是否可以重复调用
WifiManager.startLocalOnlyHotspot()
(期望值为不能) - 测试会调用两次
WifiManager.startLocalOnlyHotspot()
,在调用第二次时会捕获framework抛出的IllegalStateException
,并在之后调用WifiManager.stopLocalOnlyHotspot()
将其关闭; - 而问题就出在了最后的
WifiManager.stopLocalOnlyHotspot()
:
正常情况
WifiServiceImpl.stopLocalOnlyHotspot()
会发出android.net.wifi.WIFI_AP_STATE_CHANGED
广播,状态为WifiManager.WIFI_AP_STATE_DISABLING
;Tethering
会收到该广播,并调用TetherInterfaceStateMachine.unwanted()
进行路由规则的反注册;- 与此同时,
WifiServiceImpl
会调用HalDeviceManager.removeIfaceInternal()
通知HAL服务将该接口销毁; - 虽然广播与销毁接口的调用并非线程安全,但由于调用存在先后顺序,且
Tethering
本身也是运行在system_server
进程中,因此正常情况下时序还是如上图所示,不会出问题的;
异常情况
Tethering
负责监听android.net.wifi.WIFI_AP_STATE_CHANGED
广播的线程为IoThread
,因此,如果次线程出现拥堵,导致广播接收出现延时,那么就会导致如下情况:
图中黄色部分表示异常部分,而红色则是系统崩溃的发生点:
- 如图所示,由于
WifiServiceImpl.startLocalOnlyHotspot()
调用的广播送达Tethering
之后,运行在IoThread
的TetherInterfaceStateMachine
并未在期望的时间内完成与Netd
进程的交互,以至于在而后的WifiServiceImpl.stopLocalOnlyHotspot()
调用后销毁接口已经完成之后,TetherInterfaceStateMachine
仍还在尝试为该接口添加路由规则。从而因此了异常,导致system_server
异常重启;
解决方案
具体方案就不提供了,这里提供三个思路供参考:
- 问题发生于
startLocalOnlyHotspot()
与stoptLocalOnlyHotspot()
之间无间隔的快速调用,因此在此处添加一个必要时间窗口判断,并做对应的延时,是可以规避该问题的;(固定延时,还是监听Tethering
状态回调来讲无序调用变成有序调用均可) - 问题的根本在于
TetherInterfaceStateMachine
对已经销毁的接口发送添加路由规则的请求,因此在发送前对接口是否被销毁进行判断,也可以规避该问题; - 广播
android.net.wifi.WIFI_AP_STATE_CHANGED
未能及时送达,是因为接收者处于一个可能会繁忙的线程中,因此将其从该线程剥离,或者使用Handler.sendMessageAtFrontOfQueue()
来确保其优先级,也是可以考虑的一个方向;