1. 系统运行环境
客观的反应系统的执行环境,通常包括如CPU 利用率,Memory 使用情况, Storage 剩余情况等。这些资料也非常重要,比如可以快速的知道,当时是否有Process 在疯狂的执行,当时是不是处于严重的low memory 情况, Storage 是否有耗尽的情况发生等。
客观的反应系统的执行环境,通常包括如CPU 利用率,Memory 使用情况, Storage 剩余情况等。这些资料也非常重要,比如可以快速的知道,当时是否有Process 在疯狂的执行,当时是不是处于严重的low memory 情况, Storage 是否有耗尽的情况发生等。
2. CPU Usage
追查CPU 利用率可大体的知道,当时机器是否有Process 在疯狂的运行, 当时系统运行是否繁忙。通常死机分析,只需要抓取基本的使用情况即可。下面说一下一般的抓取方式
追查CPU 利用率可大体的知道,当时机器是否有Process 在疯狂的运行, 当时系统运行是否繁忙。通常死机分析,只需要抓取基本的使用情况即可。下面说一下一般的抓取方式
2.1 top
top 可以简单的查询Cpu 的基本使用情况
Usage: top [ -m max_procs ] [ -n iterations ] [ -d delay ] [ -s sort_column ] [ -t ] [ -h ]
-m num Maximum number of processes to display.
-n num Updates to show before exiting.
-d num Seconds to wait between updates.
-s col Column to sort by (cpu,vss,rss,thr).
-t Show threads instead of processes.
-h Display this help screen.
注意的是top 的CPU% 是按全部CPU 来计算的,如果以单线程来计算,比如当时有开启4个核心,那么最多吃到25%.
个人常见的操作方式如: top -t -m 5 -n 2
top 可以简单的查询Cpu 的基本使用情况
Usage: top [ -m max_procs ] [ -n iterations ] [ -d delay ] [ -s sort_column ] [ -t ] [ -h ]
-m num Maximum number of processes to display.
-n num Updates to show before exiting.
-d num Seconds to wait between updates.
-s col Column to sort by (cpu,vss,rss,thr).
-t Show threads instead of processes.
-h Display this help screen.
注意的是top 的CPU% 是按全部CPU 来计算的,如果以单线程来计算,比如当时有开启4个核心,那么最多吃到25%.
个人常见的操作方式如: top -t -m 5 -n 2
I.E 正常情况
User 2%, System 12%, IOW 0%, IRQ 0%
User 14 + Nice 0 + Sys 67 + Idle 471 + IOW 0 + IRQ 0 + SIRQ 2 = 554
User 14 + Nice 0 + Sys 67 + Idle 471 + IOW 0 + IRQ 0 + SIRQ 2 = 554
PID TID PR CPU% S VSS RSS PCY UID Thread Proc
2423 2423 1 8% R 2316K 1128K root top top
270 270 0 1% S 2160K 924K root aee_resmon /system/bin/aee_resmon
159 159 0 0% D 0K 0K root bat_thread_kthr
3 3 0 0% S 0K 0K root ksoftirqd/0
57 57 0 0% D 0K 0K root hps_main
2423 2423 1 8% R 2316K 1128K root top top
270 270 0 1% S 2160K 924K root aee_resmon /system/bin/aee_resmon
159 159 0 0% D 0K 0K root bat_thread_kthr
3 3 0 0% S 0K 0K root ksoftirqd/0
57 57 0 0% D 0K 0K root hps_main
User 1%, System 7%, IOW 0%, IRQ 0%
User 10 + Nice 0 + Sys 41 + Idle 494 + IOW 0 + IRQ 0 + SIRQ 0 = 545
User 10 + Nice 0 + Sys 41 + Idle 494 + IOW 0 + IRQ 0 + SIRQ 0 = 545
PID TID PR CPU% S VSS RSS PCY UID Thread Proc
2423 2423 1 8% R 2324K 1152K root top top
57 57 0 0% D 0K 0K root hps_main
242 419 0 0% S 8600K 4540K shell mobile_log_d /system/bin/mobile_log_d
982 991 0 0% S 4364K 1156K media_rw sdcard /system/bin/sdcard
272 272 0 0% S 30680K 9048K root em_svr /system/bin/em_svr
从上面可以看出, 系统基本运行正常,没有很吃CPU 的进程.
2423 2423 1 8% R 2324K 1152K root top top
57 57 0 0% D 0K 0K root hps_main
242 419 0 0% S 8600K 4540K shell mobile_log_d /system/bin/mobile_log_d
982 991 0 0% S 4364K 1156K media_rw sdcard /system/bin/sdcard
272 272 0 0% S 30680K 9048K root em_svr /system/bin/em_svr
从上面可以看出, 系统基本运行正常,没有很吃CPU 的进程.
I.E 异常情况
User 59%, System 4%, IOW 2%, IRQ 0%
User 1428 + Nice 0 + Sys 110 + Idle 811 + IOW 67 + IRQ 0 + SIRQ 1 = 2417
User 1428 + Nice 0 + Sys 110 + Idle 811 + IOW 67 + IRQ 0 + SIRQ 1 = 2417
PID TID PR CPU% S VSS RSS PCY UID Thread Proc
16132 32195 3 14% R 997100K 53492K bg u0_a60 Thread-1401 com.android.mms
16132 32190 1 14% R 997100K 53492K bg u0_a60 Thread-1400 com.android.mms
16132 32188 2 14% R 997100K 53492K bg u0_a60 Thread-1399 com.android.mms
16132 32187 0 14% R 997100K 53492K bg u0_a60 Thread-1398 com.android.mms
18793 18793 4 1% R 2068K 1020K shell top top
16132 32195 3 14% R 997100K 53492K bg u0_a60 Thread-1401 com.android.mms
16132 32190 1 14% R 997100K 53492K bg u0_a60 Thread-1400 com.android.mms
16132 32188 2 14% R 997100K 53492K bg u0_a60 Thread-1399 com.android.mms
16132 32187 0 14% R 997100K 53492K bg u0_a60 Thread-1398 com.android.mms
18793 18793 4 1% R 2068K 1020K shell top top
User 67%, System 3%, IOW 7%, IRQ 0%
User 1391 + Nice 0 + Sys 75 + Idle 435 + IOW 146 + IRQ 0 + SIRQ 1 = 2048
User 1391 + Nice 0 + Sys 75 + Idle 435 + IOW 146 + IRQ 0 + SIRQ 1 = 2048
PID TID PR CPU% S VSS RSS PCY UID Thread Proc
16132 32195 3 16% R 997100K 53492K bg u0_a60 Thread-1401 com.android.mms
16132 32188 2 16% R 997100K 53492K bg u0_a60 Thread-1399 com.android.mms
16132 32190 0 16% R 997100K 53492K bg u0_a60 Thread-1400 com.android.mms
16132 32187 1 16% R 997100K 53492K bg u0_a60 Thread-1398 com.android.mms
18793 18793 4 2% R 2196K 1284K shell top top
16132 32195 3 16% R 997100K 53492K bg u0_a60 Thread-1401 com.android.mms
16132 32188 2 16% R 997100K 53492K bg u0_a60 Thread-1399 com.android.mms
16132 32190 0 16% R 997100K 53492K bg u0_a60 Thread-1400 com.android.mms
16132 32187 1 16% R 997100K 53492K bg u0_a60 Thread-1398 com.android.mms
18793 18793 4 2% R 2196K 1284K shell top top
可以明显的看到,贵司的mms 的4个thread 都有进入了deadloop, 分别占用了一个cpu core. 同时可以快速抓取他们的java trace, 进一步可以看到当时MMS 的四个backtrace, 以便快速分析.
"Thread-1401" prio=5 tid=32 SUSPENDED JIT
| group="main" sCount=1 dsCount=0 obj=0x4264f860 self=0x7b183558
| sysTid=32195 nice=0 sched=0/0 cgrp=apps/bg_non_interactive handle=2078705952
| state=S schedstat=( 3284456714198 104216273858 383002 ) utm=324720 stm=3725 core=5
at com.yulong.android.mms.c.f.d(MmsChatDataServer.java:~1095)
at com.yulong.android.mms.ui.MmsChatActivity$37.run(MmsChatActivity.java:7582)
at java.lang.Thread.run(Thread.java:841)
| group="main" sCount=1 dsCount=0 obj=0x4264f860 self=0x7b183558
| sysTid=32195 nice=0 sched=0/0 cgrp=apps/bg_non_interactive handle=2078705952
| state=S schedstat=( 3284456714198 104216273858 383002 ) utm=324720 stm=3725 core=5
at com.yulong.android.mms.c.f.d(MmsChatDataServer.java:~1095)
at com.yulong.android.mms.ui.MmsChatActivity$37.run(MmsChatActivity.java:7582)
at java.lang.Thread.run(Thread.java:841)
"Thread-1400" prio=5 tid=31 SUSPENDED JIT
| group="main" sCount=1 dsCount=0 obj=0x41f5d8f0 self=0x7be2a8e8
| sysTid=32190 nice=0 sched=0/0 cgrp=apps/bg_non_interactive handle=2078029504
| state=S schedstat=( 3284905134412 105526230562 382946 ) utm=324805 stm=3685 core=5
at com.yulong.android.mms.ui.MmsChatActivity$37.run(MmsChatActivity.java:~7586)
at java.lang.Thread.run(Thread.java:841)
| group="main" sCount=1 dsCount=0 obj=0x41f5d8f0 self=0x7be2a8e8
| sysTid=32190 nice=0 sched=0/0 cgrp=apps/bg_non_interactive handle=2078029504
| state=S schedstat=( 3284905134412 105526230562 382946 ) utm=324805 stm=3685 core=5
at com.yulong.android.mms.ui.MmsChatActivity$37.run(MmsChatActivity.java:~7586)
at java.lang.Thread.run(Thread.java:841)
"Thread-1399" prio=5 tid=30 SUSPENDED JIT
| group="main" sCount=1 dsCount=0 obj=0x42564d28 self=0x7b0e6838
| sysTid=32188 nice=0 sched=0/0 cgrp=apps/bg_non_interactive handle=2077662640
| state=S schedstat=( 3288042313685 103203810616 375959 ) utm=325143 stm=3661 core=7
at com.yulong.android.mms.ui.MmsChatActivity$37.run(MmsChatActivity.java:~7586)
at java.lang.Thread.run(Thread.java:841)
| group="main" sCount=1 dsCount=0 obj=0x42564d28 self=0x7b0e6838
| sysTid=32188 nice=0 sched=0/0 cgrp=apps/bg_non_interactive handle=2077662640
| state=S schedstat=( 3288042313685 103203810616 375959 ) utm=325143 stm=3661 core=7
at com.yulong.android.mms.ui.MmsChatActivity$37.run(MmsChatActivity.java:~7586)
at java.lang.Thread.run(Thread.java:841)
"Thread-1398" prio=5 tid=29 SUSPENDED
| group="main" sCount=1 dsCount=0 obj=0x4248e5a8 self=0x7be0d128
| sysTid=32187 nice=0 sched=0/0 cgrp=apps/bg_non_interactive handle=2079251904
| state=S schedstat=( 3287248372432 105116936413 379634 ) utm=325055 stm=3669 core=6
at com.yulong.android.mms.ui.MmsChatActivity$37.run(MmsChatActivity.java:~7586)
at java.lang.Thread.run(Thread.java:841)
| group="main" sCount=1 dsCount=0 obj=0x4248e5a8 self=0x7be0d128
| sysTid=32187 nice=0 sched=0/0 cgrp=apps/bg_non_interactive handle=2079251904
| state=S schedstat=( 3287248372432 105116936413 379634 ) utm=325055 stm=3669 core=6
at com.yulong.android.mms.ui.MmsChatActivity$37.run(MmsChatActivity.java:~7586)
at java.lang.Thread.run(Thread.java:841)
当时处于suspend, 即意味着当时这四个thread 正在执行java code, 而抓取backtrace 时强制将thread suspend 。看起来客户改动所致,并且客户有Proguard, 麻烦客户自己review 代码。
2.2 cputime
通常用cputime 来打印一段时间内, CPU 的利用率的统计情况, 资讯比top 更加详细.
通常用cputime 来打印一段时间内, CPU 的利用率的统计情况, 资讯比top 更加详细.
cputime -h
Usage: cputime [-start/-stop] [-n iterations] [-d delay] [-e time] [ -m max_count ] [-p] [-s sort_column] [-i id] [-h]
-start Start cpu time monitor.
-stop Stop cpu time monitor.
-n num Updates to show before exiting.
-d num Seconds to wait between updates.
-m num Maximum number of information to display.
-e num Enable CPU time monitor and stop monitor after num seconds. If no this parameter will
show last cputime monitor data.
-p Show process instead of thread. If no this parameter default will show thread information.
-s col Column to sort by time/user/kernel/id/isr_c/isr_t(cputime/user time/kernel time/id/isr_count/
isr_time). If no this parameter default will sort by cputime.
-i id show isr information of thread id.
-h Display this help screen.
* Example *
cputime -e 100: Enable cputime monitor and stop after 100 seconds. Then show thread cputime, and sort by cputime.
cputime -e 200 -s user: Enable cputime monitor and stop after 200 seconds. Then show thread cputime, and sort by user time.
cputime: Show thread cputime, and sort by cputime.
cputime -p -s id: Show process cputime, and sort by process id.
cputime -e 100: Enable cputime monitor and stop after 100 seconds. Then show thread cputime, and sort by cputime.
cputime -e 200 -s user: Enable cputime monitor and stop after 200 seconds. Then show thread cputime, and sort by user time.
cputime: Show thread cputime, and sort by cputime.
cputime -p -s id: Show process cputime, and sort by process id.
2.3 ftrace
ftrace 可以纪录CPU 最为详细的执行情况, 即linux scheduler 的执行情况. 通常默认只开启 sched_switch.
如何抓取ftrace 可以查询相关的FAQ.
ftrace 可以纪录CPU 最为详细的执行情况, 即linux scheduler 的执行情况. 通常默认只开启 sched_switch.
如何抓取ftrace 可以查询相关的FAQ.
2.4 Kernel core status
有的时候我们需要追查一下,当时Kernel 的基本调度情况,以及接收中断的情况,以判断当前CPU 执行的基本情况是否异常。比如有时候如果某个中断上来太过频繁,就容易导致系统运行缓慢,甚至死机。
* CPU Sched status
adb shell cat proc/sched_debug
Use sysrq-trigger
有的时候我们需要追查一下,当时Kernel 的基本调度情况,以及接收中断的情况,以判断当前CPU 执行的基本情况是否异常。比如有时候如果某个中断上来太过频繁,就容易导致系统运行缓慢,甚至死机。
* CPU Sched status
adb shell cat proc/sched_debug
Use sysrq-trigger
* CPU interrupts
adb shell cat proc/interrupts
adb shell cat proc/irq/irq_id/spurious
adb shell cat proc/interrupts
adb shell cat proc/irq/irq_id/spurious
3. Memory Usage
Memory Usage, 我们通常会审查,系统当时memory 是否足够, 是否处于low memory 状态, 是否可能出现因无法申请到memory 而卡死的情况.
Memory Usage, 我们通常会审查,系统当时memory 是否足够, 是否处于low memory 状态, 是否可能出现因无法申请到memory 而卡死的情况.
常见的一些基本信息如下:
* meminfo: basic memory status
adb shell cat proc/meminfo
adb shell cat proc/pid/maps
adb shell cat proc/pid/smaps
* procrank info: all process memory status
adb shell procrank
adb shell procmem pid
adb shell dumpsys meminfo pid
* zoneinfo:
adb shell cat proc/zoneinfo
* buddyinfo:
adb shell cat /proc/buddyinfo
4. Storage Usage
查看Storage 的情况,通常主要是查询data 分区是否已经刷满, sdcard 是否已经刷满, 剩余的空间是否足够。以及是否有产生超大文件等。
通常使用的命令如 df
df
Filesystem Size Used Free Blksize
/dev 446.0M 128.0K 445.8M 4096
/sys/fs/cgroup 446.0M 12.0K 445.9M 4096
/mnt/secure 446.0M 0.0K 446.0M 4096
/mnt/asec 446.0M 0.0K 446.0M 4096
/mnt/obb 446.0M 0.0K 446.0M 4096
/system 1.2G 915.3M 355.5M 4096
/data 1.1G 136.7M 1010.1M 4096
/cache 106.2M 48.0K 106.2M 4096
/protect_f 4.8M 52.0K 4.8M 4096
/protect_s 4.8M 48.0K 4.8M 4096
/mnt/cd-rom 1.2M 1.2M 0.0K 2048
/mnt/media_rw/sdcard0 4.6G 1.1G 3.4G 32768
/mnt/secure/asec 4.6G 1.1G 3.4G 32768
/storage/sdcard0 4.6G 1.1G 3.4G 32768
查看Storage 的情况,通常主要是查询data 分区是否已经刷满, sdcard 是否已经刷满, 剩余的空间是否足够。以及是否有产生超大文件等。
通常使用的命令如 df
df
Filesystem Size Used Free Blksize
/dev 446.0M 128.0K 445.8M 4096
/sys/fs/cgroup 446.0M 12.0K 445.9M 4096
/mnt/secure 446.0M 0.0K 446.0M 4096
/mnt/asec 446.0M 0.0K 446.0M 4096
/mnt/obb 446.0M 0.0K 446.0M 4096
/system 1.2G 915.3M 355.5M 4096
/data 1.1G 136.7M 1010.1M 4096
/cache 106.2M 48.0K 106.2M 4096
/protect_f 4.8M 52.0K 4.8M 4096
/protect_s 4.8M 48.0K 4.8M 4096
/mnt/cd-rom 1.2M 1.2M 0.0K 2048
/mnt/media_rw/sdcard0 4.6G 1.1G 3.4G 32768
/mnt/secure/asec 4.6G 1.1G 3.4G 32768
/storage/sdcard0 4.6G 1.1G 3.4G 32768
以及ls, du 等命令,如du
du -help
usage: du [-H | -L | -P] [-a | -d depth | -s] [-cgkmrx] [file ...]
du -LP -d 1
8 ./lost+found
88 ./local
384 ./misc
48 ./nativebenchmark
912 ./nativetest
8 ./dontpanic
13376 ./data
8 ./app-private
8 ./app-asec
129424 ./app-lib
8 ./app
136 ./property
16 ./ssh
116312 ./dalvik-cache
8 ./resource-cache
48 ./drm
8 ./mediadrm
8 ./security
3888 ./nvram
8 ./amit
8 ./acdapi
88 ./@btmtk
32 ./sec
8 ./user
16 ./media
16 ./agps_supl
8 ./anr
8 ./gps_mnl
8 ./nfc_socket
16 ./ccci_cfg
32 ./mdlog
1312 ./system
176 ./recovery
32 ./backup
274688 .