Collectl is a powerful tool to monitor system resources on Linux

最新推荐文章于 2023-08-05 01:56:17 发布

yuzhoudiwang

最新推荐文章于 2023-08-05 01:56:17 发布

阅读量1k

点赞数

分类专栏： linux 系统配置

linux 系统配置专栏收录该内容

9 篇文章 0 订阅

订阅专栏

Monitoring system resources

Linux system admins often need to monitor system resources like cpu, memory, disk, network etc to make sure that the system is in a good condition. And there are plenty of commands like iotop, top, free, htop, sar etc to do the task. Today we shall take a look at a tool called collectl that can be used to measure, monitor and analyse system performance on linux.

Collectl is a nifty little program that does a lot more than most other tools. It comes with a extensive set of options that allow users to not only measure the values of multiple different system metrics but also save the data for later analysis. Unlike other tools, which are designed to measure only a specific system parameter, collectl can monitor different parameters at the same time and report them in a suitable manner.

Unlike most monitoring tools that either focus on a small set of statistics, format their output in only one way, run either interatively or as a daemon but not both, collectl tries to do it all. You can choose to monitor any of a broad set of subsystems which currently include buddyinfo, cpu, disk, inodes, infiniband, lustre, memory, network, nfs, processes, quadrics, slabs, sockets and tcp.

Take a peek at the command before we start digging deeper.

$ collectl
waiting for 1 second sample...
#<--------CPU--------><----------Disks-----------><----------Network---------->
#cpu sys inter  ctxsw KBRead  Reads KBWrit Writes   KBIn  PktIn  KBOut  PktOut 
   0   0   864   1772      0      0      0      0      0      1      0       0 
   5   2  1338   2734      0      0      8      2      0      0      0       1 
   1   0  1222   2647      0      0     92      3      0      2      0       1 
   1   0   763   1722      0      0     80      3      0      1      0       2

The cpu usage, disk io, and network activity is being logged every second. The data is not difficult to read for those who understand it. The list keeps growing at a defined time interval and is easily loggable to a file. And collectl provides necessary options to record, search and do other useful things with the data.

Install collectl

Ubuntu/Debian and the likes have Collectl is available in the default repositories, so just apt it.

$ sudo apt-get install collectl

Fedora/CentOS too have it in the repos, so grab it with yum.

Usage

Essential theory - Collectl subsystems

Different types of system resources that can be measured are called subsystems. Like cpu, memory, network bandwidth and so on. If you just run the collectl command, it will show the cpu, disk and network subsystems in a batch mode output. That has already been shown above.

According to the man page, collectl identifies the following subsystems.

SUMMARY SUBSYSTEMS

b - buddy info (memory fragmentation)
c - CPU
d - Disk
f - NFS V3 Data
i - Inode and File System
j - Interrupts
l - Lustre
m - Memory
n - Networks
s - Sockets
t - TCP
x - Interconnect
y - Slabs (system object caches)

DETAIL SUBSYSTEMS

This is the set of detail data from which in most cases the corresponding summary data is derived.  There are currently 2 types that
do not have corresponding summary data and those are "Environmental" and "Process".  So, if one has 3 disks  and  chooses  -sd,  one
will  only  see a single total taken across all 3 disks.  If one chooses -sD, individual disk totals will be reported but no totals.
Choosing -sdD will get you both.

C - CPU
D - Disk
E - Environmental data (fan, power, temp),  via ipmitool
F - NFS Data
J - Interrupts
L - Lustre OST detail OR client Filesystem detail
M - Memory node data, which is also known as numa data
N - Networks
T - 65 TCP counters only available in plot format
X - Interconnect
Y - Slabs (system object caches)
Z - Processes

To monitor and measure a particular subsystem use the "-s" option and add the subsytem identifier to it. Now lets try out a few examples.

1. Monitor cpu usage

To monitor just the summary of cpu usage use "-sc"

$ collectl -sc
waiting for 1 second sample...
#<--------CPU-------->
#cpu sys inter  ctxsw 
   3   0  1800   3729 
   3   0  1767   3599

To observe each cpu individually, use "C". It will output multiple lines together, one for each cpu.

$ collectl -sC
waiting for 1 second sample...

# SINGLE CPU STATISTICS
#   Cpu  User Nice  Sys Wait IRQ  Soft Steal Idle
      0     3    0    0    0    0    0     0   96
      1     3    0    0    0    0    0     0   96
      2     2    0    0    0    0    0     0   97
      3     1    0    0    0    0    0     0   98
      0     2    0    0    0    0    0     0   97
      1     2    0    2    0    0    0     0   95
      2     1    0    0    0    0    0     0   98
      3     4    0    1    0    0    0     0   95

Using the C and c option together will fetch you both individual measures and the summary stats in a mmore comprehensive manner, if you need.

2. Monitor memory

Use the m subsystem to check the memory

$ collectl -sm
waiting for 1 second sample...
#<-----------Memory----------->
#Free Buff Cach Inac Slab  Map 
   2G 220M   1G   1G 210M   3G 
   2G 220M   1G   1G 210M   3G 
   2G 220M   1G   1G 210M   3G

Should not be difficult to interpret.
The M option would give further details about the memory.

$ collectl -sM
waiting for 1 second sample...

# MEMORY STATISTICS 
# Node    Total     Used     Free     Slab   Mapped     Anon   Locked    Inact Hit%
     0    7975M    5939M    2036M  215720K  372184K        0    6652K    1434M    0
     0    7975M    5939M    2036M  215720K  372072K        0    6652K    1433M    0

Does that look similar to what free reports ?

3. Check disk usage

The d and D options provide the summary and details on disk usage.

$ collectl -sd
waiting for 1 second sample...
#<----------Disks----------->
#KBRead  Reads KBWrit Writes 
      4      1    136     24 
      0      0     80     13

$ collectl -sD
waiting for 1 second sample...

# DISK STATISTICS (/sec)
#          <---------reads---------><---------writes---------><--------averages--------> Pct
#Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size  RWSize  QLen  Wait SvcTim Util
sda              0      0    0    0       0      0    0    0       0     0     0      0    0
sda              0      0    0    0       0      0    0    0       0     0     0      0    0
sda              1      0    2    1      17      1    5    3       2     2     6      2    1
sda              0      0    0    0      92     11    5   18      18     1    12     12    5

Another option that provides extended information is the "--verbose" option. It expands the summary to include more information but is not identical to using D.

$ collectl -sd --verbose

4. Report multiple systems together

So lets say you want a report of cpu, memory and disk io together, then use the subsystems together.

$ collectl -scmd
waiting for 1 second sample...
#<--------CPU--------><-----------Memory-----------><----------Disks----------->
#cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map KBRead  Reads KBWrit Writes 
   4   0  2187   4334   1G 221M   1G   1G 210M   3G      0      0      0      0 
   3   0  1896   4065   1G 221M   1G   1G 210M   3G      0      0     20      5

5. Display time with the stats

To display the time in each line along with the measurements, use the T option. And over that, to specify options, you need to use the "-o" switch.

$ collectl -scmd -oT
waiting for 1 second sample...
#         <--------CPU--------><-----------Memory-----------><----------Disks----------->
#Time     cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map KBRead  Reads KBWrit Writes 
12:03:05    3   0  1961   4013   1G 225M   1G   1G 212M   3G      0      0      0      0 
12:03:06    3   0  1884   3810   1G 225M   1G   1G 212M   3G      0      0      0      0 
12:03:07    3   0  2011   4060   1G 225M   1G   1G 212M   3G      0      0      0      0

You could also display the time in milliseconds with "-oTm".

6. Change sample count

Every row the collectl reports is a snapshot or sample. And it takes these snapshots at regular intervals, say 1 second. The i option sets the interval and c option sets the sample count.

$ collectl -c1 -sm
waiting for 1 second sample...
#<-----------Memory----------->
#Free Buff Cach Inac Slab  Map 
   1G 261M   1G   1G 228M   3G

To change interval use the i options

$ collectl -sm -i2
waiting for 2 second sample...
#<-----------Memory----------->
#Free Buff Cach Inac Slab  Map 
   1G 261M   1G   1G 229M   3G

The above command would collect memory stats every 2 seconds.

7. Use collectl like iotop

Out of the plenty options, the "top" option makes collectl report process-wise statistics much like iostat/top commands. The list is continuously updated and can be sorted on a number of fields.

$ collectl --top iokb

The output looks like this

# TOP PROCESSES sorted by iokb (counters are /sec) 09:44:57
# PID  User     PR  PPID THRD S   VSZ   RSS CP  SysT  UsrT Pct  AccuTime  RKB  WKB MajF MinF Command
 3104  enlighte 20  2683    3 S  938M   33M  0  0.00  0.00   0  00:09.16    0    4    0    0 /usr/bin/ktorrent 
    1  root     20     0    0 S   26M    3M  2  0.00  0.00   0  00:01.30    0    0    0    0 /sbin/init 
    2  root     20     0    0 S     0     0  3  0.00  0.00   0  00:00.00    0    0    0    0 kthreadd 
    3  root     20     2    0 S     0     0  0  0.00  0.00   0  00:00.02    0    0    0    0 ksoftirqd/0 
    4  root     20     2    0 S     0     0  0  0.00  0.00   0  00:00.00    0    0    0    0 kworker/0:0 
    5  root      0     2    0 S     0     0  0  0.00  0.00   0  00:00.00    0    0    0    0 kworker/0:0H 
    7  root     RT     2    0 S     0     0  0  0.00  0.00   0  00:00.08    0    0    0    0 migration/0 
    8  root     20     2    0 S     0     0  2  0.00  0.00   0  00:00.00    0    0    0    0 rcu_bh 
    9  root     20     2    0 S     0     0  0  0.00  0.00   0  00:00.00    0    0    0    0 rcuob/0

The output is very similar to the top command and it sorts the process by the amount of disk io in descending order.

To display only top 5 processes use it as follows

$ collectl --top iokb,5

To learn about what fields the above list can be sorted, use the following command

$ collectl --showtopopts
The following is a list of --top's sort types which apply to either
process or slab data.  In some cases you may be allowed to sort
by a field that is not part of the display if you so desire

TOP PROCESS SORT FIELDS

Memory
  vsz    virtual memory
  rss    resident (physical) memory

Time
  syst   system time
  usrt   user time
  time   total time
  accum  accumulated time

I/O
  rkb    KB read
  wkb    KB written
  iokb   total I/O KB

  rkbc   KB read from pagecache
  wkbc   KB written to pagecache
  iokbc  total pagecacge I/O
  ioall  total I/O KB (iokb+iokbc)

  rsys   read system calls
  wsys   write system calls
  iosys  total system calls

  iocncl Cancelled write bytes

Page Faults
  majf   major page faults
  minf   minor page faults
  flt    total page faults

Context Switches
  vctx   volunary context switches
  nctx   non-voluntary context switches

Miscellaneous (best when used with --procfilt)
  cpu    cpu number
  pid    process pid
  thread total process threads (not counting main)

TOP SLAB SORT FIELDS

  numobj    total number of slab objects
  actobj    active slab objects
  objsize   sizes of slab objects
  numslab   number of slabs
  objslab   number of objects in a slab
  totsize   total memory sizes taken by slabs
  totchg    change in memory sizes
  totpct    percent change in memory sizes
  name      slab names

8. Use collectl like top

To make collectl report like top, we just have to report processes ordered by the cpu usage.

$ collectl --top

The output should be like this

# TOP PROCESSES sorted by time (counters are /sec) 14:08:46
# PID  User     PR  PPID THRD S   VSZ   RSS CP  SysT  UsrT Pct  AccuTime  RKB  WKB MajF MinF Command
 9471  enlighte 20  9102    0 R   63M   22M  3  0.03  0.10  13  00:00.81    0    0    0    3 /usr/bin/perl 
 3076  enlighte 20  2683    2 S  521M   40M  2  0.00  0.03   3  00:55.14    0    0    0    2 /usr/bin/yakuake 
 3877  enlighte 20  3356   41 S    1G  218M  1  0.00  0.03   3  10:10.50    0    0    0    0 /opt/google/chrome/chrome 
 4625  enlighte 20  2895   36 S    1G  241M  2  0.00  0.02   2  08:24.39    0    0    0   12 /usr/lib/firefox/firefox 
 5638  enlighte 20  3356    3 S    1G  265M  1  0.00  0.02   2  09:55.04    0    0    0    2 /opt/google/chrome/chrome 
 1186  root     20  1152    4 S  502M   76M  0  0.00  0.01   1  03:02.96    0    0    0    0 /usr/bin/X 
 1334  www-data 20  1329    0 S   87M    1M  2  0.00  0.01   1  00:00.85    0    0    0    0 nginx:

You can also display sub system information along with the above

$ collectl --top -scm

9. List processes like ps

To just list out the processes like ps command, without updating continously, just set the sample count to 1 with the "c" options

$ collectl -c1 -sZ -i:1

The above command will list out all the processes much like "ps -e". The 'procfilt' option can be used to filter out specific processes from the process. The 'procopts' option can be used to specify another set of options for fine tune the process list display.

10. Use collectl like vmstat

Collectl has got a direct option to make it behave like vmstat

$ collectl --vmstat
waiting for 1 second sample...
#procs ---------------memory (KB)--------------- --swaps-- -----io---- --system-- ----cpu-----
# r  b   swpd   free   buff  cache  inact active   si   so    bi    bo   in    cs us sy  id wa
  1  0      0  1733M   242M  1922M  1137M   710M    0    0     0   108 1982  3918  2  0  95  1
  1  0      0  1733M   242M  1922M  1137M   710M    0    0     0     0 1906  3886  1  0  98  0
  1  0      0  1733M   242M  1922M  1137M   710M    0    0     0     0 1739  3480  3  0  96  0

11. Detailed information about subsystems

The following command would collect "5 samples" of CPU statistics at "1 second" interval and print detailed information (verbose) along with the time.

$ collectl -sc -c5 -i1 --verbose -oT
waiting for 1 second sample...

# CPU SUMMARY (INTR, CTXSW & PROC /sec)
#Time      User  Nice   Sys  Wait   IRQ  Soft Steal  Idle  CPUs  Intr  Ctxsw  Proc  RunQ   Run   Avg1  Avg5 Avg15 RunT BlkT
14:22:10     11     0     0     0     0     0     0    87     4  1312   2691     0   866     1   0.78  0.86  0.78    1    0
14:22:11     15     0     0     0     0     0     0    84     4  1283   2496     0   866     1   0.78  0.86  0.78    1    0
14:22:12     17     0     0     0     0     0     0    82     4  1342   2658     0   866     0   0.78  0.86  0.78    0    0
14:22:13     15     0     0     0     0     0     0    84     4  1241   2429     0   866     1   0.78  0.86  0.78    1    0
14:22:14     11     0     0     0     0     0     0    88     4  1270   2488     0   866     0   0.80  0.87  0.78    0    0

Change the "-s" parameter to view a different subsystem.

Summary

The post so far was just a bird's view of this amazing tool called collectl. It should have given a fair idea of how flexible it is. The discussion however leaves out various other features of collectl which include the ability to record and "playback" the captured data, export data for various file formats and data formats that can be used with external tools for analysis etc.

Another major feature that collectl supports is running as a service that allows for remote monitoring making it a perfect tool for keeping a watch on resources of remote linux machines or an entire server cluster.

Collectl is accompanied with an additional set of tools named Collectl Utils (colmux, colgui, colplot) that can be used to process and analyse the data collected. May be we shall take a look at those in another post.

Check the man page to learn more about the options. I would also recommend checking out the FAQs to get a quick idea about collectl. Next, read up the collectl documentation for more indepth examples to get beyond the basics. There is also a command equivalence matrix which maps the more common commands like sar, iostat, netstat, top with their collectl equivalents.