pio and topio

I originally wrote this program for Solaris when I realized top (or prstat) couldn't sort on I/O. Now I have it for other OS'es. For Solaris, read on until the end of purple background. For Windows, click here. For HP-UX, you can jump here but I recommend you glance through the Solaris section first. For Linux, click here.

SOLARIS

pio (for Process I/O) on Solaris is a program that shows how much character I/O a process has read or written. topio can be used to show the most I/O-intensive processes. This is what
top or prstat command cannot do. Jump to the most useful command if you're impatient.

For instance, the following command shows that your current shell has read and written 10994 characters so far (we'll talk about other columns later).
$ pio -p $$
PID     InpBlk  OutpBlk RWChar  MjPgFlt Comm
392     0       33      10994   0       -ksh

You can look at a process continuously (-H removes header)
$ pio -p 554
PID     InpBlk  OutpBlk RWChar  MjPgFlt Comm
554     991     0       116099  922     find /
$ while true; do
> pio -Hp 554
> sleep 1
> done
554     2266    0       243623  2095    find /
554     2339    0       251878  2166    find /
554     2408    0       258030  2229    find /
^C$

or look at all processes (note that pio doesn't need SUID permission to look at other users' processes, and -A prevents header from being printed)
$ pio -A #columns are pid, InpBlk, OutpBlk, RWChar, MjPgFlt, Comm
0       280     1       0       93      sched
0       280     1       0       93      sched
0       280     1       0       93      sched
1       125     1       68121   92      /etc/init -
2       0       0       0       0       pageout
3       0       210     0       0       fsflush
342     12      4       8691    11      /usr/lib/saf/sac -t 300
...

The next useful thing to do is write a program to sort the RWChar column. I wrote a Perl script specifically for this purpose, appropriately named topio.
$ topio
** WARNING: Running topio without -d may not be **
** what you want. Type topio -h for help.       **
PID     InpBlk  OutpBlk RWChar  MjPgFlt Command
338     530     4       2394164 408     /usr/openwin/bin/Xsun :0 -nobanner -defdepth 24 -auth /var/dt/A:0-oHaGLa
368     96      5       2027102 86      /usr/lib/ssh/sshd
363     188     0       834232  163     dtgreet -display :0
348     196     0       679939  147     /usr/sfw/sbin/snmpd
388     0       4       291163  0       /usr/lib/ssh/sshd
214     7       0       128703  7       /usr/sbin/inetd -s
350     50      1       84051   40      /usr/dt/bin/dtlogin -daemon
1       125     1       68121   92      /etc/init -
247     0       3       57536   0       /usr/lib/utmpd
314     8       0       34965   4       /usr/lib/snmp/snmpdx -y -c /etc/snmp/conf
^C$

Probably the most useful of pio and topio is the -d option of topio, which sorts based on the delta or difference of the process Read/Write Characters between two consecutive runs. (While the examples above are run on my laptop, the screen shot below is captured on a server so the numbers differ.)
$ topio -d -s2 -n5	#display 5 top Delta-I/O processes every 2 seconds
--PID-------RWChar-----DltRWC-----MjPgFlt-DltMPF Command------------------------
 8872     64025286    1835008           7      0 ora_dbw0_ORATEST
 5945    289675626      56832          62      0 ora_lgwr_ORATRN
 5947    497918441      49152         324      0 ora_ckpt_ORATRN
 5773   3917910706      28672          47      0 ora_lgwr_INTTST
 5943   3609392512      16384           1      0 ora_dbw0_ORATRN
--PID-------RWChar-----DltRWC-----MjPgFlt-DltMPF Command------------------------
 8874   3130856681    2108416         112      0 ora_lgwr_ORATEST
18528     11064724     831589           0      0 oracleORATEST
 5945    289729898      54272          62      0 ora_lgwr_ORATRN
 5775   1537267122      49152         223      0 ora_ckpt_INTTST
 8876    302708422      16384         241      0 ora_ckpt_ORATEST
--PID-------RWChar-----DltRWC-----MjPgFlt-DltMPF Command------------------------
 8872     68752070    4726784           7      0 ora_dbw0_ORATEST
 8874   3132257001    1400320         112      0 ora_lgwr_ORATEST
18528     11361185     296461           0      0 oracleORATEST
18526     48811015     178640         113      0 oracleORATEST
 5775   1537381810     114688         223      0 ora_ckpt_INTTST
^C$

Process 5773 has the highest absolute I/O's under RWChar column according to topio output (without -d, not shown here). But its delta I/O, difference of absolute I/O's between two consecutive runs, only shows up near the top occasionally. This process happens to be an Oracle background process LGWR which writes to the redo logfiles of INTTST database. This LGWR process at the time we captured wrote 28672 bytes to logfiles in a 2 second period. (LGWR does not read, unless the database is being recovered from crash.) If you're only checking Oracle processes' I/O, you may want to supplement this information with that offered by Oracle's tools such as the statistics collected in Oracle v$sess_io view. (Unfortunately v$sess_io doesn't record physical writes.)

Download source code pio.c and type gcc -o pio pio.c. Also download topio and read the line below #!. Put pio and topio in the same directory and chmod to make executable. If you wish to run topio from directories other than where they are, change $PIO in topio to the absolute path. The current version additionally probes the process major page fault in the hope that true disk I/O excluding page cache I/O can be deduced. Note for x86 Solaris users: gcc 3 has problems with some headers. Use gcc 2.95 instead, unless you want to fix the header files.

How does it work? Before Solaris 10,note1 there're two ways to get the I/O count of a process on Solaris. Brendan Gregg's psio Perl program uses the prex utility to probe into kernel and filter on a specific process. My pio, originally written by looking at Jim Mauro and Richard McDougall's msacctnote2 published in Appendix C of Solaris Internals, fetches the I/O count from /proc filesystem. (I'm not using microstate accounting as in Jim's program, which is essential in CPU costing but would pose some performance overhead.) Basically, pio gets process I/O statistics from /proc/pid/usage, specifically the fields pr_inblock, pr_oublock and pr_ioch of struct prusage, as explained on pp.314-5 of Solaris Internals and proc(4) man page. You may wonder how much precious information is collected by our UNIX box without ever being used! That's right. If you don't write programs like this to fetch the data, they're collected and simply thrown away.

What the numbers mean pr_inblock and pr_oublock are generally not very useful. According to Adrian Cockcroft, "inblock and outblock [sic] counters are uninteresting as they only refer to filesystem metadata for the old-style buffer cache". Indeed, beginning with Solaris 2, the old buffer cache is largely replaced by page cache and is only used to store metadata. So if you see occasional number jump in InpBlk and OutpBlk, it is, for instance, because the allocated file blocks needs to be extended/shrunk to accomodate more/less data, so the inode is updated. What I observed is, when a process continuously does I/O, RWChar keeps increasing. InpBlk and OutpBlk remain the same for some time and suddenly jump, remain the same for a while again and jump again. But the ratio of this jump in blocks to the number of characters incremented in RWChar is not consistent for each file. That's why the 2nd and 3rd columns of pio output don't look important to me.

The statistic pr_ioch or Read/Write Characters lumps reads and writes together and there's no way to separate them. The only workaround I can think of is something like
#trace read/write syscalls, redirect stderr (which truss outputs to) to Perl filter,
#which prints syscall return value, i.e. number of chars read/written
truss -t read,pread -p pid 2>&1 | perl -nle '/= (\d+)$/; print $1'
truss -t write,pwrite -p pid 2>&1 | perl -nle '/= (\d+)$/; print $1'
You can do some math there to sum up the return values for a period of time. If scatter-gather I/O is used, you may want to add readv and writev to -t.

Another problem with pio is that RWChar includes all kinds of I/O, i.e. disk I/O as well as terminal and network I/O. If somebody has left the top program running for a long time (because he doesn't know the lighter-weight prstat!), topio may show this top process has accumulated a lot of RWChar, and possibly a lot of delta I/O in topio -d output, particularly if top was launched with a short interval (like top -s1). You can test this problem of pio and topio with a tight loop of echo "some characters" without a sleep in the loop. The current version of my program incorporates major page fault statistic in order to hopefully uniquely identify real disk I/O. Fortunately people often use topio to monitor daemon processes including Oracle server processes. So terminal I/O is completely off. But disk I/O and network I/O (if any) are still mixed.

____________________
note1 Solaris 10 has the powerful DTrace facility which can be used to provide process I/O statistics.

note2 Jim Mauro's msacct uses printf("%ld".. for process usage. I changed it to printf("%lu".. in pio.c. Otherwise numbers greater than 2 billion would show as negative. They're defined as unsigned long anyway.

HP-UX

Assuming you have quickly read the Solaris section, I only highlight a few points here. pio on HP-UX tells you how many read and write operations a process has performed. topio sorts all processes by either reads or writes.

$ pio -p $$
PID     InpOps  OutpOps MjPgFlt Comm
25240   8       16      0       sh

$ topio -n3 -s2 -kW #display 3 top Delta-Write processes every 2 seconds
--PID ProcName--------- -----Reads ---DltR -----Writs ---DltW -----PFlts ---DltF
   52 vxfsd                    225       0     322383       6          0       0
 1700 midaemon                   0       0          0       0          0       0
13730 ia64_corehw                0       0          0       0          0       0
--PID ProcName--------- -----Reads ---DltR -----Writs ---DltW -----PFlts ---DltF
   52 vxfsd                    225       0     322388       5          0       0
 1429 java                     191       0       8080       1        794       0
 1700 midaemon                   0       0          0       0          0       0

While the Solaris version lumps read and write characters together, the HP-UX version separately counts input and output, and it counts read and write operations, not number of characters. In addition, the HP-UX version no longer needs -d to sort on deltas.

Download source code pio.c and type cc -D_PSTAT64 -o pio pio.c. Also download topio and read the line below #!. Put pio and topio in the same directory and chmod to make them executable. If you wish to run topio from directories other than where they are, change $PIO in topio to the absolute path. [Dec 2008, Alexander Beyn comments "on HPUX 11.00, I had to #define _RUSAGE_EXTENDED before sys/pstat.h was included, otherwise pst_inblock and pst_oublock were not part of the pst_status structure...It looks like HP-UX 11.11 (released in 2000) and newer expose those fields without _RUSAGE_EXTENDED."]

How does it work? pio fetches I/O statistics from pstat, specifically pst_inblock and pst_oublock fields of struct pst_status. You can see these fields in /usr/include/sys/pstat/pm_pstat_body.h (thanks to Don Morris and Christof Meerwald on the newsgroup). Note that judging by the names, you would think they represent number of input and output blocks, just like pr_inblock and pr_oublock on Solaris. But the header file comment says they are block input and output operations.

Windows

Windows Task Manager allows you to view process statistics. On 2000, XP and above, if you go to View | Select Columns, you can add I/O-related counters. There are, however, two limitations. First, Task Manager can't display processes on a remote computer. Second, the I/O counters are absolute values accumulated since process startup. The absolute values answer the question such as "What process has done the most reading in bytes or in number of times of read?" But in reality, one would ask another question more often, "What process currently is doing the most read?" My topio program answers the second question. Here's a screen shot showing top 5 processes on server 123.45.67.89 every 2 seconds sorted by delta write bytes (DltWBts column). I launched Winzip to compress some files right after I started topio.

D:\>perl d:\systools\topio.pl -m123.45.67.89 -n5 -s2 -kw
--PID ProcName---- -----RBytes DltRBts --Reads DltR -----WBytes DltWBts --Writs DltW -----CBytes DltCBts ---Cntls DltC --PFlts DltF
 1864 WinMgmt         24645682   16410    4973   90     5778555   48320   20850   87   448413706       0  7765109   14 1367503  257
  304 SERVICES       377067071    2208 6365120   48   618084696   32980 5756074   51    49798992     510  5599270   56   82296    3
    8 System             34644       0      83    0   504278961    8258  708219   12    71108078       0  4601016   11   41190    5
  316 LSASS            6986726    5024  102785   10    10817344    1756   84879    9     7439645       8   185753    8   48880   25
  552 svchost           562472    1157     999    4      265419    1286     605    2      140400       0     3367    6    3141    1
--PID ProcName---- -----RBytes DltRBts --Reads DltR -----WBytes DltWBts --Writs DltW -----CBytes DltCBts ---Cntls DltC --PFlts DltF
 4256 WINZIP32         1396315 1396168      60   55      216465  216463      20   19       22268    9218      796  417    1506  196
 1864 WinMgmt         24662620   16938    5067   94     5828011   49456   20942   92   448413706       0  7765123   14 1367530  262
  304 SERVICES       377067959     888 6365136   16   618119520   34824 5756100   26    49807164    8172  5599355   85   82296    0
  316 LSASS            6993022    6296  102827   42    10821296    3952   84910   31     7439733      88   185814   61   48880    0
  552 svchost           563629    1157    1003    4      266705    1286     607    2      140400       0     3373    6    3142    1
^C

You might think that if a process is doing a lot of I/O, it must be burning a lot of CPU. That's not always true or obvious. For instance, when the Oracle database is running on my laptop but with all database sessions idle, I notice hard disk activity about once every three seconds. Task Manager doesn't show oracle.exe as a top CPU process. My topio does. A trivial example can also be set up where a process is doing nothing but busy loop on null operation, while another process genuinely reads a big file, and the first process is higher on CPU usage. These are the cases where topio can be of some use. It actually can sort on any I/O counters, including page faults, which hopefully can be used to deduce real disk I/O instead of I/O against system cache. (Task Manager has PF Delta column, equivalent to my DltF, but I include it here for your convenience.) One caveat, though, is that all these I/O counters lump disk, terminal and network I/O's together. There's no way to separate them out. You have to use other information to know which of the three types it really is. But generally, a Windows service process such as oracle.exe has no terminal I/O so you can eliminate that.

Download pio.vbs and topio.pl to the same folder. (Rename the files to pio.vbs and topio.pl after you download.) If you wish to run topio.pl from folders other than where they are, change $PIO in topio.pl to the absolute path. Unless you already have Perl installed (such as the one that comes with Oracle client), download and install ActivePerl. Then first, change to the folder and type perl topio.pl -h to verify Help works. Type perl topio.pl to run the program with all default values. (If you have associated .pl with perl.exe, you can try just topio.pl. But I find that it may mess up command line options. Always prepending perl solves the problem.) Please read help first, or find the help in the Usage part of topio.pl source code. Make sure your console window is 132 characters wide to avoid line wrapping.

How does it work? topio is a Perl script that sorts on values supplied by pio.vbs, a VBScript that fetches I/O-related statistics for all processes running on the system. The statistics are collected by WMI (Windows Management Instrumentation) so make sure that service is not stopped on the target machine. pio.vbs uses Microsoft WMI class Win32_PerfRawData_PerfProc_Process to obtain this data.

The functionality of topio may eventually be merged into pstats, another freeware tool for Windows. The reason I'm developing topio separately is that Microsoft WMI class Win32_PerfRawData_PerfProc_Process is totally flawed: all those "... per second" counters are not per-second values at all; instead they're cumulative since process startup (see my message posted to the Microsoft official newsgroup without an answer.) Until that problem is addressed by Microsoft, we have to sort on delta values for I/O as well as CPU usage counters. Only memory counters can stay as absolute values, because a question "What process is using the most memory now?" is more practical than "What process has gained the most memory in the past few seconds?" If you do need an answer to the second question, pmon has a "Mem Diff" column in its output.

Linux

Historically, Linux didn't have process I/O usage recorded either in the proc filesystem or by the getrusage call (getrusage has the fields for the current process but they're never populated). On those older Linux boxes, if we want process I/O count, we can have a kernel module to catch read(2) and write(2) syscalls or their variants; I think this is exactly what AT Consultancy's atop has been doing all these years. Alternatively, we can enable block I/O debugging as done by iodump. Also see this thread. Finally, SystemTap's uid-iotop could be used as well. However, in my opinion, all those have been outdated by kernel 2.6.18-164's introduction of /proc/pid/io.

Assume you have /proc/pid/io:
Download pio and topio into the same directory, type chmod to make them executable. If you wish to run topio from a directory other than where they are now, change $PIO in topio to the absolute path. Run ./topio -h first. A screen shot of actual use is shown below. (You may have to run as root now because from some version of Linux on, /proc/*/io is no longer other-readable.)

$ ./topio -s3 -krb -n4 #display top 4 delta-read-bytes (DltRBts) processes every 3 seconds
--PID ProcName----- -----RdChars DltRChs ------WtChrs DltWChs -------Rds -DltRds ------Wts DltW ------RdBytes DltRBts -----WtByts DtWB ---CWBts DltCWB
22537 ora_lmon_orac    732113332     560       206773       0    8914128       7      7935    0  144191709184  114688      413696    0        0      0
22555 ora_ckpt_orac   1317082923    1482        95358       0    8779307      10       123    0   36745311232   32768       32768    0        0      0
11629 /u01/crs/orac  16103068023   17851   8661523718    8092   50064836      59  30159419   34   14550792704   10752  7267489280 4608 52371046   8192
12599 /u01/crs/orac  79179130109  743582  57139477524  561498   56850567     404  16021147  154    3476266496    4096 10401308672 4915 11809955   8192

See Linux documentation for detailed description of these counters I use, exposed in /proc/pid/io, which Linux has been planning to add to the mainstream build for some time. For example, kernel 2.6.18-164 already has it; in fact it appeared much earlier, if TASK_DELAY_ACCT and TASK_IO_ACCOUNTING are configured in the kernel (check with egrep 'TASK_DELAY_ACCT|TASK_IO_ACCOUNTING' /boot/config-`uname -r`; if only TASK_IO_ACCOUNTING is configured, topio will work if modified). A Red Hat DTrace article talks about it. Guillaume Chazarain already implemented iotop based on these IO counters; a minor issue is that he used fancy features of Python, which may complicate installation. Before long, Red Hat 6 will be widely deployed and it comes with iotop. So my port of topio to Linux may be replaced in practical use, unless the official iotop doesn't live up to our expectation.


The only other major OS I care about is AIX. According to this discussion. AIX has nmon that can sort on process I/O. Indeed it works as expected (press t then 5 to sort on I/O). They also have nmon for Linux. But on Linux, it actually sorts on delta page faults, not really I/O's.

To my Computer Page