Oracle Process: Process Tracing and /proc File System
I. Introduction
Why? (for fun and curiosity)
How? (truss in Solaris, strace in Linux)
II. Fun Items
1. Tracing a process
Most software applications offer their own debugging tools. For instance, Oracle v$ views can be used for Oracle database troubleshooting as well as performance tuning. In addition, application-independent tools are sometimes used; when a process crashes and generates a core file, you might get insight into the cause of the crash by running a debugger such as gdb on it, particularly if the core contains symbols.
In this article we talk about one other application-independent tool, process tracing programs. Such a program lists all system calls made by the running process. Here's an example. Once I saw a process named A+ taking more than 10% CPU in top on a 16-CPU Sparc R420 server. trussing the process indicats that the only thing it's doing is
read(0,<some_hex_value>,1) = 0and it's doing that repeatedly very fast. In order to interpret process tracing, you have to have a little experience in system programming or familiarity of section 2 of the man page. The above read syscall says the process tries to read 1 byte from stdin (file descriptor 0) into a buffer pointed to by <some_hex_value>, and it fails to read anything. Unless it's redirected, stdin is keyboard. So the process was written in a way of busy-loop waiting for somebody to press a key! After talking to the users of this software and got approval, we killed the process. We could have opened a ticket with the software company reporting this as a bug.
Another case. Developers complained about slow performance on the development server. top shows that two oracle processes are eating up CPU. I found the SQL sessions associated with the processes but couldn't tell why they took so much CPU (they were just executing some PL/SQL code which calls Oracle-wrapped procedures). Back to UNIX, I ran truss and saw they repeatedly executing brk() with an argument that kept increasing at a rate of 1MB per 3 seconds or so. (brk(), as well as sbrk(), is the syscall behind malloc or calloc library function to allocate heap memory, or free to deallocate memory.) Repeatedly running /usr/proc/bin/pmap -x <pid> indicated that their heap size kept going up at a rate of 1MB per 3 seconds, other lines staying the same. /proc/<pid>/as, the process address space, increased in size at the same rate. So, we've found a typical memory leak! The code is wrapped by Oracle but our developer found a workaround. We killed the sessions from SQL*Plus and a while later the processes disappeared from UNIX (reaped by PMON).1
Now let's take a look at Oracle server process, a.k.a shadow process. Tracing an Oracle process when it's scanning a datafile shows these lines
pread64(407, "0602\0\0\0 A I p\0 W90EF".., 8192, 0x292E0000) = 8192 pread64(407, "0602\0\0\0 A I o\0 W90EF".., 8192, 0x292DE000) = 8192 pread64(407, "0602\0\0\0 A I q\0 W90EF".., 8192, 0x292E2000) = 8192This means that this process is reading the file with a process file descriptor 407. In Solaris, you can use Vic Abel's lsof to find out what file it is: lsof -p <pid> -a -d 407; in Linux, ls -l /proc/<pid>/fd shows the open files with their file descriptors. The first few bytes it reads are shown (ASCII 6, ASCII 2, NUL, NUL, NUL, letter A, letter I,...) and each pread() system call (pread64() is just its largefile-aware variant) attempts to read 8192 bytes (the 3rd argument of the system call), which is the Oracle db_block_size (if db_block_size were 16k, it would read 16384 bytes, assuming db file sequential read), and both calls successfully read the expected number of bytes (returned value matches the 3rd argument). If the read is scattered, the return value will be actual db_file_multiblock_read_count times db_block_size; you can verify this by tracing an Oracle session with event 10046 at level 12 and look at p3 for 'db file scattered read' event. Note that the very first few bytes read by each pread() are always the same. That's what Oracle calls block common header: 06 for normal data block type, 02 for Oracle8 or 8i format, \0\0 not used, \0 A I p for relative data block address. The most interesting may be the first byte, data block type, which you can obtain by simple tests. See footnote 2 for some common types.
Few if any DBAs trace an oracle process routinely; instead they look at the wait event the SQL session is waiting on. Note that v$process.spid is the UNIX process ID (for dedicated server configuration) and v$process.addr matches v$session.paddr. Thus you find the SQL session. Corresponding to the above truss output is "db file sequential read" event in v$session_wait. You can further find the SQL the session is executing by matching v$session.sql_address with v$sql.address (to eliminate the infinitesimal chance that one of the two views has more than one row matching, you may also match v$session.sql_hash_value with v$sql.hash_value, but that never happened to me). You can also find the segment name and type for the db file read by plugging in file_id and block_id from v$session_wait.
Before we go further, let me say a few words about file system I/O size. This size is not the same as hard disk physical block size, which is always 512 bytes (NBPSCTR, bytes per disk sector, in /usr/include/sys/param.h); instead it's a multiple of that. Let's assume the OS is Solaris and the file system is UFS. UNIX system administrators may all prefer to create file systems (using newfs(1M) or directly mkfs(1M)) with the largest supported logical block size on UFS, 8192 bytes (The largest on UFS is MAXBSIZE defined in /usr/include/sys/param.h in Solaris 2 or later). To find out the block size of the file system where your current directory is, you use either df -g . or perl -e '$a=(stat ".")[11]; print $a'. If you have root access and the OS is Solaris, you can also run mkfs -m <raw_device_file> or fstyp -v <raw_device_file>.3 The block size given by df -g . is f_bsize of struct statvfs defined as "fundamental file system block size" in statvfs.h. The perl one-liner is simply a convenient way to get the 12th element of the array returned by stat(2), which is the suggested or optimal value of I/O size, and can be run on any UNIX. This suggested I/O size, even though in the inode structure on a per-file basis, actually is the "fundamental block size filesystem used by the file system" (see struct vattr, member va_blksize on p.352 of B. Goodheart, J. Cox The Magic Garden Explained).
Back to our database. Suppose the above database had a db_block_size 16k. Each read() system call would still read a db_block_size, not the UFS file system I/O size; i.e., the two pread lines shown above would have the four 8192's replaced by four 16384's. So what role does the OS I/O size play if it's different from db_block_size? Maurice Bach describes the algorithm of read(),4 of which pread() is just a variant. In his description, the file inode is locked near the beginning of the system call (note that this step only exists for the legacy s5 file system, and is replaced by reader-writer lock5 in UFS!). A file system block size of bytes are read. The number of bytes read is compared with the number of bytes expected to read. In this case, it's not enough (only 8192 out of 16384 is read). So another I/O size of bytes are read. Then it's compared again. If equals or exceeds (now it's yes), exit the loop and release the inode lock, thus finishing this read() system call. So you see that within one read() call, there could actually be multiple lower level file system reads not visible in process tracing, which has the finest granularity of system calls. Steve Adams used to warn us that all UFS based datafiles are better created with db_block_size of 8k because UNIX file system I/O size is usually 8k (direct I/O, raw disks, or any proprietary file system are not UFS-based); if you use 16k as often suggested for data warehouses, two low-level file system reads occur serially anyway. Somebody also posted a message 6 claiming changing db_block_size from 2k to 4k sped up batch jobs. Bert Scalzo investigated the effect of db_block_size on performance running Oracle on Linux. It's understandable that going from 2k to 4k increased performance by 138%. But it's surprising that changing db_block_size from 4k to 8k (the maximum supported on Linux) reduced data load time by almost 32%. Since Linux ext2 filesystem has a 4k block size, this benchmark seems to contradict the common belief that performance is the best when db_block_size matches file system I/O size; instead doubling it may improve performance. I wish more tests could be done, particularly on various filesystems.
Case studies
Case 1: SYS Uses SYSTEM as its Temporary Tablespace
The following is the oracle process for the session in which SYS runs ?/rdbms/admin/utlrp.sql, compiling invalid objects, in Oracle 8.1.6.
pwrite64(407, "\b02\0\0\0 A P W\0 y >B7".., 8192, 0x2A0AE000) = 8192 pwrite64(407, "\b02\0\0\0 A P X\0 y >B8".., 8192, 0x2A0B0000) = 8192 pwrite64(407, "\b02\0\0\0 A P Y\0 y >B8".., 8192, 0x2A0B2000) = 8192 pwrite64(407, "\b02\0\0\0 A P Z\0 y >B8".., 8192, 0x2A0B4000) = 8192 pwrite64(407, "\b02\0\0\0 A P [\0 y >B8".., 8192, 0x2A0B6000) = 8192 pwrite64(407, "\b02\0\0\0 A P \\0 y >B8".., 8192, 0x2A0B8000) = 8192 pread64(407, "\b02\0\0\0 A P >\0 y >AF".., 8192, 0x2A07C000) = 8192 pread64(407, "\b02\0\0\0 A P ?\0 y >AF".., 16384, 0x2A07E000) = 16384 pread64(407, "\b02\0\0\0 A P A\0 y >B1".., 24576, 0x2A082000) = 24576 pread64(407, "\b02\0\0\0 A P D\0 y >B3".., 40960, 0x2A088000) = 40960 pread64(407, "\b02\0\0\0 A P I\0 y >B5".., 81920, 0x2A092000) = 81920 pread64(407, "\b02\0\0\0 A P S\0 y >B7".., 81920, 0x2A0A6000) = 81920Two things are suspicious in this truss output. \b or ASCII 8 indicates this block is type Sort Key. But according to lsof, fd 407 is a SYSTEM tablespace datafile. Additionally, you would also ask why a USER shadow process (not background process) writes to a datafile? A foreground proocess usually only writes to a temporary datafile or tempfile (or to LOB segments, which generates wait event "direct path write (lob)"). The truth is that it can write to a datafile if sort occurs on this file. So I ran this SQL
SQL> select username, temporary_tablespace from dba_users where temporary_tablespace != 'TEMP'; USERNAME TEMPORARY_TABLESPACE ------------------------------ ------------------------------ SYS SYSTEM SYSTEM SYSTEM OUTLN SYSTEM DBSNMP SYSTEM AURORA$ORB$UNAUTHENTICATED SYSTEM ORDSYS SYSTEM ORDPLUGINS SYSTEM MDSYS SYSTEM CSMIG SYSTEM 9 rows selected.This is a rapidwiz fresh install of Oracle Applications 11i which bundles Oracle database 8.1.6. Immediately after I corrected SYS's temporary tablespace, truss tells me
pwrite64(395, "\b02\0\0\0 @\0\v\0 y N W".., 8192, 90112) = 8192 pwrite64(395, "\b02\0\0\0 @\0\f\0 y N W".., 8192, 98304) = 8192 pwrite64(395, "\b02\0\0\0 @\0\r\0 y N W".., 8192, 106496) = 8192 pwrite64(395, "\b02\0\0\0 @\00E\0 y N W".., 8192, 114688) = 8192where 395 is the TEMP tempfile. To confirm, I changed SYS's temporary tablespace back to SYSTEM and TEMP and back several times, the pread64 and pwrite64 syscalls that occurred on a Sort Key (\b) block indeed alternated between a SYSTEM datafile and the TEMP tempfile.
Interestingly though, v$sort_segment only showed a sort segment created in TEMP, not SYSTEM, probably because "the view is only updated when the tablespace is of the TEMPORARY type", according to documentation | References | V$SORT_SEGMENT.
Case 2: What to Do if You See Too Many yield()s
$ truss -t \!times -p 10747 semctl(983042, 8, 8, 1) = 0 semctl(983042, 8, 8, 1) = 0 kill(17202, SIGWINCH) = 0 semctl(983042, 8, 8, 1) = 0 yield() = 0 semctl(983042, 8, 8, 1) = 0 semctl(983042, 8, 8, 1) = 0 semctl(983042, 8, 8, 1) = 0 semctl(983042, 8, 8, 1) = 0 yield() = 0 semctl(983042, 8, 8, 1) = 0 semctl(983042, 8, 8, 1) = 0 yield() = 0 semctl(983042, 8, 8, 1) = 0 yield() = 0 yield() = 0 semctl(983042, 8, 8, 1) = 0 yield() = 0 semctl(983042, 8, 8, 1) = 0 semctl(983042, 8, 8, 1) = 0 semctl(983042, 8, 8, 1) = 0 semctl(983042, 8, 8, 1) = 0 yield() = 0 yield() = 0 semctl(983042, 8, 8, 1) = 0 yield() = 0 yield() = 0 semctl(983042, 8, 8, 1) = 0 semctl(983042, 8, 8, 1) = 0 semctl(983042, 8, 8, 1) = 0 semctl(983042, 8, 8, 1) = 0 yield() = 0 semctl(983042, 8, 8, 1) = 0 ... kill(17202, SIGWINCH) = 0
v$session_wait keeps showing "log buffer space", "log file switch completion" and less frequently "db file sequential read". The process running utlrp.sql has been running for more than one hour.
If a UNIX process calls yield system calls frequently, it may well be that this process has been running for some time and consumed too much CPU. Everything being equal, breaking up the job to be run in several sessions may help
connect user/passwd a little SQL connect user/passwd a little SQLThe reason is that every new connection creates a new shadow process (and the old process automatically exits). Operating system schedules the new process with a fair amount of CPU. Having said that, I have to warn that a process may call too many yields as a result of other problems. Sometimes you have to think of everything in perspective. It's possible a bad design or bad SQL can be changed to avoid too much CPU usage.
2. What can /proc/<oraclepid> tell us?
$ ps -fp 27535
UID PID PPID C STIME TTY TIME CMD
oracle 27535 1 0 Jul 10 ? 8:45 oracleORCL (DESCRIPTION=(LOCAL=no)(ADDRESS=(PROTOCOL=BEQ)))
$ cd /proc/27535
$ ls -l
total 173177
-rw------- 1 oracle dba 88621056 Jul 10 15:14 as
-r-------- 1 oracle dba 152 Jul 10 15:14 auxv
-r-------- 1 oracle dba 56 Jul 10 15:14 cred
--w------- 1 oracle dba 0 Jul 10 15:14 ctl
lr-x------ 1 oracle dba 0 Jul 10 15:14 cwd ->
dr-x------ 2 oracle dba 6592 Jul 10 15:14 fd
-r--r--r-- 1 oracle dba 120 Jul 10 15:14 lpsinfo
-r-------- 1 oracle dba 912 Jul 10 15:14 lstatus
-r--r--r-- 1 oracle dba 536 Jul 10 15:14 lusage
dr-xr-xr-x 3 oracle dba 48 Jul 10 15:14 lwp
-r-------- 1 oracle dba 3936 Jul 10 15:14 map
dr-x------ 2 oracle dba 800 Jul 10 15:14 object
-r-------- 1 oracle dba 14656 Jul 10 15:14 pagedata
-r--r--r-- 1 oracle dba 336 Jul 10 15:14 psinfo
-r-------- 1 oracle dba 3936 Jul 10 15:14 rmap
lr-x------ 1 oracle dba 0 Jul 10 15:14 root ->
-r-------- 1 oracle dba 1440 Jul 10 15:14 sigact
-r-------- 1 oracle dba 1232 Jul 10 15:14 status
-r--r--r-- 1 oracle dba 256 Jul 10 15:14 usage
-r-------- 1 oracle dba 0 Jul 10 15:14 watch
-r-------- 1 oracle dba 6232 Jul 10 15:14 xmap
$ ls -l fd
total 7205088
c--------- 1 oracle tty 24, 4 Jul 5 15:25 0
c--------- 1 oracle tty 24, 4 Jul 5 15:25 1
s--------- 0 root root 0 Jul 29 22:41 18
c--------- 1 oracle tty 24, 4 Jul 5 15:25 2
c--------- 1 root sys 13, 2 Jul 29 22:42 3
c--------- 1 root sys 13, 2 Jul 29 22:42 4
-rw-rw---- 1 oracle dba 541073408 Jul 29 22:42 408
-rw-r----- 1 oracle dba 3145736192 Jul 29 22:41 409
c--------- 1 root sys 13, 2 Jul 29 22:42 5
s--------- 0 root root 0 Jul 5 15:25 6
c--------- 1 root sys 13, 2 Jul 29 22:42 7
D--------- 1 root root 0 Jan 18 2001 8
-r--r--r-- 1 oracle dba 358912 Aug 3 2000 9
$ lsof -p 27535 -a -d 0,1,2,3,4,5,6,7,8,9,18,408,409
...
3. Why is shared memory in /usr/proc/bin/pmap -x <oraclepid> shown as private?
$ ps -fp 25822
UID PID PPID C STIME TTY TIME CMD
oracle 25822 1 0 23:35:09 ? 0:00 ora_pmon_tiny
$ /usr/proc/bin/pmap -x 25822
25822: ora_pmon_tiny
Address Kbytes Resident Shared Private Permissions Mapped File
00010000 26192 12272 8424 3848 read/exec oracle
019B2000 264 264 200 64 read/write/exec oracle
019F4000 880 824 - 824 read/write/exec [ heap ]
20000000 37064 37064 - 37064 read/write/exec/shared [shmid=0x193]
80000000 55376 55376 - 55376 read/write/exec/shared [shmid=0x7d2]
EF000000 4656 3032 1120 1912 read/exec libjox8.so
...
$ ipcs -m
IPC status from as of Sun Jan 27 23:37:09 2002
T ID KEY MODE OWNER GROUP
Shared Memory:
m 0 0x50000fff --rw-r--r-- root root
m 1 00000000 --rw-rw-rw- root root
m 2002 00000000 --rw-rw---- oracle dba
m 403 0xbf821444 --rw-rw---- oracle dba
The two shared memory segments (with shmid 0x193 and 0x7d2 or 403 and 2002 in decimal, respectively) show that all their resident memory is private, yet permissions include "shared". This is something I'm still studying. Sun released Solaris source code. You can find the source code C program pmap.c in osnet_volume\usr\src\cmd\ptools\pmap directory expanded from sol_foundation-8-fcs-src-en-b.zip (similar code is in osnet_volume\usr\src\uts\common\fs\proc\prsubr.c) . The shared memory column of the pmap -x output is calculated as
printK((pmp->pr_ashared + pmp->pr_vshared) * (pmp->pr_pagesize / 1024));/usr/include/sys/procfs.h defined pr_ashared and pr_vshared. So we conclude that the shared memory thus calculated is the size in bytes of anonymous shared pages plus vnode shared pages. The next two lines in pmap.c indicates that private memory of the process is merely resident memory minus shared memory. In the case of an Oracle process as shown above, shared memory is for some reason 0, so the private memory is blindly "calculated" to be the same as resident.
So the question is, why does pmap or really the /proc file system fail to calculate the shared memory portion of an Oracle process? I don't know the answer. But it probably has to do with ISM, intimate shared memory, which is used by default. If you disable ISM by setting the Oracle initialization parameter _use_ism=false, or in older version Oracle use_ism=false, then at lease some shared memory segments (if multiple) for the instance will show the shared memory size under the shared instead of private column. Maybe this bug is in the code for segspt_shm, the driver implementing ISM. But I don't see source code for segspt_shm.
New This bug is corrected by the unbundled tool pmem.sparc:
$ pmem.sparc 29938 29938: ora_pmon_tiny Address Kbytes Resident Shared Private Permissions Mapped File 00010000 37064 16280 16280 - read/exec oracle 02450000 304 304 256 48 read/write/exec oracle 0249C000 984 280 - 280 read/write/exec [ heap ] 80000000 280 280 280 - read/write/exec/shared [shmid=0x1772] 81000000 49152 49152 49152 - read/write/exec/shared [shmid=0xaf3] 84000000 65536 65536 65536 - read/write/exec/shared [shmid=0xaf4] 88000000 49152 49152 49152 - read/write/exec/shared [shmid=0xaf5] 8B000000 65536 65536 65536 - read/write/exec/shared [shmid=0xaf6] 8F000000 49688 49688 49688 - read/write/exec/shared [shmid=0xaf7] EF000000 4184 1368 1368 - read/exec libjox9.so ...
4. oraus.msb, not oraus.msg.
From: Yong Huang (yong321@yahoo.com) Subject: Re: Contention on oraus.msg? rick@oyarsa.com (rlr1031) wrote in message news:<978df3d.0106290244.61c7eaee@posting.google.com>... > Has anyone experienced a problem with contention on the file oraus.msg in a > high volume, OLTP, OCI-based application? I know the file is used to > translate messages, but how is it being opened and used by Oracle? And when > in the processing of an OCI statement is this being hit? On the client? On > the server? Prior to sending the response? > Our internal database execution times look good, but we have occasional slow > downs in the response from the OCI statements. > > tia, > R > Oracle 8.1.7.x, single Unix platform with clustered disks (sorry cannot be > more specific at this time) Are you sure it's oraus.msg, not oraus.msb? I've been playing around tracing an Oracle process on the OS side recently. I noticed that all Oracle processes, background or foreground, open $ORACLE_HOME/rdbms/mesg/oraus.msb (note msb, not msg) with a file descriptor 9 (I only checked 8.1.6 and 8.1.7 for Sparc Solaris 2.6, UFS filesystem). The file is kept open during the lifetime of the Oracle process. Your observation is interesting and I've never seen anyone mention this contention. But posting a little more details should help. What OS or ORA error do you get if any? Can you show us a screen shot with no or very little editing? Yong Huang yong321@yahoo.com
5. dbms_lock.sleep can be awaken by SIGALRM
6. SQL*Plus may or may not do anything on server side
__________________
1 In this case, the process was only calling brk consecutively. You may argue that even though their virtual memory size (from ps and pmap) showed 500MB, they didn't use 500MB physical memory, because brk only allocates that much memory in virtual address space, not in physical memory which would happen when pages in virtual memory are first referenced later. (See J. Mauro, R. McDougall, Solaris Internals, p.135) Well, remember that allocated virtual memory consumes total swap space, memory- plus disk-based; when it is too low, no new process may be created. Also remember that every process has a memory resource limit (ulimit -v). Suppose it's unlimited or 2 GB on a 32-bit OS, the process will be killed -9 and core dumped when it reaches the limit. In fact this is exactly what happened to us when we encountered the Livelink Workflow memory leak bug in its Version 9.0.0; the process reaches 2 GB size in 70 minutes very predictably because its heap size increases by 500 K every second.
3 But on the Veritas Filesystem, it's possible that fstyp -v and df -g report different block sizes. According to a Veritas note, Solaris bundled df is not reading the vxfs filesystem superblock correctly; the block size from fstyp -v or the "fragment" size from df -g should be trusted.
4 M.J. Bach, "The Design of the UNIX Operating System", p.97.
5 In the old s5 filesystem (s5fs), an inode lock is exclusively acquired even during a read() call so that no other process can acquire the same lock on the same file for read(). Many UNIX vairants introduced the shared lock concept for read(). Solaris 2.6 /usr/include/sys/fs/ufs_inode.h has this section:
6 In case the message on groups.google.com is gone, it's repeated here:
* i_rwlock: Serializes ufs_write and ufs_setattr request
* and allows ufs_read requests to proceed in parallel.
* Serializes reads/updates to directories.
indicating an inode lock for file read is shared, not exclusive. SunOS4 inode.h does not seem to differentiate between shared and exclusive. Bach's description is based on s5fs. I checked the Magic Garden book. The description of the read() system call on p.297 and file system operation of VOP_RWLOCK on p.355 do not make a distinction of inode locks. Vahalia's Unix Internals mentions this as a footnote on p.269.
Four years ago I was involved with a database that a bunch of drongos
had set to 2K "because it was an OLTP system". Actually, it was a
3-tier client-server system, with a huge app server cache and heaps of
very, very large overnight batch processing. All file systems had
been set to 4K or 8K block size!
Once I finally cut through the "bureaucracy" of this bunch of idiots,
the db_block_size was set to 4K as a first try. All batch runs
immediately sliced the runtime into half with no visible or measurable
impact on the on-line stuff. Makes me wonder where these "guidelines"
came from...
Cheers
Nuno Souto
nsouto@bigpond.net.au.nospam
http://www.users.bigpond.net.au/the_Den/index.html
To my Computer Page