Process-Oriented Performance Monitoring

Many performance monitoring tools on the market or in the freeware world focus on the system. But if you're asked "My xyz program ran very slow at 10pm last night. What was wrong?", system-wide data are rarely sufficient for you to make a conclusion. Perhaps only one out of a hundred times it was due to something completely unrelated to a running process, such as bad memory chips. Most of the time, you need to look at the processes at OS level and database sessions inside Oracle that existed at the moment in order to say "The xyz process suddenly used a huge amount of memory", or "The SQL execution plan for your application changed last night". Without this process specific information, all you can say, based on sar output or the data collected by your expensive monitoring tool, is "Something caused severe memory shortage", or "There was a big CPU spike", but you don't know what application did it. Fortunately, it only takes a small effort to remedy this. That's what I call process-oriented performance monitoring.

The perfmon.sh script shown below is just one of many ways you implement this philosophy on a Linux or UNIX system. It records OS process performance data with top and Oracle session performance data by querying v$session and v$process (if pre-10g, add v$session_wait). The beauty of combining top and Oracle wait event check into one report is that you can easily cross-check. This is important because it follows a performance analyst's natural flow in troubleshooting. If you're notified about a slow running application, after initial verification, the first thing you as a DBA will do on the database server is probably to run top. If there're indeed some oracle processes using lots of CPU as is often the case, you find the Oracle sessions attached to these processes. The provided perfmon.sh offers you the capability to do after-crisis analysis as if the problem was happening right now. It runs as a shell script preferably scheduled as a cron job with once per minute interval. (If higher execution frequency is desired, the script can be modified to run the main logic twice or more times with sleep n in between, where n is a little less than 60 devided by the number of times.)

#!/bin/bash
#perfmon.sh: Oracle session and OS process correlated performance monitoring script.
#Example assumes a 4 node RAC where hosts are named like host[1-4] and this script runs on host4.

export ORACLE_SID=ORCL
export ORACLE_HOME=/u01/app/oracle/product/10.2.0/db
export PATH=$ORACLE_HOME/bin:/usr/bin:/bin
TIMEOUT=10 #this many seconds allowed to connect to another host of RAC
cd /home/oracle/scripts/perfmon

date "+%Y%m%d%H%M%S" >> perfmon.log

for i in 1 2 3; do
  echo -e "\nhost$i\n" >> perfmon.log
  ssh host$i "COLUMNS=132 top -cb | head -15 | sed 's/  *$//' | grep -v ' top -cb$'" >> perfmon.log &
  p=$!
  (sleep $TIMEOUT; kill -1 $p) &
  k=$!
  wait $p
  exit=$?
  #Normal exits are 0..127, signals are 128+signo
  case $exit in
    129) echo '(timed out)' >&2;;
    *) kill $k;;   #  Kill the killer.
  esac
done
echo -e "\nhost4\n"  >> perfmon.log #This script runs locally on host4.
COLUMNS=132 top -cb | head -15 | sed 's/  *$//' | grep -v ' top -cb$' >> perfmon.log

#Feel free to exclude other unwanted events in the query
sqlplus -S / as sysdba <<EOF >> perfmon.log
set pagesize 1000 linesize 200 feedback 0
col i for 9
col sid for 9999
col evnt for a30
col usr for a15
col mach for a20
col prog for a20
col spid for a5
select a.inst_id i,sid,substr(a.username,1,15) usr,substr(machine,1,20) mach,substr(a.program,1,20) prog,sql_id,prev_sql_id,substr(event,1,30) evnt,p1,p2,p3,spid,pga_alloc_mem pga
from gv\$session a,gv\$process b where a.paddr=b.addr and a.inst_id=b.inst_id and event!='PX Deq: reap credit' and wait_class!='Idle' order by 1,3;
exit
EOF

There's really no magic here and many shops are already doing this. The above script is for a 4-node RAC whose hosts are named like host[number] and this script runs on host4, saving OS process and Oracle session data to a log file on host4. The sleep $TIMEOUT and kill logic is to prevent hanging ssh processes in case of node hang (code snippet from newsgroup). Because of the message about process killed (Terminated), the cron job is better set up like ... perfmon.sh > /dev/null 2>&1, or replace /dev/null with a file for debugging.note0 Remove the pink text , if you don't run RAC, or don't want to stress the network by ssh (therefore you run each copy of the script on each host locally). If you like, you can modify the script to include not just top CPU processes, but all processes whose sessions are captured in the SQL section (i.e. non-idle wait sessions); alternatively, you can include not just non-idle wait sessions, but all sessions whose Oracle server processes show up in top.

top is just an example program to gather OS process performance data. You can use lighter-weight prstat on Solaris, or topas on AIX, or more powerful atop on Linux (e.g. to get process I/O statistics). top is run in batch mode (-b) so it can be logged, and only the top few processes are recorded. -c shows a more complete command, up to 132 characters on one line (COLUMNS=132). Since new version of top has no RPM as of March 2008 (Ref), the syntax top -cb number is not used at this time. Note that Microsoft Windows would be very different so I don't give an example script here.note1

If your database is 9i or earlier, add v$session_wait to the SQL and join on sid to v$session. You also need to exclude individual idle waits because there's no wait_class column anywhere, e.g. (see "Current session waits" at SQLsFreqUsed.txt):
where not (event like '%message%' or event like 'SQL*Net%' or event like '%timer%' or event like 'jobq%')
There's no hard and fast rule about what event is idle and what is not. For instance, 10g v$event_name groups about half of 'PX%' events into 'Other' wait class. Should they be in 'Idle' class instead? And blindly ignoring all 'SQL*Net%' events may be bad because frequent 'SQL*Net % to client' may indicate a real network problem. Nevertheless, I exclude PX Deq: reap credit because the gv$ view query from perfmon.sh itself always causes that wait on other instances.

The output of the perfmon.sh script is like
20080327124601
...
host2

top - 12:46:04 up 131 days, 11:49,  1 user,  load average: 1.17, 1.13, 1.25
Tasks: 312 total,   4 running, 308 sleeping,   0 stopped,   0 zombie
Cpu(s): 13.7% us,  3.9% sy,  0.0% ni, 73.5% id,  8.5% wa,  0.1% hi,  0.3% si
Mem:   8293488k total,  7828272k used,   465216k free,   332448k buffers
Swap:  8388576k total,  2264096k used,  6124480k free,  3918592k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
28543 oracle    22   0 5233m 141m 129m R   83  1.7   0:00.54 oracletest (LOCAL=NO)
 8272 oracle    15   0 5449m 2.3g 2.1g D   22 29.6 943:24.05 ora_pz99_test
10098 oracle    16   0 5231m 2.2g 2.2g D    7 27.9  28:15.04 oracletest (LOCAL=NO)
28548 root      16   0  4976 2912 2320 S    3  0.0   0:00.01 /bin/sh /etc/init.d/init.cssd fatal
 4760 oracle    15   0 5228m 108m  99m S    3  1.3   0:00.89 oracletest (LOCAL=NO)
 8302 oracle    15   0 5231m 102m  91m S    2  1.3   0:01.75 oracletest (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
 8451 oracle    15   0 5231m 227m 216m S    1  2.8   0:07.85 oracletest (LOCAL=NO)

 I  SID USR             MACH                 PROG                 SQL_ID        PREV_SQL_ID   EVNT                                   P1         P2         P3 SPID         PGA
-- ---- --------------- -------------------- -------------------- ------------- ------------- ------------------------------ ---------- ---------- ---------- ----- ----------
 2  755 CB              EXAMPLE\APPSERVER001 wwwpx.exe            9ss1vsmq53ucs fqwpnzvugsgs6 gc cr request                         173      10806          1 28543    8188249
 2  626 ABC_USER        EXAMPLE\APPSERVER2   abc.exe              8jt2wr0d0r7tg cq7an5qmrb1bq gc cr request                          44     128581          1 10098    5132225
 2  554 SYS             host1.example.com    oracle@host2.exampl.                             db file scattered read                  3     333341         16 8272   217796545
 3  816 LOGISTICS_USER  ECP\CSNETTRANS       wwwpx.exe            2aj1gub45fjs5 f4gkzwsdd78gk db file sequential read                53       6110          1 29580    6025561
...

Each snapshot begins with a timestamp. Below the name of the host on which this snapshot is taken are the top 15 rows of top output. I usually pay attention to load average, total tasks, user and sys CPU, and free memory. Then comes the process part, which is the main focus of this article. top process list offers two benefits: CPU usage sorting (which could be achieved inside sqlplus as well but with expensive calculation, i.e. sorting on sampling deltas of 'CPU used by this session' or 'CPU used when call started'), and spotting occasional non-database CPU intensive processes (which sqlplus cannot do).

For instance, here OS process 28543 has used 83% CPU in the last sampling period. In the SQL sectionnote2 below, you see that it corresponds to an Oracle session with SID 755 on instance 2. This session is waiting on global cache CR request for file 173 and block 10806. What exactly is the session doing? Given the sql_id, check v$sql% (if checked frequently and only stats are needed, v$sqlstats in 10g is preferred over v$sql or v$sqlarea because the query requires no latching and the SQL text stays in it longer). For 10g, if the SQL is gone even from v$sqlstats, you can check to see if it was captured by AWR, specifically dba_hist_sqltext and a few other dba_hist_sql% views.note3

Process memory is harder. Suppose you see low free memory in top and don't see a non-database process using much memory, focus on Oracle session's PGA.note4 Since top runs in batch mode, you would have to launch top another time to show the top memory consumers (pipe top -b to sort on %MEM column), if it's absolutely needed.note5 To better assess process memory usage, it may be better to look inside Oracle unless there's a non-database memory hogger. For instance, the Oracle session for OS process 8272 uses about 207 MB PGA private memory, i.e. 26% of 800 MB pga_aggregate_target on this database. Does the session use a huge PL/SQL table, or it temporaily overshot the limit indirectly set by pga_aggregate_target? Unfortunately, the sql_id's and prev_sql_id's for pz99 (and pz98,pz97,...) are not recorded in v$session; these are the RAC parallel sessions initiated by a parent session from another instance querying a gv$ view (this relationship is recorded in v$px_session). For this particular problem, you can narrow down to the segment or extent based on the wait event parameters, and keep an eye on this process next time. Since it runs as SYS and the process name is pz99, it could be a system management session from a DBA's script or emagent. Since OS pid is known, check to see if Oracle bdump or udump has a trace file for this process. The special Linux process state D is not a good sign, perhaps indicating bad or stressed I/O.

For those enjoying 10g active session history (ASH) and automatic workload repository (AWR), enjoy them! ASH is a wonderful process- as well as system-oriented monitoring system. Once every second, it records performance data in v$session and other dynamic performance views, actually by directly attaching to their base table data structures in shared memory. In spite of the sophistication, however, there're a few reasons you may want to use your own monitoring tool instead of ASH. Additional license cost is a major one. Concern about its overhead may be another. Keep an eye on mmon, m000 and mmnl background processes' metrics (excessive CPU usage? UGA leak?...) Increase _ash_sampling_interval if needed. If you do decide to disable ASH, set _ash_enable to false. Do not simply set statistics_level to basic without considering its complex implications.note6 However, bear in mind that ASH and AWR are great technologies. If the DBA team relies on Enterprise Manager, and you have the license, there's even more reason to keep them on. Unless the database is genuinely short of power, you can additionally run my perfmon.sh. This offers lots of unique advantages. The ease of cross-checking top OS processes against Oracle sessions aside, since you can modify the script, you have maximum freedom to decide what you want or don't want; perhaps you want some officially idle waits to be included, or non-idle waits to be excluded, or less (or more) columns to care about in v$session than ASH does, and more detailed info about a particular application, less sampling at night. The list goes on.

Process-oriented performance monitoring emphasizes monitoring at the granularity of processes. In Oralce 10g, DBA's can use services to manage and monitor applications. Services in essence group processes or sessions in a more business-meaningful way. It lies in the middle between the entire system and individual processes on the scale of granularity. Service-oriented performance monitoring is achieved with numerous v$service% views. But of course this approach lacks "cooperation" on the OS side, missing a link between the OS process and the Oracle service. It can't completely replace process-oriented monitoring.

In summary, process-oriented performance monitoring focuses on processes, instead of the system. "Process" is a general term in this context. The idea is not specific to post-crisis Oracle performance analysis, but has relevance to any application that contains more or less independent worker threads, of which database sessions are examples. Windows or UNIX systems need to record top OS processes. Mid-tier app servers need to record resource usage of individual threads for individual business functions, if possible. Without this process-level information, it's difficult to track down what has dragged down the overall performance.

__________________
note0 My actual crontab is as follows, capturing any possible "fall-out" message to a file in /tmp, and overwriting the log of the same day last month:

$ crontab -l | grep perfmon
* * * * * /u01/app/oracle/scripts/perfmon/perfmon.sh > /tmp/perfmon.out 2>&1
59 23 * * * /bin/mv /u01/app/oracle/scripts/perfmon/perfmon.log /u01/app/oracle/scripts/perfmon/perfmon.log.$(/bin/date +'\%d') > /tmp/rotateperfmonlog.out 2>&1
note1 Windows runs Oracle sessions as OS threads inside oracle.exe process. So the top or top-like program needs to sort on CPU sampling deltas of the threads of the process. But then in order to catch occasional or accidental non-Oracle top CPU processes, the top program also needs to watch at the process level. These two tasks are not easy to accomplish with one sorting output. In any case, if needed, I can write two separate programs to achieve the two goals; there's no program I know of that can write top Windows processes or top threads to the console or into a text file.

note2 My example perfmon.sh script would not create the SQL section if the SQL didn't find even one single matching row that met the where condition.

note3 There're various reasons a SQL is not recorded in v$sql% views, such as a recursive SQL, a DDL, or a pseudo cursor, as operated on a LOB column directly. In that case, you can see the cryptic "SQL" text like table_4_200_4139_0_0 in v$open_cursor or v$db_object_cache, and of course the library cache fixed table x$kglob or its derived ones (this particular one is a SQL reading (4) LOB of a table whose object_id is 0x4139).

note4 The example is from a Red Hat Linux 4 with kernel 2.6.9. Since HugePages is not enabled on this server and lock_sga is default false, top SHR shows shared memory in resident working set only; otherwise it would show the Oracle SGA, which is 5008m in this particular instance, plus a little other shared memory.

note5 You can sort processes that do not use shared memory easily on resident memory. If they use shared memory as Oracle processes do, the command is more involved and depends on Linux version. For all these commands, see this note.

note6 Some companies demand absolutely top raw horsepower of the database engine for their applications by shutting down anything fancy that comes with Oracle. An easy way to achieve this is to set statistics_level to basic. Do not do this unless (1) the application is already well written (all except data warehouse jobs use bind variables, SQLs are reviewed by SQL tuning specialists, etc.), and (2) the whole DBA team is sufficiently technical. The reason is that this non-default, "light-weight", statistics_level not just turns off ASH and various advisories, but also freezes many other important statistics. See this document for all details.


To my Computer Page
To my OraNotes Page