Tips on Handling Log File

1. How to get alert by reading log files

Many people write shell scripts to read and parse server log files, such as Oracle alert.log, Apache error_log..., and email them when certain strings in the log are found. The scripts are usually run as a cron job. But the problem with this approach is that it may incur heavy disk I/O especially when the file is already large, and depending on how often the cron job runs, it may not send the notification as quickly as you would like.

These shortcomings can be easily overcome. Try tail -f on the log file and pipe the result to an action command. tail -f reads the tail of the file once a second and sleeps in between. (The GNU version of tail allows you to adjust check interval with -s.) That's close to "immediate" response and uses much less I/O and CPU, regardless the log file size. Instead of a cron job, all you need to do is type a command like this (replace mail with mailx if needed)

#Monitoring and instant notification of Oracle ORA-600 error
nohup tail -f /path_to_bdump/alert_ORCL.log | perl -nle 'system("mail -s \"ORCL error\" yourname\@example.com < /dev/null") if /^ORA-00600/' &
But it may make more sense to put the command sequence in a script and worry about nohup at the time you launch it. So create a file named e.g. logmonitor which has
#Apache sends status 200 to user but not followed by data (i.e. "-" at line end)
tail -f access_log | perl -nle 'system("mail -s ChkApache you\@somecompany.com < /dev/null") if / 200 -$/'

and type the command nohup logmonitor &.

You can also let it email to you the line in the log instead of a blank message

tail -f access_log | perl -nle 'system("echo \"$_\" | mail -s ChkApache yourname\@example.com anotheremail\@yahoo.com") if / 200 -$/'
A more sophisticated example that skips some errors you don't care

#!/bin/ksh
#Usage: $0 OracleSID
#In older versions of Oracle, ORA-xxx may be ORA-00xxx or ORA-000xxx. Now it's always 5 digits.
cd /path_to_bdump
export SID=$1
tail -f alert_$SID.log | perl -nle '$SID=$ENV{SID}; system("echo \"$_\" | mail -s \"$SID Database Error\" yourname\@example.com") if (/^ORA-/ and !/^ORA-1543/)'

If you're more comfortable with awk than perl, replace perl with nawk or gawk (but probably not awk in case it has a broken system command)

tail -f alert_ORCL.log | nawk '/^ORA-00600/ {system("mail -s \"ORCL error\" youremail < /dev/null")}'

The disadvantage of using the method demonstrated so far to send alert is that it processes one line at a time; if you have 100 lines that match the pattern, you'll get 100 emails. While the traditional scan-the-whole-logfile approach easily groups the lines into one alert, here you have to do a little programming. We can modify logmonitor to be a full-fledged script action.pl or ck_alert_log.sh with some logic in it to achieve the goal.

tail -f /var/adm/messages | perl -nl action.pl
tail -f /path/alert.log | /path/ck_alert_log.sh

Here's the actual script I use at my work to check for ORA- errors. It avoids sending me too many emails if many errors occur in quick succession.

#!/bin/bash
#ck_alert_log.sh: Check alert.log for ORA- errors. Please run
#touch -t $(date +"%m%d")0000 /tmp/ck_alert_log.lastmail
#first, and then run this script as the following command (not as a cron job)
#nohup tail -s 60 -f /path/alert_SID.log | /path/ck_alert_log.sh &

export PATH=/bin:/usr/bin:/usr/sbin:/sbin
RECIPIENT="you@example.com yourbuddy@example.com"
DELAY=300 #No email is sent unless the previous email was sent at least $DELAY seconds ago
SID=mysid

#check for ^ORA- from stdin
while read ERR_LINE; do

#Modify the pattern in the case-block as needed
case $ERR_LINE in
ORA-*)

#if found, see if current time is at least $DELAY seconds newer than time of the marker file lastmail
if [[ ! -z $ERR_LINE ]]; then
  if [[ $(($(date +"%s")-$DELAY)) > $(ls -l --time-style=+"%s" /tmp/ck_alert_log.lastmail | awk '{print $6}') ]]; then
    #echo $ERR_LINE #debug
    echo $ERR_LINE | mail -s "ORA- error in alert_$SID.log" $RECIPIENT
    touch -m /tmp/ck_alert_log.lastmail #update the marker file to current time
  fi
fi;;

esac

done

Note that the script is very Linux-, or rather GNU-, centric, i.e. %s for date and ls, the optional -s 60 for tail.

To test, make sure your alert.log has an ORA- error recently. Create one with a silly command, e.g. alter database datafile 1 resize 1;, or exec dbms_system.ksdwrt(2, 'ORA-12345 test'). Since the script has never been run before, the timestamp file /tmp/ck_alert_log.lastmail doesn't exist. Create one with an earlier time: touch -t $(date +"%m%d")0000 /tmp/ck_alert_log.lastmail. Then run the command below (change the paths of course)

nohup tail -s 60 -f /path_to_bdump/alert_mysid.log | /path/ck_alert_log.sh) & > /tmp/ck_alert_log.out

There's one more little problem. Your tail -f... | cmd... command is not preserved across reboot. You could choose to start it from one of your rc scripts. However, I strongly recommend you create a cron job to start or restart these processes. The job just periodically checks for their presence and starts them if they died for any reason. A crontab entry like this will do:

5 * * * * [[ $(ps -ef | grep ck_alert_log | grep -v grep) ]] || (tail -s 60 -f /path_to_bdump/alert_mysid.log | /path/ck_alert_log.sh) & > /tmp/ck_alert_log.out 2>&1

That once-per-hour job wakes up and checks a process named like ck_alert_log. If it's missing, it starts tail -f alert.log | ck_alert_log.sh. Good practice in setting up a cron job calls for logging anything, if any, to a file such as /tmp/ck_alert_log.out instead of /dev/null or $MAIL for easy troubleshooting.

If you don't want that, at least set up a job to only remind you to start the processes:

5 * * * * [[ $(ps -ef | grep ck_alert_log | grep -v grep) ]] || (echo 'Please run: tail -s 60 -f /path_to_bdump/alert_mysid.log | /path/ck_alert_log.sh &' | mail -s 'Restart ck_alert_log.sh' yourname@example.com)

If you're on Windows, then you have to find a tail implementation. Good candidates are in the following

UNIX emulation software


2. How to read a large file

If you have a several hundred MB log file, how do you quickly find a portion of it you're interested in? Your vi may not be able to open the file, complaining file too large, and/or take too long and use too much memory. Here's what I usually do.

Let's say you want Mon Jul 15's data from July's access_log, here's the command

sed -n '/^Mon Jul 15/p; /^Tue Jul 16/q' access_log > /tmp/qq
Without the quit command, sed would scan the file to the end or till you press ^C. Note the undocumented semicolon that allows you to put two sed commands on one line. (By the way, on Solaris creating a file under /tmp may be faster because tmpfs should be memory-based, unless you're short on RAM. But remember to delete big files under /tmp when you're done because they reduce available swap space.)

If you know the data you want is not too far from the end

tail -1000 access_log | head -1

If you still see lines too late, increase 1000 to 5000, 10000 and so on. But it should not be too close to the head of the file. You should always have an idea of when the file starts and when it stops. If you know it only contains records for July, then it's of course July 1 to July 31. But even then you have to take into account very uneven data distribution. If you don't know the start and end, do run head -1 access_log and tail -1 access_log first.

Unless you have to, do not start with head as in head -1000 access_log | tail. That would actually print 1000 lines to the pipe reader which only cares about the last 10 lines. Count the total lines (wc -l access_log) only if you have to. And generally do not split the file into chunks (split -b 10000000 access_log).

Needless to say, pattern match should use the most restrictive regular expression when dealing with large files. This means that if you search for all entries in Apache access_log for the client 172.168.123.45 and you know the IP is at line beginning, you should use /^172\.168\.123\.45/ instead of /172.168.123.45/. egrep may be faster than grep, which is in turn faster than fgrep. Therefore either of these two commands are OK

egrep '^172\.168\.123\.45' access_log
sed -n '/^172\.168\.123\.45/p' access_log

Subtle differences of regexp's in different tools are clearly explained in Jeffrey Friedl's Mastering Regular Expressions.

To my Computer Page