[warning: written in 2002 through 2005]

HOW IS DIO ENALED ON SOLARIS BY ORACLE FILESYSTEM_OPTIONS?

Direct I/O can be enabled in two ways, using forcedirectio mount option 
(you can find it with mount command after that), and using directio(3C) 
function call. Since Oracle's filesystemtio_options obviously can't 
control the mount option, it must be making a directio() call. A simple 
program like this:

$ cat dio.c
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/fcntl.h>
main()
{
int fd, rc;
fd = open("testfile", O_RDONLY);
rc = directio(fd, DIRECTIO_ON);
printf("%d", rc);
}
$ gcc dio.c
$ touch testfile
$ truss a.out

generates the following `truss a.out' output:
...
open("testfile", O_RDONLY)                      = 3
ioctl(3, _ION('f', 76, 0), 0x00000001)          = 0
...

I compiled dio.c with gcc. My testfile is on UFS (x86 Solaris 10). The sign of directio() function 
call in this system call tracing is _ION('f', 76, 0) used as the cmd to ioctl() (and 0x1 after it 
for DIRECTIO_ON). [According to Solaris 8 source code /src/osnet_volume/usr/src/lib/libc/port/gen/directio.c, 
this ioctl() argument is _FIODIRECTIO, defined in /usr/include/sys/filio.h as _IO('f', 76). I'm 
not exactly sure about _ION but it looks like a variant of _IO with an additional number argument.]

Anyway, if you truss an Oracle server process when you
alter system set filesystemio_options = directio;
inside Oracle, watch for ioctl() with _ION as its second argument.

Alternatively, try `truss -u:directio a.out' to focus on this library function. In the test of my 
dio program, I see
/1:     open("testfile", O_RDONLY)                      = 3
/1@1:   -> libc:directio(0x3, 0x1, 0xd2ff260c, 0xd2f008e9)
/1:     ioctl(3, _ION('f', 76, 0), 0x00000001)          = 0
/1@1:   <- libc:directio() = 0

Or try `truss -t!all -u:directio a.out' to remove all syscalls in
output to make it cleaner. 

Chao_ping's test on Oracle on Solaris:

File handle 26200/26201 are datafiles on ufs. Others are on vxfs with
Qio. _filesystemio_options=directio. 
truss dbwr result: 
ioctl(26201, _ION('f', 76, 0), 0x00000001)      = 0
And with your second test plan:
[oracle@ebaysha2**8i]$grep direct dio3.log
/1@1:   -> libc:directio(0x6655, 0x1, 0xfffffffe7e355e40, 0x746571fefefefeff)
/1@1:   <- libc:directio() = 0
/1@1:   -> libc:directio(0x6654, 0x1, 0xfffffffe7e355e40, 0x746571fefefefeff)
/1@1:   <- libc:directio() = 0
/1@1:   -> libc:directio(0x6653, 0x1, 0xfffffffe7e3559e0, 0x746571fefefefeff)
/1@1:   <- libc:directio() = 0
/1@1:   -> libc:directio(0x6652, 0x1, 0xfffffffe7e3559e0, 0x746571fefefefeff)
/1@1:   <- libc:directio() = 0 

A related question: How do we find out if a process has called directio(3C) call in the past, not 
find out when directio() is being called at the moment? pfiles command may not be revealing 
everything in in-core inodes, where I believe directio() made some change. For old Solaris versions, 
you may use crash(1M) to find difference in user area dump (u command) for a process, possibly 
followed by a file command (some common usage can be found at 
http://yong321.freeshell.org/computer/SolarisMdb.txt). New versions replace (and augment) crash 
with mdb but I haven't played with it much.








DIO ON LINUX

Linux direct I/O can be verified by strace on Oracle processes and watch for O_DIRECT flag in 
open(2) calls. See my note ./OCFS.txt for example. Although not documented, a caveat similar to that 
in Solaris directio(3C) man page may still apply, i.e. the last open(2) call with O_DIRECT by any 
process on a given file determines whether the file content should be placed in page cache or not, 
but it may be that if the content is already cached, the cached part stays. Think about it. What 
happens if process P1 opens file F without O_DIRECT and places content in page cache, P2 opens F 
with the flag, and P3 opens F without it. Is F's content in page cache or not?

Regardless whether the file content is in page cache or not, O_DIRECT of open(2) probably does 
advise on subsequent read() and write() calls of the current process to bypass page cache, even if 
the cached content is readily available.

[update 2011-06] I believe at least partially because of Linus Torvalds' insistence, ext2 and ext3 
(and ext4 as well?) do not support direct I/O mount option. Unless you use a filesystem that supports 
this mount option (vxfs etc), the only way for a write() to not contaminate page cache is O_DIRECT 
option of all open() calls (Linux doesn't have directio() library function). But most apps don't do 
that, cp, mv, gzip, etc. (Watch decrease of Cached in /proc/meminfo while running those commands on 
a big file, and eventually kswapd in `top'). Luckily, Oracle does, with the help of the 
filesystemio_options parameter.






DIO PERFORMANCE TEST ON SOLARIS

My test on Solaris 2.6 shows that directio on UFS does make difference. Here're my notes: 

The database db_block_buffers is 2048 and db_block_size 8192. So the buffer cache is 16m except in 
Test IV below. OS is Solaris 2.6 on sun4u. UFS file system. All datafiles, logfiles, controlfiles 
are in one directory. A table called large is created as select * from dba_objects where 1=2. Then 
insert into large select * from dba_objects is executed many times till the table size is about 
170m (from user_segments.bytes). The query is select /*+ full (t) */ count(*) from large t. It's 
run three times under each condition. Timing in seconds of the three times are separated by ";".

I. Just mounted filesystem, startup db 
11.37; 4,46; 4.06 

II. Shutdown db, umount, mount, startup db 
11.18; 4.56; 4.86 

III. Shutdown db, umount, mount -o forcedirectio, startup db 
19.99; 19.19; 19.29 

IV. Same as III but change db_block_buffers to 32768 
19.39; 2.14; 2.03 

Observation:
The first time accessing the table takes 11 seconds with filesystem cache (Test I) and 20 seconds 
with direct I/O (III). The second time it takes about 4 and half seconds with filesystem cache (I). 
With direct I/O, if oracle buffer cache is too small, it still takes 19 seconds (III); with a 
sufficiently large cache, it takes 2 seconds (IV).

Conclusion:
If your data access pattern is well defined, and if the amount of data can be fully cached in 
Oracle buffer cache, then DIO is better because the second time (not first time) the same data is 
accessed, it's fast. If neither condition is met, I'll take the lazy man's approach and let file 
system intelligently manage it.

******** Unrelated: Testing effect of sticky bit on datafiles ********
Adrian Cockcroft and Richard Pettit say turning on sticky bit for database files makes the inode 
updates for the file asynchronous and is completely safe. It "may help get back some of the 
performance" ("Sun Performance and Tuning", 2nd ed. p.165). But in March 6, 2002, eSeminar hosted 
by Quest Software, Richard dismisses this trick as old and probably not valid. I did the following 
test. No performance difference is found with or without sticky bit on database files.

Experimental

The database db_block_buffers is 2048 and db_block_size 8192. So the buffer cache is 16m. OS is 
Solaris 2.6 on sun4u. UFS file system. All datafiles, logfiles, controlfiles are in one directory. 
A table called large is created as select * from dba_objects where 1=2. Then insert into large 
select * from dba_objects is executed many times. In SQL*Plus, I set timing on for each insert. The 
time for each insert is shown below.

Initially without sticky bit: 06.60, 07.92, 07.02 seconds

Turned on sticky bit (chmod +t *, must run as root): 07.02, 08.62, 08.02, 07.81 seconds.

Turned off (chmod -t *): 08.53

Turned on again: 10.85, 06.10, 09.83, 08.03, 08.42, 08.73, 07.93, 08.03, 08.23, 09.34, 08.53

Turned off: 08.23, 08.93, 08.72, 09.14, 08.93, 08.03, 09.43, 07.82, 08.34, 08.32

Conclusion: There's no performance difference with or without sticky bit on database files.
***********************************************************************








DIO ON TRU64

Subject: SUMMARY: Extremely slow Oracle performance with Apps 11.0.3 after OS upgrade
From: tackenhu AT fnal.gov
Date: Sun, 22 Dec 2002 23:37:35 -0600

> We believe the core cause of this performance degradation was the change
> in the default behavior of the Oracle database's use of the filesystem
> when run on HP/Compaq Tru64 5.1A UNIX (as opposed to on the original
> version 4.0F).  On 5.1A, Oracle uses direct I/O by default (avoiding any
> operating system filesystem caching), whereas on 4.0F, Oracle doesn't do
> this (and, in fact, the OS doesn't support direct I/O) so it uses the
> filesystem cache.
>
> In our case, this difference in filesystem cache usage dramatically
> impacted a number of queries, most of which manipulated a large quantity
> of data by repeatedly reading the same data over and over (poor query
> design, but hadn't caused problems in the past).  It would appear that
> prior to the 5.1A upgrade, the performance of these queries was being
> propped up by heavy use of the AdvFS filesystem cache at the OS level.
> Once we went to 5.1A, this cache wasn't being used (because Oracle was
> using direct I/O) and our queries fell victim to extreme I/O waits, since
> the SGA DB_BLOCK_BUFFERS weren't large enough to buffer as much data as
> had been buffered in the filesystem buffer pre-OS upgrade.
>
> From a purist standpoint, the problem could be considered a combination of
> several poorly designed queries (many repetitive reads on a large quantity
> of data) combined with an undersized data buffer cache for the instance
> (causing it to depend heavily on OS filesystem caching).   The OS
> upgrade's effect of shifting Oracle I/O to direct I/O made these causes
> yield drastic performance slow downs.
>
> Our solution was to dramatically increase our DB_BUFFER_CACHE size (by
> approximately 20 times).  Another possible solution is to disable direct
> I/O usage by Oracle RDBMS (which can be done in 8.1.7.2 by the
> _TRU64_DIRECTIO_DISABLE parameter, which via an Oracle patch can also be
> made available in 8.1.7.1) - since this solution required a patch and our
> initial testing got mixed results using this, we opted for the data buffer
> cache increase (since we could make the memory available to support this).
>  We continue to test the direct I/O parameter in a more controlled
> environment (dedicated test box set up for just this purpose) in case we
> need a fallback, but our performance is now back to roughly the same as
> pre-upgrade.
>
> We are also working to have the development team examine these queries
> that do the repetitive reads (according to some trace analysis, roughly 3
> million blocks read to generate 8000 rows of output - when you consider
> the entire database is less than 14 Gb (and 3 million blocks would be
> around 22 Gb), it is safe to say the query isn't optimizing its reads).
>
> Once again, thanks for everyone's suggestions.  For anyone planning an
> 4.0F to 5.X upgrade for a system running Oracle databases, I would
> strongly suggest you look closely at performance for large jobs and
> examine how caching is being used - the direct I/O change can have
> dramatic effects.

When direct I/O is disabled (by default), performance is not quite sensitive to Oracle buffer cache 
size. When direct I/O is enabled and buffer cache is smaller than my table on which I do a full 
table scan, performance is worse than if direct I/O is disabled. But when buffer cache is larger 
than the table size with direct I/O, the *second* time the table is scanned, it's very fast.

So if all conditions are met, we should use direct I/O and increase Oracle buffer cache as other 
experts say (see Ixora News - March 2002). However, this strategy is easier said than done. In 
order to fully take advantage of the remaining memory otherwise used as the filesystem page cache, 
you need to increase Oracle buffer cache to the extent that your system will almost start to swap. 
I haven't done this but I imagine you have to carefully calculate memory usage of all other 
applications on the system, and most likely rebuild oracle kernel and use 64-bit OS to get a large 
SGA. On Solaris, filesystem page cache is automatically sized by OS to use all physical memory 
available. But Oracle buffer cache is sized by you; too big a size causes swap. Direct I/O also 
disables intelligence of the filesystem to prefetch data in sequential read, and delayed data 
write. So its optimal use needs a lot of work on a system where activities are fairly predictable 
and data have been scanned into Oracle cache prior to peak usage time.

BTW, I got some questions for you:
1. Earlier you said you had 95% buffer cache hit ratio. You increased it by 20 times. Then what's 
the ratio now?
2. I assume you're saying _TRU64_DIRECTIO_DISABLE was true when you used Tru64 4.0F and it changed 
itself to false when you upgraded OS to 5.1A, with the same version of Oracle. Is that correct?
3. Direct I/O is a filesystem mount option. Did the OS upgrade automatically enable the option or 
somebody has to manually enable it? On Solaris, even if the filesystem is not mounted with direct 
I/O, applications can still advise so to the filesystem on a per-file basis. Maybe that's what 
Tru64 5.1A does?
4. Why did manually changing _TRU64_DIRECTIO_DISABLE give you mixed result?

Yong Huang

Hi, Yong,

In answer to your questions from 12/26/2002...

> 1. Earlier you said you had 95% buffer cache hit ratio. You increased it 
> by 20 times. Then what's the ratio now?

What I meant to say was the buffer cache hit ratio was about 55% pre-SGA 
(DB_BLOCK_BUFFER) increase and 95% post increase.

> 2. I assume you're saying _TRU64_DIRECTIO_DISABLE was true when you used 
> Tru64 4.0F and it changed itself to false when you upgraded OS to 5.1A, with
> the same version of Oracle. Is that correct?

Roughly, yes.  _TRU64_DIRECTIO_DISABLE was irrelevant on Tru64 4.0F as the 
OS didn't provide any direct I/O support - so direct I/O was never used. 
Once the OS was upgraded to 5.1A, Oracle by default used direct I/O and 
this behavior was only disabled by using the _TRU64_DIRECTIO_DISABLE 
Oracle parameter.

> 3. Direct I/O is a filesystem mount option. Did the OS upgrade automatically
> enable the option or somebody has to manually enable it? On Solaris, even 
> if the filesystem is not mounted with direct I/O, applications can still 
> advise so to the filesystem on a per-file basis. Maybe that's what Tru64 5.1A 
> does?

I believe Solaris must differ from Tru64 in this regard.  Direct I/O is 
not a filesystem mount option in Tru64 (at least on AdvFS volumes, which 
we use).  Rather, use of direct I/O is determined by passing (or omission) 
of a direct I/O flag to the open() system call when opening a file. 
Therefore, to utilize direct I/O, a program must specifically open the 
file for direct I/O (which is a code modification that Oracle did to their 
RDBMS code - as early as 8.0.6.3, I believe and including all 8.1.7 code 
streams).

> 4. Why did manually changing _TRU64_DIRECTIO_DISABLE give you mixed 
> result?

We believe this was due to the other processing on the box (we have 8 
other database instances on our development box) which would affect 
contents of filesystem caching and overall performance.  To get 
reproducible results, we believe we have to have a dedicated system, but 
we couldn't halt all development on the system for this testing.








ANOTHER DIO TEST ON SOLARIS

$ uname -a
SunOS countfleet 5.6 Generic_105181-31 sun4u sparc SUNW,Ultra-2
$ mount | grep ^/f[12] #/f2 has DIO turned on
/f1 on /dev/dsk/c0t1d0s0 setuid/read/write/largefiles on Wed Jan 15 16:17:29 2003
/f2 on /dev/dsk/c0t1d0s1 forcedirectio/setuid/read/write/largefiles on Wed Jan 15 16:17:29 2003
$ grep maxphys /etc/system
set maxphys = 1048576

Database 9.0.1.3

create tablespace test datafile '/f1/oradata/tiny/test.dbf' size 400m extent management local 
uniform size 32k;

Three times it took 35,36,36 seconds, respectively. The same command except for f1 changed to f2 
took 25,27,26 seconds, respectively, about 9 seconds faster. /f1 is regular UFS and /f2 is DIO UFS.

When the tablespace is being created on /f1, truss is run against the shadow process and the second 
run shows:

$ truss -c -p 9704
^Csyscall      seconds   calls  errors
read             .00       1
write            .00       3
open             .00       2
close            .00      10
time             .00       2
lseek            .00       2
times            .03     282
semsys           .00      31
ioctl            .00       3      3
fdsync           .00       1
fcntl            .01      14
poll             .01     146
sigprocmask      .00      56
context          .00      14
fstatvfs         .00       3
writev           .00       2
getrlimit        .00       3
setitimer        .00      28
lwp_create       .00       2
lwp_self         .00       1
lwp_cond_wai     .03     427
lwp_cond_sig     .15     427
kaio            5.49     469    430 <-- More kernelized IO time
stat64           .00       3      1
fstat64          .00       3
pread64          .00      32
pwrite64         .35     432        <-- Each pwrite() call takes 350/432 = 0.8 ms
open64           .00       6
                ----     ---    ---
sys totals:     6.07    2405    434
usr time:       1.71
elapsed:       36.74

When the tablespace is created on /f2,

$ truss -c -p 9704
^Csyscall      seconds   calls  errors
read             .00       1
write            .00       3
open             .00       2
close            .00      10
time             .00       2
lseek            .00       2
times            .02     282
semsys           .00      31
ioctl            .00       3      3
fdsync           .00       1
fcntl            .00      14
poll             .01     146
sigprocmask      .00      56
context          .00      14
fstatvfs         .00       3
writev           .00       2
getrlimit        .00       3
setitimer        .00      28
lwp_cond_wai     .00     430
lwp_cond_sig     .03     430
kaio             .50     462    430 <-- Much less kernelized IO time
stat64           .00       3      1
fstat64          .00       3
pread64          .01      32
pwrite64         .00     432        <-- pwrite calls take practically no time.
open64           .00       6
                ----     ---    ---
sys totals:      .57    2401    434
usr time:       1.94
elapsed:       27.72

During the first run, the result on /f1 is even worse. But for good benchmark, I usually ignore the 
first run.





USEFUL REFERENCE

Ixora News - March 2002 (Direct benefits )