[warning: written in 2002 through 2005] HOW IS DIO ENALED ON SOLARIS BY ORACLE FILESYSTEM_OPTIONS? Direct I/O can be enabled in two ways, using forcedirectio mount option (you can find it with mount command after that), and using directio(3C) function call. Since Oracle's filesystemtio_options obviously can't control the mount option, it must be making a directio() call. A simple program like this: $ cat dio.c #include #include #include #include #include #include main() { int fd, rc; fd = open("testfile", O_RDONLY); rc = directio(fd, DIRECTIO_ON); printf("%d", rc); } $ gcc dio.c $ touch testfile $ truss a.out generates the following `truss a.out' output: ... open("testfile", O_RDONLY) = 3 ioctl(3, _ION('f', 76, 0), 0x00000001) = 0 ... I compiled dio.c with gcc. My testfile is on UFS (x86 Solaris 10). The sign of directio() function call in this system call tracing is _ION('f', 76, 0) used as the cmd to ioctl() (and 0x1 after it for DIRECTIO_ON). [According to Solaris 8 source code /src/osnet_volume/usr/src/lib/libc/port/gen/directio.c, this ioctl() argument is _FIODIRECTIO, defined in /usr/include/sys/filio.h as _IO('f', 76). I'm not exactly sure about _ION but it looks like a variant of _IO with an additional number argument.] Anyway, if you truss an Oracle server process when you alter system set filesystemio_options = directio; inside Oracle, watch for ioctl() with _ION as its second argument. Alternatively, try `truss -u:directio a.out' to focus on this library function. In the test of my dio program, I see /1: open("testfile", O_RDONLY) = 3 /1@1: -> libc:directio(0x3, 0x1, 0xd2ff260c, 0xd2f008e9) /1: ioctl(3, _ION('f', 76, 0), 0x00000001) = 0 /1@1: <- libc:directio() = 0 Or try `truss -t!all -u:directio a.out' to remove all syscalls in output to make it cleaner. Chao_ping's test on Oracle on Solaris: File handle 26200/26201 are datafiles on ufs. Others are on vxfs with Qio. _filesystemio_options=directio. truss dbwr result: ioctl(26201, _ION('f', 76, 0), 0x00000001) = 0 And with your second test plan: [oracle@ebaysha2**8i]$grep direct dio3.log /1@1: -> libc:directio(0x6655, 0x1, 0xfffffffe7e355e40, 0x746571fefefefeff) /1@1: <- libc:directio() = 0 /1@1: -> libc:directio(0x6654, 0x1, 0xfffffffe7e355e40, 0x746571fefefefeff) /1@1: <- libc:directio() = 0 /1@1: -> libc:directio(0x6653, 0x1, 0xfffffffe7e3559e0, 0x746571fefefefeff) /1@1: <- libc:directio() = 0 /1@1: -> libc:directio(0x6652, 0x1, 0xfffffffe7e3559e0, 0x746571fefefefeff) /1@1: <- libc:directio() = 0 A related question: How do we find out if a process has called directio(3C) call in the past, not find out when directio() is being called at the moment? pfiles command may not be revealing everything in in-core inodes, where I believe directio() made some change. For old Solaris versions, you may use crash(1M) to find difference in user area dump (u command) for a process, possibly followed by a file command (some common usage can be found at http://yong321.freeshell.org/computer/SolarisMdb.txt). New versions replace (and augment) crash with mdb but I haven't played with it much. DIO ON LINUX Linux direct I/O can be verified by strace on Oracle processes and watch for O_DIRECT flag in open(2) calls. See my note ./OCFS.txt for example. Although not documented, a caveat similar to that in Solaris directio(3C) man page may still apply, i.e. the last open(2) call with O_DIRECT by any process on a given file determines whether the file content should be placed in page cache or not, but it may be that if the content is already cached, the cached part stays. Think about it. What happens if process P1 opens file F without O_DIRECT and places content in page cache, P2 opens F with the flag, and P3 opens F without it. Is F's content in page cache or not? Regardless whether the file content is in page cache or not, O_DIRECT of open(2) probably does advise on subsequent read() and write() calls of the current process to bypass page cache, even if the cached content is readily available. [update 2011-06] I believe at least partially because of Linus Torvalds' insistence, ext2 and ext3 (and ext4 as well?) do not support direct I/O mount option. Unless you use a filesystem that supports this mount option (vxfs etc), the only way for a write() to not contaminate page cache is O_DIRECT option of all open() calls (Linux doesn't have directio() library function). But most apps don't do that, cp, mv, gzip, etc. (Watch decrease of Cached in /proc/meminfo while running those commands on a big file, and eventually kswapd in `top'). Luckily, Oracle does, with the help of the filesystemio_options parameter. DIO PERFORMANCE TEST ON SOLARIS My test on Solaris 2.6 shows that directio on UFS does make difference. Here're my notes: The database db_block_buffers is 2048 and db_block_size 8192. So the buffer cache is 16m except in Test IV below. OS is Solaris 2.6 on sun4u. UFS file system. All datafiles, logfiles, controlfiles are in one directory. A table called large is created as select * from dba_objects where 1=2. Then insert into large select * from dba_objects is executed many times till the table size is about 170m (from user_segments.bytes). The query is select /*+ full (t) */ count(*) from large t. It's run three times under each condition. Timing in seconds of the three times are separated by ";". I. Just mounted filesystem, startup db 11.37; 4,46; 4.06 II. Shutdown db, umount, mount, startup db 11.18; 4.56; 4.86 III. Shutdown db, umount, mount -o forcedirectio, startup db 19.99; 19.19; 19.29 IV. Same as III but change db_block_buffers to 32768 19.39; 2.14; 2.03 Observation: The first time accessing the table takes 11 seconds with filesystem cache (Test I) and 20 seconds with direct I/O (III). The second time it takes about 4 and half seconds with filesystem cache (I). With direct I/O, if oracle buffer cache is too small, it still takes 19 seconds (III); with a sufficiently large cache, it takes 2 seconds (IV). Conclusion: If your data access pattern is well defined, and if the amount of data can be fully cached in Oracle buffer cache, then DIO is better because the second time (not first time) the same data is accessed, it's fast. If neither condition is met, I'll take the lazy man's approach and let file system intelligently manage it. ******** Unrelated: Testing effect of sticky bit on datafiles ******** Adrian Cockcroft and Richard Pettit say turning on sticky bit for database files makes the inode updates for the file asynchronous and is completely safe. It "may help get back some of the performance" ("Sun Performance and Tuning", 2nd ed. p.165). But in March 6, 2002, eSeminar hosted by Quest Software, Richard dismisses this trick as old and probably not valid. I did the following test. No performance difference is found with or without sticky bit on database files. Experimental The database db_block_buffers is 2048 and db_block_size 8192. So the buffer cache is 16m. OS is Solaris 2.6 on sun4u. UFS file system. All datafiles, logfiles, controlfiles are in one directory. A table called large is created as select * from dba_objects where 1=2. Then insert into large select * from dba_objects is executed many times. In SQL*Plus, I set timing on for each insert. The time for each insert is shown below. Initially without sticky bit: 06.60, 07.92, 07.02 seconds Turned on sticky bit (chmod +t *, must run as root): 07.02, 08.62, 08.02, 07.81 seconds. Turned off (chmod -t *): 08.53 Turned on again: 10.85, 06.10, 09.83, 08.03, 08.42, 08.73, 07.93, 08.03, 08.23, 09.34, 08.53 Turned off: 08.23, 08.93, 08.72, 09.14, 08.93, 08.03, 09.43, 07.82, 08.34, 08.32 Conclusion: There's no performance difference with or without sticky bit on database files. *********************************************************************** DIO ON TRU64 Subject: SUMMARY: Extremely slow Oracle performance with Apps 11.0.3 after OS upgrade From: tackenhu AT fnal.gov Date: Sun, 22 Dec 2002 23:37:35 -0600 > We believe the core cause of this performance degradation was the change > in the default behavior of the Oracle database's use of the filesystem > when run on HP/Compaq Tru64 5.1A UNIX (as opposed to on the original > version 4.0F). On 5.1A, Oracle uses direct I/O by default (avoiding any > operating system filesystem caching), whereas on 4.0F, Oracle doesn't do > this (and, in fact, the OS doesn't support direct I/O) so it uses the > filesystem cache. > > In our case, this difference in filesystem cache usage dramatically > impacted a number of queries, most of which manipulated a large quantity > of data by repeatedly reading the same data over and over (poor query > design, but hadn't caused problems in the past). It would appear that > prior to the 5.1A upgrade, the performance of these queries was being > propped up by heavy use of the AdvFS filesystem cache at the OS level. > Once we went to 5.1A, this cache wasn't being used (because Oracle was > using direct I/O) and our queries fell victim to extreme I/O waits, since > the SGA DB_BLOCK_BUFFERS weren't large enough to buffer as much data as > had been buffered in the filesystem buffer pre-OS upgrade. > > From a purist standpoint, the problem could be considered a combination of > several poorly designed queries (many repetitive reads on a large quantity > of data) combined with an undersized data buffer cache for the instance > (causing it to depend heavily on OS filesystem caching). The OS > upgrade's effect of shifting Oracle I/O to direct I/O made these causes > yield drastic performance slow downs. > > Our solution was to dramatically increase our DB_BUFFER_CACHE size (by > approximately 20 times). Another possible solution is to disable direct > I/O usage by Oracle RDBMS (which can be done in 8.1.7.2 by the > _TRU64_DIRECTIO_DISABLE parameter, which via an Oracle patch can also be > made available in 8.1.7.1) - since this solution required a patch and our > initial testing got mixed results using this, we opted for the data buffer > cache increase (since we could make the memory available to support this). > We continue to test the direct I/O parameter in a more controlled > environment (dedicated test box set up for just this purpose) in case we > need a fallback, but our performance is now back to roughly the same as > pre-upgrade. > > We are also working to have the development team examine these queries > that do the repetitive reads (according to some trace analysis, roughly 3 > million blocks read to generate 8000 rows of output - when you consider > the entire database is less than 14 Gb (and 3 million blocks would be > around 22 Gb), it is safe to say the query isn't optimizing its reads). > > Once again, thanks for everyone's suggestions. For anyone planning an > 4.0F to 5.X upgrade for a system running Oracle databases, I would > strongly suggest you look closely at performance for large jobs and > examine how caching is being used - the direct I/O change can have > dramatic effects. When direct I/O is disabled (by default), performance is not quite sensitive to Oracle buffer cache size. When direct I/O is enabled and buffer cache is smaller than my table on which I do a full table scan, performance is worse than if direct I/O is disabled. But when buffer cache is larger than the table size with direct I/O, the *second* time the table is scanned, it's very fast. So if all conditions are met, we should use direct I/O and increase Oracle buffer cache as other experts say (see Ixora News - March 2002). However, this strategy is easier said than done. In order to fully take advantage of the remaining memory otherwise used as the filesystem page cache, you need to increase Oracle buffer cache to the extent that your system will almost start to swap. I haven't done this but I imagine you have to carefully calculate memory usage of all other applications on the system, and most likely rebuild oracle kernel and use 64-bit OS to get a large SGA. On Solaris, filesystem page cache is automatically sized by OS to use all physical memory available. But Oracle buffer cache is sized by you; too big a size causes swap. Direct I/O also disables intelligence of the filesystem to prefetch data in sequential read, and delayed data write. So its optimal use needs a lot of work on a system where activities are fairly predictable and data have been scanned into Oracle cache prior to peak usage time. BTW, I got some questions for you: 1. Earlier you said you had 95% buffer cache hit ratio. You increased it by 20 times. Then what's the ratio now? 2. I assume you're saying _TRU64_DIRECTIO_DISABLE was true when you used Tru64 4.0F and it changed itself to false when you upgraded OS to 5.1A, with the same version of Oracle. Is that correct? 3. Direct I/O is a filesystem mount option. Did the OS upgrade automatically enable the option or somebody has to manually enable it? On Solaris, even if the filesystem is not mounted with direct I/O, applications can still advise so to the filesystem on a per-file basis. Maybe that's what Tru64 5.1A does? 4. Why did manually changing _TRU64_DIRECTIO_DISABLE give you mixed result? Yong Huang Hi, Yong, In answer to your questions from 12/26/2002... > 1. Earlier you said you had 95% buffer cache hit ratio. You increased it > by 20 times. Then what's the ratio now? What I meant to say was the buffer cache hit ratio was about 55% pre-SGA (DB_BLOCK_BUFFER) increase and 95% post increase. > 2. I assume you're saying _TRU64_DIRECTIO_DISABLE was true when you used > Tru64 4.0F and it changed itself to false when you upgraded OS to 5.1A, with > the same version of Oracle. Is that correct? Roughly, yes. _TRU64_DIRECTIO_DISABLE was irrelevant on Tru64 4.0F as the OS didn't provide any direct I/O support - so direct I/O was never used. Once the OS was upgraded to 5.1A, Oracle by default used direct I/O and this behavior was only disabled by using the _TRU64_DIRECTIO_DISABLE Oracle parameter. > 3. Direct I/O is a filesystem mount option. Did the OS upgrade automatically > enable the option or somebody has to manually enable it? On Solaris, even > if the filesystem is not mounted with direct I/O, applications can still > advise so to the filesystem on a per-file basis. Maybe that's what Tru64 5.1A > does? I believe Solaris must differ from Tru64 in this regard. Direct I/O is not a filesystem mount option in Tru64 (at least on AdvFS volumes, which we use). Rather, use of direct I/O is determined by passing (or omission) of a direct I/O flag to the open() system call when opening a file. Therefore, to utilize direct I/O, a program must specifically open the file for direct I/O (which is a code modification that Oracle did to their RDBMS code - as early as 8.0.6.3, I believe and including all 8.1.7 code streams). > 4. Why did manually changing _TRU64_DIRECTIO_DISABLE give you mixed > result? We believe this was due to the other processing on the box (we have 8 other database instances on our development box) which would affect contents of filesystem caching and overall performance. To get reproducible results, we believe we have to have a dedicated system, but we couldn't halt all development on the system for this testing. ANOTHER DIO TEST ON SOLARIS $ uname -a SunOS countfleet 5.6 Generic_105181-31 sun4u sparc SUNW,Ultra-2 $ mount | grep ^/f[12] #/f2 has DIO turned on /f1 on /dev/dsk/c0t1d0s0 setuid/read/write/largefiles on Wed Jan 15 16:17:29 2003 /f2 on /dev/dsk/c0t1d0s1 forcedirectio/setuid/read/write/largefiles on Wed Jan 15 16:17:29 2003 $ grep maxphys /etc/system set maxphys = 1048576 Database 9.0.1.3 create tablespace test datafile '/f1/oradata/tiny/test.dbf' size 400m extent management local uniform size 32k; Three times it took 35,36,36 seconds, respectively. The same command except for f1 changed to f2 took 25,27,26 seconds, respectively, about 9 seconds faster. /f1 is regular UFS and /f2 is DIO UFS. When the tablespace is being created on /f1, truss is run against the shadow process and the second run shows: $ truss -c -p 9704 ^Csyscall seconds calls errors read .00 1 write .00 3 open .00 2 close .00 10 time .00 2 lseek .00 2 times .03 282 semsys .00 31 ioctl .00 3 3 fdsync .00 1 fcntl .01 14 poll .01 146 sigprocmask .00 56 context .00 14 fstatvfs .00 3 writev .00 2 getrlimit .00 3 setitimer .00 28 lwp_create .00 2 lwp_self .00 1 lwp_cond_wai .03 427 lwp_cond_sig .15 427 kaio 5.49 469 430 <-- More kernelized IO time stat64 .00 3 1 fstat64 .00 3 pread64 .00 32 pwrite64 .35 432 <-- Each pwrite() call takes 350/432 = 0.8 ms open64 .00 6 ---- --- --- sys totals: 6.07 2405 434 usr time: 1.71 elapsed: 36.74 When the tablespace is created on /f2, $ truss -c -p 9704 ^Csyscall seconds calls errors read .00 1 write .00 3 open .00 2 close .00 10 time .00 2 lseek .00 2 times .02 282 semsys .00 31 ioctl .00 3 3 fdsync .00 1 fcntl .00 14 poll .01 146 sigprocmask .00 56 context .00 14 fstatvfs .00 3 writev .00 2 getrlimit .00 3 setitimer .00 28 lwp_cond_wai .00 430 lwp_cond_sig .03 430 kaio .50 462 430 <-- Much less kernelized IO time stat64 .00 3 1 fstat64 .00 3 pread64 .01 32 pwrite64 .00 432 <-- pwrite calls take practically no time. open64 .00 6 ---- --- --- sys totals: .57 2401 434 usr time: 1.94 elapsed: 27.72 During the first run, the result on /f1 is even worse. But for good benchmark, I usually ignore the first run. USEFUL REFERENCE Ixora News - March 2002 (Direct benefits )