Quick Tips on Linux HugePages

This section is the practical list of steps you can follow to set up HugePages. Failing to configure HugePages causes significant performance problems, especially on servers with lots of memory. For example, on one of our Financial databases, configuring HugePages alone reduced runtime of a job by one hour. There are many little details in configuring HugePages. So it's better to have a check list.

1. Determine an appropriate SGA size. Take 50% of server memory as a starting point, e.g. 20G for a 40G memory server if single-node, a little less if RAC. Remember to add up SGA's for all instances if the server runs more than one instance.

2. To simplify calculation, always use MB as unit. To avoid SGA granule (chunk size) roundup, set SGA to an integer multiple of its granule size, listed below for your convenience (Ref: 947152.1), assuming you didn't and won't set inmemory_size or memory_max_target, nor _ksmg_granule_size:

sga_max_size granule size

<= 1GB 4MB

>1Gb <= 8GB 16MB

>8Gb <= 16GB 32MB

>16Gb <= 32GB 64MB

>32Gb <= 64GB 128MB

>64Gb <= 128GB 256MB

> 128GB 512MB

So if you plan to have an SGA of 20 GB, its granule size will be 64 MB. If the instance is already up, you can also find the granule size for this instance by
SQL> select bytes/1048576 mb from v$sgainfo where name='Granule Size';

3. For an SGA of about 20 GB, 20480 MB is a good number (while e.g. 20000 MB is not because that's not a multiple of 64 MB). It's a good idea to set sga_max_size and sga_target to the same value so SGA is fully reserved up front on instance startup:
SQL> alter system set sga_max_size=20480m scope=spfile; SQL> alter system set sga_target=20480m scope=spfile;
Then set HugePages to slightly larger for example 20500 MB i.e. 10250 pages (each page is 2 MB). So vi /etc/sysctl.conf and add vm.nr_hugepages=10250.

4. Make sure /etc/sysctl.conf has kernel.shmmax set to a large number. To make it simple, just set it to the server memory in bytes. The first number shown in the output of command `free' (under "total", to the right of "Mem:") is in KB. So, just use that appended with three 0's. (Don't worry about the big numbers because these parameters are just mathematical limits to check against. OS won't allocate anything according to their values.)

5. Make sure /etc/security/limits.conf has
oracle soft memlock limit oracle hard memlock limit
To make it simple, you can just set limit to the server memory in KB, which is the first number shown in the output of command `free' (under "total", to the right of "Mem:").

6. Disable HugePages in ASM and MGMTDB (if it's installed)
Change environment to ASM with `. oraenv', and login and run
SQL> alter system set use_large_pages=false scope=spfile;
(Remember to do the same to MGMTDB if this database can't be got rid of.)

7. Disable transparent HugePages
Transparent HugePages may cause unnecessary CPU usage and other problems. Disable it by modifying /etc/default/grub to append transparent_hugepage=never to the last part of the string value (inside the quotation marks) for GRUB_CMDLINE_LINUX. (So the line looks like GRUB_CMDLINE_LINUX="... rhgb quiet transparent_hugepage=never"). Save the file and run `grub2-mkconfig -o /boot/grub2/grub.cfg'. After server reboot, `cat /proc/cmdline' should show transparent_hugepage=never as part of the value, and /sys/kernel/mm/transparent_hugepage/enabled should show [never] as part of the value.
Also disable tuned in case it re-enables transparent HugePages:
# systemctl disable tuned
Note: On RHEL9, disable it by grub2-mkconfig --update-bls-cmdline -o /boot/grub2/grub.cfg instead (thanks to Michael Schwager). Or use a tuned profile. Or with the command `echo never > /sys/kernel/mm/transparent_hugepage/enabled' (you may add it to /etc/rc.local).

8. On RHEL9 and possibly some earlier versions in some cases, vm.hugetlb_shm_group in /etc/sysctl.conf must be set to the group ID of user oracle. Get the GID by `id oracle'. Either oinstall or dba's GID is fine. (Ref: Red Hat, 2491966.1)

9. Once the server is rebooted and instance(s) is/are up, check SGA setting and its actual value:
SQL> select value/1048576 from v$spparameter where name like 'sga%'; SQL> select value/1048576 from v$parameter where name like 'sga%';
If they don't match (because the setting is not of multiple of the granule size; see Step 2), it's always the latter that is bigger. Set the parameters to be the same so it's less confusing in the future:
SQL> alter system set sga_max_size=<above_value_in_v$parameter>m scope=spfile; SQL> alter system set sga_target=<above_value_in_v$parameter>m scope=spfile;

Check HugePages usage by:
$ grep HugePages /proc/meminfo
HugePages_Free should be only a little larger than HugePages_Rsvd.
You can also check by reading alert_SID.log. At the very beginning of instance startup, a few lines indicate how much HugePages is needed and how much is provided by OS.
In 12c+, you can also check by
SQL> select "AREA NAME", "SEGMENT SIZE"/1048576, "SIZE"/1048576, pagesize, shmid from x$ksmssinfo;
Note the lines for PAGESIZE of 2097152.

10. In very rare cases, HugePages could be used by third party software.

For explanations of some suggestions outlined above, read on.

2020-12, 2024-02

The following is the original article (with updates constantly added over the years) meant to provide quick and practical tips on using Linux HugePages on servers that run Oracle databases.

If you don't want to be accurate in calculating how much memory should be allocated for HugePages, give a rough and very generous estimate. Start all Oracle instances on the box. (To save time, startup nomount is enough.) Check the difference between HugePages_Free and HugePages_Rsvd, which is the wastage, because HugePages_Free includes reserved but not actually used memory. For example,

2458-2341=117 pages of HugePages or 234 MB memory (assumes 2 MB page size) will never be used. You do NOT have to wait till the instances have been used for a while; that would increase both HugePages_Free and HugePages_Rsvd, but not the difference between them. To understand the 3 lines of HugePages_*, look at this simple diagram

Now, let's dynamically shrink HugePages to reduce wastage. Take the example of 3190 HugePages shown earlier. Let's cut the wastage down to, say, 10 pages. So we should decrease HugePages_Total by 117-10=107. That is, change 3190 to 3190-107=3083.

cat /proc/sys/vm/nr_hugepages to confirm the number has been reduced to 3083. Update vm.nr_hugepages in /etc/sysctl.conf with the correct number so it takes effect on next reboot.

The advantage of over-allocating HugePages at the beginning is that it saves time in getting the memory allocation right on the first try. In addition, dynamically changing HugePages allocation ensures no memory is wasted. In case of shutting down an Oracle instance for an extended period of time, you may choose to lower /proc/sys/vm/nr_hugepages to give the memory back to OS as well as Oracle PGA.

However, if you start back up the previously shutdown instance, you'll have to increase the nr_hugepages number, and you may not be able to bring it up fully to the desired number if the available memory is no longer physically contiguous. When that happens, you may or may not be able to start the instance depending on the setting of use_large_pages. If it's set to true (default), the instance may be started but it uses no HugePages at all and you'll waste lots of HugePages unless you give up and lower nr_hugepages back down to give it to OS and wait till next server reboot. So think it over whenever you plan to lower the value.

In older versions of Oracle, the only way to know that HugePages is used is to check /proc/memory. Later versions show the lines in alert_sid.log (Oracle 11g example):

The instance in this example here clearly has too much unused HugePages. I would cut configured HugePages down from 2000 to 2000-815+overhead, say, 1200. The overhead is related to roundup of shared memory segments for the instance as shown in ipcs or sysresv upward to the nearest SGA granule size.

In 12c, the alert.log has these lines instead (excluding the annoying timestamp lines profusely intercalated):

This example only wastes 3 HugePages, corresponding to the following /proc/meminfo values where 10-7=3:

Beginning with Linux kernel 2.6.29 or Red Hat Enterprise Linux 6 and possibly later minor releases of RHEL 5, /proc/pid/smaps provides clues about HugePages usage as well.

The last two lines showing 2 MB instead of 4 KB page size are the telltale sign that HugePages are used. If you want to see all processes using HugePages, you can run as root

Beginning with Oracle 12c, you can also check the fixed table x$ksmssinfo (probably Kernel Service, Memory Sga OS (level) Info), which not only tells us whether the memory page size is that of HugePages, but even maps the SGA components with shared memory segments. The example below is from Oracle 12.1.0.2, where in-memory area is configured. (I removed the ipcs lines irrelevant to this Oracle instance in the example.)

As you can see, this fixed table tells us HugePages is used except for Oracle's interface to the OS in the generic memory management layer (skgm overhead), which still uses the default 4 KB page size. The largest segment of 3288334336 bytes in size is in two parts: Variable Size (not the same as Variable Size shown by SQL*Plus command show sga, which excludes buffer cache) used for buffer cache and various SGA pools (shared pool, java pool, large pool), and part of the in-memory area or column store (imc area default 0). The second largest segment of 83886080 bytes contains the other part of in-memory area (imc area rdonly 0). The remaining two segments are obvious. But in spite of small sizes, they don't seem to be fully used.

The same type of information is also written to a trace file, even in Oracle 11g, although in 11g it's not exposed to any table, e.g.

To make calculation of HugePages easier, always use MB in dealing with memory. When setting SGA, set it to an integer multiple of memory granule size (see the size table at the beginning of this article) so that the value you set (seen in v$spparameter) matches the value you end up with (seen in v$parameter)
select value/1048576 from v$spparameter where name like 'sga%';
select value/1048576 from v$parameter where name like 'sga%';
Then set vm.nr_hugepages in /etc/sysctl.conf a little larger than SGA in MB devided by 2 (since each page is 2 MB). Remember to add up all SGA's if the server has multiple instances, except for those you set use_large_pages to false.

ASM instance does not have buffer cache (the so-called "ASM buffer cache" caches some metadata). There's no need to configure HugePages for it, even if it's small. So disable it by alter system set use_large_pages=false scope=spfile and bounce it. If your 12c or 18c RAC installation includes MGMTDB, you definitely should disable its usage of HugePages; the parameter is true by default. Since this management database is only run on one of the nodes, usually but not always on the first node, accomodating its HugePages requirement on that single node but not the others would either complicate HugePages setup or waste memory, depending on whether you configure the same HugePages on all nodes. In Oracle 19c, we're finally relieved by Oracle's decision to make this useless MGMTDB database optional (and I strongly recommend you not install it).

Don't forget to set memlock in /etc/security/limits.conf (and add session required pam_limits.so to /etc/pam.d/login) and kernel.shmmax in /etc/sysctl.conf high enough to cover the entire SGA since HugePages must be physically contiguous. To make it simple, set them to the physical memory of the server (but note memlock uses unit KB while kernel.shmmax uses byte); they are just mathematical limits and do not actually allocate anything. Changing the values in /etc/security/limits.conf requires you to re-login because your shell takes the values in this file. If the instance is up, you can find the running process limits by cat /proc/pid/limits. (If GI was started with limits too low and you don't want to bounce GI, perhaps because you have multiple DB instances, you must use sqlplus, not srvctl, to bounce the instance for which you want to have a higher limit.) Also, kernel.shmall should not be too low. To make it simple, just set it to the number of pages as if all memory would be in 4k size, i.e. the Mem value under total of command free, divided by 4, since the default page size is 4k and the free command output is in KB. For new values of kernel.shmmax and kernel.shmall to take effect, just type sysctl -p.

/etc/default/grub (or for older Linux, /boot/grub/grub.conf) should have transparent_hugepage=never appended to the GRUB_CMDLINE_LINUX line, and run grub2-mkconfig --update-bls-cmdline -o /boot/grub2/grub.cfg (remove --update-bls-cmdline if older than RHEL9). To check if the currently running kernel has it disabled, cat /proc/cmdline. Transparent HugePages causes high sys CPU. You may also disable tuned since it may enable THP again: systemctl disable tuned.

One very simple thing. Make sure memory_max_target and memory_target are not set. If they are, and you use an spfile, alter system reset memory_max_target and alter system reset memory_target. In older versions, you may have to use the trick in Doc 1138645.1.

Troubleshooting an interesting case
2020-12

The server has two Oracle instances running. One has the entire SGA in HugePages as expected. But the other does not, according to alert.log

2020-11-30T19:09:38.444409-06:00
  PAGESIZE  AVAILABLE_PAGES  EXPECTED_PAGES  ALLOCATED_PAGES  ERROR(s)
2020-11-30T19:09:38.444477-06:00
        4K       Configured               8          256520        NONE
2020-11-30T19:09:38.444661-06:00
     2048K            86925           87296           86795        NONE

So we fall short of 87296-86925=371 HugePages. Therefore, part of SGA, most notably log buffer (Redo Buffers), is allocated in the conventional 4k pagesize memory:

SQL> select "AREA NAME", "SEGMENT SIZE", "SIZE", pagesize, shmid from x$ksmssinfo;

AREA NAME                        SEGMENT SIZE         SIZE     PAGESIZE        SHMID
-------------------------------- ------------ ------------ ------------ ------------
Variable Size                    181999239168 181999239168      2097152           18
Variable Size                       536870912    536870912         4096           19
Redo Buffers                        513802240    513802240         4096           20
Fixed Size                           23068672     23068672      2097152           17
skgm overhead                           32768        32768         4096           21

Since HugePages or vm.nr_hugepages or /proc/sys/vm/nr_hugepages is calculated as the sum of the SGA's of the two databases plus about 10 MB, +ASM has use_large_pages set to false, and system shmmax is set to essentially the physical memory of the box, process memlock as well, why do we still fall short? Is there anything else that could be using HugePages? This is the way to find out:

# grep -l '^KernelPageSize:     2048 kB' /proc/*/smaps > /tmp/abc.txt
# vi /tmp/abc.txt #change each line to just the pid, append comma to each line, join the lines into one with %j!, prepend the line with "ps -fp "
# sh /tmp/abc.txt

The output is hundreds of processes, all but one being of the two running Oracle databases. That single one that is not is

F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
5 S root      2185     1  0  80   0 - 1388607 futex_ Nov30 ?      00:08:08 ./cybAgent.bin -a

To confirm this never-heard-of process is indeed using HugePages, open /proc/2185/smaps with vi and see these lines

e0000000-100000000 rw-s 00000000 00:0e 0                                 /SYSV00000000 (deleted)
Size:             524288 kB
Rss:                   0 kB
...
KernelPageSize:     2048 kB
MMUPageSize:        2048 kB
Locked:                0 kB
VmFlags: rd wr sh mr mp me ms de ht sd

The memory map of this process contains a 500 MB shared memory segment whose pagesize is 2M, i.e. HugePages pagesize. No wonder one of the Oracle databases can't grab the HugePages meant for Oracle! So, is there any way to prevent this cybAgent process from using HugePages? A Google search found that this is from a product called Autosys from Broadcom. I registered a login on their community forum and posted a message to it asking this question. (But the message needs admin's approval to appear. Having waited for two days with no answer, I found their website contact and sent a site feedback message asking why the admin didn't approve it. A curt email came back just saying the content of my posting violated their policy.) Another way is to make use of vm.hugetlb_shm_group to limit HugePages to oracle's group only, since cybAgent.bin runs as root. But since we'll replace Autosys with cron jobs soon, I didn't bother.
[Update 2021-11] Broadcom published the article Huge pages problems prevent Oracle DB to start with WA Agent (Linux) on 06/25/2021 that seems to provide a way to stop cybAgent from using HugePages.

(This article was published by IOUG in 2015. That old version is still available as a PDF file.)

sga_max_size	granule size
<= 1GB	4MB
>1Gb <= 8GB	16MB
>8Gb <= 16GB	32MB
>16Gb <= 32GB	64MB
>32Gb <= 64GB	128MB
>64Gb <= 128GB	256MB
> 128GB	512MB