Quick Tips on Linux HugePages

This section is the practical list of steps you can follow to set up HugePages. Failing to configure HugePages causes significant performance problems, especially on servers with lots of memory. For example, on one of our Financial databases, configuring HugePages alone reduced runtime of a job by one hour. There are many little details in configuring HugePages. So it's better to have a check list.

1. Determine an appropriate SGA. Take 50% of server memory as a starting point, e.g. 20G for a 40G memory server if single-node, a little less if RAC. Remember to add up SGA's for all instances if the server runs more than one instance.

2. For ease in calculation, always use MB as unit. To avoid SGA granule (chunk size) roundup, set SGA to an integer multiple of its granule size, which could be 32MB, 64MB, 128MB, or even greater if you have a huge SGA. You can find the SGA granule size for this instance by
SQL> select bytes/1048576 from v$sgainfo where name='Granule Size';
or
SQL> select ksppstvl/1048576 from x$ksppsv where indx=(select indx from x$ksppi where ksppinm='_ksmg_granule_size');
But it's OK to ignore this roundup for now.

3. Suppose SGA is set to 20480 MB. You can set HugePages to for example 20500 MB i.e. 10250 pages (each page is 2 MB). So vi /etc/sysctl.conf and add vm.nr_hugepages=10250.

4. Make sure /etc/sysctl.conf has kernel.shmmax set to a large number. To make it simple, just set it to the server memory in bytes. The first number shown in the output of command `free' (under "total", to the right of "Mem:") is in KB. So, just use that appended with 3 0's.

5. Make sure /etc/security/limits.conf has
oracle soft memlock limit
oracle hard memlock limit

To make it simple, just set limit to the server memory in KB, which is the first number shown in the output of command `free' (under "total", to the right of "Mem:").

6. Disable HugePages in ASM and MGMTDB
Change environment to ASM with `. oraenv', and login and run
SQL> alter system set use_large_pages=false scope=spfile;
Do the same to MGMTDB if it's installed.

7. Disable transparent HugePages
Modify /etc/default/grub to append transparent_hugepage=never to the last part of the string value (inside the quotation marks) for GRUB_CMDLINE_LINUX, save the file and run `grub2-mkconfig -o /boot/grub2/grub.cfg'. After server reboot, `cat /proc/cmdline' should show transparent_hugepage=never as part of the value.
Also disable tuned in case it re-enables transparent HugePages:
# systemctl disable tuned

8. Once the server is rebooted and instance(s) is/are up, check SGA setting and its actual value:
SQL> select value/1048576 from v$spparameter where name like 'sga%';
SQL> select value/1048576 from v$parameter where name like 'sga%';

If they don't match (because the setting is not of multiple of the granule size; see Step 2), it's always the latter that is bigger. Set the parameters to be the same so it's less confusing in the future:
SQL> alter system set sga_max_size=<above_value_in_v$parameter>m scope=spfile;
SQL> alter system set sga_target=<above_value_in_v$parameter>m scope=spfile;

Check HugePages usage by:
$ grep HugePages /proc/meminfo
HugePages_Free should be only a little larger than HugePages_Rsvd.
You can also check by reading alert_SID.log. At the very beginning of instance startup, a few lines indicate how much HugePages is needed and how much is provided by OS.
In 12c+, you can also check by
SQL> select "AREA NAME", "SEGMENT SIZE"/1048576, "SIZE"/1048576, pagesize, shmid from x$ksmssinfo;
Note the lines for PAGESIZE of 2097152.

9. In very rare cases, HugePages could be used by third party software.

For explanations of some suggestions outlined above, read on.

2020-12

The following is the original article (with updates constantly added over the years) meant to provide quick and practical tips on using Linux HugePages on servers that run Oracle databases.

1. Be generous first and dynamically shrink later

If you don't want to be accurate in calculating how much memory should be allocated for HugePages, give a rough and very generous estimate. Start all Oracle instances on the box. (To save time, startup nomount is enough.) Check the difference between HugePages_Free and HugePages_Rsvd, which is the wastage, because HugePages_Free includes reserved but not actually used memory. For example,

$ grep Huge /proc/memory
...
HugePages_Total:  3190
HugePages_Free:   2458
HugePages_Rsvd:   2341

2458-2341=117 pages of HugePages or 234 MB memory (assumes 2 MB page size) will never be used. You do NOT have to wait till the instances have been used for a while; that would increase both HugePages_Free and HugePages_Rsvd, but not the difference between them. To understand the 3 lines of HugePages_*, look at this simple diagram

UUUUUFFFF <-- Total split into really used (U) and free (F)
UUUUURRR. <-- Total split into really used (U), reserved (R) and really free (.)
If one letter or dot is one HugePage, the above says
HugePages_Total: 9
HugePages_Free:  4
HugePages_Rsvd:  3
and you'll have 4-3=1 page completely wasted.

Now, let's dynamically shrink HugePages to reduce wastage. Take the example of 3190 HugePages shown earlier. Let's cut the wastage down to, say, 10 pages. So we should decrease HugePages_Total by 117-10=107. That is, change 3190 to 3190-107=3083.

# echo 3083 > /proc/sys/vm/nr_hugepages

cat /proc/sys/vm/nr_hugepages to confirm the number has been reduced to 3083. Update vm.nr_hugepages in /etc/sysctl.conf with the correct number so it takes effect on next reboot.

The advantage of over-allocating HugePages at the beginning is that it saves time in getting the memory allocation right on the first try. In addition, dynamically changing HugePages allocation ensures no memory is wasted. In case of shutting down an Oracle instance for an extended period of time, you may choose to lower /proc/sys/vm/nr_hugepages to give the memory back to OS as well as Oracle PGA.

However, if you start back up the previously shutdown instance, you'll have to increase the nr_hugepages number, and you may not be able to bring it up fully to the desired number if the available memory is no longer physically contiguous. When that happens, you may or may not be able to start the instance depending on the setting of use_large_pages. If it's set to true (default), the instance may be started but it uses no HugePages at all and you'll waste lots of HugePages unless you give up and lower nr_hugepages back down to give it to OS and wait till next server reboot. So think it over whenever you plan to lower the value.

2. Seeing is believing

In older versions of Oracle, the only way to know that HugePages is used is to check /proc/memory. Later versions show the lines in alert_sid.log (Oracle 11g example):

Total Shared Global Region in Large Pages = 2370 MB (100%)

Large Pages used by this instance: 1185 (2370 MB)
Large Pages unused system wide = 815 (1630 MB)
Large Pages configured system wide = 2000 (4000 MB)
Large Page size = 2048 KB

The instance in this example here clearly has too much unused HugePages. I would cut configured HugePages down from 2000 to 2000-815+overhead, say, 1200. The overhead is related to roundup of shared memory segments for the instance as shown in ipcs or sysresv upward to the nearest SGA granule size.

In 12c, the alert.log has these lines instead (excluding the annoying timestamp lines profusely intercalated):

  PAGESIZE  AVAILABLE_PAGES  EXPECTED_PAGES  ALLOCATED_PAGES  ERROR(s)
        4K       Configured               5               5        NONE
     2048K             1620            1617            1617        NONE

This example only wastes 3 HugePages, corresponding to the following /proc/meminfo values where 10-7=3:

HugePages_Total:    1620
HugePages_Free:       10
HugePages_Rsvd:        7

Beginning with Linux kernel 2.6.29 or Red Hat Enterprise Linux 6 and possibly later minor releases of RHEL 5, /proc/pid/smaps provides clues about HugePages usage as well.

# cat /proc/any pid of Oracle instance/smaps
...
61000000-a7000000 rwxs 00000000 00:0c 1146885                            /SYSV00000000 (deleted)
Size:            1146880 kB
Rss:                   0 kB
Pss:                   0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:            0 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:     2048 kB  <-- 2MB HugePage size
MMUPageSize:        2048 kB  <-- 2MB HugePage size

The last two lines showing 2 MB instead of 4 KB page size are the telltale sign that HugePages are used. If you want to see all processes using HugePages, you can run as root

grep '^KernelPageSize:     2048 kB' /proc/[0-9]*/smaps | awk -F/ '{print $3}' > /tmp/$$
echo $':%s/$/,/\n:%j!\n$xIps -opid=,ruser=,args= -p \E:x\n' | vi /tmp/$$ 2>/dev/null
sh /tmp/$$
rm /tmp/$$
That of course shows all processes running Oracle on an Oracle server. But the commands are generic, not specific to an Oracle server.

Beginning with Oracle 12c, you can also check the fixed table x$ksmssinfo (probably Kernel Service, Memory Sga OS (level) Info), which not only tells us whether the memory page size is that of HugePages, but even maps the SGA components with shared memory segments. The example below is from Oracle 12.1.0.2, where in-memory area is configured. (I removed the ipcs lines irrelevant to this Oracle instance in the example.)

SQL> select "AREA NAME", "SEGMENT SIZE", "SIZE", pagesize, shmid from x$ksmssinfo;

AREA NAME            SEGMENT SIZE       SIZE   PAGESIZE      SHMID
-------------------- ------------ ---------- ---------- ----------
imc area rdonly 0        83886080   83886080    2097152   87588873
Variable Size          3288334336 3254779904    2097152   87621642
imc area default 0     3288334336   33554432    2097152   87621642
Redo Buffers             14680064   13844480    2097152   87654411
Fixed Size                4194304    2932736    2097152   87556104
skgm overhead               20480      20480       4096   87687180

SQL> !ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
...
0x00000000 87556104   oracle     640        4194304    25
0x00000000 87588873   oracle     640        83886080   25
0x00000000 87621642   oracle     640        3288334336 25
0x00000000 87654411   oracle     640        14680064   25
0x639dac14 87687180   oracle     640        20480      25

As you can see, this fixed table tells us HugePages is used except for Oracle's interface to the OS in the generic memory management layer (skgm overhead), which still uses the default 4 KB page size. The largest segment of 3288334336 bytes in size is in two parts: Variable Size (not the same as Variable Size shown by SQL*Plus command show sga, which excludes buffer cache) used for buffer cache and various SGA pools (shared pool, java pool, large pool), and part of the in-memory area or column store (imc area default 0). The second largest segment of 83886080 bytes contains the other part of in-memory area (imc area rdonly 0). The remaining two segments are obvious. But in spite of small sizes, they don't seem to be fully used.

The same type of information is also written to a trace file, even in Oracle 11g, although in 11g it's not exposed to any table, e.g.

Large Pages segment allocation succeed, size = 335544320, shmid = 3473420, target_node = 129, large page used 160
Large Pages segment allocation succeed, size = 31876710400, shmid = 3702803, target_node = 129, large page used 15200
Large Pages segment allocation succeed, size = 2097152, shmid = 3768341, target_node = 129, large page used 1
(In 12c, the wording is "Shared memory segment allocated:".) So, you can use the shmid to match up with the lines in the ipcs -m output.

3. Miscellaneous

To make calculation of HugePages easier, always use MB in dealing with memory. When setting SGA, set it to an integer multiple of memory granule size, which you can find by
select ksppstvl/1048576 mb from x$ksppsv where indx=(select indx from x$ksppi where ksppinm='_ksmg_granule_size');
so that the value you set (seen in v$spparameter) matches the value you end up with (seen in v$parameter)
select value/1048576 from v$spparameter where name like 'sga%';
select value/1048576 from v$parameter where name like 'sga%';
Then set vm.nr_hugepages in /etc/sysctl.conf a little larger than SGA in MB devided by 2 (since each page is 2 MB). Remember to add up all SGA's if the server has multiple instances, except for those you set use_large_pages to false.

ASM instance does not have buffer cache (the so-called "ASM buffer cache" caches some metadata). There's no need to configure HugePages for it, even if it's small. So disable it by alter system set use_large_pages=false scope=spfile and bounce it. If your 12c or 18c RAC installation includes MGMTDB, you definitely should disable its usage of HugePages; the parameter is true by default. Since this management database is only run on one of the nodes, usually but not always on the first node, accomodating its HugePages requirement on that single node but not the others would either complicate HugePages setup or waste memory, depending on whether you configure the same HugePages on all nodes. In Oracle 19c, we're finally relieved by Oracle's decision to make this useless MGMTDB database optional (and I strongly recommend you not install it).

Don't forget to set memlock in /etc/security/limits.conf (and add session required pam_limits.so to /etc/pam.d/login) and kernel.shmmax in /etc/sysctl.conf high enough to cover the entire SGA since HugePages must be physically contiguous. To make it simple, set them to the physical memory of the server (but note memlock uses unit KB while kernel.shmmax uses byte); they are just mathematical limits and do not actually allocate anything. Changing the values in /etc/security/limits.conf requires you to re-login because your shell takes the values in this file. If the instance is up, you can find the running process limits by cat /proc/pid/limits. (If GI was started with limits too low and you don't want to bounce GI, perhaps because you have multiple DB instances, you must use sqlplus, not srvctl, to bounce the instance for which you want to have a higher limit.) Also, kernel.shmall should not be too low. To make it simple, just set it to the number of pages as if all memory would be in 4k size, i.e. the Mem value under total of command free, divided by 4, since the default page size is 4k and the free command output is in KB. For new values of kernel.shmmax and kernel.shmall to take effect, just type sysctl -p.

/etc/default/grub (or for older Linux, /boot/grub/grub.conf) should have transparent_hugepage=never appended to the GRUB_CMDLINE_LINUX line, and run grub2-mkconfig -o /boot/grub2/grub.cfg. To check if the currently running kernel has it disabled, cat /proc/cmdline. Transparent HugePages causes high sys CPU. You may also disable tuned since it may enable THP again: systemctl disable tuned.

One very simple thing. Make sure memory_max_target and memory_target are not set. If they are, and you use an spfile, alter system reset memory_max_target and alter system reset memory_target. In older versions, you may have to use the trick in Doc 1138645.1.


2015-09, 2017-08, 2020-11


Troubleshooting an interesting case
2020-12

The server has two Oracle instances running. One has the entire SGA in HugePages as expected. But the other does not, according to alert.log

2020-11-30T19:09:38.444409-06:00
  PAGESIZE  AVAILABLE_PAGES  EXPECTED_PAGES  ALLOCATED_PAGES  ERROR(s)
2020-11-30T19:09:38.444477-06:00
        4K       Configured               8          256520        NONE
2020-11-30T19:09:38.444661-06:00
     2048K            86925           87296           86795        NONE
So we fall short of 87296-86925=371 HugePages. Therefore, part of SGA, most notably log buffer (Redo Buffers), is allocated in the conventional 4k pagesize memory:
SQL> select "AREA NAME", "SEGMENT SIZE", "SIZE", pagesize, shmid from x$ksmssinfo;

AREA NAME                        SEGMENT SIZE         SIZE     PAGESIZE        SHMID
-------------------------------- ------------ ------------ ------------ ------------
Variable Size                    181999239168 181999239168      2097152           18
Variable Size                       536870912    536870912         4096           19
Redo Buffers                        513802240    513802240         4096           20
Fixed Size                           23068672     23068672      2097152           17
skgm overhead                           32768        32768         4096           21
Since HugePages or vm.nr_hugepages or /proc/sys/vm/nr_hugepages is calculated as the sum of the SGA's of the two databases plus about 10 MB, +ASM has use_large_pages set to false, and system shmmax is set to essentially the physical memory of the box, process memlock as well, why do we still fall short? Is there anything else that could be using HugePages? This is the way to find out:
# grep -l '^KernelPageSize:     2048 kB' /proc/*/smaps > /tmp/abc.txt
# vi /tmp/abc.txt #change each line to just the pid, append comma to each line, join the lines into one with %j!, prepend the line with "ps -fp "
# sh /tmp/abc.txt
The output is hundreds of processes, all but one being of the two running Oracle databases. That single one that is not is
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
5 S root      2185     1  0  80   0 - 1388607 futex_ Nov30 ?      00:08:08 ./cybAgent.bin -a
To confirm this never-heard-of process is indeed using HugePages, open /proc/2185/smaps with vi and see these lines
e0000000-100000000 rw-s 00000000 00:0e 0                                 /SYSV00000000 (deleted)
Size:             524288 kB
Rss:                   0 kB
...
KernelPageSize:     2048 kB
MMUPageSize:        2048 kB
Locked:                0 kB
VmFlags: rd wr sh mr mp me ms de ht sd
The memory map of this process contains a 500 MB shared memory segment whose pagesize is 2M, i.e. HugePages pagesize. No wonder one of the Oracle databases can't grab the HugePages meant for Oracle! So, is there any way to prevent this cybAgent process from using HugePages? A Google search found that this is from a product called Autosys from Broadcom. I registered a login on their community forum and posted a message to it asking this question. (Update: The message needs admin's approval to appear. Having waited for two days, I found their website contact and sent a site feedback message asking why the admin didn't approve. A curt email came back just saying the content of my posting violated their policy.) Another way is to make use of vm.hugetlb_shm_group to limit HugePages to oracle's group only, since cybAgent.bin runs as root. But since we'll replace Autosys with cron jobs soon, I didn't bother.


(This article was published by IOUG in 2015. That old version is still available as a PDF file.)




To my Computer Page
To my OraNotes Page