Long Reboot Time and Swap Usage on VM?

One VM running RHEL 7.7 always has the problem that it takes a very long time to reboot, measured by the time gap between the last line before and the first line after the reboot in /var/log/messages. For example, the following shows 31 minutes 10 seconds or 1870 seconds of time gap:

Nov 23 16:04:36 hostname npd[29501]: ntpd exiting on signal 15
Nov 23 16:35:46 hostname kernel: Initializing cgroup subsys cpuset

Some articles such as Group Policies applied that force your Windows 2008 R2 systems to clear their pagefiles at shutdown point to a large pagefile or swap partition. On the VM experiencing this problem (VM 16 shown below), I just set up a cron job to run the following code at 12:00 AM daily

free
grep VmSwap /proc/[1-9]*/status | grep -v ' 0 kB$'
date
swapoff -a
date
swapon -a

where the grep line captures the running processes that use swap at the time, and turning off and back on swap purges the swap partition.(Ref) The latest run log

              total        used        free      shared  buff/cache   available
Mem:       32778544    29134548      308508      290660     3335488     3075188
Swap:       4095996       57864     4038132
/proc/4790/status:VmSwap:             96 kB
Wed Nov 27 00:00:02 CST 2019
Wed Nov 27 00:00:27 CST 2019

shows that at midnight of 11/27/2019, there was 57864 KB used swap, which was purged in 27-2=25 seconds, i.e. at a rate of 2314.56 KB/sec. (Process 4790 is python /usr/bin/goferd --foreground. It only used 96 out of 57864 KB swap. Once the process using swap exits, there's no way to find them afterwards.)

To investigate whether there's a relationship between the amount of used swap and reboot time, 48 VMs were checked for the reboot time gap (in seconds) shown in /var/log/messages and for the last line of sar -S -f /var/log/sa/sa?? (the system activity report on swap usage for the day of reboot) right before the line showing "LINUX RESTART", excluding the "Average" line (i.e. the swap usage of the system before the reboot).

VM#  gap swap (time  kbswpfree kbswpused  %swpused  kbswpcad   %swpcad)
---- --- --------------------------------------------------------------
********** Gerneral DBs:
01    57 12:00:02 PM   4023036     72960      1.78      3564      4.88
02   117 12:00:01 PM   4012540     83456      2.04      6392      7.66
03    16 09:30:01 AM   4063484     32512      0.79       280      0.86
04    15 10:00:01 AM   4073724     22272      0.54       300      1.35
05    46 05:30:01 PM   4072956     23040      0.56       880      3.82
06    27 05:30:01 PM   4075004     20992      0.51       904      4.31
07    73 11:50:01 AM   3717628    378368      9.24      3660      0.97
08    35 12:10:01 PM   3807228    288768      7.05     13724      4.75
09    84 02:30:01 PM   3834108    261888      6.39     10860      4.15
10   146 02:20:01 PM   3537708    558288     13.63     20380      3.65
11    67 02:40:01 PM   4056060     39936      0.98       176      0.44
12    43 02:20:01 PM   3968764    127232      3.11     17872     14.05
********** General DBs (standbys):
13    48 01:50:01 PM   3619580    476416     11.63      7824      1.64
14    43 03:10:01 PM   4049916     46080      1.13     16664     36.16
15   111 03:40:01 PM   4058876     37120      0.91     19072     51.38
********** Dedicated DBs (non-CNR):
16  1870 04:00:01 PM   2027384   2068612     50.50     70932      3.43
17  804+ 04:40:01 PM   2976764   1119232     27.33     45896      4.10
18    67 10:00:01 AM   2714364   1381632     33.73     17244      1.25
19    22 11:00:01 AM   3988988    107008      2.61     12260     11.46
20    24 02:00:01 PM   3900924    195072      4.76      4012      2.06
21   148 06:40:01 AM   4054012     41984      1.03       144      0.34
22   430 11:30:01 PM   4095996         0      0.00         0      0.00
23    94 12:10:01 PM   3253816    842180     20.56     45528      5.41
24  1370 10:20:01 AM   2097148         0      0.00         0      0.00
25    25 10:00:01 AM   4084212     11784      0.29       260      2.21
26   173 10:00:01 PM   2038052     59096      2.82     13032     22.05
27    53 10:10:01 AM   2097148         0      0.00         0      0.00
28   271 04:30:02 PM    261580       560      0.21       112     20.00
29    41 11:40:01 AM   4081404     14592      0.36       724      4.96
30   109 11:40:01 AM   3563260    532736     13.01     17308      3.25
31    52
32    57
********** CNR DBs (generally better hardware):
33     27 11:50:01 AM   3930952    165044      4.03      8968      5.43
34     26 12:20:01 PM   4084988     11008      0.27       188      1.71
35     41 02:28:02 PM 129892344         0      0.00         0      0.00
36     37 09:00:01 AM   3783932    312064      7.62      5096      1.63
37     28 09:20:01 AM   3984636    111360      2.72      2208      1.98
38     60 11:50:01 AM   2829564   1266432     30.92     20420      1.61
39     26 12:20:01 PM   3835528    260468      6.36      6208      2.38
********** Web servers:
40    616 06:00:01 AM   4095996         0      0.00         0      0.00
41     54 03:00:01 PM   4094196      1800      0.04        52      2.89
42     49 03:00:01 PM   4095476       520      0.01        72     13.85
43     65 03:00:01 PM   4073972     22024      0.54       328      1.49
44     44 06:00:01 AM   4095996         0      0.00         0      0.00
********** Oracle forms servers:
45     44 06:00:01 AM   4095996         0      0.00         0      0.00
46    108 03:00:01 PM   3681292    414704     10.12      5904      1.42
47     21 03:00:02 PM   4022012     73984      1.81      3112      4.21
48     32 06:00:01 AM   4095996         0      0.00         0      0.00

(VM 16 is the one that always reboots very slowly. The latest reboot of VM 17 was interrupted by manually resetting the VM, hence +. VMs 31 and 32 are missing the sar files.)

According to the data thus collected, it doesn't appear to have a good positive relationship between reboot time and swap usage, measured in whatever criterion shown here. Very roughly, we may say that if the used swap is high, the likelihood of a fast reboot is low, but not much else with confidence. For example, although VM 16's 2068612 KB used swap is high and its reboot time is also long (1870 seconds), VM 24 is a strong counter-example, so is VM 18. Plotting any of the 5 swap usage metrics against the reboot gap time shows very scattered data.

After the cron job script on VM 16 has been run for some days, an average swap purging rate of about 3 MB/sec has been observed. At that rate, its e.g. 2068612 KB or 2 GB used swap would be purged in about 680 seconds, which is only 36% of 1870 seconds recorded in /var/log/messages. So 2/3 of the reboot time on 11/23 was still not accounted for.

Next time, we'll closely monitor the console of the VM during the reboot. Meanwhile, the nightly swap check and purging job should continue to run. Swap usage probably plays a partial role in slowing down reboot, along with other factors. Periodically purging the swap partition does little harm and purging prior to the reboot may help shorten the reboot time. I'll update this note once I have more data.

2019-11