One VM running RHEL 7.7 always has the problem that it takes a very long time to reboot, measured by the time gap between the last line before and the first line after the reboot in /var/log/messages. For example, the following shows 31 minutes 10 seconds or 1870 seconds of time gap:
Nov 23 16:04:36 hostname npd[29501]: ntpd exiting on signal 15 Nov 23 16:35:46 hostname kernel: Initializing cgroup subsys cpusetSome articles such as Group Policies applied that force your Windows 2008 R2 systems to clear their pagefiles at shutdown point to a large pagefile or swap partition. On the VM experiencing this problem (VM 16 shown below), I just set up a cron job to run the following code at 12:00 AM daily
free grep VmSwap /proc/[1-9]*/status | grep -v ' 0 kB$' date swapoff -a date swapon -awhere the grep line captures the running processes that use swap at the time, and turning off and back on swap purges the swap partition.(Ref) The latest run log
total used free shared buff/cache available Mem: 32778544 29134548 308508 290660 3335488 3075188 Swap: 4095996 57864 4038132 /proc/4790/status:VmSwap: 96 kB Wed Nov 27 00:00:02 CST 2019 Wed Nov 27 00:00:27 CST 2019shows that at midnight of 11/27/2019, there was 57864 KB used swap, which was purged in 27-2=25 seconds, i.e. at a rate of 2314.56 KB/sec. (Process 4790 is python /usr/bin/goferd --foreground. It only used 96 out of 57864 KB swap. Once the process using swap exits, there's no way to find them afterwards.)
To investigate whether there's a relationship between the amount of used swap and reboot time, 48 VMs were checked for the reboot time gap (in seconds) shown in /var/log/messages and for the last line of sar -S -f /var/log/sa/sa?? (the system activity report on swap usage for the day of reboot) right before the line showing "LINUX RESTART", excluding the "Average" line (i.e. the swap usage of the system before the reboot).
VM# gap swap (time kbswpfree kbswpused %swpused kbswpcad %swpcad) ---- --- -------------------------------------------------------------- ********** Gerneral DBs: 01 57 12:00:02 PM 4023036 72960 1.78 3564 4.88 02 117 12:00:01 PM 4012540 83456 2.04 6392 7.66 03 16 09:30:01 AM 4063484 32512 0.79 280 0.86 04 15 10:00:01 AM 4073724 22272 0.54 300 1.35 05 46 05:30:01 PM 4072956 23040 0.56 880 3.82 06 27 05:30:01 PM 4075004 20992 0.51 904 4.31 07 73 11:50:01 AM 3717628 378368 9.24 3660 0.97 08 35 12:10:01 PM 3807228 288768 7.05 13724 4.75 09 84 02:30:01 PM 3834108 261888 6.39 10860 4.15 10 146 02:20:01 PM 3537708 558288 13.63 20380 3.65 11 67 02:40:01 PM 4056060 39936 0.98 176 0.44 12 43 02:20:01 PM 3968764 127232 3.11 17872 14.05 ********** General DBs (standbys): 13 48 01:50:01 PM 3619580 476416 11.63 7824 1.64 14 43 03:10:01 PM 4049916 46080 1.13 16664 36.16 15 111 03:40:01 PM 4058876 37120 0.91 19072 51.38 ********** Dedicated DBs (non-CNR): 16 1870 04:00:01 PM 2027384 2068612 50.50 70932 3.43 17 804+ 04:40:01 PM 2976764 1119232 27.33 45896 4.10 18 67 10:00:01 AM 2714364 1381632 33.73 17244 1.25 19 22 11:00:01 AM 3988988 107008 2.61 12260 11.46 20 24 02:00:01 PM 3900924 195072 4.76 4012 2.06 21 148 06:40:01 AM 4054012 41984 1.03 144 0.34 22 430 11:30:01 PM 4095996 0 0.00 0 0.00 23 94 12:10:01 PM 3253816 842180 20.56 45528 5.41 24 1370 10:20:01 AM 2097148 0 0.00 0 0.00 25 25 10:00:01 AM 4084212 11784 0.29 260 2.21 26 173 10:00:01 PM 2038052 59096 2.82 13032 22.05 27 53 10:10:01 AM 2097148 0 0.00 0 0.00 28 271 04:30:02 PM 261580 560 0.21 112 20.00 29 41 11:40:01 AM 4081404 14592 0.36 724 4.96 30 109 11:40:01 AM 3563260 532736 13.01 17308 3.25 31 52 32 57 ********** CNR DBs (generally better hardware): 33 27 11:50:01 AM 3930952 165044 4.03 8968 5.43 34 26 12:20:01 PM 4084988 11008 0.27 188 1.71 35 41 02:28:02 PM 129892344 0 0.00 0 0.00 36 37 09:00:01 AM 3783932 312064 7.62 5096 1.63 37 28 09:20:01 AM 3984636 111360 2.72 2208 1.98 38 60 11:50:01 AM 2829564 1266432 30.92 20420 1.61 39 26 12:20:01 PM 3835528 260468 6.36 6208 2.38 ********** Web servers: 40 616 06:00:01 AM 4095996 0 0.00 0 0.00 41 54 03:00:01 PM 4094196 1800 0.04 52 2.89 42 49 03:00:01 PM 4095476 520 0.01 72 13.85 43 65 03:00:01 PM 4073972 22024 0.54 328 1.49 44 44 06:00:01 AM 4095996 0 0.00 0 0.00 ********** Oracle forms servers: 45 44 06:00:01 AM 4095996 0 0.00 0 0.00 46 108 03:00:01 PM 3681292 414704 10.12 5904 1.42 47 21 03:00:02 PM 4022012 73984 1.81 3112 4.21 48 32 06:00:01 AM 4095996 0 0.00 0 0.00(VM 16 is the one that always reboots very slowly. The latest reboot of VM 17 was interrupted by manually resetting the VM, hence +. VMs 31 and 32 are missing the sar files.)
According to the data thus collected, it doesn't appear to have a good positive relationship between reboot time and swap usage, measured in whatever criterion shown here. Very roughly, we may say that if the used swap is high, the likelihood of a fast reboot is low, but not much else with confidence. For example, although VM 16's 2068612 KB used swap is high and its reboot time is also long (1870 seconds), VM 24 is a strong counter-example, so is VM 18. Plotting any of the 5 swap usage metrics against the reboot gap time shows very scattered data.
After the cron job script on VM 16 has been run for some days, an average swap purging rate of about 3 MB/sec has been observed. At that rate, its e.g. 2068612 KB or 2 GB used swap would be purged in about 680 seconds, which is only 36% of 1870 seconds recorded in /var/log/messages. So 2/3 of the reboot time on 11/23 was still not accounted for.
Next time, we'll closely monitor the console of the VM during the reboot. Meanwhile, the nightly swap check and purging job should continue to run. Swap usage probably plays a partial role in slowing down reboot, along with other factors. Periodically purging the swap partition does little harm and purging prior to the reboot may help shorten the reboot time. I'll update this note once I have more data.
2019-11