Unstable Private Interconnect Network Found to be Caused by IP Conflict Ever since this pre-production 2-node Oracle cluster was set up a month ago, the cluster interconnect network has been unstable. At no specific time throughout the day, but definitely a few times in early morning, the network is broken and the Oracle GI (Grid Infrastructure a.k.a clusterware) occsd.log shows "no network HB" (HB for heartbeat). Sometimes the loss of the HB persists longer than the GI tolerance (30 seconds in 11gR2 and up), and the GI stack crashes. The two cluster nodes are VMWare guest hosts. To troubleshoot, the SA moved the two nodes into one ESXi host to avoid any possible network traffic to the outside. And yet the problem continued. Troubletshooting: A cron job is set up to UDP "ping" from one node to the other. The script simply sends a UDP packet to port 42424 of the partner node [note1]: #run on nodea to "ping" nodeb at its physical private IP ...13 and virtual link-local private IP 169.254.24.48 on Oracle CSSD port 42424: echo "$(date "+%Y%m%d %H:%M:%S"): $(traceroute -nrU 10.114.196.13 -p42424)" >> udppingtest.log echo "$(date "+%Y%m%d %H:%M:%S"): $(traceroute -nrU 169.254.24.48 -p42424)" >> udppingtest.log (Actually, the second ping to the link-local IP is not needed; it's used by RAC cache fusion only. GI heartbeat only goes on the physical interconnect.[note2]) According to the job log, the network failure times indeed match ocssd.log "no network HB" times. This proves that the network problem is at the OS, not Oracle, level. Finally, it was found that, on one node only, nodeb, if I run arping ("send ARP REQUEST to a neighbour host" per manpage), from node B to A (i.e., ARP ping from private IP of nodeb using the private interconnect physical interface to the private IP of nodea): nodeb ~ $ arping -s 10.114.196.13 -I eno33559296 10.114.196.12 <-- source: nodeb IP ...13, destination: nodea IP ...12 ARPING 10.114.196.12 from 10.114.196.13 eno33559296 Unicast reply from 10.114.196.12 [00:50:56:A6:E2:4A] 0.585ms <-- MAC address ...:E2:4A Unicast reply from 10.114.196.12 [00:50:56:A6:F6:A9] 0.721ms <-- MAC address ...:F6:A9 Unicast reply from 10.114.196.12 [00:50:56:A6:F6:A9] 0.695ms ^CSent 2 probes (1 broadcast(s)) Received 3 response(s) I see that the first MAC address differs from the rest (I pressed Control-C to abort). This is absulutely reproducible and only happens on this node. Immediately after arping, I run arp -a to see the MAC address to IP mapping: nodeb ~ $ arp -a ... ? (10.114.196.12) at 00:50:56:a6:e2:4a [ether] on eno33559296 <-- MAC address ...:E2:4A ... I see that the first MAC address ending with E2:4A is still bound to the destination IP (private IP of node A), and the one ending with F6:A9 is not in `arp -a' output. And I can see that this E2:4A MAC is indeed bound to the private IP on node A: nodea ~ $ ifconfig eno33559296 <-- run on nodea eno33559296: flags=4163 mtu 1500 inet 10.114.196.12 netmask 255.255.255.0 broadcast 10.114.196.255 <-- IP address ...12 inet6 fe80::250:56ff:fea6:e24a prefixlen 64 scopeid 0x20 <--MAC address ...:E2:4A ether 00:50:56:a6:e2:4a txqueuelen 1000 (Ethernet) ... So the mystery is, Where is the host or device with the F6:A9 MAC? Why does my node B contact the correct E2:4A first and from the second ping on, contact the mysterious F6:A9 host? Could this be related to our frequent network problem? I don't see the problem if I arping from node A to B: nodea ~ $ arping -s 10.114.196.12 -I eno33559296 10.114.196.13 <-- source: nodea IP ...12, destination: nodeb ...13 ... ? (10.114.196.13) at 00:50:56:a6:64:32 [ether] on eno33559296 <-- MAC ...54:32 as expected ... The destination MAC address is consistent and is shown in `arp -a', and ifconfig (not shown here) shows the 64:32 MAC is bound to the correct IP. Solution: Login VMWare vSphere Client. Find the ESXi host the guest hosts are in. Click on the ESXi host. Go to Summary tab. Open the vSphere Distributed Switch the host "plugs" in. Go to Ports tab. There we find a guest host that should not be in here. The SA immediately recognized a mistake he made and deleted this guest host. But it's very likely that this guest host had the IP 10.114.196.12, which was conflicting with that of my Oracle server nodeb, and that this host had the MAC ...:F6:A9. Since the host was immediately destroyed, we were not able to do more analysis. Ever since then, the Oracle cluster is extremely stable. Lessons and more questions: If this was DHCP, conflicting IPs would be reported to the DHCP clients involved in the conflict. Since this is not DHCP, conflicting IPs have to be found manually. ARP ping is a basic and efficient tool, although its output does not readily point to the problem of conflicting IPs. In general, when the network is unstable, arping may be used in addition to ping to see if there's any anomaly at the lower layer. If needed, the ultimate troubleshooting tool that sniffs and analyzes various OSI layers of the network can be used. It's still an interesting mystery that arping to the IP which two devices claim to own gets the first response from one device different from the second and on, and the result is consistent. Why not random, e.g. a few responses from one device intermingled with the other? December 2015 _______________ [note1] In Oracle 11gR2 and above, in spite of the documentation http://docs.oracle.com/cd/E11882_01/install.112/e47689/app_port.htm saying port 42424 is the default port used by CSS daemon and there's a dynamic port range, the 42424 port is actually only used for management work such as startup, reconfiguration etc. Regular heartbeats go on a randomly assigned port. Verify with e.g. `tcpdump -i '. And see http://oradbatips.blogspot.com/2012/12/tip-104-node-eviction-in-rac-11gr2-due.html In 11gR1 and older, the port is fixed at 49895 using TCP procotol: http://docs.oracle.com/cd/B28359_01/install.111/b32002/app_port.htm [note2] Highly Available IP (HAIP) FAQ for Release 11.2 (Doc ID 1664291.1): "HAIP provides a layer of transparency between multiple Network Interface Cards (NICs) on a given node that are used for the cluster_interconnect which is used by RDBMS and ASM instances. However it is not used for the Clusterware network heartbeat." Why does Oracle bother to create virtual interfaces and bind link local addresses to them to be used for cluster private interconnect? It may be related to ease of failover. Ideally, we should have two (or more) NICs for interconnect. Since Oracle no longer recommends OS level bonding or teaming, its clusterware takes over the functionality to fail over between multiple IPs. If the physical IPs were used, Oracle would have to implement bonding exactly like OS would do it. Oracle decides to create one more layer of abstraction or virtualization probably because it's easier for their own cluster software to do it this way. Why not use this higher layer of virtualization or HAIP for network heartbeat as well? I don't know. Maybe it's a good idea to leave such baseic low-level traffic to the physical interface.