0

I am building a vSAN cluster consisting of 2 racks each with 3 nodes (This will eventually be a stretch cluster). Each rack is in different subnets as listed below:

Rack 1:

  • Management: 10.73.8.0/25 (Gateway: 10.73.8.126)
  • vMotion: 10.73.10.0/25 (Gateway: 10.73.10.126)
  • vSAN: 10.73.11.0/25 (Gateway: 10.73.10.126)

Rack 2:

  • Management: 10.73.8.128/25 (Gateway: 10.73.8.254)
  • vMotion: 10.73.10.128/25 (Gateway: 10.73.10.254)
  • vSAN: 10.73.11.128/25 (Gateway: 10.73.10.254)

I had built the cluster with all nodes in Rack 1. No problem. Everything works and I have a few test VMs running. When I try to add nodes from Rack 2 into the same cluster I get a "vSAN cluster partition" error. Here is what I've checked/tested:

  1. I have full end to end connectivity between ALL nodes (vmkping between nodes in both racks works on all subnets with MTU size messages with no fragmentation)
  2. The unicast agent list on all nodes is correctly showing all other nodes with the right UUIDs, IP addresses, and cert thumbprints.
  3. I've tried various permutations of leaving/joining the cluster with the partitioned nodes.

All my google-fu has indicated that my issue should be one of the above but it doesn't appear to be the case. I even added static routes for the vSAN networks even though I do have override default gateways set on the vSAN vmks. No dice.

I know this is a strange one but if anyone can point me in directions of other causes for this error it would be greatly appreciated.

2 Answers 2

0
#!/bin/bash

# Define the static route parameters
NETWORK="10.73.11.0/24"
GATEWAY_RACK1="10.73.10.126"
GATEWAY_RACK2="10.73.10.254"

# Apply static route on Rack 1 hosts
for host in "esxi1-rack1" "esxi2-rack1" "esxi3-rack1"; do
    ssh root@$host "esxcli network ip route ipv4 add --network $NETWORK --gateway $GATEWAY_RACK1"
done

# Apply static route on Rack 2 hosts
for host in "esxi1-rack2" "esxi2-rack2" "esxi3-rack2"; do
    ssh root@$host "esxcli network ip route ipv4 add --network $NETWORK --gateway $GATEWAY_RACK2"
done

echo "Static routes configured successfully."
1
  • Please explain how this helps the problem. Commented Jun 19 at 18:18
0

For those who stumble upon this: I found the solution. There was a router in the network that incorrectly had NAT enabled. This meant that yes, all vSAN nodes could ping each other (so no basic connectivity alarms) but because the IP addresses of Rack 2 devices did not match the IP in the Unicast Agent list those messages were apparently being rejected. This then was the reason that vSAN was declaring a network partitioning event.

The way I discovered this was this: I disabled the default gateway for the vSAN interfaces and instead used static routing. Once that was done ping stopped working. Why? Because no static route was defined for the NATed IP addresses...

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .