I've been dealing with instability here but I think I've got a handle
on it. Should see how the next 24-48 hours go after taking
everything offline for the past 12 hours. I had the DB on my primary
node get corrupted once again. I've changed the SCSI controller my VMs
were using, enabled jumbo frames on my iSCSI subnet and increased the
queue depth on my HBA which all seem to have stabilized the IO issues
and performance has so far been better than it was. While the bandwidth
wasn't being saturated because of the iSCSI traffic, I did find that
splitting and giving each SKS node it's own iSCSI LUN has helped with
the queue depth contention I was observing.
All that said, it may take a little bit for my cluster to get back up
to pace. The key dump I imported from only had 5461780 keys
while my secondary nodes still had 5464961 before the primary fell over
and went offline. Hopefully the cluster will be able to sync up quickly
and it won't put any strain on my peers.
-----BEGIN PGP SIGNATURE-----