Recovery Scenarios

The recovery scenarios in the following sections are explained assuming we are using the default health check settings from tc.properties:

l2.healthcheck.l1.ping.idletime = 5000
l2.healthcheck.l1.ping.interval = 1000
l2.healthcheck.l1.ping.probes = 3
l2.healthcheck.l1.socketConnectTimeout = 5
l2.healthcheck.l1.socketConnectCount = 10

l2.healthcheck.l2.ping.idletime = 5000
l2.healthcheck.l2.ping.interval = 1000
l2.healthcheck.l2.ping.probes = 3
l2.healthcheck.l2.socketConnectTimeout = 5
l2.healthcheck.l2.socketConnectCount = 10

l1.healthcheck.l2.ping.idletime = 5000
l1.healthcheck.l2.ping.interval = 1000
l1.healthcheck.l2.ping.probes = 3
l1.healthcheck.l2.socketConnectTimeout = 5
l1.healthcheck.l2.socketConnectCount = 13

Default health monitoring parameter in tc.properties
- L1 - L2 detection of failure
  - 109 secs, if L2 is reachable and L1 can initiate a new socket connection. This basically allows a max of 109 secs GC on L2.
  - 13 secs, if connectivity to L2 is broken and L1 can not create a new socket connection to L2
- L2 - L2 detection of failure
  - If active (passive) L2 is reachable and passive(active) L2 can initiate a new socket connection = 85 secs
  - If connectivity to active (passive) L2 is broken and passive (active) L2 can not create a new socket connection to active (passive) L2 = 13 secs
- L2 - L1 detection of failure
  - 85 secs, if L1 is reachable and L2 can initiate a new socket connection. This allows a maximum of 85 secs GC on L1.
  - 13 secs, if the connectivity to L1 is broken and L2 cannot create a new socket connection.
  - In case L2 is not able to initiate a socket connection during the first connection cycle due to firewall settings etc., socket connection failure message would be printed in the server logs and all the L2→L1 health check properties for this particular client will be multiplied by a factor of 10.
Reconnect properties
- L2 - L1 reconnect parameters
  - l2.l1reconnect.enabled = true (default is false)
  - l2.l1reconnect.timeout.millis = 15000 (default is 5000)
- L2 - L2 reconnect properties
  - l2.nha.tcgroupcomm.reconnect.enabled = true (default is false)
  - l2.nha.tcgroupcomm.reconnect.timeout = 15000 (default is 2000)

With the above parameters set:

Max GC allowed at L1 before it is quarantined from cluster = L2-L1 health monitoring = 85s
Max allowed GC at passive L2 before it is quarantined by active L2 = L2-L2 health monitoring = 85 secs
Max allowed GC at active L2 before
- Passive takes over, is = (L2-L2 health monitoring(85 secs) + Election time(5 secs)) = 90 secs
- L1 disconnects from active L2 and tries connection with another L2, is = L1-L2 health monitoring = 13 secs