Call us: +1-415-738-4000

Recovery Scenarios

The recovery scenarios in the following sections are explained assuming we are using the default health check settings from tc.properties:

l2.healthcheck.l1.ping.idletime = 5000
l2.healthcheck.l1.ping.interval = 1000
l2.healthcheck.l1.ping.probes = 3
l2.healthcheck.l1.socketConnectTimeout = 5
l2.healthcheck.l1.socketConnectCount = 10

l2.healthcheck.l2.ping.idletime = 5000
l2.healthcheck.l2.ping.interval = 1000
l2.healthcheck.l2.ping.probes = 3
l2.healthcheck.l2.socketConnectTimeout = 5
l2.healthcheck.l2.socketConnectCount = 10

l1.healthcheck.l2.ping.idletime = 5000
l1.healthcheck.l2.ping.interval = 1000
l1.healthcheck.l2.ping.probes = 3
l1.healthcheck.l2.socketConnectTimeout = 5
l1.healthcheck.l2.socketConnectCount = 13
  1. Default health monitoring parameter in tc.properties
    • L1 - L2 detection of failure
      • 109 secs, if L2 is reachable and L1 can initiate a new socket connection. This basically allows a max of 109 secs GC on L2.
      • 13 secs, if connectivity to L2 is broken and L1 can not create a new socket connection to L2
    • L2 - L2 detection of failure
      • If active (passive) L2 is reachable and passive(active) L2 can initiate a new socket connection = 85 secs
      • If connectivity to active (passive) L2 is broken and passive (active) L2 can not create a new socket connection to active (passive) L2 = 13 secs
    • L2 - L1 detection of failure
      • 85 secs, if L1 is reachable and L2 can initiate a new socket connection. This allows a maximum of 85 secs GC on L1.
      • 13 secs, if the connectivity to L1 is broken and L2 cannot create a new socket connection.
      • In case L2 is not able to initiate a socket connection during the first connection cycle due to firewall settings etc., socket connection failure message would be printed in the server logs and all the L2→L1 health check properties for this particular client will be multiplied by a factor of 10.
  2. Reconnect properties
    • L2 - L1 reconnect parameters
      • l2.l1reconnect.enabled = true (default is false)
      • l2.l1reconnect.timeout.millis = 15000 (default is 5000)
    • L2 - L2 reconnect properties
      • l2.nha.tcgroupcomm.reconnect.enabled = true (default is false)
      • l2.nha.tcgroupcomm.reconnect.timeout = 15000 (default is 2000)

With the above parameters set:

  • Max GC allowed at L1 before it is quarantined from cluster = L2-L1 health monitoring = 85s
  • Max allowed GC at passive L2 before it is quarantined by active L2 = L2-L2 health monitoring = 85 secs
  • Max allowed GC at active L2 before
    • Passive takes over, is = (L2-L2 health monitoring(85 secs) + Election time(5 secs)) = 90 secs
    • L1 disconnects from active L2 and tries connection with another L2, is = L1-L2 health monitoring = 13 secs