Commercial Products

Open Source

Failure Scenarios

This page discusses "grey outages" (degraded characteristics) as well as "black and white" failures.

Client (L1) Failures
L1-L2 Connectivity Failure
Terracotta Server (L2) Subsystem failure
Other Failures
- Data Center Failure

Client (L1) Failures

Loss of Terracotta-Client Java PID

Expected Behavior

After PID loss, the log of Primary Terracotta Server (L2) for each mirror group prints 'DSO Server - Connection to [L1 IP:PORT] DISCONNECTED. Health Monitoring for this node is now disabled.'
Slow down / Zero TPS at admin console for 15 seconds (L2-L1 reconnect) as the resources held by the L1 will not be released until then
After 15 seconds terracotta server array ejects the L1 from cluster and primary L2 logs print 'shutdownClient() : Removing txns from DB :'
Once L1 is ejected, Admin console does not show the failed L1 in client list and TPS recovers

Monitor

Observe latency in user request (some of the request might have to wait 15s.
GC and Heap usage at other L1s. Because of application level backlog, heap usage might increase until cluster recovers

Observation

L2 Active Log = WARN tc.operator.event -NODE : Server1 Subsystem: CLUSTER_TOPOLOGY Message:Node ClientID[0] left the cluster

When= immediately from the loss of PID

Limit (with default values)= 0 seconds

For Reconnect properties enabled:

L2 Active Log = (same) When = after [l2.l1reconnect.timeout.millis] from the loss of PID

Expected Recovery Time

15 secs (L2-L1 reconnect)

Action to be taken

recycle client JVM

Terracotta-Client Host Reboot

Expected Behavior

After PID loss, the Primary Terracotta Servers (L2) log prints 'DSO Server - Connection to [L1 IP:PORT] DISCONNECTED. Health Monitoring for this node is now disabled.'
Slow down / Zero TPS at admin console for 15 seconds (L2-L1 reconnect) as the resources held by the L1 will not be released until then
After 15 seconds terracotta server array ejects the L1 from cluster and prints 'shutdownClient() : Removing txns from DB :' in L2 logs for each primary L2s.
Once L1 is ejected, Admin console does not show the failed L1 in client list and TPS recovers

Monitor

Observe latency in user request (some of the request might have to wait 15 secs.
GC and Heap usage at other L1s. Because of application level backlog, heap usage might increase until cluster recovers

Observation

L2 Active Log = WARN tc.operator.event - NODE : Server1 Subsystem: HA Message: Node ClientID[0] left the cluster

When= after ping.idletime + (ping.interval * ping.probes) + ping.interval

Limit (with default values)= 4 - 9 seconds

Limit is a measure of the time in which the process determines the case. Why it is a limit and not an absolute value?

This is because that there is a possibility that when the system encountered the problem it could be in one of the two states below.

State 1: All the components were in continuous conversation and thus the Health Monitoring has to factor in the first ping.idletime as a measure of detection of the problem.

State 2: All the components were connected to each other but the application load or the communication was such that there was a communication silence > ping.idletime. This means, that the system was doing Health Monitoring in the background already and the cluster was detected healthy at all times before this new problem arrived.

Therefore, it is possible that you may see the detection time as an interval within this limit.

All the expressions from here on show the maximum time it can take inclusive of the limiting ping.idletime. To get the limit interval just deduct the ping.idletime from the equations.

For Reconnect properties enabled:

L2 Active Log = (same)

When = after [l2.l1reconnect.timeout.millis] from the loss of PID

Expected Recovery Time

15 secs (L2-L1 reconnect)

Action to be taken

Start Client-JVM after machine reboot (On Restart client rejoins the cluster.)

Terracotta-Client Host Extended Power Outage

Expected Behavior

text

Expected Recovery Time

text

Action to be taken

text

L1 Local Disk Full

Expected Behavior

Terracotta code should execute without any impact, except nothing will be logged in log files.

Monitor

Whether application threads are able to proceed, as their ability to write to disk (e.g. logging) will be hampered.

Observation

text

Expected Recovery Time

As soon as disk usage falls back to normal.

Action to be taken

Cleanup local disk to resume Terracotta Client Logging

L1 CPU Pegged

Expected Behavior

Slow down in TPS at admin console because L1 will not be able to release resources (e.g. Locks) faster and the Terracotta Server Array (L2) will take more time to commit the transaction that are to applied on this L1
TPS recovers when CPU returns to normal. Run tests with difference intervals of high CPU usage (15s, 30s, 60s, 120s, 300s)

Monitor

Observe latency in user request, some/all of them will be processed slower
GC and Heap usage at other L1s. Because of application level backlog, heap usage might increase until cluster recovers

Observation

text

Expected Recovery Time

As soon as CPU usage returns back to normal.

Action to be taken

Analyze Root Cause and remedy.

L1 Memory Pegged (Constant GC)

Expected Behavior

Slow down/Zero in TPS at admin console because any resource (e.g. Locks) held by L1 will not released until GC is over and terracotta server (L2) will not able to commit transactions that are to be applied on this L1.

Case1: Full GC cycle less than 45 secs No message in L1/L2 logs. Admin console reflects normal TPS once L1 recovers from GC.

Case 2: Full GC cycle > 45 secs After 45 secs, L2 health monitoring declares L1 dead and prints this message in L2 logs 'INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl.TCGroupManager - L1:PORT is DEAD' After 45 secs, primary L2 ejects the L1 from cluster and prints 'shutdownClient() : Removing txns from DB :' in L2 logs. Once L1 is ejected Admin console does not show the failed L1 in client list. If the L1 recovers after 45 secs and tries to reconnect, the L2 refuses all connections and prints this message in L2 logs: INFO com.tc.net.protocol.transport.ServerStackProvider - Client Cannot Reconnect ConnectionID() not found. Connection attempts from the Terracotta client at [L1 IP:PORT] are being rejected by the Terracotta server array. Restart the client to allow it to rejoin the cluster. Many client reconnection failures can be avoided by configuring the Terracotta server array for "permanent-store" and tuning reconnection parameters. For more information, see http://www.terracotta.org/ha

Monitor

Observe latency in user request - some/all of them will be processed slower
GC and Heap usage at other L1s. Because of application level backlog, heap usage might increase until the cluster recovers

Observation

L2 Active Log = WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl.TCGroupManager - 127.0.0.1:56735 might be in Long GC. Ping-probe cycles completed since last reply : 1 .... .... INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TCGroupManager - localhost:56735 is DEAD

When= Detection in ping.idletime + (ping.interval * ping.probes) + ping.interval Disconnection in (ping.idletime) + socketConnectCount * [(ping.interval * ping.probes) + ping.interval] Limit (with default values)= detection in 4 - 9 seconds, disconnection in 45 seconds

Expected Recovery Time

Max allowed GC time at L1 = 'L2-L1 Health monitoring ' = 45 secs.

Action to be taken

Analyze GC issues and remedy.

L1-L2 Connectivity Failure

L1 NIC Failure - Dual NIC Client Host

Expected Behavior

Slow down/Zero in TPS at admin console because any resource (e.g. Locks) held by L1 will not released and terracotta server(L2) will not able to commit transactions that are to be applied on this L1.

Case 1: Client host fails over to standby NIC within 14 seconds. No message in L1/L2 logs. TPS resumes to normal at admin console as soon L1 NIC is restored.

Case 2: Client host fails over to standby NIC after 14 seconds - After 14 secs, Terracotta Server Array health monitoring declares L1 dead and prints this message in L2 logs 'INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Server - L1 IP:PORT is DEAD' After 14 secs seconds primary L2 ejects the L1 from cluster and prints 'shutdownClient() : Removing txns from DB :' in L2 logs. Once L1 is ejected Admin console does not show the failed L1 in client list. If the L1 recovers after 14 secs and tries to reconnect, the L2 doesn't allow it to reconnect and prints this message in L2 logs: INFO com.tc.net.protocol.transport.ServerStackProvider - Client Cannot Reconnect ConnectionID() not found. Connection attempts from the Terracotta client at [L1 IP:PORT] are being rejected by the Terracotta server array. Restart the client to allow it to rejoin the cluster. Many client reconnection failures can be avoided by configuring the Terracotta server array for "permanent-store" and tuning reconnection parameters. For more information, see http://www.terracotta.org/ha

Monitor

Observe latency in user request, some/all of them will be processed slower
GC and Heap usage at other L1s. Because of application level backlog, heap usage might increase until cluster recovers

Observation

L1 Log = INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - Socket Connect to indev1.terracotta.lan:9510(callbackport:9510) taking long time. probably not reachable. [HealthChecker] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - indev1.terracotta.lan:9510 is DEAD L2 Active Log = INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Server - Socket Connect to pbhardwa.terracotta.lan:52275(callbackport:52274) taking long time. probably not reachable. [HealthChecker] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Server - pbhardwa.terracotta.lan:52275 is DEAD

Expected Recovery Time

Max allowed NIC recovery at L1 = 'L2-L1 Health monitoring = 14 secs.

Action to be taken

No action needed immediately. At some point fix failed NIC.

Primary Switch Failure

Expected Behavior

Terracotta code should execute without any impact, except nothing will be logged in log files.

Monitor

If application threads are able to proceed as their ability to write to disk (e.g. logging) will be hampered

Observation

text

Expected Recovery Time

If switch fails such that primary L2 (of a mirror-group) is unreachable from hot-standby L2 of the same mirror group and all L1s.

Zero TPS at admin console
Max allowed Recovery time from switch failure = 'min ((L2-L1 health monitoring (14 secs)), (L1-L2 health monitoring (14 secs)), (L2-L2 health monitoring (14 secs)))' = 14 secs.
If failover to redundant switch occurs within 14 secs, cluster topology remains untouched and TPS resumes to normal at admin console after switch recovery
If switch does not failover within 14 secs, hot-standby L2 starts election to become the primary and all L1 disconnect from primary L2.
Hot-standby L2 will become primary after 19 secs (14 secs + 5 secs of election time) of switch failure.
The complete recovery time will be more than 19 secs and exact time will depend on cluster runtime condition. Ideally cluster should recover completely within 25 secs.

If Switch fails such that primary and hot-standby L2 connectivity is intact, while L1s connectivity with primary L2 is broken - Zero TPS at admin console - Max allowed Recovery time from switch failure = L2-L2 Health Monitoring = 14 secs. - If failover to redundant switch occurs within 14 secs, cluster topology remains untouched and TPS resumes to normal at admin console after switch recovery - If switch does not failover within 14 secs, L2 quarantines all the L1s from cluster. After switch recovery all the L1s have to be restarted to make them rejoin the cluster.

Monitor

Observe latency in user request as they are not processed until L1 recovers
GC and Heap usage at other L1s. Because of application level backlog, heap usage might increase until cluster recovers

Action to be taken

No action needed immediately. Restore Switch at a later point.

Primary L2 NIC Failure - Dual NIC Terracotta Server Host

Expected Behavior

Expected behavior - Zero TPS at admin console

Case 1: TC server host fails over to standby NIC within 14 seconds - TPS resumes on admin console as soon as NIC recovery happens.

Case 2: TC server host does not fail over to standby NIC within 14 seconds, - After 14 secs all L1s disconnect from primary L2 and try connection with hot-standby L2. - After 14 secs, hot-standby L2 starts election to become primary and prints 'Starting Election to determine cluster wide ACTIVE L2' inside its logs. - After 19 secs, hot-standby L2 becomes primary L2 and prints 'Becoming State[ ACTIVE-COORDINATOR ]' inside its logs - Once hot-standby L2 becomes the primary, all L1s will reconnect to it. Cluster recovers when new primary log prints 'Switching GlobalTransactionID Low Water mark provider since all resent transactions are applied' and TPS resumes at admin console. - Once the old primary L2 recovers, it is zapped by the new primary L2.

Monitor

Observe latency in user request as they are not processed until primary/hot-standby L2 recovers
GC and Heap usage at other L1s. Because of application level backlog, heap usage might increase until TPS recovers

Observation

L1 Log = INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - Socket Connect to indev1.terracotta.lan:9510(callbackport:9510) taking long time. probably not reachable. [HealthChecker] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - indev1.terracotta.lan:9510 is DEAD L2 Passive Log = INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Server - Socket Connect to pbhardwa.terracotta.lan:52275(callbackport:52274) taking long time. probably not reachable. [HealthChecker] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Server - pbhardwa.terracotta.lan:52275 is DEAD

When= Detection in ping.idletime + (ping.interval * ping.probes) + ping.interval Disconnection in (ping.idletime) + (ping.interval * ping.probes + socketConnectTimeout * ping.interval) + ping.interval Passive become active in (ping.idletime) + (ping.interval * ping.probes + socketConnectTimeout * ping.interval) + ping.interval + Election Time Limit (with default values)= detection in 4 - 9 seconds, disconnection in 14 seconds, passive takes over in 19 seconds

Expected Recovery Time

Max allowed recovery time = 'min ( (L2-L1 health monitoring (14 secs)), (L1-L2 health monitoring (14 secs)), (L2-L2 health monitoring (14 secs)) )' = 14 secs.

The complete recovery time will be more than 19 secs and exact time will depend on cluster runtime condition. Ideally cluster should recover completely within 25 seconds.

Action to be taken

No action needed immediately. At some point FIX the failed NIC by after forcing a failover to the standby Terracotta Server.

Terracotta Server (L2) Subsystem failure

Hot-standby L2 Available - Primary L2 Java PID Exits

Expected Behavior

Zero TPS at admin console After 15 seconds, hot-standby L2 starts election to become the primary and print 'Starting Election to determine cluster wide ACTIVE L2' inside its logs. All L1 disconnects from primary L2 after 15 secs and connect to old hot-standby L2 when it becomes primary. After 20 secs hot-standby becomes primary and prints 'Becoming State[ ACTIVE-COORDINATOR ]' inside its logs Once hot-standby L2 becomes primary, all L1 will reconnect to hot-standby. Cluster recovers when new primary log prints 'Switching GlobalTransactionID Low Water mark provider since all resent transactions are applied' and TPS resumes at admin console.

Monitor

Observe latency in user request, none/all of them will not complete until L2-L2 reconnect interval
GC and Heap usage at other L1s. Because of application level backlog, heap usage might increase until TPS recovers

Observation

L2 Passive Log = WARN tc.operator.event - NODE : Server1 Subsystem: CLUSTER_TOPOLOGY Message: Node Server2 left the cluster When = Immediately when PID exits L1 Log = INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl: DSO Client - Connection to [localhost:8510] DISCONNECTED. Health Monitoring for this node is now disabled. When= Immediately when PID exits Limit = Detection Immediate, L2 PassiVe takes over as Active after Election Time (default = 5 seconds) For Reconnect properties enabled: L2 Passive Log = (same) When = after [l2.nha.tcgroupcomm.reconnect.timeout] from the loss of PID

Expected Recovery Time

The complete recovery time will be more than 20 secs (L2-L2 reconnect + Election time) and exact time will depend on cluster runtime condition. Ideally cluster should recover completely within 25 seconds.

Action to be taken

No action needed immediately (given failover). Restart L2 (it will now become the hot-standby).

Hot-standby L2 Available - Primary L2 Host Reboot

Expected Behavior

Clients fail over to Hot-standby L2, which then becomes primary. Once Primary L2 comes back (i.e. is restarted after the machine reboot sequence), it will join the cluster as hot-standby.

Observation

text

Expected Recovery Time

Same as F9

Action to be taken

No action needed immediately. Restart L2 after reboot (it will now become the hot-standby)

Hot-standby L2 Available - Primary Host Unavailable: Extended Power Outage

Expected Behavior

Zero TPS at admin console After 14 seconds, hot-standby starts election to become primary and print 'Starting Election to determine cluster wide ACTIVE L2' inside its logs. All L1 disconnects from primary L2 after 14 secs and connect to old hot-standby L2 when it becomes primary.

After 19 secs hot-standby becomes primary and prints 'Becoming State[ ACTIVE-COORDINATOR ]' inside its logs Once hot-standby L2 becomes primary, all L1s will reconnect to hot-standby. Cluster recovers when new primary log prints 'Switching GlobalTransactionID Low Water mark provider since all resent transactions are applied' and TPS resumes at admin console.

Monitor

Observe latency in user request
GC and Heap usage at other L1s. Because of application level backlog, heap usage might increase until TPS recovers

Observation

text

Expected Recovery Time

L1 will detect failure after (L1-L2 health monitoring (14 secs)) Hot-standby L2 will detect failure after (L2-L2 health monitoring (14 secs) The complete recovery time will be more than 19 secs (L2-L2 health monitoring (14 secs) + Election time (5 secs) and exact time will depend on cluster runtime condition. Ideally cluster should recover completely within 25 seconds.

Action to be taken

Same as F10 (a)

Primary L2 Local Disk Full

Expected Behavior

Same as F9

Observation

15 secs.

Expected Recovery Time

text

Action to be taken

No action needed immediately (given failover to Hot-standby L2) Clean up disk and restart services (this L2 will now be hot-standby)

Primary L2 - CPU Pegged at 100%

Expected Behavior

Slow down in TPS at admin console because L2 will take more time to process transactions TPS recovers when CPU returns to normal. Run tests with difference intervals of high CPU usage (15s, 30s, 60s, 120s, 300s)

Monitor

Observe latency in user request as they are processes slower until CPU recovers
GC and Heap usage at other L1s. Because of application level backlog, heap usage might increase until cluster recovers

Observation

text

Expected Recovery Time

As soon as CPU usage returns back to normal

Action to be taken

Root-cause analysis and fix (Thread dumps needed if escalated to TC)

Primary L2 Memory Pegged - Excessive GC or I/O (Host Reachable but PID Unresponsive)

Expected Behavior

Zero TPS at admin console as primary L2 cannot process any transaction.

Case 1: GC cycle < 45 secs - L1 and hot-standby L2 log will display 'WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TCGroupManager - L2 might be in Long GC. GC count since last ping reply :', if L2 is in GC for more than 9s. - TPS returns to normal at admin console as soon as primary L2 recovers from GC

Case 3: GC cycle > 45s - After 45s, hot-standby L2 declares primary L2 dead. - Hot-standby L2 prints 'HealthChecker] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TCGroupManager - L2 IP: PORT is DEAD' in it logs message. - After 45 secs, hot-standby starts election to become primary and print 'Starting Election to determine cluster wide ACTIVE L2' inside its logs. - After 50 secs hot-standby becomes primary and prints 'Becoming State[ ACTIVE-COORDINATOR ]' inside its logs.

After 57 secs, all L1's declare the old primary L2 dead and print '[HealthChecker] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - L2 IP:PORT is DEAD' message in their logs.
After 57 secs, all L1 disconnect from old primary L2 and try connection with old hot-standby L2 (which should have become primary now).
Once hot-standby L2 becomes primary, all L1s will reconnect to hot-standby. Cluster recovers when new primary log prints 'Switching GlobalTransactionID Low Water mark provider since all resent transactions are applied'.
Once old primary L2 recovers from GC, it is zapped by the new primary L2.

Monitor

Observe latency in user request as none of them is processed until primary L2 recovers from GC or hot-standby L2 takes over
GC and Heap usage at other L1s. Because of application level backlog, heap usage might increase until cluster recovers

Observation

L1 Log = WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - localhost:9510 might be in Long GC. GC count since last ping reply : 1 ... ... ... But its too long. No more retries [HealthChecker] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - localhost:9510 is DEAD When = Detection in ping.IdleTime + l1.healthcheck.l2.ping.probes* ping.interval + ping.interval Dead in ping.IdleTime + l1.healthcheck.l2.socketConnectCount * (l1.healthcheck.l2.ping.probes * ping.interval + ping.interval) Limit = Detect in 5-9 seconds, Dead in 57 L2 Passive Log = WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - localhost:9510 might be in Long GC. GC count since last ping reply : 1 ... ... ... But its too long. No more retries [HealthChecker] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - localhost:9510 is DEAD When = Detection in ping.IdleTime + l2.healthcheck.l2.ping.probes* ping.interval + ping.interval Dead in ping.IdleTime + l2.healthcheck.l2.socketConnectCount * (l2.healthcheck.l2.ping.probes * ping.interval + ping.interval) Limit = Detect in 5 -9 seconds, Dead in 45 L2 passive takes over as Active after Dead Time + Election Time

Expected Recovery Time

Max allowed GC time = 'min ((L1-L2 health monitoring (57 secs) )), (L2-L2 health monitoring(45 secs)))' = 45 secs The max complete recovery time will be more than 57 secs and exact time will depend on cluster runtime condition. Ideally cluster should recover completely within 65 seconds.

Action to be taken

Root cause analysis to avoid this situation (e.g. more Heap, GC Tuning, etc. based on what the root-cause analysis dictates).

Primary L2 Available - Hot-standby L2 PID Unresponsive

Expected Behavior

Slow/Zero TPS at admin console as primary L2 can not commit the transactions to hot-standby L2 Primary L2 prints 'Connection to [Passive L2 IP:PORT] DISCONNECTED. Health Monitoring for this node is now disabled.' In logs as soon as hot-standby L2 fails After 15s, primary L2 quarantines hot-standby L2 from cluster, prints 'NodeID[Passive L2 IP:PORT] left the cluster' in logs and TPS returns to normal at admin console

Monitor

Once hot-standby is recycled it joins the cluster back. Monitor the time hot-standby L2 takes to move to PASSIVE-STANDBY standby state. Until hot-standby L2 moves to PASSIVE-STANDBY, cluster has single point of failure as hot-standby L2 cannot take over in case primary L2 fails.
Observe latency in user request (some/all of the requests might have to wait until L2-L2 reconnect interval).
GC and Heap usage at other L1s. Because of application level backlog, heap usage might increase until L2-L2 reconnect interval.

Observation

L2 Active Log = WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - localhost:9510 might be in Long GC. GC count since last ping reply : 1 ... ... ... But its too long. No more retries [HealthChecker] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - localhost:9510 is DEAD When = Detection in ping.IdleTime + l2.healthcheck.l2.ping.probes* ping.interval + ping.interval Dead in ping.IdleTime + l2.healthcheck.l2.socketConnectCount * (l2.healthcheck.l2.ping.probes * ping.interval + ping.interval) Limit = Detect in 5 -9 seconds, Dead in 45

Expected Recovery Time

Recovery Time = [L2-L2 Reconnect] = 15 secs

Action to be taken

Restart hot-standby L2 (blow away dirty BDB database before restart) in case of PID failure /Host Failure

Primary L2 Available - Hot-standby L2 Host Failure

Expected Behavior

Slow/Zero TPS at admin console as primary L2 can not commit the transactions to hot-standby L2 After 14 secs, Primary L2 prints 'Connection to [Passive L2 IP:PORT] DISCONNECTED. Health Monitoring for this node is now disabled.' in logs. After 14 secs, Primary L2 quarantines hot-standby L2 from cluster, prints 'NodeID [Passive L2 IP:PORT] left the cluster' in logs and TPS returns to normal at admin console

Monitor

Once hot-standby is recycled it joins the cluster back. Monitor the time hot-standby L2 takes to move to PASSIVE-STANDBY standby state. Until hot-standby L2 moves to PASSIVE-STANDBY, cluster has single point of failure as hot-standby L2 can not take over in case primary L2 fails.
Observe latency in user request (some/all of the requests might have to wait until L2-L2 reconnect interval).
GC and Heap usage at other L1s. Because of application level backlog, heap usage might increase until L2-L2 reconnect interval.

Observation

text

Expected Recovery Time

Recovery Time = [L2-L2 health monitoring] = 14 secs

Action to be taken

Restart hot-standby L2 (blow away dirty BDB database before restart - not needed in case of 2.7.x or above) in case of PID failure /Host Failure

Primary L2 Available - Hot-standby L2 NIC Failure (Dual NIC Host)

Expected Behavior

Slow/Zero TPS at admin console as primary L2 can not commit the transactions to hot-standby L2 Case 1: Hot-standby-L2 host failover to secondary NIC within 14 secs - No impact on cluster topology. TPS at admin console resumes as soon as NIC is restored at hot-standby L2 Case 2: Secondary host does not failover to standby NIC in 14 secs - After 14 secs Primary L2 prints 'Connection to [indev2.terracotta.lan:46133] DISCONNECTED. Health Monitoring for this node is now disabled.' - After 14 secs, Primary L2 quarantines hot-standby L2 from cluster, prints 'NodeID[Passive L2 IP:PORT] left the cluster' in logs and TPS returns to normal at admin console

Monitor

Once hot-standby is recycled, it rejoins the cluster. Monitor the time hot-standby L2 takes to move to PASSIVE-STANDBY standby state. Until hot-standby L2 moves to PASSIVE-STANDBY, cluster has single point of failure as hot-standby L2 cannot take over in case primary L2 fails.
Observe latency in user request (some/all of the requests might have to wait until L2-L2 reconnect interval).
GC and Heap usage at other L1s. Because of application level backlog, heap usage might increase until L2-L2 reconnect interval.

Observation

L2 Active Log = INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TCGroupManager - Socket Connect to indev1.terracotta.lan:8530(callbackport:8530) taking long time. probably not reachable. INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TCGroupManager - indev1.terracotta.lan:8530 is DEAD When = Detection in ping.idletime+ ping.probes* ping.interval + ping.interval Dead in ping.idletime + ping.probes* ping.interval + l2.healthcheck.l2.socketConnectTimeout* ping.interval Limit = 9 -14 seconds (with default values)

Expected Recovery Time

Recovery Time = [L2-L2 heath monitoring] = 14 secs

Action to be taken

If quarantined from cluster, Restart hot-standby L2 (blow away dirty BDB database before restart - not needed for 2.7.x) in case of PID failure /Host Failure

Primary L2 Available - Hot-standby L2 "Gray" Issue (CPU High)

Expected Behavior

Slow TPS at Admin-Console as primary L2 takes more time to commit transaction at hot-standby L2 TPS recovers when CPU returns to normal. Run tests with difference intervals of high CPU usage (15s, 30s, 60s, 120s, 300s)

Monitor

Observe latency in user request
GC and Heap usage at other L1s. Because of application level backlog, heap usage might increase until cluster recovers

Observation

text

Expected Recovery Time

Recovers as soon as CPU becomes normal

Action to be taken

Analyze root-cause and resolve high-CPU issue at Hot-standby L2.

Primary L2 Available - Hot-standby L2 "Gray" Issue (Memory Pegged)

Expected Behavior

Slow/Zero TPS at admin console as primary L2 can commit transaction locally but cannot commit transaction at hot-standby L2

Case 1: GC cycle < 45 secs - Primary L2 log will display 'WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TCGroupManager - L2 might be in Long GC. GC count since last ping reply :', if L2 is in GC for more than 9 secs. - TPS returns to normal at admin console as soon as hot-standby L2 recovers from GC

Case 2: GC cycle > 45 secs - After 45 secs primary L2 health monitoring declares hot standby L2 dead. - Primary L2 prints 'HealthChecker] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TCGroupManager -[Passive L2 IP: PORT] is DEAD' in it logs message - After 45 seconds, primary L2 quarantines hot-standby L2 from cluster, prints 'NodeID[Passive L2 IP:PORT] left the cluster' in logs and TPS returns to normal at admin console.

Monitor

Once hot-standby is recycled it rejoins the cluster. Monitor the time hot-standby L2 takes to move to PASSIVE-STANDBY standby state. Until hot-standby L2 moves to PASSIVE-STANDBY, cluster has single point of failure, as hot-standby L2 cannot take over in case primary L2 fails.
Backlog at application queue and observe backlog recovery time. Ensure backlog recovery time is within applications acceptable range
Observe latency in user request
GC and Heap usage at other L1s. Because of application level backlog, heap usage might increase until cluster recovers

Observation

Expected Recovery Time

Max Recovery Time = [L2-L2 health monitoring] = 45 secs

Action to be taken

Root-cause analysis and fix needed for Memory on L2 getting pegged - (the actual action to be taken is fairly varied in this case, depending on the symptoms and analysis)

Primary L2 Available - Hot-standby L2 "Grey" Issue (Disk Full)

Expected Behavior

Same as F14.

Observation

text

Expected Recovery Time

Same as F14.

Action to be taken

Hot-standby process dies with BDB errors. Restart hot-standby l2.

Primary and Hot-standby L2 Failure

Expected Behavior

All application threads that need DSO lock from the TC Server or those that are writing "Terracotta-transactions" with transaction buffer full will block. Once mirror-group(s) is restored, all the L1s connected to it before failure will reconnect and normal application activity resumes.

Observation

text

Expected Recovery Time

Depends on when the Terracotta Server Array is recycled.

Action to be taken

Not designed for N+1 failure. Restart mirror group(s) primary and hot-standby after collecting artifacts for root-cause analysis.

Other Failures

Data Center Failure

Expected Behavior

Not designed for N+1 failure. Restart mirror group(s) primary and hot-standby after collecting artifacts for root-cause analysis.

Observation

text

Expected Recovery Time

Minutes

Action to be taken

Once Data-Center is restored, restart Terracotta Server Array. Then restart L1 Nodes. Cluster state will be restored to point of outage.