Dashboard > Terracotta Public Wiki > Home > High Availability With 2 Nodes
  Terracotta Public Wiki Log In   View a printable version of the current page.  
  High Availability With 2 Nodes
Added by Taylor Gautier, last edited by Igal Levy on Jul 11, 2008  (view change)
Labels: 
(None)

UNDER CONSTRUCTION

The information on this page is out of date. A new page is under construction.

High Availability With 2 Nodes

It is possible to run Terracotta DSO with only 2 nodes. In this configuration, there are a minimum of four processes that must be running for High Availability with Terracotta DSO.

Terminology

L1 : The Terracotta client. This is where the application resides. Often, this is a server application, but from the Terracotta DSO point of view, it's a client.

L2 : The Terracotta Server. Terracotta servers are JVM processes.

2x2 : This configuration may be referred to as "2x2" as it involves placing one L1 and one L2 together onto one machine, and another L1 and L2 on another machine.

Configuration

The four processes are one L1 and one L2 on one machine, and one L1 and one L2 on the other machine. For the purposes of discussion, we name these processes:

L1A – The L1 process on server A
L2A – The L2 process on server A (At steady state, it is the "Active" L2.)
L1B – The L1 process on server B
L2B – The L2 process on server B

Terracotta Reconnect Window

During a failover procedure, Terracotta DSO expects that all clients previously connected to the cluster will reconnect to the newly active server.

The Reconnect Window setting controls how long Terracotta DSO will wait for all clients to reconnect. If all clients connect to the newly active Terracotta server, cluster operations can proceed immediately. If they do not all connect, cluster operations are paused awaiting the connection of all clients.

If all clients do not connect within the Reconnect Window period, the remaining clients will be banned from the cluster.

Configuring Timeouts Correctly

The cluster consists of L1A, L2A, L1B and L2B, operating in a steady state. L2A is the Active server.

  1. If the Ethernet cable on L2A is pulled out, L2A notices the disconnect of L2B because of its heartbeat detection.
  2. L2B notices the disconnect of L2A because of its heartbeat detection.
  3. L2A is already active, so it stays active.
  4. L2B has no other active servers nearby, so it elects itself active.
    Note that this has setup a classic split-brain scenario, but since server A is disconnected from the network, this is an expected result of pulling the cable.
  5. L2A can no longer successfully write to L1B, since the connection is severed, but L2A does not know this. It begins to push back on L1A, which eventually stops when all available transaction buffers are used.
  6. L1B can no longer successfully write to L1A, since the connection is severed, but L1B does not know this. It too eventually stops when all its transaction buffers are used. Meanwhile, L2B has begun the L1 Reconnect Window.
  7. Eventually, L2B expires the L1 Reconnect Window. Since no L1s reconnected to it in the pre-configured time period, the cluster moves on with no L1s attached. Any L1s wishing to talk to L2B cannot have been from the previous cluster (before pulling the cable) since their state is now out of date with respect to the cluster.
  8. Eventually, L1B notices that its TCP connection is severed, and tries to failover to L2B. L2B rejects this client as it no longer accepts clients from the previous cluster. In other words, L1B is permanently orphaned.
  9. Eventually, L2A notices that the TCP connection to L1B is severed, and drops L1B from the cluster. L1A can now proceed.

Note that if your operating system drops the interface or the IP address of the interface when the cable is pulled, the events above for L1A and L2A may not happen exactly as described. Those two processes may stop working altogether since prior to the cable being pulled they may have been communicating on the interface/IP address that was dropped.

Pulling the cable from the passive cluster (TCP Timeout < Terracotta Reconnect Window)

TBD

Gotchas

Firewall

Both Windows and Mac OS come with firewalls that can pose problems with inbound TCP connections. Make sure these are disabled, or will let connections through on port 9530 (the default keepalive port)

TBD

  • Feature Discussion
    • L1 - L2 Heartbeat
    • Configurable "Quorum"
    • Automatic detection of co-resident L1s with L2

This forum topic describes how to tune the tcp keepalive on macos (which ought to be similar to other unix operating systems):

http://forums.terracotta.org/forums/posts/list/0/633.page

--Orion

Powered by Atlassian Confluence, the Enterprise Wiki. (Version: 2.5.5 Build:#811 Jul 25, 2007) - Bug/feature request - Contact Administrators