Open Source

BigMemory WAN Replication Operations

Logging
- Synchronization
- Incremental Updates
WAN Cleanup Utility
- Procedure to convert caches
Synchronizing Replica Caches
- Bootstrapping a new cache
- Recovering after failure
Fault Tolerance
- Automatic Recovery Scenarios
- Manual Recovery

Logging

BigMemory WAN Replication provides logging with messages that are easily parsable by third-party log watchers or scrapers. Logging includes:

Topology changes
Regularly print stats
Number of mutations.
Message Size
Data conflicts and repairs
Failover/recovery status/progress

Significant WAN replication log messages regard synchronization and incremental updates. The messages below provide important markers in the WAN replication logs.

Synchronization

Master Side

Log message issued when the Master cache is sending a synchronization batch to the Replica cache:
- INFO [master-0] FullScanMasterSynchronizer - Master 'localhost:9001' initiated sync protocol for replica 'localhost:9003' and cache '__tc_clustered-ehcache|CacheManager|test-cache-1'
- INFO [master-sync-13] FullScanMasterSynchronizer - Master 'localhost:9001' sent SYNC_UPDATE request with 7 events to replica 'localhost:9002' for cache 'tc_clustered-ehcache|__DEFAULT|wan-test-cache1'
Log message posting the total number of SYNC_UPDATEs that should be processed by the Replica:
- INFO [New I/O worker #1] FullScanMasterSynchronizer - Total number of SYNC_UPDATE requests submitted to replica 'localhost:9002' is 184
Log message issued once the synchronization between Master and Replica has completed:
- INFO [New I/O worker #4] FullScanMasterSynchronizer - Master 'localhost:9001' successfully synchronized replica 'localhost:9002' for cache 'tc_clustered-ehcache|__DEFAULT|wan-test-cache3'

Replica Side

Log message issued when the Replica cache has started synchronzing with the Master cache:
- INFO [replica-sync-4] FullScanReplicaSynchronizer - Replica got SYNC_START request from master 'localhost:9001' for cache 'tc_clustered-ehcache|__DEFAULT|wan-test-cache3'
Log message issued when the Replica cache has processed a batch of synchronization updates from the Master cache:
- INFO [replica-sync-2] FullScanReplicaSynchronizer - Replica got SYNC_UPDATE request from master 'localhost:9001' for cache 'tc_clustered-ehcache|__DEFAULT|wan-test-cache1' with 8 keys
Log message issued when the Replica cache has completed synchronization with Master cache, but the Replica is not yet activated:
- INFO [replica-sync-5] FullScanReplicaSynchronizer - Replica got SYNC_END request from master 'localhost:9001' for cache 'tc_clustered-ehcache|__DEFAULT|wan-test-cache3'
Log message issued when the Replica cache is successfully activated after synchronization:
- INFO [New I/O worker #1] ReplicaCache - Replica 'localhost:9002' activated cache 'tc_clustered-ehcache|__DEFAULT|wan-test-cache3' by request from master 'localhost:9001'

Incremental Updates

Log messages for incremental updates are for Bidirectional mode only. Note that watermarks are for batches of updates, which have been batched as part of the internal WAN process.

Log messages issued when the Master is receiving acknowledgements of a Replica successfully storing a batch of updates in the TSA:
- INFO [master-0] UnitReplicator - Replica cache 'tc_clustered-ehcache|__DEFAULT|wan-test-cache1@localhost:9002' was at watermark '1131', is now at '1199'
- INFO [master-0] MasterCache - Lowest watermark across all replicas for cache 'tc_clustered-ehcache|__DEFAULT|wan-test-cache1' is now '1199'
Log messages issued when a Replica has succesfully acklowledged the storage of the updates in the TSA to the Master cache:
- INFO [New I/O worker #4] ReplicaCache - Replica 'tc_clustered-ehcache|__DEFAULT|wan-test-cache1@localhost:9002' successfully acknowledged watermark 1131 with master 'localhost:9001'
- INFO [New I/O worker #4] ReplicaCache - Replica 'tc_clustered-ehcache|__DEFAULT|wan-test-cache1@localhost:9002' successfully acknowledged watermark 1199 with master 'localhost:9001'

WAN Cleanup Utility

If a cache that was registered with a TSA subsequently becomes inactive, then when the cache becomes active again, the TSA will attempt to verify that it has the same configuration. Therefore, in order to change the WAN enabled/disabled status of caches that have already been registered in a TSA, the existing WAN information will need to be removed. The cleanup-wan-metadata utility, found in the /server/bin directory of the BigMemory kit, is provided for this purpose.

To run the script:

cleanup-wan-metadata.sh  –f "configLocation"

The script requires the configLocation argument, which is the URI location to either a wan_config.xml file or an ehcache.xml file.

If the configLocation specifies an ehcache.xml, the script performs cleanup all the caches mentioned in it. If the configLocation specifies a wan_config.xml, the script will look for all the ehcache.xml locations listed in it and cleanup their caches.

The cleanup is performed for every cache, whether WAN-enabled or not. This is useful when you want to convert a WAN-enabled cache to non-WAN cache, or a non-WAN cache to a WAN-enabled cache.

Note that the script does not modify ehcache.xml or wan-config.xml files in any way. It simply removes WAN-related information from Terracotta servers.

Procedure to convert caches

In a wan-enabled system, if you need to EITHER convert a wan-enabled cache to a non-wan cache OR convert a non-wan cache to a wan-enabled-cache, follow the below steps:

Bring down all the clients and Orchestrator.
Modify the wan-config.xml and the ehcache.xml files as per the new configuration requirement.
Run the cleanup-wan-metadata script with the new configuration as the parameter.
Start the Orchestrator and clients again with the new configuration files.

Synchronizing Replica Caches

The Orchestrator activates a Replica cache when it has successfully synchronized the state of the cache with the Master cache. A Master cache can be immediately activated by an Orchestrator, as there is no synchronization to perform. Orchestrators managing Master caches perform continuous health-checking to verify the active status of the Replica caches managed by the other Orchestrators.

There are two main cases when the state of a Replica cache needs to be resynchronized by the Master cache: bootstrapping a new cache and recovering after a failure.

Bootstrapping a new cache

Perform these steps when starting a cache for the first time, or after the cache was fully cleared.

Ensure the Replica cache configuration is the same as the Master cache.
Ensure the WAN configuration for the Replica cache is valid:
- Check that there is a valid list of Master caches.
- For a given cache, the WAN configuration should be identical across all regions.
Ensure the TSA is running.
Start the Orchestrator.

On startup, the new Replica cache will be inactive while synchronizing (clients cannot use the cache). In this mode, it is receiving incremental updates and synchronizing the full state of the cache. Once fully synchronized, the Replica cache will be active.

Recovering after failure

BigMemory WAN Replication is built with fault tolerance features to automatically handle short-term failures and most longer-term failures.

When a replica reconnects with the master, its behavior is governed by the Orchestrator configuration parameter replicaDisconnectBehaviorType. If this parameter is set to:

remainDisconnected, the replica will remain disconnected from the master, and will continue to operate offline.
reconnectResync, the replica will reconnect to the master and will resync the contents of its cache. In this case, all local changes on the replica region will be dropped in favor of whatever is in the master region.

Fault Tolerance

Recovery from most failures is accomplished automatically by the WAN replication service, however some scenarios require user intervention.

Automatic Recovery Scenarios

Orchestrator Failover

It is recommended to run multiple Orchestrators in a region, either on the same machine or rack, or in different racks to ensure availability. Although only one master Orchestrator is mandatory, you can start one or more standby Orchestrators at any time during application run time to provide failover protection.

Standby Orchestrators are passive, so running extra Orchestrator processes will have minimal runtime impact under normal operations. Upon failure, the other regions will look for the next available Orchestrator and resume replication(there will be a full sync for replica caches after master failover).

Master-Replica Disconnection

When a Master cache fails, control is given to the failover Master (if any) listed in wan-config.xml. A Replica cache will not take over as a Master. If there is no failover Master, the Replicas will continue to operate in isolation (i.e., no replication will take place). When the Master re-starts, the behavior of its Replicas is governed by the replicaDisconnectBehavior property in wan-config.xml. By default, the Replicas will attempt to reconnect to the Master and to the failover Master listed in wan-config.xml. Upon reconnection to a Master, the Replicas are deactivated, cleared, resynchronized to the Master cache, and then reactivated. All local changes on the Replica region will be dropped in favor of whatever is in the Master region. (Similarly, even if the Master does not fail but its Replicas become disconnected from their Master, their behavior is also controlled by the replicaDisconnectBehavior property.) For more information, refer to Orchestrator Configuration Parameters.

If no failover Master is listed in wan-config.xml and the master is lost, the operator has the option to restart a Replica to act as a Master, as described below.

In a bi-directional configuration without a failover Master, you should choose your most important Replica region to take over as the Master because any changes that occurred since the Master failed will be lost in all other regions. To do this, change your wan-config.xml to reflect that the Replica is now the Master, and then restart the Replica. It would be a good idea to remove or comment out the old Master in case it comes back.

In a uni-directional configuration without a failover Master, since the "writes" are only performed in one region, you will not lose any changes. You may simply designate that Replica region as the new Master.

TSA Disconnection

Upon any disconnection from its TSA, the Orchestrator deactivates and waits the amount of time specified by the l2.l1reconnect.timeout.millis property. This property is described in the Automatic Client Reconnect section.

When communication between the TSA and the Orchestrator is resumed:

If the downtime was shorter than the configured reconnect timeout, then WAN replication will resume immediately, without the need for any resynchronization.
If the downtime exceeded the configured reconnect timeout, then the Orchestrator will resynchronize all of its Master caches, and subsequently those Master caches will resynchronize all of their Replica caches.

Cache Recovery Operations

Unidirectional cache
- In master region, cache operations remains operational even if local orchestrator went down
- In replica region, cache operations remains operational even if local orchestrator went down but replica region won’t receive any updates from the master region. Once orchestrator comes up, it will connect to master orchestrator and deactivates the cache (cache operations blocked) and does a full sync
- Only if replicaDisconnectBehavior = reconnectResync (default ).
Bidirectional cache
- In master region, cache operations remains operational even if orchestrator went down
- In replica region, cache operations wait for the local orchestrator to come up and after local orchestrator comes up, it will connect to master region and does full sync
- Only if replicaDisconnectBehavior = reconnectResync (default ).
With standby orchestrator, standby orchestrator becomes active orchestrator automatically and
- in replica region, both unidirectional and bidirectional cache operations wait for full sync to be completed and cache activation.
  - Only if replicaDisconnectBehavior = reconnectResync (default ).
- in master region, both unidirectional and bidirectional cache operations continue to operate and new active orchestrator syncs all replica orchestrator.
- Only if replicaDisconnectBehavior = reconnectResync (default ).

Master Region Failure

If the master region fails entirely, there will most likely be data writes that were not completed (updated to Replica Region), though application may believe they were. As described in this section, recommended action would be to promote Replica region caches to master.

Manual Recovery

If a TSA failure takes place in a region with Master caches, upon recovery the Master caches cannot automatically force a resynchronization of all live Replicas across the WAN, because that could result in data loss. In this case, when an entire region is down (for example, your data center is offline), you must use a manual recovery process. You will need to designate new Master caches, update the configuration, and restart the Orchestrators.

In preparation for disaster recovery, two Orchestrator configurations should be ready:

Orchestrator Configuration	Region A	Region B
wan-config-1.xml	Master	Replica
wan-config-2.xml	Replica	Master

If Region A goes down, then Configuration 2 can be applied in Region B. Ideally, Region B would already be serving as a read-only or backup region. When Region B becomes the Master, it can then be used to resynchronize Region A.

Products

Components

More Documentation