Open Source

Bulk Loading

Introduction
API
Bulk-Load API Example Code
Performance Improvement
FAQ
Performance Tips

Introduction

BigMemory Max has a bulk-loading mode that dramatically speeds up bulk loading into caches using the Terracotta Server Array (TSA). Bulk loading is designed to be used for:

cache warming - where caches need to be filled before bringing an application online
periodic batch loading - say an overnight batch process that uploads data

The bulk-load API optimizes bulk-loading of data by removing the requirement for locks and adding transaction batching. The bulk-load API also allows applications to discover whether a cache is in bulk-load mode and to block based on that mode.

API

With bulk loading, the API for putting data into BigMemory Max stays the same. Just use cache.put(...) cache.load(...) or cache.loadAll(...). What changes is that there is a special mode that suspends the normal distributed-cache consistency guarantees and provides optimised flushing to the TSA (the L2 cache).

NOTE: The Bulk-Load API and the Configured Consistency Mode
The initial consistency mode of a cache is set by configuration and cannot be changed programmatically (see the attribute "consistency" in `<terracotta>`. The bulk-load API should be used for temporarily suspending the configured consistency mode to allow for bulk-load operations.

The following are the bulk-load API methods that are available in org.terracotta.modules.ehcache.Cache.

public boolean isClusterBulkLoadEnabled()

Returns true if a cache is in bulk-load mode (is not consistent) throughout the cluster. Returns false if the cache is not in bulk-load mode (is consistent) anywhere in the cluster.
public boolean isNodeBulkLoadEnabled()

Returns true if a cache is in bulk-load mode (is not consistent) on the current node. Returns false if the cache is not in bulk-load mode (is consistent) on the current node.
public void setNodeBulkLoadEnabled(boolean)

Sets a cache's consistency mode to the configured consistency mode (false) or to bulk load (true) on the local node. There is no operation if the cache is already in the mode specified by setNodeBulkLoadEnabled(). When using this method on a nonstop cache, a multiple of the nonstop cache's timeout value applies. The bulk-load operation must complete within that timeout multiple to prevent the configured nonstop behavior from taking effect. For more information on tuning nonstop timeouts, see Tuning Nonstop Timeouts and Behaviors.
public void waitUntilBulkLoadComplete()

Waits until a cache is consistent before returning. Changes are automatically batched and the cache is updated throughout the cluster. Returns immediately if a cache is consistent throughout the cluster.

Note the following about using bulk-load mode:

Consistency cannot be guaranteed because isClusterBulkLoadEnabled() can return false in one node just before another node calls setNodeBulkLoadEnabled(true) on the same cache. Understanding exactly how your application uses the bulk-load API is crucial to effectively managing the integrity of cached data.
If a cache is not consistent, any ObjectNotFound exceptions that may occur are logged.
get() methods that fail with ObjectNotFound return null.
Eviction is independent of consistency mode. Any configured or manually executed eviction proceeds unaffected by a cache's consistency mode.

Bulk-Load API Example Code

The following example code shows how a clustered application with BigMemory Max can use the bulk-load API to optimize a bulk-load operation:

import net.sf.ehcache.Cache;
public class MyBulkLoader {
CacheManager cacheManager = new CacheManager();  // Assumes local ehcache.xml.
  Cache cache = cacheManager.getEhcache(\"myCache\"); // myCache defined in ehcache.xml.
  cache.setNodeBulkLoadEnabled(true); // myCache is now in bulk mode.
// Load data into myCache.

// Done, now set myCache back to its configured consistency mode.
cache.setNodeBulkLoadEnabled(false); 
}

NOTE: Potentional Error With Non-Singleton CacheManager
Ehcache does not allow multiple CacheManagers with the same name to exist in the same JVM. `CacheManager()` constructors creating non-singleton CacheManagers can violate this rule, causing an error. If your code may create multiple CacheManagers of the same name in the same JVM, avoid this error by using the `static CacheManager.create() methods"()`, which always return the named (or default unnamed) CacheManager if it already exists in that JVM. If the named (or default unnamed) CacheManager does not exist, the `CacheManager.create()` methods create it.

On another node, application code that intends to touch myCache can run or wait, based on whether myCache is consistent or not:

...
if (!cache.isClusterBulkLoadEnabled()) {
// Do some work.
}
else {
 cache.waitUntilBulkLoadComplete()
 // Do the work when waitUntilBulkLoadComplete() returns.
}
...

Waiting may not be necessary if the code can handle potentially stale data:

...
if (!cache.isClusterBulkLoadEnabled()) {
// Do some work.
}
else {
// Do some work knowing that data in myCache may be stale.
}
...

Performance Improvement

The performance improvement is an order of magnitude faster. ehcacheperf (Spring Pet Clinic) now has a bulk load test which shows the performance improvement for using a Terracotta cluster. Consider also that multi-threading is likely to improve performance.

FAQ

How does bulk-loading affect pinned caches?

If a cache has been pinned, switching the cache into bulk-load mode removes the cached data. The data will then be faulted in from the TSA as it is needed.

Are there any alternatives to putting the cache into bulk-load mode?

Bulk-loading Cache methods putAll(), getAll(), and removeAll() provide high-performance and eventual consistency. These can also be used with strong consistency. If you can use them, it's unnecessary to use bulk-load mode. See the API documentation for details.

Why does the bulk loading mode only apply to Terracotta clusters?

Ehcache, both standalone and replicated, is already very fast and nothing needed to be added.

How does bulk load with RMI distributed caching work?

The core updates are very fast. RMI updates are batched by default once per second, so bulk loading will be efficiently replicated.

Bulk Loading and Nonstop

If a nonstop cache is bulk-loaded, a multiplier is applied to the configured nonstop timeout whenever the method net.sf.ehcache.Ehcache.setNodeBulkLoadEnabled(boolean) is used. The default value of the multiplier is 10. You can tune the multiplier using the bulkOpsTimeoutMultiplyFactor system property:

-Dnet.sf.ehcache.nonstop.bulkOpsTimeoutMultiplyFactor=10

For a bulk-loaded nonstop cache, the cache size displayed in the TMC is subject to the bulkOpsTimeoutMultiplyFactor. Increasing this multiplier on the clients can facilitate more accurate size reporting.

This multiplier also affects the methods net.sf.ehcache.Ehcache.getAll(), net.sf.ehcache.Ehcache.removeAll(), and net.sf.ehcache.Ehcache.removeAll(boolean).

Performance Tips

Bulk Loading on Multiple Nodes

The implementation scales very well when the load is split up against multiple Ehcache CacheManagers on multiple machines. Adding nodes for bulk loading is likely to improve performance.

Why not run in bulk load mode all the time

Terracotta clustering provides consistency, scaling and durability. Some applications will require consistency, or not for some caches, such as reference data. It is possible to run a cache permanently in inconsistent mode.