Node/Machine fail behavior of distributed caches

My high level question is: what happens to a distributed cache when nodes fail?
We have 2 servers which run 4 JVMs each. We have the default of 1 backup set.
What happens when an entire machine fails (all 4 JVMs go down with the ship)?
What happens when I stop and restart each JVM one at a time?
My main concern is data-loss. Since I have backup set to 1 my expectations for both of my scenarios above is that I would lose no cached data, but that does not appear to be the case. I am left wondering in what scenarios the backups help.
How does the cluster tell the difference between (a) a node failed but will be restored soon enough so don't reduce the cluster size, and (b) a node was removed and will never come back so reduce the cluster size?
It would be nice to see a wiki page that describes the gory details of how the cluster handles various failure scenarios.

Each partition is allocated to a JVM and a backup of that partition is allocated to another JVM. If you are running on multiple physical machines then Coherence will put the backup partition on another machine to the primary. You can tell how successful Coherence has been at doing this by looking at the StatusHA value for your services in JMX using something like JConsole. If the backup partitions are on different machines to the primary partitions the StatusHA value will say MACHINE-SAFE, if the backup is on the same machine as the primary the StatusHA value will be NODE-SAFE and if there is no backup the StatusHA value will be ENDANGERED.
There is also a status called BALANCED, which means that besides being MACHINE-SAFE, the partitions are also as evenly distributed between nodes (not boxes) as possible.
When you loose a JVM (or multiple JVMs if you loose a whole machine) this cause a partition loss event for the partitions that were allocated to the dead JVMs. In the case of loosing a single JVM the backup partition now becomes the primary and a new backup ios created (following the same rules about creating the backup on another machine if possible). If you loose a whole machine then the same thing happens but on a bigegr scale.
A small correction: partition loss event happens is when you lose both the primary and all backups. What you described is not a partition loss, as a backup is there and is promoted to primary.
Also, losing a whole machine is the same only in the case when you were machine-safe (or at least those partitions which had primaries on the lost box were machine-safe). If those partitions were not machine safe, then you would have lost partitions as all copies to non-machine-safe partitions on that box were lost.
Other than that it does happen as described.
In your case you should not necesarrily see data loss if you kill a single node from the cluster you described and neither should you loose data if you kill a whole machine. As mentioned, provided that the cluster or at least the partitions having primaries on the killed box are machine-safe.
There are scenarios where data loss can occur, for example loosing two JVM on different machines at exactly the same time - this is becuse there is a very high chance that those two JVMs shared primary and backups for at least one partition.
If you loose a JVM the cluster size will always be reduced - it cannot be anything else as a node has just departd the cluster.
The above descriptions may be a bit simplified but I think they are close enough to describe what you wanted to know.
JKBest regards,
Robert

Similar Messages

Newsfeed error - The operation failed because the server could not access the distributed cache.

Recently installed SharePoint 2013 RTM, on the newsfeed page an error is displayed, and no entries display in the following or everyone tabs.
"The operation failed because the server could not access the distributed cache."
Reading through various posts, I've checked:
- Activity feeds and mentions tabs are working as expected.
- User Profile Service is operational and syncing as expected
- Search is operational and indexing as expected
- The farm was installed based on the autospinstaller scripts.
- Don't believe this to be a permissions issue, during testing added accounts to the admin group to verify
Any suggestions are welcomed, thanks.
The full error message and trace logs is as follows.
SharePoint returned the following error: The operation failed because the server could not access the distributed cache. Internal type name: Microsoft.Office.Server.Microfeed.MicrofeedException. Internal error code: 55. Contact your system administrator
for help in resolving this problem.
From the trace logs there's several messages which are triggered around the same time:
http://msdn.microsoft.com/en-AU/library/System.ServiceModel.Diagnostics.TraceHandledException.aspxHandling an exception. Exception details: System.ServiceModel.FaultException`1[Microsoft.Office.Server.UserProfiles.FeedCacheFault]: Unexpected exception in
FeedCacheService.GetPublishedFeed: Object reference not set to an instance of an object.. (Fault Detail is equal to Microsoft.Office.Server.UserProfiles.FeedCacheFault)./LM/W3SVC/2/ROOT/d71732192b0d4afdad17084e8214321e-1-129962393079894191System.ServiceModel.FaultException`1[[Microsoft.Office.Server.UserProfiles.FeedCacheFault,
Microsoft.Office.Server.UserProfiles, Version=15.0.0.0, Culture=neutral, PublicKeyToken=71e9bce111e9429c]], System.ServiceModel, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089Unexpected exception in FeedCacheService.GetPublishedFeed: Object
reference not set to an instance of an object..
at Microsoft.Office.Server.UserProfiles.FeedCacheService.Microsoft.Office.Server.UserProfiles.IFeedCacheService.GetPublishedFeed(FeedCacheRetrievalEntity fcTargetEntity, FeedCacheRetrievalEntity fcViewingEntity, FeedCacheRetrievalOptions fcRetOptions)
at SyncInvokeGetPublishedFeed(Object , Object[] , Object[] )
at System.ServiceModel.Dispatcher.SyncMethodInvoker.Invoke(Object instance, Object[] inputs, Object[]& outputs)
at System.ServiceModel.Dispatcher.DispatchOperationRuntime.InvokeBegin(MessageRpc& rpc)
at System.ServiceModel.Dispatcher.ImmutableDispatchRuntime.ProcessMessage5(MessageRpc& rpc)
at System.ServiceModel.Dispatcher.ImmutableDispatchRuntime.ProcessMessage31(MessageRpc& rpc)
at System.ServiceModel.Dispatcher.MessageRpc.Process(Boolean isOperationContextSet)System.ServiceModel.FaultException`1[Microsoft.Office.Server.UserProfiles.FeedCacheFault]: Unexpected exception in FeedCacheService.GetPublishedFeed: Object reference not
set to an instance of an object.. (Fault Detail is equal to Microsoft.Office.Server.UserProfiles.FeedCacheFault).
SPSocialFeedManager.GetFeed: Exception: Microsoft.Office.Server.Microfeed.MicrofeedException: ServerErrorFetchingConsolidatedFeed : ( Unexpected exception in FeedCacheService.GetPublishedFeed: Object reference not set to an instance of an object.. ) : Correlation
ID:db6ddc9b-8d2e-906e-db86-77e4c9fab08f : Date and Time : 31/10/2012 1:40:20 PM
at Microsoft.Office.Server.Microfeed.SPMicrofeedThreadCollection.PopulateConsolidated(SPMicrofeedRetrievalOptions retOptions, SPMicrofeedContext context)
at Microsoft.Office.Server.Microfeed.SPMicrofeedThreadCollection.Populate(SPMicrofeedRetrievalOptions retrievalOptions, SPMicrofeedContext context)
at Microsoft.Office.Server.Microfeed.SPMicrofeedManager.CommonGetFeedFor(SPMicrofeedRetrievalOptions retrievalOptions)
at Microsoft.Office.Server.Microfeed.SPMicrofeedManager.CommonPubFeedGetter(SPMicrofeedRetrievalOptions feedOptions, MicrofeedPublishedFeedType feedType, Boolean publicView)
at Microsoft.Office.Server.Microfeed.SPMicrofeedManager.GetPublishedFeed(String feedOwner, SPMicrofeedRetrievalOptions feedOptions, MicrofeedPublishedFeedType typeOfPubFeed)
at Microsoft.Office.Server.Social.SPSocialFeedManager.Microsoft.Office.Server.Social.ISocialFeedManagerProxy.ProxyGetFeed(SPSocialFeedType type, SPSocialFeedOptions options)
at Microsoft.Office.Server.Social.SPSocialFeedManager.<>c__DisplayClass4b`1.<S2SInvoke>b__4a()
Microsoft.Office.Server.Social.SPSocialFeedManager.GetFeed: Microsoft.Office.Server.Microfeed.MicrofeedException: ServerErrorFetchingConsolidatedFeed : ( Unexpected exception in FeedCacheService.GetPublishedFeed: Object reference not set to an instance of
an object.. ) : Correlation ID:db6ddc9b-8d2e-906e-db86-77e4c9fab08f : Date and Time : 31/10/2012 1:40:20 PM
at Microsoft.Office.Server.Microfeed.SPMicrofeedThreadCollection.PopulateConsolidated(SPMicrofeedRetrievalOptions retOptions, SPMicrofeedContext context)
at Microsoft.Office.Server.Microfeed.SPMicrofeedThreadCollection.Populate(SPMicrofeedRetrievalOptions retrievalOptions, SPMicrofeedContext context)
at Microsoft.Office.Server.Microfeed.SPMicrofeedManager.CommonGetFeedFor(SPMicrofeedRetrievalOptions retrievalOptions)
at Microsoft.Office.Server.Microfeed.SPMicrofeedManager.CommonPubFeedGetter(SPMicrofeedRetrievalOptions feedOptions, MicrofeedPublishedFeedType feedType, Boolean publicView)
at Microsoft.Office.Server.Microfeed.SPMicrofeedManager.GetPublishedFeed(String feedOwner, SPMicrofeedRetrievalOptions feedOptions, MicrofeedPublishedFeedType typeOfPubFeed)
at Microsoft.Office.Server.Social.SPSocialFeedManager.Microsoft.Office.Server.Social.ISocialFeedManagerProxy.ProxyGetFeed(SPSocialFeedType type, SPSocialFeedOptions options)
at Microsoft.Office.Server.Social.SPSocialFeedManager.<>c__DisplayClass4b`1.<S2SInvoke>b__4a()
at Microsoft.Office.Server.Social.SPSocialUtil.InvokeWithExceptionTranslation[T](ISocialOperationManager target, String name, Func`1 func)
Microsoft.Office.Server.Social.SPSocialFeedManager.GetFeed: Microsoft.Office.Server.Social.SPSocialException: The operation failed because the server could not access the distributed cache. Internal type name: Microsoft.Office.Server.Microfeed.MicrofeedException.
Internal error code: 55.
at Microsoft.Office.Server.Social.SPSocialUtil.TryTranslateExceptionAndThrow(Exception exception)
at Microsoft.Office.Server.Social.SPSocialUtil.InvokeWithExceptionTranslation[T](ISocialOperationManager target, String name, Func`1 func)
at Microsoft.Office.Server.Social.SPSocialFeedManager.<>c__DisplayClass48`1.<S2SInvoke>b__47()
at Microsoft.Office.Server.Social.SPSocialUtil.InvokeWithExceptionTranslation[T](ISocialOperationManager target, String name, Func`1 func)

Thanks Thuan,
I've restarted to the Distrubiton Cache servicem and the error is still occuring.
The AppFabric Caching Service is running under the service apps account, and does appear operational based on:
> use-cachecluster
> get-cache
CacheName            [Host]
                     Regions
default
DistributedAccessCache_1e9f4999-0187-40e8-aa92-f8308d47d6e9
DistributedActivityFeedCache_1e9f4999-0187-40e8-aa92-f8308d47d6e9
DistributedActivityF [SERVER:22233]
eedLMTCache_1e9f4999 LMT(Primary)
-0187-40e8-aa92-f830
8d47d6e9
DistributedBouncerCache_1e9f4999-0187-40e8-aa92-f8308d47d6e9
DistributedDefaultCache_1e9f4999-0187-40e8-aa92-f8308d47d6e9
DistributedLogonToke [SERVER:22233]
nCache_1e9f4999-0187 Default_Region_0538(Primary)
-40e8-aa92-f8308d47d Default_Region_0004(Primary)
6e9                  Default_Region_0451(Primary)
DistributedSearchCache_1e9f4999-0187-40e8-aa92-f8308d47d6e9
DistributedSecurityTrimmingCache_1e9f4999-0187-40e8-aa92-f8308d47d6e9
DistributedServerToAppServerAccessTokenCache_1e9f4999-0187-40e8-aa92-f8308d47d6e9

Data-node affinity in a distributed cache

My apologies if this is addressed elsewhere...
I have a few questions regarding the association of a cached object to a cluster node in a distributed cache:
1. What factor(s) determine which node is the primary cluster node for a given object in a distributed cache?
2. Similarly, assuming that at least one backup node is configured, what determines which node will be the backup node for a given object?
Thanks.

Hi,
There is not yet the ability to specify node ownership (through the DistributedCacheService). The basic issue is that a signficant chunk of our technology is involved in managing node ownership without introducing non-scalable state or data vulnerability. Allowing users to control this would shift that responsibility onto application code. This is a very difficult task to manage in a manner that is scalable, performant and fault-tolerant (any two of those are fairly easy to accomplish).
In practice, this is not much of an issue as we have patterns to work around this (including data-driven load balancing and our near-cache technology) without impacting any of those three requirements.
I believe there are plans to add this ability to a public Coherence API in a future release, but this would be (as discussed above) a very advanced feature.
Jon Purdy
Tangosol, Inc.

Set request timeout for distributed cache

Hi,
Coherence provides 3 parameters we can tune for the distributed cache
tangosol.coherence.distributed.request.timeout      The default client request timeout for distributed cache services
tangosol.coherence.distributed.task.timeout      The default server execution timeout for distributed cache services
tangosol.coherence.distributed.task.hung      the default time before a thread is reported as hung by distributed cache services
It seems these timeout values are used for both system activities (node discovery, data re-balance etc.) and user activities (get, put). We would like to set the request timeout for get/put. But a low threshold like 10 ms sometimes causes the system activities to fail. Is there a way for us to separately set the timeout values? Or even is it possible to setup timeout on individual calls (like get(key, timeout))?
-thanks

Hi,
not necessarily for get and put methods, but for queries, entry-processor and entry-aggregator and invocable agent sending, you can make the sent filter or aggregator or entry-processor or agent implement PriorityTask, which allows you to make QoS expectations known to Coherence. Most or all stock aggregators and entry-processors implement PriorityTask, if I correctly remember.
For more info, look at the documentation of PriorityTask.
Best regards,
Robert

Need Help regarding initial configuration for distributed cache

Hi ,
I am new to tangosol and trying to setup a basic partitioned distributed cache ,But I am not being able to do so
Here is my Scenario,
My Application DataServer create the instance of Tangosolcache .
I have this config.xml set in my machine where my application start.
<?xml version="1.0"?>
<!DOCTYPE cache-config SYSTEM "cache-config.dtd">
<cache-config>
<caching-scheme-mapping>

<cache-mapping>
<cache-name>*</cache-name>
<scheme-name>default-distributed</scheme-name>
</cache-mapping>
</caching-scheme-mapping>
<caching-schemes>

<distributed-scheme>
<scheme-name>default-distributed</scheme-name>
<service-name>DistributedCache</service-name>
<backing-map-scheme>
<class-scheme>
<scheme-ref>default-backing-map</scheme-ref>
</class-scheme>
</backing-map-scheme>
<autostart>true</autostart>
</distributed-scheme>

<class-scheme>
<scheme-name>default-backing-map</scheme-name>
<class-name>com.tangosol.util.SafeHashMap</class-name>
<init-params></init-params>
</class-scheme>
</caching-schemes>
</cache-config>
Now on the same machine I start a different client using the command
java -Dtangosol.coherence.distributed.localstorage=false -Dtangosol.coherence.cacheconfig=near-cache-config.xml -classpath
"C:/calypso/software/release/build" -jar ../lib/coherence.jar
The problem I am facing is
1)If I do not start the client even then my application server cache the data .Ideally my config.xml setting is set to
distributed so under no case it should cache the data in its local ...
2)I want to bind my differet cache on different process on different machine .
say
for e.g
machine1 should cache cache1 object
machine2 should cache cache2 object
and so on .......but i could not find any documentation which explain how to do this setting .Can some one give me example of
how to do it ....
3)I want to know the details of cache stored in any particular node how do I know say for e.g machine1 contains so and so
cache and it corresponding object values ... etc .....
Regards
Mahesh

Hi Thanks for answer.
After digging into the wiki lot i found out something related to KeyAssociation I think what I need is something like implementation of KeyAssociation and that
store the particular cache type object on particular node or group of node
Say for e,g I want to have this kind of setup
Cache1-->node1,node2 as I forecast this would take lot of memory (So i assign this jvms like 10 G)
Cache2-->node3 to assign small memory (like 2G)
and so on ...
From the wiki documentation i see
Key Association
By default the specific set of entries assigned to each partition is transparent to the application. In some cases it may be advantageous to keep certain related entries within the same cluster node. A key-associator may be used to indicate related entries, the partitioned cache service will ensure that associated entries reside on the same partition, and thus on the same cluster node. Alternatively, key association may be specified from within the application code by using keys which implement the com.tangosol.net.cache.KeyAssociation interface.
Do someone have any example of explaining how this is done in the simplest way ..

Different distributed caches within the cluster

Hi,
i've three machines n1 , n2 and n3 respectively that host tangosol. 2 of them act as the primary distributed cache and the third one acts as the secondary cache. i also have weblogic running on n1 and based on some requests pumps data on to the distributed cache on n1 and n2. i've a listener configured on n1 and n2 and on the entry deleted event i would like to populate tangosol distributed service running on n3. all the 3 nodes are within the same cluster.
i would like to ensure that the data directly coming from weblogic should only be distributed across n1 and n2 and NOT n3. for e.g. i do not start an instance of tangosol on node n3. and an object gets pruned from either n1 or n2. so ideally i should get a storage not configured exception which does not happen.
The point is the moment is say CacheFactory.getCache("Dist:n3") in the cache listener, tangosol does populate the secondary cache by creating an instance of Dist:n3 on either n1 or n2 depending from where the object has been pruned.
from my understanding i dont think we can have a config file on n1 and n2 that does not have a scheme for n3. i tried doing that and got an illegalstate exception.
my next step was to define the Dist:n3 scheme on n1 and n2 with local storage false and have a similar config file on n3 with local-storage for Dist:n3 as true and local storage for the primary cache as false.
can i configure local-storage specific to a cache rather than to a node.
i also have an EJB deployed on weblogic that also entertains a getData request. i.e. this ejb will also check the primary cache and the secondary cache for data. i would have the statement
NamedCahe n3 = CacheFactory.getCache("n3") in the bean as well.

Hi Jigar,
i've three machines n1 , n2 and n3 respectively that
host tangosol. 2 of them act as the primary
distributed cache and the third one acts as the
secondary cache.First, I am curious as to the requirements that drive this configuration setup.
i would like to ensure that the data directly coming
from weblogic should only be distributed across n1
and n2 and NOT n3. for e.g. i do not start an
instance of tangosol on node n3. and an object gets
pruned from either n1 or n2. so ideally i should get
a storage not configured exception which does not
happen.
The point is the moment is say
CacheFactory.getCache("Dist:n3") in the cache
listener, tangosol does populate the secondary cache
by creating an instance of Dist:n3 on either n1 or n2
depending from where the object has been pruned.
from my understanding i dont think we can have a
config file on n1 and n2 that does not have a scheme
for n3. i tried doing that and got an illegalstate
exception.
my next step was to define the Dist:n3 scheme on n1
and n2 with local storage false and have a similar
config file on n3 with local-storage for Dist:n3 as
true and local storage for the primary cache as
false.
can i configure local-storage specific to a cache
rather than to a node.
i also have an EJB deployed on weblogic that also
entertains a getData request. i.e. this ejb will also
check the primary cache and the secondary cache for
data. i would have the statement
NamedCahe n3 = CacheFactory.getCache("n3") in the
bean as well.In this scenario, I would recommend having the "primary" and "secondary" caches on different cache services (i.e. distributed-scheme/service-name). Then you can configure local storage on a service by service basis (i.e. distributed-scheme/local-storage).
Later,
Rob Misek
Tangosol, Inc.

Distributed cache size limit

Hi,
I want to create a distributed cache with 2 node.
Each node can have maximum 500 entries.
Total entries in cache of both node together should not exceed more than 1000.
If user try to put more than 1000 element in cache then some old entries from cache (LRU, LFU) should be removed from cache and new entries should be added.
Can you please help me with the schema for the above scenario.
Your help will be appreciated.
Thanks & Regards,
Viral Gala

Hi,
I tried below code <high-units> was not working
Cache size is 1010 (greater than 500)
Java code
package com.splwg.ccb.domain.pricing;
import com.tangosol.net.CacheFactory;
import com.tangosol.net.NamedCache;
public class CoherenceSizeTest {
     public static final int NO_OF_MAIN = 4;
     public static void main(String[] args) {
     NamedCache cache = CacheFactory.getCache("CheckSize");
      for (int i = 0; i < 1010; i++) {
            String key = "key" + i;
            cache.put(key, new Long(i));
     System.out.println(" Cache size : " + cache.size());
config file
<?xml version="1.0"?>
<!DOCTYPE cache-config SYSTEM "cache-config.dtd">
<cache-config>
    <caching-scheme-mapping>
        <cache-mapping>
            <cache-name>CheckSize</cache-name>
            <scheme-name>default-distributed</scheme-name>
        </cache-mapping>
    </caching-scheme-mapping>
    <caching-schemes>
        <distributed-scheme>
            <scheme-name>default-distributed</scheme-name>
            <service-name>DistributedCache</service-name>
            <backing-map-scheme>
                <local-scheme />
            </backing-map-scheme>
            <high-units>500</high-units>
            <autostart>true</autostart>
        </distributed-scheme>
    </caching-schemes>
</cache-config>
Console output
MasterMemberSet(
ThisMember=Member(Id=1, Timestamp=2014-11-07 16:31:21.123, Address=10.180.7.97:8088, MachineId=16932, Location=site:,machine:OFSS310723,process:4036, Role=SplwgCcbDomainCoherenceSizeTest)
OldestMember=Member(Id=1, Timestamp=2014-11-07 16:31:21.123, Address=10.180.7.97:8088, MachineId=16932, Location=site:,machine:OFSS310723,process:4036, Role=SplwgCcbDomainCoherenceSizeTest)
ActualMemberSet=MemberSet(Size=1
    Member(Id=1, Timestamp=2014-11-07 16:31:21.123, Address=10.180.7.97:8088, MachineId=16932, Location=site:,machine:OFSS310723,process:4036, Role=SplwgCcbDomainCoherenceSizeTest)
MemberId|ServiceVersion|ServiceJoined|MemberState
    1|3.7.1|2014-11-07 16:31:24.375|JOINED
RecycleMillis=1200000
RecycleSet=MemberSet(Size=0
TcpRing{Connections=[]}
IpMonitor{AddressListSize=0}
Cache size : 1010

Distributed cache and MapListener

Do the MapListeners receive all events on a distributed cache, when the cache is updated on a cluster that is different of the cluster on which the MapListener is located?
I ask this because, I've been testing the following configuration:
A distributed cache started on 2 machines.
When listening to an event on a cache (using a getCache(String,ClassPath) to get the cache, and addMapListener() to connect my implementation of the MapListener to the cache), nothing is received when the other node is updated.
Am I misusing MapListeners?

Sorry. I have been testing using an unappropriate configuration (-Dtangosol.coherence.ttl=0)
Therefore the 2 machine could not see each other.
The MapListener works as expected.
Pedro

Distributed cache performance?

Hi,
I have a question about the performance of a cluster using a distributed cache:
A distributed cache is available in the the cluster, using the expiry-delay functionality. Each node first inserts new entries in the cache and then periodically updates the entries as long as the entry is needed in the cluster (entries that are no longer periodically updated will be removed due to the expiry-delay).
I performed a small test using a cluster with two nodes that each inserted ~2000 entries in the distributed cache. The nodes then periodically update their entries at 5 minutes intervals (using the Map.put(key, value) method). The nodes never access the same entries, so there will be no synchronization issues.
The problem is that the CPU load on the machines running the nodes are very high, ~70% (and this is quite powerful machines with 4 CPUs running Linux). To be able to find the reason for the high CPU load, I used a profiler tool on the application running on one of the nodes. It showed that the application spent ~70% of the time in com.tangosol.coherence.component.net.socket.UdpSocket.receive. Is this normal?
Since each node has a lot of other things to do, it is not acceptable that 70% of the CPU is used only for this purpose. Can this be a cache configuration issue, or do I have to find some other approach to perform this task?
Regards
Andreas

Hi Andreas,
Can you provide us with some additional information. You can e-mail it to our support account.
- JProfiler snapshot of the profiling showing high CPU utilization
- multiple full thread dumps for the process taken a few seconds apart, these should be taken when running outside of the profiler
- Your override file (tangosol-coherence-override.xml)
- Your cache configuration file (coherence-cache-config.xml)
- logs from the high CPU event, please also include -verbose:gc in the logs, directing the output to the coherence log file
- estimates on the sizes of the objects being updated in the cache
As this is occurring even when you are not actively adding data to the cache, can you describe what else your application is doing at this time. It would be extremely odd for Coherence to consume any noticeable amount of CPU if you are not making heavy use of the cache.
Note that when using the Map.put method the old value is returned to the caller, which for a distributed cache means extra network load, you may wish to consider switching to Map.putAll() as this does not need to return the old value, and is more efficient even if you are only operating on a single entry.
thanks,
Mark

Distributed Cache : Performance issue; takes long to get data

Hi there,
     I have set up a cluster on one a Linux machine with 11 nodes (Min & Max Heap Memory = 1GB). The nodes are connected through a multicast address / port number. I have configured Distributed Cache service running on all the nodes and 2 nodes with ExtendTCPService. I loaded a dataset of size 13 millions into the cache (approximately 5GB), where the key is String and value is Integer.
     I run a java process from another Linux machine on the same network, that makes use of the this cache. The process fetches around 200,000 items from the cache and it takes around 180 seconds ONLY to fetch the data from the cache.
     I had a look at the Performance Tuning > Coherence Network Tuning and checked the Publisher and Receiver Success rate and both were neardly 0.998 on all the nodes.
     It a bit hard to believe that it takes so long. May be I'm missing something. Would appreciate if you could advice me on the same?
     More info :
          a) All nodes are running on Java 5 update 7
          b) The java process is running on JDK1.4 Update 8
          c) -server option is enabled on all the nodes and the java process
          d) I'm using Tangosol Coherence 3.2.2b371
          d) cache-config.xml
                    <?xml version="1.0"?>
                    <!DOCTYPE cache-config SYSTEM "cache-config.dtd">
                    <cache-config>
                    <caching-scheme-mapping>
                    <cache-mapping>
                    <cache-name>dist-*</cache-name>
                    <scheme-name>dist-default</scheme-name>
                    </cache-mapping>
                    </caching-scheme-mapping>
                    <caching-schemes>
                    <distributed-scheme>
                    <scheme-name>dist-default</scheme-name>
                    <backing-map-scheme>
                         <local-scheme/>
                    </backing-map-scheme>
                    <lease-granularity>member</lease-granularity>
                    <autostart>true</autostart>
                    </distributed-scheme>
                    </caching-schemes>
                    </cache-config>
     Thanks,
     Amit Chhajed

Hi Amit,
     Is the java test process single threaded, i.e. you performed 200,000 consecutive cache.get() operations? If so then this would go a long ways towards explaining the results, as most of the time in all processes would be spent waiting on the network, and your results would come out to just over 1ms per operation. Please be sure to run with multiple test threads, and also it would be good to make use of the cache.getAll() call where possible to have a single thread fetching multiple items in parallel.
     Also you may need to do a some tuning on your cache server side. In general I would say that on a 1GB heap you should only utilize roughly 750 MB of that space for cache storage. Taking backups into consideration this means 375MB of data per JVM. So with 11 nodes, this would mean a cache capacity of 4GB. At 5GB of data each cache server will be running quite low on free memory, resulting in frequent GCs which will hurt performance. Based on my calculations you should use 14 cache servers to hold your 5GB of data. Be sure to run with -verbose:gc to monitor your GC activity.
     You must also watch your machine to make sure that your cache servers aren't getting swapped out. This means that your server machine needs to have enough RAM to keep all the cache servers in memory. Using "top" you will see that a 1GB JVM actually takes about 1.2 GB of RAM. Thus for 14 JVMs you would need ~17GB of RAM. Obviously you need to leave some RAM for the OS, and other standard processes as well, so I would say this box would need around 18GB RAM. You can use "top" and "vmstat" to verify that you are not making active use of swap space. Obviously the easiest thing to do if you don't have enough RAM, would be to split your cache servers out onto two machines.
     See http://wiki.tangosol.com/display/COH32UG/Evaluating+Performance+and+Scalability for more information on things to consider when performance testing Coherence.
     thanks,
     Mark

Limitation on number of objects in distributed cache

Hi,
Is there a limitation on the number (or total size) of objects in a distributed cache? I am seeing a big increase in response time when the number of objects exceeds 16,000. Normally, the ServiceMBean.RequestAverageDuration value is in the 6-8ms range as long as the number of objects in the cache is less than 16K - I've run our application for weeks at a time without seeing any problems. However, once the number of objects exceeds the magic number of 16K the average request duration almost immediately jumps to over 100ms and continues to climb as more objects are added.
I'm fairly confident that the cache is indexed properly (as Dimitri helped us with that). Are there any configuration changes that could possibly help out here? We are using Coherence 3.3.
Any suggestions would be greatly appreciated.
Thanks,
Jim

Hi Jim,
The results from the load test look quite normal, the system fairly quickly stabilizes at a particular performance level and remains there for the duration of the test. In terms of latency results, we see that the cache.putAll operations are taking ~45ms per bulk operation where each operation is putting 100 1K items, for cache.getAll operations we see about ~15ms per bulk operation. Additionally note that the test runs over 256,000 items, so it is well beyond the 16,000 limit you've encountered.
So it looks like your application are exhibiting different behavior then this test. You may wish to try to configure this test to behave as similarly to yours as possible. For instance you can set the size of the cache to just over/under 16,000 using the -entries parameter, set the size of the entries to 900 bytes using the -size parameter, and set the total number of threads per worker using the -threads parameter.
What is quite interesting is that at 256,000 1K objects the latency measured with this test is apparently less then half the latency you are seeing with a much smaller cache size. This would seem to point at the issue being related to or rooted in your test. Would you be able to provide a more detailed description of how you are using the cache, and the types of operations you are performing.
thanks,
mark

Error message when using a MessageListener using a distributed cache

Hi --
I'm getting the following error message when I attach a message listener to a distributed cache. I get the same message if I attach the listener to the NearCache in front of the DistributedCache, or to the DistributedCache itself.
My message listener listens for a create() operation and writes the created value out to the database. Both the key and value are java objects that are getting "serialized" when they're pushed in the cache. The listener is never called.
The error spits out two messages, which look like:
2003-04-07 21:48:05.281 Tangosol Coherence 2.1/239 <Error> (thread=DistributedCache:EventDispatcher): An exception occurred while dispatching this event:
CacheEvent: MapEvent{com.tangosol.coherence.component.util.daemon.queueProcessor
.service.DistributedCache$BinaryMap added: key=Binary(length=269, value=0x0005AC
ED000573720021636F6D2E6F6C742E646174612E696E7465726E616C2E4461746162617365554944
6FABB5383C6013B402000078720021636F6D2E6F6C742E646174612E696E7465726E616C2E416273
7472616374554944D04F591196E4DC1B0200024C000D657874656E73696F6E4461746174000F4C6A
6176612F7574696C2F4D61703B4C0009756964537472696E677400124C6A6176612F6C616E672F53
7472696E673B7870737200116A6176612E7574696C2E486173684D61700507DAC1C31660D1030002
46000A6C6F6164466163746F724900097468726573686F6C6478703F400000000000087708000000
0B0000000078740011363930395F436F6D706F6E656E74426964), value=Binary(length=1069,
value=0x0005ACED000573720026636F6D2E6562726576696174652E61756374696F6E2E6269642
E436F6D706F6E656E744269648EC95C4DE33A88D802000D5A0007626573744269644A000B6269645
3657175656E6365440004636F73745A0007696E697469616C5A00066E65774269645A00067469654
2696444000576616C75654C000B61756374696F6E4D6F64657400294C636F6D2F656272657669617
4652F61756374696F6E2F6576656E742F41756374696F6E4D6F64653B4C000A61756374696F6E554
9447400124C636F6D2F6F6C742F646174612F5549443B4C000A636F6D70616E7955494471007E000
24C000C636F6D706F6E656E7455494471007E00024C000A737472696E67436F73747400124C6A617
6612F6C616E672F537472696E673B4C000B737472696E6756616C756571007E00037872002D636F6
D2E6562726576696174652E636F6D6D6F6E2E416273747261637450657273697374656E744F626A6
56374497E2729A24CA5790200034C000A637265617465446174657400104C6A6176612F7574696C2
F446174653B4C000375696471007E00024C000A7570646174654461746571007E000578707372000
E6A6176612E7574696C2E44617465686A81014B59741903000078707708000000F46B9A286278737
20021636F6D2E6F6C742E646174612E696E7465726E616C2E44617461626173655549446FABB5383
C6013B402000078720021636F6D2E6F6C742E646174612E696E7465726E616C2E416273747261637
4554944D04F591196E4DC1B0200024C000D657874656E73696F6E4461746174000F4C6A6176612F7
574696C2F4D61703B4C0009756964537472696E6771007E00037870737200116A6176612E7574696
C2E486173684D61700507DAC1C31660D103000246000A6C6F6164466163746F72490009746872657
3686F6C6478703F4000000000000877080000000B0000000078740011363930395F436F6D706F6E6
56E744269647371007E00077708000000F46B9A286278000000000000000000402E0000000000000
00000402E00000000000073720027636F6D2E6562726576696174652E61756374696F6E2E6576656
E742E41756374696F6E4D6F6465BD0C9E245C328B4F02000078720029636F6D2E656272657669617
4652E636F6D6D6F6E2E41627374726163745479706553616665456E756D506D8C41B0144DB302000
249000A696E744C69746572616C4C000D737472696E674C69746572616C71007E000378700000000
374000A50524F44554354494F4E7371007E00097371007E000D3F4000000000000877080000000B0
00000007874000A38335F41756374696F6E7371007E00097371007E000D3F4000000000000877080
000000B000000007874000A34325F436F6D70616E797371007E00097371007E000D3F40000000000
00877080000000B00000000787400103131315F426964436F6D706F6E656E747070)}
2003-04-07 21:48:05.687 Tangosol Coherence 2.1/239 <Warning> (thread=CoherenceLogger): Asynchronous logging character limit exceeded; discarding 3 log messages (lines=17, chars=1416)

Kris,
First of all you should increase the value of logging-config/character-limit element in tangosol-coherence.xml to see the message entirely. The default setting is 4096 which is not enough to see your exception text.
When you do that I believe you will see that the actual exception is java.lang.ClassNotFoundException indicating that the node that has the listener installed doesn't know about the class that is being put into the cache and could be easily fixed as shown here: http://www.tangosol.com/faq-coherence.jsp#classnotfound
Please let me know if that doesn't help.
Gene

Error handling for distributed cache synchronization

Hello,
Can somebody explain to me how the error handling works for the distributed cache synchronization ?
Say I have four nodes of a weblogic cluster and 4 different sessions on each one of those nodes.
On Node A an update happens on object B. This update is going to be propogated to all the other nodes B, C, D. But for some reason the connection between node A and node B is lost.
In the following xml
<cache-synchronization-manager>
<clustering-service>...</clustering-service>
<should-remove-connection-on-error>true</should-remove-connection-on-error>
If I set this to true does this mean that the Toplink will stop sending updates from node A to node B ? I presume all of this is transparent. In order to handle any errors I do not have to write any code to capture this kind of error .
Is that correct ?
Aswin.

This "should-remove-connection-on-error" option mainly applies to RMI or RMI_IIOP cache synchronization. If you use JMS for cache synchronization, then connectivity and error handling is provided by the JMS service.
For RMI, when this is set to true (which is the default) if a communication exception occurs in sending the cache synchronization to a server, that server will be removed and no longer synchronized with. The assumption is that the server has gone down, and when it comes back up it will rejoin the cluster and reconnect to this server and resume synchronization. Since it will have an empty cache when it starts back up, it will not have missed anything.
You do not have to perform any error handling, however if you wish to handle cache synchronization errors you can use a TopLink Session ExceptionHandler. Any cache synchronization errors will be sent to the session's exception handler and allow it to handle the error or be notified of the error. Any errors will also be logged to the TopLink session's log.

High-units reflect twice the amount with dual JVM's in a distributed cache

HI all,
I have a question - i have a near cache scheme defined - running 4 JVM's with my application deployed to it (localstorage=false) - and 2 JVM's for the distributed cache (localstorage=true)
The high-units is set to 2000 - but the cache is allowing 4000. Is this b/c each JVM will allow for 2000 high-units each?
I was under the impression that as long as coherence is running in the same multi-cast address and port - that the total high-units would be 2000 not 4000.
Thanks...

user644269 wrote:
HI all,
I have a question - i have a near cache scheme defined - running 4 JVM's with my application deployed to it (localstorage=false) - and 2 JVM's for the distributed cache (localstorage=true)
The high-units is set to 2000 - but the cache is allowing 4000. Is this b/c each JVM will allow for 2000 high-units each?
I was under the impression that as long as coherence is running in the same multi-cast address and port - that the total high-units would be 2000 not 4000.
Thanks...Hi,
the high-unit setting is per-backing map, so in your case it means 2000 units per storage-enabled nodes.
From 3.5 it will become a bit more complex with the partition aware backing maps.
Best regards,
Robert

Query from Distributed Cache

Hi
I am newbie to Oracle Coherence and trying to get a hands on experience by running a example (coherence-example-distributedload.zip) (Coherence GE 3.6.1). I am running two instances of server . After this I ran "load.cmd" to distribute data across two server nodes - I can see that data is partitioned across server instances.
Now I run another instance(on another JVM) of program which will try to join the distributed cache and try to query on the loaded on server instances. I see that the new JVM is joining the cluster and querying for data returns no records. Can you please tell me if I am missing something?
     NamedCache nNamedCache = CacheFactory.getCache("example-distributed");
     Filter eEqualsFilter = new GreaterFilter("getLocId", "1000");
     Set keySet = nNamedCache.keySet(eEqualsFilter);
I see here that keySet has no records. Can you please help?
Thanks
sunder

I got this problem sorted out - the was problem cache-config.xml.. The correct one looks as below.
<distributed-scheme>
<scheme-name>example-distributed</scheme-name>
<service-name>DistributedCache1</service-name>
<backing-map-scheme>
     <read-write-backing-map-scheme>
     <scheme-name>DBCacheLoaderScheme</scheme-name>
     <internal-cache-scheme>
     <local-scheme>
     <scheme-ref>DBCache-eviction</scheme-ref>
     </local-scheme>
     </internal-cache-scheme>
          <cachestore-scheme>
          <class-scheme>
               <class-name>com.test.DBCacheStore</class-name>
               <init-params>
                              <init-param>
                                   <param-type>java.lang.String</param-type>
                                   <param-value>locations</param-value>
                              </init-param>
                              <init-param>
                                   <param-type>java.lang.String</param-type>
                                   <param-value>{cache-name}</param-value>
                              </init-param>
               </init-params>
               </class-scheme>
          </cachestore-scheme>
          <cachestore-timeout>6000</cachestore-timeout>
          <refresh-ahead-factor>0.5</refresh-ahead-factor>
     </read-write-backing-map-scheme>
     </backing-map-scheme>
     <thread-count>10</thread-count>
<autostart>true</autostart>
</distributed-scheme>
<invocation-scheme>
<scheme-name>example-invocation</scheme-name>
<service-name>InvocationService1</service-name>
<autostart system-property="tangosol.coherence.invocation.autostart">true</autostart>
</invocation-scheme>
Missed <class-scheme> element inside <cachestore-scheme> of <read-write-backing-map-scheme>.
Thanks
sunder

Node/Machine fail behavior of distributed caches

Similar Messages

Maybe you are looking for