DB_EVENT_REP_CONNECT_BROKEN event to survived replicas !

Hello all,
I just work on a project about High availability with the latest BerkeleyDB 5.3 and I have 3 replicas (10.10.8.5, 10.10.8.7, and 10.10.8.8), where we have one master at each time. The master gets a virtual IP and that's the IP the client knows... The failover works only when the elected master is the group creator node. If this node fails, then another one is elected as new master and works perfectly. In other cases (when the master is not the group creator) when I kill the master, the other two replicas receive a DB_EVENT_REP_CONNECT_BROKEN event and no election procedure for a new master is being held. Thus, the client cannot communicate with them and gets connection refused... The exact output (with many debug flags on) is:
[1327795212:363390][20270/1149090128] TROVE:DBPF:Berkeley DB: EOF on connection to site 10.10.8.8:2000 ( 10.10.8.8 is the master node that just crashed )
(*) Debugging : BerkeleyDB_replica_state_callback : default : 4 // This is the code for DB_EVENT_REP_CONNECT_BROKEN
[1327795212:363620][20270/1149090128] TROVE:DBPF:Berkeley DB: Repmgr: bust connection. Block archive
My DB_CONFIG files are:
10.10.8.5
set_flags DB_TXN_NOSYNC on
set_flags DB_AUTO_COMMIT on
repmgr_site 10.10.8.7 5010
repmgr_site 10.10.8.8 5010
repmgr_site 10.10.8.5 5010 db_local_site on db_group_creator on
rep_set_priority 100
10.10.8.7
set_flags DB_TXN_NOSYNC on
set_flags DB_AUTO_COMMIT on
repmgr_site 10.10.8.5 5010 db_bootstrap_helper on
repmgr_site 10.10.8.7 5010 db_local_site on
repmgr_site 10.10.8.8 5010
rep_set_priority 100
10.10.8.8
set_flags DB_TXN_NOSYNC on
set_flags DB_AUTO_COMMIT on
repmgr_site 10.10.8.5 5010 db_bootstrap_helper on
repmgr_site 10.10.8.7 5010
repmgr_site 10.10.8.8 5010 db_local_site on
rep_set_priority 100
Any ideas about that error??
Many thanks,
Dimos.

We need more information to understand the problem.
I have 3 replicas (10.10.8.5, 10.10.8.7, and 10.10.8.8), where we have one master at each time. The master gets a virtual IP and that's the IP the client knows... When you say virtual IP, do you mean some host string other than 10.10.8.5, 10.10.8.7 or 10.10.8.8? If so, this could be the problem. You must refer to each site in the replication group in a consistent manner. We have no way to relate more than one host string and port to a single site.
The failover works only when the elected master is the group creator node. If this node fails, then another one is elected as new master and works perfectly. In other cases (when the master is not the group creator) when I kill the master, the other two replicas receive a DB_EVENT_REP_CONNECT_BROKEN event and no election procedure for a new master is being held.So was this your sequence of events? If it was different, please elaborate.
1. Start up replication with site 5 master, sites 7 and 8 clients.
2. Kill site 5, site 8 became master.
3. Site 5 rejoins the replication group as a client.
4. Kill site 8 (current master), sites 5 and 7 get CONNECT_BROKEN but don't start an election.
[1327795212:363390][20270/1149090128] TROVE:DBPF:Berkeley DB: EOF on connection to site >10.10.8.8:2000 ( 10.10.8.8 is the master node that just crashed )
(*) Debugging : BerkeleyDB_replica_state_callback : default : 4 // This is the code for >DB_EVENT_REP_CONNECT_BROKEN
[1327795212:363620][20270/1149090128] TROVE:DBPF:Berkeley DB: Repmgr: bust connection. Block archiveIt looks like you have turned on verbose output. But it looks like what you display here is from more than one site.
The first line talks about a broken connection to site 8 and you say site 8 is the master that just crashed. This implies that this line is from one of the clients (site 5 or site 7).
The final line ("bust connection") is a message that should only be coming from a site that that thinks it's the current master. I assume this must be from site 8.
The only other explanation is that you are using more than one host string to refer to the same site, as I mentioned above, which we don't support.
I have one other question - are you using rep_set_config() to turn off the DB_REPMGR_CONF_ELECTIONS flag at any time? I realize this is unlikely, but it's worth ruling this out.
Paula Bingham
Oracle

Similar Messages

How to handle event "DB_EVENT_PANIC" on replica

Hi BDB experts,
I am writing db HA application based on bdb version 4.6.21. Two daemons run on two machines, one as master which will read/write db, one as backup will only read db. It seems backup will sometimes get event "DB_EVENT_PANIC" for reasons I didn't know. What should I do on receiving such event? Should the daemon exit and run then open env again for recovery? Can one process reopen the environment for recovery without exit first?
Another question is unrelated to the issue above: if use bdb ha(base api), can there be 3 processes, process1 is neither master or client but just writing db, process2 is as rep master that open the same env(just for ha, will not write db), process3 is client that connects to master to get db synced?
Thanks,
Min

That is, if in ha environment and N processes would write db, each process should establish a connection with the replica and would send ha msg to the replica when it writes db, right?
It is said "Subsequent replication processes must at least call the DB_ENV->rep_set_transport method.
Yes, each process should establish a connection to the replica. Each process also needs the DB_ENV->rep_set_transport call so that it will invoke your application's send function to send each HA message to the replica.
Those processes may call the DB_ENV->rep_start method", I am confused that why rep_start is not a must. Because if rep_start is not called, bdb will not start ha threads and the process can't send msg to replica, right?
I am assuming from your earlier information that you are using Base API calls. The Base API calls do not create their own threads. When using the Base API, it is the application's responsibility to create and manage its own threads.
One possible way to design an application is to have a main replication process on the master that calls rep_set_transport and rep_start and creates one or more threads to handle all incoming messages from other sites.
In this design, there can be N additional processes on the master that perform database writes which only need to send the logging information about these writes to the other sites. The call to rep_set_transport provides the information needed to do this so there is no need for these N additional processes to call rep_start. Our documentation is allowing for cases like this.
You do not have to design your application this way and you can make rep_start calls in more than one process as long as they each supply the same DB_REP_CLIENT or DB_REP_MASTER value.
If my understanding is right, for the N processes ha case, there is a question below:
each process used to be single threaded, and run a few seconds then exit after finishes its task. Will the main thread get notified for which event from bdb that indicates sync to replica is done? As the process is not a daemon, I am afraid if it exits as former logic, bdb ha thread may not finish sync.
This is a reason to consider a design with a main replication process that remains running and handles all incoming messages. Then you can have multiple additional short-lived processes that perform database writes and exit.
If your application cannot follow this model, then you will need to design your application to make sure each process survives long enough to get whatever synchronization from the replica it requires. You should look at the Reference Guide sections "Building the communications infrastructure" and "Transactional guarantees" for more details.
Just out of curiosity, why are you using 4.6? That is not a very recent release.
Paula Bingham
Oracle

ICal events are moving around the calender

I tried to move an event from one date to another tonight, and after that iCal started flashing that event back and forward between the original date to the new one right in front of my eyes, and another event has been replicated and moved to another date.
I have tried deleting the shifting events, and even deleted calenders that the 2 events are under, but they keep reappearing. I have also tried refreshing the com.apple.ical.plist, and have restarted the machine several times.
Could this be a virus? I'm too nervous to sync my iphone just in case.
We have Leopard with all the latest updates (OS X 10.5.8) with iCal 3.0.8

Worked it out - I was using a custom date format to give me day and date in the menu bar - reverting to one of the standard built in 'British' works - but then I lose my day in the menu bar....

Node Event Handler

Hi,
I am using com.sleepycat.db.EventHandlerAdapter to capture the environment related events. What scenarios trigger the "handlePanicEvent"?
Deleting some files from the database folder when the environment is still up and running causes PANIC_ERROR in ReplicationConfiguration. But it does not trigger the "handlePanicEvent" method?
In case of replication, if Master node database folder is tampered, the master node becomes unusable and needs a recovery or restart with fresh database. Is there a event broadcast to replicas when master is inactive due to PANIC eror?
Thanks
Priya

Hello,
handlePanicEvent is triggered when a Berkeley DB method throws a RunRecoveryException. This indicates that the database environment has failed,
and that all threads of control in the database environment should exit
the environment and that recovery should be run.
Thanks,
Sandra

HT204053 Multiple apple ID prompts at same time on phone

I have already changed my old apple ID to my new email, yet on my phone I am prompted for 3 different apple IDs: the old email, the new email, and mobile me...How do I change that?

You wouldn't have to use the same iCloud ID to both use Find My iPhone, you would just have to sign into the iCloud account on the Find My iPhone app that the device you are trying to loacate is using. Sharing the same iCloud account is usually not recommended because when you do this, any data you sync with the account is merged and you will end up with each other's data on your devices. Also, any actions taken on one device (such as adding contacts or calendar events) will be replicated on the other. You would also be sharing the same 5 GB of iCloud storage for your data and backups rather than each having your own.
If you want to do this anyway, start by deciding which device will be keeping the current iCloud account. On the one that will be changing accounts, if you have any photos in photo stream that are not in your camera roll or backed up somewhere else save these to your camera roll by opening the photo stream album in the thumbnail view, tapping Edit, then tap all the photos you want to save, tap Share and tap Save to Camera Roll.
Once this is done, go to Settings>iCloud, scroll to the bottom and tap Delete Account. When prompted about what to do with the iCloud data, be sure to select Keep On My iDevice. Next, sign into the other account, turn iCloud data syncing for contacts, etc. back to On, and when prompted about merging with iCloud, choose Merge. This will upload the data to the new account and merge it with the data that is already there.

HT204053 Multiple apple id's

My daughter and I have different apple id's. we would like to use the same one so that we can use apps such as find my I phone. How could this be done?

You wouldn't have to use the same iCloud ID to both use Find My iPhone, you would just have to sign into the iCloud account on the Find My iPhone app that the device you are trying to loacate is using. Sharing the same iCloud account is usually not recommended because when you do this, any data you sync with the account is merged and you will end up with each other's data on your devices. Also, any actions taken on one device (such as adding contacts or calendar events) will be replicated on the other. You would also be sharing the same 5 GB of iCloud storage for your data and backups rather than each having your own.
If you want to do this anyway, start by deciding which device will be keeping the current iCloud account. On the one that will be changing accounts, if you have any photos in photo stream that are not in your camera roll or backed up somewhere else save these to your camera roll by opening the photo stream album in the thumbnail view, tapping Edit, then tap all the photos you want to save, tap Share and tap Save to Camera Roll.
Once this is done, go to Settings>iCloud, scroll to the bottom and tap Delete Account. When prompted about what to do with the iCloud data, be sure to select Keep On My iDevice. Next, sign into the other account, turn iCloud data syncing for contacts, etc. back to On, and when prompted about merging with iCloud, choose Merge. This will upload the data to the new account and merge it with the data that is already there.

TS3988 The Apple ID that I use isn't the one I want my for my iCloud. How do I remove the account that is already logged in without deleting everything from off my iPhone?

All of the iPhones we have in my house are all connected to the same Apple ID. If I upload all of my stuff to iCloud will it merge with anything anyone else uploads?
Thanks!

Yes. All of the phones that use the same ID for iCloud will be sharing the same iCloud account. Any data synced with the shared account will be merged and will appear on all devices. Also, any actions taken on synced data (such as deleting contacts or adding calendar events) will be replicated on all devices sharing the account.
To keep your data separate, everyone should use a separate iCloud account with a separate ID on their devices. You can continue to share the same ID for iTuens; it does not need to be the same as the ID you use for iCloud.

Best Practice for 2008R2 DC off site backup

hello.
at the request of the management i am trying to finalize a plan that covers off different potential events that could afflict our domain controllers.
we currently have 2 DC's in our environment 1 holds all FSMO so we have redundancy if 1 goes down and i can use DC promo to build a new one and add it to the domain, then doing meta data clean up if needed.
i have system state backups which will be useful if something happens to AD and i still have the original hardware to restore onto.
however if something horrific happens i lose both dc's and need to restore to new hardware i am dubious of the reliability of these system state backups as i have tested them in the past and often got BSOD issues.
i toyed with the idea of have a 3rd DC hosted off site in a data center, have replication occur to this and then i could use to rebuild new ones onsite if such a disaster were to occur.
any one have any suggestions or ideas on this one, or speak from there own experiences in this subject
Many Thanks

Having a off-site DC is always good, but in the event of a replicated failure it's not going to help, there is situations where you would need to do a forest recovery (for example backing out of a schema update) Then you need to restore a system state
backup of one DC and preferably re-install the others.
Enfo Zipper
Christoffer Andersson – Principal Advisor
http://blogs.chrisse.se - Directory Services Blog

Level Number of Current member using MDX

Hi
What is the MDX formula for finding the Level number of the Current Member of the dimension? I tried the following as per the document but it says syntax error.
Year.CurrentMember.Level
Thanks
Kannan.

If you have registered a MemberListener on that replicated cache's cache service, you should have received a MemberLeft event when the replicated cache service been shutdown on that node unless the JVM has died or been killed thus coherence doesn't have chance to send out event.
Has anything unusual showed up in log?

Problem establishing new feeder channels following an election when ReplicationGroupAdmin or DbPing used

This question pertains to 5.0.84. I didn't find any prior posts here with specifically the same problem.
My application uses JE HA with replication across 4 nodes. We've been testing various failure scenarios in which we network-isolate the Master node. We see the 3 surviving replicas call an election, a Master is determined, feeder channels are established and the overall replication group resumes its expected functionality.   We also tested scenarios in which the Master of the now-reduced 3-node replication group is similarly network-isolated.    The results generally worked but were less consistent when the group transitions from 3 to 2 nodes.   We thought that adjusting the quorum (electable group size override) was the proper way to address this issue and had good results by manually adjusting the electable group size override downward.
We implemented a StateChangeListener and in its stateChange() method are attempting to automate adjustment of the electable group size override by calling getNodeState() methods of (alternately) DbPing and ReplicationGroupAdmin to query each of the possible 4 nodes to see which were reachable and in a non-DETACHED state.    However, it appears that the very use of either DbPing or ReplicationGroupAdmin within our application at any node causes the establishment of feeder channels following an election of a new Master to fail (regardless the number of surviving electable nodes).    This seems rather counterintuitive, or that introducing either DbPing or ReplicationGroupAdmin within our application code introduces a Heisenberg uncertainty principle-effect in which attempting to observe the number of active nodes causes the behavior of the replication group, or more specifically--the feeder channel establishment--to change.    Is there a prohibition against using either DbPing, ReplicationGroupAdmin, or the getNodeState() methods within an electable node application?   Likewise do using these objects/methods within a monitor app cause a change in behavior among electable nodes following an election?    We seem to be observing such behavior.
Thanks,
Ted
Here is the je.info.0 log, from node01. Node02 initially is Master.   At 17:57:55, we network-isolate node02. We see an election is called, node01 wins and begins trying to establish feeders but cannot.
130903 17:55:56:752 INFO [node01] Started ServiceDispatcher. HostPort=ap01.domain.acme.net:13000
130903 17:55:56:867 INFO [node01] Current group size: 4
130903 17:55:56:868 INFO [node01] Existing node node01 querying for a current master.
130903 17:55:56:913 INFO [node01] Node node01 started
130903 17:55:56:914 INFO [node01] Election initiated; election #1
130903 17:55:56:926 INFO [node01] Started election thread Tue Sep 03 17:55:56 PDT 2013
130903 17:56:29:308 INFO [node01] Master changed to node02
130903 17:56:29:310 INFO [node01] Election finished. Elapsed time: 32396ms
130903 17:56:29:310 INFO [node01] Exiting election after 5 retries
130903 17:56:29:311 INFO [node01] Election thread exited. Group master: node02(64)
130903 17:56:29:310 INFO [node01] Replica loop started with master: node02(64)
130903 17:56:29:343 INFO [node01] Replica-feeder handshake start
130903 17:56:29:536 INFO [node01] Replica-feeder node02 handshake completed.
130903 17:56:29:553 INFO [node01] Replica-feeder node02 syncup started. Replica range: first=843,326 last=843,393 sync=843,393 txnEnd=843,393
130903 17:56:29:650 INFO [node01] Rollback to matchpoint 843,393 at 0x33/0x2d38 status=No active txns, nothing to rollback
130903 17:56:29:651 INFO [node01] Replica-feeder node02 start stream at VLSN: 843,394
130903 17:56:29:652 INFO [node01] Replica-feeder node02 syncup ended. Elapsed time: 102ms
130903 17:56:29:658 INFO [node01] Replica initialization completed. Replica VLSN: 843,393 Heartbeat master commit VLSN: 843,393 VLSN delta: 0
130903 17:56:29:662 INFO [node01] Joined group as a replica. join consistencyPolicy=PointConsistencyPolicy targetVLSN=843,393 first=843,326 last=843,393 sync=843,393 txnEnd=843,393
130903 17:56:29:667 INFO [node01] Replay thread started. Message queue size:1000
130903 17:56:29:666 INFO [node01] Refreshed 0 monitors.
130903 17:56:30:939 WARNING [node01] Electable group size override changed to:3
130903 17:57:54:860 INFO [node01] Inactive channel: node02(64) forced close. Timeout: 7000ms.
130903 17:57:55:406 INFO [node01] Exiting inner Replica loop.
130903 17:57:55:407 INFO [node01] Replica stats - Lag waits: 0 Lag wait time: 0ms. VLSN waits: 0 Lag wait time: 0ms.
130903 17:57:56:049 INFO [node01] Election initiated; election #2
130903 17:57:56:050 INFO [node01] Election in progress. Waiting....
130903 17:57:56:057 INFO [node01] Started election thread Tue Sep 03 17:57:56 PDT 2013
130903 17:58:08:115 INFO [node01] Master changed to node01
130903 17:58:08:116 INFO [node01] Election finished. Elapsed time: 12067ms
130903 17:58:08:121 INFO [node01] Request for unknown Service: Feeder Registered services: [Acceptor, Learner, LogFileFeeder, LDiff, NodeState, Group, BinaryNodeState]
130903 17:58:08:122 INFO [node01] Request for unknown Service: Feeder Registered services: [Acceptor, Learner, LogFileFeeder, LDiff, NodeState, Group, BinaryNodeState]
130903 17:58:08:122 INFO [node01] Exiting election after 2 retries
130903 17:58:08:123 INFO [node01] Election thread exited. Group master: node01(61)
130903 17:58:09:124 INFO [node01] Request for unknown Service: Feeder Registered services: [Acceptor, Learner, LogFileFeeder, LDiff, NodeState, Group, BinaryNodeState]
130903 17:58:09:130 INFO [node01] Request for unknown Service: Feeder Registered services: [Acceptor, Learner, LogFileFeeder, LDiff, NodeState, Group, BinaryNodeState]
130903 17:58:10:128 INFO [node01] Request for unknown Service: Feeder Registered services: [Acceptor, Learner, LogFileFeeder, LDiff, NodeState, Group, BinaryNodeState]
130903 17:58:10:137 INFO [node01] Request for unknown Service: Feeder Registered services: [Acceptor, Learner, LogFileFeeder, LDiff, NodeState, Group, BinaryNodeState]
130903 17:58:11:132 INFO [node01] Request for unknown Service: Feeder Registered services: [Acceptor, Learner, LogFileFeeder, LDiff, NodeState, Group, BinaryNodeState]
...      The same message repeats ad nauseum....
Here is node02's je.info.0 log for the same time period.   Node02 starts as Master but this host becomes network-isolated and shuts down its feeder channels at 17:57:55:
130903 17:56:26:861 INFO [node02] Chose lowest utilized file for cleaning. fileChosen: 0x2f totalUtilization: 50 bestFileUtilization: 0 lnSizeCorrectionFactor: 0.55577517 isProbe: false
130903 17:56:26:912 INFO [node02] CleanerRun 1 ends on file 0x2f probe=false invokedFromDaemon=true finished=true fileDeleted=false nEntriesRead=1 nINsObsolete=0 nINsCleaned=0 nINsDead=0 nINsMigrated=0 nBINDeltasObsolete=0 nBINDeltasCleaned=0 nBINDeltasDead=0 nBINDeltasMigrated=0 nLNsObsolete=0 nLNsCleaned=0 nLNsDead=0 nLNsMigrated=0 nLNsMarked=0 nLNQueueHits=0 nLNsLocked=0 logSummary=<CleanerLogSummary endFileNumAtLastAdjustment="0x33" initialAdjustments="5" recentLNSizesAndCounts="Cor:1274201/3219-Est:1494628/3219 Cor:535/6-Est:965/11 Cor:788/8-Est:1190/13 Cor:789/8-Est:1289/14 Cor:714/2-Est:1310/9 Cor:37694/611-Est:178596/619 Cor:168637/3961-Est:1164331/3961 Cor:62/1-Est:2286/21 Cor:22130/1446-Est:417561/1451 Cor:2429020/9915-Est:3837944/9915 "> inSummary=<INSummary totalINCount="0" totalINSize="0" totalBINDeltaCount="0" totalBINDeltaSize="0" obsoleteINCount="0" obsoleteINSize="0" obsoleteBINDeltaCount="0" obsoleteBINDeltaSize="0"/> estFileSummary=<summary totalCount="1" totalSize="31" totalINCount="0" totalINSize="0" totalLNCount="0" totalLNSize="0" maxLNSize="0" obsoleteINCount="0" obsoleteLNCount="0" obsoleteLNSize="0" obsoleteLNSizeCounted="0" getObsoleteSize="31" getObsoleteINSize="0" getObsoleteLNSize="0" getMaxObsoleteSize="31" getMaxObsoleteLNSize="0" getAvgObsoleteLNSizeNotCounted="NaN"/> recalcFileSummary=<summary totalCount="1" totalSize="31" totalINCount="0" totalINSize="0" totalLNCount="0" totalLNSize="0" maxLNSize="0" obsoleteINCount="0" obsoleteLNCount="0" obsoleteLNSize="0" obsoleteLNSizeCounted="0" getObsoleteSize="31" getObsoleteINSize="0" getObsoleteLNSize="0" getMaxObsoleteSize="31" getMaxObsoleteLNSize="0" getAvgObsoleteLNSizeNotCounted="NaN"/> lnSizeCorrection=0.55577517 newLnSizeCorrection=0.55577517 estimatedUtilization=0 correctedUtilization=0 recalcUtilization=0
130903 17:56:26:973 INFO [node02] Started ServiceDispatcher. HostPort=ap02.domain.acme.net:13000
130903 17:56:26:992 INFO [node02] Request for unknown Service: Acceptor Registered services: []
130903 17:56:27:108 INFO [node02] Current group size: 4
130903 17:56:27:108 INFO [node02] Existing node node02 querying for a current master.
130903 17:56:27:186 INFO [node02] Node node02 started
130903 17:56:27:186 INFO [node02] Election initiated; election #1
130903 17:56:27:206 INFO [node02] Started election thread Tue Sep 03 17:56:27 PDT 2013
130903 17:56:29:290 INFO [node02] Winning proposal: Proposal(00000140e6787717:0000000000000000000000000a000253:00000001) Value: Value:10.0.2.83$$$13000$$$node02$$$64
130903 17:56:29:299 INFO [node02] Master changed to node02
130903 17:56:29:301 INFO [node02] Election finished. Elapsed time: 2114ms
130903 17:56:29:302 INFO [node02] Feeder manager accepting requests.
130903 17:56:29:309 INFO [node02] Request for unknown Service: Feeder Registered services: [Acceptor, Learner, LogFileFeeder, LDiff, NodeState, Group, BinaryNodeState]
130903 17:56:29:312 INFO [node02] Joining group as master
130903 17:56:29:317 INFO [node02] Refreshed 0 monitors.
130903 17:56:29:318 INFO [node02] Election thread exited. Group master: node02(64)
130903 17:56:29:374 INFO [node02] Feeder accepted connection from java.nio.channels.SocketChannel[connected local=/10.0.2.83:13000 remote=/10.0.2.82:60052]
130903 17:56:29:477 INFO [node02] Feeder-replica handshake start
130903 17:56:29:539 INFO [node02] Feeder-replica node01 handshake completed.
130903 17:56:29:542 INFO [node02] Feeder-replica node01 syncup started. Feeder range: first=843,326 last=843,393 sync=843,393 txnEnd=843,393
130903 17:56:29:654 INFO [node02] Feeder-replica node01 start stream at VLSN: 843,394
130903 17:56:29:656 INFO [node02] Feeder-replica node01 syncup ended. Elapsed time: 113ms
130903 17:56:29:660 INFO [node02] Feeder output thread for replica node01 started at VLSN 843,394 master at 843,393 VLSN delta=-1 socket=(node01(61))java.nio.channels.SocketChannel[connected local=/10.0.2.83:13000 remote=/10.0.2.82:60052]
130903 17:56:30:321 INFO [node02] Feeder accepted connection from java.nio.channels.SocketChannel[connected local=/10.0.2.83:13000 remote=/10.0.2.63:47920]
130903 17:56:30:326 INFO [node02] Feeder-replica handshake start
130903 17:56:30:349 INFO [node02] Feeder-replica node04 handshake completed.
130903 17:56:30:350 INFO [node02] Feeder-replica node04 syncup started. Feeder range: first=843,326 last=843,395 sync=843,395 txnEnd=843,395
130903 17:56:30:363 INFO [node02] Feeder-replica node04 start stream at VLSN: 843,394
130903 17:56:30:364 INFO [node02] Feeder-replica node04 syncup ended. Elapsed time: 14ms
130903 17:56:30:383 INFO [node02] Feeder output thread for replica node04 started at VLSN 843,394 master at 843,395 VLSN delta=1 socket=(node04(62))java.nio.channels.SocketChannel[connected local=/10.0.2.83:13000 remote=/10.0.2.63:47920]
130903 17:56:31:678 WARNING [node02] Electable group size override changed to:3
130903 17:57:21:200 INFO [node02] Feeder accepted connection from java.nio.channels.SocketChannel[connected local=/10.0.2.83:13000 remote=/10.0.2.84:38224]
130903 17:57:21:206 INFO [node02] Feeder-replica handshake start
130903 17:57:21:272 INFO [node02] Feeder-replica node03 handshake completed.
130903 17:57:21:273 INFO [node02] Feeder-replica node03 syncup started. Feeder range: first=843,326 last=843,399 sync=843,399 txnEnd=843,399
130903 17:57:21:389 INFO [node02] Feeder-replica node03 start stream at VLSN: 843,394
130903 17:57:21:390 INFO [node02] Feeder-replica node03 syncup ended. Elapsed time: 117ms
130903 17:57:21:392 INFO [node02] Feeder output thread for replica node03 started at VLSN 843,394 master at 843,399 VLSN delta=5 socket=(node03(63))java.nio.channels.SocketChannel[connected local=/10.0.2.83:13000 remote=/10.0.2.84:38224]
130903 17:57:55:115 INFO [node02] Inactive channel: node01(61) forced close. Timeout: 7000ms.
130903 17:57:55:115 INFO [node02] Inactive channel: node04(62) forced close. Timeout: 7000ms.
130903 17:57:55:116 INFO [node02] Shutting down feeder for replica node01 Reason: null write time: 2ms Avg write time: 98us
130903 17:57:55:125 INFO [node02] Inactive channel: node03(63) forced close. Timeout: 7000ms.
130903 17:57:55:126 INFO [node02] Shutting down feeder for replica node04 Reason: null write time: 3ms Avg write time: 147us
130903 17:57:55:127 INFO [node02] Shutting down feeder for replica node03 Reason: null write time: 1ms Avg write time: 90us
130903 17:57:55:409 INFO [node02] Feeder output for replica node01 shutdown. feeder VLSN: 843,402 currentTxnEndVLSN: 843,401
130903 17:57:55:410 INFO [node02] Feeder output for replica node03 shutdown. feeder VLSN: 843,402 currentTxnEndVLSN: 843,401
130903 17:57:55:411 INFO [node02] Feeder output for replica node04 shutdown. feeder VLSN: 843,402 currentTxnEndVLSN: 843,401
130903 17:58:49:456 INFO [node02] Master changed to node01
130903 17:58:50:227 INFO [node02] Master change: Master change. Node master id: node02(64) Group master id: node01(61)
130903 17:58:50:228 INFO [node02] Releasing commit block latch
130903 17:58:50:229 INFO [node02] Feeder manager exited. CurrentTxnEnd VLSN: 843,401
130903 17:58:50:230 INFO [node02] RepNode main thread shutting down.
130903 17:58:50:233 INFO [node02] RepNode shutdown exception:
(JE 5.0.84) node02(64):/data/node02 com.sleepycat.je.rep.stream.MasterStatus$MasterSyncException: Master change. Node master id: node02(64) Group master id: node01(61) MASTER_TO_REPLICA_TRANSITION: This node was a master and must reinitialize internal state to become a replica. The application must close and reopen all Environment handles. Environment is invalid and must be closed.node02(64)[MASTER]
No feeders.
GlobalCBVLSN=843,393
Group info [RepGroup] 8e8b97b5-f9bc-4b9e-bb6e-56edfbe79674
Representation version: 2
Change version: 4
Max rep node ID: 64
Node:node02 ap02.domain.acme.net:13000 (is member) changeVersion:4 LocalCBVLSN:843,393 at:Tue Sep 03 17:56:31 PDT 2013
Node:node03 ap03.domain.acme.net:13000 (is member) changeVersion:3 LocalCBVLSN:843,393 at:Tue Sep 03 17:57:21 PDT 2013
Node:node01 ap01.domain.acme.net:13000 (is member) changeVersion:1 LocalCBVLSN:843,393 at:Tue Sep 03 17:56:29 PDT 2013
Node:node04 ap04.domain.acme.net:13000 (is member) changeVersion:2 LocalCBVLSN:843,393 at:Tue Sep 03 17:56:30 PDT 2013
vlsnRange=first=843,326 last=843,401 sync=843,401 txnEnd=843,401
lastReplayedTxn=null lastReplayedVLSN=843,393 numActiveReplayTxns=0
130903 17:58:50:234 INFO [node02] Shutting down node node02(64)
130903 17:58:50:235 INFO [node02] Elections shutdown initiated
130903 17:58:50:238 INFO [node02] Elections shutdown completed
130903 17:58:50:239 INFO [node02] RepNode main thread: MASTER node02(64) exited.
130903 17:58:50:240 INFO [node02] ServiceDispatcher shutdown starting. HostPort=ap02.domain.acme.net:13000 Registered services: []
130903 17:58:50:243 INFO [node02] ServiceDispatcher shutdown completed. HostPort=ap02.domain.acme.net:13000
130903 17:58:50:243 INFO [node02] node02(64) shutdown completed.
Here is node03's log. We see that node01 wins the election but
130903 17:57:20:961 INFO [node03] CleanerRun 1 ends on file 0x2f probe=false invokedFromDaemon=true finished=true fileDeleted=false nEntriesRead=1 nINsObsolete=0 nINsCleaned=0 nINsDead=0 nINsMigrated=0 nBINDeltasObsolete=0 nBINDeltasCleaned=0 nBINDeltasDead=0 nBINDeltasMigrated=0 nLNsObsolete=0 nLNsCleaned=0 nLNsDead=0 nLNsMigrated=0 nLNsMarked=0 nLNQueueHits=0 nLNsLocked=0 logSummary=<CleanerLogSummary endFileNumAtLastAdjustment="0x32" initialAdjustments="5" recentLNSizesAndCounts="Cor:1274201/3219-Est:1494628/3219 Cor:535/6-Est:965/11 Cor:788/8-Est:1190/13 Cor:789/8-Est:1289/14 Cor:714/2-Est:1310/9 Cor:37694/611-Est:178596/619 Cor:168637/3961-Est:1164331/3961 Cor:62/1-Est:2286/21 Cor:22130/1446-Est:417561/1451 Cor:2429020/9915-Est:3837944/9915 "> inSummary=<INSummary totalINCount="0" totalINSize="0" totalBINDeltaCount="0" totalBINDeltaSize="0" obsoleteINCount="0" obsoleteINSize="0" obsoleteBINDeltaCount="0" obsoleteBINDeltaSize="0"/> estFileSummary=<summary totalCount="1" totalSize="31" totalINCount="0" totalINSize="0" totalLNCount="0" totalLNSize="0" maxLNSize="0" obsoleteINCount="0" obsoleteLNCount="0" obsoleteLNSize="0" obsoleteLNSizeCounted="0" getObsoleteSize="31" getObsoleteINSize="0" getObsoleteLNSize="0" getMaxObsoleteSize="31" getMaxObsoleteLNSize="0" getAvgObsoleteLNSizeNotCounted="NaN"/> recalcFileSummary=<summary totalCount="1" totalSize="31" totalINCount="0" totalINSize="0" totalLNCount="0" totalLNSize="0" maxLNSize="0" obsoleteINCount="0" obsoleteLNCount="0" obsoleteLNSize="0" obsoleteLNSizeCounted="0" getObsoleteSize="31" getObsoleteINSize="0" getObsoleteLNSize="0" getMaxObsoleteSize="31" getMaxObsoleteLNSize="0" getAvgObsoleteLNSizeNotCounted="NaN"/> lnSizeCorrection=0.55577517 newLnSizeCorrection=0.55577517 estimatedUtilization=0 correctedUtilization=0 recalcUtilization=0
130903 17:57:20:991 INFO [node03] Started ServiceDispatcher. HostPort=ap03.domain.acme.net:13000
130903 17:57:21:111 INFO [node03] Current group size: 4
130903 17:57:21:112 INFO [node03] Existing node node03 querying for a current master.
130903 17:57:21:155 INFO [node03] Master changed to node02
130903 17:57:21:182 INFO [node03] Node node03 started
130903 17:57:21:183 INFO [node03] Replica loop started with master: node02(64)
130903 17:57:21:205 INFO [node03] Replica-feeder handshake start
130903 17:57:21:269 INFO [node03] Replica-feeder node02 handshake completed.
130903 17:57:21:282 INFO [node03] Replica-feeder node02 syncup started. Replica range: first=843,326 last=843,393 sync=843,393 txnEnd=843,393
130903 17:57:21:385 INFO [node03] Rollback to matchpoint 843,393 at 0x32/0x55da status=No active txns, nothing to rollback
130903 17:57:21:386 INFO [node03] Replica-feeder node02 start stream at VLSN: 843,394
130903 17:57:21:387 INFO [node03] Replica-feeder node02 syncup ended. Elapsed time: 105ms
130903 17:57:21:391 INFO [node03] Replica initialization completed. Replica VLSN: 843,393 Heartbeat master commit VLSN: 843,399 VLSN delta: 6
130903 17:57:21:396 INFO [node03] Replay thread started. Message queue size:1000
130903 17:57:21:417 INFO [node03] Joined group as a replica. join consistencyPolicy=PointConsistencyPolicy targetVLSN=843,399 first=843,326 last=843,399 sync=843,399 txnEnd=843,399
130903 17:57:21:421 INFO [node03] Refreshed 0 monitors.
130903 17:57:55:069 INFO [node03] Inactive channel: node02(64) forced close. Timeout: 7000ms.
130903 17:57:55:406 INFO [node03] Exiting inner Replica loop.
130903 17:57:55:407 INFO [node03] Replica stats - Lag waits: 0 Lag wait time: 0ms. VLSN waits: 0 Lag wait time: 0ms.
130903 17:57:56:059 INFO [node03] Election initiated; election #1
130903 17:57:56:068 INFO [node03] Started election thread Tue Sep 03 17:57:56 PDT 2013
130903 17:58:08:101 INFO [node03] Winning proposal: Proposal(00000140e679d235:0000000000000000000000000a000254:00000001) Value: Value:10.0.2.82$$$13000$$$node01$$$61
130903 17:58:08:108 INFO [node03] Master changed to node01
130903 17:58:08:112 INFO [node03] Election finished. Elapsed time: 12053ms
130903 17:58:08:113 INFO [node03] Replica loop started with master: node01(61)
130903 17:58:08:124 INFO [node03] Exiting inner Replica loop.
130903 17:58:08:125 INFO [node03] Replica stats - Lag waits: 0 Lag wait time: 0ms. VLSN waits: 0 Lag wait time: 0ms.
130903 17:58:08:126 INFO [node03] Retry #: 0/10 Will retry replica loop after 1000ms.
130903 17:58:09:127 INFO [node03] Replica loop started with master: node01(61)
130903 17:58:09:131 INFO [node03] Exiting inner Replica loop.
130903 17:58:09:131 INFO [node03] Replica stats - Lag waits: 0 Lag wait time: 0ms. VLSN waits: 0 Lag wait time: 0ms.
130903 17:58:09:132 INFO [node03] Retry #: 1/10 Will retry replica loop after 1000ms.
130903 17:58:10:133 INFO [node03] Replica loop started with master: node01(61)
130903 17:58:10:138 INFO [node03] Exiting inner Replica loop.
130903 17:58:10:138 INFO [node03] Replica stats - Lag waits: 0 Lag wait time: 0ms. VLSN waits: 0 Lag wait time: 0ms.
130903 17:58:10:139 INFO [node03] Retry #: 2/10 Will retry replica loop after 1000ms.
130903 17:58:11:139 INFO [node03] Replica loop started with master: node01(61)
130903 17:58:11:142 INFO [node03] Exiting inner Replica loop.
130903 17:58:11:142 INFO [node03] Replica stats - Lag waits: 0 Lag wait time: 0ms. VLSN waits: 0 Lag wait time: 0ms.
130903 17:58:11:143 INFO [node03] Retry #: 3/10 Will retry replica loop after 1000ms.
130903 17:58:12:144 INFO [node03] Replica loop started with master: node01(61)
130903 17:58:12:146 INFO [node03] Exiting inner Replica loop.
130903 17:58:12:147 INFO [node03] Replica stats - Lag waits: 0 Lag wait time: 0ms. VLSN waits: 0 Lag wait time: 0ms.
130903 17:58:12:147 INFO [node03] Retry #: 4/10 Will retry replica loop after 1000ms.
130903 17:58:13:148 INFO [node03] Replica loop started with master: node01(61)
130903 17:58:13:162 INFO [node03] Exiting inner Replica loop.
130903 17:58:13:222 INFO [node03] Replica stats - Lag waits: 0 Lag wait time: 0ms. VLSN waits: 0 Lag wait time: 0ms.
130903 17:58:13:222 INFO [node03] Retry #: 5/10 Will retry replica loop after 1000ms.
130903 17:58:14:223 INFO [node03] Replica loop started with master: node01(61)
130903 17:58:14:225 INFO [node03] Exiting inner Replica loop.
130903 17:58:14:226 INFO [node03] Replica stats - Lag waits: 0 Lag wait time: 0ms. VLSN waits: 0 Lag wait time: 0ms.
130903 17:58:14:226 INFO [node03] Retry #: 6/10 Will retry replica loop after 1000ms.
130903 17:58:15:227 INFO [node03] Replica loop started with master: node01(61)
130903 17:58:15:229 INFO [node03] Exiting inner Replica loop.
130903 17:58:15:230 INFO [node03] Replica stats - Lag waits: 0 Lag wait time: 0ms. VLSN waits: 0 Lag wait time: 0ms.
130903 17:58:15:230 INFO [node03] Retry #: 7/10 Will retry replica loop after 1000ms.
130903 17:58:16:231 INFO [node03] Replica loop started with master: node01(61)
130903 17:58:16:233 INFO [node03] Exiting inner Replica loop.
130903 17:58:16:234 INFO [node03] Replica stats - Lag waits: 0 Lag wait time: 0ms. VLSN waits: 0 Lag wait time: 0ms.
130903 17:58:16:234 INFO [node03] Retry #: 8/10 Will retry replica loop after 1000ms.
130903 17:58:17:235 INFO [node03] Replica loop started with master: node01(61)
130903 17:58:17:237 INFO [node03] Exiting inner Replica loop.
130903 17:58:17:238 INFO [node03] Replica stats - Lag waits: 0 Lag wait time: 0ms. VLSN waits: 0 Lag wait time: 0ms.
130903 17:58:17:238 INFO [node03] Retry #: 9/10 Will retry replica loop after 1000ms.
130903 17:58:18:112 INFO [node03] Election thread exited. Group master: node01(61)
130903 17:58:18:239 INFO [node03] Replica loop started with master: node01(61)
130903 17:58:18:280 INFO [node03] Exiting inner Replica loop.
130903 17:58:18:280 INFO [node03] Replica stats - Lag waits: 0 Lag wait time: 0ms. VLSN waits: 0 Lag wait time: 0ms.
130903 17:58:18:281 INFO [node03] Failed to recover from exception: Failed after retries: 10 with retry interval: 1000ms., despite 10 retries.
com.sleepycat.je.rep.impl.node.Replica$ConnectRetryException: Failed after retries: 10 with retry interval: 1000ms.
        at com.sleepycat.je.rep.impl.node.Replica.createReplicaFeederChannel(Replica.java:777)
        at com.sleepycat.je.rep.impl.node.Replica.initReplicaLoop(Replica.java:578)
        at com.sleepycat.je.rep.impl.node.Replica.runReplicaLoopInternal(Replica.java:392)
        at com.sleepycat.je.rep.impl.node.Replica.runReplicaLoop(Replica.java:328)
        at com.sleepycat.je.rep.impl.node.RepNode.run(RepNode.java:1402)
130903 17:58:18:282 INFO [node03] Election initiated; election #2
130903 17:58:18:282 INFO [node03] Election in progress. Waiting....
           .... The retries and the ConnectRetryException repeat ad nauseum....
We see a similar pattern in the log of node04. We observe at 17:58:08:101 that node01 has become Master but we're unable to establish a feeder connection to it:
130903 17:55:48:149 INFO [node04] Started ServiceDispatcher. HostPort=ap04.domain.acme.net:13000
130903 17:55:48:230 INFO [node04] Current group size: 4
130903 17:55:48:231 INFO [node04 Existing node node04 querying for a current master.
130903 17:55:48:258 INFO [node04] Node node04 started
130903 17:55:48:259 INFO [node04] Election initiated; election #1
130903 17:55:48:266 INFO [node04] Started election thread Tue Sep 03 17:55:48 PDT 2013130903 17:56:29:300 INFO [node04] Master changed to node02
130903 17:56:29:301 INFO [node04] Election finished. Elapsed time: 41043ms
130903 17:56:29:301 INFO [node04] Replica loop started with master: node02(64)130903 17:56:29:303 INFO [node04] Exiting election after 5 retries130903 17:56:29:304 INFO [node04] Election thread exited. Group master: node02(6
4)
130903 17:56:29:309 INFO [node04] Exiting inner Replica loop.
130903 17:56:29:310 INFO [node04] Replica stats - Lag waits: 0 Lag wait time: 0ms. VLSN waits: 0 Lag wait time: 0ms.
130903 17:56:29:310 INFO [node04] Retry #: 0/10 Will retry replica loop after 10
00ms.130903 17:56:30:310 INFO [node04] Replica loop started with master: node02(64)
130903 17:56:30:324 INFO [node04] Replica-feeder handshake start
130903 17:56:30:348 INFO [node04] Replica-feeder node02 handshake completed.
130903 17:56:30:352 INFO [node04] Replica-feeder node02 syncup started. Replica range: first=843,326 last=843,393 sync=843,393 txnEnd=843,393
130903 17:56:30:361 INFO [node04] Rollback to matchpoint 843,393 at 0x34/0x1628
status=No active txns, nothing to rollback
130903 17:56:30:362 INFO [node04] Replica-feeder node02 start stream at VLSN: 843,394
130903 17:56:30:362 INFO [node04] Replica-feeder node02 syncup ended. Elapsed time: 10ms
130903 17:56:30:382 INFO [node04] Replica initialization completed. Replica VLSN: 843,393 Heartbeat master commit VLSN: 843,395 VLSN delta: 2
130903 17:56:30:386 INFO [node04] Replay thread started. Message queue size:1000
130903 17:56:30:397 INFO [node04] Joined group as a replica. join consistencyPolicy=PointConsistencyPolicy targetVLSN=843,395 first=843,326 last=843,395 sync=843,395 txnEnd=843,395
130903 17:56:30:401 INFO [node04] Refreshed 0 monitors.
130903 17:56:30:922 WARNING [node04] Electable group size override changed to:3
130903 17:57:55:223 INFO [node04] Inactive channel: node02(64) forced close. Timeout: 7000ms.
130903 17:57:55:398 INFO [node04] Exiting inner Replica loop.
130903 17:57:55:399 INFO [node04] Replica stats - Lag waits: 0 Lag wait time: 0ms. VLSN waits: 0 Lag wait time: 0ms.
130903 17:57:55:401 INFO [node04] Election initiated; election #2
130903 17:57:55:401 INFO [node04] Election in progress. Waiting....
130903 17:57:55:406 INFO [node04] Started election thread Tue Sep 03 17:57:55 PDT 2013
130903 17:58:08:101 INFO [node04] Master changed to node01
130903 17:58:08:102 INFO [node04] Election finished. Elapsed time: 12701ms
130903 17:58:08:102 INFO [node04] Exiting election after 2 retries
130903 17:58:08:102 INFO [node04] Replica loop started with master: node01(61)
130903 17:58:08:103 INFO [node04] Election thread exited. Group master: node01(61)
130903 17:58:08:110 INFO [node04] Exiting inner Replica loop.
130903 17:58:08:110 INFO [node04] Replica stats - Lag waits: 0 Lag wait time: 0ms. VLSN waits: 0 Lag wait time: 0ms.
130903 17:58:08:111 INFO [node04] Retry #: 0/10 Will retry replica loop after 1000ms.
130903 17:58:09:111 INFO [node04] Replica loop started with master: node01(61)
130903 17:58:09:114 INFO [node04] Exiting inner Replica loop.
130903 17:58:09:114 INFO [node04] Replica stats - Lag waits: 0 Lag wait time: 0ms. VLSN waits: 0 Lag wait time: 0ms.
130903 17:58:09:114 INFO [node04] Retry #: 1/10 Will retry replica loop after 1000ms.
130903 17:58:10:115 INFO [node04] Replica loop started with master: node01(61)
130903 17:58:10:117 INFO [node04] Exiting inner Replica loop.
130903 17:58:10:118 INFO [node04] Replica stats - Lag waits: 0 Lag wait time: 0ms. VLSN waits: 0 Lag wait time: 0ms.
130903 17:58:10:118 INFO [node04] Retry #: 2/10 Will retry replica loop after 1000ms.
130903 17:58:11:118 INFO [node04] Replica loop started with master: node01(61)
130903 17:58:11:121 INFO [node04] Exiting inner Replica loop.
130903 17:58:11:121 INFO [node04] Replica stats - Lag waits: 0 Lag wait time: 0ms. VLSN waits: 0 Lag wait time: 0ms.
130903 17:58:11:122 INFO [node04] Retry #: 3/10 Will retry replica loop after 1000ms.
130903 17:58:12:122 INFO [node04] Replica loop started with master: node01(61)
130903 17:58:12:125 INFO [node04] Exiting inner Replica loop.
130903 17:58:12:125 INFO [node04] Retry #: 4/10 Will retry replica loop after 1000ms.
130903 17:58:13:126 INFO [node04] Replica loop started with master: node01(61)
130903 17:58:13:129 INFO [node04] Exiting inner Replica loop.
130903 17:58:13:129 INFO [node04] Replica stats - Lag waits: 0 Lag wait time: 0ms. VLSN waits: 0 Lag wait time: 0ms.
130903 17:58:13:129 INFO [node04] Retry #: 5/10 Will retry replica loop after 1000ms.
130903 17:58:14:130 INFO [node04] Replica loop started with master: node01(61)
130903 17:58:14:132 INFO [node04] Exiting inner Replica loop.
130903 17:58:14:133 INFO [node04] Replica stats - Lag waits: 0 Lag wait time: 0ms. VLSN waits: 0 Lag wait time: 0ms.
130903 17:58:14:133 INFO [node04] Retry #: 6/10 Will retry replica loop after 1000ms.
130903 17:58:15:133 INFO [node04] Replica loop started with master: node01(61)
130903 17:58:15:136 INFO [node04] Exiting inner Replica loop.
130903 17:58:15:136 INFO [node04] Replica stats - Lag waits: 0 Lag wait time: 0ms. VLSN waits: 0 Lag wait time: 0ms.
130903 17:58:15:137 INFO [node04] Retry #: 7/10 Will retry replica loop after 1000ms.
130903 17:58:16:137 INFO [node04] Replica loop started with master: node01(61)
130903 17:58:16:140 INFO [node04] Exiting inner Replica loop.
130903 17:58:16:140 INFO [node04] Replica stats - Lag waits: 0 Lag wait time: 0ms. VLSN waits: 0 Lag wait time: 0ms.
130903 17:58:16:140 INFO [node04] Retry #: 8/10 Will retry replica loop after 1000ms.
130903 17:58:17:141 INFO [node04] Replica loop started with master: node01(61)
130903 17:58:17:143 INFO [node04] Exiting inner Replica loop.
130903 17:58:17:144 INFO [node04] Replica stats - Lag waits: 0 Lag wait time: 0ms. VLSN waits: 0 Lag wait time: 0ms.
130903 17:58:17:144 INFO [node04] Retry #: 9/10 Will retry replica loop after 1000ms.
130903 17:58:18:145 INFO [node04] Replica loop started with master: node01(61)
130903 17:58:18:147 INFO [node04] Exiting inner Replica loop.
130903 17:58:18:148 INFO [node04] Replica stats - Lag waits: 0 Lag wait time: 0ms. VLSN waits: 0 Lag wait time: 0ms.
130903 17:58:18:148 INFO [node04] Failed to recover from exception: Failed after retries: 10 with retry interval: 1000ms., despite 10 retries.
com.sleepycat.je.rep.impl.node.Replica$ConnectRetryException: Failed after retries: 10 with retry interval: 1000ms.
        at com.sleepycat.je.rep.impl.node.Replica.createReplicaFeederChannel(Replica.java:777)
at com.sleepycat.je.rep.impl.node.Replica.initReplicaLoop(Replica.java:578)
        at com.sleepycat.je.rep.impl.node.Replica.runReplicaLoopInternal(Replica.java:392)
        at com.sleepycat.je.rep.impl.node.Replica.runReplicaLoop(Replica.java:328)
        at com.sleepycat.je.rep.impl.node.RepNode.run(RepNode.java:1402)
130903 17:58:18:149 INFO [node04] Election initiated; election #3
130903 17:58:18:149 INFO [node04] Election in progress. Waiting....
130903 17:58:18:150 INFO [node04] Started election thread Tue Sep 03 17:58:18 PDT 2013
   ....The retries and the ConnectRetryException repeat ad nauseum....

The issue is that you're hitting an unsupported model by violating the contract of the StateChangeListener by issuing an expensive operation in the middle of the stateChange method. From http://docs.oracle.com/cd/E17277_02/html/java/com/sleepycat/je/rep/StateChangeListener.html#stateChange(com.sleepycat.je.rep.StateChangeEvent), we caution:
void stateChange(StateChangeEvent stateChangeEvent) throws RuntimeException ....
This method should do the minimal amount of work, queuing any resource intensive operations for processing by another thread before returning to the caller, so that it does not unduly delay the other housekeeping operations performed by the internal thread which invokes this method.
In other words, we really mean that the work done by stateChange should not issue any I/O or network communication, and if it does so, it should let this thread return, and do the expensive work in another thread.
This seems rather counterintuitive, or that introducing either DbPing or ReplicationGroupAdmin within our application code introduces a Heisenberg uncertainty principle-effect in which attempting to observe the number of active nodes causes the behavior of the replication group, or more specifically--the feeder channel establishment--to change.    Is there a prohibition against using either DbPing, ReplicationGroupAdmin, or the getNodeState() methods within an electable node application?   Likewise do using these objects/methods within a monitor app cause a change in behavior among electable nodes following an election?    We seem to be observing such behavior.
While an application can certainly use DbPing or ReplicationGroupAdmin, the problem is that the state change listener is executing within the critical paths of the replication group state changes. You mentioned monitors, and an alternate model of watching and managing replication group membership is to use http://docs.oracle.com/cd/E17277_02/html/java/com/sleepycat/je/rep/monitor/Monitor.html. This class, and the associated MonitorStateChangeListener, run outside the ReplicatedEnvironment, and could be another way of implementing application logic.
So in summary, you should either use com.sleepycat.je.rep.monitor.Monitor* to implement your logic, or have the implementation of StateChangeListener provoke another thread to do the work asynchronously, so that stateChange() can return quickly.
Please let me know if this helps.
Thanks,
Bogdan

Errors in event log of Secondary DPM server protecting replicas on Primary

Hello again
I have two DPM servers, one situated on-site (primary) and one situated off-site (secondary). Protection jobs seem to be running correctly on both servers in that the jobs complete and I am able to restore data from the backups. I use the primary server
to make the initial backups of critical systems and data (Exchange MDB's etc) and the secondary server to backup those replicas off-site in case of primary site loss or DPM system loss.
The primary server is a physical server and the secondary server is a virtual server. Both DPM servers have their DPM databases stored on one physical SQL server that is in the primary site.
Basically what is happening is that every day our virtual machines are snapshotted (secondary DPM server included) and everyday the snapshot of the secondary DPM server fails. I see the following to entries in the event log of the secondary server.
Error 1:
WARNING
Source: MSDPM
Event ID: 955
The description for Event ID 955 from source MSDPM cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.
If the event originated on another computer, the display information had to be saved with the event.
The following information was included with the event:
The consistency check resulted in the following changes to SQL Server Agent schedules: Schedules added: 2 Schedules removed: 2 Schedules updated: 0.
Problem Details:
<ConsistencyCheck><__System><ID>26</ID><Seq>27861</Seq><TimeCreated>22/05/2014 23:01:31</TimeCreated><Source>SchedulerImpl.cs</Source><Line>719</Line><HasError>True</HasError></__System><Tags><JobSchedule
/></Tags></ConsistencyCheck>
the message resource is present but the message is not found in the string/message table
Error 2
ERROR
Source: MSDPM
Event ID: 4212
The description for Event ID 4212 from source MSDPM cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.
If the event originated on another computer, the display information had to be saved with the event.
The following information was included with the event:
DpmWriter service encountered an error during PrepareBackup as more than one component is selected for backup in the same snapshot set. Select a single DPM replica for backup and try the operation again.
Problem Details:
<DpmWriterEvent><__System><ID>30</ID><Seq>7</Seq><TimeCreated>23/05/2014 00:30:45</TimeCreated><Source>d:\btvsts\21011\private\product\tapebackup\dpswriter\vssfunctionality.cpp</Source><Line>438</Line><HasError>True</HasError></__System><DetailedCode>4212</DetailedCode></DpmWriterEvent>
the message resource is present but the message is not found in the string/message table
These two events are followed by another event from VMWare Tools everyday
Error 3:
WARNING
Source: VMWare Tools
Event ID: 1000
[ warning] [vmvss:vmvss] CVmSnapshotRequestor::CheckWriterStatus():1536: writer DPM Writer in failed state: res = 0x800423f4, err = 0x1, error =
Has anyone come across this before? Currently I am not quite sure what is going wrong and whether it is actually related to snapshots failing, but I want to try to fix these errors first and see what happens.
Regards

Your ar using VMware for Virtualization?
Are you trying to do an online Backup of the VM, think that will not work?
One thing i wonder, your have installed second DPM if Site one fails or goes done, but SQL for DPM2 is in Site one? try to move SQL to external site for DPM 2
Seidl Michael | http://www.techguy.at |
twitter.com/techguyat | facebook.com/techguyat

Event ID - 13568 The File Replication Service has detected that the replica set "DOMAIN SYSTEM VOLUME (SYSVOL SHARE)" is in JRNL_WRAP_ERROR.

We had a major storm over the weekend which caused an unexpected shutdown.
I am having an issue with one of my domain controller with Event ID 13568
The domain controller which is running Windows Server 2012 was added successfully just a couple of days ago.
I do not have a full backup of the server yet.
It only has a GC role on it.
What are the things I should look out for before I attempt to Enable Journal Wrap Automatic Restore and set it to 1?
Would it be safer to just demote the server and start from scratch?
Thank you all for reading!
Mladen
The File Replication Service has detected that the replica set "DOMAIN SYSTEM VOLUME (SYSVOL SHARE)" is in JRNL_WRAP_ERROR.
Replica set name is    : "DOMAIN SYSTEM VOLUME (SYSVOL SHARE)"
Replica root path is   : "c:\windows\sysvol\domain"
Replica root volume is : "\\.\C:"
A Replica set hits JRNL_WRAP_ERROR when the record that it is trying to read from the NTFS USN journal is not found. This can occur because of one of the following reasons.
[1] Volume "\\.\C:" has been formatted.
[2] The NTFS USN journal on volume "\\.\C:" has been deleted.
[3] The NTFS USN journal on volume "\\.\C:" has been truncated. Chkdsk can truncate the journal if it finds corrupt entries at the end of the journal.
[4] File Replication Service was not running on this computer for a long time.
[5] File Replication Service could not keep up with the rate of Disk IO activity on "\\.\C:".
Setting the "Enable Journal Wrap Automatic Restore" registry parameter to 1 will cause the following recovery steps to be taken to automatically recover from this
error state.
[1] At the first poll, which will occur in 5 minutes, this computer will be deleted from the replica set. If you do not want to wait 5 minutes, then run "net stop ntfrs"
followed by "net start ntfrs" to restart the File Replication Service.
[2] At the poll following the deletion this computer will be re-added to the replica set. The re-addition will trigger a full tree sync for the replica set.
WARNING: During the recovery process data in the replica tree may be unavailable. You should reset the registry parameter described above to 0 to prevent automatic recovery from
making the data unexpectedly unavailable if this error condition occurs again.
To change this registry parameter, run regedit.
Click on Start, Run and type regedit.
Expand HKEY_LOCAL_MACHINE.
Click down the key path:
   "System\CurrentControlSet\Services\NtFrs\Parameters"
Double click on the value name
   "Enable Journal Wrap Automatic Restore"
and update the value.
If the value name is not present you may add it with the New->DWORD Value function under the Edit Menu item. Type the value name exactly as shown above.

I set Enable Journal Wrap Automatic Restore to 1 and it was
successful.
I will monitor it to make sure it does not occur again.
Thanks everyone on your replies
Mladen

EVENT ID 30011 Error: 85 (Timeout) ReplicationType:LocalRegistrarReplication Source: LS User Replicator

when we are trying to move Lync users between two pools in the same site we received error that the move operation has been failed because the user is not provisioned.
we have two lync 2013 servers (windows 2008 R2) and SQl 2012 SP2
In event viewer we have noticed the below replication error
Event ID 30011
"Encountered an unrecognized error while processing objects from a domain. This error caused User Replicator to abort synchronization of this domain. Synchronization will be retried for this domain.
If this domain is not enabled for Lync Server, then this error can be ignored.
Domain: **.*** (DN: DC=**,DC=**) Error: 85 (Timeout) ReplicationType:LocalRegistrarReplication
Cause: The cause for this error can vary. Please review the errors listed above.
Resolution:
Contact support services if the error is not descriptive enough to remedy the problem."
Best Regards, Fadi.F.Haddad

HI
I think you have several issues, First can you check weather you lync server is synced with Domain. this could be some stale entries of domain which may be deleted or created for testing purpose.
Can you run this command from Lync FE: Get-CSUserReplicatorConfiguration and see the domains are listed.
You can probably deleted if some unwatented.
If you want o add the domains which you cannot find in the list
Set-CsUserReplicatorConfiguration -Identity global –ADDomainNamingContextList @{add=”dc=Zubi, dc=local”}
Set-CsUserReplicatorConfiguration -Identity global –ADDomainNamingContextList @{remove=”dc=Zubi, dc=local”
you can also check basics weather you DNS is replicating between domain and all of you front end server.
Regards
Zahoor

Calendars duplicated, events replicated moving to iCloud

I'd like to use iCloud. I really would. But every time I try moving my calendars (iCal) to the Cloud, 2 calendars are partially duplicated (each splits into 2 calendars with the same name, some events in both calendars but neither containing all the events of the original), and many events are replicated, sometimes many times. Any idea how I can get "clean" copy of my calendars in iCloud?

What do you mean by that? From what I'm reading we wont be able to use iCloud features without Lion anyway.

Replicating MENU_BUTTON, TOOLBAR events

Hi All,
I wanted to know if the events in CL_GUI_ALV_GRID - MENU_BUTTON and TOOLBAR - are available in the new ALV Object Model. If not, is there a way to replicate them.
I couldn't find anything matching these events in the various CL_SALV* classes.
Thanks and Regards,
Vidya.

SALV Model has Functions object which does the job similar to TOOLBAR Event.
In SALV, Add the function in the ALV grid, get the function object and call the method ADD_FUNCTION.
*... §3.1 activate ALV generic Functions
    data: lr_functions type ref to cl_salv_functions,
          l_text       type string,
          l_icon       type string.
    lr_functions = gr_table->get_functions( ).
    lr_functions->set_all( gc_true ).
*... §3.2 include own functions
    l_text = 'My Button'.
    l_icon = icon_complete.
    try.
      lr_functions->add_function(
        name     = 'MYFUNCTION'
        icon     = l_icon
        text     = l_text
        tooltip = l_text
        position = if_salv_c_function_position=>right_of_salv_functions ).
      catch cx_salv_existing cx_salv_wrong_call.
    endtry.
To handle the added function, register the event USER_COMMAND.
    data: lr_events type ref to cl_salv_events_table.
    lr_events = gr_table->get_event( ).
    create object gr_events.
    set handler gr_events->on_user_command for lr_events.
Event handler class
class lcl_handle_events definition.
public section.
    methods:
      on_user_command for event added_function of cl_salv_events
        importing e_salv_function.
endclass.                    "lcl_handle_events DEFINITION
class lcl_handle_events implementation.
method on_user_command.
    perform show_function_info using e_salv_function text-i08.
endmethod.                    "on_user_command
endclass.                    "lcl_handle_events IMPLEMENTATION
Check program SALV_DEMO_TABLE_FUNCTIONS for more information.
Regards,
Naimesh Patel

DB_EVENT_REP_CONNECT_BROKEN event to survived replicas !

Similar Messages

Maybe you are looking for