Split Brain Scenario

Hello,
Whilst testing Dataguard with FSFO, I seem to have managed to achieve a split brain situation where 2 databases in the Data Guard configuration were considered as primary databases and both were available to receive client connections. I dont understand why this happened and I'm looking for some assurance that this is not a bug in Oracle.
My Data Guard Configuration is as follows:-
1 primary database (DBa)
3 physical standby databases (DBb, DBc, DBd)
All databases are single instance, i.e no RAC, and are running Oracle 11g (11.1.0.7) on RHEL 5.3
DBb is the Fast Start Failover Target for DBa. The FSFO observer process is running on a stand-alone server called OBS1,
To simulate a 'data centre disaster' i did the following :-
1) Kill the SMON processes on the servers running DBa and DBb (Note I did not kill the Observer process)
2) From DGMGRL on the server running DBc issue the following commands :-
DGMGRL> disable fast_start failover force (Without doing this I could not issue the subsequent failover command)
DGMGRL> failover to DBc ;
This worked as expected and DBc was established as the new primary database in the configuration. DBd continued to function correctly as a stand db. Subsequent client connections were routed to DBc as expected.
3) I then attempted to simulate the two failed databases DBa and DBb rejoining the configuration. Firstly I put DBa into MOUNT status using the STARTUP MOUNT command from the SQLPLUS command line.
4) Before I did anything with DBb, the Observer process that was still running on OBS1, detected that DBa was 'active' again and OPENed the database. In doing this it took no notice of the fact that DBc was already open and acting as the primary database in the configuration. The result of this was that two databases - DBa and DBc in the configuration were in an an OPEN state and acting as a primary database i.e Split Brain. The TNSNAMES.ora configuration on the Oracle machines meant that it was now perfectly possible for client connection to be spread over both these machines.
I am very concerned as to why Oracle allowed the above situation to happen. Was my test unreasonable or should Oracle have detected that DBc was the new primary database after I attmepted to restart DBa in MOUNT state?? I now understand that if I had also issued the STOP OBSERVER command in DGMGRL, after issuing the FAILOVER ro DBc command, then the FSFO Observer could not have OPENed DBa once DBc had become the primary database, so is this the only mechanism that must be used in the above scenario to prevent a Split brain ?
Any advice would be greatly appreciated.
Thanks,
Shaun

Your environment is complex enough, three standbys, I've not worked in a similar environment. My recommendation would be to open an SR at metalink and please let us know what they tell you as we may very well find ourselves with multiple standbys in the future.

Similar Messages

Risk of split brain scenario

Hi,
We have come across a hughe concern within my company. We are running RAC and Dataguard on RHEL 5.3 and 10.2.0.4 PSU 2. We have configured both primary and standby as RAC databases and have run into a situation where do a failover-test. The primary fails over as expected, but the reinstate-command with the broker fails to determine the failed database is no longer a primary and wants to open the database in read / write mode. This is very dangerous and I was wondering if anyone has come across this? When using observer and FSF it has worked as expected.
Desperate for suggestions!
Regards,
Martin

Hi,
thanks for the reply. I have some more info about tests we have done with different patch-levels on 10.2.0.4 with different PSU levels, see descriptions below:
Case 1 – Successful reinstate.
No FSFO, oracle 10.2.0.4.0
Broker log at the startup mount time of the old primary (to be reinstated): the new primary returns the error ORA-16623.
DG 2010-03-18-16:46:57 0 2 0 DMON: >> Starting Data Guard Broker bootstrap <<
DG 2010-03-18-16:46:57 0 2 0 DMON: Attach state object
DG 2010-03-18-16:46:57 0 2 0 DMON: chief lock convert for bootstrap
DG 2010-03-18-16:46:57 0 2 0 DMON Registering service DT9990ELM7_0_DGB with listener(s)
DG 2010-03-18-16:46:57 0 2 0 Executing SQL [ALTER SYSTEM REGISTER]
DG 2010-03-18-16:46:57 0 2 0 SQL [ALTER SYSTEM REGISTER] Executed successfully
DG 2010-03-18-16:46:57 0 2 0 DMON: Broker Configuration DT9990ELM7_0.XXXX.COM
DG 2010-03-18-16:46:57 0 2 0 Metadata Version: 2.5 / UID = 562794543 (0x218b902f) / SEQ = 0
/ MIV = 19
DG 2010-03-18-16:46:57 0 2 0 Protection Mode: Maximum Performance
DG 2010-03-18-16:46:57 0 2 0 Fast-Start Failover (FSFO): Disabled
DG 2010-03-18-16:46:57 0 2 0 Primary Database: DT9990ELM7_0 (01010000)
DG 2010-03-18-16:46:57 0 2 0 Standby Database: DT9990ELM8_1, Enabled Physical Standby (02010000)
DG 2010-03-18-16:46:57 0 2 0 DMON: site 01001000, instance 00000003 queuing healthcheck lock request
DG 2010-03-18-16:46:57 0 2 0 DMON: Health check master lock conversion successful
DG 2010-03-18-16:46:57 0 2 0 DMON: a process acquired the healthcheck master lock
DG 2010-03-18-16:47:02 0 2 0 NSV1: Received error ORA-16623 from target remote site DT9990ELM8_1.
DG 2010-03-18-16:47:05 0 2 0 Version Check status returned is 16623
DG 2010-03-18-16:47:05 0 2 713983617 DMON: All destinations will be deferred
DG 2010-03-18-16:47:05 0 2 0 Executing SQL [alter system set log_archive_dest_state_2 = 'RESET']
DG 2010-03-18-16:47:05 0 2 0 SQL [alter system set log_archive_dest_state_2 = 'RESET'] Executed successfully
DG 2010-03-18-16:47:05 0 2 0 DMON: Redo transport to database DT9990ELM8_1 (02001000) has been deferred
DG 2010-03-18-16:47:05 0 2 713983617 DMON: Entered rfm_release_chief_lock for CTL_VC
DG 2010-03-18-16:47:05 0 2 713983617 DMON: chief lock convert for site quiesce
DG 2010-03-18-16:47:05 0 2 0 NSV1: NetSlave shutting down.
DG 2010-03-18-16:47:06 0 2 0 DMON: NSV1 terminated
Case 2 – reinstate not successful
FSFO enabled, oracle 10.2.0.4.3
Broker log at the startup mount time of the old primary (to be reinstated): the new primary returns the error ORA-16541. The reinstate will not work
DG 2010-03-19-08:46:58 0 2 0 DMON: >> Starting Data Guard Broker bootstrap <<
DG 2010-03-19-08:46:58 0 2 0 DMON: Attach state object
DG 2010-03-19-08:46:58 0 2 0 DMON: chief lock convert for bootstrap
DG 2010-03-19-08:46:58 0 2 0 DMON Registering service SMPELM7_1_DGB with listener(s)
DG 2010-03-19-08:46:58 0 2 0 Executing SQL [ALTER SYSTEM REGISTER]
DG 2010-03-19-08:46:58 0 2 0 SQL [ALTER SYSTEM REGISTER] Executed successfully
DG 2010-03-19-08:46:58 0 2 0 DMON: Broker Configuration SMPELM8_0.xxxx.COM
DG 2010-03-19-08:46:58 0 2 0 Metadata Version: 2.5 / UID = 15015 (0x00003aa7) / SEQ = 3 / MIV
= 57
DG 2010-03-19-08:46:58 0 2 0 Protection Mode: Maximum Availability
DG 2010-03-19-08:46:58 0 2 0 Fast-Start Failover (FSFO): Enabled
DG 2010-03-19-08:46:58 0 2 0 Primary Database: SMPELM7_1 (02010000)
DG 2010-03-19-08:46:58 0 2 0 Standby Database: SMPELM8_0, Enabled Physical Standby (01010000)
DG 2010-03-19-08:46:58 0 2 0 DMON: site 02001000, instance 00000001 queuing healthcheck lock request
DG 2010-03-19-08:46:58 0 2 0 DMON: Health check master lock conversion successful
DG 2010-03-19-08:46:58 0 2 0 DMON: a process acquired the healthcheck master lock
DG 2010-03-19-08:46:58 0 2 0 DMON: Creating process FSFP
DG 2010-03-19-08:47:02 0 2 0 FSFP: Process started
DG 2010-03-19-08:47:03 0 2 0 DMON: FSFP successfully started
DG 2010-03-19-08:47:07 0 2 0 NSV0: Received error ORA-16541 from target remote site SMPELM8_0.
DG 2010-03-19-08:47:10 0 2 0 Version Check status returned is 16541
DG 2010-03-19-08:47:10 0 2 0 DMON: FSFO SetState(2, 0x0) operation requires an ack
DG 2010-03-19-08:47:10 0 2 0 Primary shutdown is possible if ack is not satisfied within 60 seconds
DG 2010-03-19-08:47:10 0 2 714041223 DMON: Entered rfm_release_chief_lock for CTL_VC
DG 2010-03-19-08:47:25 0 2 0 DMON: Creating process RSM0
DG 2010-03-19-08:47:28 0 2 0 RSM0: Attach state object
DG 2010-03-19-08:47:30 0 2 0 RSM0: HEALTH CHECK ERROR: ORA-16783: instance SMP not open for read and write ac
cess
DG 2010-03-19-08:47:31 0 2 0 RSM0: HEALTH CHECK WARNING: ORA-16817: unsynchronized Fast-Start Failover config
uration
DG 2010-03-19-08:47:31 0 2 0 RSM0: HEALTH CHECK ERROR: ORA-16820: Fast-Start Failover observer is no longer o
bserving this database
DG 2010-03-19-08:47:31 0 2 714041224 Operation CTL_GET_STATUS cancelled during phase 2, error = ORA-16825
DG 2010-03-19-08:47:31 0 2 714041224 Operation CTL_GET_STATUS cancelled during phase 2, error = ORA-16825
DG 2010-03-19-08:47:40 0 2 714041224 DMON: Database SMPELM8_0 is still working on the task.
DG 2010-03-19-08:47:55 0 2 714041224 DMON: Database SMPELM8_0 is still working on the task.
DG 2010-03-19-08:48:10 0 2 714041224 DMON: Database SMPELM8_0 is still working on the task.
DG 2010-03-19-08:48:25 0 2 714041224 DMON: Database SMPELM8_0 is still working on the task.
Case 3 – reinstate not successful
FSFO not enabled, oracle 10.2.0.4.2
Broker log at the startup mount time of the old primary (to be reinstated): the new primary returns the error ORA-16541. The reinstate will not work
DG 2010-03-19-12:50:02 0 2 0 DMON: >> Starting Data Guard Broker bootstrap <<
DG 2010-03-19-12:50:02 0 2 0 DMON: Attach state object
DG 2010-03-19-12:50:02 0 2 0 DMON: chief lock convert for bootstrap
DG 2010-03-19-12:50:02 0 2 0 DMON Registering service PP0164ELM7_1_DGB with listener(s)
DG 2010-03-19-12:50:02 0 2 0 Executing SQL [ALTER SYSTEM REGISTER]
DG 2010-03-19-12:50:02 0 2 0 SQL [ALTER SYSTEM REGISTER] Executed successfully
DG 2010-03-19-12:50:02 0 2 0 DMON: Broker Configuration PP0164ELM8_0.XXXX.COM
DG 2010-03-19-12:50:02 0 2 0 Metadata Version: 2.5 / UID = 90006921 (0x055d6589) / SEQ = 5 / MIV = 51
DG 2010-03-19-12:50:02 0 2 0 Protection Mode: Maximum Availability
DG 2010-03-19-12:50:02 0 2 0 Fast-Start Failover (FSFO): Disabled
DG 2010-03-19-12:50:02 0 2 0 Primary Database: PP0164ELM7_1 (02010000)
DG 2010-03-19-12:50:02 0 2 0 Standby Database: PP0164ELM8_0, Enabled Physical Standby (01010000)
DG 2010-03-19-12:50:02 0 2 0 DMON: site 02001000, instance 00000001 queuing healthcheck lock request
DG 2010-03-19-12:50:02 0 2 0 DMON: Health check master lock conversion successful
DG 2010-03-19-12:50:02 0 2 0 DMON: a process acquired the healthcheck master lock
DG 2010-03-19-12:50:05 0 2 0 NSV0: Received error ORA-16541 from target remote site PP0164ELM8_0.
DG 2010-03-19-12:50:08 0 2 0 Version Check status returned is 16541
DG 2010-03-19-12:50:08 0 2 0 Executing SQL [ALTER SYSTEM REGISTER]
DG 2010-03-19-12:50:08 0 2 0 SQL [ALTER SYSTEM REGISTER] Executed successfully
DG 2010-03-19-12:50:08 0 2 714055803 DMON: Evaluating critical status of standbys in configuration
DG 2010-03-19-12:50:08 0 2 714055803 DMON: Evaluating critical status of 0x1010000
DG 2010-03-19-12:50:08 0 2 714055803 ChangeCritical value is FALSE
DG 2010-03-19-12:50:08 0 2 714055803 IsCritical value is FALSE
DG 2010-03-19-12:50:08 0 2 714055803 DMON: status from rfi_post_instances() for CTL_ENABLE = ORA-00000
DG 2010-03-19-12:50:08 0 2 0 INSV: Received message for inter-instance publication
DG 2010-03-19-12:50:08 0 2 714055803 DMON: dispersing message to standbys for ENABLE phase RESYNCH
DG 2010-03-19-12:50:08 0 2 0 req_id 2.1.714055803, opcode CTL_ENABLE, phase RESYNCH, flags 5
DG 2010-03-19-12:50:08 0 2 0 NSV0: Start sending metadata file: /OCRS/broker/dr2PP0164ELM7_1.dat.
DG 2010-03-19-12:50:08 0 2 0 INSV: Reply received for message with
DG 2010-03-19-12:50:08 0 2 0 req ID 2.1.714055803, opcode CTL_ENABLE, phase RESYNCH
DG 2010-03-19-12:50:08 0 2 0 NSV0: DRCX returns error ORA-16656 on OPEN.
/Martin

Split brain

Does split brain scenario happens only when we have multiple (multiplexed) voting disks ?
If we have single vote disk (ext redundancy) can we come across split brain scenario.

Does split brain scenario happens only when we have multiple (multiplexed) voting disks ?No.
If we have single vote disk (ext redundancy) can we come across split brain scenario.Oh yes....you can have split brain with just the on voting disk. As you identified, if you have external redundancy, you will only have 1 voting disk.

Election problem after repeated split-brains with two nodes

Hi
I'm using a customized source based on BDB-5.1.19 (excxx_repquote)
with two site one - MASTER and the other SLAVE...
nsite=2
ack=quorum
- the master is writing to quotedb at a rate of 10 txn per sec
- the test consist to isolate the client from the master (split brain) and reconnect it after a random time include from 1sec to 10sec
the test run well about 10 times but at a moment the process slave receive DB_EVENT_REP_ELECTION_FAILED
and the master enter in election mode and never exit from the CLIENT mode. I must say that to freeze the client I decide to kill me (kill -9 my pid) when I receive such event...
here is the verbose log on the master...
[1307872770:871621][6510/47655809107168] MASTER: rep_send_function returned: 110
[1307872770:973655][6510/47655809107168] MASTER: bulk_msg: Send buffer after copy due to PERM
[1307872770:973667][6510/47655809107168] MASTER: send_bulk: Send 266 (0x10a) bulk buffer bytes
[1307872770:973672][6510/47655809107168] MASTER: /opt/bdb/ rep_send_message: msgv = 5 logv 17 gen = 68 eid -1, type bulk_log, LSN [21][986648] perm
[1307872770:973693][6510/47655809107168] MASTER: will await acknowledgement: need 1
[1307872771:26623][6510/47655809107168] MASTER: rep_send_function returned: 110
[1307872771:126380][6510/1162996032] MASTER: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 70 eid 0, type log, LSN [21][946345]
[1307872771:126407][6510/1162996032] MASTER: /opt/bdb/ rep_send_message: msgv = 5 logv 17 gen = 68 eid -1, type dupmaster, LSN [0][0] nobuf
[1307872771:126695][6510/1162996032] MASTER: rep_start: Found old version log 17
[1307872771:126753][6510/1162996032] CLIENT: /opt/bdb/ rep_send_message: msgv = 5 logv 17 gen = 68 eid -1, type newclient, LSN [0][0] nobuf
[1307872771:126833][6510/1183975744] CLIENT: starting election thread
[1307872771:126876][6510/1183975744] CLIENT: Start election nsites 2, ack 1, priority 100
[1307872771:126890][6510/1183975744] CLIENT: Election thread owns egen 69
[1307872771:127423][6510/1173485888] CLIENT: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 70 eid 0, type newclient, LSN [0][0]
[1307872771:130079][6510/1183975744] CLIENT: Tallying VOTE1[0] (2147483647, 69)
[1307872771:130113][6510/1183975744] CLIENT: Beginning an election
[1307872771:130134][6510/1183975744] CLIENT: /opt/bdb/ rep_send_message: msgv = 5 logv 17 gen = 68 eid -1, type vote1, LSN [21][986728] nobuf
[1307872771:130147][6510/1173485888] CLIENT: /opt/bdb/ rep_send_message: msgv = 5 logv 17 gen = 68 eid -1, type master_req, LSN [0][0] nobuf
[1307872771:130438][6510/1152506176] CLIENT: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 70 eid 0, type vote1, LSN [21][946437]
[1307872771:130460][6510/1162996032] CLIENT: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 70 eid 0, type alive, LSN [21][986728]
[1307872771:130467][6510/1152506176] CLIENT: Updating gen from 68 to 70
[1307872771:130482][6510/1162996032] CLIENT: Received ALIVE egen of 71, mine 69
[1307872771:130503][6510/1162996032] CLIENT: Election finished in 0.003602000 sec
[1307872771:130515][6510/1162996032] CLIENT: Election done; egen 70
[1307872771:130534][6510/1152506176] CLIENT: Received vote1 egen 71, egen 71
[1307872771:130581][6510/1152506176] CLIENT: Tallying VOTE1[0] (0, 71)
[1307872771:130593][6510/1089075520] CLIENT: starting election thread
[1307872771:130619][6510/1152506176] CLIENT: Incoming vote: (eid)0 (pri)100 ELECTABLE (gen)70 (egen)71 [21,946437]
[1307872771:130642][6510/1152506176] CLIENT: Not in election, but received vote1 0x282c 0x8
[1307872771:130674][6510/1089075520] CLIENT: Start election nsites 2, ack 1, priority 100
[1307872771:130692][6510/1089075520] CLIENT: Election thread owns egen 71
[1307872771:130704][6510/1194465600] CLIENT: starting election thread
[1307872771:130733][6510/1194465600] CLIENT: Start election nsites 2, ack 1, priority 100
[1307872771:132922][6510/1089075520] CLIENT: Tallying VOTE1[1] (2147483647, 71)
[1307872771:132949][6510/1089075520] CLIENT: Accepting new vote
[1307872771:132958][6510/1089075520] CLIENT: Beginning an election
[1307872771:132973][6510/1089075520] CLIENT: /opt/bdb/ rep_send_message: msgv = 5 logv 17 gen = 70 eid -1, type vote1, LSN [21][986728] nobuf
[1307872771:132985][6510/1194465600] CLIENT: election thread is exiting
[1307872771:133012][6510/1089075520] CLIENT: Tallying VOTE2[0] (2147483647, 71)
[1307872771:133037][6510/1089075520] CLIENT: Counted my vote 1
[1307872771:133048][6510/1089075520] CLIENT: Skipping phase2 wait: already got 1 votes
[1307872771:133060][6510/1089075520] CLIENT: Got enough votes to win; election done; (prev) gen 70
[1307872771:133071][6510/1089075520] CLIENT: Election finished in 0.002367000 sec
[1307872771:133084][6510/1089075520] CLIENT: Election done; egen 72
[1307872771:133111][6510/1089075520] CLIENT: Ended election with 0, e_th 1, egen 72, flag 0x2a2c, e_fl 0x0, lo_fl 0x6
[1307872771:133170][6510/1173485888] CLIENT: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 70 eid 0, type alive, LSN [0][0]
[1307872771:133187][6510/1173485888] CLIENT: Racing replication msg lockout, ignore message.
[1307872771:173744][6510/1162996032] CLIENT: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 70 eid 0, type vote2, LSN [0][0]
[1307872771:173769][6510/1162996032] CLIENT: Racing replication msg lockout, ignore message.
[1307872771:231593][6510/1183975744] CLIENT: Ended election with 0, e_th 0, egen 72, flag 0x2a2c, e_fl 0x0, lo_fl 0x1c
[1307872771:231629][6510/1183975744] CLIENT: election thread is exiting
[1307872777:443794][6510/1131526464] CLIENT: init connection to site 2.0.0.210:12345 with result 115
[1307872971:644194][6510/1131526464] CLIENT: init connection to site 2.0.0.210:12345 with result 115
[1307873165:844583][6510/1131526464] CLIENT: init connection to site 2.0.0.210:12345 with result 115
[1307873360:44955][6510/1131526464] CLIENT: init connection to site 2.0.0.210:12345 with result 115
[1307873554:245347][6510/1131526464] CLIENT: init connection to site 2.0.0.210:12345 with result 115
[1307873748:445736][6510/1131526464] CLIENT: init connection to site 2.0.0.210:12345 with result 115
[1307873942:646117][6510/1131526464] CLIENT: init connection to site 2.0.0.210:12345 with result 115
[1307874136:846509][6510/1131526464] CLIENT: init connection to site 2.0.0.210:12345 with result 115
.... and infinite stay to this situation
My question is why the Master is suddenly transformed into CLIENT and why it's never returning to the MASTER
Thanks in advance ...
here is the log for the client
[1307872315:455113][1282/1181583680] MASTER: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 68 eid 0, type log, LSN [21][984396]
[1307872315:455134][1282/1160603968] MASTER: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 68 eid 0, type log, LSN [21][984483] perm
[1307872315:609962][1282/1181583680] MASTER: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 68 eid 0, type bulk_log, LSN [21][984733] perm
[1307872315:764958][1282/1181583680] MASTER: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 68 eid 0, type bulk_log, LSN [21][984986] perm
[1307872315:919962][1282/1181583680] MASTER: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 68 eid 0, type bulk_log, LSN [21][985238] perm
[1307872316:75018][1282/1181583680] MASTER: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 68 eid 0, type bulk_log, LSN [21][985491] perm
[1307872316:229959][1282/1181583680] MASTER: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 68 eid 0, type bulk_log, LSN [21][985741] perm
[1307872316:384949][1282/1181583680] MASTER: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 68 eid 0, type bulk_log, LSN [21][985993] perm
[1307872316:499899][1282/1181583680] MASTER: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 68 eid 0, type bulk_log, LSN [21][986141] perm
[1307872316:539895][1282/1181583680] MASTER: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 68 eid 0, type log, LSN [21][986221]
[1307872316:540078][1282/1171093824] MASTER: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 68 eid 0, type log, LSN [21][986307]
[1307872316:540100][1282/1160603968] MASTER: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 68 eid 0, type log, LSN [21][986394] perm
[1307872316:694950][1282/1171093824] MASTER: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 68 eid 0, type bulk_log, LSN [21][986648] perm
[1307872316:847349][1282/1129134400] MASTER: /opt/bdb/ rep_send_message: msgv = 5 logv 17 gen = 70 eid -1, type log, LSN [21][946345]
[1307872316:847698][1282/1171093824] MASTER: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 68 eid 0, type dupmaster, LSN [0][0]
[1307872316:847999][1282/1181583680] MASTER: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 68 eid 0, type newclient, LSN [0][0]
[1307872316:848168][1282/1171093824] MASTER: rep_start: Found old version log 17
[1307872316:848222][1282/1181583680] CLIENT: Racing replication msg lockout, ignore message.
[1307872316:848398][1282/1171093824] CLIENT: /opt/bdb/ rep_send_message: msgv = 5 logv 17 gen = 70 eid -1, type newclient, LSN [0][0] nobuf
[1307872316:848504][1282/1192073536] CLIENT: starting election thread
[1307872316:848542][1282/1192073536] CLIENT: Start election nsites 2, ack 1, priority 100
[1307872316:848566][1282/1192073536] CLIENT: Election thread owns egen 71
[1307872316:849634][1282/1192073536] CLIENT: Tallying VOTE1[0] (2147483647, 71)
[1307872316:849654][1282/1192073536] CLIENT: Beginning an election
[1307872316:849680][1282/1192073536] CLIENT: /opt/bdb/ rep_send_message: msgv = 5 logv 17 gen = 70 eid -1, type vote1, LSN [21][946437] nobuf
[1307872316:851403][1282/1160603968] CLIENT: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 68 eid 0, type vote1, LSN [21][986728]
[1307872316:851448][1282/1160603968] CLIENT: Received vote1 egen 69, egen 71
[1307872316:851470][1282/1160603968] CLIENT: Received old vote 69, egen 71, ignoring vote1
[1307872316:851481][1282/1160603968] CLIENT: /opt/bdb/ rep_send_message: msgv = 5 logv 17 gen = 70 eid 0, type alive, LSN [21][986728] nobuf
[1307872316:851538][1282/1171093824] CLIENT: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 68 eid 0, type master_req, LSN [0][0]
[1307872316:851558][1282/1171093824] CLIENT: /opt/bdb/ rep_send_message: msgv = 5 logv 17 gen = 70 eid 0, type alive, LSN [0][0] nobuf
[1307872316:854254][1282/1160603968] CLIENT: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 70 eid 0, type vote1, LSN [21][986728]
[1307872316:854275][1282/1160603968] CLIENT: Received vote1 egen 71, egen 71
[1307872316:854317][1282/1160603968] CLIENT: Tallying VOTE1[1] (0, 71)
[1307872316:854339][1282/1160603968] CLIENT: Incoming vote: (eid)0 (pri)100 ELECTABLE (gen)70 (egen)71 [21,986728]
[1307872316:854353][1282/1160603968] CLIENT: Existing vote: (eid)2147483647 (pri)100 (gen)70 (sites)2 [21,946437]
[1307872316:854369][1282/1160603968] CLIENT: Accepting new vote
[1307872316:854379][1282/1160603968] CLIENT: Phase1 election done
[1307872316:854395][1282/1160603968] CLIENT: Voting for 0
[1307872316:854407][1282/1160603968] CLIENT: /opt/bdb/ rep_send_message: msgv = 5 logv 17 gen = 70 eid 0, type vote2, LSN [0][0] nobuf
[1307872317:960344][1282/1192073536] CLIENT: After phase 2: votes 0, nvotes 1, nsites 2
[1307872317:960389][1282/1192073536] CLIENT: Election finished in 1.111809000 sec
[1307872317:960401][1282/1192073536] CLIENT: Election done; egen 72
[1307872317:960412][1282/1192073536] CLIENT: Ended election with -30974, e_th 0, egen 72, flag 0x282c, e_fl 0x0, lo_fl 0x0
Kill me !!
--- my source
on the master I run manually :
txn_rate 1
loop_rate 10
loop 1 20000
* See the file LICENSE for redistribution information.
* Copyright (c) 2001, 2010 Oracle and/or its affiliates. All rights reserved.
* $Id$
* In this application, we specify all communication via the command line. In
* a real application, we would expect that information about the other sites
* in the system would be maintained in some sort of configuration file. The
* critical part of this interface is that we assume at startup that we can
* find out
*      1) what our Berkeley DB home environment is,
*      2) what host/port we wish to listen on for connections; and
*      3) an optional list of other sites we should attempt to connect to.
* These pieces of information are expressed by the following flags.
* -h home (required; h stands for home directory)
* -l host:port (required; l stands for local)
* -C or -M (optional; start up as client or master)
* -r host:port (optional; r stands for remote; any number of these may be
*     specified)
* -R host:port (optional; R stands for remote peer; only one of these may
* be specified)
* -a all|quorum (optional; a stands for ack policy)
* -b (optional; b stands for bulk)
* -n nsites (optional; number of sites in replication group; defaults to 0
*     to try to dynamically compute nsites)
* -p priority (optional; defaults to 100)
* -v (optional; v stands for verbose)
#include <cstdlib>
#include <cstring>
#include <iostream>
#include <string>
#include <sstream>
#include <sys/types.h>
#include <signal.h>
#include <db_cxx.h>
#include "RepConfigInfo.h"
#include "dbc_auto.h"
using std::cout;
using std::cin;
using std::cerr;
using std::endl;
using std::ends;
using std::flush;
using std::istream;
using std::istringstream;
using std::ostringstream;
using std::string;
using std::getline;
#include <stdio.h>
#include <readline/readline.h>
#include <readline/history.h>
#define     CACHESIZE     (10 * 1024 * 1024)
#define     DATABASE     "quote.db"
#define     DATABASE2     "quote2.db"
const char *progname = "excxx_repquote";
#include <errno.h>
#ifdef _WIN32
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#define     snprintf          _snprintf
#define     sleep(s)          Sleep(1000 * (s))
extern "C" {
extern int getopt(int, char * const *, const char *);
extern char *optarg;
typedef HANDLE thread_t;
typedef DWORD thread_exit_status_t;
#define     thread_create(thrp, attr, func, arg)                    \
(((*(thrp) = CreateThread(NULL, 0,                         \
     (LPTHREAD_START_ROUTINE)(func), (arg), 0, NULL)) == NULL) ? -1 : 0)
#define     thread_join(thr, statusp)                         \
((WaitForSingleObject((thr), INFINITE) == WAIT_OBJECT_0) &&          \
GetExitCodeThread((thr), (LPDWORD)(statusp)) ? 0 : -1)
#else /* !_WIN32 */
#include <pthread.h>
typedef pthread_t thread_t;
typedef void* thread_exit_status_t;
#define     thread_create(thrp, attr, func, arg)                    \
pthread_create((thrp), (attr), (func), (arg))
#define     thread_join(thr, statusp) pthread_join((thr), (statusp))
#endif
// Struct used to store information in Db app_private field.
typedef struct {
     bool app_finished;
     bool in_client_sync;
     bool is_master;
     bool no_dummy_wr;
} APP_DATA;
static void log(const char *);
void checkpoint_thread (void );
void log_archive_thread (void );
void dummy_write_thread (void );
class RepQuoteExample {
public:
     RepQuoteExample();
     void init(RepConfigInfo* config);
     void doloop();
     int terminate();
     static void event_callback(DbEnv* dbenv, u_int32_t which, void *info);
     void print_stocks_size(Db *dbp);
private:
     // disable copy constructor.
     RepQuoteExample(const RepQuoteExample &);
     void operator = (const RepQuoteExample &);
     // internal data members.
     APP_DATA          app_data;
     RepConfigInfo *app_config;
     DbEnv          cur_env;
     thread_t ckp_thr;
     thread_t lga_thr;
     thread_t dmy_thr;
     // private methods.
     void print_stocks(Db *dbp);
     void print_env(DbEnv *dbenv);
     void prompt();
RepQuoteExample *g_runner=NULL;
RepConfigInfo *g_config=NULL;
class DbHolder {
public:
     DbHolder(DbEnv env, const char _dbname) : env(env)
          dbp = 0;
          if (_dbname) dbname=_dbname;
          else dbname=DATABASE;
     ~DbHolder() {
     try {
          close();
     } catch (...) {
          // Ignore: this may mean another exception is pending
     bool ensure_open(bool creating) {
     if (dbp)
          return (true);
     dbp = new Db(env, 0);
     u_int32_t flags = DB_AUTO_COMMIT;
     if (creating)
          flags |= DB_CREATE;
     try {
          //dbp->open(NULL, DATABASE, NULL, DB_BTREE, flags, 0);
          //dbp->open(NULL, dbname, NULL, DB_BTREE, flags, 0);
          dbp->open(NULL, NULL, dbname, DB_BTREE, flags, 0);
          return (true);
     } catch (DbDeadlockException e) {
     } catch (DbRepHandleDeadException e) {
     } catch (DbException e) {
          if (e.get_errno() == DB_REP_LOCKOUT) {
          // Just fall through.
          } else if (e.get_errno() == ENOENT && !creating) {
          // Provide a bit of extra explanation.
          log("Stock DB does not yet exist");
          } else
          throw;
     // (All retryable errors fall through to here.)
     log("please retry the operation");
     close();
     return (false);
     void close() {
     if (dbp) {
          try {
          dbp->close(0);
          delete dbp;
          dbp = 0;
          } catch (...) {
          delete dbp;
          dbp = 0;
          throw;
     operator Db *() {
     return dbp;
     Db *operator->() {
     return dbp;
private:
     Db *dbp;
     DbEnv *env;
     const char *dbname;
class StringDbt : public Dbt {
public:
#define GET_STRING_OK 0
#define GET_STRING_INVALID_PARAM 1
#define GET_STRING_SMALL_BUFFER 2
#define GET_STRING_EMPTY_DATA 3
     int get_string(char **buf, size_t buf_len)
          size_t copy_len;
          int ret = GET_STRING_OK;
          if (buf == NULL) {
               cerr << "Invalid input buffer to get_string" << endl;
               return GET_STRING_INVALID_PARAM;
          // make sure the string is null terminated.
          memset(*buf, 0, buf_len);
          // if there is no string, just return.
          if (get_data() == NULL || get_size() == 0)
               return GET_STRING_OK;
          if (get_size() >= buf_len) {
               ret = GET_STRING_SMALL_BUFFER;
               copy_len = buf_len - 1; // save room for a terminator.
          } else
               copy_len = get_size();
          memcpy(*buf, get_data(), copy_len);
          return ret;
     size_t get_string_length()
          if (get_size() == 0)
               return 0;
          return strlen((char *)get_data());
     void set_string(char *string)
          set_data(string);
          set_size((u_int32_t)strlen(string));
     StringDbt(char *string) :
     Dbt(string, (u_int32_t)strlen(string)) {};
     StringDbt() : Dbt() {};
     ~StringDbt() {};
     // Don't add extra data to this sub-class since we want it to remain
     // compatible with Dbt objects created internally by Berkeley DB.
Db *g_repquote=NULL;
RepQuoteExample::RepQuoteExample() : app_config(0), cur_env(0) {
     app_data.app_finished = 0;
     app_data.in_client_sync = 0;
     app_data.is_master = 0; // assume I start out as client
     app_data.no_dummy_wr = 0 ; //prevent to run dummy write
int (*old_rep_process_message)
          __P((DB_ENV *, DBT *, DBT *, int, DB_LSN *));
int my_rep_process_message __P((DB_ENV arg1, DBT arg2, DBT arg3, int arg4, DB_LSN arg5))
     printf("EZ->>> my_rep_process_message:%p\n",arg5);
     old_rep_process_message(arg1,arg2,arg3,arg4,arg5);
void RepQuoteExample::init(RepConfigInfo *config) {
     app_config = config;
     cur_env.set_app_private(&app_data);
     cur_env.set_errfile(stderr);
     app_data.no_dummy_wr=config->no_dummy_wr;
     if (app_data.no_dummy_wr)
          printf("No dummy !!!\n");
     //EZ->cur_env.set_errpfx(progname);
     cur_env.set_event_notify(event_callback);
     // Configure bulk transfer to send groups of records to clients
     // in a single network transfer. This is useful for master sites
     // and clients participating in client-to-client synchronization.
     if (app_config->bulk)
          cur_env.rep_set_config(DB_REP_CONF_BULK, 1);
     // Set the total number of sites in the replication group.
     // This is used by repmgr internal election processing.
     if (app_config->totalsites > 0)
          cur_env.rep_set_nsites(app_config->totalsites);
     // Turn on debugging and informational output if requested.
     if (app_config->verbose)
          cur_env.set_verbose(DB_VERB_REPLICATION, 1);
     cur_env.set_verbose(DB_VERB_REPMGR_MISC, 1);
     cur_env.set_verbose(DB_VERB_RECOVERY, 1);
     cur_env.set_verbose(DB_VERB_REPLICATION, 1);
     cur_env.set_verbose(DB_VERB_REP_ELECT, 1);
     cur_env.set_verbose(DB_VERB_REP_LEASE, 1);
     cur_env.set_verbose(DB_VERB_REP_SYNC, 1);
     cur_env.set_verbose(DB_VERB_REPMGR_MISC, 1);
     // Set replication group election priority for this environment.
     // An election first selects the site with the most recent log
     // records as the new master. If multiple sites have the most
     // recent log records, the site with the highest priority value
     // is selected as master.
     cur_env.rep_set_priority(app_config->priority);
     // Set the policy that determines how master and client sites
     // handle acknowledgement of replication messages needed for
     // permanent records. The default policy of "quorum" requires only
     // a quorum of electable peers sufficient to ensure a permanent
     // record remains durable if an election is held. The "all" option
     // requires all clients to acknowledge a permanent replication
     // message instead.
     cur_env.repmgr_set_ack_policy(app_config->ack_policy);
     // Set the threshold for the minimum and maximum time the client
     // waits before requesting retransmission of a missing message.
     // Base these values on the performance and load characteristics
     // of the master and client host platforms as well as the round
     // trip message time.
     cur_env.rep_set_request(20000, 500000);
     // Configure deadlock detection to ensure that any deadlocks
     // are broken by having one of the conflicting lock requests
     // rejected. DB_LOCK_DEFAULT uses the lock policy specified
     // at environment creation time or DB_LOCK_RANDOM if none was
     // specified.
     cur_env.set_lk_detect(DB_LOCK_DEFAULT);
     // The following base replication features may also be useful to your
     // application. See Berkeley DB documentation for more details.
     // - Master leases: Provide stricter consistency for data reads
     // on a master site.
     // - Timeouts: Customize the amount of time Berkeley DB waits
     // for such things as an election to be concluded or a master
     // lease to be granted.
     // - Delayed client synchronization: Manage the master site's
     // resources by spreading out resource-intensive client
     // synchronizations.
     // - Blocked client operations: Return immediately with an error
     // instead of waiting indefinitely if a client operation is
     // blocked by an ongoing client synchronization.
     cur_env.repmgr_set_local_site(app_config->this_host.host,
     app_config->this_host.port, 0);
     for ( REP_HOST_INFO *cur = app_config->other_hosts; cur != NULL;
          cur = cur->next) {
          cur_env.repmgr_add_remote_site(cur->host, cur->port,
          NULL, cur->peer ? DB_REPMGR_PEER : 0);
     // Configure heartbeat timeouts so that repmgr monitors the
     // health of the TCP connection. Master sites broadcast a heartbeat
     // at the frequency specified by the DB_REP_HEARTBEAT_SEND timeout.
     // Client sites wait for message activity the length of the
     // DB_REP_HEARTBEAT_MONITOR timeout before concluding that the
     // connection to the master is lost. The DB_REP_HEARTBEAT_MONITOR
     // timeout should be longer than the DB_REP_HEARTBEAT_SEND timeout.
     cur_env.rep_set_timeout(DB_REP_HEARTBEAT_SEND, 5000000);
     cur_env.rep_set_timeout(DB_REP_HEARTBEAT_MONITOR, 10000000);
     // The following repmgr features may also be useful to your
     // application. See Berkeley DB documentation for more details.
     // - Two-site strict majority rule - In a two-site replication
     // group, require both sites to be available to elect a new
     // master.
     // - Timeouts - Customize the amount of time repmgr waits
     // for such things as waiting for acknowledgements or attempting
     // to reconnect to other sites.
     // - Site list - return a list of sites currently known to repmgr.
     // We can now open our environment, although we're not ready to
     // begin replicating. However, we want to have a dbenv around
     // so that we can send it into any of our message handlers.
     cur_env.set_cachesize(0, CACHESIZE, 0);
     cur_env.set_flags(DB_REP_PERMANENT, 1);
     //cur_env.set_flags(DB_TXN_WRITE_NOSYNC, 1);
/*     u_int32_t maxlocks=300000;
     if (maxlocks != 0)
          cur_env.set_lk_max_locks(maxlocks);
     u_int32_t maxlocks_o=300000;
     if (maxlocks_o != 0)
          cur_env.set_lk_max_objects(maxlocks_o);
     u_int32_t maxmutex=300000;
     if (maxmutex != 0)
          cur_env.mutex_set_max(maxmutex);
     DbEnv          *m_env=&cur_env;
     m_env->set_flags(DB_TXN_NOSYNC, 1);
     m_env->set_lk_max_lockers(60000);
     m_env->set_lk_max_objects(60000);
     m_env->set_lk_max_locks(60000);
     m_env->set_tx_max(60000);
     //m_env->repmgr_set_ack_policy(DB_REPMGR_ACKS_NONE);
     m_env->rep_set_timeout(DB_REP_ACK_TIMEOUT, 50 * 1000); //50ms
     m_env->rep_set_timeout(DB_REP_CHECKPOINT_DELAY, 0);
     //m_env->rep_set_timeout(DB_REP_CONNECTION_RETRY, 30 * 1000 * 1000); // 30 seconds
     m_env->rep_set_timeout(DB_REP_ELECTION_TIMEOUT, 1 * 1000 * 1000); // 5 seconds
     m_env->rep_set_timeout(DB_REP_FULL_ELECTION_TIMEOUT, 5 * 1000 * 1000); // 5 seconds
     m_env->rep_set_timeout(DB_REP_CONNECTION_RETRY, 5 * 1000 * 1000);
     //m_env->rep_set_timeout(DB_REP_ELECTION_RETRY, 10 * 1000 * 1000); //10 seconds
     //m_env->rep_set_timeout(DB_REP_HEARTBEAT_MONITOR, 80 * 1000 * 1000); //80 seconds
     //m_env->rep_set_timeout(DB_REP_HEARTBEAT_SEND, 500 * 1000); //500 milli seconds
     //The minimum number of microseconds a client waits before requesting retransmission
     u_int32_t rep_req_min = 40000; //40 000 microsec = 40 mili
     //The maximum number of microseconds a client waits before requesting retransmission
     u_int32_t rep_req_max = 1280000;// 1 280 000 microsec = 1.28 sec
     u_int32_t rep_limit_gbytes = 0;
     u_int32_t rep_limit_bytes = 100 * 1024 * 1024; // 100MB
     m_env->rep_set_request(rep_req_min, rep_req_max);
     m_env->rep_set_limit(rep_limit_gbytes, rep_limit_bytes);
     cur_env.open(app_config->home, DB_CREATE | DB_RECOVER |
     DB_THREAD | DB_INIT_REP | DB_INIT_LOCK | DB_INIT_LOG |
     DB_INIT_MPOOL | DB_INIT_TXN , 0);
     //keep old function for chain
     //old_rep_process_message=cur_env.get_DB_ENV()->rep_process_message;
     //derouting
     //cur_env.get_DB_ENV()->rep_process_message=my_rep_process_message;
     /*int _i;
     cur_env.log_get_config(DB_LOG_DIRECT, &_i);printf ("DB_LOG_DIRECT = %d\n",_i);
     cur_env.log_get_config(DB_LOG_DSYNC, &_i);printf ("DB_LOG_DSYNC = %d\n",_i);
     cur_env.log_get_config(DB_LOG_AUTO_REMOVE, &_i);printf ("DB_LOG_AUTO_REMOVE = %d\n",_i);
     cur_env.log_get_config(DB_LOG_IN_MEMORY, &_i);printf ("DB_LOG_IN_MEMORY = %d\n",_i);
     cur_env.log_get_config(DB_LOG_ZERO,&_i);printf ("DB_LOG_ZERO = %d\n",_i);
     // Start checkpoint and log archive support threads.
     (void)thread_create(&ckp_thr, NULL, checkpoint_thread, &cur_env);
     (void)thread_create(&lga_thr, NULL, log_archive_thread, &cur_env);
     (void)thread_create(&dmy_thr, NULL, dummy_write_thread, &cur_env);
     cur_env.repmgr_start(3, app_config->start_policy);
}

int RepQuoteExample::terminate() {
     try {
          // Wait for checkpoint and log archive threads to finish.
          // Windows does not allow NULL pointer for exit code variable.
          thread_exit_status_t exstat;
          (void)thread_join(lga_thr, &exstat);
          (void)thread_join(ckp_thr, &exstat);
          (void)thread_join(dmy_thr, &exstat);
          // We have used the DB_TXN_NOSYNC environment flag for
          // improved performance without the usual sacrifice of
          // transactional durability, as discussed in the
          // "Transactional guarantees" page of the Reference
          // Guide: if one replication site crashes, we can
          // expect the data to exist at another site. However,
          // in case we shut down all sites gracefully, we push
          // out the end of the log here so that the most
          // recent transactions don't mysteriously disappear.
          cur_env.log_flush(NULL);
          cur_env.close(0);
     } catch (DbException dbe) {
          cout << "error closing environment: " << dbe.what() << endl;
     return 0;
void RepQuoteExample::prompt() {
     cout << "QUOTESERVER";
     if (!app_data.is_master)
          cout << "(read-only)";
     cout << "> " << flush;
void log(const char *msg) {
time_t currentTime;
// get and print the current time
time (&currentTime); // fill now with the current time
     char buff[255];
     strncpy(buff,ctime(&currentTime),sizeof(buff));
     char *p;
     for(p =buff ; *p != '\n'; p++);
     *p = '\0';
     cerr << buff << " - " << msg << endl;
// Simple command-line user interface:
// - enter "<stock symbol> <price>" to insert or update a record in the
//     database;
// - just press Return (i.e., blank input line) to print out the contents of
//     the database;
// - enter "quit" or "exit" to quit.
void RepQuoteExample::doloop() {
     DbHolder dbh1(&cur_env,DATABASE);
     DbHolder dbh2(&cur_env,DATABASE2);
     DbHolder *dbh=&dbh1;
     DbTxn *txn;
     string input;
bool truncate = false;
     char *c;
     using_history();
     g_repquote=*dbh;
     int loop_rate = 0;
     int txn_rate = 500;
     while (prompt(), /*getline(cin, input)*/c=readline(NULL)) {
          input=std::string(c);
          add_history(c);
          free(c);
          int start_loop = 0;
          int end_loop = 0;
          int start_loop_d = 0;
          int end_loop_d = 0;
          istringstream is(input);
          string token1, token2, token3;
truncate = false;
start_loop = 0;
end_loop = 0;
          // Read 0, 1 or 2 tokens from the input.
          int count = 0;
          if (is >> token1) {
               count++;
               if (is >> token2)
               count++;
               if (is >> token3)
               count++;
          if (count == 1) {
     if (token1 == "truncate" ) {
                    truncate = true;
               else if (token1 == "env" ){
                    print_env(&cur_env);
                    continue;
     else if (token1 == "verbose" ) {
                    app_config->verbose = !app_config->verbose;
                    if (app_config->verbose)
                         cur_env.set_verbose(DB_VERB_REPLICATION, 1);
                         cur_env.set_verbose(DB_VERB_REPMGR_MISC, 1);
                         cur_env.set_verbose(DB_VERB_RECOVERY, 1);
                         cur_env.set_verbose(DB_VERB_REP_ELECT, 1);
                         cur_env.set_verbose(DB_VERB_REP_LEASE, 1);
                         cur_env.set_verbose(DB_VERB_REP_SYNC, 1);
                         cur_env.set_verbose(DB_VERB_REPMGR_MISC, 1);
                         log("verbose is on");
                    else
                         cur_env.set_verbose(DB_VERB_REPLICATION, 0);
                         cur_env.set_verbose(DB_VERB_REPMGR_MISC, 0);
                         cur_env.set_verbose(DB_VERB_RECOVERY, 0);
                         cur_env.set_verbose(DB_VERB_REP_ELECT, 0);
                         cur_env.set_verbose(DB_VERB_REP_LEASE, 0);
                         cur_env.set_verbose(DB_VERB_REP_SYNC, 0);
                         cur_env.set_verbose(DB_VERB_REPMGR_MISC, 0);
                         log("verbose is off");
                    continue;
     else if (token1 == "print" ) {
               print_stocks(*dbh);
                    count = 0;
     else if (token1 == "db1" ) {
                    dbh=&dbh1;
                    g_repquote=*dbh;
                    log( "switch to Db1");
                    count = 0;
     else if (token1 == "db2" ) {
                    dbh=&dbh2;
                    g_repquote=*dbh;
                    log( "switch to Db2");
                    count = 0;
               else if (token1 == "exit" || token1 == "quit") {
                    app_data.app_finished = 1;
                    break;
               } else {
                    log("Format: <stock> <price>");
                    continue;
else if (count == 2)
               if (token1 == "loop_rate" ){
     loop_rate = atoi(token2.c_str());
                    continue;
               if (token1 == "txn_rate" ){
     txn_rate = atoi(token2.c_str());
                    continue;
else if (count == 3)
if (token1 == "loop" ) {
start_loop = atoi(token2.c_str());
end_loop = start_loop + atoi(token3.c_str());
if (token1 == "delete" ) {
start_loop_d = atoi(token2.c_str());
end_loop_d = start_loop_d + atoi(token3.c_str());
          // Here we know count is either 0 or 2, so we're about to try a
          // DB operation.
          // Open database with DB_CREATE only if this is a master
          // database. A client database uses polling to attempt
          // to open the database without DB_CREATE until it is
          // successful.
          // This DB_CREATE polling logic can be simplified under
          // some circumstances. For example, if the application can
          // be sure a database is already there, it would never need
          // to open it with DB_CREATE.
          if (!dbh->ensure_open(app_data.is_master))
               continue;
          try {
               if (count == 0)
                    if (app_data.in_client_sync)
                         log( "Cannot read data during client initialization - please try again.");
                    else
                         print_stocks_size(*dbh);
               else if (!app_data.is_master)
                    log("Can't update at client");
               else {
                    if (truncate)
u_int32_t no_remove;
                    txn = NULL;
cur_env.txn_begin(NULL, &txn, DB_TXN_NOWAIT);
                         try
          (*dbh)->truncate(txn, &no_remove, 0);
// commit
txn->commit(0);
txn = NULL;
} catch (DbException &e) {
std::cout << "Error on txn commit: " << e.what() << std::endl;
                    //     } catch (DbDeadlockException &) {
                    if (txn != NULL)
                         (void)txn->abort();
// std::cout << "Error on txn commit: " << std::endl;
else if (start_loop)
int j=0;
for (int i=start_loop; i<=end_loop; i=i+txn_rate)
//transaction begin
               txn = NULL;
               cur_env.txn_begin(NULL, &txn, 0);
for (j=i; j<=end_loop && j<=(i+txn_rate); j++)
                              Dbt key, value;
     std::string key1, value1;
     std::stringstream sstrm;
     sstrm << "key" << j << ends;
     key1 = sstrm.str();
               key.set_data((void *)key1.c_str());
               key.set_size((u_int32_t)strlen(key1.c_str()));
     sstrm.str("");
     int payload = rand() + j;
                              sstrm << "price" << payload << ends;
     value1 = sstrm.str();
               value.set_data((void *)value1.c_str());
               value.set_size((u_int32_t)strlen(value1.c_str()));
     // Perform the database put
     (*dbh)->put(txn, &key, &value, 0);
                         printf("Kill me !!\n");
                         kill(getpid(),-9);
                         exit(0);
     try
                              // commit
                    txn->commit(0);
                    txn = NULL;
               } catch (DbException &e) {
                    std::cout << "Error on txn commit: " << e.what() << std::endl;
                         if (loop_rate>0)
                              usleep(txn_rate * 1000 * 1000 / loop_rate);
                    else if (start_loop_d)
int j=0;
for (int i=start_loop_d; i<=end_loop_d; i=i+100)
//transaction begin
               txn = NULL;
               cur_env.txn_begin(NULL, &txn, 0);
for (j=i; j<=end_loop_d && j<=(i+100); j++)
                              Dbt key, value;
     std::string key1, value1;
     std::stringstream sstrm;
     sstrm << "key" << j << ends;
     key1 = sstrm.str();
               key.set_data((void *)key1.c_str());
               key.set_size((u_int32_t)strlen(key1.c_str()));
     // Perform the database put
     (*dbh)->del(txn, &key, 0);
     try
                              // commit
                    txn->commit(0);
                    txn = NULL;
               } catch (DbException &e) {
                    std::cout << "Error on txn commit: " << e.what() << std::endl;
                    else
                         const char *symbol = token1.c_str();
                         StringDbt key(const_cast<char*>(symbol));
                         const char *price = token2.c_str();
                         StringDbt data(const_cast<char*>(price));
                         (*dbh)->put(NULL, &key, &data, 0);
          } catch (DbDeadlockException e) {
               log("please retry the operation");
               dbh->close();
          } catch (DbRepHandleDeadException e) {
               log("please retry the operation");
               dbh->close();
          } catch (DbException e) {
               if (e.get_errno() == DB_REP_LOCKOUT) {
               log("please retry the operation");
               dbh->close();
               } else
               throw;
     dbh->close();
void RepQuoteExample::event_callback(DbEnv* dbenv, u_int32_t which, void *info)
     static char buf[256];
     APP_DATA app = (APP_DATA)dbenv->get_app_private();
     info = NULL;          /* Currently unused. */
     switch (which) {
     case DB_EVENT_REP_CLIENT:
          app->is_master = 0;
          app->in_client_sync = 1;
          sprintf(buf,"%s - %s",progname,"CLIENT");
          //EZ->dbenv->set_errpfx(buf);
          log("DB_EVENT_REP_CLIENT.");
          break;
     case DB_EVENT_REP_MASTER:
          app->is_master = 1;
          app->in_client_sync = 0;
          sprintf(buf,"%s - %s",progname,"MASTER");
          //EZ->dbenv->set_errpfx(buf);
          log("DB_EVENT_REP_MASTER.");
          break;
     case DB_EVENT_REP_NEWMASTER:
          log("DB_EVENT_REP_NEWMASTER.");
          app->in_client_sync = 1;
          break;
     case DB_EVENT_REP_PERM_FAILED:
          // Did not get enough acks to guarantee transaction
          // durability based on the configured ack policy. This
          // transaction will be flushed to the master site's
          // local disk storage for durability.
          log("DB_EVENT_REP_PERM_FAILED.");
          log("Insufficient acknowledgements to guarantee transaction durability.");
          break;
     case DB_EVENT_REP_STARTUPDONE:
          app->in_client_sync = 0;
          log("DB_EVENT_REP_STARTUPDONE.");
          break;
     case DB_EVENT_REP_ELECTION_FAILED:
          log("DB_EVENT_REP_ELECTION_FAILED.");
          //g_runner->init(g_config);
          printf("Kill me !!\n");
          kill(getpid(),-9);
          exit(0);
          break;
     case DB_EVENT_REP_DUPMASTER:
          log("DB_EVENT_REP_DUPMASTER.");
          break;
     default:
          dbenv->errx("ignoring event %d", which);
void RepQuoteExample::print_stocks_size(Db *dbp) {
     DB_BTREE_STAT *statp;
dbp->stat(NULL, &statp, 0);
     log("db_stat");
cout << "***************************************** >>>>>>>>>>> : database contains " << (u_long)statp->bt_ndata << " records\n";
void RepQuoteExample::print_env(DbEnv *dbenv) {
     dbenv->stat_print(DB_STAT_ALL);
void RepQuoteExample::print_stocks(Db *dbp) {
     StringDbt key, data;
#define     MAXKEYSIZE     10
#define     MAXDATASIZE     20
     char keybuf[MAXKEYSIZE + 1], databuf[MAXDATASIZE + 1];
     char kbuf, dbuf;
     memset(&key, 0, sizeof(key));
     memset(&data, 0, sizeof(data));
     kbuf = keybuf;
     dbuf = databuf;
     DbcAuto dbc(dbp, 0, 0);
     cout << "\tSymbol\tPrice" << endl
          << "\t======\t=====" << endl;
int no_records =0;
     for (int ret = dbc->get(&key, &data, DB_FIRST);
          ret == 0;
          ret = dbc->get(&key, &data, DB_NEXT)) {
          key.get_string(&kbuf, MAXKEYSIZE);
          data.get_string(&dbuf, MAXDATASIZE);
no_records++;
          cout << "\t" << keybuf << "\t" << databuf << endl;
cout << "********************** NO Records " << no_records << endl;
     cout << endl << flush;
     dbc.close();
static void usage() {
     cerr << "usage: " << progname << " -h home -l host:port [-CM]"
     << "[-r host:port][-R host:port]" << endl
     << " [-a all|quorum][-b][-n nsites][-p priority][-v]" << endl;
     cerr << "\t -h home (required; h stands for home directory)" << endl
     << "\t -l host:port (required; l stands for local)" << endl
     << "\t -C or -M (optional; start up as client or master)" << endl
     << "\t -r host:port (optional; r stands for remote; any "
     << "number of these" << endl
     << "\t may be specified)" << endl
     << "\t -R host:port (optional; R stands for remote peer; only "
     << "one of" << endl
     << "\t these may be specified)" << endl
     << "\t -a all|quorum (optional; a stands for ack policy)" << endl
     << "\t -b (optional; b stands for bulk)" << endl
     << "\t -n nsites (optional; number of sites in replication "
     << "group; defaults " << endl
     << "\t     to 0 to try to dynamically compute nsites)" << endl
     << "\t -p priority (optional; defaults to 100)" << endl
     << "\t -v (optional; v stands for verbose)" << endl;
     exit(EXIT_FAILURE);
int main(int argc, char **argv) {
     RepConfigInfo config;
     char ch, portstr, tmphost;
     int tmpport;
     bool tmppeer;
     config.no_dummy_wr = false;
     // Extract the command line parameters
     while ((ch = getopt(argc, argv, "E:a:bCh:l:Mn:p:R:r:vw")) != EOF) {
          tmppeer = false;
          switch (ch) {
          case 'a':
               if (strncmp(optarg, "all", 3) == 0)
                    config.ack_policy = DB_REPMGR_ACKS_ALL;
               else if (strncmp(optarg, "quorum", 6) != 0)
                    usage();
               break;
          case 'b':
               config.bulk = true;
               break;
          case 'C':
               config.start_policy = DB_REP_CLIENT;
               break;
          case 'E':
config.start_policy = DB_REP_ELECTION;
break;
          case 'h':
               config.home = optarg;
               break;
          case 'l':
               config.this_host.host = strtok(optarg, ":");
               if ((portstr = strtok(NULL, ":")) == NULL) {
                    cerr << "Bad host specification." << endl;
                    usage();
               config.this_host.port = (unsigned short)atoi(portstr);
               config.got_listen_address = true;
               break;
          case 'M':
               config.start_policy = DB_REP_MASTER;
               break;
          case 'n':
               config.totalsites = atoi(optarg);
               break;
          case 'p':
               config.priority = atoi(optarg);
               break;
          case 'R':
               tmppeer = true; // FALLTHROUGH
          case 'r':
               tmphost = strtok(optarg, ":");
               if ((portstr = strtok(NULL, ":")) == NULL) {
                    cerr << "Bad host specification." << endl;
                    usage();
               tmpport = (unsigned short)atoi(portstr);
               config.addOtherHost(tmphost, tmpport, tmppeer);
               break;
          case 'v':
               config.verbose = true;
               break;
          case 'w':
               config.no_dummy_wr = true;
               //config.priority = 2;
               break;
          case '?':
          default:
               usage();
     // Error check command line.
     if ((!config.got_listen_address) || config.home == NULL)
          usage();
     RepQuoteExample runner;
     g_runner=&runner;
     g_config=&config;
     try {
          runner.init(&config);
          runner.doloop();
     } catch (DbException dbe) {
          cerr << "Caught an exception during initialization or"
               << " processing: " << dbe.what() << endl;
     runner.terminate();
     return 0;
// This is a very simple thread that performs checkpoints at a fixed
// time interval. For a master site, the time interval is one minute
// plus the duration of the checkpoint_delay timeout (30 seconds by
// default.) For a client site, the time interval is one minute.
void checkpoint_thread(void args)
     DbEnv *env;
     APP_DATA *app;
     int i, ret;
     env = (DbEnv *)args;
     app = (APP_DATA *)env->get_app_private();
     for (;;) {
          // Wait for one minute, polling once per second to see if
          // application has finished. When application has finished,
          // terminate this thread.
          for (i = 0; i < 60; i++) {
               sleep(1);
               if (app->app_finished == 1)
                    return ((void *)EXIT_SUCCESS);
          // Perform a checkpoint.
          // original line
          if ((ret = env->txn_checkpoint(0, 0, 0)) != 0) {
          //if ((ret = env->txn_checkpoint(0, 0, DB_FORCE)) != 0) {
               env->err(ret, "Could not perform checkpoint.\n");
               return ((void *)EXIT_FAILURE);
// This is a simple log archive thread. Once per minute, it removes all but
// the most recent 3 logs that are safe to remove according to a call to
// DBENV->log_archive().
// Log cleanup is needed to conserve disk space, but aggressive log cleanup
// can cause more frequent client initializations if a client lags too far
// behind the current master. This can happen in the event of a slow client,
// a network partition, or a new master that has not kept as many logs as the
// previous master.
// The approach in this routine balances the need to mitigate against a
// lagging client by keeping a few more of the most recent unneeded logs
// with the need to conserve disk space by regularly cleaning up log files.
// Use of automatic log removal (DBENV->log_set_config() DB_LOG_AUTO_REMOVE
// flag) is not recommended for replication due to the risk of frequent
// client initializations.
void log_archive_thread(void args)
     DbEnv *env;
     APP_DATA *app;
     char **begin, **list;
     int i, listlen, logs_to_keep, minlog, ret;
     env = (DbEnv *)args;
     app = (APP_DATA *)env->get_app_private();
     logs_to_keep = 3;
     for (;;) {
          // Wait for one minute, polling once per second to see if
          // application has finished. When application has finished,
          // terminate this thread.
          for (i = 0; i < 60; i++) {
               sleep(1);
               if (app->app_finished == 1)
                    return ((void *)EXIT_SUCCESS);
          // Get the list of unneeded log files.
          if ((ret = env->log_archive(&list, DB_ARCH_ABS)) != 0) {
               env->err(ret, "Could not get log archive list.");
               return ((void *)EXIT_FAILURE);
          if (list != NULL) {
               listlen = 0;
               // Get the number of logs in the list.
               for (begin = list; *begin != NULL; begin++, listlen++);
               // Remove all but the logs_to_keep most recent
               // unneeded log files.
               minlog = listlen - logs_to_keep;
               for (begin = list, i= 0; i < minlog; list++, i++) {
                    if ((ret = unlink(*list)) != 0) {
                         env->err(ret,
                         "logclean: remove %s", *list);
                         env->errx(
                         "logclean: Error remove %s", *list);
                         free(begin);
                         return ((void *)EXIT_FAILURE);
               free(begin);
#define DATABASE_DUMMY "dummy.db"
void create_dummy_db(DB_ENV env, DB *dbp)
DB_ENV *dbenv=env;
int ret;
u_int32_t db_flags;
if ((ret = db_create(dbp, dbenv, 0)) != 0)
dbenv->err(dbenv, ret, "create_dummy_db: db_create");
db_flags = DB_AUTO_COMMIT | DB_CREATE;
//if ((ret = (*dbp)->open(*dbp,NULL, DATABASE, NULL, DB_BTREE, db_flags, 0)) != 0)
if ((ret = (*dbp)->open(*dbp,NULL, NULL, DATABASE_DUMMY, DB_BTREE, db_flags, 0)) != 0)
dbenv->err(dbenv, ret, "create_dummy_db: DB->open");
void reopen_dummy_db(DB_ENV env, DB *dbp)
DB_ENV *dbenv=env;
int ret;
u_int32_t db_flags;
if ((ret = db_create(dbp, dbenv, 0)) != 0)
dbenv->err(dbenv, ret, "create_dummy_db: db_create");
db_flags = DB_AUTO_COMMIT | DB_CREATE;
//if ((ret = (*dbp)->open(*dbp,NULL, DATABASE, NULL, DB_BTREE, db_flags, 0)) != 0)
if ((ret = (*dbp)->open(*dbp,NULL, NULL, DATABASE_DUMMY, DB_BTREE, db_flags, 0)) != 0)
dbenv->err(dbenv, ret, "reopen_dummy_db: DB->open");
void perform_db_operation(DB_ENV env, DB *dbp, bool bRead)
//main loop
//DB *dbp=NULL;
DB_ENV *dbenv=env;
int ret;
u_int32_t db_flags;
DBT key, data;
char buf[20]="dummy", *rbuf;
rbuf=buf;
if (*dbp == NULL)
create_dummy_db(dbenv, dbp);
if (! bRead)
     memset(&key, 0, sizeof(key));
     memset(&data, 0, sizeof(data));
     key.data = buf;
     key.size = (u_int32_t)strlen(buf);
     data.data = rbuf;
     data.size = (u_int32_t)strlen(rbuf);
     if ((ret = (*dbp)->put(*dbp, NULL, &key, &data, 0)) != 0)
          if (ret == DB_REP_HANDLE_DEAD)
               //create_dummy_db(dbenv, dbp);
               reopen_dummy_db(dbenv, dbp);
               (*dbp)->err(*dbp, ret, "DB->put :");
          else
          if (ret != DB_KEYEXIST)
               (*dbp)->err(*dbp, ret, "perform_db_operation: DB->put");
     else
          DB_BTREE_STAT *statp;
          (*dbp)->stat(*dbp,NULL, &statp, 0);
          std::cout<<"dbp read stats: key#"<< statp->bt_nkeys <<std::endl;
void dummy_write_thread(void args)
     DbEnv *env;
     APP_DATA *app;
     char **begin, **list;
     int i, listlen, logs_to_keep, minlog, ret;
     DB *m_dbp; // a pointer
     env = (DbEnv *)args;
     app = (APP_DATA *)env->get_app_private();
     logs_to_keep = 3;
     for (;;) {
          if (! app->no_dummy_wr)
               if (app->is_master)
               perform_db_operation(env->get_DB_ENV(),&m_dbp,false);
                    //env->txn_checkpoint(0, 0, DB_FORCE);
          usleep(1 * 1000 * 1000);
          else
               if (app->is_master)
                    //DB *db_quote=g_repquote->get_DB();
                    //perform_db_operation(env->get_DB_ENV(),&db_quote,true);
                    //if (g_repquote)
                    //     g_runner->print_stocks_size(g_repquote);
                    //env->txn_checkpoint(0, 0, DB_FORCE);
                    //perform_db_operation(env->get_DB_ENV(),&m_dbp,false);
                    env->rep_flush();
          usleep(4 * 1000 * 1000);
my script to simulate the split brain
#!/bin/sh
[ -z "$node1" ] && node1=10.10.32.121
[ -z "$node2" ] && node2=10.10.32.91
trap myend 0 1 2 3 6 9 14 15
myend()
     echo "Receive signal to stop test..."
     un_split_brain
     echo "done"
     exit 1
split_brain()
     echo -n "Split-Brain at node $node..."
     snmpset -m ALL -v 2c -c svil 10.10.0.100 ifAdminStatus.41 i 2 >/dev/null 2>&1
     echo "done"
un_split_brain()
     echo -n "Undo Split-Brain at node $node..."
     snmpset -m ALL -v 2c -c svil 10.10.0.100 ifAdminStatus.41 i 1 >/dev/null 2>&1
     echo "done"
is_slave()
     local r=$(ssh root@$1 "tail -2 /tmp/BDB.log" | grep -c CLIENT)
     [ $r -gt 1 ] && ret=1 || ret=0
     return $ret
is_master()
     local r=$(ssh root@$1 "tail -2 /tmp/BDB.log" | grep -c MASTER)
     [ $r -gt 1 ] && ret=1 || ret=0
     return $ret
wait_for_master()
     echo -n "Waiting for MASTER at node $node ... "
     is_master $node
     r=$?
     while ( [ ! $r -eq 1 ] )
     do
     usleep 500000
     is_master $node
     r=$?
     echo -n "."
     done
     echo "done"
wait_for_slave()
     local r
     local tm
     tm=0
     echo -n "Waiting for SLAVE at node $node ... "
     is_slave $node
     r=$?
     while ( [ ! $r -eq 1 ] )
     do
          usleep 500000
          is_slave $node
          r=$?
          echo -n "."
          tm=$((tm+1))
          [ $tm -gt 120 ] && break
     done
     [ $tm -gt 120 ] && ret=0 || ret=1
     echo "done"
     return $ret
run_test_split_brain()
     local nt
     nt=1
     nfails=0
     x=4
     [ -z "$1" ] && node=$node2
     while ((1))
     do
          printf "*************** TEST [%02d] ********************\n" $nt
          split_brain
          wait_for_master
          x=$((RANDOM%9))
          echo -n " waiting $x sec ..."
          sleep $x
          echo "done"
          un_split_brain
          wait_for_slave
          r=$?
          [ ! $r -eq 1 ] && echo "`date` - test [$nt] - fails ..." || echo "`date` - test [$nt] - OK ."
          [ ! $r -eq 1 ] && nfails=$((nfails+1))
          perc_failure=$(echo "100.0 - $nfails / $nt * 100.0" | bc -l)
          echo "************************************************ [% Success test $perc_failure % ]"
          nt=$((nt+1))
          x=$((RANDOM%9))
          echo -n " waiting $x sec ..."
          sleep $x
     done
run_test_split_brain
here is the makefile to run to two environments
i run:
- make run
and in another window sh test_split_brain.sh
node1?=10.10.32.121
node2?=10.10.32.91
nsite?=2
debug?=0
all: RepQuoteExampleEric install
RepConfigInfo.o: RepConfigInfo.cpp RepConfigInfo.h
     g++ -I/usr/local/BerkeleyDB.5.1/include/ -g -O0 -c RepConfigInfo.cpp -o RepConfigInfo.o
RepQuoteExampleEric: RepQuoteExampleEric.cpp RepConfigInfo.o
     g++ -I/usr/local/BerkeleyDB.5.1/include/ -g -O0 RepQuoteExampleEric.cpp RepConfigInfo.o -o RepQuoteExampleEric -L /usr/local/BerkeleyDB.5.1/lib/ -lreadline -lcurses -ldb_cxx
kill:
     -ssh -X root@$(node1) "killall -9 /root/RepQuoteExampleEric"
     -ssh -X root@$(node2) "killall -9 /root/RepQuoteExampleEric"
run: RepQuoteExampleEric kill install clean_env
     ssh -X root@$(node1) "xterm -geom 100x20+100+100 -e \"LD_LIBRARY_PATH=/usr/local/BerkeleyDB.5.1/lib/ /root/RepQuoteExampleEric -h /opt/bdb/ -l 2.0.0.110:12345 -r 2.0.0.210:12345 -a quorum -b -n $(nsite) -v | tee /tmp/BDB.log\"" &
     ssh -X root@$(node2) "xterm -geom 100x20+800+100 -e \"LD_LIBRARY_PATH=/usr/local/BerkeleyDB.5.1/lib/ /root/RepQuoteExampleEric -h /opt/bdb/ -l 2.0.0.210:12345 -r 2.0.0.110:12345 -a quorum -b -n $(nsite) -v -w | tee /tmp/BDB.log\"" &
run_node2: clean_env2
     ssh -X root@$(node2) "xterm -geom 100x20+800+100 -e \"LD_LIBRARY_PATH=/usr/local/BerkeleyDB.5.1/lib/ /root/RepQuoteExampleEric -h /opt/bdb/ -l 2.0.0.210:12345 -r 2.0.0.110:12345 -a quorum -b -n $(nsite) -v -w | tee /tmp/BDB.log\"" &
debug_node2: clean_env2
     ssh -X root@$(node2) "xterm -geom 100x20+800+100 -e \"LD_LIBRARY_PATH=/usr/local/BerkeleyDB.5.1/lib/ /root/RepQuoteExampleEric -h /opt/bdb/ -l 2.0.0.210:12345 -r 2.0.0.110:12345 -a quorum -b -n $(nsite) -v -w | tee /tmp/BDB.log\"" &
     sleep 3
     ssh -X root@$(node2) /sbin/pidof RepQuoteExampleEric >/tmp/pid
     ssh -X root@$(node2) ~/kdbg /root/db-5.1.19/examples/cxx/excxx_repquote/RepQuoteExampleEric -p `cat /tmp/pid`
run_debug_node1: RepQuoteExampleEric kill install clean_env
     ssh -X root@$(node1) "xterm -geom 100x20+100+100 -e \"LD_LIBRARY_PATH=/usr/local/BerkeleyDB.5.1/lib/ /root/kdbg /root/RepQuoteExampleEric\" " &
     ssh -X root@$(node2) "xterm -geom 100x20+800+100 -e \"LD_LIBRARY_PATH=/usr/local/BerkeleyDB.5.1/lib/ /root/RepQuoteExampleEric -h /opt/bdb/ -l 2.0.0.210:12345 -r 2.0.0.110:12345 -a quorum -b -n $(nsite) -v\"" &
run_debug_node2: RepQuoteExampleEric kill install clean_env
     ssh -X root@$(node1) "xterm -geom 100x20+100+100 -e \"LD_LIBRARY_PATH=/usr/local/BerkeleyDB.5.1/lib/ /root/RepQuoteExampleEric -h /opt/bdb/ -l 2.0.0.110:12345 -r 2.0.0.210:12345 -a quorum -b -n $(nsite) -v\" " &
     ssh -X root@$(node2) "xterm -geom 100x20+800+100 -e \"LD_LIBRARY_PATH=/usr/local/BerkeleyDB.5.1/lib/ /root/kdbg /root/RepQuoteExampleEric\"" &
install: RepQuoteExampleEric
     scp RepQuoteExampleEric root@$(node1):~
     scp RepQuoteExampleEric root@$(node2):~
clean_env: clean_env1 clean_env2
clean_env1:
     ssh -X root@$(node1) rm -rf /opt/bdb/*
clean_env2:
     ssh -X root@$(node2) rm -rf /opt/bdb/*

Sun Fire V490 x 2 servers with Oracle RAC facing Split brain problem

Hi all,
I have Sun Fire V490 x 2 servers with Oracle RAC and they faced a Split brain problem. One of the node's database instance has gone down, The DBA claims it is due to network problem, but as such the networks are OK. We use the on board CE1 interface for Cluster interconnect and CE0 as the public interface.
Did anybody face this kind of a problem? Could this be a hardware/OS patch problem?
I had kept a continuous ping for 24 hours after this happened last time and the output shows no packet loss
Many thanks in advance.
Ushas Symon

In order to diagnose this properly, you'll need to provide too much detail and far too many log files for a generic discussion forum to handle.
Use your service contract and open a support case.
Because a cluster environment is involved you'll likely end up talking to the cluster support staff.
They can analyze hardware and software errors as well as review whether you configured the systems in a supportable fashion.
Be prepared to make a direct connection to each system and gather data using such as by using the Explorer tool. The technical support staff will tell you what they will actually need.

Split Brain Problem

Dear All,
RAC Database Version is 10.2.0.3, and Operating System is SPARC5.10.
The Database use Veritas Sun Cluster.
Voting Disk Size = 10MB
Yesterday, it experienced Split Brain Problem. Each Node Assumes it is the only surviving member of the cluster.
At first, Node 1 became hung and the network link was broken, then our system admin decided to restart the server.
After that, Node 2 Work Load increased significantly, and there was nothing we can do in database.
Fortunately, Veritas Team could handle it, by cleaning the process that was hung.
In my opinion, interconnect communication between node1 and node2 was broken, then each node assumes that it is the only member of the cluster.
To prevent Database Corruption, Veritas Cluster blocked every changes to database.
So, Would you tell me what is actually happened,
was It Related to Heartbeat to Controlfile, considering there was ORA-600 [2103] ?
Best Regards,
abip

Take a look at Bug 5526987
This issue is fixed in
10.2.0.4 (Server Patch Set)
11.1.0.6 (Base Release)
Symptoms:
Related To:
Internal Error May Occur (ORA-600)
Hang (Involving Shared Resource)
ORA-600 [2103]
Automatic Storage Management (ASM)
Description
This problem is introduced in 10.2.0.2 by the fix for bug 4671216.
A file creation rollback in ASM can lead to an ASM
instance hang when the rolling back session tries to get
the FA enqueue in X mode but it is already held. This can
lead to an ORA-600 [2103] on the database instance.
This issue is marked as a notable fix and it is recommended
to install a patch for this issue where ASM is used.
Further details on this issue can be found in Note 468572.1

Split brain syndrome in RAC

As per Split brain syndrome in Oracle RAC in case of inter-connect failures the master node will evict other/dead nodes .
Let say 2 node RAC configuration node 1 is defined as master node (by some parameter like load and others) incase of network failures node 1 will terminate node 2 from cluster.
;l
what happens if master node, in this case node 1 fails. which will terminate node 1 and is node 2 will become master node ?

Hi,
It occurs when the instance members in a RAC fail to ping/connect to each other via this private interconnect, but the servers are all pysically up and running and the database instance on each of these servers is also running. These individual nodes are running fine and can conceptually accept user connections and work independently. So basically due to lack of commincation the instance thinks that the other instance that it is not able to connect is down and it needs to do something about the situation. The problem is if we leave these instance running, the sane block might get read, updated in these individual instances and there would be data integrity issue, as the blocks changed in one instance, will not be locked and could be over-written by another instance. Oracle has efficiently implemented check for the split brain syndrome.
In RAC if any node becomes inactive, or if other nodes are unable to ping/connect to a node in the RAC, then the node which first detects that one of the node is not accessible, it will evict that node from the RAC group. e.g. there are 4 nodes in a rac instance, and node 3 becomes unavailable, and node 1 tries to connect to node 3 and finds it not responding, then node 1 will evict node 3 out of the RAC groups and will leave only Node1, Node2 & Node4 in the RAC group to continue functioning.
The split brain concepts can become more complicated in large RAC setups. For example there are 10 RAC nodes in a cluster. And say 4 nodes are not able to communicate with the other 6. So there are 2 groups formed in this 10 node RAC cluster ( one group of 4 nodes and other of 6 nodes). Now the nodes will quickly try to affirm their membership by locking controlfile, then the node that lock the controlfile will try to check the votes of the other nodes. The group with the most number of active nodes gets the preference and the others are evicted. Moreover, I have seen this node eviction issue with only 1 node getting evicted and the rest function fine, so I cannot really testify that if thats how it work by experience, but this is the theory behind it.
When we see that the node is evicted, usually oracle rac will reboot that node and try to do a cluster reconfiguration to include back the evicted node.
You will see oracle error: ORA-29740, when there is a node eviction in RAC. There are many reasons for a node eviction like heart beat not received by the controlfile, unable to communicate with the clusterware etc.
And also You can go through Metalink Note ID: 219361.1

How to test "Split brain"

Hi ,
I tried to demonstrate the automatic reboot at obnormal cluster
situations but i failed:
Setup:
IAS1, IAS2 and IAS3 running on their own Sun with Solaris 8
Load balancing: round robin
Number of sync backups: 1
restart at abnormal cluster on
I deployed The HaServlet and everything went fine, session data was
distrubuted
I brought down IAS3, which was in Sync alternate role, unplugged the
network cable of the machine on which IAS3 was running and started IAS3
again.
IAS3 came up as the a primary, because it could not detect the other
primary.
So far, according to what I understand, everything OK
After plugging the network cable back in no ias restarts occurred. The
round robin load balancing worked again and as a result of 2 primary
servers everytime i visisted ias3 my session data was lost.
Question: Did i miss someting in my "Split Brain" experiment??
Version SP3 on Solaris 8, only HaServlet deployed and working nicely
DSync distributed before the experiment.
After restarting ias3 it came up as alternate again, as expected and
everything worked as before.
Any hints are welcome, thanks
Robert
Robert Schrijvers
Javix Training & Software development
e-mail: [email protected]
website: www.javix.nl
phone +31 (0) 629594749

Just a quick question before we go any further, did you have the option to restart cluster in case of an abnormal cluster detected turned on? Sorry but I needed to ask.

Some quick help needed with certificates and split brain dns.

I run exch 2010 and have one cas server(srv03). I have split brain dns configured and working in my system. I got a new certificate this year because of the new regulations that won't allow .internal names in the san portion of an ssl cert.
I have followed several tids on the internet and still when I tried to implement it today the outlook clients started getting a popup that says [the name on the certificate is invalid or does not match the name of the site] At the top of this popup
is srv03.abccorp.internal which is what it was before.
The certificate is for mail.abccorp.com and also includes autodiscover.abccorp.com and srv03.abccorp.com.
When I run [Get-clientAccessServer | fl Name,AutoDiscoverServiceInternalUri] the name and the Url is correct and has the .com value.
When I run the test email autoconfiguration from my Outlook icon, and look at the log, Autodiscover URL found through SCP, is correct and it says Succeeded at the end. In the results tab however the Server, Availability Service, OOF URL are still showing
the .internal instead of .com. The Internal OWA, External OWA and the OAB are correctly displaying the .com. What commands do I need to run to change these as they seem to be the problem.
I wasted a lot of time chasing the autodiscover before I found out about this test in outlook and realized the autodiscover url was correct. :-)
I have two days left on my old cert that has both .com and .internal SANs so I rolled that back into service so the users stop getting messages. Any help would be appreciated.

Hi OTS,
You can run the following command to Change the InternalUrl attribute of the EWS:
Set-WebServicesVirtualDirectory -Identity "CAS_Server_Name\EWS (Default Web Site)" -InternalUrl https://mail.abccorp.com/ews/exchange.asmx
Best regards,
Please remember to mark the replies as answers if they help, and unmark the answers if they provide no help. If you have feedback for TechNet Support, contact [email protected]
Niko Cheng
TechNet Community Support

CUC 10.0.1 cluster status stuck in Split Brain Recovery (SBR) on Primary server - HA reports fine.

Hi,
Have a 10.01.11900 CUC cluster and everything is working fine (no one having issues with voice mail, etc) but the cluster status reports is not consistent.
DBreplication is showing 2 on both servers.
Primary unity server cluster status shows Primary/split brain recovery.
HA Unity server cluster status shows Primary/Secondary.
utils diagnose test - everything tests fine except the tomcat_connectors test.
test - tomcat_connectors : Failed - The HTTPS port is not responding to local requests. Please collect all of the Tomcat logs for root cause analysis: file get activelog tomcat/logs/*
We've shutdown the HA server and rebooted primary, and then waited awhile after primary was back up/active before bringing the HA server back up and still same.
We reset DB replication and same.
On the HA server I made the HA primary and the cluster status flipped to Seconday/Primary and I then made primary the primary again, but the primary server cluster status always shows Split Brain Recovery for the secondary/HA server.
No core dumps on either server and all services are started.
Any one seen this before or have any thoughts? I have a TAC Case on this but so far in same boat.
Would the utils cuc cluster renegotiate command help? Did not replace a server so don't really want to overwrite data to publisher server. Issue seems to be with the publisher since HA shows fine but not sure. I don't want to lose messages/etc so don't want really want to run these commands.
Thanks.

Ok, thanks.
The SRM logs indicate the Connection Digital Networking Replication Agent service is not running, however when I start it it stops right away and the cuReplicator log states digital networking is not enabled.
From SRM Log:
23:47:20.100 |17755,,,SRM,7,<svcmon> checkServiceStatus: started service monitoring
23:47:20.100 |17755,,,SRM,7,<svcmon> Service Status: 1 service(s) not running. Service name(s):
23:47:20.100 |17755,,,SRM,7,<svcmon> Connection Digital Networking Replication Agent
23:47:24.674 |28471,,,SRM,11,<Timer-3> [snd] Type: Heartbeat
From Replicator log:
admin:file tail activelog cuc/diag_CuReplicator_00000049.uc
23:42:59.208 HDR|09/14/2014 ,Significant
23:42:59.208 |28914,,,CuReplicator,0,Digital Networking is not enabled. Replicator will stop now.
There is no digital networking setup to other unity systems, and only one location.
Also, the Server role manager can't be restarted from CLI or the GUI so either root or a server reboot.
I compared it to another CUC cluster and deactivated the Digital Networking service and the SRM logs seem happier now, will wait a bit and see if it clears the SBR status up.

Split Brain handling in oracle 10g

How does oracle10g deal with split brain? Does it use voting disk or files or does it use any of the SCSI protocols ?

It uses the CRS voting disk to determine who should and shouldn't continue.

Quick questions (bulk operations, WAN, provisioning, split brain)

Thank you for your assistance with the following questions..
1. Can I use the "disk overflow" feature when using the "partitioned cache" topology? I need to just do puts and gets.
2. Are bulk operations (i.e. bulk put, bulk get, bulk erase) available in all cache topologies? I am interested in bulk put/get when using a "partitioned cache" and when using your "WAN capability" to connect geographical dispersed clusters (Coherence*Extend). Do you guarantee data and event delivery over the WAN?
3. Does the WAN capability (Coherence*Extend) allow multiple clusters to talk with each other (N:N) or does it only support a "star topology" (1 main cluster connected to N other "peripheral" clusters)?
4. As a result of a load bottleneck (i.e. a cache server(s) running too hot) in a "partitioned region" topology, can I "on-the-fly" add new cache servers/boxes to load balancing my "client load"? What client load balancing policies are typically recommended (i.e. sticky, random)? How does dynamic provisioning of new servers (e.g., adding more RAM to my distributed cache) affect the hashing of my map regions and my network bandwidth? What kind of performance would I expect if I add one or more new cache members when I was close to have an OOM on one or more of my already running cache servers? Even if I was not close to an OOM, what kind of network spike would I expect and what kind of client operation latency should I expect?
5. Could I have more information to understand your "data location transparency"?
6. How do you solve "split brain" problems (i.e. sets of members get separated and later try to rejoin cluster)? Do you have or is it possible to implement a "quorum-based policy" to decide who should live and who should die? We have seen "split brain" problems with messaging retransmission storms in the past and are looking for options.

Hi user614602,
1 - 3. Yes to all.
4 - 6. Those are quite advanced and open-ended questions (and by no means "quick"). Please contact your sales rep to schedule a conference call to discuss your requirements and how Coherence addresses the issues you rise.
Regards,
Gene

HSRP "Split Brain" on the STP Topology

Hello. I'm a network administrator in my company.
I have a question about HSRP "Split Brain" on the STP Topology.
verifying to HSRP Down Time that attached network topology.
Trying "ICMP Ping" from PC to HSRP Virtual IP.
I have found unexpected senario. PIng goes down when reboot the L2 Core SW 1 and Router 1 HSRP goes Active from Init Status.
Why ping goes down?
Router 1's Gratuitouse ARP, It's should not be transffered to L2 Core SW 2?
Sorry to trouble you, Could you please teach.

Thank you for your reply, rejeevh. I retried "L2 Core SW 1 Down Test".
From a result, My verification senario was wrong.
It was not shown in the figure, there is another link of both routers to other routers over the Core SWs that enable OSPF and "redistribute connected" in practice.
"ping from PC to HSRP Virtual IP" was wrong. "ping from PC to another OSPF Router's Interface" is correct senario.
I verified correct senario rebooting L2 Core SW 1. Also, ping goes down.
but this result was simple, It was dropped at SW 1's EtherChannel Interface in STP LIS/LER status when recieved return packet from another Router. (SW 1's other Interface was enabled Portfast or Portfast Trunk. )
I was confirmed the result was improved, It was enable Portfast Trunk in Etherchannel Interfaces of SW 2 and SW 1.
Thank you very much for your reply.

Split Brain in RAC

Hi ALL,
I am new to RAC, I am just reading the RAC concepts & Architecture. I got little idea about GRD, Cache Fusion Etc, But still not sure about Split Brain.
Kindly anyone tell me,
1, what is split Brain?
2, why & when will happen?
what is the solutions for split brain??
Regards
Senthil

Hi Senthil,
Regardless of RAC, split brain is a general cluster issue.
So a general information about you questions:
In a cluster you have servers that run the same resources, in active-active mode (like RAC) or active-passive (like fail safe). In both cases only 1 node is the master, the others are slaves. This is for synchronization issues, so there will be no cases in which both servers think they can do "whatever they want" without letting the other servers to "know" about it. In active-passive" it's easier to understand, since the master decides which server will hold the resources. Split brain is the case that two servers in the cluster think they are the master (for example, in active-passive they can decide to start the resoures on 2 different servers, in RAC they can decide that a cached block is the current copy and not held by other server, etc.)
Solution is a very complicated thing. Usually in clusters, one of the servers will perform reboot (eviction) once this is identified.
By the way, this is one of the reasons the clusterware uses 2 heartbeats (voting disk and interconnect), if one of them is down, there won't be split brain, but on the available resource (interconnect or voting disk) the cluster will decides which server will reboot. Then, if the server starts without the voting or interconnect, the clusterware will not start to prevent a split brain.
Hope it was clear.
Liron Amitzi
Senior DBA consultant
[www.dbsnaps.com]
[www.orbiumsoftware.com]

Split Brain Syndrome

As oracle says voting disk is used for avoiding split brain syandrome.
Can any one please explain me in details how voting disk is used for avoiding split brain syndrome.
Can i use Solaris ipmp concept to avoid interconnect failure and split brain syandrome...if yes then why to use voting disk
Regards,
Yasser.

The oracle clusterware is responsible for maintaining cluster health.
In order to do so, the cluster daemons keep track of local and remote health.
Local health means the cluster deamons can use a kernel module to see if it got 'lost time' which means the system is too busy (and can stable in a cluster). Also if the machine can not write to the voting disk, the machine is rebooted to avoid erroneous writings to disks. the rebooting is called 'stonith' (shoot the other node in the head) and the prevention of writing erroneous information is called 'fencing'.
Remote health is determined by sending network packets through the private network, and by writing in the voting disk or voting disks.
Also, the machines in a cluster do a voting every time the number of machines in a cluster modifies, and determine a master. The results of the voting a written to the voting disk, and every machine in the cluster keeps on writing it is still available in the cluster. that way a dead machine can be both by the network and by looking at the voting disk.
If the network between two machines in a cluster is disturbed, the cluster is said to have a 'split brain'.
Because of the voting disk or disks, the split brain can be solved by the master by terminating the other or others.
Since oracle 10, the only clusterware supported for RAC is the oracle clusterware. the network and voting disk are mandatory components in oracle clusterware.
this means the voting disk is needed, even with ip multipathing (ipmp)

Split Brain Scenario

Similar Messages

Maybe you are looking for