Bdb Btree Freezing

Hi All,
I have an interesting issue with bdb Btrees.
Here is the setup:
u_int32_t env_flags = DB_CREATE | DB_INIT_MPOOL | DB_INIT_LOCK;
env = new DbEnv(0);
env->set_msgfile(berkleyLog);
env->set_verbose(DB_VERB_DEADLOCK, 1);
env->open(dir.c_str(), env_flags, 0);
db = new Db(env, 0);
db->set_bt_compare(compare_double);
db->set_flags(DB_DUPSORT);
db->set_pagesize(pageSize);
u_int32_t oFlags = DB_CREATE ;
db->open(NULL, uuid.c_str(), NULL, DB_BTREE, oFlags, 0);
What is happening is that after inserting about 2000 rows, give or take a few hundred, a call to Db->put deadlocks, and wont return. Cpu load on the box increases to 100%, and from that point on bdb doesnt respond.
This problem was fixed when using independent db's without the environment by increasing the page size. But, now that I have started using envs, it seems to have crept in again. The data being inserted is double key pairs. My cache size is 200 meg, and I have tried with and without locking. (This particular application doesnt need locking at the moment).
This issue occurs on version 5.1.19
Any ideas ? Thanks
Michael

Hi Michael,
Thanks for you're question. I'll do my best to answer, but a complete reproducible test case would make tracking down the issue much easier.
>
What is happening is that after inserting about 2000 rows, give or take a few hundred, a call to Db->put deadlocks, and wont return. Cpu load on the box increases to 100%, and from that point on bdb doesnt respond.>
The behavior you describe matches a hang or infinite loop, rather than a deadlock. In a deadlock the CPU load would generally go to 0.
Can you debug the application when it is in this state? Could you post a stack trace?
I notice that you're using a custom key comparison routine. Please post the content of that routine.
Have you ever killed a process while it's operating on the database that is displaying this behavior? If so, it's possible that the database has been corrupted at some point.
Could you try running the [url http://download.oracle.com/docs/cd/E17076_02/html/api_reference/C/db_stat.html]db_stat -C A utility on the environment, to see if there are any locks being held that you wouldn't expect?
Regards,
Alex Gorrod
Oracle Berkeley DB

Similar Messages

Bdb btree with DB_INIT_CDB crashes on concurrent write

Hi,
I am having issues with bdb and then locking mechanisms.
The following code results in either a seg fault, or what looks like a deadlock/endless loop
#include <iostream>
#include "db_cxx.h"
#include <boost/thread.hpp>
using namespace std;
void thread_instance(Db* db, double start){
double s = start;
double finish = start + 5000;
for(int x=s; x < finish ; x++){
Dbt key(&x, sizeof(double));
Dbt ddata(&x, sizeof(double));
db->put(NULL, &key, &ddata, 0);
int
compare_double(DB dbp, const DBT a,const DBT *b){
double ai, bi;
memcpy(&ai, a->data, sizeof(double));
memcpy(&bi, b->data, sizeof(double));
return (ai > bi ? 1 : ((ai < bi) ? -1 : 0));
int main(){
system("rm data/*");
u_int32_t env_flags = DB_CREATE | DB_INIT_MPOOL | DB_INIT_CDB;
DbEnv* env = new DbEnv(0);
env->set_cachesize(0, 2000000, 1);
u_int32_t m = 0;
env->open("data/", env_flags, 0);
Db* db = new Db(env, 0);
db->set_bt_compare(compare_double);
db->set_flags(DB_DUPSORT);
db->set_pagesize(32768);
db->set_dup_compare(compare_double);
u_int32_t oFlags = DB_CREATE;
try {
db->open(NULL, "db", NULL, DB_BTREE, oFlags, 0);
} catch (DbException &e) {
} catch (std::exception &e) {
vector<boost::thread*> threads;
for(int x=0; x < 3; x++){
threads.push_back(new boost::thread(boost::bind(&thread_instance, db, (x *5000))));
for(int x=0; x < threads.size(); x++){
threads[x]->join();
I have tried DB_INIT_LOCK as well, but with the same results.
What is going on here?

I forgot to include the stack trace:
Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x0000000000000019
[Switching to process 34816]
0x00000001002e36a7 in __bamc_put ()
(gdb) ba
#0 0x00000001002e36a7 in __bamc_put ()
#1 0x0000000100386689 in __dbc_iput ()
#2 0x0000000100387a6c in __dbc_put ()
#3 0x0000000100383092 in __db_put ()
#4 0x0000000100397888 in __db_put_pp ()
#5 0x00000001002cee59 in Db::put ()
#6 0x0000000100001f88 in thread_instance (db=0x1007006c0, start=5000) at src/main.cpp:16
#7 0x0000000100698254 in thread_proxy ()
#8 0x00007fff80cb9456 in pthreadstart ()
#9 0x00007fff80cb9309 in thread_start ()

Need help getting data out of Berkeley DB

Hi All
I have an old vrsion of Berkeley Db (about 5 years old I think) that we use at work for storing key value pairs. We use this to store ids for that data that we later load to a relational database. We have multiple files with key vaue pairs. One particular file is very large(11 GB) and has about 100million entries.
We are trying to migrate the data from Berkeley DB to a relational database. I was able to get the entries out of most files but this large file is givng trouble.
I can only get out about 26 million entries. I have tried Ruby and Perl to get the data out (our main application is in Ruby) and tried multiple approaches but all hit the limit at about 26.xx million records.
If anybody has experienced similar thing and knows a way to get the data out, your help is highly appreciated.
Thank
Harsh D.

Hi All
This is for Berkeley DB version that is at least 5 years old. I do not know the exact verion and do not know how to find one. This is not for the Java Edition or the XML edition.
Below is what I am doing in Ruby:
db = nil
options = { "set_pagesize" => 8 * 1024,
"set_cachesize" => [0, 8024 * 1024, 0]}
puts "starting to open db"
db = BDB::Btree.open(ARGV[0], nil, 0, options)
if(db.size < 1)
puts "\nNothing to dump; #{ARGV[0]} is empty."
end
puts "progressing with the db"
myoutput = ARGV[1]
puts "allocating the output file #{myoutput}"
f = File.open(myoutput,"w")
i = 0
iteration = 0
puts "starting to iterate the db"
db.each do |k, v|
a = k.inspect
b = v.inspect
f.puts "#{a}|#{b}"
i = i+1
if (i>1000000)
iteration = iteration + 1
puts "iteration #{iteration}"
i = 0
end
end
This only outputs about 26.xx million records. I am sures there are more than 50 million entries in the database.
I also tried some other approaches but nothing seems to work. I end up getting only 26.xx million entries in the output.
In some case, I managed to get it to output more records, but after 26.xx million, everything is output as duplicate entries so they are of no use to me.
The Ruby is 32 bit version. I tried this on Windows 7 (64 bit) and also on RedHat Linux 5 (64 bit version).
Thanks
Harsh
We ran db_stat on the ExpId database and below are the results
ExpId
53162 Btree magic number
8 Btree version number
Big-endian Byte order
Flags
2 Minimum keys per-page
8192 Underlying database page size
2031 Overflow key/data size
4 Number of levels in the tree
151M Number of unique keys in the tree (151263387)
151M Number of data items in the tree (151263387)
9014 Number of tree internal pages
24M Number of bytes free in tree internal pages (68% ff)
1304102 Number of tree leaf pages
3805M Number of bytes free in tree leaf pages (64% ff)
0 Number of tree duplicate pages
0 Number of bytes free in tree duplicate pages (0% ff)
0 Number of tree overflow pages
0 Number of bytes free in tree overflow pages (0% ff)
0 Number of empty pages
0 Number of pages on the free list

BTREE and duplicate data items : over 300 people read this,nobody answers?

I have a btree consisting of keys (a 4 byte integer) - and data (a 8 byte integer).
Both integral values are "most significant byte (MSB) first" since BDB does key compression, though I doubt there is much to compress with such small key size. But MSB also allows me to use the default lexical order for comparison and I'm cool with that.
The special thing about it is that with a given key, there can be a LOT of associated data, thousands to tens of thousands. To illustrate, a btree with a 8192 byte page size has 3 levels, 0 overflow pages and 35208 duplicate pages!
In other words, my keys have a large "fan-out". Note that I wrote "can", since some keys only have a few dozen or so associated data items.
So I configure the b-tree for DB_DUPSORT. The default lexical ordering with set_dup_compare is OK, so I don't touch that. I'm getting the data items sorted as a bonus, but I don't need that in my application.
However, I'm seeing very poor "put (DB_NODUPDATA) performance", due to a lot of disk read operations.
While there may be a lot of reasons for this anomaly, I suspect BDB spends a lot of time tracking down duplicate data items.
I wonder if in my case it would be more efficient to have a b-tree with as key the combined (4 byte integer, 8 byte integer) and a zero-length or 1-length dummy data (in case zero-length is not an option).
I would loose the ability to iterate with a cursor using DB_NEXT_DUP but I could simulate it using DB_SET_RANGE and DB_NEXT, checking if my composite key still has the correct "prefix". That would be a pain in the butt for me, but still workable if there's no other solution.
Another possibility would be to just add all the data integers as a single big giant data blob item associated with a single (unique) key. But maybe this is just doing what BDB does... and would probably exchange "duplicate pages" for "overflow pages"
Or, the slowdown is a BTREE thing and I could use a hash table instead. In fact, what I don't know is how duplicate pages influence insertion speed. But the BDB source code indicates that in contrast to BTREE the duplicate search in a hash table is LINEAR (!!!) which is a no-no (from hash_dup.c):
     while (i < hcp->dup_tlen) {
          memcpy(&len, data, sizeof(db_indx_t));
          data += sizeof(db_indx_t);
          DB_SET_DBT(cur, data, len);
          * If we find an exact match, we're done. If in a sorted
          * duplicate set and the item is larger than our test item,
          * we're done. In the latter case, if permitting partial
          * matches, it's not a failure.
          *cmpp = func(dbp, dbt, &cur);
          if (*cmpp == 0)
               break;
          if (*cmpp < 0 && dbp->dup_compare != NULL) {
               if (flags == DB_GET_BOTH_RANGE)
                    *cmpp = 0;
               break;
What's the expert opinion on this subject?
Vincent
Message was edited by:
user552628

Hi,
The special thing about it is that with a given key,
there can be a LOT of associated data, thousands to
tens of thousands. To illustrate, a btree with a 8192
byte page size has 3 levels, 0 overflow pages and
35208 duplicate pages!
In other words, my keys have a large "fan-out". Note
that I wrote "can", since some keys only have a few
dozen or so associated data items.
So I configure the b-tree for DB_DUPSORT. The default
lexical ordering with set_dup_compare is OK, so I
don't touch that. I'm getting the data items sorted
as a bonus, but I don't need that in my application.
However, I'm seeing very poor "put (DB_NODUPDATA)
performance", due to a lot of disk read operations.In general, the performance would slowly decreases when there are a lot of duplicates associated with a key. For the Btree access method lookups and inserts have a O(log n) complexity (which implies that the search time is dependent on the number of keys stored in the underlying db tree). When doing put's with DB_NODUPDATA leaf pages have to be searched in order to determine whether the data is not a duplicate. Thus, giving the fact that for each given key (in most of the cases) there is a large number of data items associated (up to thousands, tens of thousands) an impressive amount of pages have to be brought into the cache to check against the duplicate criteria.
Of course, the problem of sizing the cache and databases's pages arises here. Your size setting for these measures should tend to large values, this way the cache would be fit to accommodate large pages (in which hundreds of records should be hosted).
Setting the cache and the page size to their ideal values is a process of experimenting.
http://www.oracle.com/technology/documentation/berkeley-db/db/ref/am_conf/pagesize.html
http://www.oracle.com/technology/documentation/berkeley-db/db/ref/am_conf/cachesize.html
While there may be a lot of reasons for this anomaly,
I suspect BDB spends a lot of time tracking down
duplicate data items.
I wonder if in my case it would be more efficient to
have a b-tree with as key the combined (4 byte
integer, 8 byte integer) and a zero-length or
1-length dummy data (in case zero-length is not an
option). Indeed, these should be the best alternative, but testing must be done first. Try this approach and provide us with feedback.
You can have records with a zero-length data portion.
Also, you could provide more information on whether or not you're using an environment, if so, how did you configure it etc. Have you thought of using multiple threads to load the data ?
Another possibility would be to just add all the
data integers as a single big giant data blob item
associated with a single (unique) key. But maybe this
is just doing what BDB does... and would probably
exchange "duplicate pages" for "overflow pages"This is a terrible approach since bringing an overflow page into the cache is more time consuming than bringing a regular page, and thus performance penalty results. Also, processing the entire collection of keys and data implies more work from a programming point of view.
Or, the slowdown is a BTREE thing and I could use a
hash table instead. In fact, what I don't know is how
duplicate pages influence insertion speed. But the
BDB source code indicates that in contrast to BTREE
the duplicate search in a hash table is LINEAR (!!!)
which is a no-no (from hash_dup.c):The Hash access method has, as you observed, a linear search (and thus a search time and lookup time proportional to the number of items in the buckets, O(1)). Combined with the fact that you don't want duplicate data than hash using the hash access method may not improve performance.
This is a performance/tunning problem and it involves a lot of resources from our part to investigate. If you have a support contract with Oracle, then please don't hesitate to put up your issue on Metalink or indicate that you want this issue to be taken in private, and we will create an SR for you.
Regards,
Andrei

Sefault in __lock_get_internal using BDB 4.7.25

Hi,
I am having trouble finding the root cause of a segfault. The program generating the fault uses both the bdb and repmgr APIs; the segfault happends in a bdb call.
Here is a quick run-down of the problem. My test is setup with two nodes. The master node is started first, then queried by a client program. Then a client node is started. It replicates the database successfully, then is queried by the same client program. Each node is asked to perform two database gets, the first completes the second causes the segfault, but only in the client node.
Each node is configured the same, except the client node will close and re-open the database after the syncronization is done.
I would appreciate any insight to what could be causing my problem, as I've noted the segfault occrrus during a lock aquisition. The program is multi-threaded, but I enable the database to be thread-safe.
I've included an example of the API calls made to setup each environment, a backtrace from the client corefile, and the verbose output from both nodes during the run.
h5. Node Configuration Example
int master_port = 10001;
int client_port = 10002;
DB_ENV *env;
DB *db;
int env_flags = DB_CREATE | DB_INIT_LOCK | DB_INIT_LOG | DB_INIT_MPOOL | DB_INIT_TXN | DB_INIT_REP | DB_RECOVER | DB_THREAD;
int db_flags = DB_CREATE | DB_AUTO_COMMIT | DB_THREAD:
db_env_create(&env, 0);
env->set_lk_detect(env, DB_LOCK_DEFAULT);
if(master)
env->repmgr_set_local_site(env, 'localhost', master_port, 0);
else
env->repmgr_set_local_site(env, 'localhost', client_port, 0);
* The DB_REPMGR_PEER seems useless in this example. But the actual
* design allows for a client to peer with another client.
if(master)
env->repmgr_add_remote_site(env, 'localhost', 0, NULL, DB_REPMGR_PEER);
else
env->repmgr_add_remote_site(env, 'localhost', master_port, NULL, DB_REPMGR_PEER);
if(master)
env->open(env, '/tmp/dbs_m', env_flags, 0);
else
env->open(env, '/tmp/dbs_c', env_flags, 0);
db_create(&db, env, 0);
db->open(db, NULL, 'DB', NULL, DB_BTREE, db_flags, 0);
env->repmgr_start(env, 3, DB_REP_ELECTION);
h5. GDB backtrace
GNU gdb 6.8
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnu"...
Reading symbols from /lib/libpthread.so.0...done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
Reading symbols from /lib/libnss_dns.so.2...done.
Loaded symbols for /lib/libnss_dns.so.2
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
Core was generated by `./dbserver/dbserver bootstrap=localhost:24050 address=localhost:17000 -'.
Program terminated with signal 11, Segmentation fault.
[New process 685]
#0 0x0814239f in __lock_get_internal (lt=0x40140868, sh_locker=0x4032d508, flags=0, obj=0x81f7108, lock_mode=DB_LOCK_READ,
timeout=0, lock=0xbe7ff5bc) at ../dist/../lock/lock.c:586
586               OBJECT_LOCK(lt, region, obj, lock->ndx);
(gdb) bt full
#0 0x0814239f in __lock_get_internal (lt=0x40140868, sh_locker=0x4032d508, flags=0, obj=0x81f7108, lock_mode=DB_LOCK_READ,
timeout=0, lock=0xbe7ff5bc) at ../dist/../lock/lock.c:586
     newl = (struct __db_lock *) 0x0
     lp = (struct __db_lock *) 0x40142e48
     env = (ENV *) 0x40140860
     sh_obj = (DB_LOCKOBJ *) 0x0
     region = (DB_LOCKREGION *) 0x40140880
     ip = (DB_THREAD_INFO *) 0x40142e48
     ndx = 3196058196
     part_id = 1074222655
     did_abort = 1073875436
     ihold = 0
     grant_dirty = 1075064392
     no_dd = 0
     ret = 0
     t_ret = 1073875436
     holder = 1075064392
     sh_off = 0
     action = 3196056724
#1 0x08141da2 in __lock_get (env=0x401407f0, locker=0x4032d508, flags=0, obj=0x81f7108, lock_mode=DB_LOCK_READ,
lock=0xbe7ff5bc) at ../dist/../lock/lock.c:456
     lt = (DB_LOCKTAB *) 0x40140868
     ret = 0
#2 0x08181674 in __db_lget (dbc=0x81f7080, action=0, pgno=1075054832, mode=DB_LOCK_READ, lkflags=0, lockp=0xbe7ff5bc)
at ../dist/../db/db_meta.c:1035
     dbp = (DB *) 0x401407e0
     couple = {{op = DB_LOCK_DUMP, mode = DB_LOCK_NG, timeout = 3196058052, obj = 0x400546b8, lock = {off = 32, ndx = 0,
      gen = 0, mode = DB_LOCK_NG}}, {op = 136380459, mode = 3196057632, timeout = 3196057624, obj = 0xbe7ff9c4, lock = {
      off = 3196057576, ndx = 0, gen = 1073916640, mode = DB_LOCK_NG}}, {op = 1073875800, mode = 3196057972, timeout = 0,
    obj = 0x0, lock = {off = 35, ndx = 66195, gen = 3196057576, mode = 43}}}
     reqp = (DB_LOCKREQ *) 0x0
     txn = (DB_TXN *) 0x0
     env = (ENV *) 0x401407f0
     has_timeout = 0
     i = 0
     ret = -1
#3 0x080d8f9e in __bam_get_root (dbc=0x81f7080, pg=1075054832, slevel=1, flags=1409, stack=0xbe7ff6a8)
at ../dist/../btree/bt_search.c:94
     cp = (BTREE_CURSOR *) 0x8259248
     dbp = (DB *) 0x401407e0
     lock = {off = 1073709056, ndx = 510075, gen = 2260372568, mode = 3758112764}
     mpf = (DB_MPOOLFILE *) 0x401407f8
     h = (PAGE *) 0x0
     lock_mode = DB_LOCK_READ
     ret = 89980928
     t_ret = 134764095
#4 0x080d9407 in __bam_search (dbc=0x81f7080, root_pgno=1075054832, key=0xbe7ffa6c, flags=1409, slevel=1, recnop=0x0,
---Type <return> to continue, or q <return> to quit---
exactp=0xbe7ff8b0) at ../dist/../btree/bt_search.c:203
     t = (BTREE *) 0x401408f8
     cp = (BTREE_CURSOR *) 0x8259248
     dbp = (DB *) 0x401407e0
     lock = {off = 0, ndx = 0, gen = 0, mode = DB_LOCK_NG}
     mpf = (DB_MPOOLFILE *) 0x401407f8
     env = (ENV *) 0x401407f0
     h = (PAGE *) 0x0
     base = 0
     i = 0
     indx = 0
     inp = (db_indx_t *) 0x0
     lim = 0
     lock_mode = DB_LOCK_NG
     pg = 0
     recno = 0
     adjust = 0
     cmp = 0
     deloffset = 0
     ret = 0
     set_stack = 0
     stack = 0
     t_ret = 0
     func = (int (*)(DB *, const DBT *, const DBT *)) 0
#5 0x0819b1d1 in __bamc_search (dbc=0x81f7080, root_pgno=0, key=0xbe7ffa6c, flags=26, exactp=0xbe7ff8b0)
at ../dist/../btree/bt_cursor.c:2501
     t = (BTREE *) 0x401408f8
     cp = (BTREE_CURSOR *) 0x8259248
     dbp = (DB *) 0x401407e0
     h = (PAGE *) 0x0
     indx = 0
     inp = (db_indx_t *) 0x0
     bt_lpgno = 0
     recno = 0
     sflags = 1409
     cmp = 0
     ret = 0
     t_ret = 0
#6 0x08196ff7 in __bamc_get (dbc=0x81f7080, key=0xbe7ffa6c, data=0xbe7ffa50, flags=26, pgnop=0xbe7ff93c)
at ../dist/../btree/bt_cursor.c:970
     cp = (BTREE_CURSOR *) 0x8259248
     dbp = (DB *) 0x401407e0
     mpf = (DB_MPOOLFILE *) 0x401407f8
     orig_pgno = 0
     orig_indx = 0
     exact = 1075236764
     newopd = 1
---Type <return> to continue, or q <return> to quit---
     ret = 136272648
#7 0x0816f6fc in __dbc_get (dbc_arg=0x81f7080, key=0xbe7ffa6c, data=0xbe7ffa50, flags=26) at ../dist/../db/db_cam.c:700
     dbp = (DB *) 0x401407e0
     dbc = (DBC *) 0x0
     dbc_n = (DBC *) 0x81f7080
     opd = (DBC *) 0x0
     cp = (DBC_INTERNAL *) 0x8259248
     cp_n = (DBC_INTERNAL *) 0x0
     mpf = (DB_MPOOLFILE *) 0x401407f8
     env = (ENV *) 0x401407f0
     pgno = 0
     indx_off = 0
     multi = 0
     orig_ulen = 0
     tmp_flags = 0
     tmp_read_uncommitted = 0
     tmp_rmw = 0
     type = 64 '@'
     key_small = 0
     ret = 136268720
     t_ret = -1098909244
#8 0x0817a1ac in __db_get (dbp=0x8258bb0, ip=0x0, txn=0x0, key=0xbe7ffa6c, data=0xbe7ffa50, flags=26)
at ../dist/../db/db_iface.c:760
     dbc = (DBC *) 0x81f7080
     mode = 0
     ret = 0
     t_ret = 1075208764
#9 0x08179f6c in __db_get_pp (dbp=0x8258bb0, txn=0x0, key=0xbe7ffa6c, data=0xbe7ffa50, flags=0)
at ../dist/../db/db_iface.c:684
     ip = (DB_THREAD_INFO *) 0x0
     env = (ENV *) 0x81f4bb0
     mode = 0
     handle_check = 1
     ignore_lease = 0
     ret = 0
     t_ret = 1073880126
     txn_local = 0
#10 0x0804c7a8 in _get (database=0x81f37a8, txn=0x0, query=0x821d1a0, callName=0x81cc1b7 "GET") at ../dbserver/database.c:503
     k = {data = 0x81f67e8, size = 22, ulen = 22, dlen = 0, doff = 0, app_data = 0x0, flags = 0}
     v = {data = 0x821d2a0, size = 255, ulen = 255, dlen = 0, doff = 0, app_data = 0x0, flags = 256}
     err = 136263592
     __PRETTY_FUNCTION__ = "_get"
#11 0x0804c8f0 in get (database=0x81f37a8, txn_id=3, query=0x821d1a0) at ../dbserver/database.c:643
     txn = (DB_TXN *) 0x416a7db4
#12 0x08053f1d in workerThreadMain (threadArg=0x7c87b) at ../dbserver/server.c:433
     type = ISProtocol_IDENTIFYMASTER
     class = <value optimized out>
---Type <return> to continue, or q <return> to quit---
     s = {context = 0x8211930, protocol = 0x8211980, socketToClient = 3, query = 0x821d1a0, deleteClientSocket = ISFalse,
abortActiveTxn = ISFalse}
     __PRETTY_FUNCTION__ = "workerThreadMain"
#13 0x4001d0ba in pthread_start_thread () from /lib/libpthread.so.0
No symbol table info available.
#14 0x400fad6a in clone () from /lib/libc.so.6
No symbol table info available.
h5. Verbose Master Node Log
REP_UNDEF: rep_start: Found old version log 14
CLIENT: db rep_send_message: msgv = 5 logv 14 gen = 0 eid -1, type newclient, LSN [0][0] nogroup nobuf
CLIENT: starting election thread
CLIENT: elect thread to do: 0
CLIENT: repmgr elect: opcode 0, finished 0, master -2
CLIENT: elect thread to do: 1
CLIENT: Start election nsites 1, ack 1, priority 100
CLIENT: Tallying VOTE1[0] (2147483647, 1)
CLIENT: Beginning an election
CLIENT: db rep_send_message: msgv = 5 logv 14 gen = 0 eid -1, type vote1, LSN [1][8702] nogroup nobuf
CLIENT: Tallying VOTE2[0] (2147483647, 1)
CLIENT: Counted my vote 1
CLIENT: Skipping phase2 wait: already got 1 votes
CLIENT: Got enough votes to win; election done; winner is 2147483647, gen 0
CLIENT: Election finished in 0.039845000 sec
CLIENT: Election done; egen 2
CLIENT: Ended election with 0, sites 0, egen 2, flags 0x200a01
CLIENT: Election done; egen 2
CLIENT: New master gen 2, egen 3
MASTER: rep_start: Old log version was 14
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid -1, type newmaster, LSN [1][8702] nobuf
MASTER: restore_prep: No prepares. Skip.
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid -1, type log, LSN [1][8702]
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid -1, type log, LSN [1][8785]
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid -1, type log, LSN [1][8821]
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid -1, type log, LSN [1][8904] perm
MASTER: rep_send_function returned: -30975
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid -1, type log, LSN [1][8948]
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid -1, type log, LSN [1][9034]
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid -1, type log, LSN [1][9115]
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid -1, type log, LSN [1][9202]
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid -1, type log, LSN [1][9287] flush perm
MASTER: rep_send_function returned: -30975
MASTER: election thread is exiting
MASTER: accepted a new connection
MASTER: handshake introduces unknown site localhost:10002
MASTER: EID 0 is assigned for site localhost:10002
MASTER: db rep_process_message: msgv = 5 logv 14 gen = 0 eid 0, type newclient, LSN [0][0] nogroup
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid -1, type newsite, LSN [0][0] nobuf
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid -1, type newmaster, LSN [1][9367] nobuf
MASTER: NEWSITE info from site localhost:10002 was already known
MASTER: db rep_process_message: msgv = 5 logv 14 gen = 0 eid 0, type master_req, LSN [0][0] nogroup
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid -1, type newmaster, LSN [1][9367] nobuf
MASTER: db rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type verify_req, LSN [1][8658]
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid 0, type verify, LSN [1][8658] nobuf
MASTER: db rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type update_req, LSN [0][0]
MASTER: Walk_dir: Getting info for dir: db
MASTER: Walk_dir: Dir db has 10 files
MASTER: Walk_dir: File 0 name: __db.001
MASTER: Walk_dir: File 1 name: __db.002
MASTER: Walk_dir: File 2 name: __db.rep.gen
MASTER: Walk_dir: File 3 name: __db.rep.egen
MASTER: Walk_dir: File 4 name: __db.003
MASTER: Walk_dir: File 5 name: __db.004
MASTER: Walk_dir: File 6 name: __db.005
MASTER: Walk_dir: File 7 name: __db.006
MASTER: Walk_dir: File 8 name: log.0000000001
MASTER: Walk_dir: File 9 name: ROUTER
MASTER: Walk_dir: File 0 (of 1) ROUTER at 0x40356018: pgsize 4096, max_pgno 1
MASTER: Walk_dir: Getting info for in-memory named files
MASTER: Walk_dir: Dir INMEM has 0 files
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid 0, type update, LSN [1][9367] nobuf
MASTER: db rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type page_req, LSN [0][0]
MASTER: page_req: file 0 page 0 to 1
MASTER: page_req: Open 0 via mpf_open
MASTER: sendpages: file 0 page 0 to 1
MASTER: sendpages: 0, page lsn [0][1]
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid 0, type page, LSN [1][9367] nobuf resend
MASTER: sendpages: 0, lsn [1][9367]
MASTER: sendpages: 1, page lsn [1][9202]
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid 0, type page, LSN [1][9367] nobuf resend
MASTER: sendpages: 1, lsn [1][9367]
MASTER: db rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type log_req, LSN [1][28]
MASTER: [1][28]: LOG_REQ max lsn: [1][9367]
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][28] nobuf resend
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][91] nobuf resend
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][4266] nobuf resend
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][8441] nobuf resend
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][8535] nobuf resend
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][8575] nobuf resend
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][8658] nobuf resend
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][8702] nobuf resend
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][8785] nobuf resend
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][8821] nobuf resend
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][8904] nobuf resend
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][8948] nobuf resend
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][9034] nobuf resend
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][9115] nobuf resend
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][9202] nobuf resend
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][9287] nobuf resend
MASTER: db rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type all_req, LSN [1][9287]
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][9287] nobuf resend logend
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid -1, type log, LSN [1][9367]
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid -1, type log, LSN [1][9469]
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid -1, type log, LSN [1][9548] flush perm
MASTER: will await acknowledgement: need 1
MASTER: rep_send_function returned: 110
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid -1, type log, LSN [1][9628]
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid -1, type log, LSN [1][9696]
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid -1, type log, LSN [1][9785] flush perm
MASTER: will await acknowledgement: need 1
MASTER: got ack [1][9548](2) from site localhost:10002
MASTER: got ack [1][9785](2) from site localhost:10002
MASTER: got ack [1][9287](2) from site localhost:10002
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid -1, type start_sync, LSN [1][9785] nobuf
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid -1, type log, LSN [1][9865]
MASTER: db rep_send_message: msgv = 5 logv 14 gen = 2 eid -1, type log, LSN [1][9948] flush perm
MASTER: will await acknowledgement: need 1
MASTER: got ack [1][9948](2) from site localhost:10002
EOF on connection from site localhost:10002
h5. Verbose Client Node Log
REP_UNDEF: EID 0 is assigned for site localhost:10001
REP_UNDEF: rep_start: Found old version log 14
CLIENT: db2 rep_send_message: msgv = 5 logv 14 gen = 0 eid -1, type newclient, LSN [0][0] nogroup nobuf
CLIENT: starting election thread
CLIENT: elect thread to do: 0
CLIENT: repmgr elect: opcode 0, finished 0, master -2
CLIENT: init connection to site localhost:10001 with result 115
CLIENT: handshake from connection to localhost:10001
CLIENT: handshake with no known master to wake election thread
CLIENT: reusing existing elect thread
CLIENT: repmgr elect: opcode 3, finished 0, master -2
CLIENT: elect thread to do: 3
CLIENT: db2 rep_send_message: msgv = 5 logv 14 gen = 0 eid -1, type newclient, LSN [0][0] nogroup nobuf
CLIENT: repmgr elect: opcode 0, finished 0, master -2
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type newsite, LSN [0][0]
CLIENT: db2 rep_send_message: msgv = 5 logv 14 gen = 0 eid -1, type master_req, LSN [0][0] nogroup nobuf
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type newmaster, LSN [1][9367]
CLIENT: Election done; egen 1
CLIENT: Updating gen from 0 to 2 from master 0
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type newmaster, LSN [1][9367]
CLIENT: egen: 3. rep version 5
CLIENT: db2 rep_send_message: msgv = 5 logv 14 gen = 2 eid 0, type verify_req, LSN [1][8658] any nobuf
CLIENT: sending request to peer
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type verify, LSN [1][8658]
CLIENT: db2 rep_send_message: msgv = 5 logv 14 gen = 2 eid 0, type update_req, LSN [0][0] nobuf
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type update, LSN [1][9367]
CLIENT: Update setup for 1 files.
CLIENT: Update setup: First LSN [1][28].
CLIENT: Update setup: Last LSN [1][9367]
CLIENT: Walk_dir: Getting info for dir: db2
CLIENT: Walk_dir: Dir db2 has 11 files
CLIENT: Walk_dir: File 0 name: __db.001
CLIENT: Walk_dir: File 1 name: __db.002
CLIENT: Walk_dir: File 2 name: __db.rep.gen
CLIENT: Walk_dir: File 3 name: __db.rep.egen
CLIENT: Walk_dir: File 4 name: __db.003
CLIENT: Walk_dir: File 5 name: __db.004
CLIENT: Walk_dir: File 6 name: __db.005
CLIENT: Walk_dir: File 7 name: __db.006
CLIENT: Walk_dir: File 8 name: log.0000000001
CLIENT: Walk_dir: File 9 name: ROUTER
CLIENT: Walk_dir: File 0 (of 1) ROUTER at 0x40356018: pgsize 4096, max_pgno 1
CLIENT: Walk_dir: File 10 name: __db.rep.db
CLIENT: Walk_dir: Getting info for in-memory named files
CLIENT: Walk_dir: Dir INMEM has 0 files
CLIENT: Next file 0: pgsize 4096, maxpg 1
CLIENT: db2 rep_send_message: msgv = 5 logv 14 gen = 2 eid 0, type page_req, LSN [0][0] any nobuf
CLIENT: sending request to peer
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type page, LSN [1][9367] resend
CLIENT: PAGE: Received page 0 from file 0
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type page, LSN [1][9367] resend
CLIENT: PAGE: Write page 0 into mpool
CLIENT: rep_write_page: Calling fop_create for ROUTER
CLIENT: PAGE_GAP: pgno 0, max_pg 1 ready 0, waiting 0 max_wait 0
CLIENT: FILEDONE: have 1 pages. Need 2.
CLIENT: PAGE: Received page 1 from file 0
CLIENT: PAGE: Write page 1 into mpool
CLIENT: PAGE_GAP: pgno 1, max_pg 1 ready 1, waiting 0 max_wait 0
CLIENT: FILEDONE: have 2 pages. Need 2.
CLIENT: NEXTFILE: have 1 files. RECOVER_LOG now
CLIENT: NEXTFILE: LOG_REQ from LSN [1][28] to [1][9367]
CLIENT: db2 rep_send_message: msgv = 5 logv 14 gen = 2 eid 0, type log_req, LSN [1][28] any nobuf
CLIENT: sending request to peer
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][28] resend
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][91] resend
CLIENT: rep_apply: Set apply_th 1
CLIENT: rep_apply: Decrement apply_th 0
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][4266] resend
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][8441] resend
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][8535] resend
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][8575] resend
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][8658] resend
CLIENT: Returning NOTPERM [1][8658], cmp = 1
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][8702] resend
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][8785] resend
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][8821] resend
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][8904] resend
CLIENT: Returning NOTPERM [1][8904], cmp = 1
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][8948] resend
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][9034] resend
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][9115] resend
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][9202] resend
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][9287] resend
CLIENT: Returning NOTPERM [1][9287], cmp = 1
CLIENT: rep_apply: Set apply_th 1
CLIENT: rep_apply: Decrement apply_th 0
CLIENT: Returning LOGREADY up to [1][9287], cmp = 0
CLIENT: Election done; egen 3
Recovery starting from [1][28]
Recovery complete at Fri Jul 31 10:11:33 2009
Maximum transaction ID 80000002 Recovery checkpoint [0][0]
CLIENT: db2 rep_send_message: msgv = 5 logv 14 gen = 2 eid 0, type all_req, LSN [1][9287] any nobuf
CLIENT: sending request to peer
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][9287] resend logend
CLIENT: Start-up is done [1][9287]
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][9367]
CLIENT: rep_apply: Set apply_th 1
CLIENT: rep_apply: Decrement apply_th 0
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][9469]
CLIENT: rep_apply: Set apply_th 1
CLIENT: rep_apply: Decrement apply_th 0
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][9548] flush
CLIENT: rep_apply: Set apply_th 1
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][9628]
CLIENT: rep_apply: Decrement apply_th 0
CLIENT: Returning ISPERM [1][9548], cmp = 0
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][9696]
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][9785] flush
CLIENT: Returning NOTPERM [1][9785], cmp = 1
CLIENT: rep_apply: Set apply_th 1
CLIENT: rep_apply: Decrement apply_th 0
CLIENT: Returning ISPERM [1][9785], cmp = 0
CLIENT: Returning ISPERM [1][9287], cmp = -1
CLIENT: election thread is exiting
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type start_sync, LSN [1][9785]
CLIENT: ALIVE: Completed sync [1][9785]
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][9865]
CLIENT: rep_apply: Set apply_th 1
CLIENT: rep_apply: Decrement apply_th 0
CLIENT: db2 rep_process_message: msgv = 5 logv 14 gen = 2 eid 0, type log, LSN [1][9948] flush
CLIENT: rep_apply: Set apply_th 1
CLIENT: rep_apply: Decrement apply_th 0
CLIENT: Returning ISPERM [1][9948], cmp = 0
Regards,
Chris

I was able to track this issue down to a usage error. I was calling a DB API call from within a callback -- which violates the APIs re-entrancy assumptions.

LOG_FILE_NOT_FOUND bug possible in current BDB JE?

I've seen references to the LOG_FILE_NOT_FOUND bug in older BDB JE versions (4.x and 5 <= 5.0.34(, however, I seem to be suffering something similar with 5.0.48.
I have a non-transactional, deferred-write DB that seems to have gotten itself into an inconsistent state. It was fine loading several million records, but after ~8 hours of operation, bailed out with:
com.sleepycat.je.EnvironmentFailureException: Environment invalid because of previous exception: (JE 5.0.55) /tmp/data/index fetchTarget of 0x9f1/0x24d34eb parent IN=44832 IN class=com.sleepycat.je.tree.BIN lastFullVersion=0xdcf/0x5a96c91 lastLoggedVersion=0xdcf/0x5a96c91 parent.getDirty()=true state=0 LOG_FILE_NOT_FOUND: Log file missing, log is likely invalid. Environment is invalid and must be closed.
     at com.sleepycat.je.tree.IN.fetchTarget(IN.java:1429)
     at com.sleepycat.je.tree.BIN.fetchTarget(BIN.java:1251)
     at com.sleepycat.je.dbi.CursorImpl.fetchCurrent(CursorImpl.java:2229)
     at com.sleepycat.je.dbi.CursorImpl.getCurrentAlreadyLatched(CursorImpl.java:1434)
     at com.sleepycat.je.Cursor.searchInternal(Cursor.java:2716)
     at com.sleepycat.je.Cursor.searchAllowPhantoms(Cursor.java:2576)
     at com.sleepycat.je.Cursor.searchNoDups(Cursor.java:2430)
     at com.sleepycat.je.Cursor.search(Cursor.java:2397)
     at com.sleepycat.je.Database.get(Database.java:1042)
     at com.xxxx.db.BDBCalendarStorageBackend.indexCalendar(BDBCalendarStorageBackend.java:95)
     at com.xxxx.indexer.TicketIndexer.indexDeltaLogs(TicketIndexer.java:201)
     at com.xxxx.indexer.DeltaLogLoader.run(DeltaLogLoader.java:87)
Caused by: java.io.FileNotFoundException: /tmp/data/index/000009f1.jdb (No such file or directory)
     at java.io.RandomAccessFile.open(Native Method)
     at java.io.RandomAccessFile.<init>(RandomAccessFile.java:216)
     at java.io.RandomAccessFile.<init>(RandomAccessFile.java:101)
     at com.sleepycat.je.log.FileManager$6.<init>(FileManager.java:1282)
     at com.sleepycat.je.log.FileManager.openFileHandle(FileManager.java:1281)
     at com.sleepycat.je.log.FileManager.getFileHandle(FileManager.java:1147)
     at com.sleepycat.je.log.LogManager.getLogSource(LogManager.java:1102)
     at com.sleepycat.je.log.LogManager.getLogEntry(LogManager.java:808)
     at com.sleepycat.je.log.LogManager.getLogEntryAllowInvisibleAtRecovery(LogManager.java:772)
     at com.sleepycat.je.tree.IN.fetchTarget(IN.java:1412)
     ... 11 more
Subsequent opens/use on the DB pretty much instantly yield the same error. I tried upgrading to 5.0.55 (hence the ver in the output above) but still get the same error.
As a recovery attempt, I used DbDump to try to dump the DB, however, it failed with a similar error. Enabling salvage mode enabled me to successfuly dump it, however, reloading it into a clean environment by programmatically running DbLoad.load() (so I can setup my env) caused the following error (after about 30% of the DB has restored):
Exception in thread "main" com.sleepycat.je.EnvironmentFailureException: (JE 5.0.55) Node 11991 should have been split before calling insertEntry UNEXPECTED_STATE: Unexpected internal state, may have side effects. fetchTarget of 0x25/0x155a822 parent IN=2286 IN class=com.sleepycat.je.tree.IN lastFullVersion=0x3e/0x118d8f6 lastLoggedVersion=0x3e/0x118d8f6 parent.getDirty()=false state=0
     at com.sleepycat.je.EnvironmentFailureException.unexpectedState(EnvironmentFailureException.java:376)
     at com.sleepycat.je.tree.IN.insertEntry1(IN.java:2326)
     at com.sleepycat.je.tree.IN.insertEntry(IN.java:2296)
     at com.sleepycat.je.tree.BINDelta.reconstituteBIN(BINDelta.java:216)
     at com.sleepycat.je.tree.BINDelta.reconstituteBIN(BINDelta.java:144)
     at com.sleepycat.je.log.entry.BINDeltaLogEntry.getIN(BINDeltaLogEntry.java:53)
     at com.sleepycat.je.log.entry.BINDeltaLogEntry.getResolvedItem(BINDeltaLogEntry.java:43)
     at com.sleepycat.je.tree.IN.fetchTarget(IN.java:1422)
     at com.sleepycat.je.tree.Tree.searchSubTreeUntilSplit(Tree.java:1786)
     at com.sleepycat.je.tree.Tree.searchSubTreeSplitsAllowed(Tree.java:1729)
     at com.sleepycat.je.tree.Tree.searchSplitsAllowed(Tree.java:1296)
     at com.sleepycat.je.tree.Tree.findBinForInsert(Tree.java:2205)
     at com.sleepycat.je.dbi.CursorImpl.putInternal(CursorImpl.java:834)
     at com.sleepycat.je.dbi.CursorImpl.put(CursorImpl.java:779)
     at com.sleepycat.je.Cursor.putAllowPhantoms(Cursor.java:2243)
     at com.sleepycat.je.Cursor.putNoNotify(Cursor.java:2200)
     at com.sleepycat.je.Cursor.putNotify(Cursor.java:2117)
     at com.sleepycat.je.Cursor.putNoDups(Cursor.java:2052)
     at com.sleepycat.je.Cursor.putInternal(Cursor.java:2020)
     at com.sleepycat.je.Database.putInternal(Database.java:1324)
     at com.sleepycat.je.Database.put(Database.java:1194)
     at com.sleepycat.je.util.DbLoad.loadData(DbLoad.java:544)
     at com.sleepycat.je.util.DbLoad.load(DbLoad.java:414)
     at com.xxxx.db.BDBCalendarStorageBackend.loadBDBDump(BDBCalendarStorageBackend.java:254)
     at com.xxxx.cli.BDBTool.run(BDBTool.java:49)
     at com.xxxx.cli.AbstractBaseCommand.execute(AbstractBaseCommand.java:114)
     at com.xxxx.cli.BDBTool.main(BDBTool.java:69)
The only other slightly exotic thing I'm using is a custom partial BTree comparator, however, it quite happily loaded/updated literally tens of millions of records for hours before the FileNotFound error cropped up, so it seems unlikely that would be the cause.
Any ideas?
Thanks in advance,
fb.

Thanks heaps to Mark for working through this with me.You're welcome. Thanks for following up and explaining it for the benefit of others. And I'm very glad it wasn't a JE bug!
My solution is to switch to using a secondary database for providing differentiated "uniqueness" vs "ordering".An index for uniqueness may be a good solution. But as you said in email, it adds significant overhead (memory and disk). This overhead can be minimized by keeping your keys (primary and secondary) as small as possible, and enabling key prefixing.
I'd also like to point out that adding a secondary isn't always the best choice. For example, if the number of keys with the same C1 value is fairly small, another way of checking for uniqueness (when inserting) is to iterate over them, looking for a match on C1:C3. The cost of this iteration may be less than the cost of maintaining a uniqueness index. To make this work, you'll have to use Serializable isolation during the iteration, to prevent another thread from inserting a key in that range.
If you're pushing the performance limits of your hardware, it may be worth trying more than one such approach and comparing the performance. If performance is not a big concern, then the additional index is the simplest approach to get right.
--mark

Error in generating BdB:tree

Hi,
I don't have much experience using BDB's (or perl), but currently I'm working on storing a huge collection of data into one via perl. I'm using a combination of Records and Btrees, and have recently ran into a problem where when I try retrieving a tied handle to a BDB subdatabase, I run into following error:
No such file or directory subname: unexpected file type or format
Where subname is the db that I'm querying for.
When I ran db_verify against the bdb, it returned the following message:
db_verify: Subdatabase entry references page 11188 of invalid type 5
db_verify: DB->verify: daily_percs_log.db: DB_VERIFY_BAD: Database verification failed
which tells me that the data in the bdb is probably corrupt.
Would anyone have any idea why this is happening?
The program that I wrote isn't multi-threaded, so I shouldn't have to worry about using transactions, right?

Hi,
For recoverability you should use transactions and logging. In addition, unless you're using the subdatabases (logical databases in a single physical file) in read-only mode, then you need to enable locking. Even if you're running single-threaded locking is needed as there may be conflicts during page allocation.
Enabling the transaction, logging and locking subsystems implies using a database environment, and moreover, if any of the subdatabases in the file is opened for update, all of the subdatabases in the file must share a memory pool (the memory pool subsystem need to be enabled).
Here is more information on subdatabases:
http://www.oracle.com/technology/documentation/berkeley-db/db/ref/am/opensub.html
The Reference Guide is a good place to get information on architecting a transactional application:
http://www.oracle.com/technology/documentation/berkeley-db/db/ref/toc.html
Now, since you don't have any logs around to recover, and probably no recent snapshot that on top of which to reiterate the last changes (if possible), the solution is to do a salvage dump (-r or -R) / reload of the data if you cannot recreate the databases from scratch:
http://www.oracle.com/technology/documentation/berkeley-db/db/ref/dumpload/utility.html
http://www.oracle.com/technology/documentation/berkeley-db/db/utility/db_dump.html
http://www.oracle.com/technology/documentation/berkeley-db/db/utility/db_load.html
Regards,
Andrei

BDB read performance problem: lock contention between GC and VM threads

Problem: BDB read performance is really bad when the size of the BDB crosses 20GB. Once the database crosses 20GB or near there, it takes more than one hour to read/delete/add 200K keys.
After a point, of these 200K keys there are about 15-30K keys that are new and this number eventually should come down and there should not be any new keys after a point.
Application:
Transactional Data Store application. Single threaded process, that's trying to read one key's data, delete the data and add new data. The keys are really small (20 bytes) and the data is large (grows from 1KB to 100KB)
On on machine, I have a total of 3 processes running with each process accessing its own BDB on a separate RAID1+0 drive. So, according to me there should really be no disk i/o wait that's slowing down the reads.
After a point (past 20GB), There are about 4-5 million keys in my BDB and the data associated with each key could be anywhere between 1KB to 100KB. Eventually every key will have 100KB data associated with it.
Hardware:
16 core Intel Xeon, 96GB of RAM, 8 drive, running 2.6.18-194.26.1.0.1.el5 #1 SMP x86_64 x86_64 x86_64 GNU/Linux
BDB config: BTREE
bdb version: 4.8.30
bdb cache size: 4GB
bdb page size: experimented with 8KB, 64KB.
3 processes, each process accesses its own BDB on a separate RAIDed(1+0) drive.
envConfig.setAllowCreate(true);
envConfig.setTxnNoSync(ourConfig.asynchronous);
envConfig.setThreaded(true);
envConfig.setInitializeLocking(true);
envConfig.setLockDetectMode(LockDetectMode.DEFAULT);
When writing to BDB: (Asynchrounous transactions)
TransactionConfig tc = new TransactionConfig();
tc.setNoSync(true);
When reading from BDB (Allow reading from Uncommitted pages):
CursorConfig cc = new CursorConfig();
cc.setReadUncommitted(true);
BDB stats: BDB size 49GB
$ db_stat -m
3GB 928MB Total cache size
1 Number of caches
1 Maximum number of caches
3GB 928MB Pool individual cache size
0 Maximum memory-mapped file size
0 Maximum open file descriptors
0 Maximum sequential buffer writes
0 Sleep after writing maximum sequential buffers
0 Requested pages mapped into the process' address space
2127M Requested pages found in the cache (97%)
57M Requested pages not found in the cache (57565917)
6371509 Pages created in the cache
57M Pages read into the cache (57565917)
75M Pages written from the cache to the backing file (75763673)
60M Clean pages forced from the cache (60775446)
2661382 Dirty pages forced from the cache
0 Dirty pages written by trickle-sync thread
500593 Current total page count
500593 Current clean page count
0 Current dirty page count
524287 Number of hash buckets used for page location
4096 Assumed page size used
2248M Total number of times hash chains searched for a page (2248788999)
9 The longest hash chain searched for a page
2669M Total number of hash chain entries checked for page (2669310818)
0 The number of hash bucket locks that required waiting (0%)
0 The maximum number of times any hash bucket lock was waited for (0%)
0 The number of region locks that required waiting (0%)
0 The number of buffers frozen
0 The number of buffers thawed
0 The number of frozen buffers freed
63M The number of page allocations (63937431)
181M The number of hash buckets examined during allocations (181211477)
16 The maximum number of hash buckets examined for an allocation
63M The number of pages examined during allocations (63436828)
1 The max number of pages examined for an allocation
0 Threads waited on page I/O
0 The number of times a sync is interrupted
Pool File: lastPoints
8192 Page size
0 Requested pages mapped into the process' address space
2127M Requested pages found in the cache (97%)
57M Requested pages not found in the cache (57565917)
6371509 Pages created in the cache
57M Pages read into the cache (57565917)
75M Pages written from the cache to the backing file (75763673)
$ db_stat -l
0x40988 Log magic number
16 Log version number
31KB 256B Log record cache size
0 Log file mode
10Mb Current log file size
856M Records entered into the log (856697337)
941GB 371MB 67KB 112B Log bytes written
2GB 262MB 998KB 478B Log bytes written since last checkpoint
31M Total log file I/O writes (31624157)
31M Total log file I/O writes due to overflow (31527047)
97136 Total log file flushes
686 Total log file I/O reads
96414 Current log file number
4482953 Current log file offset
96414 On-disk log file number
4482862 On-disk log file offset
1 Maximum commits in a log flush
1 Minimum commits in a log flush
160KB Log region size
195 The number of region locks that required waiting (0%)
$ db_stat -c
7 Last allocated locker ID
0x7fffffff Current maximum unused locker ID
9 Number of lock modes
2000 Maximum number of locks possible
2000 Maximum number of lockers possible
2000 Maximum number of lock objects possible
160 Number of lock object partitions
0 Number of current locks
1218 Maximum number of locks at any one time
5 Maximum number of locks in any one bucket
0 Maximum number of locks stolen by for an empty partition
0 Maximum number of locks stolen for any one partition
0 Number of current lockers
8 Maximum number of lockers at any one time
0 Number of current lock objects
1218 Maximum number of lock objects at any one time
5 Maximum number of lock objects in any one bucket
0 Maximum number of objects stolen by for an empty partition
0 Maximum number of objects stolen for any one partition
400M Total number of locks requested (400062331)
400M Total number of locks released (400062331)
0 Total number of locks upgraded
1 Total number of locks downgraded
0 Lock requests not available due to conflicts, for which we waited
0 Lock requests not available due to conflicts, for which we did not wait
0 Number of deadlocks
0 Lock timeout value
0 Number of locks that have timed out
0 Transaction timeout value
0 Number of transactions that have timed out
1MB 544KB The size of the lock region
0 The number of partition locks that required waiting (0%)
0 The maximum number of times any partition lock was waited for (0%)
0 The number of object queue operations that required waiting (0%)
0 The number of locker allocations that required waiting (0%)
0 The number of region locks that required waiting (0%)
5 Maximum hash bucket length
$ db_stat -CA
Default locking region information:
7 Last allocated locker ID
0x7fffffff Current maximum unused locker ID
9 Number of lock modes
2000 Maximum number of locks possible
2000 Maximum number of lockers possible
2000 Maximum number of lock objects possible
160 Number of lock object partitions
0 Number of current locks
1218 Maximum number of locks at any one time
5 Maximum number of locks in any one bucket
0 Maximum number of locks stolen by for an empty partition
0 Maximum number of locks stolen for any one partition
0 Number of current lockers
8 Maximum number of lockers at any one time
0 Number of current lock objects
1218 Maximum number of lock objects at any one time
5 Maximum number of lock objects in any one bucket
0 Maximum number of objects stolen by for an empty partition
0 Maximum number of objects stolen for any one partition
400M Total number of locks requested (400062331)
400M Total number of locks released (400062331)
0 Total number of locks upgraded
1 Total number of locks downgraded
0 Lock requests not available due to conflicts, for which we waited
0 Lock requests not available due to conflicts, for which we did not wait
0 Number of deadlocks
0 Lock timeout value
0 Number of locks that have timed out
0 Transaction timeout value
0 Number of transactions that have timed out
1MB 544KB The size of the lock region
0 The number of partition locks that required waiting (0%)
0 The maximum number of times any partition lock was waited for (0%)
0 The number of object queue operations that required waiting (0%)
0 The number of locker allocations that required waiting (0%)
0 The number of region locks that required waiting (0%)
5 Maximum hash bucket length
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Lock REGINFO information:
Lock Region type
5 Region ID
__db.005 Region name
0x2accda678000 Region address
0x2accda678138 Region primary address
0 Region maximum allocation
0 Region allocated
Region allocations: 6006 allocations, 0 failures, 0 frees, 1 longest
Allocations by power-of-two sizes:
1KB 6002
2KB 0
4KB 0
8KB 0
16KB 1
32KB 0
64KB 2
128KB 0
256KB 1
512KB 0
1024KB 0
REGION_JOIN_OK Region flags
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Lock region parameters:
524317 Lock region region mutex [0/9 0% 5091/47054587432128]
2053 locker table size
2053 object table size
944 obj_off
226120 locker_off
0 need_dd
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Lock conflict matrix:
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Locks grouped by lockers:
Locker Mode Count Status ----------------- Object ---------------
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Locks grouped by object:
Locker Mode Count Status ----------------- Object ---------------
Diagnosis:
I'm seeing way to much lock contention on the Java Garbage Collector threads and also the VM thread when I strace my java process and I don't understand the behavior.
We are spending more than 95% of the time trying to acquire locks and I don't know what these locks are. Any info here would help.
Earlier I thought the overflow pages were the problem as 100KB data size was exceeding all overflow page limits. So, I implemented duplicate keys concept by chunking of my data to fit to overflow page limits.
Now I don't see any overflow pages in my system but I still see bad bdb read performance.
$ strace -c -f -p 5642 --->(607 times the lock timed out, errors)
Process 5642 attached with 45 threads - interrupt to quit
% time     seconds usecs/call     calls    errors syscall
98.19    7.670403        2257      3398       607 futex
0.84    0.065886           8      8423           pread
0.69    0.053980        4498        12           fdatasync
0.22    0.017094           5      3778           pwrite
0.05    0.004107           5       808           sched_yield
0.00    0.000120          10        12           read
0.00    0.000110           9        12           open
0.00    0.000089           7        12           close
0.00    0.000025           0      1431           clock_gettime
0.00    0.000000           0        46           write
0.00    0.000000           0         1         1 stat
0.00    0.000000           0        12           lseek
0.00    0.000000           0        26           mmap
0.00    0.000000           0        88           mprotect
0.00    0.000000           0        24           fcntl
100.00    7.811814                 18083       608 total
The above stats show that there is too much time spent locking (futex calls) and I don't understand that because
the application is really single-threaded. I have turned on asynchronous transactions so the writes might be
flushed asynchronously in the background but spending that much time locking and timing out seems wrong.
So, there is possibly something I'm not setting or something weird with the way JVM is behaving on my box.
I grep-ed for futex calls in one of my strace log snippet and I see that there is a VM thread that grabbed the mutex
maximum number(223) of times and followed by Garbage Collector threads: the following is the lock counts and thread-pids
within the process:
These are the 10 GC threads (each thread has grabbed lock on an avg 85 times):
  86 [8538]
  85 [8539]
  91 [8540]
  91 [8541]
  92 [8542]
  87 [8543]
  90 [8544]
  96 [8545]
  87 [8546]
  97 [8547]
  96 [8548]
  91 [8549]
  91 [8550]
  80 [8552]
VM Periodic Task Thread" prio=10 tid=0x00002aaaf4065000 nid=0x2180 waiting on condition (Main problem??)
223 [8576] ==> grabbing a lock 223 times -- not sure why this is happening…
"pool-2-thread-1" prio=10 tid=0x00002aaaf44b7000 nid=0x21c8 runnable [0x0000000042aa8000] -- main worker thread
   34 [8648] (main thread grabs futex only 34 times when compared to all the other threads)
The load average seems ok; though my system thinks it has very less memory left and that
I think is because its using up a lot of memory for the file system cache?
top - 23:52:00 up 6 days, 8:41, 1 user, load average: 3.28, 3.40, 3.44
Tasks: 229 total, 1 running, 228 sleeping, 0 stopped, 0 zombie
Cpu(s): 3.2%us, 0.9%sy, 0.0%ni, 87.5%id, 8.3%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 98999820k total, 98745988k used, 253832k free, 530372k buffers
Swap: 18481144k total, 1304k used, 18479840k free, 89854800k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8424 rchitta 16 0 7053m 6.2g 4.4g S 18.3 6.5 401:01.88 java
8422 rchitta 15 0 7011m 6.1g 4.4g S 14.6 6.5 528:06.92 java
8423 rchitta 15 0 6989m 6.1g 4.4g S 5.7 6.5 615:28.21 java
$ java -version
java version "1.6.0_21"
Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)
Maybe I should make my application a Concurrent Data Store app as there is really only one thread doing the writes and reads. But I would like
to understand why my process is spending so much time in locking.
Can I try any other options? How do I prevent such heavy locking from happening? Has anyone seen this kind of behavior? Maybe this is
all normal. I'm pretty new to using BDB.
If there is a way to disable locking that would also work as there is only one thread that's really doing all the job.
Should I disable the file system cache? One thing is that my application does not utilize cache very well as once I visit a key, I don't visit that
key again for a very long time so its very possible that the key has to be read again from the disk.
It is possible that I'm thinking this completely wrong and focussing too much on locking behavior and the problem is else where.
Any thoughts/suggestions etc are welcome. Your help on this is much appreciated.
Thanks,
Rama

Hi,
Looks like you're using BDB, not BDB JE, and this is the BDB JE forum. Could you please repost here?:
Berkeley DB
Thanks,
mark

Warming up File System Cache for BDB Performance

Hi,
We are using BDB DPL - JE package for our application.
With our current machine configuration, we have
1) 64 GB RAM
2) 40-50 GB -- Berkley DB Data Size
To warm up File System Cache, we cat the .jdb files to /dev/null (To minimize the disk access)
e.g
// Read all jdb files in the directory
p = Runtime.getRuntime().exec("cat " + dirPath + "*.jdb >/dev/null 2>&1");
Our application checks if new data is available every 15 minutes, If new Data is available then it clears all old reference and loads new data along with Cat *.jdb > /dev/null
I would like to know that if something like this can be done to improve the BDB Read performance, if not is there any better method to Warm Up File System Cache ?
Thanks,

We've done a lot of performance testing with how to best utilize memory to maximize BDB performance.
You'll get the best and most predictable performance by having everything in the DB cache. If the on-disk size of 40-50GB that you mention includes the default 50% utilization, then it should be able to fit. I probably wouldn't use a JVM larger than 56GB and a database cache percentage larger than 80%. But this depends a lot on the size of the keys and values in the database. The larger the keys and values, the closer the DB cache size will be to the on disk size. The preload option that Charles points out can pull everything into the cache to get to peak performance as soon as possible, but depending on your disk subsystem this still might take 30+ minutes.
If everything does not fit in the DB cache, then your best bet is to devote as much memory as possible to the file system cache. You'll still need a large enough database cache to store the internal nodes of the btree databases. For our application and a dataset of this size, this would mean a JVM of about 5GB and a database cache percentage around 50%.
I would also experiment with using CacheMode.EVICT_LN or even CacheMode.EVICT_BIN to reduce the presure on the garbage collector. If you have something in the file system cache, you'll get reasonably fast access to it (maybe 25-50% as fast as if it's in the database cache whereas pulling it from disk is 1-5% as fast), so unless you have very high locality between requests you might not want to put it into the database cache. What we found was that data was pulled in from disk, put into the DB cache, stayed there long enough to be promoted during GC to the old generation, and then it was evicted from the DB cache. This long-lived garbage put a lot of strain on the garbage collector, and led to very high stop-the-world GC times. If your application doesn't have latency requirements, then this might not matter as much to you. By setting the cache mode for a database to CacheMode.EVICT_LN, you effectively tell BDB to not to put the value or (leaf node = LN) into the cache.
Relying on the file system cache is more unpredictable unless you control everything else that happens on the system since it's easy for parts of the BDB database to get evicted. To keep this from happening, I would recommend reading the files more frequently than every 15 minutes. If the files are in the file system cache, then cat'ing them should be fast. (During one test we ran, "cat *.jdb > /dev/null" took 1 minute when the files were on disk, but only 8 seconds when they were in the file system cache.) And if the files are not all in the file system cache, then you want to get them there sooner rather than later. By the way, if you're using Linux, then you can use "echo 1 > /proc/sys/vm/drop_caches" to clear out the file system cache. This might come in handy during testing. Something else to watch out for with ZFS on Solaris is that sequentially reading a large file might not pull it into the file system cache. To prevent the cache from being polluted, it assumes that sequentially reading through a large file doesn't imply that you're going to do a lot of random reads in that file later, so "cat *.jdb > /dev/null" might not pull the files into the ZFS cache.
That sums up our experience with using the file system cache for BDB data, but I don't know how much of it will translate to your application.

Questions in partial key matches of Btree

Hi,
For Btree access method, partial key matches and range searches can be done through cursor with DB_SET_RANGE flag specified. It will return "the smallest record in the database greater than or equal to the supplied key". Seems it only support range searches like ">" or ">=". If I want to do range searches like "<" or "<=", what shall I do? Does BDB support them?
Thanks!

Sorry, I made a mistake. The cursor is just a position. Only if I reached a correct position, I can move the cursor forward or backward freely.

BDB- Recno multiple key/data pairs retrieval

Hey,
I am new to BDB and have just started to work with Berkley DB (ver. 5.3.15) using the Linux C API.
I set up a simple Recno DB which is populated with sequential keys 1,2,3….100. DB Records are in variable length size, although, I am limiting them to a max size.
Below are the environment and DB open flags I am using:
dbenv->open(dbenv, DB_HOME_DIR, DB_SYSTEM_MEM | DB_INIT_LOCK | DB_CREATE | DB_INIT_MPOOL, 0)
dbp->open(dbp, NULL, NULL,DATABASE_NAME, DB_RECNO, DB_CREATE, 0664))
Single record get/put or using cursor to iterate over the all DB works well.
However, I would like to retrieve multiple records in a single get call.
These records can be non-sequential.
For example, retrieving 3 records with the keys 4,89,90. I prefer the bulk buffer to be as minimal as possible (avoiding stack or heap unnecessary memory allocation).
I was reading and saw few examples about using bulk retrieval. Though, I couldn’t find any example for racno bulk get on multiple specified keys.
From what I figured out till now, it seems that I should use:
Get flags: DB_SET_RECNO, DB_MULTIPLE_KEY. And the macros: DB_MULTIPLE_INIT and DB_MULTIPLE_RECNO_NEXT to iterate over a bulk buffer received.
But, I couldn’t figure it out where and how I should specify the list of Keys.
Beside, the BDB man says: "For DB_SET_RECNO to be specified, the underlying database must be of type Btree, and it must have been created with the DB_RECNUM flag."
Does the BDB open with DB_RECNO flag imply that the underlying database is Btree? If creating Btree instead of recno, wouldn’t I loss access performances?
I would appreciate if anyone could supply some guidelines or an example which will assist me to figure it out how to retrieve multiple key/data pairs from a recno DB.
Thanks in advance
Kimel

I am checking the BDB to see if it can suit my needs (mostly performance wise).
It should work on a simple home router device and should hold router runtime information (static/dynamic).
This information should be accessible for process in the system which can write or read the device info.
The DB is not require to be persistent and is recreated every reboot (memory only).
I believe the DB will hold not more then 200 parameters at max.
DB access rate is around 30 records per sec (write/read)...
Currently, I think of either BDB or maybe just use a normal Linux shared memory (random access + semaphore).
Cons and pros to each...
If I choose BDB, I will use the in memory DB and recno due to access performance considerations.
Getting back to my question, I would like to be able to read a list of parameters in a single get call.
In order to use recno, I will give every parameter a unique id (1 - ip, 2 - subnet , 3 - default gateway).
e.g: ip , subnet , default gateway. (get keys 1,2,3)
Hope you have the relevant info.
Thanks

How to write BTree entries to distinct pages

Hi All,
I am writing concurrent transaction tests for my Java Library that uses a Berkeley-DB BTree. The intention of these tests are to assert that my library provide the advertised levels of isolation by using the underlying Berkeley locking correctly. Here is a simple example test:
--- Java ---
String setup = "setup: declare(X,Y,Z) ";
String writeY = "writeY: temp=X Y=1 write|commit";
String writeZ = "writeZ: temp=X Z=1 write|commit";
new ScenarioRunner(setup, writeY, writeZ).run();
In this test, I am simply checking that no conflicts occurred (shared reads, non-overlapping writes). However, to make the test work, I need to make sure that I create entries (X,Y,Z) on separate pages.
So how does one accomplish this?
My current (painful) solution to is create a set of test values and then select those values that appear on different pages to be used as X,Y,Z in the test. This is accomplished by reading the logs (using db4.6_printlog).
Your Ideas?
Thanks in Advance,
Roberto
Edited by: Roberto Faria on Apr 20, 2009 2:59 PM

Hi Roberto,
Just in case you are still in the look for a way to resolve.
Suppose that you have the database page size set to 512B. The size of an item (key or data item) that can be stored onto a leaf page will be approximately 128B (usually calculated as page_size/4). Any item exceeding this size will be forced onto an overflow page.
[http://www.oracle.com/technology/documentation/berkeley-db/db/ref/am_conf/pagesize.html|http://www.oracle.com/technology/documentation/berkeley-db/db/ref/am_conf/pagesize.html]
[http://www.oracle.com/technology/documentation/berkeley-db/db/ref/am_misc/diskspace.html|http://www.oracle.com/technology/documentation/berkeley-db/db/ref/am_misc/diskspace.html]
The actual formula for calculating the size of an overflow item looks like this: overflow_item_size = (page_size - page_overhead) / minimum_no_of_keys * 2 - item_overhead * 2
Remember that on a Btree database page the minimum number of keys that can be stored is 2 (each record requires 2 slots, one for the key, one for the data).
Now, you can pad your key and data items so that they extend up to this limit. You can set your page size to a 512B, pad the items so that they extend up to 111B (based on the formula mentioned above) and store the records you need but with some dummy records in between, such as (key | data):
keyX<pad_bytes> | dataForKeyX<pad_bytes>
keyXX<pad_bytes> | dataForKeyXX<pad_bytes> (dummy record)
keyZ<pad_bytes> | dataForKeyZ<pad_bytes>
keyZZ<pad_bytes> | dataForKeyZZ<pad_bytes> (dummy record)
The idea is that considering that by default a lexicographical comparison routine is used by BDB for key comparison, each record you need will be distributed onto a page with an ignorable/dummy record, hence you should end up with the keys you need on different leaf pages.
You'll have to adjust the number of padding bytes according to the page size and the size of your key/data items.
Regards,
Andrei

BDB Lib Crash (4.8.24)

I'm using BDB 4.8.25, 64 bits CPU, multi-thread accessing , but the lib crash when I use db->del(). If I don't use db->del(), just use put and get , it is no problem. The stack is
0x0000000000629e11 in __memp_fget (dbmfp=0x9c8730, pgnoaddr=0x4820b3b4, ip=0x0, txn=0x0, flags=0, addrp=0x4820b360)
at ../dist/../mp/mp_fget.c:250
#1 0x000000000067ffff in __bam_get_root (dbc=0x9c8a80, pg=1, slevel=1, flags=1409, stack=0x4820b514)
at ../dist/../btree/bt_search.c:116
#2 0x0000000000680659 in __bam_search (dbc=0x9c8a80, root_pgno=1, key=0x4820bd70, flags=1409, slevel=1, recnop=0x0,
exactp=0x4820b864) at ../dist/../btree/bt_search.c:290
#3 0x0000000000668d2b in __bamc_search (dbc=0x9c8a80, root_pgno=0, key=0x4820bd70, flags=27, exactp=0x4820b864)
at ../dist/../btree/bt_cursor.c:2785
#4 0x0000000000663878 in __bamc_get (dbc=0x9c8a80, key=0x4820bd70, data=0x4820bd40, flags=27, pgnop=0x4820b91c)
at ../dist/../btree/bt_cursor.c:1088
#5 0x00000000005dc94f in __dbc_iget (dbc=0x9c8a80, key=0x4820bd70, data=0x4820bd40, flags=27) at ../dist/../db/db_cam.c:934
#6 0x00000000005dc379 in __dbc_get (dbc=0x9c8a80, key=0x4820bd70, data=0x4820bd40, flags=27) at ../dist/../db/db_cam.c:755
#7 0x00000000005e4400 in __db_get (dbp=0x9c80a0, ip=0x0, txn=0x0, key=0x4820bd70, data=0x4820bd40, flags=27)
at ../dist/../db/db_iface.c:778
#8 0x00000000005e4184 in __db_get_pp (dbp=0x9c80a0, txn=0x0, key=0x4820bd70, data=0x4820bd40, flags=0)
at ../dist/../db/db_iface.c:693
#9 0x00000000005be05d in Db::get (this=0x9c8000, txnid=0x0, key=0x4820bd70, value=0x4820bd40, flags=0)
DB environment is:
if ((ret = dbenv->set_timeout(200000, DB_SET_LOCK_TIMEOUT)) != 0) {
log_error("set dbenv timeout error!");
exit(-1);
if ((ret = dbenv->set_lk_detect(DB_LOCK_EXPIRE)) !=0 ) {
log_error("set dbenv lock detect error!");
exit(-1);
if ((ret = dbenv->set_cachesize(20, 0, 1)) != 0) {
log_error("set dbenv cache size error!");
exit(-1);
dbenv->open(dataDir, DB_INIT_MPOOL | DB_INIT_LOG | DB_THREAD | DB_CREATE | DB_SYSTEM_MEM, 0664);
The crash occuers in get method, but if not use db->del(), I had not see the crash again.
Edited by: user9233151 on 2010-2-28 下午11:47
Edited by: user9233151 on 2010-2-28 下午11:50
Edited by: user9233151 on 2010-2-28 下午11:51

Hi,
Do you have a small stand-alone test case program to demonstrate the issue? Is this issue reproducible at will?
Does the problem happen when trying to deleting a specific key/record?
Any errors messages reported? Is this the complete stack? What does the crash consists in?
Regards,
Andrei

BDB dumps core after adding approx 19MB of data

Hi,
BDB core dumps after adding about 19MB of data & killing and restarting it several times.
Stack trace :
#0 0xc00000000033cad0:0 in kill+0x30 () from /usr/lib/hpux64/libc.so.1
(gdb) bt
#0 0xc00000000033cad0:0 in kill+0x30 () from /usr/lib/hpux64/libc.so.1
#1 0xc000000000260cf0:0 in raise+0x30 () from /usr/lib/hpux64/libc.so.1
#2 0xc0000000002fe710:0 in abort+0x190 () from /usr/lib/hpux64/libc.so.1
warning:
ERROR: Use the "objectdir" command to specify the search
path for objectfile db_err.o.
If NOT specified will behave as a non -g compiled binary.
warning: No unwind information found.
Skipping this library /integhome/jobin/B063_runEnv/add-ons/lib/libicudata.sl.34.
#3 0xc000000022ec2340:0 in __db_assert+0xc0 ()
from /integhome/jobin/B063_runEnv/service/sys/servicerun/bin/libdb_cxx-4.3.so
warning:
ERROR: Use the "objectdir" command to specify the search
path for objectfile db_meta.o.
If NOT specified will behave as a non -g compiled binary.
#4 0xc000000022ed2870:0 in __db_new+0x780 ()
from /integhome/jobin/B063_runEnv/service/sys/servicerun/bin/libdb_cxx-4.3.so
warning:
ERROR: Use the "objectdir" command to specify the search
path for objectfile bt_split.o.
If NOT specified will behave as a non -g compiled binary.
#5 0xc000000022ded690:0 in __bam_root+0xb0 ()
from /integhome/jobin/B063_runEnv/service/sys/servicerun/bin/libdb_cxx-4.3.so
#6 0xc000000022ded2d0:0 in __bam_split+0x1e0 ()
from /integhome/jobin/B063_runEnv/service/sys/servicerun/bin/libdb_cxx-4.3.so
warning:
ERROR: Use the "objectdir" command to specify the search
path for objectfile bt_cursor.o.
If NOT specified will behave as a non -g compiled binary.
#7 0xc000000022dc83f0:0 in __bam_c_put+0x360 ()
from /integhome/jobin/B063_runEnv/service/sys/servicerun/bin/libdb_cxx-4.3.so
warning:
ERROR: Use the "objectdir" command to specify the search
path for objectfile db_cam.o.
If NOT specified will behave as a non -g compiled binary.
#8 0xc000000022eb8c10:0 in __db_c_put+0x740 ()
from /integhome/jobin/B063_runEnv/service/sys/servicerun/bin/libdb_cxx-4.3.so
warning:
ERROR: Use the "objectdir" command to specify the search
path for objectfile db_am.o.
If NOT specified will behave as a non -g compiled binary.
#9 0xc000000022ea4100:0 in __db_put+0x4c0 ()
from /integhome/jobin/B063_runEnv/service/sys/servicerun/bin/libdb_cxx-4.3.so---Type <return> to continue, or q <return> to quit---
warning:
ERROR: Use the "objectdir" command to specify the search
path for objectfile db_iface.o.
If NOT specified will behave as a non -g compiled binary.
#10 0xc000000022eca7a0:0 in __db_put_pp+0x240 ()
from /integhome/jobin/B063_runEnv/service/sys/servicerun/bin/libdb_cxx-4.3.so
warning:
ERROR: Use the "objectdir" command to specify the search
path for objectfile cxx_db.o.
If NOT specified will behave as a non -g compiled binary.
#11 0xc000000022d92c90:0 in Db::put(DbTxn*,Dbt*,Dbt*,unsigned int)+0x120 ()
from /integhome/jobin/B063_runEnv/service/sys/servicerun/bin/libdb_cxx-4.3.so
What is the behaviour of BDB if its killed & restarted when a bdb transaction is in progress?
anybody has an idea as to why BDB dumps core in above scenario?
Regards
Sandhya

Hi Bogdan,
As suggested by you i am using the below flags to open an enviornment.
DB_RECOVER |DB_CREATE | DB_INIT_LOG | DB_INIT_MPOOL | DB_INIT_TXN|DB_THREAD
DB_INIT_LOCK is not used because at our application level we are maintaining a lock to guard against multiple simultaneous access.
The foll msg is output on the console & the dumps core with same stack trace as posted before.
__db_assert: "last == pgno" failed: file "../dist/../db/db_meta.c", line 163
I ran db_verify, db_stat, db_recover tools on the DB & thier results are as below.
db_verify <dbfile>
db_verify: Page 4965: partially zeroed page
db_verify: ./configserviceDB: DB_VERIFY_BAD: Database verification failed
db_recover -v
Finding last valid log LSN: file: 1 offset 42872
Recovery starting from [1][42200]
Recovery complete at Sat Jul 28 17:40:36 2007
Maximum transaction ID 8000000b Recovery checkpoint [1][42964]
db_stat -d <dbfile>
53162 Btree magic number
9 Btree version number
Big-endian Byte order
Flags
2 Minimum keys per-page
8192 Underlying database page size
1 Number of levels in the tree
60 Number of unique keys in the tree
60 Number of data items in the tree
0 Number of tree internal pages
0 Number of bytes free in tree internal pages (0% ff)
1 Number of tree leaf pages
62 Number of bytes free in tree leaf pages (99% ff)
0 Number of tree duplicate pages
0 Number of bytes free in tree duplicate pages (0% ff)
0 Number of tree overflow pages
0 Number of bytes free in tree overflow pages (0% ff)
0 Number of empty pages
0 Number of pages on the free list
db_stat -E <dbfile>
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Default database environment information:
4.3.28 Environment version
0x120897 Magic number
0 Panic value
2 References
0 The number of region locks that required waiting (0%)
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Per region database environment information:
Mpool Region:
2 Region ID
-1 Segment ID
1MB 264KB Size
0 The number of region locks that required waiting (0%)
Log Region:
3 Region ID
-1 Segment ID
1MB 64KB Size
0 The number of region locks that required waiting (0%)
Transaction Region:
4 Region ID
-1 Segment ID
16KB Size
0 The number of region locks that required waiting (0%)
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
DB_ENV handle information:
Set Errfile
db_stat Errpfx
!Set Errcall
!Set Feedback
!Set Panic
!Set Malloc
!Set Realloc
!Set Free
Verbose flags
!Set App private
!Set App dispatch
!Set Home
!Set Log dir
/integhome/jobin/B064_July2/runEnv/temp Tmp dir
!Set Data dir
0660 Mode
DB_INIT_LOG, DB_INIT_MPOOL, DB_INIT_TXN, DB_USE_ENVIRON Open flags
!Set Lockfhp
Set Rec tab
187 Rec tab slots
!Set RPC client
0 RPC client ID
0 DB ref count
-1 Shared mem key
400 test-and-set spin configuration
!Set DB handle mutex
!Set api1 internal
!Set api2 internal
!Set password
!Set crypto handle
!Set MT mutex
DB_ENV_LOG_AUTOREMOVE, DB_ENV_OPEN_CALLED Flags
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Default logging region information:
0x40988 Log magic number
10 Log version number
1MB Log record cache size
0660 Log file mode
1Mb Current log file size
632B Log bytes written
632B Log bytes written since last checkpoint
1 Total log file writes
0 Total log file write due to overflow
1 Total log file flushes
1 Current log file number
42872 Current log file offset
1 On-disk log file number
42872 On-disk log file offset
1 Maximum commits in a log flush
1 Minimum commits in a log flush
1MB 64KB Log region size
0 The number of region locks that required waiting (0%)
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Log REGINFO information:
Log Region type
3 Region ID
__db.003 Region name
0xc00000000b774000 Original region address
0xc00000000b774000 Region address
0xc00000000b883dd0 Region primary address
0 Region maximum allocation
0 Region allocated
REGION_JOIN_OK Region flags
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
DB_LOG handle information:
!Set DB_LOG handle mutex
0 Log file name
!Set Log file handle
Flags
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
LOG handle information:
0 file name list mutex (0%)
0x40988 persist.magic
10 persist.version
0 persist.log_size
0660 persist.mode
1/42872 current file offset LSN
1/42872 first buffer byte LSN
0 current buffer offset
42872 current file write offset
68 length of last record
0 log flush in progress
0 Log flush mutex (0%)
1/42872 last sync LSN
1/41475 cached checkpoint LSN
1MB log buffer size
1MB log file size
1MB next log file size
0 transactions waiting to commit
1/0 LSN of first commit
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
LOG FNAME list:
0 File name mutex (0%)
1 Fid max
ID Name Type Pgno Txnid DBP-info
0 configserviceDB btree 0 0 No DBP 0 0 0
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Default cache region information:
1MB 262KB 960B Total cache size
1 Number of caches
1MB 264KB Pool individual cache size
0 Maximum memory-mapped file size
0 Maximum open file descriptors
0 Maximum sequential buffer writes
0 Sleep after writing maximum sequential buffers
0 Requested pages mapped into the process' address space
43312 Requested pages found in the cache (89%)
4968 Requested pages not found in the cache
640 Pages created in the cache
4965 Pages read into the cache
621 Pages written from the cache to the backing file
4818 Clean pages forced from the cache
621 Dirty pages forced from the cache
0 Dirty pages written by trickle-sync thread
166 Current total page count
146 Current clean page count
20 Current dirty page count
131 Number of hash buckets used for page location
53888 Total number of times hash chains searched for a page
4 The longest hash chain searched for a page
92783 Total number of hash buckets examined for page location
0 The number of hash bucket locks that required waiting (0%)
0 The maximum number of times any hash bucket lock was waited for
0 The number of region locks that required waiting (0%)
5615 The number of page allocations
10931 The number of hash buckets examined during allocations
22 The maximum number of hash buckets examined for an allocation
5439 The number of pages examined during allocations
11 The max number of pages examined for an allocation
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Pool File: temporary
1024 Page size
0 Requested pages mapped into the process' address space
43245 Requested pages found in the cache (99%)
1 Requested pages not found in the cache
635 Pages created in the cache
0 Pages read into the cache
617 Pages written from the cache to the backing file
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Pool File: configserviceDB
8192 Page size
0 Requested pages mapped into the process' address space
65 Requested pages found in the cache (1%)
4965 Requested pages not found in the cache
1 Pages created in the cache
4965 Pages read into the cache
0 Pages written from the cache to the backing file
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Mpool REGINFO information:
Mpool Region type
2 Region ID
__db.002 Region name
0xc00000000b632000 Original region address
0xc00000000b632000 Region address
0xc00000000b773f08 Region primary address
0 Region maximum allocation
0 Region allocated
REGION_JOIN_OK Region flags
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
MPOOL structure:
0/0 Maximum checkpoint LSN
131 Hash table entries
64 Hash table last-checked
48905 Hash table LRU count
48914 Put counter
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
DB_MPOOL handle information:
!Set DB_MPOOL handle mutex
1 Underlying cache regions
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
DB_MPOOLFILE structures:
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
MPOOLFILE structures:
File #1: temporary
0 Mutex (0%)
0 Reference count
18 Block count
634 Last page number
0 Original last page number
0 Maximum page number
0 Type
0 Priority
0 Page's LSN offset
32 Page's clear length
0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 f8 0 0 0 0 ID
deadfile, file written Flags
File #2: configserviceDB
0 Mutex (0%)
1 Reference count
148 Block count
4965 Last page number
4964 Original last page number
0 Maximum page number
0 Type
0 Priority
0 Page's LSN offset
32 Page's clear length
0 0 b6 59 40 1 0 2 39 ac 13 6f 0 a df 18 0 0 0 0 ID
file written Flags
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Cache #1:
BH hash table (131 hash slots)
bucket #: priority, mutex
pageno, file, ref, LSN, mutex, address, priority, flags
bucket 0: 47385, 0/0%:
4813, #2, 0, 0/1, 0/0%, 0x04acf0, 47385
4944, #2, 0, 0/0, 0/0%, 0x020c18, 48692

Can I configure the BTREE compare function at search time?

Hi, all,
I was implementing the 'order by' feature on top of BDB. The problem I met is the "reverse order", that means 'order by xxx desc'.
BDB provides cursor->get API that is used to do search, with flags of 'DB_SET_RANGE', 'DB_GET_BOTH_RANGE'. This is perfect when do 'order by xxx asc' search.
But when I do reverse search, it is complicated, because such as 'DB_SET_RANGE', they find the record that is equal or large than, not equal or less than, this situation becomes even more complicated when the records is duplicated(I need move the cursor forward and back, back and forword). My question is, is there a way to specify the Btree compare function at search time? Now I can only find a way to specify the compare function when the db is created. If there are options that can reverse the compare function at search time, also things done.
Thanks very much!
Regards,
Steve
Edited by: Steve Chu on Jun 4, 2009 4:42 PM

Hi,
It is not possible to change the btree comparison function after the database has been created. It needs to return consistent results at all times, or the internal btree structure will become corrupt.
I don't really understand the issue you are facing. Can't you do a DBC->get(DB_SET_RANGE) call, and then walk the cursor backwards? That is if the position returned from the DBC->get(DB_SET_RANGE) is not equal to the requested key, call DBC->get(DB_PREV) before iterating any further?
It is possible to control the behavior of the previous/next operations on cursors in relation to duplicate sets by using the correct flags. See the documentation for [DB_PREV_NODUP|http://www.oracle.com/technology/documentation/berkeley-db/db/api_c/dbc_get.html#DB_PREV_NODUP]. You can see the other related flags documented on the same page.
Regards,
Alex Gorrod, Oracle Berkeley DB

Bdb Btree Freezing

Similar Messages

Maybe you are looking for