Performance problems on multi processor machine

Hi,
we are using a software written in Java (as Servlets using Tomcat 4.1.29) 1.4, which has a very poor performance. Sadly we don't have the source-codes for the application, but from the log-files we can figure out, that there is a very strong overhead, when changing the processors; that means i.e. when changing from processor 1 to processor 2 a statement which usually only needs 50ms, takes around 20 secs to finish. That could not be....
Do you have any suggestion, maybe about the parameters which are used to start java?
We use the following startup-properties:
-d64 -server -Xms1G -Xmx2G -Xmn800m -XX:+DisableExplicitGC -XX:+UseParallelGC -verbose:GC -Djava.awt.headless=true
Thanks for your help,
Anton

Before anyone answers this, check out what was already attempted at his stinkin' CROSSPOST:
http://forum.java.sun.com/thread.jsp?thread=553113&forum=54&message=2706725

Similar Messages

Simulating a multi-processor machine

Hi there,
I am writing a server-like application and would appreciate some of the guru's advice.
During the start up and running of this app, there are quite a few operations that could block the current thread. For example one of the start-up operations opens a socket. If the socket cannot be opened then it has to wait quite a while before throwing the TimeoutException.
This gives the application a slow feeling to it. (Not that any of GUI thread stuff is blocking, just the data comes through slowly).
Now I do know about Threading, and I can go that route quite easily, however I am not keen on the thread creation overhead (+-20k memory for each thread, even though most of the tasks are incredibly short lived)
What I am interested in is a hybrid approach. I was thinking of simulating a multi-processor machine. This would entail have a fixed number of Threads, and sending them work as the tasks arrive.
However to coordinate these threads, will require a fair amount of synchronized code, so I am not sure if it worth it.
Has anyone tried such an approach, or have any ideas about this?
Many thanks
-Philip.

Sorry about the double post,
please ignore this thread.
Wish there was a way for me to delete
this thread...
<Prays to Java Gods>

Performance degradation on multi-processor computer

I saw couple similar topics but they are not the same. The issue is that .NET app is not so affected and Java app. See below. Thank you.
There is strange picture is observed here. The same test, the same data - the more CPUs computer has, the slower test results:
1) Powerful 8 CPUs server with 1 CPU assigned to process via "Set Affinity" in Task Manager: 1578 ms
2) Powerful 8 CPUs server with 2 CPUs assigned to process via "Set Affinity" in Task Manager: 1656 ms
3) Powerful 8 CPUs server with no adjusments via Task Manager - all 8 CPUs are enabled: 3469 ms
4) Those tests 1-3 were on powerful server. On my less powerful laptop with single core set in BIOS I get: 921 ms (!!!) How come?
The test has 2 active threads. First one puts 5,000,000 updates into queue. Second thread at the same time pops updates from our queue. Source is provided.
With more sophisticated test that I cannot provide I have performance difference much more noticable. That test has more synchronize calls.
The only 2 explanations that I can think of are:
1) Necessity to synchronize CPU caches on all possible CPUs. But why would computer need to synch all 8 CPUs if only 2 threads are using that data - meaning at most only 2 CPUs are active.
2) Windows Task Scheduler switches current CPU for given activ thread all the time. But why would it does so if it makes more powerful server only worse?
I hope there is some good explanation for this behavior. And the most important - there is a solution for our server to really beat my laptop performance counter.
By the way the same type of sample in .NET v.2.0 (also provided) with x64 binary runs 546 ms on my laptop and 781 and 813 ms on server with affinity to 1 and to 1-8 CPUs respectively. It makes me think this issue is JVM related. Please, comment.
== ThreadBoundaryQueue.java file ==
import java.text.NumberFormat;
import java.util.ArrayDeque;
public class ThreadBoundaryQueue
    private static final int COUNT = 5*1000*1000;
    private static final int delayBeforeStart = 10;
    private static void print(String text)
                System.out.println(Thread.currentThread().getName() + ": " + text);
    public static void main(String[] args) throws Exception
                SysInfo.dump();
                print("Preparing data for test...");
        Long events[] = new Long[COUNT];
        for(int i = 0; i < COUNT; ++i)
            events[i] = new Long(i/87);
        final Object notifier = new Object();
        final boolean finished[] = new boolean[1];
        SimpleEventQueueWithArrayDeque queue = new SimpleEventQueueWithArrayDeque(new IEventDispatcher<Long>()
            long sum_;
            int count_;
            @Override
            public void dispatchEvent(Long event)
                for(long l = event.longValue() * 50; l > 0; l /=17)
                    sum_ += l;
                if(++count_ == COUNT)
                    synchronized (notifier)
                       finished[0] = true;
                       notifier.notify();
                       print("some dummy sum is " + sum_);
        queue.startDispatching();
        print("Test starts in "+delayBeforeStart+" seconds");
        print("================================");
        Thread.sleep(delayBeforeStart*1000);
        print("Started...");
        TimeCounter sendingTime = new TimeCounter();
        TimeCounter processingTime = new TimeCounter();
        for(int i = 0; i < COUNT; ++i)
            queue.postEvent(events);
sendingTime.dump("sending of " + COUNT + " events");
synchronized (notifier)
while(!finished[0])
notifier.wait(1);
processingTime.dump("processing of " + COUNT + " events");
print("...Done");
queue.stopDispatching();
private static class SysInfo
public static void dump()
for(String p : new String[]{
"os.name",
"os.version",
"os.arch",
"sun.arch.data.model",
"java.runtime.version",
"java.vm.name",
}) print(p+"="+System.getProperty(p));
print("CPU count=" + Runtime.getRuntime().availableProcessors() + "\n");
private static class TimeCounter
private long start_;
public TimeCounter()
start_ = System.currentTimeMillis();
public void dump(String name)
print( name
+ " took, milliseconds: "
+ NumberFormat.getIntegerInstance().format(System.currentTimeMillis()-start_));
private interface IEventDispatcher<EventType>
void dispatchEvent(EventType event);
private static class SimpleEventQueueWithArrayDeque implements Runnable
private ArrayDeque<Long> buffer_;
private Thread thread_;
private IEventDispatcher<Long> dispatcher_;
public SimpleEventQueueWithArrayDeque(IEventDispatcher<Long> dispatcher)
buffer_ = new ArrayDeque<Long>(65535);
dispatcher_ = dispatcher;
public void postEvent(Long event)
synchronized (buffer_)
buffer_.addLast(event);
buffer_.notify();
@Override
public void run()
try
Long event;
while(!Thread.interrupted())
synchronized (buffer_)
event = buffer_.poll();
while(event == null)
buffer_.wait(300);
event = buffer_.poll();
dispatcher_.dispatchEvent(event);
catch(InterruptedException ex)
public synchronized void startDispatching()
if(thread_ == null)
thread_ = new Thread(this, "queue dispatcher");
thread_.setDaemon(false);
thread_.start();
public synchronized void stopDispatching()
if(thread_ != null)
thread_.interrupt();
try
thread_.join();
catch(Exception ex){}
thread_ = null;

No problem. We are pretty sure in what we see. We repeated a test. So, new test makes 5 rounds of test with 5,000,000 updates. And then it makes another 20 rounds that are counted. Results with our process having "Real-time priority" in both cases:
8 CPUs on 8 CPUs server: 40.5 seconds
1 CPU on the same 8 CPU server: 17,9 seconds
Just to add more confidence in results we made the same with our PRODUCTION system that has 4 heavy processes on 8 CPUs server. We set affinity for each process to have 2 of 8 CPUs. It made heavy loaded system to take around half less CPU time (each of 4 processes with a lot of "synchronized" calls). How to explain this and how to configure our system with that knowledge? And why Windows and/or JVM does not do this for us?
Code is below. Thank you
import java.text.NumberFormat;
import java.util.ArrayDeque;
public class ThreadBoundaryQueue {
    private static final int COUNT = 5*1000*1000;
    private static final int delayBeforeStart = 15;
    private static Long Events_[];
    private static void print(String text) {
        System.out.println(Thread.currentThread().getName() + ": " + text);
    public static void main(String[] args) throws Exception
        SysInfo.dump();
        final Object notifier = new Object();
        final boolean finished[] = new boolean[1];
        IEventDispatcher<Long> d = new IEventDispatcher<Long>() {
            long sum_;
            int count_;
            @Override
            public void dispatchEvent(Long event) {
                for(long l = event.longValue() * 50; l > 0; l /=17) {
                    sum_ += l;
                if(++count_ == COUNT) {
                    synchronized (notifier) {
                        finished[0] = true;
                        print("some dummy sum is " + sum_);
                        sum_ = 0;
                        count_ = 0;
                        notifier.notify();
        print("Preparing data for tests...");
        Events_ = new Long[COUNT];
        for(int i = 0; i < COUNT; ++i) {
            Events_ = new Long(i/87);
print("Test starts in "+delayBeforeStart+" seconds");
print("================================");
Thread.sleep(delayBeforeStart*1000);
print("BEGIN Warmup...");
for(int i = 0; i < 5; ++i) {
print("Test " + (i+1) + "...");
test(new SimpleEventQueueWithArrayDeque(d), notifier, finished);
print("END Warmup...");
TimeCounter allTime = new TimeCounter();
for(int i = 0; i < 20; ++i) {
print("Test " + (i+1) + "...");
test(new SimpleEventQueueWithArrayDeque(d), notifier, finished);
print("...Done");
allTime.dump("ALL TESTS after warm up");
private static void test(SimpleEventQueueWithArrayDeque queue, Object notifier, boolean[] finished) throws Exception {
synchronized (notifier) {
finished[0] = false;
queue.startDispatching();
print("Started...");
TimeCounter sendingTime = new TimeCounter();
TimeCounter processingTime = new TimeCounter();
for(int i = 0; i < COUNT; ++i) {
queue.postEvent(Events_[i]);
sendingTime.dump("sending of " + COUNT + " events");
synchronized (notifier) {
while(!finished[0]) {
notifier.wait(100);
processingTime.dump("processing of " + COUNT + " events");
queue.stopDispatching();
private static class SysInfo {
public static void dump() {
for(String p : new String[]{"os.name", "os.version", "os.arch",
"sun.arch.data.model", "java.runtime.version", "java.vm.name",})
print(p+"="+System.getProperty(p));
print("CPU count=" + Runtime.getRuntime().availableProcessors() + "\n");
private static class TimeCounter {
private long start_;
public TimeCounter() {
start_ = System.currentTimeMillis();
public void dump(String name) {
print( name
+ " took, milliseconds: "
+ NumberFormat.getIntegerInstance().format(System.currentTimeMillis()-start_));
private interface IEventDispatcher<EventType> {
void dispatchEvent(EventType event);
private static class SimpleEventQueueWithArrayDeque implements Runnable {
private ArrayDeque<Long> buffer_;
private Thread thread_;
private IEventDispatcher<Long> dispatcher_;
public SimpleEventQueueWithArrayDeque(IEventDispatcher<Long> dispatcher) {
buffer_ = new ArrayDeque<Long>(65535);
dispatcher_ = dispatcher;
public void postEvent(Long event) throws Exception {
synchronized(buffer_) {
buffer_.addLast(event);
buffer_.notifyAll();
@Override
public void run() {
try {
Long event;
while(!Thread.interrupted()) {
synchronized(buffer_) {
event = buffer_.poll();
while(event == null) {
buffer_.wait(100);
event = buffer_.poll();
dispatcher_.dispatchEvent(event);
catch(InterruptedException ex){}
public synchronized void startDispatching() {
if(thread_ == null) {
thread_ = new Thread(this, "queue dispatcher");
thread_.setDaemon(false);
thread_.start();
public synchronized void stopDispatching() {
if(thread_ != null) {
thread_.interrupt();
try {
thread_.join();
catch(Exception ex){}
thread_ = null;

Installion problems on multi distro machine [solved]

Started with 0.7.2 i686, but had problems with that so I was advised to try 0.8. Just downloaded base-0.8-beta1-20070122-i686.iso image. Mdsum5 and burn verify no problem. However, I'm having lots of bother getting it installed on a machine containing other distros. So I tried installing on VMWare and had no problems whatsoever with either with the iso image or the burnt CD.
I had to use ide-legacy to boot the real machine. As the disk was already prepared I went straight to option 3. Set Filesystem Mountpoints and set the swap (overwrite), / (ext3/overwrite) and boot (ext2/no overwrite) partitions. Then on selecting all base packages I get I get error
checking package integrity...error: archive kernel-headers-2.6.19
Package installation failed
Continued with 4. configure system anyway and a get a load of mount point errors and the chroot fails. When I edit (vim) the list of configuration files, they're all empty and if I try to add anything I get error E212: Can't open file for writing. Obviously I can't set the password either.
So I thought I'd try the ftp base installl. The network set up was fine. Instead of the above package integrity error, I get
error: the following file conflicts were found :
grub: /mnt/boot/grub/menu.lst: exists in filesystem
errors occured, no packages were upgraded
Package installation failed
Also I get a load of errors on trying to install the kernel, but at least there was no package corruption.
I tried both methods a number of times. With the CD install I sometimes get more CD errors. With ftp base install I get exactly the same problems. I also tried preparing the disk within the install using cfdisk. However, when I deleted /dev/hda10 and created a new one, cdfisk wanted to rename most of the other partitions, so i didn't go ahead with that. I've seen this problem with cfdisk before.
I know I could move the virtual machine to the real machine, but I rather not have the hassle. Anyone got some workarounds?
Last edited by grazie (2007-02-18 16:43:40)

Thanks for your reply mutlu_inek.
mutlu_inek wrote:What do you mean by "overwrite"? Formatting?
Yes the installer gives the option to format or not, selected partitions during the installation. The app uses the terms overwrite/no overwrite.
mutlu_inek wrote:Maybe your designated /boot partition is corrupted (or not actually ext2)? Maybe you should try formatting it.
The /boot partition was ext2 and I wasn't aware of any corruption. The fsck check on booting was fine and no other distro was having problems. However, removing the kernels and config files or reformatting the /boot partition seemed like good option that I hadn't yet tried. I backed up the partition and allowed the installer to format/overwrite it during the install. This fixed all the problems.
mutlu_inek wrote:The partition naming/numbering sheme is ordered. If you delete a partition and then re-create it, it will have the same name. If you just delete it, names of partitions with higher numbers will change. That is not cfdisk specific, not even Linux-specific.
My disk has 10 partitions. Within cfdisk I deleted /dev/hda10 created and recreated it. However, rather than getting a new /dev/hda10 the partition was designated /dev/hda7 with the original 7, 8 and 9 all reassigned. I'd say this was a bug.

Performance problem using multi-thread

I am using berkeley db to store a int/double pair, and i randomly get 100,0000 results to check the performance . Using 1 thread, it cost about 1 Mins. But using 4 threads, it costs 4 Mins. More threads, lower performance. Is there Anyone know the reason?
Env flags:
envFlags=DB_CREATE | DB_INIT_MPOOL | DB_THREAD | DB_PRIVATE|DB_INIT_CDB;
DB open flags: DB_THREAD
and i use db->get method to get results.

Berkeley DB 4.8 will be released soon and has features that address CMP/SMP scalability. We would appreciate your feedback on your experiences with it. Once available, please test it and post your results here, I'm sure you'll be pleasently surprised.
regards,
-greg

Multi-Processor Performance

I would like to know what performance gains I might expect from moving our
weblogic application server to a multi-processor machine. Will 2 processors
handle twice server the load of the one processor machine?
Platform: Solaris 2.6
Weblogic Server: 4.5.1 SP7
NativeIO enabled
Weblogic Server is the only thing running on the machine.
Other Questions:
1. Is there anything that needs to be done(other than purchase another
license) for the weblogic server to work on a multi-processor system?
2. Will the weblogic server naturally take advantage both processors?
3. Will performance gains be uniform or will certain features gain more
from multiple processors?
Any links or suggestions are appreciated.
thanks,
Jeremy

Hi Jeremy -
If you are interested in modeling this before implementing it to determine
performance gains, you might want to check out our scalability assessment
services description, see attached. We are a BEA Technology Alliance Partner
that specializes in answering those specific performance questions, and have
done that for a number of clients in the past few weeks.
(See also eQASEsheet2.pdf) - this describes our capacity sizing tool that works
particularly well for Weblogic.
Todd
jeremy wrote:
I would like to know what performance gains I might expect from moving our
weblogic application server to a multi-processor machine. Will 2 processors
handle twice server the load of the one processor machine?
Platform: Solaris 2.6
Weblogic Server: 4.5.1 SP7
NativeIO enabled
Weblogic Server is the only thing running on the machine.
Other Questions:
1. Is there anything that needs to be done(other than purchase another
license) for the weblogic server to work on a multi-processor system?
2. Will the weblogic server naturally take advantage both processors?
3. Will performance gains be uniform or will certain features gain more
from multiple processors?
Any links or suggestions are appreciated.
thanks,
Jeremy--
Todd Wiseman
Dir/Business Development
eQASE LLC
(303)790-4242 x130
(303)790-2816
www.eqase.com
Java Performance & Scalability
[eQASE WLS Consulting Offerings.pdf]
[eQASEsheet2.pdf]

Performance problem with synchronized singleton

I'm using the singleton pattern to cache incoming JMS Message data from a 3rd party. I'm seeing terrible performance though, and I think it's because I've misunderstood something.
My singleton class stores incoming JMS messages in a HashMap, so that successive messages can be checked to see if they are a new piece of data, or an update to an earlier one.
I followed the traditional examples of a private constructor and a public getInstance method, and applied the double-checked locking to the latter. However, a colleague then suggested that all my other methods in the same class should also be synchronized - is this the case or am I creating an unnecessary performance bottleneck? Or have I unwittingly created that bottleneck elsewhere?
package com.mycode;
import java.util.HashMap;
import java.util.Iterator;
public class DataCache {
    private volatile static DataCache uniqueInstance;
    private HashMap<String, DataCacheElement> dataCache;
    private DataCache() {
        if (dataCache == null) {
            dataCache = new HashMap<String, DataCacheElement>();
    public static DataCache getInstance() {
         if (uniqueInstance == null) {
            synchronized (DataCache.class) {
                if (uniqueInstance == null) {
                    uniqueInstance = new DataCache();
        return uniqueInstance;
    public synchronized void put(String uniqueID, DataCacheElement dataCacheElement) {
        dataCache.put(uniqueID, dataCacheElement);
    public synchronized DataCacheElement get(String uniqueID) {
        DataCacheElement dataCacheElement = (DataCacheElement) dataCache.get(uniqueID);
        return dataCacheElement;
    public synchronized void remove(String uniqueID) {
        dataCache.remove(uniqueID);
    public synchronized int getCacheSize() {
     return dataCache.keySet().size();
     * Flushes all objects from the cache that are older than the
     * expiry time.
     * @param expiryTime (long milliseconds)
    public synchronized void flush(long expiryTime) {
        String uniqueID;
        long currentDate = System.currentTimeMillis();
        long compareDate = currentDate - (expiryTime);
        Iterator<String> iterator = dataCache.keySet().iterator();
        while( iterator.hasNext() ){
            // Get element by unique key
            uniqueID = (String) iterator.next();
            DataCacheElement dataCacheElement = (DataCacheElement) get(uniqueID);
            // get time from element
            long lastUpdatedDate = dataCacheElement.getUpdatedDate();
            // if time is greater than 1 day, remove element from cache
            if (lastUpdatedDate < compareDate) {
                remove(uniqueID);
    public synchronized void empty() {
        dataCache.clear();
}

m0thr4 wrote:
SunFred wrote:
m0thr4 wrote:
I [...] applied the double-checked locking
Which is broken. http://www.ibm.com/developerworks/java/library/j-dcl.html
from the link:
The theory behind double-checked locking is perfect. Unfortunately, reality is entirely different. The problem with double-checked locking is that there is no guarantee it will work on single or multi-processor machines.
The issue of the failure of double-checked locking is not due to implementation bugs in JVMs but to the current Java platform memory model. The memory model allows what is known as "out-of-order writes" and is a prime reason why this idiom fails[b].
I had a read of that article and have a couple of questions about it:
1. The article was written way back in May 2002 - is the issue they describe relevant to Java 6's memory model? DCL will work starting with 1.4 or 1.5, if you make the variable you're testing volatile. However, there's no reason to do it.
Lazy instantiation is almost never appropriate, and for those rare times when it is, use a nested class to hold your instance reference. (There are examples if you search for them.) I'd be willing to be lazy instantiation is no appropriate in your case, so you don't need to muck with syncing or DCL or any of that nonsense.

WLS on Multi-Processors

A few questions about WLS 5.1 on multi-processor machines:
1. Is there anything that needs to be done(other than purchase another
license) for a weblogic server to work on a multi-processor system?
2. Will WLS take advantage of all processors with just ONE invocation of WLS?
Or will I have to run one instance of WLS for each processor?
3. Will performance gains be uniform or will certain features gain more
from multiple processors?
Any answers, insights or pointers to answers are appreciated.
Thanks.
-Heng

>
I consider WebLogic to be a great no-nonsense J2EE implementation (not
counting class loaders ;-).Look for major improvements in that area in version 6.0.
Thanks,
Michael
Michael Girdley
BEA Systems Inc
"Cameron Purdy" <[email protected]> wrote in message
news:[email protected]...
Rob,
I consider WebLogic to be a great no-nonsense J2EE implementation (not
counting class loaders ;-). Gemstone's architecture is quite elaboratewhen
compared to WebLogic, and BTW they spare no opportunity to compare to
WebLogic although never by name. (Read their white paper on scalabilityto
see what I mean.) I am quite impressed by their architecture; it appearsto
be set up for dynamic reconfiguration of many-tier processing. Forexample,
where WL coalesces (i.e. pass by ref if possible), Gemstone will always
distribute if possible, creating a "path" through (e.g.) 7 levels of JVMs
(each level having a dynamic number of JVMs available in a pool) and if
there is a problem at any level, the request gets re-routed (full failover
apparently). I would say that they are set up quite well to solve the
travelling salesperson problem ... you could probably implement aweb-driven
neural net on their architecture. (I've never used the Gemstone product,
but I've read about everything that they published on it.) I would assume
that for certain types of scaling problems, the Gemstone architecturewould
work very very well. I would also guess that there are latency issues and
administration nightmares, but I've had the latter with every app server
that I've ever used, so ... $.02.
Cameron Purdy
[email protected]
http://www.tangosol.com
WebLogic Consulting Available
"Rob Woollen" <[email protected]> wrote in message
news:[email protected]...
Dimitri Rakitine wrote:
Hrm. Gemstone reasons are somewhat different.I'm not a Gemstone expert, but I believe their architecture is quite
different from a WebLogic cluster. Different architectures might have
different trade-offs.
However, out of curiosity, what are their reasons?
Anyway, here is my question:
why running multiple instances of WL is more efficient than running
one
with
high execute thread count?The usual reason is that most garbage collectors suspend all of the jvm
threads. Using multiple WLS instances causes the pauses to be
staggered. Newer java vms offere incremental collectors as an option so
this may no longer as big of an issue.
-- Rob
>

Multi-threads will run better on multi-Processores environment?

Are there any proof to say that Multi-threads will run better on multi-Processores environment?
Will Java run better on multi-Processores environment? Are there any proof?
Thank in advance

Are there any proof?The proof is in the pudding, so to speak. The best thing to do is to benchmark your application on single and multi-processor machines.
It should be noted that there are reports of threading problems with Java on multi-processor machines. I haven't seen the problems myself, nor do I remember the exact problems people were seeing (and on what OS/Hardware) but if you search these forums you should be able to find out the details.

Multi-processor Multi-Threaded deadlock

Hi all-
I've posted this over at jGuru.com and haven't come up with an effective solution but I wanted to try here before I reported this as a bug. This deals with kicking off many threads at once (such as might happen when requests are coming in over a socket).
I'm looking for a way to find out when all of my threads have finished processing. I have several different implementations that work on a single processor machine:
inst is an array of 1000+ threads. The type of inst is a class that extends threads (and thus implements the runable interface).
for (i = 0;i<Count;i++)
inst[ i ].start()
for (i = 0;i<Count;i++)
inst[ i ].join();
however this never terminates on a multiprocessor machine. I've also tried an isAlive loop instead of join that waits until all threads have terminated until it goes on. Again, this works on the single processor but not the multi-processor machine. Additionally someone suggested a solution with a third "counter" class that is synchronized and decremented in each thread as processing finishes, then a notifyAll is called. The main thread is waiting after it starts all the threads, wakes up to check if the count is zero, and if it's not it goes back to sleep. Again this will work on the single processor but not the multi-processor.
The thread itself ends up doing a JNI call which in turn calls another DLL which I just finished making thread safe (this has been tested via C programs). The procedures are mathematically intensive but calculation time is around half a second on a P3 800Mhz.
Any help on this is greatly appreciated. My next step will likely be tearing down the application to exclude the calculating part just to test the JVM behavior. Is there a spec with the JVM behavior on a multi processor machine, I haven't seen one and it's hard to know what to expect from join() (joining across to processors seems like wierd concept), getCurrentThread() (since 2+ threads are running at the same time), etc.

My next step will likely be tearing down the application to
exclude the calculating part just to test the JVM behavior.Sounds like a really good idea.
Is there a spec with the JVM behavior on a multi processor machine, The behaviour of threads is defined in the specs.
You might want to check the bug database also. There are bug fixes for various versions of java.

CPU time from a multi processor

Hi
I need to get the CPU time from a multi processor machine,
The top command will not do the job for me, and I will need to use the command in automation for testing the CPU time for 2 hours or more, I thinking about redirect the CPU % to a file, and in the end I will run the average for the numbers in that file.
If I will be able to see the CPU time for the two processors it will be grate, but more important is to collect the global CPU status.
I don�t have a command line that a can use, I can use some help.
Thanks Shay

mpstat in fact works on my Opteron 270 dual-processor dual-core machine running Soalris 10. For instance 'mpstat 3 5' displays 5 reports each 3 seconds apart, showing status of each CPU:
% mpstat 3 5
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 1 0 1 383 245 73 0 5 0 0 87 1 0 0 99
1 1 0 1 33 4 51 0 4 0 0 57 0 0 0 99
2 0 0 1 38 0 72 0 2 0 0 42 0 0 0 100
3 1 0 1 53 26 49 0 1 1 0 47 0 0 0 100
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 0 393 252 76 0 8 0 0 94 1 0 0 99
1 0 0 0 48 4 82 0 7 1 0 71 0 0 0 100
2 3 0 0 39 0 76 0 4 1 0 51 0 0 0 100
3 0 0 0 44 25 35 0 3 2 0 49 0 0 0 100
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 0 382 250 64 0 5 0 0 111 0 0 0 100
1 0 0 0 29 4 43 0 4 0 0 56 1 0 0 99
2 0 0 1 48 1 93 0 3 0 0 39 0 0 0 100
3 0 0 0 69 29 78 0 1 1 0 65 0 0 0 100
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 1 386 250 72 0 4 0 0 111 0 0 0 99
1 0 0 0 28 3 42 0 3 0 0 55 1 0 0 99
2 0 0 0 42 0 81 0 1 0 0 43 0 0 0 100
3 0 0 0 67 29 74 0 0 1 0 63 0 0 0 100
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 0 404 252 98 0 5 0 0 100 1 0 0 99
1 0 0 0 45 9 68 0 5 1 0 53 0 0 0 100
2 0 0 0 34 0 64 0 4 0 0 50 0 0 0 100
3 0 0 0 58 27 60 0 2 2 0 73 0 0 0 100
(The first report summarizes all activity since boot.)

Monitor multi processor

hey all..
how i can monitor each processor in my multi processor machine which have solaris 8 ..
and if i create thread by using java language why i not see it in the prosess list by use PS command..

Regarding question 1: try the "mpstat" comand...
$ mpstat 3
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 177 7 145 149 35 781 44 104 50 0 462 15 11 13 61
1 142 8 530 519 435 27 48 104 46 0 103 12 10 14 64
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 8639 0 2906 700 589 923 72 291 222 0 7232 18 50 1 30
1 8659 0 2237 439 315 994 75 290 236 0 7370 20 52 3 25
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 8350 0 2641 685 578 853 69 273 198 0 10050 23 50 2 26
1 8739 0 2353 491 313 1101 141 271 222 0 8423 21 51 1 27
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 8991 0 2834 680 554 987 81 300 238 0 6952 19 51 0 30
1 8825 0 2438 506 384 1013 80 293 237 0 7001 21 52 0 27

Performance problem - mutexes with multi-cpu machine

Hi!
My company is developing multi-threaded server program for
multi-cpu machine which communicates with Oracle database on separate
machine. We use Solaris 7, Workshop 5 CC and pthreads API.
We tested our program on 4 CPU E4500 with 2 CPU E420 Oracle server.
We upgraded E4500 from 4 to 8 CPU and to our surprise instead of
performance improvement we got performance degradation ( 8 CPU runs
about 5% slower than 4 CPU ).
After a long investigation we found out that under stress load most of the
time our performs lwpmutex related operation.
With truss -c statistics was 160 secs in mutex operations and
about 2 secs was read/write in oracle client side library.
Here is output of truss for example:
19989 29075/5: 374.0468 0.0080 lwp_mutex_lock(0x7F2F3F60 = 0
19990 29075/31: 374.0466 0.0006 lwp_mutex_wakeup(0x7F2F3F60) = 0
19991 29075/5: 374.0474 0.0006 lwp_mutex_wakeup(0x7F2F3F60) = 0
19992 29075/30: 374.0474 0.0071 lwp_mutex_lock(0x7F2F3F60) = 0
19993 29075/30: 374.0484 0.0010 lwp_mutex_wakeup(0x7F2F3F60) = 0
19994 29075/31: 374.0483 0.0017 lwp_mutex_lock(0x7F2F3F60) = 0
19995 29075/5: 374.0492 0.0018 lwp_mutex_lock(0x7F2F3F60) = 0
19996 29075/31: 374.0491 0.0008 lwp_mutex_wakeup(0x7F2F3F60) = 0
19997 29075/5: 374.0499 0.0007 lwp_mutex_wakeup(0x7F2F3F60) = 0
19998 29075/30: 374.0499 0.0015 lwp_mutex_lock(0x7F2F3F60) = 0
19999 29075/5: 374.0507 0.0008 lwp_mutex_lock(0x7F2F3F60) = 0
20000 29075/30: 374.0507 0.0008 lwp_mutex_wakeup(0x7F2F3F60) = 0
20001 29075/5: 374.0535 0.0028 lwp_mutex_wakeup(0x7F2F3F60) = 0
20002 29075/30: 374.0537 0.0030 lwp_mutex_wakeup(0x7F2F3F60) = 0
20003 29075/31: 374.0537 0.0046 lwp_mutex_lock(0x7F2F3F60) = 0
20004 29075/5: 374.0547 0.0012 lwp_mutex_lock(0x7F2F3F60) = 0
20005 29075/31: 374.0546 0.0009 lwp_mutex_wakeup(0x7F2F3F60) = 0
20006 29075/5: 374.0554 0.0007 lwp_mutex_wakeup(0x7F2F3F60) = 0
20007 29075/30: 374.0557 0.0020 lwp_mutex_lock(0x7F2F3F60) = 0
20008 29075/31: 374.0555 0.0009 lwp_mutex_wakeup(0x7F2F3F60) = 0
20009 29075/5: 374.0564 0.0010 lwp_mutex_lock(0x7F2F3F60) = 0
20010 29075/30: 374.0564 0.0007 lwp_mutex_wakeup(0x7F2F3F60) = 0
20011 29075/5: 374.0572 0.0008 lwp_mutex_wakeup(0x7F2F3F60) = 0
20012 29075/28: 374.0574 0.0170 lwp_mutex_lock(0x7F2F3F60) = 0
20013 29075/31: 374.0575 0.0020 lwp_mutex_wake(0x7F2F3F60) = 0
We have a several question:
1. We always get the same mutex address : 0x7F2F3F60 even with different
binaries. It looks that all threads wait on one and magic
mutex. Why?
2. We read in article on unixinsider.com that on Solaris when mutex is
unlocked all the threads waiting on this mutex are waked up. It also looks so
from truss output. What is solution for this problem? unixinsider.com
recommends native Solaris read-write lock with all threads as writers?
Is there any other solution? Should in improve performance?
3. We heard that Solaris 8 has better pthreads implementation using a
one-level threading model, where threads are one-to-one with
lwp, rather than the two-level model that is used in the
standard libthread implementation, where user-level threads are
multiplexed over possibly fewer lwps. Are mutexes in this library
behave in "Solaris 7" way or do it put thread to sleep when it unlocks
the mutex? Is it possible to use this library on Solaris 7?
4. Is there plug - in solution like mtmalloc or hoard for new/delete that change
pthread mutexes?
Thank you in advance for your help,
Alexander Indenbaum

<pre>
Hi!
My company is developing multi-threaded server program for
multi-cpu machine which communicates with Oracle database on separate
machine. We use Solaris 7, Workshop 5 CC and pthreads API.
We tested our program on 4 CPU E4500 with 2 CPU E420 Oracle server.
We upgraded E4500 from 4 to 8 CPU and to our surprise instead of
performance improvement we got performance degradation ( 8 CPU runs
about 5% slower than 4 CPU ).
After a long investigation we found out that under stress load most of the
time our performs lwpmutex related operation.
With truss -c statistics was 160 secs in mutex operations and
about 2 secs was read/write in oracle client side library.
Here is output of truss for example:
19989 29075/5: 374.0468 0.0080 lwp_mutex_lock(0x7F2F3F60) = 0
19990 29075/31: 374.0466 0.0006 lwp_mutex_wakeup(0x7F2F3F60) = 0
19991 29075/5: 374.0474 0.0006 lwp_mutex_wakeup(0x7F2F3F60) = 0
19992 29075/30: 374.0474 0.0071 lwp_mutex_lock(0x7F2F3F60) = 0
19993 29075/30: 374.0484 0.0010 lwp_mutex_wakeup(0x7F2F3F60) = 0
19994 29075/31: 374.0483 0.0017 lwp_mutex_lock(0x7F2F3F60) = 0
19995 29075/5: 374.0492 0.0018 lwp_mutex_lock(0x7F2F3F60) = 0
19996 29075/31: 374.0491 0.0008 lwp_mutex_wakeup(0x7F2F3F60) = 0
19997 29075/5: 374.0499 0.0007 lwp_mutex_wakeup(0x7F2F3F60) = 0
19998 29075/30: 374.0499 0.0015 lwp_mutex_lock(0x7F2F3F60) = 0
19999 29075/5: 374.0507 0.0008 lwp_mutex_lock(0x7F2F3F60) = 0
20000 29075/30: 374.0507 0.0008 lwp_mutex_wakeup(0x7F2F3F60) = 0
20001 29075/5: 374.0535 0.0028 lwp_mutex_wakeup(0x7F2F3F60) = 0
20002 29075/30: 374.0537 0.0030 lwp_mutex_wakeup(0x7F2F3F60) = 0
20003 29075/31: 374.0537 0.0046 lwp_mutex_lock(0x7F2F3F60) = 0
20004 29075/5: 374.0547 0.0012 lwp_mutex_lock(0x7F2F3F60) = 0
20005 29075/31: 374.0546 0.0009 lwp_mutex_wakeup(0x7F2F3F60) = 0
20006 29075/5: 374.0554 0.0007 lwp_mutex_wakeup(0x7F2F3F60) = 0
20007 29075/30: 374.0557 0.0020 lwp_mutex_lock(0x7F2F3F60) = 0
20008 29075/31: 374.0555 0.0009 lwp_mutex_wakeup(0x7F2F3F60) = 0
20009 29075/5: 374.0564 0.0010 lwp_mutex_lock(0x7F2F3F60) = 0
20010 29075/30: 374.0564 0.0007 lwp_mutex_wakeup(0x7F2F3F60) = 0
20011 29075/5: 374.0572 0.0008 lwp_mutex_wakeup(0x7F2F3F60) = 0
20012 29075/28: 374.0574 0.0170 lwp_mutex_lock(0x7F2F3F60) = 0
20013 29075/31: 374.0575 0.0020 lwp_mutex_wakeup(0x7F2F3F60) = 0
We have a several question:
1. We always get the same mutex address : 0x7F2F3F60 even with different
binaries. It looks that all threads wait on one and magic
mutex. Why?
2. We read in article on unixinsider.com that on Solaris when mutex is
unlocked all the threads waiting on this mutex are waked up. It also looks so
from truss output. What is solution for this problem? unixinsider.com
recommends native Solaris read-write lock with all threads as writers?
Is there any other solution? Should in improve performance?
3. We heard that Solaris 8 has better pthreads implementation using a
one-level threading model, where threads are one-to-one with
lwp, rather than the two-level model that is used in the
standard libthread implementation, where user-level threads are
multiplexed over possibly fewer lwps. Are mutexes in this library
behave in "Solaris 7" way or do it put thread to sleep when it unlocks
the mutex? Is it possible to use this library on Solaris 7?
4. Is there plug - in solution like mtmalloc or hoard for new/delete that change
pthread mutexes?
Thank you in advance for your help,
Alexander Indenbaum
</pre>

Jdeveloper dual core processor performance problem

I have a dual core 2.4 ghz processor and 2 gig of ram and Im running Jdevloper 9.0.5.2 and my performace is terrible. Other developers in the company have non-dual core processors and they can start up in debug mode using the Jdeveloper embedded oc4j our application in 10 seconds where it takes me 4 min!!!!! It is a struts,ejb web application. Is there anything I can do to help in debug mode??? cheers.
Murray

Hi Bernard,
Which version of McAfee are you using?
On my (personal) laptop, I'm using McAfee VirusScan 9.0.10. I don't frequently run JDeveloper on this laptop, but when I do, it's not experiencing signficant startup delays (it's a very low power machine: PIII 650 512Mb)
McAfee VirusScan seems to have very few configuration options (noticably different from Norton, which I use on my corporate desktop machine). I specifically remember changing the "File Types to Scan" option to "Program files and documents only". You can get to this by right clicking the "M" notification area icon, VirusScan->Options menu, Advanced button on the ActiveShield page.
In Norton, I think I have it configured so that it only scans files on write rather than on read. I also exclude directories which contain jdeveloper installs or other large Java apps (although scanning only on write elliminates most of the performance problems anyway and still leaves your system reasonably secure).
The easiest way to convince your MIS dept that the virus checker is the source of your problems might be if you ask them to allow you to turn it off in order to test the difference it makes to performance. It's a reasonable request to make if you're trying to elliminate possible causes for the slowdown (from the description you gave, it does sound like the AV upgrade is the first place I'd start looking).
If the virus checker is the source of your problems, you'll probably be seeing massive slowness in most large Java applications that have a large number of JARs on their classpath.
Thanks,
Brian

Urgent : UI refresh problem on dual processor or hyper threaded machine

Hello,
I am developing an applet that displays a set of tabs to the user. Each tab contains JLists.
When user selects from any of the JList(s), all lists are updated to reflect what is still available for selection. Number of elements in the list can be as large as few hundred thousands. This all works fine on single processor machine.
However, when running same on dual processor or Intel hyperthreaded processor, lists are not updated correctly. What I meant is that, for example, if a list originally has 100 thousands elements in it, then when I select an element from it, list is updated with available elements.Assume there are only 2 thousand elements available after above selection.Everything is fine so far, but when I de-select my selection, list is suppose to be updated again with original 100 thousands elements displayed again. The real problem is list indeed gets updated properly but few elements from list are not visible on the screen. However, If I minimize the browser(as it is an applet) and restore it again, all elemnets within the list are displayed correctly. Any help will be highly appreciated.

When your JLIst refresh occurs, the items are loaded into the JList from some other Object right? Nt knowing more about your code, I can't say exactly what that would be.Yes, you are right. Let me explain in breif about the architecture of the code.
I have a class,say CatPanel, that extends from JPanel and has BorderLayout with JList (wrapped in JScrollPane) in the center.
The main GUI class controls all such CatPanels.So, when the list is to be updated, GUI class invokes a method from CatPanel with Vector of elements to be inserted as an argument. Elements are inserted using setListData method of the JList.
If I'm right, the this might explain the behavior you are seeing. In a single processor environment, the data >is not cahced and therefore everything is fine. When you go to the dual-processor machine, the threads >have their own caches and are not always in sync. When you mimize the window and restoer it, this could >trigger a resync of the cache.Is there a way to trigger such synchronization?
One thing I'm not sure about is, when you return to the 'main' list, do you still see the selected sublist or >the main list with some elements missing? When a list is updated as a result of de-selection, few elements from start of the list are not visible.
I can see all other elements by using ScrollBar. So, if there are some selections that belong to the begining portion of the list, then selected sublist is not visible.
Let me explain with following example. Assume that with current selection, I have 100 elements in the list and all of them are visible on the screen. Now as a result of de-selection, suppose now the list should contain 200 elements rather than 100 elements. So, what happens in this case is that if first 50 elements are not visible in the list then I can see blank space corresponding to first 50 elements at the start of the list followed by remaining elements. I tried overloading paint method of an element that is actually inserted into the list, but still the result is same.

Performance problems on multi processor machine

Similar Messages

Maybe you are looking for