Performance degradation on multi-processor computer

I saw couple similar topics but they are not the same. The issue is that .NET app is not so affected and Java app. See below. Thank you.
There is strange picture is observed here. The same test, the same data - the more CPUs computer has, the slower test results:
1) Powerful 8 CPUs server with 1 CPU assigned to process via "Set Affinity" in Task Manager: 1578 ms
2) Powerful 8 CPUs server with 2 CPUs assigned to process via "Set Affinity" in Task Manager: 1656 ms
3) Powerful 8 CPUs server with no adjusments via Task Manager - all 8 CPUs are enabled: 3469 ms
4) Those tests 1-3 were on powerful server. On my less powerful laptop with single core set in BIOS I get: 921 ms (!!!) How come?
The test has 2 active threads. First one puts 5,000,000 updates into queue. Second thread at the same time pops updates from our queue. Source is provided.
With more sophisticated test that I cannot provide I have performance difference much more noticable. That test has more synchronize calls.
The only 2 explanations that I can think of are:
1) Necessity to synchronize CPU caches on all possible CPUs. But why would computer need to synch all 8 CPUs if only 2 threads are using that data - meaning at most only 2 CPUs are active.
2) Windows Task Scheduler switches current CPU for given activ thread all the time. But why would it does so if it makes more powerful server only worse?
I hope there is some good explanation for this behavior. And the most important - there is a solution for our server to really beat my laptop performance counter.
By the way the same type of sample in .NET v.2.0 (also provided) with x64 binary runs 546 ms on my laptop and 781 and 813 ms on server with affinity to 1 and to 1-8 CPUs respectively. It makes me think this issue is JVM related. Please, comment.
== ThreadBoundaryQueue.java file ==
import java.text.NumberFormat;
import java.util.ArrayDeque;
public class ThreadBoundaryQueue
    private static final int COUNT = 5*1000*1000;
    private static final int delayBeforeStart = 10;
    private static void print(String text)
                System.out.println(Thread.currentThread().getName() + ": " + text);
    public static void main(String[] args) throws Exception
                SysInfo.dump();
                print("Preparing data for test...");
        Long events[] = new Long[COUNT];
        for(int i = 0; i < COUNT; ++i)
            events[i] = new Long(i/87);
        final Object notifier = new Object();
        final boolean finished[] = new boolean[1];
        SimpleEventQueueWithArrayDeque queue = new SimpleEventQueueWithArrayDeque(new IEventDispatcher<Long>()
            long sum_;
            int count_;
            @Override
            public void dispatchEvent(Long event)
                for(long l = event.longValue() * 50; l > 0; l /=17)
                    sum_ += l;
                if(++count_ == COUNT)
                    synchronized (notifier)
                       finished[0] = true;
                       notifier.notify();
                       print("some dummy sum is " + sum_);
        queue.startDispatching();
        print("Test starts in "+delayBeforeStart+" seconds");
        print("================================");
        Thread.sleep(delayBeforeStart*1000);
        print("Started...");
        TimeCounter sendingTime = new TimeCounter();
        TimeCounter processingTime = new TimeCounter();
        for(int i = 0; i < COUNT; ++i)
            queue.postEvent(events);
sendingTime.dump("sending of " + COUNT + " events");
synchronized (notifier)
while(!finished[0])
notifier.wait(1);
processingTime.dump("processing of " + COUNT + " events");
print("...Done");
queue.stopDispatching();
private static class SysInfo
public static void dump()
for(String p : new String[]{
"os.name",
"os.version",
"os.arch",
"sun.arch.data.model",
"java.runtime.version",
"java.vm.name",
}) print(p+"="+System.getProperty(p));
print("CPU count=" + Runtime.getRuntime().availableProcessors() + "\n");
private static class TimeCounter
private long start_;
public TimeCounter()
start_ = System.currentTimeMillis();
public void dump(String name)
print( name
+ " took, milliseconds: "
+ NumberFormat.getIntegerInstance().format(System.currentTimeMillis()-start_));
private interface IEventDispatcher<EventType>
void dispatchEvent(EventType event);
private static class SimpleEventQueueWithArrayDeque implements Runnable
private ArrayDeque<Long> buffer_;
private Thread thread_;
private IEventDispatcher<Long> dispatcher_;
public SimpleEventQueueWithArrayDeque(IEventDispatcher<Long> dispatcher)
buffer_ = new ArrayDeque<Long>(65535);
dispatcher_ = dispatcher;
public void postEvent(Long event)
synchronized (buffer_)
buffer_.addLast(event);
buffer_.notify();
@Override
public void run()
try
Long event;
while(!Thread.interrupted())
synchronized (buffer_)
event = buffer_.poll();
while(event == null)
buffer_.wait(300);
event = buffer_.poll();
dispatcher_.dispatchEvent(event);
catch(InterruptedException ex)
public synchronized void startDispatching()
if(thread_ == null)
thread_ = new Thread(this, "queue dispatcher");
thread_.setDaemon(false);
thread_.start();
public synchronized void stopDispatching()
if(thread_ != null)
thread_.interrupt();
try
thread_.join();
catch(Exception ex){}
thread_ = null;

No problem. We are pretty sure in what we see. We repeated a test. So, new test makes 5 rounds of test with 5,000,000 updates. And then it makes another 20 rounds that are counted. Results with our process having "Real-time priority" in both cases:
8 CPUs on 8 CPUs server: 40.5 seconds
1 CPU on the same 8 CPU server: 17,9 seconds
Just to add more confidence in results we made the same with our PRODUCTION system that has 4 heavy processes on 8 CPUs server. We set affinity for each process to have 2 of 8 CPUs. It made heavy loaded system to take around half less CPU time (each of 4 processes with a lot of "synchronized" calls). How to explain this and how to configure our system with that knowledge? And why Windows and/or JVM does not do this for us?
Code is below. Thank you
import java.text.NumberFormat;
import java.util.ArrayDeque;
public class ThreadBoundaryQueue {
    private static final int COUNT = 5*1000*1000;
    private static final int delayBeforeStart = 15;
    private static Long Events_[];
    private static void print(String text) {
        System.out.println(Thread.currentThread().getName() + ": " + text);
    public static void main(String[] args) throws Exception
        SysInfo.dump();
        final Object notifier = new Object();
        final boolean finished[] = new boolean[1];
        IEventDispatcher<Long> d = new IEventDispatcher<Long>() {
            long sum_;
            int count_;
            @Override
            public void dispatchEvent(Long event) {
                for(long l = event.longValue() * 50; l > 0; l /=17) {
                    sum_ += l;
                if(++count_ == COUNT) {
                    synchronized (notifier) {
                        finished[0] = true;
                        print("some dummy sum is " + sum_);
                        sum_ = 0;
                        count_ = 0;
                        notifier.notify();
        print("Preparing data for tests...");
        Events_ = new Long[COUNT];
        for(int i = 0; i < COUNT; ++i) {
            Events_ = new Long(i/87);
print("Test starts in "+delayBeforeStart+" seconds");
print("================================");
Thread.sleep(delayBeforeStart*1000);
print("BEGIN Warmup...");
for(int i = 0; i < 5; ++i) {
print("Test " + (i+1) + "...");
test(new SimpleEventQueueWithArrayDeque(d), notifier, finished);
print("END Warmup...");
TimeCounter allTime = new TimeCounter();
for(int i = 0; i < 20; ++i) {
print("Test " + (i+1) + "...");
test(new SimpleEventQueueWithArrayDeque(d), notifier, finished);
print("...Done");
allTime.dump("ALL TESTS after warm up");
private static void test(SimpleEventQueueWithArrayDeque queue, Object notifier, boolean[] finished) throws Exception {
synchronized (notifier) {
finished[0] = false;
queue.startDispatching();
print("Started...");
TimeCounter sendingTime = new TimeCounter();
TimeCounter processingTime = new TimeCounter();
for(int i = 0; i < COUNT; ++i) {
queue.postEvent(Events_[i]);
sendingTime.dump("sending of " + COUNT + " events");
synchronized (notifier) {
while(!finished[0]) {
notifier.wait(100);
processingTime.dump("processing of " + COUNT + " events");
queue.stopDispatching();
private static class SysInfo {
public static void dump() {
for(String p : new String[]{"os.name", "os.version", "os.arch",
"sun.arch.data.model", "java.runtime.version", "java.vm.name",})
print(p+"="+System.getProperty(p));
print("CPU count=" + Runtime.getRuntime().availableProcessors() + "\n");
private static class TimeCounter {
private long start_;
public TimeCounter() {
start_ = System.currentTimeMillis();
public void dump(String name) {
print( name
+ " took, milliseconds: "
+ NumberFormat.getIntegerInstance().format(System.currentTimeMillis()-start_));
private interface IEventDispatcher<EventType> {
void dispatchEvent(EventType event);
private static class SimpleEventQueueWithArrayDeque implements Runnable {
private ArrayDeque<Long> buffer_;
private Thread thread_;
private IEventDispatcher<Long> dispatcher_;
public SimpleEventQueueWithArrayDeque(IEventDispatcher<Long> dispatcher) {
buffer_ = new ArrayDeque<Long>(65535);
dispatcher_ = dispatcher;
public void postEvent(Long event) throws Exception {
synchronized(buffer_) {
buffer_.addLast(event);
buffer_.notifyAll();
@Override
public void run() {
try {
Long event;
while(!Thread.interrupted()) {
synchronized(buffer_) {
event = buffer_.poll();
while(event == null) {
buffer_.wait(100);
event = buffer_.poll();
dispatcher_.dispatchEvent(event);
catch(InterruptedException ex){}
public synchronized void startDispatching() {
if(thread_ == null) {
thread_ = new Thread(this, "queue dispatcher");
thread_.setDaemon(false);
thread_.start();
public synchronized void stopDispatching() {
if(thread_ != null) {
thread_.interrupt();
try {
thread_.join();
catch(Exception ex){}
thread_ = null;

Similar Messages

Performance problems on multi processor machine

Hi,
we are using a software written in Java (as Servlets using Tomcat 4.1.29) 1.4, which has a very poor performance. Sadly we don't have the source-codes for the application, but from the log-files we can figure out, that there is a very strong overhead, when changing the processors; that means i.e. when changing from processor 1 to processor 2 a statement which usually only needs 50ms, takes around 20 secs to finish. That could not be....
Do you have any suggestion, maybe about the parameters which are used to start java?
We use the following startup-properties:
-d64 -server -Xms1G -Xmx2G -Xmn800m -XX:+DisableExplicitGC -XX:+UseParallelGC -verbose:GC -Djava.awt.headless=true
Thanks for your help,
Anton

Before anyone answers this, check out what was already attempted at his stinkin' CROSSPOST:
http://forum.java.sun.com/thread.jsp?thread=553113&forum=54&message=2706725

JRE with multi-processors

Does anybody know whether JRE 1.3 is able to work well in a multi-processor computer with Unix OS
if so, how to implement this ? which parameters does one have to initilize in the JVM so it would work?
if not, how does it work with JVM 1.4 if at all

Threaded code on a single cpu gets timesliced/swapped by the underlying OS but on a multi-CPU system, you get true
parallel execution for free.
Scott Oaks & Henry Wong have a chapter on this in the O'Reilly Java Threads book.

Multi-Processor Performance

I would like to know what performance gains I might expect from moving our
weblogic application server to a multi-processor machine. Will 2 processors
handle twice server the load of the one processor machine?
Platform: Solaris 2.6
Weblogic Server: 4.5.1 SP7
NativeIO enabled
Weblogic Server is the only thing running on the machine.
Other Questions:
1. Is there anything that needs to be done(other than purchase another
license) for the weblogic server to work on a multi-processor system?
2. Will the weblogic server naturally take advantage both processors?
3. Will performance gains be uniform or will certain features gain more
from multiple processors?
Any links or suggestions are appreciated.
thanks,
Jeremy

Hi Jeremy -
If you are interested in modeling this before implementing it to determine
performance gains, you might want to check out our scalability assessment
services description, see attached. We are a BEA Technology Alliance Partner
that specializes in answering those specific performance questions, and have
done that for a number of clients in the past few weeks.
(See also eQASEsheet2.pdf) - this describes our capacity sizing tool that works
particularly well for Weblogic.
Todd
jeremy wrote:
I would like to know what performance gains I might expect from moving our
weblogic application server to a multi-processor machine. Will 2 processors
handle twice server the load of the one processor machine?
Platform: Solaris 2.6
Weblogic Server: 4.5.1 SP7
NativeIO enabled
Weblogic Server is the only thing running on the machine.
Other Questions:
1. Is there anything that needs to be done(other than purchase another
license) for the weblogic server to work on a multi-processor system?
2. Will the weblogic server naturally take advantage both processors?
3. Will performance gains be uniform or will certain features gain more
from multiple processors?
Any links or suggestions are appreciated.
thanks,
Jeremy--
Todd Wiseman
Dir/Business Development
eQASE LLC
(303)790-4242 x130
(303)790-2816
www.eqase.com
Java Performance & Scalability
[eQASE WLS Consulting Offerings.pdf]
[eQASEsheet2.pdf]

Performance degradation with addition of unicasting option

We have been using the multi-casting protocol for setting up the data grid between the application nodes with the vm arguments as
*-Dtangosol.coherence.clusteraddress=${Broadcast Address} -Dtangosol.coherence.clusterport=${Broadcast port}*
As the certain node in the application was expected in a different sub net and multi-casting was not feasible, we opted for well known addressing with following additional VM arguments setup in the server nodes(all in the same subnet)
*-Dtangosol.coherence.machine=${server_name} -Dtangosol.coherence.wka=${server_ip} -Dtangosol.coherence.localport=${server_port}*
and the following in the remote client node that point to one of the server node like this
*-Dtangosol.coherence.wka=${server_ip} -Dtangosol.coherence.wka.port=${server_port}*
But this deteriorated the performance drastically both in pushing data into the cache and getting events via map listener.
From the coherence logging statements it doesn't seems that multi-casting is getting used atleast with in the server nodes(which are in the same subnet).
Is it feasible to have both uni-casting and multi-casting to coexist? How to verify if it is setup already?
Is performance degradation in well-known addressing is a limitation and expected?

Hi Mahesh,
From your description it sounds as if you've configured each node with a wka list just including it self. This would result in N rather then 1 clusters. Your client would then be serviced by the resources of just a single cache server rather then an entire cluster. If this is the case you will see that all nodes are identified as member 1. To setup wka I would suggest using the override file rather then system properties, and place perhaps 10% of your nodes on that list. Then use this exact same file for all nodes. If I've misinyerpreted your configuration please provide additional details.
Thanks,
Mark
Oracle Coherence

SCOM reports "A significant portion of the database buffer cache has been written out to the system paging file. This may result in severe performance degradation"

This was discussed here, with no resolution
http://social.technet.microsoft.com/Forums/en-US/exchange2010/thread/bb073c59-b88f-471b-a209-d7b5d9e5aa28?prof=required
I have the same issue. This is a single-purpose physical mailbox server with 320 users and 72GB of RAM. That should be plenty. I've checked and there are no manual settings for the database cache. There are no other problems with
the server, nothing reported in the logs, except for the aforementioned error (see below).
The server is sluggish. A reboot will clear up the problem temporarily. The only processes using any significant amount of memory are store.exe (using 53GB), regsvc (using 5) and W3 and Monitoringhost.exe using 1 GB each. Does anyone have
any ideas on this?
Warning ESE Event ID 906.
Information Store (1497076) A significant portion of the database buffer cache has been written out to the system paging file. This may result in severe performance degradation. See help link for complete details of possible causes. Resident cache
has fallen by 213107 buffers (or 11%) in the last 207168 seconds. Current Total Percent Resident: 79% (1574197 of 1969409 buffers)

Brian,
We had this event log entry as well which SCOM picked up on, and 10 seconds before it the Forefront Protection 2010 for Exchange updated all of its engines.
We are running Exchange 2010 SP2 RU3 with no file system antivirus (the boxes are restricted and have UAC turned on as mitigations). We are running the servers primarily as Hub Transport servers with 16GB of RAM, but they do have the mailbox role installed
for the sole purpose of serving as our public folder servers.
So we theroized the STORE process was just grabbing a ton of RAM, and occasionally it was told to dump the memory so the other processes could grab some - thus generating the alert. Up until last night we thought nothing of it, but ~25 seconds after the
cache flush to paging file, we got the following alert:
Log Name:      Application
Source:        MSExchangeTransport
Date:          8/2/2012 2:08:14 AM
Event ID:      17012
Task Category: Storage
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      HTS1.company.com
Description:
Transport Mail Database: The database could not allocate memory. Please close some applications to make sure you have enough memory for Exchange Server. The exception is Microsoft.Exchange.Isam.IsamOutOfMemoryException: Out of Memory (-1011)
   at Microsoft.Exchange.Isam.JetInterop.CallW(Int32 errFn)
   at Microsoft.Exchange.Isam.JetInterop.MJetOpenDatabase(MJET_SESID sesid, String file, String connect, MJET_GRBIT grbit, MJET_WRN& wrn)
   at Microsoft.Exchange.Isam.JetInterop.MJetOpenDatabase(MJET_SESID sesid, String file, MJET_GRBIT grbit)
   at Microsoft.Exchange.Isam.JetInterop.MJetOpenDatabase(MJET_SESID sesid, String file)
   at Microsoft.Exchange.Isam.Interop.MJetOpenDatabase(MJET_SESID sesid, String file)
   at Microsoft.Exchange.Transport.Storage.DataConnection..ctor(MJET_INSTANCE instance, DataSource source).
Followed by:
Log Name:      Application
Source:        MSExchangeTransport
Date:          8/2/2012 2:08:15 AM
Event ID:      17106
Task Category: Storage
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      HTS1.company.com
Description:
Transport Mail Database: MSExchangeTransport has detected a critical storage error, updated the registry key (SOFTWARE\Microsoft\ExchangeServer\v14\Transport\QueueDatabase) and as a result, will attempt self-healing after process restart.
Log Name:      Application
Source:        MSExchangeTransport
Date:          8/2/2012 2:13:50 AM
Event ID:      17102
Task Category: Storage
Level:         Warning
Keywords:      Classic
User:          N/A
Computer:      HTS1.company.com
Description:
Transport Mail Database: MSExchangeTransport has detected a critical storage error and has taken an automated recovery action. This recovery action will not be repeated until the target folders are renamed or deleted. Directory path:E:\EXCHSRVR\TransportRoles\Data\Queue
is moved to directory path:E:\EXCHSRVR\TransportRoles\Data\Queue\Queue.old.
So it seems as if the Forefront Protection 2010 for Exchange inadvertently trigger the cache flush which didn't appear to happen quick or thuroughly enough for the transport service to do what it needed to do, so it freaked out and performed the subsequent
actions.
Do you have any ideas on how to prevent this 906 warning, which cascaded into a transport service outage?
Thanks!

JDBC, SQL*Net wait interface, performance degradation on 10g vs. 9i

Hi All,
I came across performance issue that I think results from mis-configuration of something between Oracle and JDBC. The logic of my system executes 12 threads in java. Each thread performs simple 'select a,b,c...f from table_xyz' on different tables. (so I have 12 different tables with cardinality from 3 to 48 millions and one working thread per table).
In each thread I'm creating result set that is explicitly marked as forward_only, transaction is set read only, fetch size is set to 100000 records. Java logic processes records in standard while(rs.next()) {...} loop.
I'm experiencing performance degradation between execution on Oracle 9i and Oracle 10g of the same java code, on the same machine, on the same data. The difference is enormous, 9i execution takes 26 hours while 10g execution takes 39 hours.
I have collected statspack for 9i and awr report for 10g. Below I've enclosed top wait events for 9i and 10g
===== 9i ===================
Avg
Total Wait wait Waits
Event Waits Timeouts Time (s) (ms) /txn
db file sequential read 22,939,988 0 6,240 0 0.7
control file parallel write 6,152 0 296 48 0.0
SQL*Net more data to client 2,877,154 0 280 0 0.1
db file scattered read 26,842 0 91 3 0.0
log file parallel write 3,528 0 83 23 0.0
latch free 94,845 0 50 1 0.0
process startup 93 0 5 50 0.0
log file sync 34 0 2 46 0.0
log file switch completion 2 0 0 215 0.0
db file single write 9 0 0 33 0.0
control file sequential read 4,912 0 0 0 0.0
wait list latch free 15 0 0 12 0.0
LGWR wait for redo copy 84 0 0 1 0.0
log file single write 2 0 0 18 0.0
async disk IO 263 0 0 0 0.0
direct path read 2,058 0 0 0 0.0
slave TJ process wait 1 1 0 12 0.0
===== 10g ==================
Avg
%Time Total Wait wait Waits
Event Waits -outs Time (s) (ms) /txn
db file scattered read 268,314 .0 2,776 10 0.0
SQL*Net message to client 278,082,276 .0 813 0 7.1
io done 20,715 .0 457 22 0.0
control file parallel write 10,971 .0 336 31 0.0
db file parallel write 15,904 .0 294 18 0.0
db file sequential read 66,266 .0 257 4 0.0
log file parallel write 3,510 .0 145 41 0.0
SQL*Net more data to client 2,221,521 .0 102 0 0.1
SGA: allocation forcing comp 2,489 99.9 27 11 0.0
log file sync 564 .0 23 41 0.0
os thread startup 176 4.0 19 106 0.0
latch: shared pool 372 .0 11 29 0.0
latch: library cache 537 .0 5 10 0.0
rdbms ipc reply 57 .0 3 49 0.0
log file switch completion 5 40.0 3 552 0.0
latch free 4,141 .0 2 0 0.0
I put full blame for the slowdown on SQL*Net message to client wait event. All I could find about this event is that it is a network related problem. I assume it would be true if database and client were on different machines.. However in my case they are on them very same machine.
I'd be very very grateful if someone could point me in the right direction, i.e. give a hint what statistics should I analyze further? what might cause this event to appear? why probable cause (that is said be outside db) affects only 10g instance?
Thanks in advance,
Rafi.

Hi Steven,
Thanks for the input. It's a fact that I did not gather statistics on my tables. My understanding is that statistics are useful for queries more complex than simple select * from table_xxx. In my case tables don't have indexes. There's no filtering condition as well. Full table scan is what I actually want as all software logic is inside the java code.
Explain plans are as follows:
======= 10g ================================
PLAN_TABLE_OUTPUT
Plan hash value: 1141003974
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
| 0 | SELECT STATEMENT | | 1 | 259 | 2 (0)| 00:00:01 |
| 1 | TABLE ACCESS FULL| xxx | 1 | 259 | 2 (0)| 00:00:01 |
In sqlplus I get:
SQL> set autotrace traceonly explain statistics;
SQL> select * from xxx;
36184384 rows selected.
Elapsed: 00:38:44.35
Execution Plan
Plan hash value: 1141003974
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
| 0 | SELECT STATEMENT | | 1 | 259 | 2 (0)| 00:00:01 |
| 1 | TABLE ACCESS FULL| xxx | 1 | 259 | 2 (0)| 00:00:01 |
Statistics
1 recursive calls
0 db block gets
3339240 consistent gets
981517 physical reads
116 redo size
26535700 bytes received via SQL*Net from client
2412294 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
36184384 rows processed
======= 9i =================================
PLAN_TABLE_OUTPUT
| Id | Operation | Name | Rows | Bytes | Cost |
| 0 | SELECT STATEMENT | | | | |
| 1 | TABLE ACCESS FULL | xxx | | | |
Note: rule based optimization
In sqlplus I get:
SQL> set autotrace traceonly explain statistics;
SQL> select * from xxx;
36184384 rows selected.
Elapsed: 00:17:43.06
Execution Plan
0 SELECT STATEMENT Optimizer=CHOOSE
1 0 TABLE ACCESS (FULL) OF 'xxx'
Statistics
0 recursive calls
1 db block gets
3306118 consistent gets
957515 physical reads
100 redo size
23659424 bytes sent via SQL*Net to client
26535867 bytes received via SQL*Net from client
2412294 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
36184384 rows processed
Thanks for pointing me difference in table scans. I infer that 9i is doing single-block full table scan (db file sequential read) while 10g is using multi-block full table scan (db file scattered read).
I now have theory that 9i is faster because sequential reads use continuous buffer space while scattered reads use discontinuous buffer space. Since I'm accessing data 'row by row' in jdbc 10g might have an overhead in providing data from discontinuous buffer space. This overhead shows itself as SQL*Net message to client wait. Is that making any sense?
Is there any way I could force 10g (i.e. with hint) to use sequential reads instead of scattered reads for full table scan?
I'll experiment with FTS tuning in 10g by enabling automatic multi-block reads tuning (i.e. db_file_multiblock_read_count=0 instead of 32 as it is now). I'll also check if response time improves after statistics are gathered.
Please advice if you have any other ideas.
Thanks & regards,
Rafi.

Concurrency in Swing, Multi-processor system

I have two questions:
1. This is a classic situation where I am looking for a definitive answer on: I've read about the single-thread rule/EDT, SwingWorker, and the use of invokeLater()/invokeAndWait(). The system I am designing will have multiple Swing windows (JInternalFrames) that do fairly complex GUI work. No direct interaction is needed between the windows which greatly simplifies things. Some windows are horrendously complex, and I simply want to ensure that one slow window doesn't bog the rest of the UIs. I'm not entirely clear on what exactly I should be threading: the entire JInternalFrame itself should be runnable? The expensive operation within the JInternalFrame? A good example of this is a complex paint() method: in this case I've heard of spawning a thread to render to a back-buffer of sorts, then blitting the whole thing when ready. In short, what's the cleanest approach here to ensure that one rogue window doesn't block others? I apologize if this is something addressed over and over but most examples seem to point to the classic case of "the expensive DB operation" within a Swing app.
2. Short and sweet: any way to have Swing take advantage of multi-processor systems, say, a system with 6 processors available to it? If you have one Swing process that spawns 10 threads, that's still just one process and the OS probably wouldn't be smart enough to distribute the threads across processors, I'm guessing. Any input on this would be helpful. Thank you!

(1) You need to use a profiler. This is the first step in any sort of optimization. The profiler does two important things: First, it tells you where are the real bottlenecks (which is usually not what you expect), and eliminates any doubt as to a certain section of code being 'slow' or 'fast'. Second, the profilter lets you compare results before and after. That way, you can check that your code changes actually increased performance, and by exactly how much.
(2) Generally speaking, if there are 10 threads and 10 CPU's, then each thread runs concurrently on a different CPU.
As per (1), the suggestion to use double buffering is the likely best way to go. When you think about what it takes to draw an image, 90% of it can be done in a worker thread. The geometry, creating Shapes, drawing then onto a graphics object, transformations and filters, all that can be done offline. Only copying the buffered image onscreen is the 10% that needs to happen in the EDT thread. But again, use a profiler first.

Cleanest multi-processor install?

I recently purchased a dual Xeon 3.0Ghz Quad-core Dell T7400 Precision Workstation (used) with 32GB of RAM.
I ran some testing before moving all of my software to the new machine, and it very quickly became clear that the system had some stability issues which I eventually attributed to the discovery that the 2 processors were not a matched pair (different builds and stepping levels). My first thought was to remove the older one and use the system with a single processor before finding a match for that remaining processor and installing it.
But then it occurred to me that I might also have to reinstall my OS (Windows 7 Ultimate), Creative Suite DP5 and all (or some) of my other software after installing that second processor, since both Windows and Photoshop check for the number of processors at installation and so will origianlly be set up to run on a single processor machine, not a dual.
I like to run as clean a system as possible (NO unnecessary software, etc.), so besides the unpleasant prospect of the double investment of time (installation, uninstallation and then reinstallation of the OS and 10 -15 other applications), I'm also concerned about junk that might be left over, either in the registry or elsewhere, after all of that installing, uninstalling and reinstalling.
So here is the question: for optimal future use (and Photoshop CS5 is probably 80+% of what I'll do on this workstation) should I:
Go ahead and install the OS, CS5 and other software and start to use the new (single processor) computer now and then just add the second processor when I find one at a decent price, or
Keep using my current workstation until I have a matched pair of processors for the new machine. Then, AFTER installing that matched pair, install the OS and application software and begin to use the new machine.
I am anxious to switch to the new machine and jump from 4GB of RAM to 32GB, but if starting with a single processor and then adding a second one in a week or a month will cost me either stability or performance down the road, it's not worth it.
Thoughts? Should I start migrating to the new machine today, or should I wait until I have a matched pair of processors in place?

Yes, my current workstation is a dual-processor Precision 690, and up until about 3 months ago it was also a rock-solid performer. Then within a few week period I:
upgraded from Adobe CSDP4 to CSDP5,
replaced a failing video card with a new ATI Radeon 1GB PCI2.0 Express card,
upgraded from my old Wacon (original style) tablet to an Intuos3,
installed some miscellaneous disk management and backup software utilities.
I've been having minor problems ever since. For example, after about 75% of reboots or waking from hibernation I have to manually reinstall the Wacom driver, and the skies in myfiles are now displaying with spectacular banding. (Luckily the banding is limited to the display, not the files, themselves. I'm pretty confident that it's nothing a clean reinstall of the OS and all of my main apps won't fix, but if I'm going to go through that, I dedided I'd rather go through it on a T7400 than on my 690. (Wanted access to quad core processors running at 1600 FSB, faster menory, and PCI 2.0 Express video bus.) A buddy of mine who does some consumer graphics might want the 690, and I have no qulams whatsoever about selling it to him. Even at 6+ years old it will run rings around anything else he has in his shop. And if he does nor want it I'll put it on eBay.
My last 4 workstations have all been Dell - a 620 followed by two 690s and now a T7400. The current T7400 issue aside, since mismatched CPUs are not the machine's fault, I've never had a day of trouble with any one of them. And once I get the CPU issue straightened out on the T7400 I expect it to be a VERY solid machine, too.

Multi Processor rendering & editing optimization

Hello,
Got some great info here from my last post! So thanx
So my question is...is it more ideal to use Xeon or multi processor board setups than single CPU chips? My main aim is for rendering or workflow turnaround since in the next couple of months I might land a nice spot in the top five post companies in my sector...god willing.
Therefore I plan to build a completely new system either focused on Multi CPU or buying multi crossfire Open CL rendering GPU cards. I can't find any documentation or recommendations for improved performances in rendering speeds or realtime fx with a Multi CPU board or the NOW added multi Stream or CUDA GPU (SLI or CrossFire) rendering setups.
if anybody could explain any myths or facts as to how Premiere Pro takes advantage of CPUs or GPUs would be a big help. I'm currently using a Quadro 2000 and an i7-990x with 24gb of memory.
thnx

Is your quaddro connected to a 10 bit monitor?
If not, you would be better served by a faster GTX card
Best Video Card http://forums.adobe.com/thread/1238382
Also, view the results of the CS5 Benchmark http://ppbm5.com/ to see what is fast

Multi processor Solaris 2.6 and mutex locks

hi,
is anyone aware of any documented issues with Solaris 2.6 running
on dual-SPARC processors (multi-processor environment) where the
programs using "mutex locks" (multi-threaded applications), require
some special handling, in terms of compiling and linking, to some
special libraries.
as far as i remember, in some OS book, maybe Peterson's, it was said
that the mechanism for implementing mutex-locks on multi-processor
systems is to use low-level spin-locks. this brings down performance
on a single-processor system, making the processor doing busy-wait,
but this happens to be the only way of mutex-locking in a multi-processor
system. if this is so, then where is such behaviour documented in case
of Solaris 2.6.
i have had problems with my applications crashing (rathing hanging up)
in a vfork() on such a system, but same application works fine, with
100% reproducability, on a single-processor system.
thanks for any inputs or suggestions and/or information.
regards,
banibrata dutta

I am also facing similar problem. Application which written using Mulit-Threaded using Posix Mutexes. When i run on SINGLE processor manchine, i.e.
SunOS sund4 5.7 Generic_106541-11 sun4u sparc SUNW,Ultra-5_10
It works perfectly.
But when try to run on dual processor machine, i.e.
SunOS sund2 5.7 Generic_106541-11 sun4u sparc SUNW,Ultra-250
It is blocking in the one of mutexes.
Please inform us what is problem. Mr. B. Datta, you comes to know
any channel, please inform me also at [email protected]
Thanx & regards,
-venkat

WLS on Multi-Processors

A few questions about WLS 5.1 on multi-processor machines:
1. Is there anything that needs to be done(other than purchase another
license) for a weblogic server to work on a multi-processor system?
2. Will WLS take advantage of all processors with just ONE invocation of WLS?
Or will I have to run one instance of WLS for each processor?
3. Will performance gains be uniform or will certain features gain more
from multiple processors?
Any answers, insights or pointers to answers are appreciated.
Thanks.
-Heng

>
I consider WebLogic to be a great no-nonsense J2EE implementation (not
counting class loaders ;-).Look for major improvements in that area in version 6.0.
Thanks,
Michael
Michael Girdley
BEA Systems Inc
"Cameron Purdy" <[email protected]> wrote in message
news:[email protected]...
Rob,
I consider WebLogic to be a great no-nonsense J2EE implementation (not
counting class loaders ;-). Gemstone's architecture is quite elaboratewhen
compared to WebLogic, and BTW they spare no opportunity to compare to
WebLogic although never by name. (Read their white paper on scalabilityto
see what I mean.) I am quite impressed by their architecture; it appearsto
be set up for dynamic reconfiguration of many-tier processing. Forexample,
where WL coalesces (i.e. pass by ref if possible), Gemstone will always
distribute if possible, creating a "path" through (e.g.) 7 levels of JVMs
(each level having a dynamic number of JVMs available in a pool) and if
there is a problem at any level, the request gets re-routed (full failover
apparently). I would say that they are set up quite well to solve the
travelling salesperson problem ... you could probably implement aweb-driven
neural net on their architecture. (I've never used the Gemstone product,
but I've read about everything that they published on it.) I would assume
that for certain types of scaling problems, the Gemstone architecturewould
work very very well. I would also guess that there are latency issues and
administration nightmares, but I've had the latter with every app server
that I've ever used, so ... $.02.
Cameron Purdy
[email protected]
http://www.tangosol.com
WebLogic Consulting Available
"Rob Woollen" <[email protected]> wrote in message
news:[email protected]...
Dimitri Rakitine wrote:
Hrm. Gemstone reasons are somewhat different.I'm not a Gemstone expert, but I believe their architecture is quite
different from a WebLogic cluster. Different architectures might have
different trade-offs.
However, out of curiosity, what are their reasons?
Anyway, here is my question:
why running multiple instances of WL is more efficient than running
one
with
high execute thread count?The usual reason is that most garbage collectors suspend all of the jvm
threads. Using multiple WLS instances causes the pauses to be
staggered. Newer java vms offere incremental collectors as an option so
this may no longer as big of an issue.
-- Rob
>

Prefetch capability for single or multi processor

Hi all
I wonder SAM supports prefetch functionality for single or multi processors.
I need that for research topic.

http://www.smartcrashreports.com/
Log:
com.unsanity.smartcrashreports Smart Crash Reports version 1.2.1 (1.2.1) /Users/ChrisFarley/Library/InputManagers/Smart Crash Reports/Smart Crash Reports.bundle/Contents/MacOS/Smart Crash Reports
Well you have it installed, have a look at their site, and see if you find it usefull otherwise uninstall it.
I don't know it either. (After uninstalling) unplug your router and restart your computer, check for update's > Apple menu>software update.
When your devices are up and running select Safari and a window will appear, choose Network Diagnosis to set up a your wireless connection.

Multi Processor environments

Just a quick question for ye, no detail really required at this time.
Is it possible to run a multi threaded Java application in a multi processor environment, where the processors would share the work between them.
I suppose basically what I am asking is does the JVM have the capabilities to run multiple threads across multiple processors.
If the answer to this question is yes, how difficult is it to set up or is there any special requirements, or extensions required.
Cheers

In most current JVMs, the facilities of the OS to spread the execution across the system's CPU's is provided to the Java programmer quite transparently. Avoid 'green threads' and you are good as gold.
On Solaris 7 and earlier, it gets tricky to convince the two-level kernel threading mechanism to really use all CPU's (or even more than one). The Sun JVM's for Solaris provide 'unsupported' but often suggested command line options to use 'bound threads' to solve this problem and Java 1.3.0 and 1.3.1 do a much better job. I've run WLS on 10-way 4500 systems and was able to consume about 70% of the 10 CPU's with productive work - principal limitation was our ability to apply load to the system.
In Solaris 8, there is an alternate thread library that a process can choose to use (just put it in the front of the LD_LIBRARY_PATH). This causes that process to use a one-level (or flat) Solaris thread model which can improve performance (and will definitely increase processor utilization) on larger SMP systems.
Chuck

Big Performance Degradation in LabVIEW 2012

Hi all,
I was expecting a performance increase upgrading to LV2012, per usual, unfortunately it seems it's been degraded by more than 50% for a simple benchmark I created for the purpose.
For the time being, I will stick with LV2011 due to this.
LV2011 running on a sbRIO9606 (steady around 3800 dereferencing/referencings per second):
LV2012 on the same computer (at about 1300 dereferencing/referencings per second):
Any takes on the issue?
Source attached.
Br,
/Roger
Solved!
Go to Solution.
Attachments:
PerformanceLV2011.zip ‏406 KB

Ben wrote:
RogerIsaksson wrote:
"Check for button presses 4 billion times a second".
So what? It's a benchmark program. It's not intended for any practical use besides showing off the performance degradation that I experience in the newer version of labview.
"race conditions between locals being a prime example"
Did you cut'n paste that nonsense from the internetz?
"Maybe those loops get higher priority in the new compiler"
What does priority have to do with execution performance?
You are clearly not understanding the issue here.
Br,
/Roger
Could you please post images of the benchmarking code?
The machine I use for the forums does not have a modern version of LV so I can only look at pictures.
There is a chance I may be able to explain your observations.
No promises!
Curious,
Ben
I wonder you don't have LV2011.( Oh you have personnal copy back home? )
The best solution is the one you find it by yourself

Performance degradation on multi-processor computer

Similar Messages

Maybe you are looking for