Niagara II Floating Point Operations

Niagara II has much improved floating point performance over Niagara I, however I'm wondering if performance of floating point intensive threads could be improved, by amalgamating the floating point processing from each core to a dedicated unit that can execute in parallel, the instructions from all threads, such that if a single core, experiences a floating point intensive load, the floating point load aggregate over the entire processor is better optimized?

pfirmst wrote:
Niagara II has much improved floating point performance over Niagara I, however I'm wondering if performance of floating point intensive threads could be improved, by amalgamating the floating point processing from each core to a dedicated unit that can execute in parallel, the instructions from all threads, such that if a single core, experiences a floating point intensive load, the floating point load aggregate over the entire processor is better optimized?I'm not a hardware designer, but I think there are two problems with this approach.
a. There would probably need to be extra interconnect and arbitration to get from the issue pipeline to the shared floating-point units and back again with the result. This would tend to add latency to the floating-point instruction execution and would probably be bad for performance. For example, the floating-point latency on T1 is about 26 cycles and on T2 it's about 6 cycles.
b. In order to get more floating-point performance from a single thread, the issue logic would also need to be changed to be able to issue more floating-point instructions in a single cycle (i.e. superscalar issue). This would be good for single thread performance, but would require more complexity and space, and may impact the number of cores/threads that can fit on a single chip. The correlary is that since each T2 core can only issue two floating-point instructions per cycle (one from each of two threads), each core could make use of at most two floating-point units.
On CMT chips, sharing is good, because it leads to higher efficiency and utilization,
but too much sharing can also hurt performance. There needs to be balance in the design.
Peter.

Similar Messages

Qosmio F50 - Webcam error - Invalid floating point operation

Hi
I have Qosmio F50 (4 Weeks old) problem is after two weeks webcam has stopped working *Invalid floating point operation*
I have to close program via task manager, *Camera assistant software not responding*.
Have tried following.
Update all drivers, system restore
Toshiba help line after all efforts suggested complete reinstall this is to aggressive for me as it would loose certain software on laptop that I had to transfer from old PC and had to plead with certain software companies to transfer (only one license etc).
Has anybody else had this problem and is there a simple solution
Will

Hi,
I have the same problem, Laptop Qosmio F50 is 4 weeks old and now the webcam stopped working.... it's quite annoying, whenever I start the application it says ' Invalid floating point operation'. To me this sounds as some issues in the software and the compatibility on vista, I think it could happen after putting the laptop to hibernate or suspension state and then when it comes back, the webcam stops working. so let's see if toshiba realeases a fix or new version quickly, otherwise I would be quite disspointed with such a good laptop.

Error - "Invalid floating point operation"

I keep getting a message that says "invalid floating point operation" and the files that I am trying to read are scrambled. They used to be ok. Please advise....

Hi,
I have the same problem, Laptop Qosmio F50 is 4 weeks old and now the webcam stopped working.... it's quite annoying, whenever I start the application it says ' Invalid floating point operation'. To me this sounds as some issues in the software and the compatibility on vista, I think it could happen after putting the laptop to hibernate or suspension state and then when it comes back, the webcam stops working. so let's see if toshiba realeases a fix or new version quickly, otherwise I would be quite disspointed with such a good laptop.

Question in floating point operation

Hi,
I have question in java floating point operation.
public class test
     public static void main(String args[])
          double d1 = 243.35 ;
          double d2 = 2.3 ;
          System.out.println(d1 * d2) ;
          System.out.println((float)d1 * (float)d2) ;
The result is,
java version "1.4.1_02"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.1_02-b06)
Java HotSpot(TM) Client VM (build 1.4.1_02-b06, mixed mode)
5.597049999999999E8
5.5970502E8
Though the multiplication does not result irrational number like 1/3, the result of the first statement is not accurate enough. In our project, this multiplication involves with money and we cannot ignore this.
Can anyone suggest why this is happening? Do I need to convert all the numbers to float to avoid this...Or Is it a bug?
~ Sathiya Dhanapal.

The underlying problem is that not all numbers can be represented exactly in a floating point representation. But if you perform all calculations using doubles and then round to two fractional digits at the end you should get a "correct" result UNLESS you have used ill-conditioned formulas introducing other kinds of arithmetic errors.
There's another way around this when it comes to counting money and that's to use integers (long or int). You convert every number to the lowest monetary unit (like a cent or whatever). Every money-amount can now be represented exactly but you still have to be careful because the rounding problem is still there (What do you do with the last cent when you split 100 cents in 3).
In your example the "more correct" you've got from using floats instead of doubles is only an illusion. The result has been implictly rounded becasuse fewer bits have been used. If you round the double result to the same precision as the float result, they're the same.
The important lesson in all this is TO KNOW WHEN TO ROUND.

Floating point operations is slower when small values are used?

I have the following simple program that multiplies two different floating point numbers many times. As you can see, one of the numbers is very small. When I calculate the time of executing both multiplications, I was surprised that the little number takes much longer than the other one. It seems that working with small doubles is slower... Does anyone know what is happening?
public static void main(String[] args) throws Exception
        long iterations = 10000000;
        double result;
        double number = 0.1D;
        double numberA = Double.MIN_VALUE;
        double numberB = 0.0008D;
        long startTime, endTime,elapsedTime;
        //Multiply numberA
        startTime = System.currentTimeMillis();
        for(int i=0; i < iterations; i++)
            result = number * numberA;
        endTime = System.currentTimeMillis();
        elapsedTime = endTime - startTime;
        System.out.println("
        System.out.println("Number A)
Time elapsed: " + elapsedTime + " ms");
        //Multiply numberB
        startTime = System.currentTimeMillis();
        for(int i=0; i < iterations; i++)
            result = number * numberB;
        endTime = System.currentTimeMillis();
        elapsedTime = endTime - startTime;
        System.out.println("
        System.out.println("Number B)
Time elapsed: " + elapsedTime + " ms");
    } Result:
Number A) Time elapsed: 3546 ms
Number B) Time elapsed: 110 ms
Thanks,
Diego

Verrry interrresting... After a few tweaks (sum & print multiplication result to prevent Hotspot from removing the entire loop, move stuff to one method to avoid code alignment effects or such, loop to get Hotspot compile everything; code below),
I find that "java -server" gives the same times for both the small and the big value, whereas "java -Xint" and "java -client" exhibit the unsymmetry. So should I conclude that my CPU floating point unit treats both values the same, but the client/server compilers do something ...what?
(You may need to add or remove a zero in "iterations" so that you get sane times with -client and -server.)
public class t
    public static void main(String[] args)
     for (int n = 0; n < 10; n++) {
         doit(Double.MIN_VALUE);
         doit(0.0008D);
    static void doit(double x)
        long iterations = 100000000;
        double result = 0;
        double number = 0.1D;
        long start = System.currentTimeMillis();
        for (int i=0; i < iterations; i++)
            result += number * x;
        long end = System.currentTimeMillis();
        System.out.println("time for " + x + ": " + (end - start) + " ms, result " + result);
}

Floating point operations....

We ran into a serious calculation problem, to give you a few
examples:
trace(1.3456-1.3454) // gives 0.00019999999999997797 but
should be 0.0002 clearly
trace(1.3456*1.3456) // gives 1.8106393599999997 but should
be 1.81063936
Any idea on how to resolve the problem?
Thanks,
Dan

This is just a result of the limitations of accuracy in the
binary representation of decimal numbers.
http://kb.adobe.com/selfservice/viewContent.do?externalId=tn_13989&sliceId=1
There's some discussion about as3 in the link below. But the
general principles are the same for as2 even though the same code
(apparently) might give different results in as2.
http://www.kirupa.com/forum/archive/index.php/t-247416.html

Cannot get Oracle 10g to start on a G5. Floating point exception

After a very painful 10g (EE) installation process i.e fixing all the following:
1) Created the missing /opt directory
2) Installation of XCode 1.2
3) Fixing the root.sh file
4) Downloaded the crstl file provided by Ron
5) Copied /etc/oratab/oratab to /etc/oratab
I tried bringing up the Oracle 10g instance by logging onto Sql*Plus as sysdba and running
startup nomount pfile ='/Users/oracle/admin/db01/scripts/init.ora''. The instance comes up for a few socunds and crashes. This is what i get in the alert.log
==========================================================
Sat Jul 17 11:40:08 2004
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 2
KCCDEBUG_LEVEL = 0
Using LOG_ARCHIVE_DEST_10 parameter default value as USE_DB_RECOVERY_FILE_DEST
Autotune of undo retention is turned on.
Dynamic strands is set to TRUE
Running with 2 shared and 18 private strand(s). Zero-copy redo is FALSE
IMODE=BR
ILAT =18
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up ORACLE RDBMS Version: 10.1.0.3.0.
System parameters with non-default values:
processes = 150
sga_target = 146800640
control_files = /Users/oracle/oradata/db01/control01.ctl, /Users/oracle/oradata/db01/control02.ctl, /Users/oracle/oradata/db01/control03.ctl
db_block_size = 8192
compatible = 10.1.0.2.0
db_file_multiblock_read_count= 16
db_recovery_file_dest = /Users/oracle/flash_recovery_area
db_recovery_file_dest_size= 2147483648
undo_management = AUTO
undo_tablespace = UNDOTBS1
remote_login_passwordfile= EXCLUSIVE
db_domain =
dispatchers = (PROTOCOL=TCP) (SERVICE=db01XDB)
job_queue_processes = 10
background_dump_dest = /Users/oracle/admin/db01/bdump
user_dump_dest = /Users/oracle/admin/db01/udump
core_dump_dest = /Users/oracle/admin/db01/cdump
db_name = db01
open_cursors = 300
pga_aggregate_target = 16777216
PMON started with pid=2, OS id=4037
MMAN started with pid=3, OS id=4039
DBW0 started with pid=4, OS id=4041
LGWR started with pid=5, OS id=4043
CKPT started with pid=6, OS id=4045
SMON started with pid=7, OS id=4047
RECO started with pid=8, OS id=4049
Sat Jul 17 11:40:16 2004
starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'...
CJQ0 started with pid=9, OS id=4051
Sat Jul 17 11:40:16 2004
starting up 1 shared server(s) ...
Sat Jul 17 11:40:18 2004
Errors in file /Users/oracle/admin/db01/bdump/db01_ckpt_4045.trc:
ORA-07445: exception encountered: core dump [semop+8] [SIGFPE] [Invalid floating point operation] [0xA0004CE4] [] []
Sat Jul 17 11:40:19 2004
Errors in file /Users/oracle/admin/db01/bdump/db01_mman_4039.trc:
ORA-07445: exception encountered: core dump [semop+8] [SIGFPE] [Invalid floating point operation] [0x41EDB3C] [] []
Sat Jul 17 11:40:21 2004
Errors in file /Users/oracle/admin/db01/bdump/db01_pmon_4037.trc:
ORA-00822: MMAN process terminated with error
Sat Jul 17 11:40:21 2004
PMON: terminating instance due to error 822
Instance terminated by PMON, pid = 4037
==========================================================
Any idea on what needs to be done to fix this error. I remember that i had the very same issue with the Oracle 9i R2 Developers release.
Any help will be greatly appreciated.

After a very painful 10g (EE) installation process
i.e fixing all the following:<snip>
Sat Jul 17 11:40:19 2004
Errors in file
/Users/oracle/admin/db01/bdump/db01_mman_4039.trc:
ORA-07445: exception encountered: core dump [semop+8]
[SIGFPE] [Invalid floating point operation]
[0x41EDB3C] [] []
Sat Jul 17 11:40:21 2004
Errors in file
/Users/oracle/admin/db01/bdump/db01_pmon_4037.trc:
ORA-00822: MMAN process terminated with error
Sat Jul 17 11:40:21 2004
PMON: terminating instance due to error 822
Instance terminated by PMON, pid = 4037==============================================> Any idea on what needs to be done to fix this error.
I remember that i had the very same issue with the
Oracle 9i R2 Developers release.
Any help will be greatly appreciated.You mentioned the 9ir2 release. Do you still have any reference to the 9ir2 software in your environment ? With a little luck you have, and in that case it not so hard to find a solution ...
Ronald.
http://homepage.mac.com/ik_zelf/oracle

ERROR: floating-point constants should not appear

ERROR: floating-point constants should not appear
Error preverifying class KGUI.KLabel
com.sun.kvem.ktools.ExecutionException: Preverifier returned 1
Build failed
I get this error after adding this code to my application:
     private int C=0;
     private float D=0.0f;
     private int H=0;
     private float store=0.0f;
     private int pos=0;
     private int SH=0;
..................some code.....................
     H=1+ROWS*(f1.getHeight());
     if(H/kawalki.length>0){
          C=H/kawalki.length;
          D=kawalki.length/(H%kawalki.length);
          SH=C;
          if(D>=0.5f)
               SH++;
     }else{
          SH=1;
          D=kawalki.length/(H%kawalki.length);
.................some code..............
protected void keyPressed(int keyCode){
int game=getGameAction(keyCode);
switch(game){
case UP:
if (line>0){
line--;
               pos-=C;
               store-=D;
               if(store<0.0f){
                    pos--;
                    store=1.0f+store;
repaint();
break;
case DOWN:
if(line+ROWS<kawalki.length){
               line++;
               pos+=C;
               store+=D;
               if(store>=1.0f){
                    pos++;
                    store--;
repaint();
break;
Can anybody help me quick?

If the platform is CLDC 1.1 you can have floats. Run preverify to see the options. The cldc1.1 preverifier seems to have options to allow rejecting floats/doubles but the default seems to be to allow them.
Usage: preverify [options] classnames|dirnames ...
where options include:
-classpath <directories separated by ';'>
Directories in which to look for classes
-d <directory> Directory in which output is written (default is ./output/)
-cldc1.0 Checks for existence of language features prohibited
by CLDC 1.0 (native methods, floating point and finalizers)
-nofinalize No finalizers allowed
-nonative No native methods allowed
-nofp No floating point operations allowed
@<filename> Read command line arguments from a text file
Command line arguments must all be on a single line
Directory names must be enclosed in double quotes (")

Floating point multiplication

hello everybody!
I use OpenSPARC T1. In floating point multiplication the upper 64 bit (64 to 128) where they compute and stored? ...in the fpu or it uses the SPU unit?
thanx in advance

Hi,
According with the OpenSparc T1 micro-architecture specifications (pag 204):
The FPU includes three independent execution pipelines:
Floating-point adder (FPA) adds, subtracts, compares, conversions
Floating-point multiplier (FPM) multiplies
Floating-point divider (FPD) divides
However, keep in mind that all the registers for the floating point operations are kept in the cores.
This is what the specs (pag 31) say about the SPU: "Stream processing unit (SPU) is used for modular arithmetic functions for crypto."

Using Floating Point

I write mobile java program using float point operations
but i instal to the mobile as a jar file
the mobile refuse that and give a message: "No supported floating point"
so what the solution for this problem

Recently I have to do a project where the main task is to port a J2SE application to J2ME. The J2SE application was full of floating point operation and as per my knowledge J2ME does not support floationg point operation due to low memory availability. So I have converted all the floating point operations to fixed point. I found no other way to execute floating point operations :-(

Precision operation (Float point) on FPGA 2011

Dear Experts....
For my application I have to perform demodulation operation on FPGA. I want to store an array of double precision number. When I am trying to perform any double precision number operation I am getting this error "Wire:Type not supported in current target" From forums I came to know that on FPGA in Labview 2011 I cannot have double precision operation. What is alternative?? Please help me with this. Due to this issue my work has been delayed due to this problem..
Thanks... Kindly guide...
Solved!
Go to Solution.

Dear Mathan, thanks for your reply.... I have already gone throught the link you sent, but for my application I have to have array of floating points number. I can not have integer numbers. Here I have attached my vi. I have to mix a signal of 20 Mhz with sin and cos signal to achieve demodulation. So for Sin and Cos values I have to have floating point. Is ther any way to overcome this problem?
Attachments:
fpga.vi ‏30 KB

Inline functions in C, gcc optimization and floating point arithmetic issues

For several days I really have become a fan of Alchemy. But after intensive testing I have found several issues which I'd like to solve but I can't without any help.
So...I'm porting an old game console emulator written by me in ANSI C. The code is working on both gcc and VisualStudio without any modification or crosscompile macros. The only platform code is the audio and video output which is out of scope, because I have ported audio and video witin AS3.
Here are the issues:
1. Inline functions - Having only a single inline function makes the code working incorrectly (although not crashing) even if any optimization is enabled or not (-O0 or O3). My current workarround is converting the inline functions to macros which achieves the same effect. Any ideas why inline functions break the code?
2. Compiler optimizations - well, my project consists of many C files one of which is called flash.c and it contains the main and exported functions. I build the project as follows:
gcc -c flash.c -O0 -o flash.o     //Please note the -O0 option!!!
gcc -c file1.c -O3 -o file1.o
gcc -c file2.c -O3 -o file2.o
... and so on
gcc *.o -swc -O0 -o emu.swc   //Please note the -O0 option again!!!
mxmlc.exe -library-path+=emu.swc --target-player=10.0.0 Emu.as
or file in $( ls *.o ) //Removes the obj files
    do
        rm $file
    done
If I define any option different from -O0 in gcc -c flash.c -O0 -o flash.o the program stops working correctly exactly as in the inline funtions code (but still does not crash or prints any errors in debug). flash has 4 static functions to be exported to AS3 and the main function. Do you know why?
If I define any option different from -O0 in gcc *.o -swc -O0 -o emu.swc the program stops working correctly exactly as above, but if I specify -O1, -O2 or O3 the SWC file gets smaller up to 2x for O3. Why? Is there a method to optimize all the obj files except flash.o because I suspect a similar issue as when compilling it?
3. Flating point issues - this is the worst one. My code is mainly based on integer arithmetic but on 1-2 places it requires flating point arithmetic. One of them is the conversion of 16-bit 44.1 Khz sound buffer to a float buffer with same sample rate but with samples in the range from -1.0 to 1.0.
My code:
void audio_prepare_as()
    uint32 i;
    for(i=0;i<audioSamples;i+=2)
        audiobuffer[i] = (float)snd.buffer[i]/32768;
        audiobuffer[i+1] = (float)snd.buffer[i+1]/32768;
My audio playback is working perfectly. But not if using the above conversion and I have inspected the float numbers - all incorrect and invalid. I tried other code with simple floats - same story. As if alchemy refuses to work with floats. What is wrong? I have another lace whre I must resize the framebuffer and there I have a float involved - same crap. Please help me?
Found the floating point problem: audiobuffer is written to a ByteArray and then used in AS. But C floats are obviously not the same as those in AS3. Now the floating point is resolved.
The optimization issues remain! I really need to speed up my code.
Thank you in advice!

Dear Bernd,
I am still unable to run the optimizations and turn on the inline functions. None of the inline functions contain any stdli function just pure asignments, reads, simple arithmetic and bitwise operations.
In fact, the file containing the main function and those functions for export in AS3 did have memset and memcpy. I tried your suggestion and put the code above the functions calling memset and memcpy. It did not work soe I put the code in a header which is included topmost in each C file. The only system header I use is malloc.h and it is included topmost. In other C file I use pow, sin and log10 from math.h but I removed it and made the same thing:
//shared.h
#ifndef _SHARED_H_
#define _SHARED_H_
#include <malloc.h>
static void * custom_memmove( void * destination, const void * source, unsigned int num ) {
void *result;
__asm__("%0 memmove(%1, %2, %3)\n" : "=r"(result) : "r"(destination), "r"(source), "r"(num));
return result;
static void * custom_memcpy ( void * destination, const void * source, unsigned int num ) {
void *result;
__asm__("%0 memcpy(%1, %2, %3)\n" : "=r"(result) : "r"(destination), "r"(source), "r"(num));
return result;
static void * custom_memset ( void * ptr, int value, unsigned int num ) {
void *result;
__asm__("%0 memset(%1, %2, %3)\n" : "=r"(result) : "r"(ptr), "r"(value), "r"(num));
return result;
static float custom_pow(float x, int y) {
    float result;
__asm__("%0 pow(%1, %2)\n" : "=r"(result) : "r"(x), "r"(y));
return result;
static double custom_sin(double x) {
    double result;
__asm__("%0 sin(%1)\n" : "=r"(result) : "r"(x));
return result;
static double custom_log10(double x) {
    double result;
__asm__("%0 log10(%1)\n" : "=r"(result) : "r"(x));
return result;
#define memmove custom_memmove
#define memcpy custom_memcpy
#define memset custom_memset
#define pow custom_pow
#define sin custom_sin
#define log10 custom_log10
#include "types.h"
#include "macros.h"
#include "m68k.h"
#include "z80.h"
#include "genesis.h"
#include "vdp.h"
#include "render.h"
#include "mem68k.h"
#include "memz80.h"
#include "membnk.h"
#include "memvdp.h"
#include "system.h"
#include "loadrom.h"
#include "input.h"
#include "io.h"
#include "sound.h"
#include "fm.h"
#include "sn76496.h"
#endif /* _SHARED_H_ */
It still behave the same way as if nothing was changed (works incorrectly - displays jerk which does not move, whereby the image is supposed to move)
As I am porting an emulator (Sega Mega Drive) I use manu arrays of function pointers for implementing the opcodes of the CPU's. Could this be an issue?
I did a workaround for the floating point problem but processing is very slow so I hear only bzzt bzzt but this is for now out of scope. The emulator compiled with gcc runs at 300 fps on a 1.3 GHz machine, whereby my non optimized AVM2 code compiled by alchemy produces 14 fps. The pure rendering is super fast and the problem lies in the computational power of AVM. The frame buffer and the enulation are generated in the C code and only the pixels are copied to AS3, where they are plotted in a BitmapData. On 2.0 GHz Dual core I achieved only 21 fps. Goal is 60 fps to have smooth audio and video. But this is offtopic. After all everything works (slow) without optimization, and I would somehow turn it on. Suggestions?
Here is the file with the main function:
#include "shared.h"
#include "AS3.h"
#define FRAMEBUFFER_LENGTH    (320*240*4)
static uint8* framebuffer;
static uint32 audioSamples;
AS3_Val sega_rom(void* self, AS3_Val args)
    int size, offset, i;
    uint8 hardware;
    uint8 country;
    uint8 header[0x200];
    uint8 *ptr;
    AS3_Val length;
    AS3_Val ba;
    AS3_ArrayValue(args, "AS3ValType", &ba);
    country = 0;
    offset = 0;
    length = AS3_GetS(ba, "length");
    size = AS3_IntValue(length);
    ptr = (uint8*)malloc(size);
    AS3_SetS(ba, "position", AS3_Int(0));
    AS3_ByteArray_readBytes(ptr, ba, size);
    //FILE* f = fopen("boris_dump.bin", "wb");
    //fwrite(ptr, size, 1, f);
    //fclose(f);
    if((size / 512) & 1)
        size -= 512;
        offset += 512;
        memcpy(header, ptr, 512);
        for(i = 0; i < (size / 0x4000); i += 1)
            deinterleave_block(ptr + offset + (i * 0x4000));
    memset(cart_rom, 0, 0x400000);
    if(size > 0x400000) size = 0x400000;
    memcpy(cart_rom, ptr + offset, size);
    /* Free allocated file data */
    free(ptr);
    hardware = 0;
    for (i = 0x1f0; i < 0x1ff; i++)
        switch (cart_rom[i]) {
     case 'U':
         hardware |= 4;
         break;
     case 'J':
         hardware |= 1;
         break;
     case 'E':
         hardware |= 8;
         break;
    if (cart_rom[0x1f0] >= '1' && cart_rom[0x1f0] <= '9') {
        hardware = cart_rom[0x1f0] - '0';
    } else if (cart_rom[0x1f0] >= 'A' && cart_rom[0x1f0] <= 'F') {
        hardware = cart_rom[0x1f0] - 'A' + 10;
    if (country) hardware=country; //simple autodetect override
    //From PicoDrive
    if (hardware&8)
        hw=0xc0; vdp_pal=1;
    } // Europe
    else if (hardware&4)
        hw=0x80; vdp_pal=0;
    } // USA
    else if (hardware&2)
        hw=0x40; vdp_pal=1;
    } // Japan PAL
    else if (hardware&1)
        hw=0x00; vdp_pal=0;
    } // Japan NTSC
    else
        hw=0x80; // USA
    if (vdp_pal) {
        vdp_rate = 50;
        lines_per_frame = 312;
    } else {
        vdp_rate = 60;
        lines_per_frame = 262;
    /*SRAM*/
    if(cart_rom[0x1b1] == 'A' && cart_rom[0x1b0] == 'R')
        save_start = cart_rom[0x1b4] << 24 | cart_rom[0x1b5] << 16 |
            cart_rom[0x1b6] << 8 | cart_rom[0x1b7] << 0;
        save_len = cart_rom[0x1b8] << 24 | cart_rom[0x1b9] << 16 |
            cart_rom[0x1ba] << 8 | cart_rom[0x1bb] << 0;
        // Make sure start is even, end is odd, for alignment
        // A ROM that I came across had the start and end bytes of
        // the save ram the same and wouldn't work. Fix this as seen
        // fit, I know it could probably use some work. [PKH]
        if(save_start != save_len)
            if(save_start & 1) --save_start;
            if(!(save_len & 1)) ++save_len;
            save_len -= (save_start - 1);
            saveram = (unsigned char*)malloc(save_len);
            // If save RAM does not overlap main ROM, set it active by default since
            // a few games can't manage to properly switch it on/off.
            if(save_start >= (unsigned)size)
                save_active = 1;
        else
            save_start = save_len = 0;
            saveram = NULL;
    else
        save_start = save_len = 0;
        saveram = NULL;
    return AS3_Int(0);
AS3_Val sega_init(void* self, AS3_Val args)
    system_init();
    audioSamples = (44100 / vdp_rate)*2;
    framebuffer = (uint8*)malloc(FRAMEBUFFER_LENGTH);
    return AS3_Int(vdp_rate);
AS3_Val sega_reset(void* self, AS3_Val args)
    system_reset();
    return AS3_Int(0);
AS3_Val sega_frame(void* self, AS3_Val args)
    uint32 width;
    uint32 height;
    uint32 x, y;
    uint32 di, si, r;
    uint16 p;
    AS3_Val fb_ba;
    AS3_ArrayValue(args, "AS3ValType", &fb_ba);
    system_frame(0);
    AS3_SetS(fb_ba, "position", AS3_Int(0));
    width = (reg[12] & 1) ? 320 : 256;
    height = (reg[1] & 8) ? 240 : 224;
    for(y=0;y<240;y++)
        for(x=0;x<320;x++)
            di = 1280*y + x<<2;
            si = (y << 10) + ((x + bitmap.viewport.x) << 1);
            p = *((uint16*)(bitmap.data + si));
            framebuffer[di + 3] = (uint8)((p & 0x1f) << 3);
            framebuffer[di + 2] = (uint8)(((p >> 5) & 0x1f) << 3);
            framebuffer[di + 1] = (uint8)(((p >> 10) & 0x1f) << 3);
    AS3_ByteArray_writeBytes(fb_ba, framebuffer, FRAMEBUFFER_LENGTH);
    AS3_SetS(fb_ba, "position", AS3_Int(0));
    r = (width << 16) | height;
    return AS3_Int(r);
AS3_Val sega_audio(void* self, AS3_Val args)
    AS3_Val ab_ba;
    AS3_ArrayValue(args, "AS3ValType", &ab_ba);
    AS3_SetS(ab_ba, "position", AS3_Int(0));
    AS3_ByteArray_writeBytes(ab_ba, snd.buffer, audioSamples*sizeof(int16));
    AS3_SetS(ab_ba, "position", AS3_Int(0));
    return AS3_Int(0);
int main()
    AS3_Val romMethod = AS3_Function(NULL, sega_rom);
    AS3_Val initMethod = AS3_Function(NULL, sega_init);
    AS3_Val resetMethod = AS3_Function(NULL, sega_reset);
    AS3_Val frameMethod = AS3_Function(NULL, sega_frame);
    AS3_Val audioMethod = AS3_Function(NULL, sega_audio);
    // construct an object that holds references to the functions
    AS3_Val result = AS3_Object("sega_rom: AS3ValType, sega_init: AS3ValType, sega_reset: AS3ValType, sega_frame: AS3ValType, sega_audio: AS3ValType",
        romMethod, initMethod, resetMethod, frameMethod, audioMethod);
    // Release
    AS3_Release(romMethod);
    AS3_Release(initMethod);
    AS3_Release(resetMethod);
    AS3_Release(frameMethod);
    AS3_Release(audioMethod);
    // notify that we initialized -- THIS DOES NOT RETURN!
    AS3_LibInit(result);
    // should never get here!
    return 0;

128-bit floating point numbers on new AMD quad-core Barcelona?

There's quite a lot of buzz over at Slashdot about the new AMD quad core chips, announced yesterday:
http://hardware.slashdot.org/article.pl?sid=07/02/10/0554208
Much of the excitement is over the "new vector math unit referred to as SSE128", which is integrated into each [?!?] core; Tom Yager, of Infoworld, talks about it here:
Quad-core Opteron? Nope. Barcelona is the completely redesigned x86, and it’s brilliant
Now here's my question - does anyone know what the inputs and the outputs of this coprocessor look like? Can it perform arithmetic [or, God forbid, trigonometric] operations [in hardware] on 128-bit quad precision floats? And, if so, will LabVIEW be adding support for it? [Compare here versus here.]
I found a little bit of marketing-speak blather at AMD about "SSE 128" in this old PDF Powerpoint-ish presentation, from June of 2006:
http://www.amd.com/us-en/assets/content_type/DownloadableAssets/PhilHesterAMDAnalystDayV2.pdf
WARNING: PDF DOCUMENT
Page 13: "Dual 128-bit SSE dataflow, Dual 128-bit loads per cycle"
Page 14: "128-bit SSE and 128-bit Loads, 128b FADD, 128 bit FMUL, 128b SSE, 128b SSE"
etc etc etc
While it's largely just gibberish to me, "FADD" looks like what might be a "floating point adder", and "FMUL" could be a "floating point multiplier", and God forbid that the two "SSE" units might be capable of computing some 128-bit cosines. But I don't know whether that old paper is even applicable to the chip that was released yesterday, and I'm just guessing as to what these things might mean anyway.
Other than that, though, AMD's main website is strangely quiet about the Barcelona announcement. [Memo to AMD marketing - if you've just released the greatest thing since sliced bread, then you need to publicize the fact that you've just released the greatest thing since sliced bread...]

I posted a query over at the AMD forums, and here's what I was told.
I had hoped that e.g. "128b FADD" would be able to do something like the following:
/* "quad" is a hypothetical 128-bit quad precision */
/* floating point number, similar to "long double" */
/* in recent versions of C++:                       */
quad x, y, z;
x = 1.000000000000000000000000000001;
y = 1.000000000000000000000000000001;
/* the hope was that "128b FADD" could perform the */
/* following 128-bit addition in hardware:          */
z = x + y;
However, the answer I'm getting is that "128b FADD" is just a set of two 64-bit adders running in parallel, which are capable of adding two vectors of 64-bit doubles more or less simultaneously:
double x[2], y[2], z[2];
x[0] = 1.000000000000000000000000000001;
y[0] = 1.000000000000000000000000000001;
x[1] = 2.000000000000000000000000000222;
y[1] = 2.000000000000000000000000000222;
/* Apparently the coordinates of the two "vectors" x & y       */
/* can be sent to "128b FADD" in parallel, and the following   */
/* two summations can be computed more or less simultaneously: */
z[0] = x[0] + y[0];
z[1] = x[1] + y[1];
Thus e.g. "128b FADD", working in concert with "128b FMUL", will be able to [more or less] halve the amount of time it takes to compute a dot product of vectors whose coordinates are 64-bit doubles.
So this "128-bit" circuitry is great if you're doing lots of linear algebra with 64-bit doubles, but it doesn't appear to offer anything in the way of greater precision for people who are interested in precision-sensitive calculations.
By the way, if you're at all interested in questions of precision sensitivity & round-off error, I'd highly recommend Prof Kahan's page at Cal-Berzerkeley:
http://www.cs.berkeley.edu/~wkahan/
PDF DOCUMENT: How JAVA's Floating-Point Hurts Everyone Everywhere
http://www.cs.berkeley.edu/~wkahan/JAVAhurt.pdf
PDF DOCUMENT: Matlab's Loss is Nobody's Gain
http://www.cs.berkeley.edu/~wkahan/MxMulEps.pdf

How can floating point division be faster than integer division?

Hello,
I don't know if this is a Java quirk, or if I am doing something wrong. Check out this code:
public class TestApp
     public static void main(String args[])
          long lngOldTime;
          long lngNewTime;
          long lngTimeDiff;
          int Tmp;
          lngOldTime = System.currentTimeMillis();
          for( int A=1 ; A<=20000 ; A++)
               for( int B=1 ; B<=20000 ; B++)
                    Tmp = A / B;
          lngNewTime = System.currentTimeMillis();
          lngTimeDiff = lngNewTime - lngOldTime;
          System.out.println(lngTimeDiff);
}It reports that the division operations took 18,116 milliseconds.
Now check out this code (integers replaced with doubles):
public class TestApp
     public static void main(String args[])
          long lngOldTime;
          long lngNewTime;
          long lngTimeDiff;
          double Tmp;
          lngOldTime = System.currentTimeMillis();
          for( double A=1 ; A<=20000 ; A++)
               for( double B=1 ; B<=20000 ; B++)
                    Tmp = A / B;
          lngNewTime = System.currentTimeMillis();
          lngTimeDiff = lngNewTime - lngOldTime;
          System.out.println(lngTimeDiff);
}It runs in 11,276 milliseconds.
How is it that the second code snippet could be so much faster than the first? I am using jdk1.4.2_04
Thanks in advance!

I'm afraid you missed several key points. I only used
Longs for measuring the time (System.currentTimeMillis
returns a long). Sorry you are correct I did miss that.
However, even if I had, double is
also a 64-bit data type - so technically that would
have been a more fair test. The fact that 64-bit
floating point divisions are faster than 32-bit
integer divisions is what confuses me.
Oh, just in case you're interested, using float's in
that same snippet takes only 7,471 milliseconds to
execute!Then the other explaination is that the Hotspot compiler is optimizing the floating point code to use the cpu floating point instructions but it is not optimizing the integer divide in the same way.

Floating point Number & Packed Number

Hai can anyone tell me what is the difference in using floating point & packed Number .
when it will b used ?

<b>Packed numbers</b> - type P
Type P data allows digits after the decimal point. The number of decimal places is generic, and is determined in the program. The value range of type P data depends on its size and the number of digits after the decimal point. The valid size can be any value from 1 to 16 bytes. Two decimal digits are packed into one byte, while the last byte contains one digit and the sign. Up to 14 digits are allowed after the decimal point. The initial value is zero. When working with type P data, it is a good idea to set the program attribute Fixed point arithmetic.Otherwise, type P numbers are treated as integers.
You can use type P data for such values as distances, weights, amounts of money, and so on.
<b>Floating point numbers</b> - type F
The value range of type F numbers is 1x10*-307 to 1x10*308 for positive and negative numbers, including 0 (zero). The accuracy range is approximately 15 decimals, depending on the floating point arithmetic of the hardware platform. Since type F data is internally converted to a binary system, rounding errors can occur. Although the ABAP processor tries to minimize these effects, you should not use type F data if high accuracy is required. Instead, use type P data.
You use type F fields when you need to cope with very large value ranges and rounding errors are not critical.
Using I and F fields for calculations is quicker than using P fields. Arithmetic operations using I and F fields are very similar to the actual machine code operations, while P fields require more support from the software. Nevertheless, you have to use type P data to meet accuracy or value range requirements.
reward if useful

Niagara II Floating Point Operations

Similar Messages

Maybe you are looking for