Scalar distance across columns

I need to calculate the scalar (Euclidean) distance between two vectors, each dimension represented by columns in a table. I have a procedure that does it the slow and painful way: loop over all vector pairs/loop over all columns/look up the individual cell values/sum over all the differences. It appears to me though, that there should be a more straightforward way to do this using the:
syntax in PL/SQL, then loop over the variables/column elements to sum the differences. However, I can't figure out how to keep the pointers for each variable in sync.
E.g.: coord1(1,2,3), coord2(2,3,1), differences(1,1,2), sum=4.
I can do this in Perl using hashes/arrays, but not in PL/SQL. What is the right frame/syntax to do this? If it were a couple of hundred rows, I'd do it in Perl, but with millions of rows and hundreds of columns, it gets real tedious, not to mention easily out-of-sync with the database contents.
I'd appreciate any suggestions!

Glad that it works for you. So, you want a fishing lesson :-).
First, the full documentation for the DBMS_SQL package can be found here .
Essentially, DBMS_SQL is an industrial strength version of EXECUTE IMMEDIATE (well, properly, EXECUTE IMMEDIATE is a highly simplified version of DBMS_SQL, since DBMS_SQL came first). It has many advantages over EXECUTE IMMEDIATE because it allows you full control over virtually everything to do with the query. So, you can relatively easily create bind variables on the fly to create statements with an unknown number of binds. It also allows you to find the columns from any arbitrary query through DBMS_SQL.Describe_Columns.
You never actually see any values returned, because I never actually return any. Since you wanted your output in another table, I decided to build one big insert statement, rather than iterate through all the rows, calculate the distance, and then insert one row at a time. It seemed more efficient to me. I also decided to "hard code", through a bind variable, each of the column values from the ref_id, rather than try to dynamically build a join, which seemd simpler to me.
1. DBMS_SQL.Describe_Columns returns a table of records, one record for each column, that gives information much like that in the xxx_tab_columns views. If some id columns may be numeric, you can test the col_type field in the table of records and put a dummy VARCHAR variable into the character columns. Just be aware that the col_type field returns a numeric version of the column_type, not a string. If you look at the text from user_tab_cols:
FROM dba_views
WHERE view_name = 'USER_TAB_COLS'You can see where Oracle DECODES the numeric type to the NUMBER, VARCHAR2 etc. Essentially, NUMBER is 1 and VARCHAR2 is 2.
2. I replaced the FOR scale_rec IN ... LOOP block with another DBMS_SQL built query to pull the AVG and VARIANCE from the table, it is commented in the code below.
3. You cannot really put a threshold on the computation of the distance, since the distance is not really known to the query until it is fully executed (i.e. at the time of inserting). However, you could modify the insert query to filter values with distances greater than 3. I will leave that modification as a exercise for the fiherman, but the end sql statement would look something like:
SELECT src, id, dist
FROM (SELECT :src_id src, id,
        SQRT(ABS((POWER((:sc2 - :mean2)/:var2, 2) - POWER((col1 - :mean2)/:var2, 2))+
                 (POWER((:sc3 - :mean3)/:var3, 2) - POWER((col2 - :mean3)/:var3, 2))+
                 (POWER((:sc4 - :mean4)/:var4, 2) - POWER((col3 - :mean4)/:var4, 2)))) dist
      FROM t
      WHERE id <> :src_id)
WHERE dist <= 3Now here is a heavily commented version taking into account the proper formula.
CREATE or replace PROCEDURE euclidean_distance (p_src_tab IN VARCHAR2) AS
   -- Associative Array for scaling factors
   mean_cols scale_cols_tp; -- Will hold average of columns
   var_cols scale_cols_tp;  -- Will hold variance of columns
   -- Array to hold column values for the reference id
   src_cols src_cols_tp;
   l_dummyn NUMBER;  -- Dummy Number for Define Column
   l_ignore NUMBER;  -- Receives rows processed from queries
   l_sqlstr VARCHAR2(32767);  -- For sql statements
   src_cur  NUMBER;  -- Reference ID Cursor handle
   src_tab  DBMS_SQL.Desc_Tab;  -- Table of Records describing passed table
   src_col_cnt NUMBER;   -- Number of columns in passed table
   ins_cur  NUMBER;  -- Insert Cursor handle
   mv_cur  NUMBER;  -- Cursor handle for Mean and Variance of table
   l_col_pos NUMBER; -- Index into src_tab
   l_res_pos NUMBER; -- Column position in mv_cur
   l_x1     VARCHAR2(10);  -- Dynamic bind variable for source column values
   l_x2     VARCHAR2(30);  -- Column name from source columns
   l_meani  VARCHAR2(10);  -- Dynamic bind variable for mean of table
   l_vari   VARCHAR2(10);  -- Dynamic bind variable for variance of table
   -- Get a "handle" for a cursor, and parse the query
   src_cur := DBMS_SQL.Open_Cursor;
   DBMS_SQL.Parse(src_cur, 'SELECT * FROM '||p_src_tab||' WHERE id = :src_id',DBMS_SQL.Native);
   -- Describe the table to get number of columns, their names and their types
   DBMS_SQL.Describe_Columns(src_cur, src_col_cnt, src_tab);
   -- I now have a table of records (src_tab) showing similar info
   -- to that shown in xxx_tab_columns, one record per column
   -- and the number of columns in the table (actually the number of records
   -- in src_tab) is in src_col_cnt
   -- Define the column types for src_cur.  This just tell DBMS_SQL that
   -- the column at position i (based on the column list in the select)
   -- is of the passed data type.  It is not setting a variable to
   -- receive the column value.
   -- I am assuming all inluding ID are NUMBER.
   -- If id may be alpha can test src_tab(i).col_type to check data type
   FOR i IN 1 .. src_col_cnt LOOP
      DBMS_SQL.DEFINE_COLUMN(src_cur, i, l_dummyn);
-- This section replaces the FOR scale_rec IN LOOP
   -- Get mean and variance for passed table
   -- Build the sql statement
   l_sqlstr := 'SELECT ';
   FOR i IN 2 .. src_col_cnt LOOP
      l_sqlstr := l_sqlstr||'AVG('||src_tab(i).col_name||'),'||
   -- l_sqlstr is now:
   -- SELECT AVG(col1), VARIANCE(col1),AVG(col2), VARIANCE(col2),
   --        AVG(col3), VARIANCE(col3),
   -- So trim the trailing , and add FROM
   l_sqlstr := RTRIM(l_sqlstr,',');
   l_sqlstr := l_sqlstr||' FROM '||p_src_tab;
   -- Set up the cursor
   mv_cur := DBMS_SQL.OPEN_CURSOR;
   DBMS_SQL.Parse(mv_cur, l_sqlstr, DBMS_SQL.Native);
   -- Getting 2 results (AVG and VARIANCE) for each column
   -- and we know that they are all numeric
   -- but need to ignore the id column so
   FOR i In 1 .. (src_col_cnt - 1) * 2 LOOP
      DBMS_SQL.DEFINE_COLUMN(mv_cur, i, l_dummyn);
   -- Now Run the query and assign columns into associative array
   l_ignore := DBMS_SQL.EXECUTE_AND_FETCH(mv_cur);
   l_col_pos := 2;  -- start in Record 2 of src_tab
   l_res_pos := 1;  -- start in Column 1 of result set
   WHILE l_col_pos <= src_col_cnt LOOP
      DBMS_SQL.COLUMN_VALUE(mv_cur, l_res_pos, mean_cols(src_tab(l_col_pos).col_name));
      DBMS_SQL.COLUMN_VALUE(mv_cur, l_res_pos + 1, var_cols(src_tab(l_col_pos).col_name));
      l_col_pos := l_col_pos + 1;
      l_res_pos := l_res_pos + 2;
   -- I end up with two associative arrays, both indexed by column name
   -- mean_cols values are the average for each column
   -- var_cols values are the variance for each column
   -- We're done with this query so
-- END replacement FOR scale_rec IN LOOP
   -- Build the insert statement
   -- I took the ABSolute value of the SUM of L(i) because with
   -- my data I was getting negative values which blew the SQRT
   l_sqlstr := 'INSERT INTO diag SELECT :src_id, id, SQRT(ABS(';
   FOR i IN 2 .. src_col_cnt LOOP
      -- Dynamically create bind variables to hold "fixed" values
      -- (i.e. reference values, mean, and variance
      -- and plug the correct column names for the target columns
      l_x1 := ':sc'||i; -- For i = 2 gives :sc2
      l_x2 := src_tab(i).col_name; -- For i = 2 gives COL1
      l_meani := ':mean'||i; -- For i = 2 gives :mean2
      l_vari := ':var'||i; -- For i = 2 gives :var2
      -- Append this column formula to sql statement
      l_sqlstr := l_sqlstr ||'(POWER(('||l_x1||' - '||l_meani||
                  ')/'||l_vari||', 2) - POWER(('||l_x2||' - '||
                  l_meani||')/'||l_vari||', 2)) + ';
   -- Here, l_sqlstr is:
   -- INSERT INTO diag
   -- SELECT :src_id, id, SQRT(ABS((POWER((:sc2 - :mean2)/:var2, 2) -
   --                               POWER((col1 - :mean2)/:var2, 2))+
   --                              (POWER((:sc3 - :mean3)/:var3, 2) -
   --                               POWER((col2 - :mean3)/:var3, 2))+
   --                              (POWER((:sc4 - :mean4)/:var4, 2) -
   --                               POWER((col3 - :mean4)/:var4, 2))+
   -- so get rid of the trailing + and space
   l_sqlstr := RTRIM(l_sqlstr,'+ ');
   -- Now close the open bracket from SQRT and ABS and add FROM and WHERE
   l_sqlstr := l_sqlstr||')) FROM '||p_src_tab||' WHERE id <> :src_id';
   -- Now we have a valid insert statement so
   -- Prepare and parse the insert cursor
   ins_cur := DBMS_SQL.Open_Cursor;
   DBMS_SQL.Parse(ins_cur, l_sqlstr, DBMS_SQL.Native);
   -- bind in the mean and variance which are fixed for this set of columns
   FOR i IN 2 .. src_col_cnt LOOP
      DBMS_SQL.Bind_Variable(ins_cur,':mean'||i, mean_cols(src_tab(i).col_name));
      DBMS_SQL.Bind_Variable(ins_cur,':var'||i, var_cols(src_tab(i).col_name));
      -- for i = 2, These two calls resolve as:
      -- DBMS_SQL.Bind_Variable(ins_cur,:mean2, mean_cols(COL1));
      -- DBMS_SQL.Bind_Variable(ins_cur,:var2, var_cols(COL1));
      -- which means bind the value found in mean_cols('COL1') to
      -- the bind variable :mean2 and the value found in var_cols('COL1') to 
      -- the bind variable :var2
   -- Get the reference IDs
   FOR intrest_rec IN (SELECT ref_id FROM query) LOOP
      -- For each reference ID bind into the source query
      DBMS_SQL.Bind_Variable(src_cur,':src_id', intrest_rec.ref_id);
      -- So, here, on the first iteration, we are about to execute
      -- the statement
      -- SELECT * FROM t WHERE id = 1     
      l_ignore := DBMS_SQL.Execute_And_Fetch(src_cur);
      -- Get the column values from each source row and
      -- bind that value (e.g. 1 in my sample) to variable :src_id
      DBMS_SQL.Bind_Variable(ins_cur,':src_id', intrest_rec.ref_id);
      FOR i IN 2 .. src_col_cnt LOOP
         -- Retrieve the value of each column of the source row
         -- into the table of NUMBERS
         DBMS_SQL.COLUMN_VALUE(src_cur, i, src_cols(i));
         -- Then bind it into the insert cursor
         DBMS_SQL.Bind_Variable(ins_cur,':sc'||i, src_cols(i));
      END LOOP;
      -- execute the insert statement
      l_ignore := DBMS_SQL.EXECUTE(ins_cur);
END;If you still have questions, I will be around until Thursday, then back January 4.

    Hi there, I dont know what is goin on, firefox shows e.g 500kb/s and 5 minut left, but it takes much more time... what can be wrong ? thanks Last edited by chosen (2012-03-10 22:39:54)