Term frequency (tf idf)

I need some help on term frequency (how to store the data)
I have severals document that will be parsed and tokenized for words (also apply stoword and stemmer).
The word are then stored in the database (where the word is the primary key).
given the equation
let
td = # of time the term t appear in document d,
md = maximum absolute term freq by aany term in d
then tf = td/md
my problem is this.
Suppose Document 1 contains the word "Java" 500 times(td) and the max is 1000(md) // some other word has 1000 freq.
and suppose in Document 2, "java" occus 200 times(td) and md is 500.
tf for document 1 = .5
and tf for document 2 = .4
however..I wanted to calculate the weight of the term for all frequencies
I don't think i can just add 200 to the counter and divided by the highest md
i'm a lost here.

ricky171
Welcome to the forum. Please don't post in old threads that are long dead. When you have a question, please start a topic of your own. Feel free to provide a link to an old thread if relevant.
Also, please don't solicit off-forum communication by email. This not only goes against the spirit of the forum, which is to share problems and their solutions, but will also fetch you plenty of spam, but rarely any help.
I'm locking this thread now. It's more than 4½ years old.
Also blocking both the posts soliciting off-forum communication.
db

Similar Messages

Info Retrieval term frequency (td.idf) calculation problem

anyone who's done any IR work like search engines should be familiar with the td.idf forumula i've tried to use. problem is, i've got it wrong somewhere. The results come back the same for each document and sometimes i get negative results.
<%@ page language="java" contentType="text/html"
 errorPage="errorpage.jsp" import="java.util.*,java.sql.*,java.lang.Math.*" %>
<%
 String strUser= "cooks";
 String strPass= "d885764c";
 String strConnectURL= ("jdbc:mysql://localhost/cooks");
 Class.forName("com.mysql.jdbc.Driver");
 Connection conn = DriverManager.getConnection(strConnectURL,strUser,strPass);
 Statement stmt = conn.createStatement();
 ResultSet rs;
 String sql;
 String searchTerm = request.getParameter("search");
 ArrayList score = new ArrayList();
 ArrayList product = new ArrayList();
 final int recall = 20;
 ArrayList splitTerms = new ArrayList();
 StringTokenizer st = new StringTokenizer(searchTerm);
 while(st.hasMoreTokens())
 splitTerms.add(st.nextToken());
 String tempSearchTerm = ""+splitTerms.get(0);
 if(splitTerms.size() == 1 && tempSearchTerm.length() == 5)
 try
 int productCode = Integer.parseInt(""+splitTerms.get(0));
 sql = "SELECT id FROM product WHERE id = "+productCode+";";
 rs = stmt.executeQuery(sql);
 while(rs.next())
 response.sendRedirect("product.jsp?productID="+rs.getString(1));
 catch(NumberFormatException e){}
 ArrayList docBaseN = new ArrayList();
 ArrayList docBaseD = new ArrayList();
 ArrayList docBaseNid = new ArrayList();
 ArrayList docBaseDid = new ArrayList();
 // ------------------------ Get items that have part of searchTerm in name -------------------------- //
 sql = "SELECT id, name FROM product WHERE UPPER(name) LIKE UPPER('%";
 for(int i=0; i<splitTerms.size(); i++)
 sql += splitTerms.get(i) + "%')";
 if(i < splitTerms.size()-1)
 sql += " OR description LIKE UPPER('%";
 sql += ";";
 rs = stmt.executeQuery(sql);
 while(rs.next())
 docBaseNid.add(rs.getInt(1));
 docBaseN.add(rs.getString(2));
 out.print(" <hr>GET RELEVANT NAMED ITEMS: \t"+sql+" ");
 // --------------------- Get items that have part of searchTerm in description ---------------------- //
 sql = "SELECT id, description FROM product WHERE UPPER(description) LIKE UPPER('%";
 for(int i=0; i<splitTerms.size(); i++)
 sql += splitTerms.get(i) + "%')";
 if(i < splitTerms.size()-1)
 sql += " OR description LIKE UPPER('%";
 sql += ";";
 rs = stmt.executeQuery(sql);
 while(rs.next())
 docBaseDid.add(rs.getInt(1));
 docBaseD.add(rs.getString(2));
 out.print("GET RELEVANT DESCRIBED ITEMS: \t"+sql+"<hr>");
 rs.close();
 stmt.close();
 conn.close();
 // ---------------------------------- Calculate TD.IDF for n ---------------------------------------- //
 double TDn = 0.0;
 double[] TDIDFn = new double[docBaseN.size()];
 double Nn = 0;
 int sumN = 0;
 for(int i=0; i<docBaseN.size(); i++)
 st = new StringTokenizer(""+docBaseN.get(i));
 sumN += st.countTokens();
 while(st.hasMoreTokens())
 String baseTerm = st.nextToken();
 for(int j=0; j<splitTerms.size(); j++)
 session.setAttribute("debug",136);
 if(baseTerm.equalsIgnoreCase(""+splitTerms.get(j)))
 Nn ++;
 try
 out.print("Nn = "+Nn+"\tsumN = "+sumN+" ");
 TDn = Nn/sumN;
 out.print(TDn);
 catch(Exception e)
 TDn = 0;
 double[] IDFn = new double[docBaseN.size()];
 int Dn = docBaseN.size();
 int[] docsWithTn = new int[docBaseN.size()];
 for(int i=0; i<docBaseN.size(); i++)
 IDFn[i] = 0.00;
 TDIDFn[i] = 0.00;
 docsWithTn[i] = 0;
 for(int i=0; i<splitTerms.size(); i++)
 for(int j=0; j<docBaseN.size(); j++)
 st = new StringTokenizer(""+docBaseN.get(j));
 out.print("tokens = "+st.countTokens()+"");
 compare:
 while(st.hasMoreTokens())
 String tempStrA = ""+st.nextToken();
 String tempStrB = ""+splitTerms.get(i);
 out.print(tempStrA+" "+tempStrB+" ");
 if(tempStrA.equalsIgnoreCase(tempStrB))
 out.print("found match ");
 docsWithTn[j] ++;
 out.print(docsWithTn[j]+"");
 break compare;
 for(int i=0; i<docsWithTn.length; i++)
 try
 IDFn[i] += 1-(Math.log(Dn/docsWithTn));
out.print("Dn = "+Dn+" ");
out.print("docsWithTn["+i+"] = "+docsWithTn[i]+" ");
out.print("IDFn["+i+"] = "+IDFn[i]+" ");
catch(Exception e)
IDFn[i] = 0;
out.print("IDFn["+i+"] = "+IDFn[i]+" ");
for(int i=0; i<IDFn.length; i++)
TDIDFn[i] += TDn*(IDFn[i]);
out.print(" id = "+docBaseNid.get(i)+" \tTDIDFn["+i+"] = "+TDIDFn[i]+"");
out.print("");
// ---------------------------------- Calculate TD.IDF for d ---------------------------------------- //
double TDd = 0.0;
double[] TDIDFd = new double[docBaseD.size()];
double Nd = 0;
int sumD = 0;
for(int i=0; i<docBaseD.size(); i++)
st = new StringTokenizer(""+docBaseD.get(i));
sumD += st.countTokens();
while(st.hasMoreTokens())
String baseTerm = st.nextToken();
for(int j=0; j<splitTerms.size(); j++)
if(baseTerm.equalsIgnoreCase(""+splitTerms.get(j)))
Nd ++;
try
out.print("Nd = "+Nd+"\tsumD = "+sumD+" ");
TDd = Nd/sumD;
out.print(TDd);
catch(Exception e)
TDd = 0;
double[] IDFd = new double[docBaseD.size()];
int Dd = docBaseD.size();
int[] docsWithTd = new int[docBaseD.size()];
for(int i=0; i<docBaseD.size(); i++)
IDFd[i] = 0.00;
TDIDFd[i] = 0.00;
docsWithTd[i] = 0;
for(int i=0; i<splitTerms.size(); i++)
for(int j=0; j<docBaseD.size(); j++)
st = new StringTokenizer(""+docBaseD.get(j));
out.print("tokens = "+st.countTokens()+"");
compare:
while(st.hasMoreTokens())
String tempStrA = ""+st.nextToken();
String tempStrB = ""+splitTerms.get(i);
out.print(tempStrA+" "+tempStrB+" ");
if(tempStrA.equalsIgnoreCase(tempStrB))
out.print("found match ");
docsWithTd[j] ++;
out.print(docsWithTd[j]+"");
break compare;
for(int i=0; i<docsWithTd.length; i++)
try
IDFd[i] += 1-(Math.log(Dd/docsWithTd[i]));
out.print("Dd = "+Dd+" ");
out.print("docsWithTd["+i+"] = "+docsWithTd[i]+" ");
out.print("IDFd["+i+"] = "+IDFd[i]+" ");
catch(Exception e)
IDFd[i] = 0;
out.print("IDFd["+i+"] = "+IDFd[i]+" ");
for(int i=0; i<IDFd.length; i++)
TDIDFd[i] += TDd*(IDFd[i]);
out.print(" id = "+docBaseDid.get(i)+" \tTDIDFd["+i+"] = "+TDIDFd[i]+"");
%>

ricky171
Welcome to the forum. Please don't post in old threads that are long dead. When you have a question, please start a topic of your own. Feel free to provide a link to an old thread if relevant.
Also, please don't solicit off-forum communication by email. This not only goes against the spirit of the forum, which is to share problems and their solutions, but will also fetch you plenty of spam, but rarely any help.
I'm locking this thread now. It's more than 4½ years old.
Also blocking both the posts soliciting off-forum communication.
db

Obtain term vectors from CONTEXT index

I'm hoping to do Latent Semantic Indexing within the Oracle framework. There are many obstacles along the way, not least of which is obtaining the term-document matrix for the corpus. Being that we are using Oracle Text already for bread and butter search, I would like to be able to read the term vectors for each document based on that existing indexing. How can I go about doing this? I have an idea that there must be a way to read the $I table blob and derive the values without manually indexing yourself to obtain a transparent index. Is there a better way? Oracle docs recommend against reading the index tables and then seems to provide no means to obtain information regarding the index in a transparent fashion. I view this as a huge weakness of Oracle text over open source Lucene/Solr which to my understanding provides visibility into the term vectors. This seems an unnecessary and big oversight and would lead companies looking to do innovative text mining to other frameworks that don't have such glaring short comings.
I'm a bit frustrated and perhaps unfairly. Perhaps I just haven't found in the docs where this can be done. Any help is appreciated.

I was wanting a vector for a document that would contain the term frequency and the inverse document frequency for each term in the corpus. Internally, this is how Oracle handles search and ranking via the Salton algorithm but I believe they calculate the weights on the fly based on the inverted index containing term frequencies and inverse document frequency at the head of each posting list. So what this means for me, is that the representation would be too massive and sparse to be of any practical use. With that in mind I can use the CTX_DOC.TOKENS function to just obtain the tokens in a document, group the result table and count and put into my own simplified inverted index and go through all the documents in this manner.
So in summary, I think they don't provide that term vector because it is probably never actually materialized and is stored in an inverted manner instead and weights are calculated on the fly. The oracle text index is also concerned with so much more than what I need considering fuzzy string matching, wildcards etc. etc... So in the end it will take me a day or two to write code to build my own simplified index and do numerical algorithms on the spare matrix stored in such a manner.

Problem in near duplicate detection in c#

I have Develop application for near duplicate detection in c#.This application work for strings only but it's not working for pdf files.may be i think
GetSimilarity method can not work properly but error not be raised.
my application code like as:
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;
using System.IO;
using iTextSharp.text;
using System.Threading;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System.Text.RegularExpressions;
using WindowsFormsApplication1.appcode;
namespace WindowsFormsApplication1
public partial class Form1 : Form
string filename;
FileInfo[] data1;
FileInfo[] data2;
string path;
public Form1()
InitializeComponent();
private void button1_Click(object sender, EventArgs e)
OpenFileDialog openFileDialog = new OpenFileDialog();
openFileDialog.CheckFileExists = true;
openFileDialog.AddExtension = true;
openFileDialog.Filter = "PDF files (*.pdf)|*.pdf";
DialogResult result = openFileDialog.ShowDialog();
if (result == DialogResult.OK)
filename = Path.GetFileName(openFileDialog.FileName);
path = Path.GetDirectoryName(openFileDialog.FileName);
textBox1.Text = path + "\\" + filename;
private void button2_Click(object sender, EventArgs e)
OpenFileDialog openFileDialog = new OpenFileDialog();
openFileDialog.CheckFileExists = true;
openFileDialog.AddExtension = true;
openFileDialog.Filter = "PDF files (*.pdf)|*.pdf";
DialogResult result = openFileDialog.ShowDialog();
if (result == DialogResult.OK)
filename = Path.GetFileName(openFileDialog.FileName);
path = Path.GetDirectoryName(openFileDialog.FileName);
textBox2.Text = path + "\\" + filename;
public static string ExtractTextFromPdf(string filename)
using (PdfReader r = new PdfReader(filename))
StringBuilder text = new StringBuilder();
for (int i = 1; i <= r.NumberOfPages; i++)
text.Append(PdfTextExtractor.GetTextFromPage(r, i));
string first = text.ToString();
return first;
public static string Extract(string filename)
using (PdfReader r = new PdfReader(filename))
StringBuilder text = new StringBuilder();
for (int i = 1; i <= r.NumberOfPages; i++)
text.Append(PdfTextExtractor.GetTextFromPage(r, i));
string second = text.ToString();
return second;
private void button3_Click(object sender, EventArgs e)
StopWordsHandler stopword = new StopWordsHandler();
string s = ExtractTextFromPdf(textBox1.Text);
string s1 = Extract(textBox2.Text);
string[] doc = new string[2]{s,s1 };
TFIDF tfidf = new TFIDF(doc);
float fl = tfidf.GetSimilarity(0,1);
var sformatted = string.Format("Value: {0:P2}.", fl);
StopWordsHandler.cs:
using System;
using System.Collections;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace WindowsFormsApplication1.appcode
class StopWordsHandler
public static string[] stopWordsList=new string[] {
"a",
"about",
"above",
"across",
"afore",
"aforesaid",
"after",
"again",
"against",
"agin",
"ago",
"aint",
"albeit",
"all",
"almost",
"alone",
"along",
"alongside",
"already",
"also",
"although",
"always",
"am",
"american",
"amid",
"amidst",
"among",
"amongst",
"an",
"and",
"anent",
"another",
"any",
"anybody",
"anyone",
"anything",
"are",
"aren't",
"around",
"as",
"aslant",
"astride",
"at",
"athwart",
"away",
"b",
"back",
"bar",
"barring",
"be",
"because",
"been",
"before",
"behind",
"being",
"below",
"beneath",
"beside",
"besides",
"best",
"better",
"between",
"betwixt",
"beyond",
"both",
"but",
"by",
"c",
"can",
"cannot",
"can't",
"certain",
"circa",
"close",
"concerning",
"considering",
"cos",
"could",
"couldn't",
"couldst",
"d",
"dare",
"dared",
"daren't",
"dares",
"daring",
"despite",
"did",
"didn't",
"different",
"directly",
"do",
"does",
"doesn't",
"doing",
"done",
"don't",
"dost",
"doth",
"down",
"during",
"durst",
"e",
"each",
"early",
"either",
"em",
"english",
"enough",
"ere",
"even",
"ever",
"every",
"everybody",
"everyone",
"everything",
"except",
"excepting",
"f",
"failing",
"far",
"few",
"first",
"five",
"following",
"for",
"four",
"from",
"g",
"gonna",
"gotta",
"h",
"had",
"hadn't",
"hard",
"has",
"hasn't",
"hast",
"hath",
"have",
"haven't",
"having",
"he",
"he'd",
"he'll",
"her",
"here",
"here's",
"hers",
"herself",
"he's",
"high",
"him",
"himself",
"his",
"home",
"how",
"howbeit",
"however",
"how's",
"i",
"id",
"if",
"ill",
"i'm",
"immediately",
"important",
"in",
"inside",
"instantly",
"into",
"is",
"isn't",
"it",
"it'll",
"it's",
"its",
"itself",
"i've",
"j",
"just",
"k",
"l",
"large",
"last",
"later",
"least",
"left",
"less",
"lest",
"let's",
"like",
"likewise",
"little",
"living",
"long",
"m",
"many",
"may",
"mayn't",
"me",
"mid",
"midst",
"might",
"mightn't",
"mine",
"minus",
"more",
"most",
"much",
"must",
"mustn't",
"my",
"myself",
"n",
"near",
"'neath",
"need",
"needed",
"needing",
"needn't",
"needs",
"neither",
"never",
"nevertheless",
"new",
"next",
"nigh",
"nigher",
"nighest",
"nisi",
"no",
"no-one",
"nobody",
"none",
"nor",
"not",
"nothing",
"notwithstanding",
"now",
"o",
"o'er",
"of",
"off",
"often",
"on",
"once",
"one",
"oneself",
"only",
"onto",
"open",
"or",
"other",
"otherwise",
"ought",
"oughtn't",
"our",
"ours",
"ourselves",
"out",
"outside",
"over",
"own",
"p",
"past",
"pending",
"per",
"perhaps",
"plus",
"possible",
"present",
"probably",
"provided",
"providing",
"public",
"q",
"qua",
"quite",
"r",
"rather",
"re",
"real",
"really",
"respecting",
"right",
"round",
"s",
"same",
"sans",
"save",
"saving",
"second",
"several",
"shall",
"shalt",
"shan't",
"she",
"shed",
"shell",
"she's",
"short",
"should",
"shouldn't",
"since",
"six",
"small",
"so",
"some",
"somebody",
"someone",
"something",
"sometimes",
"soon",
"special",
"still",
"such",
"summat",
"supposing",
"sure",
"t",
"than",
"that",
"that'd",
"that'll",
"that's",
"the",
"thee",
"their",
"theirs",
"their's",
"them",
"themselves",
"then",
"there",
"there's",
"these",
"they",
"they'd",
"they'll",
"they're",
"they've",
"thine",
"this",
"tho",
"those",
"thou",
"though",
"three",
"thro'",
"through",
"throughout",
"thru",
"thyself",
"till",
"to",
"today",
"together",
"too",
"touching",
"toward",
"towards",
"true",
"'twas",
"'tween",
"'twere",
"'twill",
"'twixt",
"two",
"'twould",
"u",
"under",
"underneath",
"unless",
"unlike",
"until",
"unto",
"up",
"upon",
"us",
"used",
"usually",
"v",
"versus",
"very",
"via",
"vice",
"vis-a-vis",
"w",
"wanna",
"wanting",
"was",
"wasn't",
"way",
"we",
"we'd",
"well",
"were",
"weren't",
"wert",
"we've",
"what",
"whatever",
"what'll",
"what's",
"when",
"whencesoever",
"whenever",
"when's",
"whereas",
"where's",
"whether",
"which",
"whichever",
"whichsoever",
"while",
"whilst",
"who",
"who'd",
"whoever",
"whole",
"who'll",
"whom",
"whore",
"who's",
"whose",
"whoso",
"whosoever",
"will",
"with",
"within",
"without",
"wont",
"would",
"wouldn't",
"wouldst",
"x",
"y",
"ye",
"yet",
"you",
"you'd",
"you'll",
"your",
"you're",
"yours",
"yourself",
"yourselves",
"you've",
"z",
private static Hashtable _stopwords=null;
public static object AddElement(IDictionary collection,Object key, object newValue)
object element = collection[key];
collection[key] = newValue;
return element;
public static bool IsStopword(string str)
//int index=Array.BinarySearch(stopWordsList, str)
return _stopwords.ContainsKey(str);
public StopWordsHandler()
if (_stopwords == null)
_stopwords = new Hashtable();
double dummy = 0;
foreach (string word in stopWordsList)
AddElement(_stopwords, word, dummy);
TFIDF.cs:
using System;
using System.Collections;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace WindowsFormsApplication1.appcode
class TFIDF
private string[] _docs;
private string[][] _ngramDoc;
private int _numDocs=0;
private int _numTerms=0;
private ArrayList _terms;
private int[][] _termFreq;
private float[][] _termWeight;
private int[] _maxTermFreq;
private int[] _docFreq;
public class TermVector
public static float ComputeCosineSimilarity(float[] vector1, float[] vector2)
if (vector1.Length != vector2.Length)
throw new Exception("DIFER LENGTH");
float denom=(VectorLength(vector1) * VectorLength(vector2));
if (denom == 0F)
return 0F;
else
return (InnerProduct(vector1, vector2) / denom);
public static float InnerProduct(float[] vector1, float[] vector2)
if (vector1.Length != vector2.Length)
throw new Exception("DIFFER LENGTH ARE NOT ALLOWED");
float result=0F;
for (int i=0; i < vector1.Length; i++)
result += vector1[i] * vector2[i];
return result;
public static float VectorLength(float[] vector)
float sum=0.0F;
for (int i=0; i < vector.Length; i++)
sum=sum + (vector[i] * vector[i]);
return (float)Math.Sqrt(sum);
private IDictionary _wordsIndex=new Hashtable() ;
public TFIDF(string[] documents)
_docs=documents;
_numDocs=documents.Length ;
MyInit();
private void GeneratNgramText()
private ArrayList GenerateTerms(string[] docs)
ArrayList uniques=new ArrayList() ;
_ngramDoc=new string[_numDocs][] ;
for (int i=0; i < docs.Length ; i++)
Tokeniser tokenizer=new Tokeniser() ;
string[] words=tokenizer.Partition(docs[i]);
for (int j=0; j < words.Length ; j++)
if (!uniques.Contains(words[j]) )
uniques.Add(words[j]) ;
return uniques;
private static object AddElement(IDictionary collection, object key, object newValue)
object element=collection[key];
collection[key]=newValue;
return element;
private int GetTermIndex(string term)
object index=_wordsIndex[term];
if (index == null) return -1;
return (int) index;
private void MyInit()
_terms=GenerateTerms (_docs );
_numTerms=_terms.Count ;
_maxTermFreq=new int[_numDocs] ;
_docFreq=new int[_numTerms] ;
_termFreq =new int[_numTerms][] ;
_termWeight=new float[_numTerms][] ;
for(int i=0; i < _terms.Count ; i++)
_termWeight[i]=new float[_numDocs] ;
_termFreq[i]=new int[_numDocs] ;
AddElement(_wordsIndex, _terms[i], i);
GenerateTermFrequency ();
GenerateTermWeight();
private float Log(float num)
return (float) Math.Log(num) ;//log2
private void GenerateTermFrequency()
for(int i=0; i < _numDocs ; i++)
string curDoc=_docs[i];
IDictionary freq=GetWordFrequency(curDoc);
IDictionaryEnumerator enums=freq.GetEnumerator() ;
_maxTermFreq[i]=int.MinValue ;
while (enums.MoveNext())
string word=(string)enums.Key;
int wordFreq=(int)enums.Value ;
int termIndex=GetTermIndex(word);
_termFreq [termIndex][i]=wordFreq;
_docFreq[termIndex] ++;
if (wordFreq > _maxTermFreq[i]) _maxTermFreq[i]=wordFreq;
private void GenerateTermWeight()
for(int i=0; i < _numTerms ; i++)
for(int j=0; j < _numDocs ; j++)
_termWeight[i][j]=ComputeTermWeight (i, j);
private float GetTermFrequency(int term, int doc)
int freq=_termFreq [term][doc];
int maxfreq=_maxTermFreq[doc];
return ( (float) freq/(float)maxfreq );
private float GetInverseDocumentFrequency(int term)
int df=_docFreq[term];
return Log((float) (_numDocs) / (float) df );
private float ComputeTermWeight(int term, int doc)
float tf=GetTermFrequency (term, doc);
float idf=GetInverseDocumentFrequency(term);
return tf * idf;
private float[] GetTermVector(int doc)
float[] w=new float[_numTerms] ;
for (int i=0; i < _numTerms; i++)
w[i]=_termWeight[i][doc];
return w;
public float GetSimilarity(int doc_i, int doc_j)
float[] vector1=GetTermVector (doc_i);
float[] vector2=GetTermVector (doc_j);
return TermVector.ComputeCosineSimilarity(vector1, vector2) ;
private IDictionary GetWordFrequency(string input)
//string convertedInput=input.ToLower() ;
Tokeniser tokenizer=new Tokeniser() ;
String[] words=tokenizer.Partition(input);
Array.Sort(words);
String[] distinctWords=GetDistinctWords(words);
IDictionary result=new Hashtable();
for (int i=0; i < distinctWords.Length; i++)
object tmp;
tmp=CountWords(distinctWords[i], words);
result[distinctWords[i]]=tmp;
return result;
private string[] GetDistinctWords(String[] input)
if (input == null)
return new string[0];
else
ArrayList list=new ArrayList() ;
for (int i=0; i < input.Length; i++)
if (!list.Contains(input[i])) // N-GRAM SIMILARITY?
list.Add(input[i]);
return Tokeniser.ArrayListToArray(list) ;
private int CountWords(string word, string[] words)
int itemIdx=Array.BinarySearch(words, word);
if (itemIdx > 0)
while (itemIdx > 0 && words[itemIdx].Equals(word))
itemIdx--;
int count=0;
while (itemIdx < words.Length && itemIdx >= 0)
if (words[itemIdx].Equals(word)) count++;
itemIdx++;
if (itemIdx < words.Length)
if (!words[itemIdx].Equals(word)) break;
return count;
Tokeniser.cs:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Collections;
using System.Text;
using System.Threading.Tasks;
using System.Text.RegularExpressions;
namespace WindowsFormsApplication1.appcode
class Tokeniser
public static string[] ArrayListToArray(ArrayList arraylist)
string[] array = new string[arraylist.Count];
for (int i = 0; i < arraylist.Count; i++) array[i] = (string)arraylist[i];
return array;
public string[] Partition(string input)
Regex r = new Regex("([ \\t{}():;. \n])");
//input = input.ToLower();
String[] tokens = r.Split(input);
ArrayList filter = new ArrayList();
for (int i = 0; i < tokens.Length; i++)
MatchCollection mc = r.Matches(tokens[i]);
if (mc.Count <= 0 && tokens[i].Trim().Length > 0
&& !StopWordsHandler.IsStopword(tokens[i]))
filter.Add(tokens[i]);
return ArrayListToArray(filter);
public Tokeniser()
button3 is compare functionality.in this scope i have to write similarity logic in terms of percentage.
oncle please check the code for similarty b/w the two pdf files.if any probelm please intimate.please help me.
thank u.

Hi
Actually iText is a third party library to create PDF originally written for java. iTextSharp is the C# adaptation of that
library. Question regarding iText are better asked on the iText forum, but this is Microsoft Forum:
http://itextpdf.com/support
Thanks for your understanding.
Best regards,
Kristin
We are trying to better understand customer views on social support experience, so your participation in this interview project would be greatly appreciated if you have time. Thanks for helping make community forums a great place.
Click
HERE to participate the survey.

Portfolio Recovery

I have this collection account still on my credit after being paid in full. I was told they would remove it and I have wrote them yet again another GW letter. My question is the DOFD is 12/2009. Should this really still be reporting on my credit? What are your thoughts? PORTFOLIO RECOVERY ASSOCRiverside Commerce Center
120 Corporate Blvd Ste 100
Norfolk, VA-235024962
Account Number:CAPIT-2025601864XXXX Status:COLLECTION ACCOUNT Account Owner:Individual Account. High Credit:$339 Type of Account :OpenCredit Limit: Term Duration: Terms Frequency: Date Opened:04/01/2013 Balance:$0 Date Reported:06/24/2015 Amount Past Due: Date of Last Payment:03/2015 Actual Payment Amount:$47 Scheduled Payment Amount: Date of Last Activity:N/ADate Major Delinquency First Reported:02/2015 Months Reviewed:1 Creditor Classification:Retail Activity Designator:N/A Charge Off Amount: Deferred Payment Start Date: Balloon Payment Amount: Balloon Payment Date: Date Closed: Type of Loan:Factoring Company Account (debt buyer) Date of First Delinquency:12/2009 Comments:Consumer disputes after resolution,
Chapter 7 bankruptcy dismissed 81-Month Payment HistoryYearJanFebMarAprMayJunJulAugSepOctNovDec2015NRNRNRNRNR 2014NRNRNRNRNRNRNRNRNRNRNRNR2013 NRNRNRNRNRNRNRNRNRPayment History Key

Lvlover26 wrote:
I have this collection account still on my credit after being paid in full. I was told they would remove it and I have wrote them yet again another GW letter. My question is the DOFD is 12/2009. Should this really still be reporting on my credit? What are your thoughts? PORTFOLIO RECOVERY ASSOCRiverside Commerce Center
120 Corporate Blvd Ste 100
Norfolk, VA-235024962
Account Number:CAPIT-2025601864XXXX Status:COLLECTION ACCOUNT Account Owner:Individual Account. High Credit:$339 Type of Account :OpenCredit Limit: Term Duration: Terms Frequency: Date Opened:04/01/2013 Balance:$0 Date Reported:06/24/2015 Amount Past Due: Date of Last Payment:03/2015 Actual Payment Amount:$47 Scheduled Payment Amount: Date of Last Activity:N/ADate Major Delinquency First Reported:02/2015 Months Reviewed:1 Creditor Classification:Retail Activity Designator:N/A Charge Off Amount: Deferred Payment Start Date: Balloon Payment Amount: Balloon Payment Date: Date Closed: Type of Loan:Factoring Company Account (debt buyer) Date of First Delinquency:12/2009 Comments:Consumer disputes after resolution,
Chapter 7 bankruptcy dismissed 81-Month Payment HistoryYearJanFebMarAprMayJunJulAugSepOctNovDec2015NRNRNRNRNR 2014NRNRNRNRNRNRNRNRNRNRNRNR2013 NRNRNRNRNRNRNRNRNRPayment History Key Your DOFD is December 2009 and this December would be 6 yrs. I would guess it wouldn't hurt for you to call the CRA's and ask for an early deletion (if the Customer Service person doesn't understand what you are talking about, ask to speak with a supervisor). If they don't want to early exclude (EE) then I would file a complaint with the BBB and/or CFPB. If you have any documents agreeing to delete, remember to mention that in your complaint and attach the copy. You can also explain to the CRA's that you have a copy if they would like to see it. Good Luck!

Thesaurus / Dictionary

Dear All,
I have a program that do the following
1. User enters key word(s)
2. The program surfs the website (s) to bring documents that are more relevant.
3.Another program will extract all the important words from those documents. Calculate the term frequency and also the document frequency.
I need to do the following
From those extracted words I need to create a thesaurus table in the following form
Index : Term
0 : Computer
0 : Computers
1 : Data
2 : Database
2 : Databases
Note: Similar Terms should have same index.
Can I use the term frequency/ document frequency to come up with the index value??
Looking for your smart ideas!!!
Cheers,
--Vj

You will probabley want to use the java.util.Hashtable
class to store your "important words". If this is a client application (It does not reside on the server that the html files are on) you will need to open up a socket on port 80 of the server to request the webpages from the server. Then strip out all HTML tags and store the A HREF tags in a vector. You can then go through the document a look for "important" words. Then go to each of the sites in your vector of links.

Thesaurus

Dear All,
I have a program that do the following
1. User enters key word(s)
2. The program surfs the website (s) to bring documents that are more relevant.
3.Another program will extract all the important words from those documents. Calculate the term frequency and also the document frequency.
I need to do the following
From those extracted words I need to create a thesaurus table in the following form
Index : Term
0 : Computer
0 : Computers
1 : Data
2 : Database
2 : Databases
Note: Similar Terms should have same index.
Can I use the term frequency/ document frequency to come up with the index value??
Looking for your smart ideas!!!
Cheers,
--Vj

here's a lightweight class to give you a basic thesaurus...
import java.util.List;
import java.util.ArrayList;
import java.util.Map;
import java.util.HashMap;
public class Thesaurus {
private Map indexWords = new HashMap();
private Map wordIndexes = new HashMap();
public void addWord(int index, String word) {
    // Make sure word is not already in Thesaurus
    if(!wordIndexes.containsKey(word)) {
      Integer indexObject = new Integer(index);
      // Place word in word list
      wordIndexes.put(word,indexObject);
      // maintain index of matching words
      List words;
      if(indexWords.containsKey(indexObject)) {
        words = (List) indexWords.get(indexObject);
      else {
        words = new ArrayList();
        indexWords.put(indexObject,words);
      words.add(word);
public List getWordMatches(String word) {
    // Check if word is in Thesaurus
    if(wordIndexes.containsKey(word)) {
      Integer indexObject = (Integer) wordIndexes.get(word);
      List words = (List) indexWords.get(indexObject);
      List result = new ArrayList(words);
      result.remove(word);
      return result;
    else return new ArrayList();
public static void main(String[] args) {
    Thesaurus t = new Thesaurus();
    t.addWord(0,"Computer");
    t.addWord(0,"Computers");
    t.addWord(1,"Data");
    t.addWord(2,"Database");
    t.addWord(2,"Databases");
    System.out.println(t.indexWords);
    System.out.println(t.wordIndexes);
    List list = t.getWordMatches("Computer");
    System.out.println(list);
}Hope that helps...

HashMap memory usage

Hi,
I am implementing an indexer / compressor for plain text files (text, query log and urls files). The basic skeleton of the indexer is the Huffman codec, plus some various addon to boost performance.
Huffman is used on words (Huffword); the first operation I execute is the complete scan of the file to collect term frequencies, which I will use to generate the Huffman model. Frequencies are stored in a HashMap<String, Integer>.
The main problem is the HashMap dimension, I quickly run out of memory.
In a query log of 300MB I collect something around 1700000 String-Integer pairs; is it possible that I need an 512MB-sized heap?

>
Huffman is used on words (Huffword); the first operation I execute is the complete scan of the file to collect term frequencies, which I will use to generate the Huffman model. Frequencies are stored in a HashMap<String, Integer>.
The main problem is the HashMap dimension, I quickly run out of memory.
In a query log of 300MB I collect something around 1700000 String-Integer pairs; is it possible that I need an 512MB-sized heap?
>
Answer to your question: yes, if you are consuming lots of memory, you need lots of heap.
Answer to the question you didn't ask: with that many unique words, attempting to assign each word a Huffman code will make your file larger. Huffman codes are only useful when you have a relatively small vocabulary, where an even smaller number of terms predominate. This allows you to use a small number of bits for the frequently-occurring items, and a large number of bits for the rarely-occurring items.
In your case you're going to have an extremely broad tree, with most of the terms being leaf nodes. If I'm remembering correctly, it will have log2(x) + N bits for a leaf node (where N accounts for the non-overlapping leading bits of the few predominant words), so 24+ bits per word. Plus, you have to store your entire dictionary in the file to be used for reconstruction.

Performace problem

Someone please help me. I am sufferring performace problem for my code down below. Highly appraciate if anyone can help me optimize it. It took 4.7 hour to run 150,000 records for 5 runs.
* Title: COllaborative REcommender for SEARCH
* Description: Collaborative recommendation for user by clustering user queries in a community
* Copyright: Copyright (c) 2003
* Company: 
* @author Chan Soe Win, Nyein
* @version 1.0.executeQuery(sql);
import java.sql.*;
import java.io.*;
import java.util.*;
import java.util.StringTokenizer;
import java.math.*;
public class ClusterGene {
static String DBUrl = "jdbc:mysql://localhost/corec";
static Connection Conn = null;
static Statement Stmt = null;
static PreparedStatement getDt = null;
static PreparedStatement countCluster = null;
static PreparedStatement insCenter = null;
static PreparedStatement updCenter = null;
static PreparedStatement insCluster = null;
static PreparedStatement selectCenter=null;
static PreparedStatement countClusterMember = null;
static PreparedStatement getClusterMember = null;
static PreparedStatement redefCluster = null;
static int lastClusterID;
public ClusterGene(){
try{
Class.forName("com.mysql.jdbc.Driver").newInstance();
Conn = DriverManager.getConnection(DBUrl);
Conn.setAutoCommit(true);
Stmt = Conn.createStatement();
getDt = Conn.prepareStatement("SELECT Dt FROM tbl_url WHERE id=? AND type=?");
countCluster = Conn.prepareStatement("SELECT DISTINCT cluster_id FROM tbl_center");
insCenter = Conn.prepareStatement("INSERT INTO tbl_center(cluster_id, url_id,type_id,lf) VALUES(?,?,?,?)");
insCluster = Conn.prepareStatement("INSERT INTO tbl_cluster(cluster_id,query_id,sim_query,sim_title,sim_snippet,sim_result,sim_outlink,sim_inlink) VALUES (?,?,?,?,?,?,?,?)");
updCenter = Conn.prepareStatement("UPDATE tbl_center SET lf=? WHERE cluster_id=? AND url_id=? AND type_id=?");
selectCenter = Conn.prepareStatement("SELECT url_id, type_id,lf FROM tbl_center WHERE cluster_id=? AND url_id=? AND type_id=?");
countClusterMember= Conn.prepareStatement("SELECT count(DISTINCT(query_id)) FROM tbl_cluster WHERE cluster_id=?");
getClusterMember = Conn.prepareStatement("SELECT url_id,type_id,lf FROM tbl_lf WHERE query_id=?");
redefCluster = Conn.prepareStatement("UPDATE tbl_center SET lf=? WHERE cluster_id=?");
}catch (Exception e){
public static void main (String args[]){
ClusterGene cluster = new ClusterGene();
int clusterid=0;
double currentlf=0.0;
int termid,typeid;
double lf;
double sim_content,sim_link,sim_hybrid, sim_query,sim_title,sim_snippet,sim_result,sim_outlink,sim_inlink;
ArrayList lfidfArray;
HashMap lfidfmap;
lfidf lfidf = new lfidf();
lfidf member = new lfidf();
lfidf center = new lfidf();
//ResultSet countClusterRS=null;
//int countClusterNo=1;
lfidfArray = new ArrayList();
lfidfmap = new HashMap();
double lfsquare1,lfsquare2,lfsquare3,lfsquare4,lfsquare5,lfsquare6,sumq1q2type1,sumq1q2type2,sumq1q2type3,sumq1q2type4,sumq1q2type5,sumq1q2type6 ;
//select a new lfidfArrayquery
long totaltimetaken = 0;
for (int j=0;j<10;j++){
for (int k=1;k<101;k++){
long start = System.currentTimeMillis();
System.out.print(k+",");
lfsquare1 = 0;
lfsquare2 = 0;
lfsquare3 = 0;
lfsquare4 = 0;
lfsquare5 = 0;
lfsquare6 = 0;
sim_content= 0.0;
sim_link=0.0;
sim_hybrid=0.0;
sim_query =0.0;
sim_title= 0.0;
sim_snippet=0.0;
sim_result=0.0;
sim_outlink=0.0;
sim_inlink=0.0;
try {
//read a case (a query) from query table with term frequency
String sql = "SELECT url_id, type_id, lf FROM tbl_lf WHERE query_id="+k;
//System.out.println(sql);
ResultSet rs= Stmt.executeQuery(sql);
//System.out.println("RS "+rs);
typeid=0;
lfidfmap = new HashMap();
while(rs.next())
lfidf=new lfidf();
termid = rs.getInt("url_id");
typeid =rs.getInt("type_id");
lfidf.query_id = k;
lfidf.url_id= termid;
lfidf.type_id= typeid;
lf = rs.getDouble("lf");
getDt.setInt(1, termid);
getDt.setInt(2, typeid);
ResultSet dtrs = getDt.executeQuery();
dtrs.next();
//System.out.println(lf+"DT >>>"+dtrs.getDouble("Dt"));
lfidf.lfidf = log2(1+lf)*log2(10000/dtrs.getDouble("Dt"));
Integer key = new Integer(termid);
//System.out.println("Type >> "+typeid);
//lfidfArray.add(lfidf);
lfidfmap.put(key,lfidf);
//System.out.print("Key "+key+" \tLFIDF :: ");
//System.out.println(((lfidf) lfidfmap.get(key)).lfidf);
//System.out.println(typeid);
if (typeid ==1){
lfsquare1 += lfidf.lfidf *lfidf.lfidf ;
//System.out.println("lfsquare1 "+lfsquare1);
}else if (typeid==2){
lfsquare2 += lfidf.lfidf *lfidf.lfidf ;
//System.out.println("lfsquare2 "+lfsquare2);
}else if (typeid ==3){
lfsquare3 += lfidf.lfidf *lfidf.lfidf ;
//System.out.println("lfsquare3 "+lfsquare3);
} else if (typeid ==4){
lfsquare4 += lfidf.lfidf *lfidf.lfidf ;
//System.out.println("lfsquare1 "+lfsquare1);
}else if (typeid==5){
lfsquare5 += lfidf.lfidf *lfidf.lfidf ;
//System.out.println("lfsquare2 "+lfsquare2);
}else if (typeid ==6){
lfsquare6 += lfidf.lfidf *lfidf.lfidf ;
//System.out.println("lfsquare3 "+lfsquare3);
}// end while(rs.next())
//compare with all existing cluster centers
boolean newseed = true;
try{
ResultSet clcountrs= countCluster.executeQuery();
while (clcountrs.next())
//select a center
clusterid = clcountrs.getInt("cluster_id");
lastClusterID = clusterid;
//select a cluster center
lfidf clfidf =new lfidf();
int cltermid,cltypeid;
double cllf, cllfsquare1,cllfsquare2,cllfsquare3,cllfsquare4,cllfsquare5,cllfsquare6 ,cllfidf;
sumq1q2type1=0;
sumq1q2type2=0;
sumq1q2type3=0;
sumq1q2type4=0;
sumq1q2type5=0;
sumq1q2type6=0;
cllfsquare1=0;
cllfsquare2=0;
cllfsquare3=0;
cllfsquare4=0;
cllfsquare5=0;
cllfsquare6=0;
try{
String clqry = "SELECT url_id, type_id,lf FROM tbl_center WHERE cluster_id="+clusterid;
//System.out.println(clqry);
ResultSet clrs = Stmt.executeQuery(clqry);
//System.out.println(clqry);
//System.out.println(clrs);
while (clrs.next())
cltermid= 0;
cltypeid= 0;
clfidf =new lfidf();
cltermid= clrs.getInt("url_id");
cltypeid= clrs.getInt("type_id");
cllfidf = clrs.getDouble("lf");
getDt.setInt(1, cltermid);
getDt.setInt(2, cltypeid);
ResultSet cldtrs = getDt.executeQuery();
cldtrs.next();
Integer tindex = new Integer(cltermid);
if(lfidfmap.containsKey(tindex)){
clfidf = (lfidf) lfidfmap.get(tindex);
if (!clfidf.equals(null)){
if ((clfidf.type_id ==1)){
sumq1q2type1 += cllfidf * clfidf.lfidf;
cllfsquare1 += cllfidf * cllfidf ;
if ((clfidf.type_id ==2)){
sumq1q2type2 += cllfidf * clfidf.lfidf;
cllfsquare2 += cllfidf * cllfidf ;
if ((clfidf.type_id ==3)){
sumq1q2type3 += cllfidf * clfidf.lfidf;
cllfsquare3 += cllfidf * cllfidf ;
if ((clfidf.type_id ==4) ){
sumq1q2type4 += cllfidf * clfidf.lfidf;
cllfsquare4 += cllfidf * cllfidf ;
if ((clfidf.type_id ==5)){
sumq1q2type5 += cllfidf * clfidf.lfidf;
cllfsquare5 += cllfidf * cllfidf ;
if ((clfidf.type_id ==6)){
sumq1q2type6 += cllfidf * clfidf.lfidf;
cllfsquare6 += cllfidf * cllfidf ;
}// if (!clfidf.equals(null))
}catch (Exception e){
if (cllfsquare1 >0 && lfsquare1>0){
sim_query = sumq1q2type1/Math.sqrt(cllfsquare1*lfsquare1);
}else {
sim_query = 0.0;
if (cllfsquare2 >0 && lfsquare2>0){
sim_title = sumq1q2type2/Math.sqrt(cllfsquare2*lfsquare2);
}else {
sim_title =0.0;
if (cllfsquare3 >0 && lfsquare3>0){
sim_snippet = sumq1q2type3/ Math.sqrt(cllfsquare3*lfsquare3);
}else {
sim_snippet=0.0;
if (cllfsquare4 >0 && lfsquare4>0){
sim_inlink = sumq1q2type4/ Math.sqrt(cllfsquare4*lfsquare4);
}else {
sim_inlink=0.0;
if (cllfsquare5 >0 && lfsquare5>0){
sim_outlink = sumq1q2type5/ Math.sqrt(cllfsquare5*lfsquare5);
}else {
sim_outlink=0.0;
if (cllfsquare6 >0 && lfsquare6>0){
sim_result = sumq1q2type6/ Math.sqrt(cllfsquare6*lfsquare6);
}else {
sim_result=0.0;
sim_content = ((1.0/3.0)* sim_query) +((1.0/3.0) * sim_title)+((1.0/3.0) * sim_snippet);
sim_link = ((1.0/3.0)* sim_result) +((1.0/3.0) * sim_outlink)+((1.0/3.0) * sim_inlink);
sim_hybrid =(0.75 * sim_content) + (0.25 *sim_link);
if (sim_hybrid>0.25){
newseed = false;
//insCluster.setInt(1,clusterid);
//insCluster.setInt(2,k);
//insCluster.setDouble(3,sim_query);
//insCluster.setDouble(4,sim_title);
//insCluster.setDouble(5,sim_snippet);
//insCluster.setDouble(6,sim_result);
//insCluster.setDouble(7,sim_outlink);
//insCluster.setDouble(8,sim_inlink);
//try{
//insCluster.executeUpdate();
//}catch (Exception e){
//System.out.println("3 >>>>>>"+ (System.currentTimeMillis()-startTime));
//countClusterMember.setInt(1,clusterid);
//countClusterRS = countClusterMember.executeQuery();
//countClusterRS.next();
//countClusterNo=countClusterRS.getInt(1);
//System.out.println("Cluster :: "+clusterid);
//if the number of member change, redefine the cluster center
//String sqlterm = "SELECT url_id,type_id,lf FROM tbl_lf WHERE query_id="+k;
// ResultSet rsterm = Stmt.executeQuery(sqlterm);
for (Iterator f = lfidfmap.keySet().iterator(); f.hasNext();)
Object key = f.next();
member = (lfidf) lfidfmap.get(key);
getDt.setInt(1, member.url_id);
getDt.setInt(2, member.type_id);
ResultSet centrs = getDt.executeQuery();
centrs.next();
double centerlfidf= log2(1+ member.lfidf)*log2(10000/centrs.getDouble("Dt"));
selectCenter.setInt(1,clusterid);
selectCenter.setInt(2, member.url_id);
selectCenter.setInt(3, member.type_id);
ResultSet selectRS = selectCenter.executeQuery();
if (selectRS.next())
currentlf = selectRS.getDouble("lf")+centerlfidf;
try{
updCenter.setDouble(1,currentlf/2);
updCenter.setInt(2,clusterid);
updCenter.setInt(3, member.url_id);
updCenter.setInt(4, member.type_id);
updCenter.executeUpdate();
}catch (Exception e){
}else {
try{
insCenter.setInt(1,clusterid);
insCenter.setInt(2, member.url_id);
insCenter.setInt(3, member.type_id);
insCenter.setDouble(4, centerlfidf);
insCenter.executeUpdate();
}catch (Exception e){
}// end of for iterator
}//end of if (sim_hybrid > 0.1)
}// end of while (clrs.next())
//new case does not fit into any of the existing cluster, then it become a new cluster itself
if (newseed = true){
String lfsql = "SELECT query_id, url_id,type_id,lf FROM tbl_lf WHERE query_id="+k;
//System.out.println("New Seed");
//System.out.println(sql);
ResultSet lfrs = Stmt.executeQuery(lfsql);
int count=0;
double centerlfidf=0.0;
while (lfrs.next()){
count++;
getDt.setInt(1, lfrs.getInt("url_id"));
getDt.setInt(2, lfrs.getInt("type_id"));
ResultSet centerrs = getDt.executeQuery();
centerrs.next();
centerlfidf= log2(1+ lfrs.getDouble("lf"))*log2(10000/centerrs.getDouble("Dt"));
if (centerlfidf >0){
insCenter.setInt(1,k);
insCenter.setInt(2, lfrs.getInt("url_id"));
insCenter.setInt(3, lfrs.getInt("type_id"));
insCenter.setDouble(4, centerlfidf);
insCenter.execute();
newseed=false;
}catch (Exception e){
}catch(Exception e){
//System.out.println(k+" takes "+ ((System.currentTimeMillis()-start)/1000) + " seconds ");
totaltimetaken +=((System.currentTimeMillis()-start)/1000);
}//end of for loop k
System.out.println("Total time taken is "+ totaltimetaken +" seconds");
private static double log2(double d) {
return Math.log(d)/Math.log(2.0);

I haven't tried understanding the flow of your program, just giving you database & JDBC tips....
1. Try creating database indexes for the fields in your SQL where clauses... may help improve querying.
2. If you've got a huge number of SQL insert statements then you should look at auto-commit.
By default, the database connection sets "auto-commit" to on.
So, for every single SQL insert or update, a database commit is performed.
If you're inserting/updating a huge batch of records, try setting auto-commit to off, then explicitely call
"commit()" after maybe 50 inserts....
We got a serious performance gain from that tip recently.
As mentioned, I didn't look at the flow/logic of your code, just the DB stuff.
regards,
Owen

Configuration of various retention goals

Hi All,
I am trying to configure Exchange backup on DPM2012R2 with below retention goals:
1. Short term protection on disk - daily with retention period of 7 days.
2. Long term protection on tape - weekly with retention period of 4 weeks.
3. Long term protection on tape - monthly with retention period of 99 years.
For disk backup, I have selected express full backup daily with retention period of 7 days.
For tape backup, I have configured 2 goals as below:
However, the configuration summary displays long term frequency as weekly with retention period of 99 years.
How do I ensure that both the goals are met. I can thus segregate the weekly and monthly backups based on the retention period and this will help me optimize the number of tapes required for backup.
Thanks,
Swapnil.
Regards, Swapnil Malpani

Hi,
You have configured the recovery goals correctly - the summary is picking the frequency of weekly because that is the frequency of the first goal which should be the most frequent.
You can run a PS script t see when each tape goal will run next.
Script to list schedule are in the below forum. It's the 2nd script listed.
# This script will list all currently scheduled backup to tape jobs #
# It will list scheduled, last run and next run dates
tape backups run a the wrong time.
http://social.technet.microsoft.com/Forums/en-US/dpmtapebackuprecovery/thread/4fafbcb0-ac2c-4867-8434-31f1f5e532e0
Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT]
This posting is provided "AS IS" with no warranties, and confers no rights.

Reading the Index

Hello,
I am running some tests on an index created by Oracle Text and need to know how to extract data from the index.
Specifically, I am trying to determine the longest posting entry (the term that appears in the most documents), the total occurrences of the longest posting entry, and the number of distinct terms in the index.
On a side note I am also unsure on how to determine the size of an index.
Thanks for helping the less experienced!

Hi
Try querying the DR$<index_name>$I table which has the following columns:
TOKEN_TEXT,
TOKEN_TYPE,
TOKEN_FIRST,
TOKEN_LAST,
TOKEN_COUNT,
TOKEN_INFO
TOKEN_COUNT gives you the document frequency (i.e. the number of documents in which a term appears).
The number of rows in the table gives you the number of distinct terms in the index but be careful if you've used theme indexing as these terms are included as well.
By total occurrences, do you mean the total term frequency, i.e. the total number of occurrences of a term across all documents including where a term has multiple occurrences in the same document? If so, I think that information is in the TOKEN_INFO column but it's a BLOB. Has anyone been able to decode it?
Brian

Help Please....CFPB Reply on Portfolio Recovery Loan Type: Factoring Company Complaint

I sent a CFPB complaint that PRA has listed 2 accounts on my husband's CR as a Loan Type: Factoring Company and they are doing NOTHING about it! I thought that a COLLECTIONS could not be listed as a Factoring account! That's what I've been reading on this forum. Plus I've read that a Junk Debt buyer is not a Factoring company! His debts were not in good standing when purchased 4 yrs after the charge off!
Can someone please help me and direct me in the right direction?
Here's PRA and CFPB's response:
Portfolio Recovery Associates said:
Explanation of closure: PRA purchased the Capital One Bank (USA), N.A. ("Capital One") MasterCard credit card account ending in 8273 from Capital One on or about February 25, 2015. Business records provided to PRA by Capital One at the time of our purchase verify that the account was opened on November 10, 2010, by Brian Roberts whose social security number ends in **** and that a balance of $772.26 was due on the account at the time of PRA's purchase. PRA purchased the Capital One Bank (USA), N.A. ("Capital One") MasterCard credit card account ending in 5355 from Capital One on or about March 27, 2015. Business records provided to PRA by Capital One at the time of our purchase verify that the account was opened on March 23, 2011, by Brian Roberts whose social security number ends in **** and that a balance of $444.52 was due on the account at the time of PRA's purchase. PRA sent its initial notification letters to you on or about March 5, 2015 regarding the PRA account ending in 8273 and on or about April 2, 2015, regarding the PRA account ending in 5355. In response to your complaint, PRA has sent letters containing validation on or about July 28, 2015 regarding the PRA accounts ending in 5355 and 8273. CFPB Response
In response to your complaint, Portfolio Recovery Associates, LLC ("PRA") verified the PRA accounts ending in 5355 and 8273. We believe that no further steps in response to your complaint or follow-up actions are required at this time.

Lvlover26 wrote:
I have this collection account still on my credit after being paid in full. I was told they would remove it and I have wrote them yet again another GW letter. My question is the DOFD is 12/2009. Should this really still be reporting on my credit? What are your thoughts? PORTFOLIO RECOVERY ASSOCRiverside Commerce Center
120 Corporate Blvd Ste 100
Norfolk, VA-235024962
Account Number:CAPIT-2025601864XXXX Status:COLLECTION ACCOUNT Account Owner:Individual Account. High Credit:$339 Type of Account :OpenCredit Limit: Term Duration: Terms Frequency: Date Opened:04/01/2013 Balance:$0 Date Reported:06/24/2015 Amount Past Due: Date of Last Payment:03/2015 Actual Payment Amount:$47 Scheduled Payment Amount: Date of Last Activity:N/ADate Major Delinquency First Reported:02/2015 Months Reviewed:1 Creditor Classification:Retail Activity Designator:N/A Charge Off Amount: Deferred Payment Start Date: Balloon Payment Amount: Balloon Payment Date: Date Closed: Type of Loan:Factoring Company Account (debt buyer) Date of First Delinquency:12/2009 Comments:Consumer disputes after resolution,
Chapter 7 bankruptcy dismissed 81-Month Payment HistoryYearJanFebMarAprMayJunJulAugSepOctNovDec2015NRNRNRNRNR 2014NRNRNRNRNRNRNRNRNRNRNRNR2013 NRNRNRNRNRNRNRNRNRPayment History Key Your DOFD is December 2009 and this December would be 6 yrs. I would guess it wouldn't hurt for you to call the CRA's and ask for an early deletion (if the Customer Service person doesn't understand what you are talking about, ask to speak with a supervisor). If they don't want to early exclude (EE) then I would file a complaint with the BBB and/or CFPB. If you have any documents agreeing to delete, remember to mention that in your complaint and attach the copy. You can also explain to the CRA's that you have a copy if they would like to see it. Good Luck!

Increase counter frequency performance

Hello,
I want to increase the frequency performance for my period counter. I'm using a USB-6210 board and I have the vi that is attached - period measurement.
The problem is that I want to measure the period for a 8MHz signal (I know that is a lot, I would be happy even with 4MHz). The source freq for the counter is 80Hz. If the frequency is high, the accuracy is not very critical for me.
1. I get most of the time the error: "Buffer overwritten". I've seen that I can get rid of it if I decrease the frequency, but I don't want to do that . I think that another solution would be to increase the number of points that are read. I noticed that the maximum buffer size is around 9000 points (I've read it with DAQmxRead Property Node).
2. Another fact that I've noticed is that in the While loop where I'm doing the Data Reading I should have no other operations or delay. Is this true, or just a coincidence?
3. There is a strange behavior: if I start the acquisition and I have at the input high frequency, I get the error (Buffer overwritten) almost instant. If I start acquisition at low freq I can increase it even at high freq.
4. There is another strange behavior: if the input frequency is high the frequency and the measured period increase and decrease togheter. I think that this is caused by alias. Where can I find some more information about the board limits?
If you can give me some other advice/hints/links/pdfs I would be very thanksful.
Maybe there are some small mistakes in the VI. I made it only to get a feeling of what I'm doing. I didn't chek it with the hardware.
Regards,
Paul
Attachments:
example.vi ‏32 KB

Paul,
I have added some comments to your answer.
Regards,
Jochen
KPanda wrote:
Jochen,
thanks for this information. This was what I was looking for some while.
I still have a question related to this topic: I've read that the maximum size of FIFO is 1024 samples. What does it mean?
[JK:] The FIFO is the hardware buffer on the board. In general the PCI-bus or the USB should have enough bandwidth to transfer the data as fast as they are acquired by the device, but in fact there are sometimes some latencies that require some local memory on the board. That's what is called FIFO in this context.
This FIFO is the same with the: Available Samples Pro Channel from Read Property node?
[JK:] No. This value refers to the buffer in the PC's memory that is allocated for the acquisition operation.
I've noticed that when the value for this property is passing 9000 I get the error with Overwritten Buffer. If it is like this why do I reach more than 9000 samples pro channel? Please take a look at the attachement (test1.png - screenshoot with the values / speed_test_x - the VI that I used for this measurement).
[JK:] The buffer size is not limited to 9000 values. NI-DAQmx allocates memory automatically by default. If you like you can increase the buffer size manually.
Which is the relation between maximum numbers of sample that can be read with the Counter 1D Read NSamples? In my VI there are N=250 samples. Can I increase it in order to avoid the error? If yes, which should be the maximum limit, 1024 ?
[JK:] You can increase the number of values to read up to the size of the buffer (not of the FIFO). A reasonable value is up to 50% of the buffer size, but this is not a strict rule. Anything between 10% and 90% could make sense, depending on the timing requirements of your application.
Paul
PS: I've hope that I translated the LabView terms in the right way. I have my LabView in german (but I don't know german, so it is a nightmare for me )

Payment terms for Advances in Sales & Distribution

Hi friends !!
1)You may know that we have payment terms like eg: 30% advance and remaining payment after delivery.
In such scenario, while doing sales order itself the customer need to pay the advance. Then he needs acknowledgement or bill for that amount. How and where we will configure that???
2)If the advance amount recieved at sales order level do we need to issue the invoice for the full amount or for the balance amount??
Please any one help me to understand the logic !!
Thanks & Regards,
Sreedhar.
Email: [email protected]

hi,
If it is real-estate based one plz follow the following procedurs:
You have to define conditions for sales-based rent in order to be able to post sales-based rent. Also, if advance payments were made for sales-based rent or a minimum sales-based rent is defined, conditions allow you to offset these against the sales-based rent to be paid.
Prerequisites
You defined the necessary condition types in Customizing and assigned them attributes for sales-based rent. In Customizing choose:
SAP Customizing Implementation Guide - Flexible Real Estate Management - Conditions and Flows - Condition Types
You set the Relevant to Sales indicator on the General Data tab page in the real estate contract.
Procedure
       1.      On the Conditions tab page, assign the necessary conditions to the contract. The following condition types are relevant for sales-based rent settlement:
·        Advance Payment for Sales-Based Rent
One-time or recurring condition for sales-based rent; requires actual rent as condition purpose. Advance payments made by the tenant are offset against the sales-based rent resulting from the sales-based rent agreement.
·        Minimum Rent
Minimum amount of sales-based rent to be paid. This is usually a condition with the condition purpose actual rent, but can also be used as a statistical condition.
·        Maximum Rent
Maximum amount of sales-based rent to be paid (cap). Maximum rent is usually represented by one or more statistical conditions. The maximum amount is the sum of the conditions, limited to the given settlement period.
·        Sales-Based Rent (Required for Every Sales Rule)
This condition defines how the receivable or credit memo is posted. It is not possible to assign a unit price to this condition, since the posting amount is determined by the sales-based rent settlement. The calculation formula always has to be sales-based rent (see: Customizing settings). The system generates a cash flow item for the posting that reflects the posting period and settlement period. For this reason, you do not enter a frequency for this condition. Instead, it is defined as a one-time condition.
The Sales-Based Rent condition is also used to specify which objects of the sales-based rent agreement form the basis for sales-based rent. This makes it possible to determine sales-based rent using measurements as a basis (for example, per square foot of retail space). If the condition has a distribution rule, then sales are distributed to the objects using this rule.
Note that the system uses only the last distribution formula in the settlement period. The system does not distribute proportionally within the settlement period if there is a change to the distribution formula.
       2.      On the Terms subtab, assign the same sales rule to all advance payment conditions for sales-based rent, to all minimum sales-based rent conditions, and to just one Sales-Based Rent condition.
You have to assign a single Sales-Based Rent condition to each sales rule.
If a contract contains more than one sales rule, you have to assign that number of Sales-Based Rent conditions in order to be able to perform settlement. You cannot use advance payments or minimum sales-based rent conditions for settlement. You can assign a sales rule to from zero to any number of advance payments and to from zero to any number of minimum sales-based rent conditions.
regards,
Siddharh.

Frequency response requirements for headphones with CMSS on XFi ???

Hi,
I would like to know if someone could tell me what kind of heaphones are suitable for the CMSS mode with the XFi.
I mean between : flat response/free-field correction/diffuse-field correction.
Applying HRTF filtering should mean that headphones with flat response is the best option ( same configuration as binaural recordings).
But I have a big doubt that Creative team expects costumers to possess such a pair of headphones, as it is rather for scientific uses (psychoacoustics, audiology etc...).
So, if we look at the technical solutions for wide audience we have two options (FF correction and DF correction). Here is a trick because these corrections intend to reproduce some of the effects from HRTF (for two different environment configuration of HRTF measurements). It is why the frequency response of most of the headphones have a notch between 4Hz and 0 kHz.
To simplify, if we listen binaural sounds with classical headphones the effect of outer pinna is reproduced twice.
So I guess Creative have implemented a kind of normalization/equalization/correction process to deal with the non-flat frequency response of headphones, but do someone know if they have chosen diffuse field or free field correction ?
This post might seem a detail but the issue can be very important for the accurate localisation and the coloration? of 3D sounds with headphones.
Thank you, and please forgive my english!

The only possibility that I can think of is that 2/2. mode is NOT as simple as headphone mode with crosstalk cancelation. Perhaps the HRTF only kicks in for sound sources outside of the arc directly in front of the listener. If that were the case, you wouldn't percei've any distortion for sound sources in front of you.
Also, you are wrong regarding DirectSound3D. Keep in mind that Direct3D and DirectSound3D are not the same. The whole point of OpenAL and DirectSound3D is that they present an API to the programmer through which there is NO specification of the number of speakers. When using OpenAL or DirectSound3D, the only thing a programmer can do is specify the location of a mono sound source in 3D space relati've to the listener. The speaker settings for your DirectSound3D or OpenAL device will then determine how this sound is "rendered" by the soundcard. It is not under control of the game. For example, if you have 5. speakers and the 3D position is behind you, the SOUND CARD will make the decision to use the rear speakers. If you use headphones, the SOUND CARD will decide to apply an HRTF to create the illusion of a rear sound source. The point is that the game does not have control over how many speakers you will get sound from.
However, to further complicate the situation, there are SOME games (HL2 is an example) where DirectSound3D is used, BUT the sound output of the game itself IS a function of the Windows speaker settings. This is not how programmers are SUPPOSED to use DirectSound3D. I've written about this countless times. There is a good post on [H]ard|Forum about this. Do an "advanced" search with my username (thomase) looking for the terms "hl2" and "cmss".

Term frequency (tf idf)

Similar Messages

Maybe you are looking for