Naive bayes

Hello,
I'm implementing that but few question; I'm scanning from file the words of the emails. I now that a technique is to remove the 100 words with highest frequency; say that, I notice that while reading I have a lot of words with high frequency like this: '_', '.', numers, single letters etc.....(characters....). So I'm wondering if:
- must I consider these characters among the 100 words to remove ?If yes,is there a simple way to check it in a shot (I mean like: if ( myWord == '@_;.) then discard the word)
- must I discard these characters while reading from from and remove the 100 effective words (real words I mean and not just characters or number)
thanks

mickey0 wrote:
Hello,
I'm implementing that but few question; I'm scanning from file the words of the emails. I now that a technique is to remove the 100 words with highest frequency; say that, I notice that while reading I have a lot of words with high frequency like this: '_', '.', numers, single letters etc.....(characters....). So I'm wondering if:
- must I consider these characters among the 100 words to remove ?If yes,is there a simple way to check it in a shot (I mean like: if ( myWord == '@_;.) then discard the word)
- must I discard these characters while reading from from and remove the 100 effective words (real words I mean and not just characters or number)
thanksHow can we know your requirements? Whether or not you consider symbols, numbers, or letters to be words is up to you, or whoever gave you this task. What you do with characters like that is up to you. What do you want us to tell you?
Yes. You should remove them. Does that answer your question?

Similar Messages

Exporting Naive Bayes PMML

I have Oracle 11g and SQL Developer 3.2 installed on my machine. In Oracle Data Miner, I have created a Naive Bayes model. I want to export it in PMML format. I tried using PL/SQL package DBMS_DATA_MINING. But it exports only Decision Tree model in PMML format. I'm not able to figure it out using JAVA api.
Is there a way to export Naive Bayes in PMML format?

Hi,
Only Decision Tree is supported for PMML export.
Of course there is an export/import model api for all Data Mining models, it just is not in PMML format.
There has not been much demand for PMML export as most users like having the models persisted in the data base where they can be easily used to score etc.
What use case do you have for exporting PMML?
BTW, ODM does have some support for PMML import in case that is of interest.
Thanks, Mark

How to classify a document based on Microsoft's Naive Bayes algorithm?

I am quite unfamiliar with using SSAS as of the moment. I am aware of the classical implementation of Naive Bayes. I have learned about it from the
here. However what I am looking for is a complete walkthrough of how to use this particular algorithm with SSAS.
For simplicity let me assume we are supposed to classify a news write-up as having positive criticism or negative criticism. So for the positive articles we can observe words like
good, awesome, super, recommended, love, like, etc occuring frequently. For negative articles we can observe words like
bad, poor, unsatisfactory, unsatisfied, pathetic, etc mostly. There are only two possible outcomes (positive
or negative), hence, generalizing on patterns is fairly simple.
To start with we have a few write-ups with their corresponding outcomes, which are
mostly in accordance with patterns we've generalized above. If we were to do this without the help of a data mining tool, we would do the following:
Take the first write-up (assume this one is a positive article)
We'd first split the whole write-ups into words.
Remove the stopwords in them, like the, this, that, etc. (Words meant to provide a grammatical structure to the write-up but they occur frequently hence get rid of them). We get a corpus of words now.
This corpus is assigned to the outcome positive. We simply note the frequency of how many time positive appears, and also the frequency of the individual words tending to give outcome
positive.
The next write-up is taken. (Assume this one to be negative).
Steps 2-5 is repeated and the particular frequencies are updated each time.
So once we have looked into all documents, we can actually prepare test cases.
In accordance with the formula above, nc is the no.of times
good actually give the outcome positive. p is a prior estimate (=0.5 since only 2 outcomes), and
n is no.of time positive outcome appears in our corpus.
How can I use SSAS to and go about verifying these kind of test cases manually?
I am a bundle of mistakes intertwined together with good intentions

You posted such a nice question -- and no one answered it.
I would not use Naïve Bayes for this task. There are three other approaches:
1) Use either Association Rules or Decision Trees to determine associated words inside individual documents (treating each document as a transaction).
http://msdn.microsoft.com/en-us/library/ms175595(v=sql.120).aspx
2) Use Integration Services Term Extraction Transformation and Term Lookup Transformation to create this application in Integration Services (Data Flow).
http://technet.microsoft.com/en-us/library/ms141713.aspx
3) Use Semantic Search.
http://technet.microsoft.com/en-us/library/gg492075.aspx
Mark Tabladillo PhD (MVP, SAS Expert; MCT, MCITP, MCAD .NET) http://www.marktab.net

Naive Bayes in 9i

Hello,
we are exploring 9iDB's
Data Mining features. About the Naive Bayes algorithm: we would like to know how the LOG_CONDITIONAL_PROBABILITY attribute in the table RT_(nnnnnn)is computed. RT_(nnnnnn)is the resulting table associated with built models.
We are acquainted with the theory of NB, but we can't figure out the exact formula that is used to calculate this attribute.
Thanks in advance.
Regards,
Mauro Rossotto
http://thinklab.telecomitalialab.com

O9iDM uses a variation on the standard NB formula which takes into consideration a factor for missing values.

ERROR: while executing Naive bayes apply

I get the following error when I try to apply using Naive bayes apply.
The error displayed is " Received an exception in main: MiningApplyResult <applyOutputResult.resultName> does not exist or is invalid".
I do use Oracle 9i ODM Version and the database release is Release 9.2.0.1.0.
Please let me know the possible causal factor and the way to resolve the same.
Thanks in advance,
Lax.

Can you give the code block that throws this exception.
-Sunil

Naive Bayes Training - CLOB as output data type for JSON string?

Hello everyone,
My training model outputs a large JSON string that doesn't fit into one row, so the string is split across multiple rows. Default - or only possible output data type for that matter - is varchar according to the official documentation on PAL. Is there any chance I could use CLOB for the output?
Regards
Henry

The property you provided is for applying an NB model, not for building one.
For build, you should have used Sample_NaiveBayesBuild property file.
I ran the NB apply sample program with your property, and was able to get past the point where you have problem.
Please double check if you really used the property you provided and if it matches to the program you wanted to use.

Naive Bayes Classifier question

How many folds are being performed when you select cross-fold validation?
Thanks,
Dianna

N - Fold validation, where N is the number of cases (records)

A question about NAIVE BAYES

hi there
I can run NaiveBayesBuild.java correctly. But when I run NaiveBayesApply.java, it shows:
model Apply Task Phase:
Invoking NaiveBayes Model apply.
Status: ERROR
Duration: 6 seconds
Display Apply Table Result Phase:
Received an exception in main: MiningApplyResult Sample_NB_APPLY_RESULT 不存在或无效。( is not exist or invalid)
and I use jad to find that
oracle.dmt.odm.task.MiningTaskStatus:ERROR - 2007-11-30 14:45:03.765 - ORA-20750: ODM_NAIVE_BAYES_APPLY.APPLY: ???? (??) ??, ???? (ORA-01031: 权限不足)? (not enough previllege)
ORA-06512: 在"ODM.ODM_NAIVE_BAYES_APPLY", line 129
ORA-06512: 在"ODM.ODM_NAIVE_BAYES_APPLY", line 860
ORA-01031: 权限不足
ORA-06512: 在"ODM.ODM_NAIVE_BAYES_MODEL", line 682
ORA-06512: 在line 1
I want to know what previllege should I grant to which user.
many thanks,
liun

Hi Liun,
Check whether you have CREATE TABLE privileges in your account.
You can also find some info at:
http://www.oracle.com/technology/documentation/datamining.html
Thanks
Jim

Classification with Adaptive Bayes Network - What's behind ?

Hello,
what's behind the Classification with an Adaptive Bayes Network?
Neuronal Networks ?
Thank You
Martin Sautter

Adaptive Bayes Network (ABN) is closer to Naive Bayes and Decision Trees than to neural networks. You can find more information on ABN in section 3.1.4 of the Oracle Data Mining Concepts Guide,10g Release 1 (10.1), Part Number B10698-01which is available on OTN through http://www.oracle.com/pls/db10g/portal.portal_demo3?selected=6
Hope this helps.
-joe yarmus

How to change the mining type of an attribute?

Quite often, when I build a model with Naive Bayes algorithm (using DBMS_DATA_MINING package) I get a model with some attributes
being type of VARCHAR2 and categorical mining type although the
data type of these attributes are NUMBER.
My question is, on what basis does ODM assign mining type to an attribute?
Is that the number of distinct values of an attribute?
Can I change these mining types somehow using PL/SQL?
I'm using ODM 10g R2
Thank you
Luke

Quite often, when I build a model with Naive Bayes algorithm (using DBMS_DATA_MINING package) I get a model with some attributes
being type of VARCHAR2 and categorical mining type although the
data type of these attributes are NUMBER.
My question is, on what basis does ODM assign mining type to an attribute?
Is that the number of distinct values of an attribute?
Can I change these mining types somehow using PL/SQL?
I'm using ODM 10g R2
Thank you
Luke

Nested Tables and Predictable Attributes

BOL states "The data in a nested table can be used for prediction or for input, or for both. For example, you might have two nested table columns in a model: one nested table column might contain a list of the products that a customer
has purchased, while the other nested table column contains information about the customer's hobbies and interests, possibly obtained from a survey. In this scenario, you could use the customer's hobbies and interests as an input for analyzing purchasing behavior,
and predicting likely purchases."
However I cannot find where it states you cannot use a predictable attribute from a Nested Table within the Neural Network algorithm.
Can you please tell me why I receive the error "Microsoft Neural Network does not support predictable nested tables" when attempting to process my model with one predictable column in a Nested Table, or am I missing something? The
same structure works fine for Naive Bayes, Decision Trees etc but not for Neural Network of Logistic Regression.

Hi Namak,
Thank you for your question.
I am trying to involve someone more familiar with this topic for a further look at this issue. Sometime delay might be expected from the job transferring. Your patience is greatly appreciated.
Thank you for your understanding and support.
Regards,
Charlie Liao
TechNet Community Support

Searchin pattern in datbase table.(Number of rows in table :More than 3 cro

Actually i have a db table having 2 columns(columnA,time).db table has 70lac rows.I have to retireve all those values which are present min. 2 times in the interval of 10 minutes for a particular value of columnA. eg. columnA,time values are
{a,June-01-2011 10:13:12},{b,June-01-2011 10:14:12},{b,June-01-2011 10:15:12},{c,June-01-2011 10:16:12},{b,June-01-2011 10:17:12},{d,June-01-2011 10:18:12},{d,June-01-2011 10:25:12},{e,June-01-2011 10:26:12},{e,June-01-2011 11:38:12},{f,June-01-2011 10:39:12},{f,June-01-2011 10:43:12},{a,June-01-2011 10:44:12},{f,June-01-2011 10:51:12},{b,June-01-2011 10:51:12},{b,June-01-2011 10:53:12},{c,June-01-2011 10:54:12},{g,June-01-2011 10:55:12},{b,June-01-2011 10:56:12},{b,June-01-2011 10:57:12},{b,June-01-2011 10:58:12}
Then I have to retrieve following rows in output : {b,June-01-2011 10:14:12},{b,June-01-2011 10:15:12},{b,June-01-2011 10:17:12},{d,June-01-2011 10:18:12},{d,June-01-2011 10:25:12},{f,June-01-2011 10:39:12},{f,June-01-2011 10:43:12},{b,June-01-2011 10:51:12},{b,June-01-2011 10:53:12},{b,June-01-2011 10:56:12},{b,June-01-2011 10:57:12},{b,June-01-2011 10:58:12}
Is it related to data Mining? i have 3 crore rows in database table.I have to search such type of many patterns in my application.I have spent hours looking for tutorials on Google. However I cannot seem to find anything that holds the hand? try to be more clear, i'm in lack of ideas in this problem, even it sounds like a classic. Can oracle java data mining solve the problem?

first of all thanks for reply, i will take care of your suggestion.
Number of rows in table : More than 3*10pow7(30 Milion rows)(But it can be more than 120 milion)
Actually i have to display all those rows which setisfy following criteria:( Specific value of column A should appear min 2 times in the interval of 10 minutes.)
for eg.
eg 1. {b,June-01-2011 10:14:12},{b,June-01-2011 10:15:12},{b,June-01-2011 10:17:12}{This set is appearing 3 times in the any interval of 10 minutes.}
EG2 {a,June-01-2011 10:13:12} is present only one time,so i dont want this rows in output.
EG# {e,June-01-2011 10:26:12},{e,June-01-2011 11:38:12} is present 2 times in db,but not in the interval of 10 minutes,so i dont want this rows as output.(t1=10:26:12,t2=11:38:12 difference is more than 10 minutes.)
In my eg. i specified only 2 columns but in actual scenario there will be 8-10 columns in table.This is only one pattern which i specified,but there are many more pattern which i have to search in db.As specified above in eg. was only one pattern,After fulfiling this pattern It has to pass many other pattern.and i want result should be retrieved very fast.
Please dont go into answer of above eg.Actually i want, what should be approach of such type of problems?
Is this problem related to oracle data mining and how?
Can data mining algorthms(Minimum Description Length ,Naive Bayes,Apriori,Decision Tree,Non-Negative Matrix Factorization ,Support Vector Machine...etc..) solve my problem?
Different-2 type of queries will be fired against this table?
if i am still not clear then let me know.

Predictive Algorithm for Churn analysis

Hi,
Can anybody help me with the algorithm which I can use for churn analysis?
Thanks,
Atul

Hi Atul,
For Churn analysis or what is usually referred to as a binary classification problem where the customer is either staying or leaving=churning I would suggest one of the following algorithms:
CNR Decision Tree - which also provides a decision tree to explain which feature split is influencing the target (churn) the most.
You could also chose one of the R based Neural Network algorithms, however the produced predictive model & results are usually hard to explain.
If need be you can enhance the number of available algorithms by adding you own R functions - there are a lot of examples in this community.
If you have SAP HANA you could also chose:
Decision Trees:C4.5, CHAID or CART (new in SAP HANA SP08).
Other supervised learning algorithms for binary classification: Naive Bayes or SVM (Support Vector Machine).
There are a lot more but this should get you started.
Best regards,
Kurt Holst

Build model - view test result - i have no ROC tab

Dear all,
When I build eg. a classification SVM (Lin.Reg, Naive Bayes)) model and view the test results, in that window I have only these tabs: Performance, Performance Matrix, Lift, Profit. I have no Roc tab. Do you know why? Before I used SQL Developer 3.2.09.30_x64 with this problem. Now I use SQL Developer 3.2.20.09.87_x64 with this problem. When I try to tune the SVM algoritm (so not let it automatic), there are only these tabs: Cost, Benefit, Lift, Profit. But not ROC tab :(((
When I use the old Oracle Data Miner software from 2011, version 11.1.0.3.0 (build 11705) connecting to the same database server (using the same account), and I build a classification SVM model, I have ROC curve.
Can anyone help me to solve this misterious problem?
Than you!!!

We don't have "preferred target value" during model building in the new data miner. However, you can use the Transform node to transform your target into 2 classes (preferred target value and others). You then use the output from the Transform node as input source for your model build.
Here is a process to transform your target into 2 classes:
- Create a Transform node
- In Transform node, select the target column, click "Add Transformation" icon on the toolbar
- In the Add Transform dialog, select "Custom" Binning Type, click "Generate Default Bins" button (accept default settings)
- In the "Custom bin values" listbox, remove the non-preferred target values (select the values and click the "Remove Transformation" icon on the toolbar)
- Now, you should have one preferred target value in the "Custom bin values" listbox, click OK to finish
You can now connect the Transform node to a Build node. In the Build node, select the transformed target (it should have the "_BIN" suffix in the name) as the Target for model build.
Hope this help!
Denny

Data Miner Error on 11g

Data Miner (11.1.0.1.0)
Oracle 11.1.0.7.1
During a Naive Bayes Mining Activity the build gets past the Sample, Discretize, Split, but fails on the Build with the following error:
ORA-40101: Data Mining System Error ORA-01426: numeric overflow
ORA-06512: at "SYS.DBMS_DATA_MINING", line 1666
ORA-06512: at :SYS.DBMS_JDM_INTERNAL", line 1145
Basically I have a source_id (pk), country_code, city_name as the dataset. Only ran against 1000 rows.
Ideas?

Hi,
You stated earlier that you had just 1000 rows, which is not alot of data to mine.
Now you have 10 million, so you have plenty of data.
Probably more than you would generally use for your first round of model building.
From your testing of the input data to the build step, it would appear that there are no immediate issues where the transformations may be generating failure.
Can you explain what your target atttribute is and how many distinct target values you have?
It appears you only have one predictive attribute for model to use, which is ok, just kind of lean.
Since it is a categorical, it will get binned to topn by default, which leaves you with even less information.
Also, what version of db and data miner are you using?
Thanks, Mark

Naive bayes

Similar Messages

Maybe you are looking for