Counting word frequency

Hi all,
I'm very new to Java, only a few months into it. I am working on a program that compresses a file, thread, string. Firstly it takes a string of undetermided length as input and using the delimiter splits the string, then places that string into a list. Then from that list I take out the recurring words with another list and return the index of where the words are, this has been done without any problems.
Next I want to get the word frequency and return the most common word into the first index following onto the next and so on. I am not sure how best to go about this. I did use a map but that just gives the frequency of the words without any ordering. Can anybody give me any suggestions that I could try bearing in mind that java is my first language so this is also a confidence building exercise.
Many thanks to anybody who responds.

littlejim4 wrote:
Hi all,
I'm very new to Java, only a few months into it. I am working on a program that compresses a file, thread, string. Firstly it takes a string of undetermided length as input and using the delimiter splits the string, then places that string into a list. Then from that list I take out the recurring words with another list and return the index of where the words are, this has been done without any problems.
Next I want to get the word frequency and return the most common word into the first index following onto the next and so on. I am not sure how best to go about this. I did use a map but that just gives the frequency of the words without any ordering. Can anybody give me any suggestions that I could try bearing in mind that java is my first language so this is also a confidence building exercise.
Many thanks to anybody who responds.Okay, so you have a Map with the 'word' as keys, and the 'frequency' as values, right?
If so, create a class (WordOccurrence for example) that holds a "word" and "occurrence" variable. Let that class also implement the Comparable interface. Now, when you can do something like this:
public static List<WordOccurrence> getWordOccurrances(String text) {
    // get the occurrence of each (unique) word
    Map<String, Integer> frequencyMap = getFrequencyMap(text);
    // create a lit to hold you WordOccurrence instances
    List<WordOccurrence> list = new ArrayList<WordOccurrence>();
    // for each word in your map, create a WordOccurrence instance
    for(String word : frequencyMap.keySet()) {
        list.add(new WordOccurrence(word, frequencyMap.get(word)));
    // sort the list of WordOccurrences
    Collections.sort(list);
    // return the sorted list
    return list;
}

Similar Messages

  • Where, if any, Word Count and Word Frequency tools on Apple Works or MS Wrd

    5/26/2008. Is there a "word count" and/or "word frequency" tool(s) on Apple Works word document and/or Microsoft Word for Mac word document, and how does one use or activate it, or find such for it on the web?? Many thanks. C. Yopst, Chicago

    Hi C,
    In an AppleWorks WP document, go Edit > Writing Tools > Word count. This will give you a count of characters, words, lines and paragraphs in the document or, if you have selected a portion of the document, in the selection.
    To my knowledge, there's no tool to give a word frequency figure, although that could be done with some fairly easy manipulation involving a spreadsheet.
    For MS Word questions, I'd suggest checking the Word section of Microsoft's Mactopia site. Try searching "count words".
    Regards,
    Barry

  • Word Frequency Counter...

    Hello all, I am working on a project that is supposed to read in a text file from a command prompt, and then break all the words up. As the words are read in by the Scanner, I need to have a counter that counts the number of times the word has occured already that I can access and display in the output. I have come up with this so far as my driver/main class, and also the Count class that I'm trying to use to keep track of the number of times a word has occured in the text, and then so I can add it to a HashMap and display later... The problem is, whenever I try to run the program with a text file, it just ends up displaying all the words in a line and then a number 1 next to it. What I need is the output to look similar to this... For example,
    hello 1
    world 1
    Any help would be appreciated! Thanks.
       import java.io.*;
       import java.util.*;
        public class Driver{
           public static void main(String[] args){
             HashMap words = new HashMap();
             String nameOfFile = args[0];      
             File file = new File(nameOfFile);
             String wordd;
             Count count;
             try{
                Scanner scanner = new Scanner(file).useDelimiter(" \t\n\r\f.,<>\"\'=/");
                while(scanner.hasNext())
                   String word = scanner.next();
                   count = (Count) words.get(word);
                   if(count==null){
                      words.put(word, new Count(word, 1));
                   else {
                      count.i++;
                   System.out.println(word);
                 catch(FileNotFoundException e){
             Set set = words.entrySet();
             Iterator iter = set.iterator();
             while(iter.hasNext()) {
                Map.Entry entry = (Map.Entry) iter.next();
                wordd = (String) entry.getKey();
                count = (Count) entry.getValue();
                System.out.println(wordd +
                   (wordd.length() < 8 ? "\t\t" : "\t") +
                   count.i);
    {code}
    {code}
    public class Count
         String word;
         int i;
         public Count(String inputWord, int increment)
              word = inputWord;
              i = increment;
    {code}
    Edited by: VisualAssassin on Apr 22, 2009 2:45 PM                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

    VisualAssassin wrote:
    Scanner scanner = new Scanner(file).useDelimiter(" \t\n\r\f.,<>\"\'=/");
    {code}According to the documentation for Scanner.useDelimiter(), the String supplied is used as a regular expression. Therefore, for the scanner to tokenize into two separate tokens, your input stream would have to contain all of those listed characters in order!
    Instead, use this (untested):
    {code}
    Scanner scanner = new Scanner(file).useDelimiter("[" + Pattern.quote(" \t\n\r\f.,<>\"\'=/") + "]+");
    The beginning and end square braces tell the regular expression engine to match +any+ of those characters, and the plus means one or more times.  The Pattern.quote is used to escape some of the characters that would get you into trouble because they have a special meaning in regexes, notably "."
    Edited by: endasil on 22-Apr-2009 11:43 PM                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               

  • Creating of the word-frequency histogram from the Oracle Text

    I need make from the Oracle Text index of the "word-frequency histogram", this is list of the tokens in this index, where each token contains the list of documents that contain that token and frequency this token in the every document. Don´t anybody know how to get this data from Oracle Text index so that result will save to the table or to the text file?

    You can use ctx_report.token_info to decipher the token_info column, but I don't think the report format that it produces is what you want. You can use a query template and specify algorithm=count to obtain the number of times a token appears in the indexed column. You can do that for every token by using the dr$...$i table, as shown below. Formatting is preserved by prefacing the code with pre enclosed in square brackets on the line above all of the code and /pre in square brackets on the line below all of the code.
    SCOTT@10gXE> create table otntest
      2    (doc_id       number primary key,
      3       document  varchar2(100))
      4  /
    Table created.
    SCOTT@10gXE> insert all
      2  into otntest values (1, 'This is a test for generating a histogram')
      3  into otntest values (2, 'Histogram shows the list of documents that contain that token and frequency')
      4  into otntest values (3, 'frequency histogram frequency histogram frequency')
      5  select * from dual
      6  /
    3 rows created.
    SCOTT@10gXE> create index otntest_ctx_idx
      2  on otntest(document)
      3  indextype is ctxsys.context
      4  /
    Index created.
    SCOTT@10gXE> column token_text format a30
    SCOTT@10gXE> select t.doc_id, i.token_text, score (1) as token_count
      2  from   otntest t,
      3           (select distinct token_text
      4            from   dr$otntest_ctx_idx$i) i
      5            where  contains
      6                  (document,
      7                   '<query>
      8                   <textquery grammar="CONTEXT">'
      9                   || i.token_text ||
    10                   '</textquery>
    11                   <score datatype="INTEGER" algorithm="COUNT"/>
    12                   </query>',
    13                   1) > 0
    14  order  by doc_id, token_text
    15  /
        DOC_ID TOKEN_TEXT                     TOKEN_COUNT
             1 GENERATING                               1
             1 HISTOGRAM                                1
             1 TEST                                     1
             2 CONTAIN                                  1
             2 DOCUMENTS                                1
             2 FREQUENCY                                1
             2 HISTOGRAM                                1
             2 LIST                                     1
             2 SHOWS                                    1
             2 TOKEN                                    1
             3 FREQUENCY                                3
             3 HISTOGRAM                                2
    12 rows selected.
    SCOTT@10gXE>

  • Is there a way to create a Word Frequency list in 8.2.5?

    I would like to be able to create a Word Frequency list for each PDF in a batch of PDFs. I have used Batch Processing to make sure each one is OCR/searchable. Is there a way to do this in Acrobat Pro 8.2.5 / Windows XP?
    Thanks!

    The SMTP banner is not customizable in 3x.
    In 4x and IMS 5x, you can use the configutil <b>service.smtp.banner</b> attribute.

  • Counting words in a single cell in Numbers'09

    Hi there,
    I'm relatively new to Mac world, but I do have years of computer experience from a PC and have also had to do with Macs at the age of first eMacs . I have finally decided to switch to the brighter side of life (hopefully ;)).
    But here is my question: I need to count words in a cell in Numbers'09.
    Is there a specific function combination for achieving this? My idea was: strip excessive spaces, count the occurencies of all space character in a cell, add 1 and voila! Problem is I can not achieve it using formulas in Numbers'09. I have found some help for Excell but the formulas are a little different. And well, I would like to leave the past behind and stick to a Apple programs - if I can. I don't like the idea to install Excell on a Windows Bootcamp partion only for this purpose.
    Any help would be greatly appreciated. Thanks.
    Aleksander

    Badunit wrote:
    Yvan once had a list of all the different localizations. He may still have it.
    I'm late but, I was very busy
    The table with every localized functions names is (and will remain) available on my iDisk :
    <http://public.me.com/koenigyvan>
    Download :
    For_iWork:iWork '09:functionsNames.numbers.zip
    An easy soluce for foreign users (like me) is to duplicate Numbers.app and remove its languages resources minus English.
    Running it you will have it running in English (minus the decimal and the parameters separators, minus also date time formats and default currency).
    It would be easy to enter the formulas given in this forum.
    Once saved, we may open the doc in the 'standard' Numbers and the formulas will be automatically localized.
    Yvan KOENIG (VALLAURIS, France) mardi 2 mars 2010 18:30:45

  • How to count words in a PDF file?

    Is there any way I can count words in a PDF file without resorting to Acrobat Reader (which apparently has that feature)?
    That's a massive program, which I actually don't like.
    I need to count words in the PDF file because I write my papers with LaTeX, and they're full of my extensive comments.
    Do you know of any alternative?

    that utility IIRC cannot be found on xpdf (the official Archlinux package) anymore and its part of poppler
    edit: its pdftotext btw
    Last edited by dolby (2008-05-11 13:35:05)

  • Counting words in a textArea

    Help !!
    I am trying to count the number of words contained in a text area - Is it possible to do this ?
    I am new to Java and I'm stuck because I only know how to write programs that use FileReader to count the number of words in a file.
    Is it possible to count the number of words in a string directly - or will I have to save the string to file and then apply another program to count the number of words ?

    You inser this code in your source file (TextArea1 is the name of the TextArea, whose you want to count words)
    String texte=TextArea1.getText();
    String rt=String.valueOf((char)13) + String.valueOf((char)10);
    StringTokenizer mots=new StringTokenizer(texte," ,.:;!?\t"+rt);
    int nbremots=mots.countTokens();
    and you place at the header of the file
    import java.util.StringTokenizer;
    nbremots is the number of words The delimitors between words are noticed in the second argument of the constructor of class StringTokenizer. I have choosen the main signes of punctuation like space, coma,stop... the String rt
    symbolise return on Windows. If you work on Unix prefer this code
    String texte=TextArea1.getText();
    StringTokenizer mots=new StringTokenizer(texte," ,.:;!?\t\n");
    int nbremots=mots.countTokens();
    You can also with the class StringTokenizer read the words one by one with this type of boucle
    while (mots.hasMoreTokens())
    String mot=mots.nextToken();
    mots.nextToken();
    It is an useful class. See the API about this class StringTokenizer in the package java.util

  • Convert a counter to frequency

    Hi,
    I am using two counters on a 6023E board. I have two pulse generators which
    give 1000 pulses in one rotation.
    How can I convert the counts to frequency or speed in LabView?
    Greetings,
    Erik.

    We found that the best way of obtaining speed measurement from a pulse generator is to use the buffered period measurement and scan some period (more than one), then eliminate the first measure (first period measurement may be incorrect due to the uncertainty in the start of measurement with respect to the slope of the signal), next average the measures obtained.
    The next step is to divide base time used (internal base time of 20 mhz for example) by the average measure obtained in the preceding step: this is the average period measurement and must be multiplied by 60 and divided by n. of pulses per round: this gives you the rpm value.
    Step two can be put in a loop in orded to obtain a continuous measurement, provided that you restart the counter once obt
    ained a finished reading from it.
    Roberto
    Proud to use LW/CVI from 3.1 on.
    My contributions to the Developer Zone Community
    If I have helped you, why not giving me a kudos?

  • Count words in each sentece in a file

    Hi all
    i want to ask how can i count words in each sentece in a file??
    ie if i have the follwoing sentece
    i ate the cake.
    today is Sundy.
    i went to school 5 day a week.
    to have the number as
    4
    3
    8
    any ideas??

    you could read the file line per line, put the line
    in a string and use StringTokenizer to split it into
    word and count them. Or you could read file char per
    char, increasing the word counter everytime you find
    a blank char, when the read char is a newline you
    save the old counter and start a new word count for
    the new line.That's an option, but a sentence is not ended by a newline. A sentence ends with a full stop/point.
    Kaj

  • Is there a method for counting words?

    Hi!
    Is there a method for counting words?
    How do I read specific data ( row, column ) out of an array?
    Thx
    Lebite

    There's could be a better way, but this is how I would do it:
            String[][] myArray = { {"Blah Blah Blah"},
                                   {"Blah Blah Blah"},
                                   {"Blah Blah Blah"} };
            int tokens = 0;
            for(int i = 0; i < myArray.length; i++)
                for(int j = 0; j < myArray.length; j++)
    StringTokenizer st = new StringTokenizer(myArray[i][j]);
    tokens += st.countTokens();
    System.out.println(tokens);

  • Help in word frequency program for .doc files

    hi,,
    can anybody help me...
    i have implemented a prog to find word frequency of .txt files,
    now i want to use the same prog for .doc files,
    how can this be done???

    Hi,
    I'm sure a few seconds on Google and you would have found the answer. But, take a look at Apache POI. This will allow you to extract the text from the doc files.
    http://poi.apache.org/
    Ben.

  • Count word without space in C#

    Dear sir,
    I would like to count word in sentence without space in c# but I could not solve this code.Please solve this one coding is below
                int i = 0,b=0;
                int Count2 = 1;
                for (i = 0; i < Paragraph.Length; i++)
                    if (Paragraph[i] == ' ')
                        for (b = i; b < Paragraph.Length; b++)
                            if (Paragraph[b] != ' ' && Paragraph[b] != '\t')
                                Count2++;
                                break;
                Console.WriteLine("Total Word = {0} .", Count2);

    Dear Sir,
    I want to count words without space in a Sentence but I face a problem that when I run the program space is also count with words but I want count only words Please guide . My C#  program is as under .
    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Threading.Tasks;
    using System.Text.RegularExpressions;
    using System.IO;
    namespace Static_Method_Count_Word
        class Program
            static void Main(string[] args)
                Console.WriteLine("Please Enter Any Paragraph .");
                string Paragraph = Console.ReadLine();
                Count.Coun(Paragraph);
                Console.ReadLine();
        class Count
            public static void Coun(string Paragraph )
                int i = 0,b=0;
                int Count2 = 1;
                for (i = 0; i < Paragraph.Length; i++)
                    if (Paragraph[i] == ' ')
                        for (b = i; b < Paragraph.Length; b++)
                            if (Paragraph[b] != ' ' && Paragraph[b] != '\t')
                                Count2++;
                                break;
                Console.WriteLine("Total Word = {0} .", Count2);

  • Best way to implement a word frequency counter (input = textfile)?

    i had this for an interview question and basically came up with the solution where you use a hash table...
    //create hash table
    //bufferedreader
    //read file in,
    //for each word encountered, create an object that has (String word, int count) and push into hash table
    //then loop and read out all the hash table entries
    ===skip this stuff if you dont feel like reading too much
    then the interviewer proceeded to grill me on why i shouldn't use a tree or any other data structure for that matter... i was kidna stumped on that.
    also he asked me what happens if the number of words exceed the capacity of the hash table? i said you can increase the capacity of the hash table, but it doesn't sound too efficient and im not sure how much you know how to increase it by. i had some ok solutions:
    1. read the file thru once, and get the number of words in the file, set the hashtable capacity to that number
    2. do #1, but run anotehr alogrithm that will figure out distinct # of words
    3. separate chaining
    ===
    anyhow what kind of answeres/algorithms would you guys have come up with? thanks in advance.

    i had this for an interview question and basically
    came up with the solution where you use a hash
    table...
    //create hash table
    //bufferedreader
    //read file in,
    //for each word encountered, create an object thatWell, first you need to check to make sure the word is not already in the hashtable, right? And if it is there, you need to increment the count.
    has (String word, int count) and push into hash
    table
    //then loop and read out all the hash table entries
    ===skip this stuff if you dont feel like reading too
    much
    then the interviewer proceeded to grill me on why i
    shouldn't use a tree or any other data structure for
    that matter... i was kidna stumped on that.A hashtable has ammortized O(1) time for insert and search. A balanced binary search tree has O(log n) complexity for the same operations. So, a hashtable will be faster for large number of words. The other option is a so-called "trie" (google for more), which has O(log m) complexity, where m is the length of the longest word. So if your words aren't too long, a trie may be just as fast as a hashtable. The trie may also use less memory than the hashtable.
    also he asked me what happens if the number of words
    exceed the capacity of the hash table? i said you can
    increase the capacity of the hash table, but it
    doesn't sound too efficient and im not sure how much
    you know how to increase it by. i had some ok
    solutions:The hashmap implementation that comes with Java grows automatically, you don't need to worry about it. It may not "sound" efficient to have to copy the entire datastructure, the copy happens quickly, and occurs relatively infrequently compared with the number of words you'll be inserting.
    1. read the file thru once, and get the number of
    words in the file, set the hashtable capacity to that
    number
    2. do #1, but run anotehr alogrithm that will figure
    out distinct # of words
    3. separate chaining
    ===
    anyhow what kind of answeres/algorithms would you
    guys have come up with? thanks in advance.I would do anything to avoid making two passes over the data. Assuming you're reading it from disk, most of the time will be spent reading from disk, not inserting to the hashtable. If you really want to size the hashtable a priori, you can make it so its big enough to hold all the words in the english language, which IIRC is about 20,000.
    And relax, you had the right answer. I used to work in this field and this is exactly how we implemented our frequency counter and it worked perfectly well. Don't let these interveiewers push you around, just tell them why you thought hashtable was the best choice; show off your analytical skills!

  • How to count words in a text ?

    Hi all,
    can anyone show me how to count the number of words in a text.
    thank in advandce,
    Toan.

    Hi,
    Are you reading the text from a file or is it stored in a string buffer?
    The best way would be to use a StringTokenizer assuming that all your words are separated by a space ' ' you could use that as your delimiter.
    Something like..
    String text = "This is just a bunch of text stored as a string.";
    StringTokenizer words = new StringTokenizer( text, " ", false );
    int numberOfWords = words.countTokens();I hope that helps,
    .kim

Maybe you are looking for