How to extract unique words in all files

Hi,
I am trying to extract all the UNIQUE words of all the files in a directory. well it gives me all the words but along with that all the single letters(alphabets and also repetitions of the same ) and also the output is not unique. I dont want the alphabets only unique words in the files. [Like for example if I encounter file x with words "cat mat sat" and then file y with words "cat mat bat pat". My output should be "cat mat sat bat pat"]
Can you please let me know what is used to get the unique words only? Or how I can modify my code to get the desired output...
//String input = "Input text, with words, punctuation, etc. Well, it's rather short.";
               Pattern p = Pattern.compile("[\\w']+",Pattern.MULTILINE);
               FileInputStream fis = new FileInputStream(file);
             FileChannel fc = fis.getChannel();
             ByteBuffer bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, (int)fc.size());
             Charset cs = Charset.forName("8859_1");
             CharsetDecoder cd = cs.newDecoder();
             CharBuffer cb = cd.decode(bb);
             // Run some matches
             Matcher m = p.matcher(cb);
               while ( m.find() ) {
                   //System.out.println(cb.substring(m.start(), m.end()));
                   System.out.println(m.group());
                Thanks

Can you tell me how to add the contents of a file into a set?
   import java.io.*;
import java.util.*;
public class RmDup{
     public static void main(String args[]){
               FileReader fr = new FileReader("Dups.txt");
               BufferedReader br = new BufferedReader(fr);
               String s1[]=br.readLine();
               String s2;
               while ((s2=br.readLine())!=null){
                HashSet ref = new HashSet( s1 ); // create a HashSet
                Iterator i = ref.iterator(); // get iterator
                System.out.println( "\nNonduplicates are: " );
                while ( i.hasNext() )
                     System.out.print( i.next() + " " );
                System.out.println();
}I am trying using this: but getting errors 1: incompatible types and 2:cannot find symbol constructor HashSet(java.lang.String[]).
Thanks

Similar Messages

How to extract text from a PDF file?

Hello Suners,
i need to know how to extract text from a pdf file?
does anyone know what is the character encoding in pdf file, when i use an input stream to read the file it gives encrypted characters not the original text in the file.
is there any procedures i should do while reading a pdf file,
File f=new File("D:/File.pdf");
               FileReader fr=new FileReader(f);
               BufferedReader br=new BufferedReader(fr);
               String s=br.readLine();any help will be deeply appreciated.

jverd wrote:
First, you set i once, and then loop without ever changing it. So your loop body will execute either 0 times or infinitely many times, writing the same byte every time. Actually, maybe it'll execute once and then throw an ArrayIndexOutOfBoundsException. That's basic java looping, and you're going to need a firm grip on that before you try to do anything as advanced as PDF reading. the case.oops you are absolutely right that was a silly mistake to forget that,
Second, what do the docs for getPageContent say? Do they say that it simply gives you the text on the page as if the thing were a simple text doc? I'd be surprised if that's the case.getPageContent return array of bytes so the question will be:
how to get text from this array? i was thinking of :
    private void jButton1_actionPerformed(ActionEvent e) {
        PdfReader read;
        StringBuffer buff=new StringBuffer();
        try {
            read = new PdfReader("d:/getjobid2727.pdf");
            read.getMetaData();
            byte[] data=read.getPageContent(1);
            int i=0;
            while(i>-1){
                buff.append(data);
i++;
String str=buff.toString();
FileOutputStream fos = new FileOutputStream("D:/test.txt");
Writer out = new OutputStreamWriter(fos, "UTF8");
out.write(str);
out.close();
read.close();
} catch (Exception f) {
f.printStackTrace();
"D:/test.txt" hasn't been created!! when i ran the program,
is my steps right?

How to extract text from a PDF file using php?

How to extract text from a PDF file using php?
thanks
fabio

> Do you know of any other way this can be done?
There are many ways. But this out of scope of this forum. You can try this forum: http://forum.planetpdf.com/

I have a 27" iMac with 16GB of factory installed SDRAM and 1T harddrive. It is telling me the harddrive is full and is not functioning correctly. How can I adjust so that all files are managed on the harddrive rather than the SDRAM. (BTW - design flaw)

Processor 3.4 GHz Intel Core i7
Memory 16 GB 1333 MHz DDR3
I have a 27" iMac with 16GB of factory installed SDRAM and 1T harddrive. It is telling me the harddrive is full and is not functioning correctly. How can I adjust so that all files are managed on the harddrive rather than the SDRAM. (BTW - design flaw here.)
Older Macs with a single Harddrive would simply expand OS management on the drive, and I think I understand the new Fusion Drive to do just that, but how do I get this product that I spent a great deal of money on to be more than a pretty screen?

I was confused by this statement
It is telling me the harddrive is full and is not functioning correctly.
OS X manages ram and when you run out, it create swapfiles to extend your ram on the HDD (Hard Disk Drive). The OS itself, takes about 16 gig of space and that too resides on the HDD. Your 16 gig of RAM is really a temporary space that holds things from the HDD so the processor can work on them. There is no design flaw between your ram and HDD. Something else is going on and there are a lot of smart people here to help you figure out what that "something else" is.
I would do three things. First I would create a backup, backups are important. Second, I would reboot into recovery and repair my HDD. Lastly while still booted in recovery I would repair permissions.
Can you capture a screen shot of the exact error you are getting?

Pdf sent won't open beyond cover sheet? how to allow click thru of all files?

pdf sent won't open beyond cover sheet? how to allow click thru of all files?

That's handy, because I can run gmail and Chrome too.
Ok, I sent myself a PDF attachment. I'm going to go step by step through what happened
to me, and you can tell me where your experience becomes different.
1. The email shows a preview of the top part of page 1 for the attachment.
2. I hover over the preview and I get two icons, "Download" and "Save to drive".
3. I click Download.
4. The attachment downloads and I see it in the status bar.
5. I click on the document in the status bar. It opens to page 1, but I can use the
scroll bar to read all of the pages.
6. I hover in the bottom right of the page, until Chrome's PDF toolbar appears.
7. I click the save icon.
8. I save the file to my desktop.
9. I open the file from my desktop. It also has all pages.

How do I translate word perfect old files for my new iMac?

How do I translate Word Perfect old files for my new iMac? They are transferred from a 13-year-old Dell. My McLink did not work.
Thanks.

The current version of the free LibreOffice continues to support Word Perfect documents. OpenOffice had to drop this support due to licensing issues. LO installs in two places: /Applications and your local Library/Application Support directory. Simple to install. Simple to remove. Good PDF documentation from their site. It is a capable MS Office replacement.
I just opened a wpd legal deposition in LO and it looked fine.

How do you gain access to all files on different users?

How do you gain access to all files and folders for each user?

http://forums.whirlpool.net.au/archive/718273

How to Extract Data from the PDF file to an internal table.

HI friends,
How can i Extract data from a PDF file to an internal table....
Thanks in Advance
Shankar

Shankar,
Have a look at these threads:-
extracting the data from pdf file to internal table in abap
Adobe Form (data extraction error)
Chintan

How to extract the contents of ZIP file stored in oracle database column ?

The file is in ZIP format and it is stored in BLOB databse column . The contents of file is in binary format .
How to extract those contents from file and thus creating the same ZIP file as output in ODI ?
Thanks
Arun

Perhaps you can something like what is described in the support note "How To Load A PDF File Or Other Binary Files To a BLOB Column From ODI (Doc ID 1412753.1)"

How to extract binary data from a File?

Hello there!
Hope I'm right here.
I have a slight problem with a file containing text- and binary-data. At the beginning of the file is only some text followed by some html-code and at the end is some binary stuff like a pdf oder an image (jpg/gif).
My problem is now how to split these parts and save them in the suitable format. The text and the html are no problem, but I don't know how to extract the binary data. I tried it out with an Inputstream, but I don't know how to find the exact position where the binary data begins.
Anyone have an idea how I can solve this?
Thanks in advance and sorry for my bad english.

I don't believe a bullet-proof solution exists.
If you know where the HTML ends, you may skip whitespace and then assume the rest is binary data. Or, if you know the first bytes of the binary data, you may locate those. With either of these solutions, though, you risk coming across a file where it doesn't work as expected, for example if the first byte of the binary data happens to match a whitespace character.

How do you change the program ALL files of a type open with?

Hello. I run Parallels Desktop for Mac on this system. Since I upgraded to v3, OS X wants to open my Mac Word/Excel/Etc... files using the MS Office installed in windows XP from PD. I don't always have PD running, and do not want it to start up to use the programs on it. I know, not a Mac problem... But, my question is this:
Is there a way to change the program that ALL files of a certain type (ex: pdf, doc, txt, etc...) will start? I have tried the OPT+RT CLICK which says to ALWAYS open this file with xxx.... but that only flags that single file. I want every file that I ever open anywhere on my system/network to use the program I specify. This is simple to do in Win XP, and I'm guessing there's a way to do it in OS X.
Help/information appreciated.
Thank you,
Scott
MacPro Quad Xeon 3.0GHz Mac OS X (10.4.9) 16 GB RAM - 1TB Internal RAID1

Well, with the program Onyx, I've found that LaunchServices database is the culprit. To fix this I just pick a type of file I want to change. Right click the file, choose get info, and change the program it opens with, then click the change all button. Done!
Cool.
Scott

How to extract Oracle data into Excel file?

For a small automation project I have to extract data from a table/
tables and append it to the existing excel file and feed that excel
file to a command that will load data into some other environment. I
am totally new to this. So to get started I wanted to know,
1) How to extract data from sample table Foo which has columns A,B,C
and append these values as new columns to an existing excel say
fooresults.csv ?
2) Can I achieve this in pl/sql script or do I need to write unix or
perl script or some other programing language, please advise?

The "extract data from a table" part is easy, you could do that with VB/ADO, or .NET/ODP.NET. It's then a matter of taking that data and appending it to a spreadsheet that might be the hard part, and how you'd do that exactly is really more of a Microsoft question than an Oracle one.
If you want to be able to do this from the database itself and your database is on Windows, you could use either [.NET Stored Procedures|http://www.oracle.com/technology/tech/dotnet/ode/index.html] if you can manipulate the spreadsheet in .net code, or you could also use Oracle's [COM Automation Feature|http://www.oracle.com/technology/tech/windows/com_auto/index.html] if you're handy with the COM object model for Excel.
How you'd do that exactly via either .net or com or vb is the crux of the problem and is something you'd need to know before it turns into an Oracle question, but if you already know how to do that and now just need to figure out a way to do that from Oracle, either of the above might help.
Hope it helps,
Greg

Yosemite 10.10.2 Microsoft Word Opening ALL Files

I just updated to Yosemite 10.10.2. Microsoft Word opens ALL of my Word files when I open the program. I have uninstalled and re-downloaded. I also went to System Preferences and changed those setting, to no avail.
Each time I open Word, it just pops open so many windows that within 20 seconds the program just crashes and offers to send an error report. I cannot use Word at all. Please help.

What version of Office are you using?
You might try looking/posting here.
Microsoft Support – Office for Mac
Microsoft Support – Office for Mac (2)

How do I clear the " view all files " history in Adobe Reader XI I have recent files set to 1 "View all files " shows all files from 22 june to current date

as i indicated in my question I'm having a problem clearing the " view all files " history in Adobe Reader XI

My current Ver is 11.0.07 and the"view all recent files" selection is found by clicking on file and in the drop down menu it is the selection just above the list of recent files

How can I search and find all files modified since december 1st 2010

I just want to see all the files that have been modified on my Windows Server 2008 since December 1, 2010. I want to see if there is anything unusual. Is this possible without 3rd party software?
I support 2 Windows Server 2008. On one of them, I can't seem to get any advanced search options until I type at least one character in the name of the file I am searching for (but I want to search all files). Even then, the search is slow, it
tells me it did not bring back all results because there are too many and the results window is too awkward to navigate. On the second server, when I click in the search box, it has buttons I can click, one is "Date Modified" but they only have predefined
options like "Last week" and "yesterday".
Every version of windows explorer gets worse and worse at searching. I rarely use it anymore. I usually install "Search Everything" from voidtools.com. You can download, install, index, and search and get your results all more quickly than
finding the results in windows explorer.

Hi,
It should be caused that Windows Search is not installed on the second server. See following steps to install it:
1) Start Server Manager
2) Click on Roles in the left navigation pane
3) Select Add Role in the Roles Summary pane to the right
4) Select the File Services role and click Next
5) Select the Windows Search role service and Finish the wizard.
Meanwhile, when you search for a file such as "data", input keyword data and click the Advanced Search at the bottom to result box. Then you can select "Data and modified" and "is after" December 1, 2010
Shaon Shan |TechNet Subscriber Support in forum |If you have any feedback on our support, please contact [email protected]

How to extract unique words in all files

Similar Messages

Maybe you are looking for