Can Java be used to parse Microsoft Word(.doc) files?

Hi guys ,
I want to know whether Java can be used to parse Microsoft Word(.doc) files for searching a string or for checking for grammatical errors, etc
Thanks in advance.
Avichal

Hey man, anything and every thing can be done these days.
About ur question doc is like all other normal text files with some extra features and extra character supports and other stuffs.
If u neglect those parts and if u consider it to be a normal text file then its a much simpler job.
Here is a code that searches for the key word in all the doc files, txt files, pdf files and html files
in the mentioned folder and sub folders. Any way its a servlet u can change it to a normal program.
It first check the file to know whether they are doc, pdf, html or txt files if yes then it will read the file and
store the contents in the vector and parse the vector for the search string and display the result.
Along with the result the below code will also display the time taken and the number of search string found in the document
import java.io.*;
import java.util.*;
import java.net.*;
import javax.servlet.*;
import javax.servlet.http.*;
public class search_local extends HttpServlet
 public void service( HttpServletRequest _req, HttpServletResponse _res ) throws ServletException, IOException
 long startTime = System.currentTimeMillis();
 File RootDir = new File( _req.getRealPath( "/docs/" ) );
 if ( RootDir.isDirectory() == false )
 System.out.println( "Invalid directory" );
 _res.setStatus( HttpServletResponse.SC_NO_CONTENT );
 return;
 Vector kList = new Vector( 3 );
 StringTokenizer st = new StringTokenizer( _req.getParameter( "search_text" ), "+" );
 while ( st.hasMoreTokens() )
 kList.addElement( st.nextToken().trim() );
 //- Run through list
 Vector toBeDone = new Vector( 10 );
 Vector found = new Vector( 10 );
 String dir[] = RootDir.list( new htmlFilter() );
 cDirInfo tX = new cDirInfo( RootDir, dir );
 toBeDone.addElement( tX );
 while ( toBeDone.isEmpty() == false )
 tX = (cDirInfo)toBeDone.firstElement();
 try
 int x = 0;
 for ( ;; )
 File newFile = new File( tX.rootDir, tX.dirList[x] );
 if ( newFile.isDirectory() )
 File t = new File( tX.rootDir, tX.dirList[x] );
 String a[] = newFile.list( new htmlFilter() );
 toBeDone.addElement( new cDirInfo( t, a ) );
 else
 int freq = searchFile( kList, newFile );
 if ( freq != 0 )
 found.addElement( new cPage( freq, newFile ) );
 x++;
 catch( ArrayIndexOutOfBoundsException E ){}
 toBeDone.removeElementAt(0);
 dir = null;
 long totalTime = System.currentTimeMillis() - startTime;
 formatResults( found, kList, totalTime, _req.getRealPath( "/docs" ), _res );
 private void formatResults( Vector _fList, Vector _kList, long time, String _root, HttpServletResponse _res ) throws IOException
 _res.setContentType("text/html");
 PrintWriter Out = new PrintWriter( _res.getOutputStream() );
 Out.println( "<HTML><HEAD><TITLE>Search results</TITLE></HEAD>" );
 Out.println( "<BODY><H3>Search Results</H3> " );
 Out.println( "Keywords: " );
 Enumeration E = _kList.elements();
 while ( E.hasMoreElements() )
 Out.println( (String)E.nextElement() + " : " );
 Out.println( " <CENTER><HR WIDTH=100%></CENTER> " );
 E = _fList.elements();
 cPage sPage;
 String link;
 while ( E.hasMoreElements() )
 sPage = (cPage)E.nextElement();
 link = sPage.cFile.toString();
 link = "http://localhost/BugFix/docs/" + link.substring( link.indexOf( _root )+_root.length(), link.length() );
 Out.println( "<A HREF=" + link + ">" + sPage.cFile.getName() + "</A>" );
 Out.println( "(" + sPage.freq + ") " );
 if ( _fList.size() == 0 )
 Out.println( "No sites found! ");
 Out.println( " <CENTER><HR WIDTH=100%></CENTER>" );
 Out.println( " Time to complete: " + ((double)time/1000) + " seconds" );
 Out.println( "</BODY></HTML>" );
 Out.flush();
 private int searchFile( Vector _klist, File _filename )
 //- Links the file
 int frequency=0;
 try
 DataInputStream In = new DataInputStream( new FileInputStream( _filename ) );
 String LineIn, token;
 boolean bValid = true;
 Enumeration E;
 cLineParse lp;
 while ( (LineIn = In.readLine()) != null )
 lp = new cLineParse( LineIn.toUpperCase() );
 while ( (token=lp.nextToken()) != "" )
 if ( token.indexOf( "<" ) != -1 && (
 token.indexOf( "<A" ) != -1 ||
 token.indexOf( "<HE" ) != -1 ||
 token.indexOf( "<APP" ) != -1 ||
 token.indexOf( "<SER" ) != -1 ||
 token.indexOf( "<TEX" ) != -1 ))
 bValid = false;
 else if ( token.indexOf( "<" ) != -1 && (
 token.indexOf( "</A" ) != -1 ||
 token.indexOf( "</HE" ) != -1 ||
 token.indexOf( "</APP" ) != -1 ||
 token.indexOf( "</SER" ) != -1 ||
 token.indexOf( "</TEX" ) != -1 ))
 bValid = true;
 else if ( bValid )
 E = _klist.elements();
 String key;
 while ( E.hasMoreElements() )
 key = ((String)E.nextElement()).toUpperCase();
 if ( token.indexOf( key ) != -1 )
 frequency++;
 In.close();
 catch( IOException E ){}
 return frequency;
class cPage extends Object
 public int freq;
 public File cFile;
 public cPage( int _freq, File _cFile )
 freq = _freq;
 cFile = _cFile;
//- End of file
//----- Supporting classes
class htmlFilter implements FilenameFilter
 public boolean accept(File dir, String name)
 File tF = new File( dir, name );
 if ( tF.isDirectory() )
 return true;
 int indx = name.lastIndexOf( "." );
 if ( indx == -1 )
 return false;
 String Ext = name.substring( indx+1, name.length() ).toLowerCase();
 if ( Ext.equals( "html" ) ||
 Ext.equals( "pdf" ) ||
 Ext.equals( "txt" ) ||
 Ext.equals( "doc" ) )
 return true;
 return false;
class cDirInfo
 public File rootDir;
 public String[] dirList;
 public cDirInfo( File _r, String[] _d )
 rootDir = _r;
 dirList = _d;
class cLineParse
 String L;
 public cLineParse( String _s )
 L = _s;
 public String nextToken()
 String ns="";
 boolean bStart = false;
 for ( int x=0; x < L.length(); x++ )
 if ( L.charAt(x) == '<' && ns.length() != 0 )
 L = L.substring( x, L.length() );
 return ns;
 else if ( L.charAt(x) == '<' )
 ns = ns + L.charAt( x );
 bStart = true;
 else if ( L.charAt(x) == '>' ||
 L.charAt(x) == '\r' ||
 ( L.charAt(x) == ' ' && bStart == false ) )
 ns = ns + L.charAt( x );
 L = L.substring( x+1, L.length() );
 return ns;
 else
 ns = ns + L.charAt( x );
 L = "";
 return ns;
}

Similar Messages

To JDeveloper Tech Team & All: Displaying Microsoft Word Doc file in JDeveloper

Is there any way, we can display a Microsoft Word Document file in JDeveloper ? I have JDeveloper 3.2.2 on my computer running Windows NT version 4.0 (SP 6). I appreciate your immediate responce. Thanks.

In theory, you can display a DOC file in JDeveloper - given the correct JavaBean capable of reading DOC files and dispplaying it - then it could be hooked
into JDeveloper as an Addin.
But, wouldn't it be easier to display it in MS Word / other Doc Viewer?
If DOC files are part of your development,
then you can put a quick Invocation to Word / other tool in your JDeveloper Tools menu by editing the [JDev]\bin\tools.cfg file - See help for more info.
I hope this helps,
if not, please clarify further.
-John

How can save/convert a preview file into a Microsoft Word/.doc file?

Basically, I've open some email attachments and they've opened in 'preview'
I have tried to save them as a .doc file so that I can open them in Microsoft Word, but have been unable to do this.
I have tried 'save as...' but it doesn't come up with a .doc option
I have also tried the 'Open with' in the get info pop up, and have changed it to microsoft word but when I then try to open it, it comes up with a 'Convert File' box.
I have tried some of the options available but none of them work
Am I missing something? or am I going down the wrong route?
Is it even possible to do?
I'm not great at computers and so answers with obvious steps would be great!
Thanks!

Are these PDF files that you're attempting to save a .doc (or .docx) files. Preview won't do it. You can use a number of third-party application to accomplish what you want, but none of them are that inexpensive.
The least expensive application may be PDFpen - you can save Word documents from PDF files with it, amongst other things.
Good luck,
Clinton

Changed handling of Microsoft Word .doc files in Finder and Quicklook

This is happening on my new MacBook Pro with Retina Display and Mountain Lion 10.8.4.
Starting a week or 10 days ago, the appearance and behavior of .doc files in my finder changed from the .docx style to the style associated with Word 6.0/95. The application associated with .doc files has always been MS Word 2011. Before this changed, I could preview them in quicklook and the icons were not changed. Word will open the .doc files if I double click from the finder (or open them from the application's Open command) but I can't quicklook the doc. files. I have tried to change the application using the two finder methods, namely through the "Get Info" command or by control-clicking on the .doc file icon. Docx. files and all versions of other MS Office applications behave normally. This seems to be a problem with .doc files only.
An interesting thing is if I change the application to open a .doc file to Pages, the "Kind' designation in the Get Info dialog changes from MS Word 6.0/95 to "Microsoft Word 97 - 2004 document" but changing the associated application back to Word 2011 doesn't fix the problem and the description changes back to Word 6.0/95. There are different recommended applications.
All of this worked fine until a few weeks ago. The problem is on my new MacBook Pro with Retina display. I have an iMac 27 inch, latest model, and this is not happening on that machine. I have tried this in several different user accounts on my MBP and the behavior is the same in each one.
I've been working with this for a week or two, and have tried resetting quicklook preferrences and restarting the process as suggested on Macfixit and elsewhere. But the problem persists. I can't tell if this is a file association problem or a quicklook problem. Usually I find solutions here but it doesn't look like anyone else is complaining.

Baltwo, thanks for your advice and quick response. Thanks to you, I think I'm making some progress after a lot of frustration. I'm not quite "there" yet, though, so I hope I'm not imposing when I ask ask for a little more of your help and expertise.
I ran the suggested command (after deleting the space) and got the following in the Terminal window:
lsregister: [OPTIONS] [ <path>... ]
 [ -apps <domain>[,domain]... ]
 [ -libs <domain>[,domain]... ]
 [ -all <domain>[,domain]... ]
Paths are searched for applications to register with the Launch Service database.
Valid domains are "system", "local", "network" and "user". Domains can also
be specified using only the first letter.
-kill Reset the Launch Services database before doing anything else
-seed If database isn't seeded, scan default locations for applications and libraries to register
-lint Print information about plist errors while registering bundles
-convert Register apps found in older LS database files
-lazy n Sleep for n seconds before registering/scanning
-r Recursive directory scan, do not recurse into packages or invisible directories
-R Recursive directory scan, descending into packages and invisible directories
-f force-update registration even if mod date is unchanged
-u unregister instead of register
-v Display progress information
-dump Display full database contents after registration
-h Display this help
It looks like I have reset the Launch Services database but need to register or reregister my applications. Maybe run -seed or -convert? Sorry to be dense about what to do next.

Printing a microsoft word doc using Java Print API

Hi,
I have to print a microsoft word doc.I am using Java Print API, but the code is printing only Hashcodes instead of the actual document.
Here is the code. Please let me know whats wrong in it.
CODE:::
public String print() throws Exception {
String realPath = getRealPath("/images/formLibrary/csaAddressContactRequestForm100.doc");
PrintRequestAttributeSet pras1 = new HashPrintRequestAttributeSet();
DocFlavor flavor1 = DocFlavor.INPUT_STREAM.AUTOSENSE;
PrintService defaultService = PrintServiceLookup.lookupDefaultPrintService();
DocPrintJob job = defaultService.createPrintJob();
FileInputStream fis1 = new FileInputStream(realPath);
DocAttributeSet das = new HashDocAttributeSet();
Doc doc1 = new SimpleDoc(fis1, flavor1, das);
job.print(doc1, pras1);
Thread.sleep(10000);
System.exit(0);
return "";
}

By using an appropriate library. JText, whatever.
Google, man.I think Rene meant iText!Whatever. :) Never used it, I just remembered there was something named like that. Thanks.

Does anyone know if you can use dictation with Microsoft Word?

I am looking to get Moutian Lion but was just wondering if you can use dictation for Microsoft Word or any other word document program.
Thanks a million!!
Beau

I have personally used it with Word and it does work. I even think I remember reading that you can use it with any program in which you can type. Go to settings on your macbook to make sure it's turned on. Chances are your shortcut to access dictation is pressing the Fn key twice.

How can I change a Microsoft Word document file into a picture file?

How can I change a Microsoft Word document file into a picture or jpeg file? I am wanting to make the image I created my background on my macbook pro.

After I had the document image the way I wanted it, I saved it as a web page and went from there. Below are the steps starting after I did the "save as" option in Word:
1) Select "Save As Web Page". I changed the location from documents to pictures when the window came up to save it as a web page.
2) Go to "Finder" on you main screen, or if it's on your main toolbar at the bottom.
3) Click on the "Pictures" tab and find the file you just re-saved as a web page. (I included "web page" or something similar in the new title so I could easily find the correct file I was looking for)
4) Open the correct file and then "right click" on the actual image. (Use 2 fingers to do so on a Mac)
5) Select 'Use Image As Desktop Picture", and voilà! The personally created image, or whatever it is that you wanted, is now your background.
**One problem I encountered while doing this is that the image would show up like it was right-aligned in relation to the whole screen. The only way I could figure how to fix this was to go back to the very original document in Word, (the one before it was saved as a web page), and move everything over to the left.
I hope this helps someone else who was as frustrated as I was with something that I thought would have been very simple to do! If you have any tips or suggestions of your own, please feel free to share. : )

How can i change the language in Microsoft word

hi
can you please tell me how can i change the language in Microsoft word ?

You might want to also search/ask in the forums devoted entirely to that product:
http://answers.microsoft.com/en-us/mac/forum/macword

Can you launch and close a Microsoft Word file from Captivate 8.01?

Can you launch and close a Microsoft Word Document from inside your Captivate 8.01 project?
Thank you, L

You can launch it as long as the server has the mime/type, especially if a docx.
Doubt that you can close it.

Can Java be used to manipulate the Windows registry?

Can Java be used to manipulate the Windows registry? If so, can someone point me to some examples?
Thanks!

There is no supplied capability to do this, because the registry is Windows platform specific.
It might be possible by making the call through a native Windoze app.

How to load word .doc-files into java applications?

Hi,
I want to load a word .doc-file (or any other type of document) into a java application, as part of the application.
Is there any java special library to do it?
Any solution or hint?
Thank you.

No, there is no existing Java library to do it that I am aware of. You could write your own, however, as the Word 97 binary file format is published. I am not sure that the Word 2000, or XP formats are as well, but since they put out the 97 presumably they'd do the same with the rest -- although I am suprised that Micro$$$oft even published the 97 format.
A library such as this would take quite some time write, due to the extensive format, but here's a link to a copy of it just so you can see what it'd take: http://www.redbrick.dcu.ie/~bob/Tech/wword8.html
Also keep in mind that Micro$$$oft has never remained binary compatable between versions. In short, this means that you'd have to parse each version's documents differently. Not a fun task.
If a Java library to read Micro$$$oft file formats already does exist, I'm willing to bet that it costs $$$ mucho dinero $$$.
Hey, aren't proprietary file formats just wonderful?

Parse a word doc

hi,
i have to parse a word doc n extract only the relevant info. so can anyone pls tell me the method of doin it?
thanks
tulip

U can use the FileInputStream.. or RandomAccessFile ..
You ll find plenty of examples for the abovesaid in tis forum itself ;-)
-TC

HT204394 how do i put microsoft word docs onto icloud

how do i put microsoft word docs onto icloud so that i can transfer to my mac book

Not really.
You can sign into iCloud.com from a web browser on your PC.
Open Pages and drag the Word document in or click the Gear on the top right and upload.
This will convert it to a Pages document. It will be accessible to you through iCloud and the Pages app on your iOS devices. You can always redownload the file from iCloud as a Word document.
As with any document conversion, this may alter the formatting.

Unable to convert Microsoft Word doc. to PDF in Words (there is no response)

Unable to convert Microsoft Word doc to PDF in Words (Does not respond) or Create PDF from a Word doc. in Adobe Acrobat X Standard 10.1.1 with all updates installed. I receive apop-up saying "Missing PDF Maker Files: Dou you want to run the installer in Repair Mode" I have done this several times. I have un-installrd and re installed the program twice. Still does not work. I'm running Windows 7 Home version and Microsoft Office XP 2002. This is a brabd new Acrobat program right out of the box. Suggestions Please.

In WORD 2002, I believe you can only print to the Adobe PDF printer. I think that WORD 2003 is the first compatible with AA X. Check out http://kb2.adobe.com/cps/333/333504.html.

Microsoft Word docs are formatted wrong when opened in pages

Hi,
I‘ve been using pages from the very beginning and I am very happy with it (still waiting for "auto-save", though), but I do have one problem:
often when I open word .docs in pages they are formatted wrongly.
This happens especially with tables and pictures in tables.
If the pictures are in one line in MS Word, they often jump to the next line in pages.
Or the margins are wrong or both.
If I open it in Open Office, for example, they are fine.
Is there anything I can do about it?
Will it be better in pages 09?
Frank

Hi, I have a similar problem with Word (.doc) files containing a Table converted to Pages. Having read that Pages can convert from Word, I bought iWork instead of Office for Mac. I need to work on Word documents sent to me as attachments and find that Tables do not survive the conversion to Pages properly. Often the text within a Table is shown but, if a lengthy text (i.e. more than a page), the Table does not continue to the next page and the text becomes hidden below the visible Table. If I cut out some text from the middle of the Table, the missing text is revealed, progressively from the bottom of the Table. In other words, the text is there in the document but cannot be seen if it is more than can be fitted on one page. Any ideas on how to sort this? I am not encouraged that iWork 09 does not appear to help in a similar problem to mine.

Can Java be used to parse Microsoft Word(.doc) files?

Similar Messages

Maybe you are looking for