Parse a word doc

hi,
i have to parse a word doc n extract only the relevant info. so can anyone pls tell me the method of doin it?
thanks
tulip

U can use the FileInputStream.. or RandomAccessFile ..
You ll find plenty of examples for the abovesaid in tis forum itself ;-)
-TC

Similar Messages

Can Java be used to parse Microsoft Word(.doc) files?

Hi guys ,
I want to know whether Java can be used to parse Microsoft Word(.doc) files for searching a string or for checking for grammatical errors, etc
Thanks in advance.
Avichal

Hey man, anything and every thing can be done these days.
About ur question doc is like all other normal text files with some extra features and extra character supports and other stuffs.
If u neglect those parts and if u consider it to be a normal text file then its a much simpler job.
Here is a code that searches for the key word in all the doc files, txt files, pdf files and html files
in the mentioned folder and sub folders. Any way its a servlet u can change it to a normal program.
It first check the file to know whether they are doc, pdf, html or txt files if yes then it will read the file and
store the contents in the vector and parse the vector for the search string and display the result.
Along with the result the below code will also display the time taken and the number of search string found in the document
import java.io.*;
import java.util.*;
import java.net.*;
import javax.servlet.*;
import javax.servlet.http.*;
public class search_local extends HttpServlet
 public void service( HttpServletRequest _req, HttpServletResponse _res ) throws ServletException, IOException
 long startTime = System.currentTimeMillis();
 File RootDir = new File( _req.getRealPath( "/docs/" ) );
 if ( RootDir.isDirectory() == false )
 System.out.println( "Invalid directory" );
 _res.setStatus( HttpServletResponse.SC_NO_CONTENT );
 return;
 Vector kList = new Vector( 3 );
 StringTokenizer st = new StringTokenizer( _req.getParameter( "search_text" ), "+" );
 while ( st.hasMoreTokens() )
 kList.addElement( st.nextToken().trim() );
 //- Run through list
 Vector toBeDone = new Vector( 10 );
 Vector found = new Vector( 10 );
 String dir[] = RootDir.list( new htmlFilter() );
 cDirInfo tX = new cDirInfo( RootDir, dir );
 toBeDone.addElement( tX );
 while ( toBeDone.isEmpty() == false )
 tX = (cDirInfo)toBeDone.firstElement();
 try
 int x = 0;
 for ( ;; )
 File newFile = new File( tX.rootDir, tX.dirList[x] );
 if ( newFile.isDirectory() )
 File t = new File( tX.rootDir, tX.dirList[x] );
 String a[] = newFile.list( new htmlFilter() );
 toBeDone.addElement( new cDirInfo( t, a ) );
 else
 int freq = searchFile( kList, newFile );
 if ( freq != 0 )
 found.addElement( new cPage( freq, newFile ) );
 x++;
 catch( ArrayIndexOutOfBoundsException E ){}
 toBeDone.removeElementAt(0);
 dir = null;
 long totalTime = System.currentTimeMillis() - startTime;
 formatResults( found, kList, totalTime, _req.getRealPath( "/docs" ), _res );
 private void formatResults( Vector _fList, Vector _kList, long time, String _root, HttpServletResponse _res ) throws IOException
 _res.setContentType("text/html");
 PrintWriter Out = new PrintWriter( _res.getOutputStream() );
 Out.println( "<HTML><HEAD><TITLE>Search results</TITLE></HEAD>" );
 Out.println( "<BODY><H3>Search Results</H3> " );
 Out.println( "Keywords: " );
 Enumeration E = _kList.elements();
 while ( E.hasMoreElements() )
 Out.println( (String)E.nextElement() + " : " );
 Out.println( " <CENTER><HR WIDTH=100%></CENTER> " );
 E = _fList.elements();
 cPage sPage;
 String link;
 while ( E.hasMoreElements() )
 sPage = (cPage)E.nextElement();
 link = sPage.cFile.toString();
 link = "http://localhost/BugFix/docs/" + link.substring( link.indexOf( _root )+_root.length(), link.length() );
 Out.println( "<A HREF=" + link + ">" + sPage.cFile.getName() + "</A>" );
 Out.println( "(" + sPage.freq + ") " );
 if ( _fList.size() == 0 )
 Out.println( "No sites found! ");
 Out.println( " <CENTER><HR WIDTH=100%></CENTER>" );
 Out.println( " Time to complete: " + ((double)time/1000) + " seconds" );
 Out.println( "</BODY></HTML>" );
 Out.flush();
 private int searchFile( Vector _klist, File _filename )
 //- Links the file
 int frequency=0;
 try
 DataInputStream In = new DataInputStream( new FileInputStream( _filename ) );
 String LineIn, token;
 boolean bValid = true;
 Enumeration E;
 cLineParse lp;
 while ( (LineIn = In.readLine()) != null )
 lp = new cLineParse( LineIn.toUpperCase() );
 while ( (token=lp.nextToken()) != "" )
 if ( token.indexOf( "<" ) != -1 && (
 token.indexOf( "<A" ) != -1 ||
 token.indexOf( "<HE" ) != -1 ||
 token.indexOf( "<APP" ) != -1 ||
 token.indexOf( "<SER" ) != -1 ||
 token.indexOf( "<TEX" ) != -1 ))
 bValid = false;
 else if ( token.indexOf( "<" ) != -1 && (
 token.indexOf( "</A" ) != -1 ||
 token.indexOf( "</HE" ) != -1 ||
 token.indexOf( "</APP" ) != -1 ||
 token.indexOf( "</SER" ) != -1 ||
 token.indexOf( "</TEX" ) != -1 ))
 bValid = true;
 else if ( bValid )
 E = _klist.elements();
 String key;
 while ( E.hasMoreElements() )
 key = ((String)E.nextElement()).toUpperCase();
 if ( token.indexOf( key ) != -1 )
 frequency++;
 In.close();
 catch( IOException E ){}
 return frequency;
class cPage extends Object
 public int freq;
 public File cFile;
 public cPage( int _freq, File _cFile )
 freq = _freq;
 cFile = _cFile;
//- End of file
//----- Supporting classes
class htmlFilter implements FilenameFilter
 public boolean accept(File dir, String name)
 File tF = new File( dir, name );
 if ( tF.isDirectory() )
 return true;
 int indx = name.lastIndexOf( "." );
 if ( indx == -1 )
 return false;
 String Ext = name.substring( indx+1, name.length() ).toLowerCase();
 if ( Ext.equals( "html" ) ||
 Ext.equals( "pdf" ) ||
 Ext.equals( "txt" ) ||
 Ext.equals( "doc" ) )
 return true;
 return false;
class cDirInfo
 public File rootDir;
 public String[] dirList;
 public cDirInfo( File _r, String[] _d )
 rootDir = _r;
 dirList = _d;
class cLineParse
 String L;
 public cLineParse( String _s )
 L = _s;
 public String nextToken()
 String ns="";
 boolean bStart = false;
 for ( int x=0; x < L.length(); x++ )
 if ( L.charAt(x) == '<' && ns.length() != 0 )
 L = L.substring( x, L.length() );
 return ns;
 else if ( L.charAt(x) == '<' )
 ns = ns + L.charAt( x );
 bStart = true;
 else if ( L.charAt(x) == '>' ||
 L.charAt(x) == '\r' ||
 ( L.charAt(x) == ' ' && bStart == false ) )
 ns = ns + L.charAt( x );
 L = L.substring( x+1, L.length() );
 return ns;
 else
 ns = ns + L.charAt( x );
 L = "";
 return ns;
}

Read contents of Word docs?

Hello all:
I have a directory of word documents that I need to loop
through and read the contents and save various parts of the textual
content into a database.
I've used cfdirectory to loop through the directory and then
cffile action="read" to read the contents of the file into a
variable. However, I have what appears to be binary information
stored before and after the text that is saved in the variable
specified in the cffile tag.
How can I get rid of this so that I'm left with just the text
contained in the Word file?
TIA
Lisa

When you read a binary file you get binary data. Word .doc
are not text
files. If you can not convert the files to txt or at least
rtf files
you will have to use the word com object to parse the file.
This is a
very problematic solution as it involves installing MS Word
on the
server. The trouble is the MS Word is not designed to run on
a server
and both Adobe nee Macromedia, and Microsoft warn against
doing so.
If you do so, have good access to the server. Because as you
program,
anytime you do something that causes MS Word to ask a
question with a
dialog box, it is going to send that to the server's screen
and lock up
and wait for somebody sitting at the server to answer the
dialog. Since
it is not a server application it doesn't understand how to
send these
to clients in any way.
No since you can read some of the text from the binary, you
may be able
to get it out with Regex or other string processing, but that
does not
sound like fun to me.
kitty1967 wrote:
> Hello all:
>
> I have a directory of word documents that I need to loop
through and read the
> contents and save various parts of the textual content
into a database.
>
> I've used cfdirectory to loop through the directory and
then cffile
> action="read" to read the contents of the file into a
variable. However, I have
> what appears to be binary information stored before and
after the text that is
> saved in the variable specified in the cffile tag.
>
> How can I get rid of this so that I'm left with just the
text contained in the
> Word file?
>
> TIA
> Lisa
>

I need to import a word doc into the film script format

I have a script saved in a word doc and i want edit it on adobe story. Ho would i do that?

Hi,
in Story Projects view, you can click the 'Import' button and choose the word document and then choose Film/TV template as appropriate to import. If your word document has been properly formatted (e..g, Scene Heading is of the format INT. SET NAME etc.), Adobe Story would be able to parse the elements correctly and import them without issues.
Thanks
Aurobinda

How to load word .doc-files into java applications?

Hi,
I want to load a word .doc-file (or any other type of document) into a java application, as part of the application.
Is there any java special library to do it?
Any solution or hint?
Thank you.

No, there is no existing Java library to do it that I am aware of. You could write your own, however, as the Word 97 binary file format is published. I am not sure that the Word 2000, or XP formats are as well, but since they put out the 97 presumably they'd do the same with the rest -- although I am suprised that Micro$$$oft even published the 97 format.
A library such as this would take quite some time write, due to the extensive format, but here's a link to a copy of it just so you can see what it'd take: http://www.redbrick.dcu.ie/~bob/Tech/wword8.html
Also keep in mind that Micro$$$oft has never remained binary compatable between versions. In short, this means that you'd have to parse each version's documents differently. Not a fun task.
If a Java library to read Micro$$$oft file formats already does exist, I'm willing to bet that it costs $$$ mucho dinero $$$.
Hey, aren't proprietary file formats just wonderful?

Parsing a txt doc and using the text to put into an arraylist

so i have a sample doc like this:
add name hairy; mass 4; species bird
sort mass
save myaddresses.txt
so i would like to have a scanner read the above text and parse for words name, mass and species where it would store the word after it into an arraylist. there is a argument which reads the txt file. i also use a delimited to seperate out the variables.
import java.util.*;
import java.io.*;
public class Animallog {
public static void main (String [] args)throws Exception
Animal a = new Person();
name = a.getname();
mass = a.getmass();
species= a.getspecies();
ArrayList <Animal> list = new ArrayList<Animal>();
i=0;
list.get(i).SetName();
list.get(i).SetMass();
list.get(i).SetSpecies();
i++;
break;
File f = new File (args[0]);
Scanner interactionsinput = new Scanner(f);
interactionsinput.useDelimiter(";");
nameinput = interactionsinput.next();
if (nameinput.equalsIngnoreCase("name"))
massinput = interactionsinput.next();
if (massinput = equalsIgnoreCase("mass"))
speciesinput = interactionsinput.next();
if (speciesinput.equalsIgnoreCase("species"))
so here i'm totally lost i'm surpose to parse this sample and put the data into the arraylist and a set of information for a single animal.
}

:o]
Indeed, and it is important to notice the "like me" he added to it. If he had said Java to be too hard for newbies, there'd be enough proof that it isn't, because then there'd be no experts.

I cannot send a Pages document, Word format via email from my ipad if it has a picture in the the doc. I can send a Word doc if it does not have a pic in it or pdf with a pic. Any thoughs why a word doc with a pic in it won't email in Pages? thanks

I cannot send a Pages document, Word via email if the doc contains a picture. If I email a Pages doc without a pic in Word format or pdf with a pic it with go through to sender? Not sure why Pages won't send Word doc with a picture in it. I check the security on the email recipient and the email doesn't get blocked or throw in the junk box?
thanks,
drainguy41

If you have upgraded to Mountain Lion, Save As… has returned to the File menu when you hold down the Option/alt key. But you don't really "save" as other file types, you translate & export as Word or RTF or text or PDF & that is easily done by going to File > Export or Share > Export.
Also, please do us all a favor & don't use all capitals in your posts, either the body of the post or the title. All caps is the internet equivalent of shouting & is very hard to read.

Some time in the last week or so a particular pair of word (doc) files won't open.

I get the following msgs when I try to open a pair of word (doc)s - "Filename.doc" is being used by "another user". Do you want to make a copy - Y/N
Response of Y gets Word cannot open this document. The document might be in use or might not be a valid word document. I've tried to restore it from time machine created feb 24, but get same result.
Any clues as to how to (a) resolve and (b) what I did to cause this. It is a real PITA!! - Thanks Q

Since your question is related to Microsoft products, I suggest you post your question on their own forums for their Mac software:
http://answers.microsoft.com/en-us/mac

When converting word doc to pdf, my images with text only show the background color of the text box.

I have a word doc that I am trying to conver to pdf. I have jpegs with text boxes on top of them on one page. It looks great on the screen but after I convert to pdf, the text boxes only have half the text, the first half of the text box is just white - the background color. If I take the background color out of the text box, the text converts over fine but I need the background color.
I have tried many things here on the print settings, standard, high quality print, unchecking the compression on the images. Any help?

Thank you for your posting. These forums are specific to the
Acrobat.com website and it's set of hosted services, and do not
cover the Acrobat family of desktop products. Please visit the
following forums for any questions related to the Acrobat family of
desktop products:
http://www.adobeforums.com/cgi-bin/webx/.3bbeda8b/

Convert multiple files (word docs) to multiple pdf

Hi,
Is there a script I can use in batch processing to convert about 1500 word documents into pdf? I know it's possible to convert multiple files into one pdf, but this is no good to me. Doing these conversions one by one is going to take forever!
I've tried selecting a few documents at a time and then selecting "convert to pdf" from the right-click menu, but each one requires that you tell it where to save, and then opens the file when it is done.
I need to convert Word files on a regular basis for work, and could really use a batch process for this with so many to do!
I have Acrobat 7 Pro (version 7.1.0) on XP Pro SP 2 at work, and Acrobat 8 Pro (Version 8.1.2) on XP Pro SP 3 at home.
If there is anyway it's possible to do this via a batch process I would really appreciate knowing how!!
Apologies if this has been covered in another thread, I searched but couldn't find anything.
Thanks in advance :)

I have Acrobat 9 Pro. I can batch convert Word docs to PDFs by doing this:
1. Open Acrobat Pro. Click File > Create PDF > Batch Create Multiple Files...
2. A window will open prompting you to add files. Click Add Files > Add Files... OR Add Folders... If adding a folder, navigate to it, and click OK to add it to the list. You can also select a bunch of files and drag & drop them into the Add Files window.
3. Once you have all the files listed that you want to convert, click OK. A new window called Output Options will open. In this window, select your preferred settings. For me, I want all the new PDFs to have the same filename and be in the same folder as the Word docs, so I choose these settings:
4. Click OK, and then the batch process will begin running. You will see Word opening and closing. However, you won't have to click Save or anything. You can run it unattended. The process takes a little while, so I usually set up a batch to run, then go to lunch. Once finished, you should have all your new PDFs:
Hope that helps someone!

Using PDF Printer TO Print Multiple Word Docs At Once.

I Have Just upgraded from acrobat to acrobat 3d. In Acrobat 5 i was able to select a batch of word document (150+ usually) and use the send to command to send them to the acrobat distiller and each document would open one at a time print and then close. I have now been upgraded to Acroabat 3D and if i use the same Send to Command to send a batch of word docs to the pdf printer it tries to open all the selected files at once and i get an error message "the command cannot be performened because a dialog box is open..." this is because the 1st document is open and printing so the secound document that opened at the same time cannot operate the print command. Can Anyone help?

Thanks for the suggestion but...
1. I can't get these scripts to run in 10.4.8
What if you save the script as a stand alone app? Does that make a difference?
2. I haven't spent much timr trying since it looks
like thy only convert "JPEG", "GIFf", "PICT", "TIFF",
"PDF" & "TEXT". Am I wrong?
That's what it looks like from the "extension-list" - what type of files are you trying to convert?

Attachments (word docs and PDFs) to emails can not be opened by the receivers. Are received as .dat files or application/octet-stream

Attachments I add to emails (word doc or PDFs) can not be opened by the email receivers. Word documents attachments are received with ATT00427.dat (application/octet-stream) or .doc with (application/octet-stream). PDF attached files arrive with .pdf (application/octet-stream). My copy of my sent email has the same attachment extensions.
== This happened ==
Every time Firefox opened

Firefox doesn't do email, it's a web browser.
If you are using Firefox to access web mail, you need to seek support from your service provider or a forum for that service.
If your problem is with Thunderbird, see this forum for support.
[http://www.mozillamessaging.com/en-US/support/]
or this one
[http://forums.mozillazine.org/viewforum.php?f=39]

Creating a link to a word doc

I'm having trouble creating a link to a word doc on my computer.
I select "enable as hyperlink" and then "Link to file".
I select the the word doc on my computer. It didn't create a link to the file.
Hmmmm... then I tried it with a pdf. Still no link.
Do I need to have a pdf ? The document was created with microsoft word.
However I also tried it with an old Adobe pdf file.. still no go.
Any thoughts? I'm using an old version of iWeb 1.1.2
Thanks.

arf arf wrote:
Do I need to have a pdf ?
No.
arf arf wrote:
The document was created with microsoft word.
That should work too.
NOTE: Links to files aren't testable within iWeb — they become active when published and viewed in a browser.
An alternative approach is to post MS Word docs (or PDFs) to the free Posterous where they'll be automatically presented in a convenient viewer, e.g.
http://dont-panic.posterous.com/pdf-document-example
...Then in iWeb set up a simple text or image hyperlink to that particular Posterous blog entry.

I am trying to get my word doc's to fit PDF pages correctly

I have a website and I rely on my opt-in email lists when people sign-up for my reports that are sent out in PDF form. I write the reports in a 2007 word doc. and save them into a PDF but the pages don't save as a full page sometimes it only takes up half the width of the page and doesn't go to the top either as if it was simply centered in the PDF. How can I change this or what am I doing wrong? If you need to see the pdf just goto http://www.yourmobilevoice.com and you can download it yourself. It takes alot of time to write the reports but even longer to try to get them to fit properly.

I am sorry as you can see, you would need to sign up for the report in
order to get it. Otherwise Pat an opt-in email list would not work if they
can just download it, but thanks if you want me to send it direct to you I
sure can do that as well.
Thanks Scott

Opening a Word doc

Hello all, I am new to Flex. I have some experience with AS3 (still switching over from AS2).
Anyway, I am making a Flex project where I have a text link in a datagrid. This link needs to open a word document that is on our company network. Currently I'm working on my own personal computer on my C drive (not sure if that affects anything - when i hit Run does it run the application from my C drive? It should still work from there i'd think?) We have a bunch of other drives on our network, and the word doc I'm trying to open is on a "P:" drive.
I'm not sure what I might be doing wrong, but I created a LinkButton and tried 2 different attempts at opening a Word doc.
I tried creating a new URLRequest and navigateToURL to attempt to open the word doc - nothing happens when I click the link. The link button will launch google if i replace my word doc path with the google url.
I tried sending the file path to javascript via ExternalInterface and having javascript open it up. It doesn't work that way either.
Is there something I should know about opening up a Word doc file with Flex? Or possibly is it not working because of where the file resides? Any thoughts?
Thanks
-Michele

Actually, no and yes. No you do not have to re-save all of your Word files. It depends on what you want to do with them.
The file you have open is actually a Pages file which was created by converting the old Word file (which has not been altered at all). If you want to keep it and any changes you have made as a Pages file, you must "save as." The next time you open that file, it will open naturally in Pages. Every time you open the original Word file, since it has not been converted to native Pages, it will go through the same translating process as it did this time.

Parse a word doc

Similar Messages

Maybe you are looking for