Help: analyze microsoft word docs

I'm trying to write a program to run linguistic analysis on Microsoft Word documents. I understand how to do i/o with plain text files. But for a Word document, how do I deal with those extra formatting? Can any of you Java gurus out there give me some ideas? Thanks in advance!

Thanks ChuckBing! ;-)
Any suggestion for an external program that can convert a Word file to plain text?
Also, what about implementing an editor kit? Since editor kits can support a custom text format.

Similar Messages

HT204394 how do i put microsoft word docs onto icloud

how do i put microsoft word docs onto icloud so that i can transfer to my mac book

Not really.
You can sign into iCloud.com from a web browser on your PC.
Open Pages and drag the Word document in or click the Gear on the top right and upload.
This will convert it to a Pages document. It will be accessible to you through iCloud and the Pages app on your iOS devices. You can always redownload the file from iCloud as a Word document.
As with any document conversion, this may alter the formatting.

Unable to convert Microsoft Word doc. to PDF in Words (there is no response)

Unable to convert Microsoft Word doc to PDF in Words (Does not respond) or Create PDF from a Word doc. in Adobe Acrobat X Standard 10.1.1 with all updates installed. I receive apop-up saying "Missing PDF Maker Files: Dou you want to run the installer in Repair Mode" I have done this several times. I have un-installrd and re installed the program twice. Still does not work. I'm running Windows 7 Home version and Microsoft Office XP 2002. This is a brabd new Acrobat program right out of the box. Suggestions Please.

In WORD 2002, I believe you can only print to the Adobe PDF printer. I think that WORD 2003 is the first compatible with AA X. Check out http://kb2.adobe.com/cps/333/333504.html.

Convert Microsoft Word docs to Pages?

I love Pages and want to convert a bunch of Microsoft Word docs into Pages documents. The action would open the .doc file in Pages, then save it as a Pages document with the same title.
Is there anything that would let me do this? Thanks.

I've tried that myself, to no avail.
Here's what I've tried - if anyone can twiddle with this to make it actually work, I too would be grateful!
1 find finder items
2 get specified finder items
3 copy finder items (to save originals)
4 launch app (pages)
but then there's no action for creating a new file or saving-as or anything...
thanks and peace-
DW

Printing a microsoft word doc using Java Print API

Hi,
I have to print a microsoft word doc.I am using Java Print API, but the code is printing only Hashcodes instead of the actual document.
Here is the code. Please let me know whats wrong in it.
CODE:::
public String print() throws Exception {
String realPath = getRealPath("/images/formLibrary/csaAddressContactRequestForm100.doc");
PrintRequestAttributeSet pras1 = new HashPrintRequestAttributeSet();
DocFlavor flavor1 = DocFlavor.INPUT_STREAM.AUTOSENSE;
PrintService defaultService = PrintServiceLookup.lookupDefaultPrintService();
DocPrintJob job = defaultService.createPrintJob();
FileInputStream fis1 = new FileInputStream(realPath);
DocAttributeSet das = new HashDocAttributeSet();
Doc doc1 = new SimpleDoc(fis1, flavor1, das);
job.print(doc1, pras1);
Thread.sleep(10000);
System.exit(0);
return "";
}

By using an appropriate library. JText, whatever.
Google, man.I think Rene meant iText!Whatever. :) Never used it, I just remembered there was something named like that. Thanks.

Can Java be used to parse Microsoft Word(.doc) files?

Hi guys ,
I want to know whether Java can be used to parse Microsoft Word(.doc) files for searching a string or for checking for grammatical errors, etc
Thanks in advance.
Avichal

Hey man, anything and every thing can be done these days.
About ur question doc is like all other normal text files with some extra features and extra character supports and other stuffs.
If u neglect those parts and if u consider it to be a normal text file then its a much simpler job.
Here is a code that searches for the key word in all the doc files, txt files, pdf files and html files
in the mentioned folder and sub folders. Any way its a servlet u can change it to a normal program.
It first check the file to know whether they are doc, pdf, html or txt files if yes then it will read the file and
store the contents in the vector and parse the vector for the search string and display the result.
Along with the result the below code will also display the time taken and the number of search string found in the document
import java.io.*;
import java.util.*;
import java.net.*;
import javax.servlet.*;
import javax.servlet.http.*;
public class search_local extends HttpServlet
 public void service( HttpServletRequest _req, HttpServletResponse _res ) throws ServletException, IOException
 long startTime = System.currentTimeMillis();
 File RootDir = new File( _req.getRealPath( "/docs/" ) );
 if ( RootDir.isDirectory() == false )
 System.out.println( "Invalid directory" );
 _res.setStatus( HttpServletResponse.SC_NO_CONTENT );
 return;
 Vector kList = new Vector( 3 );
 StringTokenizer st = new StringTokenizer( _req.getParameter( "search_text" ), "+" );
 while ( st.hasMoreTokens() )
 kList.addElement( st.nextToken().trim() );
 //- Run through list
 Vector toBeDone = new Vector( 10 );
 Vector found = new Vector( 10 );
 String dir[] = RootDir.list( new htmlFilter() );
 cDirInfo tX = new cDirInfo( RootDir, dir );
 toBeDone.addElement( tX );
 while ( toBeDone.isEmpty() == false )
 tX = (cDirInfo)toBeDone.firstElement();
 try
 int x = 0;
 for ( ;; )
 File newFile = new File( tX.rootDir, tX.dirList[x] );
 if ( newFile.isDirectory() )
 File t = new File( tX.rootDir, tX.dirList[x] );
 String a[] = newFile.list( new htmlFilter() );
 toBeDone.addElement( new cDirInfo( t, a ) );
 else
 int freq = searchFile( kList, newFile );
 if ( freq != 0 )
 found.addElement( new cPage( freq, newFile ) );
 x++;
 catch( ArrayIndexOutOfBoundsException E ){}
 toBeDone.removeElementAt(0);
 dir = null;
 long totalTime = System.currentTimeMillis() - startTime;
 formatResults( found, kList, totalTime, _req.getRealPath( "/docs" ), _res );
 private void formatResults( Vector _fList, Vector _kList, long time, String _root, HttpServletResponse _res ) throws IOException
 _res.setContentType("text/html");
 PrintWriter Out = new PrintWriter( _res.getOutputStream() );
 Out.println( "<HTML><HEAD><TITLE>Search results</TITLE></HEAD>" );
 Out.println( "<BODY><H3>Search Results</H3> " );
 Out.println( "Keywords: " );
 Enumeration E = _kList.elements();
 while ( E.hasMoreElements() )
 Out.println( (String)E.nextElement() + " : " );
 Out.println( " <CENTER><HR WIDTH=100%></CENTER> " );
 E = _fList.elements();
 cPage sPage;
 String link;
 while ( E.hasMoreElements() )
 sPage = (cPage)E.nextElement();
 link = sPage.cFile.toString();
 link = "http://localhost/BugFix/docs/" + link.substring( link.indexOf( _root )+_root.length(), link.length() );
 Out.println( "<A HREF=" + link + ">" + sPage.cFile.getName() + "</A>" );
 Out.println( "(" + sPage.freq + ") " );
 if ( _fList.size() == 0 )
 Out.println( "No sites found! ");
 Out.println( " <CENTER><HR WIDTH=100%></CENTER>" );
 Out.println( " Time to complete: " + ((double)time/1000) + " seconds" );
 Out.println( "</BODY></HTML>" );
 Out.flush();
 private int searchFile( Vector _klist, File _filename )
 //- Links the file
 int frequency=0;
 try
 DataInputStream In = new DataInputStream( new FileInputStream( _filename ) );
 String LineIn, token;
 boolean bValid = true;
 Enumeration E;
 cLineParse lp;
 while ( (LineIn = In.readLine()) != null )
 lp = new cLineParse( LineIn.toUpperCase() );
 while ( (token=lp.nextToken()) != "" )
 if ( token.indexOf( "<" ) != -1 && (
 token.indexOf( "<A" ) != -1 ||
 token.indexOf( "<HE" ) != -1 ||
 token.indexOf( "<APP" ) != -1 ||
 token.indexOf( "<SER" ) != -1 ||
 token.indexOf( "<TEX" ) != -1 ))
 bValid = false;
 else if ( token.indexOf( "<" ) != -1 && (
 token.indexOf( "</A" ) != -1 ||
 token.indexOf( "</HE" ) != -1 ||
 token.indexOf( "</APP" ) != -1 ||
 token.indexOf( "</SER" ) != -1 ||
 token.indexOf( "</TEX" ) != -1 ))
 bValid = true;
 else if ( bValid )
 E = _klist.elements();
 String key;
 while ( E.hasMoreElements() )
 key = ((String)E.nextElement()).toUpperCase();
 if ( token.indexOf( key ) != -1 )
 frequency++;
 In.close();
 catch( IOException E ){}
 return frequency;
class cPage extends Object
 public int freq;
 public File cFile;
 public cPage( int _freq, File _cFile )
 freq = _freq;
 cFile = _cFile;
//- End of file
//----- Supporting classes
class htmlFilter implements FilenameFilter
 public boolean accept(File dir, String name)
 File tF = new File( dir, name );
 if ( tF.isDirectory() )
 return true;
 int indx = name.lastIndexOf( "." );
 if ( indx == -1 )
 return false;
 String Ext = name.substring( indx+1, name.length() ).toLowerCase();
 if ( Ext.equals( "html" ) ||
 Ext.equals( "pdf" ) ||
 Ext.equals( "txt" ) ||
 Ext.equals( "doc" ) )
 return true;
 return false;
class cDirInfo
 public File rootDir;
 public String[] dirList;
 public cDirInfo( File _r, String[] _d )
 rootDir = _r;
 dirList = _d;
class cLineParse
 String L;
 public cLineParse( String _s )
 L = _s;
 public String nextToken()
 String ns="";
 boolean bStart = false;
 for ( int x=0; x < L.length(); x++ )
 if ( L.charAt(x) == '<' && ns.length() != 0 )
 L = L.substring( x, L.length() );
 return ns;
 else if ( L.charAt(x) == '<' )
 ns = ns + L.charAt( x );
 bStart = true;
 else if ( L.charAt(x) == '>' ||
 L.charAt(x) == '\r' ||
 ( L.charAt(x) == ' ' && bStart == false ) )
 ns = ns + L.charAt( x );
 L = L.substring( x+1, L.length() );
 return ns;
 else
 ns = ns + L.charAt( x );
 L = "";
 return ns;
}

Importing Microsoft Word doc to InDesign with embedded EPS art ~ scaling issue

Hi, my workflow calls for creating content in Microsoft Word 2010 with embedded EPS art, in this case MathType 6.7a math objects. When I import these manuscripts (after saving as Word 97/2003 format) into my Adobe InDesign CS5.5 templates, the embedded inline graphics have been resized. Strangely, InDesign is keeping the container frame at the correct dimensions and then upsizing or downsizing the art inside that box.
When I export a sample graphic from the Word file and unembed the same graphic after import it into InDesign, the two graphics are different sizes. InDesign might increase the size of one embedded graphic by 400% and then scale it down to 25% inside the anchored picture box and then in the next anchored picture box it might decrease the size the art to 50% and scale it to 200%.
I did a test and created a graphic in Adobe Illustrator, saved it as EPS, placed the graphic into a Microsoft Word 2010 document, saved it out to 97/2003 format, imported that doc into an empty Adobe InDesign CS5.5 file. Again! InDesign changed the size of the art and re-scaled it to appear the same.
I've been able to duplicate this issue on InDesign CS4 also, on both Windows XP and Windows 7.
Has anyone else run into this issue? Does anyone know why the InDesign import filter doesn't import inline art in a Word document at 100%? Thanks for the help!

My solution for this is:
1. Place *.docx (yes, docx) word document with mathtype equations into indesign. This will set correct baseline for equations which is very important.
2. download this scalegraphics script http://in-tools.com/downloads/indesign/scripts/ScaleGraphics.zip
3. copy script from zip folder to C:\Program Files\Adobe\Adobe InDesign CS4\Scripts\Scripts Panel\Samples\JavaScript
4. open palet (window/autoamtion/scripts) scripts, find new sript and run it
Voila, all equations is at 100%...
Script explanation: http://in-tools.com/article/scripts-blog/scale-graphics-script/
In links palete equations have eps extension but this is embeded wmf files so you cant open them in photoshop or distill it with distiller.-((( I use export to pdf option in indesign to make pdf file.
If you want all this eps links to export from indesign use this method:
1. in links palete select all links
2. in palete menu choose "unembed link"
3. on answer window choose "no"
4. select folder where you want indesign save files
5. press "select"
6. now you have all links in new folder unembed from indesign document and you can edit it with mathtype.

Ch@ndra plz help: conversion to word doc

Some people have repsonded to this post but it is not working as desired
the code below is working but when I am tryin to add the table, it is not working...
Incase chabdra you can post the code for creating tables it would be great..
some one posted a pdf file as well to accheive this but the steps and code given in that is not working correctly...
sorry for bothering you...
thanks,
Hello,
Any pointers as how to acheive if we want to convert a flat file into a meanigful word output.
approach: I want to add some lines of introduction to the word,
then add some data from the flat file,
again some explanation to the data and then again a bunch of data..
earlier i had posted the post and Ch@ndra had replied giving some inputs,
What I could understand, we can create a new word file and add some text to it, But how we combine the the selected flat file data and some text.
posting the code which Ch@ndra posted me ..
INCLUDE ole2incl.
DATA: word TYPE ole2_object,
documentos TYPE ole2_object,
documento TYPE ole2_object,
selection TYPE ole2_object,
font TYPE ole2_object.
CREATE OBJECT word 'WORD.APPLICATION'.
CALL METHOD OF word 'Documents' = documentos.
CALL METHOD OF documentos 'Add' = documento.
CALL METHOD OF documento 'Activate'.
GET PROPERTY OF word 'Selection' = selection.
GET PROPERTY OF selection 'Font' = font.
SET PROPERTY OF word 'Visible' = 1.
SET PROPERTY OF font 'Name' = 'Arial'.
SET PROPERTY OF font 'Size' = 10.
SET PROPERTY OF font 'Bold' = 1. "o 0
SET PROPERTY OF font 'Underline' = 1. "o 0
CALL METHOD OF selection 'TypeText' EXPORTING #1 = 'Assesment Tool'.
CALL METHOD OF selection 'TypeParagraph'.
CALL METHOD OF documento 'SaveAs' EXPORTING #1 = 'c:\test.doc'.
CALL METHOD OF word 'Quit'.
any other commands availbale to convert the data into table and other forms.
thanks

If you have adobe X, simply go to "File" -> "Save-As" and Microsoft Word is an option. Also, there is a download (paid) called Able2Extract that will allow you to convert pdfs to word, excel, etc. but the formatting is often lost. If you get the premium package, you can even convert scanned documents. I've used Adobe X to do the conversion, and it retains headers, footers and all formatting.
And i personally share you with a article about how to converting PDF files to Word(.doc), you just need to follow the easy guide.
Hope it can help you a lot.

How to store microsoft word doc in SAP

Hello Frndsm
I have a text editor defined on my screen. I want to store microsoft word document .doc in it. and also i want function so that i can load .doc in it. Is there any way I can do??
Your help will be greatly appreciated
Regards,
Arpit

It doesnt matter. All user want is to upload a word document in the editor box and display its contents as it is. But what basically is happening that the CL_GUI_TEXTEDIT class iam am currently using is only supporting plain text and showing rest all chars as garbage.
so is there anyway???
thanks for ur reply
Arpit

Changed handling of Microsoft Word .doc files in Finder and Quicklook

This is happening on my new MacBook Pro with Retina Display and Mountain Lion 10.8.4.
Starting a week or 10 days ago, the appearance and behavior of .doc files in my finder changed from the .docx style to the style associated with Word 6.0/95. The application associated with .doc files has always been MS Word 2011. Before this changed, I could preview them in quicklook and the icons were not changed. Word will open the .doc files if I double click from the finder (or open them from the application's Open command) but I can't quicklook the doc. files. I have tried to change the application using the two finder methods, namely through the "Get Info" command or by control-clicking on the .doc file icon. Docx. files and all versions of other MS Office applications behave normally. This seems to be a problem with .doc files only.
An interesting thing is if I change the application to open a .doc file to Pages, the "Kind' designation in the Get Info dialog changes from MS Word 6.0/95 to "Microsoft Word 97 - 2004 document" but changing the associated application back to Word 2011 doesn't fix the problem and the description changes back to Word 6.0/95. There are different recommended applications.
All of this worked fine until a few weeks ago. The problem is on my new MacBook Pro with Retina display. I have an iMac 27 inch, latest model, and this is not happening on that machine. I have tried this in several different user accounts on my MBP and the behavior is the same in each one.
I've been working with this for a week or two, and have tried resetting quicklook preferrences and restarting the process as suggested on Macfixit and elsewhere. But the problem persists. I can't tell if this is a file association problem or a quicklook problem. Usually I find solutions here but it doesn't look like anyone else is complaining.

Baltwo, thanks for your advice and quick response. Thanks to you, I think I'm making some progress after a lot of frustration. I'm not quite "there" yet, though, so I hope I'm not imposing when I ask ask for a little more of your help and expertise.
I ran the suggested command (after deleting the space) and got the following in the Terminal window:
lsregister: [OPTIONS] [ <path>... ]
 [ -apps <domain>[,domain]... ]
 [ -libs <domain>[,domain]... ]
 [ -all <domain>[,domain]... ]
Paths are searched for applications to register with the Launch Service database.
Valid domains are "system", "local", "network" and "user". Domains can also
be specified using only the first letter.
-kill Reset the Launch Services database before doing anything else
-seed If database isn't seeded, scan default locations for applications and libraries to register
-lint Print information about plist errors while registering bundles
-convert Register apps found in older LS database files
-lazy n Sleep for n seconds before registering/scanning
-r Recursive directory scan, do not recurse into packages or invisible directories
-R Recursive directory scan, descending into packages and invisible directories
-f force-update registration even if mod date is unchanged
-u unregister instead of register
-v Display progress information
-dump Display full database contents after registration
-h Display this help
It looks like I have reset the Launch Services database but need to register or reregister my applications. Maybe run -seed or -convert? Sorry to be dense about what to do next.

IMac won't print microsoft word doc "open with" apple pages

When its set to open with pages it just freezes the pages program and wont print. If I set it to open with textedit or some other program then it will print. Note, if I open it first with pages then it will print from within pages. It just wont print using file > print from the apple menu. Whats the fix for this? Currently I have it set to open all word docs in LibreOffice in order to print docs with ease. But I would rather use apple pages app.

I felt i had to come to help you if your problem was not solved.
I've a customer who just came up with this problem. He is using Office 2003 on Windows 7 Home Edition 64 bits. He is using also a Color LaserJet 2600n. We just reinstalled his software on a new computer and we started experiencing "Current Printer Is Unavailable" in Excel. When trying to access Page Setup, it was also giving us that same message, and
then we had a dialog listing every computer available.
We produced a PDF, tried to print from Acrobat Reader, it crashed too the same way you mentionned.
Here's what happenned:
We first installed the driver available from HP web site
We did some updates from Microsoft Web Site. There was an optional hardware update for the LaserJet 2600n, so we installed it. This update is completely {Language Filter Evasion}
Uninstall your printer driver (well, just remove it from your printer list)
Then reinstall it. In our case, it was accessed thru IP.
When it was installing, it asked us to keep the same driver or REPLACE the driver... pick-up REPLACE the driver!
And you are good to go...
Paul Champoux... from Sherbrooke? Are you working with Brian?

To JDeveloper Tech Team & All: Displaying Microsoft Word Doc file in JDeveloper

Is there any way, we can display a Microsoft Word Document file in JDeveloper ? I have JDeveloper 3.2.2 on my computer running Windows NT version 4.0 (SP 6). I appreciate your immediate responce. Thanks.

In theory, you can display a DOC file in JDeveloper - given the correct JavaBean capable of reading DOC files and dispplaying it - then it could be hooked
into JDeveloper as an Addin.
But, wouldn't it be easier to display it in MS Word / other Doc Viewer?
If DOC files are part of your development,
then you can put a quick Invocation to Word / other tool in your JDeveloper Tools menu by editing the [JDev]\bin\tools.cfg file - See help for more info.
I hope this helps,
if not, please clarify further.
-John

How do i convert a pdf file to a microsoft word doc

how do i convert a pdf file to a microsoft word doc

Please see our Getting Started guide for ExportPDF: http://forums.adobe.com/docs/DOC-2412

Microsoft Word docs are formatted wrong when opened in pages

Hi,
I‘ve been using pages from the very beginning and I am very happy with it (still waiting for "auto-save", though), but I do have one problem:
often when I open word .docs in pages they are formatted wrongly.
This happens especially with tables and pictures in tables.
If the pictures are in one line in MS Word, they often jump to the next line in pages.
Or the margins are wrong or both.
If I open it in Open Office, for example, they are fine.
Is there anything I can do about it?
Will it be better in pages 09?
Frank

Hi, I have a similar problem with Word (.doc) files containing a Table converted to Pages. Having read that Pages can convert from Word, I bought iWork instead of Office for Mac. I need to work on Word documents sent to me as attachments and find that Tables do not survive the conversion to Pages properly. Often the text within a Table is shown but, if a lengthy text (i.e. more than a page), the Table does not continue to the next page and the text becomes hidden below the visible Table. If I cut out some text from the middle of the Table, the missing text is revealed, progressively from the bottom of the Table. In other words, the text is there in the document but cannot be seen if it is more than can be fitted on one page. Any ideas on how to sort this? I am not encouraged that iWork 09 does not appear to help in a similar problem to mine.

All pdf's are listed as Microsoft Word docs

Is this normal?
Everytime I open a local PDF file using Adobe Digital Editions, the application lists it as "Microsoft Word" and lists the file extension as ".doc"... what's up with that?
It's not a file association issue, as PDF is assigned to Adobe Reader.

On a Windows system disable the preview of Windows Explorer.

Help: analyze microsoft word docs

Similar Messages

Maybe you are looking for