Parsing word documents

Can I use Oracle Text API to parse a word document (stored on a
file server) from a Java program? I want to extract specific
text from this document.
Thanks for your help!

Oracle Text does not provide features for parsing Word documents.
However if you have an XML version of it you can extract sections
using XMLType functions.
The following paper describes the XMLType:
http://www.oracle.com/oramag/oracle/01-nov/index.html?o61xml.html
Can I use Oracle Text API to parse a word document (stored on a
file server) from a Java program? I want to extract specific
text from this document.
Thanks for your help!

Similar Messages

Parse ms word document...

anyone can advise how i can parse a ms word document in java, modify then save the file..?
some sample source code would be great...

i'm unable to replace the text in the HWPF doc.... i can only insertBefore / insertAfter........ is there a way to replace text.?
help appreciated.

Parsing and replacing text items in a word document?

Hi,
I was wondering if someone could point me in a
good direction as to how I would store word documents
in the database as BLOBs but then be able to open
them in plsql and replace tags with values and
save it as a new document???
Thanks in advance!

There are examples in the interMedia User's Guide on loading documents into OrdDoc type objects. Once you do that, you should have a relational interface to the object. Depending on the specific features that the Word plug-in supports, you may be able to edit the document in place via PL/SQL.
I'm more familiar with the OrdImage object, which provides most of the basic image manipulation routines for most image formats. I assume that the OrdDoc object provides a basic document manipulation API as well.
Justin
Distributed Database Consulting, Inc.
http://www.ddbcinc.com/askDDBC

Upload and display Word Document in WD application

Hello,
I have a WD ABAP appl. where the user wants to upload an Word / Excel file (from its own local drive).
The document shall be saved in SAP and it shall also be possible to display the document later in the WD application.
I implemented the UI element upload in the view, to determine the path of the document.
For the display implemented the UI element Office control.
1. When i browse the document, the properties data, filename and mime type are filled into the bound context elements of the upload UI.
2. The property datasource of the UI office control I bound to the same context element, that is also bound to the property data of the upload UI.
The office control opens a word document, but the document is empty.
Is it possible that the document is not uploaded correct?
In another application I did an upload for a PDF doc.. There I implemented the following coding as action of the button 'Upload'.
data lo_nd_pdf type ref to if_wd_context_node.
data lo_el_pdf type ref to if_wd_context_element.
data ls_pdf type wd_this->element_pdf.
data lv_pdf like ls_pdf-pdf.
navigate from <CONTEXT> to <PDF> via lead selection
lo_nd_pdf = wd_context->get_child_node( name = wd_this->wdctx_pdf ).
get element via lead selection
lo_el_pdf = lo_nd_pdf->get_element( ).
get single attribute
lo_el_pdf->get_attribute(
exporting
name = `PDF`
importing
value = lv_pdf ).
Get a reference to the from processing class.
data: l_fp type ref to if_fp.
l_fp = cl_fp=>get_reference( ).
Get a reference to the PDF Object class.
data: l_pdfobj type ref to if_fp_pdf_object.
l_pdfobj = l_fp->create_pdf_object( ).
set the pdf in the PDF Object
l_pdfobj->set_document( pdfdata = lv_pdf ).
set the PDF Object to extract data the Form data.
l_pdfobj->set_extractdata( ).
execute call to ADS
l_pdfobj->execute( ).
get the PDF Form data
data: pdf_form_data type xstring.
l_pdfobj->get_data(
importing
formdata = pdf_form_data ).
convert the xstring from data to string so it can be processed using the iXML classes
data: converter type ref to cl_abap_conv_in_ce,
formxml type string.
converter = cl_abap_conv_in_ce=>create( input = pdf_form_data ).
converter->read(
importing
data = formxml ).
pull in the iXML type group.
type-pools: ixml.
get a reference to iXML object
data:l_ixml type ref to if_ixml.
l_ixml = cl_ixml=>create( ).
get iStream object from StreamFactory
data: streamfactory type ref to if_ixml_stream_factory,
istream type ref to if_ixml_istream.
streamfactory = l_ixml->create_stream_factory( ).
istream = streamfactory->create_istream_string( formxml ).
create an XML document class that will be used to process the XML
data: document type ref to if_ixml_document.
document = l_ixml->create_document( ).
create the parser class
data: parser type ref to if_ixml_parser.
parser = l_ixml->create_parser( stream_factory = streamfactory
istream = istream
document = document ).
parse the XML
parser->parse( ).
define XML Node type object
data: node type ref to if_ixml_node,
attributes type ref to if_ixml_named_node_map.
get the psi sales data Node and value.
data ls_psi_sales type wd_this->element_psi_sales.
data: lt_dfies type table of dfies,
ls_defies type dfies.
call function 'DDIF_NAMETAB_GET'
exporting
tabname = 'ZCM_PSI_SALES'
ALL_TYPES = ' '
LFIELDNAME = ' '
GROUP_NAMES = ' '
UCLEN =
IMPORTING
X030L_WA =
DTELINFO_WA =
TTYPINFO_WA =
DDOBJTYPE =
DFIES_WA =
LINES_DESCR =
tables
X031L_TAB =
dfies_tab = lt_dfies
exceptions
not_found = 1
others = 2
if sy-subrc <> 0.
MESSAGE ID SY-MSGID TYPE SY-MSGTY NUMBER SY-MSGNO
WITH SY-MSGV1 SY-MSGV2 SY-MSGV3 SY-MSGV4.
endif.
data: lv_fieldname type string.
field-symbols <fs_field> type any.
loop at lt_dfies into ls_defies.
lv_fieldname = ls_defies-fieldname.
node = document->find_from_name( name = lv_fieldname ).
assign component lv_fieldname
of structure ls_psi_sales
to <fs_field>.
if <fs_field> is assigned.
<fs_field> = node->get_value( ).
endif.
endloop.
WRITE DATA INTO CONTEXT
data lo_nd_psi_sales type ref to if_wd_context_node.
data lo_el_psi_sales type ref to if_wd_context_element.
navigate from <CONTEXT> to <PSI_SALES> via lead selection
lo_nd_psi_sales = wd_context->get_child_node( name = wd_this->wdctx_psi_sales ).
get element via lead selection
lo_el_psi_sales = lo_nd_psi_sales->get_element( ).
set all declared attributes
lo_el_psi_sales->set_static_attributes(
exporting
static_attributes = ls_psi_sales ).
Do I need such a code also to upload a word doc?
Which interface / class exists for word documents? (for PDF upload there is the interface IF_FP)
How can I save a document in SAP? (as MIME Object? with which method?)
I hope someone can help me!?
BR

You can use the fileupload and filedownload uielements.
Check these links:
[File Upload|http://help.sap.com/saphelp_nw70ehp1/helpdata/en/b3/be7941601b1d09e10000000a155106/content.htm]
[File Download|http://help.sap.com/saphelp_nw70ehp1/helpdata/en/09/a5884121a41c09e10000000a155106/content.htm]
When you upload a file and save in SAP, are you saving it as xstring.
If yes follow these steps for filedownload.
Follow these steps for file download:
1 Create FileDownload uielement in your View
2.Create an Attribute of type xstring.
3.Bind this attribute to the data property of your Filedownload uielement.
4. during fileuplaod as you are saving the document in xstring format, fetch the same from your database table and pass the value to filedownload i.e set the attribute bound to data property of filedownload uielement with the xstring content.

Problem in Downloading Word Document of very huge size

Hi Folks,
I upload a word document of 5 MB size to the server. I do hexencoding before
I upload. The problem is when i try to download the file from the server it
goes on parsing the document and it does'nt stop. It is getting stuck in
the while loop. I am using BufferedInputStream for reading the data and
storing it in a String. Even the CPU utilization stays at 100% in Task Manager.
Can any of you Suggest a solutation? Should I replace BufferedInputStream with
some other class.
Thanks in advance.
-deena

No wonder it takes long or even seems that it hangs... Your going to use incredible amount of memory within that loop. First of all you recreate your buffer every time, but then you create new string of it, and then concat it to other, which will again create new string. Stringbuffer will probably help, but it's still a waste.
try this:
byte inbuffer[]=new buffer[BUFSIZE];
byte outbuffer[]=new buffer[BUFSIZE/2];
int readed;
while ( (readed=in.read(buffer))!=0)
hexEncode(outbuffer, inbuffer, readed); // or whatever way you do it
out.write(outbuffer, 0, readed); // or whereever you want to put the data
}Anyway the idea is to use common input and output buffer so memory isn't wasted. This is based on a fact that InputStreams read suplied only with byte[] argument read's as much as possible from stream and return the amount of bytes read.

Associating requirements in my Word Document with NI Gateway

What do I need to do in order to associate
requirements in my Word Document with NI Gateway? I have created a MS
Word Style defined as Requirement_ID and Requirement_Text and applied
them to their respective parts of the document but when I load the
document in NI Gateway I get no response.

jstrong -
You could customize the Word NIRG Type to add a new Requirement element which uses a regular expression to parse Table information.
When you include requirements in a Table, NIRG parses that information into the intermediate file in the form of |d Requirement_ID REQX: blah, where "|" specifies the beginning of a new cell in the row and d specifies the cell number.
Please take a look at the attached project which is a modification of the NIRG Word example. You'll notice that I use a custome Type called Word1 to load the Word Example - Requirements.doc file.
Hope this helps!
Manooch H.
National Instruments
Attachments:
Word.zip ‏44 KB

How to update a template(word document) dynamically.

Hi guys,
I am having an requirement to store details in a template,which is in word document dynamically and save it on the desktop.
I can able to open a new word document dynamically.
please give ur suggestions.
Reagrds,
Rajesh

If you can use Word 2003 files, you can create WordML to create documents dynamically.
Look here for a brief introduction: http://www.xmlw.ie/aboutxml/wordml.htm
I've used WordML in combination with XMLBeans in one project.
It's a very nice combination since you don't have to parse the full XML, just the part you are intrested in.
If you use XMLBeans for instance, you can base your template on an existing word-file, saving some struggle to write all entries inside the template file.
Combine this with the filedownload uicontroller in webdynpro to let the user download a template on the desktop.

Parsing Word 2007 with XPath

I am trying to get the following code to parse the text areas out of a Word 2007 document.
     public static void main (String[] args) {
          InputStream docXMLIS = null;
          ZipFile docxFile = null;
          ZipEntry docXML = null;
          try {
               docxFile = new ZipFile(new File("testing.docx"));
               docXML = docxFile.getEntry("word/document.xml");
               docXMLIS = docxFile.getInputStream(docXML);
          } catch (ZipException ze) {
               System.out.println("Zip error.");
          } catch (IOException ioe) {
               System.out.println("IO Error.");
          String text = "";
          try {
               XPathFactory factory = XPathFactory.newInstance();
               XPath xPath = factory.newXPath();
               XPathExpression xPathExpression = null;
               //String expression = "//w:document/descendant::w:t";
               String expression = "//w:document/w:body/w:p/w:r/w:t/text()";
               InputSource inputSource = new InputSource(docXMLIS);
               NodeList nodeList = (NodeList) xPath.evaluate(expression, inputSource, XPathConstants.NODESET);
               for (int i=0; i<nodeList.getLength(); i++)
                    text += nodeList.item(i).getNodeValue();
          }catch (XPathExpressionException xpee) {
               System.out.println("xpee");
          }catch (Exception e) {
               System.out.println("It broke.");
          System.out.println(text);
          System.out.println(text.length());
     }However, the error checking println's are just throwing out zeroes. Where is the logic behind this going wrong?
Thanks.

I have the feeling this is related to a bad handling of namespaces: I do not see any declaration of the "w" NamespaceContext in your code.

Display Word-document in TextPane ?

Is it possible to display a word-document in a JTextPane ?
If yes,how can it be done ?
thanx.

You have to code a class that parses a Word doc file,
and then returns the text and the attributes that the
TextPane supports (which I think is only text).Hi,
you can have all the styles in a JTextPane the Class StyleConstants supports if you use a DefaultStyledDocument it it and setup the styles in a SimpleAttributeSet with the methods of class StyleConstants. To insert styled text in the JTextPane use the insertString(...)-method of it - there you can pass string and SimpleAttributeSet.
You can use this styles during editing too by the setParagraphAttributes(..)-methode resp. the setCharacterAttributes(...)-method.
greetings Marsian

Read equations from word document programatically

hi
its too difficult to problem i think but soluation is there.
i want to read equations from word document using java. i was tried in xml prser also ,there i can 't find the equation format.
any body tell me how to extract equation from word .

Everyone that has posted that you have 2 problems is absolutly correct, you will have to use some 3rd party product to read the file, or make your own MS-Word decoder--a task for which MS itself has problems doing reliably between versions. You could set up some DDE or OLE via Java/C and Excel... have fun it that is your choosen path, you'll want to talk to them in the JNI forum on how to get started there.
As far as parsing, you have to be able to identify the formula, is it an Excel formula? If so, then you have to be able to copy the formula, it does not show as text in cells. If it is not a formula, but just text in a cell, then copy it into a String in Java.
Now you have the String, use String.split or other tokenizer to break the String appart, and then you have to process the tokens. Nobody here is going to write the front end of an interpreter for you to process your sting into appropriat logic and tokens.
The task you embark on is not one of triviality, but can be done. The real question is: "Is it really worth it to you?" Only you can answer that. If you need refernce material for parsing--look up discussions of language, compiler, and interpreter development--most of the ones I've run across have excellent discussions on parsing and tokenizing.

Writing html page into word document

HI,
I want to write the HTML document using java code in to a word document.
When I save the webpage as word document. It's properly saving the font...
I almost did it with following code but the font is missing it is just writing all the text in single font.I want the exact replica of the html page in to the word document.Plz help me out with some input.
Thanx in advance.
Regards,
Ashok.
public String readContent(){
String contText="";
EditorKit kit = new HTMLEditorKit();
Document doc = kit.createDefaultDocument();
// The Document class does not yet
// handle charset's properly.
doc.putProperty("IgnoreCharsetDirective",Boolean.TRUE);
try {
// Create a reader on the HTML content.
Reader rd = getReader(fileName);
// Parse the HTML.
kit.read(rd, doc, 0);
// Iterate thsrough the elements
// of the HTML document.
//ElementIterator it = new ElementIterator(doc);
//javax.swing.text.Element elem;
//while ((elem = it.next()) != null) {
//SimpleAttributeSet s = (SimpleAttributeSet)elem.getAttributes().getAttribute(HTML.Tag.A);
// code to read the content
int nleft = doc.getLength();
Segment text = new Segment();
int offs = 0;
text.setPartialReturn(true);
while (nleft > 0) {
contText+=doc.getText(offs,nleft);
System.out.println(contText);
// do someting with text
nleft -= text.count;
offs += text.count;
if(nleft > 0)break;
/*if (s != null) {
//System.out.println(s.getAttribute(HTML.Attribute.HREF));
System.out.println(s.getAttribute(HTML.Attribute.HREF));
} catch (Exception e) {
e.printStackTrace();
//System.exit(1);
return contText;
} //end of reading content method
static Reader getReader(String uri)
throws IOException {
if (uri.startsWith("http:")) {
// Retrieve from Internet.
URLConnection conn = new URL(uri).openConnection();
return new InputStreamReader(conn.getInputStream());
} else {
// Retrieve from file.
return new FileReader(uri);

Maybe Im missing something here, but where are you doing anything using MS Word? I see you write the text to std out but where does it go to a word document? How do you create the word document, using JNI or is it a java API?

I need to convert PDF file to Word Document, so it can be edited. But the recognizing text options do not have the language that I need. How I can convert the file in the desired of me language?

I need to convert PDF file to Word Document, so it can be edited. But the recognizing text options do not have the language that I need. How I can convert the file in the desired of me language?

The application Acrobat provides no language translation capability.
If you localize the language for OS, MS Office applications, Acrobat, etc to the desired language try again.
Alternative: transfer a copy of content into a web based translation service (Bing or Google provides a free service).
Transfer the output into a word processing program that is localized to the appropriate language.
Do cleanup.
Be well...

Error in converting a pdf document to a Word document

I have Adobe Acrobat 6 running under Win XP.
My Panasonic Lumix camera manual is a pdf file that I wish to place on my Kindle 4. I copied the pdf file across but, although Kindle is supposed to be able to handle pdf files, it wouldn't with this one - it simply froze when I opened it in the Kindle. I therefore decided to convert the pdf file to a Word document file and email it to Amazon for a free conversion to a Kindle azw file. I attempted to use Save As in Acrobat to save the pdf file as a doc file but, after a few pages, I got the error message:
Bad pdf; could not read page structure. <Bad pdf; error in processing fonts: cannot find CMap resource file> [26-27].

For what it's worth I pulled in some of the product line's user guide PDFs. Those I looked over were authored with FrameMaker or InDesign.
As these both support solid PDF output and from a quick look-see I'd not attribute the core issue to a poorly created PDF.
With that said; two things are evident. They are not Tagged and, with the heavy graphics content, of a healthy file size.
Tagged PDF is more than a little important as this is what provides the essential ingredient for export of PDF content (retaining font info, format, layout, etc.).
A healthy file size associated with the significant graphics content means that what is "under the hood" of the computer in use is significant as export puts a load on these resources.
Example: A local machine having integrated graphics is hard pressed compared to a local machine having a dedicated graphics card with a comfortable amount of onboard RAM.
Due to "design" improvements over the years Acrobat X does a much better job of "export"/"save as" for untagged PDF.
But, "export"/"save as" of a well-formed Tagged PDF trumps.
Be well...

Can no longer see my I-photo photos, when trying to insert photo into Word Document.

I had some computer problems and did upgrade to Snow Leopard from Leopard. Using older MacBook Pro.
Now, when I try to insert a picture into a Word Document, I cannot see my photos!
I am working in a word document, I choose "Insert Photo from my files" and when the window comes up so I can choose the photo to insert... when I select the "photos" choice under the Media section on the Left side of the box that comes up, the box to the right is BLANK, both on the top and bottom! None of my photos come up for selection.
This method to choose photos to insert into documents used to work before... now it is not. After selecting "photos" all my photos would appear in the right box, and I would move through my I-photo events to choose my saved photo and insert it. Yes, photos are showing and present in I-photo. They just won't appear for importing.
This is so frustrating, not to mention time comsuming trying to figure out what is going on and how to fix it.
Can anyone help? Please?

Oroilore-
I do not see a way to disable the Picture Frame icon. I looked at Settings-Picture Frame, but none of the options turns it off. The only way I can thinik of, would be if all of your photos had been deleted. If there were no photos, you couldn't have a slide show!
One thing to try is to reset (reboot) your iPad. Hold both the Home and Sleep buttons for several seconds until the Apple logo appears. Ignore the "Slide to power off" arrow. The iPad will restart after a couple of minutes. Resetting this way will not hurt anything, and sometimes clears up mysterious problems.
Fred

I am working in Adobe Acrobat 9 Pro and just created a pdf form from a MS Word document. I need to find out how to have a date field in my form which will update automatically. Can some one out there help me?

I am working in Adobe Acrobat 9 Pro and just created a pdf form from a MS Word document. I need to find out how to have a date field in my form which will update automatically.

Update automatically under which circumstances, exactly?

Parsing word documents

Similar Messages

Maybe you are looking for