Extracting text from .doc,.ppt,.pdf files

How can i extract ascii text from the file types like .doc , .ppt , .pdf ,. xls ..etc.
Any tips/hints would be helpful
Thanks
Rama

HI I tried for pdf, but didn't succeed
Following is for text/Doc files
<pre>
import java.io.*;
public class Doc
     public static void main(String[] args)
          try{
               File file=new File("c:\\downloads\\WP2001.doc");
               LineNumberReader buffer=new LineNumberReader(new FileReader(file));
               StringBuffer buff=new StringBuffer("");
          boolean valid=true;
          while(valid)
               //System.out.println(buffer.readLine());
               buff=buff.append(buffer.readLine()+"\n");
               if(buffer.read()==-1)
                    valid=false;
               else
               buffer.setLineNumber(buffer.getLineNumber()+1);
               System.out.println(buff);
          catch(Exception fne)
               System.out.println("File Not Found"+fne);
</pre>
pathreading

Similar Messages

  • How do I extract pages from a Secured PDF file

    How do I extract pages from a Secured PDF file?

    Adobe would call that hacking, and don't allow discussion of it in this forum. You should contact the copyright holder and see if they are prepared to release the password, or an unsecured document, to you. If it's something made for you like a bank statement you should tell the bank how inconvenient their choices are.

  • Read text from a simple PDF file

    Is it possible to extract text from a simple PDF (Non-Interactive) in ABAP? May be using the class CL_FP_PDF_OBJECT ?
    Let's say I have a pdf document with a couple of lines of text. How can I read the "actual" text in an ABAP program?
    Thanks for your help!

    Of course you can do this, but not using the standard SAP/ABAP/Adobe.
    Check for the Java library called iText. Now you think you don´t need Java, of course. If you cannot use Java, at least you can check the coding of this library and do it in ABAP for yourself.
    But there is no such a tool in standard SAP because why would one buy a Interactive form licence if it would be so easy?

  • Javascript in .PDF's - Extracting text from .doc or .txt

    Hello All,
    I am very new to javascript in .pdfs -- but I seem to find my around doing misc. work with forms. What I need:
    I need a Form with a Submit button that locates and extracts the text from a file and places it into another field.
    Example:
    on Server:
    one.txt or one.doc
    two.txt or two.doc,
    ...etc
    You type one in the form and submit -- it pulls all of the txt from one.txt off the server and places it into a field.
    Also if there is anyway to do this with tables to avoid multiple files that would be even better.
    I know I am a newbie, but this would be a game-changer for what I do.
    Thank you.

    Thanks for the advice
    It is accessing a shared file server (among employees) and it is to be a .pdf used in Adobe Acrobat Professional
    Basically I want it to be a form that pulls txt based on what was in the typed box or drop-down menu from a .txt or .doc

  • How do I link to Anchor Text from browser to PDF file exported from InDesign?

    I see that this question was asked previously but not answered:
    http://forums.adobe.com/message/3737541
    I need a way to link to topics (from an HTML page) within a structured PDF file generated from InDesign, i.e. one that has an interactive bookmarks
    panel, that doesn't break the links when the file is updated.
    In particular, I don't want to refer to:
    *  page numbers because the document will repaginate when it is updated, breaking existing links
    * named destinations that are numeric as addition/deletion of topics renumbers the named destinations, breaking the links.
    Is there any option in InDesign that creates named destinations derived from bookmark titles as that would seem to solve my problem?
    I have a similar need, but I have not been able to find an answer anywhere online, even after a several hours of research.
    I am on contract with General Electric to develop a UI Style Guide for an international team of developers. My client wants to be able to distribute the Style Guide as a PDF document, but needs to provide links to specific topics when submitting a "Story" in an Agile development environment to the programmers.
    Ideally, I would like to be able to create "named destinations" in InDesign such that I can provide the client with a list of URLs similar to this format:
    file:\\StyleGuide\Test1-20131024a.pdf\#nameddest=AnchorText-01
    However, I can not figure out how to create "named destinations" in the PDF file from the InDesign file in which I have created the Anchor Text in the Hyperlinks tab.
    Alternately, the client would be satisfied with the ability to link to specific bookmarks in the exported PDF file that have been derived from the configuration of the Table of Contents styles, but my research has led me to believe that it is not possible to hyperlink from a browser window to a bookmark in a PDF file (only to specific page numbers, search word lists, named destinations, and comments) as described here: http://blogs.adobe.com/tcs/2011/01/tcs-specific/linking-to-a-page-within-a-pdf-and-more.ht ml
    Having been using InDesign since 1988, when it was PageMaker, I would prefer to use it to provide GE with the most attractive deliverable, but they are now leaning heavily toward having me licensed with Madcap Flare, which has this capability built in:
    http://webhelp.madcapsoftware.com/flare9/Default.htm#Nav_Links/Named_Destinations/Creating _Named_Destinations.htm
    Please help me to find a solution that will allow me to continue using InDesign for this project.
    Thank you!
    Lynne O’Connor
    Technical Writer (contractor)
    GE Oil & Gas - Measurement & Control

    Follow-up post (solution is not available):
    I have just completed a phone call with Gaurev Sethi of Adobe escalated technical support in which I shared my screen to explain and demonstrate the desire to link from an external URL to a specific location within a PDF file that has been exported from InDesign. After 40 minutes, during which I was placed on hold a few times while he consulted with his team, it was determined that the desired functionality is not supported in InDesign and has been identified as a limitation of the software.
    In researching alternate documentation software that supports PDF export, I discovered that Madcap Software supports this functionality in its Flare product:
    http://webhelp.madcapsoftware.com/flare9/Default.htm#Nav_Links/Named_Destinations/Creating _Named_Destinations.htm
    Note that linking via URL to a bookmarked location within a PDF is not supported:
    http://blogs.adobe.com/tcs/2011/01/tcs-specific/linking-to-a-page-within-a-pdf-and-more.ht ml
    More about parameters that can be specified to open PDF files are available here:
    http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_open_parameters.pdf
    Although the above link is for a previous version of Acrobat, I could find no evidence to the contrary in the current SDK documentation.
    -Lynne

  • How do i extract pages from a large pdf file?

    I have Windows XP, with Adobe Reader 9 and PDF version 1.4 (Acrobat 5.x) installed. I received a large pdf file (19,000KB) with 25 pages of architectural drawings. I want to extract just 8 pages from that document. Under the Documents Properties Tab, it states page extraction is allowed. How do i do this?
    Morgy1

    I'm not an Acrobat user. But if you have a PDF print driver or if Acrobat 5 5 has a print PDF mode you can tell the PDF print driver to print the page number that you want and it should prompt you to save the individual page in PDF format. If you don't have a PDF print driver you can Google PDFCreate and got to its website and download it and install it.

  • How do I copy text from an uploaded pdf file?

    I have a document and want to copy the text of one page into a new document.  How do I proceed to do that?  I am using the Adobe cloud product.
    Thanks in advance.

    Just some observations.
    PDF page content that is an image (and music notes may very well be that in a given PDF) then all that can be exported is the image.
    Only an image editor can "edit" such eh.
    As to textual content - once in TXT, RTF, DOC or DOCX yes, expect to have to do "cleanup"
    Of course there's the old school alternative that is always available.
    Paper copy beside you as you transcribe to a fresh word processing file.
    Be well...

  • Extracting text from the iWeb Domain file

    Hi Folks
    We used iWeb to run our travel diary when we were travelling last year, I now want to take all our diary entries and add them into another website I'm building. I'm aware that I can fire up iWeb and just copy and paste the text but with over nine months of diary entries it's going to be a laborious task and quite frankly I'd rather have root canal work!
    Surely I must be able to extract the text or the daily entries from the Domain file, I just don't know how.
    Does anyone out there know or have any ideas how it can be done?
    Many thanks
    Potterman
    17" Powerbook G4   Mac OS X (10.4.9)  

    You need only use a browser to copy text.
    Open the page in a browser and drag the cursor across the text portion. Once highlighted Command-C to copy.
    Use paste Command-V to add the text into the other software or html files. You could even just paste the whole shebang into one large TextEdit file to add at your leisure.
    Even hundreds of pages would only take an hour. It would take longer than that to open iWeb, then each page then copy and switch to another app.
    Of course this will only work if the source really is text. iWeb has a habit of converting text to .png files when Web safe fonts are not used.

  • How can i extract text from Power point files,wod files,pdf files

    hi friends,
    i need to extract text from the power point files,word files,pdf files for my application.Is it possible to extract the text from the those files .If yes plz give solution to this problem.i would be thankful if u givve solution to this problem.

    My reply would be the same.
    http://forum.java.sun.com/thread.jspa?threadID=676559&tstart=0

  • How to extract text from a PDF file?

    Hello Suners,
    i need to know how to extract text from a pdf file?
    does anyone know what is the character encoding in pdf file, when i use an input stream to read the file it gives encrypted characters not the original text in the file.
    is there any procedures i should do while reading a pdf file,
    File f=new File("D:/File.pdf");
                   FileReader fr=new FileReader(f);
                   BufferedReader br=new BufferedReader(fr);
                   String s=br.readLine();any help will be deeply appreciated.

    jverd wrote:
    First, you set i once, and then loop without ever changing it. So your loop body will execute either 0 times or infinitely many times, writing the same byte every time. Actually, maybe it'll execute once and then throw an ArrayIndexOutOfBoundsException. That's basic java looping, and you're going to need a firm grip on that before you try to do anything as advanced as PDF reading. the case.oops you are absolutely right that was a silly mistake to forget that,
    Second, what do the docs for getPageContent say? Do they say that it simply gives you the text on the page as if the thing were a simple text doc? I'd be surprised if that's the case.getPageContent return array of bytes so the question will be:
    how to get text from this array? i was thinking of :
        private void jButton1_actionPerformed(ActionEvent e) {
            PdfReader read;
            StringBuffer buff=new StringBuffer();
            try {
                read = new PdfReader("d:/getjobid2727.pdf");
                read.getMetaData();
                byte[] data=read.getPageContent(1);
                int i=0;
                while(i>-1){ 
                    buff.append(data);
    i++;
    String str=buff.toString();
    FileOutputStream fos = new FileOutputStream("D:/test.txt");
    Writer out = new OutputStreamWriter(fos, "UTF8");
    out.write(str);
    out.close();
    read.close();
    } catch (Exception f) {
    f.printStackTrace();
    "D:/test.txt"  hasn't been created!! when i ran the program,
    is my steps right?                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

  • How to extract text from a PDF file using php?

    How to extract text from a PDF file using php?
    thanks
    fabio

    > Do you know of any other way this can be done?
    There are many ways. But this out of scope of this forum. You can try this forum: http://forum.planetpdf.com/

  • Applescript or workflow to extract text from PDF and rename PDF with the results

    Hi Everyone,
    I get supplied hundreds of PDFs which each contain a stock code, but the PDFs themselves are not named consistantly, or they are supplied as multi-page PDFs.
    What I need to do is name each PDF with the code which is in the text on the PDF.
    It would work like this in an ideal world:
    1. Split PDF into single pages
    2. Extract text from PDF
    3. Rename PDF using the extracted text
    I'm struggling with part 3!
    I can get a textfile with just the code (using a call to BBEDIT I'm extracting the code)
    I did think about using a variable for the name, but the rename functions doesn't let me use variables.

    Hello
    You may also try the following applescript script, which is a wrapper of rubycocoa script. It will ask you choose source pdf files and destination directory. Then it will scan text of each page of pdf files for the predefined pattern and save the page as new pdf file with the name as extracted by the pattern in the destination directory. Those pages which do not contain string matching the pattern are ignored. (Ignored pages, if any, are reported in the result of script.)
    Currently the regex pattern is set to:
    /HB-.._[0-9]{6}/
    which means HB- followed by two characters and _ and 6 digits.
    Minimally tested under 10.6.8.
    Hope this may help,
    H
    _main()
    on _main()
        script o
            property aa : choose file with prompt ("Choose pdf files.") of type {"com.adobe.pdf"} ¬
                default location (path to desktop) with multiple selections allowed
            set my aa's beginning to choose folder with prompt ("Choose destination folder.") ¬
                default location (path to desktop)
            set args to ""
            repeat with a in my aa
                set args to args & a's POSIX path's quoted form & space
            end repeat
            considering numeric strings
                if (system info)'s system version < "10.9" then
                    set ruby to "/usr/bin/ruby"
                else
                    set ruby to "/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby"
                end if
            end considering
            do shell script ruby & " <<'EOF' - " & args & "
    require 'osx/cocoa'
    include OSX
    require_framework 'PDFKit'
    outdir = ARGV.shift.chomp('/')
    ARGV.select {|f| f =~ /\\.pdf$/i }.each do |f|
        url = NSURL.fileURLWithPath(f)
        doc = PDFDocument.alloc.initWithURL(url)
        path = doc.documentURL.path
        pcnt = doc.pageCount
        (0 .. (pcnt - 1)).each do |i|
            page = doc.pageAtIndex(i)
            page.string.to_s =~ /HB-.._[0-9]{6}/
            name = $&
            unless name
                puts \"no matching string in page #{i + 1} of #{path}\"
                next # ignore this page
            end
            doc1 = PDFDocument.alloc.initWithData(page.dataRepresentation) # doc for this page
            unless doc1.writeToFile(\"#{outdir}/#{name}.pdf\")
                puts \"failed to save page #{i + 1} of #{path}\"
            end
        end
    end
    EOF"
        end script
        tell o to run
    end _main

  • Extract text from pdf

    Hi, is it possible to extract text from a pdf file using the command line to get an output like you would get by using the File menu and then 'Save as text..."?
    I also noticed that in the installation folder there is a small executable called AcroTextExtractor which sounds interesting, but I was unable to figure out how to use it.

    what's wrong with using automator for this? this certainly seems the easiest. I'm not aware of any built in apple script commands that will do this. But You should also ask on the Apple script forum under Mac OS Technologies.
    Message was edited by: V.K.

  • How to read/extract text from pdf

    Respected All,
    I want to read/extract text from pdf. I tried using etymon but not succed.
    Could anyone will guide me in this.
    Thanks and regards,
    Ajay.

    Thank you very much Abhilshit, PDFBox works for reading pdf.
    Regards,
    Ajay.

  • Extract Text from pdf using C#

    Hi,
    We are Solution developer using Acrobat,as we have reuirement of extracting text from pdf using C# we have downloaded adobe sdk and installed. We have found only four exmaples in C# and those are used only for viewing pdf in windows application. Can you please guide us how to extract text from pdf using SDK in C#.
    Thanks you for your help.
    Regards
    kiranmai

    Okay so I went ahead and actually added the text extraction functionality to my own C# application, since this was a requested feature by the client anyhow, which originally we were told to bypass if it wasn't "cut and dry", but it wasn't bad so I went ahead and gave the client the text extraction that they wanted. Decided I'd post the source code here for you. This returns the text from the entire document as a string.
           private static string GetText(AcroPDDoc pdDoc)
                AcroPDPage page;
                int pages = pdDoc.GetNumPages();
                string pageText = "";
                for (int i = 0; i < pages; i++)
                    page = (AcroPDPage)pdDoc.AcquirePage(i);
                    object jso, jsNumWords, jsWord;
                    List<string> words = new List<string>();
                    try
                        jso = pdDoc.GetJSObject();
                        if (jso != null)
                            object[] args = new object[] { i };
                            jsNumWords = jso.GetType().InvokeMember("getPageNumWords", BindingFlags.InvokeMethod, null, jso, args, null);
                            int numWords = Int32.Parse(jsNumWords.ToString());
                            for (int j = 0; j <= numWords; j++)
                                object[] argsj = new object[] { i, j, false };
                                jsWord = jso.GetType().InvokeMember("getPageNthWord", BindingFlags.InvokeMethod, null, jso, argsj, null);
                                words.Add((string)jsWord);
                        foreach (string word in words)
                            pageText += word;
                    catch
                return pageText;

Maybe you are looking for