Extracting text from PDF files produced by Oracle reports

Hi,
I am currently using Report Builder 9.0.4.0.21 to produce reports in PDF format.
The pdf reports were displayed to screen and printed to printer correctly.
However, doing a copy-and-paste from the pdf report to a text editor produces
garbage characters. Also, I failed to extract the text using any of available adobe
plug-ins. I know that the PDF report is using font subseting with custom
encoding.I have already read the pdf reference manual and it seems that
the PDF report is missing the mapping tables to convert the custom encoding
used in the report back to ansi or unicode.
Is there a solution to this problem?
Are there any environment variables or settings that I am missing?
Your help is really appreciated.

Hello,
Your problem may be related to a limitation in the PDF generated with Reports 9.0.2 / 9.0.4 when using Subsetting :
Font Subsetting Creates PDF Output not Searchable with Acrobat Reader (Doc ID 311345.1)
This limitation no more exists in Reports 10.1.2 / 11.1
Regards

Similar Messages

  • Extracting text from pdf file

    Hi All
    I want to extract only text from a pdf file.
    I am trying to extrat text from a pdf file using PDFBox. But I am getting error. My code is like this:
    * Main.java
    * Created on den 10 september 2007, 23:01
    * To change this template, choose Tools | Template Manager
    * and open the template in the editor.
    package extracttext;
    import org.pdfbox.exceptions.InvalidPasswordException;
    import org.pdfbox.pdmodel.PDDocument;
    import org.pdfbox.util.PDFTextStripper;
    //import java.awt.Rectangle;
    //import java.util.List;
    import org.pdfbox.pdmodel.PDPage;
    public class Main {
    /** Creates a new instance of Main */
    public Main() {
    * @param args the command line arguments
    public static void main( String[] args ) throws Exception
    int startPage = 1;
    int endPage = Integer.MAX_VALUE;
    PDDocument document = null;
    try
    document = PDDocument.load( "C:\\thesis\\fileread\\sim.pdf" );
    if( document.isEncrypted() )
    try
    document.decrypt( "" );
    catch( InvalidPasswordException e )
    System.err.println( "Error: Document is encrypted with a password." );
    System.exit( 1 );
    PDFTextStripper stripper = new PDFTextStripper();
    stripper.setSortByPosition( true );
    stripper.setStartPage( startPage );
    stripper.setEndPage( endPage );
    System.out.println("Text: " + stripper.getText(document));
    finally
    if( document != null )
    document.close();
    can anybody pls help me solving this problem
    Regards,
    UK

    i get the following error message:
    Exception in thread "main" java.lang.NoClassDefFoundError: org/fontbox/afm/FontMetric
    at org.pdfbox.pdmodel.font.PDFont.getAFM(PDFont.java:334)
    at org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(PDSimpleFont.java:104)
    at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:336)
    at org.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:80)
    at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
    at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
    at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
    at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
    at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
    at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
    at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
    at extracttext.Main.main(Main.java:55)
    Java Result: 1
    BUILD SUCCESSFUL (total time: 1 second)
    I would appreciate if you can please help me writing a java program that can extract only test from a pdf file

  • Applescript or workflow to extract text from PDF and rename PDF with the results

    Hi Everyone,
    I get supplied hundreds of PDFs which each contain a stock code, but the PDFs themselves are not named consistantly, or they are supplied as multi-page PDFs.
    What I need to do is name each PDF with the code which is in the text on the PDF.
    It would work like this in an ideal world:
    1. Split PDF into single pages
    2. Extract text from PDF
    3. Rename PDF using the extracted text
    I'm struggling with part 3!
    I can get a textfile with just the code (using a call to BBEDIT I'm extracting the code)
    I did think about using a variable for the name, but the rename functions doesn't let me use variables.

    Hello
    You may also try the following applescript script, which is a wrapper of rubycocoa script. It will ask you choose source pdf files and destination directory. Then it will scan text of each page of pdf files for the predefined pattern and save the page as new pdf file with the name as extracted by the pattern in the destination directory. Those pages which do not contain string matching the pattern are ignored. (Ignored pages, if any, are reported in the result of script.)
    Currently the regex pattern is set to:
    /HB-.._[0-9]{6}/
    which means HB- followed by two characters and _ and 6 digits.
    Minimally tested under 10.6.8.
    Hope this may help,
    H
    _main()
    on _main()
        script o
            property aa : choose file with prompt ("Choose pdf files.") of type {"com.adobe.pdf"} ¬
                default location (path to desktop) with multiple selections allowed
            set my aa's beginning to choose folder with prompt ("Choose destination folder.") ¬
                default location (path to desktop)
            set args to ""
            repeat with a in my aa
                set args to args & a's POSIX path's quoted form & space
            end repeat
            considering numeric strings
                if (system info)'s system version < "10.9" then
                    set ruby to "/usr/bin/ruby"
                else
                    set ruby to "/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby"
                end if
            end considering
            do shell script ruby & " <<'EOF' - " & args & "
    require 'osx/cocoa'
    include OSX
    require_framework 'PDFKit'
    outdir = ARGV.shift.chomp('/')
    ARGV.select {|f| f =~ /\\.pdf$/i }.each do |f|
        url = NSURL.fileURLWithPath(f)
        doc = PDFDocument.alloc.initWithURL(url)
        path = doc.documentURL.path
        pcnt = doc.pageCount
        (0 .. (pcnt - 1)).each do |i|
            page = doc.pageAtIndex(i)
            page.string.to_s =~ /HB-.._[0-9]{6}/
            name = $&
            unless name
                puts \"no matching string in page #{i + 1} of #{path}\"
                next # ignore this page
            end
            doc1 = PDFDocument.alloc.initWithData(page.dataRepresentation) # doc for this page
            unless doc1.writeToFile(\"#{outdir}/#{name}.pdf\")
                puts \"failed to save page #{i + 1} of #{path}\"
            end
        end
    end
    EOF"
        end script
        tell o to run
    end _main

  • How to read/extract text from pdf

    Respected All,
    I want to read/extract text from pdf. I tried using etymon but not succed.
    Could anyone will guide me in this.
    Thanks and regards,
    Ajay.

    Thank you very much Abhilshit, PDFBox works for reading pdf.
    Regards,
    Ajay.

  • Extract Text from pdf using C#

    Hi,
    We are Solution developer using Acrobat,as we have reuirement of extracting text from pdf using C# we have downloaded adobe sdk and installed. We have found only four exmaples in C# and those are used only for viewing pdf in windows application. Can you please guide us how to extract text from pdf using SDK in C#.
    Thanks you for your help.
    Regards
    kiranmai

    Okay so I went ahead and actually added the text extraction functionality to my own C# application, since this was a requested feature by the client anyhow, which originally we were told to bypass if it wasn't "cut and dry", but it wasn't bad so I went ahead and gave the client the text extraction that they wanted. Decided I'd post the source code here for you. This returns the text from the entire document as a string.
           private static string GetText(AcroPDDoc pdDoc)
                AcroPDPage page;
                int pages = pdDoc.GetNumPages();
                string pageText = "";
                for (int i = 0; i < pages; i++)
                    page = (AcroPDPage)pdDoc.AcquirePage(i);
                    object jso, jsNumWords, jsWord;
                    List<string> words = new List<string>();
                    try
                        jso = pdDoc.GetJSObject();
                        if (jso != null)
                            object[] args = new object[] { i };
                            jsNumWords = jso.GetType().InvokeMember("getPageNumWords", BindingFlags.InvokeMethod, null, jso, args, null);
                            int numWords = Int32.Parse(jsNumWords.ToString());
                            for (int j = 0; j <= numWords; j++)
                                object[] argsj = new object[] { i, j, false };
                                jsWord = jso.GetType().InvokeMember("getPageNthWord", BindingFlags.InvokeMethod, null, jso, argsj, null);
                                words.Add((string)jsWord);
                        foreach (string word in words)
                            pageText += word;
                    catch
                return pageText;

  • Editing text from pdf file

    how to edit text from pdf file?

    Adobe Reader does not allow editing the text of a PDF document. You will need to get Acrobat on your Windows or Mac to do that.

  • Extracting Images from PDF file

    Hello All,
                   I am reading PDF File.I need to extract images from PDF File programatically.But problem is that some images are stored inside PDF File using FlateDecode Filter and I need to first decode that file and then I can extract that image .I dont know the way to decode that image data.Is there any way or API to do that in C++.
    Thanks
    Aarti Nagpal

    I think you can do it through cos object in VC++ plugin..go through the PDEFilterSpec in
    Acrobat core api reference
    Be well..

  • Extract text from pdf

    Hi, is it possible to extract text from a pdf file using the command line to get an output like you would get by using the File menu and then 'Save as text..."?
    I also noticed that in the installation folder there is a small executable called AcroTextExtractor which sounds interesting, but I was unable to figure out how to use it.

    what's wrong with using automator for this? this certainly seems the easiest. I'm not aware of any built in apple script commands that will do this. But You should also ask on the Apple script forum under Mac OS Technologies.
    Message was edited by: V.K.

  • Does IBR support to extract text from office files

    hi Experts,
    Can we use IBR to extract word/excel/ppt content to a text file? where is doc for this function?
    Best regards

    Hi ,
    Extraction of text based on some rules ?Is that what you are looking for ? Is it for searching on those specific set of texts from the file ? If yes , then you have Oracle Text search feature which would do that .
    If it is to populate some metadata with those extracted texts , then it would be Content Categorizer component . For details on this component and it's functionality please go through the following documentation : http://docs.oracle.com/cd/E14571_01/doc.1111/e10978/c11_content_categorizer.htm#sthref1210
    In either cases , IBR is not the actual engine which would do this . It is solely used for document conversion . \
    Hope this helps .
    Thanks,
    Srinath

  • Reading and extracting information from pdf file

    Hi everybody!
    what am looking for is Java packages which can allow me to read and extract information form pdf file
    I would really appreciate link wtih sample code
    thanks in advance!

    STFW.
    http://www.google.com/search?q=java+read+pdf&sourceid=mozilla-search&start=0&start=0&ie=utf-8&oe=utf-8

  • How can I read content from PDF file stored in Oracle 9i XMLDB

    Hi Friends:
    Now I have met one question that I don`t know how to read some String , for example "Hello", from the PDF file stored in the Oracle 9i XMLDB, I have stored that PDF file into the XMLDB now, any suggestions are appriciated . Thank you in advance.

    You may be able to do something with Oracle Text. The following shows how to get an HTML rendiditon of a binary document. I think you can also get plain text instead of HTML
    set echo on
    spool xfilesUtilties.log
    connect sys/&1 as sysdba
    grant ctxapp to &2
    connect &2/&3
    begin
      ctxsys.ctx_ddl.create_policy(policy_name=>'XFILES_HTML_GENERATION', filter=>'ctxsys.auto_filter');
    end;
    create or replace package xfiles_internal_11010
    authid definer
    as
      function renderAsHTML(sourceDoc BLOB) return CLOB;
    end;
    show errors
    create or replace package body xfiles_internal_11010
    as
    function renderAsHTML(sourceDoc BLOB)
    return CLOB
    as
      html_content CLOB;
    begin
      dbms_lob.createTemporary(html_content,true,DBMS_LOB.SESSION);
      ctx_doc.policy_filter(policy_name => 'XFILES_HTML_GENERATION',
                            document => sourceDoc,
                            restab => html_content,
                            plaintext => false);
      return html_content;
    end;
    end;
    show errors
    create or replace package xfiles_utilities_11010
    authid current_user
    as
      HOME_FOLDER   constant varchar2(700) := xdb_constants.HOME_FOLDER;
      PUBLIC_FOLDER constant varchar2(700) := xdb_constants.PUBLIC_FOLDER;
      function renderAsHTML(sourceFile VARCHAR2) return CLOB;
      function transformToHTML(xmldoc XMLType, xslPath VARCHAR2) return CLOB;
    end;
    show errors
    create or replace package body xfiles_utilities_11010
    as
    function renderAsHTML(sourceFile VARCHAR2)
    return CLOB
    as
    begin
      return xfiles_internal_11010.renderAsHTML(xdburitype(sourceFile).getBLOB());
    end;
    function transformToHTML(xmldoc XMLType, xslPath VARCHAR2)
    return CLOB
    as
      html clob;
    begin
      select xmldoc.transform(xdburitype(xslPath).getXML()).getClobVal()
        into HTML
        from dual;
      return html;
    end;
    end;
    show errors
    grant execute on xfiles_utilities_11010 to public
    create or replace public synonym xfiles_utilities for xfiles_utilities_11010
    quitMessage was edited by:
    mdrake

  • Extract text from PDF without opening PDF in window C#

    Hello,
    I'm creating a application for searching text in PDF's. I found some code wich uses the SDK from Acrobat (Installed on my system). But all the snippets I find seem to open a PDF window and then extract the text. Is it possible to extract the text without openening this window. I think this would increase the search time since I need te search a lot of files. And I just need a list with the file name and page number where the search string is found.
    AcroAVDoc avDoc = (AcroAVDoc)gAppClass.GetInterface("Acrobat.AcroAVDoc");
    Then I use the javascript obects to acces the "getPageNumWords" and "getPageNthWord" in a loop and putting the word in a string.
    Thanks in advance fore the help.
    I didn't want to put the entire code here because it's easely found all over the web
    Thanks in advance for your help.
    avDoc.Open(System.IO.Path.GetFullPath(filespec), filespec); 

    Hello,
    I own a copy of Acrobat pro 9. and its is for my own use. I am not a proffesional developper and this application wil not be distributed.

  • Extracting pages from PDF file and creating new subfile PDF

    I am a .NET C# developer looking into creating an app that extracts a subset of pages from a PDF document, say, given a start and end page number, and possibly creating a new PDF file with that subset of pages.  Is there a convenient way of doing this ?

    Ok. The Acrobat SDK is not directly going to help, because the Acrobat SDK requires Acrobat, and Acrobat is not for server use.
    Adobe have the Adobe PDF Library for server use. C/C++, there may be a Java interface too from DataLogics, but huge overkill for this task, and many find it rather pricy for a simple task like this. There are third party libraries and tools, but this is not the place to discuss them.

  • Copying text from PDF files in another language

    When I try to copy text from a PDF file which is written in Greek - the minute I paste it into another document, the text cannot be recognized so comes out as a jumbled mess. How an I solve this problem??

    Into what other document type?  Word?  Do you have a Greek font selected when you paste?  Or a Unicode font?

  • Extract text from PDF and parse to Excel, automator? applescript?

    I'd been trying to do this with automator and have been getting nowhere. 
    I have some pdf files that have selectable, formatted text in them.
    Its an old college directory that I would like to import.  There are chunks of text:
    Name:  XXX, XXX
    Telephone: (XXX) XXX-XXXX        DOB:    XX/XX/XX          Gender: XXX
    Hometown:
    dflasdkfjdlkj
    asdflajksdflj
    adsflakjdsf
    Name:  XXX, XXX
    Telephone: (XXX) XXX-XXXX        DOB:    XX/XX/XX          Gender: XXX
    Hometown:
    dflasdkfjdlkj
    asdflajksdflj
    adsflakjdsf
    Name:  XXX, XXX
    Telephone: (XXX) XXX-XXXX        DOB:    XX/XX/XX          Gender: XXX
    Hometown:
    dflasdkfjdlkj
    asdflajksdflj
    adsflakjdsf
    etc.
    I'd like to find a way to extract the names, the phone numbers, and the hometowns and get it all into those respective columns in excel.  Seems that I don't stand a chance with automator.  Do I need a programmer for this?

    I'd been trying to do this with automator and have been getting nowhere. 
    I have some pdf files that have selectable, formatted text in them.
    Its an old college directory that I would like to import.  There are chunks of text:
    Name:  XXX, XXX
    Telephone: (XXX) XXX-XXXX        DOB:    XX/XX/XX          Gender: XXX
    Hometown:
    dflasdkfjdlkj
    asdflajksdflj
    adsflakjdsf
    Name:  XXX, XXX
    Telephone: (XXX) XXX-XXXX        DOB:    XX/XX/XX          Gender: XXX
    Hometown:
    dflasdkfjdlkj
    asdflajksdflj
    adsflakjdsf
    Name:  XXX, XXX
    Telephone: (XXX) XXX-XXXX        DOB:    XX/XX/XX          Gender: XXX
    Hometown:
    dflasdkfjdlkj
    asdflajksdflj
    adsflakjdsf
    etc.
    I'd like to find a way to extract the names, the phone numbers, and the hometowns and get it all into those respective columns in excel.  Seems that I don't stand a chance with automator.  Do I need a programmer for this?

Maybe you are looking for

  • Can't trust this computer for iphone and ipad with Itunes

    hi all after updating to ios 7.1.1 on both my iphone 4s and ipad I keep getting this message when I connect my phone or ipad to my iMac :iTunes could not connect to this iPad. You do not have permission. then this : Do you want to allow this computer

  • Assigning reason for status in lead transaction

    hello everybody i'm using CRM 5.0 and what i did is that i create all of subject profile and code group profile and codes and i assigned status and subjuct profile to transaction and it doesn't appeare high point for the quick response

  • Image capture from avi vfw device can't find video modes

    Hello Here my issu, i have implemented image capture from a video device using jmf 2.1.1e performancepack, using a webcam everything is fine but when i use the video capture source i want to use i can only work with format 720x480 pal and i require 7

  • PE 12 Will not launch

    Dowloaded PE12 to upgrade PE10. when I try to open new project I get an error message: Adobe Premiere Elements Editor quit unexpectedly. And a long document  wit Problem Details and System Configuration.

  • How to design one of these.

    Hi i am not exactly sure how or what was used to create this (http://www.cybergamer.com.au/pc/) slideshow with navigation buttons( i assume they used flash profesionnal or some other flash program). I just recently purchased cs5 design premium and am