Export text from pdf to csv or xls / identity-H text

Hello,
Is it possible to export text in a pdf to a csv or an excel file (with coresponding page numbers where the text was found)
For a product we make we need to put all the text of a page into the metadata of the page. Normally we use a ghostscript for this but when customers provide a PDF with identity-H text this won't work most of the time. When this doesn't work we create a postscript of this PDF and recreate a PDF with distiller, quite often after this the ghostscript will recognize the text, but if it doesn't work... then we need to put all the text manually in an excel file and with all the text boxes and lay-out in the PDF this is a quite frustrating task.(especially a few hundred pages)
The question on top is a sulotion which will work if it is possible because that is failproof but, if someone knows an other solution to the actual problem we experience with identity-H that would be very helpfull too.
Thanks in advance!

Here is a visual example of what I am referring to, showing the right-click location and the different options that appear. (The blue selection box is from me selecting and dragging around the question from a non-text area)

Similar Messages

  • Help with exporting data from pdf form

    I have about 100 pdf forms that I created in adobe forms central and distributed as a pdf form (rather than on the web). I am trying to export the data into a spreadsheet but when I export it, the fields are all jumbled in the csv file, as in they are not in the same order. I need to export the data all together so I'm going to the forms menu and selecting "manage form data" and then selecting "merge data files into spreadsheet". I tried exporting a single file but that gave me something really weird.
    Please help, I have a deadline next week to analyze this data and can't make sense of it once it is exported to a spreadsheet.

    Would you please share your form with me and send me one of your pdf forms and some of the csv files?
    You can share your form by doing the following:
    1. Click on the “Share” icon on the bottom left corner.
    2. Click on “Add Collaborator” on the popup menu.
    3. Enter [email protected] under “People to share with”.
    4. Set subject to "Export data from pdf form"
    5. Click the “Share” button on the bottom right of the dialog.
    Thanks
    Ken

  • How to read/extract text from pdf

    Respected All,
    I want to read/extract text from pdf. I tried using etymon but not succed.
    Could anyone will guide me in this.
    Thanks and regards,
    Ajay.

    Thank you very much Abhilshit, PDFBox works for reading pdf.
    Regards,
    Ajay.

  • Copying text from pdf with embedded font

    I have tried everything to copy and paste text from pdf into word. I think because it has embedded text it comes over as garbled. I have downloaded the font, tried to open it in several other aps, viewed it as html -- to copy and paste ...
    anyone have a trick that they can share with me before I poke my eyes out
    thank you

    Thanks for your prompt reply.
    As i said i have the font installed on my system. for your reference,
    following is the link to the pdf file. also the second link is the link to
    the fonts used. Kindly help me to sort this issue.
    https://www.yousendit.com/download/T2dkcHBEVEh0QTIwYjhUQw
    https://www.yousendit.com/download/T2dkcHBFQXBrYUJYd3NUQw

  • Copying text from PDF to Pages

    I am trying to copy text from a PDF file into Pages, after pasting the copied text into my new Pages document the spacing between most of the text becomes corrupeted,
    for ex.
    "Copying text from PDF to Pages" is imported as "CopyingtextfromPDFtoPages"
    does anyone know how to correct this?
    Imac   Mac OS X (10.4.7)  

    Rishi,
    Welcome to Apple Discussions.
    After reading your post, I tried to duplicate this problem. I opened a PDF, selected a sentence, then copied it to the clipboard. I then opened Pages, selected the blank template, then pasted in the text. It pasted perfectly.
    Does this problem happen with all text in a PDF? With different PDFs?
    -Dennis

  • Will not print text from PDFs - all other print is fine - Using nitro reader - Win7- HP4255

    Will not print text from PDFs - all other print is fine - Using nitro reader - Win7- HP4255

    Mulga
    Welcome to the HP Community Forum.
    Have you tried asking your question on the Nitro-Reader Forum?
    Nitro Reader Forum
    If you would like to try using the Adobe Reader, you might find help here:
    Manage Print Output with Print Preview
    See the section on PDF files
    Click the Kudos Thumbs-Up to show you appreciate the help.
    Click Accept as Solution when the Answer provides a Fix or Workaround!
    I am pleased to provide assistance on behalf of HP. I do not work for HP. 
    Kind Regards,
    Dragon-Fur

  • Extract Text from pdf using C#

    Hi,
    We are Solution developer using Acrobat,as we have reuirement of extracting text from pdf using C# we have downloaded adobe sdk and installed. We have found only four exmaples in C# and those are used only for viewing pdf in windows application. Can you please guide us how to extract text from pdf using SDK in C#.
    Thanks you for your help.
    Regards
    kiranmai

    Okay so I went ahead and actually added the text extraction functionality to my own C# application, since this was a requested feature by the client anyhow, which originally we were told to bypass if it wasn't "cut and dry", but it wasn't bad so I went ahead and gave the client the text extraction that they wanted. Decided I'd post the source code here for you. This returns the text from the entire document as a string.
           private static string GetText(AcroPDDoc pdDoc)
                AcroPDPage page;
                int pages = pdDoc.GetNumPages();
                string pageText = "";
                for (int i = 0; i < pages; i++)
                    page = (AcroPDPage)pdDoc.AcquirePage(i);
                    object jso, jsNumWords, jsWord;
                    List<string> words = new List<string>();
                    try
                        jso = pdDoc.GetJSObject();
                        if (jso != null)
                            object[] args = new object[] { i };
                            jsNumWords = jso.GetType().InvokeMember("getPageNumWords", BindingFlags.InvokeMethod, null, jso, args, null);
                            int numWords = Int32.Parse(jsNumWords.ToString());
                            for (int j = 0; j <= numWords; j++)
                                object[] argsj = new object[] { i, j, false };
                                jsWord = jso.GetType().InvokeMember("getPageNthWord", BindingFlags.InvokeMethod, null, jso, argsj, null);
                                words.Add((string)jsWord);
                        foreach (string word in words)
                            pageText += word;
                    catch
                return pageText;

  • Editing text from pdf file

    how to edit text from pdf file?

    Adobe Reader does not allow editing the text of a PDF document. You will need to get Acrobat on your Windows or Mac to do that.

  • Applescript or workflow to extract text from PDF and rename PDF with the results

    Hi Everyone,
    I get supplied hundreds of PDFs which each contain a stock code, but the PDFs themselves are not named consistantly, or they are supplied as multi-page PDFs.
    What I need to do is name each PDF with the code which is in the text on the PDF.
    It would work like this in an ideal world:
    1. Split PDF into single pages
    2. Extract text from PDF
    3. Rename PDF using the extracted text
    I'm struggling with part 3!
    I can get a textfile with just the code (using a call to BBEDIT I'm extracting the code)
    I did think about using a variable for the name, but the rename functions doesn't let me use variables.

    Hello
    You may also try the following applescript script, which is a wrapper of rubycocoa script. It will ask you choose source pdf files and destination directory. Then it will scan text of each page of pdf files for the predefined pattern and save the page as new pdf file with the name as extracted by the pattern in the destination directory. Those pages which do not contain string matching the pattern are ignored. (Ignored pages, if any, are reported in the result of script.)
    Currently the regex pattern is set to:
    /HB-.._[0-9]{6}/
    which means HB- followed by two characters and _ and 6 digits.
    Minimally tested under 10.6.8.
    Hope this may help,
    H
    _main()
    on _main()
        script o
            property aa : choose file with prompt ("Choose pdf files.") of type {"com.adobe.pdf"} ¬
                default location (path to desktop) with multiple selections allowed
            set my aa's beginning to choose folder with prompt ("Choose destination folder.") ¬
                default location (path to desktop)
            set args to ""
            repeat with a in my aa
                set args to args & a's POSIX path's quoted form & space
            end repeat
            considering numeric strings
                if (system info)'s system version < "10.9" then
                    set ruby to "/usr/bin/ruby"
                else
                    set ruby to "/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby"
                end if
            end considering
            do shell script ruby & " <<'EOF' - " & args & "
    require 'osx/cocoa'
    include OSX
    require_framework 'PDFKit'
    outdir = ARGV.shift.chomp('/')
    ARGV.select {|f| f =~ /\\.pdf$/i }.each do |f|
        url = NSURL.fileURLWithPath(f)
        doc = PDFDocument.alloc.initWithURL(url)
        path = doc.documentURL.path
        pcnt = doc.pageCount
        (0 .. (pcnt - 1)).each do |i|
            page = doc.pageAtIndex(i)
            page.string.to_s =~ /HB-.._[0-9]{6}/
            name = $&
            unless name
                puts \"no matching string in page #{i + 1} of #{path}\"
                next # ignore this page
            end
            doc1 = PDFDocument.alloc.initWithData(page.dataRepresentation) # doc for this page
            unless doc1.writeToFile(\"#{outdir}/#{name}.pdf\")
                puts \"failed to save page #{i + 1} of #{path}\"
            end
        end
    end
    EOF"
        end script
        tell o to run
    end _main

  • Conversion from .pdf to .docx: I have converted a text in Slovenian in .pdf, but all the accents are missing in the converted .docx. version

    Conversion from .pdf to .docx:
    I have converted a text in Slovenian in .pdf, but all the accents are missing in the converted .docx. version.
    Does Adobe have a converter for texts with European-language accents, or shall I cancel my subscription?

    Subscription for what: ExportPDF or Acrobat?
    [topic moved to Acrobat.com Services forum]

  • How to read line number text from PDF using plugin?

    Hi, I would like to know how to read line number text from PDF using plugin?
    Thanks in advance.

    Ok, some background reading of the PDF Reference will help you understand why this is so difficult. PDF files are not organised into lines. It is best to think of each word or character on the page as being a graphic with its own position. The human eye sees lines where a series of graphics (words) are roughly in the same horizontal region.
    In the general case it is difficult or even impossible to answer this. You may have columns with different spacing (but the PDF stores no information on what is a column). You may have subscripts and superscripts. You may have text in graphics coinciding with other text. Commonly, there may be titles, headings or page numbers which are just ordinary text and might count as lines.
    That said, what you need to do is extract the text on the page and its positions. The WordFinder APIs are the way to do that. Now, sort all the words out, using the Y coordinates and size to try and guess what makes a "line". Now you are in a position to find the text (divided into words, not strings) and report the "line number" you have estimated.

  • How to import text from Word and retain italics and/or indented text?

    Is it possible to import or copy text from Word into InDesign and retain italics or text that is indented in the Word version?

    just wanna edit wrote:
    Thank you, I think I understand the idea of "place" (which is what I would call "paste")
    The two are not at all the same. Paste involves copying text to the clipboard and then pasting from the clipboard. Place is an import operation.

  • Myself and my partner had a iphones. Since the recent software update he now randomly receives texts from my contacts and also when im sending texts. Is this because we share the same apple id. This didnt happen before the software update.

    myself and my partner had a iphones. Since the recent software update he now randomly receives texts from my contacts and also when im sending texts. Is this because we share the same apple id. This didnt happen before the software update.

    You can both share the same Apple ID for purchases but i suggest one of you use another email for the use of iMessage, Facetime etc.
    It is easy to do, the easiest way is to set up an iCloud email directly from your iPhone. Once set up one of you will use this for iCloud services but both still use the existing Apple ID to make purchases.

  • HT1688 Is there an "easy" way to either 1) cut and paste a large block of text from the Messages app or 2) access this text through a computer?

    Is there an "easy" way to either 1) cut and paste a large block of text from the Messages app or 2) access this text through a computer?

    Tap and hold the text you want to copy, then tap Copy.

  • IPhone6 is not sending or receiving texts from non-Apple users, except when in Group texts. I've tried the various fixes on the main support page to no avail. Any ideas?

    iPhone6 is not sending or receiving texts from non-Apple users, except when in Group texts. I've tried the various fixes on the main support page to no avail. Any ideas?

    Have you contacted your carrier to make sure there's no issues with your account?
    ~Lyssa

Maybe you are looking for

  • After upgrading to Mavericks on MacBook Pro, help doesn't open

    I have a MacBook Pro OSX 10.9.4 with a processor 2.8 GHZ Intel Core 2 Duo with memory of 4 GB 1067 MHz DDR3. I just upgraded to Mavericks last week and now I can't get the help menus to open when in Finder or using Numbers. I haven't checked the othe

  • How to use xPath to extract a element from a xml stringin BPEL workflow?

    I have a xml string passed into the BPEL workflow, and I need to extract the value of "serviceOrderGuid" (see the xml string below), how I can use BPEL mapper to do this? I am using NetBeans 6.0/M10. Thanks Kebin <Parameters xmlns="http://www.sunmicr

  • Which Chipset on my Satellite 1800/400?

    Hi I'm trying to connect my notebook to a tv, but i dont know the chipset. Does anyone know it or know how i can find it? Thanks in advance

  • Integrate Upload Videos

    I have a client who would like to upload video as part of their database records. What I would like to do is to have CF incorporate that video, (wmv) into a Flash SWF file that is precreated on the public side. Can anyone direct me to some tutorial o

  • "Font Variation" in character viewer

    Well, now that I've made the ghastly mistake of "upgrading" to 10.9.... Is there any way to make the "Font Variation" aspect of Character Viewer behave the way it did in 10.6? I want to see all the font names all the time, each under its glyph, not j