Ways to extract text/data from a pdf

I have just been given five large boxes of documents which are old work orders which include (in the same position on every page) a customer's name, address, tel etc.  I'm pretty adept at scanning stuff to pdf using Adobe Acrobat 9.0 Pro (version 9 5 1 283) but I don't know how (if it's even possible) to batch scan just that part of the page which contains the customer data and save it to Excel or csv or similar.  I know I could highlight the required text on each page and copy and paste it into Excel but is there a more elegant solution?  To create the pdf, I can scan using our office Epson all-in-one (although it's on the other side of the office) or a Neatdesk Scanner I'm trying out.  The latter scans the pages quickly but if I set the Neat software to treat the document as a contact card, it only picks up our company address from the top of the page and ignores everything else ;-)  I'm running Windows 7.

Hello
You may try the following AppleScript script. It will ask you to choose a root folder where to start searching for *.map files and then create a CSV file named "out.csv" on desktop which you may import to Excel.
set f to (choose folder with prompt "Choose the root folder to start searching")'s POSIX path
if f ends with "/" then set f to f's text 1 thru -2
do shell script "/usr/bin/perl -CSDA -w <<'EOF' - " & f's quoted form & " > ~/Desktop/out.csv
use strict;
use open IN => ':crlf';
chdir $ARGV[0] or die qq($!);
local $/ = qq(\\0);
my @ff = map {chomp; $_} qx(find . -type f -iname '*.map' -print0);
local $/ = qq(\\n);
#     CSV spec
#     - record separator is CRLF
#     - field separator is comma
#     - every field is quoted
#     - text encoding is UTF-8
local $\\ = qq(\\015\\012);    # CRLF
local $, = qq(,);            # COMMA
# print column header row
my @dd = ('column 1', 'column 2', 'column 3', 'column 4', 'column 5', 'column 6');
print map { s/\"/\"\"/og; qq(\").$_.qq(\"); } @dd;
# print data row per each file
while (@ff) {
    my $f = shift @ff;    # file path
    if ( ! open(IN, '<', $f) ) {
        warn qq(Failed to open $f: $!);
        next;
    $f =~ s%^.*/%%og;    # file name
    @dd = ('', $f, '', '', '', '');
    while (<IN>) {
        chomp;
        $dd[0] = \"$2/$1/$3\" if m%Link Time\\s+=\\s+([0-9]{2})/([0-9]{2})/([0-9]{4})%o;
        ($dd[2] = $1) =~ s/ //g if m/([0-9 ]+)\\s+bytes of CODE\\s/o;
        ($dd[3] = $1) =~ s/ //g if m/([0-9 ]+)\\s+bytes of DATA\\s/o;
        ($dd[4] = $1) =~ s/ //g if m/([0-9 ]+)\\s+bytes of XDATA\\s/o;
        ($dd[5] = $1) =~ s/ //g if m/([0-9 ]+)\\s+bytes of FARCODE\\s/o;
        last unless grep { /^$/ } @dd;
    close IN;
    print map { s/\"/\"\"/og; qq(\").$_.qq(\"); } @dd;
EOF
Hope this may help,
H

Similar Messages

  • Is there a way to copy text boxes from one pdf file to a different one?

    is there a way to copy text boxes from one pdf file to a different one in Acrobat 11?

    Yes. In Form Edit mode select the field(s), press Ctrl+C and then move to
    the other file and press Ctrl+V.

  • Whats the best way to extract hex data from control code a Intel format

    Does anyone know an easy way to extract the green set of data from this EEPROM dump?
    :020000040000FA
    :10420000110022003300440055006600770088004A
    :104210009900AA00BB00CC00EE00FF00FF00FF00E9
    :10422000FF00FF00FF00FF00FF00FF00FF00FF0096
    :10423000FF00FF00FF00FF00FF00FF00FF00FF0086
    :10424000FF00FF00FF00FF00FF00FF00FF00FF0076
    :10425000FF00FF00FF00FF00FF00FF00FF00FF0066
    :10426000FF00FF00FF00FF00FF00FF00FF00FF0056
    :10427000FF00FF00FF00FF00FF00FF00FF00FF0046
    :10428000FF00FF00FF00FF00FF00FF00FF00FF0036
    :10429000FF00FF00FF00FF00FF00FF00FF00FF0026
    :1042A000FF00FF00FF00FF00FF00FF00FF00FF0016
    :1042B000FF00FF00FF00FF00FF00FF00FF00FF0006
    :1042C000FF00FF00FF00FF00FF00FF00FF00FF00F6
    :1042D000FF00FF00FF00FF00FF00FF00FF00FF00E6
    :1042E000FF00FF00FF00FF00FF00FF00FF00FF00D6
     :10430000FF00FF00FF00FF00FF00FF00FF00FF00B5
    :10431000FF00FF00FF00FF00FF00FF00FF00FF00A5
    :10432000FF00FF00FF00FF00FF00FF00FF00FF0095
    :10433000FF00FF00FF00FF00FF00FF00FF00FF0085
     :10434000FF00FF00FF00FF00FF00FF00FF00FF0075
     :10435000FF00FF00FF00FF00FF00FF00FF00FF0065
    :10436000FF00FF00FF00FF00FF00FF00FF00FF0055
    :10437000FF00FF00FF00FF00FF00FF00FF00FF0045
    :10438000FF00FF00FF00FF00FF00FF00FF00FF0035
    :10439000FF00FF00FF00FF00FF00FF00FF00FF0025
    :1043A000FF00FF00FF00FF00FF00FF00FF00FF0015
     :1043B000FF00FF00FF00FF00FF00FF00FF00FF0005
    :1043C000FF00FF00FF00FF00FF00FF00FF00FF00F5
     :1043D000FF00FF00FF00FF00FF00FF00FF00FF00E5
     :1043E000FF00FF00FF00FF00FF00FF00FF00FF00D5
     :1043F000FF00FF00FF00FF00FF00FF00FF00FF00C5
     :02401200FF3F6E
     :00000001FF

    rolfk wrote:
    crossrulz,
    Intel Hex Format is a text format describing binary data, usually used for chip programming devices. Besides records for the actual data itself it also allows for addressing records, such that you only need to have the data in the file that needs to be written to the chip, even if the various areas are all over the address range of the chip. Since it is a well known format for hardware developers, some people coming here assume that everybody knows what it is. Of course many true softies may never have heard of it.
    rolfk, thanks for the information.  I was not aware of this.  I guess I am a true softie even with my EE background.
    There are only two ways to tell somebody thanks: Kudos and Marked Solutions
    Unofficial Forum Rules and Guidelines

  • Any way to extract contact data from old data.syncdb files????

    Had the unfortunate consequence of a power outage corrupting my address book database, which then populated to all other devices it synched to - iPhone, Entourage, even .Mac online. Now all contacts are gone.
    No, I didn't have a backup (smack my wrists), but in digging into user/library/syncservices/local, I found a bunch of files called data.syncdb. Opening these files in text edit yielded lines and lines of code, and a cursory glance indicates that at least some of the old contact data is stored in these files.
    How do I extract this data? I have done search after search for procedures, but all I can find on the web is how deleting these files makes synch happen faster. I don't want to delete these files, as they seem to contain the last vestiges of my contacts?
    Any ideas???

    You cannot without having a working version of the original OS X / Logic / K1 combo installed on a Mac...  which is what you stated in your post.
    The Kontakt 'data' is not stored in the LSO file as such.... Just 'pointers' to the data that is stored within Kontakt and it's libraies itself. No installed Kontakt 1... no data... and no way to get to it.

  • Is there a way to extract/export fonts from a PDF file?

    Hi,
    I got a tamil PDF file which is associate with a tamil font.
    The pdf successfully displaying the tamil characters.
    i would like to know is there a way to extracts fonts being used in the PDF file as FONTS.
    Thanks in advance.

    You may want to try this - http://onlinefontconverter.com/extract_font_from_pdf.php
    ~Deepak

  • The fastest way to extract the data from Sun One Directory Server 5.2

    I'm trying to figure out the best way to extract the whole contents of the Directory Tree of a Sun One DS 5.2. I assum that the best way is db2ldif but it takes about 17 min for about 1.5 GB size database, which seems quite a long time. Is there a faster way?

    Thanks,
    Actually the file I need should be readable - I need to parse it later on. But I think I just found the answer in the development kit. The utility is called dbscan and it works directly on the database files.
    Thanks again anyway,
    Ayelet

  • Extracting text data from CRM

    Hi,
      We have a NOTE section/field in CRM, which is used to enter some notes by the users on Activities. We need to pull it into BW.How to do it. NOTE can be any long!

    Thanks to Eugene Khusainov
    Long texts in SAP BW: Modeling – Follow Up
    Of course the clever way would be to publish to a KM document then attach the document a BLOB object to the master data - but you can figure out how to that yourself (and don;t forget to write a blog about it)

  • Extract text data from dbc on the fly

    Hi All,
    I am currently setting up a testsystem for a CAN based project with LV7.1 and Teststand 4.
    I have my PXI8461 CAN card sending and receiving CAN messages fine but I wish to perform some work on all channel ( signal ) information before passing it up to Teststand for analysis.
    For my Teststand purposes it is preferable to work with text, for example if the DUT has a button pressed then the appropriate signal gets sent up to Teststand as "On" or "Off" etc rather than 1 and 0. ( CANoe has this feature and the LDF driver kit for LIN in Labview also )
    This is very useful for Teststand when checking test conditions rather than trying to equate a test with a number, so tests can be written like if result = "Ignition On" then Pass else Fail.
    The dbc file I have contains this text representation of data but is there any VI which can be ran after a Read Channel which will convert Channel data to text ?
    Thanks,
    Mike 

    Hi Mike,
    Thanks for the post! and I hope your well today.
    Im not very familiar with the CAN palette, but Im sure we could use Lower Level VIs to build this functionality.
    So, when on the fly, the CAN message is a number (0 or 1) and then convert it using a case structure?
    I imagine this issue is more complex then I have caught on so far, so please, maybe some example code etc would help.
    Kind Regards,
    James.
    Kind Regards
    James Hillman
    Applications Engineer 2008 to 2009 National Instruments UK & Ireland
    Loughborough University UK - 2006 to 2011
    Remember Kudos those who help!

  • Extract text messages from my Iphone 5

    Is there a way to extract text messages from my Iphone 5 so I can save them and print them if necessary?

    You can capture a screen shot of the text message on your iPhone.  Simply view it, and briefly hold the on/off and home buttons together until you hear a shutter clicking sound.
    The photo will appear in Camera Roll, and you can then save it as you would a photo.

  • Need to pre-populate and Extract data from static PDF form

    Hi Jasmin or Jayan or anyone else that can answer.
    I have a requirement to use Digital Signatures.  Because of that, the forms must be static PDFs and the form variables will be “document form”.  I want to pre-populate the form via an SQL query and custom render process and render it as PDF so that the submitter can apply a digital signature when he/she is done and ready to submit for approvalSubsequent approvers will also digitally sign the form.  I know that I will specify the custom render to render only once and thereby preserve the signature(s) on the form.  I do, however, need to extract data from the form to control the business process.  I cannot access the data in the form the same way I do with an xdp and I also cannot pre-populate the same way I do with an xdp. 
    Any suggestions on how to attack this?

    Parth, one problem with your approach is he will submit PDF and therefore you won't be able to put the PDF in a variable that's suppose to contain just xml.
    The prepopulation should be the same. If you start off with an xdp, then you will call a render service that merges data with your xdp to create a PDF.
    Now when you submit, you will submit the entire PDF back in the Document Form variable. In Workbench, you can use the FormDataIntegration service to extract data from that PDF that's being stored under Document Form var/object/document and put it in an xml variable. Then you can just use xPath to do your condition.
    I'm assuming you'll just pass that same Document Form variable to the next step, because if you do any change to the PDF it'll brake the signature.
    Let me know if I missed anything.
    Jasmin

  • How to Extract Data from the PDF file to an internal table.

    HI friends,
    How can i Extract data from a PDF file to an internal table....
    Thanks in Advance
    Shankar

    Shankar,
    Have a look at these threads:-
    extracting the data from pdf  file to internal table in abap
    Adobe Form (data extraction error)
    Chintan

  • How can I use Automator to extract specific Data from a text file?

    I have several hundred text files that contain a bunch of information. I only need six values from each file and ideally I need them as columns in an excel file.
    How can I use Automator to extract specific Data from the text files and either create a new text file or excel file with the info? I have looked all over but can't find a solution. If anyone could please help I would be eternally grateful!!! If there is another, better solution than automator, please let me know!
    Example of File Contents:
    Link Time =
    DD/MMM/YYYY
    Random
    Text
    161 179
    bytes of CODE    memory (+                68 range fill )
    16 789
    bytes of DATA    memory (+    59 absolute )
    1 875
    bytes of XDATA   memory (+ 1 855 absolute )
    90 783
    bytes of FARCODE memory
    What I would like to have as a final file:
    EXCEL COLUMN1
    Column 2
    Column3
    Column4
    Column5
    Column6
    MM/DD/YYYY
    filename1
    161179
    16789
    1875
    90783
    MM/DD/YYYY
    filename2
    xxxxxx
    xxxxx
    xxxx
    xxxxx
    MM/DD/YYYY
    filename3
    xxxxxx
    xxxxx
    xxxx
    xxxxx
    Is this possible? I can't imagine having to go through each and every file one by one. Please help!!!

    Hello
    You may try the following AppleScript script. It will ask you to choose a root folder where to start searching for *.map files and then create a CSV file named "out.csv" on desktop which you may import to Excel.
    set f to (choose folder with prompt "Choose the root folder to start searching")'s POSIX path
    if f ends with "/" then set f to f's text 1 thru -2
    do shell script "/usr/bin/perl -CSDA -w <<'EOF' - " & f's quoted form & " > ~/Desktop/out.csv
    use strict;
    use open IN => ':crlf';
    chdir $ARGV[0] or die qq($!);
    local $/ = qq(\\0);
    my @ff = map {chomp; $_} qx(find . -type f -iname '*.map' -print0);
    local $/ = qq(\\n);
    #     CSV spec
    #     - record separator is CRLF
    #     - field separator is comma
    #     - every field is quoted
    #     - text encoding is UTF-8
    local $\\ = qq(\\015\\012);    # CRLF
    local $, = qq(,);            # COMMA
    # print column header row
    my @dd = ('column 1', 'column 2', 'column 3', 'column 4', 'column 5', 'column 6');
    print map { s/\"/\"\"/og; qq(\").$_.qq(\"); } @dd;
    # print data row per each file
    while (@ff) {
        my $f = shift @ff;    # file path
        if ( ! open(IN, '<', $f) ) {
            warn qq(Failed to open $f: $!);
            next;
        $f =~ s%^.*/%%og;    # file name
        @dd = ('', $f, '', '', '', '');
        while (<IN>) {
            chomp;
            $dd[0] = \"$2/$1/$3\" if m%Link Time\\s+=\\s+([0-9]{2})/([0-9]{2})/([0-9]{4})%o;
            ($dd[2] = $1) =~ s/ //g if m/([0-9 ]+)\\s+bytes of CODE\\s/o;
            ($dd[3] = $1) =~ s/ //g if m/([0-9 ]+)\\s+bytes of DATA\\s/o;
            ($dd[4] = $1) =~ s/ //g if m/([0-9 ]+)\\s+bytes of XDATA\\s/o;
            ($dd[5] = $1) =~ s/ //g if m/([0-9 ]+)\\s+bytes of FARCODE\\s/o;
            last unless grep { /^$/ } @dd;
        close IN;
        print map { s/\"/\"\"/og; qq(\").$_.qq(\"); } @dd;
    EOF
    Hope this may help,
    H

  • I want to extract data from a PDF using Java

    I would prefer to extract data from a PDF and convert it to XML. Is there an API that will convert a PDF to some Adobe format XML? Ideally I would like to add some JAR files to my classpath, similar to PDFBox. I don't want to install a bunch of server side componets or anything like that.
    Thanks!

    Thank you for the reply!
    If I installed the server side components, how would a Java client invoke a service to export data from a PDF? RMI, Web Services?

  • Extracting data from a pdf form

    Hi,
    livecycle es2, workbench 9.0
    I'm new to workbench and have a problem extracting data from a pdf form submitted to a short lived process.
    I have set up the following very simple process :
    default startpoint >  ProcessForm > exportData > set value > set value > Write Document
    The intention is to update the document and write it to disk. So far, each step works except for the 'export data' where I cannot get the pdf to extract to xml.
    The Input to the 'export data' step is a variable (myDoc), Data Type: Document,  created from the incoming PDF form.
    If I write out myDoc it is an exact copy of the incoming document, so I guess the start and finish steps of of the process are OK.
    The incoming (PDF) form I was given had no data schema, but  I thought I could access the form data by exporting to an xml variable....
      Service : FormDataIntegration  / exportData
    input (PDF Document)    variable : myDoc
      output(Data extracted)     variable : myXMLData
    Then in the next step (set value) access the xml element I am after ..
    Mappings
    Location:  /process_data/@groupId      Expression: /process_data/myXMLData/xdp/datasets/data/form1/mainPage/groupId
    This is did not work, so I got the incoming form, exported the form data to an xml file,  and created a schema using  Stylus Studio. I then imported that into the myXMLdata definition. ( BTW - Do I need to specify the root node after importing it ? )
    Still not working !
    Extra info : The XML view of my incoming  form shows I have a minimal dataset definition- is this OK ??
    <connectionSet xmlns="http://www.xfa.org/schema/xfa-connection-set/2.8/">
       <?originalXFAVersion http://www.xfa.org/schema/xfa-connection-set/2.4/?></connectionSet>
    <xfa:datasets xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/">
       <xfa:data xfa:dataNode="dataGroup"/>
    </xfa:datasets>
    The schema created by stylus studio has none of the xfdf, xfa settings I have seen on other schemas - is this OK ?
    Any help to get this fixed greatly appreciated
    thanks
    steve

    hey thanks for the offer, but I am now sorted after I found a simple working example on line.
    This is a similar process to the one I am working on, and is clearly described and easy to follow...
    http://eslifeline.wordpress.com/2009/04/25/extracting-data-from-signed-pdf-using-livecycle -server/
    girish bedekar - I thank you !

  • How to extract the data into Excel / PDF from SAP

    Hi,
    We have spool number of a report output.
    We want to extract the data into Excel / PDF from SAP directly...
    Plz guide...

    Hi ,
    Please check this [Thread|HOW TO DOWNLOAD SAP OUTPUT TO EXCEL FILE;. Hope your problem will be solved.
    You can check [this also.|http://wiki.sdn.sap.com/wiki/display/sandbox/ToConvertSpoolDataintoPDFandExcelFormatandSenditinto+Mail]
    Thanks,
    Abhijit

Maybe you are looking for