Text extraction script for pdf documents

Hello Everyone,
As everyone in the U.S. knows Tax Season has begun.  I am looking to the Apple Community for help with  a script that will help me mitigate the daunting task of manually extracting data from my bank statements to put into my expense tracking software.  The software that I am using at the moment is "Neat Receipts" which will only inport ."pdf" and image files.  I have very limited scripting knowledge at this point, however, I have begun the process of learning the craft.  With deadines steadily approaching I have put off the task of manually combing through hundreds of pages of docs in search of a more eficient way of accomplishing this task.  Therefore, I have turned to the Apple Community for help for myself and possibly millions of others with the same issue.
Thus far, I have downloaded all of my bank statements for the last year and have organized them into a folder on my desktop.  Each file is labled by a specific name, such a "TD Bank Statement - Jan 2012.pdf".  I would like to go about extracting the data from the pdf in a way as to be able to reimport it back to a seperate pdf file under a new name.  First, I would like to select the folder containing all the bank statements.  Second, I would like to retrieve the "Transaction Date", "Vendor", and "Transaction Amount" from all the stements.  Third, I would like to combine the, Date", "Vendor", and "Transaction Amount" and place it into a new "pdf" file.  Last, I would like to name the new file with the date of the transaction followed by the vendor, a delimiter, and then the file name from which the transaction originated fileNext export a single trasnaction to a new pdf, and give the file the name of its "transaction date.  
Here is a sample of the data I am looking to capture:
Sample Data would look like this:
12/6/13, WAL WAL MART SUPER, $25.37
Sample file output would look like this:
12/6/13, WAL WAL MART SUPER - TD Bank Statement - Jan 2012.PDF
I am actively working on this, as I type this, to test my knowledge and ability to solve this problem myself.  I would like some feedback, input, and help with this
First, I believe the script should first perform an OCR of the "pdf file
Second, variables should be set to tell the script what to look for (Date), (transaction Amount), and all lines proceeding until it hits another (date)
Third, group all lines and insert (delimiter) in place of hard returns and tabs
Fourth, export grouped data into (new pdf) fie
Fifth, rename the (new pdf) file with (Transaction date) followed by (delimiter) followed by original file name

Acrobat can only work with what is present in the file. For instance,
in some cases there is just a scan, a picture, and no text can be
extracted.
Sometimes letters are doubled up when the document's creator used
"fake bold", where letters are printed twice to make an illusion of
bold text.
Aandi Inston

Similar Messages

  • This regards Adobe Reader XI 11.0.07 and Adobe Acrobat Pro XI 11.0.07 running on Win 7. Copying text in a text callout (to paste into a text callout in another pdf document (3 instances of Adobe open) sometimes (only sometimes) causes the program to crash

    This regards Adobe Reader XI 11.0.07 and Adobe Acrobat Pro XI 11.0.07 running on Win 7. Copying text in a text callout (to paste into a text callout in another pdf document (3 instances of Adobe open) sometimes (only sometimes) causes the program to crash, losing unsaved work. Windows Task Manager shows only a small percentage of cpu used and plenty of memory available. What is causing this?

    scholtzkie wrote:
    "Please wait...If this message is not eventually replaced by the proper contents of the document, your PDF viewer may not be able to display this type of document.   You can upgrade ...  For more assistance....    Windows is either a registered trademark...."
    This usually occurs if you use a browser that uses its own PDF viewer, not the Adobe Reader plugin; see http://helpx.adobe.com/acrobat/kb/pdf-browser-plugin-configuration.html

  • Does anyone know how I can create a text field in a pdf document that will multiply a total in another box?

    does anyone know how I can create a text field in a pdf document that will multiply a total in another box? I’m making an interactive pdf for an order form (attached), and I need to find a way for the “total quantity” number to multiply by 9 and total in the “Amount Due” box.

    Hey Gary,
    Have a look at this post: Re: horizontal scrolling similar to excel
    Andy's reply will show you how to make a table scroll horizontally, but it will be tough to accomplish it in some sort of easily replicable way. I am working on a JQuery extension that will help accomplish this, but I have had my time invested in another project at the moment.
    Good Luck,
    Tyson

  • How to save the output of sap script to pdf document in sap

    hi abapers
    how to save the output of sap script in sap so that can retrieve the saved document later.
    i have to save the rcia output from sap script in pdf document in sap so that it can be retrieved later
    how to use dms

    Hi deepika,
    This thread will solve ur problem OTF  -> PDF
    Regards,
    Pravin

  • Show key instead of text in ME51N for PR document type

    Hi, did anyone know how to show key instead of text in ME51N for PR document type? How can I remain the setting after shut down the PC?
    Thanks.

    Can you give more info about your need as it is not clear which text you're referring to.
    Regards,
    Vivek

  • Where is tool bar for pdf documents?

    where is tool bar for pdf documents workspace?

    Never mind
    I just printed the PDF file I couldn't open or type on.
    It was my tax organizer to filet taxes.
    I'm done w it.
    Thanks!
    Sent from my iPhone

  • Thumbnails for PDF-Documents

    Hi,
    I want to use "thumbnails"-property (rnd:thumbnail) for PDF-Documents. Do I need an extra plug-in?
    Thanks!
    Kind regards
    Natalya Kolesnikova

    Hi Mark,
    as far as I know the BADIs here are used to connect a special converter which is able to convert PDF files into useable thumbnail formats.
    Maybe the information in the following link can help you to find a consulting partner which can help you to realise this function:
    http://service.sap.com/~sapdownload/011000358700006245642006E/
    Best regards,
    Christoph

  • Formatting size and resolution for PDF documents

    How do I  format size and resolution for PDF documents with Adobe ReaderX

    No way that I know of with just the free Reader.

  • Text determination steps for Invoice document

    Hi friends,
    can any one send text determination steps for invoice document.
    Thanks ,
    Laxminarayana

    hi
    GO TO VOTXN
    PRESS ON TEXT TYPE
    CREATE AS PER YOUR REQUIREMENT AN D SAVE IT
    GO BACK CHANGE
    DEFINE TEXT ID PROCEDURE
    DEFINE ACCESS SEQUENCE
    ASSIGN TEXT ID PROCEDURE TO ACCESS SEQUENCE
    ALSO CHECK PARTNER FUNCTION AND LANGUAGE  AND ASSIGN WITH TEXT ID OF CUSTOMER TYPE
    SAVE IT
    NOW U CAN ASSIGN EST AT RELEVENT TEXT OBJECT LIKE INVOICE .
    HENCEAFTER SYSTEM WILL GIVE YOU POP UP MESSAGE WHENEVER YOU WIL CROSS WITH THIS APPLICATION IN SALES PROCESS.
    THANKS
    REWARD IT

  • Using java scripts for pdf

    I am trying to find a sort of tutorial on how to use java scripts for pdf files, particularly in setting up repetitive links between several pdf files.

    Here is a very nice website that has tons and tons of stuff for pdf using java script. Enjoy!
    http://www.planetpdf.com/forumarchive/forum34.htm

  • Text extraction from a pdf

    Hello,
    I am trying to convert pdf documents in plain text but the output is disappointing.
    The document was generated using acrobat pdf printer from Microsoft Word.
    Opened the resulting pdf in Acrobat Reader and did a "save as text" on it.
    The resulting text is broken, letters are missing or doubled. Is there some catch to it?
    I cannot understand why Acrobat cannot interpret its own files.
    Best regards,
    Vlad

    Acrobat can only work with what is present in the file. For instance,
    in some cases there is just a scan, a picture, and no text can be
    extracted.
    Sometimes letters are doubled up when the document's creator used
    "fake bold", where letters are printed twice to make an illusion of
    bold text.
    Aandi Inston

  • Extract element from PDF document from automatized process

    Hi
    I have never worked on PDF document and I am looking a solution for :
    extracting elements (simply text in a first time) in a PDF document in paragraph and/or in table
    after this I could manipulated them in another processing
    Have you any idea or information about my need (what should be the best way for doing this ?)
    SDK package : possible for doing this, if yes : which one ?
    another solution than SDK package
    PDF version supported : latest ones ?
    any advice about best developpement language : java (I prefer), or other ?
    Thanks for all you advices !!!
    Lst

    Well, if you want Java - then Adobe only has server-side options for you. We don't offer desktop Java APIs.  Our server-side options are part of the Adobe LiveCycle family of products.
    For client-side, we have the Adobe Acrobat SDK (which also requires Adobe Acrobat to be installed) or the PDFLibrary SDK (for stand-alone applications).  Both are C/C++ based.

  • Setting passwords for pdf document generated by SAP system

    We are generating PDF documents from SAP system and distributing them as email attachment. just want to know if anyhow we can enforce the password protection while generating these documents or may be setting the password protection before distribution. If yes, then what are the requirements/essentials for the same?

    1. Is there any OSS Note on the same issue where it has been published by SAP that it can only be ensured using third party solutions? If yes, may I have the link?
    I have no clue on this i.e. whether we can integrate Adobe Policy Server. My understanding was in it's current state it is NOT possible to have secured PDF Forms.
    2. May I have the more elaboration on the way you mentioned in your reply?
    Well what you can do is hide all the elements in the form and keep only a TextField visible. Now write a script and check the value entered in the TextField. If it matches with the default password you want to set then make the form visible else display error prompt back to user.

  • How to position text in an existing PDF document with X,Y coordinates

    There used to be a CFX PDF tag that could do this.  the company (www.easel2.com) does not appear to exist any more.
    This is what i want to do.  I have an existing PDF file that is uploaded by a user.  I want to receive the file, then put a registration number (some text) in the lower left corner of the 2nd page (I don't know the exact x,y coordinate but I can figure that out later).
    Does anyone know how I can do this?

    Unfortunately, I can not create the PDF.  It is uploaded by the user, and I have to open the PDF, insert the text and save it.  BUT, I did find the solution in a slight roundabout way.
    1) Create an image file containing the text I need
    2) Use that image file as a watermark with 10 opacity and use the tag's x,y coordinate setting.
    Below is the code I found to work:
    <!--- Create a blank image that is 500 pixels square. --->
    <cfset myImage=ImageNew("",500,500)>
    <!--- Set the background color for the image to white. --->
    <cfset ImageSetBackgroundColor(myImage,"white")>
    <!---Clear the rectangle specified on myImage and apply the background color. --->
    <cfset ImageClearRect(myImage,0,0,500,500)>
    <!--- Turn on antialiasing. --->
    <cfset ImageSetAntialiasing(myImage)>
    <!--- Draw the text. ---> 
    <cfset attr=StructNew()>
    <cfset attr.size=50>
    <cfset attr.style="bold">
    <cfset attr.font="Verdana">
    <cfset ImageSetDrawingColor(myImage,"blue")>
    <cfset ImageDrawText(myImage,"PROOF",100,250,attr)>
    <!--- Write the text image to a file. --->
    <cfimage action="write" source="#myImage#" destination="text.tiff" overwrite ="yes">
    <!--- Use the text image as a watermark in the PDF document. --->
    <cfpdf action="addwatermark" source="c:/book/1.pdf" image="text.tiff"
        destination="watermarked.pdf" overwrite="yes">

  • Dynamically bind website text field values to PDF document

    Hi there,
    I am in search of a step by step tutorial on how to generate a pdf document based on values from website text fields.
    I have a form produced in Live Cycle Designer ES3 and a website in which we insert test results for our company to a MySql database.
    My thoughts was that when we have inserted values on the website we could have a link to the pdf document and the same values would
    somehow magically appear on the pdf. Have read somewhere that this can be done with XML and the use of different 3rd party programs.
    Is this possible?
    What approach would be the easiest? Perhaps someone knows about a tutorial on the matter?
    Data is stored from PHP website to MySql database. Website produced in Dreamweaver CS5
    Thanks in advance for any tips/help!
    Regards,
    Christian

    It's possible with an XFA form as well as an AcroForm, but to do the work on a server additional software is needed as you mentioned. Adobe's offering is in their LiveCycle lineup and possibly . Others from third-parties include FDFMerge, PDFlib, iText, Debenu Quick PDF Library, and others. The idea is you merge the data with the PDF on the server and serve it up or save it on the server and provide a link on the HTML page that's returned. Documentation for these libraries will include samples, but you'll also have to learn how to do this securely if that's important for the system you're putting together.

Maybe you are looking for

  • Can't view/monitor copy-pasted video prem elements 7

    Hi folks Wierd one this.  We've got Premiere elements 7 in college on some awfully slow computers (2GHz dual core cpu with 1GB ram). Sometimes, when a student has copied some video and pasted it on another track/later in the session, they'll try and

  • Cannot open resource file & missing character color option

    having 2 problems and unsure if they are related to each other or not. Running indesign cs6 v8.0.1 with no more updates showing in the Updater. When I open the program it gives me the error "cannot open resource file." this isnt too annoying as i jus

  • SETTING UP APPROVAL LIMITS IN AR IN VISION OPERATIONS

    Whilst in the AR Corporate Super User responsibility in Vision Operations, I defined approval limits for a an application user with a currency of GBP and an amount ranging from £5000 to£6000 and then saved it. Later, I then decided that I wanted to c

  • Web link in SAPScript

    Hello, How can I embed a web link in the SAPScript so that when the form is sent out as an email, the receiver can click on the web link for more detail information. Thanks.

  • Solution for removing unsafe versions of flash player embedded in HP apps?

    Windows 7 64 bit IE 9 Flash Player 10.3.183.10 (the most current I can get--I have read the discussion on the update issues.) I have a 5 week old Dell system and have had an HP Officejet installed for 3 weeks. My security software detected 2 security