OCR Text Recognizition

Hello All,
I am new to Acrobat and have been assigned a task which I believe requires the deep knowledge of Acrobat .
We have thousands of PDF documents that contain the scanned documents in it. In other words, we scan the document and generate the PDFs to keep them in one document.
I have to write a batch process, that extracts the text out of these pdfs (text is inside the image) and need to create the text files out of it. I looked at the sequence feature of Adobe and was able to generate the batch file out of it. But I see some problems with this approach
1) It prompts for the user input for the selection of image resolution. I have to deliver a non-monitored solution.
2) If the source folder has thousands of documents in it, and if the process fails for one of the document, it stops there itself. I would like to come up with some failover mechanism, so that I can move on to next document.
Ideally, I would like to have the front end in .Net and would like to perform the conversion with the help of SDK if possible. Since I am new to this area, I would request  what should be the design pattern, I should follow to extract the text out of PDF documents (text is inside the images).
We have the license for Acrobat Professional; do we need to have any other thing to get hold of SDK?
Thank you,
HK

Thanks for the prompt response. Will it be possible for you to share some more details on it? Are you aware of any sample where someone has done something similar? I am a c# programmer and was wondering how difficult it will be for me to write a plug-in this case.
Thank you,
Hemant Kathuria

Similar Messages

  • How to print OCR text in Adobe invoice form

          Hi All,
          I have got a requirement to print the customer number and invoice no  as OCR text on the invoice adobe form. 
          So, if some one has any idea or might have gone through this scenario, please share your suggestions for the same.
          Thanks in advance,   
          Nilesh R Gaikwad

    Hi Nilesh,
    try to check if you can choose OCR-B in Smartform Layout Designer.
    If not, then the font might be corrupt and you must turn to Basis team.
    If yes, then it is not a truetype font. And you muss upload a truetype font (.ttf, licenced font, might be with cost) into SAP by SE73. Something like this:
    Cheers,
    Tao

  • Performing OCR text recognition in all pages of a PDF document which has blank pages

    The PDF pages in the file are scanned as an image, but there are blank pages also. So, page 1, 2 have images of scanned text, page 3 is blank, pages 4, 5,  have images of scanned text, page 6 is blank and so on.The series of blank pages is 3, 6, 9....
    I am using Adobe Acrobat Pro version 9 on Windows XP. I choose the Document option in menu, OCR Text recognition->Recognize text using OCR->All pages. But, the OCR stops after page 2.
    Then, I go to page 4 and find that it is still an image(and I cannot copy the text from it). So, then I have to choose the Document option in menu, OCR Text recognition->Recognize  text using OCR->All pages. But, the OCR stops after page 6.
    The document has some 200 pages and I don't want to manually do the choose the Document option in menu, OCR Text recognition->Recognize  text using OCR->All pages for all pages which have text.
    I also tried choose the Document option in menu, OCR Text recognition->Recognize  text using OCR->Page 1 to page 200, but that too does not work.
    I can manually go to each blank page, delete it, then do the OCR but that would be tedious.
    1. Am I missing something in doing the OCR of all pages?
    2. Can I set up a batch process to do the delete of blank page series or can I set up a batch process which does OCR of all pages(and does not stop at a blank page) as it is doing now?
    Any suggestions would be appreciated.

    Hm, you could convert every pdf-page to a image file (jpeg, etc.):
    -install the package imagemagick
    -use the command convert
    convert -trim -density 150 your.pdf your.jpg
    This will generate a jpg-file for every pdf-page; "-trim" removes the white margins, density specifies the resolution of the page (you could also try "-resize" instead of "-density") and 150 specifies amount of  dots per inch.
    This does not generate a new pdf-file, but maybe there is a way to convert the images back into a single pdf?

  • Is it possible to add or change OCR text in a batch process?

    Hello,
    This is my first submission.  Really hope you can help as if there's a solution to this it will significantly help our business.
    Is it possible to 'batch process' the adding or changing of OCR text in a PDF?
    This might sound like a strange process but let me explain what we do:
    1. We scan old, handwritten books and registers.  Typically these registers contain lists of information. Each page will be scanned to a filename like pge01.jpg, pge02.jpg, pge03.jpg etc.
    2. We transcribe the content of each page.  Each page will contain mulitple records (i.e. 20 records)  typical fields might be:
    Unique ID
    Surname
    Forename
    Year
    Address
    Filename (i.e. pge02.jpg)
    3. We provide this content back within database driven software so that when a user performs a search on say 'Surname=Jones' & 'Year=1945' then all of the scanned pages that match that contains handwritten text that matches that search criteria is displayed in a list.  The user can then click on a search result and see the scanned page containing that record.
    Rather than provide database driven software, we'd simply like to produce a standard PDF file.  Each page within the PDF will show each of the scanned pages (pge01.jpg, pge02.jpg, pge03.jpg etc.).  But where you would normally store the 'ocr recognised text' behind each image, we would like to show our transcribed content.
    If this is possible, then I realise that it's likely that you can't do field searches (i.e. 'Surname=Jones' & 'Year=1945') but at the very least I'd be able to type 'Jones' in to the search box and it would find all pages that contained the transcribed word 'jones'.
    If it's possible to add the transcribed data as 'ocrd text' then is it possible to do it in some sort of batch process?  We scan lots of big books and capture millions of records - so doing it manually is not an option.
    Any help that anyone can provide will be hugely appreciated.
    Thanks,
    Paul

    I don't think it's possible to manually add OCRd text, but you can add form
    fields with the text in them. And yes, it is possible to search the content
    of form fields, using a script.

  • Make a pdf ocr text searchable on a mac with adobe

    How do I make a pdf ocr text searchable on a mac with adobe

    With "adobe" what: Acrobat?  What version?

  • OCR-text from pdf to pdf ?

    Is it possible to copy the OCR-text from a pdf file obtained after Finereader and paste it in the original pdf?

    Having had to perform OCR indicates the original PDF is a scanner's output. That makes it an image / picture of text forming the PDF page content.
    An alternative approach -
    Using Acrobat XI Pro create a new, blank PDF page (****+Ctrl+T).
    Copy the OCR text.
    Use Acrobat XI's Tools - Add Text and paste the OCR text at the cursor location.
    Be well...

  • How to determine (in bulk) whether a document has OCR text and what reader versions are supported?

    Help!
    I've inherited a task which has spanned many years and which was not well thought out over the transitions.  At least three project teams have scanned and stored tens of thousands of documents to PDF.  What was discovered subsequently was that the project teams did not apply a uniform standard for which versions of Adobe would be supported in each PDF, and that not all documents appear to have been OCR'ed as part of the scan process.
    This has resulted in two major problems.  First, PDFs which support all Reader versions are bloated and consuming significant amounts of storage; second, the automated processing tools which depend upon the OCR text are failing once they pass the front and rear cover sheets (which do contain extractable text).  I need to know if there is a way that PDFs can be bulk scanned to determine which Reader versions are supported (say 8.0 to current), and if the OCR'ed / extractable text is not just limited to the first few and last pages of each PDF.
    I have been manually fixing individual files with Adobe Acrobat 9.0.  I can force Adobe to re-OCR and save the files, but I would rather not have to re-process the existing bulk that we have unless absolutely necessary.  If I could determine which ones need fixing and just processing those it will save man years of work.
    Thanks, in advance, for any assistance.
    Michael

    What I meant by supporting a version of Reader is that I don't need the files to be fully backwards compatible all the way back to say the first few versions of Adobe Reader.  (There are likely limitations on that much backwards compatibility, anyway.)  One of the scanners that was used apparently was set for full backwards compatibility, to the extent possible, for every PDF that it generated.  Some of those PDFs are huge, commonly 300-400MBs in size.  If I open them in Acrobat 9.0, limit the backwards compatibility to Adobe Reader 8.0 and forward, then resave the file, the size is often significantly reduced.
    As to how it is measured, there is something in the PDF itself that indicates a minimum version for Adobe Reader for compatibility purposes.  When you select for compatibility in Acrobat during the save process I mentioned, you get to pick the version at which you want to stop -- so if you selected 8.0, it would be compatible with 8.0, 9.0, and so on.

  • New user can't figure out fixing OCR text

    I am a new user of Acrobat Professional 8.1.2 on Mac OS 10.5.3. There's lots I don't know so I hope my problem has a simple solution where an experienced user can say of course, you should be doing this or that.
    I am making pdfs from scans of printed articles. The scanner I'm using isn't mine and isn't connected to my computer so I can't scan direct to pdf. So I'm using jpegs, all scanned at 300 to 600 dpi, and cleaned up a bit with Photoshop before I read them into Acrobat.
    Everything works great up to the point where I cut in the OCR text recognition. Even though this text is perfectly clean and readable to the eye, the accuracy of the OCR (as determined by copying blocks to a text reader) leaves much to be desired. The Acrobat help file says to correct any errors using the Find OCR Suspects function. But this function is far too unsuspicious - it found no suspects at all in any of these files!
    I tried correcting by hand using the Touchup Text tool. To begin with, this didn't work well because the invisible text I was selecting to correct, was *not* directly over the visible text in the bitmap. But far worse than that, after I make 2 or 3 corrections this way something causes the "invisible text" in that entire paragraph to be hopelessly garbled!
    Incidentally, my text recognition settings are: Language, English. Output style, Searchable Image (Exact). Downsample: none.
    So, what silly little thing am I not doing that I should be?

    I just tried using my Bold. When I hit "Open" I receiving the following error: "The media being played is of an unsupported format"
    When I try to open the file on my PC the link is an ASX file. Here's a list of the supported media types on a BlackBerry. ASX is not listed
    http://www.blackberry.com/btsc/KB05482
    If someone has been helpful please consider giving them kudos by clicking the star to the left of their post.
    Remember to resolve your thread by clicking Accepted Solution.

  • Combine OCR pdfs into one file and keep OCR text

    I posted a few weeks ago, but found out today that I misunderstood the problem. We got back our files from the vendor digitized as individual pdfs that have already been OCRed. When we try to combine the individual pages into issues (these are newspapers), the OCR text is lost. We're using Acrobat 8 and 9 Professional.
    Thanks!

    Hi
         I tried this with A8 and A9 but could not replicate the issue as mentioned by you. Here is what I did:
         1. OCR two single page scanned PDFs.
         2. Combine the two single page OCRed PDFs into one PDF.
         3. Check if the text is still there.
         and I found that the text was retained.
         How were the PDFs OCRed in your case? Were these PDFs OCRed using Acrobat or some other tool?
    -Ravish

  • OCR Text in Adobe Acrobat 8 Pro

    Is there a way to set in preferences that when I hit my highlight text button that it automatically OCR the document like it used to? Also is there an OCR button for the tool bar.

    To make it appear, press "Ctrl+E". It's known as the properties toolbar and can only be active when you interact with an element (e.g., form field, text box) that allows you to edit. The properties bar changes depending on what type of element is selected and the context.
    When creating a new PDF from a blank page, Acrobat allows you to use it, but as yoiu've found, when you save the document and reopen, it is no longer available. This is a special case and it will not work to change the text on a page with other documents.

  • Advice about OCR text in ID

    This project is a total labor of love for my 90 year-old mother-in-law who wrote a weekly column for a small-town newspaper for years back in the 1980s.We are going to do a booklet compilation of her articles for the enjoyment of family and friends. (not to sell) We are looking into different printing options like the local Kinkos (digital printing and binding and another source for low volume hardback book printing.) No details yet.
    I have basic ID skills but NO Acrobat experience which I own as part of my CS2 suite. It was suggested in this forum that Acrobat might be the way to go (project scope has changed since then) but I have spent several days trying to learn Acrobat. It's not coming easy for me so I've decided that ID is the program I am most comfortable with for this project. I've been experimenting with methods and doing a lot of research in the forums with threads that seemed applicable to my project.
    I would greatly appreciate anyone who has the time to read through my proposed process and point out any errors that could turn out to be a problem with the end printing.
    I haven't nailed down the overall design concept and presentation of the book because family members don't agree on things yet. (Ugh, that's a whole different challenge.) But I don't vision a master page layout deal because I really want to immitate the different layouts of the published articles, so I vision each page having its own personality.
    So I just want to be productive in doing the grunt work of obtaining the articles and setting up individual pages in ID that might or not be used. Then I will put the final together according to the printer's instructions.
    1. I obtain the articles from a paid subscription newspaper archive service and it gives me the option to save and view this PDF in Acrobat. The PDF is a "snapshot" of the newspaper page. Then I crop in Acrobat to just the article I want and print it out and save it.
    2. Then I put the printed article on my scanner and push the text button on the front of the scanner. The articles are in different sizes  and number of columns but they are all consistent newspaper style—single-spaced, justified columns with a 3 space indent at the beginning of the paragraph. (I know that Acrobat has an OCR feature but I couldn't figure it out.)
    3. Then the article appears in Microsoft Word 2000 as an rtf document all in one long page with inconsistent spacing between lines. It does have the paragraph indents and shows the typos or misunderstood characters in green. I fix all the typos etc. in word and it's obvious that there is inherent formatting involved but I have no idea how that happens from a scan or it it could create problems.
    4. Last I copy all the text in word and paste it on my ID document guessing at the width of the column I want. Then I click on the overset (from Peter's response on a thread about text flowing) and make the remaining columns. But I want each article to be printed on one page so I don't want to risk it carrying over and ID making another page.The text shows up in ID as Gil Sans instead of Times Roman from Word. No huge deal I just select and change. There are no paragraph indents but I just added 3 spaces. (I know the nudge thing is bad and I promised not to do it again) so I need to understand the formatting.
    So any glaring errors in my proposed methods that could make me cry and have to start over?
    Thanks in advance,
    Patty

    Hi Patty.
    I am working on a similar project that involves original newspaper clippings (not preserved well so they are quite yellow) from 1863.
    > ...I have spent several days trying to learn Acrobat. It's not coming easy for me so I've decided that ID is the program I am most comfortable with for this project.
    I very new to using InDesign and InCopy. There are three inexpensive learning resources that I am using: Lynda.com, the online resources of my public library, and this forum. Lynda.com offers excellent multi-media instruction, but it is time consuming to complete the courses (some exceed nine hours). The public library has contracted with certain publishers so that as a member, I have access to the electronic version of those extremely expensive instructional books that go out of date the next time a new version of the software is released (thereby saving me $$$).
    > I haven't nailed down the overall design concept and presentation of the book because family members don't agree on things yet. (Ugh, that's a whole different challenge.) But I don't vision a master page layout deal because I really want to immitate the different layouts of the published articles, so I vision each page having its own personality.
    Welcome to the world of a Self Published Author. ;-)
    > So I just want to be productive in doing the grunt work of obtaining the articles and setting up individual pages in ID that might or not be used. Then I will put the final together according to the printer's instructions.
    That sounds like a good plan.
    > 1. I obtain the articles from a paid subscription newspaper archive service and it gives me the option to save and view this PDF in Acrobat. The PDF is a "snapshot" of the newspaper page. Then I crop in Acrobat to just the article I want and print it out and save it.
    Okay, there might be some difficulties that crop up doing it that way. When a scanned image is cropped in Acrobat, the entire image remains intact and it is only the viewable area that is cropped--so no information is lost. That means if the article you want is in the center of a three column newspaper, all of the information surrounding that article will remain available.
    It is possible to open the original image to Photoshop through Acrobat and permanently delete the extraneous information, but unfortunately I am not using CS2 so can not provide the command path to do so.
    > 2. Then I put the printed article on my scanner and push the text button on the front of the scanner. The articles are in different sizes  and number of columns but they are all consistent newspaper style—single-spaced, justified columns with a 3 space indent at the beginning of the paragraph. (I know that Acrobat has an OCR feature but I couldn't figure it out.)
    If the OCR is not working right, then it may be because it is OCR'ing the entire page, not just the area that was cropped.
    Since Acrobat can OCR image files, so I see a couple options. Open the original in Photoshop through Acrobat, crop out all of the unnecessary text, align the article text so that it is in a single column, save it, then just initiate the OCR in Acrobat on that existing page.
    Another option is to paste screen prints from the article as it appears in Acrobat into Photoshop, align the text into a single column, save it and then run the Acrobat OCR on that file.
    > 3. Then the article appears in Microsoft Word 2000 as an rtf document all in one long page with inconsistent spacing between lines. It does have the paragraph indents and shows the typos or misunderstood characters in green. I fix all the typos etc. in word and it's obvious that there is inherent formatting involved but I have no idea how that happens from a scan or it it could create problems.
    That's from Word, not the scan itself.
    The only way I have been able to clean up the garbage-code Word creates has been to paste the text into Notepad, re-copy it, then Paste Special > Unformatted Text back into the same document (or sometimes a new doc). Then it's a manual process of cleaning up the text that is easier than working with the original scan.
    For some reason, there have been times when I have used the Paste Special > Unformatted Text directly into Word and it has been almost as messy as the original, so I've opted to just paste to notepad, copy and repaste back into Word. Seems inefficient, but it works.
    Best of luck with your project.

  • Correct OCR text in PDFs

    This question was posted in response to the following article: http://help.adobe.com/en_US/acrobat/pro/using/WS58a04a822e3e50102bd615109794195ff-7f6f.w.h tml

    When you convert your scan to recognizable text in Acrobat, which selection are you using -- Searchable Image, Searchable Image Exact or Clearscan?
    You'll also find some suggestions from Dave in this forum post on how to work with the invisible layer of text in an OCR'd PDF.

  • OCR text with strange spacing

    We are OCRing large numbers of PDF documents using Abbyy FineReader 9.0 - the recognised text that you can see within the Abbyy program looks great, but if you save the OCR'd files as PDFs & then copy/paste the text out of the resulting PDF file opened in Adobe Acrobat Prof 8.0, there is often a space between each character, making the text search function useless! Has anyone else encountered this? This problem doesn't occur if you copy/paste the text from the same PDF doc in earlier versions of Adobe Acrobat......Abbyy customer service has been completely useless, so any help would be greatly appreciated.

    Hi Mike,
    Thanks very much for your email. Our problem is that we are mainly doing batch OCRing, as we have thousands of documents being digitized - do you happen to know if Adobe supports text recognition in batches within Acrobat, or within some other Adobe software?
    Thanks very much
    Penny

  • Feature Request - Allow direct editing of hidden OCR text to correct errors

    Examining an OCR-scanned PDF allows the OCR plain-text results to be "previewed," but in a read-only format.  We need to be able to edit this text to correct the inevitible errors.
    I just scanned a document, and the results are mosty OK, but there are two lines that are complete garbage.  I personally have never performed a completely error-free OCR scan in Acrobat or anywhere else.  Editing the image text is no solution, especially if the font is unavailable.
    Other OCR-enabled apps I have worked with expect you edit the results.  Acrobat should too.

    If Adobe won't do it, maybe somebody else will.  Nuance.com specializes in OCR, among other things.  They've also got a PDF converter.  Maybe they'll recognize the value of marrying the two.

  • Trying to overwrite OCR text

    I'm having issues in Adobe Acrobat Professional 9.5.5 where I can't overwrite text that was OCR.  I don't know if I'm missing a step.  When I search the document, it does pickup the particular text that I want to update.  Any suggestions.
    (To be more specific) I have selected the "Document" menu and OCR all text.  What I can not process is deleting the text and replacing it within the .pdf document.
    Thank you

    Acrobat (9 & above) provides three OCR methods.
    --| Searchable Image
    --| Searchable Image (Exact)
    --| ClearScan
    The first two provide an OCR output that makes use of "hidden" / "invisible" text.
    No practicable way to touchup/edit this.
    Try using ClearScan.
    Be well...

Maybe you are looking for

  • Passing objects into a class constructor

    I've created a frame for inputting a patient's medical results. There are lots of buttons for temperature, blood pressure, etc. I've created a separate, inner class that creates a NumberPad frame to enter the patient's results. Press the "Temp." butt

  • Java.util.zip.ZipException while ejb clustering

              Hi,           Trying to do some Clustering with 2 weblogic servers version 5.1           (sp8)           No Problem when the servers individually started up, but when one is started whilst           the other is already running, I get the f

  • I want to disable Private Browsing ,but under Safari there is no such application. Where do I find it or is ther a different application I have an Apple iphone 4

    I want to disable Private Browsing which should appear under Safari ,but there is no such application. I have an iphone 4 Apple. Is there an alternative way to disable

  • Reconnecting Media to offline clips

    Hi all, I have an interesting situation with a client.  Never seen this problem/scenario before. Here is the setup.... --> Client is logging all footage on her computer. --> I provided her with a formatted Excel document that will allow me to import

  • TreeViewWidget problems

    Hello i hope someone of you want to help. Well i m developing a plugin under mac os x10.5.0 and i have CS4 InDesign my plugin currently is crashing to indesign when it try run im following the details of 3 examples WListComposite, PanelTreeView & Det