Correct OCR text in PDFs

This question was posted in response to the following article: http://help.adobe.com/en_US/acrobat/pro/using/WS58a04a822e3e50102bd615109794195ff-7f6f.w.h tml

When you convert your scan to recognizable text in Acrobat, which selection are you using -- Searchable Image, Searchable Image Exact or Clearscan?
You'll also find some suggestions from Dave in this forum post on how to work with the invisible layer of text in an OCR'd PDF.

Similar Messages

  • OCR-text from pdf to pdf ?

    Is it possible to copy the OCR-text from a pdf file obtained after Finereader and paste it in the original pdf?

    Having had to perform OCR indicates the original PDF is a scanner's output. That makes it an image / picture of text forming the PDF page content.
    An alternative approach -
    Using Acrobat XI Pro create a new, blank PDF page (****+Ctrl+T).
    Copy the OCR text.
    Use Acrobat XI's Tools - Add Text and paste the OCR text at the cursor location.
    Be well...

  • Correcting OCR errors that are not "Suspects"

    I routinely find OCR errors that Acrobat does not flag as suspects. I have not found any way to correct this text, as it seems the only tool that Accrobat has for correcting OCR text is the "Find Suspect" tools. It would seem that if the text is there, it should be editable. But after hours of research, I have not found a way. However, I also have not found anything that says that this text cannot be edited. Am I missing some tool that would help in situations like those?

    Hi,
    The ability to mark a non-suspect and correct it is available in Acrobat DC. You can use the following steps:
    Go to tools>Enhance Scans>Recognize Text>Correct Recognized Text
    Check "Show recognized text" check box
    Double click on the word you want to mark as a suspect and correct it.

  • Copy text in pdf gives me gibberish. Is there a way to OCR to correct?

    I have a few documents that are complete gibberish when I select text and copy. If I open them in Acrobat Pro, select text, Copy, and "Show Clipboard" in Finder, I see a bunch of "skull characters", and if I open them in Preview and do the same, I see strings of dots. The text in the clipboard cannot be pasted intelligibly into any other program, and I cannot search the document.
    Some of these documents were downloaded from commercial sites. One of them came from (I think) OCR'ing a scanned document using ClearScan. An example of such a document is at https://public.me.com/ix/alanterra/Reynoso%202006%20p%201.pdf?disposition=download+1317001 233647 (it's small, 55K).
    It seems to me that one way to do this would be to convert the document into a "scanned" pdf, and then OCR it. But the only way I can figure out how to do this is to image each page separately in Photoshop, and then assemble the pages into a new document.
    There must be a way to deal with this problem.
    Any thoughts?
    A
    PS--If you look at the document linked to above, you will note that the text in the footer is coherent, but not the text in the body of the document.

    You will be able to use OCR in Acrobat after you convert the type to outlines. You will need to add some transparency, then use the flattener preview to outline your type. Here are the steps (for Acrobat 9):
    1. Document> Watermark> Add (add a text watermark, hit the space bar once).
    2. Advanced> Print Production> Flattener Preview> Convert all text to outlines (checkbox on). Save.
    3. Document> OCR text recognition> recognize text using OCR. Select all text with the type tool, copy.
    This method is not perfect, you will need to check the copy for errors.

  • How can I correct "hidden" text in a searchable PDF file?

    This seems like a simple question. However, the answers are invariably complex, do not yield the desired result, and often answer a different question entirely. I say all that just to warn people up front that the "problem" is easier than how many people and PDF application developers, including Adobe, typically understand it while the proposed "solutions" are invariably a total...well, botch is a reasonable word if a bit understated.
    Here is the actual problem:
    I have "searchable" PDF files created by scanning documents and running them through an OCR process. I create "searchable" PDF files in order to archive, index, and eventually enable searching for the documents scanned. A "searchable" PDF satisfies those criteria better than any other commonly used, "portable" archive format -- though I would be happy if someone could point out an obvious alternative I may have overlooked. I do not need perfect OCR results. If I need a document to edit or perhaps feed into a spreadsheet or database, I expect to be able to reprocess the page images in a given "searchable" PDF file to OCR and convert the contents to Word, RTF, Excel, or another file format as necessary with more care for the results than for the archived document itself. Therefore, the "searchable" PDF document is the scanned page images which compose it while the OCR generated "searchable" text is secondary, but still important. Therefore, each file must contain scanned page images of sufficient detail to be efficiently converted by OCR if possible and legible enough for whoever views the images to be able to work out what an OCR process may fail to understand. Once scanned, those pages are the "document" and therefore "immutable." However, OCR is imperfect. For a searchable document archive, it does not have to be, but some errors are significant in that they may prevent the document from being found by a search. Therefore, there must be a way to view and, if necessary, edit the "hidden" text in a "searchable" PDF without altering the visual display of a document or how it is printed. No strike-throughs. No visible "corrections." None of the stuff PDF editors want to insert into a PDF file when editing it. I do not want to edit the document without exporting it to a format appropriate for an editable document. I just want adequately "correct" hidden text in a "searchable" PDF file.
    I apologize for the length and redundancy in my description of the problem. However, past attempts to explain my problem and objectives as well as what I have seen in reply to similar queries across the Internet indicate that most people trying to answer this question come at it from the same point of view shared by most, if not all, PDF tool or application vendors. They seem to think that any desire to edit a PDF file is a desire to have a PDF word processor of some sort. Or, they assume that the OCR process employed may need tweaking of the means by which people apply it and then a process like "find suspects" is adequate to deal with any errors. But no, those are not what I am trying to accomplish and answers which address those topics do not answer this question.
    In short, which tool or application from any vendor will reveal the "searchable" hidden text in a PDF produced by any OCR or other process and then enable corrections to the hidden text without changing any document display parameters at all? Note, hidden text typically includes bounding box information denoting the portion of the image from which the text was recognized. That information must not be lost or changed when editing the "searchable" text.
    So, any tools or applications capable of doing this? If Adobe Acrobat XI Pro can (use of a trial copy demonstrated that the hidden text content can be reviewed, but editing did not work by any straight-forward means I could work out while trying out the application), fine. However, $500.00 list or even a $200.00 possible upgrade from a copy of Adobe Acrobat X Standard which came with my scanner is a lot of money for personal use when review and edit of the OCR generated hidden text in a "searchable" PDF file is the only function I require. Therefore, other suggested tools or applications which do what I need for less would be greatly appreciated.

    My "claim"? Actually I've made no "claim" such as you've mentioned.
    Simply stated your OP has foundational premises that presume as factual what is not.
    Here, we're in Adobe's hosted user forum for Acrobat.
    Any other application use is not material. 
    Acrobat XI provides 3 OCR methods.
    Searchable Image, Searchable Image (Exact) & ClearScan.
    Only the first two provide the "hidden" text output.
    (Glyphs have no stroke, no fill)
    From back to the Acrobat 3 product family the design functionality of Searchable Image and Searchable Image (Exact) has been to facilitate the use of Find / Search.
    The "hidden" text is can be touched up. Acrobat Pro provides the facility to view the hidden text.
    So you can see what the OCR output that correlates to the bit-map images of the characters that are present.  
    With Acrobat XI Pro use Tools - Protection -Remove Hidden Information
    In the Remove Hidden Information pane select "Hidden text" then "Show preview".
    The default for the preview is "Show Only Hidden Text".
    Back in the PDF --
    You'd select some of the hidden text and retype what you suspect is the correct string of characters.
    Save and return to the preview of the hidden text.
    If you got it right, good. Continue.
    If not, darn - try again.
    Plug 'n chug -- somewhere over the rainbow it'll be done eh.
    Full disclosure -- this is something I've done (enquiring minds don't you know).
    I've found it to be a rather Sisypean undertaking.
    So, "doable" but not practicable.
    This is to be expected because such touchups are not the concern / focus of the output from Searchable Image or Searchable Image (Exact) - (the names tell it all).
    To have touchup "editablity" of an OCR output using Acrobat make use of ClearScan.
    ClearScan replaces recognized character bit-maps with a character from an Acrobat internal font.
    The character strings can be selected to change to a generic, system available font.
    Something that is good to know when embarking on the "tweak the PDF" journey is that PDF (the file format / technology as defined by its ISO Standard, ISO 32000-1) does not tolerate "editing". PDF is decidely not a word processor file format and "editing" can quickly render a PDF unusable.
    Minor touchups can be made and your best "tool" for this is still Acrobat Pro. (Save As often and periodically "bank" the PDF via some file rename scheme.) 
    Be well...

  • Performing OCR text recognition in all pages of a PDF document which has blank pages

    The PDF pages in the file are scanned as an image, but there are blank pages also. So, page 1, 2 have images of scanned text, page 3 is blank, pages 4, 5,  have images of scanned text, page 6 is blank and so on.The series of blank pages is 3, 6, 9....
    I am using Adobe Acrobat Pro version 9 on Windows XP. I choose the Document option in menu, OCR Text recognition->Recognize text using OCR->All pages. But, the OCR stops after page 2.
    Then, I go to page 4 and find that it is still an image(and I cannot copy the text from it). So, then I have to choose the Document option in menu, OCR Text recognition->Recognize  text using OCR->All pages. But, the OCR stops after page 6.
    The document has some 200 pages and I don't want to manually do the choose the Document option in menu, OCR Text recognition->Recognize  text using OCR->All pages for all pages which have text.
    I also tried choose the Document option in menu, OCR Text recognition->Recognize  text using OCR->Page 1 to page 200, but that too does not work.
    I can manually go to each blank page, delete it, then do the OCR but that would be tedious.
    1. Am I missing something in doing the OCR of all pages?
    2. Can I set up a batch process to do the delete of blank page series or can I set up a batch process which does OCR of all pages(and does not stop at a blank page) as it is doing now?
    Any suggestions would be appreciated.

    Hm, you could convert every pdf-page to a image file (jpeg, etc.):
    -install the package imagemagick
    -use the command convert
    convert -trim -density 150 your.pdf your.jpg
    This will generate a jpg-file for every pdf-page; "-trim" removes the white margins, density specifies the resolution of the page (you could also try "-resize" instead of "-density") and 150 specifies amount of  dots per inch.
    This does not generate a new pdf-file, but maybe there is a way to convert the images back into a single pdf?

  • Make a pdf ocr text searchable on a mac with adobe

    How do I make a pdf ocr text searchable on a mac with adobe

    With "adobe" what: Acrobat?  What version?

  • How to use OCR Font A type by the time of writing some text into Pdf fil

    Hi,
    I am generating one pdf file in java. How can I use OCR Font A for text of pdf file ..Please can any one help where can I get OCR Font A and how to use that one in java ... I want to write some text into pdf file and that text should use OCR Font A family ...
    Thanks.

    This document shows how to disable OCR during conversion; just do the opposite: https://forums.adobe.com/docs/DOC-3062

  • Correcting OCR'd text misreads

    Hi everyone,
    I'm using Acrobat Pro 8.
    I've OCR'd a bunch of scanned documents and found that the OCR utility does tend to skip over some lines and other times it misreads text (and doesn't mark it as suspect).
    Is there a way to correct misread text? Is there a find/replace feature in Acrobat Pro 8?
    What I've come up with is using the text touchup tool and importing the text into word to do a global find/replace. But I'd rather do it in acrobat if it's available.
    Thanks!
    Kevin

    You can do a find, but not a find and replace. Probably just as easy as the WORD route. You are finding some of the limitations of the OCR utility. If you do a lot of OCR, you really need to look into a full-fledged OCR package, not this plugin for Acrobat.
    The quality of the original images is very important in the ability of Acrobat to do the job. Also, you need to be sure that you have proper resolution and such (or Acrobat won't even run the OCR).

  • When i add text to PDFs and work on the file for awhile, my red added text starts to turn into red X's with a box around them. I have OCRs turned off, i have the latest update, and i have registered the product. What is happening to my text while i'm work

    When i add text to PDFs and work on the file for awhile, my red added text starts to turn into red X's with a box around them. I have OCRs turned off, i have the latest update, and i have registered the product. What is happening to my text while i'm working on these files? On top of this, my red arrows get moved around also.

    Hi ,
    Could you please update me with few details like what version of Acrobat are you using?
    What OS do you work on ?
    Do you experience this any particular PDF or happens with all of them?
    Did you try the same with turning on the OCR ?Please check the same and compare the outputs .Does that help you in anyway ?
    If the file is not confidential ,could you please share the file with us so that we can analyse it our end and revert you with the appropriate answer .
    Please share the file on [email protected] and please cc [email protected] as well .
    Regards
    Sukrit Dhingra

  • Combine OCR pdfs into one file and keep OCR text

    I posted a few weeks ago, but found out today that I misunderstood the problem. We got back our files from the vendor digitized as individual pdfs that have already been OCRed. When we try to combine the individual pages into issues (these are newspapers), the OCR text is lost. We're using Acrobat 8 and 9 Professional.
    Thanks!

    Hi
         I tried this with A8 and A9 but could not replicate the issue as mentioned by you. Here is what I did:
         1. OCR two single page scanned PDFs.
         2. Combine the two single page OCRed PDFs into one PDF.
         3. Check if the text is still there.
         and I found that the text was retained.
         How were the PDFs OCRed in your case? Were these PDFs OCRed using Acrobat or some other tool?
    -Ravish

  • New user can't figure out fixing OCR text

    I am a new user of Acrobat Professional 8.1.2 on Mac OS 10.5.3. There's lots I don't know so I hope my problem has a simple solution where an experienced user can say of course, you should be doing this or that.
    I am making pdfs from scans of printed articles. The scanner I'm using isn't mine and isn't connected to my computer so I can't scan direct to pdf. So I'm using jpegs, all scanned at 300 to 600 dpi, and cleaned up a bit with Photoshop before I read them into Acrobat.
    Everything works great up to the point where I cut in the OCR text recognition. Even though this text is perfectly clean and readable to the eye, the accuracy of the OCR (as determined by copying blocks to a text reader) leaves much to be desired. The Acrobat help file says to correct any errors using the Find OCR Suspects function. But this function is far too unsuspicious - it found no suspects at all in any of these files!
    I tried correcting by hand using the Touchup Text tool. To begin with, this didn't work well because the invisible text I was selecting to correct, was *not* directly over the visible text in the bitmap. But far worse than that, after I make 2 or 3 corrections this way something causes the "invisible text" in that entire paragraph to be hopelessly garbled!
    Incidentally, my text recognition settings are: Language, English. Output style, Searchable Image (Exact). Downsample: none.
    So, what silly little thing am I not doing that I should be?

    I just tried using my Bold. When I hit "Open" I receiving the following error: "The media being played is of an unsupported format"
    When I try to open the file on my PC the link is an ASX file. Here's a list of the supported media types on a BlackBerry. ASX is not listed
    http://www.blackberry.com/btsc/KB05482
    If someone has been helpful please consider giving them kudos by clicking the star to the left of their post.
    Remember to resolve your thread by clicking Accepted Solution.

  • How to change text in PDF doc. which is a musical score

    Hello,
    I'm new here, so please excuse me if I do or say something I shouldn't.
    I need to change the words in a musical score because the font is too small. OCR recognition doesn"t work because there are illustrations that are different from images or text... Is there a way to get in there and make the changes I need to do?
    Any help greatly appreciated.

    Thanks for the reply, but I have Adobe Reader 9 Pro. Will it still not 
    work ?
    Le 29 sept. 2011 à 29 sept. 11 - 16:09, Claudio González a écrit :
    Re: How to change text in PDF doc. which is a musical score
    created by Claudio González in Adobe Reader - View the full discussion
    Unfortunately, not with the free Reader.
    Replies to this message go to everyone subscribed to this thread, 
    not directly to the person who posted the message. To post a reply, 
    either reply to this email or visit the message page: [http://forums.adobe.com/message/3944833#3944833
    To unsubscribe from this thread, please visit the message page at [http://forums.adobe.com/message/3944833#3944833
    ]. In the Actions box on the right, click the Stop Email 
    Notifications link.
    Start a new discussion in Adobe Reader by email or at Adobe Forums
    For more information about maintaining your forum email 
    notifications please go to http://forums.adobe.com/message/2936746#2936746

  • Why can't I "Save as Text" a pdf file received as an email attachment?

    I can "Save as text" a pdf file which I have created in my own computer (that is, it goes into MS notebook that I then can Copy and Save as an MS Word file) but not when I receive a pdf as an email attachment. (The file is saved, but it is empty.) Why would I want to convert my own pdf back to text? Well, in case I no longer have the original Word document I suppose, but the thing is "Save as text" works with my pdf, but not with those I recieve from others. How come? Thanks!

    Is this a scanned PDF? If so, it must first be OCR'd.

  • Is it possible to add or change OCR text in a batch process?

    Hello,
    This is my first submission.  Really hope you can help as if there's a solution to this it will significantly help our business.
    Is it possible to 'batch process' the adding or changing of OCR text in a PDF?
    This might sound like a strange process but let me explain what we do:
    1. We scan old, handwritten books and registers.  Typically these registers contain lists of information. Each page will be scanned to a filename like pge01.jpg, pge02.jpg, pge03.jpg etc.
    2. We transcribe the content of each page.  Each page will contain mulitple records (i.e. 20 records)  typical fields might be:
    Unique ID
    Surname
    Forename
    Year
    Address
    Filename (i.e. pge02.jpg)
    3. We provide this content back within database driven software so that when a user performs a search on say 'Surname=Jones' & 'Year=1945' then all of the scanned pages that match that contains handwritten text that matches that search criteria is displayed in a list.  The user can then click on a search result and see the scanned page containing that record.
    Rather than provide database driven software, we'd simply like to produce a standard PDF file.  Each page within the PDF will show each of the scanned pages (pge01.jpg, pge02.jpg, pge03.jpg etc.).  But where you would normally store the 'ocr recognised text' behind each image, we would like to show our transcribed content.
    If this is possible, then I realise that it's likely that you can't do field searches (i.e. 'Surname=Jones' & 'Year=1945') but at the very least I'd be able to type 'Jones' in to the search box and it would find all pages that contained the transcribed word 'jones'.
    If it's possible to add the transcribed data as 'ocrd text' then is it possible to do it in some sort of batch process?  We scan lots of big books and capture millions of records - so doing it manually is not an option.
    Any help that anyone can provide will be hugely appreciated.
    Thanks,
    Paul

    I don't think it's possible to manually add OCRd text, but you can add form
    fields with the text in them. And yes, it is possible to search the content
    of form fields, using a script.

Maybe you are looking for

  • Store all values from updateable report on clientside for later use

    apex 4.2 , 11xe hello, it is possible to store all values from an updateable report to something like an cache BEFORE sent it to DB ? Background: i have an modified updateable report ,look the jpg http://www10.pic-upload.de/25.06.13/3zq5wr23awwg.jpg

  • SCCM 2012 R2 CU2 Can't Edit WMI Query Code in Report Builder

    I first posted this in the SCCM but the issue is not solved so I thought I'd try here in the SQL Reporting forum. Original thread here: http://social.technet.microsoft.com/Forums/en-US/b0a6ca3d-7471-4b49-8447-7403a65c2ec0/sccm-2012-r2-cu2-cant-edit-w

  • JRE and JDK

    Hi, Dumb question from a java programmer working for 7+ years, but do I need both the JDK and the JRE? Obviously, a JRE is included in the JDK. The JDK is 150 MB and the JDK is also 150 MB relatively, if I recall correctly. This is 300 MB of Java, do

  • Problems Emptying REJECTED Smart Album

    My REJECTED smart album has 700+ images in it and I want to delete all of them - versions and masters. I have done this plenty of times by highlighting all of the images and then using the Delete Masters And All Versions selection from the File menu.

  • JTabbedPane and its accompanying ChangeListener.

    Hi all, I have a JTabbedPane with two tabs and a ChangeListener. If I click on a single Tab I check in the stateChanged-Method via "((JTabbedPane)event.getSource()).getSelectedIndex()==0" the selected Index of the tab. This works fine. But If I gener