Automation of OCR and text extraction possible in Acrobat 9?

Hi,
I need to know whether I can use Acrobat 9, on Windows, in the following way.
I have an application running. It has a set of image files and image-only PDFs, and sends a request to Acrobat to OCR them, and produce a text-searchable PDF, and also produce a text file of the text content, and save these to paths also supplied by the application. This should all be done without any user input, so there shouldn't be any windows popping up wanting response - or, the program should be able to detect if such a window has come up, and deal with it itself.
Is this feasible? I would hopefully be coding in C or Ruby.
It would also be nice to be able to ask Acrobat whether a given PDF is already text-searchable or not, although this is not essential.
Thanks,
Farmer.

Well, since the actual use of Acrobat for their purposes will be very limited, another option might be as follows:
They install acrobat on a couple of computers in the office. I write a simple plugin for Acrobat to perform the above processing on a Desktop. Whenever there is a set of documents needing to be converted (as described above), one of the people with Acrobat runs the plugin on the documents, then sends the processed files to the server. (Though maybe I don't need to write a plugin? I remember seeing that Acrobat does do batch processing itself anyway.)
Does this sound reasonable? I have looked at the on-line description of LiveCycle PDF Generator but it seems to have a lot of extra stuff that's not necessary for my situation, and seems that it will be much more expensive.

Similar Messages

OCR and Extract

Hi,
I have two questions:
1) When I try to "Recognize Text Using OCR" a document (which is a collection of scanned images of invoices), the result is extremely poor, with most of characters not being recognized at all. The image quality is not great, but I have been able to obtain better results by a) printing an invoice and scanning it with OCR; and b) using Able2Extract from www.ocr.com. However, a) would take too long and b) doesn't seem to have the option of saving OCRed pages as PDF.
My question is: is it possible to OCR a lower quality PDF more reliably in Acrobat. If not, is there another software that can do this?
2) Is it possible to extract pages that satisfy certain conditions? For example, can you extract all pages that contain any kind of notes or markings (e.g. Acrobat highlights) or pages that contain certain words?
Thanks in advance your assistance.

An answer to question 2 is a conditional yes. The condition is that you have a version which is not Reader (which I assume you do), and also you will need a custom-made script that will determine which pages to extract. Contact me by email if you're interested in such a tool.

Proofread and correct OCR'd text in Acrobat 10 Pro

How do you proofread and correct text produced by OCR from a scanned document, in Acrobat 10 Pro?
I scan (many, large) paper documents, then use Recognise Text. After the OCR phase, if I save PDFs as text, I can see many scan errors.
I would like to be able to correct those errors in the scanned text, so that names etc can be successfully searched. However I cannot find any way to view and correct the scanned text.
I experimented with Tools / Content / Edit Document Text, but I cannot see how to display the scanned text to allow correction. It appears to operate on the PDF image. But if I try to change the document image to correct known errors (e.g. in spacing), and then save the PDF as text again, the string where I changed the image becomes gibberish.
How is Edit Document Text supposed to work? Is there any way to achieve what I am looking for (fixing many errors in large OCR'd documents)?
Regards,
Sue.

"This is a 76-page document, and the users will expect to see the image looking like the scanned original."
That locks it down. The only way to satisfy this is Searchable Image (Exact).
The scanned image serves as the an objective replacement for the source hardcopy.
The OCR output exists to facilitate search/find.
At the end of the day there is no practical means of editing OCR's Hidden Text layer with it in the PDF.
That's not to say you cannot work at it and get results. But, the operative word is practical.
In that context you may want to look over a reply I made here:
http://forums.adobe.com/thread/950209?tstart=0
To increase accuracy of OCR recognition:
Yes, there are dedicated OCR applications (desktop or server). Having used several of each as well as Acrobat's OCR I've learned that also significant is the scanner and the quality of the hardcopy source.
Regarding the remainder of your post above.
Ok, I cannot replicate what you describe with a PDF I've been using.
It is a scanned image of a single page of textual content.
After ClearScan I can export to Word (&, of course, have some cleanup required).
I can use the TouchUp Text / Edit Document Text tool to select all the PDF page's content (the ClearScan output).
Changed the font to TimesNewRoman, saved, and exported to Word.
The content in Word needed cleanup.
Next, I selected various words and typed in a replacement word.
After a Save I Exported to Word. The changed words carried through.
re: Q1 - What you describe is symptomatic of the Hidden text output of Searchable Image / Searchable Image (Exact) and not ClearScan. So, I'm perplexed.
re: Q2 - An advantage of ClearScan is being able to edit a text string to correct it. So, sure, why not correct? With that said, it can be a tedious and labor intensive activity. As well, typos are possible during correction which begs the question "Who bells the cat?" 8^)
re: Q3 - If corrections to the ClearScan output meets your needs an export to Word may not be needed.
However, sometimes ClearScan cannot recognize the image of a character and leaves it as a bitmapped image.
So, to correct you'd have to get into a word processor.
re: Q4 - Goes back to Q3.
Here are some useful video tutorials:
http://acrobatusers.com/tutorials/clearscan-vs-imagetext-ocr
A listing of others: http://acrobatusers.com/tutorials/filter/search&keywords=scanning%20ocr&tut_type=Video&cha nnel=tutorials/
At Adobe TV:
http://acrobatusers.com/tutorials/filter/search&keywords=scanning%20ocr&tut_type=Video&cha nnel=tutorials/
Be well...

OCR and hidden text in PDF scans of historic documents

I need to edit the hidden text behind a scanned PDF image of a document. The image must remain as an “exact” copy of the original scanned document.
I used Acrobat Pro (versions 7 and 9) to make PDF images of old typed documents from the 1940’s. When I open those images and run OCR in version 9, then examine the hidden (invisible) text layer behind the image, there are errors. For example, the word “book” has been picked-up by the OCR as the word “look.” I need to change the “l” to a “b” in order to make the PDF accurate when it is searched at a later date.
I have checked many user forums. Most people imply that hidden text can be viewed, but NOT edited in Acrobat Pro 7 and 9. (Hidden text can be viewed in Version 9 by selecting “Document” “Examine Document” and then clicking on the “+” symbol next to “Hidden Text,” then clicking “Show preview.”) Some say to use Adobe Capture 3.0 to edit hidden text. Others say to use Photoshop or Illustrator to edit hidden text (I think these folks may have been confused, because Photoshop and Illustrator would be used, logically, to edit the image ON TOP OF the hidden text). Yet another person seemed to say that a hidden text editor was added to Acrobat 8, but was taken away in Acrobat 9. (I can’t verify that because I don’t have version 8.)
The closest answer I was able to find involved using the Text Touch Up Tool on top of the image to edit hidden text behind it, but when you do that you are typing “blind.” In other words, you highlight a spot on the image (top layer) where you THINK the error MIGHT be, and you type the correction without being able to see what you are typing over. Then, you go back to the “Examine Document” procedure (described above) to see if you “hit” your mark, and if not, you redo it until you do “hit” your mark. With the number of documents and corrections that we have, that procedure would be too labor intensive and thus a budget breaker.
If we have to buy more software, my preference would be to buy a genuine Adobe product because I have experienced problems in the past switching back and forth between Adobe products and other PDF manipulation software.
Can anyone answer any of these questions:
(1) Is there a way in Acrobat versions 7, 8 or 9 to edit hidden text, and if so, how?
(2) What Adobe software (other than Acrobat) will edit hidden text behind a PDF image?
(3) Assuming no Adobe product will edit hidden text behind a PDF image, is there any non-Adobe products that will do that?
Thank you!

Hi,
Unless you use Acrobat 8 Pro's Formatted Text & Graphics" or Acrobat 9 Pro's ClearScan you will find that there is no
practicable means of editing the OCR "hidden text" in a PDF.
The TouchUp text tool (Advanced Editing toolbar) is reliant upon the selected text having an available system font to use during touchup. However, both Searchable Image and Searchable Image (Exact) OCR output is of text rendering mode 3 (invisible text) that is provided from within Acrobat and not any installed system or other application installed font.
With Searchable Image (Exact) you have the untouched image augmented by the invisible text which is provided as a user aid for search or find with Adobe Reader or Acrobat. The invisible text is not intended to support word processor like editing.
To your questions:
#1. There is no practicable way to edit invisible text (text rendering mode 3) with Acrobat (any past or current release).
#2. None.
#3. A good question. Perhaps a specialty program. Keep in mind, many products provide a promise but those those that actually deliver tend to be expensive.
Something to play with. Using Acrobat 9 Pro or Pro Extended, try the Preflight Fixup to embed hidden text.
Then try using the TouchUp Text tool. You may also want to see if you can change the font type of this newly embedded font.
(use copies of the "real" files - just in case <g>).
Be well...

Pdf text extract problem with CID font and Identity-H

Hi all,
Iam facing some big problem with text extraction from pdf file.
Currently iam using congviews pdf2xl text extraction tool.
About 95% of the text extract correcly but few charaters showing box some ? and some dotted circle mark.
Font Used:
ArialUnicodeMS(Embedded Subset)
Type:(True Type (CID)
Encoding:Identity-H
TimesNewRomanPSMT
Type:True Type
Encoding:ANSI
ActualFont:TimesNewRomanPSMT
ActualFontType:TrueType
Anyone please help me to overcome this.
Regards
Gilbert.X

I tried with acrobat pro9 export option it retrieved only alphabets and numbers all of the hindi charcaters showing just ........
By the way how can i upload the my pdf file within this forum please guide me.
Regards
Gilbert.X

I am so confused. I am thinking about buying an iPhone 4S but I want to use those card things that you can put the code in for the phone is this possible? I'm not talking about the 50$ unlimited card just the regular 20$ cards that hold mins and texts...

I am so confused. I am thinking about buying an iPhone 4S but I want to use those card things that you can put the code in for the phone is this possible? I'm not talking about the 50$ unlimited card just the regular 20$ cards that hold mins and texts... As long as it has the support for the service it's hooked up to?

As far as I know there is no US cell carrier that allows pay as you go with an iPhone. You can purchase an unlocked iPhone 4S at an Apple Store and try it with a non-supported carrier.

Horizontal RSS feed- is it possible to create an RSS feed (showing both pictures and text), that scrolls from left to right rather than top to bottom?

I am trying to create an RSS feed that will scroll from left to right rather than top to bottom as is standard in iWeb. Ideally, I would like to show both images and text. I am actually using iWeb '09. I am not opposed to writing html code if necessary.
Thank you for your help!

Probleme solved without having to change anything in the flash.
The solution was to add a small part in the CSS in HTML
html {
overflow-x: hidden;
overflow-y: auto;
Solve probleme. All browser. VERY usefull to know !

Is it possible to have icons for the main forward/back toolbar line and text only for the bookmark line in the toolbars?

In the customizing toolbars section it looks like you can either have icons and text, icons only or text only.
I'd like to have the toolbar with the back/forward buttons, home button, etc. be icon only and the toolbar below, which is the bookmark bar to be text only. I don't see how to do this.
Thanks.

See:
*Bookmarks Deiconizer: https://addons.mozilla.org/firefox/addon/bookmarks-deiconizer/

Acrobat 9 OCR and "OCR Suspects"

I downloaded the trial for version 9.
Took a poorly scanned page and OCR'd it.
It (expectedly) had a few errors.
Then I selected "OCR Suspects" from menus.
What it should have done is found the "low confidence" results, but
instead, it said no OCR suspects were found.
This used to work in version 8, but I can't get 'OCR suspects' working in V9 trial.
Can anyone confirm if this works in the full version of Acrobat 9 Pro or Standard?

It's strange that while I posted to this Adobe forum, there is a response over at objectmix.com. As contributing to this topic from 2 locations seems confusing, I'll carry on here.
Amannagpal76 responded, saying in part that ClearScan in 9 Pro replaces Formatted Text & Graphics. Good to know this. ClearScan does, however, continue the mix. If ocr doesn't work on a character graphic, that graphic will continue to be displayed as such, amidst ClearScan's synthesized type 3 font imitation of the original font. This is most obvious when using the marquee zoom tool.
Aman suggests using the Touchup Text Tool and changing the font to any font installed on one's system. This doesn't work for ClearScan. Selecting a different font in Touchup for a PDF that came via a wordprocessor works fine, but not for a PDF that came via a scan. That, unfortunately, is the only time that ClearScan is used. The error message when I try this states that there's no system font to match the one in ClearScan, and text can't be added or deleted.
ClearScan is remarkable for the small size file it produces. That size can be reduced considerably even further by converting it to the Adobe 7 file format. ClearScan's synthesized font is also remarkable when enlarging the page on screen. Then you can see its true outlines -- rather chewed up in high magnification, but that's OK. It would be nice to extract the font in question and use it on one's system. One downside to ClearScan is that its ocr fails to retain italics when output to RTF and Word.
I have never found a suspect in 9 Pro.
The conclusion from the above is that the hidden text produced by any ocr'ing in 9 Pro can't be corrected.

OCR and Reducing file size

I have a large document (a book) that I am trying to scan. I will be scanning it chapter by chapter. The book was printed in grayscale, so I don't have a pure BLACK AND WHITE document. I would like to optimize the file size, but I have a few questions about that.
Currently running:
Windows 7
Acrobat Pro X
Epson GT-S80 High-speed scanner
1. What is a good typical workflow? I have tried scanning the documents to PDF using the scanner's software then opening them up in Acrobat to OCR them. I have tried using Acrobat's Scan feature with OCR being one of the steps in the scanning process. I have tried letting both softwares do their own color mode detection, where they will mix black and white and grayscale to reduce the file size, but have typically told it to stick with grayscale because that gives me the cleanest and clearest document. Does anyone have any recommendations on getting a good quality image and using a mix of black and white, as well as grayscale, or should I keep using just grayscale?
2. I am having some trouble, I think, with the file size. I have a 12 page document I believe was either scanned at 300 dpi or was scanned at full resolution because I used CLEARSCAN, and downsampled everything to 300 dpi. I don't remember exactly, but that file is about 2.20 MB in size, and I think that runs about 185K per page. I would think there could be a way to get a smaller file.
3. For text recognition purposes, this document is not ideal because it is a collection of powerpoint slide sheets (2 - 3 slides per page), and in some cases there is text on top of image in the slides, and it seems very hard to discern.
4. Once a document has been scanned, and OCR has been run on it, I was under the impression that the OCR is in a separate layer, and that (if Searchable Text is chosen), you basically have a scanned image with another layer of searchable text. Because the OCR'd text is "there somewhere", is it possible to remove the scanned image text, and have just the raw recognized text, similar to if I created the document in Word, and created a PDF?
5. Sort of back to number 1, suppose I am stuck with leaving the scanned image behind, and just running OCR, what is the optimal way to reduce the file size of the PDF? I had read that running your scan at 600 dpi may help with the text recognition. The same article suggested doing the higher resolution scan and using the ClearScan because it would a) recognize the text better and b) convert the text image to actual text and reduce the file size. From there, should I then just run the PDF optimizer to downsample the images to a certain DPI to further reduce the size?
Hopefully you all can understand what I am saying and help fill in some gaps.
Thanks,
Ian

Let us know if this tutorial helps you with your workflow Acrobat X: Taking the guesswork out of scanning to PDF.

Partial OCR on one page possible?

I'm using Adobe Acrobat 9 Pro (with Windows XP). I've got a PDF page that shows various news articles. The end user should be able to use CTRL + F for some, but not all, of the articles on that page. Is it possible to apply OCR to only part of one page for the articles where the text should be findable? I have experimented with layers, but so far have not been able to get this to work.
If Adobe Acrobat cannot do this, is there another application that can do partial OCR and still have the results look like the original scan after OCR has been applied? This would definitely be a compromise, but I'd like to be aware of any possible option out there.

Nobody replied, so I put in a request directly with Adobe Technical Support. They supplied this answer:
I understand that you would like to know if Acrobat can perform an OCR
on only a portion of a page instead of the entire page of a PDF file.
Acrobat does not contain a preference that allows you to OCR only a
portion of a page.
I have submitted this as a feature request.

Can OCR'd text be edited?

I'm using Acrobat 9 Pro, and working with a document, a TIFF file, that was converted to PDF and OCR'd in Acrobat. Now I need to edit some of the text using the TouchUp Text tool, but am not able to edit it. In previous versions I was able to edit OCR'd text. Is it still possible in Acrobat 9?

Hi jay,
Keep in mind that there are three ways to OCR with Acrobat.
With Acrobat 8 Pro/3D:
1. Searchable Image
2. Searchable Image (Exact)
3. Formatted Text & Graphics
With Acrobat 9 Pro/Pro Extended
1. Searchable Image
2. Searchable Image (Exact)
3. ClearScan
Methods # 3 may be used to edit "suspects".
Methods #1 & #2 place invisible text (text rendering mode 3). There has been and continues to be no practicable means of editing this directly.
It is not intended to be "word processor like editable text" (but then neither is any PDF content).
However, there is something to play with.
If you used methods #1 or #2 and you have Acrobat Pro/Pro Extended, try the Prefight Fixup that embeds fonts.
"Embed fonts (even if text is invisible)"
Which is described:
"If a PDF uses fonts which are not embedded into the PDF file they are embedded.
This fixup embeds fonts event if they are only usef for text which is invisible (text rendering mode 3).
It is required that the respective fonts are present in the system's font folders.
Some fonts may have a flag indicating that their license does not allow embedding.
In this case the fonts are not embedded into the PDF."
I've had no occasion to use this fixup so I cannot say it will or will not serve your needs.
But, may still be worth a run on some trial files, eh?
"Rotated"
The OCR characters (the "invisible text" of text rendering mode 3) are "rotated".
Be well...

I need to search a multi-page pdf and then extract just the pages returned by the search.

Version 5.0 of Preview allowed me to search a multi-paged pdf for specific text and then extract those pages returned in the search. Version 5.5 of Preview has removed that function. Does anyone know of a work-around, either using a third party piece of software, or perhaps automator? I'm desperate!
Thanks!

BrooklynJohn wrote:
Version 5.5 of Preview
Sorry I can't answer your question, but I don't understand how you are using this version of Preview with Mac OS X v10.6.4.

Pictures and text boxes in rtf template

Hi there,
are there any general rules for applying pictures and text boxes to rtf-templates?
As far as I have experienced, the text boxes are not shown when generating a report. Is it generally not possible to display text boxes? Are there any work arounds?
Thank you,
BR
Lena

Hi,
do you mean instead of text boxes?
Yes, that is no pborlem.
Are there any known issues regardings pictures and any other design elements?
Thank you!
BR
Lena

Chart and text items are not synchron

hi all,
i am a newbie in apex 4.0.
i have two regions. the first shows a chart, based on pl/sql-code. second shows text items for possible filter an the current statement of the chart.
when user clicks on a bar in the chart then
1. i set the the first text item as filter,
2. a pl/sql-code runs, build a sql statement, assign the sql statement string to the second text item and
3. returns the statement for the chart.
it works fine. the chart refresh. and the data i can see are correct.
but the text items in the other region doesn't refresh.
when i press "refresh"-button at browser level everything is fine. chart and text items fits together.
what can I do to refresh all regions when the chart changes the sql-statment / User clicks?
Is there any way to read out the current statement of a chart?
thx in advance
jogi
Edited by: Jogi on 09.05.2011 12:31

Bernd: the reason I asked about views is that you don't have any error messages. This might indicate that (a) you have no items in the view, or (b) there's something wrong with view-role-user assignment.
To check for (a), please go to the published procurement catalog, and go to Views tab. Check that your View is Active. Click on your View ID link to display view details. You should see a list of characteristics assigned to your view in Assign Characteristics sub-tab (the list should not be blank!). Go to Assign Items sub-tab. Navigate in your schema to find items that are supposed to be assigned to your view. You should see "Yes" in the "Assigned" column for those products. If you don't, then you simply don't have any items in your view.
Another thing I'd like you to check: when the user calls your procurement catalog for search, do you see the name of the catalog displayed just below the drop-down "Select Categories Hierarchically"?
Cheers,
Serguei

Automation of OCR and text extraction possible in Acrobat 9?

Similar Messages

Maybe you are looking for