Using OCR to Create Searchable PDFs

Hi, I'm converting about 30 bankers' boxes of paper into PDF. What I want to do is save each file as a PDF that is keyword-searchable. I DON'T want to convert everything to MS word.
I'm new to this and started the job yesterday with ReadIris 11, but I'm not happy with the results. I can't figure out how to get ReadIris to keep the original formatting and to save as a PDF.
I don't want the formatting or the look of the docs to change - I just want them to make new PDFs that are searchable by keyword. I am addicting to using the "search" feature in the Finder to locate files by keyword and if I could search ALL my PAPER files as well, I would be in heaven.
Any suggestions on how to do this? Thanks.

thanks for the excellent tips. i've been making some headway with this project - and you are right i think it is best to scan the docs first, save them as pdf images, and then convert them later. this will save me some time as i won't have to convert everything all at once. and i'll have both the original and the OCR versions. i'll definitely look into PDFpen.

Similar Messages

  • How do i create searchable pdfs on acrobat 9

    how do i create searchable pdfs on acrobat Pro 9?

    Master your content in a text editor (Notepad), a word processor (Word), a page layout application (InDesign) or some application that you do a "File | Print" with. Print to the Adobe PDF virtual printer that is installed with the install of Acrobat (any version). You'll have a seachable PDF. If a PDF has page content that is the image of text (the output of a scanner) use Acrobat's OCR feature to obtain searchable content.
    Be well...

  • Creating Searchable PDFs

    Hello,
    I am fairly new to Acrobat development and have a couple of questions that I would greatly appreciate help and direction on:
    1) What is the best way to programmatically create a searchable PDF from a .doc, .xls, .ppt and various image formats e.g. tiff, jpeg, bmp, etc...
    2) What is the best way to programmatically invoke the OCR capabilities of Acrobat SDK.
    Any help and or direction here would be greatly appreciated.
    Thanks.

    Hello Malky,
    Thanks for your response. Here are some answers to your questions:
    1) To run on the server as a background process.
    2) Yes, it will be automatic.
    Here is what I want to do:
    1) User uploads a file (.doc, .xls, .ppt or any image file type)
    2) The system converts to a PDF which needs to be searchable.
    3) I want include OCR support during the conversion if images exist.
    Does that make sense? My questions are:
    1) Whats the best way to convert into PDF for any the type? Is there any opensource solution to do this?
    2) How can a make use of the OCR capabilities? How can I through the SDK?
    Thank you very much!

  • Using Acrobat to create a PDF from .swf files

    I'm trying to beef up a standard powerpoint presentation into something i could have more control over, and was considering Acrobat as it would (i hoped) allow me to import .swf files from Flash. In reading up on just how flexible acrobat is though, it doesn't really seem to have all of the functionality that i want - if anyone could offer feedback on the following, that would be great in terms of clearing up the queries that i have beforehand!
    essentially, i want to create a PDF file which will open fulllscreen on any machine with no toolbars, windows, additional content in the window except my PDF content. So i would have a central shape/box, which would be white, on which i have text.
    I want this to open centrescreen, fullscreen, with no acrobat reader branding / navigation visible at all. Similar to the way a powerpoint document would open, however - i don't want Adobe Reader to scale my PDF at all.
    I've set my canvas in flash to 1024x768, and i would want Reader to display my file at exactly 1024x768 pixels, and not scale the document up to suit different monitors. So in a way, utilising something like you could with a Flash projector file where you could set the fullscreen's "allowscale" function to "false" (keeping the document at exactly 100%) and the actual fullscreen call itself to "true" (display the document in fullscreen mode, taking over your entire monitor).
    Also, i would want to specify a 'background', which is simple in flash as i can just create an oversized movie clip with whatever i want to appear in the background, and this will cover any excess area in monitors that exceed my canvas size. (So i usually create a movie clip of an image or gradient for the background that measures 2600 x 1600 pixels). Will acrobat be able to render this correctly, or is it just going to display the actual canvas size and exclude my oversized 'background' movie clip? If i use my 1024 x 768 canvas, will it clip the big background movie clip to fit?
    Thanks folks... bit long winded but i just wanted to make sure i've covered the bases!

    Like it or not, InDesign is a color managed application, and your files are all being managed.
    You'll always have less grief if the working spaces match the output, but that isn't always possible. If you want to to convert to the profiles specified in the CSF file, choose one of the "Convert to Profile" settings and select the correct profile as the destination. You'll have two options, preserve numbers, or don't preserve numbers.
    Preserving numbers will cause colors to shift, but will preserve 100% K blacks. Not preseving numbers will change the numbers to give you the closes match possible in appearance, but will royally screw up your 100% K type if going to press.

  • Installing .csf files and using them to create a pdf from Indesign.

    I have a .csf file that was sent to me that I need to:
    1.     Install (not sure how to do that in CS4),
    2.     And be able to use in Indesign when I create a pdf from an Indesign file.
    I'm not sure how to do this.

    Like it or not, InDesign is a color managed application, and your files are all being managed.
    You'll always have less grief if the working spaces match the output, but that isn't always possible. If you want to to convert to the profiles specified in the CSF file, choose one of the "Convert to Profile" settings and select the correct profile as the destination. You'll have two options, preserve numbers, or don't preserve numbers.
    Preserving numbers will cause colors to shift, but will preserve 100% K blacks. Not preseving numbers will change the numbers to give you the closes match possible in appearance, but will royally screw up your 100% K type if going to press.

  • Using Preview to create a PDF with scanner

    Hello,
    I just installed my new printer and want to create PDFs. I know my HP software can do it for me, but can Preview create PDF files as well? It would be similar to Adobe Prof. for Windows where I can create it through that software. Thanks!

    Preview can open PDF, and can combine multiple PDFs into one file, or remove individual pages from a PDF.
    On the Mac OS, PDFs can be created from any program through the print dialog box.
    Select Print form the file menu. Instead of click print in the dialog box that appears, select PDF from the pup up list in the lower left corner of the dialog box.
    Instead of printing a hardcopy, it'll create a PDF file.
    Preview can then open the PDF. To merge PDFs, make sure you have the sidebar open in preview. Drag/drop individual PDFs to/from the sidebar will add/removed pages from the original document

  • Using XML files created with PDF form in Excel

    I have returned survey XML files created by a survey form using Lifecycle Designer. Have been unsuccessful in the importing of multiple XML files into Excel as a 2nd file just overlays the 1st file's data.
    I have been reading a number of posts that most likely tells me that sending the PDF survey form out via e-mail and getting the returned XML file back via e-mail was not the best way to do it. Unfortunately it is what I have to deal with now, so two questions I have:
    1. Is there a known method for importing all of these XML files into any Office program more easily than what I have been dealing with, and
    2. What method would be best used if I have surveys to send out but have no web server or any other tool other than my local software on my PC for collecting and compiling the returned data?

    Are you clicking the Send Email button while previewing the pdf form?
         -> If yes, have you specified preview data on Form Properties panel?
                    -> If yes, does your form contain Table or repeatable subforms?
    If all the above point are true, you will get multiple data in your xml.
    Nith

  • Problem using Automator to create "Watermark PDF Documents"

    After having updated our macs from Mavericks to Yosemite, the "Watermark PDF Documents" using Automator is no longer working.

    Hi Miykael,
    Do you have flash player installed on your system for that browser? Did you tried accessing that pdf file from a different browser?
    I would also recommend you to refer this KB Document : https://helpx.adobe.com/acrobat/using/display-pdf-browser-acrobat-xi.html
    Regards,
    Rahul

  • How to use cfdocument  create a PDF file and save the file in server?

    Hi,
    I want to use cfdocument to create a PDF file and save it in
    the server for other people to download,can you give me a idea how
    to do this.Thanks.
    <cfdocument format = "PDF" pagetype="A4"
    orientation="portrait">
    </cfdocument>
    Mark

    Hi
    <cfdocument filename="" format = "PDF" pagetype="A4"
    orientation="portrait">
    </cfdocument>
    Give the physical path to the filename. You have write
    permission for this folder to create a PDF file.

  • How can I correct "hidden" text in a searchable PDF file?

    This seems like a simple question. However, the answers are invariably complex, do not yield the desired result, and often answer a different question entirely. I say all that just to warn people up front that the "problem" is easier than how many people and PDF application developers, including Adobe, typically understand it while the proposed "solutions" are invariably a total...well, botch is a reasonable word if a bit understated.
    Here is the actual problem:
    I have "searchable" PDF files created by scanning documents and running them through an OCR process. I create "searchable" PDF files in order to archive, index, and eventually enable searching for the documents scanned. A "searchable" PDF satisfies those criteria better than any other commonly used, "portable" archive format -- though I would be happy if someone could point out an obvious alternative I may have overlooked. I do not need perfect OCR results. If I need a document to edit or perhaps feed into a spreadsheet or database, I expect to be able to reprocess the page images in a given "searchable" PDF file to OCR and convert the contents to Word, RTF, Excel, or another file format as necessary with more care for the results than for the archived document itself. Therefore, the "searchable" PDF document is the scanned page images which compose it while the OCR generated "searchable" text is secondary, but still important. Therefore, each file must contain scanned page images of sufficient detail to be efficiently converted by OCR if possible and legible enough for whoever views the images to be able to work out what an OCR process may fail to understand. Once scanned, those pages are the "document" and therefore "immutable." However, OCR is imperfect. For a searchable document archive, it does not have to be, but some errors are significant in that they may prevent the document from being found by a search. Therefore, there must be a way to view and, if necessary, edit the "hidden" text in a "searchable" PDF without altering the visual display of a document or how it is printed. No strike-throughs. No visible "corrections." None of the stuff PDF editors want to insert into a PDF file when editing it. I do not want to edit the document without exporting it to a format appropriate for an editable document. I just want adequately "correct" hidden text in a "searchable" PDF file.
    I apologize for the length and redundancy in my description of the problem. However, past attempts to explain my problem and objectives as well as what I have seen in reply to similar queries across the Internet indicate that most people trying to answer this question come at it from the same point of view shared by most, if not all, PDF tool or application vendors. They seem to think that any desire to edit a PDF file is a desire to have a PDF word processor of some sort. Or, they assume that the OCR process employed may need tweaking of the means by which people apply it and then a process like "find suspects" is adequate to deal with any errors. But no, those are not what I am trying to accomplish and answers which address those topics do not answer this question.
    In short, which tool or application from any vendor will reveal the "searchable" hidden text in a PDF produced by any OCR or other process and then enable corrections to the hidden text without changing any document display parameters at all? Note, hidden text typically includes bounding box information denoting the portion of the image from which the text was recognized. That information must not be lost or changed when editing the "searchable" text.
    So, any tools or applications capable of doing this? If Adobe Acrobat XI Pro can (use of a trial copy demonstrated that the hidden text content can be reviewed, but editing did not work by any straight-forward means I could work out while trying out the application), fine. However, $500.00 list or even a $200.00 possible upgrade from a copy of Adobe Acrobat X Standard which came with my scanner is a lot of money for personal use when review and edit of the OCR generated hidden text in a "searchable" PDF file is the only function I require. Therefore, other suggested tools or applications which do what I need for less would be greatly appreciated.

    My "claim"? Actually I've made no "claim" such as you've mentioned.
    Simply stated your OP has foundational premises that presume as factual what is not.
    Here, we're in Adobe's hosted user forum for Acrobat.
    Any other application use is not material. 
    Acrobat XI provides 3 OCR methods.
    Searchable Image, Searchable Image (Exact) & ClearScan.
    Only the first two provide the "hidden" text output.
    (Glyphs have no stroke, no fill)
    From back to the Acrobat 3 product family the design functionality of Searchable Image and Searchable Image (Exact) has been to facilitate the use of Find / Search.
    The "hidden" text is can be touched up. Acrobat Pro provides the facility to view the hidden text.
    So you can see what the OCR output that correlates to the bit-map images of the characters that are present.  
    With Acrobat XI Pro use Tools - Protection -Remove Hidden Information
    In the Remove Hidden Information pane select "Hidden text" then "Show preview".
    The default for the preview is "Show Only Hidden Text".
    Back in the PDF --
    You'd select some of the hidden text and retype what you suspect is the correct string of characters.
    Save and return to the preview of the hidden text.
    If you got it right, good. Continue.
    If not, darn - try again.
    Plug 'n chug -- somewhere over the rainbow it'll be done eh.
    Full disclosure -- this is something I've done (enquiring minds don't you know).
    I've found it to be a rather Sisypean undertaking.
    So, "doable" but not practicable.
    This is to be expected because such touchups are not the concern / focus of the output from Searchable Image or Searchable Image (Exact) - (the names tell it all).
    To have touchup "editablity" of an OCR output using Acrobat make use of ClearScan.
    ClearScan replaces recognized character bit-maps with a character from an Acrobat internal font.
    The character strings can be selected to change to a generic, system available font.
    Something that is good to know when embarking on the "tweak the PDF" journey is that PDF (the file format / technology as defined by its ISO Standard, ISO 32000-1) does not tolerate "editing". PDF is decidely not a word processor file format and "editing" can quickly render a PDF unusable.
    Minor touchups can be made and your best "tool" for this is still Acrobat Pro. (Save As often and periodically "bank" the PDF via some file rename scheme.) 
    Be well...

  • Is there a program to extract email addresses from a searchable pdf?

    Is there a program that will extract email addresses from a searchable pdf?
    I scanned a 75 page excel spreadsheet and used OCR to create a searchable pdf. I've verified that the OCR did work, the email address are searchable, but I need a way to extract them from the pdf so that I can add them to an email list database. There is other data in the spreadsheets that is not needed and it is making it impossible to just copy and paste. Does anybody know if there is a program available that works on the mac platform for this. Any help is greatly appreciated. Thanks!
    Nate

    Nate B- wrote:
    Is there a program that will extract email addresses from a searchable pdf?
    I scanned a 75 page excel spreadsheet and used OCR to create a searchable pdf. I've verified that the OCR did work, the email address are searchable, but I need a way to extract them from the pdf so that I can add them to an email list database. There is other data in the spreadsheets that is not needed and it is making it impossible to just copy and paste. Does anybody know if there is a program available that works on the mac platform for this. Any help is greatly appreciated. Thanks!
    Nate
    Nate,
    You might want to repost this in the Unix forum, or one of the scripts forums here:
    AppleScripts: http://discussions.apple.com/forum.jspa?forumID=724
    Unix: http://discussions.apple.com/forum.jspa?forumID=735
    Automator: http://discussions.apple.com/forum.jspa?forumID=1261

  • Searchable PDFs are now more than 10 times as large

    When I was using on Windows XP an older version (70.0.128.0) for creating searchable PDF files, the created files require less than 10% of the searchable PDF files created on Windows 7 64bit using new version (130.0.44.62).
    Sample: 9 pages, black and white, pdf searchable
      359 KB on Windows XP, 70.0.128.0
    3847 KB on Windows 7 64bit, 130.0.44.62
    Looking at the results, I do not notice any difference.
    For all other files I notice a similar factor.
    Anyone observing the same 'feature'?
    What have I done wrong?
    Best regards,
    Wolfgang

    You may have changed the settings file selection, maybe for the better. You may not have been embedding all of the fonts for instance and now you are. It may also be that you changed from printing to the Adobe PDF printer and started using PDF Maker that by default adds a lot of bloat to the PDF. That bloat can be useful, but not if you are worried about size.

  • Using Image Thumbnails in InDesign/PDF

    Hi,
    I'm using InDesign to create a PDF training document but I'm having difficulty in getting some larger images to display at a good resolution. What I would like to do as a result is to place my images as hyperlinked thumbnails which, when clicked, would display the fullsize image either within the same document or in a separate window. Does anyone know if that is possible with Acrobat(PDF)/InDesign?
    Thanks in advance for any help. I'm currently using CS4 Design Standard on Windows XP and am awaiting a copy of CS5 Design Standard.
    Regards,
       Patrick

    I'm using InDesign to create a PDF training document but I'm having difficulty in getting some larger images to display at a good resolution. What I would like to do as a result is to place my images as hyperlinked thumbnails which, when clicked, would display the fullsize image either within the same document or in a separate window. Does anyone know if that is possible with Acrobat(PDF)/InDesign?
    Surprising things are possible with Javascript in Acrobat, so what you ask for might be possible within Acrobat, but you'd best ask over in the Acrobat-specific forum if you want to pursue that. I don't think that what you want can be handled in InDesign alone. However, the InDesign forum might be able to help you get the original images to display at the resolution you'd like, thereby obviating the need for HTML+JS-style image popups.

  • Creating a PDF document with visible bookmarks

    Hi.
    I'm using VB to create a PDF document with bookmarks. I have no problem with this part.
    What I want to do, though, is to program the document so that the bookmarks pane is visible when I open the document.
    Any ideas?
    All Best,
    Ethan

    I found the answer in the Interapplication Communication API Reference:
    Dim SetPageMode as Boolean = PDDoc.SetPageMode(nPageMode),
    where nPageMode has the possible values:
    0: leave the view mode as is
    1: display without bookmarks or thumbnails
    2: display using thumbnails
    3: display using bookmarks

  • Creating a PDF-file with CONVERT_OTF. How to set the properties?

    Gents, Ladies,
    I have an ABAP that uses CONVERT_OTF and creates a pdf-file. This works fine but now the security properties of the pdf-file need to be set to not-modifiable (See File, Document Security in Acrobat's Reader).
    Can any of you help me to set any of the properties of the pdf-file?
    Best regards,
    Tim van Steenbergen.

    Hi tim,
    sorry, it's not possible (see Steff's posting)
    http://www.abapforum.com/forum/viewtopic.php?t=318&highlight=pdf
    grx
    Andreas

Maybe you are looking for