Creating Searchable PDFs

Hello,
I am fairly new to Acrobat development and have a couple of questions that I would greatly appreciate help and direction on:
1) What is the best way to programmatically create a searchable PDF from a .doc, .xls, .ppt and various image formats e.g. tiff, jpeg, bmp, etc...
2) What is the best way to programmatically invoke the OCR capabilities of Acrobat SDK.
Any help and or direction here would be greatly appreciated.
Thanks.

Hello Malky,
Thanks for your response. Here are some answers to your questions:
1) To run on the server as a background process.
2) Yes, it will be automatic.
Here is what I want to do:
1) User uploads a file (.doc, .xls, .ppt or any image file type)
2) The system converts to a PDF which needs to be searchable.
3) I want include OCR support during the conversion if images exist.
Does that make sense? My questions are:
1) Whats the best way to convert into PDF for any the type? Is there any opensource solution to do this?
2) How can a make use of the OCR capabilities? How can I through the SDK?
Thank you very much!

Similar Messages

  • How do i create searchable pdfs on acrobat 9

    how do i create searchable pdfs on acrobat Pro 9?

    Master your content in a text editor (Notepad), a word processor (Word), a page layout application (InDesign) or some application that you do a "File | Print" with. Print to the Adobe PDF virtual printer that is installed with the install of Acrobat (any version). You'll have a seachable PDF. If a PDF has page content that is the image of text (the output of a scanner) use Acrobat's OCR feature to obtain searchable content.
    Be well...

  • Using OCR to Create Searchable PDFs

    Hi, I'm converting about 30 bankers' boxes of paper into PDF. What I want to do is save each file as a PDF that is keyword-searchable. I DON'T want to convert everything to MS word.
    I'm new to this and started the job yesterday with ReadIris 11, but I'm not happy with the results. I can't figure out how to get ReadIris to keep the original formatting and to save as a PDF.
    I don't want the formatting or the look of the docs to change - I just want them to make new PDFs that are searchable by keyword. I am addicting to using the "search" feature in the Finder to locate files by keyword and if I could search ALL my PAPER files as well, I would be in heaven.
    Any suggestions on how to do this? Thanks.

    thanks for the excellent tips. i've been making some headway with this project - and you are right i think it is best to scan the docs first, save them as pdf images, and then convert them later. this will save me some time as i won't have to convert everything all at once. and i'll have both the original and the OCR versions. i'll definitely look into PDFpen.

  • How can I correct "hidden" text in a searchable PDF file?

    This seems like a simple question. However, the answers are invariably complex, do not yield the desired result, and often answer a different question entirely. I say all that just to warn people up front that the "problem" is easier than how many people and PDF application developers, including Adobe, typically understand it while the proposed "solutions" are invariably a total...well, botch is a reasonable word if a bit understated.
    Here is the actual problem:
    I have "searchable" PDF files created by scanning documents and running them through an OCR process. I create "searchable" PDF files in order to archive, index, and eventually enable searching for the documents scanned. A "searchable" PDF satisfies those criteria better than any other commonly used, "portable" archive format -- though I would be happy if someone could point out an obvious alternative I may have overlooked. I do not need perfect OCR results. If I need a document to edit or perhaps feed into a spreadsheet or database, I expect to be able to reprocess the page images in a given "searchable" PDF file to OCR and convert the contents to Word, RTF, Excel, or another file format as necessary with more care for the results than for the archived document itself. Therefore, the "searchable" PDF document is the scanned page images which compose it while the OCR generated "searchable" text is secondary, but still important. Therefore, each file must contain scanned page images of sufficient detail to be efficiently converted by OCR if possible and legible enough for whoever views the images to be able to work out what an OCR process may fail to understand. Once scanned, those pages are the "document" and therefore "immutable." However, OCR is imperfect. For a searchable document archive, it does not have to be, but some errors are significant in that they may prevent the document from being found by a search. Therefore, there must be a way to view and, if necessary, edit the "hidden" text in a "searchable" PDF without altering the visual display of a document or how it is printed. No strike-throughs. No visible "corrections." None of the stuff PDF editors want to insert into a PDF file when editing it. I do not want to edit the document without exporting it to a format appropriate for an editable document. I just want adequately "correct" hidden text in a "searchable" PDF file.
    I apologize for the length and redundancy in my description of the problem. However, past attempts to explain my problem and objectives as well as what I have seen in reply to similar queries across the Internet indicate that most people trying to answer this question come at it from the same point of view shared by most, if not all, PDF tool or application vendors. They seem to think that any desire to edit a PDF file is a desire to have a PDF word processor of some sort. Or, they assume that the OCR process employed may need tweaking of the means by which people apply it and then a process like "find suspects" is adequate to deal with any errors. But no, those are not what I am trying to accomplish and answers which address those topics do not answer this question.
    In short, which tool or application from any vendor will reveal the "searchable" hidden text in a PDF produced by any OCR or other process and then enable corrections to the hidden text without changing any document display parameters at all? Note, hidden text typically includes bounding box information denoting the portion of the image from which the text was recognized. That information must not be lost or changed when editing the "searchable" text.
    So, any tools or applications capable of doing this? If Adobe Acrobat XI Pro can (use of a trial copy demonstrated that the hidden text content can be reviewed, but editing did not work by any straight-forward means I could work out while trying out the application), fine. However, $500.00 list or even a $200.00 possible upgrade from a copy of Adobe Acrobat X Standard which came with my scanner is a lot of money for personal use when review and edit of the OCR generated hidden text in a "searchable" PDF file is the only function I require. Therefore, other suggested tools or applications which do what I need for less would be greatly appreciated.

    My "claim"? Actually I've made no "claim" such as you've mentioned.
    Simply stated your OP has foundational premises that presume as factual what is not.
    Here, we're in Adobe's hosted user forum for Acrobat.
    Any other application use is not material. 
    Acrobat XI provides 3 OCR methods.
    Searchable Image, Searchable Image (Exact) & ClearScan.
    Only the first two provide the "hidden" text output.
    (Glyphs have no stroke, no fill)
    From back to the Acrobat 3 product family the design functionality of Searchable Image and Searchable Image (Exact) has been to facilitate the use of Find / Search.
    The "hidden" text is can be touched up. Acrobat Pro provides the facility to view the hidden text.
    So you can see what the OCR output that correlates to the bit-map images of the characters that are present.  
    With Acrobat XI Pro use Tools - Protection -Remove Hidden Information
    In the Remove Hidden Information pane select "Hidden text" then "Show preview".
    The default for the preview is "Show Only Hidden Text".
    Back in the PDF --
    You'd select some of the hidden text and retype what you suspect is the correct string of characters.
    Save and return to the preview of the hidden text.
    If you got it right, good. Continue.
    If not, darn - try again.
    Plug 'n chug -- somewhere over the rainbow it'll be done eh.
    Full disclosure -- this is something I've done (enquiring minds don't you know).
    I've found it to be a rather Sisypean undertaking.
    So, "doable" but not practicable.
    This is to be expected because such touchups are not the concern / focus of the output from Searchable Image or Searchable Image (Exact) - (the names tell it all).
    To have touchup "editablity" of an OCR output using Acrobat make use of ClearScan.
    ClearScan replaces recognized character bit-maps with a character from an Acrobat internal font.
    The character strings can be selected to change to a generic, system available font.
    Something that is good to know when embarking on the "tweak the PDF" journey is that PDF (the file format / technology as defined by its ISO Standard, ISO 32000-1) does not tolerate "editing". PDF is decidely not a word processor file format and "editing" can quickly render a PDF unusable.
    Minor touchups can be made and your best "tool" for this is still Acrobat Pro. (Save As often and periodically "bank" the PDF via some file rename scheme.) 
    Be well...

  • Searchable PDFs are now more than 10 times as large

    When I was using on Windows XP an older version (70.0.128.0) for creating searchable PDF files, the created files require less than 10% of the searchable PDF files created on Windows 7 64bit using new version (130.0.44.62).
    Sample: 9 pages, black and white, pdf searchable
      359 KB on Windows XP, 70.0.128.0
    3847 KB on Windows 7 64bit, 130.0.44.62
    Looking at the results, I do not notice any difference.
    For all other files I notice a similar factor.
    Anyone observing the same 'feature'?
    What have I done wrong?
    Best regards,
    Wolfgang

    You may have changed the settings file selection, maybe for the better. You may not have been embedding all of the fonts for instance and now you are. It may also be that you changed from printing to the Adobe PDF printer and started using PDF Maker that by default adds a lot of bloat to the PDF. That bloat can be useful, but not if you are worried about size.

  • Adobe PDF Printer - no longer creating searchable files

    Hi,
    I am using Acrobat Professional 9.4.3 on Windows 7 64-bit.
    I just noticed Adobe PDF Printer is no longer creating searchable files and I can't figure out why or how to set it up to do so.
    Looking back at PDFs created in February, the PDF Producer is Acrobat Distiller 9.4.0.  These are searchable.
    Current files - which are not searchable - show the PDF Producer as Acrobat Distiller 9.4.2.  Should it be 9.4.3 - to match the current version of Professional?
    Any thoughts why this would be happening?
    Thank you,
    Kevin

    Hi,
    I think I found the answer.
    I updated to Acrobat Professional 9.4.4.  But that isn't the answer.
    I tracked it down to the installation of Firefox 4.
    Found these discussions:
    http://support.mozilla.com/en-US/questions/788993
    http://forums.mozillazine.org/viewtopic.php?f=23&t=2122001&sid=76d02be681f087f8bb0ba6ded50 55198
    Specifically, in the second discussion there is a comparsion of browsers with hardware acceleration turned on and off and their outputs (text vs graphic).
    When I turn hardware acceleration off, the PDF from the Adobe PDF Printer is searchable.
    I am using Firefox 4.0.1.
    Kevin

  • Acrobat creating searchable image PDF

    I have a PDF made from images of a set of scanned pages; I also have the text of those pages in a Word document.  I could make a searchable PDF of the images by using Acrobat's OCR and correction tools, but this would be a long and imprecise process (since the text is in Old English and a number of the characters are unusual and not recognized correctly by the Acrobat OCR), and seems unnecessary since I have a good copy of the text in another file.  Is there a way that I can take the image PDF and import the Word document and have the two merged to create a searchable image PDF?

    I had great hopes for the hidden text layer Acrobat creates when OCR-ing text, but I never figured out how to get sufficient control over that text.  In particular, I couldn't get corrections to work smoothly -- always a challenge when OCR-ing mixed languages, not to mention hand-written Chinese.  I've heard Abbyy FineReader might allow easier access to the hidden OCR text, but I decided to try making my own text layer when digitizing back issues of a scholarly journal, Early China.  Testers found Reader's menu for swapping layers awkward, so I put a button for this on each page -- actually, two buttons superimposed, each visible in the appropriate layer.  This example (3.5 MB) includes quite a few archaic Chinese characters and other snippets from the scans set in-line with text in the text layer, which is practical only because InDesign CS4's export to PDF is smart enough to re-use a single copy of an image that occurs repeatedly.
    In fact, I used Acrobat's OCR to recover as much of the text as it could.  The copy still required extensive fixing up, nor was re-setting it in InDesign trivial.  But the result does meet the goal of giving the reader both searchable text and a convenient way to check the original publication.

  • Searchable PDF Unreadable...

    I have created two versions of a PDF. One is a serachable PDF, using IRIS and the other is a standard PDF. Why is the searchable PDF unreadable in the AR IOS version? That is, why are the pages blank? I should mention the searchable PDF is readable in IOS Mai, Google Chrome, AiO Remote (HP app), and iBook. Just not AR IOS. Thanks...

    Hi,
    We can examine your PDF document to determine if Adobe Reader for iOS has a defect.
    Would you share your PDF document with us?  I have sent you the instructions on how to share your PDF document with us in a separate forum email.
    Thank you!

  • How can I search for document content, specific words, in my scanned (and searchable) pdf files?

    When using spotloght to search my files for text within a scanned document, I may or may not find documents that were scanned by a scan program (ScanSnap) to a searchable PDF file on a firewire attached drive.
    Is there a specific manner that yeilds useable results 100% of the time?

    Spotlight should index PDF files (as well as Word and some other kinds of files) but it seems that PDFs are not created equal. I've seen and heard complaints that some scanner/OCR programs are saving files that aren't searchable and yours is one of them. Try opening one of the files in Preview and resaving it.

  • How to make a searchable PDF from an AutoCAD DWG file

    We have PDF files created from AutoCAD, but they are not searchable using Adobe Reader X.  I checked the source DWG file to find what font was used, as suggested in an old post, and found that the DWG uses Arial font.  I am not sure what version of AutoCAD was used to create the file, but I can open it with AutoCAD LT 2011.  Can anyone guide me to a way to make searchable PDF files from these DWG files?

    Ok, this is as I feared; something CAD programs are particularly likely to do is draw text with a pen, rather than use text. There might be options in AutoCAD to control this.

  • What software do I need to make OCR/searchable PDF's?

    I recently bought an Epson scanner so I can digitize a mountain of documents I've accumulated over the years. I played with it a little bit before a failing hared drive forced me to get a new laptop (MacBook Pro), at the same time upgrading to Mavericks. So I'm kinda sorta starting over.
    It looks like I can easily create PDF's with my Epson scanner alone. What's confusing me is PDF's with searchable text; I think the technical term is OCR.
    Is this something I should be able to do with a scanner alone, or do I need to make Acrobat (or another program) part of the workflow?
    I have my scanner plugged into my laptop right now, and I can create PDF's with it. But if I open Acrobat and choose "Create PDF," then choose either "EPSON Scan" or "EPSON Scan Settings," I get an error message: "There was an error opening this document. This file cannot be found."
    I've also read about a software program called ABBYY Fine Reader. Is this something I can use instead of Acrobat to make OCR PDF's?
    Also, how do I know if a PDF I've created is OCR searchable? Can I simply search for a word or phrase on the PDF with a program like Dreamweaver or Apple's Spotlight?
    Thanks.

    OK, now I see it. I just opened a file I recently scanned, chose "In This File," and I was able to search the text.
    So if I can't get Acrobat to communicate with my scanner, then I'll just scan all my files as "plain PDF's," later batch converting them to OCR PDF's. In fact, I wonder if that might actually be a little faster than my original plan - teaming up Acrobat and my scanner to create OCR PDF's from the beginning. It's probably a lot easier and faster just working with a scanner.
    Thanks.

  • Is there a program to extract email addresses from a searchable pdf?

    Is there a program that will extract email addresses from a searchable pdf?
    I scanned a 75 page excel spreadsheet and used OCR to create a searchable pdf. I've verified that the OCR did work, the email address are searchable, but I need a way to extract them from the pdf so that I can add them to an email list database. There is other data in the spreadsheets that is not needed and it is making it impossible to just copy and paste. Does anybody know if there is a program available that works on the mac platform for this. Any help is greatly appreciated. Thanks!
    Nate

    Nate B- wrote:
    Is there a program that will extract email addresses from a searchable pdf?
    I scanned a 75 page excel spreadsheet and used OCR to create a searchable pdf. I've verified that the OCR did work, the email address are searchable, but I need a way to extract them from the pdf so that I can add them to an email list database. There is other data in the spreadsheets that is not needed and it is making it impossible to just copy and paste. Does anybody know if there is a program available that works on the mac platform for this. Any help is greatly appreciated. Thanks!
    Nate
    Nate,
    You might want to repost this in the Unix forum, or one of the scripts forums here:
    AppleScripts: http://discussions.apple.com/forum.jspa?forumID=724
    Unix: http://discussions.apple.com/forum.jspa?forumID=735
    Automator: http://discussions.apple.com/forum.jspa?forumID=1261

  • HP LaserJet Pro MFP M521 and searchable PDF

    Dear All, Does anyone know if the HP LaserJet Pro MFP M521 can scan to a searchable PDF(OCR)? I can't find that option anywhere (I read somewhere that this was a feature), would this be an optional component/license? Thanks in advance. Regards,Pardeep Saini 

    OCR is a proprietary function offered by some but not all HP products.  The printer would need to be licensed in order to create an OCR from scratch if it cannot already do so. User Guide has a small section on the OCR capabilities of this printer on page 107:http://content.etilize.com/User-Manual/1024547645.pdf Work around would be to purchase Adobe OCR products and then republish the original PDFs with OCR.  

  • Is there a way to create a PDF form that ANYONE can fill out and SAVE with their content?

    Is there a way to create a PDF form that ANYONE can fill out and SAVE with their content? By anyone, I mean someone who can download and use the free Adobe Reader, on either a Mac or PC. I have Acrobat Pro, and would like to be able to create forms that can not only be filled out and printed, but saved and emailed, which is not an option with the forms I have created to date. They can be filled out, but not saved, with Adobe Reader.
    TIA,
    Nancy

    To do what Dave indicated you need to do, it depends on what version of Acrobat you have:
    Acrobat 8: Advanced > Enable Usage Rights in Adobe Reader
    Acrobat 9: Advanced > Extend Features in Adobe Reader
    Acrobat 10: File > Save As > Reader Extended PDF > Enable Additional Features
    Acrobat 11: File > Save as Other > Reader Extended PDF > Enable More Tools (includes form fill-in & save)
    I wonder what it will be next time?

  • I am working in Adobe Acrobat 9 Pro and just created a pdf form from a MS Word document. I need to find out how to have a date field in my form which will update automatically. Can some one out there help me?

    I am working in Adobe Acrobat 9 Pro and just created a pdf form from a MS Word document. I need to find out how to have a date field in my form which will update automatically.

    Update automatically under which circumstances, exactly?

Maybe you are looking for

  • Color doesn't recognize movie files

    Hello, I'm pretty new to FCP and I've been working on a project for about a week. When I tried to send my sequence to Color, it seemed to load the sequence, but none of the files are recognized by Color. It looks like this: http://i5.photobucket.com/

  • IPhoto 11 v9.4.2 not reading JPGs from camera and crashing

    it reads the first 3 or 4 then stops dead in its tracks. when i hit 'stop' it says the device has been prematurely ejected! then the circle just goes round and round and nothing loads then it comes up with 'can't read this type of file' - and they ar

  • Transport master data infoobjects

    What will be the steps to transport master data in BI 7? I have transported relevant R/3 data source and then replicated data source in BW. Then I start collecting BW info objects for the master data. When collecting BW info objects for transport, do

  • How to move my iPhoto library to a new iMAC?

    how to move my iPhoto library to a new iMAC?

  • Working with multiple results of a complex query

    Hi all! As I "advance" in learning PL/SQL with oracle, I now get stuck in handling multiple results of a complex query. As far as I know, I cannot use a cursor here, as there is no table where the cursor could point to. Here is the concept of what I