Advice about OCR text in ID

This project is a total labor of love for my 90 year-old mother-in-law who wrote a weekly column for a small-town newspaper for years back in the 1980s.We are going to do a booklet compilation of her articles for the enjoyment of family and friends. (not to sell) We are looking into different printing options like the local Kinkos (digital printing and binding and another source for low volume hardback book printing.) No details yet.
I have basic ID skills but NO Acrobat experience which I own as part of my CS2 suite. It was suggested in this forum that Acrobat might be the way to go (project scope has changed since then) but I have spent several days trying to learn Acrobat. It's not coming easy for me so I've decided that ID is the program I am most comfortable with for this project. I've been experimenting with methods and doing a lot of research in the forums with threads that seemed applicable to my project.
I would greatly appreciate anyone who has the time to read through my proposed process and point out any errors that could turn out to be a problem with the end printing.
I haven't nailed down the overall design concept and presentation of the book because family members don't agree on things yet. (Ugh, that's a whole different challenge.) But I don't vision a master page layout deal because I really want to immitate the different layouts of the published articles, so I vision each page having its own personality.
So I just want to be productive in doing the grunt work of obtaining the articles and setting up individual pages in ID that might or not be used. Then I will put the final together according to the printer's instructions.
1. I obtain the articles from a paid subscription newspaper archive service and it gives me the option to save and view this PDF in Acrobat. The PDF is a "snapshot" of the newspaper page. Then I crop in Acrobat to just the article I want and print it out and save it.
2. Then I put the printed article on my scanner and push the text button on the front of the scanner. The articles are in different sizes  and number of columns but they are all consistent newspaper style—single-spaced, justified columns with a 3 space indent at the beginning of the paragraph. (I know that Acrobat has an OCR feature but I couldn't figure it out.)
3. Then the article appears in Microsoft Word 2000 as an rtf document all in one long page with inconsistent spacing between lines. It does have the paragraph indents and shows the typos or misunderstood characters in green. I fix all the typos etc. in word and it's obvious that there is inherent formatting involved but I have no idea how that happens from a scan or it it could create problems.
4. Last I copy all the text in word and paste it on my ID document guessing at the width of the column I want. Then I click on the overset (from Peter's response on a thread about text flowing) and make the remaining columns. But I want each article to be printed on one page so I don't want to risk it carrying over and ID making another page.The text shows up in ID as Gil Sans instead of Times Roman from Word. No huge deal I just select and change. There are no paragraph indents but I just added 3 spaces. (I know the nudge thing is bad and I promised not to do it again) so I need to understand the formatting.
So any glaring errors in my proposed methods that could make me cry and have to start over?
Thanks in advance,
Patty

Hi Patty.
I am working on a similar project that involves original newspaper clippings (not preserved well so they are quite yellow) from 1863.
> ...I have spent several days trying to learn Acrobat. It's not coming easy for me so I've decided that ID is the program I am most comfortable with for this project.
I very new to using InDesign and InCopy. There are three inexpensive learning resources that I am using: Lynda.com, the online resources of my public library, and this forum. Lynda.com offers excellent multi-media instruction, but it is time consuming to complete the courses (some exceed nine hours). The public library has contracted with certain publishers so that as a member, I have access to the electronic version of those extremely expensive instructional books that go out of date the next time a new version of the software is released (thereby saving me $$$).
> I haven't nailed down the overall design concept and presentation of the book because family members don't agree on things yet. (Ugh, that's a whole different challenge.) But I don't vision a master page layout deal because I really want to immitate the different layouts of the published articles, so I vision each page having its own personality.
Welcome to the world of a Self Published Author. ;-)
> So I just want to be productive in doing the grunt work of obtaining the articles and setting up individual pages in ID that might or not be used. Then I will put the final together according to the printer's instructions.
That sounds like a good plan.
> 1. I obtain the articles from a paid subscription newspaper archive service and it gives me the option to save and view this PDF in Acrobat. The PDF is a "snapshot" of the newspaper page. Then I crop in Acrobat to just the article I want and print it out and save it.
Okay, there might be some difficulties that crop up doing it that way. When a scanned image is cropped in Acrobat, the entire image remains intact and it is only the viewable area that is cropped--so no information is lost. That means if the article you want is in the center of a three column newspaper, all of the information surrounding that article will remain available.
It is possible to open the original image to Photoshop through Acrobat and permanently delete the extraneous information, but unfortunately I am not using CS2 so can not provide the command path to do so.
> 2. Then I put the printed article on my scanner and push the text button on the front of the scanner. The articles are in different sizes  and number of columns but they are all consistent newspaper style—single-spaced, justified columns with a 3 space indent at the beginning of the paragraph. (I know that Acrobat has an OCR feature but I couldn't figure it out.)
If the OCR is not working right, then it may be because it is OCR'ing the entire page, not just the area that was cropped.
Since Acrobat can OCR image files, so I see a couple options. Open the original in Photoshop through Acrobat, crop out all of the unnecessary text, align the article text so that it is in a single column, save it, then just initiate the OCR in Acrobat on that existing page.
Another option is to paste screen prints from the article as it appears in Acrobat into Photoshop, align the text into a single column, save it and then run the Acrobat OCR on that file.
> 3. Then the article appears in Microsoft Word 2000 as an rtf document all in one long page with inconsistent spacing between lines. It does have the paragraph indents and shows the typos or misunderstood characters in green. I fix all the typos etc. in word and it's obvious that there is inherent formatting involved but I have no idea how that happens from a scan or it it could create problems.
That's from Word, not the scan itself.
The only way I have been able to clean up the garbage-code Word creates has been to paste the text into Notepad, re-copy it, then Paste Special > Unformatted Text back into the same document (or sometimes a new doc). Then it's a manual process of cleaning up the text that is easier than working with the original scan.
For some reason, there have been times when I have used the Paste Special > Unformatted Text directly into Word and it has been almost as messy as the original, so I've opted to just paste to notepad, copy and repaste back into Word. Seems inefficient, but it works.
Best of luck with your project.

Similar Messages

  • Please give me advice about....what is the best virus guard in store and is virus worked in iPhone (my English is not good sorry)

    Please give me advice about....what is the best virus guard in store and is virus worked in iPhone (my English is not good sorry)

    If you are gettin spammed via e-mail or text that your IOS has a virus (usually says call some 800 number) then these are a scam - do NOT contact these crooks

  • How to print OCR text in Adobe invoice form

          Hi All,
          I have got a requirement to print the customer number and invoice no  as OCR text on the invoice adobe form. 
          So, if some one has any idea or might have gone through this scenario, please share your suggestions for the same.
          Thanks in advance,   
          Nilesh R Gaikwad

    Hi Nilesh,
    try to check if you can choose OCR-B in Smartform Layout Designer.
    If not, then the font might be corrupt and you must turn to Basis team.
    If yes, then it is not a truetype font. And you muss upload a truetype font (.ttf, licenced font, might be with cost) into SAP by SE73. Something like this:
    Cheers,
    Tao

  • Performing OCR text recognition in all pages of a PDF document which has blank pages

    The PDF pages in the file are scanned as an image, but there are blank pages also. So, page 1, 2 have images of scanned text, page 3 is blank, pages 4, 5,  have images of scanned text, page 6 is blank and so on.The series of blank pages is 3, 6, 9....
    I am using Adobe Acrobat Pro version 9 on Windows XP. I choose the Document option in menu, OCR Text recognition->Recognize text using OCR->All pages. But, the OCR stops after page 2.
    Then, I go to page 4 and find that it is still an image(and I cannot copy the text from it). So, then I have to choose the Document option in menu, OCR Text recognition->Recognize  text using OCR->All pages. But, the OCR stops after page 6.
    The document has some 200 pages and I don't want to manually do the choose the Document option in menu, OCR Text recognition->Recognize  text using OCR->All pages for all pages which have text.
    I also tried choose the Document option in menu, OCR Text recognition->Recognize  text using OCR->Page 1 to page 200, but that too does not work.
    I can manually go to each blank page, delete it, then do the OCR but that would be tedious.
    1. Am I missing something in doing the OCR of all pages?
    2. Can I set up a batch process to do the delete of blank page series or can I set up a batch process which does OCR of all pages(and does not stop at a blank page) as it is doing now?
    Any suggestions would be appreciated.

    Hm, you could convert every pdf-page to a image file (jpeg, etc.):
    -install the package imagemagick
    -use the command convert
    convert -trim -density 150 your.pdf your.jpg
    This will generate a jpg-file for every pdf-page; "-trim" removes the white margins, density specifies the resolution of the page (you could also try "-resize" instead of "-density") and 150 specifies amount of  dots per inch.
    This does not generate a new pdf-file, but maybe there is a way to convert the images back into a single pdf?

  • Is it possible to add or change OCR text in a batch process?

    Hello,
    This is my first submission.  Really hope you can help as if there's a solution to this it will significantly help our business.
    Is it possible to 'batch process' the adding or changing of OCR text in a PDF?
    This might sound like a strange process but let me explain what we do:
    1. We scan old, handwritten books and registers.  Typically these registers contain lists of information. Each page will be scanned to a filename like pge01.jpg, pge02.jpg, pge03.jpg etc.
    2. We transcribe the content of each page.  Each page will contain mulitple records (i.e. 20 records)  typical fields might be:
    Unique ID
    Surname
    Forename
    Year
    Address
    Filename (i.e. pge02.jpg)
    3. We provide this content back within database driven software so that when a user performs a search on say 'Surname=Jones' & 'Year=1945' then all of the scanned pages that match that contains handwritten text that matches that search criteria is displayed in a list.  The user can then click on a search result and see the scanned page containing that record.
    Rather than provide database driven software, we'd simply like to produce a standard PDF file.  Each page within the PDF will show each of the scanned pages (pge01.jpg, pge02.jpg, pge03.jpg etc.).  But where you would normally store the 'ocr recognised text' behind each image, we would like to show our transcribed content.
    If this is possible, then I realise that it's likely that you can't do field searches (i.e. 'Surname=Jones' & 'Year=1945') but at the very least I'd be able to type 'Jones' in to the search box and it would find all pages that contained the transcribed word 'jones'.
    If it's possible to add the transcribed data as 'ocrd text' then is it possible to do it in some sort of batch process?  We scan lots of big books and capture millions of records - so doing it manually is not an option.
    Any help that anyone can provide will be hugely appreciated.
    Thanks,
    Paul

    I don't think it's possible to manually add OCRd text, but you can add form
    fields with the text in them. And yes, it is possible to search the content
    of form fields, using a script.

  • Advice about using the laptop with closed lid while connected to external display

    Hello all HP forum members!
    I have an ENVY 17 with Win8.1 64bit, a great laptop, but I need an advice about using it while connected via HDMI cable to an external display (TV).
    When I play heavy 3D games, the computer produces more heat than usual (of course this is normal), but doing so with closed lid, could this harm the  display in some way, as a result of the heat?
    Thanks!

    Hi @VVel ,
    Thank you for visiting the HP Support Forums and Welcome. It is a great site for information and questions. I have looked into your issue about your HP ENVY 17 Notebook and issue with keeping the Notebook on while closing the lid. Here is a document on changing the Power option.
    Control Panel - Power Options - upper left - Choose what closing the lid does.
    Change Plan Settings and also check Advanced Settings. Make sure you do this for both battery and power.
    Hope this helps.
    Thanks.
    Please click “Accept as Solution ” if you feel my post solved your issue, it will help others find the solution.
    Click the “Kudos, Thumbs Up" on the bottom to say “Thanks” for helping!

  • Make a pdf ocr text searchable on a mac with adobe

    How do I make a pdf ocr text searchable on a mac with adobe

    With "adobe" what: Acrobat?  What version?

  • OCR-text from pdf to pdf ?

    Is it possible to copy the OCR-text from a pdf file obtained after Finereader and paste it in the original pdf?

    Having had to perform OCR indicates the original PDF is a scanner's output. That makes it an image / picture of text forming the PDF page content.
    An alternative approach -
    Using Acrobat XI Pro create a new, blank PDF page (****+Ctrl+T).
    Copy the OCR text.
    Use Acrobat XI's Tools - Add Text and paste the OCR text at the cursor location.
    Be well...

  • About OCR and Vote Disk size

    Hallo,
    I have a simple question about OCR and Vote Disk in a 10g/11g RAC setup:
    official documentation speaks about sizing OCR and Vote Disk about 280 MB, but do they mean each OCR and Vote disk LUN ?
    In other words, for 2 OCR and 3 Vote disk I need 280*5 MB ( in multiple o single LUN)
    Is it so?
    thx in advance

    ottocolori wrote:
    Hallo,
    I have a simple question about OCR and Vote Disk in a 10g/11g RAC setup:
    official documentation speaks about sizing OCR and Vote Disk about 280 MB, but do they mean each OCR and Vote disk LUN ?
    In other words, for 2 OCR and 3 Vote disk I need 280*5 MB ( in multiple o single LUN)
    Is it so?
    thx in advanceYes. Also see Re: Reduce size of OCR file

  • How to determine (in bulk) whether a document has OCR text and what reader versions are supported?

    Help!
    I've inherited a task which has spanned many years and which was not well thought out over the transitions.  At least three project teams have scanned and stored tens of thousands of documents to PDF.  What was discovered subsequently was that the project teams did not apply a uniform standard for which versions of Adobe would be supported in each PDF, and that not all documents appear to have been OCR'ed as part of the scan process.
    This has resulted in two major problems.  First, PDFs which support all Reader versions are bloated and consuming significant amounts of storage; second, the automated processing tools which depend upon the OCR text are failing once they pass the front and rear cover sheets (which do contain extractable text).  I need to know if there is a way that PDFs can be bulk scanned to determine which Reader versions are supported (say 8.0 to current), and if the OCR'ed / extractable text is not just limited to the first few and last pages of each PDF.
    I have been manually fixing individual files with Adobe Acrobat 9.0.  I can force Adobe to re-OCR and save the files, but I would rather not have to re-process the existing bulk that we have unless absolutely necessary.  If I could determine which ones need fixing and just processing those it will save man years of work.
    Thanks, in advance, for any assistance.
    Michael

    What I meant by supporting a version of Reader is that I don't need the files to be fully backwards compatible all the way back to say the first few versions of Adobe Reader.  (There are likely limitations on that much backwards compatibility, anyway.)  One of the scanners that was used apparently was set for full backwards compatibility, to the extent possible, for every PDF that it generated.  Some of those PDFs are huge, commonly 300-400MBs in size.  If I open them in Acrobat 9.0, limit the backwards compatibility to Adobe Reader 8.0 and forward, then resave the file, the size is often significantly reduced.
    As to how it is measured, there is something in the PDF itself that indicates a minimum version for Adobe Reader for compatibility purposes.  When you select for compatibility in Acrobat during the save process I mentioned, you get to pick the version at which you want to stop -- so if you selected 8.0, it would be compatible with 8.0, 9.0, and so on.

  • Need advice about a WiFi set-up

    Dear Gang,
    I want to have a pure 802.11n network using the 5 ghz spectrum and the wide array setting. However, I have one desktop computer that is 802.11g. I'd like to run a 25 ft. Cat 5 cable to one of the LAN ports on the TimeCapsule and then shut-off the WiFi on this computer. However, strangely enough, I've always had all my computers (even desktops) networked via WiFi so I don't know how well using the LAN works.
    What are the thoughts of the group?
    I know that ethernet should give me greater speeds to the TC's hard drive and should solve my problem with bringing a non-802.11n machine in to a "pure" network, but I don't want to do so if the set-up is arduous or if doing so makes me lose a lot of features that I get in an all WiFi network.
    TIA,
    Mark

    dwb,
    I'll definitely take your advice about the ethernet cables - thank you.
    Your info is reassuring as it means that my dual G5 will get a little more life out of it by connecting it in to my "pure" system via ethernet.
    Two other quick questions:
    -should I connect my new iMac via ethernet to the router, too, even though it's capable of going on my system in 802.11n 5ghz?
    -also, as an aside, I'm trying to figure out if my new 802.11n network will equal my current 802.11g network in terms of distance as I have an Apple TV that's a few rooms over. That Apple TV gets about 4 bars of signal from my current network. Do you think that 802.11n at 5 ghz will have an equivocal range?
    Mark

  • Need advice about zip rar files

    Here's wishing you all a very happy New Year!! Mr Greenhorn is back again! I have a question about zip and rar files. As some of you may remember, this is my first Mac, so I'm used to using the zip/rar programme on my old Windows O/S. Now I receive zip/rar files and can't open them. What is anyone's advice about downloading the zip/rar programme onto my Mac? Any particular pro's and con's? Any particular site to download from? Now my Mac has a compression programme included which converts files to zip files, but is that compatible to the actual zip/rar programme? And finally, where is the advantage to compressing files? Is it only for uploading or is there some benefit to having files 'compressed' on your actual laptop?

    Hi There
    Mac OS X has a built in Archive Utility which doesn't support .rar files. To uncompress rar files you will need a utility like RAR Expander or Stuffit Expander both are free. I suggest doing a search on MacUpdate (http://www.macupdate.com) or VersionTracker (http://www.versiontracker.com) to find a programme that you like.
    I personally don't see any advantage with keeping compressed files on your laptop (unless the archives being used often for emailing, uploading to websites etc). Compressing files does save disk space but if you get to the point where you need to compress your files to gain disk space then I suggest buying an external HD or a bigger internal HD.
    Hope this helps
    J.C

  • Need advice about best characterset for XMLDB

    Hi,
    Oracle 9.2.0.5 Windows 2000
    Please, give me an advice about best character set
    configuration for XML DB.
    During installation Oracle istallator suggests
    charset =AL32UTF8 for multilingual data and ncharset=
    AL16UTF16.
    Is it good settings for database, which will be
    used for usual multilingual data and XML DB ?
    Thanks,
    Viacheslav

    Yes, we strongly recommend the use of AL32UTF8 for XML DB.

  • Need advice about coalesce and deallocate unused space

    Hi experts;
    Here looking for an advice about coalesce and deallocate unused space.
    I got this tablespace with 87% full, one of the table in that tablespace has 1,150,325 records.  I'm going to delete 500,000 records from that table, but to release the space used by those records I understand that I need to execute other procedure. I was reading about coalesce tablespace and deallocate unused space.
    I found that apparently, both process can help me to free space. If you want to share with me your comments, about  advantages or disadvantages about them, in order I can take the best solution?
    Thanks for your comments.
    Al

    Hi
    after deleted rows, the high water mark is still the same and so the size of the table. you need to bring down the water mark
    here is what you need to do to bring down the high water mark. We do this monthly for performance purpose.
    This is an EBS R12 system  but the procedures are the same for EBS database or non EBS database.
    After you purge or delete data in a table
    1) alter table APPLSYS.WF_ITEM_ATTRIBUTE_VALUES move; <-- this operation will invalidate all indexes attache to the table
    2)select owner, index_name, status from dba_indexes  -- list all invalid object for user APPLSYS
    where table_owner = upper('APPLSYS')
    and
    status NOT IN ('VALID','N/A');
    3)spool idxrebuild.sql --generate script to rebuild indexes.
    select 'alter index ' ||owner||'.'||index_name ||' rebuild online;'  from dba_indexes
    where table_owner = upper('APPLSYS')
    and
    status <> 'VALID';
    4) run idxrebuild.sql   -- to rebuild indexes.  -- at this point if you check spaces on the table, it is still the same, you need to run #5
    5)exec fnd_stats.gather_schema_stats ('APPLSYS');  --fnd_stat is for EBS system you can replace with the database equivalent command.
    use this statement to count the block before and after the operation to see the different.
    select DISTINCT(SEGMENT_NAME), count(blocks) "Total Block" from dba_extents
    where
    owner IN ('APPLSYS')
    AND segment_name = 'WF_ITEM_ATTRIBUTE_VALUES'
    Hope this help.

  • Need advice about headphones and splitter for HP EliteBook

    Hello,
    I would like some advice about what headphones and headphones splitter I should use for an HP EliteBook. I am going on a plane trip with my kids, and I plan to get them both Leapfrog headphones. However, I need a headphone splitter so that they can both watch a movie on the same computer.
    Does anyone have any suggestions for a splitter for an HP EliteBook? I tried Amazon but couldnt find anything.
    Thanks
    This question was solved.
    View Solution.

    Hi,
    The following one is for more than 2:
        http://www.officeworks.com.au/shop/officeworks/belkin-rockstar-headphone-splitter-bef8z274
    and the following one is from Amazon:
        http://www.amazon.com/Belkin-Speaker-and-Headphone-Splitter/dp/B00009WQSR
    Regards.
    BH
    **Click the KUDOS thumb up on the left to say 'Thanks'**
    Make it easier for other people to find solutions by marking a Reply 'Accept as Solution' if it solves your problem.

Maybe you are looking for

  • Can not create the node in Repository

    Hi,<br> I am trying to create a content node in BEA repository. But BEA is throwing following error: <i><b>Error creating Node: NODE NAME. Please verify the Node name size is within the Repository limits</b>.</i> <br> I have no idea why I am getting

  • Cannot open Payment Usage in FBL5N transaction as FI document is archived

    Hi, I can't open Payment Usage in FBL5N transaction (from Environment - Payment Usage)  to drilldown from the document to its related invoices as this payment/clearing document has been archived . Is there any other way to find this relationship betw

  • Difference between logical and physical SQL

    Hi all, I created a dashboard prompt and added some columns from the dimensions and one column from the fact table. When I run the report in the dashboard and filters using the prompts,I see in the logical sql that all the filters have been applied t

  • Dragging Mail folders out of program

    Now running 10.5.3. Can no longer simply drag a mail folder over to my flash drive or anywhere. I've always done it in 10.4.11. I NEED to export mail folders to another Mac. What's with this?

  • Cannot create a new parameter in TPARA table.

    Hi All, I'm trying to create a paramater F03 in TPARA table using SM30. After i give the package name (SEUA) the Original system automatically changes to SAP (I think it is because parameter will get created in SAP Namespace..no issues so far), then