Batch text extraction with CleanContent SDK (OIT)

Hi all,
we're using the CleanContent SDK to extract the text from large numbers of documents to do some text processing afterwards. I've succesfully implemented (or should I say copied the sample code) a simple scenario to extract a single file and I'm now calling this function for every single document, which gets quite heavy when we need to process thousands of documents.
I was wondering wether there are any features or techniques to process large numbers of documents more efficiently, but I haven't found too much documentation or forum threads on the subject. Should/can we reuse request or other objects to avoid setup/loading time in any way?
Many thanks in advance for your suggestions, examples and experiences !
Best regards,
Benjamin

Hi Bex,
thanks for your reply.
I'm calling Outside In from within the database using the oracle-embedded JVM, so it's not exactly the same as calling the app 1000 times individually, but probably not the same as calling it 1000 times from within the same java-only app either. I've already parallelized it to the extent that the DB will start several asynchroneous processes each calling out to Outside In and it's indeed taking 100% CPU after a few of these have taken off.
But still, even if the application would create SecureRequests the one after the other rather then calling the app for a single SecureRequest each time, you'd only avoid some default java setup costs right ? As far as I can see (or that's at least why I posted this question), there is no "intra-app" reuse of objects or resources and therefore I'd think the (probably considerable) setup costs of the SecureRequest object stuff will still have to be paid for each document processed.
So what I was looking for is some advice of whether there's any features in the CleanContent modules for batch processing I could take advantage of when finetuning the java code called from the DB.
Best regards,
Benjamin

Similar Messages

  • Refying PDF with subset embedded fonts fixes text extraction

    Hi All,
    I know it is not a good idea to (just) refry PDF files (PDF -> EPS -> PDF). Especially when the PDF contains subset embedded fonts. Chances are you will end up with a PDF file which does not contain valid (searchable) text.
    I did not know the apposite could also be true. The following zip file contains 2 PDF files echo containing two words: the original and the refried version.
    Refried.zip
    When selecting text from the original PDF (using acrobat 6 through X) file it contains incorrect text, in this case invalid capitals. If I try the same in the refried version the extracted text is correct.
    It seems strange to me that a process which only can result in loss of information "fixes" this text issue. Somewhere the correct text must be hidden in the original PDF file. Not only capitals seem to be effected but also random characters which seem to be fixed once refried.
    Could anyone think of an explanation?
    Is there a workaround without having to refry the PDF (refrying often results in loss of information). I have no influence on the PDF files I recieve, therefore I cannot embed the full fonts.
    I am using de C++ SDK for Acrobat to write plugins.
    Any pointers would be great!
    Kind regards,
    Robert

    Thanks again for your reply,
    Your explanation makes sense.
    I went ahead and removed the tounicode cmap just to see what would happen
           if (CosDictKnown (cosFont, ASAtomFromString ("ToUnicode")))
             CosDictRemove (cosFont,ASAtomFromString ("ToUnicode"));
    As you predicted this fixes some issues and introduces new ones.
    The results differed from the refry method, in some cases the refried PDF did not contain extractable  text, in other cases the PDF without "ToUnicode Cmap" had no extractable text.
    Maybe I could combine the information of different text extraction methods to make an educated gues which one (or combination of) is best :S
    I suppose looking at individual textruns (with all its complexity) would not help me either...
    Kind regards,
    Robert

  • Custom batch rename files with Aperture 3 in the following format: IMG_0023.cr2 to Smith_YYMMDD_0023.cr2?  I cannot find a way to structure the date in Aperture as such, as well as extract only the camera file

    Please advise how to custom batch rename files with Aperture 3 in the following format: IMG_0023.cr2 to Smith_120816_0023.cr2?  I cannot find a way to structure the date in Aperture as such (YYMMDD), as well as extract only the camera file (0023, for example).  Adobe Bridge CS5 can do this, but NONE of the Adobe software is retina optimized, and is terrible to look at.

    In Aperture you are limited to renaming files by the entries in the File Naming preset window.
    At what point are you looking to rename, import or export? It might be possible to do what you are looking to do external to Aperture either via a script or other software.
    regards

  • Pdf text extract problem with CID font and Identity-H

    Hi all,
    Iam facing some big problem with text extraction from pdf file.
    Currently iam using congviews pdf2xl text extraction tool.
    About 95% of the text extract correcly but few charaters showing box some ? and some dotted circle mark.
    Font Used:
    ArialUnicodeMS(Embedded Subset)
    Type:(True Type (CID)
    Encoding:Identity-H
    TimesNewRomanPSMT
    Type:True Type
    Encoding:ANSI
    ActualFont:TimesNewRomanPSMT
    ActualFontType:TrueType
    Anyone please help me to overcome this.
    Regards
    Gilbert.X

    I tried with acrobat pro9 export option it retrieved only alphabets and numbers all of the hindi charcaters showing just ........
    By the way how can i upload the my pdf file within this forum please guide me.
    Regards
    Gilbert.X

  • Batch process text document with automator

    Hi all
    i would like to create a workflow with automator that look like that
    1)open text file with textedit
    2)"find and replace"
    3)Save
    4)Close
    my workflow works if i select one textfile but i didnt manage to make it do that on more than one text at the same time which is what i would like
    please help
    thanks
    p

    Hi all
    i would like to create a workflow with automator that look like that
    1)open text file with textedit
    2)"find and replace"
    3)Save
    4)Close
    my workflow works if i select one textfile but i didnt manage to make it do that on more than one text at the same time which is what i would like
    please help
    thanks
    p

  • Extract text file with HTML tags from JTextPane

    hello world
    I have a big problem !
    I am creating an applet with a JTextPane ...
    so I can write text, (bold, italic etc), i can insert images.
    Now i want to create a text file with all the HTML tags
    corresponding to what I wrote in my JTextPane.
    I want to have and save the HTML file corresponding to what i wrote ...
    Is it possible ? Help me please ....
    Jeremie

    writing to a file from an applet is going to take a fair amount of work on your part.
    in order to write to a file from your applet, you have to use servlets or jsp to write to a file on your server. if you wish to write locally, look into signing your applet or policy settings of your browser.
    for writing to a file to the server, i suggest you look into servlets and tomcat to run the servlets.
    i just finished a project that used servlets and they take some time to figure out, but its definitely worth your time.
    here are some websites...
    http://www.j-nine.com/pubs/applet2servlet/Applet2Servlet.html
    http://jakarta.apache.org
    other websites have tutorials that you can look at too
    Andy

  • QTP Automation with Acrobat SDK

    Hi,
    I would like to verify the PDF content using HP Automation tool - QTP by opening the PDF and extracting the Text content from PDF.
    I am trying the OLE style and using a series of commands, as described in many postings. The first one is:
    CreateObject("AcroExch.App")
    It throws error, saying "Cannot create ActiveX component"
    From the Forums I found that it will not work with Adobe Reader and will work with Full Adobe Product with Adobe SDK.
    Now I would like to know which version of Acrobat will have Acrobat SDK.
    My Client already procured the below Acrobat Products.
    VLA ACROBAT 10  win universal English
    VLA ACROBAT 10 WIN Universal English STD to STD Upgrade ST-ST
    VLA ACROBAT Pro 10  win universal English
    VLA ACROBAT Pro 10 WIN Universal English STD to STD Upgrade ST-ST
    Pls let me know which product( which has Adobe SDK as well)  I should choose to proceed with QTP Automation.
    Any Advise/Help greatly appreciated.
    Thanks.
    Anand Muthunayagam.

    Check out http://labs.adobe.com/technologies/aptt/
    From: Adobe Forums <[email protected]<mailto:[email protected]>>
    Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>>
    Date: Fri, 28 Oct 2011 06:09:25 -0700
    To: Leonard Rosenthol <[email protected]<mailto:[email protected]>>
    Subject: QTP Automation with Acrobat SDK
    QTP Automation with Acrobat SDK
    created by AnandMuthuNayagam<http://forums.adobe.com/people/AnandMuthuNayagam> in Acrobat SDK - View the full discussion<http://forums.adobe.com/message/3995113#3995113

  • Importing text file (with file names) into Automator.. is it possible?

    Hello all,
    I have been working with Windows Batch files for my line of work. I have a couple of file names in a text file (a column), which I want to copy from one folder of one hdd to another folder on a different hdd. I have been trying to do this kind of work with a Mac. I already know how you copy and rename files in automator (which isn't difficult, of course) but you have to 'select' the files in the finder first (with get specified items).
    But the only way i see that you can specify items is by selecting them... is there a way to import a text file with all the file names instead of selecting all the file names manually?
    or is there an AppleScript alternative which I can use to import the text file (or just copy into applescript) and run before the query's of copying and renaming the files? I am kind of new to Apple programming.
    The text file looks like this:
    image1.jpg
    image2.jpg
    etc..
    so there has to be a command to: 'goto' a specific folder as well.
    Thanks in advance!

    You can import text files, but if they are just names you will need an additional action to add the source folder path. A *Run AppleScript* action can be used, for example:
    Tested workflow:
    1) *Ask for Finder Items* {Type: files } -- choose the text file containing the names
    2) *Combine Text Files* -- this gets the text file contents
    3) *Filter Paragraphs* { return paragraphs that are not empty } -- skip blank lines
    4) *Run AppleScript* -- copy and paste the following script:
    <pre style="
    font-family: Monaco, 'Courier New', Courier, monospace;
    font-size: 10px;
    font-weight: normal;
    margin: 0px;
    padding: 5px;
    border: 1px solid #000000;
    width: 680; height: 340px;
    color: #000000;
    background-color: #FFEE80;
    overflow: auto;"
    title="this text can be pasted into an Automator 'Run AppleScript' action">
    on run {input, parameters} -- add folder path
    add the specified folder path to a list of file names
    input: a list of text items (the file names)
    output: a list of file paths (aliases)
    set output to {}
    set SkippedItems to {} -- this will be a list of skipped items (errors)
    set SourceFolder to (choose folder with prompt "Choose the folder containing the file names") as text -- this is the folder containing the names
    repeat with AnItem in the input -- step through each name in the input
    try
    set AnItem to SourceFolder & AnItem -- add the prefix
    set the end of the output to (AnItem as alias) -- test
    on error number ErrorNumber -- oops
    set ErrorNumber to ("  (" & ErrorNumber as text) & ")" -- add the specific error number
    set the end of SkippedItems to (AnItem as text) & ErrorNumber
    end try
    end repeat
    ShowSkippedAlert for SkippedItems
    return the output -- pass the result(s) to the next action
    end run
    to ShowSkippedAlert for SkippedItems
    show an alert dialog for any items skipped, with the option to cancel the workflow
    parameters - SkippedItems [list]: the items skipped
    returns nothing
    if SkippedItems is not {} then
    set {AlertText, TheCount} to {"Error with AppleScript action", count SkippedItems}
    if TheCount is greater than 1 then
    set theMessage to (TheCount as text) & space & " items were skipped:"
    else
    set theMessage to "1 " & " item was skipped:"
    end if
    set {TempTID, AppleScript's text item delimiters} to {AppleScript's text item delimiters, return}
    set {SkippedItems, AppleScript's text item delimiters} to {SkippedItems as text, TempTID}
    if button returned of (display alert AlertText message (theMessage & return & SkippedItems) ¬
    alternate button "Cancel" default button "OK") is "Cancel" then error number -128
    end if
    return
    end ShowSkippedAlert
    </pre>
    5) *Copy Finder Items* { To: _your external drive_ }

  • Used of Sap text variable with replacement path in Bo designer

    Dear experts,
    I created a univers based on a SAp BW query. In this query I used a text variable in order to get dynamic header columns.
    The text variable is done by using "replacement path" that is the text is derived automatically from the user input triggered by a sap bw variable.
    example :
    the query contains a restricted key figure which shows net values for a selected year based on variable PYEAR.
    The name of this restricted keyfigure is &ZYEAR& where ZYEAR is the text variable with replacement path from the variable PYEAR.
    When generated the univers based on this query I get component &YEAR&.
    When I use this unvers in WebI my result containts header with &YEAR& althougth I selected for example the year 2006.
    According to several how to and white papers on SDN , it seems that text variables with replacement path are supported. So I am disapointed by this result.
    Can you give me advise to get a rigth result ?
    thanks a lot
    Olivier Doubroff

    Hi Rishit,
    I am trying to achieve the same as you, but it seems like text variables in restricted key figures do not work. I am using BO XI3.1 SP3 and SAP BW 7.01 SP6.
    A work-around for me has been to use the "user response" function in webi to create a webi variable that holds the dynamic text title. If the user inputs Jan 2010, I change the input to a date using the "user response" and "ToDate" functions in the webi variable editor. After changing the input to a date I use the RelativeDate to extract 1 month (e.g. 25 days) from the user input. Then I have both Jan 2010 and Dec 2010 as webi variables to use as headers for my restricted key figures.
    The formulas can easily become a little long, but by tweaking the user response string, you should be able to get dynamic headings by using webi functionality. But be aware that you need one webi variable for each dynamic heading if you use this method.
    Let me know if it works or if I can help more:-)
    Best regards,
    Morten

  • Batch file extracting all files from nested archives

    I have managed to leverage a powerful
    forfiles command line utility with the mighty
    7z compression program.
    Below is a simple batch file extracting all files from nested archives hidden at any depth inside other archives and/or folders. After the extraction each archive file turns into a folder having the archive file name. If, for example, there was an "outer.rar"
    archive file containing nothing but an "inner.zip" archive with only "afile.txt" inside, "outer.rar" becomes "...\outer.rar\inner.zip\afile.txt" file system path.
    @echo off
    rem extract_nested_archives.bat
    move %1 "%TMP%"\%2
    md %2
    7z x -o%1 -y %TMP%\%2
    del "%TMP%"\%2
    for %%a in (zip rar jar z bz2 gz gzip tgz tar lha iso wim cab rpm deb) do forfiles /P %1 /S /M *.%%a /C "cmd /c if @isdir==FALSE extract_nested_archives.bat @path @file"
    ARCHIVES ARE DELETED DURING THE EXTRACTION! Make a copy before running the script!
    "7z.exe" and "extract_nested_archives.bat" should be in folders available via the %PATH% environment variable.
    The first parameter of extract_nested_archives.bat is the full path name of the archive or folder that should be fully expanded; the second parameter is just the archive or folder name without the path. So you should run "c:\temp\extract_nested_archives.bat
    c:\temp\outer.rar outer.rar" from the command line to completely expand "outer.rar". "c:\temp" must be the current folder.
    Best regards, 0x000000AF

    Incredibly useful!  Thank you so much.  I did make a couple of small changes to make the script a little easier to use from the end-user perspective.
    First - I don't like making the user input the redundant second parameter, so I added this snippet which extracts it from the first parameter.  The first line of the snippet enables delayed expansion so that special characters in our file name don't
    break anything.  The second line pulls the parameter into a variable, and the 3rd line uses delayed expansion on that new variable.  Before implementing delayed expansion I had problems with file paths which included parentheses.
    SetLocal EnableDelayedExpansion
    Set SOURCE=%1
    For %%Z in (!source!) do (
    set FILENAME=%%~nxZ
    set FILENAME=%FILENAME:"=%
    Anyway once that was done, I just used %FILENAME% everywhere in the script instead of
    %2 (making sure to correct quotes as needed)
    This way, to run my script all you need to run is:
    C:\temp\extract_nested_archives.bat C:\temp\Archive.zip
    Second - I didn't want to modify the Windows environment variable.  So I replaced
    7z with "%PROGRAMFILES%\7-zip\7z.exe"
    I also replaced extract_nested_archives.bat with "%~f0" (which represents the full path+filename of the current script).
    Here is my full script now.  Tested on Windows 8 with the 64-bit version of 7-zip installed:
    @echo off
    Setlocal EnableDelayedExpansion
    Set source=%1
    For %%Z in (!source!) do (
    set FILENAME=%%~nxZ
    set FILENAME=%FILENAME:"=%
    move /Y %1 "%TMP%\%FILENAME%"
    md "%FILENAME%"
    "%PROGRAMFILES%\7-zip\7z.exe" x -o%1 -y "%TMP%\%FILENAME%"
    DEL "%TMP%\%FILENAME%"
    for %%a in (zip rar jar z bz2 gz gzip tgz tar lha iso wim cab rpm deb) do (
    forfiles /P %1 /S /M *.%%a /C "cmd /c if @isdir==FALSE "%~f0" @path @file"

  • Soft KeyBoard is not working on ios 7 with Air sdk 3.8

    Hi
    In my app Soft KeyBoard is not working on ios 7 with Air sdk 3.8. Does any one know soluton for this ?

    Hi,
    There's no question that TextFields and TextAreas weren't working in our case, likely because we have a deep displaylist with a variety of object types. Presumably, AIR has changed how it looks through the displaylist for objects that need keyboard, and perhaps we have an object type somewhere in the hierarchy that AIR no longer recurses. It's definitely something that changed, though.
    It's no picnic to put together a sample app, I can't afford that time when I have a solution. But the symptom was very clear, a textbox would open, and the cursor would just blink with no way of interacting with it.
    I'm happy using StageText directly, because it's a more direct way to interact with the OS and gives more control.
    It also solves a bug in AIR that I haven't reported yet, but is as follows. Rarely, when you move the container of a TextField of TextArea after it has been created, AIR will crash and freeze iOS devices. It doesn't happen on Android or desktop, but with a user-base of about 100,000 of our app, we've had it reported maybe 50 times. One of our dialogs sometimes needs to reposition the elements, which is done animated. During this, AIR will crash about 0.01% of the time. We tried only creating the TextArea, but not activating it or even having it visible, but even an invisible TextArea will crash, presumably because AIR moves the internal StageText overlay around as well, and this confuses iOS after a while during the animation.
    By using StageText directly, I finally also have a way to get rid of this bug, because I simply don't activate StageText until the object has already been positioned. Prior to that, it isn't even an editable text field, it's a label like anything else. So I'm happy I did this solution.
    Let's just leave this thread as a record if someone else has the same problem. I'm quite sure it's because of our very complicated display list, and AIR having changed how it scans the displaylist for objects that need keyboard.
    Best,
    Per

  • Is Adobe FLash builder 4.7 compatible with flex sdk 3.5?

    Is Adobe FLash builder 4.7 compatible with flex sdk 3.5?

    Kind of.  I never actually solved the bug, but I did make it past the install. 
      Upon running Adobe Flash Builder 4.7 Plug-in Installer it appeared to crash and only load a blank screen with a single unclickable button.  It wasn't actually "hanging", and would react when I clicked on the program menu and moused over "Services".  This would allow my clicks and keystrokes to register with the program, but only after I mouse over the "Services" menu item.  For example, you need to click "okay", then mouse over "Services".  When you want to enter text you need to click in the text box, and mouse over "Services".  Then you need to type the text you want, and mouse over "Services" to see the text appear.  It's a PITA, but you can get through the install by doing this. 
      After installing the FB plug-in starting eclipse with editors open appeared to cause errors.  If this happens, this is because it seems that FB attempts to load when eclipse starts up.  Close all editors and restart eclipse.  The Welcome page for FB may pop open up.  At the bottom there is a check box that will keep that form loading - check it.  If the page does not open when eclipse starts, then open up an MXML file.  The Welcome page should load, and you can check the box. 
    I hope this helps you move forward.  Also, if you find an actual solution please let me know.

  • Creating batch of coupons with unique code

    Hello,
    I wanted to provide some of my clients with coupons for additional services at a discounted price, and was wondering if there is a way to create batch of pdfs with unique coupon code (which would be provided), that would be then printed.
    TIA
    Lukasz Bien

    I thought about something that is similar to *Office functionality, where you open some document, and run mail wizard (which takes data from external db, or other source).
    PDF is created in Adobe InDesign (static elements like text or graphic).

  • Read Only Display of Radio group and Text area with counter not working

    Hello,
    I am using Apex 3.2, with 10g for the database
    I have this form, with fields that will set to read only when status = 'closed'
    All of the fields display as read only except for 2. I cannot figure out why this is not working correctly.
    1st field is Issues that is a text area with character counter, with a sql query behind it, that is set to null unless the query is pulling in the data.
    2nd field is Status which is a radio group that will not display as read only when status = 'closed'
    I have other fields on the form with the same format and they change to read only when the status = 'closed', I have even copied the pl/sql expression from one field to these fields and it still doesn't work correctly. I have also tried javascript for an on load event, which works, but once I click on the save button, it disables all of the page items, which works correctly, but I purposely forget to enter information, to make sure the validations are firing correctly, which it does, but the script disables everything, not allowing me to correct the errors. The javascript is firing on the on page load event.
    Any help on this is greatly appreciated.
    Mary

    Dung,
    That API seems to have a bug, it returns true/false/null, so you could use 'return not nvl(htmldb_util.current_user_in_group(p_group_name => 'APP Admin'),false)' to get a false value.
    Unfortunately there's another problem: using the read-only attributes for checkbox or radiogroup item makes them hidden. My suggestion would be to create another item that has disabled="disabled" in the HTML Form Element attribute in the item definition and display that item or the non-disabled item alternately, using conditions based on the current_user_in_group logic.
    Scott

  • How do I forward a text message with the iPhone?

    How do I forward a text message with the iPhone?

    You could take a screen shot (hold and release both buttons simultaneously) and then email the resulting image to someone.
    // crude workaround for now...

Maybe you are looking for

  • Duplicate Meetings when Syncing with Outlook

    Outlook meeting sometimes (but not always) end up being duplicated in my desktop Outlook after I synchronize using Blackberry Desktop Manager.  If I could avoid this duplication, I would like to keep wireless calendar from my work Outlook but I am wo

  • Using WebUtil Function giving an error

    I installed the WebUtil, i executed the demo(example which would have been in installation file which is running on my system) but now i want to open html file in my form using webutils function present in some remote system WEBUTIL_FILE_TRANSFER.URL

  • Speed of Fujitsu 160GB Drive (or lack thereof)

    I just received one of the new SR MBP with a 160GB drive. I also have access to a C2D with a 160GB drive. The SR has a Fujitsu drive while the C2D has a Hitachi drive. The performance difference is dramatic in favor of the C2D with the Hitachi drive.

  • SourceFire management system error when I try to delete a user agent

    I'm running into an issue with the Sourcefire management system. When I try to delete a user agent within Sourcefire I get redirected to an error page. Running 5.3. I have attached the error.

  • IPhoto v8.1.1 wont print to Epson SX515W - Blank Error Message

    Before the most recent update, it used to completely crash iPhoto when printing to the Epson printer but now it doesnt crash but I get a blank error message and the print just seems to be a screen capture rather than what I was trying to print. Anybo