Analyzing PDF Content

I'm working on an inhouse-built document management solution running on .NET technologies. Herewith a very quick drilldown of the current process:
A document(s) in PDF format gets uploaded to the server using a web service. A document is usually some sort of financial document such as statements or invoices etc. A Windows Service picks up this file and splits a single batched document into multiple documents, analyzes the document content to determine what entity it belongs to and links the two together.
Currently, the solution extracts the entire text from the PDF and uses a combination of character counts and stopwords to extract the document identification number such as the invoice number, account number etc. This identifier is used to search in the database and link the document to the applicable entities. At the moment, this process is an administrative nightmare because everything setup-related is dependant on our company as the setup-process is too complex to leave in our clients' hands. The current solution utilize the iTextSharp & PDFBox libraries.
I've recently started out on this project and would like to change the way this entire process works, here is what I've got in mind:
On the setup screen, the client upload a sample document such as a statement, then use a region selection tool to nominate that particular area for analysis and nominates a single stop word to look for in that area, extract the identifier next to the stopword and this is in turn used to identify the entity it should be linked to. Much simpler than the current process in words but in practice not so simple I guess!
Now my question is if this is at all possible with the Acrobat SDK? Should I just stick to the current method instead?
Thank you for your input!

@try67: Thanks for replying! Yeah, I'm trying to move away from using iTextSharp and PDFBox because the process is very confusing and prone to errors for our end-users. I don't really know the Adobe Acrobat products apart from using Adobe Reader, so I'm not entirely sure how the JavaScript part should work... Basically this process should be automated in a Windows Service according to the setup done on the ASP.NET part of the application by the user. I'll repost my question in the SDK forums, or alternatively a moderator is welcome to move and / or merge the threads?

Similar Messages

  • Analyze PDF Content

    I'm working on an inhouse-built document management solution running on .NET technologies. Herewith a very quick drilldown of the current process:
    A document(s) in PDF format gets uploaded to the server using a web service. A document is usually some sort of financial document such as statements or invoices etc. A Windows Service picks up this file and splits a single batched document into multiple documents, analyzes the document content to determine what entity it belongs to and links the two together.
    Currently, the solution extracts the entire text from the PDF and uses a combination of character counts and stopwords to extract the document identification number such as the invoice number, account number etc. This identifier is used to search in the database and link the document to the applicable entities. At the moment, this process is an administrative nightmare because everything setup-related is dependant on our company as the setup-process is too complex to leave in our clients' hands. The current solution utilize the iTextSharp & PDFBox libraries.
    I've recently started out on this project and would like to change the way this entire process works, here is what I've got in mind:
    On the setup screen, the client upload a sample document such as a statement, then use a region selection tool to nominate that particular area for analysis and nominates a single stop word to look for in that area, extract the identifier next to the stopword and this is in turn used to identify the entity it should be linked to. Much simpler than the current process in words but in practice not so simple I guess!
    Now my question is if this is at all possible with the Acrobat SDK? Should I just stick to the current method instead?
    Thank you for your input!

    mmm, assumptions are dangerous... I'm not bound to the PDF format for the processing part, in fact any software printer could work. If my clients can print to the software printer from Pastel or whatever financial package they are using, they are sorted! From there the file gets uploaded to our server to be processed, split up and saved in PDF format. The only part where the PDF format is important, is when the end-users log in and view their documents, obviously because the PDF format is not bound to platform and has a wider acceptance amongst end-users...
    I'm investigating other document formats because of the exorbitant licensing fees for PDF server software and the apparant lack of solutions without bending over five times when not using those server packages... If you could point me in the right direction, I'd be very greatful and happy to investigate PDF and build a prototype to test on? The project sponsor is willing to spend on server software but most of the packages we've come accross so far is in the $ 4 000 range and upwards. Is there an Adobe product that can provide us the functionality explained in the first post without the $ 4 000 pricetag?

  • Servlet retrieves pdf content stored in oracle and sends this content

    I have pdf content stored in an oracle database. The database column type is LONG RAW. I am retrieving the contents using jdbc's getBinaryStream. The content type is set to pplication/pdf. When the code is executed, the acrobat is launched but the content is not displayed. After analyzing the contents retrieved from the database, I discovered that characters were missing. I replicated the same code in a different language and it was working.
    Has anyone else encountered this problem? If so, how was it resolved.
    Any ideas or comments would be appreciated.
    z.

    What is the character set of the Oracle db that you are using. I would suggest that you try to set the char set to Unicode and try.

  • Firefox is stumbling when pages with PDF content come up. how can i get my old version3.5.6 back?

    Question
    firefox is stumbling when pages with PDF content come up. how can i get my old version3.5.6 back?

    Thanks for your response Cor el, I'll try that the next time it happens, it's completely random. One moment it functions normally then suddenly it changes its mind.
    Restarting my pc usually brings it back to normal until next time.
    One thing I've noticed, but not 100% sure there's a connection here.....I edit my websites online and log into cpanel which opens new pages in separate tabs, as it should do. It seems that this problem occurs when I'm logged into cpanel...if i'm logged in then open a new tab for regular browsing of other sites then all links on those sites also open in new tabs.
    Ican't say for sure if it's always only happened when logged into cpanel, but certainly on the last 3 occasions.
    At the moment it's behaving normally so much I can do to check your suggestions, but will try them the next time it occurs.
    One other thing I forgot to mention too....since i've had the new firefox, I've noticed that when I log into yahoo and post comments, I often find myself having to log on for each new post. Everything was fine with old firefox and there have been no other changes to my pc .

  • "This is an Adobe Illustrator file that was save without PDf content."

    I have a client that has been working on a file in AI CS, never left his machine, but when he went to open it today he gets a message that says this file "was generated by a newer version of Illustrator. Would you like to import this file? Some data loss may occur".
    And when we click the "Import" button, we then get:
    "This is an Adobe® Illustrator® file that was saved without PDF content. To place or open this file in other applications, it should be re-saved from Adobe Illustrator with the "Create PDF Compatible File" options turned on. This option is in the Illustrator Native Format Options dialog box, which appears when saving an Adobe Illustrator file using the Save As command."
    However, even though this was created in AI CS, all other machine and all other versions of AI (even CS3, 3.3, 2, etc..) we get the same exact error message. How it can be newer than CS 3.3 when it was created in CS is strange to begin with.
    I assume this is some sort of corruption, but was wondering if there were any cures. It seems our backups have backed up this file in it's current state (corrupt), which is a huge issue at this point.
    Thank you very much for any help!

    I have sometimes had this trouble since I updated Acrobat.
    "This is an Adobe® Illustrator® file that was saved without PDF content ..."
    Not sure what's wrong - it seems to happen even though the Illy file is saved with PDF content and everything embedded.
    Another thing: When viewing pdfs under Advanced>Print Production>Output Preview I repeatedly get an RGB simulation preview even though I know that the file is CMYK. So I have to choose a CMYK simulation preview and then things look o.k.
    Is there any way to stop the RGB preview coming up? I note that Convert Colours messes things up completely, so it's only the preview that's wrong.
    Any ideas, Wade?

  • PDF content disposition on IE

    Hi,
    I am having difficulties in displaying PDF content from JSP. Obviously URL ends with mypdffile.jsp. From the JSP, I set attributes as follows. It works as intended on Firefox. Not on IE;
    response.setContentType("application/pdf;charset=UTF-8");
    response.setHeader("Content-Disposition", "inline;filename=\"Document.pdf\""); Any suggestions how I can fix this mess?

    On IE, it shows nothing! I can see that it reads something then nothing shows up. It is meant to display inside the viewing frame without creating external frames. If I change to "Content-disposition: attachment; filename=ddd.pdf", it comes out with PDF view frame. If you move the frame, you see lots of dirty traces on Windows frame!
    It may IE is not handling extension well.Or size information may be required.
    JSP is convenient to write changeable reports. PDF is generated using iText.
    Regards.

  • I want to open a Illustrator file in Illustrator CS4 but i am getting this message "This is an adobe illustrator file that was saved without PDF content..."

    I want to open a Illustrator file in Illustrator CS4 but i am getting this message "This is an adobe illustrator file that was saved without PDF content..." I tried the save as option with "Create PDF compatible file" option but still same message is showing. PLEASE HELP.
    Thanks in advance.

    I have uploaded the files to my dropbox - Dropbox - AI
    I would really appreciate it if anyone is willing to down-save the version of these files to Illustrator CS4 and send it to me.
    thanks

  • Displaying PDF content in Android Air app - how?

    Hi, I'm fairly new to developing Air apps for mobile devices. I'm looking to have my app display PDF files and can't find a dfinitive way of doing it. I've read that using the StageWebView would be the way to do it.
    Here's the relevant part of my code (PDF path changed):
    if (StageWebView.isSupported)
    currentState = "normal";
    webView.stage = stage;
    webView.viewPort = new Rectangle(20, 100, 450, 450);
    webView.addEventListener(LocationChangeEvent.LOCATION_CHANGE,onURLChange);
    webView.loadURL("http://path.to.my.pdf");
    addEventListener(ViewNavigatorEvent.REMOVING,onRemove);
    else {
    currentState = "unsupported";
    lblSupport.text = "StageWebView feature not supported";
    Its working up until the point of actually displaying the PDF content, I get nothing (blank screen).
    I'm using Flash Builder 4.5 / Air 2.6 and debugging on a Motorola Xoom tablet. Adobe Reader is installed on the device.
    Any help with this would be greatly appreciated.

    Solved.
    For some reason I couldn't access the HTMLLoader object outside of the function that initialised it, despite the fact that I was listening for EVENT_COMPLETE.
    Basically, to get round this bug, just put all the initialise and load code in the same function and you should be laughing.

  • PDF content not searchable from Sharepoint 2010 Sever tried every thing

    I have checked many blogs and tried many option i am unable to search the pdf content using Adobe filter
    (1)i have installed the adobe filter from
    http://www.adobe.com/support/downloads/detail.jsp?ftpID=4025
    (2)set the envoirment variable in the control panel to the bin directory of ifilter on the server
    (3)Download
    PDF icon picture (17x17) from Adobe web  site http://www.adobe.com/misc/linking.html and Copy to
    C:\Program  Files\Common Files\Microsoft Shared\Web Server  Extensions\14\TEMPLATE\IMAGES\
    Backup the docicon.xml file, and Add an  entry in docicon.xml for the PDF icon:
    C:\Program Files\Common Files\Microsoft Shared\Web Server  Extensions\14\TEMPLATE\XML
    <Mapping  Key="pdf" Value="pdficon17.gif">
    (4)Add  PDF file type on the File Type page under Search Service
    (5)added the followng registry
    \\HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\14.0\Search\Setup\Filters
    "Extension"="pdf"
    "FileTypeBucket"=dword:00000001
    "MimeTypes"="application/pdf"
    \\HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\14.0\Search\Setup\ContentIndexCommon\Filters\Extension
    added .pdf and set the below value
    {E8978DA6-047F-4E3D-9C78-CDBE46041603}
    (6)Stop and start the search service using below
    Net Stop OSearch14
    Net Start OSearch14
    (7) restart the server
    (8)reset the index
    (9) run the full crawl
    still i am unable to search the PDF content from Sharepoint search
    can any one help in this i will be thankful
    regards
    MCTS,ITIL

    Hello Shahid,
    Your question has been quite clearly answered here http://social.technet.microsoft.com/Forums/sharepoint/en-US/126076e8-7511-44cc-9c59-ce8cdd534758/sharepoint-foundation-2010-ocr-help?forum=sharepointgeneralprevious
    If you are looking for an OCR solution that will allow you to use the full-text search without having to dive into all the technicalities of it, maybe Udocx is something for you (see Udocx.com). It allows you to upload documents from your MFP and desktop
    directly into SharePoint folders including OCR and metadata.
    Cheers,
    Iris

  • Inserting pdf content to database

    Hi,
         I am working with adobe live cycle and using web service i am inserting pdf content to db , Live cycle is installed in my machine so the data insertion is possible from my machine ,but i am trying to insert data from other  systems in our network it is not working , adobe reader 9 nstalled on these machines .Is necessary to register any other dll to GAC for working with this on client machine if so pls specify the dll. I am waiting for ur reply ,pls reply as soon as possible

    Hi,
    I suspect that you also have the full version of Adobe Acrobat on your PC.That is why the data connections work on your system.
    Unless you Reader Enable your form with LiveCycle Reader Extensions Enterprise Suite, then web service calls / data connections will not work in Reader.So if your users have Reader the data connections will not work on their PCs.
    Here is a summary of functionality in Acrobat and Reader.
    Good luck,
    Niall

  • Passing PDF content thru XI

    Hi All,
    I have a scenario where sender issues synchronous SOAP request with PO details to XI. XI further calls RFC using request payload. RFCs response is a table of PDF content. Now I have to pass this PDF data back to the sender as it is. Sender will later take care of displaying PDF output. I just need to pass this table data as it is to the sender.
    Any idea how to achieve this?
    Below I am displaying response payload from RFC.
    I am getting below excpetion in response mapping.
    Runtime exception occurred during execution of application mapping program com/sap/xi/tf/_MM_RFCShipDoc_To_ShippingDocDataResponse_: com.sap.aii.utilxi.misc.api.BaseRuntimeException; Fatal Error: com.sap.engine.lib.xml.parser.ParserException: XMLParser : #0 not allowed in Character data sections(:main:, row:252, col:6)
    Thanks,
    Smita

    Hi Smitha,
    Check this how to guide ..will help u out
    https://www.sdn.sap.com/irj/sdn/go/portal/prtroot/docs/library/uuid/9913a954-0d01-0010-8391-8a3076440b6e
    Go through this as well
    https://www.sdn.sap.com/irj/sdn/go/portal/prtroot/docs/library/uuid/b0c63b84-19e5-2910-fc81-f438716573d5
    Should help u out
    Regards,
    Mohd Tauseef I

  • WD Abap - show pdf content

    hi folks,
    i'm trying to show pdf content (type binary) with wd abap, but how?
    - with interactive forms ADOBE -> i bound pdfsource the type binary value but i always get the message "ugly scripts are running at that url"
    - "normal" interactive froms with a normal interface rather than pdf-data stream work.
    we use NW2004s SP 05
    any ideas?
    kind regards oliver

    hi thomas,
    well i didnot see the url but i got out following acf trace:
    ERROR|20060103160230|WD0284_5376|CAcfControl::ParseParameter|ReadArrayFromUrl failed, Url= http://cat01130.tyrolit.com:8080/sap/bc/webdynpro/tyed/ca_atcp_cre_0001/~wd_key43ACC6054ACC02C8000000000A180A82/acf_control?sap-contextid=SID%3aANON%3acat01130_NY1_01%3aIUyg6v83oBfW_5oMpeaC3BmYWOS9DRej4fSmLpQq-NEW&sap-wd-clientWindowId=2855&sap-wd-resource-id=WD0284&sap-wd-filedownload=X|HRESULT=-2147467259(Unspecified error)
    ERROR|20060103160230|WD0284_5376|CAcfControl::ParsePropertyParameters|ParseParameter failed|HRESULT=-2147467259(Unspecified error)
    ERROR|20060103160230|WD0284_5376|CAcfControl::SendDataFromClientToAcf|SetProperty: ParsePropertyParameters failed for property dataSource|HRESULT=-2147467259(Unspecified error)
    ERROR|20060103160230|WD0284_5376|CAcfControl::invoke|InovkeInternal failed|HRESULT=-2147467259(Unspecified error)

  • How can we send PDF content from R3 to SAP-UI5 using ODATA

    Hi Experts,
    my PDF content is in XSTRING format on R3, and i am trying to send that content to my UI5 aaps but for that i am not getting any types for my my
    entity type properties.
    Please Help me
    Regards
    Saumya

    Hi,
    Refer this discussion thread Generate PDF file in backend, and "send" it to SAP Gateway
    Regards,
    Chandra

  • Does acrobat has any option to read pdf content objects(stamps, layers, images etc) without using pl

    Does acrobat has any option to read pdf content objects(stamps, layers, images etc) without using plug-in in c#.net ?
    My requirement is read a pdf file and extract all the page objects(stampls, layers and images) along with its coordinates. Is it possible to get without using plug-ins in c#.net ?
    Please hlep me.

    These are very different things.
    JavaScript has some access to layers (called OCG). Layers are just names and have no coordinates to retrieve.
    Stamps may be annotations; there is some minimal access to these too I think. It may be hard to identify what is, and is not, a stamp even with a plug-in.
    Images are part of the actual page contents, and are only accessible to plug-ins.

  • This is an Adobe Illustrator File that was saved without PDF Content.

    Hi everybody,
    I'm working in illustrator CS6 and I am trying to save my file as PDF. However I can't seem to do it.
    I save the file as an ai. file and click the "Create PDF Compatible File" option. Then when I click >save as >and opt for the PDF version and click save the PDF file comes up with no images and just the .....
    This is an Adobe® Illustrator® File that was
    saved without PDF Content.
    To Place or open this  le in other
    applications, it should be re-saved from
    Adobe Illustrator with the "Create PDF
    Compatible File" option turned on. This
    option is in the Illustrator Native Format
    Options dialog box, which appears when
    saving an Adobe Illustrator  le using the
    Save As command.
    This hasn't happened before when saving to PDF previously. If you could offer any help or advice I would much appreciate it.
    Cheers
    R

    Rosie,
    This is seriously weird: even if you have Create PDF Compatible File unticked you should be able to Save As PDF.
    There may be something corrupt, so the list may be worth trying:
    The following is a general list of things you may try when the issue is not in a specific file, and when it is not caused by issues with opening a file from external media, see below. You may have tried/done some of them already; 1) and 2) are the easy ones for temporary strangenesses, and 3) and 4) are specifically aimed at possibly corrupt preferences); 5) is a list in itself, and 6) is the last resort.
    If possible/applicable, you should save current artwork first, of course.
    1) Close down Illy and open again;
    2) Restart the computer (you may do that up to at least 5 times);
    3) Close down Illy and press Ctrl+Alt+Shift/Cmd+Option+Shift during startup (easy but irreversible);
    4) Move the folder (follow the link with that name) with Illy closed (more tedious but also more thorough and reversible), for CS3 - CC you may find the folder here:
    https://helpx.adobe.com/illustrator/kb/preference-file-location-illustrator.html
    5) Look through and try out the relevant among the Other options (follow the link with that name, Item 7) is a list of usual suspects among other applications that may disturb and confuse Illy, Item 15) applies to CC, CS6, and maybe CS5);
    Even more seriously, you may:
    6) Uninstall (ticking the box to delete the preferences), run the Cleaner Tool (if you have CS3/CS4/CS5/CS6/CC), and reinstall.
    http://www.adobe.com/support/contact/cscleanertool.html

Maybe you are looking for