Extract embedded xml from PDF/A-3b (also creation)

Hello there,
in the context of a research project, we are currently trying to extract embedded xml from a PDF/A-3b document via code.
The project deals with establishing a new invoicing standard (Zugferd: ferd-net.de, only german). Invoices are expressed via xml, which is embedded in PDF/A.
What we are trying to archive is extraction of the xml via java code. For testing purposes, we are currently using an third party skd to extract the invoice-xml, by calling a .EXE file and then picking up the results in java.
I currently have only one valid example file that can be processed via this sdk. To get more data, i used the test version of acrobat pro to alter the embedded xml file. To be more specific, i deleted the embedded file, added a new xml file, and used preflight to make the PDF conform to /A-3b. Although the file seems to have the same properties as the original, it can no more be processed via the extraction sdk. Since messing around with acrobat does not seem to get me anywhere, i am now looking into extracting data from the pdf my self.
Is there any present implementation/library/solution for extracting data in a java context? The few third party tools i found are all based of a .net/windows native environment. I have heard rumors about Adobe giving out tools to extract embedded data from PDF/A?
How is it the other way around? Is it possible to embedd xml into a PDF via Java? Given there allready is PDF file which we can attach to.
I really appreciate reading and thanks for any help or input!
Greetings,
Florian

Hi Florian,
I would look for general purpose PDF libraries that can open a PDF and access data objects in it.
All in all it is not too difficult to get to the embedded XML, once you have a library that can access and read data structures/data objects inside a PDF file. Some understanding of the inner workings of PDF data structures will help you get the job done (e.g. read the section about embedded files in the PDF standard / ISO 32000-1, as well as the chapter about PDF syntax).
Olaf
Am 19 Aug 2013 um 13:19 schrieb xfrapp <[email protected]>:
Extract embedded xml from PDF/A-3b (also creation)
created by xfrapp in PDF Language and Specifications - View the full discussion
Hello there,
in the context of a research project, we are currently trying to extract embedded xml from a PDF/A-3b document via code.
The project deals with establishing a new invoicing standard (Zugferd: ferd-net.de, only german). Invoices are expressed via xml, which is embedded in PDF/A.
What we are trying to archive is extraction of the xml via java code. For testing purposes, we are currently using an third party skd to extract the invoice-xml, by calling a .EXE file and then picking up the results in java.
I currently have only one valid example file that can be processed via this sdk. To get more data, i used the test version of acrobat pro to alter the embedded xml file. To be more specific, i deleted the embedded file, added a new xml file, and used preflight to make the PDF conform to /A-3b. Although the file seems to have the same properties as the original, it can no more be processed via the extraction sdk. Since messing around with acrobat does not seem to get me anywhere, i am now looking into extracting data from the pdf my self.
Is there any present implementation/library/solution for extracting data in a java context? The few third party tools i found are all based of a .net/windows native environment. I have heard rumors about Adobe giving out tools to extract embedded data from PDF/A?
How is it the other way around? Is it possible to embedd xml into a PDF via Java? Given there allready is PDF file which we can attach to.
I really appreciate reading and thanks for any help or input!
Greetings,
Florian
Please note that the Adobe Forums do not accept email attachments. If you want to embed a screen image in your message please visit the thread in the forum to embed the image at http://forums.adobe.com/message/5606424#5606424
Replies to this message go to everyone subscribed to this thread, not directly to the person who posted the message. To post a reply, either reply to this email or visit the message page: http://forums.adobe.com/message/5606424#5606424
To unsubscribe from this thread, please visit the message page at http://forums.adobe.com/message/5606424#5606424. In the Actions box on the right, click the Stop Email Notifications link.
Start a new discussion in PDF Language and Specifications by email or at Adobe Community
For more information about maintaining your forum email notifications please go to http://forums.adobe.com/message/2936746#2936746.
Olaf Druemmer | Managing Director | callas software GmbH | Schoenhauser Allee 6/7 | 10119 Berlin
Tel +49.30.4439031-0 | Fax +49.30.4416402 | [email protected] | www.callassoftware.com
Amtsgericht Charlottenburg, HRB 59615 | Geschäftsführung: Olaf Drümmer, Ulrich Frotscher

Similar Messages

Extracting XML from Pdf form

There is an industry standard pdf form with an underlying XML schema which can be opened in Adobe reader.
The form has a custom button on Page 2 called "export" which can be manually clicked to export the XML file.
We will have hundreds of these forms. How would I automate the extraction of this XML document?
I would prefer to just write a simple script and extract out the xml to a file folder
Thanks for your help.

Thanks Patrick.
We are thinking about using a third party native Java library to do this (http://www.qoppa.com/pdffields/jpfindex.html). I was hoping we could use acrobat reader, since everyone has it!
Here are a few more things.
1. We are an Software Vendor that sells our solutions - our software solutions need to extract the xml from pdf. We have a java based program that parses this xml and does stuff with it.
2. Obviously, we would need to be able to redistribute whatever solution we use to extract the xml from pdf.
3. Can Acrobat Professional batch mode be executed from Java?
4.. If so, Instead of distributing a full blown Acrobat Professional or requiring customers to buy it, is there a library that Adobe provides that we could repackage and ewdistribute? If so, can you send me some pointers on where I could find what those libraries would be and how much would they cost for each distribution we do.
5. If no, are you familiar with qoppa or do you have recommendations on any other third party libary for Java?
Thanks a bunch!

How remove embedded font from PDF

When I print to PDF on Mac OS 10.6.8 by default embed fonts to PDF-file. It add unnecessary bites to PDF-file (the file is huge size).
How to remove this option of fonts embedding? Or how remove embedded font from PDF file?

After opening dozens and dozens of linked files,I finally found the offending "empty line of text" in one of the AI files I placed in the INDD file. Open > Select All.... then check the font panel. With mixed fonts, it was empty, but if everything was the correct font, it was filled in. It was just one AI file.
I want to thank you all for your great ideas and for sharing your experience. Onward, now.

Extracting Tiff Image from PDF document

Hi Leo,
I need to extract Tiff images from PDF file in .NET applications. Is there any way to extract images using Javascript, Plug-ins or other APIs.
If possible can you kindly send some code snipplet

LiveCycle is a range of products, designed (almost all) to run on
servers. Except LiveCycle Designer, bundled with Acrobat. These
provide a Java API.
http://www.adobe.com/products/livecycle/
Aandi Inston

Extract email addresses from PDF file?

Hi,
Does somebody know if there is any -builtin- way to extract email addressed from PDF file in acrobat?
I tried 'save as' text/excel but this is a laborious task, especially when the pdf is large!
Thanks

I've developed a script that does just that. Have a look here:
http://try67.blogspot.com/2012/02/acrobat-list-all-email-addresses.html

Want to extract data in xml from pdf.....

i am newbie to LIVECYCLE ES.
i made a pdf form design.
Now i need a process which which can extract data in xml format
from pdf form...
Please give me example which i can understood or...step by step information.

Hi Arun,
Where there you are using WHERE condition in select statement while fetching the records?
if yes means check for the fields are primary key, available in WHERE condition, or else create secondary index for those
non Primary key Fields in WHERE condition.
This may help you.
Thanks and Regards,
Prakash.K

Help: XSU and extracting embedded xml

Hello
I am trying to pull data out of a relational database as XML using XSU. Some of the varchar2 columns in the database have embedded XML in them. When I pass in my select statement the returned XML escapes the emedded xml's special characters(turning them into entities) thus defeating the use of the embedded XML.
Is there some way to force XSU to NOT escape special characters so the embedded XML will remain intact? If not what would be the next best way to extract XML automatically from relational tables? My goal is to store the resulting XML from a query into a clob stored back in the database. So I'm not sure if I could even use XSQL pages.
Any help would be appreciated.
Kenneth

XSU can't help you to escape the special character now.
But after 9i if you store your xml into xmltype then the problem can be solved.
Currently you can use xslt to solve the problem,please refer to document demo for xsql.

Extract Embedded XML within XML using XSLT

Hi,
We have a unique scenario where our incoming payload is coming from Oracle Database table which has one column of CLOB type storing complete raw XML.
We need to extract this embedded raw XML and process it further, each XML has a unique XSD associated and we have that details in a separate column.
So our DBAdapter incoming payload looks like below
<rows>
<row>
<xml_xsd>xyz.xsd</xml_xsd>
<xml>RAW XML DATA</xml>
</row>
<row>
<xml_xsd>xyzv2.xsd</xml_xsd>
<xml>RAW XML DATA</xml>
</row>
<row>
<xml_xsd>xyzv2.xsd</xml_xsd>
<xml>RAW XML DATA</xml>
</row>
<row>
<xml_xsd>xyzv3.xsd</xml_xsd>
<xml>RAW XML DATA</xml>
</row>
</rows>
How can we leverage XSL Transformation to extract this embedded XML in each row? I need the each individual XML available for further mapping. I can split the payload using XPATH filtering per XSD, but not able to find a solution to parse the embedded XML and assign to a target schema?
Research done so far points to do two transformations to get resulting XML or using Saxon Parser if available and use the parse() extension.
Any other ideas/suggestions will be helpful. Challenge here is performance as i need to do this in bulk, will have many rows to process
Thanks in advance.

Hi,
You dont have finite set of XSD's and probably you wont be creating a variable for each type of xsd from that finite set.
Secondly, xslt doesnt support dynamic xpath as per my knowledge.
Question:
Do you really need XSD to do the validation?
A possible solution to your question would be using java approach as below: Pass the xml and the xpath query
    public String evalXpath(String xml, String xpathQuery) {
          String xpathResult ;
        DocumentBuilderFactory domFactory =
            DocumentBuilderFactory.newInstance();
        try {
            DocumentBuilder builder = domFactory.newDocumentBuilder();
            InputSource is = new InputSource(new StringReader(xml));
            Document dDoc = builder.parse(is);
            XPath xPath = XPathFactory.newInstance().newXPath();
            Object result =
                (NodeList)xPath.evaluate(xpathQuery, dDoc, XPathConstants.NODESET);
            NodeList nodes = (NodeList)result;
            for (int i = 0; i < nodes.getLength(); i++) {
                xpathResult = nodes.item(i).getNodeValue();
                System.out.println(xpathResult);
        } catch (Exception e) {
            e.printStackTrace();
        return xpathResult;
Thanks,
Rosh

How to extract the image from pdf file

     Hai friends........
         Is it possible to extract the images in a page from pdf file.
         If so. please share with me.......
    Thanks in advance,
    abu

In later versions of Acrobat you can select an Image with the Select tool, then right-hand click for Save options.
------------->
It helps if you quote your exact version of Adobe Acrobat/Reader - choose [Help, About...] to find this.
Also useful: Version numbers of other software (e.g. Word) if relevant. Age of computer and amount of memory (RAM) available (r/h cllcking on 'My Computer' and choosing Properties gives you this, plus processor speed).

How can I extract single pages from pdf document

how can I extract a single page from pdf document

Purchase and install Acrobat XI.
Open a multi-page PDF.
Use the click path of:
Tools - Pages - Under "Manipulate Pages": Extract
Be well...

Extracting Linked images from PDF

Good afternoon,
I have a PDF which I created in illustrator but unfortunately I had a problem with with my computer which meant that the recovered files were corrupted. I found a PDF on disc which opens in Acrobat just fine but when I open it in illustrator it asks for the linked images which of course I do not have.
The linked images must be in the PDF as they show up ok.
Is there a way of extracting the images from the PDF so that I can do a repair in illustrator?
Hope someone can help.
Thank you,
Kirk

If you have photoshop, open that pdf from it. File/open, then chose images...
Another option, if you have Acrobat Pro, go to Advanced/Document procesing/Export images

How to extract word coordinates from PDF using vc++6.0

In sdk,i just know how to get coordinate from pdf using javascript,and it will be completed use vb.but i dont know how to get the coordinate througt vc++6.0.anyone can help me?
thank you advance!

PDEWordFinder is the usual method for getting words and co-ordinates.
PDFEdit is not usually used, it is not suitable for getting text.
It is very hard work to make the two worlds work together (e.g. to
edit text you find).
Aandi Inston

Extracting the xml from string and parse it

Hi all,
I have a webservice, and calling one of the methods, returns xml data but store this data in a string.
For example:
String str = keysstub.getUserLMLArray(UserID,hash, Provider, Filter.ALL,TimeStampString).getXmlResults();returns <id>123456</id><id>123457</id><id>123458</id><id>123459</id><id>123461</id>and stores it in str.
I have to take read this xml from the string and parse it accordingly to retreive the data from it.
Please suggest how i can parse this xml from the string. it will be of great help. Code snippets from anyone would be of great help
Thanks and Regards,
Shikha

      * Get DOM document from a string containing valid XML.
      * @param string String to read XML content from.
     * @param varargs Optional arguments: 1: Validating?, 2: NamespaceAware?
      * @return DOM document or null if failed.
     * @throws Exception if failed.
     static public Document toDocument(String string, boolean... varargs) throws Exception {
        Document result = null;
        if (string != null) {
            DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
            if (varargs != null && varargs.length > 0) {
                int count = varargs.length;
                if (count > 0) {
                    factory.setValidating(varargs[0]); //needs error handler
                if (count > 1) {
                    factory.setNamespaceAware(varargs[1]);
            DocumentBuilder db = factory.newDocumentBuilder();
            result = db.parse(new InputSource(new StringReader(string)));
        }//else: input unavailable
        return result;
    }//toDocument()and don't forget that the string must be valid xml, so in your example the header is missing and the first element must embrace all subsequent elements, e.g.
<?xml version="1.0"?>
<ids>
<id>123456</id><id>123457</id><id>123458</id><id>123459</id><id>123461</id>
</ids>

Extracting Embedded Files From a PDF with Adobe.APS

I'm not sure if this needs to be here in security or if it needs to be in another forum, so if an admin feels it needs to move to a different forum, feel free to.
The situation I am in is that I get about 100-150 pdf's a day that I need to extract an embeeded file from. Right now this is a completely manual process and is very time consuming. What I have been trying to do is automate the process of extraction.
The issue I am running into is that the files are encrypted with Adobe.APS, and so my java code won't handle the security, and I can't find any other software that handles Adobe.APS.
I was wondering if Adobe had a product that could do this, or if there was an API that could handle this. I can perform the extraction on a platform of any flavor (Windows, Mac, Linux, etc...).
Any help in this regard would be greatly appreciatted. Thanks.

In my case I have a drop-down list of files with preloaded filenames to attach. Here is the code that works for me on a click of button.
var selectFileName = form1.subform.DropDownList1.rawValue;
if (selectFileName != "Please Select") {
var doc = event.target;
doc.importDataObject(selectFileName);
var MyPar1 = doc.getDataObject(selectFileName);
var filename = MyPar1.path;
After you click the button it open a windows dialog box asking you to choose the file and adds the attachment to the attachment pane in runtime. To view the attachments simply view the attachment pane in runtime after you add the attachments.
Good luck,
SekharN

Cannot Extract Embedded Font in PDF Report Generated Using Crystal Reports XI

The issue is that sometimes the PDF reports generated by Crystal Report XI are not showing the fonts properly. Arial font is used in the Crystal report template(.rpt file) for field label and data elements. When the report is opened in Adobe Reader, it displays the message - "Cannot extract the embedded font 'AAAAAA+ArialBold'. some characters may not display or print correctly." Other PDF Readers like Foxit show blank spaces instead. I tried to change the font to Courier New, Times New Roman etc but still getting the same message.
The issue is consistent when the application is deployed in Windows server. When deployed in Unix server the issue happens very occasionally. The font files are available and installed in the UNIX box.
Please tell me why this error message is shown and also tell me how to get it fixed.

Hi Scot
Just replied to your email re. conf call.
In the meantime.
1st question.
    Some of our developers are creating the reports under crystal 2008, while some are 2011.
    Could this be a problem, when they are run by the SDK?
    Should we upgrade all our developers to crystal 2011?
The version of CR a report is created in should not matter at all. I am 100% certain this is not the issue.
2nd question
    When the report is run, to create the pdf... why would it actually embed the font on the pdf?
    if they weren't embedded.. wouldn't it just pull up the proper font, if the user's pc had that font on their
    pc?
    if it does the common fonts aren't embedded, we would just have to worry about the things like
    barcode 3 of 9
I can not answer this as that essentially is the way the product is designed. It may look like a limitation, but there are no work-arounds and I would not consider this to be a bug. Perhaps an enhancement, but a pretty faint hope there. And even if, enhancements take very, very low priority and take long, long time before implementation.
3rd question
    Would the fonts, just need to be installed on the server?   we've seen some indication, that they
    also need to be in a java sub-directory
The fonts must be on the server as that is where the report engine does it's work, then the report is streamed to the user's browser. E.g.; there is no work done on the browser. Also, the process must have rights to access the fonts.
One last point. You mention deployment to Linux. Linux is not supported. Please see: SAP Crystal Reports, Developer Version for Microsoft Visual Studio - Supported Platforms
I suspect this is more an issue with CR not working with the framework for Linux.
- Ludek

Extract embedded xml from PDF/A-3b (also creation)

Similar Messages

Maybe you are looking for