Extracting pdf that's embedded in xml

I'm needing to extract/parse a pdf file that is embedded in an xml file. I'm able to do it with Microsoft XML Parser, but want to do it on unix using java.
I've tried FOP - only good for creating pdf's out of text, but not able to handle embedded binary ???
Any help would be appreciated.

Your question is confused. You have a PDF file embedded in your XML? If that's so, then FOP would be useless because its purpose is to create PDF files, not to read them. And embedded binary? Most binary values in the 0x00 to 0x1f range are invalid XML characters, so no XML parser should handle them. (Although Microsoft hasn't always been good at implementing other people's standards.)
Perhaps you could restate your question. If you just have a PDF file embedded as a giant text node in XML, then it shouldn't be too hard to extract it using any DOM technology. But somehow I don't think that's what you have.

Similar Messages

Need to extract PDF form answers into an XML file

Hi Adobe gurus,we have a requirement like this. We have a Time Tracking and Project Monitoring System whose DB is a Oracle 10g R2. We want to automate our ''Meeting Minute'' processing.
1. The project lead will write minutes into a form PDF. i.e. a PDF where people can type in information into fields and also tick check-boxes etc.
2. The PDF will be e-mailed to the project manager.
3. Project manager will save the PDF in a HDD directory.
4. Then he will run a program.
5. Program will pickup all the PDFs in that directory one-by-one.
6. For each PDF, the program should read the fields and get the values for the form fields and create an .XML file for it.
7. Now, another program will read the XML files, extract the information and store those in a DB against each project lead.
I went through the thread PDF to XML conversion, but unfortunately it has no complete solution. This problem is present for lots of people and I would be really grateful if Adobe experts can give a complete solution. In order to make it easy to answer, I have made a small questionnaire below:
(A.) When creating a PDF form. i.e. a PDF where you can type information to questions and tick checkboxes etc., can you create the PDF with structure. i.e. for example, the field into which a user types project name should be identifiable (as a tag or something like like PROJECT_NAME), when we create an XML out of it later? Is this called creating a tagged PDF?
(B.) Does this mean that we cannot convert an untagged PDF to XML?
(C.) In order to convert a PDF to XML do we need a XSD or DTD? I ask this because, I some converters from the web, like this one, asks for the XSD. And this tool which converts a PDF to XML, asks for a rules set before converting the PDF. So is this necessary?? i.e. Do we have to have our own XSD and rules set, or does the PDF->XML utility convert to XML based on predefined ADOBE PDF tags??
(D.) Do we HAVE TO use the Adobe LiveCycle ES DLLS to do this??? I ask this becuase most of the free PDF-2-XML convertes give wrong results and has no guarantee.
(E). Can you please elaborate the process of converting PDFs to XML. Please note that we are doing this using a program (i.e. we process a batch of PDFs).
(F.) If we use ADOBE DLLS then do we have to purchase the Adobe LiveCycle ES product??
(G.) If we purchase Adobe LC ES then can we use the DLLs in Java, .Net??? Or is it possible to call the Adobe DLLs only using C or C++ (I read about this on the .Net)?
(H.) Is Adobe LiveCycle ES a separate product from the Adobe SDK and Adobe PDF writer??
Your advice would be greatly appreicated.
Thanks in advance.
Ravi de Silva.

Thanks, however, I just want the user to be able to hit the Submit by E-mail button, and e-mail the form to the Manager. I believe, that this is what this button should be able to do, except that it converts the file to .XML format.
Still looking for help. thanks

Can you post a PDF that has embedded media in it, onto a website?

I have embedded audio and video onto a PDF now I want to post that on a web site so that people can view it but not to download. Is this possible?

If it can be viewed, it can be downloaded.

Is it possible to insert image that is embedded in XML tag value?

Hello,
I am trying to import a blog XML. Blogs can have images in their bodies. So can I have this image imported and placed in content place holder along with other text in the XML? I am using import XML here....thanks

I don't know that it's a flaw just because it's not the way you want it. That appears to be just how the app is. That's not a bad idea, though. You could give Apple feedback here and suggest it: Apple - Mac OS X - Feedback

Extracting Base64 images embeded inside XML and Convert it into PDF using Biztalk

Hi,
I'm presently working in a scenario, where we will be getting huge XML Files containing Base64 encoded images. The scenario goes like this :-
1) Client will dump the XML files with embedded Base64 images in a sFTP location.
1) Firstly, we have to extract Base64 encoded images and the metadata from the XML file.
3) Secondly, we need to convert the extracted Base64 encoded image into multiple pdf.
4) Then merge the PDF's into a single file.
5) Then the merge pdf will store to a particular location.
5) It is presume that, the file will be of very big in size ~ 1 GB XML file, so we need to take care of the performance as well.
The sample xml looks like:-
<ns0:tran xmlns:ns0="http://Sample.Schemas.Record_XML">
<tranheader>
</tranheader>
<item>
    <image>
      <frontimage>
        <frontimage> image 1 part 1</frontimage>
      </frontimage>
      <rearimage>
        <rearimage>image 1 part 2</rearimage>
      </rearimage>
      <frontimage>
        <frontimage> image 2 part 1</frontimage>
      </frontimage>
      <rearimage>
        <rearimage>image 2 part 2</rearimage>
      </rearimage>
      <frontimage>
        <frontimage> image 3 part 1</frontimage>
      </frontimage>
      <rearimage>
        <rearimage>image 3 part 2</rearimage>
      </rearimage>
    </image>
</item>
<trantrailer>
</trantrailer>
</ns0:tran>
Thanks & Regards

Do you really need to use BizTalk for this requirement? This can be done better with standard .NET code through a Windows service/schedule task/ if you want to poll,
you can implement file watcher class and poll the file as when it arrive to SFTP folder and convert the image in base64 to PDFs.
Another point, I don’t know why you want to “convert the extracted Base64 encoded
image into multiple pdf” (Point-3) and “Then merge the PDF's into a single file”-
point4. You can create a single PDF file (unless I don’t know the reason for creating separate PDF file and merge it again as single PDF file)
Anyway, if you still need to use BizTalk, you have somes options in general:
Option1:
Receive the message using BizTalk receive location using passthrouhg pipeline at receive end
Create a send port with filter for receive portname. In the send port use a custom send pipeline. In the send pipeline use a custom pipeline component which will extract the base64 content from
the XML file, convert the base64 encoded image as PDF and send the PDF file in the send port level.
Option 2 – this options works better if you have some process based on your
<tranheader> record:
Receive the message using BizTalk receive location using a custom pipeline strip off, decode and store the base64
encoded document in a temporary store (file system). So when the message is published in message box db, the message doesn’t contain the heafty encoded64 data part, message will be light weight when its published into message box.
Process the XML message (without bae64 encoded document) with or without orchestration where you will do processing based on your
<tranheader> record.
In send last moment –at send port level, retrieve the stored file from the
temporary store (file system), convert the image to PDF (i,e, hefty processing like creation of PDFs/merging) can be done at the send pipeline level and send the PDF file
to destination.
Following are the guidelines you should keep in mind if you need to achieve this process in BizTalk:
Try to avoid publishing the hefty message to message box.
Conversion of base64 to PDF can be done only using a .NET code. So your options to do this conversion in BizTalk are either in Receive pipeline/send pipeline/ .NET helper in orchestration.
Try not to use Orchestration as much as possible, because of heavy processing and message transmission is already involved.
Following articles shall help you in this context:
Dealing with base64 encoded XML documents in BizTalk
To convert Base64 to PDF/JPEG using C# code:
TechNet-Wiki Code: Converting Base64 strings
to Bitmap images
Convert Image to Base64 String and Base64
String to Image
Base64 encoding and decoding in .NET
Regards,
M.R.Ashwin Prabhu
If this answers your question please mark it accordingly. If this post is helpful, please vote as helpful by clicking the upward arrow mark next to my reply.

Extract PDF embedded in XML

HiAll,
I have tried searching but did not get proper content on this topic. Could you please help.
I have to extract PDF content embedded in an XML and send it via FTP.

Hi,
Have you read this document http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/9913a954-0d01-0010-8391-8a3076440b6e?QuickLink=index&overridelayout=true&5003637721700
http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/9913a954-0d01-0010-8391-8a3076440b6e?QuickLink=index&overridelayout=true&5003637721700
http://scn.sap.com/community/pi-and-soa-middleware/blog/2009/05/17/trouble-writing-out-a-pdf-in-xipi
Regards
srinivas

Refying PDF with subset embedded fonts fixes text extraction

Hi All,
I know it is not a good idea to (just) refry PDF files (PDF -> EPS -> PDF). Especially when the PDF contains subset embedded fonts. Chances are you will end up with a PDF file which does not contain valid (searchable) text.
I did not know the apposite could also be true. The following zip file contains 2 PDF files echo containing two words: the original and the refried version.
Refried.zip
When selecting text from the original PDF (using acrobat 6 through X) file it contains incorrect text, in this case invalid capitals. If I try the same in the refried version the extracted text is correct.
It seems strange to me that a process which only can result in loss of information "fixes" this text issue. Somewhere the correct text must be hidden in the original PDF file. Not only capitals seem to be effected but also random characters which seem to be fixed once refried.
Could anyone think of an explanation?
Is there a workaround without having to refry the PDF (refrying often results in loss of information). I have no influence on the PDF files I recieve, therefore I cannot embed the full fonts.
I am using de C++ SDK for Acrobat to write plugins.
Any pointers would be great!
Kind regards,
Robert

Thanks again for your reply,
Your explanation makes sense.
I went ahead and removed the tounicode cmap just to see what would happen
if (CosDictKnown (cosFont, ASAtomFromString ("ToUnicode")))
CosDictRemove (cosFont,ASAtomFromString ("ToUnicode"));
As you predicted this fixes some issues and introduces new ones.
The results differed from the refry method, in some cases the refried PDF did not contain extractable text, in other cases the PDF without "ToUnicode Cmap" had no extractable text.
Maybe I could combine the information of different text extraction methods to make an educated gues which one (or combination of) is best :S
I suppose looking at individual textruns (with all its complexity) would not help me either...
Kind regards,
Robert

Embedding fonts in an existing pdf that does not have the fonts embedded

Hi,
I have an existing pdf that fails to open on my ereader.
So I opened in acrobat x pro to see what might be the cause.
Selecting "file"--> "properties" -->"font" ( fonts used in this document) I can see that this pdf uses multiple windows trueType fonts and that none of them are embedded in the pdf document.
Now my ereader is linux bases, so I am guessing this might be a problem.
Being an ereader I didn' give me much to go on.
So i would like to embed all the fonts listed in this list into the pdf file and save it again with the fonts embedded.
I went to "file"-->"save as optimized pdf" -->"font"
But there both lists embedded and unembedded fonts are empty.
Why is this and how can I embed the fonts shown in the document properties into the pdf?
They seem standard windows fonts like e.g. ArialNarrow

I just looked a bit furder and actually checked wether the fonts existed and found the following:
I am using windows vista.
The fonts mentioned in the properties of the pdf actually exist in the windows installed fonts but
in windows they are no longer TrueType but OpenType.
So I am guessing this might be the reason they are not in the list of unembedded fonts in the optimizer window.
Is there an easy way to say to acrobat to replace one font ( not found on this computer) with another font that does exist and also to embed that font in the pdf document?

Extract Process Chain ID/Name that an embedded ABAP program is run from?

Hi all,
I have created some Process Chains in SAP BW where I have incorporated some ABAP program Process Types that uses the same ABAP program.
In these embedded ABAP program Process Types I need to extract the name of the Process Chain it runs from (ID, Technical Name, Descirption).
Is there any way this is possible to do?
One solutions that is not possible to implement (due to parallell runs of process chains that uses the same ABAP program) is the following:
search in table RSPCVARIANT for your program as follows
- field TYPE = "ABAP"
- field FNAM = "PROGRAM"
- field LOW = <program name>
take the value of field VARIANTE and use this in table RSPCPROCESSLOG (enter a date selection for BATCHDATE as well)
take the most recent entry (should be the one you're actually running at that moment)
via field LOG_ID, retrieve CHAIN_ID (technical name of your process chain) in table RSPCLOGCHAIN
So my question here is:
-Is there any way to extract the Process Chain ID/Name that an embedded ABAP program Process Type is run from?
Thanks beforhand for your feedback!
regards
Oddmar

Hi Erik,
I am stuck up with a similar requirement, wherein I have an ABAP program in my process chain and I need the technical name of the process chain in the ABAP program at run-time.
Did you get a solution or work-around for this scenario?
Thanks in advance.
Regards,
Chetana.

How to Optimize an xml PDF that has already been filled out

I have created a pdf file in Livecycle and forgot to uncheck embed fonts. I have already sent out the pdf form and Extended Features in Adobe Reader. I have received over 300 forms back. Is there anyway to reduce the file size with forms already filled out? Are there any work arounds?
PLEASE HELP!! I appreicate any assistance you can provide.

Although there is a way to use the Action Wizard to re-import the XML it can be tedious to appropriately rename all the XML files.
I made a web app that my department uses to batch update all the PDFs that the recipients filled out into the latest version of the PDF.
I made a standalone version you can use here.
This application is AS IS. Use at your own risk and expense.
I'll expect my cheque in the mail ; )
Kyle

How to reduce a PDF document that contains an Adobe XML form?

Hi,
I have created some fillable PDF files using Adobe LiveCycle Designer and used Adobe Acrobat Pro XI to save it as Reader Extended PDF > Enable More Tools (includes form fill-in and save)…
The form includes image, text field, check box, radio button, table and other controls, as well as java scripts behind. The files are from 1 to 3 MB in size. I tried Reduced Size PDF and Optimized PDF, but I got a message: “This PDF document contains an Adobe XML form. Such files cannot be optimized.”
My questions are:
Is there a way I can reduce the file greatly, say about 100 KB?
And after the size reduced, does it still keep the original quality in format and function?
Thanks

Hi,
if you only use default fonts like Arial or Myriad don't embed fonts into the form at all.
If you have to use a specific font because of corporate identity you should check if there are fields in your fom using other fonts like Arial.
Especially floating fields are a reason for big files as they use Designers default font even if you'll never see it in the PDF.
When using images, you should not check the "embed" checkbox, as this will save the images a base64-stream into the xml-source.
If unchecked the images are stored as hex-stream in the PDF-stream which cause a smaller file size.
Also, don't use high res images, 150dpi are far enough. You also don't need 16Bit depth as Designer only supports 8Bit images with RGB colors.

Why a pdf document with embedded fonts can be copied but is not searchable in pdf reader

I am writing a pdf files with embedded subset fonts. As required, I am including the ToUnicode and CIDSet objects. To test, I created a simple PDF with two Hebrew characters. I can select the two characters and copy to the clipboard, and paste it properly into another application such as Word. But I am not able to search for a word containing these two characters. Adobe Reader (or Acrobat) displays the message that the word was not found. So in essence, I have created a PDF document which can be copied properly, but is not searchable. Any idea what I might be missing when creating the document?
Additional information: 1. The file in question is a minimal file with just two characters. I have tested with many such files in many different languages including English. None of the files are searchable. 2. Curiously, if I search for the letter 'e', Adobe reader highlights an incorrect word, even if the letter 'e' does not exists in the file. 3. Adobe acrobat is also not able to search within this file, however when I save the file to another disk file, the saved file now is searchable. I confirmed that the major objects such as the font-file, ToUnicode object, CID object, and the font description objects are the same in the saved file. However, one of the font object is brought up closer to the top of the file. 4. FoxIt is able to search these files properly.
5 0 obj
<</Filter /FlateDecode /Length 115>>
stream
        q 0.750000 0 0 0.750000 0.000000 792.000000 cm
        q q q 0.160000 0.000000 0.000000 0.160000 0.000000 0.000000 cm
        BT /F0 100.000000 Tf 0 g 750.000000 -690 Td[<02B0>] TJ 35.000000 0 Td[<02B9>] TJ ET Q
        Q
        Q
        Q
endstream
endobj
10 0 obj
<</FontName/AAAAAA+ArialUnicode/CIDSet 9 0 R /Ascent 905/CapHeight 905/Descent -212/FontFamily(Arial)/Flags 32/FontBBox [0 -212 1000 905]/ItalicAngle 0/StemV 0/FontFile2 7 0 R/Type/FontDescriptor>>
endobj
11 0 obj
<</BaseFont/AAAAAA+ArialUnicode/CIDToGIDMap/Identity/CIDSystemInfo <</Ordering(Identity)/Registry(Adobe) /Supplement 0>> /FontDescriptor 10 0 R/Subtype/CIDFontType2/Type/Font>>
endobj
12 0 obj
<</Subtype/Type0/BaseFont/AAAAAA+ArialUnicode/Encoding/Identity-H/DescendantFonts [11 0 R]/ToUnicode 8 0 R/Type/Font>>
endobj
8 0 obj
<</Filter /FlateDecode /Length 252>>
stream
        /CIDInit /ProcSet findresource begin
        12 dict begin
        begincmap
        /CIDSystemInfo
        << /Registry (Adobe)
        /Ordering (UCS) /Supplement 0 >> def
        /CMapName /Adobe-Identity-UCS def
        /CMapType 2 def
        1 begincodespacerange
        <0000> <FFFF>
        endcodespacerange
        3 beginbfchar
        <0000> <0000>
        <02B0> <05E0>
        <02B9> <05E9>
        endbfchar
        endcmap
        CMapName currentdict /CMap defineresource pop
        end
        end
endstream
endobj

I figured the app might have that ability - considering you can add text, highlight, add a signature, annotate and draw - so my thought was why not delete a page, or rearrange for that matter?.. That should be an option, this way we don't have to export to one of the other apps to delete or rearrange..
Thanks for the help, Bernd.
BTW if anyone is looking - PDF Max can do all of the above and delete and rearrange. With PDF Splicer you can delete and rearrange as well, but it has no other features.
And as for Steve Werner whose comment was deleted after it got to my inbox, it is much more than a Reader, as you can plainly see from the amount of tasks the Reader app can do above.

Crystal Report that reads from an XML file Datetime or Date

I have a Crystal Report 2008 that reads from an XML file, the source File XML Date data looks like this: 2008-03-10
But the Crystal Report interpreted by datatime, I need the Crystal Report to look like this: 2008/03/10 (date) not 2008-03-10T00:00:00-05:00 (datatime)
Look at an example (source file xml, report, and parameter file to execute report) at url: http://www.5websoft.com/sample.zip
Import the file in the design and will to verify that interpret incorrectly the fields of type date as datetime
not mapped currently for fields..
Help.....
Thanks!

You could always reformat the field to only display the date portion:
Format Field > Date and Time tab; choose the date style you need here.
Or create a formula to extract just the date and use this field in your report:
date({table.field})

How can I programatically identify PDF files with embedded images?

Our company has 27,266,949 .PDF files that we're planning to compress in order to save server space.
We don't want to compress any of the .PDF files that have embedded images as to not alter the image's state.
How can we programatically create a list to exclude from the compression process?

Ah, see told you we were new to this and no, my taxs already have enough digits to the balance.
Ok, so based on that, we should be able to use the preflighting tool to identify the PDF’s with images, factor them out, and then continue with lossless compression on the remaining balance.
That will give us the compression we need to save space, but also allow us to stand in the court of law (if the scenario was to ever occur) and proclaim that none of our medical images have ever been altered by compression.
Sound like a reasonable plan?

Extracting data from a tag of xml file which is(xml) in a Field of Csv.

We have a xlm script which is stored in the clob column of the csv file. we have to extract one value from the <tag> and reject remaining data.
Sample:-
<ROW>
<ID>100</ID>
<ORDER_DATE>2000.12.20</ORDER_DATE>
<SHIPTO_NAME>Adrian Howard</SHIPTO_NAME>
<SHIPTO_STREET>500 Marine World Parkway</SHIPTO_STREET>
<SHIPTO_CITY>Redwood City</SHIPTO_CITY>
<SHIPTO_STATE>CA</SHIPTO_STATE>
<SHIPTO_ZIP>94065</SHIPTO_ZIP>
</ROW>
Required Output:-
We have to extract the "500 Marine World Parkway"
from tag <SHIPTO_STREET>
and the above sample xml file is in one of the column which is clob datatype
Any idea How to perform the above activity in PL/SQL ?

As BP suggested you can use an XPATH query to extract that information from your XML. However it depends a bit on your XML data.
Here are two examples:
one row XML
select extractvalue(xmltype('<ROW>
<ID>100</ID>
<ORDER_DATE>2000.12.20</ORDER_DATE>
<SHIPTO_NAME>Adrian Howard</SHIPTO_NAME>
<SHIPTO_STREET>500 Marine World Parkway</SHIPTO_STREET>
<SHIPTO_CITY>Redwood City</SHIPTO_CITY>
<SHIPTO_STATE>CA</SHIPTO_STATE>
<SHIPTO_ZIP>94065</SHIPTO_ZIP>
</ROW>')
,'//SHIPTO_STREET/text()') as result
from dual;
RESULT
500 Marine World Parkway
multi rows XML
select extractvalue(column_value,'SHIPTO_STREET/text()') as result
from table(xmlsequence(extract(xmltype('<ROWS>
<ROW>
<ID>100</ID>
<ORDER_DATE>2000.12.20</ORDER_DATE>
<SHIPTO_NAME>Adrian Howard</SHIPTO_NAME>
<SHIPTO_STREET>500 Marine World Parkway</SHIPTO_STREET>
<SHIPTO_CITY>Redwood City</SHIPTO_CITY>
<SHIPTO_STATE>CA</SHIPTO_STATE>
<SHIPTO_ZIP>94065</SHIPTO_ZIP>
</ROW>
<ROW>
<ID>200</ID>
<ORDER_DATE>2000.12.20</ORDER_DATE>
<SHIPTO_NAME>Adrian Howard</SHIPTO_NAME>
<SHIPTO_STREET>Test</SHIPTO_STREET>
<SHIPTO_CITY>Redwood City</SHIPTO_CITY>
<SHIPTO_STATE>CA</SHIPTO_STATE>
<SHIPTO_ZIP>94065</SHIPTO_ZIP>
</ROW>
</ROWS>'
),'ROWS/ROW/SHIPTO_STREET')));
RESULT
500 Marine World Parkway
Test

Extracting pdf that's embedded in xml

Similar Messages

Maybe you are looking for