PDF to HTML or .doc conversion

Hi
Do we have any Java API for converting PDF files to HTML or .doc or XML format?
It seems pdf files doesn't maintain any structure inside so its difficult to parse or convert PDF files to other file formats..
Comments, suggestions are welcome
-Ven

Do you know about GObcl.com
They have a free webservice to convert pdf file to html or doc or doc file to pdf.
Just mail you *.doc file to [email protected] and you will get a reply in the form of the zip file containing doc files.
May be you can use this in your app.

Similar Messages

Export pdf to html/txt/xml

Hi,
I downloaded "adobe acrobat x pro" for trying the "save as"/export functionality to xml/htm/text etc. and the result was exactly what I was looking for in terms of output, keeping formatting etc.
However, I am building an application which need to have an embeded library in order to do pdf to html/txt/xml conversion on the fly keeping formatting.
I have tried a number of libraries for pdf to html/txt/xml conversion an none of them deliver anything near what adobe acrobat x pro does in terms om keeping format/tables etc.
So, my question is how can I get access to the "save as"/export functionality in adobe acrobat x pro in any official adobe library, sdk, service, product etc. since I assume acrobat x pro does not expose any api for convert functionality or may be used serverside?
Best regards,
Rick

It sounds like you want to use Acrobat as a web service. Rather than pursue this route, you may want to note that such a use of Acrobat is not permitted under the license. Thus it may not worth pursuing. Why convert to HTML is a possible question anyway, at least on a regular basis? On occasions I can understand the need.
For programmable features you should probably check in the SDK forum.

Unique issue with PDF to WORD .doc conversion with Acrobat Pro - any ideas?

I have been unable to solve the following issue when converting (save as...) PDF documents to Microsoft Word .doc using numerous methods. This could either be an issue that would be fixed in Acrobat Pro itself, or in MS Word - posting to the Adobe forums first.
PREFACE: I am attempting to use the converted .doc file with translation applications/software. Google Translator Toolkit is what I use the most, but ALL other translators are having this very same issue with the .doc file. --The source PDFs are product information from drug manufacturers in various countries that I need to have translated to English. I do not have access to their source documents, as they do not provide their own source docs for obvious reasons.
ALSO: I cannot use Google Translator toolkit to translate from PDFs directly - if you do that, it will attempt to translate a PDF and then export in an .html file, but it does not get the exact spacing of the sentences correctly, which leads to errors in translating - key things such as "can take with alcohol" and "do not take with alcohol". So that's out!
I am not having any problems with the resultant .doc file in MS Word itself. It looks right, the spacing matches the original PDF source perfectly, prints correctly, etc... Reference here on a product info sheet from Austria in German:
The problem: This is a screenshot from Google Translator Toolkit - the right side of the image - the spacing in the lettering from the .doc file I am uploading is not being read correctly, resulting in untranslated gibberish. (Note: this isn't a problem with the translation applications or software -- all are having this issue with .doc files converted from .pdf - this issue isn't present with any old .doc file that wasn't converted from a .pdf) -- It's definitely got something to do with some kind of embedded data in the .doc file that I cannot isolate!!)
My settings in Adobe Pro (convert from PDF to .doc):
Page layout: Flowing Text (this prevents the resultant .doc from having all of those text boxes, which also don't then work in translators)
Include comments: True
Include images: True
Run OCR if needed: True
Notes:
-I have run OCR text recognition on the source PDF files in it's specific language.
-I have edited the accessibilty of the PDF and have run the tag recognition and quick checks (to see if they solved the issue, which it did not - tagged or untagged, same problems!)
-I have exported the .doc BACK to PDF using MS Word's function, which results in a great looking tagged PDF. THEN I re-saved this new PDF back as a .doc - same issue.
-I have tried saving the PDF in all of the other formats that the translators accept. All have different issues. The only one that works consistently is saving to a .txt (plain)... The best is a .doc to .doc conversion, with all the original spacing. (I am not spending hours reformatting a .txt translation in word)...
I can't seem to find where this spacing data is in the .doc file!!!! (Changing the fonts, sizes, margins -- doesnt fix this either). I have tried so many methods...
Any thoughts on other things to try in Adobe Pro (or Word)?
EDIT: Here's an additional tidbit of info that may be the key to this... There's some kind of coding that is in the .doc that Adobe Pro converted from the source PDF that doesnt display in Word, but that is being seen by the translation programs....... I have no idea what these are, but I want to remove them!
Message was edited by: KaotikADC

I would suggest you look at the fonts that are being used. It may be a font issue that is not properly being read by the translation program.

Clean pdf to html conversion

I am searching for a clean conversion from pdf to html or xml. I know that there are many solutions, which follow the way of keeping the positions and layouts of several contents. But I am searching for a conversion tool, which converts into clean html without css or into xml. Existing products create a mess of thousand div-tags and span-tags, but you cannot differentiate between a header and a table.
Example (what I need):
<h1>This is a header 1</h1>
<p>This is text</p>
<img src="..... />
<h2> This is header 2 </h2>
<table>.....This is a table ....... </table>
Existing solutions:
<div style="position:.....">Text</div>
<div style="position:.....">Text</div>
<div style="position:.....">Text</div>
<div style="position:.....">Text</div>
Is there any product, which can do that? (batch conversion on servers (e.g. JAVA))

Given that CSS is part of HTML - I don't see why that would be an issue.
Since this is Adobe's forum, we offer a Java-focused server side solution called LiveCycle ES.

I am in the process of expanding a database of chemistry journal articles. These materials are ideally acquired in two formats when both are available-- PDF and HTML. To oversimplify, PDFs are for the user to read, and derivatives of the HTML versions a

I am in the process of expanding a database of chemistry journal articles. These materials are ideally acquired in two formats when both are available-- PDF and HTML. To oversimplify, PDFs are for the user to read, and derivatives of the HTML versions are for the computer to read. Both formats are, of course, readily recognized and indexed by Spotlight. Journal articles have two essential components with regards to a database: the topical content of the article itself, and the cited references to other scientific literature. While a PDF merely lists these references, the HTML version has, in addition, links to the cited items. Each link URL contains the digital object identifier (doi) for the item it points to. A doi is a unique string that points to one and only one object, and can be quite useful if rendered in a manner that enables indexing by Spotlight. Embedded URL's are, of course, ignored by Spotlight. As a result, HTML-formatted articles must be processed so that URL's are openly displayed as readable text before Spotlight will recognize them. Conversion to DOC format using MS Word, followed by conversion to RTF using Text Edit accomplishes this, but is quite labor intensive.
In the last few months, I have added about 3,500 articles to this collection, which means that any procedure for rendering URL's must be automated and able to process large batches of documents with minimal user oversight. This procedure needs to generate a separate file for each HTML document processed. Trials using Automator's "Get Specified Finder Items" and "Get Selected Finder Items", as well as "Ask For Finder Items" (along with "Get URLs From Web Pages") give unsatisfactory results. When provided with multiple input documents, these three commands generate output in which the URLs from multiple input items are merged into a single block, which yields a single file using "Create New Word Document" as the subsequent step. A one-to-one, input file to output file result can be obtained by processing one file at a time, but this requires manual selection of each item and one-at-a-time processing. What I need is a command that accepts multiple input documents, but processes them one at a time, generating a separate output for each file processed. Is there a way for Automator to do this?

Hi,
With the project all done, i'm preparing for the presentation. Managed to get my hands on a HD beamer for the night (Epason TW2000) and planning to do the presentation in HD.
That of course managed to bring up some problems. I posted a thread which i'll repost here . Sorry for the repost, i normally do not intend to do this, but since this thread is actually about the same thing, i'd like to ask the same question to you. The end version is in AfterEffects, but that actually doesn't alter the question. It's about export:
"I want to export my AE project of approx 30 min containing several HD files to a Blu Ray disc. The end goal is to project the video in HD quality using the Epson EMP-TW2000 projector. This projector is HD compatible.
To project the video I need to connect the beamer to a computer capable of playing a heavy HD file (1), OR burn the project to a BRD (2) and play it using a BRplayer.
I prefer option 2, so my question is: which would be the preferred export preset?
Project specs:
                    - 1920x1080 sq pix (16:9)
                    - 25 fps
                    - my imported video files (Prem.Pro sequences) are also 25 fps and are Progressive (!)
To export to a BRD compatible format, do i not encounter a big problem: my projectfiles are 25 fps and progressive, and I believe that the only Bluray preset dispaying 1920x1080 with 25 fps requests an INTERLACED video (I viewed the presets found on this forum, this thread)... There is also a Progr. format, BUT then you need 30 fps (29,...).
So, is there one dimension that can be changed without changing the content of the video, and if yes which one (either the interlacing or the fps).
I'm not very familiar with the whole Blu-ray thing, I hope that someone can help me out."
Please give it a look.
Thanks,
Jef

Generate pdf and html(urgent)

can anybody tell how to generate pdf and html from a single report,
thanks in adv

From a single report, you can generate outputs to html, htmlcss, pdf, rtf, XML and text formats.
If you use rwclient, rwrun or rwservlet methods, specify desformat=pdf/html and the destination file name in desname command line parameters.
If you use Reports Builder, open a report, select File->Generate to file and select html/pdf. Then give the file name.
For more details, Refer to Reports Tutorial / Publishing Reports document from this site.
http://otn.oracle.com/docs/products/reports/content.html
Thanks,
The Oracle Reports team

How can I convert a pages color document to a PDF Black and White doc.

How can I convert a pages color document to a PDF Black and White doc. Or, covert the color to B&W in a new doc?

> How can I convert a pages color document to a PDF Black and White doc. Or, covert the color to B&W in a new doc?
The general idea is that you colour correct photographs once, archive, and convert with or without colour changes. The archived photograph is unchanged - or we would be colour correcting the same photographs again and again and again.
If you have a photograph with a corrected exposure, you can open the photograph in the Apple ColorSync Utility, apply a colour space conversion to a grayscale appearance using the preinstalled ICC profile, save the photograph under another name, and place that in your pagination.
If you have a paginated document with corrected exposures, and any such non-scalable bitmap or scalable spline graphics as you have chosen to add, you can render the pagination as a whole to PDF through the same ICC profile, carrying out the same colour space conversion on any and all objects.
Caveat: If you intend the pagination for certain processes, in particular offset lithography, then you are probably expected not to render the type to grayscale, but rather to render it to single ink solid black. No software can determine what printing process you intend, you have to understand a bit about printing, and how to set up general colour space conversions in software. Ask your prepress provider, and if the answer is not prompt and proficient, pick another provider.
/hh

Problems with saving pdf as a word doc

hello, i
recently upgraded to acrobt x pro, and i went to convert a one-page pdf into a word doc. i went to "file,"
"save as," "word document, and when i went to view
the word doc, the format/layout/font, etc. was completely garbled. what could i be doing wrong? thanks!

Your problem is likely with the PDF not having the tags that provide the format information you are expecting. Such information is not needed for the PDF, but is used for the conversion back to WORD. You could try converting a file from DOC to PDF and back to DOC with both a print to the Adobe PDF printer and PDF Maker to get the format info. See how they compare and you should get the idea.

Issue with Exporting Pdf as HTML

Hi all,
I am trying to Convert a PDF to HTML using the "Save as Other -> HTML(Web Page) in the File menu, After Conversion I see some of the characters that are present in Symbol Font are not converted properly(Eg a text 8.7 appears as ""). instead of numerals. Any thoughts on this issue. Thanks in advance.
Regards
Srini

I can't see this working very well. Even if it exported a correct reference to the symbol font, people who view the converted page probably won't have the font and won't see the symbols. You can perhaps replace them with graphics.

Exporting a PDF to HTML without OCR

Hi All,
I am using Adobe Acrobat X to export PDFs to HTML files. It looks like the HTML conversion runs an OCR process on the document before the HTML page is written. This is resulting in a lot of images not showing properly because the OCR process strips out the text and puts it in the body of the HTML rather than recognizing that it should be part of the image. I had used Acorbat 9 to convert to HTML in the past and this was not an issue.
Is there any possible way to disable the OCR portion of the HTML conversion in Acrobat X?
Thanks,
Teri

Hi Teri,
Edit > Preferences > Converting from PDF > HTML. Click 'Edit Settings..' and uncheck 'Run OCR if needed'.
-David

Can't download word docs or PDF's from Google Docs

I cannot download word docs or pdf's from Google docs. I can download MP3's.
I get an error message saying IE can't find the website.
The google docs forum response to this problem was to uninstall the current version of Flash and install Flash 10.1 I'd like to try this suggestion but don't know how to implement the procudure or where to find Flash 10.1 and choose which version of it to install
What is the best way to do that?
I currently have Flash Player 10.3.181.14 installed
I use IE Ver 8.0.6001.1870
And Windows XP pro version 5.1.2600 SP 3 Build 2600
Thanks in advance for any help you can offer

I cannot quite understand how Google Docs is related to Flash Player. But anyway, to revert to an earlier Flash Player version
downoad the FP uninstaller from http://kb2.adobe.com/cps/141/tn_14157.html and save it to disk;
download the archived FP 10.1 installers from http://kb2.adobe.com/cps/142/tn_14266.html and save it to disk;
close all browser instances, then run the downloaded uninstaller;
extract the latest 10.1 installer (ActiveX for IE, plugin for other browsers) and run it.
Don't hesitate to ask again if my instructions are not clear.

HTML to Tiff conversion

Hi,
I want to perform Html to Tiff conversion.
The Html file is on my sys and i want to convert it into Tiff file using my java code. The html file contain Some formated text and 3-4 images.
I have a tool (GUI) that take html file path and snap the html and convert it into Tiff. But i want it in my java programming.
Does jave provide API for doing this, or any other vendor providing this as jar so that i include the jar and call its API for conversion.
Thanks,
Manish

Do you know of a method in the xdk that takes a well formed HTML doc and using xsd / xslt convert back to original xml spec?
Because you created (and as long as you create) the HTML from XML it will be well formed (every tag will be ended with an end-tag) and you can therefore transform it back into XML.
Most times it will not be possible to convert HTML found on the 'internet' into XML because this HTML is not well formed. For example, many people forget to end a paragraph of text within HTML with the </p> tag.
We are evaluating using xslt to convert the XML to a form based medium for content maintenance. Wondering if once a XML document is parsed to HTML (DOM) can it be parsed back to XML for subsequent update to stored value in blob column. Specifically interested in conversion (parser) from HTML to XML
Simply can HTML (in DOM format validated against a xsd) be transformed back to XML ?

I want to convert pdf to html

Hello
My English ability is ver poor sorry
I want to read pdf and display in web page (use jsp)
I have two problem
first, read pdf by html (not just text, jpedal is good)
sencond, save image(jpedal can do)
but I can't know image position....
anyway, I want to convert pdf to html
plz recommend good library.

codingMonkey wrote:
DanCrintea wrote:
HTML to PDF with Java, using OpenOffice.org - example here: [http://www.dancrintea.ro/html-to-pdf/|http://www.dancrintea.ro/html-to-pdf/]
You can use OpenOffice.org, running as a server and command it remotely for document convertion.
Besides HTML to PDF, there are also possible other convertions:
doc --> pdf, html, txt, rtf
xls --> pdf, html, csv
ppt --> pdf, swf
Code example:
import officetools.OfficeFile; // this is my tools package
FileInputStream fis = new FileInputStream(new File("c:/test.html"));
FileOutputStream fos = new FileOutputStream(new File("c:/test.pdf"));
// suppose OpenOffice.org runs on localhost, port 8100
OfficeFile f = new OfficeFile(fis,"localhost","8100", true);
f.convert(fos,"pdf");
-----------------------------------------------------------------------------------------------------------------------------------------Methinks someone is close to getting their account blocked for resurrecting hordes of zombies.Indeed. Abuse already reported

Adobe reader wont allow pdf to save as doc

I have windows 8 and Adobe Reader. I don't use the touchscreen as it allows because my laptop is not touchscreen capable. I have been trying for 2 hours to convert a PDF. file to a DOC. so that I can fill out the form in Microsoft Word. When I pull up the PDF in Adobe, I click the "save as" option and it takes me to the save section. The only options I have in the save screen is where I want to save the file, the description (file name), File type, which the only option is PDF. No other file type is available; and finally, the save option. I am desperately trying to change the file type to a DOC and have hit a dead end. I have tried changing it by renaming and adding (doc.) at the end to change the format, tried opening word and attempting to open the file that way but the characters are wrong, it looks like a bunch of jumbled characters instead of the PDF, and finally, opened "reader" and attempted to copy/paste in word with the same results as adobe, and the save as function wont let me change it to a doc as well. I'm totally stuck and I need to change this PDF to a Word Document but I'm having the worst time doing this! Does anyone know a way to change a PDF file to a document some other way than trying to pull it up in Word, changing it in Adobe or trying to rename the file? I desperately need help!

The free Reader has never been able to convert PDF files to any other format, and most probably never will. You can try using the free 30 day trial of Acrobat to attempt a conversion. And if I may say so, if someone created a form in PDF format, and expects it to be filled and submitted in PDF format, I don't think s/he will be happy receiving it in Word format.

Using the Export to PDF online service, I get "Conversion Failure" error message; why?

Using the Export to PDF online service, I get "Conversion Failure" error message; why?
The documen t was originally created with Word (*.doc) then printed to PDF format; now I need to convert back to Work, but I get the "Conversion Failure" error message, both on conversion back to *.Doc format and attempt to convert to *.docx format.

Hi lfordlaw,
How large is the PDF that you're trying to convert? How did you generate the PDF from Word (did you use Acrobat, or Word's built-in PDF generator)? What version of Word are you using?
The answers to those questions should help us get to the bottom of things.
In the meantime, please try the following:
Clear the browser cache and log back in to https://cloud.acrobat.com.
Try a different web browser (see System Requirements | Adobe Acrobat Pro and Online Services for a list of supported browsers).
Try to convert a different file (if you can convert another file without error, the one you're receiving the error on may be damaged).
I look forward to hearing back from you.
Best,
Sara

PDF to HTML or .doc conversion

Similar Messages

Maybe you are looking for