How to parse a html document?

I am trying to parse an html document that I load from a url over the internet. The html is not well formed but thats ok. The problem is the document builder throws an exception because the document is not well formed.
Can I parse a html document using the document builder?
Please note that I set validating to false and the parse still has a fatal errror saying <meta> tag must have a corresponding </meta> tag.
I am using code like the following.....
DocumentBuilderfactory = DocumentBuilderFactory.newInstance();
factory.setValidating(false);
DocumentBuilder db = factory.newDocumentBuilder();
doc = db.parse(urlString);

The html is not well formed but thats ok.No, it isn't.
"Validation" means checking that the XML conforms to a schema or a DTD. Don't confuse that with checking whether the XML is well-formed, which means whether it follows the basic rules of XML like opening tags have to have matching closing tags. Which is what your message is telling you -- your file isn't well-formed XML.
So sure, you can parse HTML or anything else with an XML parser, just be prepared to be told it isn't well-formed XML.
If you want to clean up HTML so that it's well-formed XML, there are products like HTMLTidy and JTidy that will do that for you.

Similar Messages

Parsing and HTML document

Does any one know how to parse an HTML document with having JEditorPane (in-the-neck) do it for you?

If you want to do a small amount of parsing, then it would make sense to write a custom program as described in the previous response. If, however, you want to do a lot of parsing, then it might make more sense to try to make use of an XML parser. If you are trying to parse html pages that are your own, then you might want to think about transforming them into xhtml so that an XML parser will be able to process them.
XML parsing is easy. I recently developed a web site for a set of mock exams for the Java Programmer Certification. Originally, I started developing the pages in html, but I quickly realized that I would have a hard time managing the exams in that format. I then organized the exam into a set of xml documents--one document for each topic. To publish a set of cross-topic exams, I use JDOM (with the help of SAX)to load all of the questions into the Java Collections Framework where I can easily organized a set of four cross-topic exams. Also, I use JDOM to number the questions and answers before writting the new exams out to a new set of four xml files. Then I use XSLT to transform the four exam.xml documents into eight HTML files--four html files for the questions and four for the answers.
If you would like to take a look at the result, then please use the following link.
http://www.geocities.com/danchisholm2000/
If you own the html files that you want to parse, then I would try to find a way to transform them into valid xml. XHTML might be a good choice.
Dan Chisholm

Problem parsing a html document

Hi all,
I need to parse a html document.
InputStream is = new java.io.FileInputStream(new File("c:/temp/htmldoc.html"));
DOMFragmentParser DOMparser = new DOMFragmentParser();
DocumentFragment doc = new HTMLDocumentImpl().createDocumentFragment();
DOMparser.parse(new InputSource(is), doc);
NodeList nl = doc.getChildNodes();
I get just 3 of the following nodes...... though the document htmldoc.html is a proper html doc..
#document-fragment
HTML
#text
Any suggestions/help are most welcome. Thanks

Here's an example showing how to do this via javax.xml:
import java.io.*;
import java.net.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;
public class HTMLElementLister {
     public static void main(String[] args) throws Exception {
          URLConnection con = new URL("http://www.mywebsite.com/index.html").openConnection();
          con.connect();
          InputStream in = (InputStream)con.getContent();
          Document doc = null;
          try {
               DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
               DocumentBuilder db = dbf.newDocumentBuilder();
               doc = db.parse(in);
          } finally {
               in.close();
          NodeList nodes = doc.getChildNodes();
          for (int i=0; i<nodes.getLength(); i++) {
               Node node = nodes.item(i);
               String nodeName = node.getNodeName();
               System.out.println(nodeName);
               if ("html".equalsIgnoreCase(nodeName)) {
                    System.out.println("|");
                    NodeList grandkids = node.getChildNodes();
                    for (int j=0; j<grandkids.getLength(); j++) {
                         Node contentNode = grandkids.item(j);
                         nodeName = contentNode.getNodeName();
                         System.out.println("|- " + nodeName);
                         if ("body".equalsIgnoreCase(nodeName)) {
                              System.out.println("   |");
                              NodeList bodyNodes = contentNode.getChildNodes();
                              for (int k=0; k<bodyNodes.getLength(); k++) {
                                   node = bodyNodes.item(k);
                                   System.out.println("   |- " + node.getNodeName());
}

Parsing an HTML document

I want to parse an html document and replace anchor tags with mines on the fly. Can anybody suggest how to do it Please?
Ajay

If your HTML files are not well-formed (chances are with most HTML files) like attribute values are not enclosed in punctuation marks, etc, most XML parsers will fail.
Anand from this forum introduced the JTidy to me and it worked very well. This is a HTML parser that is able to tidy up your HTML codes.

How to Parse an HTML File?

Hi all
I want to parse an HTML file?
How is it possible?
After taking an input which is an HTML file, i need to parse it, and i need to print/modify values based on some tags?
Please help me, how to parse an HTML file?

You start by reading the first character and then continiung until you reach the last character.
For a more serious answer try elaborating on your question. Its really really vague.

Parsing a HTML document

I want to parse a HTML file and take out the data from it eliminating the HTML tags.Can anybody give some idea how to do that ?
The HTML file may contain javascript functions also.

Hi,
here is a method for replacing strings in a text:
http://forums.java.sun.com/thread.jsp?forum=31&thread=185221
I know it isn't exactly what you want, but maybe it helps you to begin.
regards

How to parse a HTML file using HTML parser in J2SE?

I want to parse an HTML file using HTML parser. Can any body help me by providing a sample code to parse the HTML file?
Thanks nad Cheers,
Amaresh

What HTML parser and what does "parsing" mean to you?

How to parse multiple xml documents from single buffer

Hello,
I am trying to use jaxb 2.0 to parse a buffer which contains multiple xml documents. However, it seems that it is meant to only parse a single document at a time and throws an exception when it gets to the 2nd document.
Is there a way I can tell jaxb to only parse the first complete document and not fetch the next one out of the buffer? Or what is the most efficient way to separate the buffer into two documents without parsing it manually. If I have to search the buffer for the next document root and then split the buffer, it seems like that defeats the purpose of using jaxb as the parser.
I am using the Unmarshaller.unmarshall method and the exception I am getting is:
org.xml.sax.SAXParseException: Illegal character at end of document, <.]
     at javax.xml.bind.helpers.AbstractUnmarshallerImpl.createUnmarshalException(AbstractUnmarshallerImpl.java:315)
     at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.createUnmarshalException(UnmarshallerImpl.java:476)
     at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:198)
     at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(UnmarshallerImpl.java:167)
     at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(AbstractUnmarshallerImpl.java:137)
     at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(AbstractUnmarshallerImpl.java:184)
Thank you for your help

It's just like any other XML parser, it's only designed to parse one XML document. If you have something that concatenates two XML documents together (that's what your "buffer" sounds like), then stop doing that.

I don't know how to edit a html document

I know how to CREATE a html doc using MSWord but don't know how to edit it once it's done. Is there a Firefox add on or plug in that lets one do this? Or is a different program needed.
Specifically, I had a MSWord doc with several jpeg images and, when I converted to html, all were stripped out and I want to put them back in.
How?

I don't know either, but this is the forum for the 5g iPods, you may want to post in the Nano forum.

How to parser a HTML page to get its variable and values?

Hi, everyone, here is my situation:
I need to parser a HTML page to get the variables and their associated values between <form>...</form> tag. for example, if you have a piece of HTML as below
<form>
<input type = "hidden" name = "para1" value = "value1">
<select name = "para2">
<option>value2</option>
</form>
the actual page is much complex than this. I want retrive pare1 = value1 and para2 = value2, I tried Jtidy but it doesn't reconginze select, could you recomend some good package this purpose? better with sample code.
Thanks a lot
Kevin

See for example Request taglib from Coldtags suite:
http://www.servletsuite.com/jsp.htm

How to parse an XML document with oracle8i

Has anyone a good link or an example how to decode and store an XML document into an oracle8i database.
I' ve found only good things for oracle9i.
Thank you
Roger

Here is an example of parsing xml taken fro Oracle8i 8.1.7 xdk.
This one uses external OS files to pase, but could be easily converted to
use CLOB or VARCHAR2 string for parsing XML documents.
IF you wanted to use CLOB to store and manipulate xml documents you can use XMLParser and XMLDom
packages along with the DBMS_LOB package to do that.
-- This file demonstates a simple use of the parser and DOM API.
-- The XML file that is given to the application is parsed and the
-- elements and attributes in the document are printed.
-- The use of setting the parser options is demonstrated.
set serveroutput on;
create or replace procedure domsample(dir varchar2, inpfile varchar2,
errfile varchar2) is
p xmlparser.parser;
doc xmldom.DOMDocument;
-- prints elements in a document
procedure printElements(doc xmldom.DOMDocument) is
nl xmldom.DOMNodeList;
len number;
n xmldom.DOMNode;
begin
-- get all elements
nl := xmldom.getElementsByTagName(doc, '*');
len := xmldom.getLength(nl);
-- loop through elements
for i in 0..len-1 loop
n := xmldom.item(nl, i);
dbms_output.put(xmldom.getNodeName(n) || ' ');
end loop;
dbms_output.put_line('');
end printElements;
-- prints the attributes of each element in a document
procedure printElementAttributes(doc xmldom.DOMDocument) is
nl xmldom.DOMNodeList;
len1 number;
len2 number;
n xmldom.DOMNode;
e xmldom.DOMElement;
nnm xmldom.DOMNamedNodeMap;
attrname varchar2(100);
attrval varchar2(100);
begin
-- get all elements
nl := xmldom.getElementsByTagName(doc, '*');
len1 := xmldom.getLength(nl);
-- loop through elements
for j in 0..len1-1 loop
n := xmldom.item(nl, j);
e := xmldom.makeElement(n);
dbms_output.put_line(xmldom.getTagName(e) || ':');
-- get all attributes of element
nnm := xmldom.getAttributes(n);
if (xmldom.isNull(nnm) = FALSE) then
len2 := xmldom.getLength(nnm);
-- loop through attributes
for i in 0..len2-1 loop
n := xmldom.item(nnm, i);
attrname := xmldom.getNodeName(n);
attrval := xmldom.getNodeValue(n);
dbms_output.put(' ' || attrname || ' = ' || attrval);
end loop;
dbms_output.put_line('');
end if;
end loop;
end printElementAttributes;
begin
-- new parser
p := xmlparser.newParser;
-- set some characteristics
xmlparser.setValidationMode(p, FALSE);
xmlparser.setErrorLog(p, dir || '/' || errfile);
xmlparser.setBaseDir(p, dir);
-- parse input file
xmlparser.parse(p, dir || '/' || inpfile);
-- get document
doc := xmlparser.getDocument(p);
-- Print document elements
dbms_output.put('The elements are: ');
printElements(doc);
-- Print document element attributes
dbms_output.put_line('The attributes of each element are: ');
printElementAttributes(doc);
-- deal with exceptions
exception
when xmldom.INDEX_SIZE_ERR then
raise_application_error(-20120, 'Index Size error');
when xmldom.DOMSTRING_SIZE_ERR then
raise_application_error(-20120, 'String Size error');
when xmldom.HIERARCHY_REQUEST_ERR then
raise_application_error(-20120, 'Hierarchy request error');
when xmldom.WRONG_DOCUMENT_ERR then
raise_application_error(-20120, 'Wrong doc error');
when xmldom.INVALID_CHARACTER_ERR then
raise_application_error(-20120, 'Invalid Char error');
when xmldom.NO_DATA_ALLOWED_ERR then
raise_application_error(-20120, 'Nod data allowed error');
when xmldom.NO_MODIFICATION_ALLOWED_ERR then
raise_application_error(-20120, 'No mod allowed error');
when xmldom.NOT_FOUND_ERR then
raise_application_error(-20120, 'Not found error');
when xmldom.NOT_SUPPORTED_ERR then
raise_application_error(-20120, 'Not supported error');
when xmldom.INUSE_ATTRIBUTE_ERR then
raise_application_error(-20120, 'In use attr error');
end domsample;
show errors;

How to parser a html page and get useful information?

now ,I try to get the page by the url,after getting the whole page,
is there any way to get the useful text ,and abandon other ,liks ,ad likes,
other related links?
I try to use java.util.regex.*;
is there any other methods for dointg this?

Regex isn't a good method unless your requirements are quite simple. In general if you want a Java HTML parser they are not hard to find -- "java html parser" is a good choice of keywords for an internet search.

How to print a html document in landscape mode with css

Hi ,
below is my code that i need to print in landscape mode by defualt with css. is there any way to do this
<html><head><style type="text/css">
table.automatic {table-layout:fixed; width:100%; border-collapse: collapse; word-wrap:break-word;}TR.head{font-family: verdana; font-size: 10pt; vertical-align: top ; color: #ffffff; background-color: black; font-weight: bold; }TR.data{font-family: verdana; font-size: 8pt; vertical-align: top}</style></head><body><table width="100%" border="0" cellspacing="0" cellpadding="4"><tbody><tr><td></td><td align="center"><font size="2" style="verdana"><b>EMERGENCY ICE Installs by Manager And Production Install Date Range</b></font><br><td></td><table width="100%" border="0" cellspacing="0" cellpadding="4"><thead><tr><td width="74%" align="center"><font size="1" style="verdana"><b>From:</b>09/01/07</font><font size="1" style="verdana"><b>&nbsp&nbspTo:</b>01/08/08</font></td></tr></thead></table></td></tr></tbody></table><br></br>
<table border="1" class="automatic"><thead><tr class="head"><td width="8%">Manager</td><td width="5%">ICE # </td><td width="10%">Plan Prod Install Date</td><td width="10%">Doc Status</td><td width="10%">Doc Title</td><td width="7%">Owner</td><td width="5%">Risk Desc</td><td width="4%">AS/400</td><td width="4%">Unix</td><td width="4%">Win Srv</td><td width="4%">Win Dsk</td><td width="4%">Vax</td><td width="5%">Target Loc</td><td width="10%">No Chg Risk Desc</td><td width="15%">Brief Chg Desc</td></tr></thead><tbody>
<tr class="data"><td rowspan="12" width="5%">Randal Hockenberry</td><td width="8%">16926</td><td width="10%">Dec 31,2007</td><td width="10%">Pending Manager-Business Review</td><td width="10%">AUTO APPROVE</td><td width="7%">Padma Chirumalla</td><td width="5%">Low</td><td width="4%">N</td><td width="4%">Y</td><td width="4%">N</td><td width="4%">N</td><td width="4%">N</td><td width="5%">TEST</td><td width="10%">TEST</td><td width="15%">TESTzxdcxvsrfdgdsgfdffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff</td></tr>
<tr class="data"><td width="8%">16933</td><td width="10%">Dec 28,2007</td><td width="10%">Pending Non-QS Testing</td><td width="10%">SR</td><td width="7%">Padma Chirumalla</td><td width="5%">Low</td><td width="4%">N</td><td width="4%">Y</td><td width="4%">N</td><td width="4%">N</td><td width="4%">N</td><td width="5%">XC</td><td width="10%">DFV</td><td width="15%">CX</td></tr>
<tr class="data"><td width="8%">16927</td><td width="10%">Dec 28,2007</td><td width="10%">Pending Risk Review</td><td width="10%">AUTO APPROVAL TEST</td><td width="7%">Padma Chirumalla</td><td width="5%">Low</td><td width="4%">N</td><td width="4%">Y</td><td width="4%">N</td><td width="4%">N</td><td width="4%">N</td><td width="5%">AUTO</td><td width="10%">AUTO APPROVE</td><td width="15%">AUTO APPROVE</td></tr>
<tr class="data"><td width="8%">15926</td><td width="10%">Nov 27,2007</td><td width="10%">Pending Code Review</td><td width="10%">CM Install LEVEL 3 QS will test</td><td width="7%">Gary Zhong</td><td width="5%">Low</td><td width="4%">N</td><td width="4%">N</td><td width="4%">Y</td><td width="4%">N</td><td width="4%">Y</td><td width="5%">pitt99</td><td width="10%">testing the app</td><td width="15%">CM Install LEVEL 3 QS will test</td></tr>
<tr class="data"><td width="8%">15712</td><td width="10%">Nov 22,2007</td><td width="10%">Rejected</td><td width="10%">Test Documentation Request</td><td width="7%">Autumn Priddy</td><td width="5%">Low</td><td width="4%">N</td><td width="4%">N</td><td width="4%">N</td><td width="4%">N</td><td width="4%">N</td><td width="5%">x</td><td width="10%">none</td><td width="15%">run-through for documentation purposes</td></tr>
<tr class="data"><td width="8%">16029</td><td width="10%">Nov 21,2007</td><td width="10%">Rejected</td><td width="10%">Contractor</td><td width="7%">Padma Chirumalla</td><td width="5%">Low</td><td width="4%">N</td><td width="4%">N</td><td width="4%">N</td><td width="4%">Y</td><td width="4%">N</td><td width="5%">Windows</td><td width="10%">testing</td><td width="15%">Testing</td></tr>
<tr class="data"><td width="8%">16030</td><td width="10%">Nov 19,2007</td><td width="10%">Rejected</td><td width="10%">Contractor</td><td width="7%">Padma Chirumalla</td><td width="5%">Low</td><td width="4%">N</td><td width="4%">N</td><td width="4%">N</td><td width="4%">Y</td><td width="4%">N</td><td width="5%">Windows</td><td width="10%">testing</td><td width="15%">Testing</td></tr>
<tr class="data"><td width="8%">15840</td><td width="10%">Nov 10,2007</td><td width="10%">Pending Follow Up Review</td><td width="10%">SMS Install LEVEL 3</td><td width="7%">Stephen Sciullo</td><td width="5%">Low</td><td width="4%">Y</td><td width="4%">N</td><td width="4%">N</td><td width="4%">N</td><td width="4%">Y</td><td width="5%">pitt88</td><td width="10%">system testing</td><td width="15%">SMS Install LEVEL 3</td></tr>
<tr class="data"><td width="8%">15719</td><td width="10%">Nov 08,2007</td><td width="10%">Pending Test Install</td><td width="10%">CM Install LEVEL 1 pitt3/col3</td><td width="7%">Stephen Sciullo</td><td width="5%">Low</td><td width="4%">N</td><td width="4%">N</td><td width="4%">N</td><td width="4%">N</td><td width="4%">Y</td><td width="5%">pitt88</td><td width="10%">testing</td><td width="15%">testing</td></tr>
<tr class="data"><td width="8%">14815</td><td width="10%">Nov 03,2007</td><td width="10%">Rejected</td><td width="10%">test</td><td width="7%">Stephen Sciullo</td><td width="5%">Low</td><td width="4%">N</td><td width="4%">N</td><td width="4%">Y</td><td width="4%">Y</td><td width="4%">N</td><td width="5%">pitt2</td><td width="10%">test</td><td width="15%">test</td></tr>
<tr class="data"><td width="8%">14915</td><td width="10%">Nov 01,2007</td><td width="10%">Pending Follow Up Review</td><td width="10%">Contractor</td><td width="7%">Padma Chirumalla</td><td width="5%">Low</td><td width="4%">N</td><td width="4%">N</td><td width="4%">N</td><td width="4%">Y</td><td width="4%">N</td><td width="5%">windows</td><td width="10%">test</td><td width="15%">Test</td></tr>
<tr class="data"><td width="8%">14816</td><td width="10%">Oct 31,2007;Nov 01,2007;Nov 02,2007</td><td width="10%">Production Install Complete</td><td width="10%">Contractor</td><td width="7%">Padma Chirumalla</td><td width="5%">Low</td><td width="4%">N</td><td width="4%">N</td><td width="4%">N</td><td width="4%">Y</td><td width="4%">N</td><td width="5%">windows</td><td width="10%">test</td><td width="15%">test</td></tr>
<tr></tr>
<tr class="data"><td rowspan="1" width="5%">Linda Humm</td><td width="8%">16034</td><td width="10%">Dec 06,2007</td><td width="10%">Pending Production Install</td><td width="10%">CM Install LEVEL 1 pitt3/col3</td><td width="7%">Randal Hockenberry</td><td width="5%">Low</td><td width="4%">N</td><td width="4%">N</td><td width="4%">N</td><td width="4%">N</td><td width="4%">Y</td><td width="5%">pitt88</td><td width="10%">testing</td><td width="15%">testing</td></tr>
<tr></tr>
</tbody></table><body></html>
Thanks,

Why are you asking a CSS question in a Java/JSP/JSTL forum? I would rather use a CSS forum for that. That HTML code is unnecessary too. Just post relevant code only, such as that CSS snippet.
Anyway, it may differ per web browser if this page rule will be picked up and it may also be a client side setting regarding printer settings. Nothing to do here.

Servlets and XML - How to Parse the XML Document?

I would like to create a servlet, which when accessed, will retrieve an
XML file remotely over the Internet, and parse it to the browser's
screen.
Can anyone point to example code that will help me along?
Thanks.

Cross post:
http://forum.java.sun.com/thread.jspa?threadID=5114779&messageID=9391940#9391940

How to display html document returned by utl_http package (POST method)

I am using oracle forms 10g, data base version is 10g.
I have written a database procedure that calls utl_http package POST method and request returns an html document. How do display this html document from oracle form?
Thank you
Hema

Here you have...
A Full Web Browser Java Bean - Oracle Forms PJCs/Java Beans
http://forms.pjc.bean.over-blog.com/article-26251949.html

How to parse a html document?

Similar Messages

Maybe you are looking for