Parsing HTML documents

I am trying to write an application that uses a parsed html document to perform some data retrieval. The problem that I am having is that the parser in JDK1.4.1 is unable to completely parse the document correctly. Some fields are skipped as well as other problems. I believe it has to do with the html32.bdtd. Is there a later version?

Parsing a HTML document is a huge task, you shouldn't do it yourself but instead javax.text.html and javax.text.html.parser already provide almost everything you ever need

Similar Messages

  • Parse HTML document embedded in IFRAME

    Dear fellows:
    How can I access contents of an HTML document embedded in an IFRAME tag, by using java class HTMLEditorKit.Parser?
    It is well known that the contents of such embedded HTML document can be accessed by javascript at front end. However, I am more interested on processing it at backend, using HTMLEditorKit.Parser, or any java swing API.
    Thanks for help.

    The javax.swing.text.html framework barely supports HTML 3.2.

  • Counting lines of parsed HTML documents

    Hello,
    I am using a HTMLEditorKit.ParserCallback to handle data generated by a ParserDelegator.
    Everything is ok but I can not find how to catch end of lines (I need to know at what line a tag or an attribute is found).
    Thanks in advance for any hints.

    I noticed that the parse() method of ParserDelegator creates a DocumentParser object to do the actual parsing of the HTML document. DocumentParser contains a method getCurrentLine(). So, I tried to extending ParserDelegator so I could access Document Parser. However, the getCurrentLine method is protected so I ended up also extending DocumentParser.
    You probably have code something like:
    new MyParserDelegator().parse(reader, this, false);
    This should be replaced with:
    parser = new MyParserDelegator();
    parser.parse(reader, this, false);
    where you defined an instance variable: MyParserDelegator parser;
    You can now use parser.getCurrentLine() in any of you parser callback methods.
    Note that you may not alway get the results that you expect for the current line as many times I found the line to be 1 greater than I thought it should be. Anyway you can decide if the code is of any value.
    Following is the code for MyParserDelegator and MyDocumentmentParser inner class. Good Luck.
    import java.io.IOException;
    import java.io.Reader;
    import java.io.Serializable;
    import javax.swing.text.html.HTMLEditorKit;
    import javax.swing.text.html.parser.DTD;
    import javax.swing.text.html.parser.DocumentParser;
    import javax.swing.text.html.parser.ParserDelegator;
    public class MyParserDelegator extends ParserDelegator implements Serializable
         MyDocumentParser parser;
    public void parse(Reader r, HTMLEditorKit.ParserCallback cb, boolean ignoreCharSet) throws IOException
         String name = "html32";
         DTD dtd = createDTD( DTD.getDTD( name ), name );
              parser = new MyDocumentParser(dtd);
              parser.parse(r, cb, ignoreCharSet);
         public int getCurrentLine()
              return parser.getCurrentLine();
    public class MyDocumentParser extends DocumentParser
         public MyDocumentParser(DTD dtd)
              super(dtd);
         public int getCurrentLine()
              return super.getCurrentLine();

  • Parsing HTML characters (e.g. &nbsp)

    Hi
    Apologies if I'm missing something obvious, I haven't been able to find an answer searching the API or Forums...
    I'm parsing HTML documents (currently as Strings) to extract certain information. Is there an easy way to replace all special HTML characters such as   < etc. to a space or < respectively without having to do a string replace on every possible HTML character?
    I know there's an HTML parser in swing but that seems to be geared towards creating an HTML editor.
    Any help would be appreciated!

    There are also a number of open source or shareware programs, such as TidyHTML, that clean-up and parse existing HTML. Check out Sourceforge or www.downloads.com.
    - Saish

  • Problem parsing a html document

    Hi all,
    I need to parse a html document.
    InputStream is = new java.io.FileInputStream(new File("c:/temp/htmldoc.html"));
    DOMFragmentParser DOMparser = new DOMFragmentParser();
    DocumentFragment doc = new HTMLDocumentImpl().createDocumentFragment();
    DOMparser.parse(new InputSource(is), doc);
    NodeList nl = doc.getChildNodes();
    I get just 3 of the following nodes...... though the document htmldoc.html is a proper html doc..
    #document-fragment
    HTML
    #text
    Any suggestions/help are most welcome. Thanks

    Here's an example showing how to do this via javax.xml:
    import java.io.*;
    import java.net.*;
    import javax.xml.parsers.*;
    import org.w3c.dom.*;
    public class HTMLElementLister {
         public static void main(String[] args) throws Exception {
              URLConnection con = new URL("http://www.mywebsite.com/index.html").openConnection();
              con.connect();
              InputStream in = (InputStream)con.getContent();
              Document doc = null;
              try {
                   DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
                   DocumentBuilder db = dbf.newDocumentBuilder();
                   doc = db.parse(in);
              } finally {
                   in.close();
              NodeList nodes = doc.getChildNodes();
              for (int i=0; i<nodes.getLength(); i++) {
                   Node node = nodes.item(i);
                   String nodeName = node.getNodeName();
                   System.out.println(nodeName);
                   if ("html".equalsIgnoreCase(nodeName)) {
                        System.out.println("|");
                        NodeList grandkids = node.getChildNodes();
                        for (int j=0; j<grandkids.getLength(); j++) {
                             Node contentNode = grandkids.item(j);
                             nodeName = contentNode.getNodeName();
                             System.out.println("|- " + nodeName);
                             if ("body".equalsIgnoreCase(nodeName)) {
                                  System.out.println("   |");
                                  NodeList bodyNodes = contentNode.getChildNodes();
                                  for (int k=0; k<bodyNodes.getLength(); k++) {
                                       node = bodyNodes.item(k);
                                       System.out.println("   |- " + node.getNodeName());
    }

  • Parsing an HTML document

    I want to parse an html document and replace anchor tags with mines on the fly. Can anybody suggest how to do it Please?
    Ajay

    If your HTML files are not well-formed (chances are with most HTML files) like attribute values are not enclosed in punctuation marks, etc, most XML parsers will fail.
    Anand from this forum introduced the JTidy to me and it worked very well. This is a HTML parser that is able to tidy up your HTML codes.

  • How to parse a html document?

    I am trying to parse an html document that I load from a url over the internet. The html is not well formed but thats ok. The problem is the document builder throws an exception because the document is not well formed.
    Can I parse a html document using the document builder?
    Please note that I set validating to false and the parse still has a fatal errror saying <meta> tag must have a corresponding </meta> tag.
    I am using code like the following.....
    DocumentBuilderfactory = DocumentBuilderFactory.newInstance();
    factory.setValidating(false);
    DocumentBuilder db = factory.newDocumentBuilder();
    doc = db.parse(urlString);

    The html is not well formed but thats ok.No, it isn't.
    "Validation" means checking that the XML conforms to a schema or a DTD. Don't confuse that with checking whether the XML is well-formed, which means whether it follows the basic rules of XML like opening tags have to have matching closing tags. Which is what your message is telling you -- your file isn't well-formed XML.
    So sure, you can parse HTML or anything else with an XML parser, just be prepared to be told it isn't well-formed XML.
    If you want to clean up HTML so that it's well-formed XML, there are products like HTMLTidy and JTidy that will do that for you.

  • Parsing and HTML document

    Does any one know how to parse an HTML document with having JEditorPane (in-the-neck) do it for you?

    If you want to do a small amount of parsing, then it would make sense to write a custom program as described in the previous response. If, however, you want to do a lot of parsing, then it might make more sense to try to make use of an XML parser. If you are trying to parse html pages that are your own, then you might want to think about transforming them into xhtml so that an XML parser will be able to process them.
    XML parsing is easy. I recently developed a web site for a set of mock exams for the Java Programmer Certification. Originally, I started developing the pages in html, but I quickly realized that I would have a hard time managing the exams in that format. I then organized the exam into a set of xml documents--one document for each topic. To publish a set of cross-topic exams, I use JDOM (with the help of SAX)to load all of the questions into the Java Collections Framework where I can easily organized a set of four cross-topic exams. Also, I use JDOM to number the questions and answers before writting the new exams out to a new set of four xml files. Then I use XSLT to transform the four exam.xml documents into eight HTML files--four html files for the questions and four for the answers.
    If you would like to take a look at the result, then please use the following link.
    http://www.geocities.com/danchisholm2000/
    If you own the html files that you want to parse, then I would try to find a way to transform them into valid xml. XHTML might be a good choice.
    Dan Chisholm

  • Why can't I make call to parse HTML from inside Thread?

    This is driving me crazy. With a defined HTMLEditorKit.ParserCallback object "callback", I am attempting to parse an HTML document retrieved from a URL by using:
    new ParserDelegator().parse(new InputStreamReader(url.openStream( )), callback, true);
    It doesn't work if I initiate the call in any way from within the run method of a Thread subclass (the way I'd like to do it). If I make the call in the constructor of the Thread subclass, however, it runs fine. I know it must have something to do with the fact that parse runs in a Thread of it's own - but the way to fix it isn't apparent to me.
    I would appreciate some words from people who might know what's happening here... THANKS in advance.

    Don't bother - figured it out - thanks.

  • Why can't I make call to parse HTML from inside a Thread?

    This is driving me crazy. With a defined HTMLEditorKit.ParserCallback object "callback", I am attempting to parse an HTML document retrieved from a URL by using:
    new ParserDelegator().parse(new InputStreamReader(url.openStream( )), callback, true);
    It doesn't work if I initiate the call in any way from within the run method of a Thread subclass (the way I'd like to do it). If I make the call in the constructor of the Thread subclass, however, it runs fine. I know it must have something to do with the fact that parse runs in a Thread of it's own - but the way to fix it isn't apparent to me.
    I would appreciate some words from people who might know what's happening here... THANKS in advance.

    Don't bother - figured it out - thanks.

  • JEditorPane parsing HTML

    Hi all,
    I am using JEditorPane and it's ability to parse HTML, which although is relatively old and crusty is certainly all I need for the job.
    Now, I understand there is a chain of classes involved in taking my .html file and turning popping into a something we can see in a JEditorPane. For example, an img tag, is picked up by HTMLEditorKit and turned into an ImageView for display purposes.
    I want to do the following: I have subclassed HTMLEditorKit, and have overridden the HTMLFactory (although at the moment it just defers everything to super). I want to be able to pick out all of the html comment tags as they go through the HTMLEditorKit :
    <!-- hey hey this is a comment -->... and get to the comment text, "hey hey this is a comment", as a Java string. However I've been digging around with Element for hours now and although my HTMLFactory correctly digs out the comments from the rest of the elements:
    else if (kind == HTML.Tag.COMMENT)
                        {System.out.println("I found a comment but don't know what it said!!");... as you can see, I don't know how to get to the comment text itself.
    The reason why I want access to the comment text is that I want to supplement the HTML code a little bit and add something in the comment that will affect the way it is rendered when I read it depending on the comment - so there's the reason if curious.
    Any help, and I do mean anything at all, would be much appreciated, as this is the last obstacle in my path to getting this thing working :)
    Thanks for your time!
    - Peter

    Here is some old code I have lying around that attempts to iterate through all the elements. If I remember correctly the comment text is found in the AttributeSet of the element:
    import java.io.*;
    import java.net.*;
    import java.util.*;
    import javax.swing.*;
    import javax.swing.text.*;
    import javax.swing.text.html.*;
    class GetHTML
        public static void main(String[] args)
            EditorKit kit = new HTMLEditorKit();
            Document doc = kit.createDefaultDocument();
            // The Document class does not yet handle charset's properly.
            doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
            try
                // Create a reader on the HTML content.
                Reader rd = getReader(args[0]);
                // Parse the HTML.
                kit.read(rd, doc, 0);
                System.out.println( doc.getText(0, doc.getLength()) );
                System.out.println("----");
                // Iterate through the elements of the HTML document.
                ElementIterator it = new ElementIterator(doc);
                Element elem = null;
                while ( (elem = it.next()) != null )
                    AttributeSet as = elem.getAttributes();
                    System.out.println( "\n" + elem.getName() + " : " + as.getAttributeCount() );
                    if ( elem.getName().equals( HTML.Tag.IMG.toString() ) )
                        Object o = elem.getAttributes().getAttribute( HTML.Attribute.SRC );
                        System.out.println( o );
                    Enumeration enum = as.getAttributeNames();
                    while( enum.hasMoreElements() )
                        Object name = enum.nextElement();
                        Object value = as.getAttribute( name );
                        System.out.println( "\t" + name + " : " + value );
                        if (value instanceof DefaultComboBoxModel)
                            DefaultComboBoxModel model = (DefaultComboBoxModel)value;
                            for (int j = 0; j < model.getSize(); j++)
                                Object o = model.getElementAt(j);
                                Object selected = model.getSelectedItem();
                                if ( o.equals( selected ) )
                                    System.out.println( o + " : selected" );
                                else
                                    System.out.println( o );
                    if ( elem.getName().equals( HTML.Tag.SELECT.toString() ) )
                        Object o = as.getAttribute( HTML.Attribute.ID );
                        System.out.println( o );
                    //  Wierd, the text for each tag is stored in a 'content' element
                    if (elem.getElementCount() == 0)
                        int start = elem.getStartOffset();
                        int end = elem.getEndOffset();
                        System.out.println( "\t" + doc.getText(start, end - start) );
            catch (Exception e)
                e.printStackTrace();
            System.exit(1);
        // Returns a reader on the HTML data. If 'uri' begins
        // with "http:", it's treated as a URL; otherwise,
        // it's assumed to be a local filename.
        static Reader getReader(String uri)
            throws IOException
            // Retrieve from Internet.
            if (uri.startsWith("http:"))
                URLConnection conn = new URL(uri).openConnection();
                return new InputStreamReader(conn.getInputStream());
            // Retrieve from file.
            else
                return new FileReader(uri);
    }To test it just use:
    java GetHTML somefile.html

  • How to parse XML document with default namespace with JDOM XPath

    Hi All,
    I am having difficulty parsing using Saxon and TagSoup parser on a namespace html document. The relevant content of this document are as follows:
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    </head>
    <body>
        <div id="container">
            <div id="content">
                <table class="sresults">
                    <tr>
                        <td>
                            <a href="http://www.abc.com/areas" title="Hollywood, CA">hollywood</a>
                        </td>
                        <td>
                            <a href="http://www.abc.com/areas" title="San Jose, CA">san jose</a>
                        </td>
                        <td>
                            <a href="http://www.abc.com/areas" title="San Francisco, CA">san francisco</a>
                        </td>
                        <td>
                            <a href="http://www.abc.com/areas" title="San Diego, CA">San diego</a>
                        </td>
                  </tr>
    </body>
    </html>
    Below is the relevant code snippets illustrates how I have attempted to retrieve the contents (value of  <a>):
                 import java.util.*;
                 import org.jdom.*;
                 import org.jdom.xpath.*;
                 import org.saxpath.*;
                 import org.ccil.cowan.tagsoup.Parser;
    ( 1 )       frInHtml = new FileReader("C:\\Tmp\\ABC.html");
    ( 2 )       brInHtml = new BufferedReader(frInHtml);
    ( 3 ) //    SAXBuilder saxBuilder = new SAXBuilder("org.apache.xerces.parsers.SAXParser");
    ( 4 )       SAXBuilder saxBuilder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser");
    ( 5 )       org.jdom.Document jdomDocument = saxbuilder.build(brInHtml);
    ( 6 )       XPath xpath =  XPath.newInstance("/ns:html/ns:body/ns:div[@id='container']/ns:div[@id='content']/ns:table[@class='sresults']/ns:tr/ns:td/ns:a");
    ( 7 )       xpath.addNamespace("ns", "http://www.w3.org/1999/xhtml");
    ( 8 )       java.util.List list = (java.util.List) (xpath.selectNodes(jdomDocument));
    ( 9 )       Iterator iterator = list.iterator();
    ( 10 )     while (iterator.hasNext())
    ( 11 )     {
    ( 12 )            Object object = iterator.next();
    ( 13 ) //         if (object instanceof Element)
    ( 14 ) //               System.out.println(((Element)object).getTextNormalize());
    ( 15 )             if (object instanceof Content)
    ( 16 )                   System.out.println(((Content)object).getValue());
    ….This program would work on the same document without the default namespace, hence, it would not be necessary to include “ns” prefix along in the XPath statements (line 6-7) either. Moreover, I was using “org.apache.xerces.parsers.SAXParser” to have successfully retrieve content of <a> from the same document without default namespace in the past.
    I would like to achieve the following objectives if possible:
    ( i ) Exclude DTD and namespace in order to simplifying the parsing process. How this could be done?
    ( ii ) If this is not possible, how to include it in XPath statements (line 6-7) so that the value of <a> is picked up correctly?
    ( iii ) Would changing from “org.apache.xerces.parsers.SAXParser” to “org.ccil.cowan.tagsoup.Parser” make any difference as far as using XPath is concerned?
    ( iv ) Failing to exlude DTD, how to change the lookup of a PUBLIC DTD to a local SYSTEM one and include a local DTD for reference?
    I am running JDK 1.6.0_06, Netbeans 6.1, JDOM 1.1, Saxon6-5-5, Tagsoup 1.2 on Windows XP platform.
    Any assistance would be appreciated.
    Thanks in advance,
    Jack

    Here's an example of using a custom EntityResolver with the standard DocumentBuilder provided by the JDK. The code may or may not be similar for the parsers that you're using.
    import java.io.IOException;
    import java.io.StringReader;
    import javax.xml.parsers.DocumentBuilder;
    import javax.xml.parsers.DocumentBuilderFactory;
    import org.w3c.dom.Document;
    import org.xml.sax.EntityResolver;
    import org.xml.sax.InputSource;
    import org.xml.sax.SAXException;
    public class ParseExamples
        private final static String COMMON_XML
            = "<music>"
            +     "<artist name=\"Anderson, Laurie\">"
            +         "<album>Big Science</album>"
            +         "<album>Strange Angels</album>"
            +     "</artist>"
            +     "<artist name=\"Fine Young Cannibals\">"
            +         "<album>The Raw & The Cooked</album>"
            +     "</artist>"
            + "</music>";
        private final static String COMMON_DTD
            = "<!ELEMENT music (artist*)>"
            + "<!ELEMENT artist (album+)>"
            + "<!ELEMENT album (#PCDATA)>"
            + "<!ATTLIST artist name CDATA #REQUIRED>";
        public static void main(String[] argv)
        throws Exception
            // this version uses just a SYSTEM identifier - note that it gets turned
            // into a file: URL
            String xml = "<!DOCTYPE music SYSTEM \"bar\">"
                       + COMMON_XML;
            // this version uses both PUBLIC and SYSTEM identifiers; the SYSTEM ID
            // gets munged, the PUBLIC ID doesn't
    //        String xml = "<!DOCTYPE music PUBLIC \"foo\" \"bar\">"
    //                   + COMMON_XML;
            DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
            dbf.setValidating(true);
            DocumentBuilder db = dbf.newDocumentBuilder();
            db.setEntityResolver(new EntityResolver()
                public InputSource resolveEntity(String publicId, String systemId)
                    throws SAXException, IOException
                    System.out.println("publicId = " + publicId);
                    System.out.println("systemId = " + systemId);
                    return new InputSource(new StringReader(COMMON_DTD));
            Document dom = db.parse(new InputSource(new StringReader(xml)));
            System.out.println("root element name = " + dom.getDocumentElement().getNodeName());
    }

  • Parsing html to text

    Im looking for a libary which can remove all tags from a html document. Ie, ending up with the 'content' of the html doc.
    Anyone knows of such a libary or has some example code on doing it?

    Hi slackman,
    I don't know if this is what you are looking for...give more details if I misunderstand :)
    import java.util.*;
    import java.io.*;
    import javax.xml.parsers.*;
    import org.xml.sax.helpers.*;
    import org.xml.sax.*;
    public class MyHandler extends DefaultHandler {
    public void characters(char[] aCh, int aStart, int aLength) throws SAXException {
    System.out.println(new String(aCh, aStart, aLength));
    public static void main(String[] aArgs) {
    try {
    String oHTMLTest = "<html><body>This is a content </body></html>";
    StringReader oReader = new StringReader(oHTMLTest);
    InputSource oSource = new InputSource(oReader);
    SAXParserFactory oParserFactory = SAXParserFactory.newInstance();
    SAXParser oParser = oParserFactory.newSAXParser();
    MyHandler oHandler = new MyHandler();
    oParser.parse(oSource, oHandler);
    } catch (Exception e) {
    e.printStackTrace();
    }

  • Parsing word documents

    Can I use Oracle Text API to parse a word document (stored on a
    file server) from a Java program? I want to extract specific
    text from this document.
    Thanks for your help!

    Oracle Text does not provide features for parsing Word documents.
    However if you have an XML version of it you can extract sections
    using XMLType functions.
    The following paper describes the XMLType:
    http://www.oracle.com/oramag/oracle/01-nov/index.html?o61xml.html
    Can I use Oracle Text API to parse a word document (stored on a
    file server) from a Java program? I want to extract specific
    text from this document.
    Thanks for your help!

  • Parsing rdf documents

    I'm attempting to parse an rdf document, but I can't navigate
    the returned xml tree because the first tag is called rdf:RDF so I
    can't start with just rdf and rdf:RDF as a variable name breaks
    coldfusion.
    I can in using xmlSearch, but I'd like to easily be able to
    parse both rss and rdf

    Parsing a HTML document is a huge task, you shouldn't do it yourself but instead javax.text.html and javax.text.html.parser already provide almost everything you ever need

Maybe you are looking for

  • Urgent::Issue with advanced tables....

    Hi i am new to OA framework i have view instance (associated with a custom table..i want to insert data into this table using advanced table)with X view attributes.. among them one is primary key.i set that using a sequence value.. rest all.. i made

  • Javascript, or something in a JEditorPane, panel?

    I have an html file that I display inside of a JEditorPane. If I run the html file in a windows browser, the javascript works, but if I run it in the JEditorPane it displays the html stuff, but the javascript functionality doesnt work. Is their somet

  • How to get a dynamic pages using jsp?

    actually using jsp we create static pages. My question is whether it is possible to create dynamic pages. ie the page should reflect changes without refreshing or reloading. for example we have date timings frequently updated without refreshing

  • Error -1712 creative cloud don't open

    I can't install creative cloud in my mac os x mavericks. An error occurred -1712

  • Need Advice / Help

    HI.. i have a program here that works as a simple chat program... i have Strider.java as the server and Hiryu.java as the client... my question is, althoug it is already running, the server and client can't meet each other. the server can't detect if