Parsing HTML into DOM using HTMLEditorKit

I am trying to parse an HTML file using javax.swing.text.html.HTMLEditorKit. My limitations are that I cannot install new libraries like jtidy and I must use a .jsp file, not a servlet. I'm able to get the url and parse it using ParserCallBack, but the new handleText method will not write to the page. Further more I cannot pass anything out of this method to use later because it is void. I want to get some data back from this method or at least do something useful within it. Is that possible?
     java.net.URL url = new java.net.URL("http://" + request.getServerName() + "/" + urls.get(i));
     java.io.InputStream is = url.openStream();
     java.io.InputStreamReader isr = new java.io.InputStreamReader(is);
     java.io.BufferedReader br = new java.io.BufferedReader(isr);
   javax.swing.text.html.HTMLEditorKit.ParserCallback callback =
      new javax.swing.text.html.HTMLEditorKit.ParserCallback () {
        public void handleText(char[] data, int pos) {
            out.println(data);
    new javax.swing.text.html.parser.ParserDelegator().parse(br, callback, false);Attempting to print from within this method gives this error:
Attempt to use a non-final variable out from a different method. From enclosing blocks, only final local variables are available.
Maybe I need to try and write the output xml file all from inside the parserCallback?

Those are rather stupid requirements. Okay, I can see the one about not using external libraries because nobody knows how to deal with the licences. But making you use a JSP instead of a servlet just gets in the way of writing the Java code which you could probably do perfectly well if you didn't have to cram it into a JSP scriptlet. Stupid.
But anyway: the error message says you need a final local variable. So don't just sit there, give it a final local variable. I forget just what type "out" is supposed to be, but something like "final JSPWriter fakeOut = out", followed by using "fakeOut" rather than "out" should work.

Similar Messages

Problem occurs when parsing uri into dom.

Hey guys, I got problem when I tries to parse a rss feed url to dom.
It works fine with most xml urls. However, e.g, "http://digg.com/rss/index.xml" doesnt work at all. It gives java.net.SocketException (connection reset). So I reckson it even could not open the connection to that url.
Not sure about how it works and if its a rss feed server's issue.
Some code as below:
Document document = null;
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
          try {
               DocumentBuilder builder = factory.newDocumentBuilder();
document = builder.parse(uri);
          } catch (SAXException sxe) {.................} catch...............
thank you for any help in advance.
Cheers,
Lin

This problem has nothing to do with parsing. I used a URLConnection object to get access to the input stream provided by the URL "http://digg.com/rss/index.xml". After hanging for 378 seconds I got the following error when the getInputStream() method was invoked on the URLConnection object:java.net.SocketException: Unexpected end of file from server
        at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:684)
        at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:554)
        at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:682)
        at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:554)
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:939)
...The error message reveals that the error occurs after the http connection has been opened. The problem is that the input stream is not appropriate even for interpreting as a file content, let alone parsing it as an XML document. Nowhere have I found a clear-cut answer to this problem which makes me think this error might be caused by a bug. You should try posting here:
http://forum.java.sun.com/forum.jspa?forumID=536

Parse html (a href) using regex

Hello,
i would like to extract all the urls from a website that are included in < a href=" parse string">
I have already the regex which is
String regex = "< *a.*href *= *['|\"]";
May you please advise me which method in Pattern or Matcher classes shall i use in order to take as output
*only* the url inside the " " marks?
I have already tried end and start methods which return the indexes, but i don't get the desirable result.
Thanks, in advance
P.S.Also, i have already tried to use HtmlParser but i prefer to use regex cause i found a difficulty in it.

Please continue in your original thread.
[http://forums.sun.com/thread.jspa?threadID=5363751]

How to convert HTML into XML

I know I can parse XML into some HTML, but is there any tools or methods existed to parse HTML into XML?
I have a not well-formed HTML with a lot data fields, including a lot not closed tags. This HTML is generated by some XML(as I can see), but I can't find a way to reform it into a XML, and eventually stored the data into another database.
Anyone can help me? I appreciate!
KIB

As SAm has told you, you can use jTidy, for the purpose, a sample code , which can convert an html file to xml file is given at following url:
see the documentation as well.
http://sourceforge.net/docman/display_doc.php?docid=1298&group_id=13153
gaurav_k1

Parsing HTML using Swing's HTMLEditorKit

Hi all,
I posted this question on the "Java programming", but I think I posted on the wrong forum. So, please let me know if I have posted on the wrong forum, again.
Anyway, I have read an article on parsing HTML using the Swing HTML Parser (http://java.sun.com/products/jfc/tsc/articles/bookmarks/index.html). However, I find that the HTMLEditorKit is unable to understand the <Meta> tag under the <Head> tag? Is this true? I am getting an error message:
javax.swing.text.ChangedCharSetException
at javax.swing.text.html.parser.DocumentParser.handleEmptyTag(DocumentParser.java:172)
at javax.swing.text.html.parser.Parser.startTag(Parser.java:327)
at javax.swing.text.html.parser.Parser.parseTag(Parser.java:1786)
at javax.swing.text.html.parser.Parser.parseContent(Parser.java:1821)
at javax.swing.text.html.parser.Parser.parse(Parser.java:1980)
at javax.swing.text.html.parser.DocumentParser.parse(DocumentParser.java:109)
at javax.swing.text.html.parser.ParserDelegator.parse(ParserDelegator.java:74)
at URLReader.main(URLReader.java:58)
Below is a simple code to write out the html file it reads in:
public static void main(String[] args) throws Exception {
HTMLEditorKit.ParserCallback callback = new HTMLEditorKit.ParserCallback () {
public void handleText(char[] data, int pos) {
try {
System.out.println(data);
} catch (Exception e) {
System.out.println("IOE: " + e);
Reader reader = new FileReader("myFile.html");
new ParserDelegator().parse(reader, callback, false);
The html file that is having a problem reading in is:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>NWS WSR-88D Radar System Transmit/Receive Status</title>
</head>
<p>A <foo>xx</foo>link</html>
If I take away <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">, there is no problem.
Any suggestions? Thanks in advance.

Hi,
Setting the third argument really works!!! Yee..... haa....!!!
WORKING SOLUTION: new ParserDelegator().parse(reader, callback, TRUE);
MANY... MANY THANKS for looking at the problem!!!
Send third argument in parse method as true.

Problem in parsing a xml string using dom parser

i want to parse a Xml String using a Dom parser......the parse function in dom parser takes only input stream as argument.......so i made the code as
InputStream inputstream = new StringBufferInputStream(XmlData) ;
InputSource inputSource = new InputSource(inputstream );
but saxexception is coming and also warning called
"java.io.StringBufferInputStream in java.io has been deprecated"
please help me.........

i want to parse a Xml String using a Dom
parser......the parse function in dom parser takes
only input stream as argument.......This is not true of the DOM parser in Java 1.4. So you might want to get rid of your old parser and replace it by something more current. Or perhaps you are using 1.4 and you just didn't read all of the API docs.

Parse into array using JDOM! please help

hey,
i've managed to parse an xml document using JDOM
i[b] need to be able to parse it and store the text (the value of each node) into an array, and then insert into db etc.
the problem is with the recursive function listChildren which calls itself... can someone tell me where do i insert the block of code such that i can store it into an array of string.
here's the code:
public static void parse(String stock) throws SQLException
SAXBuilder builder = new SAXBuilder();
Reader r = new StringReader(stock);
Document doc = builder.build(r);
Element root = doc.getRootElement();
listChildren(root, 0);
public static void listChildren(Element current, int depth) throws Exception
String nodes = current.getName();
System.out.println(nodes + " : " + current.getText());
List children = current.getChildren();
Iterator iterator = children.iterator();
while(iterator.hasNext())
Element child = (Element) iterator.next();
listChildren(child, depth+1);
i'm looking for something like:
a=current.getText();
but i donno where to include this line of code, please help
cheers,
Shivek Sachdev

hi, I suggest you make an array of byte arrays
--> Byte[][] and use one row for each number
you can do 2 things,
take each cipher of one number and put one by one in each column of the row correspondent to that number. of course it may take too much room if the int[] is too big, but that is the easiest way I think
the other way is dividing your number into bitsets(class BitSet) with sizes of 8 bits and then you can save each bit into each column of your array. and you still have one number in each row. To put your numbers back use the same class.
Maybe someone has an easier way, I couldnt think of any.

Convert html into tidy html to convert pdf using iText

hello.
I am try to convert html document into pdf.
first i tried iText it works properly. but it needs all the tags to be witten correctly.
when u try html not well formeted it gives an exception.
So is there any way to convert html to pdf.
or if not if not then way to convert html into properly taged HTML
so it s easy to convert it to html,
If you have any working example of Tidy.jar please send me.
Thanks..

Hi,
I had a similar tasko to do i.e converting HTML to PDF.
Please follow the link to this site and download the trial code.
http://www.pd4ml.com
I was able to convert my HTML to PDF.
Have a look at it and let me know.
Regards,
Joe

JEditorPane parsing HTML

Hi all,
I am using JEditorPane and it's ability to parse HTML, which although is relatively old and crusty is certainly all I need for the job.
Now, I understand there is a chain of classes involved in taking my .html file and turning popping into a something we can see in a JEditorPane. For example, an img tag, is picked up by HTMLEditorKit and turned into an ImageView for display purposes.
I want to do the following: I have subclassed HTMLEditorKit, and have overridden the HTMLFactory (although at the moment it just defers everything to super). I want to be able to pick out all of the html comment tags as they go through the HTMLEditorKit :
... and get to the comment text, "hey hey this is a comment", as a Java string. However I've been digging around with Element for hours now and although my HTMLFactory correctly digs out the comments from the rest of the elements:
else if (kind == HTML.Tag.COMMENT)
                    {System.out.println("I found a comment but don't know what it said!!");... as you can see, I don't know how to get to the comment text itself.
The reason why I want access to the comment text is that I want to supplement the HTML code a little bit and add something in the comment that will affect the way it is rendered when I read it depending on the comment - so there's the reason if curious.
Any help, and I do mean anything at all, would be much appreciated, as this is the last obstacle in my path to getting this thing working :)
Thanks for your time!
- Peter

Here is some old code I have lying around that attempts to iterate through all the elements. If I remember correctly the comment text is found in the AttributeSet of the element:
import java.io.*;
import java.net.*;
import java.util.*;
import javax.swing.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
class GetHTML
    public static void main(String[] args)
        EditorKit kit = new HTMLEditorKit();
        Document doc = kit.createDefaultDocument();
        // The Document class does not yet handle charset's properly.
        doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
        try
            // Create a reader on the HTML content.
            Reader rd = getReader(args[0]);
            // Parse the HTML.
            kit.read(rd, doc, 0);
            System.out.println( doc.getText(0, doc.getLength()) );
            System.out.println("----");
            // Iterate through the elements of the HTML document.
            ElementIterator it = new ElementIterator(doc);
            Element elem = null;
            while ( (elem = it.next()) != null )
                AttributeSet as = elem.getAttributes();
                System.out.println( "\n" + elem.getName() + " : " + as.getAttributeCount() );
                if ( elem.getName().equals( HTML.Tag.IMG.toString() ) )
                    Object o = elem.getAttributes().getAttribute( HTML.Attribute.SRC );
                    System.out.println( o );
                Enumeration enum = as.getAttributeNames();
                while( enum.hasMoreElements() )
                    Object name = enum.nextElement();
                    Object value = as.getAttribute( name );
                    System.out.println( "\t" + name + " : " + value );
                    if (value instanceof DefaultComboBoxModel)
                        DefaultComboBoxModel model = (DefaultComboBoxModel)value;
                        for (int j = 0; j < model.getSize(); j++)
                            Object o = model.getElementAt(j);
                            Object selected = model.getSelectedItem();
                            if ( o.equals( selected ) )
                                System.out.println( o + " : selected" );
                            else
                                System.out.println( o );
                if ( elem.getName().equals( HTML.Tag.SELECT.toString() ) )
                    Object o = as.getAttribute( HTML.Attribute.ID );
                    System.out.println( o );
                // Wierd, the text for each tag is stored in a 'content' element
                if (elem.getElementCount() == 0)
                    int start = elem.getStartOffset();
                    int end = elem.getEndOffset();
                    System.out.println( "\t" + doc.getText(start, end - start) );
        catch (Exception e)
            e.printStackTrace();
        System.exit(1);
    // Returns a reader on the HTML data. If 'uri' begins
    // with "http:", it's treated as a URL; otherwise,
    // it's assumed to be a local filename.
    static Reader getReader(String uri)
        throws IOException
        // Retrieve from Internet.
        if (uri.startsWith("http:"))
            URLConnection conn = new URL(uri).openConnection();
            return new InputStreamReader(conn.getInputStream());
        // Retrieve from file.
        else
            return new FileReader(uri);
}To test it just use:
java GetHTML somefile.html

JS/HTML: HTML to DOM so I can XPath

Im new to all this and all I want to do is get info from a
page and redisplay it. I would really like to get that info like
the title suggests tho...I want to start with the html source,
parse it to a DOM and then use xpath to get the info I want. Ive
only ever done this sort of thing in Greasemonkey before tho and
dont seem to able to get it.
Any help would be appreciated.

Just to answer my own question ;).......
quote:
<html>
<head>
<title>Get It!</title>
<link href="sample.css" rel="stylesheet"
type="text/css"/>
<script type="text/javascript"
src="lib/air/AIRAliases.js"></script>
<script type="text/javascript"
src="lib/air/AIRIntrospector.js"></script>
<script type="text/javascript">
// AIR-related functions created by the developer
function onHTMLLoadComplete(e)
//get a reference to the top level html document
var doc = html.window.document;
//var doc = e.target.window.document;
var node=doc.evaluate("//title",doc).iterateNext();
// while (thisNode = nodes.interateNext()) {
// alert( thisNode.textContent );
// thisNode = nodes.iterateNext();
var elem = document.createElement( 'div' );
elem.innerText = 'Title of Page is: ' + node.textContent;
document.body.appendChild( elem );
// loads the content of a remote URL
function doRequest(url) {
var req = new XMLHttpRequest();
req.onreadystatechange = function() {
if (req.readyState == 4) {
var str = req.responseText;
html = new air.HTMLLoader();
html.addEventListener(air.Event.COMPLETE,
onHTMLLoadComplete);
html.loadString(str);
req.open('GET', url, true);
req.send(null);
function openInBrowser(url) {
air.navigateToURL( new air.URLRequest(url));
</script>
</head>
<body>
<h3>HTML to DOM for XPath</h3>
<ul>
<li>XMLHttpRequest object can reach into remote
domains — the following loads
http://www.adobe.com:
<br/>
<input type="button" onclick='doRequest("
http://www.adobe.com");'
value='doRequest("
http://www.adobe.com");'/>
</li>
</ul>
</body>
</html>
Now Id like to know if there's an option for it to not load
images when it parses the dom. I assume its still loading the
images from the amount of time it took to load (I have dialup). If
not I wonder if a regular expression could be made to wreck the
urls of the images (by changing the href attribute id to something
else) and then search for the new attribute with xpath.....pity Im
no good at regular expressions.

Inserting HTML into JTextPane

Hi,
I am trying to insert HTML into JTextPane.
I am using the following code for the same.
JTextPane jedit = new JTextPane();
jedit.setContentType("text/html");
        HTMLDocument doc = (HTMLDocument)jedit.getDocument();
        String text = "<a href=\"???\">hyperlink</a>asd<a href=\"???\">hyperlink123</a>";
        HTMLEditorKit editorKit = (HTMLEditorKit)jedit.getEditorKit();
        editorKit.insertHTML(doc, doc.getLength(), text, 0, 0, null);
        doc.insertString(doc.getLength(),"Hi testing",null);
        text = "<a href=\"???\">hyperlink123</a>";The problem is that the HTML gets inserted into new line. I do not want the new line.
I know there is an API like insertBeforeEnd(...) but do not know how to use that.
Any help for the above problem will be of great use.
Thanks.

look I have got the answer ... I guess this would be the root cause of the problem of new line.
when ever you want the text to be inserted
-at new line provide the last argument of HTMLDocument.insertHTML as null
- at same line provide the last argument as the HTML tag you are inserting into the JtextPane's document.thats it !!!
ENJOY :-)

Parse siblings in DOM

Hi,
HELP!!!
I went through all the possible help code from the net which tells me how to access the value/child of a tag.
But what I need is to access the column values for each table. There are several tables in this xml file. So, I search for the table name through getElementsByTagName(), then I need to parse for the column-names. I'm not able to make it work with getNextSibling().
<table-name>hist_client</table-name>
<data-element>
<column-name>camp_id</column-name>
<column-name>ccr_id</column-name>
<column-name>record_id</column-name>
<column-name>record_num</column-name>
</data-element>
-thanx

Hey Sai,
Thanx.
I believe I'm using the DOM parser. My code is very similar to this first program -'Reading the XML':
http://java.sun.com/webservices/docs/1.0/tutorial/doc/JAXPXSLT5.html
The article was quite enlightening:
As opposed to SAX, the DOM specification covers only the tree representation of the XML document. Instantiating and using the parser is not actually covered by DOM itself, and the specific implementation must be named directly in the application code. After the input document has been parsed, the resulting DOM tree can be retrieved from the parser using the getDocument function, which returns a Document instance. The Document interface extends the Node interface and represents the root node of the document. It is then used with the appropriate unmarshaller class, similar to the SAX case.
I'm probably using the DOM, cos I dont need to do anything more than get info about the nodes.
Why did u ask? So, where did I go wrong???
-s

Parsing HTML files

Hello,
I have a question about parsing HTML files. Usually when I get an HTML file and I need to find all the text in it I do this. This stuff just collects all of the hyperlinks and ignores all the html tags just keeping the actual text. It's fine for smaller files but occasionally I'll hit a large online text file and it will work but its way to slow for large files. I don't need to do all of this HTML tag stripping however for text files. Is there a way to still grab all the text without doing any tag searching to make it faster?
thanks,
private void find() throws IOException
        //Really slow for large text files. Need a way to just use a regular scanner on an internet text file
        new ParserDelegator().parse(new InputStreamReader(myBase.openStream()),
                new ParserListener(),
                true);
     * Inner class for processing all "<a href.."> tags when reading a base URL.
    private class ParserListener extends HTMLEditorKit.ParserCallback
        final String IGNORED_LINKS = "^(http|mailto|\\W).*";
        public void handleStartTag (HTML.Tag t, MutableAttributeSet a, int pos)
            if (t == HTML.Tag.A)
                String href = (String)(a.getAttribute(HTML.Attribute.HREF));
                //System.out.println(href);
                //System.out.println(href.matches(IGNORED_LINKS) + "\t" + href);
                if (! (href == null || href.matches(IGNORED_LINKS)) && !myURLs.contains(href))
                    myURLs.add(href);
            //TODO fix
            if (t == HTML.Tag.TITLE)
                String title = (String) (a.getAttribute(HTML.Attribute.TITLE));
                if (!(title == null))
                    myTitle = title;
                else myTitle = "No title was found";
        public void handleText (char[] data, int pos)
            myText.append(" ");
            myText.append(data);
    }

JFactor2004 wrote:
My question is. If I know an html file is actually just a txt fileThis isn't a question. HTML files are text by definition.
is it possible to look through it (maybe use something similar to a regular scanner) without doing anything with html.That depends on what you mean by "doing something with HTML". You can certainly read it one line at a time.

Parse HTML document embedded in IFRAME

Dear fellows:
How can I access contents of an HTML document embedded in an IFRAME tag, by using java class HTMLEditorKit.Parser?
It is well known that the contents of such embedded HTML document can be accessed by javascript at front end. However, I am more interested on processing it at backend, using HTMLEditorKit.Parser, or any java swing API.
Thanks for help.

The javax.swing.text.html framework barely supports HTML 3.2.

Parse html js and ajax

I am trying to make a parser for a website. This web have a list of elements that i need to read for compare it with my database. I have used HTMLEditorKit.ParserCallback for parse it and Apache HttpClient for connect with it. In others website it work ok, but in this website it use javascript and maybe it use ajax. How can I process javascript from a website with Java? There is any class on Java that work like a web navigator, and i can write on forms input, click on buttons an read the final code after of this process?

I am trying to make a parser for a website. This web have a list of elements that i need to read for compare it with my database. I have used HTMLEditorKit.ParserCallback for parse it and Apache HttpClient for connect with it. This is fundamentally a hard problem. And even when you "solve" it, your solution will inherently be brittle. You have no control presumably over the site's designers, who can (and likely will) change it. Then your solution will need to be updated.
In others website it work ok, but in this website it use javascript and maybe it use ajax. How can I process javascript from a website with Java? There will either be embedded Javascript on the page or a URL linking to the same. Either way, you can grab the embedded Javascript or make a separate URLConnection call to get the Javascript linked in the URL. From that point, you would want to use a Javascript parser. However, the designer of the site can do just about anything with code (as opposed to static content such as HTML). The Javascript itself may be obfuscated or minified. This is an even harder task than simply obtaining elements in markup.
There is any class on Java that work like a web navigator, and i can write on forms input, click on buttons an read the final code after of this process?I have no idea what that might be or what you are looking for. What are you actually trying to accomplish? Why do you feel you need to analyze a third-party's web page and determine how it works in a semi-automated fashion?
- Saish

Parsing HTML into DOM using HTMLEditorKit

Similar Messages

Maybe you are looking for