HTML parsing : javascript

I have to parse some HTML pages and collect the links present on the page.
I am using HTMLEditorKit.ParserCallback to parse the pages.
On some pages, javascript code is present. e.g. onclick attribute
How can I parse the javascript code/ fetch the URL present in javascript?

Similar Messages

Webpage (HTML) parsing...

Any ideas on how to parse an HTML page? I'm trying to do it with a StreamTokenizer but with little success. I don't think this class was made to do this sort of thing, Oridnarilly anyway. Is there a better choice? StringTokenizer? Here's what I have so far:
URLConnection uc = url.openConnection();
BufferedReader br = new BufferedReader(new InputStreamReader
                                        (uc.getInputStream()));
StreamTokenizer stok = new StreamTokenizer(br);
stok.eolIsSignificant(false);
String inputLine;
for (int i=0; (stok.nextToken() != stok.TT_EOF); i++)
    System.out.println("token #" + i + stok.toString());
}It gives me a result like this:
token #0Token['<'], line 3
token #1Token[script], line 3
token #2Token[language], line 3
token #3Token['='], line 3
token #4Token[javascript], line 3
token #5Token['>'], line 3
token #6Token['<'], line 4
token #7Token['!'], line 4
token #8Token['-'], line 4
token #9Token['-'], line 4
token #10Token[function], line 5
token #11Token[dojump], line 5
token #12Token['('], line 5
token #13Token[')'], line 5
token #14Token['{'], line 6
token #15Token[document.location.href], line 7
token #16Token['='], line 7
token #17Token[play247.asp?page=promo&id=72&r=R2], line 7What I want is all the links that have "promo" as a parameter e.g. . Any suggestions?

Java has a callback parser, which notifies you when start/end tags are found. Then you can query the attributes and search for the desired string. Heres a sample to get you started:
import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;
public class TestParser extends HTMLEditorKit.ParserCallback
     boolean ignoreText;
     public static void main(String[] args)
     throws IOException
          TestParser parser = new TestParser();
          // args[0] is the file to parse
          Reader reader = new FileReader(args[0]);
          try
               new ParserDelegator().parse(reader, parser, false);
          catch (IOException e)
               System.out.println(e);
     public void handleComment(char[] data, int pos)
          System.out.println(data);
     public void handleEndOfLineString(String eol)
     public void handleEndTag(HTML.Tag tag, int pos)
          System.out.println("/" + tag);
     public void handleError(String errorMsg, int pos)
          System.out.println(pos + ":" + errorMsg);
     public void handleMutableTag(HTML.Tag tag, MutableAttributeSet a, int pos)
          System.out.println("mutable:" + tag + ": " + pos + ": " + a);
     public void handleSimpleTag(HTML.Tag tag, MutableAttributeSet a, int pos)
          System.out.println( tag + ":" + a );
     public void handleStartTag(HTML.Tag tag, MutableAttributeSet a, int pos)
          System.out.println( tag + ":" + a );
     public void handleText(char[] data, int pos)
          System.out.println( data );

APPLESCRIPT AND HTML PARSING.

hi,
im new to applescript so im not quite sure if what i want to do is actually called html parsing.. but basically i want to put a variable in applescript that is linked to the actual html but i dont know how to make applescript access data inside a html code... to give u a better idea, inside the html is something like this:
100
now that value "100" changes but its maximum amount is 100. i want to create a script which responds to change when that value starts to drop by loading another link.
am i making sense? again the thing id like to achieve is make applescript use that value INSIDE the HTML as its own variable (and perform the right actions as that value changes)
any help would be appreciated.

In first place you could open the site you talked about in safari and run a little javascript via applescript to get that value.
Javascript is the "best" way to get a special value out of an HTML-Element, but only works in browsers.
e.g.
tell application "Safari"
open location "http://apple.com"
delay 6
set mypromo to do JavaScript "document.getElementById('promos').getElementsByTagName('a')[0].title" in document 1
display dialog "Title of first Promo is:" & return & mypromo
end tell
Or you could just d/l the pure source convert it to text and search for the phrase you are looking for
e.g.
set mysource_html to do shell script "curl http://mysite.org/bla.html"
set mysource_txt to do shell script "curl http://mysite.org/bla.html | textutil -stdin -convert txt -format html -stdout"
if mysource_html contains "<a>100</a>" then
display dialog "Hey, value of 100 is reached"
end if
--or something like
if mysource_txt contains "100" then
display dialog "Hey, value of 100 is reached"
end if

Sending and parse javascript object variable to java variable

can anyone help me how to send and parse javascript object variable from client to java variable on servlet. Here is what i mention about:
suppose i have object variable var_js with it's properties:
<script>
var var_js = {
id: 'var_js1',
name: 'this is var javascript',
allow_value_type: ['int', 'string', 'object']
</script>
/* after processing javascript object to java var, that i hope you guys may help me, it's java var (var_java) become something like*/
var_java.id = "var_js1";
var_java.name = "this is var javascript";
var_java.allow_value_type = {"int", "string", "object"}

You could have this html page:
<html>
<script>
var var_js = {
id: 'var_js1',
name: 'this is var javascript',
allow_value_type: ['int', 'string', 'object']
function send()
document.getElementById("id").value = var_js.id;
document.getElementById("name").value =var_js.name;
document.getElementById("allow_value_type").value =var_js.allow_value_type;
document.form.submit();
</script>
<form name="myForm" action="http://localhost:8080/servlet/myServlet" method="post" >
<input id="id" type="hidden" value="">
<input id="name" type="hidden" value="">
<input id="allow_value_type" type="hidden" value="">
<input id="cmdGo" type="button" value="Button" onClick="send()">
</form>
</html>
Then have a servlet like this:
import javax.servlet.*;
import javax.servlet.http.*;
import java.io.*;
public class myServlet extends HttpServlet {
public void doPost( HttpServletRequest request,
HttpServletResponse response )
throws ServletException, IOException
String id = request.getParameter("id");
String name = request.getParameter("name");
String allowed_value_type = request.getParameter("allowed_value_type");
Var_java var_java = new Var_java(id,name,allowed_value_type);
//and have you java object
class Var_java
String id;
String name;
String allowed_value_type;
public var_java(String id,String name,String allowed_value_type)
this.id=id;
this.name=name;
this.allowed_value_type=allowed_value_type;
well...something like that i think.
Hope it helps.

How to parse a HTML file using HTML parser in J2SE?

I want to parse an HTML file using HTML parser. Can any body help me by providing a sample code to parse the HTML file?
Thanks nad Cheers,
Amaresh

What HTML parser and what does "parsing" mean to you?

STYLE tag problem in HTML Parser.

Hi,
I am trying to parse a HTML file. I am able to extract context of various tags like Tag.SPAN,Tag.DIV and so...
I want to extract the text content of Tag.Style. What to do? The problem is that HTML Parser right now doesnot support this tag along with 5 more tags which are Tag.META,Tag.PARAM and so..
Please help me out.

Before responding to this posting, you may want to check out the discussion in the OP's previous posting on this topic:
http://forum.java.sun.com/thread.jspa?threadID=634938

Don't understand error message from HTML parser?

I've written a simple test program to parse a simple html file.
Everything works fine accept for the <img src="test.gif"> tag.
It understands the img tag and the handleSimpleTag gets called.
I can even pick out the src attribute. But I get a very strange error message.
When I run the test program below on the test.html file (also below) I get the following output:
handleError(134) = req.att srcimg?
What does "req.att srcimg?" mean?!?!?
/John
This is my test program:
import javax.swing.text.html.*;
import javax.swing.text.*;
import javax.swing.text.html.parser.*;
import java.io.*;
public class htmltest extends HTMLEditorKit.ParserCallback
public htmltest()
   super();
public void handleError(String errorMsg, int pos)
   System.err.println("handleError("+pos+") = " + errorMsg);
static public void main (String[] argv) throws Exception
    Reader reader = new FileReader("test.html");
    new ParserDelegator().parse(reader, new htmltest(), false);
This is the "test.html" file
<html>
<head>
</head>
<body>
This is a plain text.<br>
This is <b>bold</b> and this is <i>itallic</i>!<br>
<img src="test.gif">
"This >is also a plain test text."<br>
</body>
</html>
----------------------------------------------------------------------

The handleError() method is not well documented any more than whole javax.swing.text.html package and its design structure. You can ignore the behavior of the method if other result of the parser and your HTML file are proper.

Attempting to use HTML parser - getAttribute() not preforming as expected.

How am I mis-using getAttribute()?
I am expecting (String)a.getAttribute((String)"name") to give me a value other than null in the below example. What am I doing wrong?
The HTML test source (missing headers/body so yes its not proper)
<input name="unit_1" size=5 maxsize=5 value="hr">
<input name="qty_1" size=5 value=4>
<input name="unit_1" size=5 maxsize=5 value="hr">
<input name="partnumber_1" size=10 value="Java Work">
<input name="description_1" size=50 value="Slip shod work at outragous prices">
<input name="sellprice_1" size=9 value=185.00>
<input name="discount_1" size=3 value=>
What I'd like to see is this:
About to parse test
Parsing error: invalid.tagattmaxsizeinput? at 39
Tag start(<html>, 1 attrs)
Tag start(<head>, 1 attrs)
Tag end(</head>)
Tag start(<body>, 1 attrs)
Tag(<input>, 4 attrs)
found input
unit_1
hr
Tag(<input>, 3 attrs)
found input
qty_1
4
Rather than this:
About to parse test
Parsing error: invalid.tagattmaxsizeinput? at 39
Tag start(<html>, 1 attrs)
Tag start(<head>, 1 attrs)
Tag end(</head>)
Tag start(<body>, 1 attrs)
Tag(<input>, 4 attrs)
found input
null
null
Tag(<input>, 3 attrs)
found input
null
null
The code that reads the HTML and give the output looks like this:
import java.io.*;
import java.net.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;
* This small demo program shows how to use the
* HTMLEditorKit.Parser and its implementing class
* ParserDelegator in the Swing system.
class DataSaved {
String InputName;
String InputValue;
boolean IsHidden;
public class HtmlParseDemo {
public static void main(String [] args) {
DataSaved DataSet[];
Reader r;
if (args.length == 0) {
System.err.println("Usage: java HTMLParseDemo [url | file]");
System.exit(0);
String spec = args[0];
try {
if (spec.indexOf("://") > 0) {
URL u = new URL(spec);
Object content = u.getContent();
if (content instanceof InputStream) {
r = new InputStreamReader((InputStream)content);
else if (content instanceof Reader) {
r = (Reader)content;
else {
throw new Exception("Bad URL content type.");
else {
r = new FileReader(spec);
HTMLEditorKit.Parser parser;
System.out.println("About to parse " + spec);
parser = new ParserDelegator();
parser.parse(r, new HTMLParseLister(), true);
r.close();
catch (Exception e) {
System.err.println("Error: " + e);
e.printStackTrace(System.err);
* HTML parsing proceeds by calling a callback for
* each and every piece of the HTML document. This
* simple callback class simply prints an indented
* structural listing of the HTML data.
class HTMLParseLister extends HTMLEditorKit.ParserCallback
int indentSize = 0;
protected void indent() {
indentSize += 3;
protected void unIndent() {
indentSize -= 3; if (indentSize < 0) indentSize = 0;
protected void pIndent() {
for(int i = 0; i < indentSize; i++) System.out.print(" ");
public void handleText(char[] data, int pos) {
pIndent();
System.out.println("Text(" + data.length + " chars)");
public void handleComment(char[] data, int pos) {
pIndent();
System.out.println("Comment(" + data.length + " chars)");
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
pIndent();
System.out.println("Tag start(<" + t.toString() + ">, " +
a.getAttributeCount() + " attrs)");
indent();
public void handleEndTag(HTML.Tag t, int pos) {
unIndent();
pIndent();
System.out.println("Tag end(</" + t.toString() + ">)");
public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) {
String name;
String value;
boolean hidden;
pIndent();
System.out.println("Tag(<" + t.toString() + ">, " +
a.getAttributeCount() + " attrs)");
if( t==HTML.Tag.INPUT) {
System.out.println("found input");
name = (String)a.getAttribute((String)"name");
value = (String)a.getAttribute((String)"value");
System.out.println(name);
System.out.println(value);
public void handleError(String errorMsg, int pos){
System.out.println("Parsing error: " + errorMsg + " at " + pos);

System.out.println( a.getAttribute(HTML.Attribute.NAME) );

Loading HTML with javascript from SDCard

Hi - my first post so please be kind
My handset is a BES connected 8100 running v4.2.1.66.
I've only recently discovered how to load files from the local media card using the file:///SDCard/example.html semantics, but I'm very frustrated I cannot seem to load scripts embedded in these files, that will load and run normally when the file is retrieved using http.
Is there some magic extension I need to give my files to enable processing of scripts from local files? I've tried the obvious file extensions without luck (htm, html, wml, jsp, asp).
It seems to me that for local files the handset is setting the MIME type from the extension alone, given the error you get when you load a file with an unknown extension - "The returned page had no content type, and therefore cannot be processed."
Can anybody help me shed any light on this behavior and how to get around my lack of javascript on locally loaded files?
Feel free to slap me if this is a know bug in the software I'm running, or has been answered 50 times.
Many thanks; Andrew.

Under which folder we can put the html and javascript files. I have installed balackberry sdk4.6(i.e;
Research In Motion\BlackBerry JDE 4.6.1).
Objective:To open and see a local html file in blackberry browser.

Exception in html parser under Linux

Hi all,
Following code is copied from Tech Tip 23Sep1999. I have compiled it and run it under Win98. It works fine for any uri. However, when I try to run it under Linux, it throws exceptions. I noticed that some web site can be parsered with the program in Linux but some can't. I wonder the different between those platforms. Anyone can tell me how to make the program works under Linux.
Rgds,
unplug
configuration
RedHat 7.1
JDK1.3.1
Failed: java GetLinks http://java.sun.com
Worked: java GetLinks http://www.apache.org
--begining of code
import java.io.*;
import java.net.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
class GetLinks {
public static void main(String[] args) {
EditorKit kit = new HTMLEditorKit();
Document doc = kit.createDefaultDocument();
// The Document class does not yet
// handle charset's properly.
doc.putProperty("IgnoreCharsetDirective",
Boolean.TRUE);
try {
// Create a reader on the HTML content.
Reader rd = getReader(args[0]);
// Parse the HTML.
kit.read(rd, doc, 0);
// Iterate through the elements
// of the HTML document.
ElementIterator it = new ElementIterator(doc);
javax.swing.text.Element elem;
while ((elem = it.next()) != null) {
SimpleAttributeSet s = (SimpleAttributeSet)
elem.getAttributes().getAttribute(HTML.Tag.A);
if (s != null) {
System.out.println(
s.getAttribute(HTML.Attribute.HREF));
} catch (Exception e) {
e.printStackTrace();
System.exit(1);
// Returns a reader on the HTML data. If 'uri' begins
// with "http:", it's treated as a URL; otherwise,
// it's assumed to be a local filename.
static Reader getReader(String uri)
throws IOException {
if (uri.startsWith("http:")) {
// Retrieve from Internet.
URLConnection conn=
new URL(uri).openConnection();
return new
InputStreamReader(conn.getInputStream());
} else {
// Retrieve from file.
return new FileReader(uri);
--End of code
--Exception in Linux
Exception in thread "main" java.lang.NoClassDefFoundError
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:120)
at java.awt.Toolkit$2.run(Toolkit.java:512)
at java.security.AccessController.doPrivileged(Native Method)
at java.awt.Toolkit.getDefaultToolkit(Toolkit.java:503)
at javax.swing.text.html.CSS.getValidFontNameMapping(CSS.java:932)
at javax.swing.text.html.CSS$FontFamily.parseCssValue(CSS.java:1789)
at javax.swing.text.html.CSS.getInternalCSSValue(CSS.java:531)
at javax.swing.text.html.CSS.addInternalCSSValue(CSS.java:516)
at javax.swing.text.html.StyleSheet.addCSSAttribute(StyleSheet.java:436)
at javax.swing.text.html.HTMLDocument$HTMLReader$ConvertAction.start(HTM
LDocument.java:2536)
at javax.swing.text.html.HTMLDocument$HTMLReader.handleStartTag(HTMLDocu
ment.java:1992)
at javax.swing.text.html.parser.DocumentParser.handleStartTag(DocumentPa
rser.java:145)
at javax.swing.text.html.parser.Parser.startTag(Parser.java:333)
at javax.swing.text.html.parser.Parser.parseTag(Parser.java:1786)
at javax.swing.text.html.parser.Parser.parseContent(Parser.java:1821)
at javax.swing.text.html.parser.Parser.parse(Parser.java:1980)
at javax.swing.text.html.parser.DocumentParser.parse(DocumentParser.java
:109)
at javax.swing.text.html.parser.ParserDelegator.parse(ParserDelegator.ja
va:74)
at javax.swing.text.html.HTMLEditorKit.read(HTMLEditorKit.java:239)
at GetLinks.main(GetLinks.java:23)

Support for CSS and clearly defined.Also Dictionary getDocumentProperties() is not properly exaplained meaning it doesnt give methods to get all the properties a HTML document can have.

Use of HTML and Javascript within EP

I have a newbie question:
I have several HTMl documents with javascript embedded like e.g. various calculators we use on the current website.
If I want to migrate these html sources to EP content, what can I best do than?
I assume that all existing html and javascript renders as normal without too many development involved?
Is there a good example how-To source which I can use to demonstrate this?
many thanks for your help

Hi,
Well there are two options:
<b>1) If you are interested to make the portal component and then use in Portal.
2) If you want to use your earlier HTML document as it is inside the portal.</b>
In case 1 you need to make the Portal objects and then make a portal project. You can use javascript in that as well.
In case 2 you can directly make the URL iView of the HTML document and then view it from the portal. Well this is not a good way of using your javascript. Personally I suggest you to go for Portal Project.
I hope this help you!!
Regards
Pravesh
PS: Please consider reewarding points.

HTML parser in J2ME

Hi all,
Even I'm stuck with the same problem. I'm developing a J2ME(MIDlet) application in which i have to open a http connection. N also i want to parse the html response n display the contents using J2ME elements in the mobile. I'm not able to solve this problem. Plz help me if any1 has come across the solution of this problem.
Below links are the related threads:
http://forums.sun.com/thread.jspa?forumID=76&threadID=250460
http://forums.sun.com/thread.jspa?forumID=76&threadID=5235530
Thanks in advance
Nandy

Hi All,
I like to ask if anyone knows if there is a HTML
parser available in J2ME? I am building an applicationTry google, a few do exist, but I don't know about free ones.
that needs to display HTML on the client.
Alternatively I may consider using XML, however I
learnt that parsing XML is expensive in terms of
computing power - is it the same for HTML?If you are controlling the content returned, the two would be about the same, as XML and HTML have the same roots. Some XML parsers do exist, and are free to use.
You might be best of returning a custom format, designed around the limitations of the device you are using .

What HTML and JavaScript engine is used within Adobe AIR on the desktop?

HTML and JavaScript within Adobe AIR are handled by the WebKit HTML/JavaScript engine.

I've made a little headway with this. Within your initHandler just make a call to login:
FacebookMobile.login(loginCallback, this.stage, [], webview);
webview is a StageWebView instance with the viewPort defined. If I left it null, or didn't set the viewPort nothing happens...
var webview:StageWebView = new StageWebView();
webview.viewPort = new Rectangle(0,0,400,400);
I'm now getting a login screen.

Can i publish HTML or JavaScript on my iWeb pages?

I don't know how to embed html or javascript - i'm adding advertising affiliates (amazon, etc.) but i cannot post their code. i've had to past the gif logo and then create a link.

Here is a thread in which I have listed the steps involved in adding external HTML to your iWeb pages. It's pretty straightforward and you can apply this technique to pretty much any kind of HTML that you would like to add...
http://discussions.apple.com/click.jspa?searchID=-1&messageID=2446855
Good luck! Let me know if you have any problems.

Error on HTML Parser

Hi,
I'm trying to parse a HTML page but I always get the same error, which is the following exception:
javax.swing.text.ChangedCharSetException
In the class ParserCallback I'm using the method handleError and it shows:
req.att contentmeta?
ioexception???
just before the exception occurs.
The only line where this error occurs in the html page is:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
and I know that the exact point is the attribute 'content'. If it is removed or changed to 'contenttype' the error desappears.
The problem is that I can't change the attribute because the html page is not mine, it is caught on the Web. And I don't want to remove it.
Anybody knows what is happening?
Thanks!!

i am also having a problem with html parsing in java
i have given a detailed / complete description of the problem on this link along with the log and my sample code ...
http://forum.java.sun.com/thread.jspa?threadID=643683&tstart=0
if u could see this ...

HTML parsing : javascript

Similar Messages

Maybe you are looking for