HTML parsing : javascript

I have to parse some HTML pages and collect the links present on the page.
I am using HTMLEditorKit.ParserCallback to parse the pages.
On some pages, javascript code is present. e.g. onclick attribute
How can I parse the javascript code/ fetch the URL present in javascript?

I have to parse some HTML pages and collect the links present on the page.
I am using HTMLEditorKit.ParserCallback to parse the pages.
On some pages, javascript code is present. e.g. onclick attribute
How can I parse the javascript code/ fetch the URL present in javascript?

Similar Messages

  • Webpage (HTML) parsing...

    Any ideas on how to parse an HTML page? I'm trying to do it with a StreamTokenizer but with little success. I don't think this class was made to do this sort of thing, Oridnarilly anyway. Is there a better choice? StringTokenizer? Here's what I have so far:
      URLConnection uc = url.openConnection();
      BufferedReader br = new BufferedReader(new InputStreamReader
                                            (uc.getInputStream()));
      StreamTokenizer stok = new StreamTokenizer(br);
      stok.eolIsSignificant(false);
      String inputLine;
      for (int i=0; (stok.nextToken() != stok.TT_EOF); i++)
        System.out.println("token #" + i + stok.toString());
      }It gives me a result like this:
    token #0Token['<'], line 3
    token #1Token[script], line 3
    token #2Token[language], line 3
    token #3Token['='], line 3
    token #4Token[javascript], line 3
    token #5Token['>'], line 3
    token #6Token['<'], line 4
    token #7Token['!'], line 4
    token #8Token['-'], line 4
    token #9Token['-'], line 4
    token #10Token[function], line 5
    token #11Token[dojump], line 5
    token #12Token['('], line 5
    token #13Token[')'], line 5
    token #14Token['{'], line 6
    token #15Token[document.location.href], line 7
    token #16Token['='], line 7
    token #17Token[play247.asp?page=promo&id=72&r=R2], line 7What I want is all the links that have "promo" as a parameter e.g. . Any suggestions?

    Java has a callback parser, which notifies you when start/end tags are found. Then you can query the attributes and search for the desired string. Heres a sample to get you started:
    import java.io.FileReader;
    import java.io.IOException;
    import java.io.Reader;
    import javax.swing.text.MutableAttributeSet;
    import javax.swing.text.html.HTML;
    import javax.swing.text.html.HTMLEditorKit;
    import javax.swing.text.html.parser.ParserDelegator;
    public class TestParser extends HTMLEditorKit.ParserCallback
         boolean ignoreText;
         public static void main(String[] args)
         throws IOException
              TestParser parser = new TestParser();
              // args[0] is the file to parse
              Reader reader = new FileReader(args[0]);
              try
                   new ParserDelegator().parse(reader, parser, false);
              catch (IOException e)
                   System.out.println(e);
         public void handleComment(char[] data, int pos)
              System.out.println(data);
         public void handleEndOfLineString(String eol)
         public void handleEndTag(HTML.Tag tag, int pos)
              System.out.println("/" + tag);
         public void handleError(String errorMsg, int pos)
              System.out.println(pos + ":" + errorMsg);
         public void handleMutableTag(HTML.Tag tag, MutableAttributeSet a, int pos)
              System.out.println("mutable:" + tag + ": " + pos + ": " + a);
         public void handleSimpleTag(HTML.Tag tag, MutableAttributeSet a, int pos)
              System.out.println( tag + ":" + a );
         public void handleStartTag(HTML.Tag tag, MutableAttributeSet a, int pos)
              System.out.println( tag + ":" + a );
         public void handleText(char[] data, int pos)
              System.out.println( data );

  • APPLESCRIPT AND HTML PARSING.

    hi,
    im new to applescript so im not quite sure if what i want to do is actually called html parsing.. but basically i want to put a variable in applescript that is linked to the actual html but i dont know how to make applescript access data inside a html code... to give u a better idea, inside the html is something like this:
    100
    now that value "100" changes but its maximum amount is 100. i want to create a script which responds to change when that value starts to drop by loading another link.
    am i making sense? again the thing id like to achieve is make applescript use that value INSIDE the HTML as its own variable (and perform the right actions as that value changes)
    any help would be appreciated.

    In first place you could open the site you talked about in safari and run a little javascript via applescript to get that value.
    Javascript is the "best" way to get a special value out of an HTML-Element, but only works in browsers.
    e.g.
    tell application "Safari"
    open location "http://apple.com"
    delay 6
    set mypromo to do JavaScript "document.getElementById('promos').getElementsByTagName('a')[0].title" in document 1
    display dialog "Title of first Promo is:" & return & mypromo
    end tell
    Or you could just d/l the pure source convert it to text and search for the phrase you are looking for
    e.g.
    set mysource_html to do shell script "curl http://mysite.org/bla.html"
    set mysource_txt to do shell script "curl http://mysite.org/bla.html | textutil -stdin -convert txt -format html -stdout"
    if mysource_html contains "<a>100</a>" then
    display dialog "Hey, value of 100 is reached"
    end if
    --or something like
    if mysource_txt contains "100" then
    display dialog "Hey, value of 100 is reached"
    end if

  • Sending and parse javascript object variable to java variable

    can anyone help me how to send and parse javascript object variable from client to java variable on servlet. Here is what i mention about:
    suppose i have object variable var_js with it's properties:
    <script>
    var var_js = {
    id: 'var_js1',
    name: 'this is var javascript',
    allow_value_type: ['int', 'string', 'object']
    </script>
    /* after processing javascript object to java var, that i hope you guys may help me, it's java var (var_java) become something like*/
    var_java.id = "var_js1";
    var_java.name = "this is var javascript";
    var_java.allow_value_type = {"int", "string", "object"}

    You could have this html page:
    <html>
    <script>
    var var_js = {
    id: 'var_js1',
    name: 'this is var javascript',
    allow_value_type: ['int', 'string', 'object']
    function send()
    document.getElementById("id").value = var_js.id;
    document.getElementById("name").value =var_js.name;
    document.getElementById("allow_value_type").value =var_js.allow_value_type;
    document.form.submit();
    </script>
    <form name="myForm" action="http://localhost:8080/servlet/myServlet" method="post" >
    <input id="id" type="hidden" value="">
    <input id="name" type="hidden" value="">
    <input id="allow_value_type" type="hidden" value="">
    <input id="cmdGo" type="button" value="Button" onClick="send()">
    </form>
    </html>
    Then have a servlet like this:
    import javax.servlet.*;
    import javax.servlet.http.*;
    import java.io.*;
    public class myServlet extends HttpServlet {
    public void doPost( HttpServletRequest request,
    HttpServletResponse response )
    throws ServletException, IOException
    String id = request.getParameter("id");
    String name = request.getParameter("name");
    String allowed_value_type = request.getParameter("allowed_value_type");
    Var_java var_java = new Var_java(id,name,allowed_value_type);
    //and have you java object
    class Var_java
    String id;
    String name;
    String allowed_value_type;
    public var_java(String id,String name,String allowed_value_type)
    this.id=id;
    this.name=name;
    this.allowed_value_type=allowed_value_type;
    well...something like that i think.
    Hope it helps.

  • How to parse a HTML file using HTML parser in J2SE?

    I want to parse an HTML file using HTML parser. Can any body help me by providing a sample code to parse the HTML file?
    Thanks nad Cheers,
    Amaresh

    What HTML parser and what does "parsing" mean to you?

  • STYLE tag problem in HTML Parser.

    Hi,
    I am trying to parse a HTML file. I am able to extract context of various tags like Tag.SPAN,Tag.DIV and so...
    I want to extract the text content of Tag.Style. What to do? The problem is that HTML Parser right now doesnot support this tag along with 5 more tags which are Tag.META,Tag.PARAM and so..
    Please help me out.

    Before responding to this posting, you may want to check out the discussion in the OP's previous posting on this topic:
    http://forum.java.sun.com/thread.jspa?threadID=634938

  • Don't understand error message from HTML parser?

    I've written a simple test program to parse a simple html file.
    Everything works fine accept for the <img src="test.gif"> tag.
    It understands the img tag and the handleSimpleTag gets called.
    I can even pick out the src attribute. But I get a very strange error message.
    When I run the test program below on the test.html file (also below) I get the following output:
    handleError(134) = req.att srcimg?
    What does "req.att srcimg?" mean?!?!?
    /John
    This is my test program:
    import javax.swing.text.html.*;
    import javax.swing.text.*;
    import javax.swing.text.html.parser.*;
    import java.io.*;
    public class htmltest extends HTMLEditorKit.ParserCallback
    public htmltest()
       super();
    public void handleError(String errorMsg, int pos)
       System.err.println("handleError("+pos+") = " + errorMsg);
    static public void main (String[] argv) throws Exception
        Reader reader = new FileReader("test.html");
        new ParserDelegator().parse(reader, new htmltest(), false);
    This is the "test.html" file
    <html>
    <head>
    </head>
    <body>
    This is a plain text.<br>
    This is <b>bold</b> and this is <i>itallic</i>!<br>
    <img src="test.gif">
    "This >is also a plain test text."<br>
    </body>
    </html>
    ----------------------------------------------------------------------

    The handleError() method is not well documented any more than whole javax.swing.text.html package and its design structure. You can ignore the behavior of the method if other result of the parser and your HTML file are proper.

  • Attempting to use HTML parser - getAttribute() not preforming as expected.

    How am I mis-using getAttribute()?
    I am expecting (String)a.getAttribute((String)"name") to give me a value other than null in the below example. What am I doing wrong?
    The HTML test source (missing headers/body so yes its not proper)
    <input name="unit_1" size=5 maxsize=5 value="hr">
    <input name="qty_1" size=5 value=4>
    <input name="unit_1" size=5 maxsize=5 value="hr">
    <input name="partnumber_1" size=10 value="Java Work">
    <input name="description_1" size=50 value="Slip shod work at outragous prices">
    <input name="sellprice_1" size=9 value=185.00>
    <input name="discount_1" size=3 value=>
    What I'd like to see is this:
    About to parse test
    Parsing error: invalid.tagattmaxsizeinput? at 39
    Tag start(<html>, 1 attrs)
    Tag start(<head>, 1 attrs)
    Tag end(</head>)
    Tag start(<body>, 1 attrs)
    Tag(<input>, 4 attrs)
    found input
    unit_1
    hr
    Tag(<input>, 3 attrs)
    found input
    qty_1
    4
    Rather than this:
    About to parse test
    Parsing error: invalid.tagattmaxsizeinput? at 39
    Tag start(<html>, 1 attrs)
    Tag start(<head>, 1 attrs)
    Tag end(</head>)
    Tag start(<body>, 1 attrs)
    Tag(<input>, 4 attrs)
    found input
    null
    null
    Tag(<input>, 3 attrs)
    found input
    null
    null
    The code that reads the HTML and give the output looks like this:
    import java.io.*;
    import java.net.*;
    import javax.swing.text.*;
    import javax.swing.text.html.*;
    import javax.swing.text.html.parser.*;
    * This small demo program shows how to use the
    * HTMLEditorKit.Parser and its implementing class
    * ParserDelegator in the Swing system.
    class DataSaved {
    String InputName;
    String InputValue;
    boolean IsHidden;
    public class HtmlParseDemo {
    public static void main(String [] args) {
    DataSaved DataSet[];
    Reader r;
    if (args.length == 0) {
    System.err.println("Usage: java HTMLParseDemo [url | file]");
    System.exit(0);
    String spec = args[0];
    try {
    if (spec.indexOf("://") > 0) {
    URL u = new URL(spec);
    Object content = u.getContent();
    if (content instanceof InputStream) {
    r = new InputStreamReader((InputStream)content);
    else if (content instanceof Reader) {
    r = (Reader)content;
    else {
    throw new Exception("Bad URL content type.");
    else {
    r = new FileReader(spec);
    HTMLEditorKit.Parser parser;
    System.out.println("About to parse " + spec);
    parser = new ParserDelegator();
    parser.parse(r, new HTMLParseLister(), true);
    r.close();
    catch (Exception e) {
    System.err.println("Error: " + e);
    e.printStackTrace(System.err);
    * HTML parsing proceeds by calling a callback for
    * each and every piece of the HTML document. This
    * simple callback class simply prints an indented
    * structural listing of the HTML data.
    class HTMLParseLister extends HTMLEditorKit.ParserCallback
    int indentSize = 0;
    protected void indent() {
    indentSize += 3;
    protected void unIndent() {
    indentSize -= 3; if (indentSize < 0) indentSize = 0;
    protected void pIndent() {
    for(int i = 0; i < indentSize; i++) System.out.print(" ");
    public void handleText(char[] data, int pos) {
    pIndent();
    System.out.println("Text(" + data.length + " chars)");
    public void handleComment(char[] data, int pos) {
    pIndent();
    System.out.println("Comment(" + data.length + " chars)");
    public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
    pIndent();
    System.out.println("Tag start(<" + t.toString() + ">, " +
    a.getAttributeCount() + " attrs)");
    indent();
    public void handleEndTag(HTML.Tag t, int pos) {
    unIndent();
    pIndent();
    System.out.println("Tag end(</" + t.toString() + ">)");
    public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) {
    String name;
    String value;
    boolean hidden;
    pIndent();
    System.out.println("Tag(<" + t.toString() + ">, " +
    a.getAttributeCount() + " attrs)");
    if( t==HTML.Tag.INPUT) {
    System.out.println("found input");
    name = (String)a.getAttribute((String)"name");
    value = (String)a.getAttribute((String)"value");
    System.out.println(name);
    System.out.println(value);
    public void handleError(String errorMsg, int pos){
    System.out.println("Parsing error: " + errorMsg + " at " + pos);

    System.out.println( a.getAttribute(HTML.Attribute.NAME) );

  • Loading HTML with javascript from SDCard

    Hi - my first post so please be kind
    My handset is a BES connected 8100 running v4.2.1.66.
    I've only recently discovered how to load files from the local media card using the file:///SDCard/example.html semantics, but I'm very frustrated I cannot seem to load scripts embedded in these files, that will load and run normally when the file is retrieved using http.
    Is there some magic extension I need to give my files to enable processing of scripts from local files?  I've tried the obvious file extensions without luck (htm, html, wml, jsp, asp).
    It seems to me that for local files the handset is setting the MIME type from the extension alone, given the error you get when you load a file with an unknown extension - "The returned page had no content type, and therefore cannot be processed."
    Can anybody help me shed any light on this behavior and how to get around my lack of javascript on locally loaded files?
    Feel free to slap me if this is a know bug in the software I'm running, or has been answered 50 times.
    Many thanks; Andrew.

    Under which folder we can put the html and javascript files. I have installed balackberry sdk4.6(i.e;
    Research In Motion\BlackBerry JDE 4.6.1).
    Objective:To open and see a local html file  in blackberry browser.

  • Exception in html parser under Linux

    Hi all,
    Following code is copied from Tech Tip 23Sep1999. I have compiled it and run it under Win98. It works fine for any uri. However, when I try to run it under Linux, it throws exceptions. I noticed that some web site can be parsered with the program in Linux but some can't. I wonder the different between those platforms. Anyone can tell me how to make the program works under Linux.
    Rgds,
    unplug
    configuration
    RedHat 7.1
    JDK1.3.1
    Failed: java GetLinks http://java.sun.com
    Worked: java GetLinks http://www.apache.org
    --begining of code
    import java.io.*;
    import java.net.*;
    import javax.swing.text.*;
    import javax.swing.text.html.*;
    class GetLinks {
    public static void main(String[] args) {
    EditorKit kit = new HTMLEditorKit();
    Document doc = kit.createDefaultDocument();
    // The Document class does not yet
    // handle charset's properly.
    doc.putProperty("IgnoreCharsetDirective",
    Boolean.TRUE);
    try {
    // Create a reader on the HTML content.
    Reader rd = getReader(args[0]);
    // Parse the HTML.
    kit.read(rd, doc, 0);
    // Iterate through the elements
    // of the HTML document.
    ElementIterator it = new ElementIterator(doc);
    javax.swing.text.Element elem;
    while ((elem = it.next()) != null) {
    SimpleAttributeSet s = (SimpleAttributeSet)
    elem.getAttributes().getAttribute(HTML.Tag.A);
    if (s != null) {
    System.out.println(
    s.getAttribute(HTML.Attribute.HREF));
    } catch (Exception e) {
    e.printStackTrace();
    System.exit(1);
    // Returns a reader on the HTML data. If 'uri' begins
    // with "http:", it's treated as a URL; otherwise,
    // it's assumed to be a local filename.
    static Reader getReader(String uri)
    throws IOException {
    if (uri.startsWith("http:")) {
    // Retrieve from Internet.
    URLConnection conn=
    new URL(uri).openConnection();
    return new
    InputStreamReader(conn.getInputStream());
    } else {
    // Retrieve from file.
    return new FileReader(uri);
    --End of code
    --Exception in Linux
    Exception in thread "main" java.lang.NoClassDefFoundError
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:120)
    at java.awt.Toolkit$2.run(Toolkit.java:512)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.awt.Toolkit.getDefaultToolkit(Toolkit.java:503)
    at javax.swing.text.html.CSS.getValidFontNameMapping(CSS.java:932)
    at javax.swing.text.html.CSS$FontFamily.parseCssValue(CSS.java:1789)
    at javax.swing.text.html.CSS.getInternalCSSValue(CSS.java:531)
    at javax.swing.text.html.CSS.addInternalCSSValue(CSS.java:516)
    at javax.swing.text.html.StyleSheet.addCSSAttribute(StyleSheet.java:436)
    at javax.swing.text.html.HTMLDocument$HTMLReader$ConvertAction.start(HTM
    LDocument.java:2536)
    at javax.swing.text.html.HTMLDocument$HTMLReader.handleStartTag(HTMLDocu
    ment.java:1992)
    at javax.swing.text.html.parser.DocumentParser.handleStartTag(DocumentPa
    rser.java:145)
    at javax.swing.text.html.parser.Parser.startTag(Parser.java:333)
    at javax.swing.text.html.parser.Parser.parseTag(Parser.java:1786)
    at javax.swing.text.html.parser.Parser.parseContent(Parser.java:1821)
    at javax.swing.text.html.parser.Parser.parse(Parser.java:1980)
    at javax.swing.text.html.parser.DocumentParser.parse(DocumentParser.java
    :109)
    at javax.swing.text.html.parser.ParserDelegator.parse(ParserDelegator.ja
    va:74)
    at javax.swing.text.html.HTMLEditorKit.read(HTMLEditorKit.java:239)
    at GetLinks.main(GetLinks.java:23)

    Support for CSS and clearly defined.Also Dictionary getDocumentProperties() is not properly exaplained meaning it doesnt give methods to get all the properties a HTML document can have.

  • Use of HTML and Javascript within EP

    I have a newbie question:
    I have several HTMl documents with javascript embedded like e.g. various calculators we use on the current website.
    If I want to migrate these html sources to EP content, what can I best do than?
    I assume that all existing html and javascript renders as normal without too many development involved?
    Is there a good example how-To source which I can use to demonstrate this?
    many thanks for your help

    Hi,
    Well there are two options:
    <b>1) If you are interested to make the portal component and then use in Portal.
    2) If you want to use your earlier HTML document as it is inside the portal.</b>
    In case 1 you need to make the Portal objects and then make a portal project. You can use javascript in that as well.
    In case 2 you can directly make the URL iView of the HTML document and then view it from the portal. Well this is not a good way of using your javascript. Personally I suggest you to go for Portal Project.
    I hope this help you!!
    Regards
    Pravesh
    PS: Please consider reewarding points.

  • HTML parser in J2ME

    Hi all,
    Even I'm stuck with the same problem. I'm developing a J2ME(MIDlet) application in which i have to open a http connection. N also i want to parse the html response n display the contents using J2ME elements in the mobile. I'm not able to solve this problem. Plz help me if any1 has come across the solution of this problem.
    Below links are the related threads:
    http://forums.sun.com/thread.jspa?forumID=76&threadID=250460
    http://forums.sun.com/thread.jspa?forumID=76&threadID=5235530
    Thanks in advance
    Nandy

    Hi All,
    I like to ask if anyone knows if there is a HTML
    parser available in J2ME? I am building an applicationTry google, a few do exist, but I don't know about free ones.
    that needs to display HTML on the client.
    Alternatively I may consider using XML, however I
    learnt that parsing XML is expensive in terms of
    computing power - is it the same for HTML?If you are controlling the content returned, the two would be about the same, as XML and HTML have the same roots. Some XML parsers do exist, and are free to use.
    You might be best of returning a custom format, designed around the limitations of the device you are using .

  • What HTML and JavaScript engine is used within Adobe AIR on the desktop?

    HTML and JavaScript within Adobe AIR are handled by the WebKit HTML/JavaScript engine.

    I've made a little headway with this. Within your initHandler just make a call to login:
    FacebookMobile.login(loginCallback, this.stage, [], webview);
    webview is a StageWebView instance with the viewPort defined. If I left it null, or didn't set the viewPort nothing happens...
    var webview:StageWebView = new StageWebView();
    webview.viewPort = new Rectangle(0,0,400,400);
    I'm now getting a login screen.

  • Can i publish HTML or JavaScript on my iWeb pages?

    I don't know how to embed html or javascript - i'm adding advertising affiliates (amazon, etc.) but i cannot post their code. i've had to past the gif logo and then create a link.

    Here is a thread in which I have listed the steps involved in adding external HTML to your iWeb pages. It's pretty straightforward and you can apply this technique to pretty much any kind of HTML that you would like to add...
    http://discussions.apple.com/click.jspa?searchID=-1&messageID=2446855
    Good luck! Let me know if you have any problems.

  • Error on HTML Parser

    Hi,
    I'm trying to parse a HTML page but I always get the same error, which is the following exception:
    javax.swing.text.ChangedCharSetException
    In the class ParserCallback I'm using the method handleError and it shows:
    req.att contentmeta?
    ioexception???
    just before the exception occurs.
    The only line where this error occurs in the html page is:
    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
    and I know that the exact point is the attribute 'content'. If it is removed or changed to 'contenttype' the error desappears.
    The problem is that I can't change the attribute because the html page is not mine, it is caught on the Web. And I don't want to remove it.
    Anybody knows what is happening?
    Thanks!!

    i am also having a problem with html parsing in java
    i have given a detailed / complete description of the problem on this link along with the log and my sample code ...
    http://forum.java.sun.com/thread.jspa?threadID=643683&tstart=0
    if u could see this ...

Maybe you are looking for