Parsing HTML, how to?

I use url to get the html formatted text. However, parsing the HTML file
to get the info cleanly is a pain, by using the StringTokenizer. For example,
I want to get the stock prices from
http://finance.yahoo.com/q/hp?s=YHOO&a=00&b=1&c=2005&d=01&e=20&f=2005&g=d
How to easily extrat the prices and the dates, so my program can
further process the data? Thank you.

problem is, most(I believe all) JAVA HTML parser
relies on the HTML to be well-formed, which isn'tI think the default parser with Editor Kit is actually fairly lenient.
true all the time. you can thanx Microsoft Internet
Explorer for allowing an HTML page to miss a closing
tag. And Netscape.
And W3C for not slapping them both.
if you know the field(string) that you are looking
for..you probably can use indexOf
to get to that position, and slowing parse the
document from there.And watch it blow up in your face when the content changes.
As a technical exersise, writing a parser is not too hard. As a solution to your problem, I'd recomend getting the data from a better source as suggested above.

Similar Messages

  • JEditorPane parsing HTML

    Hi all,
    I am using JEditorPane and it's ability to parse HTML, which although is relatively old and crusty is certainly all I need for the job.
    Now, I understand there is a chain of classes involved in taking my .html file and turning popping into a something we can see in a JEditorPane. For example, an img tag, is picked up by HTMLEditorKit and turned into an ImageView for display purposes.
    I want to do the following: I have subclassed HTMLEditorKit, and have overridden the HTMLFactory (although at the moment it just defers everything to super). I want to be able to pick out all of the html comment tags as they go through the HTMLEditorKit :
    <!-- hey hey this is a comment -->... and get to the comment text, "hey hey this is a comment", as a Java string. However I've been digging around with Element for hours now and although my HTMLFactory correctly digs out the comments from the rest of the elements:
    else if (kind == HTML.Tag.COMMENT)
                        {System.out.println("I found a comment but don't know what it said!!");... as you can see, I don't know how to get to the comment text itself.
    The reason why I want access to the comment text is that I want to supplement the HTML code a little bit and add something in the comment that will affect the way it is rendered when I read it depending on the comment - so there's the reason if curious.
    Any help, and I do mean anything at all, would be much appreciated, as this is the last obstacle in my path to getting this thing working :)
    Thanks for your time!
    - Peter

    Here is some old code I have lying around that attempts to iterate through all the elements. If I remember correctly the comment text is found in the AttributeSet of the element:
    import java.io.*;
    import java.net.*;
    import java.util.*;
    import javax.swing.*;
    import javax.swing.text.*;
    import javax.swing.text.html.*;
    class GetHTML
        public static void main(String[] args)
            EditorKit kit = new HTMLEditorKit();
            Document doc = kit.createDefaultDocument();
            // The Document class does not yet handle charset's properly.
            doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
            try
                // Create a reader on the HTML content.
                Reader rd = getReader(args[0]);
                // Parse the HTML.
                kit.read(rd, doc, 0);
                System.out.println( doc.getText(0, doc.getLength()) );
                System.out.println("----");
                // Iterate through the elements of the HTML document.
                ElementIterator it = new ElementIterator(doc);
                Element elem = null;
                while ( (elem = it.next()) != null )
                    AttributeSet as = elem.getAttributes();
                    System.out.println( "\n" + elem.getName() + " : " + as.getAttributeCount() );
                    if ( elem.getName().equals( HTML.Tag.IMG.toString() ) )
                        Object o = elem.getAttributes().getAttribute( HTML.Attribute.SRC );
                        System.out.println( o );
                    Enumeration enum = as.getAttributeNames();
                    while( enum.hasMoreElements() )
                        Object name = enum.nextElement();
                        Object value = as.getAttribute( name );
                        System.out.println( "\t" + name + " : " + value );
                        if (value instanceof DefaultComboBoxModel)
                            DefaultComboBoxModel model = (DefaultComboBoxModel)value;
                            for (int j = 0; j < model.getSize(); j++)
                                Object o = model.getElementAt(j);
                                Object selected = model.getSelectedItem();
                                if ( o.equals( selected ) )
                                    System.out.println( o + " : selected" );
                                else
                                    System.out.println( o );
                    if ( elem.getName().equals( HTML.Tag.SELECT.toString() ) )
                        Object o = as.getAttribute( HTML.Attribute.ID );
                        System.out.println( o );
                    //  Wierd, the text for each tag is stored in a 'content' element
                    if (elem.getElementCount() == 0)
                        int start = elem.getStartOffset();
                        int end = elem.getEndOffset();
                        System.out.println( "\t" + doc.getText(start, end - start) );
            catch (Exception e)
                e.printStackTrace();
            System.exit(1);
        // Returns a reader on the HTML data. If 'uri' begins
        // with "http:", it's treated as a URL; otherwise,
        // it's assumed to be a local filename.
        static Reader getReader(String uri)
            throws IOException
            // Retrieve from Internet.
            if (uri.startsWith("http:"))
                URLConnection conn = new URL(uri).openConnection();
                return new InputStreamReader(conn.getInputStream());
            // Retrieve from file.
            else
                return new FileReader(uri);
    }To test it just use:
    java GetHTML somefile.html

  • Parse HTML behaviour

    Hi,
    Can anybody explain the behavior of SunOne Web server when parse HTML is enabled for all html.
    If we have a valid html in the web server for instance http://servername/myhtml.html , then the page will be loaded by web server. The same case if we put anything in the URL after that then also the sma epage will get served. Consider an URL http://servername/myhtml.html/ahdjksad/asdhjsad/sdhjklsad/asjdksald (Anything after that HTML), web server would be able to load the myhtml.html page. If we disable the parse HTML , then it won�t work. I want a way to work the Server side includes where it shouldn�t server such wrong URLs.
    How this comes? If we look on the web server the path won�t be there on the server and it should give 404 error. Is it the way parse HTML works? Is there any way to restrict it by keeping the parse HTML functionality for server side includes enabled other than custom NSAPI?
    If anyone noticed this behavior please explain.
    Thanks,
    Rijesh.

    Hi,
    Acually i load one external html using LoadVars class
    methods.
    var oLoad:LoadVars=new LoadVars();
    oLoad.load("external.html");
    I want to parse that html in flash.
    I need some text from html page(html is having 200 line code)
    How can i parse that html and trace that particular text.

  • DocumentParser parsing HTML ...

    i am parsing HTML of website through this
    HTMLEditorKit.Parser parser = new javax.swing.text.html.parser.ParserDelegator();
    i was able to parse www.yahoo.com
    its html code (first few lines)
    <html><head>
    <script language=javascript>
    var now=new Date,t1=0,t2=0,t3=0,t4=0,t5=0,t6=0,cc='',ylp='';t1=now.getTime();
    </script>
    <title>Yahoo!</title>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <meta http-equiv="PICS-Label" content='(PICS-1.1 "http://www.icra.org/ratingsv02.html" l r (cz 1 lz 1 nz 1 oz 1 vz 1) gen true for "http://www.yahoo.com" r (cz 1 lz 1 nz 1 oz 1 vz 1) "http://www.rsac.org/ratingsv01.html" l r (n 0 s 0 v 0 l 0) gen true for "http://www.yahoo.com" r (n 0 s 0 v 0 l 0))'>
    <base href="http://www.yahoo.com/_ylh=X3oDMTEwZGh2NmNjBF9TAzI3MTYxNDkEdGVzdAMwBHRtcGwDaW5kZXgtdGJs/" target=_top>
    <script language=javascript>------------
    and my corresponding log goes like this ....
    0 DEBUG  [main]  - Start :html
    15 DEBUG  [main]  - Start :head
    15 DEBUG  [main]  - Start :script
    15 DEBUG  [main]  - End :script
    15 DEBUG  [main]  - Start :title
    15 DEBUG  [main]  - End :title
    15 DEBUG  [main]  - meta -- http-equiv=Content-Type content=text/html; charset=UTF-8
    31 DEBUG  [main]  - meta -- http-equiv=PICS-Label content=(PICS-1.1 "http://www.icra.org/ratingsv02.html" l r (cz 1 lz 1 nz 1 oz 1 vz 1) gen true for "http://www.yahoo.com" r (cz 1 lz 1 nz 1 oz 1 vz 1) "http://www.rsac.org/ratingsv01.html" l r (n 0 s 0 v 0 l 0) gen true for "http://www.yahoo.com" r (n 0 s 0 v 0 l 0))
    31 INFO  [main]  - base http://www.yahoo.com/_ylh=X3oDMTEwZGh2NmNjBF9TAzI3MTYxNDkEdGVzdAMwBHRtcGwDaW5kZXgtdGJs/
    31 DEBUG  [main]  - Start :script
    31 DEBUG  [main]  - End :script
    31 DEBUG  [main]  - Start :script
    62 DEBUG  [main]  - End :script
    62 DEBUG  [main]  - Start :style
    62 DEBUG  [main]  - End :style
    62 DEBUG  [main]  - Start :script
    next I parsed www.java.sun.com/index.html
    its html code (first few lines ) goes like this ...
    <html>
    <head>
    <title>Java Technology</title>
    <meta name="keywords" content="Java, platform" />
    <meta name="description" content="Java technology is a portfolio of products that are based on the power of networks and the idea that the same software should run on many different kinds of systems and devices." />
    <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
    <meta name="date" content="2003-11-23" />
    <link rel="stylesheet" href="/css/default_developer.css" />
    <script type="text/javascript" language="JavaScript" src="/js/popUp.js"></script>
    <script type="text/javascript" language="JavaScript" src="/js/support_incident.js"></script>
    <link href="http://developers.sun.com/rss/java.xml" rel="alternate" type="application/rss+xml" title="rss" />
    </head>
    <!--stopindex-->
    <body leftmargin="0"....-----
    and my corresponding log goes like this ...
    0 DEBUG  [main]  - Start :html
    16 DEBUG  [main]  - Start :head
    16 DEBUG  [main]  - Start :title
    16 DEBUG  [main]  - End :title
    16 INFO  [main]  - meta --- name=keywords content=Java, platform
    16 DEBUG  [main]  - End :head
    16 DEBUG  [main]  - Start :body
    16 DEBUG  [main]  - Simple Tag :linkNow as u can see from the logs that the META TAG of yahoo was read in twice by the Parser while the META TAG of java.sun.com/index.html was read only once.
    One visible difference between the html of these two tags is that the META tag of yahoo page doesnt has a closing tag (isnt well formed) whereas the META tag of java.sun.com is well formed.
    why is the meta tag (of java.sun.com) being ignored by the parser ?
    Is it because of this...
    javax.swing.text.html.parser.Parser.java , method boolean ignoreElement(Element elem) : line 429
    returns true for ignoring meta tag in html file...
    is my problem due to this?
    how can i possibly overcome this :-(
    Code for my Callback class looks like this ...
         HTMLEditorKit.ParserCallback  parserCallback = new HTMLEditorKit.ParserCallback()
              public void handleStartTag(HTML.Tag t, MutableAttributeSet a , int pos)
                   try {
                   if (t==HTML.Tag.A)
                        String hrefValue = (String)a.getAttribute(HTML.Attribute.HREF);
                        logger.log(Level.INFO,t + " " + hrefValue);
                   else
                        logger.log(Level.DEBUG,"Start :"+t  );
                   catch(Exception e){ e.printStackTrace();     }
              public void handleEndTag(HTML.Tag t, int pos)
                   try {
                   logger.log(Level.DEBUG, "End :"+t);
                   catch(Exception e){ e.printStackTrace();     }
              public void handleSimpleTag(HTML.Tag t , MutableAttributeSet a,int pos)
                   try
                   if (t== HTML.Tag.BASE )
                        String hrefValue = (String)a.getAttribute(HTML.Attribute.HREF);
                        logger.log(Level.INFO,t + " " + hrefValue);
                   else if (t == HTML.Tag.FRAME)
                        String srcValue= (String)a.getAttribute(HTML.Attribute.SRC);
                        logger.log(Level.INFO, t +" "+ srcValue);                     
                   else if (t == HTML.Tag.META)
                        String nm = (String)a.getAttribute(HTML.Attribute.NAME);
                        String content = (String)a.getAttribute(HTML.Attribute.CONTENT);
                        if ("keywords".equalsIgnoreCase(nm) || "description".equalsIgnoreCase(nm))
    // i found it
                             logger.log(Level.INFO, t + " --- " + a);
                        else
                             logger.log(Level.DEBUG,t + " -- " + a);
                   else
                        logger.log(Level.DEBUG,"Simple Tag :" + t);
                   catch(Exception e){ e.printStackTrace();     }
         };I want to read the values in meta tag attributes "name" , "content" where <meta name="keywords" content="asdfasdfasdf" > or <meta name="description" content="asdfasdfasdf">
    ?

    ok ...
    then if there is some other way to be able to read in html tags such as meta , a (anchor) , base , frame ( only these tags matter to me ) without being concerned abt the way their html has been coded .............. then plz tell me ...
    searching internet showed that their are html parser that use stringtokenizer kind of ways to read in html ...
    has anyone over here use anything like this ever......

  • Parsing HTML characters (e.g. &nbsp)

    Hi
    Apologies if I'm missing something obvious, I haven't been able to find an answer searching the API or Forums...
    I'm parsing HTML documents (currently as Strings) to extract certain information. Is there an easy way to replace all special HTML characters such as   < etc. to a space or < respectively without having to do a string replace on every possible HTML character?
    I know there's an HTML parser in swing but that seems to be geared towards creating an HTML editor.
    Any help would be appreciated!

    There are also a number of open source or shareware programs, such as TidyHTML, that clean-up and parse existing HTML. Check out Sourceforge or www.downloads.com.
    - Saish

  • Parsing HTML documents

    I am trying to write an application that uses a parsed html document to perform some data retrieval. The problem that I am having is that the parser in JDK1.4.1 is unable to completely parse the document correctly. Some fields are skipped as well as other problems. I believe it has to do with the html32.bdtd. Is there a later version?

    Parsing a HTML document is a huge task, you shouldn't do it yourself but instead javax.text.html and javax.text.html.parser already provide almost everything you ever need

  • Parsing HTML files

    Hello,
    I have a question about parsing HTML files. Usually when I get an HTML file and I need to find all the text in it I do this. This stuff just collects all of the hyperlinks and ignores all the html tags just keeping the actual text. It's fine for smaller files but occasionally I'll hit a large online text file and it will work but its way to slow for large files. I don't need to do all of this HTML tag stripping however for text files. Is there a way to still grab all the text without doing any tag searching to make it faster?
    thanks,
    private void find() throws IOException
            //Really slow for large text files.  Need a way to just use a regular scanner on an internet text file
            new ParserDelegator().parse(new InputStreamReader(myBase.openStream()),
                    new ParserListener(),
                    true); 
         * Inner class for processing all "<a href.."> tags when reading a base URL.
        private class ParserListener extends HTMLEditorKit.ParserCallback
            final String IGNORED_LINKS = "^(http|mailto|\\W).*";
            public void handleStartTag (HTML.Tag t, MutableAttributeSet a, int pos)
                if (t == HTML.Tag.A)
                    String href = (String)(a.getAttribute(HTML.Attribute.HREF));
                    //System.out.println(href);
                    //System.out.println(href.matches(IGNORED_LINKS) + "\t" + href);
                    if (! (href == null || href.matches(IGNORED_LINKS)) && !myURLs.contains(href))
                        myURLs.add(href);
                //TODO fix
                if (t == HTML.Tag.TITLE)
                    String title = (String) (a.getAttribute(HTML.Attribute.TITLE));
                    if (!(title == null))
                        myTitle = title;
                    else myTitle = "No title was found";
            public void handleText (char[] data, int pos)
                myText.append(" ");
                myText.append(data);
        }

    JFactor2004 wrote:
    My question is. If I know an html file is actually just a txt fileThis isn't a question. HTML files are text by definition.
    is it possible to look through it (maybe use something similar to a regular scanner) without doing anything with html.That depends on what you mean by "doing something with HTML". You can certainly read it one line at a time.

  • Parsing HTML using Swing's HTMLEditorKit

    Hi all,
    I posted this question on the "Java programming", but I think I posted on the wrong forum. So, please let me know if I have posted on the wrong forum, again.
    Anyway, I have read an article on parsing HTML using the Swing HTML Parser (http://java.sun.com/products/jfc/tsc/articles/bookmarks/index.html). However, I find that the HTMLEditorKit is unable to understand the <Meta> tag under the <Head> tag? Is this true? I am getting an error message:
    javax.swing.text.ChangedCharSetException
    at javax.swing.text.html.parser.DocumentParser.handleEmptyTag(DocumentParser.java:172)
    at javax.swing.text.html.parser.Parser.startTag(Parser.java:327)
    at javax.swing.text.html.parser.Parser.parseTag(Parser.java:1786)
    at javax.swing.text.html.parser.Parser.parseContent(Parser.java:1821)
    at javax.swing.text.html.parser.Parser.parse(Parser.java:1980)
    at javax.swing.text.html.parser.DocumentParser.parse(DocumentParser.java:109)
    at javax.swing.text.html.parser.ParserDelegator.parse(ParserDelegator.java:74)
    at URLReader.main(URLReader.java:58)
    Below is a simple code to write out the html file it reads in:
    public static void main(String[] args) throws Exception {
    HTMLEditorKit.ParserCallback callback = new HTMLEditorKit.ParserCallback () {
    public void handleText(char[] data, int pos) {
    try {
    System.out.println(data);
    } catch (Exception e) {
    System.out.println("IOE: " + e);
    Reader reader = new FileReader("myFile.html");
    new ParserDelegator().parse(reader, callback, false);
    The html file that is having a problem reading in is:
    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
    <title>NWS WSR-88D Radar System Transmit/Receive Status</title>
    </head>
    <p>A <foo>xx</foo>link</html>
    If I take away <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">, there is no problem.
    Any suggestions? Thanks in advance.

    Hi,
    Setting the third argument really works!!! Yee..... haa....!!!
    WORKING SOLUTION: new ParserDelegator().parse(reader, callback, TRUE);
    MANY... MANY THANKS for looking at the problem!!!
    Send third argument in parse method as true.

  • Extension .HTM instead of .HTML, How can I change that in MUSE?

    I'm rebuilding an excisting website that all the pages use the extension .HTM, instead of .HTML, How can I change that in MUSE?

    Unfortunately it would not be possible to change the extesion of the pages from .html to .htm.
    We do not have that option available in Muse. Muse only generates .html extension.
    Regards,
    Prateek

  • I don't own a MAC. I can probably use one at the library. I am trying to publish my book with Apple's ibooks, etc. Since my files are in Word.doc or html; how can I format my book so as to publish it?

    I don't own a MAC. I can probably use one at the library. I am trying to publish my book with Apple's ibooks, etc. Since my files are in Word.doc or html; how can I format my book so as to publish it?
    www.amessageforthehumanrace.org

    Use an aggregator and follow their instructions for formatting.

  • Parsing html files via an url

    Hi,
    I already have a Java program that is able to read in html files that are stored on my computers hard drive. Now I would like to expand its functionality by being able to parse html files straight from the web.
    For example, when the program is run, I would like to be able to give it an url for a given website. Then, I would like to be able to parse the html file that the link goes to.
    I've searched the forum, but have not been able to find anything of any real use. If you could offer an overview or point me towards a resource, I would be very greatful.

    If you've done things right, you have a HTML reader/parser that takes an InputStream. For Files, this would be a FileInputStream.
    For URLs, this would be the InputStream you get from URLConnection.getInputStream(). You can get a URLConnection by calling openConnection() on a URL instance (created from your input url of course).

  • Parsing HTML - best tool

    Hi guys, like to know the best open source API to parse HTML and get required data from it? Hopefully one thats uses SAX Parser but the HTML not fully XML compliant, i.e XHMTL
    Thanks
    Abe

    Thanks I found my anser to use Jericho HTML Parser. Any of you guys know of a better one?
    Thanks
    Abe

  • Parsing HTML from Google API results

    Hello,
    I just downloaded the Google API (http://www.google.com/apis) and I am trying to parse the HTML content which is returned so that it can be displayed in a TextArea or some other GUI component.
    Here are my questions:
    1. Is there a Java class that can parse HTML and display it correctly?
    2. If not, are there are third party, prefabably free Java components that can do that?
    3. Has anyone tried out the Google API? Any interesting applications?
    Thank you.
    Hanxue

    To convert plain text to html, you can parse the text with a simple code like this
    1.
    String inputText = getInputText(); //
    StringBuffer HTMLOutputText = new StringBuffer();
    java.util.StringTokenizer st = new java.util.StringTokenzier(inputText, "\n\r");
    while ( st.hasMoreTokens() ) {
    HTMLOutputText.append(st.nextToken());
    HTMLOutputText.append("<br>");
    /// insert the top level HTML tags
    HTMLOutputText.insert(0, "<HTML> <HEAD><TITLE> Some Title</TITLE></HEAD> <BODY>");
    HTMLOutputText.insert( HTMLOutputText.getLength(), "</BODY> </HTML>" );
    2. even simpler, but as far as I know it doesn't display right in a JEditorPane
    String inputText = getInputText();
    inputText = "<HTML> <HEAD><TITLE> Some Title</TITLE></HEAD> <BODY> <PRE> <TT>" +
    + inputText + "</TT></PRE></BODY> </HTML>";

  • What is Execute to Parse % and how to tune it when it lower?

    What is Execute to Parse % and how to tune it when it is lower?

    Gjohn wrote:
    What is Execute to Parse % and how to tune it when it is lower?If you don't know what it is, how are you going to decide that you need to tune it.
    Here's a little information on how pointless it can be to get too worried about that particular "Instance Efficiency" percentage in Statspack and the AWR: http://jonathanlewis.wordpress.com/2006/12/27/analysing-statspack-2/
    Regards
    Jonathan Lewis
    http://jonathanlewis.wordpress.com
    http://www.jlcomp.demon.co.uk
    To post code, statspack/AWR report, execution plans or trace files, start and end the section with the tag {noformat}{noformat} (lowercase, curly brackets, no spaces) so that the text appears in fixed format.
    "There's no sense in being precise when you don't even know what you're talking about"
    John von Neumann                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               

  • Could not parse html cannot be top level tag

    hi
    i have installed the siteminder webagent in sunone server, i am trying to call my first page (index.jsp) it contains the following jsp code
    <jsp:useBean id="helpBroker" class="javax.help.ServletHelpBroker" scope="session" />
    <%@ taglib uri="/WEB-INF/jhlib.tld" prefix="jh" %>
    <jh:validate helpBroker="<%= helpBroker %>" helpSetName="<%= strContext %>" />
    (here strContext is the delphi.hs file)
    then it's giving the "Could not parse html cannot be top level tag"
    error in servers log if i comment <jh:validate helpBroker="<%= helpBroker %>" helpSetName="<%= strContext %>" /> line then everything working fine
    can anybody suggest what is the error ?

    Could be your hs file inside your firewall configuration. Check whether you can able to assess hs with different session of your browser.

Maybe you are looking for

  • Dual 23" ACD Oddity

    Hoping someone can recognize this issue. I'll be succinct. Dual 2 G5 PPC with Radeon X800XT Card and Matrox MXO. Recently, the displays have started acting "funny" not in a ha-ha way. When I cursor from one display to the next ther is a "shift" like

  • I have problems with office for mac  screen resolution, specially with excel

    I have problems with office for mac  screen resolution, specially with excel ?

  • Audio configurations

    I can't open itunes because an error pops up saying, itunes can't run because an error has been detected with your audio configurations. What is going on!   Windows XP     Windows XP  

  • Oracle recommanded

    COULD U PLZ TEL ME..WHAT IS ORACLE RECOMMENDED WHETHER IS PHYSICAL OR LOGICAL STANDBY.IF LOGICAL STANDBY MEANS HOW CAN WE SWITCHOVER DATABASE DURING FAILS..

  • Error ! Unable to continue with N1 SPS Master Server installation.

    Product: Sun N1 Service Provisioning System 6.0 I am trying to install sun N1 provsioning Service software. During installation am getting this error. ERROR! Unable to continue with N1 SPS Master Server installation. The required OS version is 2.1, a