Problem in HTML parsing

I am trying to parse the HTML, and for a HTML code like this
<span class="authorName">Steve </span>
I want to retrieve the value "Steve"
Right now, I am using a parser, to generate the HTML tree and by using the this code..
Tidy tidy = new Tidy();
tidy.setXHTML(xhtml);
d = tidy.parseDOM(in,out);
NodeList spanNode = d.getElementsByTagName("span");
int length = spanNode.getLength();
for(int i = 0;i<length;i++)
org.w3c.dom.Node span = spanNode.item(i);
String tempAltText = span.getAttributes().getNamedItem("class").getNode Value();
if(tempAltText.equals("authorName")){
System.out.println("the item is " + tempAltText);
else{
}The tempAltText is storing "authorName" as a string but not "Steve"
please give me some suggestions......

try this
XPathFactory  factory=XPathFactory.newInstance();
XPath xPath=factory.newXPath();
XPathExpression  xPathExpression=xPath.compile("//*[@class='authorName']");
System.out.println(xPathExpression.evaluate(d));

Similar Messages

  • STYLE tag problem in HTML Parser.

    Hi,
    I am trying to parse a HTML file. I am able to extract context of various tags like Tag.SPAN,Tag.DIV and so...
    I want to extract the text content of Tag.Style. What to do? The problem is that HTML Parser right now doesnot support this tag along with 5 more tags which are Tag.META,Tag.PARAM and so..
    Please help me out.

    Before responding to this posting, you may want to check out the discussion in the OP's previous posting on this topic:
    http://forum.java.sun.com/thread.jspa?threadID=634938

  • Problem with HTML Parser and multiple instances

    I have a parser program which queries a online shopping comparison web page and extracts the information needed. I am trying to run this program with different search terms which are created by entering a sentence, so each one is sent separately, however the outputs (text files) are the same for each word, despite the correct term and output file seeming passed. I suspect it might be that the connection is not being closed each time but am not sure why this is happening.
    If i create an identical copy of the program and run that after the first one it works but this is not an appropriate solution.
    Any help would be much appreciated. Here is some of my code, if more is required i will post.
    To run the program:
    StringTokenizer t = new StringTokenizer("red green yellow", " ");
            int c = 0;
            Parser1 p = new Parser1();
            while (t.hasMoreTokens()) {
                c++;
                String tok = t.nextToken();
                File tem = new File("C:/"+c+".txt");
                    p.mainprog(tok, tem);
                    p.mainprog(tok, tem)
                    p.mainprog(tok, tem);
    }The parser:
    import javax.swing.text.html.parser.*;
    import javax.swing.text.html.*;
    import javax.swing.text.*;
    import java.awt.*;
    import java.util.*;
    import javax.swing.*;
    import java.io.*;
    import java.net.*;
    public class Parser1 extends HTMLEditorKit.ParserCallback {
        variable declarations
       public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos){
    ...methods
      public void handleText(char[] data, int pos){
           ...methods
      public void handleTitleTag(HTML.Tag t, char[] data){
      public void handleEmptyTag(HTML.Tag t, char[] data){       
      public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos){
    ...methods
          static void mainprog(String term, File file) {   
    ...proxy and authentication methods
                        Authenticator.setDefault(new MyAuthenticator() );
                        HTMLEditorKit editorKit = new HTMLEditorKit();
                        HTMLDocument HTMLDoc;
                        Reader HTMLReader;
                      try {
                            String temp = new String(term);
                            String fullurl = new String(MainUrl+temp);
                            url = new URL(fullurl);
                            InputStream myInStream;
                            myInStream = url.openConnection().getInputStream();
                            HTMLReader = (new InputStreamReader(myInStream));
                            HTMLDoc = (HTMLDocument) editorKit.createDefaultDocument();
                            HTMLDoc.putProperty("IgnoreCharsetDirective", new Boolean(true));
                            ParserDelegator parser = new ParserDelegator();
                            HTMLEditorKit.ParserCallback callback = new Parser1();
                            parser.parse(HTMLReader, callback, true);
                            callback.flush();
                            HTMLReader.close();
                            myInStream.close();
                     catch (IOException IOE) {
                        IOE.printStackTrace();
                    catch (Exception e) {
                        e.printStackTrace();
          try {
                FileWriter writer = new FileWriter(file);
                BufferedWriter bw = new BufferedWriter(writer);
                for (int i = 0; i < vect.size(); i++){
                    bw.write((String)vect.elementAt(i));
                    if (vect.elementAt(i)!=vect.lastElement()){
                        bw.newLine();
                bw.flush();
                bw.close();
                writer.close();
            catch (IOException IOE) {
                        IOE.printStackTrace();
                    catch (Exception e) {
                        e.printStackTrace();
              }   catch (IOException IOE) {
                     System.out.println("User options not found.");
    }

    How many Directory Servers are you using?
    Are both serverconfig.xml files of PS instances the same?
    Set debug level to message in the appropriate AMConfig.properties of your portal instances and look into AM debug files.
    For some reason amSDK seems not to get the correct service values.
    -Bernhard

  • Error on HTML Parser

    Hi,
    I'm trying to parse a HTML page but I always get the same error, which is the following exception:
    javax.swing.text.ChangedCharSetException
    In the class ParserCallback I'm using the method handleError and it shows:
    req.att contentmeta?
    ioexception???
    just before the exception occurs.
    The only line where this error occurs in the html page is:
    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
    and I know that the exact point is the attribute 'content'. If it is removed or changed to 'contenttype' the error desappears.
    The problem is that I can't change the attribute because the html page is not mine, it is caught on the Web. And I don't want to remove it.
    Anybody knows what is happening?
    Thanks!!

    i am also having a problem with html parsing in java
    i have given a detailed / complete description of the problem on this link along with the log and my sample code ...
    http://forum.java.sun.com/thread.jspa?threadID=643683&tstart=0
    if u could see this ...

  • Problem with SAX parser - entity must finish with a semi-colon

    Hi,
    I'm pretty new to the complexities of using SAXParserFactory and its cousins of XMLReaderAdapter, HTMLBuilder, HTMLDocument, entity resolvers and the like, so wondered if perhaps someone could give me a hand with this problem.
    In a nutshell, my code is really nothing more than a glorified HTML parser - a web page editor, if you like. I read in an HTML file (only one that my software has created in the first place), parse it, then produce a Swing representation of the various tags I've parsed from the page and display this on a canvas. So, for instance, I would convert a simple <TABLE> of three rows and one column, via an HTMLTableElement, into a Swing JPanel containing three JLabels, suitably laid out.
    I then allow the user to amend the values of the various HTML attributes, and I then write the HTML representation back to the web page.
    It works reasonably well, albeit a bit heavy on resources. Here's a summary of the code for parsing an HTML file:
          htmlBuilder = new HTMLBuilder();
    parserFactory = SAXParserFactory.newInstance();
    parserFactory.setValidating(false);
    parserFactory.setNamespaceAware(true);
    FileInputStream fileInputStream = new FileInputStream(htmlFile);
    InputSource inputSource = new InputSource(fileInputStream);
    DoctypeChangerStream changer = new DoctypeChangerStream(inputSource.getByteStream());
    changer.setGenerator(
       new DoctypeGenerator()
          public Doctype generate(Doctype old)
             return new DoctypeImpl
             old.getRootElement(),
                              old.getPublicId(),
                              old.getSystemId(),
             old.getInternalSubset()
          resolver = new TSLLocalEntityResolver("-//W3C//DTD XHTML 1.0 Transitional//EN", "xhtml1-transitional.dtd");
          readerAdapter = new XMLReaderAdapter(parserFactory.newSAXParser().getXMLReader());
          readerAdapter.setDocumentHandler(htmlBuilder);
          readerAdapter.setEntityResolver(resolver);
          readerAdapter.parse(inputSource);
          htmlDocument = htmlBuilder.getHTMLDocument();
          htmlBody = (HTMLBodyElement)htmlDocument.getBody();
          traversal = (DocumentTraversal)htmlDocument;
          walker = traversal.createTreeWalker(htmlBody,NodeFilter.SHOW_ELEMENT, null, true);
          rootNode = new DefaultMutableTreeNode(new WidgetTreeRootNode(htmlFile));
          createNodes(walker); However, I'm having a problem parsing a piece of HTML for a streaming video widget. The key part of this HTML is as follows:
                <object classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000"
                  id="client"
            width="100%"
            height="100%"
                  codebase="http://fpdownload.macromedia.com/get/flashplayer/current/swflash.cab">
                  <param name="movie" value="client.swf?user=lkcl&stream=stream2&streamtype=live&server=rtmp://192.168.250.206/oflaDemo" />
             etc....You will see that the <param> tag in the HTML has a value attribute which is a URL plus three URL parameters - looks absolutely standard, and in fact works absolutely correctly in a browser. However, when my readerAdapter.parse() method gets to this point, it throws an exception saying that there should be a semi-colon after the entity 'stream'. I can see whats happening - basically the SAXParser thinks that the ampersand marks the start of a new entity. When it finds '&stream' it expects it to finish with a semi-colon (e.g. much like and other such HTML characters). The only way I can get the parser past this point is to encode all the relevant ampersands to %26 -- but then the web page stops working ! Aaargh....
    Can someone explain what my options are for getting around this problem ? Some property I can set on the parser ? A different DTD ? Not to use SAX at all ? Override the parser's exception handler ? A completely different approach ?!
    Could you provide a simple example to explain what you mean ?
    Thanks in anticipation !

    You probably don't have the ampersands in your "value" attribute escaped properly. It should look like this:
    value="client.swf?user=lkcl&stream=...{code}
    Most HTML processors (i.e. browsers) will overlook that omission, because almost nobody does it right when they are generating HTML by hand, but XML processors won't.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

  • HTML parser in J2ME

    Hi all,
    Even I'm stuck with the same problem. I'm developing a J2ME(MIDlet) application in which i have to open a http connection. N also i want to parse the html response n display the contents using J2ME elements in the mobile. I'm not able to solve this problem. Plz help me if any1 has come across the solution of this problem.
    Below links are the related threads:
    http://forums.sun.com/thread.jspa?forumID=76&threadID=250460
    http://forums.sun.com/thread.jspa?forumID=76&threadID=5235530
    Thanks in advance
    Nandy

    Hi All,
    I like to ask if anyone knows if there is a HTML
    parser available in J2ME? I am building an applicationTry google, a few do exist, but I don't know about free ones.
    that needs to display HTML on the client.
    Alternatively I may consider using XML, however I
    learnt that parsing XML is expensive in terms of
    computing power - is it the same for HTML?If you are controlling the content returned, the two would be about the same, as XML and HTML have the same roots. Some XML parsers do exist, and are free to use.
    You might be best of returning a custom format, designed around the limitations of the device you are using .

  • JEditorPane slow HTML parsing and GUI lockup

    Hi,
    I use a JEditorPane to display HTML page. If the page is relatively small, everything is ok, but with big pages (about 1 Mb) it takes a huge amount of time to load (actually, to parse) the page. And the main problem is that user interface locks and doesn't even repaint unitl page is loaded. Is there any workaround to this problem? I use JDK 1.5.0_2 std. edition.

    Try loading a 1Mb plain text file and you will find
    that loading is slow. So it will only be slower with
    the additional parsing required for an HTML file.Really? The following is the text loading code, which is run in a separate thread:
                        BufferedInputStream in = new BufferedInputStream(con.getInputStream());
                        BufferedReader reader = new BufferedReader(new InputStreamReader(in));
                        String line;
                        if(type.startsWith("text/plain")||type.startsWith("text/xml")) // It's not an HTML page
                            jEditorPane1.setContentType("text/plain");
                            jEditorPane1.setFont(new Font("Monospaced",Font.PLAIN,14));               
                            page=new StringBuilder();
                            Document ed_doc=jEditorPane1.getDocument();
                            while ( ( line = reader.readLine()) != null )
                                line=line.concat("\n");
                                page.append(line);
                                ed_doc.insertString(ed_doc.getLength(),line,null);
                        jEditorPane1.getDocument().putProperty(HTMLDocument.StreamDescriptionProperty, u);Yes, it takes some time to load a big page, but no, it does not block GUI repainting and does not block any events such as mouse, etc.
    I'm not an HTML
    expert but I believe if you have, for example, a
    large table structure then the table won't be painted
    until the parser knows how many columns is in the
    table and how wide to make each column. It can't
    determine column width until all the data is read.The files I use for testing don't have tables, etc. See this: http://www.lib.ru/ADAMS/liff.txt (Though it ends with .txt, it is html file - almost plain text with some HTML formatting tags). So, again, the problem is not that parsing is slow; the problem is that parsing blocks any gui events, and I'd like to know why and how to cope this problem. Running setPage() in separate thread doesn't change anything.

  • Java HTML Parser

    What seems to be the best tool for HTML parser, not converting to XHTML unless its very robust and can handle any HTML page?
    Been looking at JTidy - http://java-source.net/open-source/html-parsers/jtidy but that has some problems trying to convert from HTML to XHTML.
    Jerico seems to parse HTML without converting to XHTML and looks reasonable
    http://jerichohtml.sourceforge.net/doc/index.html,
    Anyone tried other HTML parsers at http://java-source.net/open-source/html-parsers
    Would like more information on other HTML parsers people have tried., preferably converting to XHTML without any problems, so we can use SAX parser to interpret the XML. Looking forward to your input
    Kind Regards
    Abs

    It kiinda depends what you need to use if for.
    Rent me and I'll tell you moregoogled it ;-)
    http://sourceforge.net/projects/htmlparser/

  • Using the HTML Parser as a filter

    I have the need to take an HTML file, filter it so that the SOURCE attribute on all of the IMG tags are modified, and then write the filtered file out. This seems pretty straight forward, but I can't seem to figure out how to get the Swing HTML parser to do what I want.
    If I had some way to parse the file into some structure, and then be able to take that structure and "unparse" it back into a file I would be fine. This assumes that I'd be able to override the handling of the IMG tag (or I guess all of the simple tags) on the parser side so that I'd be able to replace one of the attributes with something that's more meaningful for my purpose.
    I know this is not so difficult, I just can't see how to do it. Any help? Thanks in advance.
    Sander Smith

    If the problem is that simple, I'd use java.util.regex classes(Pattern and Matcher) and could write the program in an hour.

  • Control of HTML parsing

    Hello,
    I am working on trying to remove certain tags from an html source but I am unfamiliar with the use of ParserDelgator or Callback classes. Say you have this example here:
    import javax.swing.text.html.parser.*;
    import javax.swing.text.html.*;
    import javax.swing.text.*;
    import java.io.*;
    public class TextFromHtml  extends HTMLEditorKit.ParserCallback{
      public void handleText(char[] data, int pos){
          System.out.println(new String(data));  // or other code you like
      public static void main(String argv[]){
        try{
          Reader r = new FileReader("testPrint.html");
          ParserDelegator parser = new ParserDelegator();
          HTMLEditorKit.ParserCallback callback = new TextFromHtml();
          parser.parse(r, callback, true);  // or 'false' if you like
        catch (IOException e){
          e.printStackTrace();
    }Is there anyway I can control which HTML lines are parsed?
    Thanks in advance!

    I was able to extract the img src tag code using some string manipulation, but now I have a problem with parsing the rest of the HTML source...
    import javax.swing.text.html.parser.*;
    import javax.swing.text.html.*;
    import javax.swing.text.*;
    import javax.swing.*;
    import java.io.*;
    public class TextFromHtml extends HTMLEditorKit.ParserCallback
            static BufferedReader reader;
            static PrintWriter pw;
            public static String trimmer(String trimText)
                   return (trimText.trim()).replaceAll("\\s+", " ");
            public void handleText(char[] data, int pos)
                   String text = new String(data);
                 text = trimmer(text);
                    if(text.indexOf((">")) != -1)
                            String[] temp = text.split((">"));
                            if(temp.length>1)
                                    text = temp[1].trim();
                            else text = "";                                                              
                 pw.println(text);
            //public void removeTags(String file)
            public static void main(String[] args)
                    String file = "";
                    String queue = "";
                    try
                            String s = args[0];
                            reader = new BufferedReader(new FileReader(s));
                            file = s.substring(0, s.lastIndexOf('.')) + ".txt";
                            pw = new PrintWriter(new BufferedWriter(new FileWriter(file)));
                            ParserDelegator parser = new ParserDelegator();
                            HTMLEditorKit.ParserCallback callback = new TextFromHtml();
                            parser.parse(reader, callback, true);
                            pw.close();
                            reader = new BufferedReader(new FileReader(file));
                            BufferedWriter fout = new BufferedWriter(new FileWriter("out.txt"));
                            while ((queue = reader.readLine()) != null)
                                    if(queue.length() != 0)
                                            System.out.println(queue);
                                            fout.write(queue);
                            reader.close();
                    }catch (IOException e)
                            e.printStackTrace();
                    //return array;        
    }For some reason, I can only output queue to the CMD prompt; any thing else such as output to another file or returning queue as an array of Strings gives me null values. Does anyone see what is wrong with the code?
    Thanks in advance!

  • Extending javax.swing.text.html.parser

    I'm trying to extend prepackaged jdk's html parser but stuck via specific javax.swing.text.html.parser.DTD (taking prepackaged HTML32 DTD and adding entities to it) but stuck with the fact DTD class is poorly described in javadoc (e.g. integer parameters without values description, messy list structures, e.g.). Is the any alternative documentation, maybe some working examples for my problem?

    I'm trying to extend prepackaged jdk's html parser but stuck via specific javax.swing.text.html.parser.DTD (taking prepackaged HTML32 DTD and adding entities to it) but stuck with the fact DTD class is poorly described in javadoc (e.g. integer parameters without values description, messy list structures, e.g.). Is the any alternative documentation, maybe some working examples for my problem?

  • Sleep() in ActionScript3 and/or Html parser issue

    Hello, I hope you can help me, please.
    I am using a html parser (http://code.google.com/p/htmlsprite/) to render several tables into a movieclip.
    My problem is: as it takes some time to render, the only one actually being added to the movieclip is the last one. I would use a timer, but that means I would need to call a function, and I am using a "for" loop to build the movielips. Is there a way to say "wait for x seconds before continuing processing" in ActionScript3?
    Thank you very much for your help.

    You can't stall a for loop from procesing.  Instead of using a for loop, you could use a Timer to control the speed at which some code executes.

  • HTML parsing, AttributeSet.getAttribute() doesn't work

    I parsed a website using javax.swing.text.html.parser.
    When I get a javax.swing.text.html.parser.Element, elem, I used elem.getAttributSet to get the AttributeSet of elem, atts. Then I used atts.getAttribute(HTML.Tag.FORM) to get the surounding form tag. This works fine in jdk 1.3.8, but for jdk 1.4.2 and after, it just return null.
    Is this a parsing bug for Java? Is there any way to get arround this problem?

    Well, it won't work as iWeb has no import facility so cannot open html files.
    What you could do is upload the html file to wherever you are hosting your site and create a link to it from iWeb, or find another package similar to the one you are using at present that is for Mac rather then PC.

  • How to parse a HTML file using HTML parser in J2SE?

    I want to parse an HTML file using HTML parser. Can any body help me by providing a sample code to parse the HTML file?
    Thanks nad Cheers,
    Amaresh

    What HTML parser and what does "parsing" mean to you?

  • Problem with HTML viewer

    Dear All ,
    I am facing a problem with HTML Viewer . My senario is as follows :
    1. I have created one HTML page . On that page there are 4 Images
    2. I imported that HTML page in SAP with the help of transaction SMW0
    3. I Called that HTML page in my ABAP program using the method "load_html_document" of class cl_gui_html_viewer
    4. This is happening perfectly ok on the machine on which all this developement was done.
    But the issue is when I execute my ABAP program on a different machine , those Images on that HTML page are not displaying.
    Can you please guide me how to remove that machine dependancy?
    Regards,
    Nikhil

    Hi Nikhil,
    Please check if the image is properly imported properly. Also check if there is any option which you might have forgotten while imported like dependeency.
    Regards
    Abhii...

Maybe you are looking for