STYLE tag problem in HTML Parser.

Hi,
I am trying to parse a HTML file. I am able to extract context of various tags like Tag.SPAN,Tag.DIV and so...
I want to extract the text content of Tag.Style. What to do? The problem is that HTML Parser right now doesnot support this tag along with 5 more tags which are Tag.META,Tag.PARAM and so..
Please help me out.

Before responding to this posting, you may want to check out the discussion in the OP's previous posting on this topic:
http://forum.java.sun.com/thread.jspa?threadID=634938

Similar Messages

A little problem getting the style tag of a html file seperate from rest

I'm making a program that will take in a URL and then search through that URL for all a, link, embed, frame, and img tags, find their sources, and download them. I also want to search through the style and find anything that uses a URL (ex. background-image:url('somepic.jpg')) and download that file. In the end, you should be able to go to the directory you saved it all in, open index.html, and see an exact replica of the original site. Now, my problem is that my program isn't getting the style tag's contents. Here's my code: import java.io.*;
import java.util.*;
import java.net.*;
public class Test
     //-->>>> MAIN <<<<--//
     public static void main(String...a)
          try{
               System.out.print("Enter URL: ");
               String target = new Scanner(System.in).next();
               URL url = null;
               try{
                    url = new URL(target);
               }catch(MalformedURLException x){
                    url = new URL("http://" + target);
               Scanner scan = new Scanner(url.openStream());
               scan.useDelimiter("<");
               ArrayList<String> tokens = new ArrayList<String>();
               while(scan.hasNext())
                    String str = scan.next();
                    str = str.trim();
                    Scanner tags = new Scanner(str);
                    if(tags.hasNext())
                         String tag = tags.next();
                         if(tag.equalsIgnoreCase("a") || tag.equalsIgnoreCase("img") || tag.equalsIgnoreCase("link") || tag.equalsIgnoreCase("embed") || tag.equalsIgnoreCase("frame"))
                              tokens.add(str);
                         else if(tag.equalsIgnoreCase("style"))
                              tokens.add(str);// This isn't adding anything
               for(String str : tokens)
                    System.out.println(str);
          }catch(UnknownHostException x){
               System.err.println("Host not found.");
          }catch(Exception x){
               x.printStackTrace();
     //-->>>> FindURLAttributes <<<<--// <--- Under construction
     private static ArrayList<String> findURLAttributes(String tag)
          ArrayList<String> tokens = new ArrayList<String>();
          tokens.add(tag);
          return tokens;
}

I've never tried it, but it seems like using an existing html parser would be a lot easier. I've worked with xml dom parsers, and it's not really that hard. I don't imagine working with an html dom would be too difficult either, at least it wouldn't be as hard as doing it by hand. Google for java html parser and see if any of them suit your needs.

Problem with HTML Parser and multiple instances

I have a parser program which queries a online shopping comparison web page and extracts the information needed. I am trying to run this program with different search terms which are created by entering a sentence, so each one is sent separately, however the outputs (text files) are the same for each word, despite the correct term and output file seeming passed. I suspect it might be that the connection is not being closed each time but am not sure why this is happening.
If i create an identical copy of the program and run that after the first one it works but this is not an appropriate solution.
Any help would be much appreciated. Here is some of my code, if more is required i will post.
To run the program:
StringTokenizer t = new StringTokenizer("red green yellow", " ");
        int c = 0;
        Parser1 p = new Parser1();
        while (t.hasMoreTokens()) {
            c++;
            String tok = t.nextToken();
            File tem = new File("C:/"+c+".txt");
                p.mainprog(tok, tem);
                p.mainprog(tok, tem)
                p.mainprog(tok, tem);
}The parser:
import javax.swing.text.html.parser.*;
import javax.swing.text.html.*;
import javax.swing.text.*;
import java.awt.*;
import java.util.*;
import javax.swing.*;
import java.io.*;
import java.net.*;
public class Parser1 extends HTMLEditorKit.ParserCallback {
    variable declarations
   public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos){
...methods
public void handleText(char[] data, int pos){
       ...methods
public void handleTitleTag(HTML.Tag t, char[] data){
public void handleEmptyTag(HTML.Tag t, char[] data){
public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos){
...methods
      static void mainprog(String term, File file) {
...proxy and authentication methods
                    Authenticator.setDefault(new MyAuthenticator() );
                    HTMLEditorKit editorKit = new HTMLEditorKit();
                    HTMLDocument HTMLDoc;
                    Reader HTMLReader;
                  try {
                        String temp = new String(term);
                        String fullurl = new String(MainUrl+temp);
                        url = new URL(fullurl);
                        InputStream myInStream;
                        myInStream = url.openConnection().getInputStream();
                        HTMLReader = (new InputStreamReader(myInStream));
                        HTMLDoc = (HTMLDocument) editorKit.createDefaultDocument();
                        HTMLDoc.putProperty("IgnoreCharsetDirective", new Boolean(true));
                        ParserDelegator parser = new ParserDelegator();
                        HTMLEditorKit.ParserCallback callback = new Parser1();
                        parser.parse(HTMLReader, callback, true);
                        callback.flush();
                        HTMLReader.close();
                        myInStream.close();
                 catch (IOException IOE) {
                    IOE.printStackTrace();
                catch (Exception e) {
                    e.printStackTrace();
      try {
            FileWriter writer = new FileWriter(file);
            BufferedWriter bw = new BufferedWriter(writer);
            for (int i = 0; i < vect.size(); i++){
                bw.write((String)vect.elementAt(i));
                if (vect.elementAt(i)!=vect.lastElement()){
                    bw.newLine();
            bw.flush();
            bw.close();
            writer.close();
        catch (IOException IOE) {
                    IOE.printStackTrace();
                catch (Exception e) {
                    e.printStackTrace();
          }   catch (IOException IOE) {
                 System.out.println("User options not found.");
}

How many Directory Servers are you using?
Are both serverconfig.xml files of PS instances the same?
Set debug level to message in the appropriate AMConfig.properties of your portal instances and look into AM debug files.
For some reason amSDK seems not to get the correct service values.
-Bernhard

Problem in HTML parsing

I am trying to parse the HTML, and for a HTML code like this
<span class="authorName">Steve </span>
I want to retrieve the value "Steve"
Right now, I am using a parser, to generate the HTML tree and by using the this code..
Tidy tidy = new Tidy();
tidy.setXHTML(xhtml);
d = tidy.parseDOM(in,out);
NodeList spanNode = d.getElementsByTagName("span");
int length = spanNode.getLength();
for(int i = 0;i<length;i++)
org.w3c.dom.Node span = spanNode.item(i);
String tempAltText = span.getAttributes().getNamedItem("class").getNode Value();
if(tempAltText.equals("authorName")){
System.out.println("the item is " + tempAltText);
else{
}The tempAltText is storing "authorName" as a string but not "Steve"
please give me some suggestions......

try this
XPathFactory factory=XPathFactory.newInstance();
XPath xPath=factory.newXPath();
XPathExpression xPathExpression=xPath.compile("//*[@class='authorName']");
System.out.println(xPathExpression.evaluate(d));

Style coding problems in html version of report

How come when I create a very ugly plain html report using OR 6i, the html loads very quickly in the browser, but when I add the style coding the WWW wizard creates the html chugs like a snail? Is there some way to get around it, like external css? I'm a novice at OR 6i so much help is appreciated!!
Thanks,
Dia

can you please tell me which parameter should I set because
I am running report from menu and I gave parameters like
add_parameter(pl_id,'DesFORMAT',TEXT_PARAMETER,'HTMLCSS');
IT DOES NOT WORK !! IT GIVES ME SAME OUT PUT AS PLAIN html with
out any borders OR COLORS( basically With out STYLE SHEET
FORMAT !!)
hello, >i assume you are refering to the difference between DesFormat=HTML and DesFormat=HTMLCSS.>there should be no big
difference between these two formats.>in reports6i all style
related information is created on the fly and in-line. there is
no reference to an external file available.>regards, >the oracle
reports team

Error on HTML Parser

Hi,
I'm trying to parse a HTML page but I always get the same error, which is the following exception:
javax.swing.text.ChangedCharSetException
In the class ParserCallback I'm using the method handleError and it shows:
req.att contentmeta?
ioexception???
just before the exception occurs.
The only line where this error occurs in the html page is:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
and I know that the exact point is the attribute 'content'. If it is removed or changed to 'contenttype' the error desappears.
The problem is that I can't change the attribute because the html page is not mine, it is caught on the Web. And I don't want to remove it.
Anybody knows what is happening?
Thanks!!

i am also having a problem with html parsing in java
i have given a detailed / complete description of the problem on this link along with the log and my sample code ...
http://forum.java.sun.com/thread.jspa?threadID=643683&tstart=0
if u could see this ...

JRE1.5 swing.html parser fails to parse data between script tags

Hi all...
I've written a class that extends the java-provided default HTML parser to parse for text inside a table. In JRE1.4.* the parser works fine and extracts data between <script> tags as text. However now that I've upgraded to 1.5, the data between <script> tags are no longer parsed. Any suggestion anyone?
Steve

According to the API docs, the 1.5 parser supports HTML 3.2, for which the spec states that the content of SCRIPT and STYLE tags should not be rendered. I assume it doesn't have a scripting engine, so it won't get executed either.

Where can I get an Html error report of all the syntax and tag problems?

Where can I get an Html error report of all the syntax and tag problems?

Thank you for your answer.
Where is the DW validation for me?
My files are in my computer so I don’t have an external URL.
File > validation > as xml = closes DW... Maybe because it is not a correct command for HTML,
And
Window > results > validation = gives a partial mistakes (e.g. shows an open tag without closing tag, but doesn’t show a closing tag without an open tag).
Thank you.

Struts bean/html tag problem

i am writing this url_element on screen
<bean:write name="url_element"/>
then i got a link
<html:link action="urlDetail.do?url=???">aaa</html:link>
how can i transfer the url_element into ??? do i need to write something like <%...%>?

just found the solution:
You can't use a <bean:write> tag inside an <html:link> tag. Struts doesn't support it.
The easiest way to fix this is to abandon use of <html:link> and just use the plain old html tags <a></a>. Then you will have no problem substituting parameters with <bean:write> tags.
The only advantage <html:link> has over the plain html tag is that it automatically does URL rewriting (Adding of the jsessionid to the URL so that sessions may be tracked even if the user has turned cookies off). If you're not using this function anyway, as most modern websites don't, you may as well use the html tag.
If you still want to use <html:link>, use the struts-el version of the html tags and use EL expressions instead of <bean:write> tags.

HTML tag problem when adding Google rich snippets in templates?

The new Google plus Rich Snippets allow us to add a schema tag like this example:
<html itemscope itemtype="http://schema.org/Article">
I can do this on my .html pages but as soon as I make any changes to the .dwt template for the page it reverts all of the tags back to <html>
How do I stop this happening?
I have tried:
Changing preferences in Dreamweaver's default code rewrite settings to "'Never Rewrite Code' for HTML document type " but this does not appear to fix things.
Very frustrating as I have lost days of work.
Any help would be appreciated
many thanks
Craig

I've done what you have said (I think) however something is not quite right. I am asssuming that I need to add the schema below the <BODY> tag if I make the changes to the template as you have suggested.
This is my practice page which is in a template:
http://www.psychics.co.uk/lovepsychic/index.html
Following the Adobe instruction on the forum: High at the top of the header section we get this code:

For the customised body tag we get this:
<body itemscope itemtype="http://schema.org/Product">
Then on the left column of the page and below the navigation and below the <body> I have added this:

<h1 itemprop="name" content="Love Psychics Readings">Love Psychics - Psychic Love Predictions Online</h1>
<img itemprop="image" content="http://www.psychics.co.uk/images/schema/love.jpg" src="/images/stars.gif" width="160" height="76" alt="Psychics and Mediums Network"></img>
<p itemprop="description" content="Getting a reading with love psychics to find out about relationships, love, romance, marriage and family. Article about our Love psychics readings onlineservices.">Copyright Psychics & Mediums Network - QKE Ltd.</p>

I must be getting close but it just shows the normal Google plus stuff.
I need to get your microdata markup into the body of the document somewhere - otherwise Google will use the something like <title> and <meta description> tags and guess at an image.
Any idea where I am going wrong or is it a quirk in Dreamweaver?

Problem with SAX parser - entity must finish with a semi-colon

Hi,
I'm pretty new to the complexities of using SAXParserFactory and its cousins of XMLReaderAdapter, HTMLBuilder, HTMLDocument, entity resolvers and the like, so wondered if perhaps someone could give me a hand with this problem.
In a nutshell, my code is really nothing more than a glorified HTML parser - a web page editor, if you like. I read in an HTML file (only one that my software has created in the first place), parse it, then produce a Swing representation of the various tags I've parsed from the page and display this on a canvas. So, for instance, I would convert a simple <TABLE> of three rows and one column, via an HTMLTableElement, into a Swing JPanel containing three JLabels, suitably laid out.
I then allow the user to amend the values of the various HTML attributes, and I then write the HTML representation back to the web page.
It works reasonably well, albeit a bit heavy on resources. Here's a summary of the code for parsing an HTML file:
      htmlBuilder = new HTMLBuilder();
parserFactory = SAXParserFactory.newInstance();
parserFactory.setValidating(false);
parserFactory.setNamespaceAware(true);
FileInputStream fileInputStream = new FileInputStream(htmlFile);
InputSource inputSource = new InputSource(fileInputStream);
DoctypeChangerStream changer = new DoctypeChangerStream(inputSource.getByteStream());
changer.setGenerator(
   new DoctypeGenerator()
      public Doctype generate(Doctype old)
         return new DoctypeImpl
         old.getRootElement(),
                          old.getPublicId(),
                          old.getSystemId(),
         old.getInternalSubset()
      resolver = new TSLLocalEntityResolver("-//W3C//DTD XHTML 1.0 Transitional//EN", "xhtml1-transitional.dtd");
      readerAdapter = new XMLReaderAdapter(parserFactory.newSAXParser().getXMLReader());
      readerAdapter.setDocumentHandler(htmlBuilder);
      readerAdapter.setEntityResolver(resolver);
      readerAdapter.parse(inputSource);
      htmlDocument = htmlBuilder.getHTMLDocument();
      htmlBody = (HTMLBodyElement)htmlDocument.getBody();
      traversal = (DocumentTraversal)htmlDocument;
      walker = traversal.createTreeWalker(htmlBody,NodeFilter.SHOW_ELEMENT, null, true);
      rootNode = new DefaultMutableTreeNode(new WidgetTreeRootNode(htmlFile));
      createNodes(walker); However, I'm having a problem parsing a piece of HTML for a streaming video widget. The key part of this HTML is as follows:
            <object classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000"
              id="client"
        width="100%"
        height="100%"
              codebase="http://fpdownload.macromedia.com/get/flashplayer/current/swflash.cab">
              <param name="movie" value="client.swf?user=lkcl&stream=stream2&streamtype=live&server=rtmp://192.168.250.206/oflaDemo" />
         etc....You will see that the <param> tag in the HTML has a value attribute which is a URL plus three URL parameters - looks absolutely standard, and in fact works absolutely correctly in a browser. However, when my readerAdapter.parse() method gets to this point, it throws an exception saying that there should be a semi-colon after the entity 'stream'. I can see whats happening - basically the SAXParser thinks that the ampersand marks the start of a new entity. When it finds '&stream' it expects it to finish with a semi-colon (e.g. much like and other such HTML characters). The only way I can get the parser past this point is to encode all the relevant ampersands to %26 -- but then the web page stops working ! Aaargh....
Can someone explain what my options are for getting around this problem ? Some property I can set on the parser ? A different DTD ? Not to use SAX at all ? Override the parser's exception handler ? A completely different approach ?!
Could you provide a simple example to explain what you mean ?
Thanks in anticipation !

You probably don't have the ampersands in your "value" attribute escaped properly. It should look like this:
value="client.swf?user=lkcl&stream=...{code}
Most HTML processors (i.e. browsers) will overlook that omission, because almost nobody does it right when they are generating HTML by hand, but XML processors won't.

JEditorPane slow HTML parsing and GUI lockup

Hi,
I use a JEditorPane to display HTML page. If the page is relatively small, everything is ok, but with big pages (about 1 Mb) it takes a huge amount of time to load (actually, to parse) the page. And the main problem is that user interface locks and doesn't even repaint unitl page is loaded. Is there any workaround to this problem? I use JDK 1.5.0_2 std. edition.

Try loading a 1Mb plain text file and you will find
that loading is slow. So it will only be slower with
the additional parsing required for an HTML file.Really? The following is the text loading code, which is run in a separate thread:
                    BufferedInputStream in = new BufferedInputStream(con.getInputStream());
                    BufferedReader reader = new BufferedReader(new InputStreamReader(in));
                    String line;
                    if(type.startsWith("text/plain")||type.startsWith("text/xml")) // It's not an HTML page
                        jEditorPane1.setContentType("text/plain");
                        jEditorPane1.setFont(new Font("Monospaced",Font.PLAIN,14));
                        page=new StringBuilder();
                        Document ed_doc=jEditorPane1.getDocument();
                        while ( ( line = reader.readLine()) != null )
                            line=line.concat("\n");
                            page.append(line);
                            ed_doc.insertString(ed_doc.getLength(),line,null);
                    jEditorPane1.getDocument().putProperty(HTMLDocument.StreamDescriptionProperty, u);Yes, it takes some time to load a big page, but no, it does not block GUI repainting and does not block any events such as mouse, etc.
I'm not an HTML
expert but I believe if you have, for example, a
large table structure then the table won't be painted
until the parser knows how many columns is in the
table and how wide to make each column. It can't
determine column width until all the data is read.The files I use for testing don't have tables, etc. See this: http://www.lib.ru/ADAMS/liff.txt (Though it ends with .txt, it is html file - almost plain text with some HTML formatting tags). So, again, the problem is not that parsing is slow; the problem is that parsing blocks any gui events, and I'd like to know why and how to cope this problem. Running setPage() in separate thread doesn't change anything.

Using the HTML Parser as a filter

I have the need to take an HTML file, filter it so that the SOURCE attribute on all of the IMG tags are modified, and then write the filtered file out. This seems pretty straight forward, but I can't seem to figure out how to get the Swing HTML parser to do what I want.
If I had some way to parse the file into some structure, and then be able to take that structure and "unparse" it back into a file I would be fine. This assumes that I'd be able to override the handling of the IMG tag (or I guess all of the simple tags) on the parser side so that I'd be able to replace one of the attributes with something that's more meaningful for my purpose.
I know this is not so difficult, I just can't see how to do it. Any help? Thanks in advance.
Sander Smith

If the problem is that simple, I'd use java.util.regex classes(Pattern and Matcher) and could write the program in an hour.

Control of HTML parsing

Hello,
I am working on trying to remove certain tags from an html source but I am unfamiliar with the use of ParserDelgator or Callback classes. Say you have this example here:
import javax.swing.text.html.parser.*;
import javax.swing.text.html.*;
import javax.swing.text.*;
import java.io.*;
public class TextFromHtml extends HTMLEditorKit.ParserCallback{
public void handleText(char[] data, int pos){
      System.out.println(new String(data)); // or other code you like
public static void main(String argv[]){
    try{
      Reader r = new FileReader("testPrint.html");
      ParserDelegator parser = new ParserDelegator();
      HTMLEditorKit.ParserCallback callback = new TextFromHtml();
      parser.parse(r, callback, true); // or 'false' if you like
    catch (IOException e){
      e.printStackTrace();
}Is there anyway I can control which HTML lines are parsed?
Thanks in advance!

I was able to extract the img src tag code using some string manipulation, but now I have a problem with parsing the rest of the HTML source...
import javax.swing.text.html.parser.*;
import javax.swing.text.html.*;
import javax.swing.text.*;
import javax.swing.*;
import java.io.*;
public class TextFromHtml extends HTMLEditorKit.ParserCallback
        static BufferedReader reader;
        static PrintWriter pw;
        public static String trimmer(String trimText)
               return (trimText.trim()).replaceAll("\\s+", " ");
        public void handleText(char[] data, int pos)
             String text = new String(data);
             text = trimmer(text);
                if(text.indexOf((">")) != -1)
                        String[] temp = text.split((">"));
                        if(temp.length>1)
                                text = temp[1].trim();
                        else text = "";
             pw.println(text);
        //public void removeTags(String file)
        public static void main(String[] args)
                String file = "";
                String queue = "";
                try
                        String s = args[0];
                        reader = new BufferedReader(new FileReader(s));
                        file = s.substring(0, s.lastIndexOf('.')) + ".txt";
                        pw = new PrintWriter(new BufferedWriter(new FileWriter(file)));
                        ParserDelegator parser = new ParserDelegator();
                        HTMLEditorKit.ParserCallback callback = new TextFromHtml();
                        parser.parse(reader, callback, true);
                        pw.close();
                        reader = new BufferedReader(new FileReader(file));
                        BufferedWriter fout = new BufferedWriter(new FileWriter("out.txt"));
                        while ((queue = reader.readLine()) != null)
                                if(queue.length() != 0)
                                        System.out.println(queue);
                                        fout.write(queue);
                        reader.close();
                }catch (IOException e)
                        e.printStackTrace();
                //return array;
}For some reason, I can only output queue to the CMD prompt; any thing else such as output to another file or returning queue as an array of Strings gives me null values. Does anyone see what is wrong with the code?
Thanks in advance!

HTML parsing, AttributeSet.getAttribute() doesn't work

I parsed a website using javax.swing.text.html.parser.
When I get a javax.swing.text.html.parser.Element, elem, I used elem.getAttributSet to get the AttributeSet of elem, atts. Then I used atts.getAttribute(HTML.Tag.FORM) to get the surounding form tag. This works fine in jdk 1.3.8, but for jdk 1.4.2 and after, it just return null.
Is this a parsing bug for Java? Is there any way to get arround this problem?

Well, it won't work as iWeb has no import facility so cannot open html files.
What you could do is upload the html file to wherever you are hosting your site and create a link to it from iWeb, or find another package similar to the one you are using at present that is for Mac rather then PC.

STYLE tag problem in HTML Parser.

Similar Messages

Maybe you are looking for