Parsing and HTML document

Does any one know how to parse an HTML document with having JEditorPane (in-the-neck) do it for you?

If you want to do a small amount of parsing, then it would make sense to write a custom program as described in the previous response. If, however, you want to do a lot of parsing, then it might make more sense to try to make use of an XML parser. If you are trying to parse html pages that are your own, then you might want to think about transforming them into xhtml so that an XML parser will be able to process them.
XML parsing is easy. I recently developed a web site for a set of mock exams for the Java Programmer Certification. Originally, I started developing the pages in html, but I quickly realized that I would have a hard time managing the exams in that format. I then organized the exam into a set of xml documents--one document for each topic. To publish a set of cross-topic exams, I use JDOM (with the help of SAX)to load all of the questions into the Java Collections Framework where I can easily organized a set of four cross-topic exams. Also, I use JDOM to number the questions and answers before writting the new exams out to a new set of four xml files. Then I use XSLT to transform the four exam.xml documents into eight HTML files--four html files for the questions and four for the answers.
If you would like to take a look at the result, then please use the following link.
http://www.geocities.com/danchisholm2000/
If you own the html files that you want to parse, then I would try to find a way to transform them into valid xml. XHTML might be a good choice.
Dan Chisholm

Similar Messages

Parsing an HTML document

I want to parse an html document and replace anchor tags with mines on the fly. Can anybody suggest how to do it Please?
Ajay

If your HTML files are not well-formed (chances are with most HTML files) like attribute values are not enclosed in punctuation marks, etc, most XML parsers will fail.
Anand from this forum introduced the JTidy to me and it worked very well. This is a HTML parser that is able to tidy up your HTML codes.

How to parse a html document?

I am trying to parse an html document that I load from a url over the internet. The html is not well formed but thats ok. The problem is the document builder throws an exception because the document is not well formed.
Can I parse a html document using the document builder?
Please note that I set validating to false and the parse still has a fatal errror saying <meta> tag must have a corresponding </meta> tag.
I am using code like the following.....
DocumentBuilderfactory = DocumentBuilderFactory.newInstance();
factory.setValidating(false);
DocumentBuilder db = factory.newDocumentBuilder();
doc = db.parse(urlString);

The html is not well formed but thats ok.No, it isn't.
"Validation" means checking that the XML conforms to a schema or a DTD. Don't confuse that with checking whether the XML is well-formed, which means whether it follows the basic rules of XML like opening tags have to have matching closing tags. Which is what your message is telling you -- your file isn't well-formed XML.
So sure, you can parse HTML or anything else with an XML parser, just be prepared to be told it isn't well-formed XML.
If you want to clean up HTML so that it's well-formed XML, there are products like HTMLTidy and JTidy that will do that for you.

Photosmart C6180 prints too many copies of PDFs and HTML-documents

Hi,
My Photosmart C6180 prints multiple copies of documents even if I haven't asked for it. This seems to only apply to PDFs and HTML-documents, I haven't seen it happen with Word/Excel. It's not consistent, it happens most of the time but now always for these documents. When I look at the console, when it has started printing I see shortly the status "Printing - Restarting". That's probably when it starts the 2nd copy. Sometimes it stops by itself, other times it seems to be looping and I have to turn the printer off and delete the document from the queue.
HP solution center SW is updated, I run Vista with latest service packs on a Thinkpad T400s.
Any ideas?
Thanks!

My C6180 printed multiple copies of any document from my Vista OS laptop ( wireless connection ), but not from my
XP OS desktop ( USB connection), even though I selected 1 copy on the print menu. Unchecking "Enable bidirectional support" in the Properties / Ports dialog box solved the problem. Hope this helps someone too.

Problem parsing a html document

Hi all,
I need to parse a html document.
InputStream is = new java.io.FileInputStream(new File("c:/temp/htmldoc.html"));
DOMFragmentParser DOMparser = new DOMFragmentParser();
DocumentFragment doc = new HTMLDocumentImpl().createDocumentFragment();
DOMparser.parse(new InputSource(is), doc);
NodeList nl = doc.getChildNodes();
I get just 3 of the following nodes...... though the document htmldoc.html is a proper html doc..
#document-fragment
HTML
#text
Any suggestions/help are most welcome. Thanks

Here's an example showing how to do this via javax.xml:
import java.io.*;
import java.net.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;
public class HTMLElementLister {
     public static void main(String[] args) throws Exception {
          URLConnection con = new URL("http://www.mywebsite.com/index.html").openConnection();
          con.connect();
          InputStream in = (InputStream)con.getContent();
          Document doc = null;
          try {
               DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
               DocumentBuilder db = dbf.newDocumentBuilder();
               doc = db.parse(in);
          } finally {
               in.close();
          NodeList nodes = doc.getChildNodes();
          for (int i=0; i<nodes.getLength(); i++) {
               Node node = nodes.item(i);
               String nodeName = node.getNodeName();
               System.out.println(nodeName);
               if ("html".equalsIgnoreCase(nodeName)) {
                    System.out.println("|");
                    NodeList grandkids = node.getChildNodes();
                    for (int j=0; j<grandkids.getLength(); j++) {
                         Node contentNode = grandkids.item(j);
                         nodeName = contentNode.getNodeName();
                         System.out.println("|- " + nodeName);
                         if ("body".equalsIgnoreCase(nodeName)) {
                              System.out.println("   |");
                              NodeList bodyNodes = contentNode.getChildNodes();
                              for (int k=0; k<bodyNodes.getLength(); k++) {
                                   node = bodyNodes.item(k);
                                   System.out.println("   |- " + node.getNodeName());
}

Parsing a HTML document

I want to parse a HTML file and take out the data from it eliminating the HTML tags.Can anybody give some idea how to do that ?
The HTML file may contain javascript functions also.

Hi,
here is a method for replacing strings in a text:
http://forums.java.sun.com/thread.jsp?forum=31&thread=185221
I know it isn't exactly what you want, but maybe it helps you to begin.
regards

What happened to the firefox icon on htm and html document files?

Today I have noticed it for the first time.. The icon use to look like a blank page with a mini firefox icon on it.. Now the htm and html icon looks like a blank white page that is bent on the top right hand corner.. I tried a firefox reset and it didn't help.. I don't wanna do a reinstall of firefox because I want to keep the old firefox download manager and not the new one that is on the right of the search bar.. ( I saw it on my ubuntu linux computer) BUT my computer with the html file icon problem is Windows 7 x64

It didn't solve it for me. I have tried the switch icons, uninstall and reinstall, making IE default and then back to FF, file association (there is no Program group in Control Panel in XP btw, the earlier Contributor poster is thinking only of Win7) and scanned other forums without any solution.
mhtml and html files save fine, but are tagged with the Windows default icon for a program the o/s does not recognise. This causes constant irritation because while quickly browsing data files, my eye falls on the unrecognised file icon and I pause wondering what is the problem before realising it's the Firefox error yet again.
There are FF users on other forums with the same problem too, some devising solutions which work for them but not for others, probably because computers are so individual.
I am a recent convert to Windows XP 64-bit sp2 on a new build. With my previous 32-bit machines, I had no problem. I think, like Apple, Firefox has not come to terms with 64-bit despite it supposedly being "the future." I use FF v12 because I can't stand the look of its successors. However, from browsing forums, I gather it makes no matter what version is used. It seemingly happened after v11 and continues throughout.
Any help would be greatly appreciated.
Thanks.

Parsing HTML documents

I am trying to write an application that uses a parsed html document to perform some data retrieval. The problem that I am having is that the parser in JDK1.4.1 is unable to completely parse the document correctly. Some fields are skipped as well as other problems. I believe it has to do with the html32.bdtd. Is there a later version?

Parsing a HTML document is a huge task, you shouldn't do it yourself but instead javax.text.html and javax.text.html.parser already provide almost everything you ever need

Parse and output XML document while preserving attribute order

QUESTION: How can I take in an element with attributes from an XML and output the same element and attributes while preserving the order of those attributes?
The following code will parse and XML document and generate (practically) unchanged output. However, all attributes are ordered a-z
Example: The following element
<work_item_type work_item_db_site="0000000000000000" work_item_db_id="0" work_item_type_code="3" user_tag_ident="Step" name="Work Step" gmt_last_updated="2008-12-31T18:00:00.000000000" last_upd_db_site="0000000000000000" last_upd_db_id="0" rstat_type_code="1">
</work_item_type>is output as:
<work_item_type gmt_last_updated="2008-12-31T18:00:00.000000000" last_upd_db_id="0" last_upd_db_site="0000000000000000" name="Work Step" rstat_type_code="1" user_tag_ident="Step" work_item_db_id="0" work_item_db_site="0000000000000000" work_item_type_code="3">
</work_item_type>As you may notice, there is no difference in these besides order of the attributes!
I am convened that the problem is not in the stylesheet.xslt but if you are not then it is posted bellow.
Please, someone help me out with this! I have a feeling the solution is simple
The following take the XML from source.xml and outputs it to DEST_filename with attributes in a-z order
Code:
private void OutputFile(String DEST_filename, String style_filename){
     //StreamSource stylesheet = new StreamSource(style_filename);
     try{
          File dest_file = new File(DEST_filename);
          if(!dest_file.exists())
              dest_file.createNewFile();
          TransformerFactory tranFactory = TransformerFactory.newInstance();
          Transformer aTransformer = tranFactory.newTransformer();
          aTransformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
          Source src = new DOMSource("source.xml");
          Result dest = new StreamResult(dest_file);
          aTransformer.transform(src, dest);
          System.out.println("Finished");
     catch(Exception e){
          System.err.print(e);
          System.exit(-1);
    }

You can't. The reason is, the XML Recommendation explicitly says the order of attributes is not significant. Therefore conforming XML serializers won't treat it as if it were significant.
If you have an environment where you think that the order of attributes is significant, your first step should be to reconsider. Possibly it isn't really significant and you are over-reaching in some way. Or possibly someone writing requirements is ignorant of this fact and the requirement can be discarded.
Or possibly your output is being given to somebody else who has a defective parser which expects the attributes to be in a particular order. You could quote the XML Recommendation to those people but often XML bozos are resistant to change. If you're stuck writing for that parser then you'll have to apply some non-XML processing to your output to fix it up on their behalf.

How to view html documents

I feel like something went wrong with my Macbook MacBook Pro (13-inch, Mid 2010) Processor 2,4 GHz Intel Core 2 Duo Memory 4 GB 1067 MHz DDR3.
I'm using Yosemite OS X 10.10.2
I am a student and I started 2 months ago to open and download PDFs and HTML documents with no problem. Then about a week ago, all of these docs I downloaded from the web showed up in code. I tried opening them with Safari, but no dice. What can I do?

I feel like something went wrong with my Macbook MacBook Pro (13-inch, Mid 2010) Processor 2,4 GHz Intel Core 2 Duo Memory 4 GB 1067 MHz DDR3.
I'm using Yosemite OS X 10.10.2
I am a student and I started 2 months ago to open and download PDFs and HTML documents with no problem. Then about a week ago, all of these docs I downloaded from the web showed up in code. I tried opening them with Safari, but no dice. What can I do?

Counting lines of parsed HTML documents

Hello,
I am using a HTMLEditorKit.ParserCallback to handle data generated by a ParserDelegator.
Everything is ok but I can not find how to catch end of lines (I need to know at what line a tag or an attribute is found).
Thanks in advance for any hints.

I noticed that the parse() method of ParserDelegator creates a DocumentParser object to do the actual parsing of the HTML document. DocumentParser contains a method getCurrentLine(). So, I tried to extending ParserDelegator so I could access Document Parser. However, the getCurrentLine method is protected so I ended up also extending DocumentParser.
You probably have code something like:
new MyParserDelegator().parse(reader, this, false);
This should be replaced with:
parser = new MyParserDelegator();
parser.parse(reader, this, false);
where you defined an instance variable: MyParserDelegator parser;
You can now use parser.getCurrentLine() in any of you parser callback methods.
Note that you may not alway get the results that you expect for the current line as many times I found the line to be 1 greater than I thought it should be. Anyway you can decide if the code is of any value.
Following is the code for MyParserDelegator and MyDocumentmentParser inner class. Good Luck.
import java.io.IOException;
import java.io.Reader;
import java.io.Serializable;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.DTD;
import javax.swing.text.html.parser.DocumentParser;
import javax.swing.text.html.parser.ParserDelegator;
public class MyParserDelegator extends ParserDelegator implements Serializable
     MyDocumentParser parser;
public void parse(Reader r, HTMLEditorKit.ParserCallback cb, boolean ignoreCharSet) throws IOException
     String name = "html32";
     DTD dtd = createDTD( DTD.getDTD( name ), name );
          parser = new MyDocumentParser(dtd);
          parser.parse(r, cb, ignoreCharSet);
     public int getCurrentLine()
          return parser.getCurrentLine();
public class MyDocumentParser extends DocumentParser
     public MyDocumentParser(DTD dtd)
          super(dtd);
     public int getCurrentLine()
          return super.getCurrentLine();

I just updated my Firefox browser to Firefox 8. I am a college student and practice with HTML and CSS for class assignments. The fonts in all my html documents are being overwritten online by your script typeface. How do I resolve this issue?

I just updated my Firefox browser to Firefox 8. I am a college student and practice with HTML and CSS for class assignments. The fonts in all my html documents are being overwritten online by your script typeface. I did not have this issue in the older version. I use an iMAC running OS10.6.8. How do I resolve this issue?

Starting with this, you have errors in your CSS code.
body {
margin-top: 0px;
margin-right: 0px;
margin-bottom: 0px;
margin-left: 0px;
color: 151515;
font-family: "Gill Sans", "Gill Sans MT", "Myriad Pro", "DejaVu Sans Condensed", Helvetica, Arial, sans-serif;
background-color: EFF5F8;
body {
margin:0;
color: #151515;
font-family: "Gill Sans", "Gill Sans MT", "Myriad Pro", "DejaVu Sans Condensed", Helvetica, Arial, sans-serif;
background-color: #EFF5F8;
font-size: 100%;
Related links:
Windows Chrome, why do my fonts look so bad? - Lee Green
css3 - Bad font rendering Chrome - Stack Overflow
Nancy O.

Storing and parsing an XML document

Hi everybody,
does anybody know how to store (clob) and to parse (query) and XML document? can you give me an example and/or some useful information?
Thanx in advance,
Ettore

Hi, first try determine how much time does the parsing xml get,
you could just remove insert lines, and count,
if the time is similar it means that parsing xml has great infulance on total time,
otherwise there is insert affection you could try ue forall loop syntax insted of for ( just FORALL currentNode IN 0 .. xmldom.getLength(nodeList) - 1 LOOP) or
direct load using 'insert into <table_name> .. as select',
'forall' minimize conext switches it could help
to achive direct load you need first rewrite your procedure to function in order to return table, (table of object ), so your insert could look like
insert into SWTable
as select * from table
(cast
USP_FILEINFOINSTANCEINSERT(
URLFile =>x
lComputerID =>y) as SWTable_t)
where SWTable_t - table of SWTable_o
SWTable_o - object with all SWTable's atributes

Loading scripts - what's the difference between loading into edge via script window and including a script in the html document?

I have a html page that loading in two edge compositions and an external custom javascript file. The javacsript file includes the bootstrapCallback so I can store references to the loaded compositions and can communicate with them. This seems to work well. The problem have is when I also try and load in a custom plugin javascript files into the edge compositions via the script window inside edge - I don't understand how this works, for example if I load in a custom javascript file into one of the compositions can only that composition use it's funcitionality? Is loading in scripts via edge script window the same as including in html document, I'm confused how the two relate, please help me understand.

I have a html page that loading in two edge compositions and an external custom javascript file. The javacsript file includes the bootstrapCallback so I can store references to the loaded compositions and can communicate with them. This seems to work well. The problem have is when I also try and load in a custom plugin javascript files into the edge compositions via the script window inside edge - I don't understand how this works, for example if I load in a custom javascript file into one of the compositions can only that composition use it's funcitionality? Is loading in scripts via edge script window the same as including in html document, I'm confused how the two relate, please help me understand.

Parse HTML document embedded in IFRAME

Dear fellows:
How can I access contents of an HTML document embedded in an IFRAME tag, by using java class HTMLEditorKit.Parser?
It is well known that the contents of such embedded HTML document can be accessed by javascript at front end. However, I am more interested on processing it at backend, using HTMLEditorKit.Parser, or any java swing API.
Thanks for help.

The javax.swing.text.html framework barely supports HTML 3.2.

Parsing and HTML document

Similar Messages

Maybe you are looking for