Parsing HTML from Java, How

Problem:
I need to Connect to a URL using Java.
Then i need to detect/ parse the html of the url for Image contents
*So that i can replace some thing rather than the image such as with the text [IMAGE].*
I already able to connect and read the htmls of url. But i need ur help regarding parsing for image links.
Just i need to parse the html page to sense image links/contents on that page.
How it can be done?

hi shazzad,
Could you please try with this,
* XsdReader.java
* Created on September 12, 2008, 11:36 AM
* To change this template, choose Tools | Template Manager
* and open the template in the editor.
package EDITool;
import java.io.File;
import java.io.IOException;
import java.util.Hashtable;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;
* @author Rajesh
public class XsdReader {
    /** Creates a new instance of XsdReader */
    private DocumentBuilder docBuilder;
    private Document doc;
    private DocumentBuilderFactory docBuilderFactory;
    private File xsdFile;
    private File xmlFile;
    private Hashtable dataList;
    private String modXML = "";
    public XsdReader() {
        docBuilder = null;
        doc = null;
        docBuilderFactory = DocumentBuilderFactory.newInstance();
        try{
                docBuilder = docBuilderFactory.newDocumentBuilder();
        catch(ParserConfigurationException e)
                System.out.println("Wrong parser configuration: " + e.getMessage());
        this.xsdFile = null;
        this.xmlFile = null;
        dataList = new Hashtable();
    public class Attr{
    private String minOccurs;
    private String maxOccurs;
    public Attr(String minOccurs, String maxOccurs)
            this.minOccurs = minOccurs;
            this.maxOccurs = maxOccurs;
    public void xsdParser(String xsdInputFileName,int level)
        try{
            this.xsdFile = new File(xsdInputFileName);
        catch(Exception e)
            System.out.println("File Not Exception: "+e);
        try{
          doc = docBuilder.parse(this.xsdFile);
                NodeList nodeList = doc.getChildNodes();
                xsdRecursive(nodeList,level);
     catch(SAXException e)
          System.out.println("Wrong XML file structure: " + e.getMessage());
     catch(IOException e)
          System.out.println("Wrong XML file structure: " + e.getMessage());
     catch(Exception e)
          System.out.println("Error: "+e);
    public void xsdRecursive(NodeList nodeList, int level)
        try{
          for(int i=0;i<nodeList.getLength();i++)
               Node node = nodeList.item(i);
               if(node.getNodeType() == node.ELEMENT_NODE && level == 1)
                    if(node.hasAttributes())
                         Element e = (Element)node;
                         String minOccursValue = e.getAttribute("minOccurs");
                         if(!minOccursValue.equals(""))
                              String name = e.getAttribute("name");
                              if(name.equals(""))
                              name = e.getAttribute("ref");
                              if(name.equals(""))
                              continue;
                              String minOccurs = e.getAttribute("minOccurs");
                              String maxOccurs = e.getAttribute("maxOccurs");
                              System.out.println(name);
                              this.dataList.put(name,new Attr(minOccurs,maxOccurs));
                        else if(node.getNodeType() == node.ELEMENT_NODE && level == 2)
                            String currentTagName = node.getNodeName();
                            Attr attr = (Attr)this.dataList.get(currentTagName);
                            NamedNodeMap nodeAttr = node.getAttributes();
                            org.dom4j.Node dom4jNode = (org.dom4j.Node) node;
                            org.dom4j.Element e = (org.dom4j.Element) dom4jNode;
                            e.addAttribute("min", attr.minOccurs);
                            e.addAttribute("max", attr.maxOccurs);
               if(node.hasChildNodes())
               xsdRecursive(node.getChildNodes(),level);
          catch(Exception e)
               System.out.println("Error: "+e);
    public void xsdRecursive(Node node, int level)
        NodeList nodeList = node.getChildNodes();
        try{
            for(int i=0;i<nodeList.getLength();i++)
            if(node.getNodeType() == Node.ELEMENT_NODE )
               // if(level == 1)
                if(node.hasAttributes())
                    Element e = (Element)node;
                    String minOccursValue = e.getAttribute("minOccurs");
                    if(!minOccursValue.equals(""))
                            String name = e.getAttribute("name");
                            if(name.equals(""))
                            name = e.getAttribute("ref");
                            String minOccurs = e.getAttribute("minOccurs");
                            String maxOccurs = e.getAttribute("maxOccurs");
                            System.out.println(name);
                            this.dataList.put(name,new Attr(minOccurs,maxOccurs));
                else if(level == 2)
                System.out.println("rajesh");
                String currentTagName = node.getNodeName();
                Attr attr = (Attr)this.dataList.get(currentTagName);
                NamedNodeMap nodeAttr = node.getAttributes();
                org.w3c.dom.Attr minAttr = ((Document)node).createAttribute("min");
                minAttr.setValue(attr.minOccurs);
                org.w3c.dom.Attr maxAttr = ((Document)node).createAttribute("max");
                minAttr.setValue(attr.maxOccurs);
                Element e =(Element)node;
                System.out.println(e.getNodeName()+"\t"+e.getAttribute("min"));
        catch(Exception e)
            System.out.println("Exception in xsdRecursive: "+e);
    public void display()
        Attr attr = (Attr) this.dataList.get("ISA01");
        System.out.println(attr.maxOccurs+"\t"+attr.minOccurs);
    public static void main(String s[])
        XsdReader xsdReader = new XsdReader();
        //String xsdInputFileName = "C:/Documents and Settings/vs73471/Documents/SAP/workspace/Sample/src/packages/com/sap/java/Copy of 850.xml";
        String xsdInputFileName = "C:/Documents and Settings/vs73471/Desktop/www.html";
        xsdReader.xsdParser(xsdInputFileName, 1);
        System.out.println(xsdReader.dataList.size());
        String xmlInputFileName = "C:/Documents and Settings/vs73471/Desktop/Reference XML Generate/850_Dummy.xml";
        xsdReader.xsdParser(xmlInputFileName, 2);
      // xsdReader.display();
}

Similar Messages

Calling html from java with some values

Hello friends,
I can call a HTML file from java.I want to pass some values from java class file to that html file.Anybody help me how to pass parameters?
Runtime.getRuntime().exec("rundll32 url.dll,FileProtocolHandler http://localhost:8080/jcoDemo/dp.html");
is one i am using for calling html file
thanks

Just add GET query parameters to the URL.
Besides, if you were using Java 1.6 or newer, then rather use [Desktop#browse()|http://java.sun.com/javase/6/docs/api/java/awt/Desktop.html#browse(java.net.URI)] instead of Runtime#exec(). It is crossplatform while your runtime approach works in Mircosoft Windows systems only.

Opening HTML from Java executable?

I have an HTML file that will be located on a CD that spawns off an installation. I need to call this HTML from a java executable. How do I do this?
Thanks,
Scott Roth
[email protected]

You may adopt a refined solution trough the swing package.... Usign the application to shows the HTML file with a JEditorPane or JPane component.
take a look in the jdk1.2.2\demo\jfc\SwingSet\HtmlPanel.java
In this case, your app may becomes slow, but if you will run it in a powerfull machine (128Mb RAM, 16Mb Video) then don�t worry - it works.

XML parsing regressing from Java 1.4 to 1.5

Hi,
I have a piece of code using Jakarta digester. This piece of code is prety simple and was working fine until I switched from java 1.4 to java 1.5. However, i didn't changed the Digester jar.
After investigation, I noticed that the problem disappear if i remove the DOCTYPE entry from the XML file I'm parsing.
Is there any known regression ?
Thanks =P
Stack Trace
java.net.UnknownHostException: D
     at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:177)
     at java.net.Socket.connect(Socket.java:507)
     at java.net.Socket.connect(Socket.java:457)
     at sun.net.NetworkClient.doConnect(NetworkClient.java:157)
     at sun.net.NetworkClient.openServer(NetworkClient.java:118)
     at sun.net.ftp.FtpClient.openServer(FtpClient.java:488)
     at sun.net.ftp.FtpClient.openServer(FtpClient.java:475)
     at sun.net.www.protocol.ftp.FtpURLConnection.connect(FtpURLConnection.java:270)
     at sun.net.www.protocol.ftp.FtpURLConnection.getInputStream(FtpURLConnection.java:352)
     at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:973)
     at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:905)
     at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startDTDEntity(XMLEntityManager.java:872)
     at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.setInputSource(XMLDTDScannerImpl.java:282)
     at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDispatcher.dispatch(XMLDocumentScannerImpl.java:1021)
     at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368)
     at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834)Error while getting input files description. Process aborted.
     at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
     at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148)
     at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242)
     at org.apache.commons.digester.Digester.parse(Digester.java:1556)
     at com.wwdm.datahandle.TransformDescriptorFactory.getTransformer(TransformDescriptorFactory.java:99)
     at mainTest.main(mainTest.java:29)
XML
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE descriptor SYSTEM "file://D:/dtds/WWDM_Input.dtd">
<descriptor>
</description>
------

I made an error in my XML while posting... so just to avoid replies telling it cames from there :
XML
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE descriptor SYSTEM "file://D:/dtds/WWDM_Input.dtd">
<descriptor>
</descriptor>

Parseing html in java

I want to remove style attibute and replace it with unique class in html in java.
Input html:
<div style="A">
<div style="B">
</div>
<div style="C">
</div>
</div>
Output updated html:
<div class="class01">
<div class="class02">
</div>
<div class="class03">
</div>
</div>
Please tell me how can I do it easly in java!
I am trying to get using code available on:
http://www.java2s.com/Tutorial/Java/0120__Development/ParseHTML.htm
If you know any other good way, then please tell me! I don't have rnd time and have to done soon.

Have your tried by yourself with any code yet? I don't beleive you even tried once.
It will not be a tough job to find and replace some string in java even within HTML. But first you need to write an algorithm for this that how you will search that particular style and replace it and it totally depends on the formatting of your HTML as mentioned above.
You need to make an algorithm and try to implement it and during the implementation if you get problems then come back here.

Parsing HTML from Google API results

Hello,
I just downloaded the Google API (http://www.google.com/apis) and I am trying to parse the HTML content which is returned so that it can be displayed in a TextArea or some other GUI component.
Here are my questions:
1. Is there a Java class that can parse HTML and display it correctly?
2. If not, are there are third party, prefabably free Java components that can do that?
3. Has anyone tried out the Google API? Any interesting applications?
Thank you.
Hanxue

To convert plain text to html, you can parse the text with a simple code like this
1.
String inputText = getInputText(); //
StringBuffer HTMLOutputText = new StringBuffer();
java.util.StringTokenizer st = new java.util.StringTokenzier(inputText, "\n\r");
while ( st.hasMoreTokens() ) {
HTMLOutputText.append(st.nextToken());
HTMLOutputText.append("<br>");
/// insert the top level HTML tags
HTMLOutputText.insert(0, "<HTML> <HEAD><TITLE> Some Title</TITLE></HEAD> <BODY>");
HTMLOutputText.insert( HTMLOutputText.getLength(), "</BODY> </HTML>" );
2. even simpler, but as far as I know it doesn't display right in a JEditorPane
String inputText = getInputText();
inputText = "<HTML> <HEAD><TITLE> Some Title</TITLE></HEAD> <BODY> <PRE> <TT>" +
+ inputText + "</TT></PRE></BODY> </HTML>";

Parse HTML from Multiple Lines of Text

Does anyone know if there's a way to extract the text from a Multi-line text enterprise custom field? I'm using the OData feed to read this field into an Excel report, and the text is being returned (from Project Online) with HTML. I'm trying to either retain
the formatting in Excel or strip the HTML from the field value.
Anyone have any suggestions?
Thanks,
Roland

Hi Roland,
See this
similar thread, advicing to create a macro.
Hope this helps,
Guillaume Rouyre, MBA, MVP, P-Seller |

How to get the parsed HTML from a JSF Page

Hi,
I have an application that the user must fill in some data. Later on, it's shown a confirmation page with the user's information. I want to email that confirmation page to the company. The question is: How can I get the result html page as a string from that jsf confirmation page within the application.
Any help is appreciated,
Tiago Gaspar.

I need the exat same page the user is viewing i.e. html filed with the information .

Print HTML from Java

Hello All,
can anyone please provide example code of how to print an html-file from localhost by Java-code ?
Thanx a lot !!

Not sure what you are asking. You've mentioned several different things.
One is printing - do you mean to put something on paper? There should be examples on how to print a file if you do a search.
The other is 'from localhost' - this implies a socket to connect to 127.0.0.1 - Does your code connect to an HTTP server, issue a HTTP GET for an HTML page?
Again there should be sample code for connecting to a server and reading the response if you do a search.

Why can't I make call to parse HTML from inside Thread?

This is driving me crazy. With a defined HTMLEditorKit.ParserCallback object "callback", I am attempting to parse an HTML document retrieved from a URL by using:
new ParserDelegator().parse(new InputStreamReader(url.openStream( )), callback, true);
It doesn't work if I initiate the call in any way from within the run method of a Thread subclass (the way I'd like to do it). If I make the call in the constructor of the Thread subclass, however, it runs fine. I know it must have something to do with the fact that parse runs in a Thread of it's own - but the way to fix it isn't apparent to me.
I would appreciate some words from people who might know what's happening here... THANKS in advance.

Don't bother - figured it out - thanks.

Why can't I make call to parse HTML from inside a Thread?

This is driving me crazy. With a defined HTMLEditorKit.ParserCallback object "callback", I am attempting to parse an HTML document retrieved from a URL by using:
new ParserDelegator().parse(new InputStreamReader(url.openStream( )), callback, true);
It doesn't work if I initiate the call in any way from within the run method of a Thread subclass (the way I'd like to do it). If I make the call in the constructor of the Thread subclass, however, it runs fine. I know it must have something to do with the fact that parse runs in a Thread of it's own - but the way to fix it isn't apparent to me.
I would appreciate some words from people who might know what's happening here... THANKS in advance.

Don't bother - figured it out - thanks.

Parse HTML from URL

Hello everyone,
I am working on a project where I need to be able to pull data out of a website. So essentially I will give the program the url of the site and then parse the source to extract the information I need. If the built in java html parsing a good choice or are there some third party libraries you would recommend. Let me know if you need more detail and thanks for any help.
Nathan

Nath5 wrote:
... the problem I have is that the element names are a standard value with a random number added to the end.Not if this is HTML you are parsing, it isn't. HTML doesn't have anything like that.
Badly-designed XML might have element names like that, though. So you ought to go back and see if you're actually getting XML. In that case you should use an XML parser instead of an HTML parser. And it may well be possible to write an XPath expression that finds such elements.

Set Text to a frame.html from java

Hi everybody
This is my case:
i have a string (in java code) and i want to show it on a frame.html page
Do u have any ideas ????
help !!!!!
pls !!!!!

It sounds like you might want to create a Java Server Page. Java Server Pages, or JSP's allow you to present information created dynamically within a static html page. The dynamic content, the Java String, would be enclosed within special tags.
You can learn more about JSP pages by visiting these links...
http://www.apl.jhu.edu/~hall/java/Servlet-Tutorial/
http://pdf.coreservlets.com/
http://java.sun.com/developer/technicalArticles/javaserverpages/JSP20/
http://java.sun.com/developer/technicalArticles/Programming/jsp/

For calling cobol from java, how can i add cobal bin directory to PATH

Hi all,
When i am testing my Algorithm using JUNIT i am getting this Exception ,Can any one tell me What this exact Error.
Exception in thread "CobolThread 1" java.lang.UnsatisfiedLinkError: no cbljvm_sun in java.library.path
     at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1734)
     at java.lang.Runtime.loadLibrary0(Runtime.java:823)
     at java.lang.System.loadLibrary(System.java:1028)
     at com.microfocus.cobol.RuntimeSystem.<clinit>(Unknown Source)
     at com.splwg.base.support.cobol.host.CobolThread.run(CobolThread.java:30)
Thanks&Regards
sivaram

i overcome the above exception by seting microfocus bin in classpath
But i am getting one more error.pls help to resolve this issue
Exception in thread "CobolThread 1" com.splwg.shared.common.LoggedException: Unable to load class CIPZMEMJ. Return code: 2
- 2011-09-07 12:51:52,111 [CobolThread 1] ERROR (cobol.host.CIPZMEMJ) Unable to load class CIPZMEMJ. Return code: 2
com.splwg.shared.common.LoggedException: Unable to load class CIPZMEMJ. Return code: 2
     at com.splwg.shared.common.LoggedException.raised(LoggedException.java:65)
     at com.splwg.base.support.cobol.host.CIPZMEMJ.checkInitialized(CIPZMEMJ.java:36)
     at com.splwg.base.support.cobol.host.CIPZMEMJ.touchToCobol(CIPZMEMJ.java:46)
     at com.splwg.base.support.cobol.host.CobolThread.run(CobolThread.java:34)
     at com.splwg.shared.common.LoggedException.raised(LoggedException.java:65)
     at com.splwg.base.support.cobol.host.CIPZMEMJ.checkInitialized(CIPZMEMJ.java:36)
     at com.splwg.base.support.cobol.host.CIPZMEMJ.touchToCobol(CIPZMEMJ.java:46)
     at com.splwg.base.support.cobol.host.CobolThread.run(CobolThread.java:34)

How to invoke Matlab from Java

Hi, I want to pass some data generated from Java class to Matlab, then invoke Matlab from Java to run the computation program (file written in Matlab, myfile.m). I know Matlab can use classes generated from Java, how to drive Matlab from Java?
I appreciate your help!
yaya

According to their documentation, you can't. Having said that, again according to their documentation, there are plans to support this in future releases.
m

Parsing HTML from Java, How

Similar Messages

Maybe you are looking for