Parsing HTML from Java, How

Problem:
I need to Connect to a URL using Java.
Then i need to detect/ parse the html of the url for Image contents
*So that i can replace some thing rather than the image such as with the text [IMAGE].*
I already able to connect and read the htmls of url. But i need ur help regarding parsing for image links.
Just i need to parse the html page to sense image links/contents on that page.
How it can be done?

hi shazzad,
Could you please try with this,
* XsdReader.java
* Created on September 12, 2008, 11:36 AM
* To change this template, choose Tools | Template Manager
* and open the template in the editor.
package EDITool;
import java.io.File;
import java.io.IOException;
import java.util.Hashtable;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;
* @author Rajesh
public class XsdReader {
    /** Creates a new instance of XsdReader */
    private DocumentBuilder docBuilder;
    private Document doc;
    private DocumentBuilderFactory docBuilderFactory;
    private File xsdFile;
    private File xmlFile;
    private Hashtable dataList;
    private String modXML = "";
    public XsdReader() {
        docBuilder = null;
        doc = null;
        docBuilderFactory = DocumentBuilderFactory.newInstance();
        try{
                docBuilder = docBuilderFactory.newDocumentBuilder();
        catch(ParserConfigurationException e)
                System.out.println("Wrong parser configuration: " + e.getMessage());
        this.xsdFile = null;
        this.xmlFile = null;
        dataList = new Hashtable();
    public class Attr{
    private String minOccurs;
    private String maxOccurs;
    public Attr(String minOccurs, String maxOccurs)
            this.minOccurs = minOccurs;
            this.maxOccurs = maxOccurs;
    public void xsdParser(String xsdInputFileName,int level)
        try{
            this.xsdFile = new File(xsdInputFileName);
        catch(Exception e)
            System.out.println("File Not Exception: "+e);
        try{
          doc = docBuilder.parse(this.xsdFile);
                NodeList nodeList = doc.getChildNodes();
                xsdRecursive(nodeList,level);
     catch(SAXException e)
          System.out.println("Wrong XML file structure: " + e.getMessage());
     catch(IOException e)
          System.out.println("Wrong XML file structure: " + e.getMessage());
     catch(Exception e)
          System.out.println("Error: "+e);
    public void xsdRecursive(NodeList nodeList, int level)
        try{
          for(int i=0;i<nodeList.getLength();i++)
               Node node = nodeList.item(i);
               if(node.getNodeType() == node.ELEMENT_NODE  && level == 1)
                    if(node.hasAttributes())
                         Element e = (Element)node;
                         String minOccursValue = e.getAttribute("minOccurs");
                         if(!minOccursValue.equals(""))
                              String name = e.getAttribute("name");
                              if(name.equals(""))
                              name = e.getAttribute("ref");
                              if(name.equals(""))
                              continue;
                              String minOccurs = e.getAttribute("minOccurs");
                              String maxOccurs = e.getAttribute("maxOccurs");
                              System.out.println(name);
                              this.dataList.put(name,new Attr(minOccurs,maxOccurs));
                        else if(node.getNodeType() == node.ELEMENT_NODE  && level == 2)
                            String currentTagName = node.getNodeName();
                            Attr attr = (Attr)this.dataList.get(currentTagName);
                            NamedNodeMap nodeAttr = node.getAttributes();
                            org.dom4j.Node dom4jNode = (org.dom4j.Node) node;
                            org.dom4j.Element e = (org.dom4j.Element) dom4jNode;
                            e.addAttribute("min", attr.minOccurs);
                            e.addAttribute("max", attr.maxOccurs);
               if(node.hasChildNodes())
               xsdRecursive(node.getChildNodes(),level);
          catch(Exception e)
               System.out.println("Error: "+e);
    public void xsdRecursive(Node node, int level)
        NodeList nodeList = node.getChildNodes();
        try{
            for(int i=0;i<nodeList.getLength();i++)
            if(node.getNodeType() == Node.ELEMENT_NODE )
               // if(level == 1)
                if(node.hasAttributes())
                    Element e = (Element)node;
                    String minOccursValue = e.getAttribute("minOccurs");
                    if(!minOccursValue.equals(""))
                            String name = e.getAttribute("name");
                            if(name.equals(""))
                            name = e.getAttribute("ref");
                            String minOccurs = e.getAttribute("minOccurs");
                            String maxOccurs = e.getAttribute("maxOccurs");
                            System.out.println(name);
                            this.dataList.put(name,new Attr(minOccurs,maxOccurs));
                else if(level == 2)
                System.out.println("rajesh");
                String currentTagName = node.getNodeName();
                Attr attr = (Attr)this.dataList.get(currentTagName);
                NamedNodeMap nodeAttr = node.getAttributes();
                org.w3c.dom.Attr minAttr = ((Document)node).createAttribute("min");
                minAttr.setValue(attr.minOccurs);
                org.w3c.dom.Attr maxAttr = ((Document)node).createAttribute("max");
                minAttr.setValue(attr.maxOccurs);
                Element e =(Element)node;
                System.out.println(e.getNodeName()+"\t"+e.getAttribute("min"));
        catch(Exception e)
            System.out.println("Exception in xsdRecursive: "+e);
    public void display()
        Attr attr = (Attr) this.dataList.get("ISA01");
        System.out.println(attr.maxOccurs+"\t"+attr.minOccurs);
    public static void main(String s[])
        XsdReader xsdReader = new XsdReader();
        //String xsdInputFileName = "C:/Documents and Settings/vs73471/Documents/SAP/workspace/Sample/src/packages/com/sap/java/Copy of 850.xml";
        String xsdInputFileName = "C:/Documents and Settings/vs73471/Desktop/www.html";
        xsdReader.xsdParser(xsdInputFileName, 1);
        System.out.println(xsdReader.dataList.size());
        String xmlInputFileName = "C:/Documents and Settings/vs73471/Desktop/Reference XML Generate/850_Dummy.xml";
        xsdReader.xsdParser(xmlInputFileName, 2);
      //  xsdReader.display();
}

Similar Messages

  • Calling html from java with some values

    Hello friends,
    I can call a HTML file from java.I want to pass some values from java class file to that html file.Anybody help me how to pass parameters?
    Runtime.getRuntime().exec("rundll32 url.dll,FileProtocolHandler http://localhost:8080/jcoDemo/dp.html");
    is one i am using for calling html file
    thanks

    Just add GET query parameters to the URL.
    Besides, if you were using Java 1.6 or newer, then rather use [Desktop#browse()|http://java.sun.com/javase/6/docs/api/java/awt/Desktop.html#browse(java.net.URI)] instead of Runtime#exec(). It is crossplatform while your runtime approach works in Mircosoft Windows systems only.

  • Opening HTML from Java executable?

    I have an HTML file that will be located on a CD that spawns off an installation. I need to call this HTML from a java executable. How do I do this?
    Thanks,
    Scott Roth
    [email protected]

    You may adopt a refined solution trough the swing package.... Usign the application to shows the HTML file with a JEditorPane or JPane component.
    take a look in the jdk1.2.2\demo\jfc\SwingSet\HtmlPanel.java
    In this case, your app may becomes slow, but if you will run it in a powerfull machine (128Mb RAM, 16Mb Video) then don�t worry - it works.

  • XML parsing regressing from Java 1.4 to 1.5

    Hi,
    I have a piece of code using Jakarta digester. This piece of code is prety simple and was working fine until I switched from java 1.4 to java 1.5. However, i didn't changed the Digester jar.
    After investigation, I noticed that the problem disappear if i remove the DOCTYPE entry from the XML file I'm parsing.
    Is there any known regression ?
    Thanks =P
    Stack Trace
    java.net.UnknownHostException: D
         at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:177)
         at java.net.Socket.connect(Socket.java:507)
         at java.net.Socket.connect(Socket.java:457)
         at sun.net.NetworkClient.doConnect(NetworkClient.java:157)
         at sun.net.NetworkClient.openServer(NetworkClient.java:118)
         at sun.net.ftp.FtpClient.openServer(FtpClient.java:488)
         at sun.net.ftp.FtpClient.openServer(FtpClient.java:475)
         at sun.net.www.protocol.ftp.FtpURLConnection.connect(FtpURLConnection.java:270)
         at sun.net.www.protocol.ftp.FtpURLConnection.getInputStream(FtpURLConnection.java:352)
         at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:973)
         at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:905)
         at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startDTDEntity(XMLEntityManager.java:872)
         at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.setInputSource(XMLDTDScannerImpl.java:282)
         at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDispatcher.dispatch(XMLDocumentScannerImpl.java:1021)
         at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368)
         at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834)Error while getting input files description. Process aborted.
         at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
         at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148)
         at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242)
         at org.apache.commons.digester.Digester.parse(Digester.java:1556)
         at com.wwdm.datahandle.TransformDescriptorFactory.getTransformer(TransformDescriptorFactory.java:99)
         at mainTest.main(mainTest.java:29)
    XML
    <?xml version="1.0" encoding="ISO-8859-1"?>
    <!DOCTYPE descriptor SYSTEM "file://D:/dtds/WWDM_Input.dtd">
    <descriptor>
    </description>
    ------

    I made an error in my XML while posting... so just to avoid replies telling it cames from there :
    XML
    <?xml version="1.0" encoding="ISO-8859-1"?>
    <!DOCTYPE descriptor SYSTEM "file://D:/dtds/WWDM_Input.dtd">
    <descriptor>
    </descriptor>

  • Parseing html in java

    I want to remove style attibute and replace it with unique class in html in java.
    Input html:
    <div style="A">
    <div style="B">
    </div>
    <div style="C">
    </div>
    </div>
    Output updated html:
    <div class="class01">
    <div class="class02">
    </div>
    <div class="class03">
    </div>
    </div>
    Please tell me how can I do it easly in java!
    I am trying to get using code available on:
    http://www.java2s.com/Tutorial/Java/0120__Development/ParseHTML.htm
    If you know any other good way, then please tell me! I don't have rnd time and have to done soon.

    Have your tried by yourself with any code yet? I don't beleive you even tried once.
    It will not be a tough job to find and replace some string in java even within HTML. But first you need to write an algorithm for this that how you will search that particular style and replace it and it totally depends on the formatting of your HTML as mentioned above.
    You need to make an algorithm and try to implement it and during the implementation if you get problems then come back here.

  • Parsing HTML from Google API results

    Hello,
    I just downloaded the Google API (http://www.google.com/apis) and I am trying to parse the HTML content which is returned so that it can be displayed in a TextArea or some other GUI component.
    Here are my questions:
    1. Is there a Java class that can parse HTML and display it correctly?
    2. If not, are there are third party, prefabably free Java components that can do that?
    3. Has anyone tried out the Google API? Any interesting applications?
    Thank you.
    Hanxue

    To convert plain text to html, you can parse the text with a simple code like this
    1.
    String inputText = getInputText(); //
    StringBuffer HTMLOutputText = new StringBuffer();
    java.util.StringTokenizer st = new java.util.StringTokenzier(inputText, "\n\r");
    while ( st.hasMoreTokens() ) {
    HTMLOutputText.append(st.nextToken());
    HTMLOutputText.append("<br>");
    /// insert the top level HTML tags
    HTMLOutputText.insert(0, "<HTML> <HEAD><TITLE> Some Title</TITLE></HEAD> <BODY>");
    HTMLOutputText.insert( HTMLOutputText.getLength(), "</BODY> </HTML>" );
    2. even simpler, but as far as I know it doesn't display right in a JEditorPane
    String inputText = getInputText();
    inputText = "<HTML> <HEAD><TITLE> Some Title</TITLE></HEAD> <BODY> <PRE> <TT>" +
    + inputText + "</TT></PRE></BODY> </HTML>";

  • Parse HTML from Multiple Lines of Text

    Does anyone know if there's a way to extract the text from a Multi-line text enterprise custom field? I'm using the OData feed to read this field into an Excel report, and the text is being returned (from Project Online) with HTML. I'm trying to either retain
    the formatting in Excel or strip the HTML from the field value.
    Anyone have any suggestions?
    Thanks,
    Roland

    Hi Roland,
    See this
    similar thread, advicing to create a macro.
    Hope this helps,
    Guillaume Rouyre, MBA, MVP, P-Seller |

  • How to get the parsed HTML from a JSF Page

    Hi,
    I have an application that the user must fill in some data. Later on, it's shown a confirmation page with the user's information. I want to email that confirmation page to the company. The question is: How can I get the result html page as a string from that jsf confirmation page within the application.
    Any help is appreciated,
    Tiago Gaspar.

    I need the exat same page the user is viewing i.e. html filed with the information .

  • Print HTML from Java

    Hello All,
    can anyone please provide example code of how to print an html-file from localhost by Java-code ?
    Thanx a lot !!

    Not sure what you are asking. You've mentioned several different things.
    One is printing - do you mean to put something on paper? There should be examples on how to print a file if you do a search.
    The other is 'from localhost' - this implies a socket to connect to 127.0.0.1 - Does your code connect to an HTTP server, issue a HTTP GET for an HTML page?
    Again there should be sample code for connecting to a server and reading the response if you do a search.

  • Why can't I make call to parse HTML from inside Thread?

    This is driving me crazy. With a defined HTMLEditorKit.ParserCallback object "callback", I am attempting to parse an HTML document retrieved from a URL by using:
    new ParserDelegator().parse(new InputStreamReader(url.openStream( )), callback, true);
    It doesn't work if I initiate the call in any way from within the run method of a Thread subclass (the way I'd like to do it). If I make the call in the constructor of the Thread subclass, however, it runs fine. I know it must have something to do with the fact that parse runs in a Thread of it's own - but the way to fix it isn't apparent to me.
    I would appreciate some words from people who might know what's happening here... THANKS in advance.

    Don't bother - figured it out - thanks.

  • Why can't I make call to parse HTML from inside a Thread?

    This is driving me crazy. With a defined HTMLEditorKit.ParserCallback object "callback", I am attempting to parse an HTML document retrieved from a URL by using:
    new ParserDelegator().parse(new InputStreamReader(url.openStream( )), callback, true);
    It doesn't work if I initiate the call in any way from within the run method of a Thread subclass (the way I'd like to do it). If I make the call in the constructor of the Thread subclass, however, it runs fine. I know it must have something to do with the fact that parse runs in a Thread of it's own - but the way to fix it isn't apparent to me.
    I would appreciate some words from people who might know what's happening here... THANKS in advance.

    Don't bother - figured it out - thanks.

  • Parse HTML from URL

    Hello everyone,
    I am working on a project where I need to be able to pull data out of a website. So essentially I will give the program the url of the site and then parse the source to extract the information I need. If the built in java html parsing a good choice or are there some third party libraries you would recommend. Let me know if you need more detail and thanks for any help.
    Nathan

    Nath5 wrote:
    ... the problem I have is that the element names are a standard value with a random number added to the end.Not if this is HTML you are parsing, it isn't. HTML doesn't have anything like that.
    Badly-designed XML might have element names like that, though. So you ought to go back and see if you're actually getting XML. In that case you should use an XML parser instead of an HTML parser. And it may well be possible to write an XPath expression that finds such elements.

  • Set Text to a frame.html from java

    Hi everybody
    This is my case:
    i have a string (in java code) and i want to show it on a frame.html page
    Do u have any ideas ????
    help !!!!!
    pls !!!!!

    It sounds like you might want to create a Java Server Page. Java Server Pages, or JSP's allow you to present information created dynamically within a static html page. The dynamic content, the Java String, would be enclosed within special tags.
    You can learn more about JSP pages by visiting these links...
    http://www.apl.jhu.edu/~hall/java/Servlet-Tutorial/
    http://pdf.coreservlets.com/
    http://java.sun.com/developer/technicalArticles/javaserverpages/JSP20/
    http://java.sun.com/developer/technicalArticles/Programming/jsp/

  • For calling cobol from java, how can i add  cobal bin directory to PATH

    Hi all,
    When i am testing my Algorithm using JUNIT i am getting this Exception ,Can any one tell me What this exact Error.
    Exception in thread "CobolThread 1" java.lang.UnsatisfiedLinkError: no cbljvm_sun in java.library.path
         at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1734)
         at java.lang.Runtime.loadLibrary0(Runtime.java:823)
         at java.lang.System.loadLibrary(System.java:1028)
         at com.microfocus.cobol.RuntimeSystem.<clinit>(Unknown Source)
         at com.splwg.base.support.cobol.host.CobolThread.run(CobolThread.java:30)
    Thanks&Regards
    sivaram

    i overcome the above exception by seting microfocus bin in classpath
    But i am getting one more error.pls help to resolve this issue
    Exception in thread "CobolThread 1" com.splwg.shared.common.LoggedException: Unable to load class CIPZMEMJ. Return code: 2
    - 2011-09-07 12:51:52,111 [CobolThread 1] ERROR (cobol.host.CIPZMEMJ) Unable to load class CIPZMEMJ. Return code: 2
    com.splwg.shared.common.LoggedException: Unable to load class CIPZMEMJ. Return code: 2
         at com.splwg.shared.common.LoggedException.raised(LoggedException.java:65)
         at com.splwg.base.support.cobol.host.CIPZMEMJ.checkInitialized(CIPZMEMJ.java:36)
         at com.splwg.base.support.cobol.host.CIPZMEMJ.touchToCobol(CIPZMEMJ.java:46)
         at com.splwg.base.support.cobol.host.CobolThread.run(CobolThread.java:34)
         at com.splwg.shared.common.LoggedException.raised(LoggedException.java:65)
         at com.splwg.base.support.cobol.host.CIPZMEMJ.checkInitialized(CIPZMEMJ.java:36)
         at com.splwg.base.support.cobol.host.CIPZMEMJ.touchToCobol(CIPZMEMJ.java:46)
         at com.splwg.base.support.cobol.host.CobolThread.run(CobolThread.java:34)

  • How to invoke Matlab from Java

    Hi, I want to pass some data generated from Java class to Matlab, then invoke Matlab from Java to run the computation program (file written in Matlab, myfile.m). I know Matlab can use classes generated from Java, how to drive Matlab from Java?
    I appreciate your help!
    yaya

    According to their documentation, you can't. Having said that, again according to their documentation, there are plans to support this in future releases.
    m

Maybe you are looking for

  • Error: iTunes Library.itl cannot be read because it was created by a newer version of iTunes" Please please help?

    error: iTunes Library.itl cannot be read because it was created by a newer version of iTunes"  Seriously?? This is the closest I have come in 4 days to reinstalling itunes and now this?? please help. and the Apple solutions? Forget it - done them ALL

  • Branching and Using the 'Back' button

    I have created a branching program that is essentially a "choose-your-own-adventure." There is no quiz, b/c there is no right or wrong. However, when you begin you have three choices. The user selects one of those choices and goes to Slide A, Slide B

  • Fill matrix with datasource

    Hi, i created an empty matrix on a form and i am trying to fill data in it with recordset ans userdatsource. I tested my recrdset query and it is working but my matrix want show me data when the form is loaded. this is my code : public static void Fi

  • CSV export  problen

    Hello! I have a problem exporting a report to a comma separated values (CSV) format. It looks like that when you export a report in CSV format the maximum length of column is 255 bytes - the rest of the string is omitted and the data is list during e

  • Do I need two subscriptions?

    Not sure how or when but I ended up with two subscriptions and I am wondering if they both are duplicates of the same thing. One of them is costing me $29 and is listed as: Unlimited minutes Unlimited calls* to landlines and cell phones in the United