Parse HTML from URL

Hello everyone,
I am working on a project where I need to be able to pull data out of a website. So essentially I will give the program the url of the site and then parse the source to extract the information I need. If the built in java html parsing a good choice or are there some third party libraries you would recommend. Let me know if you need more detail and thanks for any help.
Nathan

Nath5 wrote:
... the problem I have is that the element names are a standard value with a random number added to the end.Not if this is HTML you are parsing, it isn't. HTML doesn't have anything like that.
Badly-designed XML might have element names like that, though. So you ought to go back and see if you're actually getting XML. In that case you should use an XML parser instead of an HTML parser. And it may well be possible to write an XPath expression that finds such elements.

Similar Messages

Why can't I make call to parse HTML from inside Thread?

This is driving me crazy. With a defined HTMLEditorKit.ParserCallback object "callback", I am attempting to parse an HTML document retrieved from a URL by using:
new ParserDelegator().parse(new InputStreamReader(url.openStream( )), callback, true);
It doesn't work if I initiate the call in any way from within the run method of a Thread subclass (the way I'd like to do it). If I make the call in the constructor of the Thread subclass, however, it runs fine. I know it must have something to do with the fact that parse runs in a Thread of it's own - but the way to fix it isn't apparent to me.
I would appreciate some words from people who might know what's happening here... THANKS in advance.

Don't bother - figured it out - thanks.

Why can't I make call to parse HTML from inside a Thread?

This is driving me crazy. With a defined HTMLEditorKit.ParserCallback object "callback", I am attempting to parse an HTML document retrieved from a URL by using:
new ParserDelegator().parse(new InputStreamReader(url.openStream( )), callback, true);
It doesn't work if I initiate the call in any way from within the run method of a Thread subclass (the way I'd like to do it). If I make the call in the constructor of the Thread subclass, however, it runs fine. I know it must have something to do with the fact that parse runs in a Thread of it's own - but the way to fix it isn't apparent to me.
I would appreciate some words from people who might know what's happening here... THANKS in advance.

Don't bother - figured it out - thanks.

Parsing HTML from Java, How

Problem:
I need to Connect to a URL using Java.
Then i need to detect/ parse the html of the url for Image contents
*So that i can replace some thing rather than the image such as with the text [IMAGE].*
I already able to connect and read the htmls of url. But i need ur help regarding parsing for image links.
Just i need to parse the html page to sense image links/contents on that page.
How it can be done?

hi shazzad,
Could you please try with this,
* XsdReader.java
* Created on September 12, 2008, 11:36 AM
* To change this template, choose Tools | Template Manager
* and open the template in the editor.
package EDITool;
import java.io.File;
import java.io.IOException;
import java.util.Hashtable;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;
* @author Rajesh
public class XsdReader {
    /** Creates a new instance of XsdReader */
    private DocumentBuilder docBuilder;
    private Document doc;
    private DocumentBuilderFactory docBuilderFactory;
    private File xsdFile;
    private File xmlFile;
    private Hashtable dataList;
    private String modXML = "";
    public XsdReader() {
        docBuilder = null;
        doc = null;
        docBuilderFactory = DocumentBuilderFactory.newInstance();
        try{
                docBuilder = docBuilderFactory.newDocumentBuilder();
        catch(ParserConfigurationException e)
                System.out.println("Wrong parser configuration: " + e.getMessage());
        this.xsdFile = null;
        this.xmlFile = null;
        dataList = new Hashtable();
    public class Attr{
    private String minOccurs;
    private String maxOccurs;
    public Attr(String minOccurs, String maxOccurs)
            this.minOccurs = minOccurs;
            this.maxOccurs = maxOccurs;
    public void xsdParser(String xsdInputFileName,int level)
        try{
            this.xsdFile = new File(xsdInputFileName);
        catch(Exception e)
            System.out.println("File Not Exception: "+e);
        try{
          doc = docBuilder.parse(this.xsdFile);
                NodeList nodeList = doc.getChildNodes();
                xsdRecursive(nodeList,level);
     catch(SAXException e)
          System.out.println("Wrong XML file structure: " + e.getMessage());
     catch(IOException e)
          System.out.println("Wrong XML file structure: " + e.getMessage());
     catch(Exception e)
          System.out.println("Error: "+e);
    public void xsdRecursive(NodeList nodeList, int level)
        try{
          for(int i=0;i<nodeList.getLength();i++)
               Node node = nodeList.item(i);
               if(node.getNodeType() == node.ELEMENT_NODE && level == 1)
                    if(node.hasAttributes())
                         Element e = (Element)node;
                         String minOccursValue = e.getAttribute("minOccurs");
                         if(!minOccursValue.equals(""))
                              String name = e.getAttribute("name");
                              if(name.equals(""))
                              name = e.getAttribute("ref");
                              if(name.equals(""))
                              continue;
                              String minOccurs = e.getAttribute("minOccurs");
                              String maxOccurs = e.getAttribute("maxOccurs");
                              System.out.println(name);
                              this.dataList.put(name,new Attr(minOccurs,maxOccurs));
                        else if(node.getNodeType() == node.ELEMENT_NODE && level == 2)
                            String currentTagName = node.getNodeName();
                            Attr attr = (Attr)this.dataList.get(currentTagName);
                            NamedNodeMap nodeAttr = node.getAttributes();
                            org.dom4j.Node dom4jNode = (org.dom4j.Node) node;
                            org.dom4j.Element e = (org.dom4j.Element) dom4jNode;
                            e.addAttribute("min", attr.minOccurs);
                            e.addAttribute("max", attr.maxOccurs);
               if(node.hasChildNodes())
               xsdRecursive(node.getChildNodes(),level);
          catch(Exception e)
               System.out.println("Error: "+e);
    public void xsdRecursive(Node node, int level)
        NodeList nodeList = node.getChildNodes();
        try{
            for(int i=0;i<nodeList.getLength();i++)
            if(node.getNodeType() == Node.ELEMENT_NODE )
               // if(level == 1)
                if(node.hasAttributes())
                    Element e = (Element)node;
                    String minOccursValue = e.getAttribute("minOccurs");
                    if(!minOccursValue.equals(""))
                            String name = e.getAttribute("name");
                            if(name.equals(""))
                            name = e.getAttribute("ref");
                            String minOccurs = e.getAttribute("minOccurs");
                            String maxOccurs = e.getAttribute("maxOccurs");
                            System.out.println(name);
                            this.dataList.put(name,new Attr(minOccurs,maxOccurs));
                else if(level == 2)
                System.out.println("rajesh");
                String currentTagName = node.getNodeName();
                Attr attr = (Attr)this.dataList.get(currentTagName);
                NamedNodeMap nodeAttr = node.getAttributes();
                org.w3c.dom.Attr minAttr = ((Document)node).createAttribute("min");
                minAttr.setValue(attr.minOccurs);
                org.w3c.dom.Attr maxAttr = ((Document)node).createAttribute("max");
                minAttr.setValue(attr.maxOccurs);
                Element e =(Element)node;
                System.out.println(e.getNodeName()+"\t"+e.getAttribute("min"));
        catch(Exception e)
            System.out.println("Exception in xsdRecursive: "+e);
    public void display()
        Attr attr = (Attr) this.dataList.get("ISA01");
        System.out.println(attr.maxOccurs+"\t"+attr.minOccurs);
    public static void main(String s[])
        XsdReader xsdReader = new XsdReader();
        //String xsdInputFileName = "C:/Documents and Settings/vs73471/Documents/SAP/workspace/Sample/src/packages/com/sap/java/Copy of 850.xml";
        String xsdInputFileName = "C:/Documents and Settings/vs73471/Desktop/www.html";
        xsdReader.xsdParser(xsdInputFileName, 1);
        System.out.println(xsdReader.dataList.size());
        String xmlInputFileName = "C:/Documents and Settings/vs73471/Desktop/Reference XML Generate/850_Dummy.xml";
        xsdReader.xsdParser(xmlInputFileName, 2);
      // xsdReader.display();
}

Parsing HTML from Google API results

Hello,
I just downloaded the Google API (http://www.google.com/apis) and I am trying to parse the HTML content which is returned so that it can be displayed in a TextArea or some other GUI component.
Here are my questions:
1. Is there a Java class that can parse HTML and display it correctly?
2. If not, are there are third party, prefabably free Java components that can do that?
3. Has anyone tried out the Google API? Any interesting applications?
Thank you.
Hanxue

To convert plain text to html, you can parse the text with a simple code like this
1.
String inputText = getInputText(); //
StringBuffer HTMLOutputText = new StringBuffer();
java.util.StringTokenizer st = new java.util.StringTokenzier(inputText, "\n\r");
while ( st.hasMoreTokens() ) {
HTMLOutputText.append(st.nextToken());
HTMLOutputText.append("<br>");
/// insert the top level HTML tags
HTMLOutputText.insert(0, "<HTML> <HEAD><TITLE> Some Title</TITLE></HEAD> <BODY>");
HTMLOutputText.insert( HTMLOutputText.getLength(), "</BODY> </HTML>" );
2. even simpler, but as far as I know it doesn't display right in a JEditorPane
String inputText = getInputText();
inputText = "<HTML> <HEAD><TITLE> Some Title</TITLE></HEAD> <BODY> <PRE> <TT>" +
+ inputText + "</TT></PRE></BODY> </HTML>";

Parse HTML from Multiple Lines of Text

Does anyone know if there's a way to extract the text from a Multi-line text enterprise custom field? I'm using the OData feed to read this field into an Excel report, and the text is being returned (from Project Online) with HTML. I'm trying to either retain
the formatting in Excel or strip the HTML from the field value.
Anyone have any suggestions?
Thanks,
Roland

Hi Roland,
See this
similar thread, advicing to create a macro.
Hope this helps,
Guillaume Rouyre, MBA, MVP, P-Seller |

How to get rid of .responsive.html from URL?

When I add reponsive in device group, the page URL adds .responsive.html, which I do not want. Is there a way I can get rid of .responsive from the URL and make my page still responsive?
Thanks in advance.

Anyone?
Thanks in advance.

How to get the parsed HTML from a JSF Page

Hi,
I have an application that the user must fill in some data. Later on, it's shown a confirmation page with the user's information. I want to email that confirmation page to the company. The question is: How can I get the result html page as a string from that jsf confirmation page within the application.
Any help is appreciated,
Tiago Gaspar.

I need the exat same page the user is viewing i.e. html filed with the information .

Removing index.html from URL bar

Hi guys, I have uploaded my site to a URL. But i can only see my website if i add /index.html to the URL. How do i get it so that if i go to www.thisismysite.com i can see it, rather than having to type www.thisismysite.com/index.html?
Cheers,
Ian

Hello,
This needs to be changed at the hosting end of the site. In case you are using Adobe Business Catalyst for hosting, please make sure you have created index.html as the start page of the website.
If you are using other hosting service, please contact them and they will help you set index.html as the start page.
Hope this helps.
Regards,
Sachin

Can I delete "index.html" from URL as it appears in browser?

Hello,
I notice websites often appear in the browser ending in just ".com" with no "index.html" after it.
Is there any way to not have that appear?
Thanks,
Jane

Which version and build # of DW do you have? You'll find it under Help > About Dreamweaver.
Also, which preferences have you specified for new documents?
Go to Edit > Preferences > New Document.
The default extension should be either .htm or .html, .shtml or .php -- depending on the type of files you work with most.
Nancy O.

Parsing html files via an url

Hi,
I already have a Java program that is able to read in html files that are stored on my computers hard drive. Now I would like to expand its functionality by being able to parse html files straight from the web.
For example, when the program is run, I would like to be able to give it an url for a given website. Then, I would like to be able to parse the html file that the link goes to.
I've searched the forum, but have not been able to find anything of any real use. If you could offer an overview or point me towards a resource, I would be very greatful.

If you've done things right, you have a HTML reader/parser that takes an InputStream. For Files, this would be a FileInputStream.
For URLs, this would be the InputStream you get from URLConnection.getInputStream(). You can get a URLConnection by calling openConnection() on a URL instance (created from your input url of course).

Weblogic call a excel-file from URL doesn't open MSExcel but flat html

Weblogic call a excel-file from URL doesn't open MSExcel but flat html
Hi,
WLS 10.3.5
Forms 11.1.1.4
I do migrate from AS10g to WLS 10.3.5 / Forms 11
I get differences between FORMS 10 g / AS and FORMS 11 / WLS
when call an excel-file with web.showdocument
in 10g AS10g
the call
web.showdocumen('http://MyAS10_Server/myFormsMapping/myExcelfile.xls, _blank);
opens a Windows-Box
to decide
open with ( MSExcel )
or
download and save as File
in WLS 10.3.5 / FORMS 11.1.1.4
the call with webcache Port 8090 as well as Port OHS 8888
web.showdocumen('http://MyWLS_Server:8090/myFormsMapping/myExcelfile.xls, _blank);
opens promptly the excel-File into the Browser as html-Format
How to get the same way under WLS as before in AS 10g,
config OHS ?
regards
get answer here :
Weblogic: when call a excelfile from URL doesn't open MSExcel but flat html
Edited by: astramare on Sep 12, 2011 11:59 AM

Weblogic: when call a excelfile from URL doesn't open MSExcel but flat html

JEditorPane parsing HTML

Hi all,
I am using JEditorPane and it's ability to parse HTML, which although is relatively old and crusty is certainly all I need for the job.
Now, I understand there is a chain of classes involved in taking my .html file and turning popping into a something we can see in a JEditorPane. For example, an img tag, is picked up by HTMLEditorKit and turned into an ImageView for display purposes.
I want to do the following: I have subclassed HTMLEditorKit, and have overridden the HTMLFactory (although at the moment it just defers everything to super). I want to be able to pick out all of the html comment tags as they go through the HTMLEditorKit :
... and get to the comment text, "hey hey this is a comment", as a Java string. However I've been digging around with Element for hours now and although my HTMLFactory correctly digs out the comments from the rest of the elements:
else if (kind == HTML.Tag.COMMENT)
                    {System.out.println("I found a comment but don't know what it said!!");... as you can see, I don't know how to get to the comment text itself.
The reason why I want access to the comment text is that I want to supplement the HTML code a little bit and add something in the comment that will affect the way it is rendered when I read it depending on the comment - so there's the reason if curious.
Any help, and I do mean anything at all, would be much appreciated, as this is the last obstacle in my path to getting this thing working :)
Thanks for your time!
- Peter

Here is some old code I have lying around that attempts to iterate through all the elements. If I remember correctly the comment text is found in the AttributeSet of the element:
import java.io.*;
import java.net.*;
import java.util.*;
import javax.swing.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
class GetHTML
    public static void main(String[] args)
        EditorKit kit = new HTMLEditorKit();
        Document doc = kit.createDefaultDocument();
        // The Document class does not yet handle charset's properly.
        doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
        try
            // Create a reader on the HTML content.
            Reader rd = getReader(args[0]);
            // Parse the HTML.
            kit.read(rd, doc, 0);
            System.out.println( doc.getText(0, doc.getLength()) );
            System.out.println("----");
            // Iterate through the elements of the HTML document.
            ElementIterator it = new ElementIterator(doc);
            Element elem = null;
            while ( (elem = it.next()) != null )
                AttributeSet as = elem.getAttributes();
                System.out.println( "\n" + elem.getName() + " : " + as.getAttributeCount() );
                if ( elem.getName().equals( HTML.Tag.IMG.toString() ) )
                    Object o = elem.getAttributes().getAttribute( HTML.Attribute.SRC );
                    System.out.println( o );
                Enumeration enum = as.getAttributeNames();
                while( enum.hasMoreElements() )
                    Object name = enum.nextElement();
                    Object value = as.getAttribute( name );
                    System.out.println( "\t" + name + " : " + value );
                    if (value instanceof DefaultComboBoxModel)
                        DefaultComboBoxModel model = (DefaultComboBoxModel)value;
                        for (int j = 0; j < model.getSize(); j++)
                            Object o = model.getElementAt(j);
                            Object selected = model.getSelectedItem();
                            if ( o.equals( selected ) )
                                System.out.println( o + " : selected" );
                            else
                                System.out.println( o );
                if ( elem.getName().equals( HTML.Tag.SELECT.toString() ) )
                    Object o = as.getAttribute( HTML.Attribute.ID );
                    System.out.println( o );
                // Wierd, the text for each tag is stored in a 'content' element
                if (elem.getElementCount() == 0)
                    int start = elem.getStartOffset();
                    int end = elem.getEndOffset();
                    System.out.println( "\t" + doc.getText(start, end - start) );
        catch (Exception e)
            e.printStackTrace();
        System.exit(1);
    // Returns a reader on the HTML data. If 'uri' begins
    // with "http:", it's treated as a URL; otherwise,
    // it's assumed to be a local filename.
    static Reader getReader(String uri)
        throws IOException
        // Retrieve from Internet.
        if (uri.startsWith("http:"))
            URLConnection conn = new URL(uri).openConnection();
            return new InputStreamReader(conn.getInputStream());
        // Retrieve from file.
        else
            return new FileReader(uri);
}To test it just use:
java GetHTML somefile.html

Parse HTML behaviour

Hi,
Can anybody explain the behavior of SunOne Web server when parse HTML is enabled for all html.
If we have a valid html in the web server for instance http://servername/myhtml.html , then the page will be loaded by web server. The same case if we put anything in the URL after that then also the sma epage will get served. Consider an URL http://servername/myhtml.html/ahdjksad/asdhjsad/sdhjklsad/asjdksald (Anything after that HTML), web server would be able to load the myhtml.html page. If we disable the parse HTML , then it won�t work. I want a way to work the Server side includes where it shouldn�t server such wrong URLs.
How this comes? If we look on the web server the path won�t be there on the server and it should give 404 error. Is it the way parse HTML works? Is there any way to restrict it by keeping the parse HTML functionality for server side includes enabled other than custom NSAPI?
If anyone noticed this behavior please explain.
Thanks,
Rijesh.

Hi,
Acually i load one external html using LoadVars class
methods.
var oLoad:LoadVars=new LoadVars();
oLoad.load("external.html");
I want to parse that html in flash.
I need some text from html page(html is having 200 line code)
How can i parse that html and trace that particular text.

How to display a dynamic image file from url?

Hey,I want to display a dynamic image file from url in applet.For example,a jpg file which from one video camera server,store one frame pictur for ever.My java file looks like here:
//PlayJpg.java:
import java.awt.*;
import java.applet.*;
import java.net.*;
public class PlayJpg extends Applet implements Runnable {
public static void main(String args[]) {
Frame F=new Frame("My Applet/Application Window");
F.setSize(480, 240);
PlayJpg A = new PlayJpg();
F.add(A);
A.start(); // Web browser calls start() automatically
// A.init(); - we skip calling it this time
// because it contains only Applet specific tasks.
F.setVisible(true);
Thread count = null;
String urlStr = null;
int sleepTime = 0;
Image image = null;
// called only for an applet - unless called explicitely by an appliaction
public void init() {
               sleepTime = Integer.parseInt(getParameter("refreshTime"));
          urlStr = getParameter("jpgFile");
// called only for an applet - unless called explicitely by an appliaction
public void start() {
count=(new Thread(this));
count.start();
// called only for applet when the browser leaves the web page
public void stop() {
count=null;
public void paint(Graphics g) {
try{
URL location=new URL(urlStr);
image = getToolkit().getImage(location);
}catch (MalformedURLException mue) {
               showStatus (mue.toString());
          }catch(Exception e){
          System.out.println("Sorry. System Caught Exception in paint().");
          System.out.println("e.getMessage():" + e.getMessage());
          System.out.println("e.toString():" + e.toString());
          System.out.println("e.printStackTrace():" );
          e.printStackTrace();
if (image!=null) g.drawImage(image,1,1,320,240,this);
// called each time the display needs to be repainted
public void run() {
while (count==Thread.currentThread()) {
try {
Thread.currentThread().sleep(sleepTime*1000);
} catch(Exception e) {}
repaint(); // forces update of the screen
// end of PlayJpg.java
My Html file looks like here:
<html>
<applet code="PlayJpg.class" width=320 height=240>
<param name=jpgFile value="http://Localhost/playjpg/snapshot0.jpg">
<param name=refreshTime value="1">
</applet>
</html>
I only get the first frame picture for ever by my html.But the jpg file is dynamic.
Why?
Can you help me?
Thanks.
Joe

Hi,
Add this line inside your run() method, right before your call to repaint():
if (image != null) {image.flush();}Hope this helps,
Kurt.

Parse HTML from URL

Similar Messages

Maybe you are looking for