How to parser a html page and get useful information?

now ,I try to get the page by the url,after getting the whole page,
is there any way to get the useful text ,and abandon other ,liks ,ad likes,
other related links?
I try to use java.util.regex.*;
is there any other methods for dointg this?

Regex isn't a good method unless your requirements are quite simple. In general if you want a Java HTML parser they are not hard to find -- "java html parser" is a good choice of keywords for an internet search.

Similar Messages

  • How to parser a HTML page to get its variable and values?

    Hi, everyone, here is my situation:
    I need to parser a HTML page to get the variables and their associated values between <form>...</form> tag. for example, if you have a piece of HTML as below
    <form>
    <input type = "hidden" name = "para1" value = "value1">
    <select name = "para2">
    <option>value2</option>
    </form>
    the actual page is much complex than this. I want retrive pare1 = value1 and para2 = value2, I tried Jtidy but it doesn't reconginze select, could you recomend some good package this purpose? better with sample code.
    Thanks a lot
    Kevin

    See for example Request taglib from Coldtags suite:
    http://www.servletsuite.com/jsp.htm

  • How to parse a web page and use same session(?) to post ?

    hi,
    lets say there a web page requesting:
    -username
    -password
    -sequence of letters ( assume its not an image, it is just plain text)
    i can get the source of page and parse and find the sequence.
    however i couldnt manage to post it. (the sequence changes)
    any help would be great, thx..
            try {
                int statusCode;
                String urlAddress =  "mywebpage";
                HttpClient client = new org.apache.commons.httpclient.HttpClient();
                String key ="";
                URL url = new URL(urlAddress);
                URLConnection conn = url.openConnection();
                DataInputStream in = new DataInputStream ( conn.getInputStream (  )  ) ;
                BufferedReader d = new BufferedReader(new InputStreamReader(in));
                StringBuffer buffer = new StringBuffer();
                while(d.ready())
                   buffer.append(d.readLine());
                int index = -1;
                String contents =buffer.toString();
                index = contents.indexOf("operation");
                while (index  > -1) {
                    key  = key + contents.substring(index+10,index+11);
                    index = contents.indexOf("operator", index+11);
                System.out.println(contents);
                System.out.println(key);
                /*GetMethod getMethod = new GetMethod(urlAddress);
                getMethod.getParams().setCookiePolicy(CookiePolicy.RFC_2109);
                statusCode = client.executeMethod(getMethod);
                if (statusCode != -1) {
                    contents = getMethod.getResponseBodyAsString();
                    index = -1;
                    index = contents.indexOf("mp3sesler");
                    while (index  > -1) {
                        key  = key + contents.substring(index+10,index+11);
                        index = contents.indexOf("operation", index+11);
                    System.out.println(contents);
                    System.out.println(key);
                PostMethod method = new PostMethod(urlAddress);
                // Configure the form parameters
                method.addParameter("user", "test");
                method.addParameter("pwd", key);
                // Execute the POST method
                statusCode = client.executeMethod(method);
                if (statusCode != -1) {
                    contents = method.getResponseBodyAsString();
                    method.releaseConnection();
                    System.out.println(contents);
            } catch (Exception e) {
                e.printStackTrace();
            }

    perhaps the website checks its source? What HTTP response do you get back?

  • How to test the JSP pages and sevlets using JUnit. ?

    How to test the JSP pages using JUnit. How to configure what are all the steps to execute the JUnit test cases.

    Hi xiepei,
    since you are using modbus, a simple error checking is implicit in the protocol and is the comparison between returned checksum and the calculated one on the received message: checksum errors, if any, are an effect of communications errors (you should have at least 2 bits changed and on particular patterns to have the checksum be calculated correctly!). You won't be able to calculate BER on them, but you can calculate PER (Packet Error Rate).
    Another flag for communication errors, on the other direction, is to intercept error messages from the device: if it fully implements modbus protocol, it should return some warning in case of error (I seem to remember that in some cases it returns the reveived message with some error bits added: please check in modbus documentation).
    Proud to use LW/CVI from 3.1 on.
    My contributions to the Developer Zone Community
    If I have helped you, why not giving me a kudos?

  • How to Parse XML with SAX and Retrieving the Information?

    Hiya!
    I have written this code in one of my classes:
    /**Parse XML File**/
              SAXParserFactory factory = SAXParserFactory.newInstance();
              GameContentHandler gameCH = new GameContentHandler();
              try
                   SAXParser saxParser = factory.newSAXParser();
                   saxParser.parse(recentFiles[0], gameCH);
              catch(javax.xml.parsers.ParserConfigurationException e)
                   e.printStackTrace();
              catch(java.io.IOException e)
                   e.printStackTrace();
              catch(org.xml.sax.SAXException e)
                   e.printStackTrace();
              /**Parse XML File**/
              games = gameCH.getGames();And here is the content handler:
    import java.util.ArrayList;
    import org.xml.sax.*;
    import org.xml.sax.helpers.DefaultHandler;
    class GameContentHandler extends DefaultHandler
         private ArrayList<Game> games = new ArrayList<Game>();
         public void startDocument()
              System.out.println("Start document.");
         public void endDocument()
              System.out.println("End document.");
         public void startElement(String namespaceURI, String localName, String qualifiedName, Attributes atts) throws SAXException
         public void endElement(String namespaceURI, String localName, String qualifiedName) throws SAXException
         public void characters(char[] ch, int start, int length) throws SAXException
              /**for (int i = start; i < start+length; i++)
                   System.out.print(ch);
         public ArrayList<Game> getGames()
              return games;
    }And here is the xml i am trying to parse:<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
    <Database>
         <Name></Name>
         <Description></Description>
         <CurrentGameID></CurrentGameID>
         <Game>
              <gameID></gameID>
              <name></name>
              <publisher></publisher>
              <platform></platform>
              <type></type>
              <subtype></subtype>
              <genre></genre>
              <serial></serial>
              <prodReg></prodReg>
              <expantionFor></expantionFor>
              <relYear></relYear>
              <expantion></expantion>
              <picPath></picPath>
              <notes></notes>
              <discType></discType>
              <owner></owner>
              <location></location>
              <borrower></borrower>
              <numDiscs></numDiscs>
              <discSize></discSize>
              <locFrom></locFrom>
              <locTo></locTo>
              <onLoan></onLoan>
              <borrowed></borrowed>
              <manual></manual>
              <update></update>
              <mods></mods>
              <guide></guide>
              <walkthrough></walkthrough>
              <cheats></cheats>
              <savegame></savegame>
              <completed></completed>
         </Game>
    </Database>I have been trying for ages and just can't get the content handler class to extract a gameID and instantiate a Game to add to my ArrayList! How do I extract the information from my file?
    I have tried so many things in the startElement() method that I can't actually remember what I've tried and what I haven't! If you need to know, the Game class instantiates with asnew Game(int gameID)and the rest of the variables are public.
    Please help someone...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

    OK, how's this?
    public void startElement(String namespaceURI, String localName, String qualifiedName, Attributes atts) throws SAXException
              current = "";
         public void endElement(String namespaceURI, String localName, String qualifiedName) throws SAXException
              try
                   if(qualifiedName.equals("Game") || qualifiedName.equals("Database"))
                        {return;}
                   else if(qualifiedName.equals("gameID"))
                        {games.add(new Game(Integer.parseInt(current)));}
                   else if(qualifiedName.equals("name"))
                        {games.get(games.size()-1).name = current;}
                   else if(qualifiedName.equals("publisher"))
                        {games.get(games.size()-1).publisher = current;}
                   etc...
                   else
                        {System.out.println("ERROR - Qualified Name found in xml that does not exist as databse field: " + qualifiedName);}
              catch (Exception e) {} //Ignore
         public void characters(char[] ch, int start, int length) throws SAXException
              current += new String(ch, start, length);
         }

  • I have a rectangle with the word caption written twice on it which appears on my page and obscures my information - how can I get rid of this please?

    I have a rectangle with the word 'caption' written twice on it which appears on my page and obscures my information. How can I get rid of it please?

    Can you post the URL so we can test the page and look at its code?

  • HT1430 How do I change page and get back to app store homepage?

    How do I change page and get back to app store homepage?

    Reset your homepage. <br />
    http://support.mozilla.com/en-US/kb/How+to+set+the+home+page

  • HT5137 my app is stuck on one page of the ibook I was reading. How do I close my ibook and get back to the main library?

    My app is stuck on one page of the ibook I was reading. How do I close the ibook and get back to the main library?

    A number of extensions can cause that problem, you'll need to do a little troubleshooting to find out which extension is causing that to happen for you.
    [http://support.mozilla.com/en-US/kb/troubleshooting+extensions+and+themes]

  • My ibook is frozen on one page. How do I close the ibook and get back to the main library?

    My ibook is frozen on one page. How do I close the ibook and get back to the main library?

    This is the iPod touch forum. You seem to have an iBook problem and can't understnd your problem with that.
    I do not understand what you mean by page and what you mean by library.

  • How do I insert new record and get results on a landing page

    how do I insert new record and get results on a landing page

    It's not clear from your post what you are asking. In a SQL database, you use the INSERT statement to insert a row into a table. You use the SELECT statement to retrieve rows. Here's some basic info on how to do that within PHP
    PHP MySQL Insert Into
    PHP MySQL Select

  • How to open an HTML page that is part of my project using Captivate 6

    I'm looking to create an HTML page that will use some javascript to extract information from Captivate and then render it to the browser window as a report.
    I see how you can Open a  URL webpage from Captivate (by putting in an explicit address ie www.abc.com   or http://www.abc.com/myReport.html)
    but I'm curious what address I can use if the html file is in the SAME folder at the index.html that launches the captivate project  (or perhaps one folder down)..    I'm wanting to do this so it works whether I've published the project to a webserver or not.
    I was thinking I could use a relative reference (ie  something like .\myReport.html), but I haven't had any luck thus far

    Thanks Seth.
    I just tried that, but when I run it in preview mode (by simply hitting F12), it gives me an error because it can't  find the file in the temporary preview folder it creates (ie:
    C:\Users\Tom\AppData\Local\Temp\CP2840464090993Session\CPTrustFolder2840464091009\Captivat ePreviewLoader\
    I'm hoping to find a place to put it so that it works when running F12 and when running in regular 'published' mode
    I was thinking I could put it in the 'C:\Program Files\Adobe\Adobe Captivate 6 (32 Bit)\Templates\Publish' folder, but when I do that, it doesn't seem to gete copied to the .\CaptivatePreviewLoader folder when running F12

  • Parsing a html page

    I want to parse page for specified contents.
    I feel that easy to do but my problem is there are many URLs in the html page and it has to enter into each link and grab the specified content from that html page.In this way it has to parse all the links.
    Can anyone help me in this? Also if anyone has a code for it please say me.I have been trying a lot for it.
    Thanks Swetha.

    Sounds like you are making a web spider. Here are a few open source spiders you could "dissect" (pun intended).
    http://java-source.net/open-source/crawlers
    Also there are a fair amount of tutorials on this kind of project if you poke around google for little bit. There are several ways you could approach this, most likely the one you choose will be based on how many urls you plan on visiting and what you plan on doing there.
    If you only plan on visiting a small number or url's you could simply maintain a list of unvisited pages and a list of visited pages. These could be linkedlists if you don't care about seeing the same page twice, or perhaps a hashset if you do. So you pull off your first url, read the contents of the page, and then find the occurences of http:// and then add that url to your unvisited list. when you are done with the current page move that url to the visited list.
    when you are parsing out the urls you could do something as simple as using a StringTokenizer and breaking the html code into words. then you could tell it was a link by calling something similar to s.startsWith("href="); and then go from there...
    If you are going to be visited many pages you might want to investigate using multiple threads. In this case you'll need to use a list that is threadsafe and you might want to throttle the threads (have them sleep a little inbetween url requests) so you don't go blasting their/your bandwidth...

  • How to create a html page

    Hello,i am very new to dreamweaver 8,and i have searched
    everwhere possible on how to create a html page.My website has a
    navegation bar.And on the navegation bar it has links,like
    forums,your accounts,and etc,but thats already integrated with the
    website.Now i have an option in the admin area,wher i can create a
    new category so that it would show up in the navegation bar,and i
    have a drop down menu where i cans select what to put in that
    category.But also i have an option to put a external url,so that
    when they click on it,it will take them to where that link is
    directed.The point is that i asnt to make a category in the
    navigation menu that says Lirics.So that i can put lirics of songs
    and etc,and that when users click on that link it will take them to
    the lirics page..Do you guys understand?Sorry for the bad english,i
    am now quite good with it..

    The most eloquent description will do nothing for us. To get
    a solution to
    your problem, we must see your code. Can you upload the file
    and post a
    link to it, please?
    Murray --- ICQ 71997575
    Adobe Community Expert
    (If you *MUST* email me, don't LAUGH when you do so!)
    ==================
    http://www.dreamweavermx-templates.com
    - Template Triage!
    http://www.projectseven.com/go
    - DW FAQs, Tutorials & Resources
    http://www.dwfaq.com - DW FAQs,
    Tutorials & Resources
    http://www.macromedia.com/support/search/
    - Macromedia (MM) Technotes
    ==================
    "don_playboy" <[email protected]> wrote in
    message
    news:eaadi9$kjf$[email protected]..
    > well what i am trying to say is that i created a Folder
    named
    > Liricas(which is
    > lirics in spanish),on my ftp.Now since i created that
    folder in my ftp,it
    > will
    > show on my navegation menu control panel.But it will be
    blank since it has
    > nothing on it.Now what i want to do is,sicne i created
    that folder,and i
    > go to
    > my nav meny control panel and select it as a category it
    will show on my
    > navegation menu(the Liricas that is),but when u click it
    it has
    > nothing,since
    > it has no html or anything in the category.Now i want to
    create a html
    > page so
    > that i can put albums names,and under those albums name
    i want to create a
    > link
    > or category thats for lirics.So all this will be stored
    on the ftp.and
    > when
    > users click on the Liricas category on my navigation
    menu,it will take
    > them to
    > the liricas index page,which will show all the lirics...
    >
    >
    >
    >

  • Parsing an html page

    i am trying to parse an html page read from the internet.
    i assume i need to create a URL of it's address, but after that im not sure how i go about reading the lines of that html page.
    i would like to load each line as a String into an array or a Vector so that i can easily parse each line from there...
    any tips?
    thanks.

    haha...this is what i ended up doing...and i was able to parse it all pretty easy...
    BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
    in.readLine();
    thanks!

  • How can I save a page and all its component parts in a single file, like IE does as an MHT - it's much easier for mailing to people where page address not available?? (as in output from an airline booking site, for example)

    how can I save a page and all its component parts in a single file, like IE does as an MHT?
    It's much easier for mailing to people where page address not available?? (as in output from an airline booking site, for example)
    It is simply too painful to have to zip everything up into a single file to send. MHT format has been available for years now from IE, and with every new FF release it's the first thing I look for. I have been using FF for years, and hate having to come out of it, over into IE |(which I even took out of startup) and key everything in again, in order to send somebody something in a convenient format that they can open with a single click.
    I can't believe this hasn't been asked before, so have you looked at it and rejected it? Have MS kept the file format secret?
    Thanks
    MG

    This is not really an answer just my comments on your question.
    I am sure I recollect efforts being made to get mhtml to work with FF.
    Probably the important thing to remember about .mhtml is that if other browsers do support it they may need addons, and may not necessarily render the content correctly/consistently.
    There are FF addons designed for archiving webpages, you could try them, but that then assumes the recipient has the same software.
    You could simply save the page from FF to your XP pc; then offline open it with and save it using IE, before then emailing using FF, and attaching the .mht or mhtml file that you have now created on your PC.
    As an alternative method, in some cases it could be worth considering taking a screen grab of the required page, then sending that to the recipient as a single email attatchment using either a bitmap or jpeg file format for instance.
    Something such as an airline booking may be designed with a print option, possibly it could be worthwile looking at sending the print file itself as an email attachment.

Maybe you are looking for

  • How to load a CONSTANT NULL

    I"m getting the error message: SQL*Loader: Release 11.1.0.6.0 - Production on Tue Apr 6 15:28:37 2010 Copyright (c) 1982, 2007, Oracle. All rights reserved. SQL*Loader-350: Syntax error at line 71. Expecting valid column specification, "," or ")", fo

  • Icon Editor Bug/Feature?

    In the linked video, I am showing a puzzling behavior of the icon editor (LV 2012 (12.0f3, Windows XP 32 bit): I have two lines of text defined in the Icon Text tab. The second and the third lines, the other two are empty. I am trying to draw a frame

  • Backwards Compatibility?

    I hope this doesn't sound stupid: I've discovered that H.264 absolutely blows away Sorenson 3 and I'd like to know if H.264 is backwards compatible with other versions of Quicktime besides version 7. Otherwise, I won't be updating the Quicktime files

  • Xcode and lib3ds

    hello, I'm a high student, and I use Xcode for programming in C. I have to open a ".3ds" file building a 3D object with it. When i try to build my program I have this message : Undefined symbols: "lib3ds_fileopen", referenced from: _main in main.o ld

  • Move Itunes, Music and Podcast to a new PC

    I want to move ITunes software, Music and Podcast from a Windows XP to a Windows 7 PC and keep the Podcast and Music?  I don't want the IPod Shuffle to be resync.  It wants to clear the Podcast and Music that's on the IPod Shuffle.