Matching HTML links with regex

Hi all,
Im making a basic web crawler and am looking for a regular expression that is perhaps for efficient than the one im using at the minute.
The regex i've got right now seems to work ok, however it doesnt seem to find all (or most) of the links on the page.
Pattern ptn = Pattern.compile("<a?\\s+href\\s*=\\s*\"?(.*?)[\"|>]",Pattern.CASE_INSENSITIVE);
Matcher match = ptn.matcher(strHTML);          
while(match.find()){     
       //process inks here
}Thanks as always for your help.

wraith2008 wrote:
Hi all,
Im making a basic web crawler and am looking for a regular expression that is perhaps for efficient than the one im using at the minute.
...Two reasons for using an HTML parser:
- it's faster than regex;
- it will find more links (also incorrectly formed ones) than your regex will find.

Similar Messages

  • Is there a way to include html link with applet?

    I am thinking of writing an applet that would be useful to others, which I would provide for free. In return I would want a link from the pages that use the applet. It is important that the link is search engine friendly, a pure href html link that will provide google page rank to me.
    Is there a way to include such a link with an applet so that it can't be removed?

    No. You can't force other sites to change their content.

  • How to use the html:link with a arraylist

    Hi everyone:
    I want to display the data using struts html:link.
    I query the database and place all the data to javabean,later place all the javabean to ArrayList.In Action,I use the "request.setAttribute("lovetable",articlelist) to set request to jsp page.
    I want to pass a parameter "id" use the hyperlink so I can get the parameter when I click the hyperlink.
    But how to use html:link to display it?
    I use <html:link action="viewtopic.do" paramId="id" paramName="lovetable" paramProperty="id"/>,it can't work and Tomcat report error :
    org.apache.jasper.JasperException: No getter method for property id of bean lovetable
         at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:248)
         at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:295)
         at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:241)
    The lovetable is a ArrayList and it don't have a getter or setter method.
    How to use it pass parameter? :( Thks

    Thank you.
    I use the bean:define successful.And display all the data to jsp.My jsp code is:
    <logic:iterate id="love" name="lovetable">
    <bean:define id="idbean" name="love"/>
    <tr bgcolor="<%=color%>">
    <td><bean:write name="love" property="id"/></td>
    <td><html:link forward="viewtopic" paramId="id" paramName="idbean" paramProperty="id"><bean:write name="love" property="title"/></html:link></td>
    <td><bean:write name="love" property="name"/></td>
    <td><bean:write name="love" property="time"/></td>
    </tr>
    </logic:iterate>
    In Action : request.setAttribute("lovetable",articlelist)
    ResutlSet rs=.............
    List articlelist=new ArrayList();
    while(rs.next()){
    articlebean bean=new articlebean();
    bean.setName(rs.getString("name"));
    bean.setTitle(rs.getString("title"));
    articlelist.add(bean);
    The above code will work property.
    The Tag "html:link" need bean to work other than arraylist so I iterate all the bean out.
    The Tag "logic:iterate" need collection to work so I make Action return a List.
    right?
    Any idea? :)

  • html:link with lots of overlapping sinle an double quotes

    Hi,
    Take a look at the following code please, and let me know how can I figure out the problem?
    <html:link href='<%="javascript:gotoUrl(' "+
    myobject.getAstring()
    +" ')"%>' >

    My rules are to go into steps, from the outside in:
    1) Lets make a link.. Outermost quotes will be double.
    <html:link href="" >2) What do we want to put into the link? A javascript call. So lets add that. Don't forget the ; at the end
    <html:link href="javascript:gotoUrl();" >3) The javascript function takes a string parameter. Since it is an inner quotes, we will use single:
    <html:link href="javascript:gotoUrl('');" >4) The parameter passed to the javascript function comes from a scriptlet:
    <html:link href="javascript:gotoUrl('<%= myobject.getAstring() %>');" >This way, it should be more clear about which quote open and close each other, and you learn to limit what needs to come out of the scriptlet code, making it easier to read.

  • Linking secure html link with JSF?

    Hey all,
    I do have a previous post regarding j_security_check and using container based security, but since this problem could be answered without it, checking to see how (j_security_check: http://swforum.sun.com/jive/thread.jspa?threadID=54464&tstart=0).
    I want to be able to e-mail links with pre-populated attributes (identifiers, dates, what have you) but still have the link secure and require auth. But, I do want to automatically go to that linked page after auth. How does one do this with JSF?
    thanks,
    -D

    Hi,
    Please go through the below thread:
    http://swforum.sun.com/jive/thread.jspa?forumID=123&threadID=50520
    Hope this helps.
    Thanks,
    RK.

  • Help with regex pattern matching.

    Hi everyone
    I am trying to write a regex that will extract each of the links from a piece of HTML code. The sample piece of HTML is as follows:
    <td class="content" valign="top">
         <!-- BODY CONTENT -->
    <script language="JavaScript"
    src="http://chat.livechatinc.net/licence/1023687/script.cgi?lang=en&groups=0"></script>
    <a href="makeReservation.html">Making a reservation</a><br/>
    <a href="changeAccount.html">Changing my account</a><br/>
    <a href="viewBooking.html">Viewing my bookings</a><br/>I am interested in extracting each link and the corrresponding text for that link into groups.
    So far I have the following regex <td class="content" valign="top">.*?<a href="(.*?)">(.*?)</a><br>However this regex only matches the first line in the block of links, but I need to match each line in the block of links.
    Any ideas? Any suggestions are appeciated as always.
    Thanks.

    Hi sabre,
    thanks for the reply.
    I am already using a while loop with matcher.find(), but it still only returns the first link based on my regex.
    the code is as follows.
    private static final Pattern MENU_ITEM_PATTERN = compilePattern("<td class=\"content\" valign=\"top\">.*?<a href=\"(.*?)\">(.*?)</a><br>");
    private LinkedHashMap<String,String> findHelpLinks(String body) {
        LinkedHashMap<String, String> helpLinks = new LinkedHashMap<String,String>();
        String link;
        String linkText;
          Matcher matcher = MENU_ITEM_PATTERN.matcher(body);
          while(matcher.find()){
            link = matcher.group(1);
            linkText = matcher.group(2);
            if(link != null && linkText != null){
              helpLinks.put(link,linkText);
        return helpLinks;
    private static Pattern compilePattern(String pattern) {
        return Pattern.compile(pattern, Pattern.DOTALL + Pattern.MULTILINE
            + Pattern.CASE_INSENSITIVE);
      }Any ideas?

  • Problem with regex patter/matcher

    Hello, I used some code I found in the tutorial and the forums to accomplish some html pattern matching. I'm just now learning regex and I can't figure out how to find each occurance of my pattern. Here's the code.
    import java.util.regex.*;
    public class Parser {
         public Parser() {
         public static void main(String[] args) {
              String INPUT = "<tr><td>1st cell</td><td>2nd cell</td></tr>";
              String REGEX = "<td>.*</td>";
              Pattern p = Pattern.compile(REGEX);
              Matcher m = p.matcher(INPUT);
              int count = 0;
              while(m.find()) {
                   count++;
                   System.out.println("Match number " + count);
                   System.out.println("start(): " + m.start());
                   System.out.println("end(): " + m.end());
    }output is:
    Match number 1
    start(): 4
    end(): 38I'm looking for a match for each cell, not the outermost match... I hope that made sense.
    Thanks for the help!

    sabre, i guess your hint is nicer, uh ? ;-)I prefere your regex because I like a closed form of
    termination condition rather than the open .*. I
    suspect that both will work OK for the OP.Funny, I was about to sugest the same you sugested when I hit your post, I prefer the one wiht the lazy search.
    Also, just for fun, I would sugest to do a more generic one with:
    "<(.*?)>(.*?)<\\1>"
    and just because it is fun:
    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    * HtmlParser.java
    * version 1.0
    * 14/07/2005
    * @author notivago
    public class HtmlParser {
        public static void main(String[] args) {
            new HtmlParser().parse( "<tr><td>1st cell</td><td>2nd cell</td></tr>" );
        private Pattern pattern = Pattern.compile("<(.*?)>(.*?)</\\1>");
        public void parse( String input) {
            Matcher matcher = pattern.matcher(input);
            while( matcher.find() ) {
                show(matcher);
                parse( matcher.group(2) );
         * @param matcher
        private void show(Matcher matcher) {
            System.out.println( "Tag: " + matcher.group(1) );
            System.out.println( "Content: " + matcher.group(2) );
            System.out.println( "----");
    }

  • Filting html tags, css, and javascript with regex

    hi everyone,
    im writing a small application where a user types in a url, and the text of the webpage is displayed in a text area.
    ive got it to work, however it takes some time, and also alot of content i dont want is displayed - tags, scripts and sometimes css.
    initally i filtered out the html tags with a regular expression, but i still get alot of unwanted content.
    im not a confident java programmer, and the idea of parsing html, css and javascript is the scariest idea ever to me, so my next idea is to keep only everything between the <body> tags - everything above and below it is deleted - hopefully that should leave me only with the visible content on the site.
    ive messed around with regular expressions but i cant get it to work, can anyone help out?
    thanks alot,
    Torre

    Darryl.Burke wrote:
    I tried out the regexes I posted on the source of a forum page, which is not valid html (contains two each opening and closing body tags). With a bit of trial and error I was able to remove everything upto the first, and not the second, opening tag by using a reluctant qualifier, ^.*?, but couldn't for the life of me achieve removal of only the last closing tag, leaving the other, invalid one intact. How would you do that?Regexes always try to match the first occurrence of whatever they're looking for (the sentinel), and there's no way to change that behavior (but it would be handy if you could). What you have to do instead is make sure the rest of the regex can't match the sentinel. For that you need lookahead, and the simplest way to use it is to scan the rest of the text looking for the sentinel and, if it doesn't find one, go ahead and gobble up the remaining text: "(?is)</body>(?!.*</body).*$" However, if there are many occurrences of the sentinel, you could take a serious performance hit. Here's a much more efficient way: "(?is)</body>(?:[^<]++|<(?!/body>))*+$" After matching the sentinel, this regex gobbles up anything that's not the first character of the sentinel, or the first character as long as it isn't followed by the remaining characters of the sentinel. The advantages of this regex are that it never has to backtrack, and the lookahead is only applied when it's necessary, where the first regex applies it every time.

  • Is there a way to select MULTIPLE tabs and then copy ALL of the the URLs and titles/or URLs+titles+HTML links? This can be done with the Multiple Tab Handler add on; However, I prefer to use a Firefox feature rather than download an add on. Thanks.

    Currently, I can copy ONE tab's url and nothing else (not its name). Or I can bookmark all tabs that are open. However, I'd like to have the ability to select multiple tabs and then copy ALL of the the URLs AND their titles/or copy ALL of the URLs+titles+HTML links? This can be done with the Multiple Tab Handler add on; when I download the add on, I get a message saying that using the add on will disable Firefox's tab features. I prefer to use Firefox features rather than download and use an add on. Is there a way to do this without an add on?

    Hi LRagsdale517,
    You should definitely be able to upload multiple files by Shift-clicking or Ctrl-clicking the files you want to upload. Just to make sure you don't have an old version of the service cached, please clear the browser cache and then log in to https://cloud.acrobat.com/files. After clicking the File Upload icon in the upper-right corner, you should be able so select multiple files for upload.
    Please let us know how it goes.
    Best,
    Sara

  • How can i send data POST with a html link in a textfield

    Hello,
    This is my problem : I generate (from a php script) html
    links in a textfield. I would like for each link to send data with
    the POST method to an other script.
    My problem is that the getURL("lien", "", "POST") function
    can be use only by a movieclip or button event ... and not from a
    html link ...
    How can i do it ?
    Have you a small idea???
    Thanks ....

    yes, thanks but now my problem is in my function associated
    to this link :
    function SendPost()
    var toto="toto";
    getURL("
    http://127.0.0.1/board/scheduledfirst.php",
    "_blank", "GET");
    The function is executed but after in php i can get anything
    with echo ($_GET["toto"]); echo ($_POST["toto"]);
    Why ?

  • Exporting to PDF with hot html links

    I'm trying to create a pdf document that has functioning html links. Does Pages let you do this?

    To create a PDF with live links you'll need to buy a full version of Adobe Acrobat. Adobe Reader is just as titled - a reader. It will only allow reading and basic markup of pdfs.
    The PDFs created through the Print menu are also very simple PDFs. Their creation is not a function of the program the original document comes from, be it Pages, Numbers, Word, Excel, Appleworks etc. It is based on the operating system using technology which is probably licensed from Adobe.
    To get all the functionality available in a PDF, you've go to pay the piper - Adobe - for the software.
    Good luck,
    Terry

  • Having problems passing more than one parameter with html:link tag

    Hi guys,
    for my web application I�m using Struts. I�ve got a database with user details. I would like to get users list and link to the details of each user. I wrote the code and everything is working fine only the users list is repeating as many times as users in the list.
    For ex: I have in the database User1, User2 and User3. I would like to have a result like:
    User1
    User2
    User3
    Instead of it I have the result like:
    User1
    User2
    User3
    User1
    User2
    User3
    User1
    User2
    User3
    What I�m doing wrong? Could somebody help me please?
    Thank you in advance
    There is a snippet of the code, which I�m using in jsp:
    <code>
    <logic:iterate id="root" name="user">               
                   <%
                        java.util.HashMap users = new java.util.HashMap();
                        params.put("user",root);
                        pageContext.setAttribute("usersName", users);
                   %>
                   <html:link name=" usersName " scope="page" page="/name.do">
                        <logic:iterate id="folder" name="user">
                             <bean:write name="folder" /><br>
                        </logic:iterate>
                   </html:link><br>
                   </logic:iterate>
    </code>

    Suggestion: next time you post code use the "CODE" button to put code tags around it. It formats much nicer that way :-)
    You have a nested loop structure here.
    <logic:iterate id="root" name="user">
      <%
        java.util.HashMap users = new java.util.HashMap();
        params.put("user",root);
        pageContext.setAttribute("usersName", users);
      %>
      <html:link name=" usersName " scope="page" page="/name.do">
        <logic:iterate id="folder" name="user">
          <bean:write name="folder" /><br>
        </logic:iterate>
      </html:link><br>
    </logic:iterate>Both loops iterate over your "user".
    Your first loop loops over each user.
    Then your second loop also loops over each user - hence you get number of users * number of users = 3 groups of 3.
    If you have 4 users, you would have 4 groups of 4.
    I only see you setting one parameter: "usersName" What other parameters do you need to pass?
    At a guess, the inner loop is unnecessary, and you want to write the users name as the text for the link, and also use it as a link parameter.
    <logic:iterate id="root" name="user">
      <%
        java.util.HashMap users = new java.util.HashMap();
        params.put("user",root);
        pageContext.setAttribute("usersName", users);
      %>
      <html:link name="usersName" scope="page" page="/name.do">
          <bean:write name="user" /><br>
      </html:link><br>
    </logic:iterate>

  • Send over HTML-Link an string to PDF with openparameters ?

    Hi,
    is it possible to open an 3D-PDF over an HTML-Link and commit the 3D-PDF any strings (I´m not mean a searchstring for the internal searchfunction)?
    Thanks for help !

    It's very difficult to explain by words.
    I need an help in this forum because I want to make my book(interactive pdf) opens an external application that is a sort of presentation with video and images created in director.
    If I try to create an hypertextual link between a button and this file  I don't have the chance to choose it. So I thought to write a script like - launch "start.app"- to make my presentation run ( it happens if I ask the script to run).
    But I wonder: - how can I make this event happen only  when I ask to my project to do it?-
    For example: I am reading my book on adobe digital edition and at the end of the third chapter, there is an image.  I mouse up and I understand it's a button. When I click on, it opens a new window with my presentation running and I watch it. At end I quit and I keep on reading my book on Adobe digital edition.
    Have you an idea?
    I hope I have explained it much better?!?!

  • Can place a swf with wmode transparent on top html without effect html link?

    i using DIV to put my swf (Param=wmode value=transparent) on
    top of the html, after that i can see that the swf background is
    transparent. but one thing happen , the html link has been overlap
    by the swf. i cannot click even highlight the link of the html.
    any expert understand my meaning , can give me some advice or
    solution. tq

    adreny,
    > i using DIV to put my swf (Param=wmode
    value=transparent)
    > on top of the html, after that i can see that the swf
    background
    > is transparent.
    That's right.
    > but one thing happen , the html link has been overlap by
    the swf.
    > i cannot click even highlight the link of the html.
    This is one of the hazards of using wmode, I'm afraid. Some
    people feel
    strongly that it should never be used, but I personally feel
    it has some
    uses, even with its inherent disadvantages. There are more
    things to
    consider, in fact. Check out a few thoughts on the subject by
    Justin
    Everett-Church:
    http://justin.everett-church.com/index.php/2006/02/23/wmode-woes/
    David Stiller
    Co-author, Foundation Flash CS3 for Designers
    http://tinyurl.com/2k29mj
    "Luck is the residue of good design."

  • Struts - HTML:link tag with dynamic page attribute?

    I am trying to use the html:link but the page value is dynamic (resulting from the bean within the iterate tag). The code below doesn't work - error. Is there a way to use all Struts tags and make this happen. Any ideas?
            <logic:iterate id="myMenuForm" property="menuItem" scope="session" name="menuForm" >          <html:link page="<bean:write name="myMenuForm" property="menuDisplayName"/>">                  <bean:write name="myMenuForm" property="menuDisplayName"/>          </html:link>        </logic:iterate> Thanks,
    mlv

    Thanks for all your help. Based on all you your comments, it is now working
    and I thought I would provide some details about how I got it working.
    There are not that many examples within the Struts documentation.
    id = unique name for the object that can be referenced within the iterate
    tag.
    name = is the Bean, in this case it's the ActionForm, that contains the List
    object that will be iterated. In my case, I have a ActionForm that contains
    a List property that contains all the menu item rows from the databse.
    property = is the property name within the ActionForm that is the List
    object. Links to the getter method
    Within the <bean:write> tag you use the id name to get a handle on the final
    object from the iterate tag and the property is one of the property values
    within the bean that is contained within the List.
    So in this case I have a ActionForm bean that contains a List property that
    contains a collection of beans.
    FormAction - bean
    property menuItems of type List contains:
    - row 1 = bean (with a property of menuDisplayName)
    - row 2 = bean (with a property of menuDisplayName)
    - row 3 = bean (with a property of menuDisplayName)
    Here's the JSP code for the tag.
            <logic:iterate id="myMenuForm" name="menuForm" property="menuItem" >
              <html:link page="www.yahoo.com"/>
                      <bean:write name="myMenuForm" property="menuDisplayName"/>
              </html:link>
            </logic:iterate>Now I would like to make the <html:link> page attribute value dynamic from
    the iterate bean. Therefore, you can make the url for the link dynamic If
    you know how to do that, please feel free to provide some help.

Maybe you are looking for

  • Firefox will not open with my previous session

    [I am posting this question using IE.] Last night when I quit Firefox I closed the browser without noticing an ad in a new browser window had opened. Thus, my last browser open was the ad page. So when I open the browser now, it tells me "Firefox is

  • Move iweb site from one computer to different computer?

    I have 2 web sites through iWeb on my macbook pro, one mine and one my daughter's. She made hres while we were on vacation (using my laptop). Now that we are home I'd like to move hers to her iMac computer. Is that something easy? I cannot risk losin

  • About database

    hi, can anyone tell me how to link director to any database like mysql, ms sql, oracle and so on?? and how to link the data also? i can't find much sample thank you for spending time in this post..

  • How to use RAC database connection details in oc4j.xcfg files

    Hi all, we are using oc4j.xcfg files for the application module configuration for an ADF application. and one of the database has been configured on RAC. Any input on how to specify the RAC database details in the oc4j.xcfg file specially for the hos

  • Cannot use Thai in Firefox

    I also use Firefox besides Safari. But I found that it doesn't work with Thai pages at all. I tried to change character set and character encoding. Also changed the default Thai font from Thonburi to Garuda as from this link http://m10lmac.blogspot.c