Html paser of regular expression

Dear all,
I know some of you will think my problem can be solved by an open-source html parser but I tested the following list of parsers (http://java-source.net/open-source/html-parsers) and failed to find one that meets my requirement as I explained below.
I would like to parse a html file and fetch the hyper links from it.
I wrote the following regular expression and it works in most cases:
.*(src|href|url|action)\s*=\s*["|']?(.*?)["|'|\s?|>].*However, I have until now two troubles:
1. For "<a href="directory.html">Directory</a> | <a href="a-z.html">A - Z</a>", I expceted to fetch "directory.html" and "a-z.html" but I only got the last one.
2. I expected to exclude "http://www.javaeye.com/upload.jpg" in "<img alt="subwayline13" class="logo" src="http://www.javaeye.com/upload.jpg" title="subject" />". I still could not find a solution for this.
Therefore, I would wish that you can give me some new advices.
Merry Chirstmas and Happy New Year!
Pengyou

pengyou wrote:
Dear all,
I know some of you will think my problem can be solved by an open-source html parser but I tested the following list of parsers (http://java-source.net/open-source/html-parsers) and failed to find one that meets my requirement as I explained below.
Then you did something wrong when you were using the parser.
I would like to parse a html file and fetch the hyper links from it.
I wrote the following regular expression and it works in most cases:
.*(src|href|url|action)\s*=\s*["|']?(.*?)["|'|\s?|>].*However, I have until now two troubles:
1. For "<a href="directory.html">Directory</a> | <a href="a-z.html">A - Z</a>", I expceted to fetch "directory.html" and "a-z.html" but I only got the last one.
2. I expected to exclude "http://www.javaeye.com/upload.jpg" in "<img alt="subwayline13" class="logo" src="http://www.javaeye.com/upload.jpg" title="subject" />". I still could not find a solution for this.
Therefore, I would wish that you can give me some new advices.
Same advice as before.
1. Use an existing html parser correctly.
2. Write you own html parser. An actual parser. A parser would be part of your solution, not the entire solution.
And more advice...do not attempt to use regexes to parse html nor xml for that matter. The reason for that is because by the time you get it right, if ever, you will have built a parser. So instead start with one right away.
I suspect that your actual problem is that you don't know what a parser is and what it should do. So you think that a "parser" should give you there result you want rather than giving you tokens. A parser parses a source based on a grammer and produces tokens. A token is not an image file until you further interpret a particular token that way.
Finally note that in the above I said you could build your own parser if you wanted. But then you must in fact build a parser. If you do it correctly then you are going to end up with something that is functionally equivalent to one of the existing parsers. If you do it wrong then it won't.

Similar Messages

  • Replace All with Regular Expression

    Hi all,
    I need a help to replace the String:
    to
    "<a href=\"1\"">Java Programming</a>"
    I was trying to replace with
    .replaceAll("\\[\\[*([^\\]]*?)*\\]\\]", "<a href=\"\"></a>")
    but I don't know how to separate the parameters values.
    Best regards                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               

    Hi prometheuzz,
    you are the one!!
    But how I could do to a Srting like this:
    Ah, more requirements...
    Things are getting a bit messy now:
    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    class Main { 
        public static void main(String[] args) {
            String line = "any text any text any text any text any text. "+
                          "Click - [[Code: 6 Title: Java Programming]]. "+
                          "any text any text any text any text any text. "+
                          "- [[C�digo: 2 T�tulo: Sun]] text text "+
                          "any text any text any text any text any text.";
            String newLine = line;
            String[] wikiTags = getWikiTags(newLine);
            for(String tag : wikiTags) {
                String newTag = "<a href=\""+get("(Code:|C�digo:)", " ", tag)+
                                "\">"+get("(Title:|T�tulo:)", "$", tag)+"</a>";
                newLine = newLine.replaceFirst("\\[\\["+tag+"\\]\\]", newTag);
            System.out.println(line);
            System.out.println(newLine);
        public static String[] getWikiTags(String text) {
            java.util.List<String> list = new java.util.ArrayList<String>();
            Pattern pattern = Pattern.compile("(?<=\\[\\[)(.*?)(?=\\]\\])");
            Matcher matcher = pattern.matcher(text);
            while(matcher.find()) {
                list.add(matcher.group());
            return list.toArray(new String[list.size()]);
        public static String get(String start, String end, String text) {
            Pattern pattern = Pattern.compile("(?<="+start+"\\s)(.*?)(?="+end+")");
            Matcher matcher = pattern.matcher(text);
            return matcher.find() ? matcher.group() : "#ERROR#";
    }Details about regex:
    http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html
    http://www.regular-expressions.info/java.html
    http://java.sun.com/docs/books/tutorial/essential/regex/
    >
    Thanks a lot for you helpLike I said, it's a bit of a messy solution. See if you can use (a part of) it.
    Good luck.

  • Quick regular expression question/help

    Can someone help me with two regular expressions I need. I could spend a while trying to figure it out myself, however times short and I really would like to get a fool proof optimal solution (my attempt would be buggy).
    Sample sentence
    The population, is projected to reach 200,000, or more (by 2020).[7] This is {dummy} text.
    The first regular expression
    I need all brackets and every thing between them to be removed from a sentence.
    Brackets such as: ( ), [ ] and { } .
    I.e. Given the above sentence the following would be returned:
    The population, is projected to reach 200,000, or more. This is text.
    The second regular expression
    If a word has a trailing comma character I need to add a whitespace between the word and the comma.
    I.e. Given the sentence returned from the first regular expression, this regex would return:
    The population *,* is projected to reach 200,000 *,* or more. This is text.
    Many thanks to anyonewho can help me with this!
    Edited by: Myles on Jan 18, 2008 8:12 AM

    http://java.sun.com/docs/books/tutorial/extra/regex/index.html
    http://www.regular-expressions.info

  • Quick regular expression question!

    If I have a String such as:
    "This is a sentence.[1] This is another."
    OR
    "This is a sentence.(1) This is another."
    I.e. A String has a full stop within it and a non alphanumeric character immediately after it (without whitespace between the full stop and the character)
    How can I insert a whitespace character between the full stop and the non alphanumeric character (such as a bracket in the above examples)?
    So the above Strings would be transformed into:
    "This is a sentence. [1] This is another."
    "This is a sentence. (1) This is another."
    Thanks

    If I understand what you're asking...
    str = str.replaceAll("\\.([^\\p{Alnum}\\s])", ". $1");
    "This is sentence.[1] This is another.(1) This is a third. [1] This is a fourth.& This is a fifth. This is the last."
    "This is sentence. [1] This is another. (1) This is a third. [1] This is a fourth. & This is a fifth. This is the last."For more info:
    http://java.sun.com/docs/books/tutorial/extra/regex/index.html
    http://www.regular-expressions.info/

  • Regular Expression to remove space in HTML Tag

    Hello All,
    My HTML string is like below.
    select '<CityName>RICHMOND</CityName> 
    <StateCd>ABCD CDE 
    <StateCd/>
    <CtryCd>CAN</CtryCd>
    <CtrySubDivCd>BC</CtrySubDivCd>' Str from dual
    Desired Output is
    <CityName>RICHMOND</CityName><StateCd>ABCD CDE 
    <StateCd/><CtryCd>CAN</CtryCd><CtrySubDivCd>BC</CtrySubDivCd>
    i.e. want to remove those spaces from tag value area having only spaces otherwise leave as it is. Please help to implement the same using Regular expression.

    Hi,
    It's unclear what you want.  This site seems to be formatting your message in some odd way.
    Post a statement like
    SELECT '...' FROM dual;
    without any formatting, to show your input, and post the exact output you want friom that, with as little formatting as possible.  It might help if you use some character like ~ instead of spaces (just for posting; we'll find a solution that works for spaces).
    To remove the text that consists of spaces and nothing else between the tags, you can say
    REGEXP_REPLACE ( str
                   , '> +<'
                   , '><'
    How is this string being generated?  Maybe there's some easier, more efficient way to keep the bad sub-wrtings out of the string in the first place.

  • Rplacing space with &nbsb; in html using regular expressions

    Hi
    I want to replace space with &nbsb; in HTML.
    I used  the below method to replace space in my html file.
    var spacePattern11:RegExp =/(\s)/g; 
    str= str.replace(spacePattern," "
    Here str varaible contains below html file.In this html file i want to replace space present between " What number does this  represents" with &nbsb;
    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Untitled Document</title>
    </head>
    <body>
    <b><TEXTFORMAT LEADING="2"><P ALIGN="LEFT"><FONT FACE="Verdana" style = 'font-size:10px' COLOR="#0B333C" LETTERSPACING="0" KERNING="0"><B></B></FONT></P></TEXTFORMAT><TEXTFORMAT LEADING="2"><P ALIGN="LEFT"><FONT FACE="Verdana" style = 'font-size:10px' COLOR="#0B333C" LETTERSPACING="0" KERNING="0"><B> What number does this Roman numeral represents MDCCCXVIII ?</B></FONT></P></TEXTFORMAT></b>
    </body>
    </html>
    But by using the above regular expression i am getting like this.
    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Untitled Document</title>
    </head><body>
    <b><TEXTFORMAT LEADING="2"><P ALIGN="LEFT"><FONT FACE="Verdana" style = 'font-size:10px' COLOR="#0B333C" LETTERSPACING="0" KERNING="0"><B></B></FONT></P></TEXTFORMAT><TEXTFORMAT LEADING="2"><P A LIGN="LEFT"><FONT FACE="Verdana" style = 'font-size:10px' COLOR="#0B333C" LETTERSPACING="0 " KERNING="0"><B> What number does this represents</B></FONT></P></TEXTFORMAT></b>
    </body>
    </html>
    Here what happening means it was replacing space with &nbsb; in HTML tags also.But want to replace space with &nbsb; present in the outside of the HTML tags.I want like this using regular expressions in FLEX
    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Untitled Document</title>
    </head>
    <body>What number does this represents</body>
    </html>
    Hi,Please give me the solution to slove the above problem using regular expressions
    Thanks in Advance to all
    Regards
    ssssssss

    sorry i missed some information in above,The modified information was in red color
    Hi
    I want to replace space with &nbsb; in HTML.
    I used  the below method to replace space in my html file.
    var spacePattern11:RegExp =/(\s)/g; 
    str= str.replace(spacePattern," "
    Here str varaible contains below html file.In this html file i want to replace space present between " What number does this  represents" with &nbsb;
    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Untitled Document</title>
    </head>
    <body>
    <b><TEXTFORMAT LEADING="2"><P ALIGN="LEFT"><FONT FACE="Verdana" style = 'font-size:10px' COLOR="#0B333C" LETTERSPACING="0" KERNING="0"><B></B></FONT></P></TEXTFORMAT><TEXTFORMAT LEADING="2"><P ALIGN="LEFT"><FONT FACE="Verdana" style = 'font-size:10px' COLOR="#0B333C" LETTERSPACING="0" KERNING="0"><B> What number does this Roman numeral represents MDCCCXVIII ?</B></FONT></P></TEXTFORMAT></b>
    </body>
    </html>
    But by using the above regular expression i am getting like this.
    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Untitled Document</title>
    </head><body>
    <b><TEXTFORMAT LEADING="2"><P ALIGN="LEFT"><FONT FACE="Verdana" style = 'font-size:10px' COLOR="#0B333C" LETTERSPACING="0" KERNING="0"><B></B></FONT></P></TEXTFORMAT><TEXTFORMAT LEADIN G="2"><P ALIGN="LEFT"><FONT FACE="Verdana" style = 'font-size:10px' COLOR="#0B33 3C" LETTERSPACING="0" KERNING="0"><B> What number does this represents</B></FONT></P></TEXTFORMAT></b>
    </body>
    </html>
    Here what happening means it was replacing space with &nbsb; in HTML tags also.But want to replace space with &nbsb; present in the outside of the HTML tags.I want like this using regular expressions in FLEX
    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Untitled Document</title>
    </head>
    <body>What&nbsb;number&nbsb;does&nbsb;this&nbsb;represents</body>
    </html>
    Hi,Please give me the solution to slove the above problem using regular expressions
    Thanks in Advance to all
    Regards
    ssssssss

  • Regular Expressions for converting HTML to Structured Plain Text

    I'm writing a PL/SQL function that will convert HTML to plain text, but still preserve some of the formatting/line breaks. One of my challenges is in writing a regular expression to capture the text blocks while ignoring the markup. I'm trying to write an expression that will grab all of the text between start/end tags, but discard the tags. For example, to find all of the text between a start/end paragraph, I want to do something like:
    REGEXP_REPLACE('&lt;p style=&quot;text-align:center&#59;&quot;&gt;This is the body of the paragraph&lt;/p&gt;', '&lt;p.*&gt;(.*)&lt;/p&gt;', '\1||v_crlf' )
    where \1 returns the contents of the paragraph and v_crlf (declared earlier in the function) inserts a line break. I know there are more general expressions that will remove all tags, but I want to specifically identify the tags so I can process them appropriately. This way I can easily convert HTML to plain text for email and reporting without having to keep two versions around. Any help would be greatly appreciated. Once I get this worked out, I will repost with the function code for others to use. Thanks.
    Edited by: jritschel on Oct 26, 2010 9:58 AM

    Here's a function I wrote for an app. I'm not making in promises on it's accuracy as the app was just a proof of concept and never made it to production.
    function strip_html( p_clob in clob )
    return clob
    is
        l_out clob;
        l_test  number := 0;
        l_max_loops constant number := 20;
        i   pls_integer := 0;
    begin
        l_out := regexp_replace(p_clob,'<br>|<br />',chr(13)||chr(10),1,0,'imn');
        l_out := regexp_replace(l_out,'<p>',chr(13)||chr(10),1,0,'imn');
        l_out := replace(l_out,'<li>',chr(13)||chr(10)||'*<li>');
        l_out := regexp_replace(l_out,'<b>(.+?)</b>','*\1*',1,0,'imn');
        l_out := regexp_replace(l_out,'<u>(.+?)</u>','_\1_',1,0,'imn');
        loop
            l_test := regexp_instr(l_out,'<([A-Z][A-Z0-9]*)[^>]*>.*?</\1>',1,1,0,'imn');
            exit when l_test = 0 or i > l_max_loops;
            l_out := regexp_replace(l_out,'<([A-Z][A-Z0-9]*)[^>]*>(.*?)</\1>','\2',1,0,'imn');
            i := i + 1;
        end loop;
        return l_out;
    end strip_html;{code}
    The loop is there to handle nested HTML.
    Tyler Muth
    http://tylermuth.wordpress.com
    "Applied Oracle Security: Developing Secure Database and Middleware Environments": http://sn.im/aos.book
    Edited by: Tyler on Oct 26, 2010 10:03 AM                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

  • Getting "Inner Html" using Regular Expressions. Learning RE in SDK1.4.

    Hello group.
    I am learning Regular Expressions in JAVA SDK 1.4 first. Not PERL or other language.
    Using the utility at the following link I am trying to get all the text between the <TR> and </TR> tags.
    http://jakarta.apache.org/oro/demo.html
    This seems simple but the line returns, breaks etc.. make it more difficult. I have worked on this for hours.
    There will be multiple table rows in my stream.
    My goal is to first get the text between the <TR> Tags...
    Then I was going to use groups to get data0, data1, data2, data3.
    Does this sound like a good plan? Should I use multiple RE or one RE that does 4 group returns.
    I was thinking the applet was causing my problem.
    <TR>.*?</TR> does not work.
    (<tr>\s*([^(</tr>)])+</tr>) does not work.
    I can get data0 to work as well as data1,2,3.
    Would it make more sense to split this multiple row table by </tr>?
    One row of malformed html (actually multiple rows):
    <TR>
    <TD bgColor=#ffffff><A class=fav
    href="http://nicesite.com/data0"
    >nicesite</A><IMG
    src="smile.gif"></TD>
    <TD bgColor=#12ff22><SPAN class=fav>data1</SPAN></TD>
    <TD bgColor=#12ff22><SPAN class=fav>data2</SPAN></TD>
    <TD bgColor=#12ff22><SPAN class=fav>data3</SPAN></TD>
    <TD align=middle bgColor=#ffffff><A
    href="#"><IMG
    src="smile.gif" border=0></A></TD>
    <TD align=middle bgColor=#ffffff>data
              4</TD>
    <TD align=middle bgColor=#ffffff>data5</TD></TR>
    s_____ I have seen some of your post and tryed to apply them. What do you think?
    Regards,
    NupeVic

    http://jakarta.apache.org/oro/demo.htmlI prefer
    http://jregex.sourceforge.net/demoapp.html
    >
    This seems simple but the line returns, breaks etc..
    make it more difficult. Yes, they do indeed
    There will be multiple table rows in my stream.
    My goal is to first get the text between the <TR>
    Tags...
    Then I was going to use groups to get data0, data1,
    data2, data3.
    Does this sound like a good plan? Should I use
    multiple RE or one RE that does 4 group returns.One of the main features of regexes that you must realize
    is that they are mainly suited for non-recursive, linear data structures
    (btw, that's why regexes in general are hardly suited for html).
    So, if the number of TD items is fixed, you could
    1. search using a single pattern for the whole row, something like
    "<PatternForTR>"+
    "<PatternForTD>(<PatternForData>)<PatternFor/TD>"+
    "<PatternForTD>(<PatternForData>)<PatternFor/TD>"+
    "<PatternFor/TR>"
    so the group1 would contain data1 and so on
    Otherwise, you should
    2. find each row using
    "<PatternForTR>(.*?)<PatternFor/TR>",
    then search the contents of group1 using the
    "<PatternForTD>(<PatternForData>)<PatternFor/TD>".
    >
    <TR>.*?</TR> does not work.The pattern itself is ok, but in order for it to work one should enable the DOTALL flag (the 's' flag in jregex demo), as the '.' doesn't accept line breaks by default.
    (<tr>\s*([^(</tr>)])+</tr>) does not work.It seems that [^(</tr>)]+ actually is a nonsense in this context.
    It describes a string that consists of any chars but '(', ')', '<', '>', 'r', 't', '/'.
    What you actully meant (a string that doesn't contain "</tr>")
    is just achieved by using non-greedy quantifier in <TR>.*?</TR>.
    >
    I can get data0 to work as well as data1,2,3.
    Would it make more sense to split this multiple row
    table by </tr>?Going the second way above, you could find rows using the
    general pattern for TR:
    <tr.*?>(.+?)</tr> and search their contents(i.e. the group#1) using the
    general pattern for TD
    <td.*?>(.+?)</td> Finally, this is the specific pattern for TD that doesn't include the leading
    and trailing tags into group1:
    <td[^>]*>(?:\s*</?[^>]*>)*\s*(.+?)(?:\s*</?[^>]*>)*\s*</td>It succeded in finding
    nicesite
    data1
    data2
    data3
    data
    4
    data5in your sample.

  • HTML Escaping Regular Expression

    Assume I have the following string:
    <font face="arial">this</font> is a <b>very nice</b> " <a href>String</a>Now say I want to allow everything above, except I want to escape certain tags... IE I want to allow:
    <b>,</b>,<font ...>,</font> and nothing else. The escaped string above should then be converted to:
    <font face="arial">this</font> is a <b>very nice</b> " & lt;a href & gt; String & lt;/a& gt;The idea I'm trying to implement is to allow form input that may contain limited html tags, that I define, and escape anything else.
    This seems like it would be an existing regular expression, does anyone have any ideas?
    Thanks!

    Kudos and thanks for the regex. I did not know how to do negation in a regular expression, ie (?!FONT|B)To answer an earlier post, the application works much like a message board. I accept input from a textarea, then write that input out as an html file (much like how this forum works). I'd like to accept certain HTML markup in the input, but disallow tags like "<javascript>", "<object>" and "<embed>", etc, as writing those tags would allow users to post malicious input (redirects, popups, etc). Thus, defining and parsing what tags I will accept is easier than defining the tags not to accept (and safer).
    That being said, escaping quotes is somewhat important, as " in html that is not in a tag should really be the converted to & quot; for w3c browser standards. However, with 95% of my original question answered (that being the most important part), I'm satisfied with this. Thanks to all for the help thus far!

  • Splitting html ul tags and their content into string arrays using regular expression

    <ul data-role="listview" data-filter="true" data-inset="true">
    <li data-role="list-divider"></li><li><a href="#"><h3>
    my title
    </h3><p><strong></strong></p></a></li>
    </ul>
    <ul data-role="listview" data-filter="true" data-inset="true">
    <li data-role="list-divider"></li><li>test.</li>
    </ul>
    I need to be able to slip this html into two arrays hold the entire <ul></ul> tag. Please help.
    Thanks.

    Hi friend.
    This forum is to discuss problems of C# development. Your question is not related to the topic of this forum.
    You'll need to post it in the dedicated Archived Forums N-R  > Regular Expressions
     for better support. Thanks for understanding.
    Best Regards,
    Kristin

  • Java Regular Expression to grab html tags

    Dear all,
    I have written a regular expression in java to grab the pairs of html tags in a String. It worked fine except that it cannot handle space or new line. What have I done wrong?
    My regular expression (I would use <tr></tr> as an example):
    <tr[^>]*>(.*?)</tr>
    It would work for <tr><one>two<three></tr>
    but not <tr ><one>two<three></tr>
    or <tr> <one> two <three></tr>
    or <tr>
    <one>two<three>
    </tr>
    Thanks a lot in advance

    I have written a regular expression in java to grab the pairs of html tags in a String. I'll make one last-ditch suggestion that you grab a decent HTML parser, as the HTML specification allows for HTML tags that don't come in pairs. While I'm sure you will be able to eventually write regexes to handle this, it may be easier (depending on your requirements) to use tools to parse the HTML.
    Good luck!

  • Regular expression for BBcode list to html list

    Hi,
    we are migrating BBforum to Jive forum.
    BBforums has data which contains BBcode Strings.i found the follwoing code after googled.
    public static String bbcode(String text) {
    String html = text;
    Map<String, String> bbMap = new HashMap<String, String>();
    bbMap.put("(\r\n|\r|\n|\n\r)", "<br/>");
    bbMap.put("\\[b\\](.+?)\\[b\\]", "<strong>$1</strong>");
    for (Map.Entry entry : bbMap.entrySet()) {
    html =
    html.replaceAll(entry.getKey().toString(), entry.getValue().toString());
    return html;
    i have BBcode with format like
    [list] [*]blue[*]red[*] green[list]
    i have to replace this by <ul><li>blue</li><li>red</li>
    Can any one sugeest me java regular expression which replace as above
    Edited by: 875452 on Jul 31, 2011 8:03 AM

    Moderator advice: Please read the announcement(s) at the top of the forum listings and the FAQ linked from every page. They are there for a purpose.
    Then edit your post and format the code correctly.
    Moderator action: Moved from Development Tools » General Questions
    db

  • Regular expression for html links

    Hi, I'm trying to get text/link pairs from a string, accepted links
    are like:
    text1
    text2
    The expected result would be:
    url: url1
    Text: text1
    url: url2
    Text: text2
    I use the following regular expression to catch the texts and the urls:
    "<a href=\"*(.*)\"*.*>(.*)</a>"
    group(1) should be the url and group(2) the text.
    But it doesn't work ok, I got something like:
    url: http://url1/" garbagetags
    text: text1
    url: utl2
    text: text2
    I'm trying to avoid links with " and without it and dinamic html
    tags.
    I think the problem is the Regular Expression string, I'm new using
    them and I can't found the right one, if you know what's wrong with
    my R.E. string, please help me.!
    thanx

    Had to break it in to two regular expressions:
    import java.util.regex.*;
    class B2  {
       public static void main(String[] args) {
            //String INPUT = "<a href=\"http://url1/\" garbagetags>text1</a>";
            //String INPUT = "<a href=url2>text2</a>";
            //String INPUT ="<a href=\"http://www.google.com\">Google search engine</a>";
              String INPUT="<a id=1a class=q href=\"/imghp?hl=en&tab=wi&ie=UTF-8&oe=UTF-8\" onClick=\"return c('www.google.com/imghp','wi',event);\"><font size=-1>Images</font></a>";
            //String REGEX = "<a .*href=\\\"?h?t?t?p?:?/?/?([\\w\\.\\?\\&=\\-\\d]*)/?\\\"?.*>(.*)</a>";
            String REGEX = "<a .*href=\\\"?h?t?t?p?:?/?/?([\\w\\.\\?\\&=\\-\\d]*)/?\\\"?.*>";
            String REGEX2 = ">\\b([\\w\\s\\d]+)\\b<";
            Pattern p = Pattern.compile(REGEX);
            Matcher m = p.matcher(INPUT);
            StringBuffer sb = new StringBuffer();
            if ( m.find() ) {
            System.out.println(m.group(1) + "     " );  }
            else { System.out.println("No MAtch found");  }
            Pattern p2 = Pattern.compile(REGEX2);
            Matcher m2 = p2.matcher(INPUT);
            if ( m2.find() ) {
            System.out.println(m2.group(1) + "     " );  }
            else { System.out.println("No MAtch found");  }
    } You do realize that you'll never get 100% accuracy with this. There are too many possible variations to account for them all.

  • Regular Expression to convert URI to HTML link tag

    I'm trying to create a method that takes an input string and converts any URIs found in the string to a html link tag.
    For example,
    String input = "This is a test string.  I like http://www.sun.com/ and think you should check out http://java.sun.com/"; The output should contain an html a tag for each URI and use the URI as the text as well.
    I need this for a blogging web app i'm working on.
    I tried a few things like
    import java.util.regex.*;
    import java.text.*;
    public class test
         public static void main(String[] args)
              final Pattern p = Pattern.compile("(\\sI\\n|^)(\\w+://[^\\s\\n]+)");
            final Matcher m = p.matcher(args[0]);
              System.out.println(args[0]);
              args[0] = m.replaceAll( "$1<a href=\"$2\">$2</a>");
              System.out.println(args[0]);
    }Any ideas? Something like this will match "http://www.apple.com/" but not a complex string. I've googled this quite a bit and I'm not very good with regular expressions.

    Couldn't get your posted regex to work on anything. Try this:public static void main(String[] args) {
        String input = "This is a test string.  I like http://www.sun.com/ and think you should check out http://java.sun.com/";
        final Pattern p = Pattern.compile("(\\s|^)(\\w+://\\S+)(\\s|$)");
        final Matcher m = p.matcher(input);
        input = m.replaceAll("$1<a href=\"$2\">$2</a>$3");
        System.out.println(input);
    }It should match any uri, however I recommend replacing \\w+ with a more concrete string like (?:(?:http)|(?:https)|(?:ftp)). Let me know if you have a uri that it doesn't match.

  • Regular expression html parsing

    I have following sample html
    <html><body>
    First Name<input type="text" class="txtField" name="txtFirstName"\>
    Last name
    <input type="text" name="txtLastName"\>
    Address <textarea name="address" rows="10">Here goes address</textarea>
    <input type="button" name="btnSubmit" class="button"\>
    </body></html>
    I m trying to build a regular expression in such a way that, the expression should find a list of tags based on the set of names available.
    for e.g. I have string array as String names[] = {"txtLastName", "address"}
    using above array, the expression must find tag in above html.
    So in above case the output should be
    <input type="text" name="txtLastName"\>
    <textarea name="address" rows="10">
    Can somebody suggest how this expression should be build?

    Hi,
    As from your question,
    I got that you want to parse the HTML file and from the names in the array you want to
    get the code those controls.
    In that case I think you can use
    Find the string
    1.which starts with '<' and ends with '>'
    2.It must contains your work("txtLastName", "address"....) in double quotes(statring & ending) excatly one.
    Best,
    Ronak

Maybe you are looking for

  • Error calling WSDL Service in Swing application.

    I'm having the following error when calling a web service in a Swing Application. getUserInfo is defined and properly deployed. I tried several time to recreate WSDL cache and auto generated code, but nothing changed. Exception occurred during event

  • Exporting HD video from FCE to iDVD

    I'm finishing my first project with Final Cut Express and want to be sure I'm exporting my video correctly for iDVD. I imported my NTSC 1080i video to FCE using the Apple Intermediate Codec 60-frame interlaced setting. Edited the video and it looks g

  • Instances not visible in Oracle BPEL Control 10.1.3.4

    Hi , I am having a problem in looking for the list of instances under the "Instances" tab. I have no issues with the process execution tough . In the run up to this installation I have done the following things 1) Installed the SOA 10.1.3.1 2) Ran th

  • BestBuy Pre-Order emails - How many did you get??

    Everyone, How many emails did everyone get in regards to their BB iPhone 5 Pre-order? I was first to place my pre-order yesterday morning at my local BestBuy.  Upon placing the order I got 2 emails immediately. One emails subject was:  "Your Presales

  • Need Help With getResource()

    I have read the API and I know how to use this but I don't know how to resolve this specific task. I have a .jar file that has a structure like this: net    /pixwerks             /oejb                  /Main.class oejb     /input.txtHow can I open in