Parsing href=".." with regular expression's

I need to get all the hyper links on a webpage, I use this code but it dont work 100%
public void parseLinks(){
String A = "([Hh][Rr][Ee][Ff]\\s*=\\s*\")";
String B = "(?!#|[Hh]ttp|[Mm]ailto|.cgi|.css)";
String C = "(.*)";
String D = "(\\s*\")";
String exp = A+B+C+D;
Pattern p = Pattern.compile(exp);
Matcher m = p.matcher(s);
while(m.find()){
System.out.println(m.group());
where s is the string that being parsed (html dokument). It works kind of on regular links e.g:
<a name="qwerty" href="qwerty.html"> gives output href="qwerty.html"
but I want it to give the output qwerty.html . How can I do this? It doesn't works on links like:
<a name="qwerty" href="qwerty.html" class="link"> gives the output href="/aktuell/index.html" class="link".
How can I just get the path?

If I got s = "<a name = \"asd\" href=\"asd.html\"" or s = " it works with group 3.
But it dont work with s = "<b>Open positions</b>" ,
gives output:
/openpos/index.html" class="link
Any suggestion ?, it seems that i fails if there is more then one " before the >
exempel code...
public void parseLinks(){
String s = "<a name = \"asd\" href=\"asd.html\"";
/*String s = "<a href=\"/openpos/index.html\" class=\"link\"><b>Open positions</b></a>";*/
String A = "([Hh][Rr][Ee][Ff]\\s*=\\s*\")";
String B = "(?!#|[Hh]ttp|[Mm]ailto|[Ll]ocation.|[Jj]avascript|.cgi|.css)";
String C = "(.*)";
String D = "(\\s*\")";
String exp = A+B+C+D;
Pattern p = Pattern.compile(exp);
Matcher m = p.matcher(s);
while(m.find()){
System.out.println(m.group(3));
}

Similar Messages

  • DOM Parser fails with regular expression using anchor (carat, dollar)

    I'm using version "Oracle XDK Java 9.0.4.0.0 Production"
    In trying to parse XML against schema: a regular expression fails to parse the data "8:00" with the following simple regular expression: "^.*$" (used to narrow the error)
    The error message is
    <Line 14, Column 25>: XSD-2025: (Error) Invalid text '8:00' in element: 'XYZ'
    If I remove the anchors and just have ".*", the data is parsed successfully.
    I dont understand why the parse fails when I use a anchors in the regular expression, and the java Pattern/Matcher classes succeed with the anchors?

    That "ns670" string is an xml namespace prefix. it should have a corresponding xml namespace declaration somewhere in the xml document (i'm guessing you have not shown the whole document). the actual value of an xml namespace prefix is meaningless. if you parse the xml with a namespace aware DOM parser, it should generate Nodes with the correct namespace. the namespace is the value you care about when extracting data from the document, not the namespace prefix.
    alternately, if you parse the document using a namespace aware DOM parser, you can just look for nodes based on their "local" name (the part after the ":" separator) and ignore the namespace/prefix.
    whatever you do, please do not parse the xml with a regex, see this http://stackoverflow.com/a/1732454/552759 for details (applies to xml as well).

  • Grouping & Back-references with regular expressions on Replace Text window

    I really appreciate the inclusion of the Regular Expressions in the search & replace feature. One thing I am missing is back-references in the replacement expression. For instance, in the unix tools vi or sed, I might do something like this:
    s/\(firstPart\) \(secondPart\) \(oldThirdPart\)/\2 \1 newThirdPart/g
    which would allow me to switch the places of firstPart and secondPart, and totally replace thirdPart. If grouping and back-references are already present in the Replace Text window, how does one correctly invoke them?

    duplicate of Grouping & Back-references with regular expressions on Replace Text window

  • Assistance with Regular Expression and Tcl

    Assistance with Regular Expression and Tcl
    Hello Everyone,
      I recently began learning Tcl to develop scripts for automating network switch deployments. 
    In my script, I want to name the device with a location and the last three octets of the base mac address.
    I can get the Base MAC address by : 
    show version | include Base
     Base ethernet MAC Address       : 00:00:00:DB:CE:00
    And I can get the last three octets of the MAC address using the following regular expression. 
    ([0-9a-f]{2}[:-]){2}([0-9a-f]{2}$)
    But I have not been able to figure out how to call the regular expression in the tcl script.
    I have checked several resources but have not been able to figure it out.  Suggestions?
    Ultimately, I want to set the last three octets to a variable (something like below) and then call the variable when I name the switch.
    set mac [exec "sh version | i Base"] (include the regular expression)
    ios_config "hostname location$mac"
    Thanks for any assistance in advance.
    Chris

    This worked for me.
    Switch_1(tcl)#set result [exec show ver | inc Base]   
    Base ethernet MAC Address       : 00:1B:D4:F8:B1:80
    Switch_1(tcl)#regexp {([0-9A-F:]{8}\r)} $result -> mac
    1
    Switch_1(tcl)#puts $mac                               
    F8:B1:80
    Switch_1(tcl)#ios_config "hostname location$mac"      
    %Warning! Hostname should contain at least one alphabet or '-' or '_' character
    locationF8:B1:80(tcl)#

  • How to search with regular expression

    I make pdx files so that I can search text quickly. But Acrobat doesn't provide a way to search with regular expression. I'm wondering if there is a way that I don't know to search for regular expression in Acrobat Pro 9?

    First, Acrobat must "mount" the PDX.
    As "Find" does not use the cataloged index, use Shift+Ctrl+F to open the advanced search dialog.
    It may be helpful to first enter Acrobat Preferences and for the Search category tick "Always use advanced search options".
    Back to the Search dialog - use the drop down menu for "Look In" to pick "Select Index" then, if no PDXs show, click the Add button.
    In the Open Index File dialog, browse to the location of the desired PDX and select it.
    OK out and use "Return results containing" to pick a "Match ..." requirement or Boolean.
    To become familiar with query syntax, for Acrobat, it is good to review Acrobat Help.
    http://help.adobe.com/en_US/Acrobat/9.0/Professional/WS58a04a822e3e50102bd615109794195ff-7 c4b.w.html
    Be well...

  • Problem with Regular Expression

    Hi There!!
    I have a problem with regular expression. I want to validate that one word and second word are same. For that I have written a regex
    Pattern p=Pattern.compile("([a-z][a-zA-Z]*)\\s\1");
    Matcher m=p.matcher("nikhil nikhil");
    boolean t=m.matches();
    if (t)
              System.out.println("There is a match");
         else
              System.out.println("There is no match");
    The result I am getting is always "There is no match
    Your timely help will be much appreciated.
    Regards

    Ram wrote:
    ErasP wrote:
    You are missing a backward slash in the regex
    Pattern p = Pattern.compile("([a-z][a-zA-Z]*)\\s\\1");
    But this will fail in this case.
    Matcher m = p.matcher("Nikhil Nikhil");It is the reason for that *[a-z]*.The OP had [a-z][a-zA-Z]* in his code, so presumably he know what that means and wants that String not to match.

  • Get the string between li tags, with regular expression

    I have a unordered list, and I want to store all the strings between the li tags (<li>.?</li>)in an array:
    <ul>
    <li>This is String One</li>
    <li>This is String Two</li>
    <li>This is String Three</li>
    </ul>
    This is what have so far:
    <li>(.*?)</li>
    but it is not correct, I only want the string without the li tags.
    Thanks.

    No one?
    Anoyone here experienced with Regular Expression?

  • Parsing with regular expressions

    Hi
    I'm developing an application that is trying to parse a text in a text file. It's looking for strings like
    " 0551 TIMERHIN, S.L. "
    Then I think I could lok with matches method for a string with many white chars followed by 4 numbers...
    I did
    line.matches("^\\s+[0-9]{4}")
    but it didn't work...
    Any help will be appreciated.
    <jl>

    The String.matches tries to match the regular express against the whole string.
    You regex pattern "^\\s+[0-9]{4}" will match
    "                          0551"but not
    "                          0551 TIMERHIN, S.L. "You can do it using this code:
    import java.sql.*;
    import java.io.*;
    import java.util.regex.*;
    import java.util.*;
    public class RegExTest {
    public static void main (String args[]) {
       Pattern pat = null;
       Matcher m = null;
       String patternToMatch = "^\\s+[0-9]{4}";
       pat = Pattern.compile(patternToMatch);
       System.out.println("Pattern to match = " + patternToMatch);
       String line = "                          0551 TIMERHIN, S.L. ";
       m = pat.matcher(line);
       if (m.find()) { System.out.println("Line matched"); }
    } // end main
    } //End Class Test
      

  • Replace All with Regular Expression

    Hi all,
    I need a help to replace the String:
    to
    "<a href=\"1\"">Java Programming</a>"
    I was trying to replace with
    .replaceAll("\\[\\[*([^\\]]*?)*\\]\\]", "<a href=\"\"></a>")
    but I don't know how to separate the parameters values.
    Best regards                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               

    Hi prometheuzz,
    you are the one!!
    But how I could do to a Srting like this:
    Ah, more requirements...
    Things are getting a bit messy now:
    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    class Main { 
        public static void main(String[] args) {
            String line = "any text any text any text any text any text. "+
                          "Click - [[Code: 6 Title: Java Programming]]. "+
                          "any text any text any text any text any text. "+
                          "- [[C�digo: 2 T�tulo: Sun]] text text "+
                          "any text any text any text any text any text.";
            String newLine = line;
            String[] wikiTags = getWikiTags(newLine);
            for(String tag : wikiTags) {
                String newTag = "<a href=\""+get("(Code:|C�digo:)", " ", tag)+
                                "\">"+get("(Title:|T�tulo:)", "$", tag)+"</a>";
                newLine = newLine.replaceFirst("\\[\\["+tag+"\\]\\]", newTag);
            System.out.println(line);
            System.out.println(newLine);
        public static String[] getWikiTags(String text) {
            java.util.List<String> list = new java.util.ArrayList<String>();
            Pattern pattern = Pattern.compile("(?<=\\[\\[)(.*?)(?=\\]\\])");
            Matcher matcher = pattern.matcher(text);
            while(matcher.find()) {
                list.add(matcher.group());
            return list.toArray(new String[list.size()]);
        public static String get(String start, String end, String text) {
            Pattern pattern = Pattern.compile("(?<="+start+"\\s)(.*?)(?="+end+")");
            Matcher matcher = pattern.matcher(text);
            return matcher.find() ? matcher.group() : "#ERROR#";
    }Details about regex:
    http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html
    http://www.regular-expressions.info/java.html
    http://java.sun.com/docs/books/tutorial/essential/regex/
    >
    Thanks a lot for you helpLike I said, it's a bit of a messy solution. See if you can use (a part of) it.
    Good luck.

  • Need help with regular expression

    I'm trying to use the java.util.regex package to extract URLs from html files.
    The URLs that I am interested in extracting from the HTML look like the following:
    <font color="#008000">http://forum.java.sun.com -
    So, the URL is always preceeded by:
    <font color="#008000">
    and then followed by a space character and then a hyphen character. I want to be able to put all these URLs in a Vector object. This doesn't seem like it should be too difficult but for some reason I can't get anywhere with it. Any help would be greatly appreciated. Thanks!

    hi gupta am not sure of the java syntax but i can tell u about the regular expression...try this....
    <font color="#008000">(http:\/\/[a-zA-Z0-9.]+) [-]
    i dont know the java methods to call...just the reg exp...
    Sanjay Acharya

  • Help with Regular Expression for field validation

    I'm fairly new to using regular expressions and using Acrobat. This is probably a simple question, but I've been unable to figure it out.
    I have a text field on a PDF that I would like to be 9 characters in length. The first 2 characters can only be alphanumeric, the last 7 characters can only be numeric.
    At first I was using the following, which allows all the characters to be alphanumeric:
    var re = /^[A-Za-z0-9 :\\_]$/;
    if (event.change.length >0) {
    if (event.willCommit == false) {
        if (!re.test(event.change)) {
            event.rc = false
    That works fine, but it's not quite what I needed. With some assistance I changed it (see below) to fit what I was looking for. However, this didn't work; it prevents anything from being entered in the field:
    var re = /^[A-Za-z0-9]{2}\d{7}$/;
    if (event.change.length >0) {
    if (event.willCommit == false) {
        if (!re.test(event.change)) {
            event.rc = false
    Any help would be greatly appreciated.
    Thanks...

    Here's a function you can call form the field's custom Format script. It should be placed in a document-level JavaScript:
    function custom_ks1() {
        // Define non-commited regular expression
        var re = /^[A-Za-z0-9]{0,2}([0-9]{0,7})?$/;
        // Get all of the characters the user has entered
        var value = AFMergeChange(event);
        // Allow field to be cleared
        if(!value) return;
        if (event.willCommit) {
            // Define commited regular expression
            var re = /^[A-Za-z0-9]{2}[0-9]{7}$/;
            if (!re.test(value)) {  // If final value doesn't match, alert user
                app.alert("Your error message goes here.");
                // event.rc = false
        } else {  // not commited
            // Only allow characters that match the regular expression
            event.rc = re.test(value);
    Call it like this:
    // Custom Keystroke script
    custom1_ks();

  • Help with regular expression needed

    Hi,
    Perhaps someone here can help me with my regular expression I'm trying to build in my Java code.
    The regular expression that I'm looking to build consists of any non-whitespace character up until it finds one or two <>= symbols and then any character thereafter. So both these Strings would match the expression:
    City 1==London
    Age>=18
    The regular expression that I'm using is as follows:
    (\\S+)([><=]){1,2}(.+)However, group 1 always retrieves the first <>= symbol as in "City 1=". How can I make the <>= part greedy so that it retrieves both operator symbols?
    Thanks.

    Make the first group, the non-spaces, reluctant:
    "(\\S+?)([<>=]{1,2})(.+)"

  • Help with regular expression to find a pattern in clob

    can someone help me writing a regular expression to query a clob that containts xml type data?
    query to find multiple occurrences of a variable string (i.e <EMPID-XX> - XX can be any number). If <EMPID-01> appears twice in the clob i want the result as EMPID-01,2 and if EMPID-02 appears 4 times i want the result as EMPID-02,4.

    with
    ofx_clob as
    (select q'~
    <EMPID>1
    < UNQID>123456
    < TIMESTAMP>...
    < ADDRINFO>
    < TITLE>^@~*
    < FIRST>ABCD
    < MI>
    < LAST>EFGH
    < ADDR1>ADDR1
    < ADDR2>^@~*
    < CITY>CITY
    <EMPID>2
    < UNQID>123457
    < TIMESTAMP>...
    < ADDRINFO>
    < TITLE>^@~*
    < FIRST>ABCD
    < MI>
    < LAST>EFGH
    < ADDR1>ADDR1
    < ADDR2>^@~*
    < CITY>CITY
    <EMPID>1
    < UNQID>123458
    < TIMESTAMP>...
    < ADDRINFO>
    < TITLE>^@~*
    < FIRST>ABCD
    < MI>
    < LAST>EFGH
    < ADDR1>ADDR1
    < ADDR2>^@~*
    < CITY>CITY
    ~' ofx from dual
    select '<EMPID>' || to_char(ids) || '(' || to_char(count(*)) || ')' multi_empid
      from (select replace(regexp_substr(ofx,'<EMPID>\d*',1,level),'<EMPID>') ids
              from ofx_clob
            connect by level <= regexp_count(ofx,'<EMPID>')
    group by ids having count(*) > 1
    MULTI_EMPID
    <EMPID>1(2)
    with
    ofx_clob as
    (select q'~
    <EMPID>1
    < UNQID>123456
    < TIMESTAMP>...
    < ADDRINFO>
    < TITLE>^@~*
    < FIRST>ABCD
    < MI>
    < LAST>EFGH
    < ADDR1>ADDR1
    < ADDR2>^@~*
    < CITY>CITY
    <EMPID>2
    < UNQID>123457
    < TIMESTAMP>...
    < ADDRINFO>
    < TITLE>^@~*
    < FIRST>ABCD
    < MI>
    < LAST>EFGH
    < ADDR1>ADDR1
    < ADDR2>^@~*
    < CITY>CITY
    <EMPID>1
    < UNQID>123456
    < TIMESTAMP>...
    < ADDRINFO>
    < TITLE>^@~*
    < FIRST>ABCD
    < MI>
    < LAST>EFGH
    < ADDR1>ADDR1
    < ADDR2>^@~*
    < CITY>CITY
    <EMPID>2
    < UNQID>123456
    < TIMESTAMP>...
    < ADDRINFO>
    < TITLE>^@~*
    < FIRST>ABCD
    < MI>
    < LAST>EFGH
    < ADDR1>ADDR1
    < ADDR2>^@~*
    < CITY>CITY
    <EMPID>1
    < UNQID>123458
    < TIMESTAMP>...
    < ADDRINFO>
    < TITLE>^@~*
    < FIRST>ABCD
    < MI>
    < LAST>EFGH
    < ADDR1>ADDR1
    < ADDR2>^@~*
    < CITY>CITY
    ~' ofx from dual
    select '<EMPID>' || listagg(to_char(ids) || '(' || to_char(count(*)) || ')',',') within group (order by ids) multi_empid
      from (select replace(regexp_substr(ofx,'<EMPID>\d*',1,level),'<EMPID>') ids
              from ofx_clob
            connect by level <= regexp_count(ofx,'<EMPID>')
    group by ids having count(*) > 1
    MULTI_EMPID
    <EMPID>1(3),2(2)
    Regards
    Etbin
    Message was edited by: Etbin
    used listagg to report more than one multiple <EMPID>

  • Help With Regular Expression In Apex Validation

    Apex 3.2
    There is a validation type of regular expression in apex, but I have never used regular expression before,
    so a little help is appreciated.
    I need to validate a field. It is only allowed to contain alpha characers, numbers, spaces and the - (dash) character.
    I have tried several times to get this working
    eg
    [[:alpha:]]*[[:digit:]]*[[:space:]]*[-]*
    ^[[:alpha:][:digit:][:space:]-]+?
    and others, but just can't to get the syntax correct.
    Can someone help me with this please
    Gus

    Example:
    SQL> ed
    Wrote file afiedt.buf
      1  with t as (select 'This is some example text' as txt from dual union all
      2             select 'And this is the 2nd one with numbers' from dual union all
      3             select 'And this allows double-barrelled words with hyphens' from dual union all
      4             select 'But this one shouldn''t be allowed!' from dual
      5            )
      6  --
      7  select *
      8  from t
      9* where regexp_like(txt, '^[[:alnum:] -]*$')
    SQL> /
    TXT
    This is some example text
    And this is the 2nd one with numbers
    And this allows double-barrelled words with hyphens

  • Help with Regular Expressions and regexp_replace

    Oh great Oracle Guru can I can gets some help
    I need to clean up the phone numbers that have been entered in Oracle eBusiness per_phones table. Some of the phone numbers have dashes, some have spaces and some have char. I would just like to take all the digits out and then re-format the number.
    Ex.
    914-123-1234 .. output (914) 123-1234
    9141231234 ..again (914) 123-1234
    914 123 1234 .. (914) 123-1234
    myphone ... just null
    (914)-123-1234.. (914) 123-1234
    I really tried to understand the regular expressions statments, but for some reason I just can't understand it.

    Hi,
    Welcome to the forum!
    I would create a user-defined function for this. I expect there will be a lot of exceptions to the regular rules (for example, strings that do not contain exactly 10 digits, such as '1-800-987-6543') that can be handled, but would require lots of nested fucntions and othwer complicted code if you had to do it in a single statement.
    If you really want to do it with a regular expression:
    SELECT     phone_txt
    ,     REGEXP_REPLACE ( phone_txt
                     , '^\D*'          || -- 0 or more non-digits at the beginning of the string
                           '(\d\d\d)'     || -- \1 = 3 consecutive digits
                    '\D*'          || -- 0 or more non-digits
                           '(\d\d\d)'     || -- \2 = 3 consecutive digits
                    '\D*'          || -- 0 or more non-digits
                           '(\d\d\d)'     || -- \3 = 4 consecutive digits
                    '\D*$'             -- 0 or more non-digits at the end of the string
                     , '(\1) \2-\3'
                     )          AS new_phone_txt
    FROM    table_x
    ;

  • Help with Regular Expressions

    I have some sample code for string editing using regular expressions but I'm a little confused as to it's behavior. Here's what I have:
    public class WordFixer
        protected final String PUNC_MATCH = "[\\d\\p{Punct}]+";
        protected final String PUNC_PREFIX = "^" + PUNC_MATCH;
        protected final String PUNC_SUFFIX = PUNC_MATCH + "$";
        public String fixPrefix (String w)
            return w.replaceFirst(PUNC_PREFIX, "");
        public String fixSuffix (String w)
            return w.replaceFirst(PUNC_SUFFIX, "");
        }This replaces all leading and trailing punctuation with "" and it works. However, changing the replaceFirst's to just replace doesn't work. I don't understand why that is. Doesn't the ^ mean in the front, and doesn't the $ mean in the back, so shouldn't this work by just changing replaceFirst to replace?

    JFactor2004 wrote:
    I have some sample code for string editing using regular expressions but I'm a little confused as to it's behavior. Here's what I have:
    public class WordFixer
    protected final String PUNC_MATCH = "[\\d\\p{Punct}]+";
    protected final String PUNC_PREFIX = "^" + PUNC_MATCH;
    protected final String PUNC_SUFFIX = PUNC_MATCH + "$";
    public String fixPrefix (String w)
    return w.replaceFirst(PUNC_PREFIX, "");
    public String fixSuffix (String w)
    return w.replaceFirst(PUNC_SUFFIX, "");
    }This replaces all leading and trailing punctuation with "" and it works. However, changing the replaceFirst's to just replace doesn't work. I don't understand why that is. The first parameter of the replaceFirst(...) is a regex-String. And the ^, $ and [...] stuff all have a special meaning in the regex language. The replace(...) method takes plain-Strings as a parameter, so the String "^\[\\d\\p{Punct}\]+" is interpreted as just that, without any special meaning.
    JFactor2004 wrote:
    Doesn't the ^ mean in the front, and doesn't the $ mean in the back, Yes, ^ means the start of the String (sometimes the start of a line) and $ means the end of the String (sometimes the end of a line).
    JFactor2004 wrote:
    so shouldn't this work by just changing replaceFirst to replace?Nope, see the explanation above.

Maybe you are looking for

  • Authorization control for creating change request transaction

    Hi, We are creating the Change Request(SDCR) through support desk message by using action Create Change Document. How can I control the creation of Change Request Document(SDCR). By which  roles i need to control this. Regards PK Edited by: PK on May

  • Automatic deletion of Notes in iphone 5?

    Since I bought iphone 5, all my notes got lost. I have tried to restore from icloud and itunes but nothing comes back. And as soon as I write a new note it gets lost the next day or so. Any solution? Thank you

  • Reimport form or remove submit feature from pdf

    I have a form that has nearly 1000 fields that I saved over the original pdf so when I went to edit it I only have the form version that was created with Forms Central. Are you telling me that I can't reimport this form into Forms Central? I have to

  • ODS NOT ACTIVATED ERROR MESG

    The creation of the export DataSource failed Message no. RSBM035 HI, PL HELP, Diagnosis The creation of the OLTP source &v1& was unsuccessful. System response Whilst generating the Export InfoSource, the Metadata for an R/3 OLTP source must be genera

  • K7T Pro2-A (MS-6330) Motherboard

    Hi, I am on my second K7T pro motherboard - the first was replaced because the machine started to randomly restart.  Things were great for a while and the new(er) one is exhibiting the same symptoms. It seems strange that it is a "grows on you" kinda