Regular Expressions: Greedy vs Non-Greedy

Guys, I just can't explain and find any explanation in the doc for such a behaviour:
SQL> with t as (select 'the 1 january of the year 2007' str from dual)
2 select regexp_substr(str,'.*?[[:digit:]][ ][[:alpha:]]+.*') substr1,
3         regexp_substr(str,'.*?[[:digit:]][ ][[:alpha:]]+.*$') substr2
4 from t
5 /
SUBSTR1       SUBSTR2
the 1 january the 1 january of the year 2007
SQLthe first part of a pattern is '.*?' - non-greedy seacrh for combination of any symbols.
It is followed by 1 digit, then 1 space then a consequent greedy combination of alpha characters
an the last part of the mask is '.*' in the first case and '.*$' in the second.
The only difference in '$' in the end.
AFAIK '.*' in the first case should stand for GREEDY search of a combination of any symbols.
So in my opinion if '.*' stands in the end of the mask it should be equivalent to '.*$',
but somehow it becomes NON-GREEDY.
I just can't explain why.
Can anyone help?
Thanks.
PS
SQL> select * from v$version;
BANNER
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Prod
PL/SQL Release 10.2.0.1.0 - Production
CORE     10.2.0.1.0     Production
TNS for 32-bit Windows: Version 10.2.0.1.0 - Production
NLSRTL Version 10.2.0.1.0 - Production
SQL

This doesn't make any sense at all to me either. The only thing I did find that could explain it is at:
http://download.oracle.com/docs/cd/B14117_01/server.101/b10759/functions116.htm#SQLRF06303
match_parameter is a text literal that lets you change the default matching behavior of the function. You can specify one or more of the following values for match_parameter:
'n' allows the period (.), which is the match-any-character character, to match the newline character. If you omit this parameter, the period does not match the newline character.
So maybe since the . doesn't match newline and $ does?

Similar Messages

Java – Regular Expressions – Finding any non digit byte in a multiple byte

Hello,
I’m new to JAVA and Regular Expressions; I’m trying to write a regular expression that will find any records that contain a non digit byte in a multiple byte field.
I thought the following was the correct expression but it is only finding records that contain “all” non digit bytes.
\D{1,}
\D = Non Digit
{1,} = at least 1 or more
Below is my sample data. I would like the regular expression to find all of the records that are not all numeric. However when I use the regular expression \D{1,} it is only finding the 2 records that all bytes are non digits. (i.e. “ “ and “A “)
“ 111229”
“2 111229”
“20091229”
“200912c9”
“201#1229”
“20101229”
“20110229”
“20111*29”
“20111029”
“20111229”
“20B11229”
“A “
“A0111229”
Please note I have also tried \D{1,}+ and \D{1,}? And they also do not return my desired results
Any assistance someone can provide would be greatly appreciated.

You don't show the code you are using but I surmise you are using String.matches() which requires that the whole target must match the regular expression not just part of it. Instead you should create a Pattern and then a Matcher and use the Matcher.find() method. Check the Javadoc for Pattern and Matcher and look at the Java regex tutorial - http://docs.oracle.com/javase/tutorial/essential/regex/ .
P.S. You can re-use the Pattern object - you don't have to create it every time you need one.
P.P.S. Java regular expressions work with characters not bytes and characters are not not not bytes.

Regular expression to select non-matching pattern

Hi All,
I am having question on regular expressions
I want to select lines containing non-matching pattern.
For example if we consider following cities:
London
NewYork
Delhi
Mountainview
If above are the given then how to select all cities except "Delhi"
Please suggest. Thanks.

Hi,
You need to explain what actually you need to get out. As all these cities are in expression [a-z, A-Z]. Some more input required.
Kuldeep Jangra

Regular Expression replacement not working

I am trying to use a regular expression to replace non-ascii characters on a file, and I'm afraid I've reached the end of my regex knowledge.
Here is the specific code
'Set the Regular Expression paramaters
Set RegEx = CreateObject("VBScript.Regexp")
RegEx.Global = True
RegEx.Pattern = "[^\u0000-\u007F]"
RegEx.IgnoreCase = True
'Replace the UTF-8 characters
ReplacedText = RegEx.Replace(FileText, "\u0020")
If I understand regular expressions correctly the pattern of "[^\u0000-\u007F]" should replace any character that is not an ascii character, and then replace it with a space (which I understand is "\u0020"). What am I doing wrong?

Simply use
ReplacedText = RegEx.Replace(FileText, " ")
Regards, Hans Vogelaar (http://www.eileenslounge.com)

Non-greedy regular expression search/replace

i want to be able to do non-greedy regular expressions as
dreamweaver defaults to greedy and eats all of the match it can
match.
Is there an add-on to add a checkbox or soemthing in the
Search and Replace Dialog for non-greedyness?
i guess i could throw something in there to match some simple
expressions like...
text to search n' replace:
$Query .= ", dateReceived='".addslashes($dateReceived)."'" .
", dateReleased='".addslashes($dateReleased)."'";
Instead of using this now
Find: '"\.addslashes$\$([^$]*)\)\."'
Replace: ". dateInsert($$1) ."
i want to just use this wihout it getting greedy with the .*
Find: '"\.addslashes$\$(.*)$\."'
Replace: ". dateInsert($$1) ."

In regular expression, if you want the .* to not be greedy
you add a '?' after so yo you have '.*?'
Find: '".addslashes($dateReleased)."'
Hope that help,
Chris
Adobe Dreamweaver Engineering

Regular Expressions - unsetting greedy possible?

Hi,
I'm currently working on a parser and got some problems with the regular expressions in ABAP.
Lets say I want to calculate (22)*(33).
The RegExp $.+$ finds everything between brackets - the problem is, that the engine finds everything between the first opening and the last closing bracket (actually it should find the first opening and the first closing bracket).
Is there a way to tell the engine to work ungreedy?
Thanks for your help
Chris

Hi Prashant,
unfortunately this won't work either.
I'd better give some more information on the topic to increase understanding.
In order to calculate this string mathmatically I created a function working recursively. It calculates the (math)value of a string.
So lets say we want to calculate (22)*(33), the function is supposed to work this way:
math ( "(22)*(33)")
-> math ("2+2") (Calculating and returning value: 4)
The formula now is "4*(3+3)"
-> math("3+3") (Calculating and returning value: 6)
The formula now is "4*6"
-> math("4*6") (calculating and returning value 24)
Thus ABAP does not know ungreedy searches in regular expressions, the function would work this way:
math ( "(22)*(33)")
->math( "22)*(33" ) (using the wrong brackets...)
... leading to a math error.
Your solution, Prashant, would work for the first recursive call. Then, the formula would be "(22)*(32)" again.
Thanks though
Regards
Christian

Grep style shortest match (non-greedy) modifier ignored

I'm looking for confirmation on a possible bug that I've had problems with from InDesign version 5.5 up through the current CC 10.0.0.70.
What I'm trying to do is apply a character style via a grep rule in a paragraph style that matches everything up until the first colon in a block of text. The goal is to make that text bold. My grep rule is
.+?:
which means match one or more of any character, shortest match (non-greedy in grep terminology), followed by a colon. But the non-greedy modifier is being ignored and it's matching all of the text up until the last colon.
Here's a screen shot.
I've come up with a work around, but it would be nice for it to work properly. I'll report it as a bug if others confirm having the same problem. Thanks.

Thanks, Peter! I'm so used to the way grep works outside of InDesign that I forgot it would look for multiple instances within the paragraph. Technically, I did limit it to the first occurrence by using the question mark after the plus. (If you use any of the shortest match options in the drop-down menu that's what you get.)
But anchoring it to the beginning of the line with ^ did the trick.

Regular Expression for non-words

hello all!
can you help me construct a regular expression that will match non-word strings say "��". I will be needing this to filter words from a Microsoft Word Document.
Thanx!

hello all!
can you help me construct a regular expression that
will match non-word strings say "��". I will
be needing this to filter words from a Microsoft Word
Document. I don't think this is a problem that should be solved with regex. You would have to convert your Word document to a String and use replaceAll() with "\\W" as the regex.
Correct me if I am wrong but I thought that Word files were binary so your first problem will be to convert the file(s) to a String.

Help with regular expression needed

Hi,
Perhaps someone here can help me with my regular expression I'm trying to build in my Java code.
The regular expression that I'm looking to build consists of any non-whitespace character up until it finds one or two <>= symbols and then any character thereafter. So both these Strings would match the expression:
City 1==London
Age>=18
The regular expression that I'm using is as follows:
(\\S+)([><=]){1,2}(.+)However, group 1 always retrieves the first <>= symbol as in "City 1=". How can I make the <>= part greedy so that it retrieves both operator symbols?
Thanks.

Make the first group, the non-spaces, reluctant:
"(\\S+?)([<>=]{1,2})(.+)"

In a Regular expressions I can set up an "OR" statement?

HI, I'm using e-tester version 8.2..My problem is about regular expessions.
I need to catch a dynamic value from a Form Field (My-TxtBox).. this textbox gets pre populated data ( when the customer has info in the database).. this is the code:
*<input name="My-TxtBox" type="text" value="XXXX" id="ID-TxtBox"*
---- ( XXXX = dynamic number prepopulated by the web app)
So, I'm using this regular expression:
*<input name="My-TxtBox" type="text" value="(.+?)" id="ID-TxtBox"*
-----At this point, everything is fine.. but there is an exception
I'm getting a big problem when the customer doesn't have data in the server, the code is NOT like this
<input name="My-TxtBox" type="text" value="" id="ID-TxtBox"
When the customer doesn't have data in the server, the code is more like this:
*<input name="My-TxtBox" type="text" id="ID-TxtBox"*
---- (please note that there is not a value parameter now)
So, I think there is no way to create a CDV that will work for both cases? any idea to solve this?
i was thinking that maybe in the reg exps sintax you can create an "OR" statement.. my idea was to create a CDV
that works for both cases.. when there is the "value=" string and when there is not.
something like this
This CDV returns the dynamic value when there is the "value=" string =
<input name="My-TxtBox" type="text" value="(.+?)" id="ID-TxtBox"
And this CDV returns "" when there is no "value=" string = Without value:
<input name="My-TxtBox" type="text" (.*?)id="ID-TxtBox"
My idea is to place something like this in some point of the CDV = *{* value="(.+?)" *OR* (.*?) *}*
so my dream is to create a CDV similar to this:
<input name="My-TxtBox" type="text" { value="(.+?)" *OR* (.*?) }id="ID-TxtBox
I was searching on google but I simply don't get an answer....it is posible to place an OR statement into a Reg Exp and how the sintax is? ..
Regards.. I appreciate your time.

Hola,
You can use a regular expression such as:
<input name="My-TxtBox" type="text" ?v?a?l?u?e?=?"?(.*?)"? id="ID-TxtBox"
Note that:
* Note that there is a question mark sign (?) after each letter that may appear in the string. This means that the letter may or may not appear.
* Note that instead of using (.+?) you should use (.*?).
This means that will match any character that appears zero or mutliple times. The question mark here means that is non-greedy, meaning that it will not include in the .* matching anything like the rest of the pattern (in this case the rest of the pattern is "? id="ID-TxtBox").
* Note that the question mark in front of the v of value is there to match a space that may or may not exists.
Few other facts:
* In regular expressions the parenthesys determine sequence of operations and mark groups. Such groups can be referenced in code (not in etester but in general). In eTester it will always get the value of the first group (first group = first set of parenthesys).
* ORs in regular expressions can be expressed with the pipe "|" (without the quotes), but you will need parenthesys in this case which would not allow you to capture the group of characters that you want.
Regards,
[Z]{1}uriel C?
Edited by: Zuriel on Oct 5, 2009 3:07 PM
Edited to avoid having the text changed by the forum formatting options.

Java Regular Expressions and Pattern

I have a file that i first want to get all the lines that match a given pattern. Then from these lines that match i want to extract two values.
Example line for the pattern to match
INFO | jvm 1 | 2006/11/07 15:14:09 | INFO | Tue Nov 07 15:14:09 CET 2006 | XLDB PPS Data Dumper: MESSAGE:- 406 Processing .. '[ /opt/nexus/horizon/raw_data/network/pp_CE01S4H_sta_20050703T015717_SYDP3001_546.bdf ]'
So all the lines that are like these i want to extract two variables
2006/11/07 15:14:09
and
/opt/nexus/horizon/raw_data/network/pp_CE01S4H_sta_20050703T015717_SYDP3001_546.bdf
so i can store these variables in a database.
Can someone help me with writing the pattern to match and the regular express to extract? Also if anyone else has a better way of doing this i am all ears and i have a lot of log files to go through.

import java.util.regex.*;
class Main
public static void main(String[] args)
    String txt="INFO | jvm 1 | 2006/11/07 15:14:09 | INFO | Tue Nov 07 15:14:09 CET 2006 | XLDB PPS Data Dumper: MESSAGE:- 406 Processing .. '[ /opt/nexus/horizon/raw_data/network/pp_CE01S4H_sta_20050703T015717_SYDP3001_546.bdf ]'";
    String re1=".*?";     // Non-greedy match on filler
    String re2="((?:2|1)\\d{3}(?:-|\\/)(?:(?:0[1-9])|(?:1[0-2]))(?:-|\\/)(?:(?:0[1-9])|(?:[1-2][0-9])|(?:3[0-1]))(?:T|\\s)(?:(?:[0-1][0-9])|(?:2[0-3])):(?:[0-5][0-9]):(?:[0-5][0-9]))";     // Time Stamp 1
    String re3=".*?";     // Non-greedy match on filler
    String re4="((?:\\/[\\w\\.]+)+)";     // Unix Path 1
    Pattern p = Pattern.compile(re1+re2+re3+re4,Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
    Matcher m = p.matcher(txt);
    if (m.find())
        String timestamp1=m.group(1);
        String unixpath1=m.group(2);
        System.out.print("("+timestamp1.toString()+")"+"("+unixpath1.toString()+")"+"\n");
}

Regular Expressions in CS5.5 - something is wrong

Hello Everybody,
Please correct me, but I think, I found a serious problem with regular Expressions in Indesign CS5.5 (and possibly in other apps from CS5.5).
Let's start with simple example:
var range = "a-a,a,a-a,a";
var regEx = /(a+-a+|a+)(,(a+-a+|a+))*/;
alert( "Match:" +regEx.test(range)+"\nLeftContext: "+RegExp.leftContext+"\nRightContext: "+RegExp.rightContext );
What I expected was true match and the left and the right context should be empty. In Indesign CS3 that is correct BUT NOT in CS5.5.
In CS 5.5 it seems that the only first "a-a" is matched and the rest is return as the rightContext - looks like big change (if not parsing error in RegExp engine).
Please correct me if I am wrong.
The second example - how to freeze ID CS5.5:
var range = "a-a,a,a-a,a";
var regEx = /(a+-a+|a+)(,(a+-a+|a+)){8,}/;
alert( "Match:" +regEx.test(range)+"\nLeftContext: "+RegExp.leftContext+"\nRightContext: "+RegExp.rightContext );
As you can see it differs only with the {8,} part instead of *
Run it in CS5.5 and you will see that the ID hangs (in CS3 of course it runs flawlessly}.
The third example - how to freeze ID 5.5 in one line (I posted it earlier in Photoshop forum because similiar problem was called earlier):
alert((/(n|s)? /gmi).test('s') );
As you can guess - it freezes the CS5.5 (CS3 passes the test).
Please correct me if I am doing something wrong or it's the problem of Adobe.
Best regards,
Daniel Brylak

Hi Daniel,
Thanks for sharing. Really annoying indeed.
Just to complete your diagnosis, what you describe about CS.5 is the same in CS5, while CS4 behaves as CS3.
var range = "aaaaa";
var regEx = /(a+-a+|a+)(,(a+-a+|a+))*/;
alert([
    "Match:" +regEx.test(range),
    "LeftContext: "+RegExp.leftContext.toSource(),        // => CS3/4: EMPTY -- CS5+: EMPTY
    "RightContext: "+RegExp.rightContext.toSource()        // => CS3/4: EMPTY -- CS5+: ",a,a-a,a"
    ].join('\r'));
So there is a serious implementation problem of the RegExp object from ExtendScript CS5.
I don't think it's related to the greedy modes. By default, JS RegExp quantifiers are greedy, and /a*/ still entirely captures "aaaaaa" in CS5+.
By the way, you can make any quantifier non-greedy by adding ? after the quantifier, e.g.: /a*?/, /a+?/, etc.
I guess that Adobe ExtendScript has a generic issue in updating the RegExp.lastIndex property in certain contexts—see http://forums.adobe.com/message/3719879#3719879 —which could explain several bugs such as the Negative Class bug —see http://forums.adobe.com/message/3510078#3510078 — or the problems you are mentioning today.
@+
Marc

Introduction to regular expressions ... last part.

Continued from Introduction to regular expressions ... continued., here's the third and final part of my introduction to regular expressions. As always, if you find mistakes or have examples that you think could be solved through regular expressions, please post them.
Having fun with regular expressions - Part 3
In some cases, I may have to search for different values in the same column. If the searched values are fixed, I can use the logical OR operator or the IN clause, like in this example (using my brute force data generator from part 2):
SELECT data
FROM TABLE(regex_utils.gen_data('abcxyz012', 4))
WHERE data IN ('abc', 'xyz', '012');There are of course some workarounds as presented in this asktom thread but for a quick solution, there's of course an alternative approach available. Remember the "|" pipe symbol as OR operator inside regular expressions? Take a look at this:
SELECT data
FROM TABLE(regex_utils.gen_data('abcxyz012', 4))
WHERE REGEXP_LIKE(data, '^(abc|xyz|012)$')
;I can even use strings composed of values like 'abc, xyz , 012' by simply using another regular expression to replace "," and spaces with the "|" pipe symbol. After reading part 1 and 2 that shouldn't be too hard, right? Here's my "thinking in regular expression": Replace every "," and 0 or more leading/trailing spaces.
Ready to try your own solution?
Does it look like this?
SELECT data
FROM TABLE(regex_utils.gen_data('abcxyz012', 4))
WHERE REGEXP_LIKE(data, '^(' || REGEXP_REPLACE('abc, xyz , 012', ' *, *', '|') || ')$')
;If I wouldn't use the "^" and "$" metacharacter, this SELECT would search for any occurence inside the data column, which could be useful if I wanted to combine LIKE and IN clause. Take a look at this example where I'm looking for 'abc%', 'xyz%' or '012%' and adding a case insensitive match parameter to it:
SELECT data
FROM TABLE(regex_utils.gen_data('abcxyz012', 4))
WHERE REGEXP_LIKE(data, '^(abc|xyz|012)', 'i')
; An equivalent non regular expression solution would have to look like this, not mentioning other options with adding an extra "," and using the INSTR function:
SELECT data
FROM (SELECT data, LOWER(DATA) search
          FROM TABLE(regex_utils.gen_data('abcxyz012', 4))
WHERE search LIKE 'abc%'
    OR search LIKE 'xyz%'
    OR search LIKE '012%'
SELECT data
FROM (SELECT data, SUBSTR(LOWER(DATA), 1, 3) search
          FROM TABLE(regex_utils.gen_data('abcxyz012', 4))
WHERE search IN ('abc', 'xyz', '012')
; I'll leave it to your imagination how a complete non regular example with 'abc, xyz , 012' as search condition would look like.
As mentioned in the first part, regular expressions are not very good at formatting, except for some selected examples, such as phone numbers, which in my demonstration, have different formats. Using regular expressions, I can change them to a uniform representation:
WITH t AS (SELECT '123-4567' phone
             FROM dual
            UNION
           SELECT '01 345678'
             FROM dual
            UNION
           SELECT '7 87 8787'
             FROM dual
SELECT t.phone, REGEXP_REPLACE(REGEXP_REPLACE(phone, '[^0-9]'), '(.{3})(.*)', '(\1)-\2')
FROM t
;First, all non digit characters are beeing filtered, afterwards the remaining string is put into a "(xxx)-xxxx" format, but not cutting off any phone numbers that have more than 7 digits. Using such a conversion could also be used to check the validity of entered data, and updating the value with a uniform format afterwards.
Thinking about it, why not use regular expressions to check other values about their formats? How about an IP4 address? I'll do this step by step, using 127.0.0.1 as the final test case.
First I want to make sure, that each of the 4 parts of an IP address remains in the range between 0-255. Regular expressions are good at string matching but they don't allow any numeric comparisons. What valid strings do I have to take into consideration?
Single digit values: 0-9
Double digit values: 00-99
Triple digit values: 000-199, 200-255 (this one will be the trickiest part)
So far, I will have to use the "|" pipe operator to match all of the allowed combinations. I'll use my brute force generator to check if my solution works for a single value:
SELECT data
FROM TABLE(regex_utils.gen_data('0123456789', 3))
WHERE REGEXP_LIKE(data, '^(25[0-5]|2[0-4][0-9]|[01]?[0-9]{1,2})$')
; More than 255 records? Leading zeros are allowed, but checking on all the records, there's no value above 255. First step accomplished. The second part is to make sure, that there are 4 such values, delimited by a "." dot. So I have to check for 0-255 plus a dot 3 times and then check for another 0-255 value. Doesn't sound to complicated, does it?
Using first my brute force generator, I'll check if I've missed any possible combination:
SELECT data
FROM TABLE(regex_utils.gen_data('03.', 15))
WHERE REGEXP_LIKE(data,
                   '^((25[0-5]|2[0-4][0-9]|[01]?[0-9]{1,2})\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9]{1,2})$'
; Looks good to me. Let's check on some sample data:
WITH t AS (SELECT '127.0.0.1' ip
             FROM dual
            UNION
           SELECT '256.128.64.32'
             FROM dual
SELECT t.ip
FROM t WHERE REGEXP_LIKE(t.ip,
                   '^((25[0-5]|2[0-4][0-9]|[01]?[0-9]{1,2})\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9]{1,2})$'
; No surprises here. I can take this example a bit further and try to format valid addresses to a uniform representation, as shown in the phone number example. My goal is to display every ip address in the "xxx.xxx.xxx.xxx" format, using leading zeros for 2 and 1 digit values.
Regular expressions don't have any format models like for example the TO_CHAR function, so how could this be achieved? Thinking in regular expressions, I first have to find a way to make sure, that each single number is at least three digits wide. Using my example, this could look like this:
WITH t AS (SELECT '127.0.0.1' ip
             FROM dual
SELECT t.ip, REGEXP_REPLACE(t.ip, '([0-9]+)(\.?)', '00\1\2')
FROM t
; Look at this: leading zeros. However, that first value "00127" doesn't look to good, does it? If you thought about using a second regular expression function to remove any excess zeros, you're absolutely right. Just take the past examples and think in regular expressions. Did you come up with something like this?
WITH t AS (SELECT '127.0.0.1' ip
             FROM dual
SELECT t.ip, REGEXP_REPLACE(REGEXP_REPLACE(t.ip, '([0-9]+)(\.?)', '00\1\2'),
                            '[0-9]*([0-9]{3})(\.?)', '\1\2'
FROM t
; Think about the possibilities: Now you can sort a table with unformatted IP addresses, if that is a requirement in your application or you find other values where you can use that "trick".
Since I'm on checking INET (internet) type of values, let's do some more, for example an e-mail address. I'll keep it simple and will only check on the
"[email protected]", "[email protected]" and "[email protected]" format, where x represents an alphanumeric character. If you want, you can look up the corresponding RFC definition and try to build your own regular expression for that one.
Now back to this one: At least one alphanumeric character followed by an "@" at sign which is followed by at least one alphanumeric character followed by a "." dot and exactly 3 more alphanumeric characters or 2 more characters followed by a "." dot and another 2 characters. This should be an easy one, right? Use some sample e-mail addresses and my brute force generator, you should be able to verify your solution.
Here's mine:
SELECT data
FROM TABLE(regex_utils.gen_data('a1@.', 9))
WHERE REGEXP_LIKE(data, '^[[:alnum:]]+@[[:alnum:]]+(\.[[:alnum:]]{3,4}|(\.[[:alnum:]]{2}){2})$', 'i'); Checking on valid domains, in my opinion, should be done in a second function, to keep the checks by itself simple, but that's probably a discussion about readability and taste.
How about checking a valid URL? I can reuse some parts of the e-mail example and only have to decide what type of URLs I want, for example "http://", "https://" and "ftp://", any subdomain and a "/" after the domain. Using the case insensitive match parameter, this shouldn't take too long, and I can use this thread's URL as a test value. But take a minute to figure that one out for yourself.
Does it look like this?
WITH t AS (SELECT 'Introduction to regular expressions ... last part. URL
             FROM dual
            UNION
           SELECT 'http://x/'
             FROM dual
SELECT t.URL
FROM t
WHERE REGEXP_LIKE(t.URL, '^(https*|ftp)://(.+\.)*[[:alnum:]]+(\.[[:alnum:]]{3,4}|(\.[[:alnum:]]{2}){2})/', 'i')
Update: Improvements in 10g2
All of you, who are using 10g2 or XE (which includes some of 10g2 features) may want to take a look at several improvements in this version. First of all, there are new, perl influenced meta characters.
Rewriting my example from the first lesson, the WHERE clause would look like this:
WHERE NOT REGEXP_LIKE(t.col1, '^\d+$')Or my example with searching decimal numbers:
'^(\.\d+|\d+(\.\d*)?)$'Saves some space, doesn't it? However, this will only work in 10g2 and future releases.
Some of those meta characters even include non matching lists, for example "\S" is equivalent to "[^ ]", so my example in the second part could be changed to:
SELECT NVL(LENGTH(REGEXP_REPLACE('Having fun with regular expressions', '\S')), 0)
FROM dual
;Other meta characters support search patterns in strings with newline characters. Just take a look at the link I've included.
Another interesting meta character is "?" non-greedy. In 10g2, "?" not only means 0 or 1 occurrence, it means also the first occurrence. Let me illustrate with a simple example:
SELECT REGEXP_SUBSTR('Having fun with regular expressions', '^.* +')
FROM dual
;This is old style, "greedy" search pattern, returning everything until the last space.
SELECT REGEXP_SUBSTR('Having fun with regular expressions', '^.* +?')
FROM dual
;In 10g2, you'd get only "Having " because of the non-greedy search operation. Simulating that behavior in 10g1, I'd have to change the pattern to this:
SELECT REGEXP_SUBSTR('Having fun with regular expressions', '^[^ ]+ +')
FROM dual
;Another new option is the "x" match parameter. It's purpose is to ignore whitespaces in the searched string. This would prove useful in ignoring trailing/leading spaces for example. Checking on unsigned integers with leading/trailing spaces would look like this:
SELECT REGEXP_SUBSTR(' 123 ', '^[0-9]+$', 1, 1, 'x')
FROM dual
;However, I've to be careful. "x" would also allow " 1 2 3 " to qualify as valid string.
I hope you enjoyed reading this introduction and hope you'll have some fun with using regular expressions.
C.
Fixed some typos ...
Message was edited by:
cd
Included 10g2 features
Message was edited by:
cd

Can I write this condition with only one reg expr in Oracle (regexp_substr in my example)?I meant to use only regexp_substr in select clause and without regexp_like in where clause.
but for better understanding what I'd like to get
next example:
a have strings of two blocks separated by space.
in the first block 5 symbols of [01] in the second block 3 symbols of [01].
In the first block it is optional to meet one (!), in the second block it is optional to meet one (>).
The idea is to find such strings with only one reg expr using regexp_substr in the select clause, so if the string does not satisfy requirments should be passed out null in the result set.
with t as (select '10(!)010 10(>)1' num from dual union all
select '1112(!)0 111' from dual union all --incorrect because of '2'
select '(!)10010 011' from dual union all
select '10010(!) 101' from dual union all
select '10010 100(>)' from dual union all
select '13001 110' from dual union all -- incorrect because of '3'
select '100!01 100' from dual union all --incorrect because of ! without (!)
select '100(!)1(!)1 101' from dual union all -- incorrect because of two occurencies of (!)
select '1001(!)10 101' from dual union all --incorrect because of length of block1=6
select '1001(!)10 1011' from dual union all) --incorrect because of length of block2=4
select '10110 1(>)11(>)0' from dual union all)--incorrect because of two occurencies of (>)
select '1001(>)1 11(!)0' from dual)--incorrect because (!) and (>) are met not in their blocks
--end of test data

Regular expressions with multi character separator

I have data like the
where |`| is the separator for distinguishing two fields of data. I am having trouble writing a regular expression to display the data correctly.
Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
SQL> declare
2 l_string varchar2 (200) :='123` 456 |`|789 10 here|`||223|`|5434|`}22|`|yes';
3 v varchar2(40);
4 begin
5 v:=regexp_substr(l_string, '[^(|`|)]+', 1, 1);
6 dbms_output.put_line(v);
7 v:=regexp_substr(l_string, '[^(|`|)]+', 1, 2);
8 dbms_output.put_line(v);
9 v:=regexp_substr(l_string, '[^(|`|)]+', 1, 3);
10 dbms_output.put_line(v);
11 v:=regexp_substr(l_string, '[^(|`|)]+', 1, 4);
12 dbms_output.put_line(v);
13 v:=regexp_substr(l_string, '[^(|`|)]+', 1, 5);
14 dbms_output.put_line(v);
15 end;
16 /
123
456
789 10 here
223
5434I need it to display
123` 456
789 10 here
|223
5434|`}22
yesI am not sure how to handle multi character separators in data using reg expressions
Edited by: Clearance 6`- 8`` on Apr 1, 2011 3:35 PM
Edited by: Clearance 6`- 8`` on Apr 1, 2011 3:37 PM

Hi,
Actually, using non-greedy matching, you can do what you want with regular expressions:
VARIABLE     l_string     VARCHAR2 (100)
EXEC :l_string := '123` 456 |`|789 10 here|`||223|`|5434|`}22|`|yes'
SELECT     LEVEL
,     REPLACE ( REGEXP_SUBSTR ( '|`|' || REPLACE ( :l_string
                                 , '|`|'
                                  , '|`||`|'
                                 ) || '|`|'
                    , '\|`\|.*?\|`\|'
                    , 1
                    , LEVEL
           , '|`|'
           )     AS ITEM
FROM     dual
CONNECT BY     LEVEL     <= 7
;Output:
LEVEL ITEM
    1 123` 456
    2 789 10 here
    3 |223
    4 5434|`}22
    5 yes
    6
    7Here's how it works:
The pattern
~.*?~is non-greedy ; it matches the smallest possible string that begins and ends with a '~'. So
REGEXP_SUBSTR ('~SHALL~I~COMPARE~THEE~', '~.*?~', 1, 1) returns '~SHALL~'. However,
REGEXP_SUBSTR ('~SHALL~I~COMPARE~THEE~', '~.*?~', 1, 2) returns '~COMPARE~'. Why not '~I~'? Because the '~' between 'SHALL' and 'I' was part of the 1st pattern, so it can't be part of the 2nd pattern. So the first thing we have to do is double the delimiters; that's what the inner REPLACE does. The we add delimiters to the beginning and end of the list. Once we've done prepared the string like that, we can use the non-greedy REGEXP_SUBSTR to bring back the delimited items, with a delimiter at either end. We don't want those delimiters, so the outer REPLACE removes them.
I'm not sure this is any better than Sri's solution.

[Regular Expressions] Saving a variable number of matches

I'm stuck with the following problem and I don't seem to be able to solve without lots of ifs and else's.
I've got a program that you can pass patterns as parameters to. The program receives patterns as one single string.
The string could look like this:
a:i:foo r::bar t:ei:bark
or like this:
a:i:foo
What I'm hinting at is that the string comprises of several parts of the same structure. Each structure can be matched and saved with:
([art]:[ei]{0,2}:.*)
Now I want my regular expression able to match all the occurences without checking the string containing the pattern for something that could indicate the number of structures inside it. The following does not seem to work:
([art]:[ei]{0,2}:.*)+
So now I'm looking for something that would match one or more occurence of the structure and save it for future use.
I'd be really happy if someone could help me out here
Last edited by n0stradamus (2012-05-03 20:27:02)

Procyon wrote:
--> echo "a:i:foo r::bar t:ei:bark" | sed 's/$[art]:[ei]\{0,2\}:[^ ]*$/1/'
1 r::bar t:ei:bark
--> echo "a:i:foo r::bar t:ei:bark" | sed 's/$[art]:[ei]\{0,2\}:[^ ]*$/1/g'
1 1 1
If [^ ]* is not usable (spaces are allowed arbitrarily), you need a non-greedy .* and non-consuming look-ahead of " [art]:"
In python's re module, this is .*?(?=( [art]:|$))
>>> import re
>>> m=re.findall("([art]:[ei]{0,2}:.*?(?=( [art]:|$)))","a:i:foo r::bar t:ei:bark")
>>> print(m)
[('a:i:foo', ' r:'), ('r::bar', ' t:'), ('t:ei:bark', '')]
Exactly what I was looking for! I didn't know that you could specify .* to stop at a certain sequence of characters.
Could you please point me to some materials where I can read up on the topic?
Back to the regex: It works finde in Python, but sadly that is not the language I'm using
The program I need this for is written in C and until now the regex functions from glibc worked fine for me.
Have I missed a function similar to re.findall in glibc?

Regular Expressions: Greedy vs Non-Greedy

Similar Messages

Maybe you are looking for