Interesting Regex dillemmas

I need to do the following:
1. given a list of regular expressions find out which ones match some string, would i be faster to do this in a for-loop or OR all expressions into one large expression and use capture groups.
2. given a list of regular expressions is it possible to tell if 2 or more expressions will match on SOME(unknown) string
3. is it possible to write 10 regular expressions so that "priority" is encoded in the expression, so that if 5 of those expressions match on a given string, we will know which of those 5 will have the highest priority.
Please kind sirs, advise (;

1. given a list of regular expressions find out which ones match some
string, would i be faster to do this in a for-loop or OR all expressions
into one large expression and use capture groups.Profile it. But I'd favour the loop as more transparent.
2. given a list of regular expressions is it possible to tell if 2 or more
expressions will match on SOME(unknown) stringDepends. By "regular expressions" do you mean regular expressions (in which case it is possible, although you'll have to write a moderate amount of code to do it) or expressions accepted by java.util.regex.Pattern, in which case it's at best not so easily proven to be possible?
3. is it possible to write 10 regular expressions so that "priority"
is encoded in the expression, so that if 5 of those expressions
match on a given string, we will know which of those 5 will have
the highest priority.Probably, but don't.

Similar Messages

Regex exec() limit or interesting work

Hello,
It says that exec is better and has has not the 9 limit:
https://forums.netiq.com/showthread....ctor-questions
I LiveDebug my new supa-dupa-cool firewall collector and I bumped into
the //.exec() limit, which is RegExp.$9. The RegExp.$10 is nothing.
But if I use variable=//.exec(); variable[10] than it is good. What is
the difference? I know it works and not works , but you know what I
mean. Is it a built-in something?
The RegExp.$9 value is the same as variables[9] value. variable[14] is
still good.
http://tinyurl.com/kalbhlk says that it is unlimited and I have to use
it variable=//.exec() instead what the Collector Developer tutorial
says.
Thank you for your answers!
woodspeed
woodspeed's Profile: https://forums.netiq.com/member.php?userid=7232
View this thread: https://forums.netiq.com/showthread.php?t=52416

On 12/15/2014 06:24 AM, woodspeed wrote:
>
> It says that exec is better and has has not the 9 limit:
> https://forums.netiq.com/showthread....ctor-questions
> I LiveDebug my new supa-dupa-cool firewall collector and I bumped into
> the //.exec() limit, which is RegExp.$9. The RegExp.$10 is nothing.
> But if I use variable=//.exec(); variable[10] than it is good. What is
> the difference? I know it works and not works , but you know what I
> mean. Is it a built-in something?
The biggest differences that I have found are:
1. .exec() returns an array with no limits in size; as a result, it is
probably always the best way to return captured groups.
2. .test() gives you a nice boolean (match or not) and then you happen to
be able to get captured groups from the global RegExp object, but only up
through $9 as you noticed.
There may be performance benefits going one way or another, but I've done
a very small, statistically-insignificant, amount of testing and not
noticed much difference when the match works. When matches fail and the
regexes are not anchored (meaning the regex engine tries really hard to
find a match, at the cost of performance) they both quickly perform
terribly; I do not remember if one was worse than the other.
Summary: Use .exec(), set to a variable, and then test for null if you wan
to know if it succeeded/failed.
Good luck.
If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below...

How can I extract multine data using a regex? ... driving me nuts!!

Hi everyone,
I have a file which I'm interested in extracting information from. Here is a snippet:
CGCNNNNGTAGTC
TAGTCNN
... my goal is to create a regex which extracts out everything beyond the N (on the first line), along with anything on the next line which is not N.
As a result, my desired outcome is "GTAGTCTAGTC"
What I've worked on sofar (for my regex) is:
Pattern p = Pattern.compile("(N[ATGC].*)\\r\\n[ATGC].*N", Pattern.MULTILINE);I've worked on it for sometime now, and am very interested in advice as-to how I can extract my desired information.
Thanks once again,
p.
Edited by: phosse1 on Apr 9, 2009 8:37 PM
Edited by: phosse1 on Apr 9, 2009 8:38 PM

Till one of the regex gurus comes around to shorten the regex by 50% and increase the readability and efficiency by 1000% ;-)import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class CrossLineMatcher {
   public static void main(String[] args) {
      String[] inputs = {
         "CGCNNNNGTAGTC\n" +
         "TAGTCNN",
         "CGCNNNNGTAGTC\r\n" +
         "TAGTCNN",
         "CGCNNNNGTAGTC\r" +
         "TAGTCNN",
      String regex = "(?m)(?<=N)([^N]+)(?:$[\\r\\n]+^)([^N]+)(?=N)";
      Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
      for (String input : inputs) {
         Matcher matcher = pattern.matcher(input);
         while (matcher.find()) {
            System.out.println(matcher.group(1)+matcher.group(2));
      System.out.println("=======================");
      regex = "(?m).*?N([^N]+)(?:$[\\r\\n]+^)([^N]+)N.*?";
      for (String input : inputs) {
         System.out.println(input.replaceAll(regex, "$1$2"));
everything beyond the N (on the first line), along with anything on the next line which is not NYou want to express a match for "anything that is not N", use [^N]
[ATGC] means "anything that is A, T, G or C"
db

Tokenize xml maybe using RegEx?

I'm working on a simple search engine for dynamically loaded XML data. I have data of this form (more or less):
<sessions>
<session id=##>
          <title><![CDATA[The Title]]></title>
          <presenter>A person or two goes here with title</presenter>
          <date>2011-2-15-10-00-a</date>
          <webex><![CDATA[https://alink.com]]></webex>
          <audience> <![CDATA[Various types that might be interested]]> </audience>
          <desc><![CDATA[A longish description that might include some simple html tags like bold or some lists]]></desc>
          <resources>
          <resource>
          <name><![CDATA[A slide deck, website, white patper, etc.]]></name>
          <link active="true"><![CDATA[thelink to the resourcepdf]]></link>
          <tip><![CDATA[A description of what is at the site or why the resource is interesting]]></tip>
          </resource>
     <resource>
     </resource>
          </resources>
</session>
<session>
</sessions>
</sessions>
I need to break apart all the "useful" words and run them through my indexer. Currently I'm using e4x to pull out certain nodes and get the content as a string. Then I'm using something like this to break it all up:
var tokens:Array=[" - ","?",",","."....etc];
for(var i:int=tokens.length-1;i>=0;i--){
   str=str.split(tokens[i]).join(" ");
Is there a quicker, more efficient, better way to do this? I'm just learning about RegEx and think it could maybe have some use here, but I'm not all that good with it. Part of the problem here is that the tokens array needs to take into account all of the possible characters that could signal divisions between words. But there are some many of them. It might be simpler to go the route of here are the things we want to keep. That list is much shorter.
remove any xml tags and remove any html tags
keep A-Za-z (including accented letters such as grave, acute, umlaut, etc.)
keep ' or - when they are in the middle of a word, i.e. surrounded by letters
everything else goes
So there are really two parts to this:
1. What is the best, fastest, easiest way to extract all the data from the xml.
2. What is the most reliable easiest way to break all that data into just the words.

What I'm seeing in these RegExps:
re1 in English:
Globally in the current string, any <![cdata[ or ]]> or <*> or http(s)://*\s or /.,(3 chars)"“”!(not)?(up to)@#$%^(and)*(nothing)[](n/a amount)(invalid range)–—(one or more):;<>©®™= or -(invalid range)(one or more than)
The substitutions are in (parens).
Think of RegExp like a language who's sole purpose is to give you a ton of wildcards with programattic-like features to "describe" content you want. Using characters like ! (exclaimation point) actually mean "not" just like they do in AS3.
So to match a string that has NO lowercase 'a':
/!a/
That's why I mentioned (not) in the description, for a simple example. If you explicitly want a character the safest thing you can do is escape it just like you did with the brackets. To match an exclaimation point:
It's just like "reserved words" in coding. You'd never make a variable name like 'for' or 'if' because you know the compiler will balk. Same deal with RegExp. Knowing what are operators (|,&,[,!,^,$,{,},(,),.,etc) will help solidify your meaning. There's tons of reference guides out there but being Perl was the big proponent of regular expressions I often just follow the simple PHP preg_* function syntax referece (click the links at the top for categories: http://php.net/manual/en/reference.pcre.pattern.syntax.php )
Any time you add in an "or" with | you're better off making a new RegExp for that. It's much easier to debug smaller complicated RegExps than a string of a bunch all together. trace() your string between every step to see which RegExps are misbehaving and medicate as needed.
For your example, from what I assume you want to do is just remove things. I'd do it like so:
removal of CDATA wrapper:
var str:String = '<![cdata[this is some text]]> moo';
var cdataRe:RegExp = /\<\!\[cdata\[(.*)?\]\]\>/i;
str = str.replace(cdataRe,"$1");
trace(str);
// trace: this is some text moo
This is a replace that shows you parenthesis's ability to capture text. Captured text will be put (in order of parenthesis) inside variables $1, $2, $3, etc. I captured the text between CDATA tags and my replacement was only the text inside it.
removal of any HTTP(S), RTMP, FTP links:
// important to note no space after, but will match
// taking out ftp,http,rtmp
var str:String = 'this is some text http://www.moo.com/a/b/c/?ref=123&q=2 and https://www.foo.com/cpanel/?a=login.do links HTTPS:// RTMP://media.someserver.com/moo.flv ftp://woo:[email protected]';[email protected]';
var httpsRe:RegExp = /[fhr]t{1,2}m*ps*\:\/\/.*?\s+|[fhr]t{1,2}m*ps*\:\/\/.*$/igm;
str = str.replace(httpsRe,'');
trace(str);
// trace: this is some text and links
You get the idea. I'm describing every bit of the text as I go. I wanted to show a decent usage of the | (or) branch in the case of removing 2 different types of links. A link in the beginning or middle of a sentence will have a space after it, or if the link is at the end of the string with no space. However it's not perfect. You run tests on it and you'd see if it ended up at the end of a sentence and there was a period, that period would get eaten too. It's exceptions like "no space after it" or "end of a sentence" that greedy RegExps need a lot of extra conditional logic on. That's why I woudln't bundle more than a single purpose RegExp because when you REALLY field test against data those seemingly simple one-purpose RegExps end up being huge.
The re2 I see above seems to have some very specific data sent to it. It's saying: A string containing a return or newline or space followed by quotes or apostrophe or dash followed by quotes or apostrophe or dash followed by a return or newline or space or just one or more spaces.
That's a pretty weird RegExp. That would match something like this:
var a:String = "
// or
var b:String = ' "" ';
The final 'or' is the only thing I'd condense because you have it in your bracket already. You're saying at the end [\r\n\s] or \s+. So:
var re2:RegExp = /[\r\n\s]["'\-]{2}[\r\n]*\s*/gm;
Writing it like that just states either return or newline or one or more spaces will match. You can see the usage of braces marking the range of matches I desire, so {2} means I need 2 of the previous characters specified in a row. 1-5 characters specified in a row is just as easy, /[a-z]{1,5}/ means from 1 to 5 lowercase letters from a to z.

Regex to split a URL, not matching

I'm trying to port Steve Levithan's [parseUri JavaScript function|http://blog.stevenlevithan.com/archives/parseuri] to Java, but I'm having problems getting the regular expression to match anything, even basic URLs. I can confirm that the regular expression +^(?:(?![^:@]+:[^:@\/]*@)([^:\/?#.]+):)?(?:\/\/)?((?:(([^:@]*):?([^:@]*))?@)?([^:\/?#]*)(?::(\d*))?)(((\/(?:[^?#](?![^?#\/]*\.[^?#\/.]+(?:[?#]|$)))*\/?)?([^?#\/]*))(?:\?([^#]*))?(?:#(.*))?)+ matches http://www.google.com/, but I can't get it to work in Java. The code I'm using is:
String spec = "http://www.google.com/";
HashMap tempUri = new HashMap();
String[] parts = {"source","protocol","authority","userInfo","user","password","host","port","relative","path","directory","file","query","anchor"};
Pattern pattern = Pattern.compile("^(?:(?![^:@]+:[^:@/]*@)([^:/?#.]+):)?(?://)?((?:(([^:@]*):?([^:@]*))?@)?([^:/?#]*)(?::(\\d*))?)(((/(?:[^?#](?![^?#/]*\\.[^?#/.]+(?:[?#]|$)))*/?)?([^?#/]*))(?:\\?([^#]*))?(?:#(.*))?)");
Matcher matcher = pattern.matcher(spec);
String match;
int i=0;
while(matcher.find()) {
    try {
        match = matcher.group(i);
        tempUri.put(parts, match);
} catch(Exception ex) {
tempUri.put(parts[i],"");
i++;
The problem is, the matcher only ever finds one match (the whole string), it should be splitting it. I'm sure I'm escaping the characters properly.
Edited by: Echilon on Nov 3, 2008 5:08 AM

As usual, after trying for hours in vain, I stumble upon the answer just after posting. :P
Rather than while(matcher.find()), I used if(matcher.find()), then looped through the groups.
For anyone interested, [http://leghumped.com/blog/2008/11/03/java-matching-urls-with-regex-wildcards/|http://leghumped.com/blog/2008/11/03/java-matching-urls-with-regex-wildcards/] might help :)
Edited by: Echilon on Nov 3, 2008 11:08 AM

Non-recursion-based regex

I'm doing some heavy duty very special case regexing. If fact I have to build them programatically as the actual regular expressions can grow to over 60,000 characters in length.
I'm doing this in 1.3.1 and have tried a couple free/open source regex packages. As one might imagine, I get StackOverflowErrors very often in every package I've tried.
If this is such a problem with regular expressions, it seems someone would have created a Java regexp package that doesn't rely so heavily on recursion. (In fact people have, in other languages, I just haven't found one in Java).
Anyway, my first question is: is there a Java non-recursion-based regex package out there?
I know I can increase the stack size, but to what?! I build my regexs on the fly, and the data they operate on is also generated on the fly.
I think an OutOfMemoryError with a non-recursive implementation would be a lot less likely that a StackOverFlowError. ???
It's Saturday (as you can see) and it looks more and more like I may have to write/port a regex package that fits the bill. This is really a pain.
Another possibility is to use GNU Kawa (which implements proper tail recursion) to compile a Scheme regexp implementation into something I can use from Java. This seems really convoluted.
Any ideas?

Indeed, a 60,000 char regex is, to be blunt,
ridiculous. There has to be another approach to
whatever problem you are solving (is it biological in
nature, perchance?). It's pattern recongition of structures of objects.
My problem is that what I would REALLY like to do boils down to "regular expressions" over an 2-dimentional array of arbitrary objects, not characters.
As a completely inane example: 3 Strings who's length is between 7 and 19 that don't start with "foo" or "bar", followed by any number of objects implementing Runnable but not Comparable.
I started doing this from scratch, but I thought a faster way to get things done was to cheat.
So I created a system to translate the objects I'm interested in to a "compact" String representation, right now it's about 32 characters long. I can specify properties I'm looking for with this representation, or specify I don't care with '.' values in the fields I don't care about. So for example a lowercase e in a particular index represents non-editable, uppercase means editable, '.' means I don't care. Some field/properties require multiple characters to represent them.
Now the patterns I'm looking for are in a 2 dimensional array of objects.
When you start multiplying hundreds of objects by hundreds of objects by 32 characters per object ... You get a pretty big regex.
Another option I've started is custom, from scracth "Java Object Regular Expressions" or JOREs as I like to call them (I already started coding this and needed a package name).
(Does anything like this already exist? I haven't found it.)
Anyway, 60,000 is bordering on worst case scenarios. It could happen. 7000-ish is a more realistic average I'd expect, but it doesn't make worst cases go away.
It seems even in many smaller cases I get a StackOverflowError, but there are probably more nested .{,}[^]+* type things in those to make them more complex.

Regex for Java comments

Hi everyone,
Can anyone be so kind and post the simplest working regex pattern that matches all Java comments (multi-line, not in quoted strings, etc).
I'm starting to use (Java) REs and have been fighting with various combinations of |s, $s, \\*s, Pattern.MULTILINEs and Pattern.DOTALLs for a while now - extending the simple example shown at http://developer.java.sun.com/developer/technicalArticles/releases/1.4regex , under "File Searching".
Thanks for your time and help!
-Nido

Hi Nido,
I would also be interested in a simple fix. I looked at this for a while and came up short of an easy solution that would cover all contingencies.
You can easily match a single line comment since it ends at the line return. I tried the pattern
//[^\r|\n]*
which works predictably, but of course it doesn't account for the possibility of being inside a quote. You could of course match the whole line and then begin looking for quotes before the "//" but you would also have to look to see whether the quote closes before the "//" or not. And when matching quote marks you also would have to check whether they were escaped or not, since an escaped quote would itself be inside a quoted string.
This is far from your simple solutions and implicates using a combination of regular expressions and string tokenizers or plain old "indexOf()" matches so that you could count the number of quotes before the "//" and see whether the quote closes or not and verify that no escaped quote marks are treated as quotes themselves.
for multi-line comments you also have a challenge. In fact I have a question for you and the rest of the community here. How solid is the support for matching multiple-line patterns at this point? I haven't been able to get it to work reliably for me yet, but that may be my misunderstanding. Regardless, if you just strip the line returns from the string (or whatever you're parsing) it will behave as predictably as a single line.
One trick I've used before (works with other languages as well) is to replace all the line returns with a token of my choice so that I can parse it reliably (since it's now a single line) and then after parsing I can replace the tokens with the original line returns. This makes multi-line parsing very solid but requires at least two extra lines of code.
Regardless, my poor brain has a hard time thinking of a solid way to match a multi-line comment. The danger is that regular expressions use "greedy" matching by defualt. That is, they try to match as much content as possible by the definition you provide. If you create a catch-all multi-line comment match such as:
you run the risk of matching all the content found between two or more multi-line comments. You can use
/\*[^\*]*\*/
which will definitely not overrun the end of the comment, but it will not work properly if you have a "*" anywhere inside the content of your comment.
I wonder if you would be better off using a tokenizer or "indexOf()" type system on the multi-line comments so that you could get away from the greedy pattern matching? The other benefit is that line returns would be handled predictably.
Just some thoughts. Maybe I'm missing something here. Hope you get the simple answer you're looking for yet.

Question on regex

Hi,
I am currently trying to compile a regular expression that can ignore everything within < >.
From:-
<p>123</p><a>456
To:-
123456
The objective is to remove all the HTML tags. Can anybody shed some light in a regular expression that could cater for this?
Thanks.
Joseph

Note: my previous (naive) method only works if there are numbers between the tags. Check this page for details on regex patterns:
http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html
If you're interested in parsing real html files, I suggest using a html parser:
http://java-source.net/open-source/html-parsers
Good luck.

Regex replaceAll question

I am using replaceAll to replace a bunch of "tags" with content for this bit of software that I am porting (from something else) that creates documents (in the end) by populating templates with data.
Here's my problem. The "tags" are all okay (not giving any regex content) but some of the data is. Because at first now I hit data that as a String has some regex funky characters in it. And regex got all all whiny about that because there aren't any matching groups. Well no. That's true.
So I have to escape the $. But it occurs to me that I could well have more than this problem with other bits of content (including some parts I haven't gotten to yet) and I am wondering if there is any sort of easy solution. Is there way to tell replaceAll that the replacement String is a "literal" replacement i.e. I don't want any regex parsing at all just replace the found "tag" with the "content", that's it.
Some sort of escape the whole sequence? Possibly. But I don't understand what it would be. \ is just for as single character?
So any quick solution? Or alternative?
BTW I am stuck with 1.4 on this project.

yawmark wrote:
I'd recommend trying Apache Commons StringUtils.
Hope this message doesn't disappear into the aether.
~I think it did for awhile but like Lazarus arose...
Anyway thank you for the effort.
This is an interesting suggestion but I'd rather not go the route of adding more libraries.
What I did at first was
private void replaceAll(StringBuffer buff, String toFind, String toReplace){
while(buff.indexOf(toFind)>-1){
buff.replace(buff.indexOf(toFind),buff.indexOf(toFind)+toFind.length(),toReplace);
}which worked good enough for me but someone else pointed out quoteReplacement and just doing my own version of that. Which works well too.
Anyway thanks again I do appreciate it what with the current situation and all.

Java regex doubt?

I have a string by as follows:
String name = "aaaaaaaaaahgcnjcdcd";I am trying to validate the above string. When the number of "a" in the string excceds 5 and above, I need to throw an error. I tried as follos:
if (name.matches("A{5,}")){
system.out.println("There is an error");
}The above code works only if my name is "aaaaaaaaaaaa" (contains only more than 5 "a"s)and it is not working when name is like "aaaaaaaaaahgcnjcdcd". Can anyone give suggestion so that I will be able to thrw an error if there are more than 5 continous "a" s along with other charactes as "aaaaaaaaaahgcnjcdcd".

cotton.m wrote:
For the regex way you need optional wildcards on either side. And sabre or uncle_alice are likely to yell at me for suggesting indexOf so maybe just do that instead.Far from it. If all the OP is interested in is a run of 5 or more 'a' chars then indexOf() is perfect BUT if he wants 5 or more of run of any of a possible set or chars then regex fits the problem much better. Something like
boolean isValid = "your string xxxxx possibly containing 5 or more of a sequence of x,y or z".matches(".*([xyz])\\1{4,}.*");it is possible to refine this so that there is much less backtracking but until I better understand the OP's requirement ...

Without regex package

How can I check a string that belongs to a grammar or not without importing regex package in my java program.
If Anybody have good idea please give .

I am planning to developing an editor tool for java programming because of my interest. Before that I want to learn lexiacal analyzing in java programming. So I wrote a token recognizing code in java ( It will recognize keywords, identifier, constants, operator, special symbols).
For that I used java.util.regex package( Pattern and Match) classes. But I got Lot of bugs and less feasibilty in my coding. It is not working properly. Sometimes it recognize a keyword as a identifier sometimes as a special symbol. So I want to know Is there any possibilities to write a token recognize program without using a regex package.

Regex question: replace

Hi,
I'm getting into java.util.regex lately. Having used Perl for regex I'm trying to get familiar with Java's regex "spirit".
Concerning replacement we can use replaceAll or replaceFirst however:
- what if I want to replace only the third or fourth element?
- what if I want to replace second to fourth element?
in PERL we use " regex_epression_here for 2..4;" for instance.
I you would have some interesting website/tutorials related to JAVA regex that would be great.
Thanks for your help.
Rgds,
SR

Yep,
here is a sample of replacement in Perl
$Line =~ s/\]/|/ for 2..4; #Replace 2nd 'til
4th delimiter (]) with pipe (|)
....Based on the reference I gave earlier
import java.util.regex.*;
* A rewriter does a global substitution in the strings passed to its
* 'rewrite' method. It uses the pattern supplied to its constructor,
* and is like 'String.replaceAll' except for the fact that its
* replacement strings are generated by invoking a method you write,
* rather than from another string.
* This class is supposed to be equivalent to Ruby's 'gsub' when given
* a block. This is the nicest syntax I've managed to come up with in
* Java so far. It's not too bad, and might actually be preferable if
* you want to do the same rewriting to a number of strings in the same
* method or class.
* See the example 'main' for a sample of how to use this class.
* @author Elliott Hughes
public abstract class Rewriter_1
    private Pattern pattern;
    private Matcher matcher;
     * Constructs a rewriter using the given regular expression;
     * the syntax is the same as for 'Pattern.compile'.
    public Rewriter_1(String regularExpression)
        this.pattern = Pattern.compile(regularExpression);
     * Returns the input subsequence captured by the given group
     * during the previous match operation.
    public String group(int i)
        return matcher.group(i);
     * Overridden to compute a replacement for each match. Use
     * the method 'group' to access the captured groups.
    public abstract String replacement(int index);
     * Returns the result of rewriting 'original' by invoking
     * the method 'replacement' for each match of the regular
     * expression supplied to the constructor.
    public String rewrite(CharSequence original)
        this.matcher = pattern.matcher(original);
        StringBuffer result = new StringBuffer(original.length());
        int index = 0;
        while (matcher.find())
            matcher.appendReplacement(result, replacement(++index));
        matcher.appendTail(result);
        return result.toString();
    public static void main(String[] arguments)
        String result = new Rewriter_1("\\|")
            public String replacement(int index)
                if ((index >= 3) && (index <=5))
                    return "y";
                else
                    return group(0);
        }.rewrite("| | | | | |");
        System.out.println(result);
}

Three new bioinformatic coding problems now in WIKI (all require regex)

At the bottom of this page:
https://wiki.sdn.sap.com/wiki/display/EmTech/Bio-InformaticBasicsInRelationtoScriptingLanguages
I've defined three new bioinformatic coding problems (Problems 3a-c).
These all involve regex parsing in one way or another and are therefore a lot more interesting than the first two problems I've presented.
I'm hoping that Anton will continue to knock out solutions, and that others will feel challenged enough to join him.
djh

Anton -
I am way impressed - not just by your code and how fast you produced it, but also by the speed with which you grasped the STRIDE documentation and put it to very good use.
Regarding the question you ask at the very end - this question goes to the heart of why protein tertiary structure (the actual 3-dimensional shape of one chain of a protein) is so hard to predict, and why governments, universities, and pharma are throwing so much money at it.
If protein primary and secondary structure were many-to-one, prediction of protein tertiary strructure would be a lot lot simpler. (For example, if "VVAY" and all "similar" primary structure subsequences mapped to the same secondary structure.)
But the mapping is, unfortunately, many-to-many because the amino acids (AAs) in a primary structure chain always bond the same way - the COOH (carboxy) of one AA bonds to the HN (amino) of the next one to produce C-N with the OH and H going away to make a water moleculre ... COOH - HN ---> C-N + H20.
And as pointed out by Linus Pauling (the discoverer of protein secondary structure), the C-N bond can take two forms:
a) one that is llkely to force the two amino acids to become HH (part of a helix)
b) one that is likely to force the two amino acids to become EE (part of a "stand")
You may be surprised to find out that even primary structure subsequences of six amino acids can wind up as different secondary structures - not always "H" vs "E", the variants might be
E TT H
H TT E
EEE HH
HH EEE
etc.
Anyway, thanks a lot for what you've done here.
If you decide to stay involved a little longer, you will very soon be able to understand why there may be real promise in the approach to protein structure that is being taken here:
http://strucclue.ornl.gov
(This is why I want to get SAP interested in the vertical bioinformatic sector - you can see why an integrated IDE with a robust DB is so important.)
BTW, the site mentions Dr. Arthur Lesk as a team member. Arthur is considered one of the fathers of what is called "structural alignment" (as opposed to pure sequence alignment.)
If you Google him, you'll see that he has written many many books on bioinformatics - all worth purchasing if you're going to become involved in this area. Arthur was at Cambridge (MRC) until recently, when he reached mandatory EU retirement age and took a position at Penn State here in the US.
Best
djh

Domains and regex

Hi,
I'm facing the problem to extract from a URL different parts of it, in particular I'm interested in extracting the pair "second level domain" + "top level domain".
I've used the following pattern in order to extract the domain (subdomain+second level+top level domain) plus other info like the parameters.
Pattern = "\\b((https?)://([-a-zA-Z0-9.]+)(:[0-9]*)?(/[-A-Z0-9+&@#/%=~_|!:,.;]*)?(\\?[-A-Z0-9+&@#/%=~_|!:,.;]*)?)"
The third group extracts (i.e.) "www.google.com" from a URL like "http://www.google.com:8080/?a=b&c=d", but my target is to extract "google.com".
Has anybody an advice to address this issue (it's clear that I'm not a regex expert...) ?
Thanks a lot
ny

jverd wrote:
yawmark wrote:
jverd wrote:
Or you could do this:Or this:
address = 'http://www.google.com:8080/?a=b&c=d'
host = new URL(address).host.replaceAll('^www.','')
assert host == 'google.com'~Sure, but it won't work in the general case, e.g. http://forum.java.sun.com/post!reply.jspa?messageID=10031705
host.replaceAll("(.*\\.)?(?=[^.]+\\.[^.]+)", "");

RegEx false positives

Hello all,
I hope someone here can help with this.
I have regular expressions configured in a DLP policy on our C370 to catch interesting traffic. The format I'm looking for is D12345678. I have a regex of
\b(?i)d\d{8}\b I've also tried \b(?i)d\d{8}\s
but I continue to see emails quarantined with text like D1234567890 or D12345678%1234 and 00D123456789.
I ran these through http://gskinner.com/RegExr and they test fine under all the situations, but on the Ironport emails continue to be quarantined. Can anyone explain what I'm missing?
Thanks,
Kendall

Hello all,
I hope someone here can help with this.
I have regular expressions configured in a DLP policy on our C370 to catch interesting traffic. The format I'm looking for is D12345678. I have a regex of
\b(?i)d\d{8}\b I've also tried \b(?i)d\d{8}\s
but I continue to see emails quarantined with text like D1234567890 or D12345678%1234 and 00D123456789.
I ran these through http://gskinner.com/RegExr and they test fine under all the situations, but on the Ironport emails continue to be quarantined. Can anyone explain what I'm missing?
Thanks,
Kendall

Interesting Regex dillemmas

Similar Messages

Maybe you are looking for