Java Regex Help

I've tried and tried and still I am yet to get my regex working!
What I need is a regex to look through a Robots.txt file (which I have downloaded and stored in a String) and return anything after the word "Disallow:"
E.g.
If I had the String:
User-agent: * Disallow: /files Disallow: /submit Disallow: /upload/new User-agent: SomeUA Disallow: / all
I want the regex to return:
/files
/submit
/upload/new
/all
Thanks!

Parsing a robots.txt file is more complex than you seem to realize (and the more lenient you are, the more complex it becomes). Assuming the file conforms to the simplest specification, there are two major considerations you seem to be ignoring. First, every record starts with one or more User-agent: fields; if you ignore them, the Disallow: fields become effectively meaningless (unless you're just trying to find out what files and directories the webmaster doesn't want people looking at, but why would you want to know that, hmmmm?). Second, robots.txt supports Unix-style comments, where everything from the first '#' in a line to the end of that line is supposed to be ignored. If you just pluck out Disallow: fields while ignoring the surrounding text, you may retrieve fields that were commented out, which should almost certainly be regarded as false hits. Of course, you can filter out comments before you start plucking, but that still leaves the issue of the User-agent: fields.

Similar Messages

  • Converting sed regex to Java regex

    I am new to reguler expressions.I have to write a regex which will do some replacements
    If the input string is something like test:[email protected];value=abcd
    when I apply the regex on it,this string should be changed to test:[email protected];value=replacedABC
    I am trying to replace test.com and abcd with the values i have supplied...
    The regex I have come up with is in sed
    s/\(^.*@\)\(.*\)$/\1replaceTest.com;value=replacedABC/i
    Now I am trying to get the regex in Java,I would think it will be something like (^.*@\)(.*\)$\1replaceTest.com;value=replacedABC
    But not sure How i can test this.Any idea on how to make sure my java regex is valid and does the required replacements?

    rsv-us wrote:
    Yep.Agreed.
    Since that these replacements should be done in a single regex.Note that the sed replacement I posted is really made of two replacements! Just like your Java solution would.
    I think once we send this regex to the third party,they will haev to use either sed or perl(will perl do this replacements,not sure though) to get the output.
    Since we are not sure what tool/software the third party is going to use,I was trying to see how i can really test this.Then I read about sed and this regex as is didn't work,so,I had to put all the sed required / and then the regex had become like s/\(^.*@\)\(.*\)$"/1replaceTest.com;value=replacedabcd/iAgain: AFAIK that does not work. I tried it like this:
    {code}$ echo test:[email protected];value=abcd | sed 's/\(^.*@\)\(.*\)$"/1replaceTest.com;value=replacedabcd/i'and the following is returned:test:[email protected] that we will have to send the java regex to the third party,I was trying to see how i can convert this sed regex to java.If I am right,with jave regex,we won;t be able to all the finds and replacements in a single regex..right?...If this is true,this will leave me a question of whether I need to send the sed regex to the thrid party or If I send java regex,they have to convert that to either sed or perl regex.
    One more question,can we do thse replacement in perrl also,if so,what will the equivalent regex for this in perl?
    I can't understand what you are talking about. The large amount of spelling errors also doesn't help to make it clearer.
    Good luck though.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               

  • Java Regex groups with quantifiers.

    I'm a bit stuck on a regex , i want to do something similar to this :
    (dog){6}
    dogdogdogdogdogdog
    and returned I want 6 seperate groups with 'dog' in each one.
    This works fine with jakarta-regexp but when I use the {} quantifiers in Java regex I lose the groupings which i'm looking for and just get a single group with a value of 'dog'
    I'm sure i'm doing something something stupid here and help would be great!

    You can't do this with the SunJDK regex engine without using find() with a Matcher.
    I consider this feature of jakarta-regex to be a bug and reported it (and a couple of other features) as such several years ago. The feature makes it incompatible with most regex engines I have used.

  • Java Regex Pipe Delimited ?

    Hello
    I am trying to split the string which is pipe delimited. I am new to Regex and new to Java.
    My Java/Regex code line to split is:
    listColumns = aLine.split("\\|"); // my code has 2 backslash-escapes chars plus 1 pipe char but this forum does not allow me to put pipes or escapes correctly and plain text help is of NO HELP 8^(
    My input string has 3 leading and 4 trailing pipe characters
    My Output from split: (3 leading emptry strings work but 4 trailing pipe delimiters dont work)
    SplitStrings2:[]
    SplitStrings2:[]
    SplitStrings2:[]
    SplitStrings2:[col1]
    SplitStrings2:[col3]
    SplitStrings2:[col4]
    I do get 3 empty strings for all 3 leading pipes but no empty strings for the any traling 4 pipe characters.
    What do I need to change the code such that all repeated pipes resulted in same number of empty strings returned by split method?
    thanks
    YuriB
    Edited by: yurib on Nov 28, 2012 12:25 PM
    Edited by: yurib on Nov 28, 2012 12:25 PM
    Edited by: yurib on Nov 28, 2012 12:29 PM

    1. The pipe is a meta-character so escape it.
    2. Split rolls things up for you unless you tell it otherwise.
    String s = "|||A|B|C||||";
    String[] array = s.split("[|]", 10);
    for(int i=0; i < array.length; i++)
         System.out.println("" + i + ": " + array);

  • How do you get java regex to match two different pattern

    Hi,
    I am having trouble getting getting java regex to match two pattern: "unknown host" or "100%".
    Here is a snippet of my code:
    try{
    Process child = Runtime.getRuntime().exec(" perl c:\\ping.pl");
    BufferedReader in = new BufferedReader( new InputStreamReader( child.getInputStream() ));
    String patternStr = "(unknown host | 100 %)";
    Pattern pattern = Pattern.compile(patternStr);
    Matcher matcher = pattern.matcher(" ");
    String line = null;
    while ( (line = in.readLine()) != null){
    System.out.println(line);
    matcher.reset(line);
    if (matcher.find())
    // line matches throws pattern
    System.out.println("match string");
    else
    System.out.println("no matches");
    I thought the "|" means OR but somehow it is not working.
    kirk123

    Hi,
    with String patternStr = "(unknown host | 100 %)"; you are looking
    for the strings "unknown host " OR " 100 %", with the spaces after host and before 100.
    Try "(unknown host|100 %)"
    hope, this will help you

  • Java Access Helper Jar file problem

    I just downloaded Java Access Helper zip file, and unzipped, and run the following command in UNIX
    % java -jar jaccesshelper.jar -install
    I get the following error message, and the installation stopped.
    Exception in thread "main" java.lang.NumberFormatException: Empty version string
    at java.lang.Package.isCompatibleWith(Package.java:206)
    at jaccesshelper.main.JAccessHelperApp.checkJavaVersion(JAccessHelperApp.java:1156)
    at jaccesshelper.main.JAccessHelperApp.instanceMain(JAccessHelperApp.java:159)
    at JAccessHelper.main(JAccessHelper.java:39)
    If I try to run the jar file, I get the same error message.
    Does anyone know how I can fix this?
    Thanks

    Cross-posted, to waste yours and my time...
    http://forum.java.sun.com/thread.jsp?thread=552805&forum=54&message=2704318

  • What is the escape character for DOT in java regex?

    How to specify a dot character in a java regex?
    . itself represents any character
    \. is an illegal escape character

    The regex engine needs to see \. but if you're putting it into a String literal in a .java file, you need to make it \\., as Rene said. This is because the compiler also uses \ as an escape character, so it will take the first \ as escaping the second one, and remove it, and the string that gets passed onto the regex will be \.

  • Java Access Helper problem

    I just downloaded Java Access Helper zip file, and unzipped, and run the following command in UNIX
    % java -jar jaccesshelper.jar -install
    I get the following error message, and the installation stopped.
    Exception in thread "main" java.lang.NumberFormatException: Empty version string
    at java.lang.Package.isCompatibleWith(Package.java:206)
    at jaccesshelper.main.JAccessHelperApp.checkJavaVersion(JAccessHelperApp.java:1156)
    at jaccesshelper.main.JAccessHelperApp.instanceMain(JAccessHelperApp.java:159)
    at JAccessHelper.main(JAccessHelper.java:39)
    If I try to run the jar file, I get the same error message.
    Does anyone know how I can fix this?
    Thanks

    sorry ab the multiple post, it was urgent for me to
    know the answer.
    I've JDK 1.4.2 and JAccessHelper should work with 1.3
    or later
    Can there be some kind of path problems?No, I doubt it. It's just some internal app failure - looks like it's trying to determine the running Java version, and is not doing a good job at it. But the failure has nothing per se to Java, it's that specific app - what are we supposed to know about that app though?

  • JAVA Regex Illegal Characters

    Hello - I am trying to find a list of all illegal characters which have to be escaped in JAVA Regex pattern matching but I cannot find a complete list.
    Also I understand that when doing the replaceall function that there is a special list of characters which can't be used for that as well, which also have to be escaped differently.
    If anyone has access to a full complete list as to when to escape and how I would greatly appreciated it!
    Thanks,
    Dan

    I also noticed this below link:
    http://java.sun.com/docs/books/tutorial/extra/regex/literals.html
    It said the following characters are meta-characters in regex API:
    ( [ { \ ^ $ | ) ? * + .
    But it also says the below:
    Note: In certain situations the special characters listed above will not be treated as metacharacters. You'll encounter this as you learn more about how regular expressions are constructed. You can, however, use this list to check whether or not a specific character will ever be considered a metacharacter. For example, the characters ! @ and # never carry a special meaning.
    Does anyone know if there would be any issues if I escaped when a character didn't need to be escaped?

  • Java Regex Pattern

    Hello,
    I have parsed a text file and want to use a java regex pattern to get the status like "warning" and "ok" ("ok" should follow the "warning" then need to parser it ), does anyone have idea? How to find ok that follows the warning status? thanks in advance!
    text example
    121; test test; test0; ok; test test
    121; test test; test0; ok; test test
    123; test test; test1; warning; test test
    124; test test; test1; ok; test test
    125; test test; test2; warning; test test
    126; test test; test3; warning; test test
    127; test test; test4; warning; test test
    128; test test; test2; ok; test test
    129; test test; test3; ok; test testjava code:
    String flag= "warning";
              while ((line= bs.readLine()) != null) {
                   String[] tokens = line.split(";");
                   for(int i=1; i<tokens.length; i++){
                        Pattern pattern = Pattern.compile(flag);
                        Matcher matcher = pattern.matcher(tokens);
                        if(matcher.matches()){
    // save into a list

    sorry, I try to expain it in more details. I want to parse this text file and save the status like "warning" and "ok" into a list. The question is I only need the "ok" that follow the "warning", that means if "test1 warning" then looking for "test1 ok".
    121; content; test0; ok; 12444      <-- that i don't want to have
    123; content; test1; warning; 126767
    124; content; test1; ok; 1265        <-- that i need to have
    121; content; test9; ok; 12444      <-- that i don't want to have
    125; content; test2; warning; 2376
    126; content; test3; warning; 78787
    128; content; test2; ok; 877666    <-- that i need to have
    129; content; test3; ok; 877666    <-- that i need to have
    // here maybe a regex pattern could be deal with my problem
    // if "warning|ok" then list all element with the status "warning and ok"
    String flag= "warning";
              while ((line= bs.readLine()) != null) {
                   String[] tokens = line.split(";");
                   for(int i=1; i<tokens.length; i++){
                        Pattern pattern = Pattern.compile(flag);
                        Matcher matcher = pattern.matcher(tokens);
                        if(matcher.matches()){
    // save into a list

  • Java regex problem

    Hi:
    I have the following texts in a flat file:
    scheduler is running
    system default destination: llp
    device for ps3: /dev/ps3
    device for ps: /dev/ecpp0
    device for llp: /dev/ecpp0
    How can I use java regex to print out the string after "device for " in this case the string "ps3" ,"ps" and "llp".

        static final Pattern DEVICE_PATTERN = Pattern.compile(
                                          "device for ([^:]++)" );
        String text = "";
        Matcher m = DEVICE_PATTERN.matcher( text );
        while ( (text = bufferedReader.readLine()) != null ) {
            if ( m.reset(text).lookingAt() ) {
                String device = m.group( 1 );
        }

  • Sed rules to java regex

    Hi,
    what is the connection between sed regex rules and java regex rule. Is there an easy way to convert sed to java? or do i have to learn sed?....
    Thanks

    IIRC, Java regex rules are like Perl's (although the syntax for invocation differs a bit), and Perl's are basically a superset of sed's, except there's a difference with parentheses. In Perl/Java, parentheses always group and you have to backslash-quote them to make them interpreted as plain parenthesis characters, whereas in sed, you backslash-quote them to make them be interpreted as grouping indicators.
    Why? What problem are you having?

  • New to java(need help on access specifier)

    hi! i am new to java.plzzzzz help me i have to make a project on access specifier's i know all theroy.but
    i am unable to understand how i can define all specifiers practicly.i mean in a program.
    thanks.plzzzzzzzz help me

    the most common project i can think of is a payroll system..
    you can have real implementation of all the access specifiers
    good luck

  • Java Regex, how do I replace \\ with \\\\?

    Hello,
    The following line doesn't work: someString.replaceAll("\\", "\\\\");
    The error I get is the following:
    Exception in thread "main" java.util.regex.PatternSyntaxException: Unexpected internal error near index 1
    ^
         at java.util.regex.Pattern.error(Unknown Source)
         at java.util.regex.Pattern.compile(Unknown Source)
         at java.util.regex.Pattern.<init>(Unknown Source)
         at java.util.regex.Pattern.compile(Unknown Source)
         at java.lang.String.replaceAll(Unknown Source)
         at TSPEngine.main(TSPEngine.java:34)
    How do I fix this?
    Thanks in advance,

    Interference wrote:
    Hello,
    The following line doesn't work: someString.replaceAll("\\", "\\\\");
    The error I get is the following:
    Exception in thread "main" java.util.regex.PatternSyntaxException: Unexpected internal error near index 1"\\" is how you represent "\" in a String literal in Java, since "\" is an escape character. But "\" is also an escape character in regular expressions, meaning to represent a literal "\" in a Java regex you need "\\\\". Your second string should be fine since it's not interpreted as a regular expression to my knowledge.
    Edit:
    No, the second string DOES need to be "\\\\\\\\".
    Edited by: endasil on 17-May-2010 11:46 AM

  • Pascal to Java Please help me!

    I am truobled in creating a validation method for the user loging in....
    As I am perfect at Pascal can anyone translate the following pascla code to Java Please help me with this.
    PROGRAM USERLOGIN;
    TYPE
      USER_TYPE = RECORD
        USERNAME : STRING[15];
        PASSWORD : STRING[15];
    END;
        USFILE = FILE OF USER_TYPE;
    VAR
       USER: USER_TYPE;
       CKUSER: USER_TYPE;
       RECFILE : USFILE;
       I,J     : INTEGER;
    BEGIN
    WRITE('ENTER USERNAME:');READLN(CKUSER.USERNAME);
    WRITE('ENTER PASSWORD:');READLN(CKUSER.PASSWORD);
    FILESPEC :='USERFILE.USR';
    ASSIGN(RECFILE, FILESPEC);
    RESET(RECFILE);
      WHILE NOT EOF(RECFILE) DO
       BEGIN
         READ(RECFILE, USER);
         IF CKUSER.USERNAME = USER.USERNAME AND CKUSER.PASSWORD = USER.PASSWORD THEN
              LOGIN
        ELSE
              WRITE('USER NOT FOUND');
       END;
    END.Thank you

    Don't bother:
    http://forum.java.sun.com/thread.jsp?forum=54&thread=539277

Maybe you are looking for

  • Can't get back to my mac to work

    I am guessing that this problem belongs somewhere in the server section. I am very crossed about this, as I have been calling apple support twice, they told me to chip on all the remote functions in sharing, and since remote management chips off scre

  • Audigy 4 Pro - Remote Contr

    Hey,?I've got an Audigy 4 Pro card which i've had from new since 2005.However, i've only ever had the remote control working once, around about the time i first installed the card. Since then it's not worked. I'm currently running the latest drivers

  • Bug or feature? Tell me...

    Okay so I have gnome+compiz+awn installed on my desktop. No problems there, but I need to use kopete, because I really like it . Now when I run it, I cannot resize the window, unless I switch to metacity ( but the awn dissapears, because it needs com

  • G4 Mirror Door Video card

    We have an older G4 Power Mac that we use for Video editing. Recently upgraded to FC Studio, Motion 3 wil not run on this machine due to Video card not supporting open GL. Our IT depart purchased a ATI Radeon 9600 Pro PC/Mac edition, they were told i

  • Small format videos

    i have 320 x 240 videos that i want to edit in final cut express HD. in my old FCP i could just create a setup for this frame size, but in HD i just cannot see how to get my project and sequence to be the right size. i thought maybe i could capture s