Regular expression html parsing

I have following sample html
<html><body>
First Name<input type="text" class="txtField" name="txtFirstName"\>
Last name
<input type="text" name="txtLastName"\>
Address <textarea name="address" rows="10">Here goes address</textarea>
<input type="button" name="btnSubmit" class="button"\>
</body></html>
I m trying to build a regular expression in such a way that, the expression should find a list of tags based on the set of names available.
for e.g. I have string array as String names[] = {"txtLastName", "address"}
using above array, the expression must find tag in above html.
So in above case the output should be
<input type="text" name="txtLastName"\>
<textarea name="address" rows="10">
Can somebody suggest how this expression should be build?

Hi,
As from your question,
I got that you want to parse the HTML file and from the names in the array you want to
get the code those controls.
In that case I think you can use
Find the string
1.which starts with '<' and ends with '>'
2.It must contains your work("txtLastName", "address"....) in double quotes(statring & ending) excatly one.
Best,
Ronak

Similar Messages

Regular expressions to parse arithmetic expression

Hello,
I would like to parse arithmetic expressions like the following
"5.2 + cos($PI/2) * -5"where valid expression entities are numbers, operations, variables and paranthesis. Until now I have figured out some regular expressions to match every type of entities that I need, and combine them into one big regex supplied to a pattern and matcher. I will paste some code now to see what I wrote:
public class RegexTest {
 /** A regular expression matching numeric expressions (floating point numbers) */
 private static final String REGEX_NUMERIC = "(-?[0-9]+([0-9]*|[\\.]?[0-9]+))";
 /** A regular expression matching a valid variable name */
 private static final String REGEX_VARIABLE = "(\\$[a-zA-Z][a-zA-Z0-9]*)";
 /** A regular expression matching a valid operation string */
 public static final String REGEX_OPERATION = "(([a-zA-Z][a-zA-Z0-9]+)|([\\*\\/\\+\\-\\|\\?\\:\\@\\&\\^<>'`=%#]{1,2}))";
 /** A regular expression matching a valid paranthesis */
 private static final String REGEX_PARANTHESIS = "([\$\$])";
 public static void main(String[] args) {
 String s = "5.2 + cos($PI/2) * -5".replaceAll(" ", "");
 Pattern p = Pattern.compile(REGEX_OPERATION + "|" + REGEX_NUMERIC + "|" + REGEX_VARIABLE + "|" + REGEX_PARANTHESIS);
 Matcher m = p.matcher(s);
 while (m.find()) {
 System.out.println(m.group());
}The output is
5
2
+
cos
$PI
2
5There are a few problems:
1. It splits "5.2" into "5" and "2" instead of keeping it together (so there might pe a problem with REGEX_NUMERIC although "5.2".matches(REGEX_NUMERIC) returns true)
2. It interprets "... * -5" as the operation " *- " and the number "5" instead of " * " and "-5".
Any solution to solve these 2 problems are greately appreciated. Thank you in advance.
Radu Cosmin.

cosminr wrote:
So, I've written some concludent examples and the output after parsing (separated by " , " ):
String e1 = "abs(-5) + -3";
// output now: abs , ( , - , 5 , ) , + , - , 3 ,
// should be: abs , ( , -5 , ) , + , - , 3 , (Notice the -5)
I presume that should also be "-3" instead of ["-", "3"].
String e2 = "sqrt(abs($x=1 + -10))";
// output now: sqrt , ( , abs , ( , $x , = , 1 , + , - , 10 , ) , ) ,
// should be: sqrt , ( , abs , ( , $x , = , 1 , + , -10 , ) , ) , (Notice the -10)
String e3 = "$e * -1 + (2 - sqrt(4))";
// output now: $e , * , - , 1 , + , ( , 2 , - , sqrt , ( , 4 , ) , ) ,
// should be: $e , * , -1 , + , ( , 2 , - , sqrt , ( , 4 , ) , ) ,
String e4 = "sin($PI/4) - 3";
// output now: sin , ( , $PI , / , 4 , ) , - , 3 , (This one is correct)
String e5 = "sin($PI/4) - -3 - (-4)";
// output now: sin , ( , $PI , / , 4 , ) , - , - , 3 , - , ( , - , 4 , ) ,
// should be: sin , ( , $PI , / , 4 , ) , - , -3 , - , ( , -4 , ) , (Notice -3 and -4)I hope they are relevant, If not I will supply some more.I made a small change to REGEX_NUMERIC and also put REGEX_NUMERIC at the start of the "complete pattern" when the Matcher is created:
import java.util.regex.*;
class Test {
 private static final String REGEX_NUMERIC = "(((?<=[-+*/(])|(?<=^))-)?\\d+(\\.\\d+)?";
 private static final String REGEX_VARIABLE = "\\$[a-zA-Z][a-zA-Z0-9]*";
 public static final String REGEX_OPERATION = "[a-zA-Z][a-zA-Z0-9]+|[-*/+|?:@&^<>'`=%#]";
 private static final String REGEX_PARANTHESIS = "[()]";
 public static void main(String[] args) {
 String[] tests = {
 "abs(-5) + -3",
 "sqrt(abs($x=1 + -10))",
 "$e * -1 + (2 - sqrt(4))",
 "sin($PI/4) - 3",
 "sin($PI/4) - -3 - (-4)",
 "-2-4" // minus sign at the start of the expression
 Pattern p = Pattern.compile(REGEX_NUMERIC + "|" + REGEX_OPERATION + "|" + REGEX_VARIABLE + "|" + REGEX_PARANTHESIS);
 for(String s: tests) {
 s = s.replaceAll("\\s", "");
 Matcher m = p.matcher(s);
 System.out.printf("%-21s-->", s);
 while (m.find()) {
 System.out.printf("%6s", m.group());
 System.out.println();
}Note that since Java's regex engine does not support "variable length look behinds", you will need to remove the white spaces from your expression, otherwise REGEX_NUMERIC will go wrong if a String looks like this:
"1 - - 1"Good luck!

Command line style regular expression string parser

Hi people,
I am currently working on a program where I need to parse a file (or any input stream) line by line. I then need to parse every line for arguments. Each line is formatted similar to how arguments are passed to the command line. The regular expression needs to split every line by any encountered whitespace, but needs to be able to retain any whitespace within double quotes (i.e. "some spaced text here"). Arguments can be numbers, booleans and (quoted) strings. Quoted strings must also be able to have escaped quotes in it (as below). The quotes for the quoted string (the outer ones, obviously not the escaped ones) do not necessarily have to be retained.
An example input line:
arg1 arg2 "arg 3" "arg 4" 987 arg6 "arg \"arg \"arg 7"
Desired example output:
arg1
arg2
arg 3
arg 4
987
arg6
arg "arg "arg 7
After the input line has been split up the program will handle any parsing (i.e. numbers, booleans, etc.). The program currently uses a simple for loop to iterate over all characters in the line and splits it up appropriately by checking every character. However, if this can be done automatically by using a regular expression passed to String.split() (or with some use of the regex package), it would remove quite a bit of redunant code and make the program that much more maintainable.
I do not have much experience with regular expressions since I have never really had the need to use them, but if they can work in this case it would be great.
Thanks in advance for any help.

Almost any parsing problem can be solved if you throw a big enough and ugly enough regex at it, or so I'm told.
I think what you are doing is also amenable to java.io.StreamTokenizer:
import java.io.*;
import static java.io.StreamTokenizer.*;
public class StreamTokenizerExample {
    public static void main(String[] args) throws IOException {
        StringReader input = new StringReader("arg1 arg2 \"arg 3\" \"arg 4\" 987 arg6 \"arg \\\"arg\\\" arg 7\"\nnextline");
        StreamTokenizer in = new StreamTokenizer(input);
        in.eolIsSignificant(true);
        for(int ttype; (ttype = in.nextToken()) != TT_EOF; ) {
            switch (ttype) {
                case TT_WORD:
                    System.out.println("String[" + in.sval + "]");
                    break;
                case TT_NUMBER:
                    System.out.println("number[" + in.nval + "]");
                    break;
                case TT_EOL:
                    System.out.println("[EOL]");
                    break;
                case '"':
                    System.out.println("quoted[" + in.sval + "]");
                    break;
                default:
                    System.out.println("unexpected " + ttype);
{code}

Help!!!!! Regular Expressions!!

I am trying to use Regular Expressions, for parsing. For that the pakage required is
java.util.regex.*;
I am also using the import statement in a sample code. But compiling it, gives an error,
ERRORS:
Replacement.java:6: package java.util.regex does not exist
import java.util.regex.*;
^
I have also set the path to C:\jdk1.4\bin
I have also set the classpath to C:\jdk1.4\lib
I don't know, Why it doesn't recognise the java.util.regex package
please help!!
gaurav_k1

Have you checked if the regex package is part of the
JDK1.4? I can't find it. What classes does it
implement?Yeah, since 1.4
http://java.sun.com/j2se/1.4/docs/api/java/util/regex/package-summary.html
I'm not sure what the original problem could be, possibly using a previously installed jre? If you had one previously installed, check the classpaths and uninstall any old jre (some forget that thinking they only need to remove the jdk). Could you give us anymore hints?

Regular Expressions and Double Byte Characters ?

Is it possible to use Java Regular Expressions to parse
a file that will contain double byte characters ?
For example, I want a regular expression to match the following line
tag="double byte stuff" id="double byte stuff"

The comments on the bytes/strings were helpful. Thanks.
But I'm still confused as to what matching pattern could be used.
For example a pattern like:
[A-Za-z]
I assume would not match any double byte characters.
I also assume the following won't work either:
[\\p{Alpah}]
because it is posix - US-ASCII only.
So how do you say "match the tag, then take any characters,
double byte, ascii, whatever, then match the text tag - per the
original example ?

Regular Expressions Question

I am using regular expressions to parse through a text file generated by a set of sensors attached to a SeaBird CTD profiler, what matter is that some lines use quotes as a special marker and some dont, but I need to treat both as the same data. Here is the example:
SeaBird Datafile (just some part):
"# name 0 = depS: depth, salt water [m]"
"# name 1 = t068: temperature, IPTS-68 [deg C]"
# name 2 = sal00: salinity, PSS-78 [PSU]
"# name 3 = sigma-t00: density, sigma-t [kg/m^3]"
"# name 4 = flS: fluorometer, sea tech"And I am using this regular expression to read this lines and save those names:
private static final String ColumnName = "(^\"#|^# ) name (\\d)+ = ((.+)\"$|(.+)$)";There is a possiblity that the string either starts with "# and ends with ", or that it only starts with # and no special marker to the end. Can anyone enlighten me on the correct regex because the one I posted up there only works in the case of being surrounded by quotes. Thanks in advance!
Christian A. Sueiras

Try this: private static final String ColumnName = "^(\"?)# name (\\d+) = (.+)\\1$";{code} You match an optional quotation mark at the beginning, and capture it in group #1. At the end, you match whatever was captured in group #1: either a quotation mark, or nothing.

Regular expression confusion

Hi,
I want to use regular expressions to parse a text data file that has the following structure:
key1=value<EOL>
key2=value<EOL>
keyn=value<EOL>
<EOR><EOL>
key1=value<EOL>
keyn=value<EOL>
<EOR><EOL>
etc.
...where <EOR> is a user specified string that defines the end of the record and <EOL> is a user defined line delimiter (i.e. \n). I use the following regular expression to extract a single line [^n]+ which returns a string upto but exclusive of the delimiter itself. I want to be able to do the same for a delimiter that is a string rather than a single character. For example I would like to extract the entire record from the file using a regular expression; that is, all the characters (including line delimiters) upto but exclusive of the <EOR> where <EOR> is a string such as "EOR".
Is there a pattern similar to [^n]+ where I could also specify a string rather than a single character?
Given the following data file (newline characters are shown for clarity)...
MAKE=FORD\n
MODEL=MUSTANG\n
YEAR=1969\n
EOR\n
MAKE=DODGE\n
MODEL=CHARGER\n
YEAR=1973\n
EOR\n
MAKE=CHEVROLET\n
MODEL=CORVETTE\n
YEAR=1977\n
<end of file>
I want to know of a regular expression that will extract an entire record such that the group() method would return, for example, the following string when applied to the start of the file:
MAKE=FORD\n
MODEL=MUSTANG\n
YEAR=1969\n

I might be missing the point of what you want to do here, but I would probably approach it this way (modifying alice's example.)
import java.util.regex.*;
public class Test
 public static void main(String[] args)
 String str = "MAKE=FORD\n" +
 "MODEL=MUSTANG\n" +
 "YEAR=1969\n" +
 "EOR\n" +
 "MAKE=DODGE\n" +
 "MODEL=CHARGER\n" +
 "YEAR=1973\n" +
 "EOR\n" +
 "MAKE=CHEVROLET\n" +
 "MODEL=CORVETTE\n" +
 "YEAR=1977\n";
 Pattern p = Pattern.compile("(.*?)(EOR|\\z)", Pattern.MULTILINE | Pattern.DOTALL);
 Matcher m = p.matcher(str);
 while (m.find())
 System.out.println();
 System.out.println(m.group(1));
}

Beginner question about Regular expression

Hi all !
I'd like to use a regular expression to parse a string like this:
*<ID>4</ID><GROUP>5</GROUP>....*
So for example to retrieve the ID I have built the following regular expression:
Pattern p = Pattern.compile("<ID>(.*?)</ID>");
Matcher m = p.matcher(handle);
if (m.find()) {
 System.out.println("->"+m.group());
} else {
System.out.println("No match!");
}The function m.group returns "<ID>4</ID>" but I want just the value (4) between the tag. Is there
a way to get it ?
thanks a lot
mark

fmarchioniscreen wrote:
thank you very much, that's exactly what I needed.
But it looks like you're parsing some XML like data: probably better to use a proper parser on it. Well it's a very short string containing XML tags. it's used in a marginal area of the application so I prefer just using a regular expression to fetch the values
thanks again
MarkYou could use XPath to get the value.

HTML Escaping Regular Expression

Assume I have the following string:
this is a very nice " <a href>String</a>Now say I want to allow everything above, except I want to escape certain tags... IE I want to allow:
,,, and nothing else. The escaped string above should then be converted to:
this is a very nice " & lt;a href & gt; String & lt;/a& gt;The idea I'm trying to implement is to allow form input that may contain limited html tags, that I define, and escape anything else.
This seems like it would be an existing regular expression, does anyone have any ideas?
Thanks!

Kudos and thanks for the regex. I did not know how to do negation in a regular expression, ie (?!FONT|B)To answer an earlier post, the application works much like a message board. I accept input from a textarea, then write that input out as an html file (much like how this forum works). I'd like to accept certain HTML markup in the input, but disallow tags like "<javascript>", "<object>" and "<embed>", etc, as writing those tags would allow users to post malicious input (redirects, popups, etc). Thus, defining and parsing what tags I will accept is easier than defining the tags not to accept (and safer).
That being said, escaping quotes is somewhat important, as " in html that is not in a tag should really be the converted to & quot; for w3c browser standards. However, with 95% of my original question answered (that being the most important part), I'm satisfied with this. Thanks to all for the help thus far!

Java Regular Expression to grab html tags

Dear all,
I have written a regular expression in java to grab the pairs of html tags in a String. It worked fine except that it cannot handle space or new line. What have I done wrong?
My regular expression (I would use <tr></tr> as an example):
<tr[^>]*>(.*?)</tr>
It would work for <tr><one>two<three></tr>
but not <tr ><one>two<three></tr>
or <tr> <one> two <three></tr>
or <tr>
<one>two<three>
</tr>
Thanks a lot in advance

I have written a regular expression in java to grab the pairs of html tags in a String. I'll make one last-ditch suggestion that you grab a decent HTML parser, as the HTML specification allows for HTML tags that don't come in pairs. While I'm sure you will be able to eventually write regexes to handle this, it may be easier (depending on your requirements) to use tools to parse the HTML.
Good luck!

Html paser of regular expression

Dear all,
I know some of you will think my problem can be solved by an open-source html parser but I tested the following list of parsers (http://java-source.net/open-source/html-parsers) and failed to find one that meets my requirement as I explained below.
I would like to parse a html file and fetch the hyper links from it.
I wrote the following regular expression and it works in most cases:
.*(src|href|url|action)\s*=\s*["|']?(.*?)["|'|\s?|>].*However, I have until now two troubles:
1. For "<a href="directory.html">Directory</a> | <a href="a-z.html">A - Z</a>", I expceted to fetch "directory.html" and "a-z.html" but I only got the last one.
2. I expected to exclude "http://www.javaeye.com/upload.jpg" in "<img alt="subwayline13" class="logo" src="http://www.javaeye.com/upload.jpg" title="subject" />". I still could not find a solution for this.
Therefore, I would wish that you can give me some new advices.
Merry Chirstmas and Happy New Year!
Pengyou

pengyou wrote:
Dear all,
I know some of you will think my problem can be solved by an open-source html parser but I tested the following list of parsers (http://java-source.net/open-source/html-parsers) and failed to find one that meets my requirement as I explained below.
Then you did something wrong when you were using the parser.
I would like to parse a html file and fetch the hyper links from it.
I wrote the following regular expression and it works in most cases:
.*(src|href|url|action)\s*=\s*["|']?(.*?)["|'|\s?|>].*However, I have until now two troubles:
1. For "<a href="directory.html">Directory</a> | <a href="a-z.html">A - Z</a>", I expceted to fetch "directory.html" and "a-z.html" but I only got the last one.
2. I expected to exclude "http://www.javaeye.com/upload.jpg" in "<img alt="subwayline13" class="logo" src="http://www.javaeye.com/upload.jpg" title="subject" />". I still could not find a solution for this.
Therefore, I would wish that you can give me some new advices.
Same advice as before.
1. Use an existing html parser correctly.
2. Write you own html parser. An actual parser. A parser would be part of your solution, not the entire solution.
And more advice...do not attempt to use regexes to parse html nor xml for that matter. The reason for that is because by the time you get it right, if ever, you will have built a parser. So instead start with one right away.
I suspect that your actual problem is that you don't know what a parser is and what it should do. So you think that a "parser" should give you there result you want rather than giving you tokens. A parser parses a source based on a grammer and produces tokens. A token is not an image file until you further interpret a particular token that way.
Finally note that in the above I said you could build your own parser if you wanted. But then you must in fact build a parser. If you do it correctly then you are going to end up with something that is functionally equivalent to one of the existing parsers. If you do it wrong then it won't.

Regular Expression to remove space in HTML Tag

Hello All,
My HTML string is like below.
select '<CityName>RICHMOND</CityName>
<StateCd>ABCD CDE
<StateCd/>
<CtryCd>CAN</CtryCd>
<CtrySubDivCd>BC</CtrySubDivCd>' Str from dual
Desired Output is
<CityName>RICHMOND</CityName><StateCd>ABCD CDE
<StateCd/><CtryCd>CAN</CtryCd><CtrySubDivCd>BC</CtrySubDivCd>
i.e. want to remove those spaces from tag value area having only spaces otherwise leave as it is. Please help to implement the same using Regular expression.

Hi,
It's unclear what you want. This site seems to be formatting your message in some odd way.
Post a statement like
SELECT '...' FROM dual;
without any formatting, to show your input, and post the exact output you want friom that, with as little formatting as possible. It might help if you use some character like ~ instead of spaces (just for posting; we'll find a solution that works for spaces).
To remove the text that consists of spaces and nothing else between the tags, you can say
REGEXP_REPLACE ( str
, '> +<'
, '><'
How is this string being generated? Maybe there's some easier, more efficient way to keep the bad sub-wrtings out of the string in the first place.

Rplacing space with &nbsb; in html using regular expressions

Hi
I want to replace space with &nbsb; in HTML.
I used the below method to replace space in my html file.
var spacePattern11:RegExp =/(\s)/g;
str= str.replace(spacePattern," "
Here str varaible contains below html file.In this html file i want to replace space present between " What number does this represents" with &nbsb;
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>
</head>
<body>
<TEXTFORMAT LEADING="2"></TEXTFORMAT><TEXTFORMAT LEADING="2"> What number does this Roman numeral represents MDCCCXVIII ?</TEXTFORMAT>
</body>
</html>
But by using the above regular expression i am getting like this.
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>
</head><body>
<TEXTFORMAT LEADING="2"></TEXTFORMAT><TEXTFORMAT LEADING="2"> What number does this represents</TEXTFORMAT>
</body>
</html>
Here what happening means it was replacing space with &nbsb; in HTML tags also.But want to replace space with &nbsb; present in the outside of the HTML tags.I want like this using regular expressions in FLEX
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>
</head>
<body>What number does this represents</body>
</html>
Hi,Please give me the solution to slove the above problem using regular expressions
Thanks in Advance to all
Regards
ssssssss

sorry i missed some information in above,The modified information was in red color
Hi
I want to replace space with &nbsb; in HTML.
I used the below method to replace space in my html file.
var spacePattern11:RegExp =/(\s)/g;
str= str.replace(spacePattern," "
Here str varaible contains below html file.In this html file i want to replace space present between " What number does this represents" with &nbsb;
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>
</head>
<body>
<TEXTFORMAT LEADING="2"></TEXTFORMAT><TEXTFORMAT LEADING="2"> What number does this Roman numeral represents MDCCCXVIII ?</TEXTFORMAT>
</body>
</html>
But by using the above regular expression i am getting like this.
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>
</head><body>
<TEXTFORMAT LEADING="2"></TEXTFORMAT><TEXTFORMAT LEADIN G="2"> What number does this represents</TEXTFORMAT>
</body>
</html>
Here what happening means it was replacing space with &nbsb; in HTML tags also.But want to replace space with &nbsb; present in the outside of the HTML tags.I want like this using regular expressions in FLEX
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>
</head>
<body>What&nbsb;number&nbsb;does&nbsb;this&nbsb;represents</body>
</html>
Hi,Please give me the solution to slove the above problem using regular expressions
Thanks in Advance to all
Regards
ssssssss

Regular Expressions for converting HTML to Structured Plain Text

I'm writing a PL/SQL function that will convert HTML to plain text, but still preserve some of the formatting/line breaks. One of my challenges is in writing a regular expression to capture the text blocks while ignoring the markup. I'm trying to write an expression that will grab all of the text between start/end tags, but discard the tags. For example, to find all of the text between a start/end paragraph, I want to do something like:
REGEXP_REPLACE('This is the body of the paragraph', '<p.*>(.*)', '\1||v_crlf' )
where \1 returns the contents of the paragraph and v_crlf (declared earlier in the function) inserts a line break. I know there are more general expressions that will remove all tags, but I want to specifically identify the tags so I can process them appropriately. This way I can easily convert HTML to plain text for email and reporting without having to keep two versions around. Any help would be greatly appreciated. Once I get this worked out, I will repost with the function code for others to use. Thanks.
Edited by: jritschel on Oct 26, 2010 9:58 AM

Here's a function I wrote for an app. I'm not making in promises on it's accuracy as the app was just a proof of concept and never made it to production.
function strip_html( p_clob in clob )
return clob
is
 l_out clob;
 l_test number := 0;
 l_max_loops constant number := 20;
 i pls_integer := 0;
begin
 l_out := regexp_replace(p_clob,' | ',chr(13)||chr(10),1,0,'imn');
 l_out := regexp_replace(l_out,'',chr(13)||chr(10),1,0,'imn');
 l_out := replace(l_out,'<li>',chr(13)||chr(10)||'*<li>');
 l_out := regexp_replace(l_out,'(.+?)','*\1*',1,0,'imn');
 l_out := regexp_replace(l_out,'(.+?)','_\1_',1,0,'imn');
 loop
 l_test := regexp_instr(l_out,'<([A-Z][A-Z0-9]*)[^>]*>.*?</\1>',1,1,0,'imn');
 exit when l_test = 0 or i > l_max_loops;
 l_out := regexp_replace(l_out,'<([A-Z][A-Z0-9]*)[^>]*>(.*?)</\1>','\2',1,0,'imn');
 i := i + 1;
 end loop;
 return l_out;
end strip_html;{code}
The loop is there to handle nested HTML.
Tyler Muth
http://tylermuth.wordpress.com
"Applied Oracle Security: Developing Secure Database and Middleware Environments": http://sn.im/aos.book
Edited by: Tyler on Oct 26, 2010 10:03 AM

Getting "Inner Html" using Regular Expressions. Learning RE in SDK1.4.

Hello group.
I am learning Regular Expressions in JAVA SDK 1.4 first. Not PERL or other language.
Using the utility at the following link I am trying to get all the text between the <TR> and </TR> tags.
http://jakarta.apache.org/oro/demo.html
This seems simple but the line returns, breaks etc.. make it more difficult. I have worked on this for hours.
There will be multiple table rows in my stream.
My goal is to first get the text between the <TR> Tags...
Then I was going to use groups to get data0, data1, data2, data3.
Does this sound like a good plan? Should I use multiple RE or one RE that does 4 group returns.
I was thinking the applet was causing my problem.
<TR>.*?</TR> does not work.
(<tr>\s*([^(</tr>)])+</tr>) does not work.
I can get data0 to work as well as data1,2,3.
Would it make more sense to split this multiple row table by </tr>?
One row of malformed html (actually multiple rows):
<TR>
<TD bgColor=#ffffff><A class=fav
href="http://nicesite.com/data0"
>nicesite</A><IMG
src="smile.gif"></TD>
<TD bgColor=#12ff22>data1</TD>
<TD bgColor=#12ff22>data2</TD>
<TD bgColor=#12ff22>data3</TD>
<TD align=middle bgColor=#ffffff><A
href="#"><IMG
src="smile.gif" border=0></A></TD>
<TD align=middle bgColor=#ffffff>data
4</TD>
<TD align=middle bgColor=#ffffff>data5</TD></TR>
s_____ I have seen some of your post and tryed to apply them. What do you think?
Regards,
NupeVic

http://jakarta.apache.org/oro/demo.htmlI prefer
http://jregex.sourceforge.net/demoapp.html
>
This seems simple but the line returns, breaks etc..
make it more difficult. Yes, they do indeed
There will be multiple table rows in my stream.
My goal is to first get the text between the <TR>
Tags...
Then I was going to use groups to get data0, data1,
data2, data3.
Does this sound like a good plan? Should I use
multiple RE or one RE that does 4 group returns.One of the main features of regexes that you must realize
is that they are mainly suited for non-recursive, linear data structures
(btw, that's why regexes in general are hardly suited for html).
So, if the number of TD items is fixed, you could
1. search using a single pattern for the whole row, something like
"<PatternForTR>"+
"<PatternForTD>(<PatternForData>)<PatternFor/TD>"+
"<PatternForTD>(<PatternForData>)<PatternFor/TD>"+
"<PatternFor/TR>"
so the group1 would contain data1 and so on
Otherwise, you should
2. find each row using
"<PatternForTR>(.*?)<PatternFor/TR>",
then search the contents of group1 using the
"<PatternForTD>(<PatternForData>)<PatternFor/TD>".
>
<TR>.*?</TR> does not work.The pattern itself is ok, but in order for it to work one should enable the DOTALL flag (the 's' flag in jregex demo), as the '.' doesn't accept line breaks by default.
(<tr>\s*([^(</tr>)])+</tr>) does not work.It seems that [^(</tr>)]+ actually is a nonsense in this context.
It describes a string that consists of any chars but '(', ')', '<', '>', 'r', 't', '/'.
What you actully meant (a string that doesn't contain "</tr>")
is just achieved by using non-greedy quantifier in <TR>.*?</TR>.
>
I can get data0 to work as well as data1,2,3.
Would it make more sense to split this multiple row
table by </tr>?Going the second way above, you could find rows using the
general pattern for TR:
<tr.*?>(.+?)</tr> and search their contents(i.e. the group#1) using the
general pattern for TD
<td.*?>(.+?)</td> Finally, this is the specific pattern for TD that doesn't include the leading
and trailing tags into group1:
<td[^>]*>(?:\s*</?[^>]*>)*\s*(.+?)(?:\s*</?[^>]*>)*\s*</td>It succeded in finding
nicesite
data1
data2
data3
data
4
data5in your sample.

Regular expression html parsing

Similar Messages

Maybe you are looking for