Stripping HTML thru regular expression(pls help)
Hi all..
I've been trying to use the regular OROMatcher-1.1 expression package downloaded from apache.org.
it works well with my program but i m having problems building correct regular expression to strip off HTML tags.
can any of u help me build an expression tha strips of ALL html tags including those with funny spaces such as:
<a href = "www.here.com">click me</a>
do help pls. i've tried for ages and its driving me mad
Hi,
Wont go into much details but the simplest way to do that would be using XML technology. Try using SAX or DOX whatever you feel comfortable with. I think SAX would be a better choice. For details visit
http://java.sun.com/xml/?frontpage-spotlight
/khurram
Similar Messages
-
HTML Escaping Regular Expression
Assume I have the following string:
<font face="arial">this</font> is a <b>very nice</b> " <a href>String</a>Now say I want to allow everything above, except I want to escape certain tags... IE I want to allow:
<b>,</b>,<font ...>,</font> and nothing else. The escaped string above should then be converted to:
<font face="arial">this</font> is a <b>very nice</b> " & lt;a href & gt; String & lt;/a& gt;The idea I'm trying to implement is to allow form input that may contain limited html tags, that I define, and escape anything else.
This seems like it would be an existing regular expression, does anyone have any ideas?
Thanks!Kudos and thanks for the regex. I did not know how to do negation in a regular expression, ie (?!FONT|B)To answer an earlier post, the application works much like a message board. I accept input from a textarea, then write that input out as an html file (much like how this forum works). I'd like to accept certain HTML markup in the input, but disallow tags like "<javascript>", "<object>" and "<embed>", etc, as writing those tags would allow users to post malicious input (redirects, popups, etc). Thus, defining and parsing what tags I will accept is easier than defining the tags not to accept (and safer).
That being said, escaping quotes is somewhat important, as " in html that is not in a tag should really be the converted to & quot; for w3c browser standards. However, with 95% of my original question answered (that being the most important part), I'm satisfied with this. Thanks to all for the help thus far! -
Quick regular expression question/help
Can someone help me with two regular expressions I need. I could spend a while trying to figure it out myself, however times short and I really would like to get a fool proof optimal solution (my attempt would be buggy).
Sample sentence
The population, is projected to reach 200,000, or more (by 2020).[7] This is {dummy} text.
The first regular expression
I need all brackets and every thing between them to be removed from a sentence.
Brackets such as: ( ), [ ] and { } .
I.e. Given the above sentence the following would be returned:
The population, is projected to reach 200,000, or more. This is text.
The second regular expression
If a word has a trailing comma character I need to add a whitespace between the word and the comma.
I.e. Given the sentence returned from the first regular expression, this regex would return:
The population *,* is projected to reach 200,000 *,* or more. This is text.
Many thanks to anyonewho can help me with this!
Edited by: Myles on Jan 18, 2008 8:12 AMhttp://java.sun.com/docs/books/tutorial/extra/regex/index.html
http://www.regular-expressions.info -
Rplacing space with &nbsb; in html using regular expressions
Hi
I want to replace space with &nbsb; in HTML.
I used the below method to replace space in my html file.
var spacePattern11:RegExp =/(\s)/g;
str= str.replace(spacePattern," "
Here str varaible contains below html file.In this html file i want to replace space present between " What number does this represents" with &nbsb;
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>
</head>
<body>
<b><TEXTFORMAT LEADING="2"><P ALIGN="LEFT"><FONT FACE="Verdana" style = 'font-size:10px' COLOR="#0B333C" LETTERSPACING="0" KERNING="0"><B></B></FONT></P></TEXTFORMAT><TEXTFORMAT LEADING="2"><P ALIGN="LEFT"><FONT FACE="Verdana" style = 'font-size:10px' COLOR="#0B333C" LETTERSPACING="0" KERNING="0"><B> What number does this Roman numeral represents MDCCCXVIII ?</B></FONT></P></TEXTFORMAT></b>
</body>
</html>
But by using the above regular expression i am getting like this.
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>
</head><body>
<b><TEXTFORMAT LEADING="2"><P ALIGN="LEFT"><FONT FACE="Verdana" style = 'font-size:10px' COLOR="#0B333C" LETTERSPACING="0" KERNING="0"><B></B></FONT></P></TEXTFORMAT><TEXTFORMAT LEADING="2"><P A LIGN="LEFT"><FONT FACE="Verdana" style = 'font-size:10px' COLOR="#0B333C" LETTERSPACING="0 " KERNING="0"><B> What number does this represents</B></FONT></P></TEXTFORMAT></b>
</body>
</html>
Here what happening means it was replacing space with &nbsb; in HTML tags also.But want to replace space with &nbsb; present in the outside of the HTML tags.I want like this using regular expressions in FLEX
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>
</head>
<body>What number does this represents</body>
</html>
Hi,Please give me the solution to slove the above problem using regular expressions
Thanks in Advance to all
Regards
sssssssssorry i missed some information in above,The modified information was in red color
Hi
I want to replace space with &nbsb; in HTML.
I used the below method to replace space in my html file.
var spacePattern11:RegExp =/(\s)/g;
str= str.replace(spacePattern," "
Here str varaible contains below html file.In this html file i want to replace space present between " What number does this represents" with &nbsb;
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>
</head>
<body>
<b><TEXTFORMAT LEADING="2"><P ALIGN="LEFT"><FONT FACE="Verdana" style = 'font-size:10px' COLOR="#0B333C" LETTERSPACING="0" KERNING="0"><B></B></FONT></P></TEXTFORMAT><TEXTFORMAT LEADING="2"><P ALIGN="LEFT"><FONT FACE="Verdana" style = 'font-size:10px' COLOR="#0B333C" LETTERSPACING="0" KERNING="0"><B> What number does this Roman numeral represents MDCCCXVIII ?</B></FONT></P></TEXTFORMAT></b>
</body>
</html>
But by using the above regular expression i am getting like this.
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>
</head><body>
<b><TEXTFORMAT LEADING="2"><P ALIGN="LEFT"><FONT FACE="Verdana" style = 'font-size:10px' COLOR="#0B333C" LETTERSPACING="0" KERNING="0"><B></B></FONT></P></TEXTFORMAT><TEXTFORMAT LEADIN G="2"><P ALIGN="LEFT"><FONT FACE="Verdana" style = 'font-size:10px' COLOR="#0B33 3C" LETTERSPACING="0" KERNING="0"><B> What number does this represents</B></FONT></P></TEXTFORMAT></b>
</body>
</html>
Here what happening means it was replacing space with &nbsb; in HTML tags also.But want to replace space with &nbsb; present in the outside of the HTML tags.I want like this using regular expressions in FLEX
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>
</head>
<body>What&nbsb;number&nbsb;does&nbsb;this&nbsb;represents</body>
</html>
Hi,Please give me the solution to slove the above problem using regular expressions
Thanks in Advance to all
Regards
ssssssss -
Question about Regular Expressions, please help!
I have created an app which reads files and extracts certain data using regular expressions in JDK1.4 using Pattern and Matcher classes.
However it needs to run on JDK1.2.2 (dont ask). The regular expression classes are not available in 1.2.2 (the Pattern and Matcher class) so i am looking for something similiar which i can use?
I need something that loops through all the matches found in the file like how Matcher works i.e.
while (matcher.find())
// do this
Help!http://jakarta.apache.org/regexp/
-
Regular Expressions, please help.
Hello everyone.
Can I get a Java Regular Expression to match with a word of the following language...
Start --> Expression;
Expression --> [0-9]+;
Expression --> Expression * Expression;
So the regexp should match with words like:
4;
4664;
4 * 763;
5 * 4534 * 23534;
04 * 002 * 1 * 10 * ...
I would be very happy, if anyone could help.I dont think that I need to learn anything more.
I am sure it is not possible to make, what I want.
I want to build a compiler.
I just finished the abstract syntax of my language. Now I need a possibility to compile the concrete syntax of my language to the abstract one.
But I think, it is not possible with regular expressions.
Cause I need possibility to match a syntax of type chomsky 2.
I think regular expressions only match chomsky 3 languages.
But the "Backtracking"-mechanism of Java RegExp could do this.
I am not sure with this.
If you have any ideas please post. -
Getting "Inner Html" using Regular Expressions. Learning RE in SDK1.4.
Hello group.
I am learning Regular Expressions in JAVA SDK 1.4 first. Not PERL or other language.
Using the utility at the following link I am trying to get all the text between the <TR> and </TR> tags.
http://jakarta.apache.org/oro/demo.html
This seems simple but the line returns, breaks etc.. make it more difficult. I have worked on this for hours.
There will be multiple table rows in my stream.
My goal is to first get the text between the <TR> Tags...
Then I was going to use groups to get data0, data1, data2, data3.
Does this sound like a good plan? Should I use multiple RE or one RE that does 4 group returns.
I was thinking the applet was causing my problem.
<TR>.*?</TR> does not work.
(<tr>\s*([^(</tr>)])+</tr>) does not work.
I can get data0 to work as well as data1,2,3.
Would it make more sense to split this multiple row table by </tr>?
One row of malformed html (actually multiple rows):
<TR>
<TD bgColor=#ffffff><A class=fav
href="http://nicesite.com/data0"
>nicesite</A><IMG
src="smile.gif"></TD>
<TD bgColor=#12ff22><SPAN class=fav>data1</SPAN></TD>
<TD bgColor=#12ff22><SPAN class=fav>data2</SPAN></TD>
<TD bgColor=#12ff22><SPAN class=fav>data3</SPAN></TD>
<TD align=middle bgColor=#ffffff><A
href="#"><IMG
src="smile.gif" border=0></A></TD>
<TD align=middle bgColor=#ffffff>data
4</TD>
<TD align=middle bgColor=#ffffff>data5</TD></TR>
s_____ I have seen some of your post and tryed to apply them. What do you think?
Regards,
NupeVichttp://jakarta.apache.org/oro/demo.htmlI prefer
http://jregex.sourceforge.net/demoapp.html
>
This seems simple but the line returns, breaks etc..
make it more difficult. Yes, they do indeed
There will be multiple table rows in my stream.
My goal is to first get the text between the <TR>
Tags...
Then I was going to use groups to get data0, data1,
data2, data3.
Does this sound like a good plan? Should I use
multiple RE or one RE that does 4 group returns.One of the main features of regexes that you must realize
is that they are mainly suited for non-recursive, linear data structures
(btw, that's why regexes in general are hardly suited for html).
So, if the number of TD items is fixed, you could
1. search using a single pattern for the whole row, something like
"<PatternForTR>"+
"<PatternForTD>(<PatternForData>)<PatternFor/TD>"+
"<PatternForTD>(<PatternForData>)<PatternFor/TD>"+
"<PatternFor/TR>"
so the group1 would contain data1 and so on
Otherwise, you should
2. find each row using
"<PatternForTR>(.*?)<PatternFor/TR>",
then search the contents of group1 using the
"<PatternForTD>(<PatternForData>)<PatternFor/TD>".
>
<TR>.*?</TR> does not work.The pattern itself is ok, but in order for it to work one should enable the DOTALL flag (the 's' flag in jregex demo), as the '.' doesn't accept line breaks by default.
(<tr>\s*([^(</tr>)])+</tr>) does not work.It seems that [^(</tr>)]+ actually is a nonsense in this context.
It describes a string that consists of any chars but '(', ')', '<', '>', 'r', 't', '/'.
What you actully meant (a string that doesn't contain "</tr>")
is just achieved by using non-greedy quantifier in <TR>.*?</TR>.
>
I can get data0 to work as well as data1,2,3.
Would it make more sense to split this multiple row
table by </tr>?Going the second way above, you could find rows using the
general pattern for TR:
<tr.*?>(.+?)</tr> and search their contents(i.e. the group#1) using the
general pattern for TD
<td.*?>(.+?)</td> Finally, this is the specific pattern for TD that doesn't include the leading
and trailing tags into group1:
<td[^>]*>(?:\s*</?[^>]*>)*\s*(.+?)(?:\s*</?[^>]*>)*\s*</td>It succeded in finding
nicesite
data1
data2
data3
data
4
data5in your sample. -
Regular expression (regex) help!
I am trying to write a correct regular expression but am having difficulties.
I have a webpage saved as a string and want to extract all the links (urls) from the webpage string.
The trouble I am having is that some websites surround links using double quotes " " and some use single quotes ' ' around links in html:
Double quotes around url:
<a href="www.example.com"></a>
And single quotes:
<a href="www.example.com"></a>
So far I have a regex which extract links if they are surrounded with double quotes (see below), however if a page uses single quotes it screws up ;)
Pattern.compile("<a\\s+href\\s*=\\s*\"?(.*?)[\"|>]", Pattern.CASE_INSENSITIVE);So is there a way to say look for double quotes OR single quotes?
Many thanks
nullThere's no need to escape the single-quote (or apostrophe) in a regex. The only reason it was necessary to escape the double-quote (or quotation mark) is because the regex was written in the form of a String literal. Neither the single-quote or the double-quote has any special meaning in regexes.
-
Regular Expression query help.
Hi, your help will be appreciated,
I need to replace the a string's pattern with some special characters.
Input String := 'mytext*% align="quot;leftquot;><font face="quot;Arialquot;"> *% align="quot;leftquot;"><this is text><p this to replace >'
Output String := 'mytext@ align="quot;leftquot;$<font face="quot;Arialquot;"> @ align="quot;leftquot;"$<this is text><p this to replace >'
Replacing Rules:
1) '*%' should be replaced by '@'
2) '>' should be replaced by $ (only the EVERY FIRST occurrence after the character @ )
Tried with REGEXP but looks like need your help!
Thx
DJ.Hi, DJ,
DeeJay wrote:
Perfect Frank. Thanks for your help.
Could you please explain how it is working? you know, these Regexps are hurdle for me always in understanding.Not just you; regular expression can be very cryptic.
We're saying "replace '*%x>' with '@x$', where x is 0 or more characters from the set of all characters except '>'.
{code}
SELECT REGEXP_REPLACE ( 'mytext*% align="quot;leftquot;> *% align="quot;leftquot;"><this is text>'
, '\*' || -- aserisk (special character, must be escaped)
'%' || -- percent sign
'(' || -- begin \1 definition
'[' || -- begin set definition
'^' || -- "The set consiting of all characters EXCEPT ...
'>' || -- ... the greater-than sign"
']' || -- end set definition
'*' || -- 0 or more characters from the preceding set
')' || -- end \1 definition
'>' -- greater-than sign
, '@\1$'
) AS txt
FROM dual; -
Java Regular Expression Need Help
I want regular Expression that accept all numbers and it should skip the numbers if it comes in {}
No this is not workingThen you need to be MUCH clearer as to exactly what you are trying to acheive...
We aren't mind readers... try posting the string you are parsing and the exact result that you want to get -
Column constant value pls help
COL cnt noprint
COLUMN cnt NEW_VALUE rowcountI am using this for select query as below
ROWNUM cnt, COUNT (*) OVER () cntbut issue is when null rows returned.... this rowcount is not initilized.
while i am usng for trailer record
SELECT '* Trailer record *' || '|' || &rowcount
FROM DUALit is giving error as below
FROM DUAL
ERROR at line 2:
ORA-00936: missing expression
Pls help
SSolomon Yakobson wrote:
Post EXACT and COMPLETE snippet of SQL*Plus session showing what you did. I can't reproduce it:
SQL> COLUMN cnt NEW_VALUE rowcount
SQL> SELECT COUNT (*) OVER () cnt
2 FROM DUAL
3 /Suppose if cnt gives no record then rowcount is null at that time rowcount is not initiliazed.
>
>
>
>
SQL> SELECT '* Trailer record *' || '|' || &rowcount
2 FROM DUAL
3 /
old 1: SELECT '* Trailer record *' || '|' || &rowcount
new 1: SELECT '* Trailer record *' || '|' || Missing expression this was my error
>
'*TRAILERRECORD*'||'
* Trailer record *|1>
SQL>
SY. -
Regular expression to substring
Hi Folks;
I need to extract dynamically substrings from an attribut A.
The varchar2 attribut A is defined like that : "LXXXXX/111111(+),LXXXXX/111111(-),LXXXXXX/111111,etc..." Always the same serie.
I need to store all "111111(+)" "111111(-)" "111111" of the same record in a new attribut named B.
I feel the regular expressions could help me but i'm not a very good...
Thanks for your help . ^^Try this,
SELECT LTRIM (REGEXP_SUBSTR (attrA,
'/[^,]+',
1,
LEVEL),'/')
FROM T
CONNECT BY LEVEL <= LENGTH (REGEXP_REPLACE ( attrA, '[^/]'))
Example
SQL> WITH T AS (SELECT 'LXXXXX/111111(+),LXXXXX/111111(-),LXXXXXX/111111,' attrA FROM DUAL)
2 SELECT LTRIM (REGEXP_SUBSTR (attrA,
3 '/[^,]+',
4 1,
5 LEVEL),'/') exprssn
6 FROM T
7 CONNECT BY LEVEL <= LENGTH (REGEXP_REPLACE ( attrA, '[^/]'));
EXPRSSN
111111(+)
111111(-)
111111
SQL> G. -
Regular expression usage question
Hi there.
I have a 200 bytes EBCDIC variable record which I need to break down into fields. Fields are positional and are either text, binary numbers, packed-decimal and 64bytes long numbers.
My question is. Can regular expression handle this complex data.
I want to isolate each field into their corresponding format. EBCDIC into ASCII text, binary into java Integer and so on.
The reason for using reqular expression is because the record format could change and regular expression would be easier to modify without having to change the code.
Your words of advice are highly appreciated.
Please advice.
Regards,
UlisesRegular expressions? I don't think so.
If you have a situation where positions 1-3 might be a binary number like client number, and the format might change so it moves to positions 12-14, then you could certainly write a record-format class to encapsulate that sort of information. In fact that would be a very good idea. But I can't imagine how a regular expression would help in getting a number out of three bytes, for example. -
Need help with regular expression
I'm trying to use the java.util.regex package to extract URLs from html files.
The URLs that I am interested in extracting from the HTML look like the following:
<font color="#008000">http://forum.java.sun.com -
So, the URL is always preceeded by:
<font color="#008000">
and then followed by a space character and then a hyphen character. I want to be able to put all these URLs in a Vector object. This doesn't seem like it should be too difficult but for some reason I can't get anywhere with it. Any help would be greatly appreciated. Thanks!hi gupta am not sure of the java syntax but i can tell u about the regular expression...try this....
<font color="#008000">(http:\/\/[a-zA-Z0-9.]+) [-]
i dont know the java methods to call...just the reg exp...
Sanjay Acharya -
Regular Expression to remove space in HTML Tag
Hello All,
My HTML string is like below.
select '<CityName>RICHMOND</CityName>
<StateCd>ABCD CDE
<StateCd/>
<CtryCd>CAN</CtryCd>
<CtrySubDivCd>BC</CtrySubDivCd>' Str from dual
Desired Output is
<CityName>RICHMOND</CityName><StateCd>ABCD CDE
<StateCd/><CtryCd>CAN</CtryCd><CtrySubDivCd>BC</CtrySubDivCd>
i.e. want to remove those spaces from tag value area having only spaces otherwise leave as it is. Please help to implement the same using Regular expression.Hi,
It's unclear what you want. This site seems to be formatting your message in some odd way.
Post a statement like
SELECT '...' FROM dual;
without any formatting, to show your input, and post the exact output you want friom that, with as little formatting as possible. It might help if you use some character like ~ instead of spaces (just for posting; we'll find a solution that works for spaces).
To remove the text that consists of spaces and nothing else between the tags, you can say
REGEXP_REPLACE ( str
, '> +<'
, '><'
How is this string being generated? Maybe there's some easier, more efficient way to keep the bad sub-wrtings out of the string in the first place.
Maybe you are looking for
-
XML out ... and in
For an automated data transfer I need to get a database (relational structure) to export to XML and to be able to import a similar structure sent by our partners. I have been googling on this and there seems to be a massive number of options and tool
-
Help I can't start ASM on rhel4(32bit) + oracle 10G 10.2 RAC Wed Jun 28 17:10:40 2006 Error: KGXGN polling error (15) Wed Jun 28 17:10:40 2006 Errors in file /oracle/product/10.2.0/db/admin/+ASM/bdump/+asm1_lmon_24375.trc: ORA-29702: error occurred i
-
Slow keyboard input in fields - first characters transposed
Any application or web page that I am in that has some kind of input field is giving me a problem with now. If I go into the field and immediately start to type it's like the iMac gets startled, that first character needs a second or so before it fig
-
What does the argument do here?
In a code snippet shown below regarding handling the button action in the example: FileChooserDemo.java from Sun, what does the argument between parentheses do? int returnValue = fc.showOpenDialog(FileChooserDemo.this);Thanks.
-
We have a Mac desktop computer. We suddenly can't watch videos. We keep getting the message, "Blocked plug-in". Please help. Thank you