Stripping HTML thru regular expression(pls help)

Hi all..
I've been trying to use the regular OROMatcher-1.1 expression package downloaded from apache.org.
it works well with my program but i m having problems building correct regular expression to strip off HTML tags.
can any of u help me build an expression tha strips of ALL html tags including those with funny spaces such as:
<a href = "www.here.com">click me</a>
do help pls. i've tried for ages and its driving me mad

Hi,
Wont go into much details but the simplest way to do that would be using XML technology. Try using SAX or DOX whatever you feel comfortable with. I think SAX would be a better choice. For details visit
http://java.sun.com/xml/?frontpage-spotlight
/khurram

Similar Messages

  • HTML Escaping Regular Expression

    Assume I have the following string:
    <font face="arial">this</font> is a <b>very nice</b> " <a href>String</a>Now say I want to allow everything above, except I want to escape certain tags... IE I want to allow:
    <b>,</b>,<font ...>,</font> and nothing else. The escaped string above should then be converted to:
    <font face="arial">this</font> is a <b>very nice</b> " & lt;a href & gt; String & lt;/a& gt;The idea I'm trying to implement is to allow form input that may contain limited html tags, that I define, and escape anything else.
    This seems like it would be an existing regular expression, does anyone have any ideas?
    Thanks!

    Kudos and thanks for the regex. I did not know how to do negation in a regular expression, ie (?!FONT|B)To answer an earlier post, the application works much like a message board. I accept input from a textarea, then write that input out as an html file (much like how this forum works). I'd like to accept certain HTML markup in the input, but disallow tags like "<javascript>", "<object>" and "<embed>", etc, as writing those tags would allow users to post malicious input (redirects, popups, etc). Thus, defining and parsing what tags I will accept is easier than defining the tags not to accept (and safer).
    That being said, escaping quotes is somewhat important, as " in html that is not in a tag should really be the converted to & quot; for w3c browser standards. However, with 95% of my original question answered (that being the most important part), I'm satisfied with this. Thanks to all for the help thus far!

  • Quick regular expression question/help

    Can someone help me with two regular expressions I need. I could spend a while trying to figure it out myself, however times short and I really would like to get a fool proof optimal solution (my attempt would be buggy).
    Sample sentence
    The population, is projected to reach 200,000, or more (by 2020).[7] This is {dummy} text.
    The first regular expression
    I need all brackets and every thing between them to be removed from a sentence.
    Brackets such as: ( ), [ ] and { } .
    I.e. Given the above sentence the following would be returned:
    The population, is projected to reach 200,000, or more. This is text.
    The second regular expression
    If a word has a trailing comma character I need to add a whitespace between the word and the comma.
    I.e. Given the sentence returned from the first regular expression, this regex would return:
    The population *,* is projected to reach 200,000 *,* or more. This is text.
    Many thanks to anyonewho can help me with this!
    Edited by: Myles on Jan 18, 2008 8:12 AM

    http://java.sun.com/docs/books/tutorial/extra/regex/index.html
    http://www.regular-expressions.info

  • Rplacing space with &nbsb; in html using regular expressions

    Hi
    I want to replace space with &nbsb; in HTML.
    I used  the below method to replace space in my html file.
    var spacePattern11:RegExp =/(\s)/g; 
    str= str.replace(spacePattern," "
    Here str varaible contains below html file.In this html file i want to replace space present between " What number does this  represents" with &nbsb;
    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Untitled Document</title>
    </head>
    <body>
    <b><TEXTFORMAT LEADING="2"><P ALIGN="LEFT"><FONT FACE="Verdana" style = 'font-size:10px' COLOR="#0B333C" LETTERSPACING="0" KERNING="0"><B></B></FONT></P></TEXTFORMAT><TEXTFORMAT LEADING="2"><P ALIGN="LEFT"><FONT FACE="Verdana" style = 'font-size:10px' COLOR="#0B333C" LETTERSPACING="0" KERNING="0"><B> What number does this Roman numeral represents MDCCCXVIII ?</B></FONT></P></TEXTFORMAT></b>
    </body>
    </html>
    But by using the above regular expression i am getting like this.
    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Untitled Document</title>
    </head><body>
    <b><TEXTFORMAT LEADING="2"><P ALIGN="LEFT"><FONT FACE="Verdana" style = 'font-size:10px' COLOR="#0B333C" LETTERSPACING="0" KERNING="0"><B></B></FONT></P></TEXTFORMAT><TEXTFORMAT LEADING="2"><P A LIGN="LEFT"><FONT FACE="Verdana" style = 'font-size:10px' COLOR="#0B333C" LETTERSPACING="0 " KERNING="0"><B> What number does this represents</B></FONT></P></TEXTFORMAT></b>
    </body>
    </html>
    Here what happening means it was replacing space with &nbsb; in HTML tags also.But want to replace space with &nbsb; present in the outside of the HTML tags.I want like this using regular expressions in FLEX
    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Untitled Document</title>
    </head>
    <body>What number does this represents</body>
    </html>
    Hi,Please give me the solution to slove the above problem using regular expressions
    Thanks in Advance to all
    Regards
    ssssssss

    sorry i missed some information in above,The modified information was in red color
    Hi
    I want to replace space with &nbsb; in HTML.
    I used  the below method to replace space in my html file.
    var spacePattern11:RegExp =/(\s)/g; 
    str= str.replace(spacePattern," "
    Here str varaible contains below html file.In this html file i want to replace space present between " What number does this  represents" with &nbsb;
    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Untitled Document</title>
    </head>
    <body>
    <b><TEXTFORMAT LEADING="2"><P ALIGN="LEFT"><FONT FACE="Verdana" style = 'font-size:10px' COLOR="#0B333C" LETTERSPACING="0" KERNING="0"><B></B></FONT></P></TEXTFORMAT><TEXTFORMAT LEADING="2"><P ALIGN="LEFT"><FONT FACE="Verdana" style = 'font-size:10px' COLOR="#0B333C" LETTERSPACING="0" KERNING="0"><B> What number does this Roman numeral represents MDCCCXVIII ?</B></FONT></P></TEXTFORMAT></b>
    </body>
    </html>
    But by using the above regular expression i am getting like this.
    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Untitled Document</title>
    </head><body>
    <b><TEXTFORMAT LEADING="2"><P ALIGN="LEFT"><FONT FACE="Verdana" style = 'font-size:10px' COLOR="#0B333C" LETTERSPACING="0" KERNING="0"><B></B></FONT></P></TEXTFORMAT><TEXTFORMAT LEADIN G="2"><P ALIGN="LEFT"><FONT FACE="Verdana" style = 'font-size:10px' COLOR="#0B33 3C" LETTERSPACING="0" KERNING="0"><B> What number does this represents</B></FONT></P></TEXTFORMAT></b>
    </body>
    </html>
    Here what happening means it was replacing space with &nbsb; in HTML tags also.But want to replace space with &nbsb; present in the outside of the HTML tags.I want like this using regular expressions in FLEX
    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Untitled Document</title>
    </head>
    <body>What&nbsb;number&nbsb;does&nbsb;this&nbsb;represents</body>
    </html>
    Hi,Please give me the solution to slove the above problem using regular expressions
    Thanks in Advance to all
    Regards
    ssssssss

  • Question about Regular Expressions, please help!

    I have created an app which reads files and extracts certain data using regular expressions in JDK1.4 using Pattern and Matcher classes.
    However it needs to run on JDK1.2.2 (dont ask). The regular expression classes are not available in 1.2.2 (the Pattern and Matcher class) so i am looking for something similiar which i can use?
    I need something that loops through all the matches found in the file like how Matcher works i.e.
    while (matcher.find())
    // do this
    Help!

    http://jakarta.apache.org/regexp/

  • Regular Expressions, please help.

    Hello everyone.
    Can I get a Java Regular Expression to match with a word of the following language...
    Start --> Expression;
    Expression --> [0-9]+;
    Expression --> Expression * Expression;
    So the regexp should match with words like:
    4;
    4664;
    4 * 763;
    5 * 4534 * 23534;
    04 * 002 * 1 * 10 * ...
    I would be very happy, if anyone could help.

    I dont think that I need to learn anything more.
    I am sure it is not possible to make, what I want.
    I want to build a compiler.
    I just finished the abstract syntax of my language. Now I need a possibility to compile the concrete syntax of my language to the abstract one.
    But I think, it is not possible with regular expressions.
    Cause I need possibility to match a syntax of type chomsky 2.
    I think regular expressions only match chomsky 3 languages.
    But the "Backtracking"-mechanism of Java RegExp could do this.
    I am not sure with this.
    If you have any ideas please post.

  • Getting "Inner Html" using Regular Expressions. Learning RE in SDK1.4.

    Hello group.
    I am learning Regular Expressions in JAVA SDK 1.4 first. Not PERL or other language.
    Using the utility at the following link I am trying to get all the text between the <TR> and </TR> tags.
    http://jakarta.apache.org/oro/demo.html
    This seems simple but the line returns, breaks etc.. make it more difficult. I have worked on this for hours.
    There will be multiple table rows in my stream.
    My goal is to first get the text between the <TR> Tags...
    Then I was going to use groups to get data0, data1, data2, data3.
    Does this sound like a good plan? Should I use multiple RE or one RE that does 4 group returns.
    I was thinking the applet was causing my problem.
    <TR>.*?</TR> does not work.
    (<tr>\s*([^(</tr>)])+</tr>) does not work.
    I can get data0 to work as well as data1,2,3.
    Would it make more sense to split this multiple row table by </tr>?
    One row of malformed html (actually multiple rows):
    <TR>
    <TD bgColor=#ffffff><A class=fav
    href="http://nicesite.com/data0"
    >nicesite</A><IMG
    src="smile.gif"></TD>
    <TD bgColor=#12ff22><SPAN class=fav>data1</SPAN></TD>
    <TD bgColor=#12ff22><SPAN class=fav>data2</SPAN></TD>
    <TD bgColor=#12ff22><SPAN class=fav>data3</SPAN></TD>
    <TD align=middle bgColor=#ffffff><A
    href="#"><IMG
    src="smile.gif" border=0></A></TD>
    <TD align=middle bgColor=#ffffff>data
              4</TD>
    <TD align=middle bgColor=#ffffff>data5</TD></TR>
    s_____ I have seen some of your post and tryed to apply them. What do you think?
    Regards,
    NupeVic

    http://jakarta.apache.org/oro/demo.htmlI prefer
    http://jregex.sourceforge.net/demoapp.html
    >
    This seems simple but the line returns, breaks etc..
    make it more difficult. Yes, they do indeed
    There will be multiple table rows in my stream.
    My goal is to first get the text between the <TR>
    Tags...
    Then I was going to use groups to get data0, data1,
    data2, data3.
    Does this sound like a good plan? Should I use
    multiple RE or one RE that does 4 group returns.One of the main features of regexes that you must realize
    is that they are mainly suited for non-recursive, linear data structures
    (btw, that's why regexes in general are hardly suited for html).
    So, if the number of TD items is fixed, you could
    1. search using a single pattern for the whole row, something like
    "<PatternForTR>"+
    "<PatternForTD>(<PatternForData>)<PatternFor/TD>"+
    "<PatternForTD>(<PatternForData>)<PatternFor/TD>"+
    "<PatternFor/TR>"
    so the group1 would contain data1 and so on
    Otherwise, you should
    2. find each row using
    "<PatternForTR>(.*?)<PatternFor/TR>",
    then search the contents of group1 using the
    "<PatternForTD>(<PatternForData>)<PatternFor/TD>".
    >
    <TR>.*?</TR> does not work.The pattern itself is ok, but in order for it to work one should enable the DOTALL flag (the 's' flag in jregex demo), as the '.' doesn't accept line breaks by default.
    (<tr>\s*([^(</tr>)])+</tr>) does not work.It seems that [^(</tr>)]+ actually is a nonsense in this context.
    It describes a string that consists of any chars but '(', ')', '<', '>', 'r', 't', '/'.
    What you actully meant (a string that doesn't contain "</tr>")
    is just achieved by using non-greedy quantifier in <TR>.*?</TR>.
    >
    I can get data0 to work as well as data1,2,3.
    Would it make more sense to split this multiple row
    table by </tr>?Going the second way above, you could find rows using the
    general pattern for TR:
    <tr.*?>(.+?)</tr> and search their contents(i.e. the group#1) using the
    general pattern for TD
    <td.*?>(.+?)</td> Finally, this is the specific pattern for TD that doesn't include the leading
    and trailing tags into group1:
    <td[^>]*>(?:\s*</?[^>]*>)*\s*(.+?)(?:\s*</?[^>]*>)*\s*</td>It succeded in finding
    nicesite
    data1
    data2
    data3
    data
    4
    data5in your sample.

  • Regular expression (regex) help!

    I am trying to write a correct regular expression but am having difficulties.
    I have a webpage saved as a string and want to extract all the links (urls) from the webpage string.
    The trouble I am having is that some websites surround links using double quotes " " and some use single quotes ' ' around links in html:
    Double quotes around url:
    <a href="www.example.com"></a>
    And single quotes:
    <a href="www.example.com"></a>
    So far I have a regex which extract links if they are surrounded with double quotes (see below), however if a page uses single quotes it screws up ;)
    Pattern.compile("<a\\s+href\\s*=\\s*\"?(.*?)[\"|>]",  Pattern.CASE_INSENSITIVE);So is there a way to say look for double quotes OR single quotes?
    Many thanks
    null

    There's no need to escape the single-quote (or apostrophe) in a regex. The only reason it was necessary to escape the double-quote (or quotation mark) is because the regex was written in the form of a String literal. Neither the single-quote or the double-quote has any special meaning in regexes.

  • Regular Expression query help.

    Hi, your help will be appreciated,
    I need to replace the a string's pattern with some special characters.
                            Input String := 'mytext*% align="quot;leftquot;><font face="quot;Arialquot;"> *% align="quot;leftquot;"><this is text><p this to replace >'
                            Output String := 'mytext@ align="quot;leftquot;$<font face="quot;Arialquot;"> @ align="quot;leftquot;"$<this is text><p this to replace >'
    Replacing Rules:
    1)              '*%'             should be replaced by '@'
    2)              '>'            should be replaced by $ (only the EVERY FIRST occurrence after the character @ )
    Tried with REGEXP but looks like need your help!
    Thx
    DJ.

    Hi, DJ,
    DeeJay wrote:
    Perfect Frank. Thanks for your help.
    Could you please explain how it is working? you know, these Regexps are hurdle for me always in understanding.Not just you; regular expression can be very cryptic.
    We're saying "replace '*%x>' with '@x$', where x is 0 or more characters from the set of all characters except '>'.
    {code}
    SELECT     REGEXP_REPLACE ( 'mytext*% align="quot;leftquot;> *% align="quot;leftquot;"><this is text>'
              , '\*'     || -- aserisk (special character, must be escaped)
              '%'     || -- percent sign
                   '('     || -- begin \1 definition
                   '['     || -- begin set definition
                   '^' || -- "The set consiting of all characters EXCEPT ...
                   '>' || --     ... the greater-than sign"
                   ']'     || -- end set definition
                   '*'     || -- 0 or more characters from the preceding set
                   ')'     || -- end \1 definition
                   '>'     -- greater-than sign
              , '@\1$'
              )     AS txt
    FROM     dual;

  • Java Regular Expression Need Help

    I want regular Expression that accept all numbers and it should skip the numbers if it comes in {}

    No this is not workingThen you need to be MUCH clearer as to exactly what you are trying to acheive...
    We aren't mind readers... try posting the string you are parsing and the exact result that you want to get

  • Column constant value pls help

    COL cnt noprint
    COLUMN  cnt NEW_VALUE rowcountI am using this for select query as below
    ROWNUM cnt, COUNT (*) OVER () cntbut issue is when null rows returned.... this rowcount is not initilized.
    while i am usng for trailer record
    SELECT '* Trailer record *' || '|' || &rowcount
      FROM DUALit is giving error as below
    FROM DUAL
    ERROR at line 2:
    ORA-00936: missing expression
    Pls help
    S

    Solomon Yakobson wrote:
    Post EXACT and COMPLETE snippet of SQL*Plus session showing what you did. I can't reproduce it:
    SQL> COLUMN  cnt NEW_VALUE rowcount
    SQL> SELECT COUNT (*) OVER () cnt
    2    FROM DUAL
    3  /Suppose if cnt gives no record then rowcount is null at that time rowcount is not initiliazed.
    >
    >
    >
    >
    SQL> SELECT '* Trailer record *' || '|' || &rowcount
    2 FROM DUAL
    3 /
    old 1: SELECT '* Trailer record *' || '|' || &rowcount
    new 1: SELECT '* Trailer record *' || '|' || Missing expression this was my error
    >
    '*TRAILERRECORD*'||'
    * Trailer record *|1>
    SQL>
    SY.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               

  • Regular expression to substring

    Hi Folks;
    I need to extract dynamically substrings from an attribut A.
    The varchar2 attribut A is defined like that : "LXXXXX/111111(+),LXXXXX/111111(-),LXXXXXX/111111,etc..." Always the same serie.
    I need to store all "111111(+)" "111111(-)" "111111" of the same record in a new attribut named B.
    I feel the regular expressions could help me but i'm not a very good...
    Thanks for your help . ^^

    Try this,
    SELECT LTRIM (REGEXP_SUBSTR (attrA,
                                 '/[^,]+',
                                 1,
                                 LEVEL),'/')
      FROM T
    CONNECT BY LEVEL <= LENGTH (REGEXP_REPLACE ( attrA, '[^/]'))
    Example
    SQL> WITH T AS (SELECT 'LXXXXX/111111(+),LXXXXX/111111(-),LXXXXXX/111111,' attrA FROM DUAL)
      2  SELECT LTRIM (REGEXP_SUBSTR (attrA,
      3                               '/[^,]+',
      4                               1,
      5                               LEVEL),'/') exprssn
      6    FROM T
      7  CONNECT BY LEVEL <= LENGTH (REGEXP_REPLACE ( attrA, '[^/]'));
    EXPRSSN
    111111(+)
    111111(-)
    111111
    SQL> G.

  • Regular expression usage question

    Hi there.
    I have a 200 bytes EBCDIC variable record which I need to break down into fields. Fields are positional and are either text, binary numbers, packed-decimal and 64bytes long numbers.
    My question is. Can regular expression handle this complex data.
    I want to isolate each field into their corresponding format. EBCDIC into ASCII text, binary into java Integer and so on.
    The reason for using reqular expression is because the record format could change and regular expression would be easier to modify without having to change the code.
    Your words of advice are highly appreciated.
    Please advice.
    Regards,
    Ulises

    Regular expressions? I don't think so.
    If you have a situation where positions 1-3 might be a binary number like client number, and the format might change so it moves to positions 12-14, then you could certainly write a record-format class to encapsulate that sort of information. In fact that would be a very good idea. But I can't imagine how a regular expression would help in getting a number out of three bytes, for example.

  • Need help with regular expression

    I'm trying to use the java.util.regex package to extract URLs from html files.
    The URLs that I am interested in extracting from the HTML look like the following:
    <font color="#008000">http://forum.java.sun.com -
    So, the URL is always preceeded by:
    <font color="#008000">
    and then followed by a space character and then a hyphen character. I want to be able to put all these URLs in a Vector object. This doesn't seem like it should be too difficult but for some reason I can't get anywhere with it. Any help would be greatly appreciated. Thanks!

    hi gupta am not sure of the java syntax but i can tell u about the regular expression...try this....
    <font color="#008000">(http:\/\/[a-zA-Z0-9.]+) [-]
    i dont know the java methods to call...just the reg exp...
    Sanjay Acharya

  • Regular Expression to remove space in HTML Tag

    Hello All,
    My HTML string is like below.
    select '<CityName>RICHMOND</CityName> 
    <StateCd>ABCD CDE 
    <StateCd/>
    <CtryCd>CAN</CtryCd>
    <CtrySubDivCd>BC</CtrySubDivCd>' Str from dual
    Desired Output is
    <CityName>RICHMOND</CityName><StateCd>ABCD CDE 
    <StateCd/><CtryCd>CAN</CtryCd><CtrySubDivCd>BC</CtrySubDivCd>
    i.e. want to remove those spaces from tag value area having only spaces otherwise leave as it is. Please help to implement the same using Regular expression.

    Hi,
    It's unclear what you want.  This site seems to be formatting your message in some odd way.
    Post a statement like
    SELECT '...' FROM dual;
    without any formatting, to show your input, and post the exact output you want friom that, with as little formatting as possible.  It might help if you use some character like ~ instead of spaces (just for posting; we'll find a solution that works for spaces).
    To remove the text that consists of spaces and nothing else between the tags, you can say
    REGEXP_REPLACE ( str
                   , '> +<'
                   , '><'
    How is this string being generated?  Maybe there's some easier, more efficient way to keep the bad sub-wrtings out of the string in the first place.

Maybe you are looking for

  • XML out ... and in

    For an automated data transfer I need to get a database (relational structure) to export to XML and to be able to import a similar structure sent by our partners. I have been googling on this and there seems to be a massive number of options and tool

  • Can't start ASM,help me

    Help I can't start ASM on rhel4(32bit) + oracle 10G 10.2 RAC Wed Jun 28 17:10:40 2006 Error: KGXGN polling error (15) Wed Jun 28 17:10:40 2006 Errors in file /oracle/product/10.2.0/db/admin/+ASM/bdump/+asm1_lmon_24375.trc: ORA-29702: error occurred i

  • Slow keyboard input in fields - first characters transposed

    Any application or web page that I am in that has some kind of input field is giving me a problem with now. If I go into the field and immediately start to type it's like the iMac gets startled, that first character needs a second or so before it fig

  • What does the argument do here?

    In a code snippet shown below regarding handling the button action in the example: FileChooserDemo.java from Sun, what does the argument between parentheses do? int returnValue = fc.showOpenDialog(FileChooserDemo.this);Thanks.

  • How do we view videos on our Mac computer?  We keep getting the message, "Blocked plug-in".

    We have a Mac desktop computer.  We suddenly can't watch videos.  We keep getting the message, "Blocked plug-in".  Please help. Thank you