Filting html tags, css, and javascript with regex

hi everyone,
im writing a small application where a user types in a url, and the text of the webpage is displayed in a text area.
ive got it to work, however it takes some time, and also alot of content i dont want is displayed - tags, scripts and sometimes css.
initally i filtered out the html tags with a regular expression, but i still get alot of unwanted content.
im not a confident java programmer, and the idea of parsing html, css and javascript is the scariest idea ever to me, so my next idea is to keep only everything between the <body> tags - everything above and below it is deleted - hopefully that should leave me only with the visible content on the site.
ive messed around with regular expressions but i cant get it to work, can anyone help out?
thanks alot,
Torre

Darryl.Burke wrote:
I tried out the regexes I posted on the source of a forum page, which is not valid html (contains two each opening and closing body tags). With a bit of trial and error I was able to remove everything upto the first, and not the second, opening tag by using a reluctant qualifier, ^.*?, but couldn't for the life of me achieve removal of only the last closing tag, leaving the other, invalid one intact. How would you do that?Regexes always try to match the first occurrence of whatever they're looking for (the sentinel), and there's no way to change that behavior (but it would be handy if you could). What you have to do instead is make sure the rest of the regex can't match the sentinel. For that you need lookahead, and the simplest way to use it is to scan the rest of the text looking for the sentinel and, if it doesn't find one, go ahead and gobble up the remaining text: "(?is)</body>(?!.*</body).*$" However, if there are many occurrences of the sentinel, you could take a serious performance hit. Here's a much more efficient way: "(?is)</body>(?:[^<]++|<(?!/body>))*+$" After matching the sentinel, this regex gobbles up anything that's not the first character of the sentinel, or the first character as long as it isn't followed by the remaining characters of the sentinel. The advantages of this regex are that it never has to backtrack, and the lookahead is only applied when it's necessary, where the first regex applies it every time.

Similar Messages

  • Using CSS and Javascript to display a div with flash in it, mozilla reloads the flash file!

    I am using CSS and Javascript to display a div with an
    embedded flash object in it. Mozilla Firefox reloads the flash file
    when the div is displayed! (I dont want this to happen, as it's
    unexpected functionality, my expectation would be that the flash
    file would not change it's state at all, and would remain in
    whatever state it was left in.)
    I was wondering if anyone has come across this issue and is
    there something I can do to prevent this from occurring?
    To be more specific, I have a single HTML page with 8 flash
    files embedded in it (yeah I know, it's a bit much). I am then
    using CSS and Javascript to display (via a numbered link (with an
    id)) an equivalent numbered div tag containing the flash file.
    Mozilla Firefox reloads the flash object that is in the div.
    Internet Explorer will not do this and will instead, load the flash
    object only upon initial view of the flash object. All subsequent
    links (in IE) will NOT reload the flash object on the page. I'm
    guessing this is some kind of difference in the flash player as an
    Active X object and the plugin, or is it just IE being clever? Or
    am I way off?
    Anyway, here is the code...

    I am using CSS and Javascript to display a div with an
    embedded flash object in it. Mozilla Firefox reloads the flash file
    when the div is displayed! (I dont want this to happen, as it's
    unexpected functionality, my expectation would be that the flash
    file would not change it's state at all, and would remain in
    whatever state it was left in.)
    I was wondering if anyone has come across this issue and is
    there something I can do to prevent this from occurring?
    To be more specific, I have a single HTML page with 8 flash
    files embedded in it (yeah I know, it's a bit much). I am then
    using CSS and Javascript to display (via a numbered link (with an
    id)) an equivalent numbered div tag containing the flash file.
    Mozilla Firefox reloads the flash object that is in the div.
    Internet Explorer will not do this and will instead, load the flash
    object only upon initial view of the flash object. All subsequent
    links (in IE) will NOT reload the flash object on the page. I'm
    guessing this is some kind of difference in the flash player as an
    Active X object and the plugin, or is it just IE being clever? Or
    am I way off?
    Anyway, here is the code...

  • Custom command to strip all except specified tags / parameters using javascript or regex?

    Hi all
    I'm trying to figure out how to create a custom command that will allow me to strip all tags from a page except for those I specify, and will only allow the parameters that I specify for the remaining tags. It seems to me that it should be achievable either with a custom command using javascript, or by simply recording a command and doing a search and replace with regex.
    I'm not familiar with javascript, so I've been trying the latter route, but I'm open to help with either method (or for that matter, any other method that will allow me to perform this with just a couple of mouse clicks, and no need to initiate a long sequence of commands manually every time).
    The tags I'd like to retain are:
    p
    br
    ul
    ol
    li
    a
    table
    tr
    td
    Any tags other than the above would need to be stripped. For the tags above, I'd obviously want to retain opening and closing tags, and I'd also want to allow href and target parameters for the a tag. I would want to strip all other parameters from these tags, though.
    I got partway into this, in that I managed to make regex that finds all the tags apart from the above, but hadn't yet gotten to trying to strip the unwanted parameters:
    </?\w+(?<!p|br|ul|ol|li|a|table|tr|td)((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s* )/?>
    This validates on the online version of RegExr (http://gskinner.com/RegExr/), and finds all tags apart from the above correctly as far as I can see. However, it doesn't work in Dreamweaver's Find box with Use Regular Expression checked, reporting "Invalid Quantifier".
    ...and this is as far as I've gotten. If anybody can offer help, it'd be much appreciated!
    Thanks in advance to you all...
    Mike

    Thanks, Barry. Paste Special is what I'm currently doing, but as you can imagine, it's mighty tedious cleaning up the formatting by hand on every single document, when the formatting I need is all there to start with in a regular paste -- just along with a bunch of extraneous formatting. With as many as a dozen or more documents to do each day, some with lengthy, nested lists and many bits of formatting that I need to keep, it's a bit soul destroying to have to do it manually.
    It must be possible with Javascript, but my problem is I'm not a Javascript coder and would need to learn from scratch. It's likely possible with regex too, but my issue there is that Dreamweaver's version of Regex seems to be cut down from the version all of the regex validators I could find online use. I came fairly close to having something workable done entirely with regex when I last looked at this a couple of months back, and it validated just fine (only remaining quirk was to do with some attributes I didn't need, but that was probably solvable.) Unfortunately, although it validated in multiple online tools, Dreamweaver refused to accept it as valid.
    I've since found a (very old) Dreamweaver Command extension that partially works, incidentally, and could perhaps be extended to do what I need -- but again, it would require me to understand code I simply don't, as yet.
    http://www.andrewwooldridge.com/dreamweaver/commands.html
    The Remove Tags Except command there mostly works, but it seems to have an issue with nested tags that throws up an error message, and it also doesn't have a way for me to specify which attributes to keep, just which tags. Other than that it's close to ideal though -- the only other way it could be improved would be if I could kludge it to simply run straight away with a predetermined exclusions file, rather than making me manually select and load the exclusions every time.

  • CSS and JavaScript support in OHJ

    Hello,
    We're considering switching to OHJ for our cross-browser/cross-platform docs, but have concerns over the level of CSS and JavaScript support in OHJ.
    Concerning CSS, the online docs mention support for "Cascading Style Sheets (CSS) 1 and most of CSS 2". Is there a resource that specifically lists what isn't supported from CSS 2?
    Concerning JavaScript, is it currently supported? Some of the web references I've seen say that it's not currently supported (http://helponline.oracle.com/ohguide/help/state/content/vtTopicId./navSetId.blafdoc/vtTopicFile.blafdoc%7Cusing_browvers~html/navId.3/).
    Thanks,
    Mike

    Hi Mike,
    OHJ uses the ICE Browser, which we license from a company named ICESoft. It generally compares favorably with other Java-based browsers, although it is not as complete as a native browser such as IE or Firefox.
    There is a list of what is and is not supported in CSS:
    http://www.icesoft.com/developer_guides/icebrowser/htmlguide/featuresappendix3.html#86328
    We recently finally resolved the licensing issues that were holding us back from releasing a version of OHJ with JavaScript support. The current versions don't support JavaScript, but our next major release will.
    Depending on your needs, we also have OHW, which is a server based help system, that uses the client's native browser.
    Regards,
    Jeffrey Stephenson
    Oracle

  • 8900 Curve - HTML 4.01 and JavaScript 1.5

    Hi. I am trying to run a chess Web app (www.gameknot.com) that requires HTML 4.01 and JavaScript 1.5, so far unsuccessfully when it comes to moving pieces. Does my 8900 Curve support HTML 4.01 and JavaScript 1.5? Thank you. Jeff

    Hi Jeff. I'm also trying to make JS works, in my case using a PhoneGap application so it browses local html/JS instead of www (but it's html&javascript after all). I had some troubles when it comes to move items like small images even though the other JS functions seem to work fine. It seems we need to focus on which properties/functions related to positionning are available on the BB browsers.
    Rodrigo Bravo
    http://www.wilkonit.com

  • Css and javascript getting called twice in publish mode

    Hi all,
    We are facing an issue only on publish environment that our css and javascripts are loaded twice on publish.
    On author it works fine.
    Can someone give some clue?
    Regards,
    Shallu

    Sallu,
    Can you delete /var/classes and /var/clientlibs and try again ? Is it happening on all publish instance or few of them ?
    Yogesh

  • Does Firefox work with all html tags/CSS properties?

    I am considering Firefox because MSIE has been and is becoming more annoying.
    I want a browser that simply implements all html tags and CSS properties.
    I want Firefox to install without screwing with any other application on my computer.
    Possible?

    Sure no problems to install Firefox alongside other browsers.<br />
    You only need to decide which browser to set as the default browser that is used when you click a link in other programs.
    *http://developer.mozilla.org/en/Mozilla_CSS_support_chart
    *https://developer.mozilla.org/en/HTML

  • HTML tags in generated javascript

    Hello BEA Experts, As you know that auto generated javascript functions like getNetuiTagNames in Workshop are included with in the the HTML comment tags like <!--
    --!> its another thing they should be generated correctly as follows <!--
    //--!>, to hide it from browsers not supporting javascript.
    My question is there any way to stop generation of the HTML comment tags and just genrate the javascript with in teh script elements..
    Please let me know..
    Thanks in advance..
    -Bob F.

    Thanks for the reply Vimla, actually I dont have any problem with the browser that I am using , its with the javascrip parser that cannot understand the "-->" tag which has to be included in // comment.That solves both the browser as well as the javascript parser issue.I have updated the support engineer with the issue via email reply.
    Apparently // comment before the HTML comment terminates solves both the issues i.e browser related as well as javascript parser related.
    Please look into the following link.
    <b>http://www.w3.org/TR/html4/interact/scripts.html#h-18.2.1</b>
    It states the following..
    The JavaScript engine allows the string "\<\!--" to occur at the start of a SCRIPT element, and ignores further characters until the end of the line. JavaScript interprets "//" as starting a comment extending to the end of the current line. <b>This is needed to hide the string "-->" from the JavaScript parser.</b>

  • Css and javascript in text item

    I've read in other threads that you can add javascript and css into simple text items by clicking the 'View HTML Source' checkbox and adding the js and css.
    When I do this then apply it the code that I have added disappears and therefore has no effect on the item.
    Any ideas what I'm doing wrong?
    Cheers,
    Steven.

    Hi
    This depends on ur needs
    WHEN-NEW-FORM-INSTANCE > if u want to appear it on the very begining moment
    --=============
    u can use
    WHEN-VALIDATE-ITEM > on ur date fields
    Here is a link that may help u decide where to take off with this code...
    http://www.dotnetspider.com/resources/22433-Triggers.aspx
    http://www.slideshare.net/magupta26/oracle-forms-tutorial
    Hope this helps ,
    Regards,
    Abdetu...

  • Safari not rendering CSS and javascript correctly

    There seems to be a bug in Safari when dealing with CSS display property and javascript.
    I've created a form that hides certain fields when the right circumstances are met, those fields become visible. However, when they become visible, through the use of javascript, they don't display in the flow of the document. For example:
    [field 1]
    [field 2]
    [field 3]
    Assume that [field 2] is hidden on initial page load. And when [field 1] is clicked, then [field 2] should appear and it should appear after [field 1]. Safari, however, displays [field 2] above [field 1] and strips it of any styles. I spent 2 hours trying to figure out what I was doing wrong, but it turns out I wasn't doing anything wrong. It's Safari that can't render this correctly because when I test it in Firefox it works fine. Just as I expected it to. I hope this gets fixed in the next update.
    If anyone has any explanation or solutions or reasons for this peculiar anomaly please tell me.

    If you have not done so already, make sure you also tell Apple by using the Bug Reporting item in the Safari menu.

  • I am not getting required output after importing xml file having HTML tags, MathML and Latex code

    I created well formatted xml having html tags, MathML code and Latex code inside
    XML with sample latex
    <question>
        <p> 8. In this problem, <Image  style="vertical-align: -9pt" title="5a\frac{7b}{3}" alt="5a\frac{7b}{3}" href="http://www.snapwiz.com/cgi-bin/mathtex.cgi?%205a%5Cfrac%7B7b%7D%7B3%7D%0A"></Image> and <Image href="http://www.snapwiz.com/cgi-bin/mathtex.cgi?%207b%5Cfrac%7B5a%7D%7B3%7D" style="vertical-align: -9pt" title="7b\frac{5a}{3}" alt="7b\frac{5a}{3}"> </Image>are both mixed numbers, like <Image src="http://www.snapwiz.com/cgi-bin/mathtex.cgi?%203%20%5Cfrac%7B1%7D%7B2%7D%20=%203.5" style="vertical-align: -9pt" title="3 \frac{1}{2} = 3.5" alt="3 \frac{1}{2} = 3.5"></Image>. The expression <Image style="vertical-align: -9pt" title="5a \frac{7b}{3} \ \times 7b \frac{5a}{3}" alt="5a \frac{7b}{3} \ \times 7b \frac{5a}{3}" href="http://www.snapwiz.com/cgi-bin/mathtex.cgi?%205a%20%5Cfrac%7B7b%7D%7B3%7D%20%5C%20%5Ctimes %207b%20%5Cfrac%7B5a%7D%7B3%7D"> </Image>is equal to which of the following choices? </p>
    </question>
    XML with sample html tags
    <p>4. Examine the expression 3<i>k</i><sup>2</sup> + 6<i>k</i> - 5 + 6<i>k</i><sup>2</sup> + 2.</p><p>When it is simplified, which of the following is the equivalent expression?</p>
    XML with sample MathML tags
    <p>5. Find the vertex of the parabola associated with the quadratic function <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>y</mi><mo> </mo><mo>=</mo><mo> </mo><mo>-</mo><msup><mi>x</mi><mn>2</mn></msup><mo> </mo><mo>+</mo><mo> </mo><mn>10</mn><mi>x</mi><mo> </mo><mo>+</mo><mo> </mo><mn>4</mn></math></p>
        <math xmlns="http://www.w3.org/1998/Math/MathML"><mfenced><mrow><mo>-</mo><mn>5</mn><mo>,</mo><mo> </mo><mn>69</mn></mrow></mfenced></math><br>
        <math xmlns="http://www.w3.org/1998/Math/MathML"><mfenced><mrow><mn>5</mn><mo>,</mo><mo> </mo><mn>29</mn></mrow></mfenced></math><br>
        <math xmlns="http://www.w3.org/1998/Math/MathML"><mfenced><mrow><mn>5</mn><mo>,</mo><mo> </mo><mn>69</mn></mrow></mfenced></math><br>
        <math xmlns="http://www.w3.org/1998/Math/MathML"><mfenced><mrow><mo>-</mo><mn>5</mn><mo>,</mo><mo> </mo><mn>29</mn></mrow></mfenced></math>
    None of the above works fine after importing xml to Indesign document/templete, it is not renderting equivalent out instead it is showing the same text inside tags in the Indesign document.
    I have lot of content in our database, I will export to XML, import to Indesign, Indesign should render required output like what we see in the browser and export the output to printed format.
    Import formated XML to indesign --> Export to Quality Print Format
    Can any one guide me and let me know whether it is posible to do with Indesign, if so let me know in case if I am missing anything in importing xml and get required output.
    Thanks in advance. Waiting for reply ASAP.

    Possibly your expectations are too high. ... Scratch that "possibly".
    XML, in general, cannot be "rendered". It's just an abstract data format. Compare it to common markup such as *asterisks* to indicate emphasized words -- e-mail clients will show this text in bold, but InDesign, and the vast majority of other software, does not.
    To "render" XML, HTML, MathML, or LaTeX -- you seem to freely mix these *very* different markup formats, unaware of the fact that they are vastly different from each other! -- in the format you "expect", you need to apply the appropriate software to each of  the original data files. None of these markup languages are supported natively by InDesign.

  • How to show cfm, html, js, css, and browser tab in coldfusion builder 3?

    On Coldfusion Builder 2, I see these tabs are available when I create a new page, but not on Coldfusion Builder 3.
    What should I do to show them on CFB 3?

    Builder 3 does not seem to offer those sub-tabs in its toolbar, but if you open each of those kinds of files, you’ll find that the top toolbar (where they appeared) shows many of the same icons for each of those type of files as was previously listed in those sub-tabs.
    I realize that isn’t as helpful if you are opening a single CFM or HTML file which happens to have also HTML, CSS, and JS code in it. You won’t see the icons for that other kind of code, even when you move to such code.
    FWIW, note that CFB 1 and 2 had previously relied upon an underlying 3rd party tool called Aptana, for most such  HTML, CSS, and JS-specific functionality. In CFB3 that product was dropped and the CFB team tried to recreate much of its functionality. This (the tabs for other code within a single file extension) may be  something they didn’t think was used often, or maybe there was a challenge in doing so.
    Even so, do note also that there is still support for CSS and JS (and HTML) within a CFM page, in terms of CFB’s features like code completion, code assist, outline view, etc.
    Still, I realize that some of the “wrap”-oriented tools in that old toolbar aren’t accessible any other way, so you may want to file a feature request (or bug report, however you may want to cast it) at bugbase.adobe.com.
    Hope that’s helpful.
    /charlie

  • Put html tag in xml genreted with strings panel

    Hello,
    I created an multilanguage application with the "strings
    "panel.
    My customer need to put some words in italic inside the
    dynamic textfield.
    So I used <i> tag or cdata tag insied the xml, but
    nothing works, even if I put the textfield in "html" format.
    So How can I do.
    Do we need to change the "locale.as" class?
    thanks.
    Regards,

    Is the dynamic text field set to embed the fonts? (Properties
    Panel >
    Character.... radio button clicked to specify ranges)
    If it is, set it to no characters. Also make sure your
    dynamic text field
    is set in the properties panel to render as html.
    Here is a test:
    Create a new Flash document, select the text tool from the
    tools menu and
    create a text area on the stage. Make it a dynamic text field
    in the
    properties panel and click the render as html button. Give it
    an instance
    name of "myText" and on frame one of the main timeline put
    the following
    actionscript.
    myText.htmlText = "<i> This is Italics</i>. This
    is not"
    Test the movie and it should show what you are looking for.
    Dan Mode
    --> Adobe Community Expert
    *Flash Helps*
    http://www.smithmediafusion.com/blog/?cat=11
    *THE online Radio*
    http://www.tornadostream.com
    *Must Read*
    http://www.smithmediafusion.com/blog
    "tof69" <[email protected]> wrote in message
    news:efb464$7cm$[email protected]..
    > Hello,
    > I created an multilanguage application with the "strings
    "panel.
    > My customer need to put some words in italic inside the
    dynamic textfield.
    > So I used
    tag or cdata tag insied the xml, but nothing works, even if
    > I
    > put the textfield in "html" format.
    > So How can I do.
    > Do we need to change the "locale.as" class?
    > thanks.
    > Regards,
    >
    >

  • Html tags br and hr in JEditorPane

    I have a JEditorPane that is used to display html data. The data is obtained from an xml file, transformed to html using a stylesheet and rendered in the JEditorPane using setContentType("text/html") and setText() methods. However the <br/> and <hr/> tags used in the html cause problems in that a '>' sign shows up at the location of these tags in the JEditorPane. So if I have a <hr/> in the html, a horizontal line followed by a > sign is rendered in the JEditorPane. I am unable to figure out whats going on, the html is rendered fine in browsers.
    Also is there a way to get rid of the <? xml version etc etc ?> tag at the beginning of the generated html without resorting to String manipulation exercises.
    Any help will be appreciated. I am using jdk 1.4
    Thanks in advance
    RS

    I am trying to figure out how to get rid of the pesky > signs. The html that I have used is pretty basic, just some text formatted by the <p>, <br> and <hr> tags., Is there any documentation on what JeditorPane supports (which html version or what tags) and what it does not.

  • Uploading CSS and Javascript

    I have created a custom stylesheet and also some additional javascript functions which I have uploaded to the server.
    Whenever I make changes to these documents, the changes don't seem to appear for some time when I run a web template that uses them.
    Is there something that I need to do to force the updated versions to be used?
    I have tried deleting the cache in the WAD, but this has not helped.
    Any assistance on this would be appreciated.
    Regards
    Richard

    Richard -
    Try transaction smicm
    Once there, navigate to <i>Goto ==> HTTP Server Cache ==> Invalidate ==> Global in System</i>
    This should do the trick. If not, your browser might be caching the css/js file. If this is the case, do a hard refresh ( <CTRL> + F5 in Internet Explorer )

Maybe you are looking for

  • Why won't the App Store let me download Mountain Lion even though my system meets the requirements?

    I am trying to purchase and download Mountain Lion from the App Store.  I have all the requirements met, but I get an error message saying my computer is not compatible.  What's up and how do I fix it?  Here are 2 screen captures of my system specs:

  • Testing app on device shows only black screen.

    Hi, I created hello world app, app is working fine on simulator, But when I install this app on device, and launch it from devie it shows only black screen. I am not sure whats the possible reason for this. Device which I am using is Intex cloud fx.

  • Creating logical port pointing XI Runtime

    Hello Friends, for a abap backend to point to an XI,I need to create a logical port,which is pointing to a RFC destination of type H,to create an RFC destination of type H,i need to know the URL and service number,how do i find this information. rega

  • My macbook pro begins start up and shuts down.

    It gets to the grey screen, the apple symbol shows up and a load bar comes on. Once the load bar is at half it shuts down. Any suggestions? It started occurring after a system update.

  • Configuring JSP's in Weblogic server 7.0

    Hi, I'm trying to port my web application from weblogic server 6.1 to weblogic 7.0. I'm having a little problem with jsp pages in weblogic 7.0. Some of the text box controls which are populated dynamically in my jsp pages are getting populated with "