Need a quicker way to strip most html tag while retaining a few.

I have to strip the majority of non plane text (html, javascript, css) from a file whilst retaining the html tags I want.
I have found many solutions for stripping ALL of the tags but not for stripping most whilst retaining a few.
I first looked at regular expressions but I could not find a solution that didnt involve specifying all the tags I want removed insted of just listing all the tags I want to keep.
I had something like this> str.replaceAll("\\<+?(^p|^th|^tr|^td|^h2|^h3|^h4|^li).+?\\>","") but it does not work.
A regular expression would be great but I could not find one so I coded my own solution.
The code below goes through each tag (both start and end tags) and strips it from the string if it is not a wanted tag.
private void stripTags(int startIndex){
          int startArrow = 0;
          int endArrow = 0;
          while(true){
               startArrow = text.indexOf("<", startIndex);
               endArrow = text.indexOf(">", startArrow+1);
               //reached EOF?
               if(startArrow == -1 || endArrow == -1)   return; // return -1; 
               if(text.substring(startArrow+1, startArrow+2).equals("/")){
                    //deal with the end tag
                    if(isWantedTag(text.substring(startArrow+2, startArrow+4))) {
                         startIndex = endArrow+1;
                    } else {
                         //remove the tag
                         text = text.substring(0, startArrow).concat(text.substring(endArrow+1, text.length()));
                         startIndex = startArrow;
               } else {
                    //deal with the start tag
                    if(isWantedTag(text.substring(startArrow+1, startArrow+3))){
                         //remove tag parameters
                         if(endArrow-(startArrow+2) > 1) text = text.substring(0, startArrow+3).concat(text.substring(endArrow, text.length()));
                         startIndex = text.indexOf(">", startArrow+1)+1;
                    } else {
                         //remove the tag
                         text = text.substring(0, startArrow).concat(text.substring(endArrow+1, text.length()));
                         startIndex = startArrow;
private boolean isWantedTag(String tag){
          return (tag.equals("p>") || tag.equals("p ") || tag.equals("th") || tag.equals("tr") || tag.equals("td") || tag.equals("h2") || tag.equals("h3") || tag.equals("h4") || tag.equals("li"));
     }The problem is that this code is taking ages to run. It takes about 10 seconds to run this method on a very large html file. I need a solution that takes only a 10th of that time.
Does anybody have a regExp that could do the job or a more efficient version on my tag stripper above?
Thanks in advance

I suggest the following template:
int copyFrom = 0;
StringBuilder sb = new StringBuilder();
while (copyFrom < text.length()) {
    int copyTo = findStartTag(text, copyFrom);
    sb.append(text, copyFrom, copyTo);
    copyFrom = findEndTag(text, copyTo + 1);
text = sb.toString();where findStartTag(String s, int offset) returns the index of the next start tag in s, starting search at offset. It should return s.length() rather than -1 if nothing is found. findEndTag is similar.
Sorry, that's obviously wrong since you don't want to remove everything between the tags, just the tags themselves. The basic idea is the same, though:
int copyFrom = 0;
StringBuilder sb = new StringBuilder();
while (copyFrom < text.length()) {
    int copyTo = findStartOfTagToRemove(text, copyFrom);
    sb.append(text, copyFrom, copyTo);
    copyFrom = findEndOfTagToRemove(text, copyTo) + 1;
text = sb.toString();Here, findStartOfTagToRemove finds the position of the '<' character of the next tag to remove.
And findEndOfTagToRemove finds the position of the '>' character of that tag.
Edited by: nygaard on Sep 30, 2008 10:39 AM

Similar Messages

  • I just upgraded to the latest version of iTunes and it duplicated virtually every track in my music library. I need a quick way to delete the duplicates. Sorting by "Date Added" will not help because they are all listed as added on 12/12/2011.

    Library Duplicated
    I just updated to the latest version of iTunes and it duplicated virtually every track in my library. I need a quick way to delete the duplicates. Sorting by "Date Added" will not work, because every track is listed as added on 12/12/2011 even though this happened today 12/19/2011.

    I've written a script called DeDuper which can help remove unwanted duplicates. See this  thread for background.
    tt2

  • Any  Quick Way to Strip Audio from Video Podcast

    I like the Mad Money podcast. All I can find is video podcast only. I really don't care to look at Jim Cramer's ugly mug everyday.... I would prefer just to have the audio only so I can put on an iPod shuffle.
    Does anyone know if an audio podcast of the show is available or, if not, of a quick way to strip it from the video?
    Thanks!

    You can also open the file as an audio file, and save it as an MP3, in Amadeus Lite ($24.99 from the Apple App Store - requires OSX 10.5.5) or Amadeus Pro ($40 - requires OSX 10.4 or higher, Snow Leopard compatible). Both are available in trial versions. (Unfortunately Audacity won't open these files.)

  • Stripping all HTML tags from a CLOB

    Hi all,
    Running Oracle 9.2.0.8 on AIX...
    We have a table which stores HTML document fragments in a clob. I have a requirement to convert these to plain/text (strip all HTML tags) for sending in a plain/text email body.
    I have read the following solution from Tom Kyte's site:
    http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:25695084847068
    Basically creating an Oracle text index on the CLOB column and calling ctx_doc.filter with "plaintext" parameter set to true.
    I noticed in Tom's example, he uses the default filter, which based on the docs, is NULL_FILTER, which applies no filtering. I have tried his example in my dev box, creating the text index on the CLOB column with no parameters.
    The call to ctx_doc.filter did not filter the html at all. I re-created the index and specified the INSO_FILTER and the filtering was done. I was under the impression that INSO_FILTER was for filtering binary content to plaintext...
    create table filter ( query_id number, document clob );
    create table demo
      ( id            int primary key,
        theclob       clob
    create index demo_idx on demo(theClob) indextype is ctxsys.context;
    SET DEFINE OFF;
    Insert into DEMO
       (ID, THECLOB)
    Values
       (1, '<html><body><p>This is a test of <strong>ctx_doc.filter</strong> and plaintext filtering.</p></body></html>');
    COMMIT;
    exec ctx_doc.filter('demo_idx',1, 'filter',1, true);The above code does not convert the html to plaintext...
    Now re-create with the index with INSO_FILTER
    drop index demo_idx;
    create index demo_idx on demo(theClob) indextype is ctxsys.context parameters ('filter ctxsys.inso_filter');
    exec ctx_doc.filter('demo_idx',1, 'filter',1, true);Above scenario returns string "This is a test of ctx_doc.filter and plaintext filtering."
    The ORacle documentation doesn't specify any special filter parameter that needs to be set... just wondering if I'm missing soemthing here... or better yet, if there is a better solution to my problem. ;-)
    Thanks
    Stephane

    The difference between what you did and what Tom Kyte did is that you created your index on a clob column and Tom created his index on a blob column. What I don't know is why that makes a difference. I have demonstrated below with one blob column and one clob column, one index on the blob and one index on the clob, using the same code on both, with different results.
    SCOTT@orcl_11gR2> create table filter
      2    (query_id  number,
      3       document  clob)
      4  /
    Table created.
    SCOTT@orcl_11gR2> create table demo
      2    (id       int primary key,
      3       theblob   blob,
      4       theclob   clob)
      5  /
    Table created.
    SCOTT@orcl_11gR2> create index demo_blob_idx
      2  on demo (theblob)
      3  indextype is ctxsys.context
      4  /
    Index created.
    SCOTT@orcl_11gR2> create index demo_clob_idx
      2  on demo (theclob)
      3  indextype is ctxsys.context
      4  /
    Index created.
    SCOTT@orcl_11gR2> insert into demo values
      2    (1,
      3       utl_raw.cast_to_raw (
      4         '<html>
      5            <body>
      6              <p>
      7             This is a test of
      8             <strong> ctx_doc.filter </strong>
      9             and plaintext filtering.
    10              </p>
    11            </body>
    12          </html>'),
    13       '<html>
    14          <body>
    15            <p>
    16              This is a test of
    17              <strong> ctx_doc.filter </strong>
    18              and plaintext filtering.
    19            </p>
    20          </body>
    21        </html>')
    22  /
    1 row created.
    SCOTT@orcl_11gR2> exec ctx_doc.filter ('demo_blob_idx', 1, 'filter', 1, true)
    PL/SQL procedure successfully completed.
    SCOTT@orcl_11gR2> exec ctx_doc.filter ('demo_clob_idx', 1, 'filter', 2, true)
    PL/SQL procedure successfully completed.
    SCOTT@orcl_11gR2> select id, utl_raw.cast_to_varchar2 (theblob), theclob from demo
      2  /
            ID
    UTL_RAW.CAST_TO_VARCHAR2(THEBLOB)
    THECLOB
             1
    <html>
            <body>
              <p>
                This is a test of
                <strong> ctx_doc.filter </strong>
                and plaintext filtering.
              </p>
            </body>
          </html>
    <html>
          <body>
            <p>
              This is a test of
              <strong> ctx_doc.filter </strong>
              and plaintext filtering.
            </p>
          </body>
        </html>
    1 row selected.
    SCOTT@orcl_11gR2> select query_id, document from filter
      2  /
      QUERY_ID
    DOCUMENT
             1
    This is a test of ctx_doc.filter and plaintext filtering.
             2
    <html>
          <body>
            <p>
              This is a test of
              <strong> ctx_doc.filter </strong>
              and plaintext filtering.
            </p>
          </body>
        </html>
    2 rows selected.
    SCOTT@orcl_11gR2>

  • Need a Quicker Way to Unarchive...

    Hello All:
    I have some .tar.gz archives on DvD's that are made for backup. I went to go and open one today and it took forever to unarchive the file.
    Is there a way to just look at the contents of this file or is there a quicker way to unarchive?
    Thanks for any help or advice on this,
    Paul

    I was talking about the plug-in, but started from the program page and thought it worked the same, just more features. I guess that part was right

  • Need to copy Data from a specific Html Tag

    Hello,
    I am trying to use CF to access website and capture data from a specific tag to the end of that tag and store same in a csv file or database.
    The tag based search of an open file is where I am not able to get any head way. Any one has done this?

    You'll need to use a regular expression for that. CF supports regular expressions with the REFind, REFindNoCase and REReplace functions. Here's an example of using regular expressions to capture the value within an HTML tag:
    http://www.javamex.com/tutorials/regular_expressions/example_scraping_html.shtml
    It's in Java, but the syntax for regular expressions is the same in CF.
    Dave Watts, CTO, Fig Leaf Software
    http://www.figleaf.com/
    http://training.figleaf.com/
    Fig Leaf Software is a Veteran-Owned Small Business (VOSB) on
    GSA Schedule, and provides the highest caliber vendor-authorized
    instruction at our training centers, online, or onsite.
    Read this before you post:
    http://forums.adobe.com/thread/607238

  • Stripping selective HTML tags?

    Anyone have a "best practice" for stripping (or allowing) certain HTML
    tags on an insert/update transaction? I'm using TinyMCE in my forms
    which does this automatically, but if a user has javascript disabled
    they can bypass the TinyMCE stripping and still submit text with all
    sorts of dangerous HTML code.
    Is there a preferred method anyone has?
    Alec
    Adobe Community Expert

    Hi Alec,
    I´d first suggest to define all your "disallowed stuff" rules in an external function, which basically uses preg_replace (or str_replace in simpler cases) plus some regular expressions to replace whatever patterns with '' -- in addition you could also integrate the strip_tags function right here, like this:
    function block_content($what){
    $what = strip_tags($what, '
    $what = preg_replace('/<\/{0,1}font[^>]*>/i', '', $what);
    $what = preg_replace('/^
    <\/p>/i', '', $was);
    $what = preg_replace('/^(?:
     <\/p>)/im', '', $what);
    $what = preg_replace('/style="BACKGROUND:.*"/i', '', $what);
    return $what;
    But I'd like to clean it when it's actually written to the database
    rather than when I'm pulling it out
    please try with establishing as Custom Trigger (type: BEFORE) and clean the contents by applying the "block_content" function to $tNG->getColumnValue("columnname"):
    $tNG->setColumnValue("columname", block_content($tNG->getColumnValue("columnname")));
    it´s a little experimental, as I haven´t done that before, but I think it should work ;-)
    Even so, users can still insert dangerous attributes
    well, good luck with trying to capture all possible dangerous stuff in this function :-)
    Cheers,
    Günter Schenk
    Adobe Community Expert, Dreamweaver

  • Best way to format text: HTML tags or CSS?

    Probably a silly question, but...  I can format the text within a div using CSS Styles (Font-size, Font-Style, Font-Weight, etc.) or by applying an HTML tag, such as <H1>. (Or, I guess a combination where I use CSS to format the tag, but let's skip that one for now.)
    In general, is one approach better than the other? Thanks!

    +1 Nancy. The specific tags exist today simply because of the meaning it adds to your site at the backend - especially for search engines to understand what's going on with your site. And since search engines dont 'see' your page, but 'read' your code, these tags are the only hope for them to figure out what your site contains and how to place it on their results.
    On a sidenote, the first method of using inline styles is time-consuming and results in larger filesizes since you'll end up doing it for each div tag. The second method is much simpler as you can style each individual tag in CSS and re-use them anywhere you want on your site.

  • I need a quick way for extracting 100's of clips.

    I have 90 min of video that we used for an old show and I need to pull out 100's of 3- 10 second clips and loops for use in another program called Arkaos. What is a good technique for grabbing and exporting quickly?
    Thanks,
    Ken

    Is the video already on your computer in a Quicktime file? Me, I would use MpegStreamclip (free) using in and out points (i and o keys) and export that clip. Fewest clicks, highest quality, quickest. Just put a folder on your desktop for the clips to go to.
    Is the video in already in iMovie HD? You can select that small clip and export making sure you check off the box that says to export only the selected clip. But 100s of clips, that is a lot of selecting and exporting in iMovie HD. IMovie 09 is built for that, but still, the fastest and easiest way that this non-expert knows is to make it into the highest quality quicktime movie I need, and then select and export using Streamclip.
    But iMovie HD experts may differ. Lot of different techniques out there is my guess.
    Hugh

  • HTML Tags not moved in PDF

    Hello,
    I would like to print a report as a PDF and use the BI Publisher.
    This functions so far also quite well, only I have the problem
    I pass a HTML formatted text from a field CLob
    and with the PDF output the HTML tags are not moved.
    In the PDF document the HTML tags stands instead of a formatted heading,
    e.g.
    <*h1> This is the heading <*/h1> (without stars)
    What I must set / make with it I also a formatted one
    Heading agrees?
    Somebody an idea?
    Edited by: user10460383 on 14.09.2009 06:16
    Edited by: user10460383 on 14.09.2009 06:17

    True - a PDF isn't going to support HTML encoding. HTML will just be seen as more text, and displayed that way.
    You can strip out HTML tags fairly easily with a regular expression in your query SQL - this simply looks for text between < and > characters, and removes it. That should work for basic HTML formatting tags, but it isn't 100% (it won't handle <script> blocks correctly, for instance).
       select regexp_replace(myHTML, '<[^>]*>', '') as myText
       from myTable... Implementing a method to convert the HTML formatting into RTF formatting is also possible, but not a trivial task - you'd effectively have to replace each HTML tag with an RTF equivalent -- eg, replace <H1> with the RTF code to make a larger font, replace </H1> with RTF code to return the font to normal... etc...

  • Converter for converting all HTML tags to JSF tags

    hai all,
    i am new to JSF. i need a suggestions to convert all my HTML pages to JSF pages bcoz i was already created more pages in HTML and now i want to convert all the pages to JSF. can u give any suggestion plz post it here or send me a mail to [email protected]
    thanks in advance,
    regards,
    V.Sabarish

    hi roman,
    thanks for ur reply. it converts the file but the links r not converting....so can u suggest me another way for converting a HTML tags to JSF tags.

  • How to avoid HTML tags in CSV output

    Hi,
    I have a Interactive Report, some of the columns are links which are derived from functions as shown below.
    function1 (p1) - this function will return a url something like
              '<a href=f?p=........>'||p1||'</a>';
    select col1, col2, function1(col3) from table1Report Output
    11     22     _33_(link)CSV Download Output
    11     22     <a href=f?p=........>33</a>Report output is perfectly fine and links are also displayed correctly..*no issues*
    BUT when I download this as a CSV output...I am getting all the html tags *< a* along with href=..f?p=... */ a >* in the CSV file
    Is there any way to avoid these html tags in the CSV download file.
    Thanks,
    Deepak
    Edited by: Deepak_J on May 17, 2010 3:44 PM

    Hi,
    I am using following conditions (Interactive Report)..as shown below.. just say columns are DEPT and DEPT_EXPORT.
    IR Report column
       DEPT
         - condition- PL\SQL Expression... :REQUEST != 'CSV'
         (this will display as a normal IR column)
    Export Column
       DEPT_EXPORT
         - condition- PL\SQL Expression... :REQUEST = 'CSV'
         (this will display only in case of export to csv)1. When I run the report and do CSV download..sometime ..I get the DEPT_EXPORT column in the CSV file..sometime not..*No idea why it is happening..it's not consistent.*
    2. Secondly, I need to display DEPT_EXPORT column in CSV export..only when DEPT column has been selected in the Interactive Report.
    If no DEPT column in IR...no DEPT_EXPORT column in CSV. Is it possible??
    Thanks,
    Deepak

  • HTML tag place over QuickTime

    Is there a way to place a HTML tag (i.e. div) over QuickTime? I have a drop down menu that is hiding behind QuickTime. The div tag that QuickTime is contained displays underneath the menu. Formerly, I could use wmode="transparent", but that no longer works. I have tried using the z-index, but this has also not worked.

    What you posted was (or was interpreted by this forum as) this:
    Here is the tag:<br />
    <img then src="then" ”URL”><br />
    <br />
    This tag does not work
    I don't know what format eBay accepts, but your image can be displayed in these forums using this code:
    <IMG SRC="/___sbsstatic___/migration-images/105/10540553-1.gif"/>
    ...which is displayed like this:
    And if you need to scale it down, try this:
    <IMG SRC="/___sbsstatic___/migration-images/105/10540553-1.gif" HR WIDTH="50%"/>
    eBay may also accept hyperlinked images using this format:
    <A HREF="http://www.bcoutback.com" target="_blank"><IMG SRC="/___sbsstatic___/migration-images/105/10540553-1.gif" title=" Will open in new window "/></A>
    By the way, if you need to post HTML in these forums such that it is not interpreted, "escape" the HTML first via this site:
    http://www.htmlescape.net/htmlescape_tool.html
    And before posting images or escaped HTML in these forums, switch from Compose mode to Preview mode to verify that it displays as intended.

  • HTML Tags in sql

    Hi guys
    Im executing the below sql with the html tags, as it gone mess and returning error like invalid table name.
    SELECT '<p style="font-family: Script MT Bold;color: #800080;font-size:35px;text-align: center;">' || 'Welcome' || PAPF.FULL_NAME|| '</p>
    <p style="font-family: Times New Roman;color: #000000;font-size:15px;text-align: center;"> The Employees Information of</p>
    <p style="font-family: Times New Roman;color: #000000;font-size:20px;text-align: center;"> <strong>Central Bank of Kuwait</strong>'
    FROM FROM PER_ALL_PEOPLE_f PAPF, FND_USER FU
    WHERE PAPF.PERSON_ID = FU.EMPLOYEE_ID AND
    TRUNC(SYSDATE) BETWEEN PAPF.EFFECTIVE_START_DATE AND
    PAPF.EFFECTIVE_END_DATE
    Im not sure whether i missed any || or quotes in html tags while using with the column name in the select statement
    Thanks in advance.
    Regards,
    Vel

    Don't do this via the SQL selection as hardcoded HTML??
    I think the following 2 approaches far superior.
    If you need the actual SQL projection to contain HTML, implement user functions for data transformation. E.g.
    SQL> create or replace function MakeDiv(
      2          divBody varchar2,
      3          fontColor varchar2 default 'black',
      4          tooltip varchar2 default null )
      5  return varchar2 is
      6          DIV_TEMPLATE constant varchar2(4000) :=
      7          '<div style="color:$COLOR" title="$TOOLTIP"> $BODY </div>';
      8 
      9          htmlDiv varchar2(4000);
    10  begin
    11          htmlDiv := replace( DIV_TEMPLATE, '$BODY', divBody );
    12          htmlDiv := replace( htmlDiv, '$COLOR', fontColor );
    13          htmlDiv := replace( htmlDiv, '$TOOLTIP', tooltip );
    14          return( htmlDiv );
    15  end;
    16  /
    Function created.
    SQL>
    SQL> select MakeDiv( ename, 'blue', 'Employee '||empno ) from emp;
    MAKEDIV(ENAME,'BLUE','EMPLOYEE'||EMPNO)
    <div style="color:blue" title="Employee 7369"> SMITH </div>
    <div style="color:blue" title="Employee 7499"> ALLEN </div>
    <div style="color:blue" title="Employee 7521"> WARD </div>
    <div style="color:blue" title="Employee 7566"> JONES </div>
    <div style="color:blue" title="Employee 7654"> MARTIN </div>
    <div style="color:blue" title="Employee 7698"> BLAKE </div>
    <div style="color:blue" title="Employee 7782"> CLARK </div>
    <div style="color:blue" title="Employee 7788"> SCOTT </div>
    <div style="color:blue" title="Employee 7839"> KING </div>
    <div style="color:blue" title="Employee 7844"> TURNER </div>
    <div style="color:blue" title="Employee 7876"> ADAMS </div>
    <div style="color:blue" title="Employee 7900"> JAMES </div>
    <div style="color:blue" title="Employee 7902"> FORD </div>
    <div style="color:blue" title="Employee 7934"> MILLER </div>
    14 rows selected.
    SQL> An even better approach is not to do this in the SQL projection - but instead have a PL/SQL framework that accepts a SQL, and turns the SQL projection of that SQL into a HTML report. Where this framework can use different templates to generate different types of HTML reports.
    This approach is very successfully used by Oracle Apex.

  • Can I get html tag info through actionscript?

    Hi,
    I want to know if there is any way to get the html tag info
    through
    actionscript when the swf file is played in a web browser?
    for example, if the html snippet is:
    <embed name="xx" id="99"
    type="application/x-shockwave-flash"
    src="my.swf" />
    can I get the 'name', 'id' attribute value of this html tag
    from my.swf
    through actionscript? or is there any other way? I know
    actionscript can
    call functions in the html page, but that cant' help me get
    the 'name',
    'id' attriubte value of the html tag.
    Any suggestion is highly appreciated, thanks in advance.

    Great! thanks, I will take a try of this.
    Gorka Ludlow wrote:
    > Use FlashVars:
    >
    > <embed name="xx" id="99"
    type="application/x-shockwave-flash"
    > src="my.swf" FlashVars="id=99&name=xx" />
    >
    > This will make two variables available at the main
    timeline of your movie.
    >
    > Cheers,
    > Gorka
    > www.AquiGorka.com
    >
    >

Maybe you are looking for