Regex: Extracting text between two HTML tags

Hello,
the common answer to this question would be
<tag>(.*)</tag>
or
<tag>[^<]*But I have LFs (and whitespace) around the tags:
<html>
  <body>
    One text line to be retrieved.
  </body>
</html>So I tried
(?s)<body>(.*)</body>But that didn't help.
What's missing?

Thanks for your replies.
@Sabre
I tried your suggestion with the following code to no avail
import java.util.*;
import java.util.regex.*;
public class X {
  public static void main(String[] args) {
      String s=
"<html>\n"+
"  <head>\n"+
"\n"+
"  </head>\n"+
"  <body>\n"+
"    One <u>text line</u> to be retrieved.\n"+
"  </body>\n"+
"</html>";
    String regex= "<body>([^<]*)</body>";
    Pattern p = Pattern.compile(regex); // Create the pattern.
    Matcher matcher = p.matcher(s); // Create the matcher with the string.
    while (matcher.find()) {
      System.out.printf("Found: \"%s\" from %d to %d.%n",
               matcher.group(), matcher.start(), matcher.end()-1);
}@ejp
Paul had a similar objection. But as I've written, my html string will always have this same structure and all I have to do is to extract the text. So if regex doesn't work in that case, I'd rather prefer two indexOf-s instead of bothering a parser.

Similar Messages

  • How to get a text between two XML tags?

    Hello everybody!
    I've got a problem! How can I extract a text that is between tags, like <myTag> My text </myTag> I have no problem to get the attributes inside the tags, however, i don't know how to get some text that is between tags. Here is my xml:
    <?xml version="1.0" encoding="ISO-8859-1"?>
    <tangram_request service_id="3">
    <send keep_session="nada">
    <source></source>
    <destination>3196931566</destination>
    <channel_id>2</channel_id>
    <text>Teste ServerSocket!!!</text>
    </send>
    </tangram_request>
    Now, there's a fragment of my code, which gets some tags' attributes:
    DOMParser parser = new DOMParser();
    InputSource resp = new InputSource(new StringReader(XML));
    parser.parse(resp);
    Document doc = parser.getDocument();
    Node node =(doc.getElementsByTagName "tangram_request")).item(0);
    if (node instanceof Element)
    Element el = (Element) node;
    service_id = el.getAttribute("service_id");
    System.out.println("\n\nService_id="+service_id);
    }else{System.out.println("Erro");}
    node = (doc.getElementsByTagName("send")).item(0);
    if (node instanceof Element)
    Element el = (Element) node;
    keep_session = el.getAttribute("keep_session");
    System.out.println("keep_session="+keep_session);
    }else{System.out.println("Erro");}
    Now, I want to get the texte between <destination> ... </destination>
    How could I do that?
    Thanks a lot
    Calegari

    Thanks... It worked fine!!!
    Now how can I get lots of <desctination>... I did something that haven't worked...
    node = (doc.getElementsByTagName("destination")).item(0);
    while(node.hasChildNodes())
    destination = node.removeChild(node.getFirstChild()).getNodeValue();
    System.out.println("destination="+destination);
    And now, my XML is like:
    <?xml version="1.0" encoding="ISO-8859-1"?>
    <tangram_request service_id="3">
    <send keep_session="nada">
    <source></source>
    <destination>3196931566</destination>
    <destination>3196931567</destination>
    <channel_id>2</channel_id>
    <text>Teste ServerSocket!!!</text>
    </send>
    </tangram_request>
    Thanks so much!
    Calegari

  • Extract text file with HTML tags from JTextPane

    hello world
    I have a big problem !
    I am creating an applet with a JTextPane ...
    so I can write text, (bold, italic etc), i can insert images.
    Now i want to create a text file with all the HTML tags
    corresponding to what I wrote in my JTextPane.
    I want to have and save the HTML file corresponding to what i wrote ...
    Is it possible ? Help me please ....
    Jeremie

    writing to a file from an applet is going to take a fair amount of work on your part.
    in order to write to a file from your applet, you have to use servlets or jsp to write to a file on your server. if you wish to write locally, look into signing your applet or policy settings of your browser.
    for writing to a file to the server, i suggest you look into servlets and tomcat to run the servlets.
    i just finished a project that used servlets and they take some time to figure out, but its definitely worth your time.
    here are some websites...
    http://www.j-nine.com/pubs/applet2servlet/Applet2Servlet.html
    http://jakarta.apache.org
    other websites have tutorials that you can look at too
    Andy

  • Getting text between two tag

    Hello,
    what would be the best method to implement to get text between
    two tags,
    eg <TEST1> this is a test </TEST1>
    as in; this is a test,
    i try to use BreakIterator, but it skip the tags,
    when i do word by word loop

    this is one way:
            String string = "<TEXT>This is the middle text</TEXT>";
            String x = string.substring(string.indexOf("<TEXT>") + "<TEXT>".length(), string.indexOf("</TEXT>"));

  • Text Catalog showing HTML tags

    We are having an issue after applying Bundle #22 for HCM 8.9 where the calls to the Text Catalog are now showing HTML tags. Has anyone else seen this? Im trying to figure out if its the bundle or something maybe with our customizations that have affected this change. Basically the page where the text from the text catalog displays now shows not only the text, but raw HTML tags as well on the page. Example: BR, B
    Thanks!
    Edited by: CoryU on May 11, 2010 2:00 PM

    We are having an issue after applying Bundle #22 for HCM 8.9 where the calls to the Text Catalog are now showing HTML tags. Has anyone else seen this? Im trying to figure out if its the bundle or something maybe with our customizations that have affected this change. Basically the page where the text from the text catalog displays now shows not only the text, but raw HTML tags as well on the page. Example: BR, B
    Thanks!
    Edited by: CoryU on May 11, 2010 2:00 PM

  • Makina text area read html tags

    How can i make a html text area recognize html tags.

    According to http://www.oracle.com/webapps/online-help/reports/10.1.2/state/content/navId.3/navSetId._/vtTopicFile.htmlhelp_rwbuild_hs%7Crwwhthow%7Cwhatare%7Coutput%7Ca_inlinehtml%7Ehtm/
    Oracle Reports only supports a specific set of HTML tags. Have you already checked if they match? Maybe you have to use REPLACE to translate them into "proper" ones in your report SQL statement.
    Have you already had a direct look in your data, if the HTML tags are encoded in some way. eg. & lt; for < ?
    Patrick
    My APEX Blog: http://inside-apex.blogspot.com
    The ApexLib Framework: http://apexlib.sourceforge.net
    The APEX Builder Plugin: http://sourceforge.net/projects/apexplugin/

  • Can you text between two iPod touch iPods and is it free?

    Can you text between two iPod touches and is it free?  Is it free to text from Canada to US?

    Yes and yes. Both iPodsneed to have iOS 5 installed.

  • How to Extract TEXT ONLY from HTML ?

    I am developing speech-enabled browser and what I would like to do is to read aloud all the texts within webpage. My problem is how can I get only the text, not HTML tags, from the webpage. Similar question has been asked before in this forum but none of the given suggestions seem to work. Any help would greatly be appreciated.
    Is there anyone out there who is also using speech package? Is there any forum for java speech package?

    don't know about the speech part, but the text parsing
    is pretty simple, if you just want the text. You just
    take the string and run thru it char by char and
    remove the stuff between the < and > chars.Also you'd have to unescape anything that was escaped for HTML, such as &amp; should be replaced by & and &eacute; should be replaced by &eacute; and so on.

  • Is there a way to delete text between two strings?

    In Pages, is there a way to delete all text containted between two strings?
    For example, if I have a text document that looks like this:
    String 1
    String 2
    Unwanted text
    Unwanted text
    String 3
    String 4
    Is there was to delete the unwanted text between string 2 and 3 so it looks like this:
    String 1
    String 2
    String 3
    String 4
    The unwanted text is differnet between documents but string 2 and 3 are constant. I want to do this via automator for the same strings on multiple documents.
    Any help is appreciated!

    Do you mean Pages '09 v4.3?
    There were some links here:
    https://discussions.apple.com/message/24051199#24051199
    Peter

  • Getting text between two special characters as a new line

    Hi all ,
    I hope some one can point me in the right direction or tell me if its possible or if there is a function that can do this in tsql. I have table with two fields in a table ORDERNUM,NARRATIVE. The values in the NARRATIVE field are some what like this (they
    come from a flat while and are delimited by a "^"). The examples are like :
    OREDERNUM NARRATIVE
    1234           ^Parcel shipped^picked^entry passed then returned back^white ford truck number 78455333^freight charges entered^parcel weight entered^parcel supervised^ticketsscannned^broken glass on
    floor^
    what i want to is seperate the text between the "^" as new line like
    1234 parcel shipped
    1234 picked
    1234 entry passed then returned back
    1234 white ford truck number 78455333
    Can anyone please help me, is there a way or can any one show me what function does this? SUBSTRING for sure is not an answer as there is no fixed length for this.
    Thanks
    SV

    CASE_NUMBER NARRATIVE
    000000GA ^000000G-A CHIEF OF POLICE 02-02-95 PGE 1^***BATCH RUN COPY***^INCIDENT: MOLEST OTHER^LOCATION: 00416 N COLORADO AV OTHER:^EI20^ATTENTION:^ SEX OFFENSE^NOTIFIED: 02-01-95 2300 HRS BY: OTHER INVEST 02-01-95 2305 HRS^ARRESTS: 00 INJURED: 00 DEAD: 00 VEH TOWED: 00 BEAT: B51^^PERSON 01^VICTIM-PERSON^STEVE WOOLBRIGHT NH/W/M/14/08-31-80^SUBPEONA: N^ADDRESS: 00416 N COLORADO AV^HOME-PHONE: 322-9249^^^NARRATIVE:^REPORT MADE FROM STATE 310 REPORT STATING ABOVE-LISTED SUBJECT^HAS BEEN SEXUALLY MOLESTED. INVESTIGATION REQUESTED.^^4725 POPCHEFF A P4687^^02-01-95 2025 HRS - WILDER EULA W3163 TAPE:02^^ 95.033 00:23-END OF REPORT-95.033 00:26^^^
    000718GB ^000718G-B ADDITION CHIEF OF POLICE 08-08-95 PGE 1^***BATCH RUN COPY***^INCIDENT: RECOVERED STOL. VEH PUBLIC STREET-ALLEY^LOCATION: PLAINFIELD,IN OTHER:^OCCURED: 04-01-93 TIME UNKNOWN EO11^ATTENTION:^ AUTO DESK AUTO THEFT WEST DISTRICT^NOTIFIED: 08-07-95 1600 HRS BY: OTHER INVEST 08-07-95 1600 HRS^ARRESTS: 00 INJURED: 00 DEAD: 00 VEH TOWED: 00 BEAT: OJ^^PERSON 01^OWNER/OPERATOR^DAVID FURMAN NH/W/M/ 08-20-57^DRIVERS LICENSE/SSN: 310702334^ADDRESS: 02495 AVON RD PLAINFIELD IN^HOME-PHONE: 838-0467^^^VEHICLE 01^PICKUP^BLACK 91 FORD F350 2DR^VIN: 2FTJW35G6MCA99397^DISPOSITION: RECOVERED^COMMENTS: OWNER RECOVERED VEHICLE IN 1993^^NARRATIVE:^ON 08/07/95, I RECEIVED A PHONE CALL FROM JAMES BEARD OF THE^^ -CONTINUED-^
    000718GB ^000718G-B ADDITION CHIEF OF POLICE 08-08-95 PGE 2^^NATIONAL INSURANCE CRIME BUREAU WHO STATED HE WAS DOING A^CHECK OF HIS RECORDS AND CAME ACROSS A VEHICLE THAT WAS^STOLEN ON 04/24/92, UNDER IPD CASE #00718GA. HE STATED HE HAD^SPOKEN TO THE INSURANCE COMPANY ON THIS CASE AND THEY STATED^THEY HAD PAID OFF ON THE VEHICLE AND SOLD THE VEHICLE BACK TO^THE OWNER. APPARENTLY THE PERSON REPORTING THE EVENT, DAVID^FURMAN, REPORTED HIS VEHICLE STOLEN ON 04/24/92, APPROXIMATELY^1 YEAR LATER HE WAS CONTACTED BY AN UNIDENTIFIED PERSON AND^TOLD HIM TO MEET HIM ON A CORNER IN THE INDIANAPOLIS AREA AND TO^GIVE THE MAN $5,000.00 AND HE WOULD GIVE HIM HIS TRUCK BACK.^MR. FURMAN HAD ALREADY BEEN PAID BY THE INSURANCE COMPANY^APPROXIMATELY $31,000.00 IN RESTITUTION FOR HIS VEHICLE.^MR. FURMAN THEN WENT TO THIS CORNER LOT AND MET WITH THIS^UNIDENTIFIED W/M AND PAID HIM $5,000 CASH. THE MAN THEN TURNED^THE 1991 PICKUP TRUCK OVER TO MR. FURMAN. MR. FURMAN THEN^SETTLED WITH THE INSURANCE COMPANY AND PAID THEM $19,000.00 AND^BOUGHT THE VEHICLE BACK FROM THE INSURANCE COMPANY AND IT IS^NOW TITLED TO HIM AND REGISTERED TO HIM. HOWEVER, AT NO^POINT DID MR. FURMAN OR THE INSURANCE COMPANY CONTACT THE^POLICE DEPARTMENT TO TAKE THE VEHICLE OUT OF THE SYSTEM AS^BEING STOLEN. THAT IS THE REASON FOR THE REPORT. THE VEHICLE^SHOULD BE REPORTED AS A RECOVERED AND RELEASED TO OWNER. DAMAGE^UNKNOWN.^^^^ -CONTINUED-^
    select TOP 214
    CASE_NUMBER,
    splitdata
    from
    SELECT F1.CASE_NUMBER,
    O.splitdata
    FROM
    SELECT *,
    cast('<X>'+replace(F.NARRATIVE,'^','</X><X>')+'</X>' as XML) as xmlfilter from [dbo].[CECASRPT] F
    )F1
    CROSS APPLY
    SELECT fdata.D.value('.','varchar(5000)') as splitdata
    FROM f1.xmlfilter.nodes('X') as fdata(D)) O
    )t
    where t.splitdata <> ''
    SV

  • Text Search skiping HTML tags

    I have a table containing clob column.
    select code, details from search order by code;
    CODE DETAILS
    4 just a <b>test </b>insert
    5 just a <b>test</b> insert
    9 <HTML>just a <i>test</i> insert</HTML>
    10 checking test insert
    I have created a context index and add html tags in the stop list.
    exec ctx_ddl.create_stoplist('mystop', 'BASIC_STOPLIST');
    exec ctx_ddl.add_stopword('mystop', '<b>');
    exec ctx_ddl.add_stopword('mystop', '</b>');
    CREATE INDEX searchi ON search(details)
    INDEXTYPE IS CTXSYS.CONTEXT PARAMETERS
    ('FILTER CTXSYS.AUTO_FILTER SECTION GROUP CTXSYS.AUTO_SECTION_GROUP STOPLIST MYSTOP');
    But when I search 'test insert' it only shows the following rows
    SQL> SELECT score(1), code, details FROM search WHERE CONTAINS(details, 'test insert', 1) > 0 ORDER BY score(1);
    SCORE(1) CODE DETAILS
    5 10 checking test insert
    5 9 <HTML>just a <i>test</i> insert</HTML>
    I would like to define a text index which skips the html keywords and returns all the rows contain the searching phrase

    Since you did not use code tags in your post, most of your html does not show, so it is difficult to tell what html is in your data or what values you set for your stopwords. One problem with stopwords is that, although the word is not indexed, it still expects some word where the stopword was, so searching for "word1 word2" will not find "word1 removed_stopword word2". How about using a procedure_filter as demonstrated below? I only removed a few tags, so you would need to either expand it to include others or searching for starting and ending tags and remove what is inbetween.
    SCOTT@orcl_11g> CREATE TABLE search
      2    (code      NUMBER,
      3       details  CLOB)
      4  /
    Table created.
    SCOTT@orcl_11g> INSERT ALL
      2  INTO search VALUES (4, 'just a <b>test</b> insert')
      3  INTO search VALUES (5, 'just a <i>test</i> insert')
      4  INTO search VALUES (9, '<HTML>just a test insert</HTML>')
      5  INTO search VALUES (10, 'checking test insert')
      6  SELECT * FROM DUAL
      7  /
    4 rows created.
    SCOTT@orcl_11g> CREATE OR REPLACE PROCEDURE myproc
      2    (p_rowid    IN ROWID,
      3       p_in_clob  IN CLOB,
      4       p_out_clob IN OUT NOCOPY CLOB)
      5  AS
      6  BEGIN
      7    p_out_clob := REPLACE (p_in_clob, '<html>', '');
      8    p_out_clob := REPLACE (p_out_clob, '</html>', '');
      9    p_out_clob := REPLACE (p_out_clob, '<HTML>', '');
    10    p_out_clob := REPLACE (p_out_clob, '</HTML>', '');
    11    p_out_clob := REPLACE (p_out_clob, '<b>', '');
    12    p_out_clob := REPLACE (p_out_clob, '</b>', '');
    13    p_out_clob := REPLACE (p_out_clob, '<B>', '');
    14    p_out_clob := REPLACE (p_out_clob, '</B>', '');
    15    p_out_clob := REPLACE (p_out_clob, '<i>', '');
    16    p_out_clob := REPLACE (p_out_clob, '</i>', '');
    17    p_out_clob := REPLACE (p_out_clob, '<I>', '');
    18    p_out_clob := REPLACE (p_out_clob, '</I>', '');
    19  END myproc;
    20  /
    Procedure created.
    SCOTT@orcl_11g> SHOW ERRORS
    No errors.
    SCOTT@orcl_11g> BEGIN
      2    CTX_DDL.CREATE_PREFERENCE ('myfilter', 'PROCEDURE_FILTER');
      3    CTX_DDL.SET_ATTRIBUTE ('myfilter', 'PROCEDURE', 'myproc');
      4    CTX_DDL.SET_ATTRIBUTE ('myfilter', 'ROWID_PARAMETER', 'TRUE');
      5    CTX_DDL.SET_ATTRIBUTE ('myfilter', 'INPUT_TYPE', 'CLOB');
      6    CTX_DDL.SET_ATTRIBUTE ('myfilter', 'OUTPUT_TYPE', 'CLOB');
      7  END;
      8  /
    PL/SQL procedure successfully completed.
    SCOTT@orcl_11g> CREATE INDEX searchi
      2  ON search (details)
      3  INDEXTYPE IS CTXSYS.CONTEXT
      4  PARAMETERS ('FILTER myfilter')
      5  /
    Index created.
    SCOTT@orcl_11g> SELECT token_text FROM dr$searchi$i
      2  /
    TOKEN_TEXT
    CHECKING
    INSERT
    TEST
    3 rows selected.
    SCOTT@orcl_11g> COLUMN details FORMAT A35
    SCOTT@orcl_11g> SELECT score (1), code, details
      2  FROM   search
      3  WHERE  CONTAINS (details, 'test insert', 1) > 0
      4  ORDER  BY score (1)
      5  /
      SCORE(1)       CODE DETAILS
             3          4 just a <b>test</b> insert
             3          5 just a <i>test</i> insert
             3          9 <HTML>just a test insert</HTML>
             3         10 checking test insert
    4 rows selected.
    SCOTT@orcl_11g>

  • I require anyones help to extract string from an html tag

    I wanted to extract the strings available in the tag named html
    eg.. <html> Hello </html>
    I require to extract the string Hello into a string variable without using if's
    Secondly,
    i require to extract the name of form in the html tag ..
    eg.. <form name= 'basic'> </form>
    Can any one tell how can this be implemented efficiently.. and using wich concept can this be implemented easily..
    Thanking you,
    bye..
    ur's True friend

    Use a ParserCallback. It notifies you everytime it finds a new tag:
    http://forum.java.sun.com/thread.jspa?forumID=31&threadID=623365
    Or use the HTMLEditorKit to parse the HTML. Then use an iterator to get the tag you need. See reply 5:
    http://forum.java.sun.com/thread.jspa?forumID=31&threadID=655038

  • Want text input containing HTML tags to appear as HTML in output format

    Hi,
    We have a table in Oracle database that has a column named detail,one of its values is like this: <bold><italics>Good Morning</italics></bold>. What our client wants is that the output format should show: <b><i>Good Morning</b></i>. That is,Bi Publisher should be able to parse the HTML tags and provide the desired output. Please tell me how to achieve this. Any help is much appreciated.
    Thanks and regards,
    Debarati,
    [email protected]

    Hi,
    have a look here (http://blogs.oracle.com/xmlpublisher/2007/01/formatting_html_with_templates.html) to get an idea.
    regards
    Rainer

  • Vertical text. Can FireFox display a vertical text? Which HTML tags should be used for that?

    The site http://habrahabr.ru/blogs/css/58732/ recommends a way for using vertical text in HTML/CSS (see example below). However my instance of Firefox 3.6.22 does not display this vertical text. Please let me know your recommendations/
    <html>
    <head>
    <title>1</title>
    <style>
    <!--
    .vertical { overflow:hidden; line-height:30px; position:relative; white-space:nowrap; width:30px; height:200px; border:1px solid #999; }
    -->
    </style>
    <body>
    <div class="vertical">Testing text</div>
    </body>
    </html>

    See: https://developer.mozilla.org/En/CSS/-moz-transform

  • Extract substrings between two words

    public class test
      public static void testing()
         String text = "<div>teext</div>sdfaf<div id='ii'>wwwwwww</div><div>nice</div>";
         String left = "<div>";
         String right = "</div>";
         int start = text.indexOf(left);
         int end = text.indexOf(right);
         int linkNum = 0;
         String[] parts = new String[50];
         while ((start = text.indexOf(left, start)) != -1) {
              start = text.indexOf(left, start);
              end = text.indexOf(right, end);
              parts[linkNum] = left + text.substring(start+1, text.indexOf(left, start+1));
              linkNum++;
              start = (end + 1);
         for (int i=0; i<parts.length; i++)     {
              System.out.println("find: " + parts);
    public static void main(String[] args) throws Exception {
         testing();
    must output
    find: teext
    find: nice
    but i have the following error
    Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -42
            at java.lang.String.substring(String.java:1444)
            at test.testing(test.java:15)
            at test.main(test.java:24)
    Press any key to continue...
    Can you plese tell me what's the problem ? Tnx
    Edited by: beginner1983 on Jun 8, 2009 7:14 AM                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

    String regex = "(?s)(?<=<div>).*?(?=</div>)"; That will work in this case because the start tags don't contain any attributes. If there may be attributes, and you don't know or care which attributes and values will be present, you would use something like this for the start tag: "<div[^<>]*>"; But you can't use that in a lookbehind, because there's no obvious upper limit to the number of characters it can match. (That requirement is peculiar to Java, by the way; among other regex flavors, some permit only fixed-length matches in lookbehinds, while others place no restrictions on them at all.)
    I'm with da.futt on this point: lookarounds should never be your first resort. Lookbehinds, especially, are much trickier than most people expect on first encountering them. You tend to find yourself going through ridiculous contortions trying to fit them into the overall regex. Lookaheads are much easier to work with, but you have to be careful to keep them in sync with your expectations; the can be slippery.
    In this case, your first impulse should have been to do a straight match of the whole element, with a capturing group to extract the element's content: String regex = "(?s)<div>(.*?)</div>";
    while (m.find()) {
        System.out.println(m.group(1));
    } By the way, my copy of the [Regular Expressions Cookbook|http://www.amazon.com/gp/product/0596520689?ie=UTF8&tag=jgsbookselection&linkCode=as2&camp=1789&creative=390957&creativeASIN=0596520689] arrived the other day, and it did not disappoint. Good, solid stuff there.

Maybe you are looking for