Html tag parsing

I have a jsp page with a form in it. In form there is a text area where
user can put html tag and also regular text:
such as :
testing the tag <b> Hello </b>
I want to write a servlet which will get the form's input and make sure
that all the html tags are properly closed and also it has to deal with <a href = ....> stuff...I used string tokenizer but it has multiple limitation ....any idea or clue would be delightful !!
Thanks.

You could try my HTML parser which I made available on these forums a while ago:
http://www.renegadeinternet.com/temp/htmlparser.zip
It can detect syntax errors like tags not being opened or closed properly. However, it won't clean them up for you. If you want it cleaned up automatically consider using JTidy instead:
http://sourceforge.net/projects/jtidy/
That said, the source code, compiled classes, and javadoc are all included in that htmlparser.zip file. A few quick examples are provided to give you a jump start. One of the examples goes through a document and looks for <A> tags and prints out the value of their href attribute. If you're checking for links, that may be a good base to start with.

Similar Messages

  • Replacing html tags in a htmldocument

    Hi Java Gurus
    I have a htmldocument which has the bacjground set to black and foreground to white.
    When i print the document i see the text in white color ( rather invisible).
    I thought the foreground color of the document needs to be changed.
    How can i do this ?
    Thanks in advance
    Naveen

    rest_in_peace wrote:
    You'll get a faster, more effective response to your questions by including as much relevant information as possible upfront. This should include:
    <li>Full APEX version
    <li>Full DB version and edition
    <li>Web server architecture (EPG, OHS or APEX listener)
    <li>Browser(s) and version(s) used
    <li>Theme
    <li>Template(s)
    <li>Region/item type(s)
    Read the FAQ and forum sticky threads for more information on using the forum effectively.
    With APEX we're fortunate to have a great resource in apex.oracle.com where we can reproduce and share problems. Reproducing things there is the best way to troubleshoot most issues.
    I have created a form on a table. Now one of the fields in the form is a display only field and its datatype is varchar2.There are a number of different ways of "creating a form on a table" and of making "display only fields". Describe exactly what you mean using actual APEX terminology of regions, items, and their attributes.
    The data in the database column of the field contains html tags which is getting displayed as it is without being parsed. It is not possible for me to edit the data in the table itself since there are thousands of data. So I need a way to display the data with the html tags parsed and not displayed as it is. Please any help would be grealy appreciated. How this is achieved is version-dependent: provide the full APEX version.

  • Problem in parsing HTML tag

    Hello,
    I want to parse the text in div ..like :<div id="title">Action Result</div>
    My code is :
    public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos){
    if (t == HTML.Tag.DIV){
      String page_title = (String)a.getAttribute(HTML.Attribute.ID);
      if (page_title != null){
         System.out.println("Title : " + page_title);
      public static void main(String argv[]) {
        try {
          Reader r = new FileReader("C://test1.html");
          ParserDelegator parser = new ParserDelegator();
          HTMLEditorKit.ParserCallback callback = new ParseTest();
          parser.parse(r, callback, false);
        } catch (IOException e) {
          e.printStackTrace();
      }But it does not work. Please advise how to do this.
    Thanks in advance.

    I also want to extract the html text from the tag like-
    <div id="title">Action Result</div>
    I want to take the "Action Result" as my programs output.
    Please help me to solve this problem.

  • Parse out the contents of meta tag using HTML.Tag

    I need help with using the HTML.Tag class. I don't even know where to start...
    I want to make a method that allows me to pass in a very long string and the NAME of the meta tag.. and will return the contents of the meta tag.. any help will be super..
    sorry i do not have much of a code base to start with.. I am just guessing on how to get this to work..
    private String getMetatag(String content,String Metaname)
    String Metacontents;
    Object HTML.Tag.META;
    Object HTML.Attribute.NAME.Metaname;
         Object HTML.Attribute.CONTENT;
    return Metacontents;
    }

    One of the way to get started is to check out how to overide (extend) the class
    HTMLEditorKit.ParserCallback.. Lets say the class is called class A
    Overide the methods for
    HandleSImpleTags( HTML.Tag t, AttributeSet attribute, int pos) { }
    roughly, the implemenation in that method is like this:
    HandleSImpleTags( HTML.Tag t, AttributeSet attribute, int pos) {
    if(t.equals(HTML.Tag.META) {
    /// your procedure, what to do when encounter META tag
    // String str = (String) attribute.getAttribute(HTML.Attribute.NAME);
    // System.out.println(str);
    You would still need to find some detail examples on how to use class A .
    Roughly it is,
    parser.parse(inputstream, an_instance of_class_A, true); // method in a outer class like a main class
    parser is a instatiation of from the method getParser. This method need to be overidden as well.

  • [svn:fx-trunk] 5289: Fix for - HTML tags in span tags in ASdoc comments not being parsed correctly.

    Revision: 5289
    Author: [email protected]
    Date: 2009-03-12 21:09:58 -0700 (Thu, 12 Mar 2009)
    Log Message:
    Fix for - HTML tags in
    tags in ASdoc comments not being parsed correctly.
    QE Notes: Some baseline will require update.
    Doc Notes: None.
    Bugs: SDK-19815
    tests: checkintests, asdoc
    Ticket Links:
    http://bugs.adobe.com/jira/browse/SDK-19815
    Modified Paths:
    flex/sdk/trunk/modules/compiler/src/java/flex2/compiler/asdoc/AsDocUtil.java

    Resize/re-scale & optimize all images for the web in your graphics editor before you insert them into your web pages.  Saves bandwidth and reduces page load.
    Cycle2 is a responsive slideshow.  If you want all images to remain 400px and not responsive to layout,  you'll need to modify the CSS code a little.
    Details on using Previous & Next links are in the documentation.
    http://jquery.malsup.com/cycle2/demo/prevnext.php
    Nancy O.

  • Html tags not parsing in spry dataset

    Hello you all,
    I have a liitle master detail page setup which works perfectly except that the data that is pulled from a database and parsed by a spry dataset is not parsing the html tags. I am seeing things like <p></p><br/> etc. on the page. Can any one help with this.
    Message was edited by: jahflasher

    Set the columtype to HTML on the affecting row:
    http://labs.adobe.com/technologies/spry/articles/data_api/apis/dataset.html#setcolumntype

  • Define HTML Tag for Parser - Help?

    Hi all,
    I'm trying to write a program which downloads a HTML script, parses it, extracts the links and checks to see which of these links are broken. While the parser is picking up tags that are well-formed, such as:
    Mark Humphrys -
    Research -
    The HTML script has a few malformed HTML tags such as the following:
    <li><b> References </b>
    <li><b> References </b>
    The snippet of code I'm using to try and get these malformed tags is as follows:
         ParserCallback parserCallback = new ParserCallback()
         public void handleText(final char[] data, final int pos) { }
              Tag a = HTML.Tag("a");
              public void handleStartTag(Tag tag, MutableAttributeSet attribute, int pos)
                   if (tag == a)
                   String address = (String) attribute.getAttribute("href");
                        list.add(address);
                   System.out.println(address);
         public void handleEndTag(Tag t, final int pos) {  }
         public void handleSimpleTag(Tag t, MutableAttributeSet a, final int pos) { }
         public void handleComment(final char[] data, final int pos) { }
         public void handleError(final java.lang.String errMsg, final int pos) { }
         };but I keep getting the error that they can't find the Tag() method. At the start of my code I have:
    import javax.swing.text.html.HTML;
    import javax.swing.text.html.HTML.Tag;so I don't understand why the compiler can't find the method. Is there something wrong with the way I'm using it?
    I have very little experience with this area so any help or pointers would be great!

    Sorry, the exact error message is:
    cannot find symbol,
    symbol: constructor Tag(java.lang.String)
    location: class javax.swing.text.html.HTML.Tag
    HTML.Tag a = new HTML.Tag("a");
    ^
    it should of course be a constructor not a method but the compiler still can't seem to find it. The proper code (in as much as I can tell although it still isn't working)
         ParserCallback parserCallback = new ParserCallback()
         public void handleText(final char[] data, final int pos) { }
              HTML.Tag a = new HTML.Tag("a");
              public void handleStartTag(Tag tag, MutableAttributeSet attribute, int pos)
                   if (tag == a)
                   String address = (String) attribute.getAttribute("href");
                        list.add(address);
                   System.out.println(address);
         public void handleEndTag(Tag t, final int pos) {  }
         public void handleSimpleTag(Tag t, MutableAttributeSet a, final int pos) { }
         public void handleComment(final char[] data, final int pos) { }
         public void handleError(final java.lang.String errMsg, final int pos) { }
         };

  • Remove HTML Tags and parse the text out of it

    Hi All -
    I had a text file with all the HTML Tags on it. I want to parse text out of it. Is there any package available to remove all the HTML Tags from the text.
    For example
    <HTML><BODY bgColor=#ffffff> This is the text i want to parse.</BODY></HTML>
    The result would be: This is the text I want to parse.
    The text can be very long and can have many different HTML Tags. I cannot use REPLACE becuase tags can me lot more then I thought.
    Please respond as soon as possible..Thanks for all your help!!
    Anuj Sharma

    thank you all, but my code is only html no xml , and is other application that save in table
    <html><head><title>Aprovação de ARC</title></head><body><font face=arial size=2><b>974-17016/ugadiego-2013</b></font><br><br><table border=0><tr><td><b><font face=arial size=1>Data da Abertura</font></b></td>    <td><font face=arial size=1>8/3/2013</font></td><tr><td><b><font face=arial size=1>Quebra Produtividade</font></b></td>    <td><font face=arial size=1>Sim</font></td><tr><td><b><font face=arial size=1>Quantidade</font></b></td>    <td><font face=arial size=1>17,5</font></td><tr><td><b><font face=arial size=1>Valor</font></b></td>    <td><font face=arial size=1>R$ 17496</font></td><tr><td><b><font face=arial size=1>Forma de Indenização</font></b></td>    <td><font face=arial size=1>Nota de Crédito</font></td><tr><td><b><font face=arial size=1>Observação</font></b></td>    <td><font face=arial size=1>Evidenciado a não conformidade do produto em visita a cliente pela assessoria agronômica e qualidade.
    Produto apresenta-se empedrado com desuniformidade de grânulos e por consequência geração de finos e falha de óleo.
    Produto expedido com GDAP.
    Bonificar o cliente em 10% do valor da compra = R$ 17.496,00 ou em toneladas e fertilizantes  que podem ficar em forma de crédito para o cliente retirar em fertilizante para o plantio  da soja. Conforme relatório do Sr. Ademilson Palharin em anexo.</font></td><tr><td><b><font face=arial size=1>Centro de Custo</font></b></td>    <td><font face=arial size=1>CAS1I4671 - MISTURA E ENSAQUE I                     </font></td></table><hr><font face=arial size=2><b>Favor incluir uma Observação (Se necessário) e selecionar o botão desejado para aprovar ou reprovar essa Indenização.</b></font><FORM ACTION='http://10.176.10.123/pgAprovaARCServidor.asp' METHOD='GET' ><font face=arial size=2><div>Observações:</div><textarea name='txtObs' rows='4' cols='60' maxlength='4000'></textarea><br><br><div><input type='submit' value='Aprovar'  name='acao'> <input type='submit' value='Reprovar' name='acao'></div></font><br><hr><font face=arial size=2 >Essa é uma mensagem automática.<br>Favor não responder esse email</font><hr><input type='hidden' name='cdARC' value='17016' ><input type='hidden' name='cdSeq' value='1' ><input type='hidden' name='cdFase' value='Indenizacao' ><input type='hidden' name='dsResp' value='ustrenat' ><input type='hidden' name='dsCargo' value='Vice Presidência' ><input type='hidden' name='dsSolic' value='LESIANE CIESLAK' ><input type='hidden' name='index' value='3' ><input type='hidden' name='rowatu' value='3' ></FORM></body></html>using oracle 9.2.08
    Edited by: muttleychess on Mar 19, 2013 11:36 AM

  • HTML tag strip (email parsing use case)

    Have you guys come across the requirements of parsing a HTML message, lets say an email, and convert into text?
    The most common use case would be email parsing, remove all the HTML tags and use as string.
    It has to remove the entire tags, with attributes, for example tables, with TRs and TDs with properties, must be removed. The BR converted into carriage return, etc?
    Cheers,
    Renato Fichmann

    Derek,
    In my situation, I'm triggering a process from an email. I need to extract information out of this email but it's not formatted as XHTML. I don't know there's an easy way to preprocess the email first before trying to disect it.
    I'm working on an alternative solution but that relys on the people sending me the email to grant direct access to their systems.
    Hopefully there'll be some funky string functions like this in upcoming releases??
    Ryan

  • XML Parser puts HTML Tags each on its own line... BAD

    Is there a way to tell Oracle XML Parser not to place each HTML tag on a separate line in the output stream? It appears that this is what is causing what I will call "dog-legging". The normal dotted line that surrounds an image on an HTML browser page has an extra protrusion in the lower-right corner under IE 4.x. A hyphen appers in the lower right corner under NN 4.x.
    any information that you good folks could provide me on this matter would be greatly appreciated.
    Thanks in advance,
    Dave Reese

    The way to avoid this is to specifically request that the HTML output be generated without indentation. Just include the
    following at the top-level of your stylesheet inside the outermost <xsl:stylesheet> element:
    <xsl:output method="html" indent="no"/>

  • Parsing XML with html tags for style

    I'm using flash to pull in XML data, but I want to use html
    tags to be able to style the text. When I add any html, it treats
    it as a sub-node and ignores the data. Also, line breaks in the xml
    are being converted to double spaced paragraphs? The relevant code
    is basically this:
    if (element.nodeName.toUpperCase() == "TEXT")
    {//add text to text array
    ar_text[s]=element.firstChild.nodeValue;
    textbox1.text = ar_text[0];

    try to use htmlText instead text... like this:
    textbox1.htmlText = ar_text[0]
    adam

  • Query to extract HTML tag with data

    Hi All,
    I have a string.
    '<HTML><HEAD>THIS IS HEAD.</HEAD><BODY>THIS IS BODY.<P>THIS IS P1.</P>NIMISH<P>THIS IS P2.</P></BODY></HTML>'
    I want to extract a html tag including its opening & closing tab with data as
    if i say P1
    then the output should be
    '<P>THIS IS P1.</P>'
    for P2
    then the output should be
    <P>THIS IS P2.</P>
    please help me in writing this query with regular expression
    i have tried it as following but it is not giving desired result:
    WITH T AS
    SELECT
        '<HTML><HEAD>THIS IS HEAD.</HEAD><BODY>THIS IS BODY.<P>THIS IS P1.</P>NIMISH<P>THIS IS P2.</P></BODY></HTML>' STR
    FROM   
        DUAL
    SELECT REGEXP_SUBSTR(STR, '<P>.+P2.+</P>') FROM T
    Thanks & Regards
    Nimish GargEdited by: Nimish Garg on May 7, 2012 5:49 PM

    Nimish Garg wrote:
    My requirement is to extract a <tag>data</tag> from a HTML/XML string
    where data contains any specified value.HTML is not XML.
    And that is a critical distinction to make. HTML parsing is horribly complex. XML is quite easy. For HTML you have to code your own parser in PL/SQL. XML can be parsed using the XMLTYPE class/data type in PL/SQL.
    So if you need to find a single specific tag in HTML - I would not try to treat it as XML. I may not even try to use regular expressions.
    I would do a basic substring search for the start of the tag. Read the data following the tag. Ensure that there are no nested or embedded tags in the data. Until the end tag is read. Because HTML is that much abused - and because that is an accepted norm as parsers used by browsers deals with that abuse without complaining.
    Proper HTML is mostly a myth in my experience of "screen scraping" web servers for data extraction as they do not have web services supplying the data.

  • Want text input containing HTML tags to appear as HTML in output format

    Hi,
    We have a table in Oracle database that has a column named detail,one of its values is like this: <bold><italics>Good Morning</italics></bold>. What our client wants is that the output format should show: <b><i>Good Morning</b></i>. That is,Bi Publisher should be able to parse the HTML tags and provide the desired output. Please tell me how to achieve this. Any help is much appreciated.
    Thanks and regards,
    Debarati,
    [email protected]

    Hi,
    have a look here (http://blogs.oracle.com/xmlpublisher/2007/01/formatting_html_with_templates.html) to get an idea.
    regards
    Rainer

  • How to remove html tags from a column

    Hi
    Problem is this: I get a column with a comma separated list of id's and I can successfully parse these id's and use them elsewhere. BUT, occasionally there are html tags within that id list like this:
    1082471,1237423<br xmlns="http://www.w3.org/1999/xhtml" />
    Is there a way to just automatically remove all tags from a column? Could do this with regex, but since there is no support, I don't know what to do.

    Hi,
    If the HTML can be detected by a starting symbol like „<“, then you could use the following:
    Unfortuntely the operation “ReplaceRange” is only available on a Text-level, so you have to invoke a function (at least to my knowledge). You also need an Index-column in your table, so if you don’t have it, you need to create one as well.
    This is your function:
    let
       fnRemoveHTML = (Value, Index) =>
    let
       Source = Excel.CurrentWorkbook(){[Name="Tabelle1"]}[Content],
       IndeNo = Index,
       Value_ = Source{IndeNo-1}[Value],
       length = Text.Length(Text.From(Value_)),
       position = Text.PositionOf(Text.From(Value_), "<"),
       range = length-position,
       new= if Value_ is number then Value_ else Text.ReplaceRange(Value_, position, range, "")
    in
        new
    in
      fnRemoveHTML
    And this is how you invoke it:
    let
        Quelle = Excel.CurrentWorkbook(){[Name="Tabelle1"]}[Content],
        Last = Table.AddColumn(Quelle, "Custom", each fn_RemoveHTML([Value], [Index])),
        ChangedType = Table.TransformColumnTypes(Last,{{"Custom", type number}})
    in
        ChangedType
    Provided your table is called “Tabelle1” & the column with your values to be replaced “Value” & your index-col “Index”
    Imke

  • How to disable html tags in richtext editor

    Hi All,
    I want to disable the html tags in rich text editor.I am able to disable the all components using the following code
    af:richTextEditor id="rte2" toolboxLayout="spellcheck"/>
    <f:facet name="spellcheck">
    <af:commandLink id="chek" text="Check Spelling" styleClass="linkcont"/>
    </f:facet>
    When page gets submitted i dont want all the html tags in it How i can achive this.
    Regards,
    Smaran

    Hi,
    I am not sure your question is complete as it seems to lack content, Anyway, did you try a value change listener to access the content and parse it for HTML tags?
    Frank

Maybe you are looking for