Tagged text from standford parser

Hello all!
I am using the nlp(Natural language processing).
I am using the standford parser.
My target is to get tagged text from standford parser,
Example from here:
http://nlp.stanford.edu:8080/parser/
like :
the screen is samsung is good
output:
the/DT screen/NN is/VBZ samsung/JJ is/VBZ good/JJ
my code is:
   LexicalizedParser lp = new LexicalizedParser("C:\\englishPCFG.ser");
    lp.setOptionFlags(new String[]{"-maxLength", "80", "-retainTmpSubcategories"});
    String sent ="sound on samsung galaxy is the best.";
    Tree parse = (Tree) lp.apply(Arrays.asList(sent));
//  parse.pennPrint();
    System.out.println();
    TreebankLanguagePack tlp = new PennTreebankLanguagePack();
    GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
    GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
    Collection tdl = gs.typedDependenciesCollapsed();
    TreeGraphNode sentence = new TreeGraph(parse).root();the api of this is here:
http://tides.umiacs.umd.edu/webtrec/stanfordParser/javadoc/
but I didnt find it.
thanks for helping

I am presuming that your question is that you don't know where to get the info from.
So experiment.
For example use Tree.pennPrint().
And tdl is a collection. So iterate through that collection, get each item and print the properties of each

Similar Messages

  • How to get a specific tag value from SAX parser

    I am using the SAX method to parse my xml file.
    My Question is how to get the returning characters parsed after calling?
    esp the value of <body> tag?
    Here is my xml file, and i want to get the parsed <body> value after call sax parser.
    <?xml version="1.0" encoding="UTF-8"?>
    <article>
    <content>
    <title>floraaaaa</title>
    <date>2004-03-19</date>
    <body>
    Details of an article, and i want to get the article details
    </body>
    </content>
    </article>

    here is the parser code I am using:
    import java.io.*;
    import org.apache.xerces.parsers.SAXParser;
    import org.xml.sax.Attributes;
    import org.xml.sax.ContentHandler;
    import org.xml.sax.ErrorHandler;
    import org.xml.sax.Locator;
    import org.xml.sax.SAXException;
    import org.xml.sax.SAXParseException;
    import org.xml.sax.XMLReader;
    import org.xml.sax.helpers.XMLReaderFactory;
    public class test2 {
         public String m_xmlDetail;
         public void readDetail(String url) {
              System.out.println("Parsing XML File: " + url + "\n\n");
              try {
                   XMLReader parser = new SAXParser();
                   ContentHandler contentHandler = new MyContentHandler();
                   parser.setContentHandler(contentHandler);
                   parser.parse(url);
              } //try ends here
              catch (IOException e) {
                   System.out.println("Error reading URI: " + e.getMessage());
              } //catch ends here
              catch (SAXException e) {
                   System.out.println("Error in parsing: " + e.getMessage());
              } //catch ends here
         } //function
    }//close class
    public class MyContentHandler implements ContentHandler {
         private Locator locator;
         //public String m_bodyDetail=new String();
         public void setDocumentLocator(Locator locator) {
              System.out.println(" * setDocumentLocator() called");
              this.locator = locator;
         public void startDocument() throws SAXException {
              System.out.println("Parsing begins...");
         public void endDocument() throws SAXException {
              System.out.println("...Parsing ends.");
         public void processingInstruction(String target, String data)throws SAXException {
              System.out.println("PI: Target:" + target + " and Data:" + data);
         public void startPrefixMapping(String prefix, String uri) {
              System.out.println("Mapping starts for prefix " + prefix + " mapped to URI " + uri);
         public void endPrefixMapping(String prefix) {
              System.out.println("Mapping ends for prefix " + prefix);
         public void startElement(String namespaceURI, String localName,String rawName, Attributes atts)throws SAXException {
              System.out.print("startElement: " + localName);
              if (!namespaceURI.equals("")) {
                   System.out.println(" in namespace " + namespaceURI + " (" + rawName + ")");
              else {
                   System.out.println(" has no associated namespace");
              for (int i=0; i<atts.getLength(); i++)
                   System.out.println(" Attribute: " + atts.getLocalName(i) +"=" + atts.getValue(i));
         public void endElement(String namespaceURI, String localName, String rawName) throws SAXException {
              System.out.println("endElement: " + localName + "\n");
         public void characters(char[] ch, int start, int end) throws SAXException {
              String s = new String(ch, start, end);
              System.out.println("characters: " + s);
         public void ignorableWhitespace(char[] ch, int start, int end)throws SAXException {
              String s = new String(ch, start, end);
              System.out.println("ignorableWhitespace: [" + s + "]");
         public void skippedEntity(String name) throws SAXException {
              System.out.println("Skipping entity " + name);
    } //close class

  • [IDCS3 WIN] Assert while importing tagged text

    Hi,<br /><br />I use the following code to import a tagged text from a buffer into a text frame:<br /><br />IDataBase* database = frameUIDRef.GetDataBase();<br /><br />InterfacePtr<IHierarchy> frameHierarhy(frameUIDRef, UseDefaultIID());<br />int32 count = frameHierarhy->GetChildCount();<br /><br />InterfacePtr<IMultiColumnTextFrame> textFrame(frameHierarhy->QueryChild(0), UseDefaultIID());<br /><br />if( !textFrame )<br />return UIDRef::gNull;<br /><br />UID storyUID = textFrame->GetTextModelUID();<br /><br />InterfacePtr<ITextModel> textModel(database, storyUID, UseDefaultIID());<br /><br />if( !textModel )<br />return UIDRef::gNull;<br /><br />UIDRef result = UIDRef::gNull;<br />InterfacePtr<IK2ServiceRegistry> services(gSession, UseDefaultIID());<br />InterfacePtr<IK2ServiceProvider> service(services->QueryServiceProviderByClassID(kImportProviderService, kTaggedTextImportFilterBoss));<br />InterfacePtr<IImportProvider> prov(service, IID_IIMPORTPROVIDER);<br /><br />InterfacePtr<IPMStream> stream(StreamUtil::CreatePointerStreamRead(taggedtext, strlen(taggedtext)));<br /><br />if (prov->CanImportThisStream(stream) == IImportProvider::kFullImport)<br />{<br />database->BeginTransaction();<br /><br />prov->ImportThis(database, stream, K2::kSuppressUI, &result);  //This line generates the Assert<br /><br />if (result != nil )<br />{<br />Utils<ITextUtils> textUtils;<br />InterfacePtr<ICommand> moveAllStoryCommand(textUtils->QueryMoveStoryFromAllToAllCommand(result, ::GetUIDRef(textModel)));<br />CmdUtils::ProcessCommand(moveAllStoryCommand);<br />}<br />     <br />database->EndTransaction();<br />}<br /><br />It works fine, but when "ImportThis" is called i get this:<br /><br />ASSERT 'fCmdProcessorState == kDoing || fCmdProcessorState != kNotProcessing || cmdMgrRef.GetDataBase()->GetUndoSupport() == IDataBase::kNoUndoSupport' in ..\..\..\source\components\appframework\commandmgmt\CommandProcessor.cpp at line 2889 failed.<br /><br />Any help would be appreciated.<br />Thanks in advance, David

    I would say the problem here is in using database->BeginTransaction()/EndTransaction(). Basically you should never call these methods - you need to find/create a command to do the processing instead, then perhaps wrap the two commands in a command sequence.
    I know there was a post saying ' don't use those methods' a long time ago by Ken Sadahiro (then of Adobe) - you might find it with a search, though will have been archived by now.
    Ian

  • Importing Adobe Tagged Text

    Hello,
    So here's what's happening…
    I've installed a Wordpress plugin that exports posts as Adobe Tagged Text.
    When I import, the text still shows up marked up with tags, just as it looks in the .txt file:
    <ANSI-MAC>
    <DefineParaStyle:NormalParagraphStyle=<Nextstyle:NormalParagraphStyle><cColor:Registration ><cSize:10><cLeading:11><pFirstLineIndent:12><cFont:Times New Roman>>
    <ParaStyle:NormalParagraphStyle>This <cTypeface:Italic>is<cTypeface:> a <cTypeface:Bold>test<cTypeface:>
    <ParaStyle:NormalParagraphStyle>Test paragraph
    <ParaStyle:NormalParagraphStyle>Testing this out
    Does it have something to do with the first line of code, <ANSI-MAC>?
    Regards,
    Eric

    You're using a very old InDesign workflow which I've never used. But it's referenced in Olav Kvern's and David Blatner's excellent "Real World InDesign" (I'm looking at the CS4 edition).
    On page 235, it reads, "The first characters in a tagged text file must state the character encoding (ASCII, ANSI, UNICODE, BIG5, or SJIS), followed by the platform (MAC or WIN). So the typical Windows tagged text begins with <ASCII-WIN> , and the Macintosh version begins with <ASCII-MAC> ."
    It looks to me like it should work. Have you tried exporting some tagged text from InDesign to see how it looks? File > Export > Adobe InDesign Tagged Text.

  • Exporting Table from Indesign to tagged text.

    Hi, all!<br />I need to export table from InDesign to tagged text by the script.<br />I found ExportAllStories.jsx in samples.<br />It's good example but I have some questions<br />1. In my document more then one table and I need to export only some of them. So, how can I export only one of them? May be Tables have some "labels" or something other differential sign?<br />2. How can I ser encoding type of distanation file (by the default InDesing set <ASCII-WIN>, but I'm need to set <Unicode>).

    Edit>InCopy>Export All Strories.
    Open the INDD in InCopy.
    This assumes everyone involved has access to the files.

  • Select a text from a word/pdf document for tagging.

    Hi,
    After an overwhelming response for my [first thread|Thumbnail creation during Image Upload; , I am hoping that my second post will have a solution.
    I am currently working on a knowledge management tool using ABAP WD. The requirement is to open the existing documents and select some text and tag/categorize it.  I also found out that using flash islands, it is possible to get the selected text from textField and pass it back to the ABAP WD binded variable.
    Is there a way to display word/pdf file and perform the text tagging in WD ? using office integration ? Kind help would be greatly appreciated.

    Never mind--I figured it out.

  • How to build tag cloud from text document

    Hi,
    I want to build tag cloud from text documents.
    The text dcouments has several lines of sentence.
    The tag shows long sentense, not key words.
    Please let me know how to build.
    And Can I use CAS or text enrichment?
    Thank you very much.
    SWKO

    Hi,
    You need to extract salient terms (e.g. people or product names) from your text documents by using the text enrichment or the text tagger component.
    This step will generate new attributes, which can be used in your tag cloud.
    Mathias

  • Batch export text from multiple Indesign files, applying em and strong tags

    Hi there,
    I am struggling for a while now with the following workflow:
    batch exporting all text (text only) as plain text, from multiple Indesign files, applying <em> and <strong> for all the bold and italics (for instance: Molluptas verion <strong>nossum</strong> idist <em>doluptatet</em> maiorerum quiaspienit, cum erferiosapis eos expe nonsequas verumquae dolor sim eos doluptatiur autet lab idicili beatum deliquat).
    Properly tagging the styles, will be too time consuming, as these documents are as old as 1998, have inconsistent untagged styles; it will mean to manually open up each file, assign tags for all styles (not to mention that might be some local overwrites).
    This task is necessary to have all the archive, available on wordpress website.
    Any feedback would be much appreciated it.
    Thanks.

    There may well be a script or you may have pay someone to write it. All I can tell you is that any text not mapped to a style will be a mess in the HTML code.
    Can’t comment on Quark files. I have no clue what they’re capable of.
    Bob

  • Can i copy copy TEXT with Tag contents from JtextPane

    Hello,
    I want to copy text from jtextpane with Tag contents. I use jtextpane for displaying HTML page. If it is posibile please make a response and how it should be?

    Hi User,
    Please follow the below link:
    http://www.rittmanmead.com/2008/04/migration-obiee-projects-between-dev-and-prod-environments/
    Basically just to give you a heads up, you copy paste the RPD and webcatalog file from your dev instance to same locations in prod instance.
    Award points if your question is answered.
    Thanks,
    -Amith.

  • How ias integrate with Snacktory for getting main text from an html page

    Hi All,
    i am new to endeca and ias, i have an requirement, need to get main text from whole html page before ias save text to Endeca_Document_Text property,
    as ias save all text in page to endeca_document_text property, it is not ok for reading when show in web page, i use an third party API to filter out the main text from original page,
    now i want to save these text to endeca_document_text property,
    an another question,
    i get zero page when doing the logic of filtering main text from original html text in ParseFilter( HTMLMetatagFilter implements ParseFilter) using Snacktory.
    if only do little things, it will work fine, if do more thing, clawer fail to crawl page. any one know how to fix it.
    log for clawler.
    Successfully set recordstore configuration.
    INFO    2013-09-03 00:56:42,743    0    com.endeca.eidi.web.Main    [main]    Reading seed URLs from: /home/oracle/oracle/endeca/IAS/3.0.0/sample/myfirstcrawl/conf/endeca.lst
    INFO    2013-09-03 00:56:42,744    1    com.endeca.eidi.web.Main    [main]    Seed URLs: [http://www.liferay.com/community/forums/-/message_boards/category/]
    INFO    2013-09-03 00:56:43,497    754    com.endeca.eidi.web.db.CrawlDbFactory    [main]    Initialized crawldb: com.endeca.eidi.web.db.BufferedDerbyCrawlDb
    INFO    2013-09-03 00:56:43,498    755    com.endeca.eidi.web.Crawler    [main]    Using executor settings: numThreads = 100, maxThreadsPerHost=1
    INFO    2013-09-03 00:56:44,163    1420    com.endeca.eidi.web.Crawler    [main]    Fetching seed URLs.
    INFO    2013-09-03 00:56:46,519    3776    com.endeca.eidi.web.parse.HTMLMetatagFilter    [pool-1-thread-1]    come into EndecaHtmlParser getParse
    INFO    2013-09-03 00:56:46,519    3776    com.endeca.eidi.web.parse.HTMLMetatagFilter    [pool-1-thread-1]    come into HTMLMetatagFilter
    INFO    2013-09-03 00:56:46,519    3776    com.endeca.eidi.web.parse.HTMLMetatagFilter    [pool-1-thread-1]    meta tag viewport ==minimum-scale=1.0, width=device-width
    INFO    2013-09-03 00:56:52,889    10146    com.endeca.eidi.web.parse.HTMLMetatagFilter    [pool-1-thread-1]    come into EndecaHtmlParser getParse
    INFO    2013-09-03 00:56:52,889    10146    com.endeca.eidi.web.parse.HTMLMetatagFilter    [pool-1-thread-1]    come into HTMLMetatagFilter
    INFO    2013-09-03 00:56:52,890    10147    com.endeca.eidi.web.parse.HTMLMetatagFilter    [pool-1-thread-1]    meta tag viewport ==minimum-scale=1.0, width=device-width
    INFO    2013-09-03 00:56:59,184    16441    com.endeca.eidi.web.parse.HTMLMetatagFilter    [pool-1-thread-2]    come into EndecaHtmlParser getParse
    INFO    2013-09-03 00:56:59,185    16442    com.endeca.eidi.web.parse.HTMLMetatagFilter    [pool-1-thread-2]    come into HTMLMetatagFilter
    INFO    2013-09-03 00:56:59,185    16442    com.endeca.eidi.web.parse.HTMLMetatagFilter    [pool-1-thread-2]    meta tag viewport ==minimum-scale=1.0, width=device-width
    INFO    2013-09-03 00:57:07,057    24314    com.endeca.eidi.web.parse.HTMLMetatagFilter    [pool-1-thread-2]    come into EndecaHtmlParser getParse
    INFO    2013-09-03 00:57:07,057    24314    com.endeca.eidi.web.parse.HTMLMetatagFilter    [pool-1-thread-2]    come into HTMLMetatagFilter
    INFO    2013-09-03 00:57:07,057    24314    com.endeca.eidi.web.parse.HTMLMetatagFilter    [pool-1-thread-2]    meta tag viewport ==minimum-scale=1.0, width=device-width
    INFO    2013-09-03 00:57:07,058    24315    com.endeca.eidi.web.Crawler    [main]    Seeds complete.
    INFO    2013-09-03 00:57:07,090    24347    com.endeca.eidi.web.Crawler    [main]    Starting crawler shut down
    INFO    2013-09-03 00:57:07,095    24352    com.endeca.eidi.web.Crawler    [main]    Waiting for running threads to complete
    INFO    2013-09-03 00:57:07,095    24352    com.endeca.eidi.web.Crawler    [main]    Progress: Level: Cumulative crawl summary (level)
    INFO    2013-09-03 00:57:07,095    24352    com.endeca.eidi.web.Crawler    [main]    host-summary: www.liferay.com to depth 1
    host    depth    completed    total    blocks
    www.liferay.com    0    0    1    1
    www.liferay.com    1    0    0    0
    www.liferay.com    all    0    1    1
    INFO    2013-09-03 00:57:07,096    24353    com.endeca.eidi.web.Crawler    [main]    host-summary: total crawled: 0 completed. 1 total.
    INFO    2013-09-03 00:57:07,096    24353    com.endeca.eidi.web.Crawler    [main]    Shutting down CrawlDb
    INFO    2013-09-03 00:57:07,160    24417    com.endeca.eidi.web.Crawler    [main]    Progress: Host: Cumulative crawl summary (host)
    INFO    2013-09-03 00:57:07,162    24419    com.endeca.eidi.web.Crawler    [main]   Host: www.liferay.com:  0 fetched. 0.0 mB. 0 records. 0 redirected. 4 retried. 0 gone. 0 filtered.
    INFO    2013-09-03 00:57:07,162    24419    com.endeca.eidi.web.Crawler    [main]    Progress: Perf: All (cumulative) 23.6s. 0.0 Pages/s. 0.0 kB/s. 0 fetched. 0.0 mB. 0 records. 0 redirected. 4 retried. 0 gone. 0 filtered.
    INFO    2013-09-03 00:57:07,162    24419    com.endeca.eidi.web.Crawler    [main]    Crawl complete.
    ~/oracle/endeca
    -======================================
    source code for parsefilter
    package com.endeca.eidi.web.parse;
    import java.util.Map;
    import java.util.Properties;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.log4j.Logger;
    import org.apache.nutch.metadata.Metadata;
    import org.apache.nutch.parse.HTMLMetaTags;
    import org.apache.nutch.parse.Parse;
    import org.apache.nutch.parse.ParseData;
    import org.apache.nutch.parse.ParseFilter;
    import org.apache.nutch.protocol.Content;
    import de.jetwick.snacktory.ArticleTextExtractor;
    import de.jetwick.snacktory.JResult;
    public class HTMLMetatagFilter implements ParseFilter {
        public static String METATAG_PROPERTY_NAME_PREFIX = "Endeca.Document.HTML.MetaTag.";
        public static String CONTENT_TYPE = "text/html";
        private static final Logger logger = Logger.getLogger(HTMLMetatagFilter.class);
        public Parse filter(Content content, Parse parse) throws Exception {
            logger.info("come into EndecaHtmlParser getParse");
            logger.info("come into HTMLMetatagFilter");
            //update the content with the main text in html page
            //content.setContent(HtmlExtractor.extractMainContent(content));
            parse.getData().getParseMeta().add("FILTER-HTMLMETATAG", "ACTIVE");
            ParseData parseData = parse.getData();
            if (parseData == null) return parse;
            extractText(content, parse);
            logger.info("update the content with the main text content");
            return parse;
        private void extractText(Content content, Parse parse){
            try {
                ParseData parseData = parse.getData();
                if (parseData == null) return;
                 Metadata md = parseData.getParseMeta();
                ArticleTextExtractor extractor = new ArticleTextExtractor();
                String sourceHtml = new String(content.getContent());
                JResult res = extractor.extractContent(sourceHtml);
                String text = res.getText();
                md.set("Endeca_Document_Text", text);
            } catch (Exception e) {
                // TODO: handle exception
        public static void log(String msg){
            System.out.println(msg);
        public Configuration getConf() {
            return null;
        public void setConf(Configuration conf) {

    but it only extracts URLs from <A> (anchor) tags. I want to be able to extract URLs from <MAP> tags as wellGee, do you think you could modify the code to check for "Map" attributes as well.
    Can someone maybe point a page containing info on the HTML toolkit for me?It's called the API. Since you are using the HTMLEditorKit and an ElementIterator and an AttributeSet, I would start there.
    There is no such API that says "get me all the links", so you have to do a little work on your own.
    Maybe you could use a ParserCallback and every time you get a new tag you check for the "href" attribute.

  • Problem to extract text from HTML document

    I have to extract some text from HTML file to my database. (about 1000 files)
    The HTML files are get from ACM Digital Library. http://portal.acm.org/dl.cfm
    The HTML page is about the information of a paper. I only want to get the text of "Title" "Abstract" "Classification" "Keywords"
    The Problem is that I can't find any patten to parser the html files"
    EX: I need to get the Classification = "Theory of Computation","ANALYSIS OF ALGORITHMS AND PROBLEM COMPLEXITY","Numerical Algorithms and Problem","Mathematics of Computing","NUMERICAL ANALYSIS"......etc .
    The section code about "Classification" is below.
    Please give any idea to do this, or how to find patten to extract text from this.
    <div class="indterms"><a href="#CIT"><img name="top" src=
    "img/arrowu.gif" hspace="10" border="0" /></a><span class=
    "heading"><a name="IndexTerms">INDEX TERMS</a></span>
    <p class="Categories"><span class="heading"><a name=
    "GenTerms">Primary Classification:</a></span><br />
    � <b>F.</b> <a href=
    "results.cfm?query=CCS%3AF%2E%2A&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Theory of Computation</a><br />
    � <img src="img/tree.gif" border="0" height="20" width=
    "20" /> <b>F.2</b> <a href=
    "results.cfm?query=CCS%3A%22F%2E2%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">ANALYSIS OF ALGORITHMS AND PROBLEM
    COMPLEXITY</a><br />
    � � � <img src="img/tree.gif" border="0" height=
    "20" width="20" /> <b>F.2.1</b> <a href=
    "results.cfm?query=CCS%3A%22F%2E2%2E1%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Numerical Algorithms and Problems</a><br />
    </p>
    <p class="Categories"><span class="heading"><a name=
    "GenTerms">Additional�Classification:</a></span><br />
    � <b>G.</b> <a href=
    "results.cfm?query=CCS%3AG%2E%2A&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Mathematics of Computing</a><br />
    � <img src="img/tree.gif" border="0" height="20" width=
    "20" /> <b>G.1</b> <a href=
    "results.cfm?query=CCS%3A%22G%2E1%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">NUMERICAL ANALYSIS</a><br />
    � � � <img src="img/tree.gif" border="0" height=
    "20" width="20" /> <b>G.1.6</b> <a href=
    "results.cfm?query=CCS%3A%22G%2E1%2E6%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Optimization</a><br />
    � � � � � <img src="img/tree.gif" border=
    "0" height="20" width="20" /> <b>Subjects:</b> <a href=
    "results.cfm?query=CCS%3A%22Linear%20programming%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Linear programming</a><br />
    </p>
    <br />
    <p class="GenTerms"><span class="heading"><a name=
    "GenTerms">General Terms:</a></span><br />
    <a href=
    "results.cfm?query=genterm%3A%22Algorithms%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Algorithms</a>, <a href=
    "results.cfm?query=genterm%3A%22Theory%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Theory</a></p>
    <br />
    <p class="keywords"><span class="heading"><a name=
    "Keywords">Keywords:</a></span><br />
    <a href=
    "results.cfm?query=keyword%3A%22Simplex%20method%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Simplex method</a>, <a href=
    "results.cfm?query=keyword%3A%22complexity%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">complexity</a>, <a href=
    "results.cfm?query=keyword%3A%22perturbation%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">perturbation</a>, <a href=
    "results.cfm?query=keyword%3A%22smoothed%20analysis%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">smoothed analysis</a></p>
    </div>

    One approach is to download Htmlparser from sourceforge
    http://htmlparser.sourceforge.net/ and write the rules to match title, abstract etc.
    Another approach is to write your own parser that extract only title, abstract etc.
    1. tokenize the html file. --> convert html into tokens (tag and value)
    2. write a simple parser to extract certain information
    find out about the pattern of text you want to extract. For instance "<class "abstract">.
    then writing a rule for extracting abstract such as
    if (tag is abstract ) then extract abstract text
    apply the same concept for other tags
    Attached is the sample parser that was used to extract title and abstract from acm html files. Please modify to include keyword and other fields.
    good luck
    import java.io.BufferedReader;
    import java.io.FileReader;
    import java.io.IOException;
    import java.io.InputStream;
    import java.io.InputStreamReader;
    import java.util.ArrayList;
    import java.util.List;
    public class ACMHTMLParser
         private String m_filename;
         private URLLexicalAnalyzer lexical;
         List urls = new ArrayList();
         public ACMHTMLParser(String filename)
              super();
              m_filename = filename;
          * parses only title and abstract
         public void parse() throws Exception
              lexical = new URLLexicalAnalyzer(m_filename);
              String word = lexical.getNextWord();
              boolean isabstract = false;
              while (null != word)
                   if (isTag(word))
                        if (isTitle(word))
                             System.out.println("TITLE: " + lexical.getNextWord());
                        else if (isAbstract(word) && !isabstract)
                             parseAbstract();
                             isabstract = true;
                   word = lexical.getNextWord();
              lexical.close();
         public static void main(String[] args) throws Exception
              ACMHTMLParser parser = new ACMHTMLParser("./acm_html.html");
              parser.parse();
         public static boolean isTag(String word)
              return ( word.startsWith("<") && word.endsWith(">"));
         public static boolean isTitle(String word)
              return ( "<title>".equals(word));
         //please modify according to the html source
         public static boolean isAbstract(String word)
              return ( "<p class=\"abstract\">".equals(word));
         private void parseAbstract() throws Exception
              while (true)
                   String abs = lexical.getNextWord();
                   if (!isTag(abs))
                        System.out.println(abs);
                        break;
         class URLLexicalAnalyzer
           private BufferedReader m_reader;
           private boolean isTag;
           public URLLexicalAnalyzer(String filename)
              try
                m_reader = new BufferedReader(new FileReader(filename));
              catch (IOException io)
                System.out.println("ERROR, file not found " + filename);
                System.exit(1);
           public URLLexicalAnalyzer(InputStream in)
              m_reader = new BufferedReader(new InputStreamReader(in));
           public void close()
              try {
                if (null != m_reader) m_reader.close();
              catch (IOException ignored) {}
           public String getNextWord() throws IOException
              int c = m_reader.read();   
              if (-1 == c) return null; 
              if (Character.isWhitespace((char)c))
                return getNextWord();
              if ('<' == c || isTag)
                return scanTag(c);
              else
                   return scanValue(c);
           private String scanTag(final int c)
              throws IOException
              StringBuffer result = new StringBuffer();
              if ('<' != c) result.append('<');
              result.append((char)c);
              int ch = -1;
              while (true)
                ch = m_reader.read();
                if (-1 == ch) throw new IllegalArgumentException("un-terminate tag");
                if ('>' == ch)
                     isTag = false;
                     break;
                result.append((char)ch);
              result.append((char)ch);
              return result.toString();
           private String scanValue(final int c) throws IOException
                StringBuffer result = new StringBuffer();
                result.append((char)c);
                int ch = -1;
                while (true)
                   ch = m_reader.read();
                   if (-1 == ch) throw new IllegalArgumentException("un-terminate value");
                   if ('<' == ch)
                        isTag = true;
                        break;
                   result.append((char)ch);
                return result.toString();
    }

  • Read Text from HTML-Pages and want to solve "ChangedCharSetException"

    Hello,
    I have an app that connect via threads with pages and parse them an gives me only the Text-version of a HTML-page. Works fine, but if it found a page, where the text is within images, than the whole app stopps and gave me the message:
    javax.swing.text.ChangedCharSetException
            at javax.swing.text.html.parser.DocumentParser.handleEmptyTag(DocumentParser.java:169)
            at javax.swing.text.html.parser.Parser.startTag(Parser.java:372)
            at javax.swing.text.html.parser.Parser.parseTag(Parser.java:1846)
            at javax.swing.text.html.parser.Parser.parseContent(Parser.java:1881)
            at javax.swing.text.html.parser.Parser.parse(Parser.java:2047)
            at javax.swing.text.html.parser.DocumentParser.parse(DocumentParser.java:106)
            at javax.swing.text.html.parser.ParserDelegator.parse(ParserDelegator.java:78)
            at aufruf.main(aufruf.java:33)So I tried to catch them with "getCharSetSpec()" and "keyEqualsCharSet( )" from the class "javax.swing.text.ChangedCharSetException" and hoped that this solved the problem. But still doesen't work...
    Then I looked at the web and found, that I have to add the line:
    doc.putProperty("IgnoreCharsetDirective", new Boolean(true));"doc." is a new HTML Dokument, created with the HTMLEditorKit. I do not have much knowledge about that and so I hope, that someone can explain me, how I can solve that problem, within my code.
    Here we go:
    import javax.swing.text.*;
    import java.lang.*;
    import java.util.*;
    import java.net.*;
    import java.io.*;
    import javax.swing.text.html.*;
    import javax.swing.text.html.parser.*;
    public class myParser extends Thread
            private String name;
            public void run()
                    try
                            URL viele = new URL(name);                       // "name" ia a variable with a lot of links
                    URLConnection hs = viele.openConnection();
                    hs.connect();
                    if (hs.getContentType().startsWith("text/html"))
                            InputStream is = hs.getInputStream();
                            InputStreamReader isr = new InputStreamReader(is);
                            BufferedReader br = new BufferedReader(isr);
                            Lesen los = new Lesen();
                            ParserDelegator parser = new ParserDelegator();
                            parser.parse(br,los, false);
            catch (MalformedURLException e)
                    System.err.print("Doesn't work");
            catch (ChangedCharSetException e)
                    e.getCharSetSpec();
                    e.keyEqualsCharSet();
                    e.printStackTrace();
            catch (Exception o)
            public void vowi(String n)
                    name = n;
    }and for the case that it is important here is the class "Lesen"
    import java.net.*;
    import java.io.*;
    import javax.swing.text.*;
    import javax.swing.text.html.*;
    import javax.swing.text.html.parser.*;
    class Lesen extends HTMLEditorKit.ParserCallback
            public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos)
                    try
                            if ((t==HTML.Tag.P) || (t==HTML.Tag.H1) || (t==HTML.Tag.H2) || (t==HTML.Tag.H3) || (t==HTML.Tag.H4) || (t==HTML.Tag.H5) || (t==HTML.Tag.H6))
                                    System.out.println();
                    catch (Exception q)
                            System.out.println(q.getMessage());
            public void handleSimpleTag(HTML.Tag t,MutableAttributeSet a, int pos)
                    try
                            if (t==HTML.Tag.BR)
                                    System.out.println(); // Neue Zeile
                                    System.out.println();
                    catch (Exception qw)
                            System.out.println(qw.getMessage());
            public void handleText(char[] data, int pos)
                    try
                            System.out.print(data);                                           // prints the text from HTML-pages
                    catch (Exception ab)
                            System.out.println(ab.getMessage());
    }Thanks a lot for helping...
    Stephan

    parser.parse(br,los, false);
    parser.parse(br,los, true);

  • Easy question: Extracting text from DOM

    I'm new to XML, so bear with me for a sec. I'm using the v2
    parser with JDK 1.2.2 to generate dynamic web pages from XML
    files. A button on the web page needs to kick off a routine that
    sends financial data from the DOM from which the page was
    generated into a graphing program. I've perused the parser API
    docs for methods that will extract the actual displayed text
    from
    DOM nodes (the content between the starting & ending <text>
    tags), but have only been able to extract element names,
    attributes, values, etc. For example, here's a snippet from the
    XML document:
    <text x="253" cellwidth="111" align="center" fontface="Arial,
    Helvetica" fontsize="3" pointsize="12" bold="true"
    italic="false">$93.63</text>
    It's the $93.63 that's important to me. Getting the attributes
    and values from within the tags is no problem. But I need to be
    able to suck out the actual text that gets spit out onto the
    browser. Any ideas?
    Thanks!
    Gus
    null

    SK (guest) wrote:
    : Oracle XML Team wrote:
    : : SK (guest) wrote:
    : : : Robert Truitt (guest) wrote:
    : : : : Oracle XML Team wrote:
    : : : : : Gus (guest) wrote:
    : : : : : : I'm new to XML, so bear with me for a sec. I'm using
    : the
    : : : v2
    : : : : : : parser with JDK 1.2.2 to generate dynamic web pages
    : from
    : : : XML
    : : : : : : files. A button on the web page needs to kick off a
    : : : routine
    : : : : : that
    : : : : : : sends financial data from the DOM from which the
    page
    : : was
    : : : : : : generated into a graphing program. I've perused the
    : : parser
    : : : : API
    : : : : : : docs for methods that will extract the actual
    : displayed
    : : : text
    : : : : : : from
    : : : : : : DOM nodes (the content between the starting & ending
    : : : <text>
    : : : : : : tags), but have only been able to extract element
    : names,
    : : : : : : attributes, values, etc. For example, here's a
    snippet
    : : : from
    : : : : : the
    : : : : : : XML document:
    : : : : : : <text x="253" cellwidth="111" align="center"
    : : : : fontface="Arial,
    : : : : : : Helvetica" fontsize="3" pointsize="12" bold="true"
    : : : : : : italic="false">$93.63</text>
    : : : : : : It's the $93.63 that's important to me. Getting the
    : : : : attributes
    : : : : : : and values from within the tags is no problem. But I
    : : need
    : : : to
    : : : : : be
    : : : : : : able to suck out the actual text that gets spit out
    : onto
    : : : the
    : : : : : : browser. Any ideas?
    : : : : : : Thanks!
    : : : : : : Gus
    : : : : : To get the "$93.63" from the text element you need to
    : : first
    : : : : get
    : : : : : its text node then get the value of that text node.
    : : : : : Oracle XML Team
    : : : : : http://technet.oracle.com
    : : : : : Oracle Technology Network
    : : : : add the following code in DOMSample.java:
    : : : : if ( n instanceof Element )
    : : : : { Element e = (Element) n;
    : : : : NodeList nl2 = e.getChildNodes();
    : : : : //NodeList nl2 = e.getElementsByTagName
    : : : : ("#text"); // this doesn't work.
    : : : : System.out.print ("\t" );
    : : : : for ( int j = 0; j < nl2.getLength(); j++ )
    : : : : { Node n2 = nl2.item(j);
    : : : : //System.out.print ( n2.getNodeName() );
    : : : : //if ( n2 instanceof Text ) //
    either
    : of
    : : : : these works...
    : : : : if ( n2.getNodeType() == Node.TEXT_NODE )
    : : : : System.out.print ( " = " +
    : : n2.getNodeValue
    : : : : System.out.println( );
    : : : : after line:
    : : : : System.out.print(n.getNodeName());
    : : : : in function 'printElements(Document doc)'
    : : : XML Team,
    : : : What will be equivalent code for PLSQL. I am trying to use
    : : : xmlparser for PLSQL and successfully compiled
    DOMSAMPLE.sql
    : : with
    : : : following commands
    : : : ---- from Oracle supplied DOMSAMPLE.sql ------------
    : : : -- get all elements
    : : : nl := xmlparser.getElementsByTagName(doc, '*');
    : : : len := xmlparser.getLength(nl);
    : : : dbms_output.put_line('>>>>>Printing elements ......');
    : : : -- loop through elements
    : : : for i in 0..len-1 loop
    : : : n := xmlparser.item(nl, i);
    : : : dbms_output.put(to_char(i)

  • How Do I Display HTML Formatted Text From A Data Table In Crystal Reports?

    I'm creating reports in Crystal XI.  The information being displayed in the reports comes from data tables where the text is formatted in HTML.
    I've worked with Crystal Reports enough to know that HTML text pulled from a data table doesn't appear in Crystal the same way it does in a web browser.  Crystal Reports ignores all the tags (...unless I'm missing something...) and just displays the text.
    Someone far more Crystal savy than I (...who I don't have access to...) came up with a Formula Field workaround that tricks Crystal Reports into displaying some basic HTML tags.  Here's that workaround:
    <!--
    stringVar TableName := ;
    TableName := Replace (TableName, "<ul>","<br> <br>");
    TableName := Replace (TableName, "<li>", "<br>   &bull; ");
    TableName := Replace (TableName, "</li>", "");
    TableName := Replace (TableName, "</ul>","<br> <br>");
    TableName := Replace (TableName, "<a", "<u><font color='blue'");
    TableName := Replace (TableName, "</a>", "</font></u>");
    TableName
    -->
    QUESTION - Does any similar workaround exist so I can display an HTML Table in Crystal Reports?  If not, is there any way to display HTML formatted text from a data table in Crystal Reports as it would appear in a web browser?

    Hi Steven,
    To display html text in Crystal Reports follows these steps.
    1. Right click on the field and select Paragraph tab.
    2. Under 'Text Interpretation' select 'HTML Text' and click OK.
    I have tried using the way,but it never works.So reply me if there is any way to solve the issue

  • Keyword tags text box poorly designed, crashes PSE 8

    I was very interested to see that PSE 8 allows you to enter keyword tags using a text box, as other programs do, promising much faster tagging with large sets of tags.  But the feature has serious bugs rendering it unusable.
    After trying the keyword-tags text box for less than a minute on my large catalog with 426 tags, PSE 8 crashed.    Even with a fresh catalog and the default tag categories, it will crash after using the text box about 40 times.   (See below for a recipe.)
    Another bug concerns tags with commas in them, e.g. "Truckee, CA" (I have a couple hundred such Place tags).  If you type "Tru" in the text box, select the tag with the mouse, and hit enter, PSE 8 will create two new tags, Truckee and CA, in the Other category, and apply them to the photo.
    A design misfeature: After you select and apply a tag with the text box, it loses the keyboard focus.  So to select another tag for the same photo, you have to move your hand to the mouse and click in the text box again.  Partially defeats the whole purpose.
    Recipe for reproducing the bug:
    1. Create a new catalog.
    2. Import 50 photos.
    3. Select the next photo.
    4. In the keyword tag text box, type "p" and then enter.
    5. Go to step 3.
    On the 40th or 41st photo, my PSE 8 reliably crashes (Vista 32, 4 GB of memory).   If after step 1 you use the Keyword Tags > + > From File command to load a tag heirarchy of 210 tags (see the attached file), it will crash after just 9 photos.   And with 426 tags, it crashes after 5 photos.
    Interestingly, on my Vista 64 system (unsupported by Adobe, 8 core x 2.7 GHz), PSE 8 doesn't crash, but the text box gets unusably slow very quickly.  WIth 210 keyword tags in the catalog, it soon takes about 12 seconds to find a tag after you type a unique prefix.   That time gradually gets longer the more you use the text box.  With 426 tags, it takes almost 20 seconds to apply a tag using the text box.
    On both systems, it's clear what the nature of the bug is: Each time you use the text box, the process's memory grows by many megabytes.  I believe my Vista 64 system doesn't crash (at least not immediately) because it provides 3.5 GB of memory for use by PSE 8, while Vista 32 only provides about 1.5 GB.
    I'm getting a sinking feelilng...

    Is the white triangle to the left of "Keyword Tags" pointing down or pointing to the right?  See this screen shot:
    Click on the white triangle -- does that make a difference?
    If not, try deleting the Organizer's preferences:
    http://www.johnrellis.com/psedbtool/photoshop-elements-faq.htm#Delete_the_Organizers_prefe rences

Maybe you are looking for

  • Problem with firefox and gtk applications in KDE!

    Hi there, i have my arch installed and it's great, i use kdemod that is quite perfect but i have a problem: all the gtk applications like firefox, eclipse, emesene and so on are terrible looking expecially the fonts. Here what I've done: 1) installed

  • How many devices can be tied to my account?

    I currently have 9 devices tied to my Apple ID to use content ive purchased. They are: 6 Apple TVs 1 rMBP 1 iPad 1 iPhone I'm thinking of buying an MBA also.  Is there a limit to how many Apple devices that can use my Apple ID? 

  • Address book syncing is not syncing or corrupting data?

    Totally confused here and need major help. I have an iMac and a MBP which are sync'd thru my mobile me account. Today I am on the iMac and a message comes up that there are 62 sync conflicts. I reviewed each conflict and all of the birthday's where w

  • Send an input form via CFMAIL?

    I've created an input form (radio buttons, text input) on our site that uses CFMAIL to send the results to the business, then updates our SQL DB. The user/client gets to the form by clicking on a survey link in an email sent to them. Now, I've been a

  • Perl package maintenance under Arch - how?

    Hi. New to Arch (but not to Linux in general). I have seen that Perl 5.12 is in core of Arch, which is good. In my work I need a _lot_ of Perl modules from CPAN, besides the standard core modules which come bundled with Perl. Question: How is the ins