Parsing HTML files

Hello,
I have a question about parsing HTML files. Usually when I get an HTML file and I need to find all the text in it I do this. This stuff just collects all of the hyperlinks and ignores all the html tags just keeping the actual text. It's fine for smaller files but occasionally I'll hit a large online text file and it will work but its way to slow for large files. I don't need to do all of this HTML tag stripping however for text files. Is there a way to still grab all the text without doing any tag searching to make it faster?
thanks,
private void find() throws IOException
        //Really slow for large text files.  Need a way to just use a regular scanner on an internet text file
        new ParserDelegator().parse(new InputStreamReader(myBase.openStream()),
                new ParserListener(),
                true); 
     * Inner class for processing all "<a href.."> tags when reading a base URL.
    private class ParserListener extends HTMLEditorKit.ParserCallback
        final String IGNORED_LINKS = "^(http|mailto|\\W).*";
        public void handleStartTag (HTML.Tag t, MutableAttributeSet a, int pos)
            if (t == HTML.Tag.A)
                String href = (String)(a.getAttribute(HTML.Attribute.HREF));
                //System.out.println(href);
                //System.out.println(href.matches(IGNORED_LINKS) + "\t" + href);
                if (! (href == null || href.matches(IGNORED_LINKS)) && !myURLs.contains(href))
                    myURLs.add(href);
            //TODO fix
            if (t == HTML.Tag.TITLE)
                String title = (String) (a.getAttribute(HTML.Attribute.TITLE));
                if (!(title == null))
                    myTitle = title;
                else myTitle = "No title was found";
        public void handleText (char[] data, int pos)
            myText.append(" ");
            myText.append(data);
    }

JFactor2004 wrote:
My question is. If I know an html file is actually just a txt fileThis isn't a question. HTML files are text by definition.
is it possible to look through it (maybe use something similar to a regular scanner) without doing anything with html.That depends on what you mean by "doing something with HTML". You can certainly read it one line at a time.

Similar Messages

  • Parsing html files via an url

    Hi,
    I already have a Java program that is able to read in html files that are stored on my computers hard drive. Now I would like to expand its functionality by being able to parse html files straight from the web.
    For example, when the program is run, I would like to be able to give it an url for a given website. Then, I would like to be able to parse the html file that the link goes to.
    I've searched the forum, but have not been able to find anything of any real use. If you could offer an overview or point me towards a resource, I would be very greatful.

    If you've done things right, you have a HTML reader/parser that takes an InputStream. For Files, this would be a FileInputStream.
    For URLs, this would be the InputStream you get from URLConnection.getInputStream(). You can get a URLConnection by calling openConnection() on a URL instance (created from your input url of course).

  • Parsing HTML File

    Hi.
    I'm working on a little app project to make an app for a french Chuck Norris Facts website. I need to parse the HTML files on their website to extract facts. I have found these piece of code to open the file. Seems to work since it shows in my NSLOG.
    NSString *htmlFile=[NSString stringWithContentsOfFile:@"/Path/to/my/htmlfile.html"];
    NSLog(@"%@", htmlFile);
    Can anybody give me a clue to extract the fats from the HTML file ? Here's it's structure :
    < d iv class="fact" i d="fact6897" >Chuck Norris blabla< /d iv>
    < d iv class="fact" i d="fact9355" >Chuck Norris blablaagain< /d iv>
    < d iv class="fact" i d="fact257" >Chuck Norris blablablabla< /d iv>
    I added spaces to the <d iv> to be shown on the forum.
    And it goes on. 30 facts per file.
    Thanks.
    Senly
    Message was edited by: Senly

    Look at the NSXML classes.
    http://developer.apple.com/mac/library/documentation/cocoa/Conceptual/NSXML_Conc epts/NSXML.html

  • In java, can I parse HTML file

    and build a DOM tree? I think DOM level 1 support HTML , but does Java implement that one?
    It would be much helpful if you can provide some sample code.
    Thanks

    Java has a simple parser that can parse HTML 3.2. See this thread for an example:
    http://forum.java.sun.com/thread.jsp?forum=31&thread=266798
    It also has a callback parser. See this article:
    http://java.sun.com/products/jfc/tsc/articles/bookmarks/index.html

  • Can HTML files be parsed as JSP

    I currently have my web-app configured to process html files as JSP. Is it possible to configure NitroX so that it will parse .html files as JSP?
    I thought changing the Editor in the Preferences: Workbench->"File Associations" might work, but it didn't.

    Thanks, that solution works great. It would be terrific if options such as that were configurable within the NitroX Preferences.
    Another great preference setting would be a way to change the specified webroot. (I figured out how to do it manually.) The reason I had to do this was due to complications in getting JSTL support in the NitroX JSP editor for my existing webapp.
    I use JSP 1.2 and Resin 2.x for my production server. Since JSTL is automatically provided within the container, I do not have the JSTL tld's within my WEB-INF subdirectories. In order to get the JSTL support in the NitroX JSP Editor, (without having to edit my ant build script, directory/file structure, etc), I had to create a new project, choose JSTL support, and then manually edit the .m7project file to specifiy my existing webroot folder, and then restart NitroX (or close and reopen the project). One great enhancement that would be nice, would be the ability to specify the webroot directory while creating a new Web Application.
    A solution to my dilemma, and for many other people I imagine, would be to provide JSTL support inherently by allowing the developer to set a flag for the project (during and after creation). When setting that flag, the developer could then be given the option of copying the files as well. Then JSTL could be available via the corresponding standard taglib uri, such as the uri http://java.sun.com/jstl/core for the core JSTL taglib.
    I am still in the trial period of NitroX. I almost gave up after spending 3-4 hours trying to get JSTL and EL to work. Things finally look promising and I can't wait to see how NitroX for JSF pans out. If some of these issues are taken care of, I imagine our company will buy a couple of licenses.
    Thanks,
    Jeffrey Lilly
    HomeGauge

  • Preventing CFMX7 to parse all .html files

    hi,
    I have coldfusion MX 7 installed and integrated with IIS6 on
    windows2003 server. To parse .html files through coldfusion, i have
    added the following lines to
    (C:\CFusionMX7\wwwroot\WEB-INF\web.xml) web.xml files
    <servlet-mapping id="macromedia_mapping_14">
    <servlet-name>CfmServlet</servlet-name>
    <url-pattern>*.html</url-pattern>
    </servlet-mapping>
    <servlet-mapping id="macromedia_mapping_15">
    <servlet-name>CfmServlet</servlet-name>
    <url-pattern>*.html/*</url-pattern>
    </servlet-mapping>
    i have vhost site in IIS6 named something.abc.com , to make
    sure that only .html file for something.abc.com site wiill be parse
    through coldfusion MX 7, i have added .html extension mapping to
    C:\CFusionMX7\runtime\lib\wsconfig\jrun_iis6.dll.
    after restarting both coldfusion and iis6, I found that not
    only for vhost site(something.abc.com) but also for others sites "
    .html " files are parsing trhough coldfusion server, which is not
    my requirements. My requirements is .html file will be parse only
    for those vhost site that has .html extension mapping, other
    vhost's .html file should be served by IIS6.
    Any idea how to do this??
    Please Help..
    Mamun

    I'm also interested in this.
    We have a specific site that we would like to have ColdFusion
    process the .html files, but when setting up IIS and the web.xml
    file it seems to setup coldfusion to process all sites html files
    instead of just the one site.
    Is there a way to set the URL-PATTERN in the web.xml to just
    match a specific URL? Or does it only match directories and files
    under the URL string?
    Any guidance is appreciated.
    Thanks

  • Find Info from HTML file

    I am trying to develop a program to read URLS and extract specific content from the source of the URLS. So far my program
    Returns the HTML of a URL and writes the HTML to a file called Results.txt.
    I now need to write a program that opens up this Results file and extracts the info that appears after certain tags. Some of these files are rather large to say the least and parsing HTML files is no simple task compared to files separated by simple white space.
    Can anyone advise how I can search an HTML file for A particular tag. Is tokenisaing the file the answer? If so How can I define a token since HTML does not separate tokens by white spaces always.
    Thanks for your help
    Ross

    Well ok I agree with you in what you say however I have designed my final year at uni project for parsing HTML and that's what Im commited to doing now. In hindsight I would have done things differently.
    I am having difficulty knowing how to parse the HTML tho. Basically to look at, it's not nice at all. For example the HTML below how would I extract the info after the words "Double Rooms from" ?
    </td></tr>     <tr><td colspan="2"><hr size="1"/>
         <font size="3"><b>Orwell Lodge Hotel, Dalry</b></font>
    (2.6 miles / 3.6 km from the centre of Sighthill)
    </td></tr>
         <tr><td><img src="http://www.activehotels.com/photos/218697/AAB218697.jpg" border="0" width="96" height="72" alt="hotel" /></td>
         <td><font size="2">Single rooms from: &pound;40.00, Double rooms from: &pound;40.00</font>     
         <p /><font size="3"><b>For more details and online booking click here.</b></font>
         <p /><font size="2">Hotel details in other languages:
         <a href="http://www.orwelllodgehotel.activehotels.com/KNW&LANGUAGE=fr&subid=                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

  • How to parse a HTML file using HTML parser in J2SE?

    I want to parse an HTML file using HTML parser. Can any body help me by providing a sample code to parse the HTML file?
    Thanks nad Cheers,
    Amaresh

    What HTML parser and what does "parsing" mean to you?

  • PARSING HTML ELEMNETS IN XML FILE?,Help please very urgent

    I am getting the input in this form
    <ul>
    <li>Strategies</li>
    <li>Planning</li>
    <li>Value</li>
    <li>Total Investment</li>
    </ul>
    I want to convert it into below format so that ContentHandler parse the HTML tages.The HTML elements are dynamic,
    contentHandler.startElement("", "ul", "ul", attrs);
    contentHandler.startElement("", "li", "li", attrs);
    contentHandler.characters(value.toCharArray(), 0, value.length());
    contentHandler.startElement("", "li", "li", attrs);
    contentHandler.startElement("", "li", "li", attrs);
    contentHandler.characters(value.toCharArray(), 0, value.length());
    contentHandler.startElement("", "li", "li", attrs);
    contentHandler.startElement("", "li", "li", attrs);
    contentHandler.characters(value.toCharArray(), 0, value.length());
    contentHandler.startElement("", "li", "li", attrs);
    contentHandler.startElement("", "li", "li", attrs);
    contentHandler.characters(value.toCharArray(), 0, value.length());
    contentHandler.startElement("", "li", "li", attrs);
    contentHandler.endElement("", "ul", "ul");
    Is their any library through which we can convert HTML tags into ContentHandler elements.
    Thanks in Advance
    Thanks
    Lakhi

    Actually i am parsing XML file,but i have HTML elements inside XML elements:
    <section id='2'><header><line>Agenda( Slide2 )</line></header>
    <line>
    <h3>Agenda</h3>
    <ol>
    <li>Overview of ABC Company inc.</li>
    <li>Defining and Measuring Employee Engagement</li>
    <li>Foresight's Survey Methodology</li>
    <li>Online Tools</li>
    <li>Standard and Custom Reporting Capabilities</li>
    <li>Action Planning and Best Practices</li>
    </ol></line></section>
    And i am using Contenthandler interface to parse,
              attrs.addCDATAAttribute("id",""+i);
                   contentHandler.startElement("", "section", "section", attrs);
                   attrs.clear();
                   contentHandler.startElement("", "header", "header", attrs);
                   contentHandler.startElement("", "line", "line", attrs);
                   contentHandler.characters(key.toCharArray(), 0, key.length());
                   contentHandler.endElement("", "line", "line");
                   contentHandler.endElement("", "header", "header");
                   contentHandler.startElement("", "line", "line", attrs);
    /*HERE I need to Generate java instruction for HTML elements as i mailed before.for elements like <li>Overview of ABC Company inc.</li>
    <li>Defining and Measuring Employee Engagement</li>...................</ol>
                   contentHandler.characters(value.toCharArray(), 0, value.length());
                   contentHandler.endElement("", "line", "line");
                   contentHandler.endElement("", "section", "section");

  • XML parser to parse XML inside HTML file

    Hi,
    I wish to know is there any other parsers apart from JAXP to parse xml content present inside HTML file. For example,
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <title></title>
    </head>
    <body>
    <form id="j_id_jsp_1394907664_1" name="j_id_jsp_1394907664_1" method="post" action="/msaiphoneportal1.1c/pages/xmlchech.faces;jsessionid=5666F0E1CF0E44B978940F021012AA41" enctype="application/x-www-form-urlencoded">
    <input type="hidden" name="j_id_jsp_1394907664_1" value="j_id_jsp_1394907664_1" />
    <?xml version="1.0" encoding="UTF-8"?>
    <hospital>
    <Users>
    <User id="1" password="x" type="staff" username="x"/>
    <User id="2" password="y" type="staff" username="y"/>
    <User id="3" password="z" type="staff" username="z"/>
    </Users>
    <Survey/>
    <Patients>staaatus</Patients>
    </hospital>
    <input type="hidden" name="j_id_jsp_1394907664_1:j_id_jsp_1394907664_2" /><input type="hidden" name="javax.faces.ViewState" id="javax.faces.ViewState" value="-4298162632826268059:-1507671971163298623" autocomplete="off" />
    </form>
    </body>
    </html>
    I need to read the XML content inside. Is there any way please let me know
    Edited by: DHURAI on Jul 22, 2010 12:59 AM

    DHURAI wrote:
    while reading, we can fetch the starting of XML through <?xml> tag, but how we know the ending of the XML as it seems to be dynamic.1) Extract the document root element which follows the <?xml ... ?>
    2) From this root element , construct the associated root element terminal by inserting a / after the <.
    3) Search for the terminal.
    If the root name can also be the name of an enclosed element then you will have to count the number of terminals.

  • Parse local HTML files

    Hi,
    I want to parse local HTML files.
    Is there another way than using the Internet Explorer($ie = new-object -com "InternetExplorer.Application";) (without relaying on external packages)?
    At the moment I do something like that:
    $ie = new-object -com "InternetExplorer.Application";
    Start-Sleep -Seconds 1
    $ie.Navigate($srcFile)
    Start-Sleep -Seconds 1
    $ParsedHtml = $ie.Document
    foreach($child in $ParsedHtml.body.getElementsByTagName('table'))
    I still want to have the methods like 'getElementById()' or 'getElementByTagName()'.
    With my current approache, the performance is not realy good and it seems that the iexplorer.exe process is not terminating at the end of the script. 
    Also it seems to have sideeffects with running internet explorer instances (from GUI) - not working to start IE in powershell sometimes.
    Last time I also have a hanging script, not continuing till i manually terminate the iexplorer.exe process.
    The error was:
    Exception calling "Navigate" with "1" argument(s): "The remote procedure call f
    ailed. (Exception from HRESULT: 0x800706BE)"
    At D:\Scripts\Run.ps1:529 char:14
    + $ie.Navigate <<<< ($src)
    + CategoryInfo : NotSpecified: (:) [], MethodInvocationException
    + FullyQualifiedErrorId : ComMethodTargetInvocation
    so I would prefere a method parsing HTML without IE.

    Hi John Mano,
    Please also try to Parse local HTML files with
    System.Xml.Linq, the script may be helpful for you:
    Using PowerShell to parse local HTML files  
    I hope this helps.                               
    XML?
    I thought HTML is not compatible with xml. 
    And as I don't know LINQ good ... 
    Ok, I'll give it a try. later.
    I can't answer the question about other ways to parse HTML, but to close your IE session you should do the following:
    $ie.Quit() # this terminates the IE process
    $ie = $null # this frees the COM object memory
    Thanks for that.
    I now use that, but seems to be still some IEs open ...
    Maybe a path missing where i dont do it.
    But finally I still get this error. And it is blocking the whole script ...
    Exception calling "Navigate" with "1" argument(s): "The remote procedure call f
    ailed. (Exception from HRESULT: 0x800706BE)"
    At E:\DailyBuild\Scripts\PublishTestResults.ps1:533 char:15
    + $ie.Navigate <<<< ($srcFile)
    + CategoryInfo : NotSpecified: (:) [], MethodInvocationException
    + FullyQualifiedErrorId : ComMethodTargetInvocation

  • How to Parse an HTML File?

    Hi all
    I want to parse an HTML file?
    How is it possible?
    After taking an input which is an HTML file, i need to parse it, and i need to print/modify values based on some tags?
    Please help me, how to parse an HTML file?

    You start by reading the first character and then continiung until you reach the last character.
    For a more serious answer try elaborating on your question. Its really really vague.

  • Parse an html file

    I want to parse an html file, making use of the DOM, is there any inbuilt java package for this. if so can anybody please give me some examples or links

    You start by reading the first character and then continiung until you reach the last character.
    For a more serious answer try elaborating on your question. Its really really vague.

  • How to use Swing html to parse a html file..

    I am currently working on a project in java.i need to read a Html file, extract the table tag along with its contents if it has a button embedded in it and replace it with another code..
    i tried it with Reg Ex in java but in vain.. Any help is great... I need an idea of how to use Html parser for this prob.. any examples will be fine..

    Run the program from command line, this way you will see the errors, if any.
    example.: java -jar theJarfile.jar
    I was successful in creating a comm application. I placed the win32com.dll in the same directory of my application jar file and all worked.
    I also extracted the comm.jar , and jar'd my app with the extracted comm files to make one jar.
    I also had a fileWriter() to get the clients jre path and my app would write the javax.properties file to the correct place.
    It took me severel weeks and late nights to accomplish this, but it was all necessary to be able to install only my app, and not a bunch of api's that were needed.

  • Parsing a html file with pl/sql

    hi,
    does anybody know how can I reach an html file source code and parse it for obtaining a useful data with pl/sql? Can utl_http solve the problem? If it can how?
    thanks
    bahar karaoglu

    CREATE OR REPLACE FUNCTION search_pattern_exists(
       url IN VARCHAR2,
       search_pattern IN VARCHAR2 )
       RETURN VARCHAR2
    IS
       result         VARCHAR2(8) := 'UNKNOWN';
       http_request   UTL_HTTP.req;
       http_response  UTL_HTTP.resp;
       little_buffer  VARCHAR2(32767);
       big_buffer     CLOB;
    BEGIN
       DBMS_LOB.createTemporary( big_buffer, TRUE );
       http_request := UTL_HTTP.begin_request( url );
       http_response := UTL_HTTP.get_response( http_request );
       <<big_buffer_construction>>
       BEGIN
          LOOP
             little_buffer := NULL;
             UTL_HTTP.read_line( http_response, little_buffer, TRUE);
             IF little_buffer IS NOT NULL THEN
                DBMS_LOB.writeAppend( big_buffer, LENGTH(little_buffer), little_buffer );
             END IF;
          END LOOP;
          UTL_HTTP.end_response( http_response );
       EXCEPTION
           WHEN UTL_HTTP.end_of_body THEN UTL_HTTP.end_response( http_response );
       END big_buffer_construction;
       IF DBMS_LOB.instr( big_buffer, search_pattern ) != 0 THEN
          result := 'TRUE';
       ELSE
          result := 'FALSE';
       END IF;
       DBMS_LOB.freeTemporary( big_buffer );
       RETURN result;
    END search_pattern_exists;
    SELECT search_pattern_exists( 'http://otn.oracle.com', 'Oracle9i Application Server' )
    FROM DUAL;Michael

Maybe you are looking for

  • Iphone back-up/sync to new computer

    I just purchased a new macbook pro but i am unable to back up my iphone to the old computer.. its been 8months since my last back up, will i lose all my new info?? is there anyway of saving it before it sync it to my new computer??

  • Can't Add Wireless Printer to Laptop

    I am trying to add a HP Officejet Pro x576dw MFP to my HP Probook 4530s .  Laptop runs Windows 7 Pro 64-bit. When I use the add printer feature, the printer shows up. I select it, then hit next.   The next step is to install the print drivers.  BUT,

  • Screenshot problem in Captivate 3/XP

    Greetings, Captivate 3 WinXP, SP3 I'm having trouble getting Captivate to consistently execute a screenshot. Sometimes it works, and sometimes it does not. Most recently it did not work once a dropdown menu was activated from a menu bar. Furthermore,

  • In device content I can't see the application library. Why?

    I connected my new ipad mini 2  to the last version of itunes. Than, clicking on the ipad logo I saw two sections, the settings one with all the libraries (music, movies, applications, photos etc....) and the device content one in witch the applicati

  • Batch Determination at SO Level

    Dear All, I have configured batch determination at SO level. The batch number auto appears. But, I have a problem where the system choose batch number which is already allocated to the shipping area deliveries. I am not sure whether it is the setting