Parsing HTML files

Hello,
I have a question about parsing HTML files. Usually when I get an HTML file and I need to find all the text in it I do this. This stuff just collects all of the hyperlinks and ignores all the html tags just keeping the actual text. It's fine for smaller files but occasionally I'll hit a large online text file and it will work but its way to slow for large files. I don't need to do all of this HTML tag stripping however for text files. Is there a way to still grab all the text without doing any tag searching to make it faster?
thanks,
private void find() throws IOException
        //Really slow for large text files. Need a way to just use a regular scanner on an internet text file
        new ParserDelegator().parse(new InputStreamReader(myBase.openStream()),
                new ParserListener(),
                true);
     * Inner class for processing all "<a href.."> tags when reading a base URL.
    private class ParserListener extends HTMLEditorKit.ParserCallback
        final String IGNORED_LINKS = "^(http|mailto|\\W).*";
        public void handleStartTag (HTML.Tag t, MutableAttributeSet a, int pos)
            if (t == HTML.Tag.A)
                String href = (String)(a.getAttribute(HTML.Attribute.HREF));
                //System.out.println(href);
                //System.out.println(href.matches(IGNORED_LINKS) + "\t" + href);
                if (! (href == null || href.matches(IGNORED_LINKS)) && !myURLs.contains(href))
                    myURLs.add(href);
            //TODO fix
            if (t == HTML.Tag.TITLE)
                String title = (String) (a.getAttribute(HTML.Attribute.TITLE));
                if (!(title == null))
                    myTitle = title;
                else myTitle = "No title was found";
        public void handleText (char[] data, int pos)
            myText.append(" ");
            myText.append(data);
    }

JFactor2004 wrote:
My question is. If I know an html file is actually just a txt fileThis isn't a question. HTML files are text by definition.
is it possible to look through it (maybe use something similar to a regular scanner) without doing anything with html.That depends on what you mean by "doing something with HTML". You can certainly read it one line at a time.

Similar Messages

Parsing html files via an url

Hi,
I already have a Java program that is able to read in html files that are stored on my computers hard drive. Now I would like to expand its functionality by being able to parse html files straight from the web.
For example, when the program is run, I would like to be able to give it an url for a given website. Then, I would like to be able to parse the html file that the link goes to.
I've searched the forum, but have not been able to find anything of any real use. If you could offer an overview or point me towards a resource, I would be very greatful.

If you've done things right, you have a HTML reader/parser that takes an InputStream. For Files, this would be a FileInputStream.
For URLs, this would be the InputStream you get from URLConnection.getInputStream(). You can get a URLConnection by calling openConnection() on a URL instance (created from your input url of course).

Parsing HTML File

Hi.
I'm working on a little app project to make an app for a french Chuck Norris Facts website. I need to parse the HTML files on their website to extract facts. I have found these piece of code to open the file. Seems to work since it shows in my NSLOG.
NSString *htmlFile=[NSString stringWithContentsOfFile:@"/Path/to/my/htmlfile.html"];
NSLog(@"%@", htmlFile);
Can anybody give me a clue to extract the fats from the HTML file ? Here's it's structure :
< d iv class="fact" i d="fact6897" >Chuck Norris blabla< /d iv>
< d iv class="fact" i d="fact9355" >Chuck Norris blablaagain< /d iv>
< d iv class="fact" i d="fact257" >Chuck Norris blablablabla< /d iv>
I added spaces to the <d iv> to be shown on the forum.
And it goes on. 30 facts per file.
Thanks.
Senly
Message was edited by: Senly

Look at the NSXML classes.
http://developer.apple.com/mac/library/documentation/cocoa/Conceptual/NSXML_Conc epts/NSXML.html

In java, can I parse HTML file

and build a DOM tree? I think DOM level 1 support HTML , but does Java implement that one?
It would be much helpful if you can provide some sample code.
Thanks

Java has a simple parser that can parse HTML 3.2. See this thread for an example:
http://forum.java.sun.com/thread.jsp?forum=31&thread=266798
It also has a callback parser. See this article:
http://java.sun.com/products/jfc/tsc/articles/bookmarks/index.html

Can HTML files be parsed as JSP

I currently have my web-app configured to process html files as JSP. Is it possible to configure NitroX so that it will parse .html files as JSP?
I thought changing the Editor in the Preferences: Workbench->"File Associations" might work, but it didn't.

Thanks, that solution works great. It would be terrific if options such as that were configurable within the NitroX Preferences.
Another great preference setting would be a way to change the specified webroot. (I figured out how to do it manually.) The reason I had to do this was due to complications in getting JSTL support in the NitroX JSP editor for my existing webapp.
I use JSP 1.2 and Resin 2.x for my production server. Since JSTL is automatically provided within the container, I do not have the JSTL tld's within my WEB-INF subdirectories. In order to get the JSTL support in the NitroX JSP Editor, (without having to edit my ant build script, directory/file structure, etc), I had to create a new project, choose JSTL support, and then manually edit the .m7project file to specifiy my existing webroot folder, and then restart NitroX (or close and reopen the project). One great enhancement that would be nice, would be the ability to specify the webroot directory while creating a new Web Application.
A solution to my dilemma, and for many other people I imagine, would be to provide JSTL support inherently by allowing the developer to set a flag for the project (during and after creation). When setting that flag, the developer could then be given the option of copying the files as well. Then JSTL could be available via the corresponding standard taglib uri, such as the uri http://java.sun.com/jstl/core for the core JSTL taglib.
I am still in the trial period of NitroX. I almost gave up after spending 3-4 hours trying to get JSTL and EL to work. Things finally look promising and I can't wait to see how NitroX for JSF pans out. If some of these issues are taken care of, I imagine our company will buy a couple of licenses.
Thanks,
Jeffrey Lilly
HomeGauge

Preventing CFMX7 to parse all .html files

hi,
I have coldfusion MX 7 installed and integrated with IIS6 on
windows2003 server. To parse .html files through coldfusion, i have
added the following lines to
(C:\CFusionMX7\wwwroot\WEB-INF\web.xml) web.xml files
<servlet-mapping id="macromedia_mapping_14">
<servlet-name>CfmServlet</servlet-name>
<url-pattern>*.html</url-pattern>
</servlet-mapping>
<servlet-mapping id="macromedia_mapping_15">
<servlet-name>CfmServlet</servlet-name>
<url-pattern>*.html/*</url-pattern>
</servlet-mapping>
i have vhost site in IIS6 named something.abc.com , to make
sure that only .html file for something.abc.com site wiill be parse
through coldfusion MX 7, i have added .html extension mapping to
C:\CFusionMX7\runtime\lib\wsconfig\jrun_iis6.dll.
after restarting both coldfusion and iis6, I found that not
only for vhost site(something.abc.com) but also for others sites "
.html " files are parsing trhough coldfusion server, which is not
my requirements. My requirements is .html file will be parse only
for those vhost site that has .html extension mapping, other
vhost's .html file should be served by IIS6.
Any idea how to do this??
Please Help..
Mamun

I'm also interested in this.
We have a specific site that we would like to have ColdFusion
process the .html files, but when setting up IIS and the web.xml
file it seems to setup coldfusion to process all sites html files
instead of just the one site.
Is there a way to set the URL-PATTERN in the web.xml to just
match a specific URL? Or does it only match directories and files
under the URL string?
Any guidance is appreciated.
Thanks

Find Info from HTML file

I am trying to develop a program to read URLS and extract specific content from the source of the URLS. So far my program
Returns the HTML of a URL and writes the HTML to a file called Results.txt.
I now need to write a program that opens up this Results file and extracts the info that appears after certain tags. Some of these files are rather large to say the least and parsing HTML files is no simple task compared to files separated by simple white space.
Can anyone advise how I can search an HTML file for A particular tag. Is tokenisaing the file the answer? If so How can I define a token since HTML does not separate tokens by white spaces always.
Thanks for your help
Ross

Well ok I agree with you in what you say however I have designed my final year at uni project for parsing HTML and that's what Im commited to doing now. In hindsight I would have done things differently.
I am having difficulty knowing how to parse the HTML tho. Basically to look at, it's not nice at all. For example the HTML below how would I extract the info after the words "Double Rooms from" ?
</td></tr>     <tr><td colspan="2"><hr size="1"/>
     <font size="3"><b>Orwell Lodge Hotel, Dalry</b></font>
(2.6 miles / 3.6 km from the centre of Sighthill)
</td></tr>
     <tr><td><img src="http://www.activehotels.com/photos/218697/AAB218697.jpg" border="0" width="96" height="72" alt="hotel" /></td>
     <td><font size="2">Single rooms from: £40.00, Double rooms from: £40.00</font>
     <p /><font size="3"><b>For more details and online booking click here.</b></font>
     <p /><font size="2">Hotel details in other languages:
     <a href="http://www.orwelllodgehotel.activehotels.com/KNW&LANGUAGE=fr&subid=

How to parse a HTML file using HTML parser in J2SE?

I want to parse an HTML file using HTML parser. Can any body help me by providing a sample code to parse the HTML file?
Thanks nad Cheers,
Amaresh

What HTML parser and what does "parsing" mean to you?

PARSING HTML ELEMNETS IN XML FILE?,Help please very urgent

I am getting the input in this form
<ul>
<li>Strategies</li>
<li>Planning</li>
<li>Value</li>
<li>Total Investment</li>
</ul>
I want to convert it into below format so that ContentHandler parse the HTML tages.The HTML elements are dynamic,
contentHandler.startElement("", "ul", "ul", attrs);
contentHandler.startElement("", "li", "li", attrs);
contentHandler.characters(value.toCharArray(), 0, value.length());
contentHandler.startElement("", "li", "li", attrs);
contentHandler.startElement("", "li", "li", attrs);
contentHandler.characters(value.toCharArray(), 0, value.length());
contentHandler.startElement("", "li", "li", attrs);
contentHandler.startElement("", "li", "li", attrs);
contentHandler.characters(value.toCharArray(), 0, value.length());
contentHandler.startElement("", "li", "li", attrs);
contentHandler.startElement("", "li", "li", attrs);
contentHandler.characters(value.toCharArray(), 0, value.length());
contentHandler.startElement("", "li", "li", attrs);
contentHandler.endElement("", "ul", "ul");
Is their any library through which we can convert HTML tags into ContentHandler elements.
Thanks in Advance
Thanks
Lakhi

Actually i am parsing XML file,but i have HTML elements inside XML elements:
<section id='2'><header><line>Agenda( Slide2 )</line></header>
<line>
<h3>Agenda</h3>
<ol>
<li>Overview of ABC Company inc.</li>
<li>Defining and Measuring Employee Engagement</li>
<li>Foresight's Survey Methodology</li>
<li>Online Tools</li>
<li>Standard and Custom Reporting Capabilities</li>
<li>Action Planning and Best Practices</li>
</ol></line></section>
And i am using Contenthandler interface to parse,
          attrs.addCDATAAttribute("id",""+i);
               contentHandler.startElement("", "section", "section", attrs);
               attrs.clear();
               contentHandler.startElement("", "header", "header", attrs);
               contentHandler.startElement("", "line", "line", attrs);
               contentHandler.characters(key.toCharArray(), 0, key.length());
               contentHandler.endElement("", "line", "line");
               contentHandler.endElement("", "header", "header");
               contentHandler.startElement("", "line", "line", attrs);
/*HERE I need to Generate java instruction for HTML elements as i mailed before.for elements like <li>Overview of ABC Company inc.</li>
<li>Defining and Measuring Employee Engagement</li>...................</ol>
               contentHandler.characters(value.toCharArray(), 0, value.length());
               contentHandler.endElement("", "line", "line");
               contentHandler.endElement("", "section", "section");

XML parser to parse XML inside HTML file

Hi,
I wish to know is there any other parsers apart from JAXP to parse xml content present inside HTML file. For example,
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
</head>
<body>
<form id="j_id_jsp_1394907664_1" name="j_id_jsp_1394907664_1" method="post" action="/msaiphoneportal1.1c/pages/xmlchech.faces;jsessionid=5666F0E1CF0E44B978940F021012AA41" enctype="application/x-www-form-urlencoded">
<input type="hidden" name="j_id_jsp_1394907664_1" value="j_id_jsp_1394907664_1" />
<?xml version="1.0" encoding="UTF-8"?>
<hospital>
<Users>
<User id="1" password="x" type="staff" username="x"/>
<User id="2" password="y" type="staff" username="y"/>
<User id="3" password="z" type="staff" username="z"/>
</Users>
<Survey/>
<Patients>staaatus</Patients>
</hospital>
<input type="hidden" name="j_id_jsp_1394907664_1:j_id_jsp_1394907664_2" /><input type="hidden" name="javax.faces.ViewState" id="javax.faces.ViewState" value="-4298162632826268059:-1507671971163298623" autocomplete="off" />
</form>
</body>
</html>
I need to read the XML content inside. Is there any way please let me know
Edited by: DHURAI on Jul 22, 2010 12:59 AM

DHURAI wrote:
while reading, we can fetch the starting of XML through <?xml> tag, but how we know the ending of the XML as it seems to be dynamic.1) Extract the document root element which follows the <?xml ... ?>
2) From this root element , construct the associated root element terminal by inserting a / after the <.
3) Search for the terminal.
If the root name can also be the name of an enclosed element then you will have to count the number of terminals.

Parse local HTML files

Hi,
I want to parse local HTML files.
Is there another way than using the Internet Explorer($ie = new-object -com "InternetExplorer.Application";) (without relaying on external packages)?
At the moment I do something like that:
$ie = new-object -com "InternetExplorer.Application";
Start-Sleep -Seconds 1
$ie.Navigate($srcFile)
Start-Sleep -Seconds 1
$ParsedHtml = $ie.Document
foreach($child in $ParsedHtml.body.getElementsByTagName('table'))
I still want to have the methods like 'getElementById()' or 'getElementByTagName()'.
With my current approache, the performance is not realy good and it seems that the iexplorer.exe process is not terminating at the end of the script.
Also it seems to have sideeffects with running internet explorer instances (from GUI) - not working to start IE in powershell sometimes.
Last time I also have a hanging script, not continuing till i manually terminate the iexplorer.exe process.
The error was:
Exception calling "Navigate" with "1" argument(s): "The remote procedure call f
ailed. (Exception from HRESULT: 0x800706BE)"
At D:\Scripts\Run.ps1:529 char:14
+ $ie.Navigate <<<< ($src)
+ CategoryInfo : NotSpecified: (:) [], MethodInvocationException
+ FullyQualifiedErrorId : ComMethodTargetInvocation
so I would prefere a method parsing HTML without IE.

Hi John Mano,
Please also try to Parse local HTML files with
System.Xml.Linq, the script may be helpful for you:
Using PowerShell to parse local HTML files
I hope this helps.
XML?
I thought HTML is not compatible with xml.
And as I don't know LINQ good ...
Ok, I'll give it a try. later.
I can't answer the question about other ways to parse HTML, but to close your IE session you should do the following:
$ie.Quit() # this terminates the IE process
$ie = $null # this frees the COM object memory
Thanks for that.
I now use that, but seems to be still some IEs open ...
Maybe a path missing where i dont do it.
But finally I still get this error. And it is blocking the whole script ...
Exception calling "Navigate" with "1" argument(s): "The remote procedure call f
ailed. (Exception from HRESULT: 0x800706BE)"
At E:\DailyBuild\Scripts\PublishTestResults.ps1:533 char:15
+ $ie.Navigate <<<< ($srcFile)
+ CategoryInfo : NotSpecified: (:) [], MethodInvocationException
+ FullyQualifiedErrorId : ComMethodTargetInvocation

How to Parse an HTML File?

Hi all
I want to parse an HTML file?
How is it possible?
After taking an input which is an HTML file, i need to parse it, and i need to print/modify values based on some tags?
Please help me, how to parse an HTML file?

You start by reading the first character and then continiung until you reach the last character.
For a more serious answer try elaborating on your question. Its really really vague.

Parse an html file

I want to parse an html file, making use of the DOM, is there any inbuilt java package for this. if so can anybody please give me some examples or links

You start by reading the first character and then continiung until you reach the last character.
For a more serious answer try elaborating on your question. Its really really vague.

How to use Swing html to parse a html file..

I am currently working on a project in java.i need to read a Html file, extract the table tag along with its contents if it has a button embedded in it and replace it with another code..
i tried it with Reg Ex in java but in vain.. Any help is great... I need an idea of how to use Html parser for this prob.. any examples will be fine..

Run the program from command line, this way you will see the errors, if any.
example.: java -jar theJarfile.jar
I was successful in creating a comm application. I placed the win32com.dll in the same directory of my application jar file and all worked.
I also extracted the comm.jar , and jar'd my app with the extracted comm files to make one jar.
I also had a fileWriter() to get the clients jre path and my app would write the javax.properties file to the correct place.
It took me severel weeks and late nights to accomplish this, but it was all necessary to be able to install only my app, and not a bunch of api's that were needed.

Parsing a html file with pl/sql

hi,
does anybody know how can I reach an html file source code and parse it for obtaining a useful data with pl/sql? Can utl_http solve the problem? If it can how?
thanks
bahar karaoglu

CREATE OR REPLACE FUNCTION search_pattern_exists(
   url IN VARCHAR2,
   search_pattern IN VARCHAR2 )
   RETURN VARCHAR2
IS
   result         VARCHAR2(8) := 'UNKNOWN';
   http_request   UTL_HTTP.req;
   http_response UTL_HTTP.resp;
   little_buffer VARCHAR2(32767);
   big_buffer     CLOB;
BEGIN
   DBMS_LOB.createTemporary( big_buffer, TRUE );
   http_request := UTL_HTTP.begin_request( url );
   http_response := UTL_HTTP.get_response( http_request );
   <<big_buffer_construction>>
   BEGIN
      LOOP
         little_buffer := NULL;
         UTL_HTTP.read_line( http_response, little_buffer, TRUE);
         IF little_buffer IS NOT NULL THEN
            DBMS_LOB.writeAppend( big_buffer, LENGTH(little_buffer), little_buffer );
         END IF;
      END LOOP;
      UTL_HTTP.end_response( http_response );
   EXCEPTION
       WHEN UTL_HTTP.end_of_body THEN UTL_HTTP.end_response( http_response );
   END big_buffer_construction;
   IF DBMS_LOB.instr( big_buffer, search_pattern ) != 0 THEN
      result := 'TRUE';
   ELSE
      result := 'FALSE';
   END IF;
   DBMS_LOB.freeTemporary( big_buffer );
   RETURN result;
END search_pattern_exists;
SELECT search_pattern_exists( 'http://otn.oracle.com', 'Oracle9i Application Server' )
FROM DUAL;Michael

Parsing HTML files

Similar Messages

Maybe you are looking for