Extract article from HTML code

Hi,
I'm trying to build a search engine for an RSS feed. Thing is i'm trying to store every article as a BLOB field in the database. To optimize my search i'll need to extract the article only and nothing else (no unrelated hyperlinks or html code)
I'm using the HTMLEditoKit of swing to get the html content without the code, but that's not enough. I need to clean the page of things like headers and footers (they affect the search results)

i already parsed the XML in the RSS...i've got informa ; the problem is the article itself
Take this article for example: [http://news.bbc.co.uk/2/hi/europe/7572635.stm] , you've got the article, but you've things all around it from link to different articles to headers, footers menus.
All of this affects the search results dramatically so i just want the article itself.
that's the greatest challenge.

Similar Messages

Problem to extract text from HTML document

I have to extract some text from HTML file to my database. (about 1000 files)
The HTML files are get from ACM Digital Library. http://portal.acm.org/dl.cfm
The HTML page is about the information of a paper. I only want to get the text of "Title" "Abstract" "Classification" "Keywords"
The Problem is that I can't find any patten to parser the html files"
EX: I need to get the Classification = "Theory of Computation","ANALYSIS OF ALGORITHMS AND PROBLEM COMPLEXITY","Numerical Algorithms and Problem","Mathematics of Computing","NUMERICAL ANALYSIS"......etc .
The section code about "Classification" is below.
Please give any idea to do this, or how to find patten to extract text from this.
<div class="indterms"><a href="#CIT"><img name="top" src=
"img/arrowu.gif" hspace="10" border="0" /></a><span class=
"heading"><a name="IndexTerms">INDEX TERMS</a></span>
<p class="Categories"><span class="heading"><a name=
"GenTerms">Primary Classification:</a></span><br />
� <b>F.</b> <a href=
"results.cfm?query=CCS%3AF%2E%2A&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Theory of Computation</a><br />
� <img src="img/tree.gif" border="0" height="20" width=
"20" /> <b>F.2</b> <a href=
"results.cfm?query=CCS%3A%22F%2E2%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">ANALYSIS OF ALGORITHMS AND PROBLEM
COMPLEXITY</a><br />
� � � <img src="img/tree.gif" border="0" height=
"20" width="20" /> <b>F.2.1</b> <a href=
"results.cfm?query=CCS%3A%22F%2E2%2E1%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Numerical Algorithms and Problems</a><br />
</p>
<p class="Categories"><span class="heading"><a name=
"GenTerms">Additional�Classification:</a></span><br />
� <b>G.</b> <a href=
"results.cfm?query=CCS%3AG%2E%2A&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Mathematics of Computing</a><br />
� <img src="img/tree.gif" border="0" height="20" width=
"20" /> <b>G.1</b> <a href=
"results.cfm?query=CCS%3A%22G%2E1%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">NUMERICAL ANALYSIS</a><br />
� � � <img src="img/tree.gif" border="0" height=
"20" width="20" /> <b>G.1.6</b> <a href=
"results.cfm?query=CCS%3A%22G%2E1%2E6%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Optimization</a><br />
� � � � � <img src="img/tree.gif" border=
"0" height="20" width="20" /> <b>Subjects:</b> <a href=
"results.cfm?query=CCS%3A%22Linear%20programming%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Linear programming</a><br />
</p>
<br />
<p class="GenTerms"><span class="heading"><a name=
"GenTerms">General Terms:</a></span><br />
<a href=
"results.cfm?query=genterm%3A%22Algorithms%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Algorithms</a>, <a href=
"results.cfm?query=genterm%3A%22Theory%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Theory</a></p>
<br />
<p class="keywords"><span class="heading"><a name=
"Keywords">Keywords:</a></span><br />
<a href=
"results.cfm?query=keyword%3A%22Simplex%20method%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Simplex method</a>, <a href=
"results.cfm?query=keyword%3A%22complexity%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">complexity</a>, <a href=
"results.cfm?query=keyword%3A%22perturbation%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">perturbation</a>, <a href=
"results.cfm?query=keyword%3A%22smoothed%20analysis%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">smoothed analysis</a></p>
</div>

One approach is to download Htmlparser from sourceforge
http://htmlparser.sourceforge.net/ and write the rules to match title, abstract etc.
Another approach is to write your own parser that extract only title, abstract etc.
1. tokenize the html file. --> convert html into tokens (tag and value)
2. write a simple parser to extract certain information
find out about the pattern of text you want to extract. For instance "<class "abstract">.
then writing a rule for extracting abstract such as
if (tag is abstract ) then extract abstract text
apply the same concept for other tags
Attached is the sample parser that was used to extract title and abstract from acm html files. Please modify to include keyword and other fields.
good luck
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.List;
public class ACMHTMLParser
     private String m_filename;
     private URLLexicalAnalyzer lexical;
     List urls = new ArrayList();
     public ACMHTMLParser(String filename)
          super();
          m_filename = filename;
      * parses only title and abstract
     public void parse() throws Exception
          lexical = new URLLexicalAnalyzer(m_filename);
          String word = lexical.getNextWord();
          boolean isabstract = false;
          while (null != word)
               if (isTag(word))
                    if (isTitle(word))
                         System.out.println("TITLE: " + lexical.getNextWord());
                    else if (isAbstract(word) && !isabstract)
                         parseAbstract();
                         isabstract = true;
               word = lexical.getNextWord();
          lexical.close();
     public static void main(String[] args) throws Exception
          ACMHTMLParser parser = new ACMHTMLParser("./acm_html.html");
          parser.parse();
     public static boolean isTag(String word)
          return ( word.startsWith("<") && word.endsWith(">"));
     public static boolean isTitle(String word)
          return ( "<title>".equals(word));
     //please modify according to the html source
     public static boolean isAbstract(String word)
          return ( "<p class=\"abstract\">".equals(word));
     private void parseAbstract() throws Exception
          while (true)
               String abs = lexical.getNextWord();
               if (!isTag(abs))
                    System.out.println(abs);
                    break;
     class URLLexicalAnalyzer
       private BufferedReader m_reader;
       private boolean isTag;
       public URLLexicalAnalyzer(String filename)
          try
            m_reader = new BufferedReader(new FileReader(filename));
          catch (IOException io)
            System.out.println("ERROR, file not found " + filename);
            System.exit(1);
       public URLLexicalAnalyzer(InputStream in)
          m_reader = new BufferedReader(new InputStreamReader(in));
       public void close()
          try {
            if (null != m_reader) m_reader.close();
          catch (IOException ignored) {}
       public String getNextWord() throws IOException
          int c = m_reader.read();
          if (-1 == c) return null;
          if (Character.isWhitespace((char)c))
            return getNextWord();
          if ('<' == c || isTag)
            return scanTag(c);
          else
               return scanValue(c);
       private String scanTag(final int c)
          throws IOException
          StringBuffer result = new StringBuffer();
          if ('<' != c) result.append('<');
          result.append((char)c);
          int ch = -1;
          while (true)
            ch = m_reader.read();
            if (-1 == ch) throw new IllegalArgumentException("un-terminate tag");
            if ('>' == ch)
                 isTag = false;
                 break;
            result.append((char)ch);
          result.append((char)ch);
          return result.toString();
       private String scanValue(final int c) throws IOException
            StringBuffer result = new StringBuffer();
            result.append((char)c);
            int ch = -1;
            while (true)
               ch = m_reader.read();
               if (-1 == ch) throw new IllegalArgumentException("un-terminate value");
               if ('<' == ch)
                    isTag = true;
                    break;
               result.append((char)ch);
            return result.toString();
}

Create a string from HTML code

Hello to all,
What i have already create is a web browser integrate into a VI project and from this i can read some values like temperature or humidity. Also i have put a HTML window in order to take this values and to export to a string (for example to check these values from the browser and to be able to see in separate constat). Can you help how to do this with an example??
Thanks in advance

Below i have attached the code and the HTML code from the sensor. If you have to recommend anything it will be usefull for me. I have another problem now. If you see in the HTML code you will see that i have three different values to export(temperature, humidity and dewpoint) but all of them have the same regular expression in order to request in my code. Do you know how is it possible to solve this problem because always i take the same temperature value for three of them.
Thanks in advance
Attachments:
TempHumid.vi ‏41 KB
HTMLcode.jpg ‏68 KB

Parametre from html code

Its my first week in applet so i was trying 2 take a parametre from the html code and then 2 take this parametre and put it as a color ex:
<APPLET code="xxxx.class" width=200 height=200 >
<param name ="name1" value="green"/>
and then in the java code i need 2 set this parametre as a color ex:
g.setColor(Color.?);
?= the color in the parametre name
how can i do this ?

The following snippet of code assumes that the permissible values
for the applet parameter "name1" are hexadecimal RGB values.
"0xFF0000" would be red.
"0x00FF00" would be green.
"0x0000FF" would be blue.
String s;
Color c;
// get the applet paramter from the html file
s = getParameter("name1");
try
    int aR, aG, aB; // RGB-Values
    aR = Integer.valueOf(s.substring(2,4),16).intValue();
    aG = Integer.valueOf(s.substring(4,6),16).intValue();
    aB = Integer.valueOf(s.substring(6,8),16).intValue();
    c = new Color(aR, aG, aB);
catch( Exception e )
    c = new Color(0x00, 0xFF, 0x00);
g.setColor(c);

To extract data from t.code mm02

Hi,
   I want to write the program to extract the data from t.code mm02 for clsiification and unit of measure .
     In classification there are charactersitcs and values which i want to extract. And in unit of measure i want to extract the data .
please if somebudy knows plz let me know.

Hi Rahul,
You can use following tables to extact default material characteristics.
CABN,CABNT,CAWN & CAWNT
And you can also refer following code where I have used FM's -
*&--- Select class type available for material table (MARA)
    SELECT klart
            FROM tcla
            INTO TABLE t_klart
            WHERE obtab = c_mara AND
                  klart = '300'.    " GIves Material class type
REFRESH t_objkeys.
    LOOP AT t_klart INTO w_klart.
      CLEAR : w_objkeys.
      w_objkeys-class_type     = w_klart.
      w_objkeys-objectkey      = w_matnr2.      "material number
      w_objkeys-objecttable    = c_mara.          " MARA
      APPEND w_objkeys TO t_objkeys.
      CLEAR : w_klart,
              w_objkeys.
    ENDLOOP.
REFRESH : t_allocations, t_class_char1, t_class_char.
*&---FM to get the Class Name for the material
    CALL FUNCTION 'CLBPAX_READ_CLASSIFICATIONS'
      EXPORTING
        t_object_keys = t_objkeys
        keydate       = sy-datum
      IMPORTING
        t_allocations     = t_allocations
        t_valuations_char = t_value_char
        t_valuations_num = t_value_num.
IF t_allocations[] IS NOT INITIAL.
    LOOP AT t_allocations INTO w_allocations WHERE class_type = '300'.
        CALL FUNCTION 'BAPI_CLASS_GETDETAIL'
          EXPORTING
            classtype            = w_allocations-class_type
            classnum             = w_allocations-classnum
          TABLES
            classcharacteristics = t_class_char1
            classcharvalues      = t_class_char.
      ENDLOOP.
    ENDIF.

Local information from html code

I have built an application in which i required the web-sites which are exclusively built for us people.while retrieving the index/home page of web-sites some are in html, some are in jsp and some are in asp.
so i need a code exclusively in java to know which area those websites belong to.
can somebody there help me?
thank you

i already parsed the XML in the RSS...i've got informa ; the problem is the article itself
Take this article for example: [http://news.bbc.co.uk/2/hi/europe/7572635.stm] , you've got the article, but you've things all around it from link to different articles to headers, footers menus.
All of this affects the search results dramatically so i just want the article itself.
that's the greatest challenge.

Extracting info from HTML documents

My program returns the HTML of any web page entered by the user. The HTML documents that are returned all contain pricing infomration that I want to extract. Any idea of the best way to search an HTML document for specific infomration I require. Seems like a huge task to split it all into tokens and searching for � sign!!!!!

This a nightmare of a problem........... the html
files that I am retrieving are huge. All I need from
them are a couple of lines of information. How do I
find the specific infomration I need???Load the entire file, search for it. You find the information in the same way like you'd do when ouy look for it in the file's source code.
Is it possible from a java program to open the HTML
file in web broweser, search, then return the info?
The html files seem really complex to search on.How would this help?

Extract textdata from HTML with AUTO_FILTER

Hi,
we're using Oracle AUTO_FILTER to extract text-information from DOC and PDF - Documents.
Works fine.
We also have data stored within a HTML structure.
We use our filter with the following options:
ctx_ddl.create_preference('SEARCH_iMT_ATTRIB_AFILTER', 'AUTO_FILTER');
ctx_ddl.create_policy('SE_IMT_POLICY', 'SEARCH_iMT_ATTRIB_AFILTER');
The filter itself is called within a loop:
CTX_DOC.Policy_Filter('SE_IMT_POLICY', v_blobtab(i), v_ctmp, TRUE);
It seems as if AUTO_FILTER converts our HTML to HTML again.
When trying to insert the AUTO_FILTER, I get an ora-31167: XML nodes over 64K in size cannot be inserted
How can I force the AUTO_FILTER only to return plain-text?
Thanks in advance
Message was edited by:
user557708

You need to create a section group preference employing HTML_SECTION_GROUP and then use it when creating your Policy.
Faisal

Function to extract descr from given code

Kindlhy help some one to get the following function. The prototype of the function is
FUNCTION extract_descr(given_val VARCHAR2 , table_name VARCHAR2, field_name varchar2) RETURN BOOLEAN. This function
get the above parameters and returning the description of from the table (table,field,value of the code is parameters). Thanks.

Hello,
Simply create a cursor on the USER_TAB_COLUMNS view.
e.g.:
Cursor c_cols is
select column_name, data_type, data_length,data_precision, data_scale from user_tab_columns
where table_name = 'EMP';Francois

Extract URL from HTML text

Suppose you have the following String that is body text with HTML.
String bodyText = " My name is Blake. I live in New York City. See my image here: <img href="http://www.blake.com/blake.jpg"/> isn't my picture awesome? Tata for now!"
I want to extract the URL that contains the location of the image in this bodyText. The ideal would be to create a function called public String extractor(String bodyText) to be used
String imageURL = extractor(bodyText);
//imageURL should be "http://www.blake.com/blake.jpg"
My first thoughts are using reg exp, yet the place i would find to use that would using the .replace in String class. I am by no means an expert on reg exp so I haven't taken too much time to try to figure it out with reg exp. I obviously could do a linear search of the bodyText and do a ton of if statements, but thats just poor coding. I want see if anyone came across this or has insight to this problem.
Thanks for all the help,
Blake

How would the regexp change if there were multiple img tags within the String.I don't rightly know, I'm just a raw beginner in regexes.
Would this regexp return all the img URLs found in the String.No, as it stands it would return only the last URL. But this will:String bodyText = " My name is Blake. " +
      "I live in New York City. See my image here: " +
      "<img href=\"http://www.blake.com/blake.jpg\"/>" +
      " isn't my picture awesome? Here's another: " +
      "<img href='http://www.blake.com/Vandelay.jpg'/>" +
      " Tata for now!";
String regex = "(?<=<img\\shref=[\"'])http://.*?(?=[\"']/?>)";
Pattern pattern = Pattern.compile (regex);
Matcher matcher = pattern.matcher (bodyText);
while (matcher.find ()) {
   System.out.println (matcher.group ());
}Note the enhancement that takes into account that both single and double quotes are legal in HTML. But unlike the earlier example, this does not tolerate more than one space between <img and href=, I couldn't find a way to achieve that.
Visit this thread later, there are some real regex experts around who may post a better solution.
db

Tools for extracting strings from java code for internationalization

For legacy code with lots and lots of strings what method is typically used to extract strings for internationalization?
Am I right in thinking you couldnt simply grep for strings, it could get pretty complitcated, especially with escape characters, escaped quotes etc.,
-SK

When dealing with legacy code, it is nice to have an application that queries you as it extracts the strings. You have a choice whether to accept the string as a localizable entity or not.
There are several tools that do this...including some IDEs like JBuilder. Although it isn't a fully supported or robust tool, Sun has a utility for extracting strings in Java source files:
http://java.sun.com/products/jilkit/.
Regards,
John O'Conner

Read values from html response

Hi,
I am trying to make a call to an API using UTL_HTTP POST method over SSL and read the response html page and extract the values from the reponse.
I am able to call and get a response back in html format. I have stored the html response in a clob variable.
Now i want to parse this html and extract values from the form input items and send them out through OUT parameters.
For example, from below reponse i want to extract the value '1111d7nhcwse30wq' from 'I4GO_UNIQUEID'
Can anyone help me with the code to parse this html response and extract the values.
Any help is greatly appreciated.
Thanks
Sharath
sample Code:
PROCEDURE get_token (
p_requesterreference IN VARCHAR2,
p_cardnumber IN VARCHAR2,
p_cardtype IN VARCHAR2,
p_cardholdername IN VARCHAR2,
p_expirationmonth IN VARCHAR2,
p_expirationyear IN VARCHAR2,
p_streetaddress IN VARCHAR2,
p_postalcode IN VARCHAR2,
p_cvv2code IN VARCHAR2,
po_uniqueid OUT VARCHAR2,
po_errorindicator OUT VARCHAR2,
po_primaryerrorcode OUT VARCHAR2,
po_response OUT VARCHAR2,
po_status_code OUT VARCHAR2,
po_reason_phrase OUT VARCHAR2
IS
v_url VARCHAR2 (200);
v_url_params VARCHAR2 (32767);
v_resp_str VARCHAR2 (32767);
l_http_req UTL_HTTP.req;
l_http_resp UTL_HTTP.resp;
v_requesterreference VARCHAR2 (12) := p_requesterreference;
v_i4go_cardnumber VARCHAR2 (32) := p_cardnumber;
v_i4go_streetaddress VARCHAR2 (30) := p_streetaddress;
v_i4go_postalcode VARCHAR2 (9) := p_postalcode;
v_i4go_expirationmonth VARCHAR2 (2) := p_expirationmonth; -- MM format
v_i4go_expirationyear VARCHAR2 (2) := p_expirationyear; -- yy format
v_i4go_cvv2code VARCHAR2 (3) := p_cvv2code;
v_name VARCHAR2 (256);
v_value VARCHAR2 (1024);
l_clob CLOB;
pv_amp CONSTANT CHAR (1) := CHR (38);
CURSOR setup_cur
IS
SELECT interface_id, interface_name, interface_url, account_id, site_id
FROM rsv.shift4_setup
WHERE interface_name = 'I4GO';
v_setup_rec setup_cur%ROWTYPE;
BEGIN
OPEN setup_cur;
FETCH setup_cur
INTO v_setup_rec;
CLOSE setup_cur;
v_url := 'https://certify.i4go.com//index.cfm?fuseaction=account.PostCardEntry';
v_url_params :=
pv_amp
|| 'i4GO_AccountID='
|| v_setup_rec.account_id
|| pv_amp
|| 'i4Go_SiteID='
|| v_setup_rec.site_id
|| pv_amp
|| 'i4Go_CardNumber='
|| v_i4go_cardnumber
|| pv_amp
|| 'i4Go_ExpirationMonth='
|| v_i4go_expirationmonth
|| pv_amp
|| 'i4Go_ExpirationYear='
|| v_i4go_expirationyear
|| pv_amp
|| 'i4Go_CVV2Code='
|| v_i4go_cvv2code
|| pv_amp
|| 'i4Go_PostalCode='
|| v_i4go_postalcode;
-- begin request using POST method
UTL_HTTP.set_response_error_check (FALSE);
UTL_HTTP.set_transfer_timeout (180);
UTL_HTTP.set_wallet ('file:/etc/ORACLE/WALLETS/oracle', 'welcome1');
l_http_req := UTL_HTTP.begin_request (v_url, 'POST');
UTL_HTTP.set_header (l_http_req, 'User-Agent', 'Mozilla/4.0');
UTL_HTTP.set_header (l_http_req, 'Content-Type', 'application/x-www-form-urlencoded');
UTL_HTTP.set_header (l_http_req, 'content-length', LENGTH (v_url_params));
UTL_HTTP.write_text (l_http_req, v_url_params);
-- get response
l_http_resp := UTL_HTTP.get_response (l_http_req);
po_status_code := l_http_resp.status_code;
po_reason_phrase := l_http_resp.reason_phrase;
-- read response into a clob
DBMS_LOB.createtemporary (l_clob, FALSE);
BEGIN
LOOP
UTL_HTTP.read_text (l_http_resp, v_resp_str, 32767);
DBMS_LOB.writeappend (l_clob, LENGTH (v_resp_str), v_resp_str);
END LOOP;
EXCEPTION
WHEN UTL_HTTP.end_of_body
THEN
-- end response
UTL_HTTP.end_response (l_http_resp);
END;
-- Fre resources
DBMS_LOB.freetemporary (l_clob);
EXCEPTION
WHEN OTHERS
THEN
DBMS_LOB.freetemporary (l_clob);
DBMS_OUTPUT.put_line (UTL_HTTP.get_detailed_sqlerrm);
RAISE;
END;
sample response:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
     <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
     <title>Return With Payment Token</title>
     <script src="js/jquery-1.6.4.min.js" type="text/javascript"></script>
     <script type="text/javascript"></script>
</head>
<body onload="bodyOnLoad();">
     <form name="i4GoMainForm" id="i4GoMainForm" action="http://google.com" method="POST" onsubmit="$('#i4Go_submit').attr('disabled','disabled');">
               <input name="I4GO_RESPONSE" type="hidden" value="SUCCESS" />
               <input name="I4GO_RESPONSECODE" type="hidden" value="1" />
               <input name="I4GO_CARDTYPE" type="hidden" value="VS" />
               <input name="I4GO_UNIQUEID" type="hidden" value="1111d7nhcwse30wq" />
               <input name="I4GO_EXPIRATIONMONTH" type="hidden" value="12" />
               <input name="I4GO_EXPIRATIONYEAR" type="hidden" value="2012" />
               <input name="I4GO_CARDHOLDERNAME" type="hidden" value="" />
               <input name="I4GO_STREETADDRESS" type="hidden" value="" />
               <input name="I4GO_POSTALCODE" type="hidden" value="65000" />
          <div id="scriptDiv" style="font-family:Arial, Helvetica, sans-serif;font-size:18px;visibility:hidden;">
               <img src="images/loading040.gif" alt="Spinner..." />  Loading...
          </div>
          <div id="noScriptDiv" style="font-family:Arial, Helvetica, sans-serif;">
               <noscript>
                                   <h1>Statement of Tokenization</h1>
                                   <p>The payment information you have submitted has been securely stored in the Shift4 PCI-DSS certified data center and a token representing this information will be sent to the merchant for processing. Below is the information that will be returning to the originating merchant:</p>
                                   <ul>
                                        <li>Response: <strong>SUCCESS</strong></li>
                                        <li>Response Code: <strong>1</strong></li>
                                        <li>Card Type: <strong>VS</strong></li>
                                        <li>Token: <strong>1111d7nhcwse30wq</strong></li>
                                   </ul>
               </noscript>
<input type="submit" name="i4Go_submit" id="i4Go_submit" value="Continue" />
          </div>
     </form>
</body>
</html>
Edited by: sgudipat on Apr 24, 2012 1:20 PM

Here is working example for your HTML using xpath to extract values from html
You can store your html response in clob variable and then extract the value with xpath
declare
   l_clob clob;
   l_value varchar2(100);
   l_xml xmltype;
begin
     l_clob :='<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>Return With Payment Token</title>
<script src="js/jquery-1.6.4.min.js" type="text/javascript"></script>
<script type="text/javascript"></script>
   </head>
   <body onload="bodyOnLoad();">
   <form name="i4GoMainForm" id="i4GoMainForm" action="http://google.com" method="POST" onsubmit="$(''#i4Go_submit'').attr(''disabled'',''disabled'');">
   <input name="I4GO_RESPONSE" type="hidden" value="SUCCESS" />
   <input name="I4GO_RESPONSECODE" type="hidden" value="1" />
   <input name="I4GO_CARDTYPE" type="hidden" value="VS" />
   <input name="I4GO_UNIQUEID" type="hidden" value="1111d7nhcwse30wq" />
   <input name="I4GO_EXPIRATIONMONTH" type="hidden" value="12" />
   <input name="I4GO_EXPIRATIONYEAR" type="hidden" value="2012" />
   <input name="I4GO_CARDHOLDERNAME" type="hidden" value="" />
   <input name="I4GO_STREETADDRESS" type="hidden" value="" />
   <input name="I4GO_POSTALCODE" type="hidden" value="65000" />
<img src="images/loading040.gif" alt="Spinner..." /> Loading...
   <noscript>
   Statement of Tokenization
   The payment information you have submitted has been securely stored in the Shift4 PCI-DSS certified data center and a token representing this information will be sent to the merchant for processing. Below is the information that will be returning to the originating merchant:
       Response: SUCCESS
       Response Code: 1
       Card Type: VS
       Token: 1111d7nhcwse30wq
   </noscript>
   <input type="submit" name="i4Go_submit" id="i4Go_submit" value="Continue" />
   </form>
   </body>
   </html>';
     execute immediate 'alter session set events =''31156 trace name context forever, level 2''';
     l_xml := xmltype(l_clob);
     execute immediate 'alter session set events =''31156 trace name context off''';
     select extractvalue( l_xml
                        , '/html/body/form/input[@name="I4GO_CARDTYPE"]/@value'
                        , 'xmlns="http://www.w3.org/1999/xhtml"' )
     into l_value
     from dual;
     dbms_output.put_line(l_value);
   end;
Problem when parsing html with xpath and xmltype
Edited by: peterv6i.blogspot.com on Apr 26, 2012 9:38 AM

How extract properties from document saved in blob.

I need extract all document properties from document (any file type) saved in blob.
I've tried three ways...
- I've tried using relational ORDDoc.getProperties() by example. But this method returned exception ORDSYS.ORDDOCEXCEPTIONS.DOC_PLUGIN_EXCEPTION for every blob.
- I can extract some properties with ctxhx.exe and key Meta. But this method is very expensive: I have to 1)save blob to file, 2)convert binary file to html-file, and 3)extract properties from html <meta> and <title> tags.
- I can extract all properties with COM Automation. But this way need to install a lot of applications for every document type. It's impossible.
Has anybody did same task yearly? Help me oracle guru!

ORDDOc getproperties will attempt to see if the media is a video, audio or image and set the appropriate properties.
I am gussing you have a word style document or something like that?
Can you tell me about your application adn what you are trying to do?
Larry

Generate a thumbnail from HTML by pure Java on Linux without Graphics

hi - we in a requirement where we have to generate thumbnails from HTML code. The solution must be implemented in pure Java on Linux where there is no graphics support.
Options tried already are :--
1. 3rd party websites - rolled out by our client.
2. Paid products - rolled out by our client
3. Media Tracker and other java API - no luck as there is no support after HTML 4.0
4. Using any os dependent native library - rolled out by our client.
5. Lobo browser - but having troubles like it opens the browser before screenshot is taken, sometimes. Gone through by putting Thread.sleep() in between and saving remote images into a local html file etc. We got some success in there but problem doesn't end here.
Questions -
1. In the point # 5 above, our Linux server had graphics support but in code we set the system property java.awt.headless= true before capturing and generating thumbnail. My question is, if we set this property in the code then does it mean 100% that our code will not use any graphics support, if present in the underlying OS?
2. Is this really possible to generate images in java on Linux where there is no X window/X server installed? Are we just wasting time in order to achieve which is unachievable?
Any suggestions are most welcome.
Regards,
Sanjeev

Thanks for ur response! Yeah - we tried but requirements are little different. We have HTML that we have to first render. Whatever output comes, we have to take a screenshot. So in order to render the html we have to have a browser first and I believe every OS which is providing browser support is having Graphics capabilities because browser would have frames, windows, toolbars, menubars etc which fall under Graphics.
The above way is the only way that I know. If there are another way which ofcourse doesn't require graphics support, please let me know.
So the question basically is - if I follow above mentioned image (like opening browser and capture screenshot) then is it possible on Linux with no graphics support? Actually I read on internet that lobo browser (written in java) supports this kind of feature.

Trim filter for iPlanet - whitespace removing from html

Hi,
I've just published my Trim Filter NSAPI plugin for Sun One Web Server/iPlanet
It is a plugin/filter which removes whitespaces from HTML code. Its LGPL and it can be found here: http://www.thrull.com/iplanet/
Feel free to test it :-)
BR,
Igor

Cool, thanks for sharing it!
I took a quick look, and I think I spotted a bug in the write filter method:        /* workaround for a bug? in iPlanet, when returning empty content */
        if (!out_size && amount) {
            rv = net_write(layer->lower, NON_EMPTY_STRING, 1);
            out_size = 1;
        } else {
            rv = net_write(layer->lower, (const char *)buffer, out_size);
        return rv;According to the NSAPI Programmer's Guide (http://docs.sun.com/source/817-6252/npgnsapi.html#wp1004627), the write filter method should return the number of bytes consumed on success. It looks like your write filter method is returning the number of bytes written instead. Perhaps that's the reason you needed that work around?

Extract article from HTML code

Similar Messages

Maybe you are looking for