Extracting info from HTML documents

My program returns the HTML of any web page entered by the user. The HTML documents that are returned all contain pricing infomration that I want to extract. Any idea of the best way to search an HTML document for specific infomration I require. Seems like a huge task to split it all into tokens and searching for � sign!!!!!

This a nightmare of a problem........... the html
files that I am retrieving are huge. All I need from
them are a couple of lines of information. How do I
find the specific infomration I need???Load the entire file, search for it. You find the information in the same way like you'd do when ouy look for it in the file's source code.
Is it possible from a java program to open the HTML
file in web broweser, search, then return the info?
The html files seem really complex to search on.How would this help?

Similar Messages

Problem to extract text from HTML document

I have to extract some text from HTML file to my database. (about 1000 files)
The HTML files are get from ACM Digital Library. http://portal.acm.org/dl.cfm
The HTML page is about the information of a paper. I only want to get the text of "Title" "Abstract" "Classification" "Keywords"
The Problem is that I can't find any patten to parser the html files"
EX: I need to get the Classification = "Theory of Computation","ANALYSIS OF ALGORITHMS AND PROBLEM COMPLEXITY","Numerical Algorithms and Problem","Mathematics of Computing","NUMERICAL ANALYSIS"......etc .
The section code about "Classification" is below.
Please give any idea to do this, or how to find patten to extract text from this.
<div class="indterms"><a href="#CIT"><img name="top" src=
"img/arrowu.gif" hspace="10" border="0" /></a><a name="IndexTerms">INDEX TERMS</a>
<a name=
"GenTerms">Primary Classification:</a> 
� F. <a href=
"results.cfm?query=CCS%3AF%2E%2A&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Theory of Computation</a> 
� <img src="img/tree.gif" border="0" height="20" width=
"20" /> F.2 <a href=
"results.cfm?query=CCS%3A%22F%2E2%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">ANALYSIS OF ALGORITHMS AND PROBLEM
COMPLEXITY</a> 
� � � <img src="img/tree.gif" border="0" height=
"20" width="20" /> F.2.1 <a href=
"results.cfm?query=CCS%3A%22F%2E2%2E1%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Numerical Algorithms and Problems</a> 

<a name=
"GenTerms">Additional�Classification:</a> 
� G. <a href=
"results.cfm?query=CCS%3AG%2E%2A&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Mathematics of Computing</a> 
� <img src="img/tree.gif" border="0" height="20" width=
"20" /> G.1 <a href=
"results.cfm?query=CCS%3A%22G%2E1%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">NUMERICAL ANALYSIS</a> 
� � � <img src="img/tree.gif" border="0" height=
"20" width="20" /> G.1.6 <a href=
"results.cfm?query=CCS%3A%22G%2E1%2E6%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Optimization</a> 
� � � � � <img src="img/tree.gif" border=
"0" height="20" width="20" /> Subjects: <a href=
"results.cfm?query=CCS%3A%22Linear%20programming%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Linear programming</a> 

 
<a name=
"GenTerms">General Terms:</a> 
<a href=
"results.cfm?query=genterm%3A%22Algorithms%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Algorithms</a>, <a href=
"results.cfm?query=genterm%3A%22Theory%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Theory</a>
 
<a name=
"Keywords">Keywords:</a> 
<a href=
"results.cfm?query=keyword%3A%22Simplex%20method%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Simplex method</a>, <a href=
"results.cfm?query=keyword%3A%22complexity%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">complexity</a>, <a href=
"results.cfm?query=keyword%3A%22perturbation%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">perturbation</a>, <a href=
"results.cfm?query=keyword%3A%22smoothed%20analysis%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">smoothed analysis</a>
</div>

One approach is to download Htmlparser from sourceforge
http://htmlparser.sourceforge.net/ and write the rules to match title, abstract etc.
Another approach is to write your own parser that extract only title, abstract etc.
1. tokenize the html file. --> convert html into tokens (tag and value)
2. write a simple parser to extract certain information
find out about the pattern of text you want to extract. For instance "<class "abstract">.
then writing a rule for extracting abstract such as
if (tag is abstract ) then extract abstract text
apply the same concept for other tags
Attached is the sample parser that was used to extract title and abstract from acm html files. Please modify to include keyword and other fields.
good luck
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.List;
public class ACMHTMLParser
 private String m_filename;
 private URLLexicalAnalyzer lexical;
 List urls = new ArrayList();
 public ACMHTMLParser(String filename)
 super();
 m_filename = filename;
 * parses only title and abstract
 public void parse() throws Exception
 lexical = new URLLexicalAnalyzer(m_filename);
 String word = lexical.getNextWord();
 boolean isabstract = false;
 while (null != word)
 if (isTag(word))
 if (isTitle(word))
 System.out.println("TITLE: " + lexical.getNextWord());
 else if (isAbstract(word) && !isabstract)
 parseAbstract();
 isabstract = true;
 word = lexical.getNextWord();
 lexical.close();
 public static void main(String[] args) throws Exception
 ACMHTMLParser parser = new ACMHTMLParser("./acm_html.html");
 parser.parse();
 public static boolean isTag(String word)
 return ( word.startsWith("<") && word.endsWith(">"));
 public static boolean isTitle(String word)
 return ( "<title>".equals(word));
 //please modify according to the html source
 public static boolean isAbstract(String word)
 return ( "".equals(word));
 private void parseAbstract() throws Exception
 while (true)
 String abs = lexical.getNextWord();
 if (!isTag(abs))
 System.out.println(abs);
 break;
 class URLLexicalAnalyzer
 private BufferedReader m_reader;
 private boolean isTag;
 public URLLexicalAnalyzer(String filename)
 try
 m_reader = new BufferedReader(new FileReader(filename));
 catch (IOException io)
 System.out.println("ERROR, file not found " + filename);
 System.exit(1);
 public URLLexicalAnalyzer(InputStream in)
 m_reader = new BufferedReader(new InputStreamReader(in));
 public void close()
 try {
 if (null != m_reader) m_reader.close();
 catch (IOException ignored) {}
 public String getNextWord() throws IOException
 int c = m_reader.read();
 if (-1 == c) return null;
 if (Character.isWhitespace((char)c))
 return getNextWord();
 if ('<' == c || isTag)
 return scanTag(c);
 else
 return scanValue(c);
 private String scanTag(final int c)
 throws IOException
 StringBuffer result = new StringBuffer();
 if ('<' != c) result.append('<');
 result.append((char)c);
 int ch = -1;
 while (true)
 ch = m_reader.read();
 if (-1 == ch) throw new IllegalArgumentException("un-terminate tag");
 if ('>' == ch)
 isTag = false;
 break;
 result.append((char)ch);
 result.append((char)ch);
 return result.toString();
 private String scanValue(final int c) throws IOException
 StringBuffer result = new StringBuffer();
 result.append((char)c);
 int ch = -1;
 while (true)
 ch = m_reader.read();
 if (-1 == ch) throw new IllegalArgumentException("un-terminate value");
 if ('<' == ch)
 isTag = true;
 break;
 result.append((char)ch);
 return result.toString();
}

Remarks info from marketing document to journal

Hi All
I need a solution for displaying the remarks info from marketing documents in the journal entry.
Regards
Bongani

Hi,
To achieve this you have to use SP , You may try this:
-- FOR SALES A/R INVOICE JE
IF @object_type = '15' AND (@transaction_type = 'A' OR @transaction_type = 'U')
BEGIN
     UPDATE OJDT
     SET U_BD_Remarks = (SELECT Comments FROM ODLN WHERE DocEntry = @list_of_cols_val_tab_del)
     WHERE TransID = (SELECT TransId FROM ODLN WHERE DocEntry = @list_of_cols_val_tab_del)
END
U_BD_Remarks, UDF in my case.
For every other document you have to change the code accordingly.
Thanks
Ashutosh T

Extracting info from a web page

Hi,
I m not sure if i m asking this question at the right forum.
Can anyone tell me if there is a way to extract data from a web page.
This means, say for example a web site Yahoo displays stock quotes
updated or NASDAQ values almost in real time.
Now if i want to get that information from the web page into one
of my applications ,say, something that uses that data. Is there
a way to do it?
Just curious

Yes, it's possible. You can use the java.net.URL object to connect to websites and download the html. Doing the coding is not that easy, and you should also be mindful of not redistributing data you've gotten from another site without permission

Merge option during assembly of PDF from html documents.

Hello,
Can LiveCycle create a combined PDF document by converting
HTML documents to PDF with the option of merging them (eliminate whitespace) during ddx assembly.
Here is a simple case. Combine three html documents such as
html-1: contains text ONE
html-2: contains text TWO
html-3: contains text THREE
The default assembled document appears to have three pages with each having a its single word text of content. However, a combined document with one page containing the merged text is desired in some cases.
Does LiveCycle handle this case. Thanks for any insight.
Jesse

Assembler will only deal with PDFs. PDF/G will take non PDF content and make PDF out of it. So in your case you would use PDF/G to change the HTML to PDF then use Assembler to manipulte the three docs into a single doc.
Hope that helps

Extract textdata from HTML with AUTO_FILTER

Hi,
we're using Oracle AUTO_FILTER to extract text-information from DOC and PDF - Documents.
Works fine.
We also have data stored within a HTML structure.
We use our filter with the following options:
ctx_ddl.create_preference('SEARCH_iMT_ATTRIB_AFILTER', 'AUTO_FILTER');
ctx_ddl.create_policy('SE_IMT_POLICY', 'SEARCH_iMT_ATTRIB_AFILTER');
The filter itself is called within a loop:
CTX_DOC.Policy_Filter('SE_IMT_POLICY', v_blobtab(i), v_ctmp, TRUE);
It seems as if AUTO_FILTER converts our HTML to HTML again.
When trying to insert the AUTO_FILTER, I get an ora-31167: XML nodes over 64K in size cannot be inserted
How can I force the AUTO_FILTER only to return plain-text?
Thanks in advance
Message was edited by:
user557708

You need to create a section group preference employing HTML_SECTION_GROUP and then use it when creating your Policy.
Faisal

Extracting info from webpages

I am trying to create a hotel program that has various features including finding cheapest hotel prices on the web. My program searces the web and returns the approriate web page results in html format. My problem is I'm not sure the best way to extract the information I want. Below is an example of a web page (I know it's long but if you copy it into a web browser, it does work, honest!). From this page I want to extract the hotels name and prices.
http://www.bookings.org/searchresults.html?class_interval=2&country=gb&error_url=http%3A%2F%2Fwww.bookings.org%2Fcountry%2Fgb.html%3F&search_by=city&city=-2595386&region=Avon+Aberdeenshire&class_key=1&class=0&do_availability_check=on&checkin_monthday=21&checkin_year_month=2005-6&checkout_monthday=22&checkout_year_month=2005-6&newlangurl=%2Fcountry%2Fgb.en.html&x=77&y=14
At present I can return the HTML of this page. However can anyone suggest how I go about extracting the specific info i require. The html file is huge!
Regards
Ross

Sorry pasted the wrong url. This one should work.
http://www.bookings.org/searchresults.html?class_interval=2&country=gb&error_url=http%3A%2F%2Fwww.bookings.org%2Fcountry%2Fgb.html%3F&search_by=city&city=-2595386&region=Avon+Aberdeenshire&class_key=1&class=0&do_availability_check=on&checkin_monthday=24&checkin_year_month=2005-3&checkout_monthday=27&checkout_year_month=2005-3&newlangurl=%2Fcountry%2Fgb.en.html&x=88&y=5

Find Info from HTML file

I am trying to develop a program to read URLS and extract specific content from the source of the URLS. So far my program
Returns the HTML of a URL and writes the HTML to a file called Results.txt.
I now need to write a program that opens up this Results file and extracts the info that appears after certain tags. Some of these files are rather large to say the least and parsing HTML files is no simple task compared to files separated by simple white space.
Can anyone advise how I can search an HTML file for A particular tag. Is tokenisaing the file the answer? If so How can I define a token since HTML does not separate tokens by white spaces always.
Thanks for your help
Ross

Well ok I agree with you in what you say however I have designed my final year at uni project for parsing HTML and that's what Im commited to doing now. In hindsight I would have done things differently.
I am having difficulty knowing how to parse the HTML tho. Basically to look at, it's not nice at all. For example the HTML below how would I extract the info after the words "Double Rooms from" ?
</td></tr> <tr><td colspan="2"><hr size="1"/>
 Orwell Lodge Hotel, Dalry
(2.6 miles / 3.6 km from the centre of Sighthill)
</td></tr>
 <tr><td><img src="http://www.activehotels.com/photos/218697/AAB218697.jpg" border="0" width="96" height="72" alt="hotel" /></td>
 <td>Single rooms from: £40.00, Double rooms from: £40.00
 For more details and online booking click here.
 Hotel details in other languages:
 <a href="http://www.orwelllodgehotel.activehotels.com/KNW&LANGUAGE=fr&subid=

Extract article from HTML code

Hi,
I'm trying to build a search engine for an RSS feed. Thing is i'm trying to store every article as a BLOB field in the database. To optimize my search i'll need to extract the article only and nothing else (no unrelated hyperlinks or html code)
I'm using the HTMLEditoKit of swing to get the html content without the code, but that's not enough. I need to clean the page of things like headers and footers (they affect the search results)

i already parsed the XML in the RSS...i've got informa ; the problem is the article itself
Take this article for example: [http://news.bbc.co.uk/2/hi/europe/7572635.stm] , you've got the article, but you've things all around it from link to different articles to headers, footers menus.
All of this affects the search results dramatically so i just want the article itself.
that's the greatest challenge.

Extracting Values from XML-Document in pl/sql

Hello!
I need to extract the content of the following extract:
<ns1:OXERPGetArticlesResponse xmlns:ns1="OXERPService">
<ns1:OXERPGetArticlesResult>
<ns1:OXERPType>
<ns1:aResult>
<ns1:ArrayOfString>
<ns1:string>OXID</ns1:string>
<ns1:string>531f91d4ab8bfb24c4d04e473d246d0b</ns1:string>
</ns1:ArrayOfString>
<ns1:ArrayOfString>
<ns1:string>OXARTNUM</ns1:string>
<ns1:string>0601-85-069</ns1:string>
</ns1:ArrayOfString>
<ns1:ArrayOfString>
<ns1:string>OXPRICE</ns1:string>
<ns1:string>100.5</ns1:string>
</ns1:ArrayOfString>
</ns1:aResult>
<ns1:blResult>true</ns1:blResult>
<ns1:sMessage/>
</ns1:OXERPType>
<ns1:OXERPType>
<ns1:aResult>
<ns1:ArrayOfString>
<ns1:string>OXID</ns1:string>
<ns1:string>531a8af7d9a9a5bb53b65a2b9a5356e5</ns1:string>
</ns1:ArrayOfString>
<ns1:ArrayOfString>
<ns1:string>OXARTNUM</ns1:string>
<ns1:string>0601-85-069-1</ns1:string>
</ns1:ArrayOfString>
<ns1:ArrayOfString>
<ns1:string>OXPRICE</ns1:string>
<ns1:string>89.9</ns1:string>
</ns1:ArrayOfString>
</ns1:aResult>
<ns1:blResult>true</ns1:blResult>
<ns1:sMessage/>
</ns1:OXERPType>
</ns1:OXERPGetArticlesResult>
</ns1:OXERPGetArticlesResponse>
The output should be:
OXID OXARTNUM OXPRICE
531f91d4ab8bfb24c4d04e473d246d0b 0601-85-069 100.5
531a8af7d9a9a5bb53b65a2b9a5356e5 0601-85-069-1 89.9
The count of rows and columns is variable.
I want to do this by using xmltype.extract but I found no way to create a loop over the content of the xml document.
Hopefully someone can help me!
Regards
Herbert

OK, then you should be able to use something like :
SQL> var xmldoc clob;
SQL> begin
2 :xmldoc := '<ns1:OXERPGetArticlesResponse xmlns:ns1="OXERPService">
3 <ns1:OXERPGetArticlesResult>
4 <ns1:OXERPType>
5 <ns1:aResult>
6 <ns1:ArrayOfString>
7 <ns1:string>OXID</ns1:string>
8 <ns1:string>531f91d4ab8bfb24c4d04e473d246d0b</ns1:string>
9 </ns1:ArrayOfString>
10 <ns1:ArrayOfString>
11 <ns1:string>OXARTNUM</ns1:string>
12 <ns1:string>0601-85-069</ns1:string>
13 </ns1:ArrayOfString>
14 <ns1:ArrayOfString>
15 <ns1:string>OXPRICE</ns1:string>
16 <ns1:string>100.5</ns1:string>
17 </ns1:ArrayOfString>
18 </ns1:aResult>
19 <ns1:blResult>true</ns1:blResult>
20 <ns1:sMessage/>
21 </ns1:OXERPType>
22 <ns1:OXERPType>
23 <ns1:aResult>
24 <ns1:ArrayOfString>
25 <ns1:string>OXID</ns1:string>
26 <ns1:string>531a8af7d9a9a5bb53b65a2b9a5356e5</ns1:string>
27 </ns1:ArrayOfString>
28 <ns1:ArrayOfString>
29 <ns1:string>OXARTNUM</ns1:string>
30 <ns1:string>0601-85-069-1</ns1:string>
31 </ns1:ArrayOfString>
32 <ns1:ArrayOfString>
33 <ns1:string>OXPRICE</ns1:string>
34 <ns1:string>89.9</ns1:string>
35 </ns1:ArrayOfString>
36 </ns1:aResult>
37 <ns1:blResult>true</ns1:blResult>
38 <ns1:sMessage/>
39 </ns1:OXERPType>
40 </ns1:OXERPGetArticlesResult>
41 </ns1:OXERPGetArticlesResponse>';
42 end;
43 /
Procédure PL/SQL terminée avec succès.
SQL> SELECT x1.rec_id
2 , x2.col_name
3 , x2.col_value
4 FROM XMLTable(
5 XMLNamespaces('OXERPService' as "ns1"),
6 '/ns1:OXERPGetArticlesResponse/ns1:OXERPGetArticlesResult/ns1:OXERPType/ns1:aResult'
7 passing xmltype(:xmldoc)
8 columns rec_id for ordinality
9 , rec_xml xmltype path 'ns1:ArrayOfString'
10 ) x1,
11 XMLTable(
12 XMLNamespaces('OXERPService' as "ns1"),'/ns1:ArrayOfString'
13 passing x1.rec_xml
14 columns col_name varchar2(30) path 'ns1:string[1]'
15 , col_value varchar2(30) path 'ns1:string[2]'
16 ) x2
17 ;
 REC_ID COL_NAME COL_VALUE
 1 OXID 531f91d4ab8bfb24c4d04e473d246d
 1 OXARTNUM 0601-85-069
 1 OXPRICE 100.5
 2 OXID 531a8af7d9a9a5bb53b65a2b9a5356
 2 OXARTNUM 0601-85-069-1
 2 OXPRICE 89.9
6 ligne(s) sélectionnée(s).You mentioned that the number of column(s) is not known in advance. That's gonna be a problem to present the data column-wise.
Version 11g has the PIVOT feature, but still you have to know how many columns there will be in the result set.
How are you going to use the data after extraction?
Maybe we could advise some other techniques more relevant for your requirement.

How to extract elements from a document

I'm new to Java and I'm using JDOM in a JSP page. I've a document like this:
<OFFERTA>
<FACOLTA idFacolta="F1">
<CORSO value="xxx"/>
<CORSO value="yyy"/>
</FACOLTA>
<FACOLTA idFacolta="F2">
<CORSO value="zzz"/>
</FACOLTA>
<FACOLTA idFacolta="F3">
</FACOLTA>
<FACOLTA idFacolta="F4">
</FACOLTA>
</OFFERTA>
I'd like to get a document with the same structure but with the only FACOLTA elements that match a requested value for the idFacolta attribute.
For example, if the requested idFacolta is "F2", I'd like to get:
<OFFERTA>
<FACOLTA idFacolta="F2">
<CORSO value="zzz"/>
</FACOLTA>
</OFFERTA>
I check in a loop if the element matches the idFacolta requested but when I find and try to delete it I get:
java.util.ConcurrentModificationException
 at org.jdom.ContentList$FilterListIterator.checkConcurrentModification(ContentList.java:1230)
 at org.jdom.ContentList$FilterListIterator.hasNext(ContentList.java:942)
Here's my code:
Document doc = builder.build(file);
Element root = doc.getRootElement();
List children = root.getChildren();
Iterator facoltaIterator = children.iterator();
// Check if there's a request
if (id!=null) {
 while (facoltaIterator.hasNext()) {
 Element facolta = (Element) facoltaIterator.next();
 if (!facolta.getAttribute("idFacolta").getValue().equalsIgnoreCase(id)) {
 // Delete
 root.removeContent(facolta);
All seems to be right, except the line
root.removeContent(facolta);
So, I tried to make a new document for the results, adding the element facolta:
Document doc = builder.build(file);
Element root = doc.getRootElement();
// Create a new document with the same root
Element rootRisultati = new Element("OFFERTA");
Document docRisultati = new Document(rootRisultati);
List children = root.getChildren();
Iterator facoltaIterator = children.iterator();
if (id!=null) {
 while (facoltaIterator.hasNext()) {
 Element facolta = (Element) facoltaIterator.next();
 if (facolta.getAttribute("idFacolta").getValue().equalsIgnoreCase(id)) {
 // Add the element to the new document
 rootRisultati.addContent(facolta);
but this causes a different exception:
org.jdom.IllegalAddException: The element already has an existing parent "OFFERTA"
 at org.jdom.ContentList.add(ContentList.java:190)
 at org.jdom.ContentList.add(ContentList.java:146)
 at java.util.AbstractList.add(AbstractList.java:84)
 at org.jdom.Element.addContent(Element.java:1062)
I'm not able to find a solution.
There's a commonly used procedure to select elements maintaing the structure of the document?
Can anyone help please? Thanks in advance.
Stefano

maybe you could try posting this in a JDOM forum? JDOM is not a standard XML tool from Sun / for Java... therefore out of the scope of this forum!

Extract info from RFH2 header in MQ message

Hi All,
I'm trying to send a MQ message with RFH header. Purpose is that JMS adapter extracts user data from RFH Header.
A few questions:
1.Format RFH2 header:
-In the MQMD I assign the format: MQFMT_RF_HEADER_2
-Apart from the mandatory fields in the RFH2 header I write user data to it as well and it looks like this:
<usr><msgType dt="string">data</msgType></usr> Is this sufficient or should there be an additional surrounding tag like <MQRFH2>?
2.In PI, I use a JMS Adapter. On the Module-tab I have added AF_Modules/DynamicConfigurationBean (type: Local Enterprise Bean, Module Key: RFHHEADER), after ConvertBinaryToXMBMessage and before CallSapAdapter.
Under Module Configuration I added 2 entries:
RFHHEADER key.0 read http://sap.com/xi/XI/System/JMS DCJMSMessageProperty0
RFHHEADER value.0 msgType
I want to extract the value of msgType and use it.
On the Parameter-tab, I suppose I have to add Adapter-specific message attributes, but it is not exactly clear what I should put there.
3.Dynamic Configuration Bean
Is there anything I should do to activate that Dynamic Configuration Bean, because when I sent I message, I don't see anything in the monitor of the processed XML messages, not even an error.
Kind Regards
Edmond Paulussen

Edmond Paulussen wrote:>
> Hi All,
>
> I'm trying to send a MQ message with RFH header. Purpose is that JMS adapter extracts user data from RFH Header.
>
> A few questions:
> 1.Format RFH2 header:
> -In the MQMD I assign the format: MQFMT_RF_HEADER_2
> -Apart from the mandatory fields in the RFH2 header I write user data to it as well and it looks like this:
> <usr><msgType dt="string">data</msgType></usr>
> On the Parameter-tab, I suppose I have to add Adapter-specific message attributes, but it is not exactly clear what I should put there.
The list of additional parameters are stored in the ASMA field.
So you enter here the header value "msgType" with type String
The value "data" is stored in DCJMSMessageProperty0
You do not need the DynamicConfigurationBean in this scenario.

Extract URL from HTML text

Suppose you have the following String that is body text with HTML.
String bodyText = " My name is Blake. I live in New York City. See my image here: <img href="http://www.blake.com/blake.jpg"/> isn't my picture awesome? Tata for now!"
I want to extract the URL that contains the location of the image in this bodyText. The ideal would be to create a function called public String extractor(String bodyText) to be used
String imageURL = extractor(bodyText);
//imageURL should be "http://www.blake.com/blake.jpg"
My first thoughts are using reg exp, yet the place i would find to use that would using the .replace in String class. I am by no means an expert on reg exp so I haven't taken too much time to try to figure it out with reg exp. I obviously could do a linear search of the bodyText and do a ton of if statements, but thats just poor coding. I want see if anyone came across this or has insight to this problem.
Thanks for all the help,
Blake

How would the regexp change if there were multiple img tags within the String.I don't rightly know, I'm just a raw beginner in regexes.
Would this regexp return all the img URLs found in the String.No, as it stands it would return only the last URL. But this will:String bodyText = " My name is Blake. " +
 "I live in New York City. See my image here: " +
 "<img href=\"http://www.blake.com/blake.jpg\"/>" +
 " isn't my picture awesome? Here's another: " +
 "<img href='http://www.blake.com/Vandelay.jpg'/>" +
 " Tata for now!";
String regex = "(?<=<img\\shref=[\"'])http://.*?(?=[\"']/?>)";
Pattern pattern = Pattern.compile (regex);
Matcher matcher = pattern.matcher (bodyText);
while (matcher.find ()) {
 System.out.println (matcher.group ());
}Note the enhancement that takes into account that both single and double quotes are legal in HTML. But unlike the earlier example, this does not tolerate more than one space between <img and href=, I couldn't find a way to achieve that.
Visit this thread later, there are some real regex experts around who may post a better solution.
db

Extract element from PDF document from automatized process

Hi
I have never worked on PDF document and I am looking a solution for :
extracting elements (simply text in a first time) in a PDF document in paragraph and/or in table
after this I could manipulated them in another processing
Have you any idea or information about my need (what should be the best way for doing this ?)
SDK package : possible for doing this, if yes : which one ?
another solution than SDK package
PDF version supported : latest ones ?
any advice about best developpement language : java (I prefer), or other ?
Thanks for all you advices !!!
Lst

Well, if you want Java - then Adobe only has server-side options for you. We don't offer desktop Java APIs. Our server-side options are part of the Adobe LiveCycle family of products.
For client-side, we have the Adobe Acrobat SDK (which also requires Adobe Acrobat to be installed) or the PDFLibrary SDK (for stand-alone applications). Both are C/C++ based.

Extracting info from v$sqlarea

Hi,
I use Oracle 9i and I am trying to extract what a session is exactly doing in a certain moment.
So, on sqlplus I execute
select a.sid,a.username,b.sql_text
from v$session a inner join v$sqlarea b
on a.sql_address=b.address and
a.sql_hash_value=b.hash_value
where sid=&sid;
but sql_text column is just 1000 characters size and I want the whole statement.
Is there any way to extract this information using sqlplus? I know that Enterprise Manager shows me this information, but to get sql text at the right moment, I would like to use a script.
Thanks in advance
Alex

Hi Arup,
I am writing a linux shell script that will automatically logmine information from the archivelogs in the database,
I want to try to build a shell script that will make information from the archivelogs aceesible to users i.e automating logminer with either shell scripting or set of
pl/sql procedures so that a user can use his reporting tool to specify the start date , end date and all other parameters to spool the information.
You could also help with any ideas you have.
Best Regards

Extracting info from HTML documents

Similar Messages

Maybe you are looking for