How to extract text in HTML Parser

Hi ,
Say I have a HMTL file . Now I want to extract text content which is within no tags .

GetHTMLText is a simple class that should help:
import java.io.*;
import java.net.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
class GetHTMLText
     public static void main(String[] args)
          throws Exception
          EditorKit kit = new HTMLEditorKit();
          Document doc = kit.createDefaultDocument();
          // The Document class does not yet handle charset's properly.
          doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
          // Create a reader on the HTML content.
          Reader rd = getReader(args[0]);
          // Parse the HTML.
          kit.read(rd, doc, 0);
          // The HTML text is now stored in the document
          System.out.println( doc.getText(0, doc.getLength()) );
     // Returns a reader on the HTML data. If 'uri' begins
     // with "http:", it's treated as a URL; otherwise,
     // it's assumed to be a local filename.
     static Reader getReader(String uri)
          throws IOException
          // Retrieve from Internet.
          if (uri.startsWith("http:"))
               URLConnection conn = new URL(uri).openConnection();
               return new InputStreamReader(conn.getInputStream());
          // Retrieve from file.
          else
               return new FileReader(uri);
}

Similar Messages

How to extract text from a PDF file?

Hello Suners,
i need to know how to extract text from a pdf file?
does anyone know what is the character encoding in pdf file, when i use an input stream to read the file it gives encrypted characters not the original text in the file.
is there any procedures i should do while reading a pdf file,
File f=new File("D:/File.pdf");
               FileReader fr=new FileReader(f);
               BufferedReader br=new BufferedReader(fr);
               String s=br.readLine();any help will be deeply appreciated.

jverd wrote:
First, you set i once, and then loop without ever changing it. So your loop body will execute either 0 times or infinitely many times, writing the same byte every time. Actually, maybe it'll execute once and then throw an ArrayIndexOutOfBoundsException. That's basic java looping, and you're going to need a firm grip on that before you try to do anything as advanced as PDF reading. the case.oops you are absolutely right that was a silly mistake to forget that,
Second, what do the docs for getPageContent say? Do they say that it simply gives you the text on the page as if the thing were a simple text doc? I'd be surprised if that's the case.getPageContent return array of bytes so the question will be:
how to get text from this array? i was thinking of :
    private void jButton1_actionPerformed(ActionEvent e) {
        PdfReader read;
        StringBuffer buff=new StringBuffer();
        try {
            read = new PdfReader("d:/getjobid2727.pdf");
            read.getMetaData();
            byte[] data=read.getPageContent(1);
            int i=0;
            while(i>-1){
                buff.append(data);
i++;
String str=buff.toString();
FileOutputStream fos = new FileOutputStream("D:/test.txt");
Writer out = new OutputStreamWriter(fos, "UTF8");
out.write(str);
out.close();
read.close();
} catch (Exception f) {
f.printStackTrace();
"D:/test.txt" hasn't been created!! when i ran the program,
is my steps right?

How to extract text from a PDF file using php?

How to extract text from a PDF file using php?
thanks
fabio

> Do you know of any other way this can be done?
There are many ways. But this out of scope of this forum. You can try this forum: http://forum.planetpdf.com/

How to extract TEXT for archived Purchase Orders ?

Hi Friends,
Can any one tell me how to extract TEXT for archived Purchase Orders ?
I have used READ_TEXT but that is not fetching texts for archived PO's. Whenever I am trying to fetch data from STXH against archived PO, no value is coming and resulting SY_SUBRC <> 0.
Any demo code will be highly appreciated.
Thanks in advance..
Sivaji

Hi,
You can see that table STXH is linked to archiving object MM_EKKO (you can see it in tcode DB15).
My suggest is that you must get the data. See the demo object BC_SBOOK in tcode AOBJ. You can see the report to reload data. The object is get the data in an internal table. So for report SBOOKR you can see this function module:
* get data records from the data container
* SBOOK
 CALL FUNCTION 'ARCHIVE_GET_TABLE'
 EXPORTING
 archive_handle = lv_handle
 record_structure = 'SBOOK'
 all_records_of_object = 'X'
 TABLES
 table = lt_sbook_tmp
 EXCEPTIONS
 end_of_object = 0. "not entries of this type
* check lt_sbook_tmp entries against selections. Delete not
* requested entries
 LOOP AT lt_sbook_tmp ASSIGNING <ls_sbook>
 WHERE carrid IN s_carrid
 AND connid IN s_connid
 AND fldate IN s_fldate.
 APPEND <ls_sbook> TO lt_sbook.
 ENDLOOP.
 REFRESH lt_sbook_tmp.
The idea is that you get the same data that you handle in READ_TEXT (because you don't have the data in database) and recovery the text.
I hope this helps you
REgards
Eduardo

How to extract text of info object

how to extract text of info object ?
Example text of project defination from 0PROJECT

Hi Siri,
I think you can't display the text element if you display the data in the dso.
In the dso, you will see only the key part.
So you don't have to load the infoobject text into the DSO, you just have to load the infoobject.
In Bex you have the option to see either key,text or both.
Refet the below thread for details.
Link: [Loading Master Data Text to DSO.;
Hope it helps you in clearing your doubt.
Regards,
Nikhil Joy

How to Extract TEXT ONLY from HTML ?

I am developing speech-enabled browser and what I would like to do is to read aloud all the texts within webpage. My problem is how can I get only the text, not HTML tags, from the webpage. Similar question has been asked before in this forum but none of the given suggestions seem to work. Any help would greatly be appreciated.
Is there anyone out there who is also using speech package? Is there any forum for java speech package?

don't know about the speech part, but the text parsing
is pretty simple, if you just want the text. You just
take the string and run thru it char by char and
remove the stuff between the < and > chars.Also you'd have to unescape anything that was escaped for HTML, such as & should be replaced by & and é should be replaced by é and so on.

Problem to extract text from HTML document

I have to extract some text from HTML file to my database. (about 1000 files)
The HTML files are get from ACM Digital Library. http://portal.acm.org/dl.cfm
The HTML page is about the information of a paper. I only want to get the text of "Title" "Abstract" "Classification" "Keywords"
The Problem is that I can't find any patten to parser the html files"
EX: I need to get the Classification = "Theory of Computation","ANALYSIS OF ALGORITHMS AND PROBLEM COMPLEXITY","Numerical Algorithms and Problem","Mathematics of Computing","NUMERICAL ANALYSIS"......etc .
The section code about "Classification" is below.
Please give any idea to do this, or how to find patten to extract text from this.
<div class="indterms"><a href="#CIT"><img name="top" src=
"img/arrowu.gif" hspace="10" border="0" /></a><a name="IndexTerms">INDEX TERMS</a>
<a name=
"GenTerms">Primary Classification:</a> 
� F. <a href=
"results.cfm?query=CCS%3AF%2E%2A&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Theory of Computation</a> 
� <img src="img/tree.gif" border="0" height="20" width=
"20" /> F.2 <a href=
"results.cfm?query=CCS%3A%22F%2E2%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">ANALYSIS OF ALGORITHMS AND PROBLEM
COMPLEXITY</a> 
� � � <img src="img/tree.gif" border="0" height=
"20" width="20" /> F.2.1 <a href=
"results.cfm?query=CCS%3A%22F%2E2%2E1%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Numerical Algorithms and Problems</a> 

<a name=
"GenTerms">Additional�Classification:</a> 
� G. <a href=
"results.cfm?query=CCS%3AG%2E%2A&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Mathematics of Computing</a> 
� <img src="img/tree.gif" border="0" height="20" width=
"20" /> G.1 <a href=
"results.cfm?query=CCS%3A%22G%2E1%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">NUMERICAL ANALYSIS</a> 
� � � <img src="img/tree.gif" border="0" height=
"20" width="20" /> G.1.6 <a href=
"results.cfm?query=CCS%3A%22G%2E1%2E6%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Optimization</a> 
� � � � � <img src="img/tree.gif" border=
"0" height="20" width="20" /> Subjects: <a href=
"results.cfm?query=CCS%3A%22Linear%20programming%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Linear programming</a> 

 
<a name=
"GenTerms">General Terms:</a> 
<a href=
"results.cfm?query=genterm%3A%22Algorithms%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Algorithms</a>, <a href=
"results.cfm?query=genterm%3A%22Theory%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Theory</a>
 
<a name=
"Keywords">Keywords:</a> 
<a href=
"results.cfm?query=keyword%3A%22Simplex%20method%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Simplex method</a>, <a href=
"results.cfm?query=keyword%3A%22complexity%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">complexity</a>, <a href=
"results.cfm?query=keyword%3A%22perturbation%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">perturbation</a>, <a href=
"results.cfm?query=keyword%3A%22smoothed%20analysis%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">smoothed analysis</a>
</div>

One approach is to download Htmlparser from sourceforge
http://htmlparser.sourceforge.net/ and write the rules to match title, abstract etc.
Another approach is to write your own parser that extract only title, abstract etc.
1. tokenize the html file. --> convert html into tokens (tag and value)
2. write a simple parser to extract certain information
find out about the pattern of text you want to extract. For instance "<class "abstract">.
then writing a rule for extracting abstract such as
if (tag is abstract ) then extract abstract text
apply the same concept for other tags
Attached is the sample parser that was used to extract title and abstract from acm html files. Please modify to include keyword and other fields.
good luck
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.List;
public class ACMHTMLParser
 private String m_filename;
 private URLLexicalAnalyzer lexical;
 List urls = new ArrayList();
 public ACMHTMLParser(String filename)
 super();
 m_filename = filename;
 * parses only title and abstract
 public void parse() throws Exception
 lexical = new URLLexicalAnalyzer(m_filename);
 String word = lexical.getNextWord();
 boolean isabstract = false;
 while (null != word)
 if (isTag(word))
 if (isTitle(word))
 System.out.println("TITLE: " + lexical.getNextWord());
 else if (isAbstract(word) && !isabstract)
 parseAbstract();
 isabstract = true;
 word = lexical.getNextWord();
 lexical.close();
 public static void main(String[] args) throws Exception
 ACMHTMLParser parser = new ACMHTMLParser("./acm_html.html");
 parser.parse();
 public static boolean isTag(String word)
 return ( word.startsWith("<") && word.endsWith(">"));
 public static boolean isTitle(String word)
 return ( "<title>".equals(word));
 //please modify according to the html source
 public static boolean isAbstract(String word)
 return ( "".equals(word));
 private void parseAbstract() throws Exception
 while (true)
 String abs = lexical.getNextWord();
 if (!isTag(abs))
 System.out.println(abs);
 break;
 class URLLexicalAnalyzer
 private BufferedReader m_reader;
 private boolean isTag;
 public URLLexicalAnalyzer(String filename)
 try
 m_reader = new BufferedReader(new FileReader(filename));
 catch (IOException io)
 System.out.println("ERROR, file not found " + filename);
 System.exit(1);
 public URLLexicalAnalyzer(InputStream in)
 m_reader = new BufferedReader(new InputStreamReader(in));
 public void close()
 try {
 if (null != m_reader) m_reader.close();
 catch (IOException ignored) {}
 public String getNextWord() throws IOException
 int c = m_reader.read();
 if (-1 == c) return null;
 if (Character.isWhitespace((char)c))
 return getNextWord();
 if ('<' == c || isTag)
 return scanTag(c);
 else
 return scanValue(c);
 private String scanTag(final int c)
 throws IOException
 StringBuffer result = new StringBuffer();
 if ('<' != c) result.append('<');
 result.append((char)c);
 int ch = -1;
 while (true)
 ch = m_reader.read();
 if (-1 == ch) throw new IllegalArgumentException("un-terminate tag");
 if ('>' == ch)
 isTag = false;
 break;
 result.append((char)ch);
 result.append((char)ch);
 return result.toString();
 private String scanValue(final int c) throws IOException
 StringBuffer result = new StringBuffer();
 result.append((char)c);
 int ch = -1;
 while (true)
 ch = m_reader.read();
 if (-1 == ch) throw new IllegalArgumentException("un-terminate value");
 if ('<' == ch)
 isTag = true;
 break;
 result.append((char)ch);
 return result.toString();
}

How to include text as HTML elements (see DOMElement)

I am working with Flash PRO CC v. 14.0. to convert my Flash website to HTML5 / javascript
I have converted a file to the HTML5 Canvas
I am very happy that the new Flash Pro has the feature to convert to HTML5 canvas
HOWEVER:
In my original .FLA file project I use only one font: Copperplate Bold. I use several sizes of that font within the project / scene
In the original file for all text I use static text, Letter spacing, AntiAlias, AutoKern and single line (Linetype)
- none of which the HTML5 canvas seem to allow / support?
How do I maintain the FONT look that I have chosen in my original FLASH project, after I convert to HTML5 canvas?
Is there a way in the HTML canvas to maintain the FONT look that I want?
HTML5 canvas will not allow Font embedding
The device font destroys the LOOK of my Copperplate Bold font.
How do I include text as HTML elements (see DOMElements)?
WARNINGS generated when I convert the original file into an HTML Canvas:
Warnings generated while copying/importing in 140827a HTML test.fla:
* AntiAlias is not supported in HTML5 Canvas document, and has been converted to DeviceFonts in an instance of Text.
* AutoKern is not supported in HTML5 Canvas document, and has been removed in an instance of Text.
* Frame Scripts have been commented
* LetterSpacing is not supported in HTML5 Canvas document, and has been converted to 0.0 in an instance of Text.
* LineType is not supported in HTML5 Canvas document, and has been converted to MultiLineNoWrap in an instance of Text.
* Some artwork contains Hairline stroke, which is not supported in HTML5 Canvas document, and has been converted to Solid.
* StaticText is not supported in HTML5 Canvas document, and has been converted to DynamicText in an instance of Text.
New HTML Canvas Document created.
NOTE: So far the only way I have been able to maintain the font look is to convert the fonts to .png files
This is painstaking work that I would like to avoid.
Even then I still get a WARNING when I test my scene - (no doubt because I left the original FONT text in guide layers)
After conversion ON TEST SCENE:
WARNINGS:
Frame numbers in EaselJS start at 0 instead of 1. For example, this affects gotoAndStop and gotoAndPlay calls. (18)
Only circular (not oval) radial gradients are supported. (85)
Text support is limited. It is generally recommended to include text as HTML elements (see DOMElement). (6)
Color effects are published as a filter and subject to the same limitations. (4)
Filters are very expensive and are not updated once applied. Cache as bitmap is automatically enabled when a filter is applied. This can prevent animations from updating. (2)
Content with both Bitmaps and Buttons may generate local security errors in some browsers if run from the local file system.
HOW CAN I MAINTAIN the FONT LOOK that I have chosen for my project?
How do I include text as HTML elements (see DOMElements)?
ANY HELP will be appreciated
A good, in depth, tutorial on the subject (FONTS) would be a BIG help to many using the convert to HTML5 canvas features.

GOOGLE HAS
https://www.google.com/fonts
choose a font from above site
then:
google generates instructions on how to embed that font
Montserrat
3. Add this code to your website:
<link href='http://fonts.googleapis.com/css?family=Montserrat:400,700' rel='stylesheet' type='text/css'>
4. Integrate the fonts into your CSS:
The Google Fonts API will generate the necessary browser-specific CSS to use the fonts. All you need to do is add the font name to your CSS styles. For example:
font-family: 'Source Sans Pro', sans-serif;
font-family: 'Ubuntu', sans-serif;
font-family: 'Montserrat Alternates', sans-serif;
font-family: 'Montserrat', sans-serif;
font-family: 'Open Sans', sans-serif;

How to extract text and image information from postscript file

I want to write a programe,and extract text and image information from postscript file using Java.Is it possible? How to extract ?
Thank!

First of all, PostScript is not a "text" file. It can and often does contain binary data. Since PostScript streams often contain nested procedures, unless you process the procedure definitions and can "execute" them, you cannot simply "scan" a file to get what you want. No, I can't talk about this in detail since it is quite complex. But Adobe does have the
PostScript Language Reference Manual on-line for download at
. Look that over and you will have a fairly healthy respect as to the task involved.
- Dov

Guide: how to extract text from any iOS notes app backup on your Mac

Scenario: you had a sweet notes app like DailyNotes, perhaps one that your girlfriend installed while you weren't looking. So it was on her iTunes account, lets say. Then one day you upgrade to a new iOS version and backup all your data. For whatever reason you erase your iPad, install the new version, and recover back all the apps from the backup. (Maybe you were beta testin g prior to that so you wanted to wipe it clean.) Still your data is gone! Or is it?
"Fortunately" you backed up. However all that this means is that the text you entered into DailyNotes is stored somewhere in your backup, inside of a file named something like
"8f5d7ff4111c9b9e4c8dbb7395efdce9c260e0de-20110814-232318", encoded in a .sqlite database file wrapped inside that data.
Now, if you reinstall the DailyNotes app (or whatever) off of the app store under your own user account, will you get your data back as well? I honestly don't know. But I tried to find out, and I could get no straight answers. Most people said I'd lose my data, and the only way to get back my data would be to restore from the old backup, which of course would erase my current data! So I'd have to backup, then restore from the older backup, then restore from the newer backup again. Does anyone realize how LONG these backups and restores take? FOREVER! I don't have that much time guys. I just need a fast way to access a simple text file. Why does iOS make this such a chore? It's my own content, my own data, which is mine, my copyright, my intellectual property, and the iOS is hiding it from me inside an anonymously-named text file.
SO HERE'S WHAT YOU DO:
Download mono framework here and install it: (free)
http://www.go-mono.com/mono-downloads/download.html
(this can be easily uninstalled later, it has an uninstaller app)
Download iPhone Backup Extractor here and extract it in your Downloads directory: (free)
http://www.iphonebackupextractor.com./free-download/
(this can be easily uninstalled later, just delete the folder)
Download SQLite Database Browser 2.0 b1, and extract it in your Downloads directory: (free)
http://sourceforge.net/projects/sqlitebrowser/
(this can easily be uninstalled later, just delete the app)
You can move the two apps to your Applications directory if you intend to keep them long-term, but you can run them just fine from the Downloads directory which will make them easier to identify and erase after you're done, if you don't plan to keep them around.
Go to Finder and open the iPhone Backup Extractor directory. Resize that window to the side of your screen.
Go to Terminal and set the window where you can see part of it if the previously mentioned Finder window from the last step was floating on top of it.
Type "cd " (yes that's c, d, space) in Terminal then hit command-Tab to switch to Finder.
Drag the folder icon from the title bar of the finder window into the Terminal window that's now in the background. For you newbies, the "title bar" is the VERY topmost edge of the window (the frame of the window) which should have a folder next to the words, "iPhone Backup Extractor" visible in it. You're clicking and dragging THAT folder icon into the terminal window in the background. HIT ALT-TAB AGAIN WHEN DONE.
Now you're back in the Terminal and it should say:
"cd /Users/yourname/Downloads/iphonebackupextractor-latest" after the unix prompt. HIT ENTER.
Now type:
mono iPhoneBackupExtractor.exe
This will run the iPhone Backup Extractor app. It takes a few minutes to load because mono is slow (it's emulating Windows basically). Be patient.
NOTE: The Backup Extractor can only see backups stored on your boot drive that are in the users folder of whatever user you're currently logged in as. So if your backup is on an external drive or a CD, etc., just copy it to the desktop.
Once iPhone Backup Extractor loads, you'll see its window where you can select the backup. Select whichever one you want to work with. Then hit Expert mode. Each app that was on your device at the time you made the backup will have a directory shown. It will be named something like:
com.ramki.dailynotes
Expand the one you want to recover data from by clicking the plus sign next to it. Then expand the Documents directory for it. You'll see a file called something like Daily_Notes.sqlite. Click the dark black box next to this and a checkmark should appear.
Once you've checked the file to recover, click "Extract selected" below and save to your Desktop (or wherever!).
Now quit iPhone Backup Extractor unless you have other data to also extract.
NEXT... OPEN the app SQLite Database Browser 2.0 b1.
Once it loads, open your Daily_Notes.sqlite file (or whatever .sqlite file you extracted, not necessarily Daily_Notes, that's just my example). If you followed my previous steps that file will be on your Desktop.
Once it loads you'll see three buttons at the center of the window near the top called, "Database Structure," "Browse Data," and "Execute SQL." CLICK ON "Browse Data."
Now on the left-hand side of the window there is a pop-up menu with the word "Table: [POP-UP-MENU-IS-HERE] "... CLICK on the pop-up menu. It will actually say the name of one of the database files that's within the SQLite database, something like "ZAPPSTATE" or "ZDAYDATA" or whatever (not "POP-UP-MENU-IS-HERE", that was just text I put as a placeholder since it could be anything really).
Now that you've clicked the pop-up menu, select each item one by one and look at the data that appears in the table.
(Don't worry, you're just working with a COPY of the file, so if you accidentally delete anything it's not a real problem, just delete the .sqlite file and start the steps over from the beginning of this message.)
You should eventually find a table that has the text that you're looking for! Mine was called "ZDAYCONTENT".
WHEN YOU ARE LOOKING AT THE DATABASE TABLE, it looks like an Excel spreadsheet. That means you may have to double-click on the database cell to get it to show you the entire contents of that part. (It only shows a truncated text string in each cell, but if you double-click, a new window will open on top of the current window, showing the full text that was in there.) Now you can copy the text out and paste it into another app like Text Edit or MS Word or Pages, etc. You can also export the data in the File Menu > Export to an SQL or CSV file. (CSV is a text file where the data is all there, just separated by commas. This can then be imported into Excel or Numbers or another database or printed, etc., or just opened into Pages or Word or BBEdit etc.)
CAVEATS: Dates will often be shown as a weird number like 3780 or 2863 etc. You may have to figure out on your own what this date means. I honestly have no clue. SQLite Database Browser does not seem to support viewing or exporting PNG and image files.
Other than that good luck. Post any questions here.
AND HEY, APPLE: MAKE THIS EASIER! FILES BELONG IN FOLDERS, NOT INSIDE FOLDERS THAT ARE INSIDE APPS!!!

Restore iPad data from backup files, with the help of iPad Data Extractor:
1. Settings>General>Reset>Erase all content and settings
2. You'll be asked twice to confirm
3. You'll see Apple logo and progress bar
4. You'll see a big iPad logo on screen
5. Configuration start
6. Set language
7. Set country
8. Select Network and input Password>Join
9. Enable Location Service>Next
10. You'll be given 3 options (a) Setup as New iPad (b) Restore from iCloud Backup (c) Restore from iTune Backup
11. Select Restore from iTune Backup
12. You will see picture of USB cable pointing towards iPad
13. Connect iPad to iTune (make sure iTune is on standby)
14. Tap Continue (computer)
15. Restore iPad from Backup (computer)
16. See progress bar with estimated time (computer)
17. See Restore in Progress on iPad
18. See Apple logo
19. See Apple and Progress Bar
20. Slide to Unlock
21. Copying Apps back to iPad (computer)
22. You'll see Loading/Installing/Waiting below the Apps (iPad)
23. Sync Music/Podcast/Movies to iPad (computer)
24. Sync completed (computer)

XML StAX - How to extract text and elements?

Hello,
I'm using StAX to parse this XML document (heavily reduced):
<?xml version="1.0" encoding="utf-8"?>
<html>
<head>
 <title>Foo</title>
</head>
<body>
 loading ...
</body>
</html>I need to extract the data between the <body> element i.e "loading...". My problem is that I can only find methods that extracts the text and not the elements. Is there an easy way to do this using the XMLStreamReader instance, or do I have to use another class?
Thanks

Thanks for your reply. But does that really mean that I have to create my own method, which will collect both text and elements information in a StringBuffer as I parse through the enclosing element? I just think it is strange that there isn't a convenient method to extract all data (text & elements) between one element.
Something like this?:
private void handleBody(XMLStreamReader parser,XMLEventAllocator allocator) throws XMLStreamException {
 StringBuffer body = new StringBuffer();
 while(true){
 String value = null;
 parser.next();
 if (parser.getEventType() == XMLStreamConstants.START_ELEMENT){
 String name = parser.getLocalName();
 if (!name.equalsIgnoreCase("body")){
 StartElement startElement = getXMLEvent(allocator,parser).asStartElement();
 value = startElement.toString();
 else if (parser.getEventType() == XMLStreamConstants.END_ELEMENT){
 String name = parser.getLocalName();
 if (name.equalsIgnoreCase("body")){
 break;
 else{
 EndElement endElement = getXMLEvent(allocator,parser).asEndElement();
 value = endElement.toString();
 else if (parser.hasText()){
 value = parser.getText();
 if (value != null){
 body.append(value);
 }

How to Extract Text coordinates from PDF

Hi,
can anyone tell me how to get coordinates in pdf document using VB or .NET, suppose if some text is written in pdf document then how can i get coordinates of that text. Its very Urgent.
Thanks in Advance.

I am trying to use the getPageNthWordQuads information to determine if a word on the page is within a region that I am interested in.
I have a limited knowledge of javascript and have been looking up text manipulation functions and array manipulation functions in an attempt to figure out how to separate the values that are returned from the Quads routine. The Adobe documentation indicates that the Quads function returns an array, but when I try to access one of the values in the array, it gives me the entire contents of the array as though it is a string. If I use the .length function to try to determine the length of it, it tells me it is length of 1! I obviously am mis-handling this reference, but I have yet to find any specific examples that work with the quads array the way I am trying to work with it....
Here is my code...I am running it against an open file in batch processing mode(maybe this has something to do with it)...
var sourceDoc = this
var tx1=492.5;
var ty1=761.5;
var tx4=563;
var ty4=726.2;
try {
for (var j = 0; j < (this.numPages); j=j+2){
var cnt=0;
var rcvrnum="";
cnt = sourceDoc.getPageNumWords(j);
if (j == 0) {
try {for (var i = 0; i < cnt; i++) {
var quads = sourceDoc.getPageNthWordQuads(j,i);
var x1 = quads[0];
console.println("Page(" + j + "),Word(" + i + ") = " + sourceDoc.getPageNthWord({nPage: j, nWord: i}));
console.println("Quads length is " + quads.length);
console.println("X1 = " + x1);
if ( x1 >= tx1 & x1 <= tx4 & y1 >= ty4 & y1 <= ty1 ) {
console.println("Q1 is good");
console.println("Page(" + j + "),Word(" + i + ") = " + sourceDoc.getPageNthWord({nPage: j, nWord: i}));
} catch (e) { console.println("Aborted: " + e) };
} catch (f) { console.println("Aborted: " + f) };
I have tried several variations of the code above to try to extract my values so that I can compare them, but to no avail. The above code outputs to the console the following...
Page(0),Word(0) = OTTO
Quads length is 1
X1 = 19.350006103515625,782.15087890625,126.51744079589844,782.15087890625,19.350006103515625, 721.5038452148438,126.51744079589844,721.5038452148438
Page(0),Word(1) =
Quads length is 1
X1 = 125.17047119140625,782.15087890625,153.91525268554688,782.15087890625,125.17047119140625, 721.5038452148438,153.91525268554688,721.5038452148438
and so on...
x1 becomes the entire output from the array and yet I can not perform a simple split function on x1. If I try to split X1 into an array by splitting on the comma, I get the following error.
Aborted: TypeError: x1.split is not a function
Am I supposed to import some libraries or something?
Thanks for any help....
Kevin Ailes

How to verify text is HTML-free

This may not be the place, but I'm writing web apps that use JSF so I thought I'd post it here.
I was wondering if there is a library/class that I can use to check that a String is "free" of HTML.
Specifically, I want to make sure that every text and text area field are free of HTML before I store them to the database.
I looked on the Apache site but nothing jumped out at me.
Anyone seen anything like this?

Apache Commons StringEscapeUtils has an escapeHTML() method.
http://commons.apache.org/lang/api-2.2/org/apache/commons/lang/StringEscapeUtils.html
This will escape malicious HTML characters and convert them to HTML entities, e.g. < to < and > to > so that your final HTML source won't be malformed. There is no need to escape this before saving into DB. The Java language and the DB itself doesn't care about it as long as you use the JDBC's PreparedStatement instead of Statement or the better ORM API's, e.g. Hibernate.

Extract Text from pdf using C#

Hi,
We are Solution developer using Acrobat,as we have reuirement of extracting text from pdf using C# we have downloaded adobe sdk and installed. We have found only four exmaples in C# and those are used only for viewing pdf in windows application. Can you please guide us how to extract text from pdf using SDK in C#.
Thanks you for your help.
Regards
kiranmai

Okay so I went ahead and actually added the text extraction functionality to my own C# application, since this was a requested feature by the client anyhow, which originally we were told to bypass if it wasn't "cut and dry", but it wasn't bad so I went ahead and gave the client the text extraction that they wanted. Decided I'd post the source code here for you. This returns the text from the entire document as a string.
 private static string GetText(AcroPDDoc pdDoc)
 AcroPDPage page;
 int pages = pdDoc.GetNumPages();
 string pageText = "";
 for (int i = 0; i < pages; i++)
 page = (AcroPDPage)pdDoc.AcquirePage(i);
 object jso, jsNumWords, jsWord;
 List<string> words = new List<string>();
 try
 jso = pdDoc.GetJSObject();
 if (jso != null)
 object[] args = new object[] { i };
 jsNumWords = jso.GetType().InvokeMember("getPageNumWords", BindingFlags.InvokeMethod, null, jso, args, null);
 int numWords = Int32.Parse(jsNumWords.ToString());
 for (int j = 0; j <= numWords; j++)
 object[] argsj = new object[] { i, j, false };
 jsWord = jso.GetType().InvokeMember("getPageNthWord", BindingFlags.InvokeMethod, null, jso, argsj, null);
 words.Add((string)jsWord);
 foreach (string word in words)
 pageText += word;
 catch
 return pageText;

Extract Text from PostScript

Can anyone please tell me how to extract text from postscript

Ghostscript is a postscript interpreter (one can actually program in postscript, and in fact postscript is a language BTW/FYI). Ghostscript is written in C (maybe C++) and yes, you can output to several different formats. I doubt you can go to HTML as it wouldn't really make much sense.
XML needs a DTD to make it of any use. You will have to call ghostscript through Runtime.exec() to use it, it will extract out purely the text from the PostScript ASSUMING the PS file contains text in that manner; it is possible to have PS text as output images in which case GS won't pick it up.

How to extract text in HTML Parser

Similar Messages

Maybe you are looking for