Problem to extract text from HTML document

I have to extract some text from HTML file to my database. (about 1000 files)
The HTML files are get from ACM Digital Library. http://portal.acm.org/dl.cfm
The HTML page is about the information of a paper. I only want to get the text of "Title" "Abstract" "Classification" "Keywords"
The Problem is that I can't find any patten to parser the html files"
EX: I need to get the Classification = "Theory of Computation","ANALYSIS OF ALGORITHMS AND PROBLEM COMPLEXITY","Numerical Algorithms and Problem","Mathematics of Computing","NUMERICAL ANALYSIS"......etc .
The section code about "Classification" is below.
Please give any idea to do this, or how to find patten to extract text from this.
<div class="indterms"><a href="#CIT"><img name="top" src=
"img/arrowu.gif" hspace="10" border="0" /></a><a name="IndexTerms">INDEX TERMS</a>
<a name=
"GenTerms">Primary Classification:</a> 
� F. <a href=
"results.cfm?query=CCS%3AF%2E%2A&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Theory of Computation</a> 
� <img src="img/tree.gif" border="0" height="20" width=
"20" /> F.2 <a href=
"results.cfm?query=CCS%3A%22F%2E2%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">ANALYSIS OF ALGORITHMS AND PROBLEM
COMPLEXITY</a> 
� � � <img src="img/tree.gif" border="0" height=
"20" width="20" /> F.2.1 <a href=
"results.cfm?query=CCS%3A%22F%2E2%2E1%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Numerical Algorithms and Problems</a> 

<a name=
"GenTerms">Additional�Classification:</a> 
� G. <a href=
"results.cfm?query=CCS%3AG%2E%2A&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Mathematics of Computing</a> 
� <img src="img/tree.gif" border="0" height="20" width=
"20" /> G.1 <a href=
"results.cfm?query=CCS%3A%22G%2E1%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">NUMERICAL ANALYSIS</a> 
� � � <img src="img/tree.gif" border="0" height=
"20" width="20" /> G.1.6 <a href=
"results.cfm?query=CCS%3A%22G%2E1%2E6%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Optimization</a> 
� � � � � <img src="img/tree.gif" border=
"0" height="20" width="20" /> Subjects: <a href=
"results.cfm?query=CCS%3A%22Linear%20programming%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Linear programming</a> 

 
<a name=
"GenTerms">General Terms:</a> 
<a href=
"results.cfm?query=genterm%3A%22Algorithms%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Algorithms</a>, <a href=
"results.cfm?query=genterm%3A%22Theory%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Theory</a>
 
<a name=
"Keywords">Keywords:</a> 
<a href=
"results.cfm?query=keyword%3A%22Simplex%20method%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">Simplex method</a>, <a href=
"results.cfm?query=keyword%3A%22complexity%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">complexity</a>, <a href=
"results.cfm?query=keyword%3A%22perturbation%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">perturbation</a>, <a href=
"results.cfm?query=keyword%3A%22smoothed%20analysis%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
target="_self">smoothed analysis</a>
</div>

One approach is to download Htmlparser from sourceforge
http://htmlparser.sourceforge.net/ and write the rules to match title, abstract etc.
Another approach is to write your own parser that extract only title, abstract etc.
1. tokenize the html file. --> convert html into tokens (tag and value)
2. write a simple parser to extract certain information
find out about the pattern of text you want to extract. For instance "<class "abstract">.
then writing a rule for extracting abstract such as
if (tag is abstract ) then extract abstract text
apply the same concept for other tags
Attached is the sample parser that was used to extract title and abstract from acm html files. Please modify to include keyword and other fields.
good luck
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.List;
public class ACMHTMLParser
 private String m_filename;
 private URLLexicalAnalyzer lexical;
 List urls = new ArrayList();
 public ACMHTMLParser(String filename)
 super();
 m_filename = filename;
 * parses only title and abstract
 public void parse() throws Exception
 lexical = new URLLexicalAnalyzer(m_filename);
 String word = lexical.getNextWord();
 boolean isabstract = false;
 while (null != word)
 if (isTag(word))
 if (isTitle(word))
 System.out.println("TITLE: " + lexical.getNextWord());
 else if (isAbstract(word) && !isabstract)
 parseAbstract();
 isabstract = true;
 word = lexical.getNextWord();
 lexical.close();
 public static void main(String[] args) throws Exception
 ACMHTMLParser parser = new ACMHTMLParser("./acm_html.html");
 parser.parse();
 public static boolean isTag(String word)
 return ( word.startsWith("<") && word.endsWith(">"));
 public static boolean isTitle(String word)
 return ( "<title>".equals(word));
 //please modify according to the html source
 public static boolean isAbstract(String word)
 return ( "".equals(word));
 private void parseAbstract() throws Exception
 while (true)
 String abs = lexical.getNextWord();
 if (!isTag(abs))
 System.out.println(abs);
 break;
 class URLLexicalAnalyzer
 private BufferedReader m_reader;
 private boolean isTag;
 public URLLexicalAnalyzer(String filename)
 try
 m_reader = new BufferedReader(new FileReader(filename));
 catch (IOException io)
 System.out.println("ERROR, file not found " + filename);
 System.exit(1);
 public URLLexicalAnalyzer(InputStream in)
 m_reader = new BufferedReader(new InputStreamReader(in));
 public void close()
 try {
 if (null != m_reader) m_reader.close();
 catch (IOException ignored) {}
 public String getNextWord() throws IOException
 int c = m_reader.read();
 if (-1 == c) return null;
 if (Character.isWhitespace((char)c))
 return getNextWord();
 if ('<' == c || isTag)
 return scanTag(c);
 else
 return scanValue(c);
 private String scanTag(final int c)
 throws IOException
 StringBuffer result = new StringBuffer();
 if ('<' != c) result.append('<');
 result.append((char)c);
 int ch = -1;
 while (true)
 ch = m_reader.read();
 if (-1 == ch) throw new IllegalArgumentException("un-terminate tag");
 if ('>' == ch)
 isTag = false;
 break;
 result.append((char)ch);
 result.append((char)ch);
 return result.toString();
 private String scanValue(final int c) throws IOException
 StringBuffer result = new StringBuffer();
 result.append((char)c);
 int ch = -1;
 while (true)
 ch = m_reader.read();
 if (-1 == ch) throw new IllegalArgumentException("un-terminate value");
 if ('<' == ch)
 isTag = true;
 break;
 result.append((char)ch);
 return result.toString();
}

Similar Messages

Extracting info from HTML documents

My program returns the HTML of any web page entered by the user. The HTML documents that are returned all contain pricing infomration that I want to extract. Any idea of the best way to search an HTML document for specific infomration I require. Seems like a huge task to split it all into tokens and searching for � sign!!!!!

This a nightmare of a problem........... the html
files that I am retrieving are huge. All I need from
them are a couple of lines of information. How do I
find the specific infomration I need???Load the entire file, search for it. You find the information in the same way like you'd do when ouy look for it in the file's source code.
Is it possible from a java program to open the HTML
file in web broweser, search, then return the info?
The html files seem really complex to search on.How would this help?

How can I solve the problem with copying text from Google Documents?

Ctrl+C and Ctrl+X are not always working in Google Document opened with Mozilla Firefox 30. This problem first occurred 2 days ago. No changes to the software were made.
The problem is not stable and permanent. I can successfully copy several lines from document and paste them somewhere else. And suddenly the next one is not copied using ctrl+c. Even ctrl+x is not working until I change the selection of the lines. Sometimes it begin to work after changing the selection, sometimes not.
I'm using Google Documents for a long time. I always use Firefox to work with them. And I've never had this problem before.

This problem happens for me as well. I can copy/paste from Firefox initially, but then after pasting elsewhere the Firefox page has to be refreshed in order to copy something else....

ExtendScript: Get all text from a document

Hi all.
I have the following task: I need to translate a document into another language using ExtendScript. So, as "input" I have a document with a text/graphics/tables/etc. in Language_1 and a "somehow-separated file", which will contain data about translation into the Language_2. E.g.:
Some_text_in_language_1 Some_text_in_language_2
Some_other_text_in_language_1 Some_other_text_in_language_2
To get the source text from the document, I've tried to use this:
var pgf = doc.MainFlowInDoc.FirstTextFrameInFlow.FirstPgf;
while(pgf.ObjectValid()){
 var test = pgf.GetText(Constants.FTI_String);
 var text, str;
 text = "";
 for (var i=0; i < test.len ; i +=1)
 var str=test[i] .sdata.replace(/^\s+|\s+$/g, '') ;
 text = text + str;
 PrintTextItem (test[i]);
 pgf = pgf.NextPgfInFlow;
But with this, I can only access the regular text in the document (e.g. the text in tables remains untougched). Is there any way I can the all textual data from specified document? Or maybe, the full list of controls, which can contain it, to iterate throught them and extract it one-by-one? Or maybe there's a better way to solve this problem?
Thanks in advance! Any advice would be greatly appreciated.

There is another way to loop through ALL paragraphs in a document, regardless whether they are in a table or in the main text flow. You can use the FirstPgfInDoc property of the document and loop through all Pgf objects using the NextPgfInDoc property of the Pgf until you reach an invalid object. Note that this also includes all paragraphs in the master and reference pages, so it might be useful to check where the Pgf is located (on a body page or not). There is a script on this forum that does that - I believe it was created and posted by Rick Quatro.
Working your way through the main text flow does not guarantee that you have all the visible text in the doc. There may be multiple flows and there may also be text frames that are placed inside anchored frames. Those text frames are not contained directly in the main flow of the document.
Good luck with your scripting
Jang

Read Text from HTML-Pages and want to solve "ChangedCharSetException"

Hello,
I have an app that connect via threads with pages and parse them an gives me only the Text-version of a HTML-page. Works fine, but if it found a page, where the text is within images, than the whole app stopps and gave me the message:
javax.swing.text.ChangedCharSetException
        at javax.swing.text.html.parser.DocumentParser.handleEmptyTag(DocumentParser.java:169)
        at javax.swing.text.html.parser.Parser.startTag(Parser.java:372)
        at javax.swing.text.html.parser.Parser.parseTag(Parser.java:1846)
        at javax.swing.text.html.parser.Parser.parseContent(Parser.java:1881)
        at javax.swing.text.html.parser.Parser.parse(Parser.java:2047)
        at javax.swing.text.html.parser.DocumentParser.parse(DocumentParser.java:106)
        at javax.swing.text.html.parser.ParserDelegator.parse(ParserDelegator.java:78)
        at aufruf.main(aufruf.java:33)So I tried to catch them with "getCharSetSpec()" and "keyEqualsCharSet( )" from the class "javax.swing.text.ChangedCharSetException" and hoped that this solved the problem. But still doesen't work...
Then I looked at the web and found, that I have to add the line:
doc.putProperty("IgnoreCharsetDirective", new Boolean(true));"doc." is a new HTML Dokument, created with the HTMLEditorKit. I do not have much knowledge about that and so I hope, that someone can explain me, how I can solve that problem, within my code.
Here we go:
import javax.swing.text.*;
import java.lang.*;
import java.util.*;
import java.net.*;
import java.io.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;
public class myParser extends Thread
        private String name;
        public void run()
                try
                        URL viele = new URL(name);                       // "name" ia a variable with a lot of links
                URLConnection hs = viele.openConnection();
                hs.connect();
                if (hs.getContentType().startsWith("text/html"))
                        InputStream is = hs.getInputStream();
                        InputStreamReader isr = new InputStreamReader(is);
                        BufferedReader br = new BufferedReader(isr);
                        Lesen los = new Lesen();
                        ParserDelegator parser = new ParserDelegator();
                        parser.parse(br,los, false);
        catch (MalformedURLException e)
                System.err.print("Doesn't work");
        catch (ChangedCharSetException e)
                e.getCharSetSpec();
                e.keyEqualsCharSet();
                e.printStackTrace();
        catch (Exception o)
        public void vowi(String n)
                name = n;
}and for the case that it is important here is the class "Lesen"
import java.net.*;
import java.io.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;
class Lesen extends HTMLEditorKit.ParserCallback
        public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos)
                try
                        if ((t==HTML.Tag.P) || (t==HTML.Tag.H1) || (t==HTML.Tag.H2) || (t==HTML.Tag.H3) || (t==HTML.Tag.H4) || (t==HTML.Tag.H5) || (t==HTML.Tag.H6))
                                System.out.println();
                catch (Exception q)
                        System.out.println(q.getMessage());
        public void handleSimpleTag(HTML.Tag t,MutableAttributeSet a, int pos)
                try
                        if (t==HTML.Tag.BR)
                                System.out.println(); // Neue Zeile
                                System.out.println();
                catch (Exception qw)
                        System.out.println(qw.getMessage());
        public void handleText(char[] data, int pos)
                try
                        System.out.print(data);                                           // prints the text from HTML-pages
                catch (Exception ab)
                        System.out.println(ab.getMessage());
}Thanks a lot for helping...
Stephan

parser.parse(br,los, false);
parser.parse(br,los, true);

Extract text from hebrew pdf using adobe ifilter 6.0 reverse the letters

Hello pdf Users
I'm using adobe Ifilter 6.0 to extract pdf text from Hebrew documents. The text returned from the filter is reversed both in the letters inside a word, and in the word order.
Example (given in English letters)
Who am I
will give
I ma ohW
This is a known issue in bidi (bidirectional, meaing right-to-left) languages lie Hebrew and Arabic, but I think I saw that Ifilter should supports hebrew OK?
Any help?
Roee

Try the Adobe Acrobat Pro forums.

Extract Text from pdf using C#

Hi,
We are Solution developer using Acrobat,as we have reuirement of extracting text from pdf using C# we have downloaded adobe sdk and installed. We have found only four exmaples in C# and those are used only for viewing pdf in windows application. Can you please guide us how to extract text from pdf using SDK in C#.
Thanks you for your help.
Regards
kiranmai

Okay so I went ahead and actually added the text extraction functionality to my own C# application, since this was a requested feature by the client anyhow, which originally we were told to bypass if it wasn't "cut and dry", but it wasn't bad so I went ahead and gave the client the text extraction that they wanted. Decided I'd post the source code here for you. This returns the text from the entire document as a string.
 private static string GetText(AcroPDDoc pdDoc)
 AcroPDPage page;
 int pages = pdDoc.GetNumPages();
 string pageText = "";
 for (int i = 0; i < pages; i++)
 page = (AcroPDPage)pdDoc.AcquirePage(i);
 object jso, jsNumWords, jsWord;
 List<string> words = new List<string>();
 try
 jso = pdDoc.GetJSObject();
 if (jso != null)
 object[] args = new object[] { i };
 jsNumWords = jso.GetType().InvokeMember("getPageNumWords", BindingFlags.InvokeMethod, null, jso, args, null);
 int numWords = Int32.Parse(jsNumWords.ToString());
 for (int j = 0; j <= numWords; j++)
 object[] argsj = new object[] { i, j, false };
 jsWord = jso.GetType().InvokeMember("getPageNthWord", BindingFlags.InvokeMethod, null, jso, argsj, null);
 words.Add((string)jsWord);
 foreach (string word in words)
 pageText += word;
 catch
 return pageText;

How can i extract text from Power point files,wod files,pdf files

hi friends,
i need to extract text from the power point files,word files,pdf files for my application.Is it possible to extract the text from the those files .If yes plz give solution to this problem.i would be thankful if u givve solution to this problem.

My reply would be the same.
http://forum.java.sun.com/thread.jspa?threadID=676559&tstart=0

Extract Text from PostScript

Can anyone please tell me how to extract text from postscript

Ghostscript is a postscript interpreter (one can actually program in postscript, and in fact postscript is a language BTW/FYI). Ghostscript is written in C (maybe C++) and yes, you can output to several different formats. I doubt you can go to HTML as it wouldn't really make much sense.
XML needs a DTD to make it of any use. You will have to call ghostscript through Runtime.exec() to use it, it will extract out purely the text from the PostScript ASSUMING the PS file contains text in that manner; it is possible to have PS text as output images in which case GS won't pick it up.

Extracting text from header, body of an InCopy table...

Hi folks, I have a script that runs in InCopy CS3 on the Windows platform and part of it extracts text from the header and body parts of a table. If the insertion point is in the header, I can put the text into a variable using...
var textToExtract = app.selection[0].parentStory.contents;
Same scenario if the insertion point is in the body of the table. Anyway, I'm looking for a way to set the insertion point in the Header or Body sections of the table or, better yet, a way of extracting the data directly from those containers. Any ideas are, of course, appreciated. Thanks, Wil

Yes, I am stuck with the same problem.... any ideas out there?
thanks

How to extract text from a PDF file?

Hello Suners,
i need to know how to extract text from a pdf file?
does anyone know what is the character encoding in pdf file, when i use an input stream to read the file it gives encrypted characters not the original text in the file.
is there any procedures i should do while reading a pdf file,
File f=new File("D:/File.pdf");
               FileReader fr=new FileReader(f);
               BufferedReader br=new BufferedReader(fr);
               String s=br.readLine();any help will be deeply appreciated.

jverd wrote:
First, you set i once, and then loop without ever changing it. So your loop body will execute either 0 times or infinitely many times, writing the same byte every time. Actually, maybe it'll execute once and then throw an ArrayIndexOutOfBoundsException. That's basic java looping, and you're going to need a firm grip on that before you try to do anything as advanced as PDF reading. the case.oops you are absolutely right that was a silly mistake to forget that,
Second, what do the docs for getPageContent say? Do they say that it simply gives you the text on the page as if the thing were a simple text doc? I'd be surprised if that's the case.getPageContent return array of bytes so the question will be:
how to get text from this array? i was thinking of :
    private void jButton1_actionPerformed(ActionEvent e) {
        PdfReader read;
        StringBuffer buff=new StringBuffer();
        try {
            read = new PdfReader("d:/getjobid2727.pdf");
            read.getMetaData();
            byte[] data=read.getPageContent(1);
            int i=0;
            while(i>-1){
                buff.append(data);
i++;
String str=buff.toString();
FileOutputStream fos = new FileOutputStream("D:/test.txt");
Writer out = new OutputStreamWriter(fos, "UTF8");
out.write(str);
out.close();
read.close();
} catch (Exception f) {
f.printStackTrace();
"D:/test.txt" hasn't been created!! when i ran the program,
is my steps right?

How to extract text from a PDF file using php?

How to extract text from a PDF file using php?
thanks
fabio

> Do you know of any other way this can be done?
There are many ways. But this out of scope of this forum. You can try this forum: http://forum.planetpdf.com/

How to read/extract text from pdf

Respected All,
I want to read/extract text from pdf. I tried using etymon but not succed.
Could anyone will guide me in this.
Thanks and regards,
Ajay.

Thank you very much Abhilshit, PDFBox works for reading pdf.
Regards,
Ajay.

A problem with copying text from english pdf to a word file

i have a problem with copying text from english pdf to a word file. the english text of pdf turns to be unknown signs when i copy them to word file .
i illustrated what i mean in the picture i attached . note that i have adobe acrobat reader 9 . so please help cause i need to copy text to translate it .

Is this an e-book? Does it allow for copying? It is possible that the pdf file is a scan of a book?

Indesign CS3-JS - Problem in reading text from a text file

Can anyone help me...
I have an problem with reading text from an txt file. By "readln" methot I can read only the first line of the text, is there any method to read the consecutive lines from the text file.
Currently I am using Indesign CS3 with Java Script (for PC).
My Java Script is as follows........
var myNewLinksFile = myFindFile("/Links/NewLinks.txt")
var myNewLinks = File(myNewLinksFile);
var a = myNewLinks.open("r", undefined, undefined);
myLine = myNewLinks.readln();
alert(myLine);
function myFindFile(myFilePath){
var myScriptFile = myGetScriptPath();
var myScriptFile = File(myScriptFile);
var myScriptFolder = myScriptFile.path;
myFilePath = myScriptFolder + myFilePath;
if(File(myFilePath).exists == false){
//Display a dialog.
myFilePath = File.openDialog("Choose the file containing your find/change list");
return myFilePath;
function myGetScriptPath(){
try{
myFile = app.activeScript;
catch(myError){
myFile = myError.fileName;
return myFile;
Thanks,
Bharath Raja G

Hi Bharath Raja G,
If you want to use readln, you'll have to iterate. I don't see a for loop in your example, so you're not iterating. To see how it works, take a closer look at FindChangeByList.jsx--you'll see that that script iterates to read the text file line by line (until it reaches the end of the file).
Thanks,
Ole

Problem to extract text from HTML document

Similar Messages

Maybe you are looking for