Extract text from htm page.

I am trying to perform an action in Automator to extract a single text from many htm pages. The problem is that after this single text, comes reference composed by numbers and letters. Example: I have 872 archives of htm pages which I have to extract it: (inventory number:78899b3). Is it possible to extract? This different reference of which page and send to a text edit application?
Thank you

import java.io.*;
import javaorg.xml.sax.*;
import javaorg.xml.sax.helpers.DefaultHandler;
import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
public class FirstParser
  public FirstParser() throws Exception
    DefaultHandler handler = new MyHandler();
    SAXParser parser = SAXParserFactory.newInstance().newSAXParser();
    parser.parse(new File("test.xml"), handler);
import org.xml.sax.*;
import org.xml.sax.helpers.DefaultHandler;
public class MyHandler extends DefaultHandler
  public void startDocument() throws SAXException
    //start parsing document
  public void startElement(String namespaceURI, String sName, String qName, Attributes attr)  throws SAXException
    //when <......> is opened
    //in <HTML> ---- qName = "HTML"
  public void endElement(String namespaceURI, String sName, String qName)  throws SAXException
    //when </......> is opened
  public void characters(char[] buf, int offset, int len)  throws SAXException
    String s = new String(buf, offset, len); //<H1>This is a heading</H1> ---- s = "This is a heading"
  public void endDocument() throws Exception
    the end of the document
}In the MyHandler class you will tell the JVM what to do when each method is automatically triggered
this will enable you to obtain the required data from the html data

Similar Messages

  • Automator to extract words from htm pages

    I am trying to perform an action in Automator to extract a single text from many htm pages. The problem is that after this single text, comes reference composed by numbers and letters. Example: I have 872 archives of htm pages which I have to extract it: (inventory number:78899b3). Is it possible to extract? This different reference of which page and send to a text edit application?
    Thank you

    Hi BDAqua, glad to see you again. The probem is :
    872 htm archives which contains a text with many informations but the one I need in each of these pages is Inventory number:349999b for example. Inventory number is there in these 872 pages, but the reference number changes in every one. I could, by my self, transform all these pages in one Word document, Is it help? The result I' am waiting for is to have a list of these inventory numbers in a single list. Sorry if I was not clear but I am trying to learn english to.
    thank you BDAqua.

  • How to extract text from a PDF file?

    Hello Suners,
    i need to know how to extract text from a pdf file?
    does anyone know what is the character encoding in pdf file, when i use an input stream to read the file it gives encrypted characters not the original text in the file.
    is there any procedures i should do while reading a pdf file,
    File f=new File("D:/File.pdf");
                   FileReader fr=new FileReader(f);
                   BufferedReader br=new BufferedReader(fr);
                   String s=br.readLine();any help will be deeply appreciated.

    jverd wrote:
    First, you set i once, and then loop without ever changing it. So your loop body will execute either 0 times or infinitely many times, writing the same byte every time. Actually, maybe it'll execute once and then throw an ArrayIndexOutOfBoundsException. That's basic java looping, and you're going to need a firm grip on that before you try to do anything as advanced as PDF reading. the case.oops you are absolutely right that was a silly mistake to forget that,
    Second, what do the docs for getPageContent say? Do they say that it simply gives you the text on the page as if the thing were a simple text doc? I'd be surprised if that's the case.getPageContent return array of bytes so the question will be:
    how to get text from this array? i was thinking of :
        private void jButton1_actionPerformed(ActionEvent e) {
            PdfReader read;
            StringBuffer buff=new StringBuffer();
            try {
                read = new PdfReader("d:/getjobid2727.pdf");
                read.getMetaData();
                byte[] data=read.getPageContent(1);
                int i=0;
                while(i>-1){ 
                    buff.append(data);
    i++;
    String str=buff.toString();
    FileOutputStream fos = new FileOutputStream("D:/test.txt");
    Writer out = new OutputStreamWriter(fos, "UTF8");
    out.write(str);
    out.close();
    read.close();
    } catch (Exception f) {
    f.printStackTrace();
    "D:/test.txt"  hasn't been created!! when i ran the program,
    is my steps right?                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

  • Problem to extract text from HTML document

    I have to extract some text from HTML file to my database. (about 1000 files)
    The HTML files are get from ACM Digital Library. http://portal.acm.org/dl.cfm
    The HTML page is about the information of a paper. I only want to get the text of "Title" "Abstract" "Classification" "Keywords"
    The Problem is that I can't find any patten to parser the html files"
    EX: I need to get the Classification = "Theory of Computation","ANALYSIS OF ALGORITHMS AND PROBLEM COMPLEXITY","Numerical Algorithms and Problem","Mathematics of Computing","NUMERICAL ANALYSIS"......etc .
    The section code about "Classification" is below.
    Please give any idea to do this, or how to find patten to extract text from this.
    <div class="indterms"><a href="#CIT"><img name="top" src=
    "img/arrowu.gif" hspace="10" border="0" /></a><span class=
    "heading"><a name="IndexTerms">INDEX TERMS</a></span>
    <p class="Categories"><span class="heading"><a name=
    "GenTerms">Primary Classification:</a></span><br />
    � <b>F.</b> <a href=
    "results.cfm?query=CCS%3AF%2E%2A&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Theory of Computation</a><br />
    � <img src="img/tree.gif" border="0" height="20" width=
    "20" /> <b>F.2</b> <a href=
    "results.cfm?query=CCS%3A%22F%2E2%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">ANALYSIS OF ALGORITHMS AND PROBLEM
    COMPLEXITY</a><br />
    � � � <img src="img/tree.gif" border="0" height=
    "20" width="20" /> <b>F.2.1</b> <a href=
    "results.cfm?query=CCS%3A%22F%2E2%2E1%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Numerical Algorithms and Problems</a><br />
    </p>
    <p class="Categories"><span class="heading"><a name=
    "GenTerms">Additional�Classification:</a></span><br />
    � <b>G.</b> <a href=
    "results.cfm?query=CCS%3AG%2E%2A&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Mathematics of Computing</a><br />
    � <img src="img/tree.gif" border="0" height="20" width=
    "20" /> <b>G.1</b> <a href=
    "results.cfm?query=CCS%3A%22G%2E1%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">NUMERICAL ANALYSIS</a><br />
    � � � <img src="img/tree.gif" border="0" height=
    "20" width="20" /> <b>G.1.6</b> <a href=
    "results.cfm?query=CCS%3A%22G%2E1%2E6%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Optimization</a><br />
    � � � � � <img src="img/tree.gif" border=
    "0" height="20" width="20" /> <b>Subjects:</b> <a href=
    "results.cfm?query=CCS%3A%22Linear%20programming%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Linear programming</a><br />
    </p>
    <br />
    <p class="GenTerms"><span class="heading"><a name=
    "GenTerms">General Terms:</a></span><br />
    <a href=
    "results.cfm?query=genterm%3A%22Algorithms%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Algorithms</a>, <a href=
    "results.cfm?query=genterm%3A%22Theory%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Theory</a></p>
    <br />
    <p class="keywords"><span class="heading"><a name=
    "Keywords">Keywords:</a></span><br />
    <a href=
    "results.cfm?query=keyword%3A%22Simplex%20method%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Simplex method</a>, <a href=
    "results.cfm?query=keyword%3A%22complexity%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">complexity</a>, <a href=
    "results.cfm?query=keyword%3A%22perturbation%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">perturbation</a>, <a href=
    "results.cfm?query=keyword%3A%22smoothed%20analysis%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">smoothed analysis</a></p>
    </div>

    One approach is to download Htmlparser from sourceforge
    http://htmlparser.sourceforge.net/ and write the rules to match title, abstract etc.
    Another approach is to write your own parser that extract only title, abstract etc.
    1. tokenize the html file. --> convert html into tokens (tag and value)
    2. write a simple parser to extract certain information
    find out about the pattern of text you want to extract. For instance "<class "abstract">.
    then writing a rule for extracting abstract such as
    if (tag is abstract ) then extract abstract text
    apply the same concept for other tags
    Attached is the sample parser that was used to extract title and abstract from acm html files. Please modify to include keyword and other fields.
    good luck
    import java.io.BufferedReader;
    import java.io.FileReader;
    import java.io.IOException;
    import java.io.InputStream;
    import java.io.InputStreamReader;
    import java.util.ArrayList;
    import java.util.List;
    public class ACMHTMLParser
         private String m_filename;
         private URLLexicalAnalyzer lexical;
         List urls = new ArrayList();
         public ACMHTMLParser(String filename)
              super();
              m_filename = filename;
          * parses only title and abstract
         public void parse() throws Exception
              lexical = new URLLexicalAnalyzer(m_filename);
              String word = lexical.getNextWord();
              boolean isabstract = false;
              while (null != word)
                   if (isTag(word))
                        if (isTitle(word))
                             System.out.println("TITLE: " + lexical.getNextWord());
                        else if (isAbstract(word) && !isabstract)
                             parseAbstract();
                             isabstract = true;
                   word = lexical.getNextWord();
              lexical.close();
         public static void main(String[] args) throws Exception
              ACMHTMLParser parser = new ACMHTMLParser("./acm_html.html");
              parser.parse();
         public static boolean isTag(String word)
              return ( word.startsWith("<") && word.endsWith(">"));
         public static boolean isTitle(String word)
              return ( "<title>".equals(word));
         //please modify according to the html source
         public static boolean isAbstract(String word)
              return ( "<p class=\"abstract\">".equals(word));
         private void parseAbstract() throws Exception
              while (true)
                   String abs = lexical.getNextWord();
                   if (!isTag(abs))
                        System.out.println(abs);
                        break;
         class URLLexicalAnalyzer
           private BufferedReader m_reader;
           private boolean isTag;
           public URLLexicalAnalyzer(String filename)
              try
                m_reader = new BufferedReader(new FileReader(filename));
              catch (IOException io)
                System.out.println("ERROR, file not found " + filename);
                System.exit(1);
           public URLLexicalAnalyzer(InputStream in)
              m_reader = new BufferedReader(new InputStreamReader(in));
           public void close()
              try {
                if (null != m_reader) m_reader.close();
              catch (IOException ignored) {}
           public String getNextWord() throws IOException
              int c = m_reader.read();   
              if (-1 == c) return null; 
              if (Character.isWhitespace((char)c))
                return getNextWord();
              if ('<' == c || isTag)
                return scanTag(c);
              else
                   return scanValue(c);
           private String scanTag(final int c)
              throws IOException
              StringBuffer result = new StringBuffer();
              if ('<' != c) result.append('<');
              result.append((char)c);
              int ch = -1;
              while (true)
                ch = m_reader.read();
                if (-1 == ch) throw new IllegalArgumentException("un-terminate tag");
                if ('>' == ch)
                     isTag = false;
                     break;
                result.append((char)ch);
              result.append((char)ch);
              return result.toString();
           private String scanValue(final int c) throws IOException
                StringBuffer result = new StringBuffer();
                result.append((char)c);
                int ch = -1;
                while (true)
                   ch = m_reader.read();
                   if (-1 == ch) throw new IllegalArgumentException("un-terminate value");
                   if ('<' == ch)
                        isTag = true;
                        break;
                   result.append((char)ch);
                return result.toString();
    }

  • Extract Text from pdf using C#

    Hi,
    We are Solution developer using Acrobat,as we have reuirement of extracting text from pdf using C# we have downloaded adobe sdk and installed. We have found only four exmaples in C# and those are used only for viewing pdf in windows application. Can you please guide us how to extract text from pdf using SDK in C#.
    Thanks you for your help.
    Regards
    kiranmai

    Okay so I went ahead and actually added the text extraction functionality to my own C# application, since this was a requested feature by the client anyhow, which originally we were told to bypass if it wasn't "cut and dry", but it wasn't bad so I went ahead and gave the client the text extraction that they wanted. Decided I'd post the source code here for you. This returns the text from the entire document as a string.
           private static string GetText(AcroPDDoc pdDoc)
                AcroPDPage page;
                int pages = pdDoc.GetNumPages();
                string pageText = "";
                for (int i = 0; i < pages; i++)
                    page = (AcroPDPage)pdDoc.AcquirePage(i);
                    object jso, jsNumWords, jsWord;
                    List<string> words = new List<string>();
                    try
                        jso = pdDoc.GetJSObject();
                        if (jso != null)
                            object[] args = new object[] { i };
                            jsNumWords = jso.GetType().InvokeMember("getPageNumWords", BindingFlags.InvokeMethod, null, jso, args, null);
                            int numWords = Int32.Parse(jsNumWords.ToString());
                            for (int j = 0; j <= numWords; j++)
                                object[] argsj = new object[] { i, j, false };
                                jsWord = jso.GetType().InvokeMember("getPageNthWord", BindingFlags.InvokeMethod, null, jso, argsj, null);
                                words.Add((string)jsWord);
                        foreach (string word in words)
                            pageText += word;
                    catch
                return pageText;

  • Applescript or workflow to extract text from PDF and rename PDF with the results

    Hi Everyone,
    I get supplied hundreds of PDFs which each contain a stock code, but the PDFs themselves are not named consistantly, or they are supplied as multi-page PDFs.
    What I need to do is name each PDF with the code which is in the text on the PDF.
    It would work like this in an ideal world:
    1. Split PDF into single pages
    2. Extract text from PDF
    3. Rename PDF using the extracted text
    I'm struggling with part 3!
    I can get a textfile with just the code (using a call to BBEDIT I'm extracting the code)
    I did think about using a variable for the name, but the rename functions doesn't let me use variables.

    Hello
    You may also try the following applescript script, which is a wrapper of rubycocoa script. It will ask you choose source pdf files and destination directory. Then it will scan text of each page of pdf files for the predefined pattern and save the page as new pdf file with the name as extracted by the pattern in the destination directory. Those pages which do not contain string matching the pattern are ignored. (Ignored pages, if any, are reported in the result of script.)
    Currently the regex pattern is set to:
    /HB-.._[0-9]{6}/
    which means HB- followed by two characters and _ and 6 digits.
    Minimally tested under 10.6.8.
    Hope this may help,
    H
    _main()
    on _main()
        script o
            property aa : choose file with prompt ("Choose pdf files.") of type {"com.adobe.pdf"} ¬
                default location (path to desktop) with multiple selections allowed
            set my aa's beginning to choose folder with prompt ("Choose destination folder.") ¬
                default location (path to desktop)
            set args to ""
            repeat with a in my aa
                set args to args & a's POSIX path's quoted form & space
            end repeat
            considering numeric strings
                if (system info)'s system version < "10.9" then
                    set ruby to "/usr/bin/ruby"
                else
                    set ruby to "/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby"
                end if
            end considering
            do shell script ruby & " <<'EOF' - " & args & "
    require 'osx/cocoa'
    include OSX
    require_framework 'PDFKit'
    outdir = ARGV.shift.chomp('/')
    ARGV.select {|f| f =~ /\\.pdf$/i }.each do |f|
        url = NSURL.fileURLWithPath(f)
        doc = PDFDocument.alloc.initWithURL(url)
        path = doc.documentURL.path
        pcnt = doc.pageCount
        (0 .. (pcnt - 1)).each do |i|
            page = doc.pageAtIndex(i)
            page.string.to_s =~ /HB-.._[0-9]{6}/
            name = $&
            unless name
                puts \"no matching string in page #{i + 1} of #{path}\"
                next # ignore this page
            end
            doc1 = PDFDocument.alloc.initWithData(page.dataRepresentation) # doc for this page
            unless doc1.writeToFile(\"#{outdir}/#{name}.pdf\")
                puts \"failed to save page #{i + 1} of #{path}\"
            end
        end
    end
    EOF"
        end script
        tell o to run
    end _main

  • How to extract text from a PDF file using php?

    How to extract text from a PDF file using php?
    thanks
    fabio

    > Do you know of any other way this can be done?
    There are many ways. But this out of scope of this forum. You can try this forum: http://forum.planetpdf.com/

  • How to read/extract text from pdf

    Respected All,
    I want to read/extract text from pdf. I tried using etymon but not succed.
    Could anyone will guide me in this.
    Thanks and regards,
    Ajay.

    Thank you very much Abhilshit, PDFBox works for reading pdf.
    Regards,
    Ajay.

  • Read Text from HTML-Pages and want to solve "ChangedCharSetException"

    Hello,
    I have an app that connect via threads with pages and parse them an gives me only the Text-version of a HTML-page. Works fine, but if it found a page, where the text is within images, than the whole app stopps and gave me the message:
    javax.swing.text.ChangedCharSetException
            at javax.swing.text.html.parser.DocumentParser.handleEmptyTag(DocumentParser.java:169)
            at javax.swing.text.html.parser.Parser.startTag(Parser.java:372)
            at javax.swing.text.html.parser.Parser.parseTag(Parser.java:1846)
            at javax.swing.text.html.parser.Parser.parseContent(Parser.java:1881)
            at javax.swing.text.html.parser.Parser.parse(Parser.java:2047)
            at javax.swing.text.html.parser.DocumentParser.parse(DocumentParser.java:106)
            at javax.swing.text.html.parser.ParserDelegator.parse(ParserDelegator.java:78)
            at aufruf.main(aufruf.java:33)So I tried to catch them with "getCharSetSpec()" and "keyEqualsCharSet( )" from the class "javax.swing.text.ChangedCharSetException" and hoped that this solved the problem. But still doesen't work...
    Then I looked at the web and found, that I have to add the line:
    doc.putProperty("IgnoreCharsetDirective", new Boolean(true));"doc." is a new HTML Dokument, created with the HTMLEditorKit. I do not have much knowledge about that and so I hope, that someone can explain me, how I can solve that problem, within my code.
    Here we go:
    import javax.swing.text.*;
    import java.lang.*;
    import java.util.*;
    import java.net.*;
    import java.io.*;
    import javax.swing.text.html.*;
    import javax.swing.text.html.parser.*;
    public class myParser extends Thread
            private String name;
            public void run()
                    try
                            URL viele = new URL(name);                       // "name" ia a variable with a lot of links
                    URLConnection hs = viele.openConnection();
                    hs.connect();
                    if (hs.getContentType().startsWith("text/html"))
                            InputStream is = hs.getInputStream();
                            InputStreamReader isr = new InputStreamReader(is);
                            BufferedReader br = new BufferedReader(isr);
                            Lesen los = new Lesen();
                            ParserDelegator parser = new ParserDelegator();
                            parser.parse(br,los, false);
            catch (MalformedURLException e)
                    System.err.print("Doesn't work");
            catch (ChangedCharSetException e)
                    e.getCharSetSpec();
                    e.keyEqualsCharSet();
                    e.printStackTrace();
            catch (Exception o)
            public void vowi(String n)
                    name = n;
    }and for the case that it is important here is the class "Lesen"
    import java.net.*;
    import java.io.*;
    import javax.swing.text.*;
    import javax.swing.text.html.*;
    import javax.swing.text.html.parser.*;
    class Lesen extends HTMLEditorKit.ParserCallback
            public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos)
                    try
                            if ((t==HTML.Tag.P) || (t==HTML.Tag.H1) || (t==HTML.Tag.H2) || (t==HTML.Tag.H3) || (t==HTML.Tag.H4) || (t==HTML.Tag.H5) || (t==HTML.Tag.H6))
                                    System.out.println();
                    catch (Exception q)
                            System.out.println(q.getMessage());
            public void handleSimpleTag(HTML.Tag t,MutableAttributeSet a, int pos)
                    try
                            if (t==HTML.Tag.BR)
                                    System.out.println(); // Neue Zeile
                                    System.out.println();
                    catch (Exception qw)
                            System.out.println(qw.getMessage());
            public void handleText(char[] data, int pos)
                    try
                            System.out.print(data);                                           // prints the text from HTML-pages
                    catch (Exception ab)
                            System.out.println(ab.getMessage());
    }Thanks a lot for helping...
    Stephan

    parser.parse(br,los, false);
    parser.parse(br,los, true);

  • I have been unable to print from FF to Brother HL-1440. print generate pages w/odd bits of text from the page, and sometimes one image. If I want to print off the web, I have to print PDF. Older versions of FF print fine.

    I have been unable to print from FF to Brother HL-1440. print generate pages w/odd bits of text from the page, and sometimes one image. If I want to print off the web, I have to print PDF. Older versions of FF print fine.

    I downloaded OpenOffice and am able to print documents (in particular, the same document I created in MS Word). I guess MS Word is to blame. I believe the issue is concluded, but I'd welcome any comments about the Active Directory, as I might have a problem
    with OpenOffice in the future regarding that.

  • Extract text from pdf

    Hi, is it possible to extract text from a pdf file using the command line to get an output like you would get by using the File menu and then 'Save as text..."?
    I also noticed that in the installation folder there is a small executable called AcroTextExtractor which sounds interesting, but I was unable to figure out how to use it.

    what's wrong with using automator for this? this certainly seems the easiest. I'm not aware of any built in apple script commands that will do this. But You should also ask on the Apple script forum under Mac OS Technologies.
    Message was edited by: V.K.

  • How do I extract text from an email?

    Hello!
    I am in the process of trying to automate orders from my website. How do I extract text from an email and paste it into specific cells in an Excel spreadsheet using Automator?
    Many thanks,
    Toby Bateson

    If you select the message on the Inbox list, or open the message, you can then go to the Message menu of Mail and select Remove Attachments.
    Bob N.
    Mac Mini 1.5 GHz; iBook 900 mHz; iPod 20 GB   Mac OS X (10.4.7)  

  • How can i extract text from Power point files,wod files,pdf files

    hi friends,
    i need to extract text from the power point files,word files,pdf files for my application.Is it possible to extract the text from the those files .If yes plz give solution to this problem.i would be thankful if u givve solution to this problem.

    My reply would be the same.
    http://forum.java.sun.com/thread.jspa?threadID=676559&tstart=0

  • Can Pages copy text from one page to another automatically?

    Can Pages automatically copy text from one page to another?
    For example recurring text in a document e.g. addressee details or a date (not today's date).
    I want to type the topic title of one letter and have it automatically filled into the correct field in the next page which will be a letter to a different recipient.
    The letter text will be different for each recipient so mail merge probably won't work.
    Thanks
    Andy

    Thanks that worked great!!!(for the first line)
    Now why wont it populate more than the one other text field?
    F.Lic.Table1.Row1.agt_ssn_lic.rawValue = xfa.event.newText;
    F.PAGE3.agt_ssn_1.rawValue = xfa.event.newText;
    F.PAGE4.agt_ssn_2.rawValue = xfa.event.newText;

  • Extract text from Powerpoint

    Can anyone tell me whether it is possible to extract text from Powerpoint. Please help me out.

    hi,it is possible if you can pay a lot of money,
    since I just finished developed a web search app
    with my classmates,but we do not do the tranfer
    from pp to text ,the client did not ask for it
    because he do not wanna to pay so much money.

Maybe you are looking for