Applescript or workflow to extract text from PDF and rename PDF with the results

Hi Everyone,
I get supplied hundreds of PDFs which each contain a stock code, but the PDFs themselves are not named consistantly, or they are supplied as multi-page PDFs.
What I need to do is name each PDF with the code which is in the text on the PDF.
It would work like this in an ideal world:
1. Split PDF into single pages
2. Extract text from PDF
3. Rename PDF using the extracted text
I'm struggling with part 3!
I can get a textfile with just the code (using a call to BBEDIT I'm extracting the code)
I did think about using a variable for the name, but the rename functions doesn't let me use variables.

Hello
You may also try the following applescript script, which is a wrapper of rubycocoa script. It will ask you choose source pdf files and destination directory. Then it will scan text of each page of pdf files for the predefined pattern and save the page as new pdf file with the name as extracted by the pattern in the destination directory. Those pages which do not contain string matching the pattern are ignored. (Ignored pages, if any, are reported in the result of script.)
Currently the regex pattern is set to:
/HB-.._[0-9]{6}/
which means HB- followed by two characters and _ and 6 digits.
Minimally tested under 10.6.8.
Hope this may help,
H
_main()
on _main()
    script o
        property aa : choose file with prompt ("Choose pdf files.") of type {"com.adobe.pdf"} ¬
            default location (path to desktop) with multiple selections allowed
        set my aa's beginning to choose folder with prompt ("Choose destination folder.") ¬
            default location (path to desktop)
        set args to ""
        repeat with a in my aa
            set args to args & a's POSIX path's quoted form & space
        end repeat
        considering numeric strings
            if (system info)'s system version < "10.9" then
                set ruby to "/usr/bin/ruby"
            else
                set ruby to "/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby"
            end if
        end considering
        do shell script ruby & " <<'EOF' - " & args & "
require 'osx/cocoa'
include OSX
require_framework 'PDFKit'
outdir = ARGV.shift.chomp('/')
ARGV.select {|f| f =~ /\\.pdf$/i }.each do |f|
    url = NSURL.fileURLWithPath(f)
    doc = PDFDocument.alloc.initWithURL(url)
    path = doc.documentURL.path
    pcnt = doc.pageCount
    (0 .. (pcnt - 1)).each do |i|
        page = doc.pageAtIndex(i)
        page.string.to_s =~ /HB-.._[0-9]{6}/
        name = $&
        unless name
            puts \"no matching string in page #{i + 1} of #{path}\"
            next # ignore this page
        end
        doc1 = PDFDocument.alloc.initWithData(page.dataRepresentation) # doc for this page
        unless doc1.writeToFile(\"#{outdir}/#{name}.pdf\")
            puts \"failed to save page #{i + 1} of #{path}\"
        end
    end
end
EOF"
    end script
    tell o to run
end _main

Similar Messages

  • Retrieving Data from SQLite and make link with the ID of the retrieved data?

    Hello, first of all sorry if my english is bad as im from Mexico.
    I using HTML/Javascript/SQLite, and im having an issue while retrieving data from SQLite, all my querys work right, but when i try to select an Id, the Name, Firstname, and Lastname and only use the Name, Firstname and Lastname to print on screen and the Id to use as a link or OnClick query with the id more info from the name of the person..
    My SQLite query is the next one.
    var sql = "SELECT [sis_persona].[idPersona], [sis_persona].[Nombre], [sis_persona].[Paterno], [sis_persona].[Materno] FROM  [sis_persona] WHERE Nombre  LIKE '"+ nombre1+"%' AND Paterno LIKE '"+ paterno1+"%' AND Materno LIKE '"+ materno1+"%' ORDER BY [sis_persona].[Nombre], [sis_persona].[Paterno], [sis_persona].[Materno] LIMIT 20";
    And my ResultHandler is the next one:
        row = document.createElement("tr");
                    cell = document.createElement("th");
                    cell.innerText = "Nombre";
                    row.appendChild(cell)       
                    cell = document.createElement("th");
                    cell.innerText = "Paterno";
                    row.appendChild(cell)       
                    cell = document.createElement("th");
                    cell.innerText = "Materno";
                    row.appendChild(cell)       
                    tbl.appendChild(row);
                    var numRows = result.data.length;
                    for (var i = 0; i < numRows; i++)
                        // iterate over the columns in the result row object
                        row = document.createElement("tr");
                        for (col in result.data[i])
                var ea = result.data[i];
                cell = document.createElement("td");
                var a  = document.createElement( 'a' );
                a.setAttribute('HREF','#');
                var txt = document.createTextNode( ea.Nombre );
                a.appendChild( txt );
                a.onclick = function() {querytest(ea.idPersona);};
                row.appendChild( a );
                row.appendChild(cell);
                            cell = document.createElement("td");
                var a  = document.createElement( 'a' );
                a.setAttribute('HREF','#');
                var txt = document.createTextNode( ea.Paterno );
                a.appendChild( txt );
                a.onclick = function() {querytest(ea.idPersona);};
                row.appendChild( a );
                row.appendChild(cell);
                        cell = document.createElement("td");
                var a  = document.createElement( 'a' );
                a.setAttribute('HREF','#');
                var txt = document.createTextNode( ea.Materno );
                a.appendChild( txt );
                a.onclick = function() {querytest(ea.idPersona);};
                row.appendChild( a );
                row.appendChild(cell);
                        tbl.appendChild(row);
    And this is where i have the problem, it print three times and i dont know how to take print only the result and make a link with the id from the query.
    var ea = result.data[i];
                cell = document.createElement("td");
                var a  = document.createElement( 'a' );
                a.setAttribute('HREF','#');
                var txt = document.createTextNode( ea.Nombre );
                a.appendChild( txt );
                a.onclick = function() {querytest(ea.idPersona);};
                row.appendChild( a );
                row.appendChild(cell);
    Anyone have any ideas?
    Sorry again if my english is bad.
    Thanks in advance

    Hi Prashant,
    Continuation to above my thread. There is a small correction.Directly in lsmw not possible. I uploaded using the same code of lsmw and made one report. Through that i uploaded the data directly selecting the file.
    Regards,
    Madhu.

  • Printing from iPhone and iPad directly with the USB cable

    Is it possible to connect an iPad directly to a printer via USB to print? Do I need to install any software on my iPad? Will this also work with iPhone?

    you still need a Pc or Mac connected via usb to the printer.
    there are programs that allow you to see fron iOs the printer as Airprint compatible, i dont remember the name of these programs.
    obiously the Mac/Pc and iPhone /iPad should be on the same wifi network

  • How to extract text from a PDF file?

    Hello Suners,
    i need to know how to extract text from a pdf file?
    does anyone know what is the character encoding in pdf file, when i use an input stream to read the file it gives encrypted characters not the original text in the file.
    is there any procedures i should do while reading a pdf file,
    File f=new File("D:/File.pdf");
                   FileReader fr=new FileReader(f);
                   BufferedReader br=new BufferedReader(fr);
                   String s=br.readLine();any help will be deeply appreciated.

    jverd wrote:
    First, you set i once, and then loop without ever changing it. So your loop body will execute either 0 times or infinitely many times, writing the same byte every time. Actually, maybe it'll execute once and then throw an ArrayIndexOutOfBoundsException. That's basic java looping, and you're going to need a firm grip on that before you try to do anything as advanced as PDF reading. the case.oops you are absolutely right that was a silly mistake to forget that,
    Second, what do the docs for getPageContent say? Do they say that it simply gives you the text on the page as if the thing were a simple text doc? I'd be surprised if that's the case.getPageContent return array of bytes so the question will be:
    how to get text from this array? i was thinking of :
        private void jButton1_actionPerformed(ActionEvent e) {
            PdfReader read;
            StringBuffer buff=new StringBuffer();
            try {
                read = new PdfReader("d:/getjobid2727.pdf");
                read.getMetaData();
                byte[] data=read.getPageContent(1);
                int i=0;
                while(i>-1){ 
                    buff.append(data);
    i++;
    String str=buff.toString();
    FileOutputStream fos = new FileOutputStream("D:/test.txt");
    Writer out = new OutputStreamWriter(fos, "UTF8");
    out.write(str);
    out.close();
    read.close();
    } catch (Exception f) {
    f.printStackTrace();
    "D:/test.txt"  hasn't been created!! when i ran the program,
    is my steps right?                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

  • How to extract text from a PDF file using php?

    How to extract text from a PDF file using php?
    thanks
    fabio

    > Do you know of any other way this can be done?
    There are many ways. But this out of scope of this forum. You can try this forum: http://forum.planetpdf.com/

  • How to read/extract text from pdf

    Respected All,
    I want to read/extract text from pdf. I tried using etymon but not succed.
    Could anyone will guide me in this.
    Thanks and regards,
    Ajay.

    Thank you very much Abhilshit, PDFBox works for reading pdf.
    Regards,
    Ajay.

  • Extract Text from pdf using C#

    Hi,
    We are Solution developer using Acrobat,as we have reuirement of extracting text from pdf using C# we have downloaded adobe sdk and installed. We have found only four exmaples in C# and those are used only for viewing pdf in windows application. Can you please guide us how to extract text from pdf using SDK in C#.
    Thanks you for your help.
    Regards
    kiranmai

    Okay so I went ahead and actually added the text extraction functionality to my own C# application, since this was a requested feature by the client anyhow, which originally we were told to bypass if it wasn't "cut and dry", but it wasn't bad so I went ahead and gave the client the text extraction that they wanted. Decided I'd post the source code here for you. This returns the text from the entire document as a string.
           private static string GetText(AcroPDDoc pdDoc)
                AcroPDPage page;
                int pages = pdDoc.GetNumPages();
                string pageText = "";
                for (int i = 0; i < pages; i++)
                    page = (AcroPDPage)pdDoc.AcquirePage(i);
                    object jso, jsNumWords, jsWord;
                    List<string> words = new List<string>();
                    try
                        jso = pdDoc.GetJSObject();
                        if (jso != null)
                            object[] args = new object[] { i };
                            jsNumWords = jso.GetType().InvokeMember("getPageNumWords", BindingFlags.InvokeMethod, null, jso, args, null);
                            int numWords = Int32.Parse(jsNumWords.ToString());
                            for (int j = 0; j <= numWords; j++)
                                object[] argsj = new object[] { i, j, false };
                                jsWord = jso.GetType().InvokeMember("getPageNthWord", BindingFlags.InvokeMethod, null, jso, argsj, null);
                                words.Add((string)jsWord);
                        foreach (string word in words)
                            pageText += word;
                    catch
                return pageText;

  • Extract text from pdf

    Hi, is it possible to extract text from a pdf file using the command line to get an output like you would get by using the File menu and then 'Save as text..."?
    I also noticed that in the installation folder there is a small executable called AcroTextExtractor which sounds interesting, but I was unable to figure out how to use it.

    what's wrong with using automator for this? this certainly seems the easiest. I'm not aware of any built in apple script commands that will do this. But You should also ask on the Apple script forum under Mac OS Technologies.
    Message was edited by: V.K.

  • How can i extract text from Power point files,wod files,pdf files

    hi friends,
    i need to extract text from the power point files,word files,pdf files for my application.Is it possible to extract the text from the those files .If yes plz give solution to this problem.i would be thankful if u givve solution to this problem.

    My reply would be the same.
    http://forum.java.sun.com/thread.jspa?threadID=676559&tstart=0

  • Extracting text from PDF in columns?

    I'm trying to extract the text from a number of PDFs in which the text is in two columns. When I copy the txt and paste it into Word, it ends up being in a single column, the width of one of the original columns, but with a hard return at the end of each line.
    I don't really care if the extracted text ends up being in one column or two, I just want it not to have a return at the end of each line. Can anyone suggest an alternative method?

    i get the following error message:
    Exception in thread "main" java.lang.NoClassDefFoundError: org/fontbox/afm/FontMetric
    at org.pdfbox.pdmodel.font.PDFont.getAFM(PDFont.java:334)
    at org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(PDSimpleFont.java:104)
    at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:336)
    at org.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:80)
    at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
    at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
    at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
    at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
    at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
    at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
    at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
    at extracttext.Main.main(Main.java:55)
    Java Result: 1
    BUILD SUCCESSFUL (total time: 1 second)
    I would appreciate if you can please help me writing a java program that can extract only test from a pdf file

  • Problem to extract text from HTML document

    I have to extract some text from HTML file to my database. (about 1000 files)
    The HTML files are get from ACM Digital Library. http://portal.acm.org/dl.cfm
    The HTML page is about the information of a paper. I only want to get the text of "Title" "Abstract" "Classification" "Keywords"
    The Problem is that I can't find any patten to parser the html files"
    EX: I need to get the Classification = "Theory of Computation","ANALYSIS OF ALGORITHMS AND PROBLEM COMPLEXITY","Numerical Algorithms and Problem","Mathematics of Computing","NUMERICAL ANALYSIS"......etc .
    The section code about "Classification" is below.
    Please give any idea to do this, or how to find patten to extract text from this.
    <div class="indterms"><a href="#CIT"><img name="top" src=
    "img/arrowu.gif" hspace="10" border="0" /></a><span class=
    "heading"><a name="IndexTerms">INDEX TERMS</a></span>
    <p class="Categories"><span class="heading"><a name=
    "GenTerms">Primary Classification:</a></span><br />
    � <b>F.</b> <a href=
    "results.cfm?query=CCS%3AF%2E%2A&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Theory of Computation</a><br />
    � <img src="img/tree.gif" border="0" height="20" width=
    "20" /> <b>F.2</b> <a href=
    "results.cfm?query=CCS%3A%22F%2E2%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">ANALYSIS OF ALGORITHMS AND PROBLEM
    COMPLEXITY</a><br />
    � � � <img src="img/tree.gif" border="0" height=
    "20" width="20" /> <b>F.2.1</b> <a href=
    "results.cfm?query=CCS%3A%22F%2E2%2E1%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Numerical Algorithms and Problems</a><br />
    </p>
    <p class="Categories"><span class="heading"><a name=
    "GenTerms">Additional�Classification:</a></span><br />
    � <b>G.</b> <a href=
    "results.cfm?query=CCS%3AG%2E%2A&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Mathematics of Computing</a><br />
    � <img src="img/tree.gif" border="0" height="20" width=
    "20" /> <b>G.1</b> <a href=
    "results.cfm?query=CCS%3A%22G%2E1%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">NUMERICAL ANALYSIS</a><br />
    � � � <img src="img/tree.gif" border="0" height=
    "20" width="20" /> <b>G.1.6</b> <a href=
    "results.cfm?query=CCS%3A%22G%2E1%2E6%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Optimization</a><br />
    � � � � � <img src="img/tree.gif" border=
    "0" height="20" width="20" /> <b>Subjects:</b> <a href=
    "results.cfm?query=CCS%3A%22Linear%20programming%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Linear programming</a><br />
    </p>
    <br />
    <p class="GenTerms"><span class="heading"><a name=
    "GenTerms">General Terms:</a></span><br />
    <a href=
    "results.cfm?query=genterm%3A%22Algorithms%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Algorithms</a>, <a href=
    "results.cfm?query=genterm%3A%22Theory%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Theory</a></p>
    <br />
    <p class="keywords"><span class="heading"><a name=
    "Keywords">Keywords:</a></span><br />
    <a href=
    "results.cfm?query=keyword%3A%22Simplex%20method%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Simplex method</a>, <a href=
    "results.cfm?query=keyword%3A%22complexity%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">complexity</a>, <a href=
    "results.cfm?query=keyword%3A%22perturbation%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">perturbation</a>, <a href=
    "results.cfm?query=keyword%3A%22smoothed%20analysis%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">smoothed analysis</a></p>
    </div>

    One approach is to download Htmlparser from sourceforge
    http://htmlparser.sourceforge.net/ and write the rules to match title, abstract etc.
    Another approach is to write your own parser that extract only title, abstract etc.
    1. tokenize the html file. --> convert html into tokens (tag and value)
    2. write a simple parser to extract certain information
    find out about the pattern of text you want to extract. For instance "<class "abstract">.
    then writing a rule for extracting abstract such as
    if (tag is abstract ) then extract abstract text
    apply the same concept for other tags
    Attached is the sample parser that was used to extract title and abstract from acm html files. Please modify to include keyword and other fields.
    good luck
    import java.io.BufferedReader;
    import java.io.FileReader;
    import java.io.IOException;
    import java.io.InputStream;
    import java.io.InputStreamReader;
    import java.util.ArrayList;
    import java.util.List;
    public class ACMHTMLParser
         private String m_filename;
         private URLLexicalAnalyzer lexical;
         List urls = new ArrayList();
         public ACMHTMLParser(String filename)
              super();
              m_filename = filename;
          * parses only title and abstract
         public void parse() throws Exception
              lexical = new URLLexicalAnalyzer(m_filename);
              String word = lexical.getNextWord();
              boolean isabstract = false;
              while (null != word)
                   if (isTag(word))
                        if (isTitle(word))
                             System.out.println("TITLE: " + lexical.getNextWord());
                        else if (isAbstract(word) && !isabstract)
                             parseAbstract();
                             isabstract = true;
                   word = lexical.getNextWord();
              lexical.close();
         public static void main(String[] args) throws Exception
              ACMHTMLParser parser = new ACMHTMLParser("./acm_html.html");
              parser.parse();
         public static boolean isTag(String word)
              return ( word.startsWith("<") && word.endsWith(">"));
         public static boolean isTitle(String word)
              return ( "<title>".equals(word));
         //please modify according to the html source
         public static boolean isAbstract(String word)
              return ( "<p class=\"abstract\">".equals(word));
         private void parseAbstract() throws Exception
              while (true)
                   String abs = lexical.getNextWord();
                   if (!isTag(abs))
                        System.out.println(abs);
                        break;
         class URLLexicalAnalyzer
           private BufferedReader m_reader;
           private boolean isTag;
           public URLLexicalAnalyzer(String filename)
              try
                m_reader = new BufferedReader(new FileReader(filename));
              catch (IOException io)
                System.out.println("ERROR, file not found " + filename);
                System.exit(1);
           public URLLexicalAnalyzer(InputStream in)
              m_reader = new BufferedReader(new InputStreamReader(in));
           public void close()
              try {
                if (null != m_reader) m_reader.close();
              catch (IOException ignored) {}
           public String getNextWord() throws IOException
              int c = m_reader.read();   
              if (-1 == c) return null; 
              if (Character.isWhitespace((char)c))
                return getNextWord();
              if ('<' == c || isTag)
                return scanTag(c);
              else
                   return scanValue(c);
           private String scanTag(final int c)
              throws IOException
              StringBuffer result = new StringBuffer();
              if ('<' != c) result.append('<');
              result.append((char)c);
              int ch = -1;
              while (true)
                ch = m_reader.read();
                if (-1 == ch) throw new IllegalArgumentException("un-terminate tag");
                if ('>' == ch)
                     isTag = false;
                     break;
                result.append((char)ch);
              result.append((char)ch);
              return result.toString();
           private String scanValue(final int c) throws IOException
                StringBuffer result = new StringBuffer();
                result.append((char)c);
                int ch = -1;
                while (true)
                   ch = m_reader.read();
                   if (-1 == ch) throw new IllegalArgumentException("un-terminate value");
                   if ('<' == ch)
                        isTag = true;
                        break;
                   result.append((char)ch);
                return result.toString();
    }

  • How do I extract text from an email?

    Hello!
    I am in the process of trying to automate orders from my website. How do I extract text from an email and paste it into specific cells in an Excel spreadsheet using Automator?
    Many thanks,
    Toby Bateson

    If you select the message on the Inbox list, or open the message, you can then go to the Message menu of Mail and select Remove Attachments.
    Bob N.
    Mac Mini 1.5 GHz; iBook 900 mHz; iPod 20 GB   Mac OS X (10.4.7)  

  • Extract text from Powerpoint

    Can anyone tell me whether it is possible to extract text from Powerpoint. Please help me out.

    hi,it is possible if you can pay a lot of money,
    since I just finished developed a web search app
    with my classmates,but we do not do the tranfer
    from pp to text ,the client did not ask for it
    because he do not wanna to pay so much money.

  • Extract Text from PostScript

    Can anyone please tell me how to extract text from postscript

    Ghostscript is a postscript interpreter (one can actually program in postscript, and in fact postscript is a language BTW/FYI). Ghostscript is written in C (maybe C++) and yes, you can output to several different formats. I doubt you can go to HTML as it wouldn't really make much sense.
    XML needs a DTD to make it of any use. You will have to call ghostscript through Runtime.exec() to use it, it will extract out purely the text from the PostScript ASSUMING the PS file contains text in that manner; it is possible to have PS text as output images in which case GS won't pick it up.

  • Extracting text from header, body of an InCopy table...

    Hi folks, I have a script that runs in InCopy CS3 on the Windows platform and part of it extracts text from the header and body parts of a table. If the insertion point is in the header, I can put the text into a variable using...
    var textToExtract = app.selection[0].parentStory.contents;
    Same scenario if the insertion point is in the body of the table. Anyway, I'm looking for a way to set the insertion point in the Header or Body sections of the table or, better yet, a way of extracting the data directly from those containers. Any ideas are, of course, appreciated. Thanks, Wil

    Yes, I am stuck with the same problem.... any ideas out there?
    thanks

Maybe you are looking for