Extracting text from selected stories

Platform : Windows
Version: CS2
Language: VB Script
I run a weekly newspaper and each week, I want to repurpose approx 60pp
to put on the web.
The stories are unstructured and each page is in a single ID document.
I thought that a headline and story text box could be manually selected and a script run to copy the contents to the clipboard, with a marker to split the headline from the following story.
Equally, in the case of a picture, a path and caption could be captured.
The data could then be handled by a VB programme, to create a file for each story.
I know this seems quite laborious, but with shortcuts, the job will take about an hour.
Is there a script somewhere that I could look at and modify to do something like this?
Alternatively, is there a better way of repurposing unstructured documents for the web?
TIA

Hi John,
Try the ExportAllStories.vbs example script--it'll export all of the stories in a document into a folder. That's probably not exactly what you want, but it'll give you a reasonable place to start.
You can find it in the CS2 sample scripts archive, at:
http://download.adobe.com/pub/adobe/indesign/InDesign_Example_Scripts.zip
Thanks,
Ole

Similar Messages

  • How do I extract text from an email?

    Hello!
    I am in the process of trying to automate orders from my website. How do I extract text from an email and paste it into specific cells in an Excel spreadsheet using Automator?
    Many thanks,
    Toby Bateson

    If you select the message on the Inbox list, or open the message, you can then go to the Message menu of Mail and select Remove Attachments.
    Bob N.
    Mac Mini 1.5 GHz; iBook 900 mHz; iPod 20 GB   Mac OS X (10.4.7)  

  • Applescript or workflow to extract text from PDF and rename PDF with the results

    Hi Everyone,
    I get supplied hundreds of PDFs which each contain a stock code, but the PDFs themselves are not named consistantly, or they are supplied as multi-page PDFs.
    What I need to do is name each PDF with the code which is in the text on the PDF.
    It would work like this in an ideal world:
    1. Split PDF into single pages
    2. Extract text from PDF
    3. Rename PDF using the extracted text
    I'm struggling with part 3!
    I can get a textfile with just the code (using a call to BBEDIT I'm extracting the code)
    I did think about using a variable for the name, but the rename functions doesn't let me use variables.

    Hello
    You may also try the following applescript script, which is a wrapper of rubycocoa script. It will ask you choose source pdf files and destination directory. Then it will scan text of each page of pdf files for the predefined pattern and save the page as new pdf file with the name as extracted by the pattern in the destination directory. Those pages which do not contain string matching the pattern are ignored. (Ignored pages, if any, are reported in the result of script.)
    Currently the regex pattern is set to:
    /HB-.._[0-9]{6}/
    which means HB- followed by two characters and _ and 6 digits.
    Minimally tested under 10.6.8.
    Hope this may help,
    H
    _main()
    on _main()
        script o
            property aa : choose file with prompt ("Choose pdf files.") of type {"com.adobe.pdf"} ¬
                default location (path to desktop) with multiple selections allowed
            set my aa's beginning to choose folder with prompt ("Choose destination folder.") ¬
                default location (path to desktop)
            set args to ""
            repeat with a in my aa
                set args to args & a's POSIX path's quoted form & space
            end repeat
            considering numeric strings
                if (system info)'s system version < "10.9" then
                    set ruby to "/usr/bin/ruby"
                else
                    set ruby to "/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby"
                end if
            end considering
            do shell script ruby & " <<'EOF' - " & args & "
    require 'osx/cocoa'
    include OSX
    require_framework 'PDFKit'
    outdir = ARGV.shift.chomp('/')
    ARGV.select {|f| f =~ /\\.pdf$/i }.each do |f|
        url = NSURL.fileURLWithPath(f)
        doc = PDFDocument.alloc.initWithURL(url)
        path = doc.documentURL.path
        pcnt = doc.pageCount
        (0 .. (pcnt - 1)).each do |i|
            page = doc.pageAtIndex(i)
            page.string.to_s =~ /HB-.._[0-9]{6}/
            name = $&
            unless name
                puts \"no matching string in page #{i + 1} of #{path}\"
                next # ignore this page
            end
            doc1 = PDFDocument.alloc.initWithData(page.dataRepresentation) # doc for this page
            unless doc1.writeToFile(\"#{outdir}/#{name}.pdf\")
                puts \"failed to save page #{i + 1} of #{path}\"
            end
        end
    end
    EOF"
        end script
        tell o to run
    end _main

  • Extracting text from header, body of an InCopy table...

    Hi folks, I have a script that runs in InCopy CS3 on the Windows platform and part of it extracts text from the header and body parts of a table. If the insertion point is in the header, I can put the text into a variable using...
    var textToExtract = app.selection[0].parentStory.contents;
    Same scenario if the insertion point is in the body of the table. Anyway, I'm looking for a way to set the insertion point in the Header or Body sections of the table or, better yet, a way of extracting the data directly from those containers. Any ideas are, of course, appreciated. Thanks, Wil

    Yes, I am stuck with the same problem.... any ideas out there?
    thanks

  • How to extract text from a PDF file?

    Hello Suners,
    i need to know how to extract text from a pdf file?
    does anyone know what is the character encoding in pdf file, when i use an input stream to read the file it gives encrypted characters not the original text in the file.
    is there any procedures i should do while reading a pdf file,
    File f=new File("D:/File.pdf");
                   FileReader fr=new FileReader(f);
                   BufferedReader br=new BufferedReader(fr);
                   String s=br.readLine();any help will be deeply appreciated.

    jverd wrote:
    First, you set i once, and then loop without ever changing it. So your loop body will execute either 0 times or infinitely many times, writing the same byte every time. Actually, maybe it'll execute once and then throw an ArrayIndexOutOfBoundsException. That's basic java looping, and you're going to need a firm grip on that before you try to do anything as advanced as PDF reading. the case.oops you are absolutely right that was a silly mistake to forget that,
    Second, what do the docs for getPageContent say? Do they say that it simply gives you the text on the page as if the thing were a simple text doc? I'd be surprised if that's the case.getPageContent return array of bytes so the question will be:
    how to get text from this array? i was thinking of :
        private void jButton1_actionPerformed(ActionEvent e) {
            PdfReader read;
            StringBuffer buff=new StringBuffer();
            try {
                read = new PdfReader("d:/getjobid2727.pdf");
                read.getMetaData();
                byte[] data=read.getPageContent(1);
                int i=0;
                while(i>-1){ 
                    buff.append(data);
    i++;
    String str=buff.toString();
    FileOutputStream fos = new FileOutputStream("D:/test.txt");
    Writer out = new OutputStreamWriter(fos, "UTF8");
    out.write(str);
    out.close();
    read.close();
    } catch (Exception f) {
    f.printStackTrace();
    "D:/test.txt"  hasn't been created!! when i ran the program,
    is my steps right?                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

  • How to extract text from a PDF file using php?

    How to extract text from a PDF file using php?
    thanks
    fabio

    > Do you know of any other way this can be done?
    There are many ways. But this out of scope of this forum. You can try this forum: http://forum.planetpdf.com/

  • How to read/extract text from pdf

    Respected All,
    I want to read/extract text from pdf. I tried using etymon but not succed.
    Could anyone will guide me in this.
    Thanks and regards,
    Ajay.

    Thank you very much Abhilshit, PDFBox works for reading pdf.
    Regards,
    Ajay.

  • Problem to extract text from HTML document

    I have to extract some text from HTML file to my database. (about 1000 files)
    The HTML files are get from ACM Digital Library. http://portal.acm.org/dl.cfm
    The HTML page is about the information of a paper. I only want to get the text of "Title" "Abstract" "Classification" "Keywords"
    The Problem is that I can't find any patten to parser the html files"
    EX: I need to get the Classification = "Theory of Computation","ANALYSIS OF ALGORITHMS AND PROBLEM COMPLEXITY","Numerical Algorithms and Problem","Mathematics of Computing","NUMERICAL ANALYSIS"......etc .
    The section code about "Classification" is below.
    Please give any idea to do this, or how to find patten to extract text from this.
    <div class="indterms"><a href="#CIT"><img name="top" src=
    "img/arrowu.gif" hspace="10" border="0" /></a><span class=
    "heading"><a name="IndexTerms">INDEX TERMS</a></span>
    <p class="Categories"><span class="heading"><a name=
    "GenTerms">Primary Classification:</a></span><br />
    � <b>F.</b> <a href=
    "results.cfm?query=CCS%3AF%2E%2A&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Theory of Computation</a><br />
    � <img src="img/tree.gif" border="0" height="20" width=
    "20" /> <b>F.2</b> <a href=
    "results.cfm?query=CCS%3A%22F%2E2%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">ANALYSIS OF ALGORITHMS AND PROBLEM
    COMPLEXITY</a><br />
    � � � <img src="img/tree.gif" border="0" height=
    "20" width="20" /> <b>F.2.1</b> <a href=
    "results.cfm?query=CCS%3A%22F%2E2%2E1%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Numerical Algorithms and Problems</a><br />
    </p>
    <p class="Categories"><span class="heading"><a name=
    "GenTerms">Additional�Classification:</a></span><br />
    � <b>G.</b> <a href=
    "results.cfm?query=CCS%3AG%2E%2A&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Mathematics of Computing</a><br />
    � <img src="img/tree.gif" border="0" height="20" width=
    "20" /> <b>G.1</b> <a href=
    "results.cfm?query=CCS%3A%22G%2E1%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">NUMERICAL ANALYSIS</a><br />
    � � � <img src="img/tree.gif" border="0" height=
    "20" width="20" /> <b>G.1.6</b> <a href=
    "results.cfm?query=CCS%3A%22G%2E1%2E6%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Optimization</a><br />
    � � � � � <img src="img/tree.gif" border=
    "0" height="20" width="20" /> <b>Subjects:</b> <a href=
    "results.cfm?query=CCS%3A%22Linear%20programming%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Linear programming</a><br />
    </p>
    <br />
    <p class="GenTerms"><span class="heading"><a name=
    "GenTerms">General Terms:</a></span><br />
    <a href=
    "results.cfm?query=genterm%3A%22Algorithms%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Algorithms</a>, <a href=
    "results.cfm?query=genterm%3A%22Theory%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Theory</a></p>
    <br />
    <p class="keywords"><span class="heading"><a name=
    "Keywords">Keywords:</a></span><br />
    <a href=
    "results.cfm?query=keyword%3A%22Simplex%20method%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Simplex method</a>, <a href=
    "results.cfm?query=keyword%3A%22complexity%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">complexity</a>, <a href=
    "results.cfm?query=keyword%3A%22perturbation%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">perturbation</a>, <a href=
    "results.cfm?query=keyword%3A%22smoothed%20analysis%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">smoothed analysis</a></p>
    </div>

    One approach is to download Htmlparser from sourceforge
    http://htmlparser.sourceforge.net/ and write the rules to match title, abstract etc.
    Another approach is to write your own parser that extract only title, abstract etc.
    1. tokenize the html file. --> convert html into tokens (tag and value)
    2. write a simple parser to extract certain information
    find out about the pattern of text you want to extract. For instance "<class "abstract">.
    then writing a rule for extracting abstract such as
    if (tag is abstract ) then extract abstract text
    apply the same concept for other tags
    Attached is the sample parser that was used to extract title and abstract from acm html files. Please modify to include keyword and other fields.
    good luck
    import java.io.BufferedReader;
    import java.io.FileReader;
    import java.io.IOException;
    import java.io.InputStream;
    import java.io.InputStreamReader;
    import java.util.ArrayList;
    import java.util.List;
    public class ACMHTMLParser
         private String m_filename;
         private URLLexicalAnalyzer lexical;
         List urls = new ArrayList();
         public ACMHTMLParser(String filename)
              super();
              m_filename = filename;
          * parses only title and abstract
         public void parse() throws Exception
              lexical = new URLLexicalAnalyzer(m_filename);
              String word = lexical.getNextWord();
              boolean isabstract = false;
              while (null != word)
                   if (isTag(word))
                        if (isTitle(word))
                             System.out.println("TITLE: " + lexical.getNextWord());
                        else if (isAbstract(word) && !isabstract)
                             parseAbstract();
                             isabstract = true;
                   word = lexical.getNextWord();
              lexical.close();
         public static void main(String[] args) throws Exception
              ACMHTMLParser parser = new ACMHTMLParser("./acm_html.html");
              parser.parse();
         public static boolean isTag(String word)
              return ( word.startsWith("<") && word.endsWith(">"));
         public static boolean isTitle(String word)
              return ( "<title>".equals(word));
         //please modify according to the html source
         public static boolean isAbstract(String word)
              return ( "<p class=\"abstract\">".equals(word));
         private void parseAbstract() throws Exception
              while (true)
                   String abs = lexical.getNextWord();
                   if (!isTag(abs))
                        System.out.println(abs);
                        break;
         class URLLexicalAnalyzer
           private BufferedReader m_reader;
           private boolean isTag;
           public URLLexicalAnalyzer(String filename)
              try
                m_reader = new BufferedReader(new FileReader(filename));
              catch (IOException io)
                System.out.println("ERROR, file not found " + filename);
                System.exit(1);
           public URLLexicalAnalyzer(InputStream in)
              m_reader = new BufferedReader(new InputStreamReader(in));
           public void close()
              try {
                if (null != m_reader) m_reader.close();
              catch (IOException ignored) {}
           public String getNextWord() throws IOException
              int c = m_reader.read();   
              if (-1 == c) return null; 
              if (Character.isWhitespace((char)c))
                return getNextWord();
              if ('<' == c || isTag)
                return scanTag(c);
              else
                   return scanValue(c);
           private String scanTag(final int c)
              throws IOException
              StringBuffer result = new StringBuffer();
              if ('<' != c) result.append('<');
              result.append((char)c);
              int ch = -1;
              while (true)
                ch = m_reader.read();
                if (-1 == ch) throw new IllegalArgumentException("un-terminate tag");
                if ('>' == ch)
                     isTag = false;
                     break;
                result.append((char)ch);
              result.append((char)ch);
              return result.toString();
           private String scanValue(final int c) throws IOException
                StringBuffer result = new StringBuffer();
                result.append((char)c);
                int ch = -1;
                while (true)
                   ch = m_reader.read();
                   if (-1 == ch) throw new IllegalArgumentException("un-terminate value");
                   if ('<' == ch)
                        isTag = true;
                        break;
                   result.append((char)ch);
                return result.toString();
    }

  • Extract Text from pdf using C#

    Hi,
    We are Solution developer using Acrobat,as we have reuirement of extracting text from pdf using C# we have downloaded adobe sdk and installed. We have found only four exmaples in C# and those are used only for viewing pdf in windows application. Can you please guide us how to extract text from pdf using SDK in C#.
    Thanks you for your help.
    Regards
    kiranmai

    Okay so I went ahead and actually added the text extraction functionality to my own C# application, since this was a requested feature by the client anyhow, which originally we were told to bypass if it wasn't "cut and dry", but it wasn't bad so I went ahead and gave the client the text extraction that they wanted. Decided I'd post the source code here for you. This returns the text from the entire document as a string.
           private static string GetText(AcroPDDoc pdDoc)
                AcroPDPage page;
                int pages = pdDoc.GetNumPages();
                string pageText = "";
                for (int i = 0; i < pages; i++)
                    page = (AcroPDPage)pdDoc.AcquirePage(i);
                    object jso, jsNumWords, jsWord;
                    List<string> words = new List<string>();
                    try
                        jso = pdDoc.GetJSObject();
                        if (jso != null)
                            object[] args = new object[] { i };
                            jsNumWords = jso.GetType().InvokeMember("getPageNumWords", BindingFlags.InvokeMethod, null, jso, args, null);
                            int numWords = Int32.Parse(jsNumWords.ToString());
                            for (int j = 0; j <= numWords; j++)
                                object[] argsj = new object[] { i, j, false };
                                jsWord = jso.GetType().InvokeMember("getPageNthWord", BindingFlags.InvokeMethod, null, jso, argsj, null);
                                words.Add((string)jsWord);
                        foreach (string word in words)
                            pageText += word;
                    catch
                return pageText;

  • Extract text from pdf

    Hi, is it possible to extract text from a pdf file using the command line to get an output like you would get by using the File menu and then 'Save as text..."?
    I also noticed that in the installation folder there is a small executable called AcroTextExtractor which sounds interesting, but I was unable to figure out how to use it.

    what's wrong with using automator for this? this certainly seems the easiest. I'm not aware of any built in apple script commands that will do this. But You should also ask on the Apple script forum under Mac OS Technologies.
    Message was edited by: V.K.

  • How can i extract text from Power point files,wod files,pdf files

    hi friends,
    i need to extract text from the power point files,word files,pdf files for my application.Is it possible to extract the text from the those files .If yes plz give solution to this problem.i would be thankful if u givve solution to this problem.

    My reply would be the same.
    http://forum.java.sun.com/thread.jspa?threadID=676559&tstart=0

  • Extract text from Powerpoint

    Can anyone tell me whether it is possible to extract text from Powerpoint. Please help me out.

    hi,it is possible if you can pay a lot of money,
    since I just finished developed a web search app
    with my classmates,but we do not do the tranfer
    from pp to text ,the client did not ask for it
    because he do not wanna to pay so much money.

  • Extract Text from PostScript

    Can anyone please tell me how to extract text from postscript

    Ghostscript is a postscript interpreter (one can actually program in postscript, and in fact postscript is a language BTW/FYI). Ghostscript is written in C (maybe C++) and yes, you can output to several different formats. I doubt you can go to HTML as it wouldn't really make much sense.
    XML needs a DTD to make it of any use. You will have to call ghostscript through Runtime.exec() to use it, it will extract out purely the text from the PostScript ASSUMING the PS file contains text in that manner; it is possible to have PS text as output images in which case GS won't pick it up.

  • Extracting text from BLOB data

    Let me start by saying that I'm flying by the seat of my pants on this and I'm neither a CR expert nor can I write SQL. Short of a little basic back on my old Apple IIe (way back when,) my programming skills are essentially nil.
    That said, here's what I'm trying to acheive.
    I have created a report that uses an ODBC connection to an Oracle database. It works fine, all the tables are properly linked and it delivers a properly functioning report. I now need to extract some text from a BLOB field. I have all the necessary tables added to the report and I've linked them as well. However, when I add an SQL expression and paste in the  SQL string that uses the function, "getprop" to try to extract the necessary text from the BLOB data, I get the error: "missing expression"
    I notice that within the SQL Expression editor, the field for the BLOB data is not listed in the table in which it resides, though the non-BLOB fields are listed. That is, when I say: select getprop(sl.blobdata,'confirmCompany'), there is no sl.blobdata field which I believe is the source of the "missing expression" error.
    If I create a new blank report and simply paste the SQL in as a command in the Database Expert, it will deliver the expected results. I get the two fields I need. This means the SQL string is fine. But, of course, all the other columns I need are not available. As I mentioned above, I can't write SQL and I have a fully functional  report that has everything except the two columns i want to add from the BLOB field.
    However if I go back to my complete report and add the SQL as a command in the Database Expert, it then asks me to link the fields to the other tables already in the report and that just doesn't work.
    I apologize for any mucked up terminology and general concept confusing statements I've made above and hope someone will be able to help me out.
    Thanks in advance.
    ~m

    So, in this case you might create a subreport that has
    "the SQL in as a command in the Database Expert, it will deliver the expected results. I get the two fields I need. This means the SQL string is fine. But, of course, all the other columns I need are not available. As I mentioned above, I can't write SQL and I have a fully functional report that has everything except the two columns i want to add from the BLOB field. "
    of course, this will return all the results, every time it gets kicked off, this is why having something to link to is handy.
    You can filter the subreport data, based on main report parameters, if that will help.

  • Extracting text from Customer master information records

    Hi,
    I want to extract the details of texts from the customer material information record (Tranx: VD52). All the input data like Sales org,Distribution channel, Customer number and material number are stored in structure MV10A. How to use this data to extract the relevant text descriptoins in the customer-material info records.
    Thanks for your response.

    No problem, we can concatenate all four in one field.
    First you need to declare a variable G_NAME(70) type C.
    Then use syntax
    CONCATENATE SALES ORG DISTRI CHANNEL CUSTOMER MATNR INTO G_NAME.
    Pass this G_NAME to the function module.
    Also you need to use Conversion routines to get correct Customer and Material No.
    call function 'CONVERSION_EXIT_MATN1_INPUT'
         exporting
              input        = MATNR
        IMPORTING
             OUTPUT       = MATNR
        EXCEPTIONS
             LENGTH_ERROR = 1
             OTHERS       = 2.
    call function 'CONVERSION_EXIT_ALPHA_INPUT'
         exporting
              input   = KUNNR
        IMPORTING
             OUTPUT  = KUNNR.
    Use above 2 routines before you do concatenation.

Maybe you are looking for

  • How do I change my telstra bigpond cable login details on my airport extreme base station?

    I have reason to believe my password has been compromised so have changed my account details with my ISP but now I need to change the password on my home network. I have connected to the (n) airport extreme (which is connected to my cable modem) but

  • Charging Me For A Programme In My Package

    I have full BT Vision package. Wife has been watching all the ugly betty series. However in series 4 next to episode 15 it wants to charge me 1.10 when all the other programmes are listed as free. Really confused!! Do you have to pay for the odd prog

  • Whats with this 16:9 Anamorphic crazyness?

    When selecting settings, if you want PAL DV 16:9 does it have to be anamorphic or am i missing something??? Anamorphic 16:9 is different to normal 16:9 with the whole pixel stretching and maneuvering so surely this isnt what i want? So Confused. Plea

  • [svn:fx-trunk] 10920: Added new file introduced in TLF Build 517.

    Revision: 10920 Author:   [email protected] Date:     2009-10-07 15:27:29 -0700 (Wed, 07 Oct 2009) Log Message: Added new file introduced in TLF Build 517. QE notes: None Doc notes: None Bugs: None Reviewer: None Tests run: ant checkintests Is notewo

  • Saving SD User Defined Report

    I am creating a user defined report in tcode MC21. When i try to save it asks me for a package. Which package do i use. do i use already existing packages or i need to use a new package. If try saving without the package the system refuses. If i save