Extracting text from a file name on export / import (Regular Expressions??)

I’m not even sure if the publishing service, File Naming, in Lightroom supports Regular Expressions or not?? Basically I’m trying to extract the left portion of the file name ie: everything before the underscore “_”. When I import a file I rename the file to reflect the current Image sequence number and then append the date the photo was taken; a typical file is as follows “05625_2008-01-05.dng” on export I would like the new name to be only the sequence number in this case “05625.jpg”. Ideally I would then like to append the folder name that contains the file… “05625 - FolderName.jpg.
I don’t want to go down the road of figuring out the correct syntax if regular expressions aren’t supported. Thanks in advance - CES

Is the imported sequence number captured in meta data somewhere or is their somewhere that all of the available fields and there reference names can be found???
Unfortunately not anywhere available to the user (but it's still stored in at least some filed I know off).
By the way, why did you choose to put the suffix at the beginning of the name (1234_2010-08-13.jpg)? The common practice is to leave the suffix at the end. That will ensure the filenames will sort in chronological order by filename and you could have easily used the suffix when exporting files. You wouldn't have this problem now.

Similar Messages

  • Extracting string from a file name

    Hello,
    I have a legacy (read: I didn't build it) SharePoint list  that includes some validation when uploading files that's giving me some trouble.
    Basically, our users are required to add files to a list in a certain filename format and based on the naming convention are approved/rejected and routed to the appropriate location.
    One of the validations looks at a section of the file name and compares it to a folder name in the library.
    For example, the file name format is XX_AAA_999_2014_05.xlsx and that matches on the folder name of /submissions/2014_05
    Currently the rule says look at the last 7 characters of the folder and the 7 characters starting at position 12 of the filename and make sure they match.
    The problem is the 999 in the example above is a sequential identifier to the project a file is associated with... e.g. they range from project 000 to project 999. We've now hit project 1000 so file being added for project 1000 (and beyond) fails because
    the starting position has shifted one spot. (Note: we have active 3 digit projects so I cannot simply change that to be position 13... not to mention what that does to my history).
    So, my task is to come up with something that can accomodate 3 or 4 digit numbers.
    I'm trying to stick as closely to the original setup so I don't mess up the history so I'm looking at other methods of getting to the same data in the string.  Another problem is that the file names include the extension and the extension can be 3 (pdf)
    or 4 (xlsx) characters long.
    I've tried this:  =LEFT([Source File Name],SEARCH(".",[Source File Name])-1)
    but that brings back everything in front of the period and I need just the 7 preceeding characters.  Is there a way to limit the number of chars a LEFT() function returns?
    In a nutshell, the 4 variations of file names are as follows of which I need to extract the
    bolded section.:
    ZZ_AAA_999_2014_05.xls
    ZZ_AAA_999_2014_05.xlsx
    ZZ_AAA_1000_2014_05.xls
    ZZ_AAA_1000_2014_05.xlsx
    Thanks!
    Kevin

    Hi,
    According to your description, you might want to retrieve the string “2014_05” from the file name.
    I would suggest you create a SharePoint Designer workflow and implement your logic of handling the filename.
    In SharePoint Designer 2010, there are already some useful utility workflow actions which can enable users to deal with the various requirements come from the business scenarios.
    For the string handling, you can consider to use the
    Utility Actions:
    http://msdn.microsoft.com/en-us/library/office/jj164026(v=office.15).aspx
    Another two links about creating SharePoint Designer workflow for your reference:
    http://office.microsoft.com/en-001/sharepoint-designer-help/introduction-to-designing-and-customizing-workflows-HA101859249.aspx
    http://www.codeproject.com/Tips/415107/Create-a-Workflow-using-SharePoint-Designer
    Thanks
    Patrick Liang
    Forum Support
    Please remember to mark the replies as answers if they help and unmark them if they provide no help. If you have feedback for TechNet Subscriber Support, contact
    [email protected]
    Patrick Liang
    TechNet Community Support

  • Does IBR support to extract text from office files

    hi Experts,
    Can we use IBR to extract word/excel/ppt content to a text file? where is doc for this function?
    Best regards

    Hi ,
    Extraction of text based on some rules ?Is that what you are looking for ? Is it for searching on those specific set of texts from the file ? If yes , then you have Oracle Text search feature which would do that .
    If it is to populate some metadata with those extracted texts , then it would be Content Categorizer component . For details on this component and it's functionality please go through the following documentation : http://docs.oracle.com/cd/E14571_01/doc.1111/e10978/c11_content_categorizer.htm#sthref1210
    In either cases , IBR is not the actual engine which would do this . It is solely used for document conversion . \
    Hope this helps .
    Thanks,
    Srinath

  • Extracting text from pdf file

    Hi All
    I want to extract only text from a pdf file.
    I am trying to extrat text from a pdf file using PDFBox. But I am getting error. My code is like this:
    * Main.java
    * Created on den 10 september 2007, 23:01
    * To change this template, choose Tools | Template Manager
    * and open the template in the editor.
    package extracttext;
    import org.pdfbox.exceptions.InvalidPasswordException;
    import org.pdfbox.pdmodel.PDDocument;
    import org.pdfbox.util.PDFTextStripper;
    //import java.awt.Rectangle;
    //import java.util.List;
    import org.pdfbox.pdmodel.PDPage;
    public class Main {
    /** Creates a new instance of Main */
    public Main() {
    * @param args the command line arguments
    public static void main( String[] args ) throws Exception
    int startPage = 1;
    int endPage = Integer.MAX_VALUE;
    PDDocument document = null;
    try
    document = PDDocument.load( "C:\\thesis\\fileread\\sim.pdf" );
    if( document.isEncrypted() )
    try
    document.decrypt( "" );
    catch( InvalidPasswordException e )
    System.err.println( "Error: Document is encrypted with a password." );
    System.exit( 1 );
    PDFTextStripper stripper = new PDFTextStripper();
    stripper.setSortByPosition( true );
    stripper.setStartPage( startPage );
    stripper.setEndPage( endPage );
    System.out.println("Text: " + stripper.getText(document));
    finally
    if( document != null )
    document.close();
    can anybody pls help me solving this problem
    Regards,
    UK

    i get the following error message:
    Exception in thread "main" java.lang.NoClassDefFoundError: org/fontbox/afm/FontMetric
    at org.pdfbox.pdmodel.font.PDFont.getAFM(PDFont.java:334)
    at org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(PDSimpleFont.java:104)
    at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:336)
    at org.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:80)
    at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
    at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
    at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
    at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
    at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
    at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
    at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
    at extracttext.Main.main(Main.java:55)
    Java Result: 1
    BUILD SUCCESSFUL (total time: 1 second)
    I would appreciate if you can please help me writing a java program that can extract only test from a pdf file

  • File-File - Need to extract data from source file name???

    Hello Experts,
    I have a unique situation. In my file to file scenario, the source file name is of the format XYZ_yymmddHHMM.dat. there is field in the target file which has to filled with the date that is there in the file name of the source file (yymmdd). How can this be achieved? Normally we do the other way round using vaiable substitution where we can name a file depending on the value in any of the target field structure.
    Please help.
    Regards,
    Yash

    Hi,
      please prepare the udf with the following code.
      i mean, dynamic configuration concept.
      where u get the file name, then use substring function to capture date from right side.
       //write your code here
    // getFileName User Defined Function
    // function to create name of output file
    String filename;
    filename = strFile;
    try {
    // initialize DynamicConfiguration for create file with given name
    DynamicConfiguration conf = (DynamicConfiguration) container
    .getTransformationParameters()
    .get(StreamTransformationConstants.DYNAMIC_CONFIGURATION);
    DynamicConfigurationKey key = DynamicConfigurationKey.create( "http://sap.com/xi/XI/System/File", "FileName");
    //create file with the specified name
    conf.put(key, filename);
    } catch (Exception ex) {
    return filename;
    warm regards
    mahesh.

  • Extracting text from PDF files produced by Oracle reports

    Hi,
    I am currently using Report Builder 9.0.4.0.21 to produce reports in PDF format.
    The pdf reports were displayed to screen and printed to printer correctly.
    However, doing a copy-and-paste from the pdf report to a text editor produces
    garbage characters. Also, I failed to extract the text using any of available adobe
    plug-ins. I know that the PDF report is using font subseting with custom
    encoding.I have already read the pdf reference manual and it seems that
    the PDF report is missing the mapping tables to convert the custom encoding
    used in the report back to ansi or unicode.
    Is there a solution to this problem?
    Are there any environment variables or settings that I am missing?
    Your help is really appreciated.

    Hello,
    Your problem may be related to a limitation in the PDF generated with Reports 9.0.2 / 9.0.4 when using Subsetting :
    Font Subsetting Creates PDF Output not Searchable with Acrobat Reader (Doc ID 311345.1)
    This limitation no more exists in Reports 10.1.2 / 11.1
    Regards

  • How to extract text from a PDF file?

    Hello Suners,
    i need to know how to extract text from a pdf file?
    does anyone know what is the character encoding in pdf file, when i use an input stream to read the file it gives encrypted characters not the original text in the file.
    is there any procedures i should do while reading a pdf file,
    File f=new File("D:/File.pdf");
                   FileReader fr=new FileReader(f);
                   BufferedReader br=new BufferedReader(fr);
                   String s=br.readLine();any help will be deeply appreciated.

    jverd wrote:
    First, you set i once, and then loop without ever changing it. So your loop body will execute either 0 times or infinitely many times, writing the same byte every time. Actually, maybe it'll execute once and then throw an ArrayIndexOutOfBoundsException. That's basic java looping, and you're going to need a firm grip on that before you try to do anything as advanced as PDF reading. the case.oops you are absolutely right that was a silly mistake to forget that,
    Second, what do the docs for getPageContent say? Do they say that it simply gives you the text on the page as if the thing were a simple text doc? I'd be surprised if that's the case.getPageContent return array of bytes so the question will be:
    how to get text from this array? i was thinking of :
        private void jButton1_actionPerformed(ActionEvent e) {
            PdfReader read;
            StringBuffer buff=new StringBuffer();
            try {
                read = new PdfReader("d:/getjobid2727.pdf");
                read.getMetaData();
                byte[] data=read.getPageContent(1);
                int i=0;
                while(i>-1){ 
                    buff.append(data);
    i++;
    String str=buff.toString();
    FileOutputStream fos = new FileOutputStream("D:/test.txt");
    Writer out = new OutputStreamWriter(fos, "UTF8");
    out.write(str);
    out.close();
    read.close();
    } catch (Exception f) {
    f.printStackTrace();
    "D:/test.txt"  hasn't been created!! when i ran the program,
    is my steps right?                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

  • How to extract text from a PDF file using php?

    How to extract text from a PDF file using php?
    thanks
    fabio

    > Do you know of any other way this can be done?
    There are many ways. But this out of scope of this forum. You can try this forum: http://forum.planetpdf.com/

  • How can i extract text from Power point files,wod files,pdf files

    hi friends,
    i need to extract text from the power point files,word files,pdf files for my application.Is it possible to extract the text from the those files .If yes plz give solution to this problem.i would be thankful if u givve solution to this problem.

    My reply would be the same.
    http://forum.java.sun.com/thread.jspa?threadID=676559&tstart=0

  • Extracting text from .doc,.ppt,.pdf files

    How can i extract ascii text from the file types like .doc , .ppt , .pdf ,. xls ..etc.
    Any tips/hints would be helpful
    Thanks
    Rama

    HI I tried for pdf, but didn't succeed
    Following is for text/Doc files
    <pre>
    import java.io.*;
    public class Doc
         public static void main(String[] args)
              try{
                   File file=new File("c:\\downloads\\WP2001.doc");
                   LineNumberReader buffer=new LineNumberReader(new FileReader(file));
                   StringBuffer buff=new StringBuffer("");
              boolean valid=true;
              while(valid)
                   //System.out.println(buffer.readLine());
                   buff=buff.append(buffer.readLine()+"\n");
                   if(buffer.read()==-1)
                        valid=false;
                   else
                   buffer.setLineNumber(buffer.getLineNumber()+1);
                   System.out.println(buff);
              catch(Exception fne)
                   System.out.println("File Not Found"+fne);
    </pre>
    pathreading

  • Reading long text from excel file to an internal table

    Hi
    Can any body tell me how to read long text from excel file to an internal table.
    When i am using this FM KCD_EXCEL_OLE_TO_INT_CONVERT then it is reading only 32 characters from each cell.
    But in my excel sheet in one of the cell has very long text which i need to upload into a internal table.
    may i know which FM or what logic i need to use for this problem.
    Regards

    Hi,
    Here is an example program.  It will upload an Excel file with two columns.  You could also assign the Excel structure dynamically, but I wanted to keep the example simple.  The main point is that the internal table (it_excel in this example) must match the Excel structure that you want to convert.
    Remember, this is just an example to help you figure out how to properly use the technique.  It will certainly need to be modified to fit your requirements, and as always there may be a better way to get the Excel converted... this is just one possibility that has worked for me in the past.
    *& Report  zexcel_upload_test                            *
    REPORT  zexcel_upload_test.
    TYPE-POOLS: truxs.
    TYPES: BEGIN OF ty_excel,
             col_a(10) TYPE n,
             col_b(35) TYPE c,
           END OF ty_excel.
    DATA: l_data_tab         TYPE TABLE OF string,
          l_text_data        TYPE truxs_t_text_data,
          l_gui_filename     TYPE string,
          it_excel           TYPE TABLE OF ty_excel.
    FIELD-SYMBOLS: <wa_excel>  TYPE ty_excel.
    PARAMETERS: p_file TYPE rlgrap-filename.
    * Pass the file name in the correct format
    l_gui_filename = p_file.
    * Upload data from PC
    CALL METHOD cl_gui_frontend_services=>gui_upload
      EXPORTING
        filename                = l_gui_filename
        filetype                = 'ASC'
        has_field_separator     = 'X'
      CHANGING
        data_tab                = l_data_tab
      EXCEPTIONS
        file_open_error         = 1
        file_read_error         = 2
        no_batch                = 3
        gui_refuse_filetransfer = 4
        invalid_type            = 5
        no_authority            = 6
        unknown_error           = 7
        bad_data_format         = 8
        header_not_allowed      = 9
        separator_not_allowed   = 10
        header_too_long         = 11
        unknown_dp_error        = 12
        access_denied           = 13
        dp_out_of_memory        = 14
        disk_full               = 15
        dp_timeout              = 16
        OTHERS                  = 17.
    IF sy-subrc <> 0.
    *   MESSAGE ...
      EXIT.
    ENDIF.
    * Convert from Excel into the appropriate itab
    l_text_data[] = l_data_tab[].
    CALL FUNCTION 'TEXT_CONVERT_XLS_TO_SAP'
      EXPORTING
        i_field_seperator    = 'X'
        i_tab_raw_data       = l_text_data
        i_filename           = p_file
      TABLES
        i_tab_converted_data = it_excel
      EXCEPTIONS
        conversion_failed    = 1
        OTHERS               = 2.
    IF sy-subrc <> 0.
    *   MESSAGE ...
      EXIT.
    ENDIF.
    LOOP AT it_excel ASSIGNING <wa_excel>.
    *  Do something here...
    ENDLOOP.
    AT SELECTION-SCREEN ON VALUE-REQUEST FOR p_file.
      PERFORM filename_get CHANGING p_file.
    *       FORM filename_get                                             *
    FORM filename_get CHANGING p_in_file TYPE rlgrap-filename.
      DATA: l_in_file  TYPE string,
            l_filetab  TYPE filetable,
            wa_filetab TYPE LINE OF filetable,
            l_rc       TYPE i,
            l_action   TYPE i,
            l_init_dir TYPE string.
    * Set the initial directory to whatever you want it to be
      l_init_dir = 'C:\'.
    * Call the file open dialog without multiselect
      CALL METHOD cl_gui_frontend_services=>file_open_dialog
        EXPORTING
          window_title            = 'Load file'
          default_extension       = '.XLS'
          default_filename        = l_in_file
          initial_directory       = l_init_dir
          multiselection          = 'X'
        CHANGING
          file_table              = l_filetab
          rc                      = l_rc
          user_action             = l_action
        EXCEPTIONS
          file_open_dialog_failed = 1
          cntl_error              = 2
          error_no_gui            = 3
          OTHERS                  = 4.
      IF sy-subrc <> 0.
        REFRESH l_filetab.
      ENDIF.
    * Read the selected filename
      READ TABLE l_filetab INTO wa_filetab INDEX 1.
      IF sy-subrc = 0.
        p_in_file = wa_filetab-filename.
      ENDIF.
    ENDFORM.                    " filename_get
    Regards,
    Jamie

  • Problem to extract text from HTML document

    I have to extract some text from HTML file to my database. (about 1000 files)
    The HTML files are get from ACM Digital Library. http://portal.acm.org/dl.cfm
    The HTML page is about the information of a paper. I only want to get the text of "Title" "Abstract" "Classification" "Keywords"
    The Problem is that I can't find any patten to parser the html files"
    EX: I need to get the Classification = "Theory of Computation","ANALYSIS OF ALGORITHMS AND PROBLEM COMPLEXITY","Numerical Algorithms and Problem","Mathematics of Computing","NUMERICAL ANALYSIS"......etc .
    The section code about "Classification" is below.
    Please give any idea to do this, or how to find patten to extract text from this.
    <div class="indterms"><a href="#CIT"><img name="top" src=
    "img/arrowu.gif" hspace="10" border="0" /></a><span class=
    "heading"><a name="IndexTerms">INDEX TERMS</a></span>
    <p class="Categories"><span class="heading"><a name=
    "GenTerms">Primary Classification:</a></span><br />
    � <b>F.</b> <a href=
    "results.cfm?query=CCS%3AF%2E%2A&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Theory of Computation</a><br />
    � <img src="img/tree.gif" border="0" height="20" width=
    "20" /> <b>F.2</b> <a href=
    "results.cfm?query=CCS%3A%22F%2E2%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">ANALYSIS OF ALGORITHMS AND PROBLEM
    COMPLEXITY</a><br />
    � � � <img src="img/tree.gif" border="0" height=
    "20" width="20" /> <b>F.2.1</b> <a href=
    "results.cfm?query=CCS%3A%22F%2E2%2E1%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Numerical Algorithms and Problems</a><br />
    </p>
    <p class="Categories"><span class="heading"><a name=
    "GenTerms">Additional�Classification:</a></span><br />
    � <b>G.</b> <a href=
    "results.cfm?query=CCS%3AG%2E%2A&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Mathematics of Computing</a><br />
    � <img src="img/tree.gif" border="0" height="20" width=
    "20" /> <b>G.1</b> <a href=
    "results.cfm?query=CCS%3A%22G%2E1%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">NUMERICAL ANALYSIS</a><br />
    � � � <img src="img/tree.gif" border="0" height=
    "20" width="20" /> <b>G.1.6</b> <a href=
    "results.cfm?query=CCS%3A%22G%2E1%2E6%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Optimization</a><br />
    � � � � � <img src="img/tree.gif" border=
    "0" height="20" width="20" /> <b>Subjects:</b> <a href=
    "results.cfm?query=CCS%3A%22Linear%20programming%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Linear programming</a><br />
    </p>
    <br />
    <p class="GenTerms"><span class="heading"><a name=
    "GenTerms">General Terms:</a></span><br />
    <a href=
    "results.cfm?query=genterm%3A%22Algorithms%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Algorithms</a>, <a href=
    "results.cfm?query=genterm%3A%22Theory%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Theory</a></p>
    <br />
    <p class="keywords"><span class="heading"><a name=
    "Keywords">Keywords:</a></span><br />
    <a href=
    "results.cfm?query=keyword%3A%22Simplex%20method%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Simplex method</a>, <a href=
    "results.cfm?query=keyword%3A%22complexity%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">complexity</a>, <a href=
    "results.cfm?query=keyword%3A%22perturbation%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">perturbation</a>, <a href=
    "results.cfm?query=keyword%3A%22smoothed%20analysis%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">smoothed analysis</a></p>
    </div>

    One approach is to download Htmlparser from sourceforge
    http://htmlparser.sourceforge.net/ and write the rules to match title, abstract etc.
    Another approach is to write your own parser that extract only title, abstract etc.
    1. tokenize the html file. --> convert html into tokens (tag and value)
    2. write a simple parser to extract certain information
    find out about the pattern of text you want to extract. For instance "<class "abstract">.
    then writing a rule for extracting abstract such as
    if (tag is abstract ) then extract abstract text
    apply the same concept for other tags
    Attached is the sample parser that was used to extract title and abstract from acm html files. Please modify to include keyword and other fields.
    good luck
    import java.io.BufferedReader;
    import java.io.FileReader;
    import java.io.IOException;
    import java.io.InputStream;
    import java.io.InputStreamReader;
    import java.util.ArrayList;
    import java.util.List;
    public class ACMHTMLParser
         private String m_filename;
         private URLLexicalAnalyzer lexical;
         List urls = new ArrayList();
         public ACMHTMLParser(String filename)
              super();
              m_filename = filename;
          * parses only title and abstract
         public void parse() throws Exception
              lexical = new URLLexicalAnalyzer(m_filename);
              String word = lexical.getNextWord();
              boolean isabstract = false;
              while (null != word)
                   if (isTag(word))
                        if (isTitle(word))
                             System.out.println("TITLE: " + lexical.getNextWord());
                        else if (isAbstract(word) && !isabstract)
                             parseAbstract();
                             isabstract = true;
                   word = lexical.getNextWord();
              lexical.close();
         public static void main(String[] args) throws Exception
              ACMHTMLParser parser = new ACMHTMLParser("./acm_html.html");
              parser.parse();
         public static boolean isTag(String word)
              return ( word.startsWith("<") && word.endsWith(">"));
         public static boolean isTitle(String word)
              return ( "<title>".equals(word));
         //please modify according to the html source
         public static boolean isAbstract(String word)
              return ( "<p class=\"abstract\">".equals(word));
         private void parseAbstract() throws Exception
              while (true)
                   String abs = lexical.getNextWord();
                   if (!isTag(abs))
                        System.out.println(abs);
                        break;
         class URLLexicalAnalyzer
           private BufferedReader m_reader;
           private boolean isTag;
           public URLLexicalAnalyzer(String filename)
              try
                m_reader = new BufferedReader(new FileReader(filename));
              catch (IOException io)
                System.out.println("ERROR, file not found " + filename);
                System.exit(1);
           public URLLexicalAnalyzer(InputStream in)
              m_reader = new BufferedReader(new InputStreamReader(in));
           public void close()
              try {
                if (null != m_reader) m_reader.close();
              catch (IOException ignored) {}
           public String getNextWord() throws IOException
              int c = m_reader.read();   
              if (-1 == c) return null; 
              if (Character.isWhitespace((char)c))
                return getNextWord();
              if ('<' == c || isTag)
                return scanTag(c);
              else
                   return scanValue(c);
           private String scanTag(final int c)
              throws IOException
              StringBuffer result = new StringBuffer();
              if ('<' != c) result.append('<');
              result.append((char)c);
              int ch = -1;
              while (true)
                ch = m_reader.read();
                if (-1 == ch) throw new IllegalArgumentException("un-terminate tag");
                if ('>' == ch)
                     isTag = false;
                     break;
                result.append((char)ch);
              result.append((char)ch);
              return result.toString();
           private String scanValue(final int c) throws IOException
                StringBuffer result = new StringBuffer();
                result.append((char)c);
                int ch = -1;
                while (true)
                   ch = m_reader.read();
                   if (-1 == ch) throw new IllegalArgumentException("un-terminate value");
                   if ('<' == ch)
                        isTag = true;
                        break;
                   result.append((char)ch);
                return result.toString();
    }

  • Applescript or workflow to extract text from PDF and rename PDF with the results

    Hi Everyone,
    I get supplied hundreds of PDFs which each contain a stock code, but the PDFs themselves are not named consistantly, or they are supplied as multi-page PDFs.
    What I need to do is name each PDF with the code which is in the text on the PDF.
    It would work like this in an ideal world:
    1. Split PDF into single pages
    2. Extract text from PDF
    3. Rename PDF using the extracted text
    I'm struggling with part 3!
    I can get a textfile with just the code (using a call to BBEDIT I'm extracting the code)
    I did think about using a variable for the name, but the rename functions doesn't let me use variables.

    Hello
    You may also try the following applescript script, which is a wrapper of rubycocoa script. It will ask you choose source pdf files and destination directory. Then it will scan text of each page of pdf files for the predefined pattern and save the page as new pdf file with the name as extracted by the pattern in the destination directory. Those pages which do not contain string matching the pattern are ignored. (Ignored pages, if any, are reported in the result of script.)
    Currently the regex pattern is set to:
    /HB-.._[0-9]{6}/
    which means HB- followed by two characters and _ and 6 digits.
    Minimally tested under 10.6.8.
    Hope this may help,
    H
    _main()
    on _main()
        script o
            property aa : choose file with prompt ("Choose pdf files.") of type {"com.adobe.pdf"} ¬
                default location (path to desktop) with multiple selections allowed
            set my aa's beginning to choose folder with prompt ("Choose destination folder.") ¬
                default location (path to desktop)
            set args to ""
            repeat with a in my aa
                set args to args & a's POSIX path's quoted form & space
            end repeat
            considering numeric strings
                if (system info)'s system version < "10.9" then
                    set ruby to "/usr/bin/ruby"
                else
                    set ruby to "/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby"
                end if
            end considering
            do shell script ruby & " <<'EOF' - " & args & "
    require 'osx/cocoa'
    include OSX
    require_framework 'PDFKit'
    outdir = ARGV.shift.chomp('/')
    ARGV.select {|f| f =~ /\\.pdf$/i }.each do |f|
        url = NSURL.fileURLWithPath(f)
        doc = PDFDocument.alloc.initWithURL(url)
        path = doc.documentURL.path
        pcnt = doc.pageCount
        (0 .. (pcnt - 1)).each do |i|
            page = doc.pageAtIndex(i)
            page.string.to_s =~ /HB-.._[0-9]{6}/
            name = $&
            unless name
                puts \"no matching string in page #{i + 1} of #{path}\"
                next # ignore this page
            end
            doc1 = PDFDocument.alloc.initWithData(page.dataRepresentation) # doc for this page
            unless doc1.writeToFile(\"#{outdir}/#{name}.pdf\")
                puts \"failed to save page #{i + 1} of #{path}\"
            end
        end
    end
    EOF"
        end script
        tell o to run
    end _main

  • Javascript in .PDF's - Extracting text from .doc or .txt

    Hello All,
    I am very new to javascript in .pdfs -- but I seem to find my around doing misc. work with forms. What I need:
    I need a Form with a Submit button that locates and extracts the text from a file and places it into another field.
    Example:
    on Server:
    one.txt or one.doc
    two.txt or two.doc,
    ...etc
    You type one in the form and submit -- it pulls all of the txt from one.txt off the server and places it into a field.
    Also if there is anyway to do this with tables to avoid multiple files that would be even better.
    I know I am a newbie, but this would be a game-changer for what I do.
    Thank you.

    Thanks for the advice
    It is accessing a shared file server (among employees) and it is to be a .pdf used in Adobe Acrobat Professional
    Basically I want it to be a form that pulls txt based on what was in the typed box or drop-down menu from a .txt or .doc

  • How do I grab a value from a file name and load it in a field/column?

    Hi,
    I am loading this .txt file (OUS_RAW_NYC_05_2011.txt) into an internal table i_raw.
    I want to pick out the NYC characters from the file name and fill it as value for <wa_raw>-field1 for all records.
    How do I do this?
    Pls advice.
    Thanks!

    Hi Durgesh,
    I am doing this in a program via SE38 and not via transformation routine.
    Now I am working on this piece of code to get the value.
    file_str = //rdmsbw/dev/data/output/all/OUS_RAW_HCM_05_2011.txt
    I only want the characters HCM from file_str.
    When I execute this code below:
        MOVE file_str TO org_unit.
        WRITE / org_unit+37(12).
        <wa_raw>-/BIC/ZOUORGUT = org_unit.
    my output is = //rdm
    how do i extract out HCM?
    Pls advice.
    PS: also pls help me out with my another post
    http://forums.sdn.sap.com/thread.jspa?threadID=2141618&tstart=0

Maybe you are looking for