Split a pdf based on text?

Hello,
We'd like to split a large pdf (1200+ pages) into multiple files. We have Acrobat X Pro and in this instance, have thousands of records in the initial pdf with each record ending in 'End of Record'.
1.) Can we split the initial pdf into multiple files by simply telling Acrobat to create a new file each time it sees 'End of Record'?
2.) Can we (batch) insert top-level bookmarks after each 'End of Record' then use the Split Document command to create multiple files?
Any help is appreciated!

Possible with JavaScript, but it will be a fair amount of coding.
First you need to read each word on each page and search for the words "End of Record". This assumes that the string "End of Record" will not appear elsewhere in the text of the pages being extracted. It also assume that the word string "End of File" appears in retrieval of getting Nth word. The words are not returned in the reading order but in the order that that text was placed or plotted on the electronic page.
You could also just extract the pages based on finding the words "End of Record".
getPageNthWord
extractPages

Similar Messages

DisAssembling a PDF based on a text string on the page

I am looking for some guidance (and an example if at all possible) on how to disassemble a multipage pdf based on text like "Tax ID" contained on certain pages. The result is that I am looking to break up a document that contains 1000 pages, 100 of those pages may contain the text "Tax ID" for 100 different people and I would like 100 different PDF's with the 1 page that has their "Tax ID" as the output. In addition...it would be great to extract the value next to the text "Tax ID" so that the PDF's could be named accordingly.
The challenge here is how do I get the page numbers that contain the "Text ID" text along with the text sitting to the right of that text? Once I get that...then I can simply feed that information back into Assembler via the DDX for extraction.
Any help here would be greatly appreciated.

You've posed an interesting problem. Here is one approach that requires you to create a few steps to your Workbench process.
Invoke the Assemble service with a DDX that extracts text information from the original PDF
Invoke the XSLT service to convert the extracted text info into a Bookmark file.
Invoke the Assembler with a two-part DDX with imports the Bookmark file into the original PDF and then uses the added bookmarks to disassemble the PDF.
Invoke the Assemble service with a DDX that extracts text information from the original PDF
Here is a DDX that extracts text info:
<DDX xmlns="http://ns.adobe.com/DDX/1.0/">
<DocumentText result="text">
    <PDF source="myOriginalPDF"/>
</DocumentText>
</DDX>
The result will be an XML file with this appearance:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="C:\Adobe\TaxID.xslt"?>
<DocText xmlns="http://ns.adobe.com/DDX/DocText/1.0/">
    <TextPerPage>
        <Page pageNumber="1">to market to market</Page>
        <Page pageNumber="1">TAX ID 1111 Gee I owe a lot of money to the IRS . How could this be ?</Page>
        <Page pageNumber="2">TAX ID 2222 We all owe lots of money</Page>
        <Page pageNumber="3">TAX ID 3333 We all owe lots of money</Page>
    </TextPerPage>
</DocText>
Invoke the XSLT service to convert the extracted text info into a Bookmark file
Here is an XSLT that converts the text info into a Bookmark file:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:textInfo="http://ns.adobe.com/DDX/DocText/1.0/">
    <xsl:output method="xml" version="1.0" encoding="UTF-8"/>
    <xsl:template match="/">
        <Bookmarks xmlns="http://ns.adobe.com/pdf/bookmarks" version="1.0">
        <xsl:apply-templates/>
            </Bookmarks>
    </xsl:template>
    <xsl:template match="textInfo:Page">
    <xsl:variable name="myText" select="text()"/>
    <xsl:if test='contains( $myText, "TAX ID")'>
        <xsl:variable name="taxID"
            select='substring($myText, 8, 4)'/>
            <Bookmark><Dest>
            <Fit>
                <xsl:attribute name="PageNum">
                <xsl:value-of select="@pageNumber"/>
                </xsl:attribute>
            </Fit>
            </Dest>
                <Title>
                <xsl:value-of select="$taxID"/>
                </Title>
            </Bookmark>
        </xsl:if>
    </xsl:template>
</xsl:stylesheet>
Here is the result of this XSLT applied against the example text info:
<?xml version="1.0" encoding="UTF-8"?>
<Bookmarks xmlns="http://ns.adobe.com/pdf/bookmarks" version="1.0">
    <Bookmark xmlns="">
        <Dest>
            <Fit PageNum="1"/>
        </Dest>
        <Title>1111</Title>
    </Bookmark>
    <Bookmark xmlns="">
        <Dest>
            <Fit PageNum="2"/>
        </Dest>
        <Title>2222</Title>
    </Bookmark>
    <Bookmark xmlns="">
        <Dest>
            <Fit PageNum="3"/>
        </Dest>
        <Title>3333</Title>
    </Bookmark>
</Bookmarks>
If you use this XSLT, you should refine it to search for the string "TAX ID" at the beginning of the page rather than anywhere in the page. You should also improve the identification of the TAX ID number to be independent of the length.
Invoke the Assembler with a two-part DDX
Write a DDX that imports the Bookmark file into the original PDF and then uses the added bookmarks to disassemble the PDF.

How to change background color of text in pdf based by font name

Hi
How to change the background color of text in PDF based by font name. Is there any option in Javascript. e.g: If PDF containing ARIAL font, the ARIAL text background color needs to be changed in red color for all pages. Same for all fonts with different different color in the PDF.
Thanks in Advance

Hi
1) Is there any possibilities to highlight with different color based on font using javascript
2) list of font used in PDF using javascript
3) How to hilight the text using javascript
Thanks in Advance

Text Object Error in Pdf based print forms

Hello Friends,
I am trying to include a text object in Adobe PDF-based print form.
In the context, I have created a node for the text. I chose the Text Type as Include Text. I am able to choose the required Text Object and Text ID from the respective search helps. When trying to activate the form, I am getting an error saying that I did not specify a text name. I tried to rectify this error but could not do so.
Please help me out on how to rectify this error.
Points will be rewarded for useful answers.
Thanks,
John.

There is no need of activation for standard text... save will do...
Also note: standard text is client dependent... you need to attach to your transport request manually to move between clients...
Close the thread once your question is answered.
Regards,
Sairam

Creating layout in PDF Based Form to print table content.

Hi ,
I am facing problem in creating the layout of PDF Based Form . I do not need any interactive text but only active table in my context whose data i need to print . What i did was drag the table from data view into the body page and activate . When i run it i only get a table structure but without any data .
Can any one help me or give a pointer to any tutorial for this ?. I have checked in SAPnet for PDF Based Print Form but it somehow skips how to create layout.
With regards,
Saurabh Kumar Pandey

Have a look at help.sap.com:
<a href="http://help.sap.com/saphelp_erp2005/helpdata/en/b7/64348655fb46149098d95bdca103d0/content.htm">Interactive Forms based on Adobe Software</a>
<a href="http://help.sap.com/saphelp_erp2005/helpdata/en/4c/9cc19e5c874091a99790e540b06f3a/content.htm">Inserting a Table or Loop</a>

Splitting a PDF with iText

Hi,
I'm trying to split a pdf into multiply pdf files. One pdf perbookmark.
Can anyone give any pointers?
I've got the bookmarks (via the outline) (they are bookmarks within a bookmark, thats why I go two levels in). But I've just got lost trying to find the pages, and write them out.
Thanks
Mike
import java.util.*;
public class PDFSplit {
    public static void main( String argv[] ) throws Throwable {
     PdfReader reader = new PdfReader("test.pdf");
     PdfDictionary dic = reader.getCatalog();
     PdfObject o = dic.get( PdfDictionary.OUTLINES );
     PdfDictionary outline = (PdfDictionary)reader.getPdfObject( o );
     PdfDictionary first = (PdfDictionary)reader.getPdfObject( outline.get( PdfName.FIRST ) );
     PdfDictionary ff = (PdfDictionary)reader.getPdfObject( first.get( PdfName.FIRST ) );
     while( ff != null ) {
         String title = ff.get( PdfName.TITLE ).toString();
         System.out.println( title );
         PdfObject next = ff.get( PdfName.NEXT );
         if( next == null )
          ff = null;
         else
          ff = (PdfDictionary)reader.getPdfObject( next );
}

Tis OK, worked it out, but I do feel as if a little to much magic is keeping it together....
(very ugly code follows)
import com.lowagie.text.*;
import com.lowagie.text.pdf.*;
import java.io.*;
import java.util.*;
public class PDFSplit {
    public static void main( String argv[] ) throws Throwable {
     PdfReader reader = new PdfReader("test.pdf");
     PdfDictionary dic = reader.getCatalog();
reader.consolidateNamedDestinations();
     PdfObject o = dic.get( PdfDictionary.OUTLINES );
     PdfDictionary outline = (PdfDictionary)reader.getPdfObject( o );
     PdfDictionary first = (PdfDictionary)reader.getPdfObject( outline.get( PdfName.FIRST ) );
     PdfDictionary ff = (PdfDictionary)reader.getPdfObject( first.get( PdfName.FIRST ) );
     int size = reader.getNumberOfPages();
     int start = 1;
     String title = ff.get( PdfName.TITLE ).toString();
     PdfObject next = ff.get( PdfName.NEXT );
     if( next == null ) ff = null;
     else ff = (PdfDictionary)reader.getPdfObject( next );
     while( ff != null ) {
         PdfArray a = (PdfArray)reader.getPdfObject( ff.get( PdfName.DEST ) );
         PdfDictionary page = (PdfDictionary)reader.getPdfObject( (PdfObject)a.getArrayList().get(0) );
         int end = size;
         for( int i = 1; i<size;i++ ) {
          if( page.equals( reader.getPageN(i) ) ) {
              end = i;
              break;
         System.out.println( end );
         Document document = new Document( reader.getPageSizeWithRotation( 1 ) );
         PdfCopy writer = new PdfCopy( document, new FileOutputStream( title + ".pdf" ) );
         document.open();
         for( int i = start; i<end; i++ ) {
          System.out.println( i );
          PdfImportedPage cpage = writer.getImportedPage( reader, i );
          writer.addPage( cpage );
         PRAcroForm form = reader.getAcroForm();
         if (form != null)
          writer.copyAcroForm(reader);
         document.close();
         title = ff.get( PdfName.TITLE ).toString();
         start = end;
         next = ff.get( PdfName.NEXT );
         if( next == null )
          ff = null;
         else
          ff = (PdfDictionary)reader.getPdfObject( next );
     Document document = new Document( reader.getPageSizeWithRotation( 1 ) );
     PdfCopy writer = new PdfCopy( document, new FileOutputStream( title + ".pdf" ) );
     document.open();
     for( int i = start; i<=size; i++ ) {
         System.out.println( i );
         PdfImportedPage cpage = writer.getImportedPage( reader, i );
         writer.addPage( cpage );
     PRAcroForm form = reader.getAcroForm();
     if (form != null)
         writer.copyAcroForm(reader);
     document.close();
}

Disassemble PDF based on Content Table

Suppose a PDF file has a table of contents. Is it possible to split this PDF file into multiple small PDF files that only contains one chapter?
Thanks,
P

What I try to do is find a prgrmaming way to slip a PDF file into multiple PDF files that contains only one chapter each based on Table of Contents info in PDF file. The bookmark may not contains Table of Content information. For exmaple, some PDF file do not have bookmark but they do have table of contents. How to deal with it in a programming way?
Thanks.

Error while running PDF based views

Hi,
Few days back I was able to run my Interactive PDF based forms in my WDA applciations. There was a patch upgrade after which I am getting an error 'WebDynpro Exception: ADS: Request start time: Wed Mar 28 19:05:48 IST 2007(200,101). ' However I am able to run another application also created before applying the patches. So I am confused, as to why my applciation is failing.
Regards
Murali

Here is the dump
Error when processing your request
What has happened?
The URL http://rethr6.XXX.com:8001/sap/bc/webdynpro/sap/zpdf/ was not called due to an error.
Note
The following error text was processed in the system D35 : WebDynpro Exception: ADS: Request start time: Thu Mar 29 13:33:24 IST 2007(200,101). ꯂ
The error occurred on the application server RET_XXX_01 and in the work process 0 .
The termination type was: RABAX_STATE
The ABAP call stack was:
Method: RAISE of program CX_WD_GENERAL=================CP
Method: CREATE_PDF of program CL_WD_ADOBE_SERVICES==========CP
Method: IF_WDR_VIEW_ELEMENT_ADAPTER~SET_CONTENT of program /1WDA/LADOBE==================CP
Method: IF_WDR_VIEW_ELEMENT_ADAPTER~SET_CONTENT of program /1WDA/LADOBE==================CP
Method: IF_WDR_VIEW_ELEMENT_ADAPTER~SET_CONTENT of program /1WDA/L8STANDARD==============CP
Method: IF_WDR_VIEW_ELEMENT_ADAPTER~SET_CONTENT of program /1WDA/L8STANDARD==============CP
Method: IF_WDR_VIEW_ELEMENT_ADAPTER~SET_CONTENT of program /1WDA/L7STANDARD==============CP
Method: IF_WDR_VIEW_ELEMENT_ADAPTER~SET_CONTENT of program /1WDA/L8STANDARD==============CP
Method: IF_WDR_VIEW_ELEMENT_ADAPTER~SET_CONTENT of program /1WDA/L8STANDARD==============CP
Method: IF_WDR_VIEW_ELEMENT_ADAPTER~SET_CONTENT of program /1WDA/L7STANDARD==============CP
What can I do?
If the termination type was RABAX_STATE, then you can find more information on the cause of the termination in the system D35 in transaction ST22.
If the termination type was ABORT_MESSAGE_STATE, then you can find more information on the cause of the termination on the application server RETHR6_D35_01 in transaction SM21.
If the termination type was ERROR_MESSAGE_STATE, then you can search for more information in the trace file for the work process 0 in transaction ST11 on the application server RET_XXX_01 . In some situations, you may also need to analyze the trace files of other work processes.
If you do not yet have a user ID, contact your system administrator.
Error code: ICF-IE-http -c: 236 -u: MURLI236 -l: E -s: D35 -i: RETHR6_D35_01 -w: 0 -d: 20070329 -t: 133325 -v: RABAX_STATE -e: UNCAUGHT_EXCEPTION

Error in generating PDF Based form - SUI Report

Hi,
We are running Quarterly reports for Unemployment reporting at USA using tax reporter.
We are not able to see the complete spool output for the Wage Type Listings but only just 1 page. In the tax reporter log we get the error "Error in generating PDF Based form HR_F_WLIST_CA" for respective states.
Any idea how to resolve this.
We are on ERP 6.04.
Thanks,

Hi,
I think its Basis problem.Ask basis gyus to repair the connection and try again.
Regards,
Manoj.

Assigning a Numeric Value in a Cell Based on Text in Another Cell

In advance, thanks for your assistance. I'm trying, in vain, to assign a numeric value in a cell based on text (from a dropdown menu) in another cell. For example, in cell A5 I have a dropdown list that includes the options "blue", "red", "white", and "gold." I want cell C15 to be 2 if A5="blue"; I want C15 to be 0 if A5="red"; I want C15 to be 2 if A5="white"; and, I want C15 to be 1 if A5="gold."

Tippet,
This is a job for LOOKUP.
The expression for the Result cell is: =LOOKUP(A2, Lookup :: A1:A4, Lookup :: B1:B4)
The aux. table contains the matches that you assign for the colors.
Regards,
Jerry

I am trying to export the combained PDF based on BOOK opetion using below scripts. but i am getting following error message "Invalid value for parameter 'to' of method 'exportFile'. Expected File, but received 1952403524". anyone knows, please suggest me

Dear ALL,
i am trying to export the combained PDF based on BOOK opetion using below scripts. but i am getting following error message "Invalid value for parameter 'to' of method 'exportFile'. Expected File, but received 1952403524". anyone knows, please suggest me solutions.
var myBookFileName ,myBookFileName_temp;
                if ( myFolder != null )
                        var myFiles = [];
                        var myAllFilesList = myFolder.getFiles("*.indd");
                        for (var f = 0; f < myAllFilesList.length; f++)
                                    var myFile = myAllFilesList[f];
                                    myFiles.push(myFile);
                        if ( myFiles.length > 0 )
                                    myBookFileName = myFolder + "/"+ myFolder.name + ".indb";
                                    myBookFileName_temp=myFolder.name ;
                                    myBookFile = new File( myBookFileName );
                                    myBook = app.books.add( myBookFile );
                                   myBook.automaticPagination = false;
                                    for ( i=0; i < myFiles.length; i++ )
                                               myBook.bookContents.add( myFiles[i] );
                                    var pdfFile =File(File(myFolder).fsName + "\\"+myBookFileName_temp+"_WEB.pdf");
                                    var bookComps = myBook.bookContents;
                                    if (bookComps.length === 1)
                                                   bookComps = [bookComps];
                                     var myPDFExportPreset = app.pdfExportPresets.item("AER6");
                                    app.activeBook.exportFile(ExportFormat.PDF_TYPE,File("D:\\AER\\WEBPDF.pdf"),false,myPDFEx portPreset,bookComps);
                                  //myBook.exportFile (ExportFormat.pdfType, pdfFile, false);
                                  //myBook.exportFile(pdfFile, false, pdfPref, bookComps);
                                    myBook.close(SaveOptions.yes);

Change the below line:
app.activeBook.exportFile(ExportFormat.PDF_TYPE,File("D:\\AER\\WEBPDF.pdf"),false,myPDFExp ortPreset,bookComps);
to
app.activeBook.exportFile(ExportFormat.PDF_TYPE,File("D:\\AER\\WEBPDF.pdf"),false,myPDFExp ortPreset);
Vandy

Acrobat X: PDF to PS and back results in a "nice" PDF, but copy text from it is bad!

We used this workflow (PDF -> PS -> PDF) to get very small filesizes in any cases.
Before, we had problems with certain PDFs with many layers.
But now we have problems with the text within these PDFs.
Selecting text and copying it (to clipboard) and past it somewhere results in small rectangles ( ).
Any ideas?
Thanks
Norbert

Acrobat shows very normal text of this PDF:
Select it with the select tool, ctrl-c, and paste it into e.g. find dialog shows rectangles:

PDF-form looses text style on another computer

I created a PDF-form with text fields using our corporate font. My colleague has to open it in the Preview and insert text into those fields (she has corporate font installed). But while opening and typing the text, she got my text style lost (corporate font and probably its size). What should I do for she types in my forn and get the same text style as I have? Thank you.
Arina.

Do NOT use Apple Preview to open PDF forms. It corrupts them in various ways and is very buggy.
Stick with Adobe software for best results.

How do I split a pdf file when the file size is too large?

How do I split a pdf file when the file size is too large? Thanks!

With Adobe Acrobat. It can also optimize your document to make the size smaller.

I am converting a .docx file to a pdf and the text is coming out blurry

I am trying to convert a .docx file to a pdf and the text keeps coming out blurry. Some sentences seem to be bolded in the pdf as well. All the colored text seems like there's a shadow behind it and all the text in bold seems extra blurry. I am saving the file as a pdf. I tried to print the file as a pdf but it kept crashing. I have adobe acrobat pro and distiller, but I'm not savy about which program does what.
Thanks for any help!

If you are trying to convert a Word file, ensure the text is in 100% black only.
To create PDF's there are hundreds of different combinations that can be used, but only a few will give you a good 'print ready' file, which is what I think you may be after.
You may have font issues that are stopping the text looking sharp.
If you have Distiller, you can try to print as .ps (save as postscript), load distiller up and choose one of the high quality settings in the pop down menu.
If you are working on a Mac you can drag and drop the .ps file direcetly on the dock icon for distiller, or if you are running Windows you will need to navigate to the file via the menu bar (you will need to know where you have saved your .ps file).
There may be other issues with the original file format that is causing problems with your PDF creation.
Let me know how you get on.
Cheers

Split a pdf based on text?

Similar Messages

Maybe you are looking for