DisAssembling  a PDF based on a text string on the page

I am looking for some guidance (and an example if at all possible) on how to disassemble a multipage pdf based on text like "Tax ID" contained on certain pages. The result is that I am looking to break up a document that contains 1000 pages, 100 of those pages may contain the text "Tax ID" for 100 different people and I would like 100 different PDF's with the 1 page that has their "Tax ID" as the output. In addition...it would be great to extract the value next to the text "Tax ID" so that the PDF's could be named accordingly.
The challenge here is how do I get the page numbers that contain the "Text ID" text along with the text sitting to the right of that text? Once I get that...then I can simply feed that information back into Assembler via the DDX for extraction.
Any help here would be greatly appreciated.

You've posed an interesting problem. Here is one approach that requires you to create a few steps to your Workbench process.
Invoke the Assemble service with a DDX that extracts text information from the original PDF
Invoke the XSLT service to convert the extracted text info into a Bookmark file.
Invoke the Assembler with a two-part DDX with imports the Bookmark file into the original PDF and then uses the added bookmarks to disassemble the PDF.
Invoke the Assemble service with a DDX that extracts text information from the original PDF
Here is a DDX that extracts text info:
<DDX xmlns="http://ns.adobe.com/DDX/1.0/">
  <DocumentText result="text">
    <PDF source="myOriginalPDF"/>
  </DocumentText>
</DDX>
The result will be an XML file with this appearance:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="C:\Adobe\TaxID.xslt"?>
<DocText xmlns="http://ns.adobe.com/DDX/DocText/1.0/">
    <TextPerPage>
        <Page pageNumber="1">to market to market</Page>
        <Page pageNumber="1">TAX ID 1111 Gee I owe a lot of money to the IRS . How could this be ?</Page>
        <Page pageNumber="2">TAX ID 2222 We all owe lots of money</Page>
        <Page pageNumber="3">TAX ID 3333 We all owe lots of money</Page>
    </TextPerPage>
</DocText>
Invoke the XSLT service to convert the extracted text info into a Bookmark file
Here is an XSLT that converts the text info into a Bookmark file:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:textInfo="http://ns.adobe.com/DDX/DocText/1.0/">
    <xsl:output method="xml" version="1.0" encoding="UTF-8"/>
    <xsl:template match="/">
        <Bookmarks xmlns="http://ns.adobe.com/pdf/bookmarks" version="1.0">
        <xsl:apply-templates/>
            </Bookmarks>
    </xsl:template>
    <xsl:template match="textInfo:Page">
    <xsl:variable name="myText" select="text()"/>
    <xsl:if  test='contains( $myText, "TAX ID")'>
        <xsl:variable name="taxID"
            select='substring($myText, 8, 4)'/>
            <Bookmark><Dest>
            <Fit>
                <xsl:attribute name="PageNum">
                <xsl:value-of select="@pageNumber"/>
                </xsl:attribute>
            </Fit>
            </Dest>
                <Title>
                <xsl:value-of select="$taxID"/>           
                </Title>
            </Bookmark>
        </xsl:if>
    </xsl:template>    
</xsl:stylesheet>
Here is the result of this XSLT applied against the example text info:
<?xml version="1.0" encoding="UTF-8"?>
<Bookmarks xmlns="http://ns.adobe.com/pdf/bookmarks" version="1.0">
    <Bookmark xmlns="">
        <Dest>
            <Fit PageNum="1"/>
        </Dest>
        <Title>1111</Title>
    </Bookmark>
    <Bookmark xmlns="">
        <Dest>
            <Fit PageNum="2"/>
        </Dest>
        <Title>2222</Title>
    </Bookmark>
    <Bookmark xmlns="">
        <Dest>
            <Fit PageNum="3"/>
        </Dest>
        <Title>3333</Title>
    </Bookmark>
</Bookmarks>
If you use this XSLT, you should refine it to search for the string "TAX ID" at the beginning of the page rather than anywhere in the page. You should also improve the identification of the TAX ID number to be independent of the length.
Invoke the Assembler with a two-part DDX
Write a DDX that imports the Bookmark file into the original PDF and then uses the added bookmarks to disassemble the PDF.

Similar Messages

  • PDF files break up when scrolling down the page

    Embedded pdf files break up when scrolling down the page i.e.
    http://www.wight-cam.co.uk/WightCAM/HTML/2011/110321.htm

    Go to the "inspector"-click on the "table" icon-then "headers and footers" where you can set the number of headers or footers you want and click/unclick "freeze". You can also do that directly from the button bar on the "header-footer" buttons

  • Find a text frame on the page with script label

    hello to all
    I need to create a script to
    find a text frame on the page with script label "xxx"
    and read its contents into a variable.
    The content of the text frame is a number.
    thanks

    Hi Roberto,
    Welcome to the forum,
    This will find the labeled textFrame on the active page.
    var myLabel = "Foo", // change to label
          myPage = app.properties.activeWindow && app.activeWindow.activePage,
          myTextFrames = myPage.textFrames.everyItem().getElements().slice(0),
          l = myTextFrames.length,
          myVariable
    while (l--) {
        if (myTextFrames[l].label != myLabel) continue;
        myVariable = myTextFrames[l].contents;
        break; // presuming there's only one "Foo" labeled frame on the page
        // Otherwise you'll nead an array
    alert(myVariable)
    Trevor

  • [Forum FAQ] How to find and replace text strings in the shapes in Excel using Windows PowerShell

    Windows PowerShell is a powerful command tool and we can use it for management and operations. In this article we introduce the detailed steps to use Windows PowerShell to find and replace test string in the
    shapes in Excel Object.
    Since the Excel.Application
    is available for representing the entire Microsoft Excel application, we can invoke the relevant Properties and Methods to help us to
    interact with Excel document.
    The figure below is an excel file:
    Figure 1.
    You can use the PowerShell script below to list the text in the shapes and replace the text string to “text”:
    $text = “text1”,”text2”,”text3”,”text3”
    $Excel 
    = New-Object -ComObject Excel.Application
    $Excel.visible = $true
    $Workbook 
    = $Excel.workbooks.open("d:\shape.xlsx")      
    #Open the excel file
    $Worksheet 
    = $Workbook.Worksheets.Item("shapes")       
    #Open the worksheet named "shapes"
    $shape = $Worksheet.Shapes      
    # Get all the shapes
    $i=0      
    # This number is used to replace the text in sequence as the variable “$text”
    Foreach ($sh in $shape){
    $sh.TextFrame.Characters().text  
    # Get the textbox in the shape
    $sh.TextFrame.Characters().text = 
    $text[$i++]       
    #Change the value of the textbox in the shape one by one
    $WorkBook.Save()              
    #Save workbook in excel
    $WorkBook.Close()             
    #Close workbook in excel
    [void]$excel.quit()           
    #Quit Excel
    Before invoking the methods and properties, we can use the cmdlet “Get-Member” to list the available methods.
    Besides, we can also find the documents about these methods and properties in MSDN:
    Workbook.Worksheets Property (Excel):
    http://msdn.microsoft.com/en-us/library/office/ff835542(v=office.15).aspx
    Worksheet.Shapes Property:
    http://msdn.microsoft.com/en-us/library/office/ff821817(v=office.15).aspx
    Shape.TextFrame Property:
    http://msdn.microsoft.com/en-us/library/office/ff839162(v=office.15).aspx
    TextFrame.Characters Method (Excel):
    http://msdn.microsoft.com/en-us/library/office/ff195027(v=office.15).aspx
    Characters.Text Property (Excel):
    http://msdn.microsoft.com/en-us/library/office/ff838596(v=office.15).aspx
    After running the script above, we can see the changes in the figure below:
    Figure 2.
    Please click to vote if the post helps you. This can be beneficial to other community members reading the thread.

    Thank you for the information, but does this thread really need to be stuck to the top of the forum?
    If there must be a sticky, I'd rather see a link to a page on the wiki that has links to all of these ForumFAQ posts.
    EDIT: I see this is no longer stuck to the top of the forum, thank you.
    Don't retire TechNet! -
    (Don't give up yet - 13,085+ strong and growing)

  • Text runs of the page / doesn't resize with window

    Hi,
    I have seen this before but cant figure out how to fix this:
    In Mail and other applicaitons that involve typing text the sentenses run off the "page".
    When I resize the window the lines dont' adapt to the window.
    I am stumped, please help :-)
    Rogier

    Howdy. See if this helps:
    https://discussions.apple.com/thread/5298047?tstart=150

  • Subform text flows off the page

    My subform text in preview doesn't page break. It flows right off the page. And I do have "Allow Page Breaks within Content" checked inside the Object tab. What's wrong?

    everything that your subform is a child of has to page break, as well. So, your text field has to allow page breaks, as does your subform, any other subforms that the subform is contained in, and the page itself. Normally if you have a problem like this, there's something else that contains your text field that will have a yellow triangle on it with an exclamation mark, and it will say something like, "This object may not work properly. Although the object is allowed to break, deselecting the Allow Page Breaks Within Content option of the parent object resctricts this object from breaking between pages."

  • InD4 - Won't let me put text up to the page edge - need to apply a header! Help!

    I am trying to put a text box at the top of my document to use as a header - something I've done hundreds of times before.  But, this time, when I add a text box, and add text, it's like there's an invisible object there with a text wrap applied, and it won't let me put my text to the top of the page.
    For info: the text box options are set to align text at the top, and to ignore text wrap on any other objects.  I've tried selecting all to see if there was something there that i was missing and needed to delete - no luck.  Even tried shutting down InDesign, reopening, and starting a new document - it does the same thing on the new document.
    It's a space about an inch wide, that extends from one side of the page to the other - but just on the top - I can place text all the way to the edge of the page on the sides and bottom, no problem.  HELP!!!!

    My guess is that you have aligned your text to document grid... and your grid starst a little bit below the top of your page.... If that´s the case, you have two options to solve your problem:
    1) adjust your grid´s starting point from Preferences > Grids
    2) Change your paragraph style setting Align text to Grid to None, you find it at Indents and Spacing tab (or alternatively you can change it locally from control panel as well)

  • When converting word doc to pdf, my images with text only show the background color of the text box.

    I have a word doc that I am trying to conver to pdf.  I have jpegs with text boxes on top of them on one page.  It looks great on the screen but after I convert to pdf, the text boxes only have half the text, the first half of the text box is just white - the background color.  If I take the background color out of the text box, the text converts over fine but I need the background color.
    I have tried many things here on the print settings, standard, high quality print, unchecking the compression on the images.  Any help?

    Thank you for your posting. These forums are specific to the
    Acrobat.com website and it's set of hosted services, and do not
    cover the Acrobat family of desktop products. Please visit the
    following forums for any questions related to the Acrobat family of
    desktop products:
    http://www.adobeforums.com/cgi-bin/webx/.3bbeda8b/

  • Printing to any PDF printer results in an image of the page instead of searchable text from all websites

    When I print to any number of different pdf printer software I've tried, it always yields an image of the webpage but never searchable text in the pdf which is the result of printing to pdf from IE or Chrome.

    I suspected as much.  I was hopeful it was something like a setting that needed to be changed or a reg hack that needed a bit changed from 0 to 1 to make the printing work as expected.  I only installed the local printer to see if the RDP session
    tunnel was in fact causing the print size to balloon so I will continue to instruct users to use that instead of the local redirected printer.  I am disappointed but not surprised.
    Edit:
    Additionally I have tested the functionality of setting the GPO settings for Easy Print to use the redirected printer's drivers instead of the ones used by easy print, by setting “Computer
    Configuration -> Administrative templates -Windows Components -> Remote Desktop Services > Remote Desktop Session Host -> Printer Redirection”.
    to “Disabled” but
    with no luck.  In fact the printer received the proper Adobe PDF driver, but the job went out into print spool hell and was lost to the ether.  I have concluded that this approach will not bear any fruit.

  • Can we create entity object based on a text file on the OS?

    I understand you can interface with files using webDAV. Can we just create a entity object with source to a text/xml file on the OS instead of a table?

    Have a look at the URL data source (under new->business tier->web services) - it allows you to create a data control based on an XML or csv file.
    http://technology.amis.nl/blog/?p=1592
    You can also create a data control based on a Java class that interact with a file.

  • How do I localize text strings in the javascriptresource section i a Javascript for Photoshop?

    I have made a javascript for photoshop. I am using the <javascriptresource> and <terminology> tags to make the script to work with actions. How do I make the textstrings within the <javascriptresource> section local (translated)? The localize function and other ways to localize text that I described in the documentation do not work.

    I recommend that you post your question in the scripts forum as this forum is meant for beginners. It is not that we won't help, it is that we may not beable to provide the best help. Good luck, I hope you can figure it out.
    http://forums.adobe.com/community/photoshop/photoshop_scripting?view=discussions
    http://forums.adobe.com/community/photoshop/photoshop_sdk?view=discussions

  • When I convert my excel spreadsheet to pdf objects like graphs move down in the page!

    Hi everyone I am very new to this! Please help!
    I was working with someone who has been migrated to Office 2007.  If we create an Excel Spreadsheet and convert it to Pdf all is well. When we do the same with his old files the graphs move to the bottom of the pdf page and any text boxes will change font and the graphs will over run the text. We saved the files as .Xlsm and tried again but to no avail.
    Can some one please help?

    You posted your question to the forum comments forum. This forum is for questions about these forums, not for questions about other Adobe products.
    I am not sure which product you are having a problem with (Acrobat to convert to PDF? Reader to read the PDF? Or the builtin PDF capability of MS Office?), but if it is an Adobe product you can find the appropriate forum from the list of all forums.

  • Acrobat printing certain PDF's with 8 - 9 lines across the pages.

    A few Acrobat user's are seeing thin black lines on the printed pages.  We do not see the lines on the PDF on the computer screen.  I am unable to reproduce the issue with Acrobat Pro 10.1.3, I upgraded the user that brought this to my attention to Acrobat Pro 10.1.3 and she is still seeing the Black Lines on her printed pages.  I had another user test to the same printer and we did not see the lines on the printed pages.  She is using Acrobat Pro 10.0.2. 
    At this point I think it's a user setting?  Any ideas or suggestions?
    I attempted to upload examples in PDF format, PDF's are not allowed????

    Hi,
    Please share the files.
    Are you facing this problem on particular machines?? And is this problem reproducible with every PDF document on these machines?
    Also provide the information about which OS are you using on the machines where this problem is occurring.
    Thanks.

  • Displaying text item inside the page I would like

    Hi, is it possible to open and display from text item link new window with formatting title, description and own text, but not only on the blank default page, but on the special page which I have created for this (with my layout as a template)?
    Thanks
    Marek

    Hi Marek,
    I believe what you are looking is for the new Portal 10.1.4 feature - itemplaceholder:-
    http://download.oracle.com/docs/cd/B14099_15/portal.1014/b13809/template.htm#CIADAEGG
    "You can use Portal Templates for items to enforce a particular layout, style, and associated content. With Portal Templates for items, a requested item displays within the layout defined by the template rather than in place on the item's container page. For example, when a link to an item displays on the item's container page, users click the link, and the item content displays within the context of its associated Portal Template. The item's content is displayed in place of the item placeholder on the template."
    I hope it helps...
    Cheers,
    Pedro.

  • To solve another problem, I need to reset my user agent. But when I type about:config in the address bar, all I get is a Firefox window that has the text "config" on the page and nothing else. No text or fields or ANYTHING. What can i do short of unins

    Firefox 3.6.3
    Windows XP
    Otherwise browses OK.
    == User Agent ==
    Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)

    Make sure that you type '''about:config''' in one word and without any spaces in the location bar.
    You can also use "Reset all user preferences to Firefox defaults" on the [[Safe mode]] start window - See http://kb.mozillazine.org/Resetting_preferences

Maybe you are looking for

  • Have a new Mac, want to transfer Acrobat 9 Pro to Mac from PC and deactivate on PC

    How do you do this? I do not have an installation disc that is compatiable with Mac and I do not need to buy a new version of Adobe Acrobate currently for the features I use.

  • Getting started with JVMTI on Linux

    Hi All, I've been doing some JVMTI code for a while using windows, and I want to get that code working on Linux. The problem is that I can't seem to get even the simplest agent running on Linux. I can't figure out how to troubleshoot this, so any hel

  • How do I restore deleted contacts? (URGENT)

    I have the app "Multidelete". When my mother and I upgraded to IOS7, our contacts were merged. Since then we've shared each other's contacts and it never really bothered me. I have recently upgraded to IOS8  but she has not. Today, I used the Multide

  • How to analyze routine in  SD (VOFM)

    Hi, how to find out why a SD routine is developed and what its doing? I'm preparing some specs and need this info. all i knew is 5 routines RV60B900, RV60C900, RV45C900, RV45C911, RV45C912 have been developed...... I don't have knowledge on the SD ro

  • I need my email address changed

    I've changed employers. I was able to update my Metalink information to include new contact information and email. However, for Technet, I can't find a way to update my email address. How is this done?