How to pick out the main body/article when surrounded by ads ?

Hi all,
First of all, not exactly sure where I should post this, so please let me know a better forum if you know of one.
I am working with a linguisitic search engine that takes pdf articles and indexes them based on lists of keywords, and then does word counts. For this latest project the pdfs are downloads of newpaper articles, typically 1 to 2 pages long. My linguistic engine works on these fine but because these are downloads from the web, the pdfs have extraneous advertisements and links (and so on) surrounding the main article. This leads to false negatives and so I need to get rid of the things around the main body. If the articles were all written with the same format  (e.g. headline at the top and then some sort of copyright at the end) I 'd be able to focus the linguistics engine solely on the article. Unfortunately that is not the case. What I am considering now is some sort of pre-processing using some sort of pdf editor. From what I understand text in the pdf format is stored in elements which describe the layout. Typically, within an article the width of the actual article remains the same. The article typically spans a larger width than say something like an advertisement which contains a link and a description. Would it be possible to pick out the text elements based on something like width or font-type and then save just those portions to another pdf or text file? Can one do such things with Acrobat or the SDK? I'll have about 5-10,000 docs to work on. Thus I need an automated procedure. Going thru manually and copying&pasting the relevant portions will be too tedious. Thanks for any suggestions.

It's really hard to say what will be possible without the benefit of seeing a typical example. It's not generally true that "...text in the pdf format is stored in elements which describe the layout..." If the ads are in consistent locations (e.g., on the right 2 inches of the page), you can automate the redaction process using JavaScript.

Similar Messages

  • How to find out the main org units in the system

    Hi,
    I want to find out the all main org units in the system. It means that i have to know all the parent org units not the sub org units.
    Can you please let me know .
    Regards
    Rajesh

    Hi,
    As I understood about your query, go to PA30, select the position field, select the structure search tab, go to find(Ctrl+F), select the object type S-position, give the position no, then it will give position place and root org, unit of the position and it will expand the structure.
    Regards
    Devi

  • How can I hide the main front panel when I run from the executable?

    Hopefully this is an easy question. I have an application where the front panel is not the main GUI and I would like to hide it when the program runs, but only when it runs from the executable. How can I do this?

    NIquist wrote:
    Yeah, don't do that. Executables built in LV require at least one front panel to be open. If it isn't, the run-time engine automatically closes the executable. I haven't checked recently, but I assume this is still the case.
    Instead, you can set the FP state to minimized (as suggested earlier) or (better) to hidden.
    P.S. one side point - the property and invoke nodes have a shortcut - if you use the Application or VI classes and don't connect a reference, they default to the current app or VI. That means you don't have to open the reference.
    Try to take over the world!

  • How to find out the Correct Controlfile script Trace .trc file in /bdump

    Hi Guys
    This is the most childish queries in this forum ..
    I wanna know how to Find out the correct trace file when we Issue alter database backup controlfile to trace at sqlprompt for Creating controlfile script.
    As i find it a bit confusing to go through the same Date & almost same time .trc files out of hundreds of trace file in /bdump directory to find the correct one.
    if we 've to find the Alert log file in /bdump directory $ ls -l al* & we get the alert log file .... if there is any similar way to find out the controlfile script Trace file ?
    Thanks & regards
    MZ

    MZ_AppsDBA wrote:
    Hi Guys
    This is the most childish queries in this forum ..
    I wanna know how to Find out the correct trace file when we Issue alter database backup controlfile to trace at sqlprompt for Creating controlfile script.
    As i find it a bit confusing to go through the same Date & almost same time .trc files out of hundreds of trace file in /bdump directory to find the correct one.
    if we 've to find the Alert log file in /bdump directory $ ls -l al* & we get the alert log file .... if there is any similar way to find out the controlfile script Trace file ?
    Thanks & regards
    MZcreation of the does not happen automatically. What script, and when does it run, do you have that creates the control file trace? Look for files in that time frame. Better, modify that script to specifically name the file .. BACKUP CONTROLFILE TO TRACE AS ....

  • How to place a Logo, Picture ..etc in the main body of an email.

    My requirement is to send an email having a logo as the header. The logo must be placed in the main body of the email and NOT as an attachment.
    I have tried out the following but it gives garbage:
    REPORT  ztest_pratik01.
    INTERNAL TABLES
    DATA:
          i_objpack   TYPE STANDARD TABLE OF sopcklsti1,
          i_objtext   TYPE STANDARD TABLE OF solisti1,
          i_objbin    TYPE STANDARD TABLE OF solisti1,
          i_hex       TYPE STANDARD TABLE OF solix,
          i_receivers TYPE STANDARD TABLE OF somlreci1.
    WORK AREAS
    DATA: wa_email_doc TYPE sodocchgi1,
          wa_objpack   TYPE sopcklsti1,
          wa_objtext   TYPE solisti1,
          wa_hex       TYPE solix,
          wa_receivers TYPE somlreci1.
    CONSTANTS
    CONSTANTS: c_x     TYPE flag   VALUE 'X',
               c_u     TYPE char1  VALUE 'U'.
    DATA: v_body TYPE i.
    START-OF-SELECTION
    START-OF-SELECTION.
      PERFORM sub_get_logo.
    END-OF-SELECTION
    END-OF-SELECTION.
      wa_email_doc-obj_name = 'TEST Mail'.
      "Mail Subject
      wa_email_doc-obj_descr
      = 'TEST'.
      v_body = LINES( i_hex ).
    Creating the entry for the compressed document
    *--(1): Creating entry for the Main Mail body text in itab i_objtext
    wa_objpack-transf_bin = 'X'.
      " Starting index(row) For header information in the itab i_objpack
      wa_objpack-head_start = 1.
      " No of lines for the header information in itab i_objpack
      wa_objpack-head_num   = 1.
      " The row(index) of the itab i_objtext from where the Mail Body starts
      wa_objpack-body_start = 1.  "Skipped the first Line
      " The number of lines in the Mail body
      wa_objpack-body_num   = v_body.  "We have two lines from the 2nd row
      " Document type. There are also whole lot of other options
      wa_objpack-doc_type   = 'RAW'.
      wa_objpack-obj_name   = 'LOGO.BMP'.
      wa_objpack-obj_descr  = 'MAIL BODY'.
      wa_objpack-obj_langu  = ' '.
      " In this case one can skip this. Normally ist calculated as
      " no of linex * 255
      wa_objpack-doc_size = v_body * 255.
      APPEND wa_objpack TO i_objpack.
      CLEAR wa_objpack.
    *Building the recepient list
    Receipient information
      wa_receivers-receiver = sy-uname.
      wa_receivers-rec_type = 'B'. "To SAP Inbox
      APPEND wa_receivers TO i_receivers.
      CLEAR wa_receivers.
      wa_receivers-receiver = mail id.
      wa_receivers-rec_type = c_u.
      APPEND wa_receivers TO i_receivers.
      CLEAR wa_receivers.
    Finally Send the Document
      CALL FUNCTION 'SO_NEW_DOCUMENT_ATT_SEND_API1'
        EXPORTING
          document_data              = wa_email_doc
          put_in_outbox              = 'X'
          commit_work                = 'X'
        TABLES
          packing_list               = i_objpack
          contents_bin               = i_objbin
          contents_txt               = i_objtext
          contents_hex               = i_hex
          receivers                  = i_receivers
        EXCEPTIONS
          too_many_receivers         = 1
          document_not_sent          = 2
          document_type_not_exist    = 3
          operation_no_authorization = 4
          parameter_error            = 5
          x_error                    = 6
          enqueue_error              = 7
          OTHERS                     = 8.
      IF sy-subrc <> 0.
    MESSAGE ID SY-MSGID TYPE SY-MSGTY NUMBER SY-MSGNO
            WITH SY-MSGV1 SY-MSGV2 SY-MSGV3 SY-MSGV4.
      ENDIF.
    *&      Form  SUB_GET_LOGO
          text
    -->  p1        text
    <--  p2        text
    FORM sub_get_logo .
      DATA: graphic_url(255),
            graphic_refresh(1).
      DATA: graphic_size TYPE i.
      DATA: l_graphic_xstr TYPE xstring,
          l_graphic_conv TYPE i,
          l_graphic_offs TYPE i.
      DATA: BEGIN OF graphic_table OCCURS 0,
            line(255) TYPE x,
          END OF graphic_table.
      CLEAR: graphic_url,
             graphic_table[].
      CALL METHOD cl_ssf_xsf_utilities=>get_bds_graphic_as_bmp
        EXPORTING
          p_object  = 'GRAPHICS'
          p_name    = 'Z_LOGO'
          p_id      = 'BMAP'
          p_btype   = 'BMON'
        RECEIVING
          p_bmp     = l_graphic_xstr
        EXCEPTIONS
          not_found = 1
          OTHERS    = 2.
      if sy-subrc = 1.
        message e287 with g_stxbitmaps-tdname.
      elseif sy-subrc <> 0.
        message id sy-msgid type sy-msgty number sy-msgno
                with sy-msgv1 sy-msgv2 sy-msgv3 sy-msgv4.
        exit.
      endif.
      graphic_size = XSTRLEN( l_graphic_xstr ).
      CHECK graphic_size > 0.
      l_graphic_conv = graphic_size.
      l_graphic_offs = 0.
      WHILE l_graphic_conv > 255.
        graphic_table-line = l_graphic_xstr+l_graphic_offs(255).
        APPEND graphic_table.
        l_graphic_offs = l_graphic_offs + 255.
        l_graphic_conv = l_graphic_conv - 255.
      ENDWHILE.
      graphic_table-line = l_graphic_xstr+l_graphic_offs(l_graphic_conv).
      APPEND graphic_table.
      LOOP AT graphic_table.
        wa_hex = graphic_table.
        APPEND wa_hex TO i_hex.
      ENDLOOP.
    ENDFORM.                    " SUB_GET_LOGO
    Any Ideas how to do the same???

    Hi ,
    I advice you to raise a OSS Note so that SAP Can suggest what needs to be done in this case .
    Hope my suggestion is helpful.
    Thanks & Regards
    Pradeep Akula .

  • HT1338 How do I copy and paste a flyer from pages to the main body of the e-mail

    How do I copy and paste a flyer that was designed in pages to the main body of the e-mail

    Print or Export to pdf and drag that into your eMail.
    Peter

  • How to find out the user-exits?

    hi.
    how to find out the user-exits?
    regards
    eswar.

    Hi,
    *& Report  ZEXITFINDER
    *report  zexitfinder.
    *& Enter the transaction code that you want to search through in order
    *& to find which Standard SAP User Exits exists.
    *& Tables
    tables : tstc, "SAP Transaction Codes
    tadir, "Directory of Repository Objects
    modsapt, "SAP Enhancements - Short Texts
    modact, "Modifications
    trdir, "System table TRDIR
    tfdir, "Function Module
    enlfdir, "Additional Attributes for Function Modules
    tstct. "Transaction Code Texts
    *& Variables
    data : jtab like tadir occurs 0 with header line.
    data : field1(30).
    data : v_devclass like tadir-devclass.
    *& Selection Screen Parameters
    selection-screen begin of block a01 with frame title text-001.
    selection-screen skip.
    parameters : p_tcode like tstc-tcode obligatory.
    selection-screen skip.
    selection-screen end of block a01.
    *& Start of main program
    start-of-selection.
    Validate Transaction Code
    select single * from tstc
    where tcode eq p_tcode.
    Find Repository Objects for transaction code
    if sy-subrc eq 0.
    select single * from tadir
    where pgmid = 'R3TR'
    and object = 'PROG'
    and obj_name = tstc-pgmna.
    move : tadir-devclass to v_devclass.
    if sy-subrc ne 0.
    select single * from trdir
    where name = tstc-pgmna.
    if trdir-subc eq 'F'.
    select single * from tfdir
    where pname = tstc-pgmna.
    select single * from enlfdir
    where funcname = tfdir-funcname.
    select single * from tadir
    where pgmid = 'R3TR'
    and object = 'FUGR'
    and obj_name = enlfdir-area.
    move : tadir-devclass to v_devclass.
    endif.
    endif.
    Find SAP Modifactions
    select * from tadir
    into table jtab
    where pgmid = 'R3TR'
    and object = 'SMOD'
    and devclass = v_devclass.
    select single * from tstct
    where sprsl eq sy-langu
    and tcode eq p_tcode.
    format color col_positive intensified off.
    write:/(19) 'Transaction Code - ',
    20(20) p_tcode,
    45(50) tstct-ttext.
    skip.
    if not jtab[] is initial.
    write:/(95) sy-uline.
    format color col_heading intensified on.
    write:/1 sy-vline,
    2 'Exit Name',
    21 sy-vline ,
    22 'Description',
    95 sy-vline.
    write:/(95) sy-uline.
    loop at jtab.
    select single * from modsapt
    where sprsl = sy-langu and
    name = jtab-obj_name.
    format color col_normal intensified off.
    write:/1 sy-vline,
    2 jtab-obj_name hotspot on,
    21 sy-vline ,
    22 modsapt-modtext,
    95 sy-vline.
    endloop.
    write:/(95) sy-uline.
    describe table jtab.
    skip.
    format color col_total intensified on.
    write:/ 'No of Exits:' , sy-tfill.
    else.
    format color col_negative intensified on.
    write:/(95) 'No User Exit exists'.
    endif.
    else.
    format color col_negative intensified on.
    write:/(95) 'Transaction Code Does Not Exist'.
    endif.
    Take the user to SMOD for the Exit that was selected.
    at line-selection.
    get cursor field field1.
    check field1(4) eq 'JTAB'.
    set parameter id 'MON' field sy-lisel+1(10).
    call transaction 'SMOD' and skip first screen.
    Regards

  • How do I change the main iCloud account on my iPhone?

    How do I change the main iCloud account on my iPhone?

    I activated an iPad for a work associate; somehow, MY Apple ID account is assigned to his iPad's iCloud backup.  His iPad is using up all MY iCloud space... he IS set up in App Store with his own Apple ID.
    On my own iPad (just to check out what settings/options were), i went to "Settings/iCloud: and tapped "Delete Acct".  BUT, then I got a message that all Photo Stream and documents stored in iCloud would be deleted from this iPad...   so, two questions...
    is there ANY way for him to retrieve the documents he saved in MY iCloud account (there's 9.1g worth of HIS backup in my account!! ... i purchased extra space before i realized it was being eaten up by him and not mine).
    Because he'd be logging  back in under his OWN ID which had never backed up his device before......so, he'll  lose everything, right?
    Really need someone who KNOWS apple iOS/Cloud workings to answer this one please ... much (at least some, lol) of his data is work-related.... at 9gig, can't be all work!! must be a lot of movies, photo's and music i'd guess.  as this is a WORK-issued device, so be it if he has to re-load all that................ 

  • How to find out the locale setting in a browser?

    In my Java web project, the multi-lingual issue on the JSP pages is handled by JSTL. It works fine so far for any input messages. However, some messages come from the container. It is needed to find out the browser's country and language settings inside of the container. My question is how to find out the settings?
    Thanks in advance.
    v.

    Hi, this I found on James' website--see http://www.rf.net/~james/perli18n.html#Q28
    Q28. Can web servers automatically detect the language of the browser and display the correct localized page?
    A28. Yes. HTTP/1.1 defines the details of how content negotiation works, including language content.
    WWW browsers send an Accept-Language request header specifying which languages are preferred for responses. This technique works fairly well, although some versions of Netscape Navigator send an improperly formatted request parameter. Also, switching language preferences in either Navigator or IE 4 doesn't always "take" without first deleting a language hint.
    Few sites do content-negotiation on language, and interestingly enough I do not know of any major portals doing this. One site that does is Sun's documentation library at SunDocs. Debian.org does a very nice job of using Apache content negotiation wih languages and also has some nice help info too on Setting the Default Language.
    Apache's Content Negotiation features will select the right page to return, whether HTML or image file. Annoyingly, the match logic is very literal, so a browser request for en-us will not match a server entry of en except as a last resort. Any other exact match will win over en, even if en-us was first preference.
    There are 2 ways of doing content negotiation with Apache: type or variant maps and multiviews.
    Variant Maps
    In httpd.conf, disable Options Multiviews if configured and add
    AddType type-var var
    DirectoryIndex index.var
    Then create an index.var file like this:
    URI: start; vary="type,language"
    URI: index.html.en
    Content-type: text/html
    Content-language: en-GB
    URI: index.html.en
    Content-type: text/html
    Content-language: en-US
    URI: index.html.en
    Content-type: text/html
    Content-language: en
    URI: index.html.fr
    Content-type: text/html
    Content-language: fr-CA
    URI: index.html.fr
    Content-type: text/html
    Content-language: fr
    URI: index.html.es
    Content-type: text/html
    Content-language: es
    Multiviews
    The multiviews technique works like this. This does add extra server load, as each content directory must be scanned for the variant document names.
    index.html is localized into variant documents such as:
    index.html.en (or index.en.html)
    index.html.fr
    index.html.es
    index.html (or index.html.html as a reader has recommended) symlinked to one of the above as the last resort.
    Here's an example of http.conf directives for this:
    # in httpd.conf
    AddLanguage en .en
    AddLanguage fr .fr
    AddLanguage es .es
    # LanguagePriority allows you to give precedence to some languages
    # in case of a tie during content negotiation.
    # Just list the languages in decreasing order of preference.
    LanguagePriority en es fr de pt
    Options Multiviews
    # end of httpd.conf fragment
    Starting in Apache 1.2, you may also create documents with multiple language extensions.
    O'Reilly's Apache - The Definitive Guide, Chapter 6
    ApacheWeek article on Language Negotiation
    Another ApacheWeek article on Language Negotiation

  • How to find out the account group information in customer master record?

    how to find out the account group information in customer master record?
    in which tab? thanks in advance

    Hi
    Go to XD02 and select the Extras from the main menu , you will find Account group info -> click on the No.ranges.
    reward if it helps
    SR

  • Transaction IW32.How to find out the person name.(Last changed By)

    Hi all,
    When i executed the transaction IW32.Its displays changed by and created by fields. please any body can tell me how to find out the person  name who had made last change with respect to the field changed by.
    Please tell me the table and field name for the field last changed by......
    Regards,
    Munna.

    hi,
    check the table AUFK field AENAM for the order number(AUFNR) in IW32..........

  • How to find out the locks in BPS & IP

    Hi Viewers,
    Can any body tell me how to find out the locks in BPS and same thing in IP.If possiable provides navigation steps also.
    Thanks & Regards,
    Venkat Vanarasi.

    Dear Venkat,
    You can use SM12 Transaction code for Lock Management.
    Regards,
    Malik
    If reply is useful Dont forget about the points.

  • I have a manual that contains headings and index entries that contain less than and greater than characters, and . The Publish to Responsive HTML5 function escapes these correctly in the main body of the text but does not work correctly in either the C

    I have a manual that contains headings and index entries that contain less than and greater than characters, < and >. The Publish to Responsive HTML5 function escapes these correctly in the main body of the text but does not work correctly in either the Contents or the Index of the generated HTML. In the Contents the words are completely missing and in the index entries the '\' characters that are required in the markers remain in the entry but the leading less than symbol and the first character of the word is deleted; hence what should appear as <dataseries> appears as \ataseries\>. I believe this is a FMv12 bug. Has anyone else experienced this? Is the FM team aware and working on a fix. Any suggestions for a workaround?

    The Index issue is more complicated since in order to get the < and > into the index requires the entry itself to be escaped. So, in order to index '<x2settings>' you have to key '\<x2settings\>'. Looking at the generated index entry in the .js file we see '<key name=\"\\2settings\\&gt;\">. This is a bit of a mess and produces an index entry of '\2settings\>'. This ought to be '<key name=\"&amp;lt;x2settings&amp;gt;\" >'. I have tested this fix and it works - but the worst of it is that the first character of the index entry has been stripped out. Consequently I cannot fix this with a few global changes - and I have a lot of index entries of this type. I'm looking forward to a response to this since I cannot publish this document in its current state.  

  • How to find out the Idoc number triggered for any material transfer frm SAP

    Hi Folks,
    Can any body let me know How to find out the Idoc number triggered for any material transfer frm SAP?
    Do we have any navigation for that in MM03?
    Thanks,
    SPMD.

    Hi Shabbirmdpasha,
    If you know the user name then you can find the idoc numbers created by that user. But the problem here is it not only gives the material it gives all the idocs created by that user. Go to SE16 --> table name EDIDS --> here you can fill the approximate date and in UNAME give the userid and execute. This will give all the idocs created by that user. I know it is only a partial solution.
    Also would suggest to post the same in abap forums for more answers:
    ABAP Development
    Regards,
    ---Satish

  • How to find out the Functional module related to a T-code

    Hi All ,
    Please tell how to find out the Functional module related to a T-code.
    i want it for the T-code RSZDELETE.

    Hi
    There is no direct way to see this.
    You need to Pick the Program(Se37/38) and tables (SE16/11)and to see where its been used
    The FM for RSZDELETE is RSZ_DB_COMP_REORG_AS_POPUP.
    Hope it helps

Maybe you are looking for

  • Getting Open MQ and Mule 2.0 work together

    Hi! I would like to ask some help on getting Open MQ work with Mule 2.0. I'm quite new to both technologies, and I can't get them to work together. Here is how I've tried so far: What I'm trying to achieve first, is that there are 2 queues in my open

  • Strange issue in updating custom table after upgrade ECC6.0! Please help!

    Hello everyone:     I have some code (shown below) that deletes and updates some SAP tables and one custom table:     DELETE T5UBV   FROM TABLE DEL_T5UBV.   DELETE ZCHRZIP FROM TABLE DEL_ZCHRZIP.   MODIFY ZCHRZIP FROM TABLE WRITE_ZCHRZIP.   MODIFY T5

  • Dropdown list in selection screen linking with Ztable

    Hi, In selection screen, i coded for parameter with dropdownlist with hardcoded values in the list. Client asking me link this list to Z table fields so that whenever they make entries in future just he need to create a variant for the new entries an

  • Can Aperture be used in XGRID?

    Anyone know if there's a way to get Aperture to work in XGRID? Aperture runs fine on my MacBook Pro, but I have two identical MacBook Pro's and was wondering if there's a way to get the lazy MBP to do some work.

  • Documentation on FM  CO_SD_PLANNED_ORDER_CONVERT

    Hi Experts,    Please let me know where can i get some documentaion on how to use FM Function Module CO_SD_PLANNED_ORDER_CONVERT to create Production Order for a given given WBS. Please let me know if there are any other FM available for the same. Re