Extracting text in correct reading order

Hi,
I am using adobe SDK to extract text from PDF. It is workiing fine, but i am facing one major problem regarding this, i have a pdf with multiple columns when exctracting text from that pdf text is not being extracted in correct order. i found that we have to make correct reading order by using touch up reading order, but when i use touch up reading order and tagging as text and extracting text it is removing space between words and sentences which is not accepted. Please suggest me how i can extract text in correct reading order and if i use touch up reading order it should not remove spaces, please suggest.
Thanks
Kiranmai

You cam get the position of each word in the file and do your own analysis. Acrobat cannot know that you want columns; there is always a measure of guesswork in text extraction. (Unless a file is tagged, which defines a precise reading order).
Message was edited by: Test Screen Name

Similar Messages

How to extract TEXT for archived Purchase Orders ?

Hi Friends,
Can any one tell me how to extract TEXT for archived Purchase Orders ?
I have used READ_TEXT but that is not fetching texts for archived PO's. Whenever I am trying to fetch data from STXH against archived PO, no value is coming and resulting SY_SUBRC <> 0.
Any demo code will be highly appreciated.
Thanks in advance..
Sivaji

Hi,
You can see that table STXH is linked to archiving object MM_EKKO (you can see it in tcode DB15).
My suggest is that you must get the data. See the demo object BC_SBOOK in tcode AOBJ. You can see the report to reload data. The object is get the data in an internal table. So for report SBOOKR you can see this function module:
*   get data records from the data container
*   SBOOK
    CALL FUNCTION 'ARCHIVE_GET_TABLE'
      EXPORTING
        archive_handle        = lv_handle
        record_structure      = 'SBOOK'
        all_records_of_object = 'X'
      TABLES
        table                 = lt_sbook_tmp
      EXCEPTIONS
        end_of_object         = 0.         "not entries of this type
*   check lt_sbook_tmp entries against selections. Delete not
*   requested entries
    LOOP AT lt_sbook_tmp ASSIGNING <ls_sbook>
                         WHERE carrid IN s_carrid
                           AND connid IN s_connid
                           AND fldate IN s_fldate.
      APPEND <ls_sbook> TO lt_sbook.
    ENDLOOP.
    REFRESH lt_sbook_tmp.
The idea is that you get the same data that you handle in READ_TEXT (because you don't have the data in database) and recovery the text.
I hope this helps you
REgards
Eduardo

Accessibility: Reading order of tables and anchored frames

I am creating accessible, tagged (section 508 compliant) PDFs in FrameMaker 9. The reading order for tables and frames is not correct.
When I view the PDF reading order using Adobe Acrobat Professional or another screen reader, anchored items such as tables and anchored frames are placed last in the reading order, regardless of where they appear in the document flow or page layout. The reading order skips over all tables and frames, reading all paragraphs on a page first, then reading the tables and frames as the last objects on the page. This logically doesn't make sense to skip over tables/frames as they generally apply to the content that preceeds it.
For example, the following document structure:
<Paragraph 1>
<Table 1>
<Paragraph 2>
<Anchored Frame 1>
<Paragraph 3>
<Anchor Frame 2>
<Paragraph 4>
is being read by assistive technology as:
<Paragraph 1>
<Paragraph 2>
<Paragraph 3>
<Paragraph 4>
<Table 1>
<Anchored Frame 1>
<Anchor Frame 2>
I want the document structure to be read correctly as intended.
In otherwords, the PDFs generated by FrameMaker 9 are not completely accessible because of incorrect reading order output by default. This information is not listed in the VPAT for FrameMaker 9.
I want to avoid any post processing using Acrobat's Touch Up Reading Order tool. Is there a way to automate updates to reading order?
Can FrameMaker 9 logically place tables and anchored frames into the correct reading order? How do I adjust these settings?
Thanks in advance!

As mentioned above, tables and anchored frames are inserted into thier own paragraph style "Frame". The paragraph style "Frame" is tagged. To my knowledge, there are no options for tagging or not tagging tables or anchored frames.
Regardless of which paragraph type the table or anchor frame is inserted into, and what that paragraphs tagging settings are, it is still last in the reading order.
I've tried a variety of options: tagging the "Frame" paragraph style as a sibling, child, and parent of my other paragraphs; I've even tried omitting it from the reading order. None of these options present anchored frames (and tables) in the logical reading order.
Even images that are inline (within a paragraph; not in thier own paragraph) are not being read as part of the paragraph. Inline images get skipped over by screen readers and get read at the end of the page, which makes no sense whatsoever.
All tables and images end up at the end of the reading order (after ALL paragraphs) regardless of the tagging settings.
Refer to my previous screenshot for a clear diagram of what is happening to the reading order. Each of those anchors is in it's own paragraph style. I want tables and anchored frames to be sequential in the reading order along with paragraphs. (1,2,3,4,5,6 not 1,4,2,5,3,6.)
I'm using the Tags tab of the "PDF Setup" dialog to adjust these settings. Is there somewhere else I should be making changes to the reading order?
This is a bit disturbing because FrameMaker touts creating accessible documents and this severe reading order issue impares my ability to do so. I would not consider documents that jump around the page in an illlogical, fixed order, to be accessible. I'm very suprised that no one else has encountered this issue (at least that I can find...)

PDWordFinder does not extract text in order

Hi,
My word document had few comments.
I converted the word document to PDF by File->SaveAs->Adobe PDF.
I did not convert the comments to sticky notes. Hence they appear the same as in word document.
My application uses PDWordFinder API to extract text from the document.
I notice that the text in these comments is retrived only at the last.
Why the text in the comments (not sticky notes) is retrieved at last and not in the order they appear in the document?
Is there any option to make the wordfinder retrieve text in the order of appearance?

I need to extract text in 'reading' order, but it's not very clear how to use PDWordFinderAcquireWordList parameters.
Can I use different 'reading order' for PDDocCreateWordFinderUCS method, or can I use xySortTable?
Which are sorting parameters (if they exist) for AcquireWordList or WordFinder ? Thanks

Extract text file from a folder and read the content

Hi
I have "n" no.of text files saved in a folder with automatically generated naming convention which include DD/MM/YYYY and also some measurement output value.
Eg: 1) Die_1_DUT_outputvalue_DD_MM_YYYY.txt
2) Die_1_DUT_outputvalue_DD_MM_YYYY_ABC.txt
In the above files part of the 2nd file naming convention same as the first file (i.e. Die_1_DUT and DD_MM_YYYY) whereas outputvalue is different and an additional string named ABC is appended.
Now I want to search the 2nd file based on matching the naming pattern with the 1st file (note: the outputvalue in the file name is different for both files) and so far followed this method
1) Use a list folder with *.txt pattern to search all the text files and the output is a 1D array
2) then use array to cluster and then flatten to XML function to have all the text file names as a string element (not 1D array)
3) then pass the output of the 2nd step to the sting match pattern and use a regular expression to get the required file name
4) send the output of the 3rd step to search 1D array to get the index and then get the file name and later use read text file to read the content of the text file
And I am stuck at the 3rd step while sending an input as the regular expression to match the pattern as the outputvalue in the namming convention of the above two files is different is there any way I can actually extract the filename/file?
Any suggestions?
Attachments:
1.png ‏11 KB

Some bits in your code are unnecessary, a leaner implementation here:
Beginner? Try LabVIEW Basics
Sharing bits of code? Try Snippets or LAVA Code Capture Tool
Have you tried Quick Drop?, Visit QD Community.

How to read/extract text from pdf

Respected All,
I want to read/extract text from pdf. I tried using etymon but not succed.
Could anyone will guide me in this.
Thanks and regards,
Ajay.

Thank you very much Abhilshit, PDFBox works for reading pdf.
Regards,
Ajay.

TouchUp Reading Order Tool - invisible Text

Last time I used the TouchUp Reading Order tool (in order to mark out columns of text to copy into a spreadsheet) in November, it worked with no problems. However, when I do this using Acrobat 9.3.0 the text below the first line disappears once I select the content type. The text is still there and I can select it, but it is not visible, even if the page structure is cleared.
Can anyone else reproduce this or am I doing something odd?
J

The "highlight" box/boxes provided by TORU display a high level overview of the tagged PDF's Block Level Structure Elements (BLSE) rather than an element ("tag") by element highlight.
As to establishing a proper "order" - you want to be working out of the Tags panel.
Reliance on TORU to establish a proper tag tree having proper PDF tag semantics can often leave one with a really gobbered PDF.
Be well...

Reversed brackets in Arabic extracted text

I'm working on a system that is reasonably good at extracting text from two different PDF documents and comparing them. It's built using PDFL (I'm hoping the community for Acrobat SDK will be willing to help me out since I can't find a forum for PDFL.)
I run into a problem when working with Arabic text. The issue is reversible symbols like brackets ( ( ), { }, etc) and some other things (like < >) are visibly identical in the two documents, but are encoded as their opposites.
i.e.
Document 1 - Text looks like (ABCDE) and is encoded with the unicode values for (ABCDE)
Document 2 - Text looks like (ABCDE) and is encoded with the unicode values for )ABCDE(
I figure this has something to do with right-to-left read order and mixed font detection or perhaps some other font-setting.
I need a way of detecting when this reversal happens so I can compensate for it when extracting the text. I'm stumbling in the dark at this point and would appreciate any direction that could be given.
Thanks,
NN

I can't actually post the PDF (confidentiality agreement prevents it), but I can give you some info:
Document 1: (where encoding is correct)
Created by easyPDF SDK and uses SimplifiedArabic font family for the problem characters and their surrounding text.
Using the Acrobat TextSelection tool to copy/paste the problem text from this document into Notepad results in text that looks right.
Document 2: (where encoding is reverse of displayed character)
It was generated by InDesign CS3 and is using the WinSoft Pro font family for the problem characters and their surrounding text.
Using the Acrobat TextSelection tool to copy/paste the problem text from this document into Notepad results in text with brackets reversed.
What should I be looking for to be missing/wrong from the font definition or content stream?

Text Flow from service order to requisition

Hi Everyone,
I having a problem regarding text flow from service orders IW31/IW32 to requisition ME53N.
I created/updated a service order in IW31/IW32 and enter some text in "Operation Short Text" column. When I release the service order the short text should ideally be reflected in purchase requisition. What's happening right now is that "If the length of text entered in IW32 is more than 40 characters then the text is reflected in requisition ME53N otherwise any updates to this column text are not reflected in ME53N".
I debugged ME53N and found that it gets the value of text using READ_TEXT so I hope the same is stored in standard text while saving IW31/IW32. I checked the text also which is created after saving Iw32 and it contains changed data only if the length of the data is more than 40 characters.
To me it looks like standard SAP process that the text will flow from service order to purchase requisition only if the data length is more than 40 characters. If anyone has any idea can you please share. Is there any sugesstion for correcting this. Any reason or OSS.
<REMOVED BY MODERATOR>
Edited by: Alvaro Tejada Galindo on Aug 13, 2008 3:30 PM

Hi,
my situation is:
- a WM managed warehouse, society A;
- a HU managed warehouse (without WM), society B;
- a purchasing process of HU from society A towards society B.
Society B have a scheduling agreement; when a delivery schedule appears, in society A born a sales order and a delivery. After the registration of the delivery good issue, an idoc transfer information for inbound delivery creation.
This process is ok without WM, but with a WM managed warehouse the idoc has the following problem:
"V51VP - item was not found - process cancelled".
Can you help me to transfer these HU?

Long Text problem in Process order header

Hi All,
I am using SAVE_TEXT FM to update the header long text in process order.
Also, I am updating the field AUFK-LTEXT = 'E'.
But when i display the order and click on long text, it does not display any thing as the text is not saved.
When I update the text directly in the order using COR2, it gets saved.
Does anyone know why the text is not being saved through FM SAVE_TEXT?
Also tried COMMIT WORK but was not successful.
The paramters I am passing to the FM are
TDOBJECT = 'AUFK'
TDID = 'KOPF'
TDSPRAS = SY-LANGU
TDNAME = sy-mandt+order number with leading zeros.
and the text lines in internal table.
Am I missing anything else here?
Thanks,
Sandeep

Hi Sandeep,
First check table STXH for the order which you saved manually, in order to verify that the values you are passing to the FM SAVE_TEXT are correct.
Also check the documentation which is supplied with this function to determine the INSERT and SAVEMODE_DIRECT values.
Also maybe check function COMMIT_TEXT and its documentation.
Regards,
Robert
PS. also test the scenario in which text s/b added to already existing text. The SAVE_TEXT function wipes out everything and therefore you first should read the existing text (READ_TEXT) to retrieve the current text and save this together with the new text using SAVE_TEXT. (check function group STXD for possible related functions to use).
PPS. Thinking about my comments under PS., I recall now that this was the symptom of the long text passed on through BAPI_SALESORDER_CHANGE and therefore maybe this is not the case for SAVE_TEXT.
Edited by: RJ. Schamhart on Feb 3, 2011 4:53 PM

Extracting text to Unicode (Korean, Japanese, ...)

Hi,
I am using the PDFWordFinder to extract text from PDFs in Unicode.
This works fine for a lot of documents, even with Japanese, Korean, Chinese ones.
But, I have some documents, using Korean fonts, which do not seem to be 'compatible' with the PDFWordFinder API.
The returned char codes are using Unicode surrogates range (ie the first value is 0xDBC0 and the next one 0xD801 for example).
It seems that the font has an internal /ToUnicode table (I have see this resource using a COS viewer).
I thought that the PDFWordFinder was able to read and process internal /ToUnicode tables in order to return the corresponding Unicode chars. Am I wrong ?
If the PDFWordFinder is able to do the job, what option am I missing if it does not work ?
Thanks for your help.
Pierre

When I copy / paste text into Word I get squares...not the characters that are displayed in the PDF itself.
If I do the same with docs for which text extraction using PDFWordFinder is working, copy / paste is OK.

How do I extract text from an email?

Hello!
I am in the process of trying to automate orders from my website. How do I extract text from an email and paste it into specific cells in an Excel spreadsheet using Automator?
Many thanks,
Toby Bateson

If you select the message on the Inbox list, or open the message, you can then go to the Message menu of Mail and select Remove Attachments.
Bob N.
Mac Mini 1.5 GHz; iBook 900 mHz; iPod 20 GB Mac OS X (10.4.7)

Text element of sales order form

I have a question about the text element of sales order printing form.
i run a sales order with some price condition (ZKFA) - VA03 and want to print it out
before i print it without ZKFA price, it's correct when printing , after input the ZKFA price, all the item detail are missing
I've checked the SAPScript form ..and see there are many /E (text element) .to control. and i think it's the key ~
but i don't know how I can debug it step by step in SE71 ...and show the foreground process to know and find where is the problem...
i don't know which system program is run to control the form text element...

I'VE CHECKED THE SYSTEM PROGRAM RVADOR01..AND KNOW WHERE IS THE KEY POINT....
I STILL DON'T KNOW WHY ....IF THE ITEM HAVE THE CONDITION "ZKFA" - AIR FREIGHT ...THEN THAT ITEM DETAIL CAN'T BE SHOW...
IF I ADD THE CONDITION - ZDI2 - DISCOUNT, IT'S OK FOR SHOWING IN THE FORM.
I DON'T KNOW WHY CAN'T GET INTO THE ELEMENT ='ITEM_LINE_PRICE_QUANTITY' IF THE ITEM HAS SUBCHARGE - ZKFA
I SEE THIS CODE IN RVADOR01:
FORM ITEM_PRICE_PRINT.
LOOP AT TKOMVD.
    KOMVD = TKOMVD.
    IF SY-TABIX = 1 AND
     ( KOMVD-KOAID = CHARB OR
       KOMVD-KSCHL = SPACE ).
      CALL FUNCTION 'WRITE_FORM'
           EXPORTING
                ELEMENT = 'ITEM_LINE_PRICE_QUANTITY'.
    ELSE.
      IF KOMVD-KNTYP NE 'f'.
        CALL FUNCTION 'WRITE_FORM'
             EXPORTING
                  ELEMENT = 'ITEM_LINE_PRICE_TEXT'.
      ELSE.
        CALL FUNCTION 'WRITE_FORM'
             EXPORTING
                  ELEMENT = 'ITEM_LINE_REBATE_IN_KIND'.
      ENDIF.
    ENDIF.
ENDLOOP.
ENDFORM.

Short text from a sales order not pulling through to a STO

Hi,
Does anyone have any idea why the short text from a sales order would pull correctly into a purchase requisition, but when this PR is converted to a PO then the short text in the PO reverts to that in the material master?
Thanks in advance.
jj

Hi,
IN ECC 6.0 EHP 4, it gets copied from PR to PO but not in lower versions.
Hence you can achieve the same using some PO BADI's or through enhancement spots.
Thanks & Regards,

Text auto-correction is turning on again!

Hi all!
My iPhone4 is running iOS 5.0.1 and I'm very happy!
But it recently turns text auto-correction on. I turn it off, but sometime later I found it turned on again
I can't understand what causes it to do this. How can I turn this feature absolutely off?

What troubleshooting have you tried over these few days? You charging the phone? Try a reset. Hold the sleep/wake and home buttons together until you see the Apple logo and then release. The phone should reboot. This can take up to 60 seconds.

Extracting text in correct reading order

Similar Messages

Maybe you are looking for