Extracting text to Unicode (Korean, Japanese, ...)

Hi,
I am using the PDFWordFinder to extract text from PDFs in Unicode.
This works fine for a lot of documents, even with Japanese, Korean, Chinese ones.
But, I have some documents, using Korean fonts, which do not seem to be 'compatible' with the PDFWordFinder API.
The returned char codes are using Unicode surrogates range (ie the first value is 0xDBC0 and the next one 0xD801 for example).
It seems that the font has an internal /ToUnicode table (I have see this resource using a COS viewer).
I thought that the PDFWordFinder was able to read and process internal /ToUnicode tables in order to return the corresponding Unicode chars. Am I wrong ?
If the PDFWordFinder is able to do the job, what option am I missing if it does not work ?
Thanks for your help.
Pierre

When I copy / paste text into Word I get squares...not the characters that are displayed in the PDF itself.
If I do the same with docs for which text extraction using PDFWordFinder is working, copy / paste is OK.

Similar Messages

  • Data extraction from Non-Unicode ECC6 to Unicode SAP BI system

    Hi,
    We have an existing non-unicode ECC6 system. Currently we are installing SAP BI unicode system. Can anyone tell me are there any issues in data extraction for SAP BI from a non-unicode ECC6 system to an Unicode SAP BI system ?
    Please also note that our data consists of Asian (korean, Japanese & Chinese) fonts.
    Regards,
    Anirban

    Hi Des Gallagher,
    Thank you for your reply.
    I have gone through the notes suggested by you, but they suggest issues related to BW 3.x versions. We are currently on SAP_BW 700 - SP16. Also, among other notes i found note 510882 which might be helpful for custom developments.
    But i am still wondering whether we are going to face major issues related to data extraction from non-unicode ECC6 system to unicode SAP_BW 700 system.
    Incase you have any further details, please let me know.
    Thanks in advance.
    Regards,
    Anirban Kundu

  • Reversed brackets in Arabic extracted text

    I'm working on a system that is reasonably good at extracting text from two different PDF documents and comparing them.  It's built using PDFL (I'm hoping the community for Acrobat SDK will be willing to help me out since I can't find a forum for PDFL.)
    I run into a problem when working with Arabic text.  The issue is reversible symbols like brackets ( ( ), { }, etc) and some other things (like < >) are visibly identical in the two documents, but are encoded as their opposites. 
    i.e.
    Document 1 - Text looks like (ABCDE) and is encoded with the unicode values for (ABCDE)
    Document 2 - Text looks like (ABCDE) and is encoded with the unicode values for )ABCDE(
    I figure this has something to do with right-to-left read order and mixed font detection or perhaps some other font-setting.
    I need a way of detecting when this reversal happens so I can compensate for it when extracting the text.  I'm stumbling in the dark at this point and would appreciate any direction that could be given.
    Thanks,
    NN

    I can't actually post the PDF (confidentiality agreement prevents it), but I can give you some info:
    Document 1: (where encoding is correct)
    Created by easyPDF SDK and uses SimplifiedArabic font family for the problem characters and their surrounding text.
    Using the Acrobat TextSelection tool to copy/paste the problem text from this document into Notepad results in text that looks right.
    Document 2: (where encoding is reverse of displayed character)
    It was generated by InDesign CS3 and is using the WinSoft Pro font family for the problem characters and their surrounding text.
    Using the Acrobat TextSelection tool to copy/paste the problem text from this document into Notepad results in text with brackets reversed.
    What should I be looking for to be missing/wrong from the font definition or content stream?

  • Non-Roman Characters (Korean/Japanese/Chinese) appearing as boxes

    Hey,
    I've had this problem before, but it has seemed to fix itself in the past. But ever since the release of iTunes 9, I've been having a constant problem with any Kanji, Kana, or Hangul appearing in my library for the program. Instead of appearing as the foreign song title, the names only appear as squares or little boxes.
    When I load the songs to my iPod, the problem does not seem to persist there. Doing a lot of research, I feel that this is a problem having to do with unicode. I know for the fact that my computer is capable of reading Korean/Japanese, since it works in other programs as Office 2007, and even Windows Media Player.
    On another note, right clicking these problematic songs while browsing my library will bring up "Play 'suchandsuchsong'" in the right characterization. These songs show up normal in windows explorer. However, Right clicking it in while in windows explorer and selecting properties will give me the little boxes as well.
    I hope I've given enough information to explain my problem, and I will be very grateful for a solution to this problem which many people seem to have.
    Edit: Additionally, solutions like changing the ID3 tags have not helped, but this might be because they were rather ambiguous in their directions (such as the solution given by iTunes in the iTunes Help).
    Message was edited by: svarogexodeus

    I didn't have the problem of showing boxes, but I had a similar problem of the korean characters showing in symbols. To fix, I (1) right clicked on the song. (2) Chose Convert ID3 Tags. (3) Checked "Reverse Unicode" Box, then hit "OK". You can highlight as many songs as you want at once and still perform the same thing, to save you time.
    Hope this helps you.

  • Unicode to Japanese characters

    Hi,
    I need a function to convert unicode to Japanese characters. I have a unicode string in my syncBO and it needs to be converted to the "strange" Japanese characters.
    When I read the unicode String from the MAMText files it is automatically done by the MI framework.
    Unicode string example: \u6A5F\u80FD\u5834\u6240\u4E00\u89A7
    Are there standard MI functions for me which are converting these kind of Strings? Because for the MAMText files it is working via the getString method of the com.sap.ip.me.api.services.MEResourceBundle class.
    I hope someone knows more about this conversion.
    Thanks in advance! Kind regards,
    Bart Elshout

    hi bart,
    it would be nice if you could feedback me with the error problems.
    i know the code might not cater to all of the requirements, you might need to add
    other escaped characters if you need to in the switch scope. i forgot to tell you
    that to use the a2n function you should provide it with the string in its ISO format.
    try running on the MyTest program below with the argument as follows:
    MyTest "u6A5Fu80FD u5834 u6240u4E00u89A7"
    public class MyTest {
      public  static void main(String[] args){
        System.out.println(a2n(args[0]));
    if you use it like this way, it will not work, since the String class will detect this
    and automatically create the native representation of the characters. there's no
    need for the a2n function.
      String s = "u6A5Fu80FD u5834 u6240u4E00u89A7";
      System.out.println(a2n(s));
    if you are reading text from an ascii file, you need to specify that the input stream
    should be  ISO-8859-1 i.e.something like
    new InputStreamReader(inStream, "ISO-8859-1")
    hope this helps.
    jo

  • Space issues while extracting text

    Hi,
          I am using PDTextSelectEnumText to extract text containing  both Japanese and English.My sample document has 2 scenario:
         a)space between English and Japanese text. b)no space between English and Japanese text.
          my issues is with space,Which gets added to the end of English text,leading to string mismatch between extracted text and original pdf text.(for 2nd scenario)
          what changes I need to do,to read the space explicitly(if present between the text).
          Or is their any other function for text extraction.
    Plz help...
    Thanks,
    Sind

    Acrobat 7 hasn't been supported for at least 3 years now. You will need to move to Acrobat 9 and the 9 SDK (or later) to obtain support.
    From: Adobe Forums <[email protected]<mailto:[email protected]>>
    Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>>
    Date: Thu, 8 Dec 2011 21:01:59 -0800
    To: Leonard Rosenthol <[email protected]<mailto:[email protected]>>
    Subject: Space issues while extracting text
    Re: Space issues while extracting text
    created by Sindhu B R<http://forums.adobe.com/people/SindhuBR> in Acrobat SDK - View the full discussion<http://forums.adobe.com/message/4073106#4073106

  • How to extract text from a PDF file?

    Hello Suners,
    i need to know how to extract text from a pdf file?
    does anyone know what is the character encoding in pdf file, when i use an input stream to read the file it gives encrypted characters not the original text in the file.
    is there any procedures i should do while reading a pdf file,
    File f=new File("D:/File.pdf");
                   FileReader fr=new FileReader(f);
                   BufferedReader br=new BufferedReader(fr);
                   String s=br.readLine();any help will be deeply appreciated.

    jverd wrote:
    First, you set i once, and then loop without ever changing it. So your loop body will execute either 0 times or infinitely many times, writing the same byte every time. Actually, maybe it'll execute once and then throw an ArrayIndexOutOfBoundsException. That's basic java looping, and you're going to need a firm grip on that before you try to do anything as advanced as PDF reading. the case.oops you are absolutely right that was a silly mistake to forget that,
    Second, what do the docs for getPageContent say? Do they say that it simply gives you the text on the page as if the thing were a simple text doc? I'd be surprised if that's the case.getPageContent return array of bytes so the question will be:
    how to get text from this array? i was thinking of :
        private void jButton1_actionPerformed(ActionEvent e) {
            PdfReader read;
            StringBuffer buff=new StringBuffer();
            try {
                read = new PdfReader("d:/getjobid2727.pdf");
                read.getMetaData();
                byte[] data=read.getPageContent(1);
                int i=0;
                while(i>-1){ 
                    buff.append(data);
    i++;
    String str=buff.toString();
    FileOutputStream fos = new FileOutputStream("D:/test.txt");
    Writer out = new OutputStreamWriter(fos, "UTF8");
    out.write(str);
    out.close();
    read.close();
    } catch (Exception f) {
    f.printStackTrace();
    "D:/test.txt"  hasn't been created!! when i ran the program,
    is my steps right?                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

  • How to extract text from a PDF file using php?

    How to extract text from a PDF file using php?
    thanks
    fabio

    > Do you know of any other way this can be done?
    There are many ways. But this out of scope of this forum. You can try this forum: http://forum.planetpdf.com/

  • How to read/extract text from pdf

    Respected All,
    I want to read/extract text from pdf. I tried using etymon but not succed.
    Could anyone will guide me in this.
    Thanks and regards,
    Ajay.

    Thank you very much Abhilshit, PDFBox works for reading pdf.
    Regards,
    Ajay.

  • Infoobject change for extracting texts data.

    Hi BW guys,
    Here is my requirement.
    I have one info object 'salesmen', which is already used in some other ODS & Cube's.
    Now I want to extract texts data for the object 'salesmen', for that I will need to change my infoobject (changes are : adding credit control are object under compounding).
    But while i am activating the info object again it is giving errors.
    Error messages:
    1) InfoObject XXXXX (or ref.) is used in data targets with data -> Error:
    2) Characteristic XXXXX: Compound or reference was changed
    3)InfoObject XXXXX being used in InfoCube XXXX (contains data)
    etc....
    But i don't want to delete the data in any data target.
    Is there any way to solve this problem?
    Thanks in advance......

    Hi,
    If you have not many cubes and ODSs with this salesman, you can consider another, beter, but more time-consuming way.
    1. Create a new IO for your salesman, add a compounding attribute as you want.
    2. Load master data for the new IO.
    3. Create copies of your infoproviders.
    3. In each of them delete an old salesman IO and insert a new one.
    4. Create export datasourses for old cubes.
    5. Create update rules for new data targets based on old ones.
    6. In URs map your new IO with the old one. All other IOs should be mapped 1:1 (new<-old).
    7. Reload data targets.
    That's all.
    The way I proposed earlier is less preferrable. Because anyway you'll have to change loaded into data targets data. And in this case it's better to change data model as you want.
    Best regards,
    Eugene

  • Problem to extract text from HTML document

    I have to extract some text from HTML file to my database. (about 1000 files)
    The HTML files are get from ACM Digital Library. http://portal.acm.org/dl.cfm
    The HTML page is about the information of a paper. I only want to get the text of "Title" "Abstract" "Classification" "Keywords"
    The Problem is that I can't find any patten to parser the html files"
    EX: I need to get the Classification = "Theory of Computation","ANALYSIS OF ALGORITHMS AND PROBLEM COMPLEXITY","Numerical Algorithms and Problem","Mathematics of Computing","NUMERICAL ANALYSIS"......etc .
    The section code about "Classification" is below.
    Please give any idea to do this, or how to find patten to extract text from this.
    <div class="indterms"><a href="#CIT"><img name="top" src=
    "img/arrowu.gif" hspace="10" border="0" /></a><span class=
    "heading"><a name="IndexTerms">INDEX TERMS</a></span>
    <p class="Categories"><span class="heading"><a name=
    "GenTerms">Primary Classification:</a></span><br />
    � <b>F.</b> <a href=
    "results.cfm?query=CCS%3AF%2E%2A&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Theory of Computation</a><br />
    � <img src="img/tree.gif" border="0" height="20" width=
    "20" /> <b>F.2</b> <a href=
    "results.cfm?query=CCS%3A%22F%2E2%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">ANALYSIS OF ALGORITHMS AND PROBLEM
    COMPLEXITY</a><br />
    � � � <img src="img/tree.gif" border="0" height=
    "20" width="20" /> <b>F.2.1</b> <a href=
    "results.cfm?query=CCS%3A%22F%2E2%2E1%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Numerical Algorithms and Problems</a><br />
    </p>
    <p class="Categories"><span class="heading"><a name=
    "GenTerms">Additional�Classification:</a></span><br />
    � <b>G.</b> <a href=
    "results.cfm?query=CCS%3AG%2E%2A&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Mathematics of Computing</a><br />
    � <img src="img/tree.gif" border="0" height="20" width=
    "20" /> <b>G.1</b> <a href=
    "results.cfm?query=CCS%3A%22G%2E1%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">NUMERICAL ANALYSIS</a><br />
    � � � <img src="img/tree.gif" border="0" height=
    "20" width="20" /> <b>G.1.6</b> <a href=
    "results.cfm?query=CCS%3A%22G%2E1%2E6%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Optimization</a><br />
    � � � � � <img src="img/tree.gif" border=
    "0" height="20" width="20" /> <b>Subjects:</b> <a href=
    "results.cfm?query=CCS%3A%22Linear%20programming%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Linear programming</a><br />
    </p>
    <br />
    <p class="GenTerms"><span class="heading"><a name=
    "GenTerms">General Terms:</a></span><br />
    <a href=
    "results.cfm?query=genterm%3A%22Algorithms%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Algorithms</a>, <a href=
    "results.cfm?query=genterm%3A%22Theory%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Theory</a></p>
    <br />
    <p class="keywords"><span class="heading"><a name=
    "Keywords">Keywords:</a></span><br />
    <a href=
    "results.cfm?query=keyword%3A%22Simplex%20method%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">Simplex method</a>, <a href=
    "results.cfm?query=keyword%3A%22complexity%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">complexity</a>, <a href=
    "results.cfm?query=keyword%3A%22perturbation%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">perturbation</a>, <a href=
    "results.cfm?query=keyword%3A%22smoothed%20analysis%22&coll=ACM&dl=ACM&CFID=22820732&CFTOKEN=38147335"
    target="_self">smoothed analysis</a></p>
    </div>

    One approach is to download Htmlparser from sourceforge
    http://htmlparser.sourceforge.net/ and write the rules to match title, abstract etc.
    Another approach is to write your own parser that extract only title, abstract etc.
    1. tokenize the html file. --> convert html into tokens (tag and value)
    2. write a simple parser to extract certain information
    find out about the pattern of text you want to extract. For instance "<class "abstract">.
    then writing a rule for extracting abstract such as
    if (tag is abstract ) then extract abstract text
    apply the same concept for other tags
    Attached is the sample parser that was used to extract title and abstract from acm html files. Please modify to include keyword and other fields.
    good luck
    import java.io.BufferedReader;
    import java.io.FileReader;
    import java.io.IOException;
    import java.io.InputStream;
    import java.io.InputStreamReader;
    import java.util.ArrayList;
    import java.util.List;
    public class ACMHTMLParser
         private String m_filename;
         private URLLexicalAnalyzer lexical;
         List urls = new ArrayList();
         public ACMHTMLParser(String filename)
              super();
              m_filename = filename;
          * parses only title and abstract
         public void parse() throws Exception
              lexical = new URLLexicalAnalyzer(m_filename);
              String word = lexical.getNextWord();
              boolean isabstract = false;
              while (null != word)
                   if (isTag(word))
                        if (isTitle(word))
                             System.out.println("TITLE: " + lexical.getNextWord());
                        else if (isAbstract(word) && !isabstract)
                             parseAbstract();
                             isabstract = true;
                   word = lexical.getNextWord();
              lexical.close();
         public static void main(String[] args) throws Exception
              ACMHTMLParser parser = new ACMHTMLParser("./acm_html.html");
              parser.parse();
         public static boolean isTag(String word)
              return ( word.startsWith("<") && word.endsWith(">"));
         public static boolean isTitle(String word)
              return ( "<title>".equals(word));
         //please modify according to the html source
         public static boolean isAbstract(String word)
              return ( "<p class=\"abstract\">".equals(word));
         private void parseAbstract() throws Exception
              while (true)
                   String abs = lexical.getNextWord();
                   if (!isTag(abs))
                        System.out.println(abs);
                        break;
         class URLLexicalAnalyzer
           private BufferedReader m_reader;
           private boolean isTag;
           public URLLexicalAnalyzer(String filename)
              try
                m_reader = new BufferedReader(new FileReader(filename));
              catch (IOException io)
                System.out.println("ERROR, file not found " + filename);
                System.exit(1);
           public URLLexicalAnalyzer(InputStream in)
              m_reader = new BufferedReader(new InputStreamReader(in));
           public void close()
              try {
                if (null != m_reader) m_reader.close();
              catch (IOException ignored) {}
           public String getNextWord() throws IOException
              int c = m_reader.read();   
              if (-1 == c) return null; 
              if (Character.isWhitespace((char)c))
                return getNextWord();
              if ('<' == c || isTag)
                return scanTag(c);
              else
                   return scanValue(c);
           private String scanTag(final int c)
              throws IOException
              StringBuffer result = new StringBuffer();
              if ('<' != c) result.append('<');
              result.append((char)c);
              int ch = -1;
              while (true)
                ch = m_reader.read();
                if (-1 == ch) throw new IllegalArgumentException("un-terminate tag");
                if ('>' == ch)
                     isTag = false;
                     break;
                result.append((char)ch);
              result.append((char)ch);
              return result.toString();
           private String scanValue(final int c) throws IOException
                StringBuffer result = new StringBuffer();
                result.append((char)c);
                int ch = -1;
                while (true)
                   ch = m_reader.read();
                   if (-1 == ch) throw new IllegalArgumentException("un-terminate value");
                   if ('<' == ch)
                        isTag = true;
                        break;
                   result.append((char)ch);
                return result.toString();
    }

  • Extract Text from pdf using C#

    Hi,
    We are Solution developer using Acrobat,as we have reuirement of extracting text from pdf using C# we have downloaded adobe sdk and installed. We have found only four exmaples in C# and those are used only for viewing pdf in windows application. Can you please guide us how to extract text from pdf using SDK in C#.
    Thanks you for your help.
    Regards
    kiranmai

    Okay so I went ahead and actually added the text extraction functionality to my own C# application, since this was a requested feature by the client anyhow, which originally we were told to bypass if it wasn't "cut and dry", but it wasn't bad so I went ahead and gave the client the text extraction that they wanted. Decided I'd post the source code here for you. This returns the text from the entire document as a string.
           private static string GetText(AcroPDDoc pdDoc)
                AcroPDPage page;
                int pages = pdDoc.GetNumPages();
                string pageText = "";
                for (int i = 0; i < pages; i++)
                    page = (AcroPDPage)pdDoc.AcquirePage(i);
                    object jso, jsNumWords, jsWord;
                    List<string> words = new List<string>();
                    try
                        jso = pdDoc.GetJSObject();
                        if (jso != null)
                            object[] args = new object[] { i };
                            jsNumWords = jso.GetType().InvokeMember("getPageNumWords", BindingFlags.InvokeMethod, null, jso, args, null);
                            int numWords = Int32.Parse(jsNumWords.ToString());
                            for (int j = 0; j <= numWords; j++)
                                object[] argsj = new object[] { i, j, false };
                                jsWord = jso.GetType().InvokeMember("getPageNthWord", BindingFlags.InvokeMethod, null, jso, argsj, null);
                                words.Add((string)jsWord);
                        foreach (string word in words)
                            pageText += word;
                    catch
                return pageText;

  • Extract text from pdf

    Hi, is it possible to extract text from a pdf file using the command line to get an output like you would get by using the File menu and then 'Save as text..."?
    I also noticed that in the installation folder there is a small executable called AcroTextExtractor which sounds interesting, but I was unable to figure out how to use it.

    what's wrong with using automator for this? this certainly seems the easiest. I'm not aware of any built in apple script commands that will do this. But You should also ask on the Apple script forum under Mac OS Technologies.
    Message was edited by: V.K.

  • PDWordFinder does not extract text in order

    Hi,
    My word document had few comments.
    I converted the word document to PDF by File->SaveAs->Adobe PDF.
    I did not convert the comments to sticky notes. Hence they appear the same as in word document.
    My application uses PDWordFinder API to extract text from the document.
    I notice that the text in these comments is retrived only at the last.
    Why the text in the comments (not sticky notes) is retrieved at last and not in the order they appear in the document?
    Is there any option to make the wordfinder retrieve text in the order of appearance?

    I need to extract text in 'reading' order, but it's not very clear how to use PDWordFinderAcquireWordList parameters.
    Can I use different 'reading order' for PDDocCreateWordFinderUCS method, or can I use xySortTable?
    Which are sorting parameters  (if they exist) for AcquireWordList or WordFinder ? Thanks

  • How do I extract text from an email?

    Hello!
    I am in the process of trying to automate orders from my website. How do I extract text from an email and paste it into specific cells in an Excel spreadsheet using Automator?
    Many thanks,
    Toby Bateson

    If you select the message on the Inbox list, or open the message, you can then go to the Message menu of Mail and select Remove Attachments.
    Bob N.
    Mac Mini 1.5 GHz; iBook 900 mHz; iPod 20 GB   Mac OS X (10.4.7)  

Maybe you are looking for

  • Encrypting PDF documents in Preview

    If you try to encrypt a PDF document in Preview by selecting the encrypt checkbox in the Save As panel, you might find that when you re-open the encrypted document, any forms that you filled out in the document have garbled characters instead of the

  • Navigatorbar-navigation

    after i get my viewobject and use getwhereclause to set the parameters my records are rendered normally but i can't navigate through my records because the navigation-bar isn't working properly

  • Installing Adobe Creative Suite 4 & Safari 4 Incompatibility

    I have been having a miserable time trying to install Adobe Creative Suite (CS) 4 on my iMac G5 (OS 10.4.11). I have the CS Master Collection and some of the applications have system requirements beyond the capability of my machine. Nevertheless, a c

  • Why can't i set up my facebook account for sharing?

    Just upgraded to Mountain Lion, and I am trying to set up my facebook and twitter for sharing, but when I try to add my facbook in system prefernces, I cant find the facebook logo. All that is there is the following: Icloud Microsoft Exchange Gmail Y

  • Why are all contacts & calendar entries listed twice? Syncing Problems?

    Hello, I am synching my iPhone 4S with iCloud (push) and also with cable connections. All contacts are listed twice and all meetings are also listed twice! Even when I delete the second contact manually it shows up again when I sync - crazy. I have t