Character encoding: Ansi, ascii, and mac, oh my!

I'm writing a program which has to search & replace data in user-supplied Rich Text documents (.rtf). Ideally, I would like to read the whole thing into a StringBuffer, so that I can use all of the functionality built into String and StringBuffer, and so that I can easily compare with constant Strings and chars.
The trouble that I have is with character encoding. According to the rtf spec, RTFs can be encoded in four different character encodings: "ansi", "mac", IBM PC code page 437, and IBM PC code page 850, none of which are supported by Java (see http://impulzus.sch.bme.hu/tom/szamitastechnika/file/rtfspec/rtfspec_6.htm#rtfspec_8 for the RTF spec and http://java.sun.com/j2se/1.3/docs/api/java/lang/package-summary.html#charenc for the character encodings supported by Java).
I believe, from a bit of googling, that they are all 8 bits/character, so I could read everything into a byte array and manipulate that directly. However, that would be rather nasty. I would have to be careful with the changes that I make to the document, so that I do not insert values that do not encode correctly in the document's character encoding. Overall, a large hassle.
So my question is - has anyone done something like this before? Any libraries that will make my job easier? Or am I missing something built into Java that will allow me to easily decode and reencode these documents?

DrClap, thanks for the response.
If I could map from the encodings listed above (which are given in the rtf doucment) to a java encoding name from the page that you listed, that would solve all my problems. However, there are a couple of problems:
a) According to this page - http://orwell.ru/info/diffs.htm - ANSI is a superset of ISO-8859-1. That page isn't exactly authoritative, but I can't afford to lose data.
b) I'm not sure what to do about the other character encodings. "mac" may correspond to "MacRoman" but that page lists a dozen or so other macintosh encodings. Gotta love crystal-clear MS documentation.

Similar Messages

  • Problem Flash Media Live Encoder 3.2 and Mac os X 10.9 Mavericks

    Problem Flash Media Live Encoder 3.2 and Mac os X 10.9 Mavericks: The videos are blocked regularly during the broadcast

    Hello,
    I'm sorry you're running into this crash.  Unfortunately you've posted in the Adobe AIR forums, would you mind reposting over one the Flash Media Live Encoder forums?
    Thanks,
    Chris

  • Character Encoding for JSPs and HTML forms

    After having read loads of postings on character encoding problems I'm still puzzled about the following problem:
    I have an instance (A) of WL 8.1 SP3 on a WinXP machine and another instance (B) of WL 8.1 without any SP on a Win2K machine. The underlying Windows locale is english(US) in both cases.
    The same application deployed as a war file to these instances does not behave in the same way when it comes to displaying non-Latin1-characters like the Euro symbol: Whereas (A) shows and accepts these characters as request-parameters, (B) does not.
    Since the war file is the same (weblogic.xml, jsps and everything), the reason for this must either be the service-pack-level or some other configuration setting I overlooked.
    Any hints are appreciated!

    Try this:
    Prefrences -> Content -> Fonts & Color -> Advanced
    At the bottom, choose your Encoding.

  • Seeing � etc despite having View--Character encoding as unicode and auto-detect universal

    On viewing some web pages see characters such as �, ,  (for example). But View-Character Encoding is set at Unicode (UTF-8) or Western (ISO8859-1) and Tools-Options-Content-Fonts-Advanced Encoding set with either of those

    example of page:
    http://scienceofdoom.com/2010/09/17/on-missing-the-point-by-chilingar-et-al-2008/
    - a little over half way down, the section headed "Anthropogenic Imact on the Earth’s Climate – Tiny" from paragraph "And continue: " there are these non-characters in the equation (12) and subsequently.
    Another page : http://www.zimbabwesituation.com/sep26_2010.html in the topic " Red warning lights" .
    Most web-pages I read are without problem.
    I contacted the writer of the first page and s/he had no idea why it happens.

  • How I set character encoding for everypage and alway?

    I use Thai window 874 open the page when I select some website it contain Thai then click open new tab it change to western windows 1252. It can not display Thai. I must set character encoding to Thai windows 874 everytime.

    Try this:
    Prefrences -> Content -> Fonts & Color -> Advanced
    At the bottom, choose your Encoding.

  • Character encoding in Drafts and Templates won't "stick"

    I've been using T'bird for many years, am using now Version 30.0 with Windows 8. Over the years I have wrestled extensively with character encodings, because I do a lot of messages in mixed English and Cyrillic. I have things set now so that my default encoding is Cyrillic (Windows 1251) and this works pretty well, EXCEPT!! that when I save a message in my Drafts folder (or any of its subfolders), or my Templates folder, two things happen:
    1. the Cyrillic goes to garbage (I can sometimes recover this by physically moving the message to the Inbox)
    2. I get a series of  and/or à characters, with and without spaces between.
    This is a major pain in the butt! If I catch it the first time around, and move the offending message to the Inbox and "Edit as new", I can usually rescue the cyrillic - but if I (for example) make some changes elsewhere in the message in English portions and save it again (without noticing the mess-up of the cyrillic), it's all lost and I've found no way to recover it.
    I've been reading this forum and I suspect that my problem lies in the "folder properties", and/or maybe that I have not set up a User-defined set of display properties. However, I'm afraid to just mess around with things (as I've done so much in the past) for fear of messing up messages that I've been keeping for a long time in the OTHER folders.
    Help please? I use Windows 7.

    Didn't understand from what starting point you want me to "select the mail account name", but closed and re-opened T'bird.
    Then I started to run your tests.
    To my HUGE AMAZEMENT!!! - everything seems to be working today!!
    This leaves me with a couple of old files that seem to have gotten trashed a long time ago and won't simply "revover" - but I did keep back-up copies of them elsewhere, and can now go about rebuilding them if necessary.
    Thank you for your help - whatever it is that I did, seems to have worked. After how many years?! If I have trouble in the future, I'll be back - but for right now, what a RELIEF!
    Best, Martha

  • Character encoding with CF and MySQL

    Okay, I thought this should be rather straight forward but
    apparently not. I have set up my site to use UTF-8— my cfm
    pages, the MySQL table, even Dreamweaver. The problem is when I
    input international character via a form they get written correctly
    to the MySQL table; however, when I retrieve them in a query and
    display them on the page I get them displayed incorrectly.
    On my input.cfm page I'll enter the string
    "Téstïñg" in the textbox and submit it. If I look at
    the record via the MySQL Browser it appears as it should. However
    when I display it on my output.cfm page it shows the record as
    "T�st��g" and will do so until I change the
    meta tag to use charset=ISO-8859-1. Am I missing something or is
    this how it is suppose to work?
    My input.cfm page is set up with both the
    <cfprocessingdirective suppresswhitespace="YES"
    pageencoding="UTF-8">
    <meta http-equiv="Content-Type" content="text/html;
    charset=UTF-8">
    tags and a regular input formfield that writes to the MySQL
    database.
    The MySQL table is configured to use the utf8 char set and
    utf8_unicode_ci collation.
    And just to be safe I included
    useUnicode=true&characterEncoding=utf8&characterSetResults=utf8
    in the connection string on the CF Admin datasource setup page.
    I'm running CF 6.1, MySQL 4.1, the latest version of Apache
    Server on a Win2K3 box. I was running the 3.0.16 MySQL JDBC driver
    but I upgraded it to the 5.0.6 this morning thinking that may fix
    my issue.

    I'm still unsure why this works but I've found a solution. I
    switched all my pages over to character set ISO-8859-1 with the
    exception of my database table and it works. I get all the normal
    range character along with the extended Unicode characters to write
    to the database and output correctly. Unicode characters actually
    write to the table as their HTML coded character.
    If someone feels the need to enlighten me as to why this
    works please feel free, I'm always willing to learn.

  • ANSI ASCII

    Hi!
    I have problem to read and write text data, which are encoded in ANSI ASCII. Can you help me with this please?
    I have tried this:
    char c = new FileReader(file).read();then
    char c = (char)new FileInputStream(file).read();and also
    char c = (char)new RandomAccessFile(file, "r").read();.., but nothing worked as I need it to work.
    What's wrong? Well, as far as the characters without diacritics are concerned, everything goes well. But if I want to read characters like &#318;,&#353;,&#269;,&#357;,... I get the problem.
    Can you help me please how to read eg. character '&#353;' which has in ANSI ASCII index #BE (win - ALT 0154) ?
    Many thanks
    Miso

    I'm talking about ANSI ASCII and it contains 255
    characters and chars like
    &#318;,�,&#269;,&#357;,... are included there!Nope, ANSI ASCII contains 128 characters.
    http://www.bbsing.com/bbsansi/bbsansi.htmlThis is encoding is known as Windows 1252 or CP1252. It is used in Windows in many occidental countries. It contains ANSI ASCII as a subset, but also contains many other characters. Microsoft has a better reference card that shows not only the characters but also what they are in Unicode: http://www.microsoft.com/globaldev/reference/sbcs/1252.mspx
    http://www.handheld-basic.com/documentation/text/page_599.htmlThis one I have never seen before. It seems to be very close to Windows 1252 but there are a few differences -- for example, Windows 1252 doesn't have a mapping for &#9825; (U+2661 WHITE HEART SUIT).
    Note that neither of these two character encodings have mappings for &#318;, &#269;, or &#357; so they can't be what you are looking for.

  • Why differing Character Encoding and how to fix it?

    I have PRS-950 and PRS-350 readers, both since 2011.  
    In the last year, I've been getting books with Character Encoding that is not easy to read.  In playing around with my browsers and View -> Encoding menus, I have figured out that it has something to do with the character encoding within the epub files.
    I buy books from several ebook stores and I borrow from the library.
    The problem may be the entire book, but it is usually restricted to a few chapters, with rare occasion where the encoding changes within a chapter.  Usually it is for a whole chapter, not part, and it can be seen in chapters not consecutive to each other.
    It occurs whether the book is downloaded directly to my 950 reader or if I load it to either reader from my computer(s), which are all Mac OS X of several versions fom 10.4 to Mountain Lion.  SInce it happens when the book is downloaded directly, I figure the operating system of my computer is not relevant.
    There are several publishers involved, though Baen (no DRM ebooks) has not so far been one of them.
    If I look at the books with viewers on the computer, the encoding is the same.  I've read them in Calibre, in the Sony Reader App, and in Adobe Digital Editions 2.0.  It's always the same.
    I believe the encoding is inherent to the files.  I would like to fix this if I can to make the books I've purchased, many of them in paper and electronically, more enjoyable to read on my readers.
    Example: I’ve is printed instead of I've.
    ’ for apostrophe
    “ the opening of a quotation,
    â€?  for closing the quotation,
    and I think — is for a hyphen.
    When a sentence had “’m  for " 'm at the beginning of a speech (when the character was slurring his words) it took me a while to figure out how it was supposed to read.
    “’Sides, â€™tis only for a moon.  That ain’t long.â€?
    was in one recent book.
    Translation: " 'Sides, 'tis only for a moon. That ain't long."
    See what I mean? 
    Any ideas?

    Hi
    I wonder if it’s possible to download a free ebook with such issue, in order to make some “tests”.
    Perhaps it’s possible, on free ebooks (without DRM), to add fonts by using softwares like Sigil.

  • FCC with ASCII and UTF-8 encoding issue

    Hi,
    I have File to IDoc scenario and I am doing FCC which has Japanese chars. (PO with HEADER,1,ITEMS,*)
    I have specified UTF-8 encoding in file adapter to processed the file.
    Earlier, my source file was in ASCII format which had junk chars; my file was picked and Idoc posted had junk chars.
    Then I used UTF-8 encoding for my source file to correct this issue. XI showed proper Japanese chars but this time Header part is missing.
    Do I have to specify encoding in "module" for File adapter?
    Regards,

    Thanks for your replies Chirag/Gabriel,
    ISO encoding didn't work.
    My source file will be in UTF-8 format.
    There is one correction. It is ANSI encoding, not ASCII as in the subject.
    I still have this issue when my document offset is 0.
    I tried to play around with FCC and found this odd thing.
    When first line of my input file is blank....and I omit reading the first line with offset 1, then file is read in its entirety.
    Again, when I remove this blank line and the file starts with Header and with offset 0 in File adapter, then again my Header part is missing.
    What to do?
    Regards,
    AV

  • When I load certain websites the the writing is all squashed up. I correct this by changing the character encoding setting. I am using the latest Apple Mac machine. Thanks in advance

    When I load certain websites the the writing is all squashed up. I correct this by changing the character encoding setting. I am using the latest Apple Mac machine. Thanks in advance

    Thanks for that information!
    I'm sure I will be calling AppleCare, but the problem is, they charge for the phone calls don't they? Because I don't have money to be spending to be on the phone with a support service.
    In other things, it seemed like the only time my MacBook was working was when I had Snow Leopard without the 10.6.8 update download that was supposed to be done to prepare for OS X Lion.
    When I look at the information of my HD it says that I have 10.6.8 but that was the install that it claimed to have failed and caused me to restart resulting in all of the repeated problems.
    Also, because my computer is currently down, and I've lost all files how would that effect the use of my iPhone? Because if it doesn't get fixed by the time OS 5 is released, how would I be able to upgrade?!

  • Quicktime encoding using x264 encoder failing in AME 6.0.2.81 and Mac OS 10.8.2

    I'm trying to encode an x264 codec Quicktime using AME 6.0.2.81 and Mac OS 10.8.2. I also tried it with two versions of the x264 encoder which I downloaded from here: http://www.macupdate.com/app/mac/24173/x264encoder
    This encoder works 100% fine and better than the h264 encoder when using Apple Compressor and Apple QT Pro V7. I've been however trying to transition all of my video encodes to AME due to a number of reasons: dynamic link to AE, Premiere and the interface to name a few.
    Everytime I start an encode it gives me an error and I hear that blasted sheep. lol. This is what the log displays, which is frankly probably no help to anyone.
    - Encoding Time: 00:00:00
    01/13/2013 04:48:10 PM : Encoding Failed
    Export Error
    Error compiling movie.
    Unknown error.
    I've also tried to specify the bitrate and have had no luck. Whether I specify the bitrate or not, I still get the error. The source video has been PNG sequences and ProRes movies at 1920x1080, 29.97fps. Both result in the same unknown error.
    Any suggestions? Thanks!

    I know this is pretty old now, but I thought I'd add a reply for those who are having this issue. I just had AME CC 2014 fail on me today, and this is the FIRST thing I've ever tried to render with it. I'm using a TGA sequence and exporting to an X264 codec Quicktime MOV file at 720p. I restarted a few times, repaired permissions etc. and nothing worked except for the following.
    I went to:
    Library/Preferences/Adobe/Adobe Media Encoder/8.0/
    I deleted the entire 8.0 folder (or you could delete whatever version you are having problems with). I then restarted AME and the preferences were recreated and I could successfully use the x264 codec again. Hope this helps someone.

  • Reading Advance Queuing with XMLType payload and JDBC Driver character encoding

    Hi
    I've got a problem retrieving the message from the queue with XMLType payload in Java.
    It was working fine in 10g database but after the switch to 11g it returns corrupted string instead of real XML message. Database NLS_LANG setting is AL32UTF8
    It is said that JDBC driver should deal with that automatically but it obviously don't in this case. When I dequeue the message using database functionality (DBMS_AQ package) it looks fine but not when using JDBC driver so Ithink it is character encoding issue or so. The message itself is enqueued by the database and supposed to be retrieved by dedicated EJB.
    Driver file used: ojdbc6.jar
    Additional libraries: aqapi.jar, xdb.jar
    All file taken from 11g database installation.
    What shoul dI do to get the xml message correctly?

    Do you mean NLS_LANG is AL32UTF8 or the database character set is AL32UTF8? What is the database character set (SELECT value FROM nls_database_parameters WHERE parameter='NLS_CHARACTERSET')?
    Thanks,
    Sergiusz

  • How to detect encoding file in ANSI, UTF8 and UTF8 without BOM

    Hi all,
    I am having a problem with detecting a .txt/.csv file encoding. I need to detect a file in ANSI, UTF8 and UTF8 without BOM but the problem is the encoding of ANSI and UTF8 without BOM are the same. I checked the function below and saw that ANSI and UTF8
    without BOM have the same encoding. so, How can I detect UTF8 without BOM encoding file? because I need to handle for this case in my code.
    Thanks.
    public Encoding GetFileEncoding(string srcFile)
                // *** Use Default of Encoding.Default (Ansi CodePage)
                Encoding enc = Encoding.Default;
                // *** Detect byte order mark if any - otherwise assume default
                byte[] buffer = new byte[10];
                FileStream file = new FileStream(srcFile, FileMode.Open);
                file.Read(buffer, 0, 10);
                file.Close();
                if (buffer[0] == 0xef && buffer[1] == 0xbb && buffer[2] == 0xbf)
                    enc = Encoding.UTF8;
                else if (buffer[0] == 0xfe && buffer[1] == 0xff)
                    enc = Encoding.Unicode;
                else if (buffer[0] == 0 && buffer[1] == 0 && buffer[2] == 0xfe && buffer[3] == 0xff)
                    enc = Encoding.UTF32;
                else if (buffer[0] == 0x2b && buffer[1] == 0x2f && buffer[2] == 0x76)
                    enc = Encoding.UTF7;
                else if (buffer[0] == 0xFE && buffer[1] == 0xFF)
                    // 1201 unicodeFFFE Unicode (Big-Endian)
                    enc = Encoding.GetEncoding(1201);
                else if (buffer[0] == 0xFF && buffer[1] == 0xFE)
                    // 1200 utf-16 Unicode
                    enc = Encoding.GetEncoding(1200);
                return enc;

    what you want is to get the encoding utf-8 without bom which can only be detected if the file has special characters, so do the following:
    public Encoding GetFileEncoding(string srcFile)
                // *** Use Default of Encoding.Default (Ansi CodePage)
                Encoding enc = Encoding.Default;
                // *** Detect byte order mark if any - otherwise assume default
                byte[] buffer = new byte[10];
                FileStream file = new FileStream(srcFile, FileMode.Open);
                file.Read(buffer, 0, 10);
                file.Close();
                if (buffer[0] == 0xef && buffer[1] == 0xbb && buffer[2]
    == 0xbf)
                    enc = Encoding.UTF8;
                else if (buffer[0] == 0xfe && buffer[1] == 0xff)
                    enc = Encoding.Unicode;
                else if (buffer[0] == 0 && buffer[1] == 0 && buffer[2]
    == 0xfe && buffer[3] == 0xff)
                    enc = Encoding.UTF32;
                else if (buffer[0] == 0x2b && buffer[1] == 0x2f &&
    buffer[2] == 0x76)
                    enc = Encoding.UTF7;
                else if (buffer[0] == 0xFE && buffer[1] == 0xFF)
                    // 1201 unicodeFFFE Unicode (Big-Endian)
                    enc = Encoding.GetEncoding(1201);
                else if (buffer[0] == 0xFF && buffer[1] == 0xFE)
                    // 1200 utf-16 Unicode
                    enc = Encoding.GetEncoding(1200);
               else if (validatUtf8whitBOM(srcFile))
                    enc = UTF8Encoding(false);
                return enc;
    private bool validateUtf8whitBOM(string FileSource)
                bool bReturn = false;
                string TextUTF8 = "", TextANSI = "";
                //lread the file as utf8
               StreamReader srFileWhitBOM  = new StreamReader(FileSource);
               TextUTF8 = srFileWhitBOM .ReadToEnd();
               srFileWhitBOM .Close();
                //lread the file as  ANSI
               StreamReader srFileWhitBOM  = new StreamReader(FileSource,Encoding.Defaul,false);
               TextANSI  = srFileWhitBOM .ReadToEnd();
               srFileWhitBOM .Close();
               // if the file contains special characters is UTF8 text read ansi show signs
                if(TextANSI.Contains("Ã") || TextANSI.Contains("±")
                     bReturn = true;
                return bReturn;

  • Problems with Forms and character encoding

    I'm having problems trying to read unicode data inputted into a Form on my JSP page.
    I've used the meta tag <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> to set the charset of the page to UTF-8. I've inputted some chinese characters inot my form and when I try to read the subsequent request parameter in my servlet using request.getParameter() the string returned is this
    "&#26469;&#28304;" which is the escape sequence required by HTML to display these characters.
    From what I've read on the subject this doesn't seem like the expected value. I've tried other ways of getting the correct string value such as setting the character encoding request.setCharacterEncoding("UTF-8") and then converting the bytes using this encoding value but it doesn't seem to work.
    I could write a method to split up the string using the ; as a token and working out the correct unicode character but this doesn't seem like the right thing to do.
    Any help on how to pass the correct information from the Form in the JSP page to the servlet would be greatly appreciated

    I don't believe that is correct, but if it's returning HTML escapes instead of URL Encoded characters, then it's the browser doing it. This is my test page for playing with Chinese...
    <%@ page language="java" contentType="text/html; charset=UTF-8" %>
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
    <html>
    <head>
         <title></title>
         <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    </head>
    <body bgcolor="#ffffff" background="" text="#000000" link="#ff0000" vlink="#800000" alink="#ff00ff">
    <%
    request.setCharacterEncoding("UTF-8");
    String str = "\u7528\u6237\u540d";
    String name = request.getParameter("name");
    %>
    req enc: <%= request.getCharacterEncoding() %><br />
    rsp enc: <%= response.getCharacterEncoding() %><br />
    str: <%= str %><br />
    name: <%= name %><br />
    <form method="GET" action="_lang.jsp" encoding="UTF-8">
    Name: <input type="text" name="name" value="" >
    <input type="submit" name="submit" value="GET Submit" />
    </form>
    <form method="POST" action="_lang.jsp" encoding="UTF-8">
    Name: <input type="text" name="name" value="" >
    <input type="submit" name="submit" value="POST Submit" />
    </form>
    </body>
    </html>

Maybe you are looking for

  • KT6 Delta stability and performance problems

    Hi all! My brother and I recently assembled my new computer. Since then I've been experiencing two problems that really bug me. I'm no expert at this stuff so if you guys need more info from me to try to help, just say so! Problem 1: The optimum sett

  • COGI and MF47

    Dear all, In COGI and MF47 we can see the materials that are not posted in the transaction. In COGI we will be able to see the reasons for the  posting not happened. But how to find the reason in MF47 . Please explain

  • Y-Scale for I16 Data Type

    Hi, I am using analog DAQ to aquire +/-1V and streaming to binary file on HDD. I am using I16 representation with analog input ranges set to +/-5 Volts. My Y-scale on chart shows data range between +400 and -400 counts. I could not understand how it

  • Is there a limit to how many chapter markers Final Cut X exports?

         I've videoing mutlple Dance recitals and I created chapter markers for each dance in one recital. I share them ad dvd plus audio and the chapter markers imported fine into DVD Studio Pro...... Then next recital and at least 10 more markers and I

  • Can I stop iPhoto's auto start?

    I don't use iPhoto to import pictures but whenever i plug in acamera or card reader it fires up looking for import duty. Can I turn this off somewhere?