Default character encoding ???

To set the default character encoding to UTF-8 for JVM I can do the
following on Solaris:
LC_ALL=en_US.UTF-8; java ....
How do I achieve the same on Windows?
Is there a java property where I can specify the default character
encoding or do I have to getBytes( str, "UTF-8") everywhere?
Thanks,
Artur...

Hi Artur,
please see my responses to the same question you've posted in
Forum Home > Internationalization
Kind regards,
Marcus

Similar Messages

  • Default Character Encoding stuck on UTF-8 - Firefox 7

    I cannot change the Character Encoding - it is stuck on Unicode UTF-8 and I can not change it! When a web page opens I get these little boxes with "FF FD" instead of Quote marks. When I change the character encoding on that page using "View->Character Encoding" and click on the Western (ISO-8859-1), the page displays correctly. Every page opens using Unicode UTF-8 as the default.
    View->Character Encoding -- shows Unicode UTF-8 as the default.
    View->Character Encoding->Auto-detect -- shows OFF
    Tool->Options->Content->Advance->Fonts->Default Character Encoding -- shows Western (ISO-8859-1) as well as the "Allow Pages to choose their own fonts..." IS CHECKED in the check box
    THE PAGES ARE NOT UTF-8!!!! The "View Page Source" IS NOT Unicode UTF-8! -- It shows <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">.
    The "View Page Info" shows MetaTag - Content-Type: text/html; charset=iso-8859-1
    Why can I not change the Default Character Encoding?
    I would also like to point out that the Unicode UTF-8 seems to be broken because it is indicating that the QUOTE CHARACTER is an UNPRINTABLE character "FF FD"
    ----- EDIT -----
    The UTF-8 is not broken. The problem as pointed out in http://en.wikipedia.org/wiki/Replacement_character#Replacement_character is that my Firefox being STUCK processing UTF-8 encoding cannot read the clearly marked iso-8859-1 data. So the UTF-8 is reinterpreting smart quotes -&ldquo; and &rdquo;- (“ and ”) as replacement (unprintable) characters.
    So the real problem is why my Firefox is stuck on Unicode UTF-8

    The real problem is that the font that is used doesn't have those characters.
    Do you see the special quotes -“ and ” on this forum page?
    Does it help if you disable the website fonts and set another font as the default font?
    *Tools > Options > Content : Fonts & Colors > Advanced
    *http://en.wikipedia.org/wiki/Punctuation
    *http://en.wikibooks.org/wiki/Unicode/Character_reference/2000-2FFF

  • How to set a platform's default character encoding

    Hi,
    Does anybody know how to set how to set a platform's default character encoding in Java? Thank you.
    Yugang
    [email protected]

    You do 0mean for Java, not from Java? (The latter would make absolutely no sense at all.) If so, pass it to the runtime using the -D switch (you've got SUN's java, right?):
    java -Dfile.encoding=the-encoding-i-like the.name.of.YourClass

  • Setting the default character encoding

    How do you set the default character encoding for the portal to UTF-8 so that unicode
    characters work within porlets?
    Any help would be much appreciated,
    Troy

    You can put this tag in the portal.jsp in the header:
    <meta http-equiv='Content-Type' content='text/html;charset=UTF-8'>
    "Troy" <[email protected]> wrote:
    >
    That doesn't seem to work when put into my portlet's content JSP. Is
    there another
    place I could put the page directive that will control the entire portal
    page?
    "Sai S Prasad" <[email protected]> wrote:
    Troy,
    you can try the page directive with encoding as:
    <%@ page contentType="text/html";charset="UTF-8" %>
    "Troy" <[email protected]> wrote:
    How do you set the default character encoding for the portal to UTF-8
    so that unicode
    characters work within porlets?
    Any help would be much appreciated,
    Troy

  • Setting DEFAULT character encoding ???

    To set the default character encoding to UTF-8 for JVM I can do the
    following on Solaris:
    LC_ALL=en_US.UTF-8; java ....
    How do I achieve the same on Windows?
    Is there a java property where I can specify the default character
    encoding or do I have to getBytes( str, "UTF-8") everywhere?
    Thanks,
    Artur...

    Hi Artur,
    there is a way. The property is file.encoding. Example:
    import java.io.*;
    public class ShowEncoding {
      public static void main(String[] args) {
        System.out.println("Default encoding: " +
            new InputStreamReader(System.in).getEncoding());
    }On my system I get the following results:
    java ShowEncoding Default encoding: Cp1252
    java -Dfile.encoding=UTF-8 ShowEncoding Default encoding: UTF8
    java -Dfile.encoding=Latin1 ShowEncoding Default encoding: ISO8859_1I don't know, if setting this property using the -D argument is documented or implementation-independendent. But so far, it was working for me. For setting the locale, you may use the properties user.language and user.region. But this is undocumented and implementation-dependendent.
    I wouldn't use these properties at all. I would rather specify the encoding in all constructors. Probably using an application property file.
    Good luck!
    Marcus.

  • SPA504G SPA514G Default Character Encoding stay in ISO-8859-1

    Hi,
    I have configure like:
      <Dictionary_Server_Script ua="na">serv=http://{{ provisioning.server }}/telecom/language/;d0=English;x0=spa50x_30x_en_v754.xml;d1=French;x1=spa50x_30x_fr_v754.xml;</Dictionary_Server_Script>
      <Language_Selection ua="na">French</Language_Selection>
      <Default_Character_Encoding ua="na">UTF-8</Default_Character_Encoding>
      <Locale ua="na">fr-FR</Locale>
    Dictionary and Provisioning Profile are encoded in UTF-8.
    but when the phone start after provisioning the Default_Character_Encoding set to ISO-8859-1
    and the lines labels are misprinted.
    Ligne 1
    Ligne 2
    Olivier
    Françoise
    instead of
    Ligne 1
    Ligne 2
    Olivier
    Françoise
    Any idea ?

    I got an answer from the developer.
    Pasted here.
    I think the default encoding is set back to ISO8859 after customer download the dictionary.
    Here is the reason: after 7.5.3, SPA 50x will parse the trkLocaleName in dictionary, for French it will set the phone’s default encoding to iso8859-1 since it is good to French.
    French
    =================================
    •1.         If the customer want to use UTF-8 after xml downloading, please modify the trkLocaleName in the French dictionary xml as following:
    croatian
    It is a workaround, but it's strange why French user will use UTF8. Thanks.
    •2.         Another way is that user can manually set the default encoding value to UTF-8 after xml downloading.

  • How can I define a default character enconding for Firefox 4?

    Dear Sirs,
    I would like to know how can I change the character encoding of Firefox 4 to Western (ISO-8859-1) and keep that character enconding as the default every time I open Firefox 4.
    I have noticed the every time I change the default character encoding from Unicode (UTF-8) to Western(ISO8859-1) and close Firefox 4, when I open Firefox 4 the character encoding returns to the Unicode (UTF-8) character encoding.
    I would like to keep the Western (ISO8859-1) as the default character encoding. How can I do that?
    Best regards,
    IMeMine

    If the server sends an encoding then Firefox will always switch to that encoding.<br />
    The default is only used on pages where the server does not send en encoding via the HTTP response header and the code doesn't specify the encoding via a meta tag.
    There is usually no need to change the encoding from the default setting Western (ISO-8859-1).
    It is probably better to try an Auto-Detect setting (e.g. View > Character Encoding > Auto-Detect > Universal) if you have a problem on specific websites.

  • What's the difference of character encoding between 1.4.0and1.4.2 in Linux

    As i find, the character encoding about chinese in jdk1.4.2 no langer the same of jdk1.4.0.
    In jdk1.4.0, the character encoding used the "file.encoding" system property, we often set the
    property with "gb2312".
    But in jdk1.4.2, i find that the default character encoding no longer used the "file.encoding" system property.
    Who knows the reason?
    Test Program:
    public class B{
    public static void main(String args[]) throws Exception{
    byte [] bytes = new byte[]{(byte)0xD6,(byte)0xD0,(byte)0xCE,(byte)0xC4};
    String s1 = new String(bytes);
    String s2 = new String(bytes,System.getProperty("file.encoding"));
    System.out.println("s1="+s1+" , s2="+s2);
    System.out.println("s1.length=" + s1.length() + " , s2.length="+s2.length());
    run four times and the result list:
    [root@app15 component]# /usr/local/j2sdk1.4.0/bin/java -Dfile.encoding=ISO-8859-1 -cp . B
    s1=&#20013;&#25991; , s2=&#20013;&#25991;
    s1.length=4 , s2.length=4
    [root@app15 component]# /usr/local/j2sdk1.4.0/bin/java -Dfile.encoding=gb2312 -cp . B
    s1=&#20013;&#25991; , s2=&#20013;&#25991;
    s1.length=2 , s2.length=2
    [root@app15 component]# /usr/local/j2sdk1.4.2/bin/java -Dfile.encoding=ISO-8859-1 -cp . B
    s1=&#20013;&#25991; , s2=&#20013;&#25991;
    s1.length=4 , s2.length=4
    [root@app15 component]# /usr/local/j2sdk1.4.2/bin/java -Dfile.encoding=gb2312 -cp . B
    s1=&#20013;&#25991; , s2=??
    s1.length=4 , s2.length=2
    [root@app15 component]#

    I don't know for sure, but:
    -- The API documentation for String says that "new String(byte[])" uses "the platform's default charset".
    -- The API documentation for Charset says "The default charset is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system."
    You'll notice that it doesn't say anything about using the file.encoding system value, so presumably (based on your experiments) it doesn't. I did a search for "java default charset" and didn't find anything specific, but this site says "As of Java 1.4.1, the default Charset varies from platform to platform" and suggests you explicitly hard-code your charset. I would agree with that.

  • Can character encoding be predefined for certain pages?

    Certain pages that I visit frequently require me to manually set my character encoding to Western (ISO Latin 1), both when my default character encoding is set as UTF-8 and Western (ISO Latin 1).
    As the pages that show up malformed are embedded in other frames I suspect that the top frame forces a different encoding than is on the embedded page.
    An example page is here (847.is). The topic list of this message board is in order, but when any one of the topics is viewed all accented and special characters are missing, until Western (ISO Latin 1) is manually set as the character encoding. Similarly, opening any of the topics in a tab will result in missing characters.
    Is there some way for me to circumvent having to go through all those menus to set it? Can I somehow define that these pages should be viewed in Western (ISO Latin 1) or can I set a keyboard shortcut for Western (ISO Latin 1)?
    MacBook 2006   Mac OS X (10.4.7)   Safari version 2.0.4
    MacBook 2006   Mac OS X (10.4.7)  

    For instance if I opened
    an entry on Vísindavefur
    HÍ out
    of the parent frame the accented letters would
    show up somehow mangled, but this is no longer a
    problem.
    That page has no charset in the source and thus should only display correctly if you have Latin-1 set as the browser default. With UTF-8 set as the default you should see (in Safari) a ton of black diamonds with question marks inside.
    Firefox never displays ð (eth), þ (thorn) and ý
    (accented y) correctly for me, did not do it on the
    old machine and does not do it on this one either,
    FireFox displays them perfectly for me in both 10.3 and 10.4.
    Actually, if I set Opera encoding to UTF-8 it
    displays the topics on 847.is as Safari does.
    This indicates it is a system issue rather than Safari. Sorry I can't duplicate it and have no good idea what could cause it on a normal system. Have you (or the place where you buy your machines) by chance installed any special software add-ons to enable the use of non-Unicode Icelandic (for apps like Appleworks, WordX, etc)?

  • Changing the default char encoding of the current JVM ?!

    Is there any way that could be used to alter the default character encoding of the Java Virtual Machine at the start of a Java application?

    This seems a little dangerous...considering that file i/o etc depend on correct charset encodings for filenames, etc.
    However, perhaps you could try setting the file.encoding property on the command line:
    java -Dfile.encoding=Big5 YourApplication
    Regards,
    John O'Conner

  • Character Encoding is changing random

    Hello,
    For a short while I'm having the next problem:
    When I am using FireFox, after a while characters on pages are showed as 'boxes'. With setting backView -> Character Encoding to ISO-8859-1 the characters are shown correctly again.
    I do not understand why the character encoding is changing random to UTF-8, while I have selected the ISO-8859-1 and set automatically recognize to false.
    In the options menu I have also set default 'ISO-8859-1' as default character encoding.
    Hoping someone can tell me why it random changes and why the same page can be show correctly for 10 times, but the 11th time the character set has changed? And of course how can I solve this problem.

    I do know a website can determine the character encoding, but the strange part of the problem I occur is that the same page can be shown 10 times correctly, but an 11th time the characters like é and ë, are shown as 'blocks'/'questionmarks'.
    Is there an explanation for that behaviour?

  • Importance of character encoding?

    I am making a list of errors in someone else's code. What about this:
    outline of existing code
    byte[] binaryMssg = "sync/ack".getBytes(); // client
    --- DataOutputStreamtream --->
    String mssg = new String(byte[] buf); // server
    proposed correction:
    byte[] binaryMssg = "sync/ack".getBytes(Charset.forName("UTF-16")); // client
    --- DataOutputStream --->
    String mssg = new String(byte[] buf, Charset.forName("UTF-16")); // server
    The original developer omitted character encoding.
    I have zero knowledge/experience with character encoding. I only just noticed method overloads that optionally use them. I don't want to look stupid by putting what is really a non-issue on my list. I don't want to look like a jerk and draw attention to something that, in real world usage, is a non-issue. Otherwise, just to be thorough, I would like to put it on my list. I think it might create more robust code? Any comments welcomed.
    Finally, I was so surprised to notice Charset is in the nio package and so not available until v1.4. This hints character encodings where not a prioriry for the Java development team. So, since they did not consider it important, perhaps neither should I. While there may be niches for character encoding usage, the normal-case rule is to not use explicit character encoding.

    dpxqb wrote:
    I am making a list of errors in someone else's code. What about this:
    outline of existing code
    byte[] binaryMssg = "sync/ack".getBytes(); // client
    --- DataOutputStreamtream --->
    String mssg = new String(byte[] buf); // server
    proposed correction:
    byte[] binaryMssg = "sync/ack".getBytes(Charset.forName("UTF-16")); // client
    --- DataOutputStream --->
    String mssg = new String(byte[] buf, Charset.forName("UTF-16")); // serverMuch much much better but I would use utf-8 since it is usually more compact for western languages.
    >
    The original developer omitted character encoding.Criminal.
    >
    I have zero knowledge/experience with character encoding. I only just noticed method overloads that optionally use them. I don't want to look stupid by putting what is really a non-issue on my list. I don't want to look like a jerk and draw attention to something that, in real world usage, is a non-issue. Otherwise, just to be thorough, I would like to put it on my list. I think it might create more robust code? Any comments welcomed.It is correct to define the character encoding to be used if the bytes are being transferred from one computer to another since the two machines may have a different default character encoding.
    >
    Finally, I was so surprised to notice Charset is in the nio package and so not available until v1.4. This hints character encodings where not a prioriry for the Java development team. So, since they did not consider it important, perhaps neither should I. While there may be niches for character encoding usage, the normal-case rule is to not use explicit character encoding.Character encoding has always been important in Java. One has always been able to explicitly set the character encoding by using a String to define it. For example one could use
    byte[] binaryMssg = "sync/ack".getBytes("utf-16"); // client
    --- DataOutputStream --->
    String mssg = new String(byte[] buf, "UTF-16"); // server

  • How to set the system default file character encoding to UTF-8?

    Hi all. This is driving me nuts, on both my Windows box and Snow Leopard; I figure much more chance of finding the answer for OS X.
    My language and locale are set to Australian English. $LANG=en_AU.UTF-8
    However, as I believe is expected, OS X (and Windows for that matter) will create files by default with character encoding of Cp1252 (Latin-1). That is, the FILE encoding in the file metadata - the Byte Order Mark I believe. The file itself, not the characters written to it.
    This, in a word, bites. I don't want to be restricted to only ASCII by default, and it is causing me problems with certain software (a Firefox plugin) that creates text files, passing in UTF-8 encoded content, which is then mangled because the file encoding itself is still Cp1252. (I know, I've tested this by changing the file encoding manually and having it overwritten again by the plugin: works correctly.)
    As a simple example, just `touch somefile` from terminal creates a file in Cp1252 -- I'm obtaining that info by opening in jEdit by the way (anyone know of something better?).
    In other locales that are not English-based, I believe the default file encoding is UTF-8. But surely this can be controlled independently? There must be a system configuration value somewhere that specifies file encoding default. Can someone please tell me what it is?
    Thanks!

    However, as I believe is expected, OS X (and Windows for that matter) will create files by default with character encoding of Cp1252 (Latin-1). That is, the FILE encoding in the file metadata - the Byte Order Mark I believe. The file itself, not the characters written to it.
    Apps like TextEdit and Mail have settings that let you determine the encoding of text produced. The default would normally depend on the character content of the file, ranging from ASCII for basic English to Windows Latin-1 (Win 1252) or ISO Latin -1 (ISO 8859-1) to UTF-8 for other content.
    Win 1252 is not ASCII, but has twice the number of characters in the latter.
    Byte Order Mark is something totally different --it's a particular character used to signal certain encodings.
    http://en.wikipedia.org/wiki/Byteordermark
    As a simple example, just `touch somefile` from terminal creates a file in Cp1252 -- I'm obtaining that info by opening in jEdit by the way (anyone know of something better?).
    For what Terminal does and how to change it, it might best to post in the Unix forum:
    http://discussions.apple.com/forum.jspa?forumID=735
    For problems with a FireFox plugin, it might be good to ask on their own forums as well.

  • NetBeans problem: Issue with servlets and Chinese character encoding

    Java Version: JDK1.5.0_01, JRE1.5.0_01 (International version)
    Netbeans Version: Netbeans IDE 4.0
    OS: Windows XP Personal Edition
    Dear Sirs,
    First at all thanks for reading this post. I am having the following issue. I am creating an application using html pages and servlets. I am using Chinese and English languages on them (html encoding UTF-8).
    I created a project in Netbeans and added an idex.html screen reporting to a servlet. Both index.html and in the servlet generated html page contains the line:
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    Additional, I setup the character code settings in Netbeans:
    (tools-options-Java sources-Expert-default encoding=UTF-8
    When I run the project, index.html displays itself perfectly, with the Chinese characters displayed properly. The problem comes when the html created servlet is displayed, which instead of the Chinese characters some strange characters are displayed (�� instead of Chinese).
    I have tried different encodings from http://java.sun.com/j2se/1.4.2/docs/guide/intl/encoding.doc.html without any luck. I also setup the encoding of the file itself (using right click-properties in the project menu of Netbeans).
    Also, when I am editing the servlet, the characters are displayed properly. I type them directly without any issue, but then the display is wrong at runtime.
    Also, just in case this have something to do with the problem, my PC was bought in US, therefore the default character set is not Chinese. I had to install the Chinese typing stuff later on. But like I said earlier, the html page is displayed properly, so I really think is some problem with Netbeans.
    After a week trying to find a solution, I decided to post it here in the hopes that someone will show me the way of the light.
    Thanks in advance for any ideas or help provided
    Aral.

    Ok, I found out some problems with Netbeans as well.
        public void doGet(HttpServletRequest request,
                          HttpServletResponse response)
            throws IOException, ServletException
            response.setCharacterEncoding("UTF-8");
            request.setCharacterEncoding("UTF-8");
            response.setContentType("text/html");
            PrintWriter out = response.getWriter();
            byte[] st = {-25,-75,-124,-27,-100,-106,-17,-68,-102,-27,-80,-113,-27,-72,-125,-26,-118,-75,-26,-105,-91,-27,-82,-93};
            out.println("this works: ");
            out.println(new String(st,"UTF-8"));
            out.println("<br>");
            out.println("this doesn't: ");
            out.println("some chinese copied from the Internet<br>");Right click the .java file and choose properties -> encoding UTF-8
    Then I make a copy of the .java file, rename it to html and open it with IE sure enough
    the Chinise is allready unreadable (not it's still readable in the IDE);
    When I compile the file with F9 I get the following error:
    whatever.java:101: warning: unmappable character for encoding Cp1252
    Tried to set the encoding to UNICODE but then the file doesn't compile.
    I gues you have to download the Japanese version for it to work correctly.

  • What every developer should know about character encoding

    This was originally posted (with better formatting) at Moderator edit: link removed/what-every-developer-should-know-about-character-encoding.html. I'm posting because lots of people trip over this.
    If you write code that touches a text file, you probably need this.
    Lets start off with two key items
    1.Unicode does not solve this issue for us (yet).
    2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
    And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts.
    The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.
    And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.
    And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
    Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
    Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.
    Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.
    UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
    But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
    Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
    Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
    Here's a key point about these text files – every program is still using an encoding. It may not be setting it in code, but by definition an encoding is being used.
    Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
    Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)
    Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
    Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
    Wrapping it up
    I think there are two key items to keep in mind here. First, make sure you are taking the encoding in to account on text files. Second, this is actually all very easy and straightforward. People rarely screw up how to use an encoding, it's when they ignore the issue that they get in to trouble.
    Edited by: Darryl Burke -- link removed

    DavidThi808 wrote:
    This was originally posted (with better formatting) at Moderator edit: link removed/what-every-developer-should-know-about-character-encoding.html. I'm posting because lots of people trip over this.
    If you write code that touches a text file, you probably need this.
    Lets start off with two key items
    1.Unicode does not solve this issue for us (yet).
    2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
    And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts. Pretty sure most Americans do not use character sets that only have a range of 0-127. I don't think I have every used a desktop OS that did. I might have used some big iron boxes before that but at that time I wasn't even aware that character sets existed.
    They might only use that range but that is a different issue, especially since that range is exactly the same as the UTF8 character set anyways.
    >
    The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.
    And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.
    And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
    The above is only true for small volume sets. If I am targeting a processing rate of 2000 txns/sec with a requirement to hold data active for seven years then a column with a size of 8 bytes is significantly different than one with 16 bytes.
    Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
    The above is out of place. It would be best to address this as part of Point 1.
    Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.
    Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.
    UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
    The first part of that paragraph is odd. The first 128 characters of unicode, all unicode, is based on ASCII. The representational format of UTF8 is required to implement unicode, thus it must represent those characters. It uses the idiom supported by variable width encodings to do that.
    But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
    Not sure what you are saying here. If a file is supposed to be in one encoding and you insert invalid characters into it then it invalid. End of story. It has nothing to do with html/xml.
    Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
    The browser still needs to support the encoding.
    Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
    I know java files have a default encoding - the specification defines it. And I am certain C# does as well.
    Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
    It is important to define it. Whether you set it is another matter.
    Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)
    Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
    Unicode character escapes are replaced prior to actual code compilation. Thus it is possible to create strings in java with escaped unicode characters which will fail to compile.
    Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
    No. A developer should understand the problem domain represented by the requirements and the business and create solutions that appropriate to that. Thus there is absolutely no point for someone that is creating an inventory system for a stand alone store to craft a solution that supports multiple languages.
    And another example is with high volume systems moving/storing bytes is relevant. As such one must carefully consider each text element as to whether it is customer consumable or internally consumable. Saving bytes in such cases will impact the total load of the system. In such systems incremental savings impact operating costs and marketing advantage with speed.

Maybe you are looking for

  • How can I get RTE information in operator interface (LV) from TS API?

    Hi! I have operator interface in Labview and use TS API to run tests. I would like to eliminate RTE dialog and to show error messages directly in front panel control. I have created callback for TS Application Manager BreakOnRTE event. It works good

  • Creating Check Manually for Vendor Invoice

    I want to create a Check in SAP from a Vendor invoice manually. What steps do I need to take to do so? I am creating vendor invoice in FB60 I can create a outgoing Payment F-07 How do I create a Payment document in reference to the vendor invoice? I

  • The only wrong thing I see in my iPod. Can anybody help?.

    I love my fifth generation iPod, but I see it has an important fault: It can't remove pics and/or songs by itself. It has to be connected to iTunes to do it, Am I right?. I've got other mp3 players and they all allow to remove files directly from the

  • SQL query for timestamp

    I need to query a database from my java program for the time of each entry, but all that's in there is a timestamp. I haven't been able to query it. It was suggested that I convert it to a date in the sql database, but I don't know how to do this. An

  • Need Help With This Warning: after effects warning: unable to create font

    I have tried everything on evry site and forum, nothing seems to work. I installed the font verdana.ttf on my system and still I get this error. I used the Adobe clean up tool for CS5 and still received this error. I did a complete re-install and sti