Locale and character encoding. What to do about these dreadful ÅÄÖ??

It's time for me to get it into my head how this works. Please, help me understand before I go nuts.
I'm from Sweden and we use a few of these weird characters like ÅÄÖ.
If I create a file called "övrigt.txt" in windows, then the file will turn up as "?vrigt.txt" on my Linux pc (At least in the console, sometimes it looks ok in other apps in X). The same is true if I create the file in Linux and copy it to Windows, it will look just as weird on the other side.
As I (probably) can't change the way windows works, my question is what I have to do to have these two systems play nicely with eachother?
This is the output from locale:
LANG=en_US.utf8
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE=C
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=
Is there anything here I should change? I have tried using ISO-8859-1 with no luck. Mind you that I want to have the system wide language set to english. The only thing I want to achieve is that "Ö" on widows should turn up as "Ö" i Linux as well, and vice versa.
Please save my hair from being torn off, I'm going bald here...

Hey, thanks for all the answers!
I share my files in a number of ways, but mainly trough a web application called Ajaxplorer (very nice btw...). The thing is that as soon as a windows user uploads anything with special chatacters in the file name my programs, xbmc, console etc, refuses to read them correctly. Other ways of sharing is through file copying with usb sticks, ssh etc. It's really not the way of sharing that is the problem I think, but rather the special characters being used sometimes.
I could probably convert the filenames with suggested applications but then I'll set the windows users in trouble when they want to download them again, won't I?
I realize that it's cp1252 that is the bad guy in this drama. Is there no way to set/use cp1252 as a character encoding in Linux? It's probably a bad idea as utf8 seems like the future way to go, but the fact that these two OS's can't communicate too well in this area is pretty useless if you ask me.
To wrap this up I'll answer some questions...
@EVRAMP: I'm actually using pcmanfm, but that is only for me and I'm not dealing very often with vfat partitions to be honest.
@pkervien: Well, I think I mentioned my forms of sharing above. (kul med lite arch-svenskar!)
@quarkup: locale.gen is edited and both sv.SE and en_US have utf-8 and ISO-8859 enabled and generated.
...and to clearify things even further. It doesn't matter if I get or provide a file via a usb stick, samba, ftp or by paper. All I want is for "Ö" to always be "Ö", everywhere.
I can't believe how hard this is to get around. Linus is finish for crying out loud. I thought he'd sorted this out the first thing he did. Maybe he doesn't deal with windows or their users at all

Similar Messages

I bought this MacBook Pro 15.4" 1.83 GHz MA463LL/A with 2gb of RAM for real cheap on eBay because I'm poor and can't afford a new MACBOOK PRO and was wondering what everyone thinks about these laptops and what can and should I upgrade when I get money?

I bought this older MACBOOK PRO 15.4" 1.83 GHz with 2gb of RAM and Leopard 10.5 on eBay for real cheap because I don't have the money to buy a new one like I would like to do, and it comes with a disk to upgrade the OS to 10.6 and I was wondering what all I can and should upgrade (short of the whole computer) as I have the money to do so? Or did I just get really hosed?

You have the original Core Duo model. 2 GBs is the most RAM you can install, so the only thing you could upgrade would be the hard drive. Doing so is not a trivial task but you can find suitable hard drives and installation tutorials at OWC.
Since yours is a Core Duo model you can upgrade OS X to Snow Leopard, but not to Lion which requires a Core 2 Duo or better.

Problems with Forms and character encoding

I'm having problems trying to read unicode data inputted into a Form on my JSP page.
I've used the meta tag <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> to set the charset of the page to UTF-8. I've inputted some chinese characters inot my form and when I try to read the subsequent request parameter in my servlet using request.getParameter() the string returned is this
"来源" which is the escape sequence required by HTML to display these characters.
From what I've read on the subject this doesn't seem like the expected value. I've tried other ways of getting the correct string value such as setting the character encoding request.setCharacterEncoding("UTF-8") and then converting the bytes using this encoding value but it doesn't seem to work.
I could write a method to split up the string using the ; as a token and working out the correct unicode character but this doesn't seem like the right thing to do.
Any help on how to pass the correct information from the Form in the JSP page to the servlet would be greatly appreciated

I don't believe that is correct, but if it's returning HTML escapes instead of URL Encoded characters, then it's the browser doing it. This is my test page for playing with Chinese...
<%@ page language="java" contentType="text/html; charset=UTF-8" %>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<body bgcolor="#ffffff" background="" text="#000000" link="#ff0000" vlink="#800000" alink="#ff00ff">
<%
request.setCharacterEncoding("UTF-8");
String str = "\u7528\u6237\u540d";
String name = request.getParameter("name");
%>
req enc: <%= request.getCharacterEncoding() %><br />
rsp enc: <%= response.getCharacterEncoding() %><br />
str: <%= str %><br />
name: <%= name %><br />
<form method="GET" action="_lang.jsp" encoding="UTF-8">
Name: <input type="text" name="name" value="" >
<input type="submit" name="submit" value="GET Submit" />
</form>
<form method="POST" action="_lang.jsp" encoding="UTF-8">
Name: <input type="text" name="name" value="" >
<input type="submit" name="submit" value="POST Submit" />
</form>
</body>
</html>

H.264 data rate difference between local and distributed encode

Doing some testing today and noticed this curious behavior with compressor 3.5.1:
Set video encode for a 50 second 1080i uncompressed file to H.264 at 50000 kbps (50 mbps). Audio is 24 bit, 48khz stereo with no compression.
1. Sending the job to "This computer" for a local-only encode gives me a file size of 327 MB
2. Sending the same job to the same computer using distributed encoding gives me a file size of 211 MB
The data rate for the distributed encoded file is about 34 mbps, which isn't what was specified in the encode settings in Compressor. The data rate for the locally encoded file is about 52 mbps, which is what I would expect when the audio is taken into account.
Why the difference? Both encoded files are the same duration, so there aren't any missing frames. They both look good and play well but this is giving me cause for pause.
Also, another curious difference I noticed... The distributed file has the More Info fields populated with useful information in the Finder (dimensions, codecs, duration, channel count, total bit rate). The locally encoded file does not have this information.
-Matt

Hi,
Not sure if this can be corrected whit customizing. Please see SAP note 1065932 for background on currencies in asset accounting. Some explanations on differences in currencies can be found there.
Regards,
Andre

Cfexchange and character encoding

Hi!
I'm developing an application using the new cfexchange-tags.
The app outputs a list of calendar events from an MS Exchange 2003
server. It works nicely, as long as the event subjects don't
contain any non-english characters. As soon as the subject contains
for example "umlaut"-characters like ä or ö, the output
is bogus.
Should I, for example, make an appointment in Outlook with
the subject "Kimi Raikkonen" (without umlauts), the subject, when
retrived with cfexchangecalendar, displays OK:
{ts '2008-01-19 10:00:00'} Kimi Raikkonen
Then, if I use the umlaut-characters in the name "Kimi
Räikkönen", the resulting page reads:
{ts '2008-01-19 12:00:00'}
=?iso-8859-1?Q?Kimi_R=E4ikk=F6nen?=
Ok, I thought, this is the good old ISO/UTF-problem, so I
changed my Outlook-settings to utf-8 and tried again (with ä
and ö), ther result was:
{ts '2008-01-19 14:00:00'}
=?UTF-8?B?S2ltaSBSw6Rpa2vDtm5lbg==?=
- I run the ColdFusion8-server on linux, with the Cumulative
Hot Fix 2 installed
- I've tried the ExchangeServerLanguage-attribute in the
cfexchangeconnection tag to no avail
(This is nothing new. Every distribusion since the first MX
has had some problems with other character sets than lower ASCII.)
I would be grateful for any help in this!

Hi,
if I use the cfexchangecalendar tag to retrieve calendar data with german umlauts I recive text phrases such as "=?iso-8859-1?B?dGVzdPbk?=".
If I using the cfexchangemail tag with the same Exchange connection, everything is perfect and now the German umlauts are displayed correctly!
(CF 9.0.1, Exchange 2007, SBS 2008, IIS7)
Thanks
Olaf

What to do about the dreaded spinning beachball? I

'm OK after a restart, for about 30 mins, and then whatever I do, I get the beachball and everything slows down to a crawl. I've tried everything I know--it seems like there's something running in the background that takes time to start up, and that's why I have some grace time--no idea where to look, or what it may be. I have PLENTY of space left...Love to get suggestions. Thanks, everyone

Hi Judy
First you should always tell what troubleshooting tips you've used.
Anyways, quit all open applications then open ~/Library/Caches. Move this folder to the trash and empty it. There will probably be a lot of items that don't empty, to bypass this do a secure empty (Finder > Secure empty trash)
Also consider downloading OnyX. It does a lot of the maintainence tasks.
Open activity monitor (Applications > Utilities > Activity Monitor) and sort them by the amount of cpu that they use. If there are any applications that are using too much then quit them. If applications or other processes are always using too much (and you aren't doing anything like watching a video from youtube or copying lots of files) then shut down and reset the SMC.
Also try reseting the PRAM
If your mac continues running slow then repair permissions in Disk Utility (Applications > Utilities > Disk Utility).
How much RAM do you have?
Chris
Message was edited by: chrisfromhopewell

What to do about these overly large header images

Hi,
I have a header image that is big and wide, 950px by 400px. It takes a tad long to load and plus I need to make a part of it clickable to link to another page.
If this were six years ago I'd slice it up and put it in a table. But I refuse to do things the old way. Can anyone suggest a good way to make this header load faster and slice it up in a CSS friendl manner?
Thanks,
Stan

I'm not entirely correct.
The loading of images varies on both the browser and the server.
>
HTTP/1.1 spec, 8.1.4
>Clients that use persistent connections SHOULD limit the number of simultaneous connections that they maintain to a given server. A single-user client SHOULD NOT maintain more than
2 connections with any server or proxy.
If you follow this concept, loading 3 different 900 KB+ images from a single server
should result with the third image not starting to load until either of the first two images completely loads. But this seems to be a theoretical spec. The HTTP/1.1 server I just tested does not seem to follow this. But then we rarely make persistent connections to a server.
On a test I just ran with IE, FF, Safari, Chrome, and Opera, only Safari had a random loading of images with this layered DIV technique. The other browsers appear to start to load all the images at the same time (the smallest stand-in images fully load first as they arrive more quickly).
You can test this for yourself on your own server and browsers with 3 different sets of an exaggerated small file of 2 KB and a large file of 900 KB. Unless the browser is extraordinary, the small files will load first.
YSlow, a Firefox add-on, is very helpful in picking apart page load issues.
>Make the real image as small as you can and let it go at that.
At about 40 KB for the image in question, I would agree. This should not be a problem for the average DSL/cable user. Dial up users should already be tolerant of image loading times and would not be any more adversely affected.
...and if your client complains that the image loads slowly, tell them that you will work on it tonight and they should check it in the morning. The cached image will load much more quickly. :)

I'm looking for an anime style fantasy rpg app that allows me to create/customize my main character. What apps come with these features? Thanks for any feedback you give me! :-)

I'm new to posting forms but can amyotrophic help me find the right app for me thanks!

I'm new to posting forms but can amyotrophic help me find the right app for me thanks!

Character Encoding and File Encoding issue

Hi,
I have a file which has a data encoded using default locale.
I start jvm in same default locale and try to red the file.
I took 2 approaches :
1. Read the file using InputStreamReader() without specifying the encoding, so that default one based on locale will be picked up.
-- This apprach worked fine.
-- I also printed system property "file.encoding" which matched with current locales encoding (on unix cooand to get this is "locale charmap").
2. In this approach, I read the file using InputStream as an array of raw bytes, and passed it to String contructor to convert bytes to String.
-- The String contained garbled data, meaning encoding failed.
I tried printing encoding used by JVM using internal class, and "file.encoding" property as well.
These 2 values do not match, there is weird difference.
For e.g. for locale ja_JP.eucjp on linux box :
byte-character uses EUC_JP_LINUX encoding
file.encoding system property is EUC-JP-LINUX
To get byte to character encoding, I used following methods (sun.io.*):
ByteToCharConverter btc = ByteToCharConverter.getDefault();
System.out.println("BTC uses " + btc.getCharacterEncoding());
Do you have any idea why is it failing ?
My understanding was, file encoding and character encoding should always be same by default.
But, because of this behaviour, I am little perplexed.

But there's no character encoding set for this operation:baos.write("��".getBytes());

About paragraph and character format

If You define paragraph and character formats what will be overwriting which
           one?

If you ude both chatacter and paragraph format to print a variable... character format will override paragraph format.
Ex, AS --> Paragraph format   Font size 10 Font family : HELVE
      CH --> character format    Font size 8
Now, PS <CH> HELLO </>
It will be of font family of HELVE and with font size of 8.
I hope it is clear for you now.
Regards,
SaiRam

Character encoding again

Hi, i havent got any answer so i try to ask again...
I have created a page from Data Controls.
I have created parameter form. And table. Detail is showed at the bottom of page (there is shown current row through #bindings. . .
Everything works fine when i fill something into parameter form table is filtered by that criteria, when current row is changed, the detail is also changed. But when i fill some czech character into parametr form it works pretty bad.
The table is correctly filtered but when i make any other action after filtering, the table shows no rows.
I found what cause this problem. It is the Property in bindings that holds value for this parameter. When i first call bindings.findXXX.execute the Propertys value is "č" for example. That is correct and table shows filtered rows. After i perform another action (i think it is not dependent what the action is, but change current row fow example) the value in that Property has changed to "?" instead of "č" and because of this filter is applied again and table shows no rows. I have checked all encoding and character encoding is set to utf-8. Is this the problem and i miss some settings ?
1) menu tools-preferences-Environment
2) project properties- compiler-character encoding
3) in jspx
<?xml version='1.0' encoding='UTF-8'?>
<jsp:directive.page contentType="text/html;charset=UTF-8"
pageEncoding="UTF-8"/>
<afh:head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
</afh:head>
is there other ? i dont know i set this settings long ago ...
Please if someone know, give me some hint , thanks for help.
Jdeveloper 10.1.3.0.4(SU5)

check regional language settings on your machiner where u r application server is running. I faced this problem but was able to resolve by modifying the
NLS_LANG = AMERICAN_AMERICA.WE8ISO8859P1
NLS_LANG is under HKEY_LOCAL_MACHINE==>ORACLE
WE8ISO8859P1 is the standard encoding for my application developed in local indian language and works fine for me
THIS can help check out
Amit

Character Encoding in XML

Hello All,
I am not clear about solving the problem.
We have a Java application on NT that is supposed to communicate with the same application on MVS mainframe through XML.
We have a character encoding for these XML commands we send for communication.
The problem is, on MVS the parser is not understaning the US-ASCII character encoding. And so we are getting the infamous "illegal character error".
The main frame file.encoding=CP1047 and
NT's file.encoding = us-ascii.
Is there any character encoding that is common to these two machines: mainframe and NT.
IF it is Unicode, what is the correct notation for it.
Or is there any way for specifying the parsers to which character encoding should be used.
thanks,
Sridhar

On the mainframe end maybe something like-
FileInputStream fris = new FileInputStream("C:\\whatever.xml");
InputStreamReader is= new InputStreamReader(fris, "ASCII");//or maybe "us-ascii" "US-ASCII"
BufferedReader brin = new BufferedReader(is);
Or give inputstream/buffered reader to whatever application you are using to parse the xml. The input stream reader should allow you to set your encoding even if the system doesnt have the native encoding. Depends though on which/whose jvm using you are using jdk1.2 at least supports following on this page http://as400bks.rochester.ibm.com/pubs/html/as400/v4r4/ic2924/info/java/rzaha/javaapi/intl/encoding.doc.html

What every developer should know about character encoding

This was originally posted (with better formatting) at Moderator edit: link removed/what-every-developer-should-know-about-character-encoding.html. I'm posting because lots of people trip over this.
If you write code that touches a text file, you probably need this.
Lets start off with two key items
1.Unicode does not solve this issue for us (yet).
2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts.
The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.
And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.
And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.
Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.
UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
Here's a key point about these text files – every program is still using an encoding. It may not be setting it in code, but by definition an encoding is being used.
Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)
Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
Wrapping it up
I think there are two key items to keep in mind here. First, make sure you are taking the encoding in to account on text files. Second, this is actually all very easy and straightforward. People rarely screw up how to use an encoding, it's when they ignore the issue that they get in to trouble.
Edited by: Darryl Burke -- link removed

DavidThi808 wrote:
This was originally posted (with better formatting) at Moderator edit: link removed/what-every-developer-should-know-about-character-encoding.html. I'm posting because lots of people trip over this.
If you write code that touches a text file, you probably need this.
Lets start off with two key items
1.Unicode does not solve this issue for us (yet).
2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts. Pretty sure most Americans do not use character sets that only have a range of 0-127. I don't think I have every used a desktop OS that did. I might have used some big iron boxes before that but at that time I wasn't even aware that character sets existed.
They might only use that range but that is a different issue, especially since that range is exactly the same as the UTF8 character set anyways.
>
The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.
And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.
And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
The above is only true for small volume sets. If I am targeting a processing rate of 2000 txns/sec with a requirement to hold data active for seven years then a column with a size of 8 bytes is significantly different than one with 16 bytes.
Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
The above is out of place. It would be best to address this as part of Point 1.
Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.
Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.
UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
The first part of that paragraph is odd. The first 128 characters of unicode, all unicode, is based on ASCII. The representational format of UTF8 is required to implement unicode, thus it must represent those characters. It uses the idiom supported by variable width encodings to do that.
But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
Not sure what you are saying here. If a file is supposed to be in one encoding and you insert invalid characters into it then it invalid. End of story. It has nothing to do with html/xml.
Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
The browser still needs to support the encoding.
Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
I know java files have a default encoding - the specification defines it. And I am certain C# does as well.
Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
It is important to define it. Whether you set it is another matter.
Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)
Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
Unicode character escapes are replaced prior to actual code compilation. Thus it is possible to create strings in java with escaped unicode characters which will fail to compile.
Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
No. A developer should understand the problem domain represented by the requirements and the business and create solutions that appropriate to that. Thus there is absolutely no point for someone that is creating an inventory system for a stand alone store to craft a solution that supports multiple languages.
And another example is with high volume systems moving/storing bytes is relevant. As such one must carefully consider each text element as to whether it is customer consumable or internally consumable. Saving bytes in such cases will impact the total load of the system. In such systems incremental savings impact operating costs and marketing advantage with speed.

What's the difference of character encoding between 1.4.0and1.4.2 in Linux

As i find, the character encoding about chinese in jdk1.4.2 no langer the same of jdk1.4.0.
In jdk1.4.0, the character encoding used the "file.encoding" system property, we often set the
property with "gb2312".
But in jdk1.4.2, i find that the default character encoding no longer used the "file.encoding" system property.
Who knows the reason?
Test Program:
public class B{
public static void main(String args[]) throws Exception{
byte [] bytes = new byte[]{(byte)0xD6,(byte)0xD0,(byte)0xCE,(byte)0xC4};
String s1 = new String(bytes);
String s2 = new String(bytes,System.getProperty("file.encoding"));
System.out.println("s1="+s1+" , s2="+s2);
System.out.println("s1.length=" + s1.length() + " , s2.length="+s2.length());
run four times and the result list:
[root@app15 component]# /usr/local/j2sdk1.4.0/bin/java -Dfile.encoding=ISO-8859-1 -cp . B
s1=中文 , s2=中文
s1.length=4 , s2.length=4
[root@app15 component]# /usr/local/j2sdk1.4.0/bin/java -Dfile.encoding=gb2312 -cp . B
s1=中文 , s2=中文
s1.length=2 , s2.length=2
[root@app15 component]# /usr/local/j2sdk1.4.2/bin/java -Dfile.encoding=ISO-8859-1 -cp . B
s1=中文 , s2=中文
s1.length=4 , s2.length=4
[root@app15 component]# /usr/local/j2sdk1.4.2/bin/java -Dfile.encoding=gb2312 -cp . B
s1=中文 , s2=??
s1.length=4 , s2.length=2
[root@app15 component]#

I don't know for sure, but:
-- The API documentation for String says that "new String(byte[])" uses "the platform's default charset".
-- The API documentation for Charset says "The default charset is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system."
You'll notice that it doesn't say anything about using the file.encoding system value, so presumably (based on your experiments) it doesn't. I did a search for "java default charset" and didn't find anything specific, but this site says "As of Java 1.4.1, the default Charset varies from platform to platform" and suggests you explicitly hard-code your charset. I would agree with that.

How can I tell what character encoding is sent from the browser?

Hi,
I am developing a servlet which supposed to be used to send and receive message
in multiple character set. However, I read from the previous postings that each
Weblogic Server can only support one input character encoding. Is that true?
And do you have any suggestions on how I can do what I want. For example, I
have a HTML form for people to post any comments (they may post in any characterset,
like ShiftJIS, Big5, Gb, etc). I need to know what character encoding they are
using before I can read that correctly in the servlet and save in the database.

From what I understand (I haven't used it yet) 6.1 supports the 2.3
servlet spec. That should have a method to set the encoding.
Otherwise, I don't think you can support multiple encodings in one
instance of WebLogic.
From what I know browsers don't give any indication at all about what
encoding they're using. I've read some chatter about the HTTP spec
being changed so it's always UTF-8, but that's a Some Day(TM) kind of
thing, so you're stuck with all the stuff out there now which doesn't do
everything in UTF-8.
Sorry for the bad news, but if it makes you feel any better I've felt
your pain. Oh, and trying to process multipart/form-data (file upload)
forms is even worse and from what I've seen the API that people talk
about on these newsgroups assumes everything is ISO-8859-1.
Emmy Lau wrote:
>
Hi,
I am developing a servlet which supposed to be used to send and receive message
in multiple character set. However, I read from the previous postings that each
Weblogic Server can only support one input character encoding. Is that true?
And do you have any suggestions on how I can do what I want. For example, I
have a HTML form for people to post any comments (they may post in any characterset,
like ShiftJIS, Big5, Gb, etc). I need to know what character encoding they are
using before I can read that correctly in the servlet and save in the database.

Locale and character encoding. What to do about these dreadful ÅÄÖ??

Similar Messages

Maybe you are looking for