Character encoding in .java file

If I write a file with:
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("MyClass.java"),"Shift_JIS"));And then compile it with:
javac -encoding SJIS MyClass.javaand then load it into a midp application on a mobile,
the characters in MyClass.java and in the mobile come out as & #45674;
I am running linux from in an English Shell, would it make a difference if I set my language to be Japanese in that shell?!?
If I save that same file manually with JEDIT (java open source word processor etc.) in SJIS -- all is happy and the code in the file is unreadable to me (because of my English system) but runs on the mobile just fine.
Is this a bug? Limitation of Java?? Error on my part???
TIA
Shawn

I suppose the font does not support the character
'\uB26A' (a hangul syllable).Ummm, well actually I get a ? plus a real chinese character. Which is much closer than that number junk I was getting.
So maybe I need to try to save the java file in unicode and then compile with --encoding=SJIS.
Was that the character for immensely satisfying Korean BBQ??? Just curious.
Shawn

Similar Messages

Accented character encoding in JNLP files

Hi everyone:
Maybe I am missing something trivial, but I am having trouble encoding French accented characters in my JNLP files. For example, I tried to encode the name "Québec" as follows:
- Québec: the file is not parsed completely and the application-desc tag appears to be missing. I know that this problem has already been reported by German and Swedish fellows.
- Québec: the file is parsed correctly but the é tag is not converted to "é."
- Québec: the é tag is not converted either.
- Qu\u00d9bec: no conversion...
- Qu\00d9;bec: still no conversion...
How should I encode my accented characters so that they appear correctly in Java Web Start presentation windows?
Thanks in advance for any help...
Jean-François Morin

Rather than trying some kind of escape sequence, how about just storing the file as whatever "native" codepage your system likes and then converting it to UTF-8 with the native2ascii utility for deployment (not ideal, but it may work):
As mentioned here:
http://java.sun.com/products/javawebstart/docs/developersguide.html#dev
Which links to here:
http://java.sun.com/products/jdk/1.1/docs/tooldocs/win32/native2ascii.html
(the document appears to be identical for the various JDK releases--if you want the 1.3 version, just change 1.1 to 1.3 in the URL above).
I don't know if this is simpler on not. HTH!
John

File name character encoding in zipped file

Hi All,
I have the following problem: I have received a zipped file which contains a folder and file structure. The file was zipped on Windows XP with Czech language. The problem is that the folder/file names use Czech characters (codepage CP1250). The characters in filenames after unzipping the structure are corrupted.
Is there any way how to define the codepage during unzipping or, more in general, way how to convert file names in one codepage to UTF-16 used in Mac's file system?
Thanks for ideas.

You might give this a try for converting the filenames after unzipping, though I think it is designed for the old Mac CE encoding rather than the Windows version:
http://support.apple.com/downloads/FileName_Encoding_RepairUtility
Otherwise the only option may be manually converting the text strings with something like TextWrangler.

Character encoding and Property files

Hi
I currently have some source code that has some hardcoded Strings with accented characters (like prot�g� ). I'm trying to move these out to a properties file. I copied the Strings to a properties file and loaded it into the application with the following code:
ResourceBundle emailProps = ResourceBundle.getBundle("package.whatever.email");
emailBody = emailProps.getString("email.body");I tried a System.out.println for both the hardcoded and "loaded from properties" method and they came out different:
hardcoded: prot�g�
from properties: prot��g��
Can somebody tell me why the String loaded from properties came out like this? What can I do to fix this?
Thanks

Thanks for your help
No, I don't know the encoding. How can I find out?
Just an update, though I'm still trying some stuff. I tried to do a native2ascii ant task on the file (ISO-8859-1) as suggested in the Properties javadoc. As expected, it converted the accented characters to \uXXXX format. However, the output is still the same

What's the difference of character encoding between 1.4.0and1.4.2 in Linux

As i find, the character encoding about chinese in jdk1.4.2 no langer the same of jdk1.4.0.
In jdk1.4.0, the character encoding used the "file.encoding" system property, we often set the
property with "gb2312".
But in jdk1.4.2, i find that the default character encoding no longer used the "file.encoding" system property.
Who knows the reason?
Test Program:
public class B{
public static void main(String args[]) throws Exception{
byte [] bytes = new byte[]{(byte)0xD6,(byte)0xD0,(byte)0xCE,(byte)0xC4};
String s1 = new String(bytes);
String s2 = new String(bytes,System.getProperty("file.encoding"));
System.out.println("s1="+s1+" , s2="+s2);
System.out.println("s1.length=" + s1.length() + " , s2.length="+s2.length());
run four times and the result list:
[root@app15 component]# /usr/local/j2sdk1.4.0/bin/java -Dfile.encoding=ISO-8859-1 -cp . B
s1=中文 , s2=中文
s1.length=4 , s2.length=4
[root@app15 component]# /usr/local/j2sdk1.4.0/bin/java -Dfile.encoding=gb2312 -cp . B
s1=中文 , s2=中文
s1.length=2 , s2.length=2
[root@app15 component]# /usr/local/j2sdk1.4.2/bin/java -Dfile.encoding=ISO-8859-1 -cp . B
s1=中文 , s2=中文
s1.length=4 , s2.length=4
[root@app15 component]# /usr/local/j2sdk1.4.2/bin/java -Dfile.encoding=gb2312 -cp . B
s1=中文 , s2=??
s1.length=4 , s2.length=2
[root@app15 component]#

I don't know for sure, but:
-- The API documentation for String says that "new String(byte[])" uses "the platform's default charset".
-- The API documentation for Charset says "The default charset is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system."
You'll notice that it doesn't say anything about using the file.encoding system value, so presumably (based on your experiments) it doesn't. I did a search for "java default charset" and didn't find anything specific, but this site says "As of Java 1.4.1, the default Charset varies from platform to platform" and suggests you explicitly hard-code your charset. I would agree with that.

How to set a platform's default character encoding

Hi,
Does anybody know how to set how to set a platform's default character encoding in Java? Thank you.
Yugang
[email protected]

You do 0mean for Java, not from Java? (The latter would make absolutely no sense at all.) If so, pass it to the runtime using the -D switch (you've got SUN's java, right?):
java -Dfile.encoding=the-encoding-i-like the.name.of.YourClass

Which character encoding do Adobe ExportPDF use when converting to word document?

Which character encoding do Adobe ExportPDF use when converting to word document?

Hi Ram,
Sorry for the long delay. I've been trying to track down this answer for you.
We're using UTF-8 character encoding for our files.
-David

Encoding of java source files

I've sucesfully compiled java source files (*.java files) over a lot of different encodings (ANSI, Unicode, UTF8, etc). My IDE (Eclipse) has Cp1252 as default on WIndows OS.
There is an 'official' encoding for source files ?

Short answer is Unicode -but that's not enough.
From Java Language Specification:
"Except for comments (�3.7), identifiers, and the contents of character and string literals (�3.10.4, �3.10.5), all input elements (�3.5) in a program are formed only from ASCII characters (or Unicode escapes (�3.3) which result in ASCII characters). ASCII (ANSI X3.4) is the American Standard Code for Information Interchange. The first 128 characters of the Unicode character encoding are the ASCII characters."
There's more that applies, I'll not quote it, read chapter 3 of JLS for details.
http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html

How to set the system default file character encoding to UTF-8?

Hi all. This is driving me nuts, on both my Windows box and Snow Leopard; I figure much more chance of finding the answer for OS X.
My language and locale are set to Australian English. $LANG=en_AU.UTF-8
However, as I believe is expected, OS X (and Windows for that matter) will create files by default with character encoding of Cp1252 (Latin-1). That is, the FILE encoding in the file metadata - the Byte Order Mark I believe. The file itself, not the characters written to it.
This, in a word, bites. I don't want to be restricted to only ASCII by default, and it is causing me problems with certain software (a Firefox plugin) that creates text files, passing in UTF-8 encoded content, which is then mangled because the file encoding itself is still Cp1252. (I know, I've tested this by changing the file encoding manually and having it overwritten again by the plugin: works correctly.)
As a simple example, just `touch somefile` from terminal creates a file in Cp1252 -- I'm obtaining that info by opening in jEdit by the way (anyone know of something better?).
In other locales that are not English-based, I believe the default file encoding is UTF-8. But surely this can be controlled independently? There must be a system configuration value somewhere that specifies file encoding default. Can someone please tell me what it is?
Thanks!

However, as I believe is expected, OS X (and Windows for that matter) will create files by default with character encoding of Cp1252 (Latin-1). That is, the FILE encoding in the file metadata - the Byte Order Mark I believe. The file itself, not the characters written to it.
Apps like TextEdit and Mail have settings that let you determine the encoding of text produced. The default would normally depend on the character content of the file, ranging from ASCII for basic English to Windows Latin-1 (Win 1252) or ISO Latin -1 (ISO 8859-1) to UTF-8 for other content.
Win 1252 is not ASCII, but has twice the number of characters in the latter.
Byte Order Mark is something totally different --it's a particular character used to signal certain encodings.
http://en.wikipedia.org/wiki/Byteordermark
As a simple example, just `touch somefile` from terminal creates a file in Cp1252 -- I'm obtaining that info by opening in jEdit by the way (anyone know of something better?).
For what Terminal does and how to change it, it might best to post in the Unix forum:
http://discussions.apple.com/forum.jspa?forumID=735
For problems with a FireFox plugin, it might be good to ask on their own forums as well.

How to use the character encoding model to use ' �' in java

hi
I have to use the special characters like '�' using java.but when '�' character is transferred to data base or is displayed in my form,it is shown as '?'.but i need to display the original characters in the form of what it is.
what can i do as i expects the result .i needs it very urgently.so reply to me as soon as possible.
advance thank you
rgds
Oasisdeserts

Java stores all characters as 16 bit UNICODE (which has plenty of room far more exotic characters than Greek). Every time character data is brought into, or stored from the Java environment it passes through a specific character encoding. If you don't state the encoding explicitly it uses the default encoding of the machine it's running on (which is typically oriented to the language set in the locale.).
With a database the database driver is supposed to do any necessary translation. The non-latin character could be stored wrongly, the database could be returning it wrongly or whatever's creating your form could be interpretting it wrongly, or lack the capacity to display the character.
If it's, say, a JSP try setting the page encoding to UTF-8. Try dumping the characters read from the database in HEX and look them up at www.unicode.org.
Take a look at the database setup information, some have an option of storing characters as UNICODE or ASCII.
The "?" is substituted whenever a codec can't make sense of the codes it's translating, either the byte array is invalid or the UNICODE character can't be represented.

How to set character encoding to Greek in a javascript file in Dreamweaver CS4?

I have a javascript file built with a text editor some time
ago, which has Greek characters in one of the variables:
1.) --> var
scrollercontent=' 11-04-2008:
ΜΑΝΩΛΗΣ
ΜΗΤΣΙΑΣ
 <img SRC="mitsias2.gif" height=148 width=104
border="0" />' <--
In DW CS4 it displays as
2.) --> var
scrollercontent=' 11-04-2008: ???O??S ???S??S
 <img SRC="mitsias2.gif" height=148 width=104
border="0" />' <--
When trying to save the corrected content in 1.) I get an
error message that DW cannot save the Greek characters. I am asked
to change the character encoding.
I don't know how to change the character encoding in a js
file. Is there a "head" file like in HTML? Or how does one
determine the character encoding in a js file?
When I change the encoding in the prefereces, which is
"Western" to "Greek" or "UTF", I have the same result.
Thanks for help.
jml

The .js file and the html need to have the same encoding. If
your html uses iso-8859-7, then the .js must also use that. But if
the original text editor created the .js file using utf-8, then
that is what the html needs to use.

Character Encoding and File Encoding issue

Hi,
I have a file which has a data encoded using default locale.
I start jvm in same default locale and try to red the file.
I took 2 approaches :
1. Read the file using InputStreamReader() without specifying the encoding, so that default one based on locale will be picked up.
-- This apprach worked fine.
-- I also printed system property "file.encoding" which matched with current locales encoding (on unix cooand to get this is "locale charmap").
2. In this approach, I read the file using InputStream as an array of raw bytes, and passed it to String contructor to convert bytes to String.
-- The String contained garbled data, meaning encoding failed.
I tried printing encoding used by JVM using internal class, and "file.encoding" property as well.
These 2 values do not match, there is weird difference.
For e.g. for locale ja_JP.eucjp on linux box :
byte-character uses EUC_JP_LINUX encoding
file.encoding system property is EUC-JP-LINUX
To get byte to character encoding, I used following methods (sun.io.*):
ByteToCharConverter btc = ByteToCharConverter.getDefault();
System.out.println("BTC uses " + btc.getCharacterEncoding());
Do you have any idea why is it failing ?
My understanding was, file encoding and character encoding should always be same by default.
But, because of this behaviour, I am little perplexed.

But there's no character encoding set for this operation:baos.write("��".getBytes());

NetBeans problem: Issue with servlets and Chinese character encoding

Java Version: JDK1.5.0_01, JRE1.5.0_01 (International version)
Netbeans Version: Netbeans IDE 4.0
OS: Windows XP Personal Edition
Dear Sirs,
First at all thanks for reading this post. I am having the following issue. I am creating an application using html pages and servlets. I am using Chinese and English languages on them (html encoding UTF-8).
I created a project in Netbeans and added an idex.html screen reporting to a servlet. Both index.html and in the servlet generated html page contains the line:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Additional, I setup the character code settings in Netbeans:
(tools-options-Java sources-Expert-default encoding=UTF-8
When I run the project, index.html displays itself perfectly, with the Chinese characters displayed properly. The problem comes when the html created servlet is displayed, which instead of the Chinese characters some strange characters are displayed (�� instead of Chinese).
I have tried different encodings from http://java.sun.com/j2se/1.4.2/docs/guide/intl/encoding.doc.html without any luck. I also setup the encoding of the file itself (using right click-properties in the project menu of Netbeans).
Also, when I am editing the servlet, the characters are displayed properly. I type them directly without any issue, but then the display is wrong at runtime.
Also, just in case this have something to do with the problem, my PC was bought in US, therefore the default character set is not Chinese. I had to install the Chinese typing stuff later on. But like I said earlier, the html page is displayed properly, so I really think is some problem with Netbeans.
After a week trying to find a solution, I decided to post it here in the hopes that someone will show me the way of the light.
Thanks in advance for any ideas or help provided
Aral.

Ok, I found out some problems with Netbeans as well.
 public void doGet(HttpServletRequest request,
 HttpServletResponse response)
 throws IOException, ServletException
 response.setCharacterEncoding("UTF-8");
 request.setCharacterEncoding("UTF-8");
 response.setContentType("text/html");
 PrintWriter out = response.getWriter();
 byte[] st = {-25,-75,-124,-27,-100,-106,-17,-68,-102,-27,-80,-113,-27,-72,-125,-26,-118,-75,-26,-105,-91,-27,-82,-93};
 out.println("this works: ");
 out.println(new String(st,"UTF-8"));
 out.println(" ");
 out.println("this doesn't: ");
 out.println("some chinese copied from the Internet ");Right click the .java file and choose properties -> encoding UTF-8
Then I make a copy of the .java file, rename it to html and open it with IE sure enough
the Chinise is allready unreadable (not it's still readable in the IDE);
When I compile the file with F9 I get the following error:
whatever.java:101: warning: unmappable character for encoding Cp1252
Tried to set the encoding to UNICODE but then the file doesn't compile.
I gues you have to download the Japanese version for it to work correctly.

What every developer should know about character encoding

This was originally posted (with better formatting) at Moderator edit: link removed/what-every-developer-should-know-about-character-encoding.html. I'm posting because lots of people trip over this.
If you write code that touches a text file, you probably need this.
Lets start off with two key items
1.Unicode does not solve this issue for us (yet).
2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts.
The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.
And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.
And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.
Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.
UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
Here's a key point about these text files – every program is still using an encoding. It may not be setting it in code, but by definition an encoding is being used.
Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)
Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
Wrapping it up
I think there are two key items to keep in mind here. First, make sure you are taking the encoding in to account on text files. Second, this is actually all very easy and straightforward. People rarely screw up how to use an encoding, it's when they ignore the issue that they get in to trouble.
Edited by: Darryl Burke -- link removed

DavidThi808 wrote:
This was originally posted (with better formatting) at Moderator edit: link removed/what-every-developer-should-know-about-character-encoding.html. I'm posting because lots of people trip over this.
If you write code that touches a text file, you probably need this.
Lets start off with two key items
1.Unicode does not solve this issue for us (yet).
2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts. Pretty sure most Americans do not use character sets that only have a range of 0-127. I don't think I have every used a desktop OS that did. I might have used some big iron boxes before that but at that time I wasn't even aware that character sets existed.
They might only use that range but that is a different issue, especially since that range is exactly the same as the UTF8 character set anyways.
>
The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.
And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.
And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
The above is only true for small volume sets. If I am targeting a processing rate of 2000 txns/sec with a requirement to hold data active for seven years then a column with a size of 8 bytes is significantly different than one with 16 bytes.
Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
The above is out of place. It would be best to address this as part of Point 1.
Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.
Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.
UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
The first part of that paragraph is odd. The first 128 characters of unicode, all unicode, is based on ASCII. The representational format of UTF8 is required to implement unicode, thus it must represent those characters. It uses the idiom supported by variable width encodings to do that.
But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
Not sure what you are saying here. If a file is supposed to be in one encoding and you insert invalid characters into it then it invalid. End of story. It has nothing to do with html/xml.
Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
The browser still needs to support the encoding.
Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
I know java files have a default encoding - the specification defines it. And I am certain C# does as well.
Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
It is important to define it. Whether you set it is another matter.
Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)
Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
Unicode character escapes are replaced prior to actual code compilation. Thus it is possible to create strings in java with escaped unicode characters which will fail to compile.
Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
No. A developer should understand the problem domain represented by the requirements and the business and create solutions that appropriate to that. Thus there is absolutely no point for someone that is creating an inventory system for a stand alone store to craft a solution that supports multiple languages.
And another example is with high volume systems moving/storing bytes is relevant. As such one must carefully consider each text element as to whether it is customer consumable or internally consumable. Saving bytes in such cases will impact the total load of the system. In such systems incremental savings impact operating costs and marketing advantage with speed.

XML parser not detecting character encoding

Hi,
I am using Jdeveloper 9.0.5 preview and the same problem is happening in our production AS 9.0.2 release.
The character encoding of an xml document is not correctly being detected by the oracle v2 parser even though the xml declaration correctly contains
<?xml version="1.0" encoding="ISO-8859-1" ?>
instead it treats the document as UTF8 encoding which is fine until a document comes along with an extended character which then causes a
java.io.UTFDataFormatException: Invalid UTF8 encoding.
at oracle.xml.parser.v2.XMLUTF8Reader.checkUTF8Byte(XMLUTF8Reader.java:160)
at oracle.xml.parser.v2.XMLUTF8Reader.readUTF8Char(XMLUTF8Reader.java:187)
at oracle.xml.parser.v2.XMLUTF8Reader.fillBuffer(XMLUTF8Reader.java:120)
at oracle.xml.parser.v2.XMLByteReader.saveBuffer(XMLByteReader.java:448)
at oracle.xml.parser.v2.XMLReader.fillBuffer(XMLReader.java:2023)
at oracle.xml.parser.v2.XMLReader.tryRead(XMLReader.java:972)
at oracle.xml.parser.v2.XMLReader.scanXMLDecl(XMLReader.java:2589)
at oracle.xml.parser.v2.XMLReader.pushXMLReader(XMLReader.java:485)
at oracle.xml.parser.v2.XMLReader.pushXMLReader(XMLReader.java:192)
at oracle.xml.parser.v2.XMLParser.parse(XMLParser.java:144)
as you can see it is explicitly casting the XMLUTF8Reader to perform the read.
I can get around this by hard coding the xml input stream to be processed by a reader
XMLSource = new StreamSource(new InputStreamReader(XMLInStream,"ISO-8859-1"));
however the manual documents that the character encoding is automatically picked up from the xml file and casting into a reader is not necessary, so I should be able to write
XMLSource = new StreamSource(XMLInStream)
Does anyone else experience this same problem?
having to hardcode the encoding causes my software to lose flexibility.
Jarrod Sharp.

An XML document should be created with 'ISO-8859-1' encoding to be parsed as 'ISO-8859-1' encoding.

Character encoding in .java file

Similar Messages

Maybe you are looking for