Specifying An Encoding

Right, while this is in my head, and I don't forget it myself... ;-)
String myString = "blah";
byte[] myWeirdEncodingBytes = myString.getBytes("myWeirdEncoding");myString is in UTF-16, no matter what. The encoding specified in the getBytes() method is the target encoding. So myWeirdEncodingBytes is an array of bytes stored in a weird encoding, rather than UTF-16.
String myNewString = new String(myWeirdEncodingBytes, "myWeirdEncoding");myNewString is UTF-16 again, no matter what. The encoding specified in the constructor is the source encoding. So myWeirdEncodingBytes are still in a weird encoding. But building a new String out of them requires an encoding conversion to occur, as the String must be in UTF-16. So we need to let the constructor know which encoding to convert from.
A Java byte can be in any encoding it likes - it's simply eight raw bits of data stored in an arbitrary order. Anything extending a Java object is stored in UTF-8, except for String, which keeps its internal characters in UTF-16 encoding.
In short, if you pull data into Java from an external source, the encoding it sits in within Java sticks to the rules above. If it first hits Java as a byte stream (say from an InputStream), it's going to be in whatever encoding it was already in. If it first hits Java as a String (say from request.getParameter() in a servlet or JSP), it will be in UTF-16, as Java will convert its encoding immediately.
The same holds when writing back out. If you write out a series of bytes (such as with OutputStream), the resulting output is in the encoding the bytes were in to begin with. If you're outputting a String (although this is a lot rarer), it will be in UTF-16.
Hope that clears a few people's heads (mine included!)....
Martin Hughes

I think that you should also point out that random data cannot be stored in a string (easily). In general:
byte[] inputData = <some random data>;
String inputString = new String(data, "an encoding");
byte[] outputData = inputString.getBytes("an encoding");
assert(inputData.equals(outputData)); // ! FALSEeven if "an encoding" is the same in both places. Why? Because some byte (or bytes) in the input stream may not be a valid character for "an encoding". In that case, the resulting character is undefined.
However:
String inputString = <some random string>;
byte[] inputData = inputString.getBytes("an encoding");
String outputString = new String(inputData, "an encoding");
assert(inputString.equals(outputString)); // ! TRUEis guaranteed, as long as "an encoding" supports all of the characters in inputString.
(Someone please tell me if I'm wrong!)

Similar Messages

Numbers to CSV export script: how to specify the encoding?

Hi,
I'm using the following script to export a Numbers document to CSV:
# Command-line tool to convert an iWork '09 Numbers
# document to CSV.
# Parameters:
# - input: Numbers input file
# - output: CSV output file
# Attik System, Philippe Lang
# Creation date: 31 mai 2012
# Modification date:
on run argv
# We retreive the path of the script
          set myPath to (path to me)
          tell application "Finder" to set myFolder to folder of myPath
# We get the command line parameters
          set input_file to item 1 of argv
          set output_file to item 2 of argv
# We retreive the extension of the file
          set theInfo to (info for (input_file))
          set extname to name extension of (theInfo)
# Paths
          set input_file_path to (myFolder as text) & input_file
          set output_file_path to (myFolder as text) & output_file
          if extname is equal to "numbers" then
    tell application "Numbers"
      open input_file_path
      save document 1 as "LSDocumentTypeCSV" in output_file_path
      close every window saving no
    end tell
          end if
end run
It works fine, except that I don't know how to specify the encoding of the text in the CSV file (Latin1, MacRoman, Unicode). This option is available in the export dialog of Numbers. Any hint on how to do that is welcome. (GUI Scripting?)
Where can I find documentation on the iWork "vocabulary" available? Is there a definitive documentation somewhere? I tried to record an manual export in the script editor, without success. Script is more or less empty.
Thanks!
Philippe Lang

A further note from Yvan. He's made some revisions to the script sent earlier.
--{code}
--[SCRIPT export to CSV with selected encoding]
I added some features.
(1) Defining the encoding thru the preferences file apply only if
the application is not in use because the file is read only once in a session.
A test urge you to quit Numbers if it is running.
(2) info for is deprecated so it may be removed by Apple tomorrow.
I no longer use it.
(3) just for the fun, I added a piece of code allowing you to select the encoding on the fly.
Thanks to the property chooseEncodingInScript, at this time the script use Unicode (UTF-8)
(4) I'm wondering which tool is used to launch this script,
I don't know the way to pass arguments when I run one.
Yvan KOENIG (VALLAURIS, France)
2012/06/13
property chooseEncodingInScript : false
true = the script will ask you to select the encoding
false = the script use the embedded encoding
on run argv
set input_file to (item 1 of argv) as text
set output_file to (item 2 of argv) as text
set myPath to (path to me) as text
          tell application "System Events"
set theProcesses to name of every application process
set myFolder to path of container of (disk item myPath)
set input_file_path to myFolder & input_file
set output_file_path to myFolder & output_file
set extname to name extension of (disk item input_file)
end tell
          if extname is "numbers" then
                    if "Numbers" is in theProcesses then error "Please, quit “Numbers” before running this script !"
if chooseEncodingInScript then
                              set theList to {"Mac OS Roman", "Unicode (UTF-8)", "Windows Latin 1"}
                              set maybe to choose from list theList with prompt "Choose the default encoding applying to export as CSV"
if maybe is false then
error number -128
else if item 1 of maybe is item 1 of theList then
                                        30 -- Mac OS Roman
else if item 1 of maybe is item 2 of theList then
                                        4 -- Unicode (UTF-8)
else
                                        12 -- Windows Latin 1
end if
else
                              4 -- Unicode (UTF-8)
end if
                    do shell script "defaults write com.apple.iWork.Numbers CSVExportEncoding -int " & result
tell application "Numbers"
open input_file_path
                              save document 1 as "LSDocumentTypeCSV" in output_file_path
close every window saving no
end tell
end if
end run
--{code}
Regards,
Barry

Reading text files without specifying the encoding?

I have looked everywhere for a solution, but I can't find one. The problem is that I'm using codes that everybody is using, but for some reason, my codes aren't working.
I want to be able to open up text files in Java without having to specify the encoding for the files that I'm going read, because I have no idea which encoding they will use.
But even when specifying UTF-8 to be encoding of the file to read, it doesn't work correctly:
FileInputStream fileIS= new FileInputStream("somefile.txt");
Reader reader = new BufferedReader(new InputStreamReader(fileIS, Charset.forName("UTF-8")));
EVERYWHERE I look, ppl are using these codes! But it doesn't work, some characters (such as the Euro sign) are displayed as squares.
However, I want to be able to read not only UTF-8 files but anything that Java supports. Any ideas?
Edited by: Stalfos on Oct 22, 2007 12:22 PM

Stalfos wrote:
I want to be able to open up text files in Java without having to specify the encoding for the files that I'm going read, because I have no idea which encoding they will use.
This is your problem. If you don't know what encoding the text is stored in, then the chances that the default encoding used to read it will be correct are slim.
But even when specifying UTF-8 to be encoding of the file to read, it doesn't work correctly:There are many different character encodings, and most of them don't overlap. Anything above 127 usually causes problems.
FileInputStream fileIS= new FileInputStream("somefile.txt");
Reader reader = new BufferedReader(new InputStreamReader(fileIS, Charset.forName("UTF-8")));
EVERYWHERE I look, ppl are using these codes! But it doesn't work, some characters (such as the Euro sign) are displayed as squares.The problem isn't with the code, it's that the file you're reading isn't using an encoding that's compatible with UTF-8. Assuming that using UTF-8 should work for all encodings is like assuming that someone who can read Chinese should be able to read a book written in Spanish or Greek. It doesn't work that way.
However, I want to be able to read not only UTF-8 files but anything that Java supports. Any ideas?You need to know what encoding your files are stored in, period. There are a few ways to guess what the encoding is, but they're only reliable for a small set of encodings.
You don't seem to truly understand what character encodings are, or how to use them, so read this:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
http://www.joelonsoftware.com/articles/Unicode.html

Why does Firefox 18 ignore the specified character encoding for websites?

We are developing a page on our website that will have the page crawled and a newsletter generated and sent out to a mailing list. Many email packages default to character encoding of iso-8859-1 so we have set our character encoding to this on the page via the standard meta tag.
We have a problem on the newsletters that we had until now been unsuccessful to replicate. Though now I know why.... I have just discovered that in Firefox 18, the specified character encoding is being completely ignored. It is rendering the page in UTF-8 even though we specified ISO-8859-1. Firefox 3.6 however, renders the page with the proper encoding (thank god for keeping an old version for testing).
Can anyone explain why the new Firefox is completely ignoring the meta tag? Both browsers are using the factory default (I even opened FF18 in safe mode)...

Thanks for letting me know that Firefox 18 ignores everything but the server headers... but it doesn't help me much. Our website is in UFT-8... but this page is a newsletter, one that is crawled and saved into an email and sent out to a mailing list (by a third party newsletter program) and many email readers use ISO-8859-1 hence why we want to have the page rendered in that encoding so that we can actually test the newsletter properly. We can't test through the third party software as our testing environment is behind a firewall, and you can't change the server headers for a single page... hence the meta tag.
If you explicitly choose to render a page in a specific encoding, that shouldn't be ignored by the browser. It's not a big deal, but now every time we make a code change in our test environment and reload the page we have to force the encoding manually in the browser which is a pain.
The problem is, the newsletter is already live and we have some users complaining because some characters aren't displaying properly in their email packages (Entourage for Mac is one of them), all our testing (which is encoding using UTF-8) looks fine.

How to specify page encoding for XML reports.

Hi,
Environment: Apps:11.5.10, Oracle Reports: 10g
I'm trying to generate XML tags by using a "rdf" report (10g).
Initially I generated the XML tags before moving the report to server. In the output file I got
<?xml version="1.0" encoding="WINDOWS-1252"?>
Then I moved the report to server and made the concurrent program output to XML.
In the concurrent program output the tag is
<?xml version="1.0" encoding="&Encoding"?>
The output shows error
===============
XML Parsing Error: XML declaration not well-formed
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
<?xml version="1.0" encoding="&Encoding"?>
Its clear that there is something wrong with page encoding format which has to get assigned while run time. But its not happening so.
How to specify the page encoding format?
Any help would be appreciated.
Thank you
BKR.
Edited by: BalaKrishna Reddy Avuthu on Aug 14, 2009 2:00 PM

Remove the encoding so it says:
<?xml version="1.0"?>
Also, you will get a similar error if your xml tags for fields contain any special characters like & or #.

Specifying Character encoding while parsing

Hi
I have an XML document that contains a particular unicode character. After i parse it with Xerces DOM PArser i find that the character is changed to some other character.
Any idea what could be the reason ? Also how do i overcome this problem ?
Thanks in advance

you specify that on the top of your XML document
for example:
<?xml version="1.0" encoding="UTF-8"?>

Specify File Encoding(Japanese Characters) for UTL_FILE in Oracle 10g

Hi All,
I am creating a text file using the UTL_FILE package. The database is Oracle 10G and the charset of DB is UTF-8.
The file is created on the DB Server machine itself which is a Windows 2003 machine with Japanese OS. Further, some tables contain Japanese characters which I need to write to the file.
When these Japanese characters are written to the text file they occupy 3 bytes instead of 1 and distort the format of the file, I need to stick to.
Can somebody suggest, is there a way to write the Japanese character in 1 byte or change the encoding of the file type to something else viz. ShiftJIS etc.
Thanking in advance,
Regards,
Tushar

Are you using the UTL_FILE.FOPEN_NCHAR function to open the files?
Cheers, APC

HOW SPECIFY FILE.ENCODING=ANSI FORMAT IN J2SE ADAPTER.

Hi All,
we are using j2se plain adapter, we need the outputdata in ANSI FORMAT.
Default file.encoding=UTF-8
how to achive this.
thanks in advance.
Regards,
Mohamed Asif KP

File adapter would behave in a similar fashion on J2ee. Providing u the link to ongoing discussion
is ANSI ENCODING possible using file/j2see adapter
Regards,
Prateek

Problems with string encoding - need the text content in char* format.

The problem is non ASCII-characters, which comes out as some sort of unicode I need to desipher.
Here's what I got:
A text frame object with the TextString "Agnartjørna"
I get the text content of this object into an ai::UnicodeString the following way:
AIErr
VMGetTextOfTextArt( AIArtHandle textArt, ai::UnicodeString &ucStr)
    ASUnicode *textBuffer = NULL;
    AITRY {
        TextFrameRef ateTextRef;
        AIX( sAITextFrame->GetATETextFrame( textArt, &ateTextRef));
        ATE::ITextFrame ateText( ateTextRef);
        ATE::ITextRange ateRange = ateText.GetTextRange( true);
        ASInt32 textLen = ateRange.GetSize();
        AIX( sSPBlocks->AllocateBlock( (textLen+2) * sizeof( ASUnicode), nil, (void**) &textBuffer));
        ateRange.GetContents( textBuffer, (ASInt32) textLen+1);
        /* trim off trailing newlines */
        if ((textBuffer[textLen] == '\n') || (textBuffer[textLen] == '\r'))
             textBuffer[textLen] = 0;
        ucStr.clear();
        ucStr.append( ai::UnicodeString( textBuffer, textLen));
        sSPBlocks->FreeBlock( textBuffer);
        textBuffer = NULL;
       AIRETURN;
    AICATCH {
        if (textBuffer) sSPBlocks->FreeBlock( textBuffer);
       AIPROPAGATE;
Now, the next step is to convert it into a form that I can use to call regexp.
Baiscally, I want to detect the ending "tjørna" (meaning small lake) on a map label, and apply a standard abbevriation "tj^a" (with "a" superscripted).
So the problem is to obtain the regexp pattern and the text content in same encoding. And since the regexp library is old *char based, I would like to convert the text content in to plain old *char.
Hence the following code:
static AIErr
VMAbbreviateTextArt( AIArtHandle textArt,
                         vmTextAbbrevEffectParams *params)
    AITRY {
    /* first obtain the text contents of the textArt */
       ai::UnicodeString ucText;
      const int kTextLen = 256;
      char textContent[kTextLen];
      AIX( VMGetTextOfTextArt( textArt, ucText));
      ucText.as_Roman( textContent, kTextLen);
But textContent now has the value "Agnartj\xbfnna" (According to XCode),
which will not get a match on the pattern "tj([øe][rn])na\\" (with backslash matching the end of the string)
Any other ways to convert the textContent to a plain *char string?

Thank you very much, your method will work fine. with
the "UTF-8" parameter the byte[].length is double,
cause every valid byte is preceeded by an -62, but I
will just filter the valid bytes into a new array.
Thanks again,
StefanActually what you need to do is to find the character encoding that your device expects, and then you can code your strings in Arabic.
That's the way Java does things; Strings and char values are always in UNICODE (see www.unicode.org) (which means \u600 to \u6ff for arabic) and uses a specified character encoding when translating these to and from a byte stream.
Each national character encoding has a name. Most of them are identical to ASCII for 0-127 and code their national characters in 128-255.
Find the encoding name for your display and, odds are, the JRE has it in the library.
BTW the character encoding ISO-8859-1 simply maps UNICODE characters 0-255 on to bytes.

What every developer should know about character encoding

This was originally posted (with better formatting) at Moderator edit: link removed/what-every-developer-should-know-about-character-encoding.html. I'm posting because lots of people trip over this.
If you write code that touches a text file, you probably need this.
Lets start off with two key items
1.Unicode does not solve this issue for us (yet).
2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts.
The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.
And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.
And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.
Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.
UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
Here's a key point about these text files – every program is still using an encoding. It may not be setting it in code, but by definition an encoding is being used.
Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)
Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
Wrapping it up
I think there are two key items to keep in mind here. First, make sure you are taking the encoding in to account on text files. Second, this is actually all very easy and straightforward. People rarely screw up how to use an encoding, it's when they ignore the issue that they get in to trouble.
Edited by: Darryl Burke -- link removed

DavidThi808 wrote:
This was originally posted (with better formatting) at Moderator edit: link removed/what-every-developer-should-know-about-character-encoding.html. I'm posting because lots of people trip over this.
If you write code that touches a text file, you probably need this.
Lets start off with two key items
1.Unicode does not solve this issue for us (yet).
2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts. Pretty sure most Americans do not use character sets that only have a range of 0-127. I don't think I have every used a desktop OS that did. I might have used some big iron boxes before that but at that time I wasn't even aware that character sets existed.
They might only use that range but that is a different issue, especially since that range is exactly the same as the UTF8 character set anyways.
>
The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.
And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.
And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
The above is only true for small volume sets. If I am targeting a processing rate of 2000 txns/sec with a requirement to hold data active for seven years then a column with a size of 8 bytes is significantly different than one with 16 bytes.
Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
The above is out of place. It would be best to address this as part of Point 1.
Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.
Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.
UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
The first part of that paragraph is odd. The first 128 characters of unicode, all unicode, is based on ASCII. The representational format of UTF8 is required to implement unicode, thus it must represent those characters. It uses the idiom supported by variable width encodings to do that.
But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
Not sure what you are saying here. If a file is supposed to be in one encoding and you insert invalid characters into it then it invalid. End of story. It has nothing to do with html/xml.
Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
The browser still needs to support the encoding.
Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
I know java files have a default encoding - the specification defines it. And I am certain C# does as well.
Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
It is important to define it. Whether you set it is another matter.
Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)
Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
Unicode character escapes are replaced prior to actual code compilation. Thus it is possible to create strings in java with escaped unicode characters which will fail to compile.
Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
No. A developer should understand the problem domain represented by the requirements and the business and create solutions that appropriate to that. Thus there is absolutely no point for someone that is creating an inventory system for a stand alone store to craft a solution that supports multiple languages.
And another example is with high volume systems moving/storing bytes is relevant. As such one must carefully consider each text element as to whether it is customer consumable or internally consumable. Saving bytes in such cases will impact the total load of the system. In such systems incremental savings impact operating costs and marketing advantage with speed.

File I/O and encoding (J2SDK 1.4.2 on Windows)

I encountered a strange behavior using the FileReader / Writer classes for serializing the contents of a java string. What I did was basically this:
String string = "some text";
FileWriter out = new FileWriter(new File("C:/foo.txt"));
out.write(string);
out.flush();
out.close();In a different method, I read the contents of the file back:
FileReader in = new FileReader(new File("C:/foo.txt"));
StringWriter out = new StringWriter();
char[] buf = new char[128];
for (int len=in.read(buf); len>0; len=in.read(buf)) {
out.write(buf, 0, buf.length);
out.flush(); out.close(); in.close();
return out.toString();Problems arise as soon as the string contains non ascii characters. After writing and reading, the value of the string differs from the original. It seems that different character encodings are used when reading and writing, although the doc states that, if no explicit encoding is specified, the platform's default encoding (in my case CP1252) will be used.
If I use streams directly instead of writers, it does not work, either, as long as I do not specify the encoding when converting bytes to strings and vice versa.
When I specify the encoding (no matter which one, as long as I specify the same for reading as for writing), the resulting string is equal to the original one.
If I replace the FileReader and Writer by StringReader and StringWriter (bypassing the serialization), it works, too (without specifying the encoding).
Is this a bug in the file i/o classes or did I miss something?
Thanks for your help
Ralph

first.... if you are writing String objects via serialization, encoding doesn't matter whatsoever. Not sure you were saying you tried that, but just for future reference.
For String.getBytes() and String(byte[]) or InputStreamReader and OutputStreamWriter: If you don't specify an encoding, the system default (or default specified on the command-line or set in some other way) will be used in all cases.
For byte streams: If you are reading/writing bytes thru streams, then the character conversion is up to you. You call getBytes on a string or create a string with the byte[] constructor.
For readers/writers: If you are reading/writing characters thru readers/writers, then the character conversion is done by that class.
However, StringReader and StringWriter are just writing to/from String objects and they are writing Unicode char's, so it's really a special case.
Okay...
So if you have a string which has characters outside the range of the encoding being used (default or explicitly specified), then when it's written to the file, those characters are messed up. So say you have a Chinese character which needs 2 bytes. Generally, the 2 bytes are written, but when read back, that one character shows as 2. Whether 2 bytes are written or 1, probably depends on the encoding. But the result is the same, you get a munged up string.
Generally speaking, you are going to get better storage on most text when using UTF-8 as your encoding. You need to specify it always for reads and writes, or set it as the default. The reason is that chars are written in as many bytes as needed. And it'll support anything Unicode supports, thus anything String supports.

How to identify the encoding used in a file ?

Hi all,
I have to read a file and check it is encoded in UTF-8. How I can do this ?. If fiel is saved in MS Windows I can check for BOM. What if the file is saved using Java API ?. Is there any code(copy left code) available for doing this ?
rgds
Antony Paul

The problem is that ther eare no definite tests for character encoding. A particular byte stream can be valid in any number of different encodings (even if the resulting characters are not correct). If the characters don't happen to include any above unicode 127 then a UTF-8 stream is identical to the same characters in any number of different encodings.
It's not just a matter of there being no code for it in the library, it's impossible to do with any certainty, and to do it even probabalistically you'd have to run the results through a multi-lingual spelling checker.
If you just ask java.io to open a Reader without specifying an encoding it will assume the default encoding of your system.

UTF encoding issues on file adapters and mappings

Hi,
We did some tests regarding to UTF-8 and UTF-16 encoding using file adapters. Our conclusion so far is (when using Windows OS):
1. Inbound adapter can handle UTF-8 and UTF-16 correctly, but do not specify the encoding!
2. XI mappings will set the XML encoding to UTF-8 correctly when sending an UTF-16 file to XI.
3. Outbound adapter can only handle UTF-8 (and US-ACSII and ISO-8859-1) correctly.
The exact test results are:
>>Outbound file adapter bug.
If no encoding is specified in the outbound file adapter, UTF-8 and UTF-16 are handled correctly. However if the encoding is set to UTF-16, XI mapping will fail with the error:
During the application mapping com/sap/xi/tf/_CHRIS_OUTBOUND_TO_INBOUND_ a com.sap.aii.utilxi.misc.api.BaseRuntimeException was thrown: Fatal Error: com.sap.engine.lib.xml.parser.Parser~
Part of the trace:
com.sap.aii.ibrun.server.mapping.MappingRuntimeException: Runtime exception occurred during execution of application mapping program com/sap/xi/tf/_CHRIS_OUTBOUND_TO_INBOUND_: com.sap.aii.utilxi.misc.api.BaseRuntimeException; Fatal Error: com.sap.engine.lib.xml.parser.ParserException: XMLParser: No data allowed here: (hex) a0d, a0d, 6e3c(:main:, row:3, col:2) at com.sap.aii.ibrun.server.mapping.JavaMapping.executeStep(JavaMapping.java:72) at com.sap.aii.ibrun.server.mapping.Mapping.execute(Mapping.java:91) at com.sap.aii.ibrun.server.mapping.MappingHandler.run(MappingHandler.java:78) at com.sap.aii.ibrun.sbeans.mapping.MappingRequestHandler.handleMappingRequest
>>Inbound file adapter bug.
If the encoding of an inbound file adapter is set to UTF-16 everything works ok (except the XML encoding is not set correctly, but this may be a mapping issue and not an adapter issue). However the default UTF-16 encoding seems to be UTF-16BE, where I would expect UTF-16LE since this is the most commonly used encoding.
If the encoding UTF-16LE or UTF-16BE the characterset used in the message is correct, except the BOM of the file. The BOM is empty which means UTF-8 encoded file. Since the file is UTF-16BE or UTF-16LE encoded, this is wrong and the correct BOM should be added by the adapter.
Encodings like US-ASCII and ISO-8859-1 are handled correctly.
>>Mapping bug
When we send in a message encoded in UTF-8 and want to send it out as a UTF-16 encoded message, we need to set the XML encoding to UTF-16. Normally this is done by an XSLT mapping using the <xsl:output encoding=UTF-16/> command.
The UTF-8 message will get processed by the XSLT and any special character will be converted to its UTF-16 value. However the output message is not UTF-16 encoded (1 byte in-stead off 2 bytes).
When this 1 byte message is send to the inbound adapter (encoding is set to UTF-16) the message will be translated from 1 byte to 2 byte (UTF-8 to UTF-16). The characters that were converted from UTF-8 to UTF-16 will be read as single byte characters and will be converted again. This will result in an incorrect message with illegal characters.
So basically characters will be converted to UTF-16 2 times, which is incorrect.
Maybe someone can confirm this on another XI system (maybe different OS). If you need test files or mapping, please let me know.
Kind regards,
Christiaan Schaake.

Update after carefully reading all the UTF related documents on the internet.
For UTF-16 the BOM is required and the adapter is handling this correctly. (encoding=UTF-16 will create the BOM).
For UTF-16LE and UTF-16BE the BOM must not be set. The application should be able to handle the conversion. The adapter is working correct again.
If the adapter is set to binary mode in stead of the text mode, the file will always be read correctly.
About the mapping issue, I'm still experimenting with this one.
Kind regards,
Christiaan Schaake.

How to set XML character encoding for a SOAP response?

Hi,
We're using Oracle J2EE web services,
and are quite happy with them.
However, it's a problem that we need to have
characters outside the standard English alphabet
in our service responses. So far, we have not been
able to find a way to specify what encoding to use.
Our version (9.0.3 release) produces SOAP-responses
without any encoding specification in the XML header.
Any ideas?

Hello,
If you are using the "Paper Layout", check the Reports's "Before Report Value" property:
Before Report Value :
<meta http-equiv="Content-Type" content="text/html; charset=&Encoding">
If you are using the "Web Layout", take a look to the document :
http://download-uk.oracle.com/docs/cd/B14099_17/bi.1012/b14048/pbr_nls.htm#i1006142
18.3 Specifying a Character Set in a JSP or XML File
Regards

Is it possible to change the default file encoding?

I have just learned that the "file.encoding" system property should be treated as read-only.
(http://developer.java.sun.com/developer/bugParade/bugs/4163515.html)
I am using this property to tell javac that the command arguments file has some other encoding than the system deafult, like this:
javac -J-Dfile.encoding=UTF-8 @files-to-compile.lst
On windows xp with us english locale it worked for all the SDK releases I checked, but for Windows 2000 Japanese Edition only one of the J2SDK 1.4.1 releases worked.
My question is: is there an acceptable way to tell the JVM what the default encoding is? Or inform javac about the encoding of the argument file?
The reason for having a UTF-8 encoded javac argument list file is that our application generates Java source files that can have unicode characters in their names. Seemingly Windows supports unicode file names so I did not want to restrict file names to those supported by the system encoding.

Use javac's "-encoding" option.
$ javac
Usage: javac <options> <source files>
where possible options include:
-g                        Generate all debugging info
-g:none                   Generate no debugging info
-g:{lines,vars,source}    Generate only some debugging info
-nowarn                   Generate no warnings
-verbose                  Output messages about what the compiler is doing
-deprecation              Output source locations where deprecated APIs are used
-classpath <path>         Specify where to find user class files
-sourcepath <path>        Specify where to find input source files
-bootclasspath <path>     Override location of bootstrap class files
-extdirs <dirs>           Override location of installed extensions
-d <directory>            Specify where to place generated class files
-encoding <encoding>      Specify character encoding used by source files
-source <release>         Provide source compatibility with specified release
-target <release>         Generate class files for specific VM version
-help                     Print a synopsis of standard options

Specifying An Encoding

Similar Messages

Maybe you are looking for