Char, UTF, Unicode, International chars

Hi all,
Could anybody make a brief summary of the connections among the char data type, UTF, Unicode, International chars?
Thanks!

The java char data type usually contains one unicode character. However, some unicode characters are larger than 16 bits (which is the size of char). Those characters are represented by multiple chars. There are multiple byte encodings for Unicode. Two of the most common are UTF-8 and UTF-16. The char data type stores characters in the UTF-16 encoding. UTF-8 is a common format for serializing unicode since it takes up half the space of UTF-16 for the most common characters. Also, UTF-8 deals with bytes, so byte order is not an issue when transfering between different platforms. Finally, one of the best reasons to use UTF-8 is that for characters with code < 128, UTF-8 is identical to ascii.
Every character is an international character.

Similar Messages

Unicode (international) characters in cfpdfform with LiveCycle PDFs

Hi all
These are the details of my environment:
ColdFusion version: 8.0.1
Update level: C:/ColdFusion8/lib/updates/chf8010004.jar
Server OS: Windows Vista 64
Web server: Apache
I'm using cfpdfform in ColdFusion 8 to read a Livecycle 8.2.1 pdf form straight into the browser window, enter text into a couple of fields and then submit the form back to the same CF page and use the submitted XML to populate the same form.
So, the relevant bit of my code is as follows:
<cfif IsDefined("PDF")>
    <cfpdfform action="read" source="#PDF.content#" xmldata="myXMLData" />
    <cfset myXMLData = XMLParse(myXMLData)>
    <cfpdfform action="populate" source="c:\temp\myPDFForm.pdf" xmldata="#myXMLData#" overwritedata="true" />
<cfelse>
    <cfpdfform action="populate" source="c:\temp\myPDFForm.pdf">
        <cfpdfsubform name="topmostSubform">
            <cfpdfformparam name="TextField1" value="Hello! The time is #TimeFormat(Now(), 'HH:mm:ss')#" />
        </cfpdfsubform>
    </cfpdfform>
</cfif>
This all works just fine, except that any multibyte unicode characters (e.g. Chinese) I enter into any text fields are lost (turned to question marks) when the form is populated from the submitted XML, even though I can see that the Unicode characters remain intact if I do a cfdump of the myXMLData variable on page submission.
I ordinarily have no problems with Unicode. I can write unicode characters in multiple foreign character sets to database and retrieve them just fine. I have site pages in several international character sets, and, as I have stated, the text I enter is displayed correctly in the correct foreign characters when I view a cfdump of the submitted form, so this appears to be specifically related to the PDF form populate side of things (or a setting in LiveCycle).
I have an inkling that the cause is that the LiveCycle form has its text fields set to allow rich text and has a specific font set against the fields, so I'm wondering whether the foreign character font, which displays fine when I enter it into the form, is being lost (overwritten by the LiveCycle field font) when read back into the form. Any ideas as to how to overcome this problem would be gratefully accepted.

Okay, I think I got to the bottom of this. The following full page code works:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>cfpdfform Post and Populate Example</title>
</head>
<body>
<cfoutput>
<cfif IsDefined("PDF")>
    <cfpdfform action="read" source="#PDF.content#" xmldata="myXMLData" result="data" />
    <cfpdfform action="populate" source="c:\temp\myLiveCycleForm.pdf" xmldata="#myXMLData#" />
<cfelse>
    <cfpdfform action="populate" source="c:\temp\myLiveCycleForm.pdf" />
</cfif>
</cfoutput>
</body>
</html>
The key to stopping the Chinese characters from breaking was to remove the overwritedata attribute from the populate tag on the action page. In other words, changing
<cfpdfform action="populate" source="c:\temp\myLiveCycleForm.pdf" xmldata=""#myXMLData#" overwriteData="yes" />
to
<cfpdfform action="populate" source="c:\temp\myLiveCycleForm.pdf" xmldata="#myXMLData#" />
fixed the problem.
As an interim measure, I wrote out the submitted PDF form XML to a file and took a look at what was happening and used the XML file as the basis for populating the PDF form, like this:
<cfif IsDefined("PDF")>
     <cfpdfform action="read" source="#PDF.content#" xmldata="myXMLData" result="data" />
    <cffile action="write" file="c:\temp\myXMLData.xml" output="#myXMLData#" charset="utf-8" nameconflict="overwrite">
    <cfpdfform action="populate" source="c:\temp\myLiveCycleForm.pdf" xmldata="c:\temp\myXMLData.xml" />
<cfelse>
     <cfpdfform action="populate" source="c:\temp\myLiveCycleForm.pdf" />
</cfif>
As well as causing the Chinese characters to break, the overwritedata attribute when populating the form appears to strip out any rich text formatting (CSS) from form fields.
Dropping the overwritedata attribute preserved the Chinese characters, regardless of whether the PDF form is populated directly from the posted PDF XML (as in the full example above) or via an XML file.
On a separate issue (one I'll post to the LiveCycle forum but I'm mentioning here for other people's reference), any XML which contains in-line CSS formatiing (e.g. individual words in bold text) using the span tag causes a ColdFusion error when used to populate the PDF form.

UTF/Unicode support

Hi,
Does Oracle Berkeley DB support unicode/utf-8 charset?
Could you please let me know?
Thanks,
ananth

Hello,
From the Berkeley DB Reference Guide at:
http://www.oracle.com/technology/documentation/berkeley-db/db/ref/intro/dbisnot.html
Notice that Berkeley DB never operates on the value part of a record.
Values are simply payload, to be stored with keys and reliably
delivered back to the application on demand.
Both keys and values can be arbitrary byte strings, either fixed-length
or variable-length. As a result, programmers can put native programming language data
structures into the database without converting them to a foreign record format first.
If that does not answer your question, please let me know.
Thanks,
Sandra

Cannot type/paste normal international text when non-Unicode in CS6

Hello,
In all versions of DW (up to CS3 which I have used) I had no problem pasting / typing HTML or text with international characters in Design View when the page is using a non-Unicode, yet international encoding (like Windows - Greek or ISO - Greek).
Now, DW CS6 auto converts all international chars typed/pasted in Design View to html entities (unless Page Encoding is Unicode).
For example, when the document has an encoding of:
<meta http-equiv="Content-Type" content="text/html; charset=windows-1253">
[ This is equal to Modify / Page Properties / Title/Encoding / Document Type (DTD): None & Encoding: Greek (Windows) ]
...in the past I was able to type/paste greek characters/text in Design view and they were retained as such (simple text) in Code view (and this is what we need now as well).
Yet, in DW CS6 such international chars/text (typed / pasted in Design view) are auto-switched to "&somechar;" entities which is not what should happen; this messes up all text. Design view shows the text correctly, but html source (Code View) does not retain international characters themselves, although it should, as long as the html page is using a proper encoding/charset that allows compatible international text to be retained (e.g. greek encoding is compatible with greek characters). I repeat that this was working correctly at least until DW CS3.
Directly typing/pasting in DW CS6 design view correctly (i.e. retaining the original chars in code view) works ONLY when using Unicode.
However, if we type/paste greek text (with html tags or not) directly in Code view, then DW CS6 retains chars/text properly and Design view displays everything properly too. Consequently, as a work-around, we can use the Code View to type/paste international text/html when not using Unicode (UTF-8) as the Page Encoding. But this makes our life more difficult.
So, has CS6 dropped support for typing/pasting international text/html directly in Design view, for non-Unicode international encodings?
Or something has changed and we need to configure some setting(s) so that the feature works properly? (I haven't been able to find any setting that might affect this behavior. I also played with Document Type (DTD) settings but I found these did not affect the described behavior.)
Please advise. This is very important.
Thanks,
Nick
Message was edited by: JJ-GR

Thanks for the reply.
As I have already mentioned, typing/pasting in Code View works properly.
However, in previous versions of DW, pasting/typing in Design View was working fine, whatever the page encoding.
I agree that pasting in Code View is not really a big deal. But having to do all editing/typing in Code View definitely is! What is the point of using a WYSIWYG editor, if it can't produce correct source code (except in Unicode pages)? If we are going to do all editing in Code View, then we could simply use notepad (just an exaggeration) or other programming-oriented tool.
I hope other people can confirm the problem and suggest solutions or Adobe fixes it.

Encoding chars

I have a basic question about Internationalization.
We are using 1.3.1. I am writing an xml file. However there will be international characters. Instead of writing, for instance, the "A" with the 2 dots, I would like to write Ä. I see that 1.4.1 has some nice classes in the packages java.nio.*. I am assuming I would use CharsetEncoder if I could use 1.4.1. However we have to use 1.3.1. Does anyone have any suggestions on how to write encoded chars?
thanks in advance

(in the java source code)
=========================
You have two choices to type non-ascii char.
1) Use unicode-encoded(\udddd) notation for non ascii char.
You can find \u notation from unicode.org. If you have a native file that contains non-ascii, please use native2ascii tool that comes with J2SE. It's under [jdk]/bin dirrectory. You can find usage doc from,
http://java.sun.com/j2se/1.3/docs/tooldocs/tools.html
2) If your java source code editor can support non-ascii chars, just type the non-ascii and save it as native encoding. When you compile the code you should use -encoding option to correctly compile the native encoded file.
(print out non ascii chars)
===========================
Java uses "unicode" internally. When a char is printed out from the jvm, the char is converted to the underlaying platform's default enclloding. There are only few java methodes that can override this default behavior. One of them is OutputStreamWriter(OutputStream out, String encode) in jdk1.3.0. You can print out data as the encoding you passed. Since you are printing out to xml, "utf-8" would be ideal.
thanks,
baechul

Unicode & fonts

hi,
it is great that java supports unicode internally, but how do i make unicode characters printable? i have read about 10 pages now and none gives a straight-forward answer.. anyone please know?
thanks.
-m

hi,
thanks for the reply.
Of course you need to have support for that particular
font, then you're good to go.how do i provide "support" for the font? i read something about this requiring more than just installing a new font on the system..
System.out.println("\u0041");
System.out.println("\u004F");that is how i am trying it, but when i set the font to a font that i know supports the unicode characters i want to print, the font is not really changed and i only receive question mark chars displayed. how do i fix that?
thanks again, appreciate the help.

UTF-16: better than UTF-8? How to use?

I read about UTF-8/UTF-16 and found out, that UTF-16 characters all have the same encopded length while UTF-8 uses a more compact representation. But why does Eclipse just shows one line of strange characters (most times little squares) in my .java file when i switch Eclispe's encoding from UTF-8 to UTF-16? I thought Java is using UTF-16 internally all the time?

String aString = "jschell";
byte[] whateverEncodingBytes = myString.getBytes("whateverEncoding");aString is in UTF-16, no matter what. The encoding specified in the getBytes() method is the target encoding. So whateverEncodingBytes is an array of bytes stored in a whatever encoding, rather than UTF-16. If you just use the method getBytes() the array of bytes would be stored in the default encoding which is iso-8859-1 unless otherwise specified.
String aNewString = new String(whateverEncodingBytes, "whateverEncoding");
aNewString is UTF-16 again, no matter what. The encoding specified in the constructor is the source encoding. So whateverEncodingBytes are still in an encoding. But building a new String out of them requires an encoding conversion to occur, as the String must be in UTF-16. So we need to let the constructor know which encoding to convert from. It would be like an interpretter to know which language to interpret from.
A Java byte can be in any encoding it likes - it's simply eight raw bits of data stored in an arbitrary order. Anything extending a Java object is stored in UTF-8, except for String, which keeps its internal characters in UTF-16 encoding.
In short, if you pull data into Java from an external source, the encoding it sits in within Java sticks to the rules above. If it first hits Java as a byte stream (say from an InputStream), it's going to be in whatever encoding it was already in. If it first hits Java as a String (say from request.getParameter() in a servlet or JSP), it will be in UTF-16, as Java will convert its encoding immediately.
The same holds when writing back out. If you write out a series of bytes (such as with OutputStream), the resulting output is in the encoding the bytes were in to begin with. If you're outputting a String (although this is a lot rarer), it will be in UTF-16. Or to be more specific as somebody pointed out Java uses Unicode characters (16-bits, unsigned) in memory to hold characters.

Can't see any advantage of Unicode

Hi,
I am really wondering about advantages of 16 bit Unicode characteristics which is one of best feature of java language. e.g., I can't display '\u0628' an arabic letter without using the font which support arabic characters.
So There is no advantage of Unicode in my views. Can any one explain this????
Secondly can we say that this feature of java is user machine dependent.Coz if he has installed the proper font for arabic then he can see arabic characters other wise it will show blocks.
Thanks

There are advantages in using Unicode. Firstly, windows now use unicode internally IIRC. They allows better cut and paste between windows applications and java applications - you can easily cut from IE and paste into java.
Secondly, unicode is useful if you are working with mulitple languages. Every character in every language has its own unique code point - no more second guessing! This leads to many other possibilities.
The reason the jdk and jre does not come with a full set of fonts is due to size.

What every developer should know about character encoding

This was originally posted (with better formatting) at Moderator edit: link removed/what-every-developer-should-know-about-character-encoding.html. I'm posting because lots of people trip over this.
If you write code that touches a text file, you probably need this.
Lets start off with two key items
1.Unicode does not solve this issue for us (yet).
2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts.
The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.
And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.
And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.
Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.
UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
Here's a key point about these text files – every program is still using an encoding. It may not be setting it in code, but by definition an encoding is being used.
Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)
Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
Wrapping it up
I think there are two key items to keep in mind here. First, make sure you are taking the encoding in to account on text files. Second, this is actually all very easy and straightforward. People rarely screw up how to use an encoding, it's when they ignore the issue that they get in to trouble.
Edited by: Darryl Burke -- link removed

DavidThi808 wrote:
This was originally posted (with better formatting) at Moderator edit: link removed/what-every-developer-should-know-about-character-encoding.html. I'm posting because lots of people trip over this.
If you write code that touches a text file, you probably need this.
Lets start off with two key items
1.Unicode does not solve this issue for us (yet).
2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts. Pretty sure most Americans do not use character sets that only have a range of 0-127. I don't think I have every used a desktop OS that did. I might have used some big iron boxes before that but at that time I wasn't even aware that character sets existed.
They might only use that range but that is a different issue, especially since that range is exactly the same as the UTF8 character set anyways.
>
The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.
And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.
And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
The above is only true for small volume sets. If I am targeting a processing rate of 2000 txns/sec with a requirement to hold data active for seven years then a column with a size of 8 bytes is significantly different than one with 16 bytes.
Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
The above is out of place. It would be best to address this as part of Point 1.
Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.
Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.
UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
The first part of that paragraph is odd. The first 128 characters of unicode, all unicode, is based on ASCII. The representational format of UTF8 is required to implement unicode, thus it must represent those characters. It uses the idiom supported by variable width encodings to do that.
But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
Not sure what you are saying here. If a file is supposed to be in one encoding and you insert invalid characters into it then it invalid. End of story. It has nothing to do with html/xml.
Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
The browser still needs to support the encoding.
Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
I know java files have a default encoding - the specification defines it. And I am certain C# does as well.
Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
It is important to define it. Whether you set it is another matter.
Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)
Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
Unicode character escapes are replaced prior to actual code compilation. Thus it is possible to create strings in java with escaped unicode characters which will fail to compile.
Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
No. A developer should understand the problem domain represented by the requirements and the business and create solutions that appropriate to that. Thus there is absolutely no point for someone that is creating an inventory system for a stand alone store to craft a solution that supports multiple languages.
And another example is with high volume systems moving/storing bytes is relevant. As such one must carefully consider each text element as to whether it is customer consumable or internally consumable. Saving bytes in such cases will impact the total load of the system. In such systems incremental savings impact operating costs and marketing advantage with speed.

Java IO for different file encodings

Hi,
I need a file reading mechanism, wherein i should be able to read files of just about any encoding (eg Shift JIS, EBCDIC, UTF, Unicode, etc).
I could do this using the following code:
               FileInputStream fis = new FileInputStream("D:\\FMSStub\\ZENMES2.txt");
               InputStreamReader isr = new InputStreamReader( fis, "Unicode");
               BufferedReader br = new BufferedReader( isr );
               br.read(c1);
But there is a requirement in our code, to also read some trailers from the file and go back and again read from the middle of the file. Basically a seek kind of functionality.
I have been desperately trying to figure out how i can do seek (which is possible with RandomAccessFile, but RandomAccessFile wont work with all the various encodings...i have tried this) with the above kind of code.
Any information on this would be very useful.
Regards,
Mallika.

Hi,
Thanks for your reply.
But as you say, when i do a new String(byte[], "encoding") my String does not get formed.
The case which i tried was a unicode file.
I read first few bytes and formed a string as follows:
     byte blfilestr[] = new byte[1000];
               ra = new RandomAccessFile("D:\\FMSStub\\ZENMES2.txt","r");
     ra.seek(0);
     int ilrd = ra.read(blfilestr);
     String gfilestr = new String(blfilestr, "Unicode");
This gave me the String correctly.
But then i did a seek at an intermediate position in the file (but its a locaiton which i confirmed is not in between the 2 bytes of a Unicode char) and it fails to form a string correctly. I need to do this though.
Any ideas ?
Mallika.

Linux or JVM: cannot display non english character

hi,
i am trying to implement a GUI that supports both turkish and english. user can switch between them on the fly.
public class SampleGUI {
JButton trTranslate = new JButton(); /* Button, to translate into turkish */
/* Label text will be translated */
JLabel label = new JLable("Text to Be Translated!");
trTranslate.addActionListener (new ActionListener(){
void ActionPerformed(ActionEvent e){
String language="tr";
String country="TR";
Locale currentLocale;
ResourceBundle messages;
currentLocale = new Locale(language, country);
messages = ResourceBundle.getBundle("TranslateMessages",currentLocale);
/* get from properties file turkish match of "TextTranslate "*/
label.setText(messages.getString("TextToTranslate"));
Finally, my problem is my application does not display non english chracaters like "� ş � ğ � i" in GUI after triggering translation.However, if i do not use ResourceBundle and instead assign directly the turkish match for that label (i.e. label.setText("şşşşş")), GUI successfully displays turkish characters. what may be the problem? which encoding set does not conform?
ps : i am using redhat linux8.0, j2sdk1.4.1. current locale = "tr_TR.UTF-8". in /etc/sysconfig/keyboard , keyTable = "trq". There seems no problem for me as i can input and output
turkish characters. OS supports this. Also jvm gets the current encoding from OS.It seems as if there is a problem in reading properties file in inappropriate encoding.
thanx for dedicating ur time and effort,
hELin

I would suspect it would work in vim only if vim supported the UTF8 character set. I have no idea if it does.
Here is one blurb I found on google:
USING UNICODE IN THE GUI
The nice thing about Unicode is that other encodings can be converted to it
and back without losing information. When you make Vim use Unicode
internally, you will be able to edit files in any encoding.
Unfortunately, the number of systems supporting Unicode is still limited.
Thus it's unlikely that your language uses it. You need to tell Vim you want
to use Unicode, and how to handle interfacing with the rest of the system.
Let's start with the GUI version of Vim, which is able to display Unicode
characters. This should work:
     :set encoding=utf-8
     :set guifont=-misc-fixed-medium-r-normal--18-120-100-100-c-90-iso10646-1
The 'encoding' option tells Vim the encoding of the characters that you use.
This applies to the text in buffers (files you are editing), registers, Vim
script files, etc. You can regard 'encoding' as the setting for the internals
of Vim.
This example assumes you have this font on your system. The name in the
example is for X-Windows. This font is in a package that is used to enhance
xterm with Unicode support. If you don't have this font, you might find it
here:
     http://www.cl.cam.ac.uk/~mgk25/download/ucs-fonts.tar.gz

Microsoft Word "Smart Quotes"

I hope this will save other developers some time.
This may be obvious to others, but I just spent several hours Googling and testing to determine what actually happens when a user copies text containing "Smart Quotes" from Microsoft Word into a Java JTextComponent. For those not familiar with Smart Quotes, by default, MS Word changes double-quoted strings from using the US-ASCII character for quote (0x22) into left- and right- curly quotes (UTF-16: 0x201c and 0x201d). Word also does this with serveral other characters. This plays havoc with the display and Java Strings later encoded with java.beans.XMLEncoder, unless treated carefully. Here is what I discovered (obviously, this applies to MS Windows):
All values are in hexadecimal.
- Word is storing the character for double quote as UTF-16 internally (201C).
- When the character is copied to the clipboard, it is copied as UTF-8 (E2 80 9C).
- When the clipboard is pasted into Java, Java is assuming the it was originally Windows-1252 encoded, because that is the default for the US-EN locale in Windows XP (probably also Vista, but I only tested in XP).
- Java translates this into a-circumflex, euro-sign, o-e-ligature, the characters corresponding to E2, 80, and 9C respectively in Windows-1252 and represents it internally in UTF-16 as 00E2 20AC 0153.
-When the String is XML-encoded using java.beans.XMLEncoder, it is written in UTF-8 as C3A2 E282AC C593, which equates to UTF-16 00E2 20AC 0153 -- the characters a-circumflex, euro-sign, o-e-ligature.
I am not sure how to fix this, but maybe another reader does. I am experimenting with the Clipboard (java.awt.datatransfer) to see if I can programmatically find out the original character encoding (in this case, UTF-16).

Doesn't the DataFlavor contain the character encoding? What is the content of the InputStream returned by
            InputStream is = (InputStream)contents.getTransferData(DataFlavor.getTextPlainUnicodeFlavor());
If I use
                DataFlavor df = DataFlavor.getTextPlainUnicodeFlavor();
                String mimeType = df.getMimeType();
                String encoding = mimeType.replaceAll(".*?charset=(.*?)\\s*$", "$1");
                InputStream is = (InputStream) contents.getTransferData(df);
                ByteArrayOutputStream baos = new ByteArrayOutputStream();
                byte[] buffer = new byte[1024];
                for (int count = 0; (count = is.read(buffer)) != -1;)
                    baos.write(buffer, 0, count);
                baos.close();
                result = baos.toString(encoding);to transfer
Hello "World"
which Word changes the quotes to the smart 'smart quotes' version I get as a result
Hello World
which is what I expect.
Am I missing something?
Edited by: sabre150 on Sep 4, 2009 1:27 PM

VNC into Mac and keyboard layout

Hi,
I'm trying to get keyboard layout mapping to work between the mac mini which is the VNC server and any remote VNC clients (linux, windows, mac). Tried ultravnc, tightvnc, realvnc clients. And 10.5.x built-in as well as Vine VNC servers. The mini has an Alphagrip keyboard with US+Finnish layouts. The client PCs have Finnish layout typically, also US or German.
Problem is that any pipe, tilde, @ etc "special" characters typed through the clients end up as umlauts or other characters in the VNC.
Any suggestions how I could make VNC work so that the VNC keyboard layout is always the client layout?
Thanks,
- Jan

Thanks for pointers!
Setting "US" in OSXvnc did not work with any local client-side layout, oddly not even EN/US local + EN/US remote with OSXvnc use either "US" or "Current Layout".
But the following nearly works. Client Finnish keyboard, server Mac Mini Finnish keyboard, for the ultraVNC client change Finnish/FI to US winxp locale (using XP's/Vista's Language Bar), in OSXvnc choose "Current Layout" and set global Mac layout to Finnish.
I.e. have same keyboard at both ends but set the "wrong" US locale in WinXP for the ultraVNC client.
Now nearly everything works (with this laptop and server) - all umlauts and special characters. The only and unfortunately also critical problem is that the key "|" does not "work", it outputs "'*" instead. On the Mac the key is on the left of spacebar, on the client on the right. I don't see how this alone could change the scancodes. But apparently it does. Because it does not work.
Is there not some UTF / Unicode version of VNC for all OS X, Windows, Linux?
Or some VNC that would handle keyboard layouts smartly, transparent to the user?
Another option, IME with RDP (MS Terminal Services, rdesktop) there is no problem with keyboard layouts. Perhaps you guys know some RDP server for OS X?

Reg:SOAP-XI-RFC scenrio

1)where can we check the processed xml messages in case of soap-xi-Rfc scenario after sending soap request to server?
and how to handle the application errors?

Check the below:
/people/siva.maranani/blog/2005/03/01/testing-xi-exposed-web-services
Also:
I guess you are done with the adapter configuration.
To test you SOAP sender scenario, goto runtime workbench-> component monitoring-> adapter monitoring (test message).
STEP 1) Give the URL: URL Template: http://host:port/XISOAPAdapter/MessageServlet?channel=party:service:channel
STEP 2) Then enter the details of Party, Business Service, Interface name and the interface namespace.
STEP 3) Give the authorization details (user name and password).
STEP 4) The last step is to enter the payload. ie in SOAP format( to create the SOAP request follow the given steps)
To expose a webservice to the internet, you have to create a WSDL file. You can create one in Integration Directory.
One menu driven option is there "Define a webservice" can be used to create a WSDL file.
Once the WSDL file is created. Use a SOAP client tool to create the SOAP request.
You can create SOAP Request from WSDL using the following tool.
http://www.gotdotnet.com/Community/UserSamples/Details.aspx?SampleGuid=65a1d4ea-0f7a-41bd-8494-e916ebc4159c
Check the SOAP request, it must be having the encoding type UTF (Unicode Transformation Formats) as UTF-16. Change it
to UTF - 8.
Once you have created the SOAP request, use the same as the payload and test the scenario.
SOAP sender configuration link:
http://help.sap.com/saphelp_erp2004/helpdata/en/ae/d03341771b4c0de10000000a1550b0/frameset.htm
Thanks
Kiran

Issues with Chinese characters via HTTPRequest

We have the below scenario:
http --> PI --> IDOC
Before the message arrives to PI we do the conection with the below code:
String urlString = WDConfiguration.getConfiguration("local/AIB_PAD").getStringEntry("HTTPURL");
     java.net.URL url = new java.net.URL(urlString);
     HttpURLConnection connection = (HttpURLConnection)url.openConnection();
     connection.setDoInput(true);
     connection.setDoOutput(true);
     connection.setUseCaches(true);
     connection.setRequestMethod("POST");
     connection.connect(); //(http connection is made)
this conection transform all chinese characters to "?"
Does anybody knows what is missing to receive the chinese characters to PI correctly?
Thanks.
Edited by: Israel Toledo on Sep 29, 2011 1:27 AM

Hi Mark,
Thanks for your answer.
We are sure that both systems support chinese characters, because when we see the outbound XML generated on the source system(JAVA SAP Portal) it looks good. Also, when we send an XML to PI via HTTP client using Java script, it works good.
The issue is when we make the connection to PI. I thing some parameter is missing. could be the text code (UTF, unicode), but we are not sure how to change that in the Java code.

Char, UTF, Unicode, International chars

Similar Messages

Maybe you are looking for