Unicode or UTF-8?

Hi, all,
I'm developing a JSP application that will work with international characters, both displaying them on webpages and storing them in a MySQL database. I'm a bit confused about whether I should use Unicode or UTF-8 for those character strings. (I've read up on both of these encodings and they appear to be very similar in many respects.)
Can anyone give me any suggestions as to which I should use and why?
Thanks,
Dmitri.

UTF-16 uses 2 bytes for all characters.
UTF-8 generally uses anywhere from 1 to 6 bytes toActually, UTF-16 uses 16-bit tokens, and represents characters with one or more tokens, like all UTF encodings.
Generally, the encodings 'UTF-N' use N-bit tokens, and encode the 32-bit UNICODE scalar values (character set) with one or more tokens. Typically, the lower values in the encoding represent the UNICODE scalar values directly.
UNICODE defines 'UTF-8', 'UTF-16', and 'UTF-32', the latter two in big- and little-endian forms as well as self-specifying forms (using initial bytes of a file or stream). 'UTF-32' encoding just uses the UNICODE scale values directly.
There is also 'UTF-7', which is used in mime encoding to get through 7bit character sets. There are also unofficial 'UTF-6' etc for specialist use.
UTF-8 has the advantage that it does not contain null (zero valued) bytes, which means that it works transparently with code expecting to see one byte characters (assuming the legacy code doesn't try to manipulate multi-byte characters!).

Similar Messages

Perform unicode to UTF-8 conversion on F110 bacs payment file in ABAP

Hi,
I am facing a conversion issue for the UK BACS payment files.
The payment run tcode F110 creates a payment file but the file when created on the application server has soem sort of code conversion. If I removed the # value, i can read most of the data.
The data example is as below-
#V#O#L#1#0#0#1#5#8#8# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #2#4#3#3#0#9#
#H#D#R#1#A#2#4#3#3#0#9#S# # #1#2#4#3#3#0#9#0#0#0#0#0#2#0#0#0#1#0#0#0#1# # # # # # # #1#0#1#1#2#
#H#D#R#2#F#0#2#0#0#0#0#0#1#0#0# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
#U#H#L#1# #1#0#1#1#3#9#9#9#9#9#9# # # # #0#0#0#0#0#0#0#0#1# #D#A#I#L#Y# # #0#0#0# # # # # # # #
This is then transferred to the bank via the FTP UNIX Script but after the conversion which is happening as-
#Perform unicode to UTF-8 conversion on bacs file
$a = "iconv -f UNICODE -t UTF-8 $tmpUNI > $tmpASC";
The need going forward is to bring the details via the interface and then make an uplaod.
The ABAP code should be able to make the conversion, remove the additional chracters and then send the file across.
I have searched everywhere but I am not able to find out how to make the same conversion in ABAP.
We are on ECC6.
Can someone please help me?
Regards,
Archana

Hi Archana,
can you please check SAP notes 1064779 and 1365764 (including the attachment) and see if this helps you ?
Best regards,
Nils Buerckel
SAP AG

Triple byte unicode to utf-16

I need to convert a triple byte unicode value (UTF-8) to UTF-16.
Does anyone have any code to do this. I have tried some code like:
String original = new String("\ue9a1b5");
byte[] utf8Bytes = original.getBytes("UTF8");
byte[] defaultBytes = original.getBytes();
but this does not seem to process the last byte (b5). Also, when I try to convert the hex values to utf-16, it is vastly off.
-Lou

Good question. Answer is, it does.
Oops, sorry, I think I left my brain in the kitchen :)
I was somehow thinking that "hmmm, e is not a hexadecimal digit so that must result in an error"... but of course it is...
Am I representing the triple byte unicode character
wrong? How do I get a 3 byte unicode character into
Java (for example, the utf-16 9875)?It's simply "\u9875".
If you have byte data in UTF-8 encoding, this is what you can do:try {
    byte[] utf8 = {(byte) 0xE9, (byte) 0xA1, (byte) 0xB5}
    String stringFromUTF8 = new String(utf8, "UTF-8");
} catch (UnsupportedEncodingException uee) {
    // UTF-8 is guaranteed to be supported everywhere
}

Character encoding (unicode to utf-8) conversion problem

I have run into a problem that I can't seem to find a solution to.
my users are copying and pasting from MS-Word. My DB is Oracle with its encoding set to "UTF-8".
Using Oracle's thin driver it automatically converts to the DB's default character set.
When Java tries to encode Unicode to UTF-8 and it runs into an unknown character (typically a character that is in the High Ascii range) it substitutes it with '?' or some other wierd character.
How do I prevent this.

my users are copying and pasting from MS-Word. My DB
is Oracle with its encoding set to "UTF-8".Pasting where? Into the database? If they are pasting into the database (however they might do that) and getting bad results then that's nothing to do with Java.
Using Oracle's thin driver it automatically converts
to the DB's default character set.Okay, I will assume that is correct.
When Java tries to encode Unicode to UTF-8 and it
runs into an unknown character (typically a character
that is in the High Ascii range) it substitutes it
with '?' or some other wierd character.This is false. When converting from Unicode to UTF-8 there are no "unknown characters". I don't know what you mean by the "High Ascii range" but if your users are pasting MS stuff into your Java program somehow, then a conversion from something into Unicode is done at that time. If "something" isn't the right encoding then you have the problems already, before you try to write to the DB.
How do I prevent this.First identify the problem. You have input coming from somewhere, then you are writing to the database. Two different steps. Either of them could have a problem. Test them separately so you know which one of them is the problem.

Japanese characters, outputstreamwriter, unicode to utf-8

Hello,
I have a problem with OutputStreamWriter's encoding of japanese characters into utf-8...if you have any ideas please let me know! This is what is going on:
static public String convert2UTF8(String iso2022Str) {
   String utf8Str = "";
   try {
      //convert string to byte array stream
      ByteArrayInputStream is = new     ByteArrayInputStream(iso2022Str.getBytes());
      ByteArrayOutputStream os = new ByteArrayOutputStream();
      //decode iso2022Str byte stream with iso-2022-jp
      InputStreamReader in = new InputStreamReader(is, "ISO2022JP");
      //reencode to utf-8
      OutputStreamWriter out = new OutputStreamWriter(os, "UTF-8");
      //get each character c from the input stream (will be in unicode) and write to output stream
      int c;
      while((c=in.read())!=-1) out.write(c);
      out.flush();
     //get the utf-8 encoded output byte stream as string
     utf8Str = os.toString();
      is.close();
      os.close();
      in.close();
      out.close();
   } catch (UnsupportedEncodingException e1) {
      return    e1.toString();
   } catch (IOException e2) {
      return e2.toString();
   return utf8Str;
}I am passing a string received from a database query to this function and the string it returns is saved in an xml file. Opening the xml file in my browser, some Japanese characters are converted but some, particularly hiragana characters come up as ???. For example:
屋台骨田家は時間目離れ拠り所那覇市矢田亜希子ナタハアサカラマ楢葉さマヤア
shows up as this:
屋�?�骨田家�?�時間目離れ拠り所那覇市矢田亜希�?ナタ�?アサカラマ楢葉�?�マヤア
(sorry that's absolute nonsense in Japanese but it was just an example)
To note:
- i am specifying the utf-8 encoding in my xml header
- my OS, browser, etc... everything is set to support japanese characters (to the best of my knowledge)
Also, I ran a test with a string, looking at its characters' hex values at several points and comparing them with iso-2022-jp, unicode, and utf-8 mapping tables. Basically:
- if I don't use this function at all...write the original iso-2022-jp string to an xml file...it IS iso-2022-jp
- I also looked at the hex values of "c" being read from the InputStreamReader here:
while((c=in.read())!=-1) out.write(c);and have verified (using character value mapping table) that in a problem string, all characters are still being properly converted from iso-2022-jp to unicode
- I checked another table (http://www.utf8-chartable.de/) for the unicode values received and all of them have valid mappings to a utf-8 value
So it appears that when characters are written to the OutputStreamWriter, not all characters can be mapped from Unicode to utf-8 even though their Unicode values are correct and there should be utf-8 equivalents. Instead they are converted to (hex value) EF BF BD 3F EF BF BD which from my understanding is utf-8 for "I don't know what to do with this one".
The characters that are not working - most hiragana (thought not all) and a few kanji characters. I have yet to find a pattern/relationship between the characters that cannot be converted.
If I am missing some....or someone has a clue....oh...and I am developing in Eclipse but really don't have a clue about it beyond setting up a project, editing it and hitting build/run. It is possible that I may have missed some needed configuration??
Thank you!!

It's worse than that, Rene; the OP is trying to create a UTF-8 encoded string from a (supposedly) iso-2022 encoded string. The whole method would be just an expensive no-op if it weren't for this line: utf8Str = os.toString(); That converts the (apparently valid) UTF-8 encoded byte array to a string, using the system default encoding (which seems to be iso-2022-jp, BTW). Result: garbage.
@meggomyeggo, many people make this kind of mistake when they first start dealing with encodings and charset conversions. Until you gain a good understanding of these matters, a few rules of thumb will help steer you away from frustrating dead ends.
* Never do charset conversions within your application. Only do them when you're communicating with an external entity like a filesystem, a socket, etc. (i.e., when you create your InputStreamReaders and OutputStreamWriters).
* Forget that the String/byte[] conversion methods (new String(byte[]), getBytes(), etc.) exist. The same advice applies to the ByteArray[Input/Output]Stream classes.
* You don't need to know how Java strings are encoded. All you need to know is that they always use the same encoding, so phrases like "iso-2022-jp string" or "UTF-8 string" (or even "UTF-16 string") are meaningless and misleading. Streams and byte arrays have encodings, strings do not.
You will of course run into situations where one or more of these rules don't apply. Hopefully, by then you'll understand why they don't apply.

Sql Plus and Unicode (or utf-8) characters.

Hello,
i have problem with Sql Plus and unicode files. I want to execute Start {filename}, where {filename} is file in unicode format (this file have to contains german and polish characters). But i receive error message. It is possible to read from unicode (or utf-8) file and execute commands from this file)?
Thanks in advance.
Pawel Przybyla

What is your client operating system characterset?

GUI_UPLOAD and UNICODE or UTF-8

Hello,
I wanted to convert some files in SAP, i.e. :
- data is processed in MS Excell and saved as Unicode file
- I read the file into SAP, process the strings and write it back
The problem is that there are some special polish characters in the file. I read the file using GUI_UPLOAD FM, but I didn't manage not to have these special characters replaced (by # by default).
Anyone came across this problem before?
Regards,
Michal

Hello Satya,
Actually I forgot to write that I used the CODEPAGE parameter... I tried two cases:
1. Save as Unicode (in Excel or Notepad) and than use CODEPAGE = "4103"
2. Save as UTF-8 in Notepad and use CODEPAGE = "4110"
In both cases the polish caracters are replaced by '#'.
In fact I tried many other code pages (all from tcp00a), but all other combinations return an error.
Regards,
Michal

The use of CL_ABAP_CONV_OUT_CE to create an unicode-16 (UTF-16) file

Hello,
I have to create a file iwth normal text in UTF-16 format. In ABAP the creation of an UTF-8 file is very easy (open dataset for output in UTF-8).
However UTF-16 is barely documented. and the normal open dataset does not support utf-16.
The only thing i could find out that you have to use class CL_ABAP_CONV_OUT_CE for it and open it as BINARY.
But i don't know how to do it. Could someone help. an small example would be perfect.
Thanx in advance.
Regards, Frank

Hi,
Please check this piece of code
DATA conv TYPE REF TO cl_abap_conv_in_ce.
DATA buffer(4) TYPE x.
DATA text(100) TYPE c.
buffer = '41424344'.
conv = cl_abap_conv_in_ce=>create(
encoding = 'UTF-8' ).
conv->convert(
EXPORTING input = buffer
IMPORTING data = text ).
write: / text.
Example for class cl_abap_conv_out_ce.
data: text(100) type c value 'ABCD',
conv type ref to cl_abap_conv_out_ce,
buffer type xstring.
conv = cl_abap_conv_out_ce=>create(
encoding = 'UTF-8'
endian = 'L'
call method conv->write( data = text n = 4 ).
buffer = conv->get_buffer( ).
write:/ buffer.
Also
you do not need to replace TRANSLATE ... TO UPPER/LOWER CASE in Unicode systems.
You just need to take care that the arguments fit:
The arguments of these instructions must be single fields of type C, N, D, T or STRING or structures of character-type only.
Regards
Hiren K.Chitalia

Converting Unicode to UTF-8 character set through Oracle forms(10g)

Hi,
I am working on oracle forms (10g) where i need to load files containing unicode character set (multilingual characters) to database.
but while loading the file , junk characters are getting inserted into the database tables.
while reading the file through forms , i am using utl_file.fopen_nchar,utl_file.get_line_nchar functions to read the unicode characters ...
the application server , and database server characterset are set to american utf8 characteset.
In fact , when i change the text file characterset to utf8 through an editor(notepad ++,etc) , in that case , data is getting inserted into database properly,(at least working for english characters) , but not with unicode ...
Any guidance in this regard are highly appreciated
Thank you in advance
Sanu

hi
please check out the following link.
http://www.oracle.com/technology/tech/globalization/htdocs/nls_lang%20faq.htm
sarah

Java text to unicode or UTF-16?

Hi all,
I get a string inputed in by the user, say 〹. I want to add this to an internal string, but as a unicode formated string. Otherwise when I write this string out using a XML formater, it encodes the ampersand character as &
Does anybody know how to do this? Is there a text -> unicode class or method available?
thanks,
Justin

String str = "〹";
char c = (char) Integer.parseInt(str.substring(str.indexOf('#') + 1,str.lastIndexOf(';')));

Can any version of Excel save to a CSV file that is either UTF-8 or UTF-16 encoded (unicode)?

Are there any versions of Excel (chinese, japanese, russian... 2003, 2007, 2010...) that can save CSV files in Unicode (either UTF-8 or UTF-16)?
If not, is the only solution to go with tab-delimited files (save as Unicode-text option)?

Hi Mark,
I have the same problem. Trying to save my CSV file in UTF8 encoding. After several hours in searching and trying this also in my VSTO Add-In I got nothing. Saving file as Unicode option in Excel creates file as TAB separated. Because I'd like to save the
file in my Add-In application, the best to do is (for my problem) saving file as unicode tab delimited and then replacing all tabs with commas in the file automatically.
I don't think there is a direct way to save CSV as unicode in Excel. And I don't understand why.

How do I tell if a File is ANSI, unicode or UTF8?

I have a jumble of file types - they should all be the same, but they are not.
How do I tell which type a file has been saved in?
(and how do I tell a file to save in a certain type?)

"unicode or UTF-8" ?? UTF-8 is unicode !NO! UTF-8 is not UNICODE. Yes it is !!No it is not.
And to prove it I refer to your links.........
You simply cannot say "unicode or UTF-8" just because
UTF is Unicode Transformation Format.UTF is a transfomation of UNICODE but it is not UNICODE. This is not playing with words. One of the big problems I see on these forums is people saying the Java uses UTF-8 to represent Strings but it does not, it uses UNICODE point values.
You can say "UTF-8 or UTF16-BE or UTF-16LE" because
all three are different Unicode representations. But
all three are unicode.No! They are UNICODE transformations but not UNICODE.
>
So please don't play on words, I wanted to notify the
original poster that "unicode or UTF-8" is
meaningless, he/she would probably have said :
"unicode (as UTF-8 or UTF-16 or...)"You are playing with words, not me. UTF-8 is not UNICODE, it is a transformation of UNICODE to a multibyte representation - http://www.unicode.org/faq/utf_bom.html#14 .

How to store text in AD attributes of type String(Unicode)

In our company, we store the name of the building in which an employee works in one of the ActiveDirectory attributes named extensionAttribute# where # represents a number form 1-15. I want to be able to store the names of buildings in the AD
attribute named "buildingName". According to the
buildingName attribute in the MSDN library, the syntax is "String(Unicode)".
When I try something that seems simple to write some text into this attribute such as
Set-QADUser jdoe -ObjectAttributes @{buildingName="My Building Name"}
I get an error that says "Set-QADUser : The requested operation did not satisfy one or more constraints associated with the class of the object."
I have been searching forums and code libraries for a solution to this issue, but I do not understand why I can't store string data into this attribute. I did try the following code snippets to try to convert the string to Unicode (and UTF-8) before I wrote
to the buildingNameAttribute, but I still get the same error.
$text = "My Building Name"
$enc = [System.Text.Encoding]::Unicode
# $enc = [System.Text.Encoding]::UTF8
$encText = $enc.GetBytes($text)
$encText
Set-QADUser jdoe -ObjectAttributes @{buildingName=$encText}
Has anyone else had this issue or know how to overcome it? I can, of course, continue to use the extensionAttribute in which I currently have the data stored, but I really want to free it up and use the "buildingName" attribute.
Gardner Rowe Systems Analyst III UT MD Anderson Cancer Center - Making Cancer History

From your link:
buildingName attribute
The name of the building where an organization or organizational unit is based.
IOW, it is an attribute of organization or organizational unit class of object not a user object, so you can't set it for a user.
[string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "

Upgrade to ECC6 and Unicode: Passing HTML

We are experiencing new issues related output on our Web page URL.
Iu2019d like to convert the internal text table, t_html (now in Unicode-I assume utf-16), to contain the old hex values(I assume utf-8), before the Unicode upgrade. Or, convert the text table from utf-16 to utf-8.
First, and most importantu2026.Is this possible?
If so, I assume these are the steps I need to do, and am asking for suggestions::
step 1) Transform internal table t_html to hex (so I can covert it from utf-16 to utf-8?)
step 2) Convert the utf-16 hex values (Unicode) to utf-8 (old hex, non-unicode).
step 3) Transform the converted t_html back to text so I can pass it out in its usual fashion.
What would be the easiest approach (or not easiest approach, if necessary) to do this?
I am open to any and all suggestions, this has been a struggleu2026.Thank-You
Edited by: TMM on Oct 23, 2009 10:35 AM

You mean hex stored as string (0-F)?

String utf-16 conversion

Hello,
I call rfc (via xi) and get result as flat xml String from r/3.
This string contains english and hebrew characters. The english is o.k. but the value nodes in hebrew looks like " ׳ ׳�׳ ׳�׳׳ ׳׳�׳¨׳׳/׳׳׳�׳� "
how can i convert the string in order to see my language characters?
I think the source unicode is utf-8, and i need to convert it to utf-16 but i'm not sure about that nor how to do that.
Thanks for your help.
Roni.

Hi Roni,
If your RFC returns an XML file as a java.lang.String object, the String itself is by definition UTF-16. To see if the XML file is correct, you could write it to disk and open it with your favourite text editor. Something like this.
private void save(String xml) throws IOException {
OutputStream out = new FileOutputStream(new File("/tmp/result.xml"));
out.write(xml.getBytes("UTF-8"));
out.close();
Or whatever supported encoding you like and that's supported by your text editor, e.g. ISO-8859-8 is Latin/Hebrew Alphabet; see http://java.sun.com/j2se/1.4.2/docs/guide/intl/encoding.doc.html
Obviously, to correctly see the characters you need the necessary fonts too.
In case the above doesn't work, it means the xml String you receive from the RFC call is already corrupt. However, if it does work, the RFC as such works correctly.
The next step is likely to parse the XML to extract the information you need and store it in your Web Dynpro context. An XML parser usually expects bytes (a java.io.InputStream) as input, which means you need to convert the String to bytes and by doing that, you need to choose a character encoding. It could be something like the following.
SAXParserFactory.newInstance().newSAXParser().parse(new ByteArrayInputStream(xml.getBytes("UTF-8")), handler);
Note that the character encoding you specify here does make a difference. It should be the same, in order not to "confuse" the XML parser, as defined in the XML file's document type declaration, e.g. <?xml version="1.0" encoding="UTF-8"?>
BTW, what exactly do you do with the received XML to obtain the value nodes?
Kind regards,
Sigiswald

Unicode or UTF-8?

Similar Messages

Maybe you are looking for