Convertion from Unicode to UTF8

I want to convert some string having Unicode chars into a string with UTF8 char. I used following code snippet:
try {
String str = new String(givenString);
String utfStr = new String(str.getBytes("UTF-8"), "UTF-8");
System.out.println("Converted:" + str + " to:" + utfStr);
} catch (Exception e) {
e.printStackTrace(System.out);
I also tried :
Charset utf8Charset = Charset.forName("UTF-8");
CharsetEncoder encoder = utf8Charset.newEncoder();
CharsetDecoder decoder = utf8Charset.newDecoder();
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(givenString));
CharBuffer cbuf = decoder.decode(bbuf);
String dest = cbuf.toString();
When Java tries to encode Unicode to UTF-8 and it runs into an unknown character (typically a character that is in the High Ascii range) it substitutes it with '?' or some other wierd character.
How do I prevent this.

Where is this string coming from? Are you initializing it in your source code as a String literal? String str = "A�roport Princesse B�atrix"; If so, you need to make sure the .java file is saved in an encoding that can handle all of the characters. ISO-8859-1, windows-1252, and of course, UTF-8 will all suffice. You also need to make sure the compiler reads the source file with the correct encoding. For example, if you saved your source files as UTF-8, you would do this: javac -encoding UTF-8 *.java Finally, before you print the text to the console, you need to make sure the console is using an encoding that can handle it. On my WinXP box, the default encoding (or codepage, as they call it) for console windows is cp437, which doesn't support accented characters. You can change it with the "chcp" command, like so: chcp 1252 Unfortunately, chcp won't accept UTF-8 or any other Unicode encoding, but cp1252 can handle the accented characters in your string. Note that you don't need to specify that encoding in your code; the Java runtime detects it automatically.
>
If you see question marks or some other placeholder character when viewing output, that's probably because the terminal or whatever doesn't have the fonts available to render those characters.>
No, question marks always indicate an encoding problem. If the character is valid but the font lacks a glyph for it, it shows up as a little rectangle.

Similar Messages

How to convert from UNICODE (UTF16) to UTF8 and vice-versa in JAVA.

Hi
I want to insert a string in format UTF16 to the database. How do I convert from UTF16 to UTF8 and vice- versa in JAVA?. What type must the database field be? Do I need a special set up for the database (oracle 8.i)?
thanks
null

I'm not sure if this is the correct topic, but we are having problems accessing our Japanese data stored in UTF-8 in our Oracle database using the JDBC thin driver. The data is submitted and extracted correctly using ODBC drivers, but inspection of the same data retrieved from the GetString() call using the JDBC thin driver shows frequent occurrences of bytes like "FF", which are not expected in either UTF8 or UCS2. My understanding is that accessing UTF8 in Java should involve NO NLS translation, since we are simply going from one Unicode encoding to another.
We are using Oracle version 8.0.4.
Can you tell me what we are doing wrong?
null

Problem in Database convertion from US7ASCII to UTF8

Hi,
We are facing the following problem while converting the database from US7ASCII to UTF8:
We have recently changed the database character set from US7ASCII to UTF8 for the internationalization
purpose. We ran the Character set scanner utility and it did report that some data may pose problems.
We followed the the below mentioned process to convert into UTF8 -
1) alter database character set utf8
2) alter database national character set utf8.
Now we find some problem while working with the old data in our application which is java based.
We are getting the following error "java.sql.SQLException: Fail to convert between UTF8 and UCS2: failUTF8Conv".
We further analyzed our data and found some interesting things :
e.g.
DB - UTF8.
NL_LANG also set to UTF8.
Select name from t1 where name like 'Gen%';
NAME
Genhve
But when we find out the length of the same data it show like this
NAME LENGTH(NAME) VSIZE(NAME)
Genhve 4 6
The question is why is it showing length as 4 only and when we try to use a substr function
its extracting like the following :-
select name,substr(name,4,1) from t1 where name like 'Gen%';
NAME SUB
Genhve hve
We have execute the above queries on US7ASCII DB and it is working fine, length it shows 6
and using SUBSTR it extracts just 'h' as well.
We also used dump function on the UTF8 Db for the above query,,this is the result :-
select name,length(name),vsize(name),dump(name) from t1 where name like 'Gen%';
NAME LENGTH(NAME) VSIZE(NAME) DUMP(NAME)
Genhve 4 6 Typ=1 Len=6: 71,101,110,232,118,101
We checked a lot with the data and it seems 'h' (accented e) is posing the problem.
We want to know where is the problem and how to overcome this.
Further, we tried all of the following :
1)
Export Server: US7ASCII
Export Client: did not set NLS_LANG / NLS_CHAR, so presumably US7ASCII as well
Import Client: did not set NLS_LANG / NLS_CHAR, so presumably US7ASCII as well
Import Server: UTF8
RESULT: Acute e became h
2)
Export Server: US7ASCII
Export Client: did not set NLS_LANG / NLS_CHAR, so presumably US7ASCII as well
Import Client: NLS_LANG=AMERICAN_AMERICA.UTF8 and NLS_CHAR=UTF8
Import Server: UTF8
RESULT: IMP 00016 error
3)
Export Server: US7ASCII
Export Client: NLS_LANG=AMERICAN_AMERICA.UTF8 and NLS_CHAR=UTF8
Import Client: did not set NLS_LANG / NLS_CHAR, so presumably US7ASCII as well
Import Server: UTF8
RESULT: Acute E became h
4)
Export Server: US7ASCII
Export Client: NLS_LANG=AMERICAN_AMERICA.UTF8 and NLS_CHAR=UTF8
Import Client: NLS_LANG=AMERICAN_AMERICA.UTF8 and NLS_CHAR=UTF8
Import Server: UTF8
RESULT: Acute e became h
5)
Tried using Update sys.props$
set value$='UTF8'
where name='NLS_CHARACTERSET'
RESULT: Acute e shows properly but it gives problem in the application
"java.sql.SQLException: Fail to convert between UTF8 and UCS2: failUTF8Conv"
Looking further it was observed the following:
when you try this command on a column 'city' in a table which contains 'Genhva' (note the acute e after n), it shows
command: select length(city), vsize(city),substr(city,4,1),city from cities
Result: 4 6 hva Genhva
if you see the value of substr(city,4,1) , you will see the problem. Also note that the length shows 4 and size shows 6. Moreover, when these records are selected through JDBC Driver, it starts giving problems.
6)
Actually the above (point no. 5) is similar to changing the character set of the database with 'ALTER DATABASE CHARACTER SET UTF8'. Same problem is observed then too.
7)
We have also tried to with another method, that is by changing the third byte of the export file which specifies the character set, to the UTF8 code by editing the export file with a Hexdecimal editor. After import the same problem has been observed as defined in (5) and (6) above.
We have no more ideas how to migrate without corrupting the data. Of course we have known the records where these characters occur through the Oracle's cssacn utility but we do not want to manually rectify each and every record by replacing it with an ASCII character. Any other idea as to how this can be accomplised?
Thanx
Ashok

The problem you have is that although your original database is defined as US7ASCII, the data it contains is more than is covered by this code page (as the reply on Sept 4 to the previous posting already said).
This has probably happened because the client was also defined as US7ASCII, and when the DB and client are defined as having the same character set no conversion (or checdking) takes place when data is passed between them. However if you are using a Windows client then it will in fact be using Windows code page 1252 (Latin-1) or similar, and this allows many more characters, including h (accented e). So a user can enter all these characters and store them in the database, and similarly read them from the database, because data transfer is transparent.
When you did ALTER DATABASE CHARACTER SET UTF8 this will only change the label on the database, but not affect the contents. However only part of the contents are valid UTF8, any character above 7F (like h) is invalid. If your original client now uses the database, code page transformation will take place because the client and DB have different character sets defined. The invalid codes can then cause problems.
Without being able to explain what has happened in detail, it may help to see what your h (dec 232, x'E8') looks like. The actual data has not changed (you can see this as it is reported as 232). However the binary code there (11101000) is invalid UTF8. UTF8 encodes a character in 1 to 4 bytes, and the first bits in a UTF8 character tell how many bytes it uses. 0xxx tell it is one byte (same as the corresponding USASCII character), 110x that it uses 2 bytes, 1110 that it uses 3 bytes etc. So if you interpret what is there as UTF8 it looks like the first byte of a 3-byte character, which explains why the substringing is giving you the other 2 bytes as well.
Can you fix this without losing data? I believe yes. First you should check what other characters are being flagged by the scan. See if these are contained in another standard character set. If they are only Western European accentet characters then WE8ISO8859P1 is probably ok, but watch out for the euro sign which Windows has at x'80', an undefined character in ISO8859-1.
You can see the contents of the Microsoft Windows Codepage 1252 at: http://www.microsoft.com/globaldev/reference/sbcs/1252.htm
For a listing of the US-ASCII defined characters see http://czyborra.com/charsets/iso646.html and for ISO 8859-1 see http://czyborra.com/charsets/iso8859.html#ISO-8859-1
If all is well, you can first ALTER DATABASE CHARACTER SET to WE8ISO8859P1. This will not change any data, but ensure that all the data gets exported. Then export the DB and import it to a UTF8 DB. This will convert the non-US-ASCII characters to Unicode. You will also have to change the clients character set to something other than USASCII or they will just see ? for the other characters.
Good Luck!

Convert from utf16 to utf8 ?? er?

Dear list,
I have recently seen a sample to convert a utf16 string to utf8. I am a little bit confused. I thought utf16 was a superset of utf8. Could please someone explain why this is necessary sometimes ?
regards
Ben

how can utf16 be a superset of utf8. I thought this
relationship was similiar to ASCII and utf8/utf16,
where for example the space bar has a value of 32 in
ASCII and Unicode (utf8 and utf16).... This been tjhe
case there is not much need for a utf8 to utf16
conversion program.I didn't say it was a superset. It is a different way of representing the same thing.
>
You say that utf16 is ALWAYS 2 bytes, and utf8 is
usually 8 bits but is variable when necessary. Is
utf16 not a variable byte character set ? No.
The name
according to this, utf8 and utf16 is somewhat
misleading as they are NOT always 8 or 16 bytes.
And "java" is neither an island nor a beverage. The name does not convey the entirety of the subject.
characters the first byte (or 2) is an 'escape' bytewhich means that more bytes are needed.
What do you mean by first or (2). escape byte?
When something sees a given specific byte then then it knows that there are a certain number of bytes after that are needed to fully represent the character.
I am still not convincedConvinced?
If you do not find my explaination satisfactory then you might try writing some code that converts to UTF16 and UTF8 using String.getBytes(String).
You might also try to find the character set definitions.

Convertion from ASCII to UTF8 on Oracle 8.1.7 via PLSQL

I need to extract a string from a ascii db, put it into a variabile in a plsql procedure, then with a 'magic box' convert it into utf8 and put it into a new utf8 database.
I need the magic box, does exist a tool, or package or procedure or..... that works like that?
Thanks in advance!

I suggest to post this message on the genaral RDBMS or PL/SQL forums
Kuassi

Convertion from ASCII to UTF8

How do we convert the Extended ASCII character to UTF8 without using the ALTER DATABASE CHARACTER SET command

Is [url http://download-west.oracle.com/docs/cd/B19306_01/server.102/b14200/functions027.htm#i77037]convert function ?
SQL> select convert('a','utf8','us7ascii') from dual;
C
a

Vietnamese Language Support Services Menu Convert Encoding from Unicode-VNI

I am running :
Macbook 13 unibody
Clean install Snow Leopard
I have Vietnamese input method in Languages and Input Method Systems Preferences on
Problem :
When I was running Leopard, I somehow was able to see in the Services Menu "convert from unicode to vni," and other encodings, and vice versa while working in TextEdit, MS Word, and other applications that supported it.
After doing a new and clean install on my machine. I am no longer able to see these options in the Services Menu, nor am I able to even find where and how to enable this in Systems Preferences.
Please HELP HELP HELP it was an absolute life saver for me to work with all the different encodings and input methods for Vietnamese language.

omg that was it. I had already installed the app from that link but it didn't show up in services because I guess I didn't log out properly for the installation to finish putting the service items on. & since I wasn't able to find the services after installation I assumed it must have been a part of the Leopard. Because you suggested, I installed again, and logged out as I was supposed to, and voila, the Vietnamese encoding conversion items for Services Menu. thanks you.

How to fix "cannot convert between unicode and non-unicode string data types" :/

Environment: SQL Server 2008 R2
Introduction:Staging_table is a table where data is being stored from source file. Individual and ind_subject_scores are destination tables.
Purpose: To load the data from a source file .csv while SSIS define table fields with 50 varchar, I can still transfer the data to the entity table/ destination and keeping the table definition.
I'm getting validation error "Cannot convert between a unicode and a non-unicode string data types" for all the columns.
Please help

Hi ,
NVARCHAR = DT_WSTR
VARCHAR = DT_STR
Try below links:
http://social.msdn.microsoft.com/Forums/sqlserver/en-US/ed1caf36-7a62-44c8-9b67-127cb4a7b747/error-on-package-can-not-convert-from-unicode-to-non-unicode-string-type?forum=sqlintegrationservices
http://social.msdn.microsoft.com/Forums/en-US/eb0d1519-4be3-427d-bd30-ae4004ea9e8d/data-conversion-error-how-to-fix-this
http://technet.microsoft.com/en-us/library/aa337316(v=sql.105).aspx
http://social.technet.microsoft.com/wiki/contents/articles/19612.ssis-import-excel-to-table-cannot-convert-between-unicode-and-non-unicode-string-data-types.aspx
sathya - www.allaboutmssql.com ** Mark as answered if my post solved your problem and Vote as helpful if my post was useful **.

Convert from NVARCHAR2 to Unicode in SQL Plus

I need to convert from NVARCHAR2 column data to Unicode format in a query. How can I do this?

I need to convert from NVARCHAR2 column data to Unicode format in a queryMaybe with convert:
SQL> select convert(n'ABC', 'utf8') abc from dual;
AB
┐
1 row selected.?

Importing from an Oracle database converted from non-unicode to unicode - SSIS 2012

Hi,
I've some SSIS 2012 pkgs that read from a non-unicode Oracle database to a non-unicode SQL Server database.
Few days later, the Oracle database will be converted to unicode one and so I need to update the SQL database and the SSIS pkgs. Obiously no data conversion transformations are present in the pkgs. I'd like to avoid to add more of these data conversions.
As a first step, I'm trying to convert a SQL table to unicode format, but it isn't possible to convert non-unicode and unicode string data types.
I did not expect to have an error about the conversion from to non-unicode Oracle database (not yet converted) to unicode SQL Server database: a such conversion doesn't lost any information content. For me, it is right to have an error by from unicode to
non-unicode conversion.
Any suggests to solve this issue with a minimum development effort? Many thanks

Nope once you change datatypes to unicode you've to refresh metadata settings within SSIS packages for it to work fine. SSIS doesnt have the ability to do metadata changes at runtime. So someone has to update package metadata information to make them unicode
to make sure it works correctly after the changes.
What you can do is create test dbs in oracle and sql in uncode and create some modified packages based on it. Once you make changes to production dbs you need to move these modified copies also to production after making necessary config value changes like
servername,dbname,folder paths etc and then they will continue to work normally without causing any further downtimes to the customer.
Please Mark This As Answer if it helps to solve the issue Visakh ---------------------------- http://visakhm.blogspot.com/ https://www.facebook.com/VmBlogs

Convert from std::string to CString in UNICODE builds

Actually I tried all ways but they didn't help me.
Is there any working code to convert from std::string to
CString in UNICODE builds?
Thanks.
Mirjalal

Using non-Unicode string literals in VC++ is asking for trouble, the exact result depends on the codepage of the system used to compile the program and the codepage of the system used to run the program. For example on my system I get 2 question
marks instead of one, one for ş and one for ə.
I also get to compilation warnings:
warning C4566: character represented by universal-character-name '\u015F' cannot be represented in the current code page (1252)
warning C4566: character represented by universal-character-name '\u0259' cannot be represented in the current code page (1252)
The easiest solution is to use Unicode string literals and std::wstring:
wstring z = L"nüşabə";
CString cs(z.c_str());
nameData.SetWindowTextW(cs);
If you can't do that things will get complicated. If you need help with this please give more information. Knowing the compiler version and the system codepage would be useful.

Convert 10.2.0.4 RDBMS from WE8ISO8859P1 to UTF8 without install new langua

We are in 11.5.10.2 Ebusiness suite with 10.2.0.4 RDBMS. thinking to take advantage a downtime to convert the darabase character set from WE8ISO8859P1 to UTF8 right now even we ONLY want to install and configure new language in year or two in the future. I have a lot of questions and hope someone can answer them
1) Is that ok to convert the database character set witout doing anything in the apps side? Not even change any setting in apps?
2) I know Oracle is recommned AL32UTF8 for E-business suite.. but for Rel 12 only. Am I have the right information?
3) I found someone post in one of the Forum in here .. that he use CSSCAN to scan the database but realized it only change the metadata not the user data..... I though CSSCAN is only for scanning to report potential data.. should not change anything in database? I wonder he mean CSaLTER instead.... but only metadata huh??
4) I also read some people use expdp/impdp to perform the character set conversion... but I also read that oracle recommend to use exp/import utility only. Is that right? What is the better method for character set conversion if both of them are valid path?
thanks Fushan

Hi
Apart from Srini inputs, here is my inputs.
*1) Is that ok to convert the database character set without doing anything in the apps side? Not even change any setting in apps?*
No issues to convert without adding new language in EBS.
*2) I know Oracle is recommend AL32UTF8 for E-business suite.. but for Rel 12 only. Am I have the right information?*
Please note that AL32UTF8 is not certified for Oracle E-Business Suite 11i.
Note.124721.1 Migrating an Applications Installation to a New Character Set:
This is documented in Note:222663.1
Note 179133.1 The correct NLS_LANG in a Windows Environment
Note 264157.1 The correct NLS_LANG on Unix Environments
*3) I found someone post in one of the Forum in here .. that he use CSSCAN to scan the database but realized it only change the metadata not the user data..... I though CSSCAN is only for scanning to report potential data.. should not change anything in database? I wonder he mean CSaLTER instead.... but only metadata huh??*
Please follow below blog, you will get clear picture
Changing the Database Character Set ( NLS_CHARACTERSET ) [ID 225912.1]
Note 276914.1 The National Character Set in Oracle 9i and 10g
Note 458122.1 Installing and Configuring Csscan in 8i and 9i (Database Character Set Scanner)
Note 745809.1 Installing and configuring Csscan in 10g and 11g (Database Character Set Scanner)
Note 444701.1 Csscan output explained
http://www.oracle-base.com/articles/10g/CharacterSetMigration.php
http://repettas.wordpress.com/2008/05/16/national-character-set-in-oracle-9i-and-10g/
http://avdeo.com/2010/11/01/converting-migerating-database-character-set/
*4) I also read some people use expdp/impdp to perform the character set conversion... but I also read that oracle recommend to use exp/import utility only. Is that right? What is the better method for character set conversion if both of them are valid path?*
yes.
Note 227332.1 NLS considerations in Import/Export - Frequently Asked Questions

Converting server from Big5 to UTF8.

I try to convert oracle 8i server from big5 to UTF8 by using "alter database character set utf8". But it tells me "ORA-12712: new character set must be a superset of old character set. Is there any way to enforce this conversion? I understand there will be data conversion problem that I don't care.
Rick Lin
null

Why would you want to do this ? In UTF8 all your 2 byte chinese BIG5 characters will be increased to 3 bytes in length. Basically each character needs to be converted when you migrate your database to UTF8, since their current binary representations are not longer valid.
The ALTER DATABASE CHARACTER SET doesn't perform any character conversion ,if your operation proceed without the error, then all you chinese data will be lost !

JDeveloper3.1.1.2 problem:convert from UTF8 to UCS2 failed

oracleTeam:
i test JDeveloper3.1.1.2,it has problem in runtime:convertion from UTF8 to UCS2 failed ,AttributeLoadException.( our language is chinese)
I found that oracle\jbo\server\QueryCollection.class in dacf.zip maybe has problem,i use this class of JDeveloper3.1 to repalce same_name class in JDeveloper3.1.1.2,above problem disappeared,
but because this class is not suit of JDeveloper3.1.1.2,other problems appeared.
so you should work out this problem ,i hope
it runs correctly.

I searched this forum and the SQLJ/JDBC forum, and found a few occurrences of this problem. Among the things people suggested:
* Changing JDBC drivers (experience varied as to which one fixed the problem)
* Adding nls_charset1x.zip to your CLASSPATH
* Ensuring you're using the same character set on the client and server.
I suggest you take a look at the following discussion threads: http://technet.oracle.com:89/ubb/Forum8/HTML/001810.html http://technet.oracle.com:89/ubb/Forum8/HTML/000065.html http://technet.oracle.com:89/ubb/Forum2/HTML/000820.html
Blaise
null

Language Conversion from Unicode 8 to Character Set

Hi,
I am creating a file programmatically containing Vendor Master data (FTP interface).
The vendor name and vendor address is maintained in the local language (Taiwanese) in SAP System, these characters are in Unicode 8 character set.
The Unicode character set should be converted to BIG5 for Taiwanese, and then send this information in the file.
How can I perform this conversion and change the character set of the values I'm retrieving from table (LFA1) to character set BIG5.
Is is possible to does this conversion in SAP, does sap allows this?
/Mike

Hi Manik,
I am also having a similar requirement, as I need to convert the unicode chinese character to GB2312 encoded chinese character,. I already posted in forums but didnt get the required the solution.
Can you please provide the solution which you implemented and also confirm whether it can be used to solve the above problem.
Hoping for your good reply.
Regards,
Prakash

Convertion from Unicode to UTF8

Similar Messages

Maybe you are looking for