Determine String encoding

Hello folks,
I have a problem with converting data to UTF-8.
My task involves a Oracle database table with a long field.
First of all, the data in this table can be Chinese/Japanese/English/Korean encoded.
So when I retrieve the data, I will need to invoke:
String localStr = new String(rs.getBytes(CONTENT_INDEX),"SJIS");
String utf8Str = new String (localStr.getBytes("SJIS"), "UTF8");
where rs is java.sql.ResultSet
Then I will need to store the utf8 String to a new database table.
Since the default encoding is "ISO8859_1" and that I have no idea whether the data is
Chinese/Japanese/English/Korean encoded, how can I make the proper conversion?
Since the data is in Long Field and I will have to use getBytes() to get the data and
convert it to the local encoding.
So I am asking if there is any way that I can determine what these bytes'
original encoding was?
Is there anything in Character class that I can make use of??
Pls help.

Thank your for your replies.
My problem is that I have no idea whether the row of data I am getting is either Big5/SJIS/KSC5601/ISO8859_1 encoded, everything stores in one table while no column is used for indicating the encoding used.
My client needs to upgrade their content management system which only able to interpret UTF8 data. So the old data needs to convert to UTF8 and store in a new table.
E.g. SJIS (Japanese) to UTF8 to SJIS (again when shown on browser)
Now the problem is that how to find out that these bytes are SJIS encoded originally???

Similar Messages

JPA string encoding

How can I tune the String encoding in JPA ?
In some cases I need UTF8 encoding, in some cases I need UTF16.
What annotation/attributes do we assign values to ?
How do we do this? Thanks.

javaUserMuser wrote:
String encoding in the database.
Why - because it is the right thing to do: I have decided that.still not clear, what does that have to do with JPA? that's something you would configure in your database, using database specific tools. the java layer doesn't really care how the database stores the strings.

How to determine the encoding

In my web application , we are capturing arabic name and storing it in oracle database
// while saving
this.arabicDesc=new String(paymentTransactionTypeDetailVO.getArabicDesc().getBytes("ISO8859_1"),"UTF8") ;
// while retrieving from database
paymentTransactionTypeDetailVO.setArabicDesc(new String (arabicDesc.getBytes("UTF8"),"ISO8859_1"));In the jsp pages , we are using
In all the jsp pages , we are using
<META HTTP-EQUIV="Content-type" CONTENT="text/html; charset=UTF-8">All these values are coming correctly in our jsp pages. But when we tried to convert this arabic data to cp864 in an applet for printing , we are getting junk.Here we are not explictly converting the data but using printstream class like this. any idea why ????? is coming..
FileOutputStream fos = new FileOutputStream("LPT1");
PrintStream pw = new PrintStream(fos,true,"Cp864");I checked the database also...
select * from nls_database_parameters
where parameter='NLS_CHARACTERSET';
NLS_CHARACTERSET     UTF8data is coming like this...
<object
id="ReceiptPrinterApplet"
classid="clsid:CAFEEFAC-0015-0000-0007-ABCDEFFEDCBB"
width="0" height="0" >
   <param name="code" value="ReceiptPrinterApplet.class">
   <param name="printermode" value="Broad">
   <param name="Party Name1" value="KUANJIKOMBIL VARGHESE ALEXANDER (720216)- PO Box No:245">
   <param name="Party Name Arabic1" value="�� ">
</object>Where could be the problem ? How can we resolve this issue. We need to convert to cp864 since the printer is supporting only cp864.

Unfortunately, based on your description of how you acquired the data, I believe you may be correct in saying that there is no generic solution. If your database is encoded in UTF-8, and you can confirm that the data is correct in the database, then you should be in reasonable shape - then it is a matter of figuring out where in your code you have some issues.
If, on the other hand, you determine that part of the data is corrupted in the database because if may have been incorrectly converted when it was stored, then it becomes much more difficult to figure out what to do with it, and how to correct it (if that is possible at all - if one of the conversions was to a target encoding where the source code point does not exist, then the data is lost forever).
Converting UTF-8 encoded data to 8859-1 is really only possible for a small subset of data (data actually in Latin-1 scripts), and there is no good reason to do it, unless it is part of an attempt to correct a previous incorrect conversion (and that should only be attempted with extreme caution, in my opinion).

Determining string's charset

Hey,
I have an issue where a string fetched from Oracle has different encoding than a string fetched from Mysql,
I'd like to know the charset of each of the strings,
Is there any java methods to determine the current charset encoding of a string?
thanks

Strings always use the same encoding. What you need to know is which encoding was used to store the text in the database, so you can decode it correctly. You should be able to query the database for that information.

Problems with string encoding - need the text content in char* format.

The problem is non ASCII-characters, which comes out as some sort of unicode I need to desipher.
Here's what I got:
A text frame object with the TextString "Agnartjørna"
I get the text content of this object into an ai::UnicodeString the following way:
AIErr
VMGetTextOfTextArt( AIArtHandle textArt, ai::UnicodeString &ucStr)
    ASUnicode *textBuffer = NULL;
    AITRY {
        TextFrameRef ateTextRef;
        AIX( sAITextFrame->GetATETextFrame( textArt, &ateTextRef));
        ATE::ITextFrame ateText( ateTextRef);
        ATE::ITextRange ateRange = ateText.GetTextRange( true);
        ASInt32 textLen = ateRange.GetSize();
        AIX( sSPBlocks->AllocateBlock( (textLen+2) * sizeof( ASUnicode), nil, (void**) &textBuffer));
        ateRange.GetContents( textBuffer, (ASInt32) textLen+1);
        /* trim off trailing newlines */
        if ((textBuffer[textLen] == '\n') || (textBuffer[textLen] == '\r'))
             textBuffer[textLen] = 0;
        ucStr.clear();
        ucStr.append( ai::UnicodeString( textBuffer, textLen));
        sSPBlocks->FreeBlock( textBuffer);
        textBuffer = NULL;
       AIRETURN;
    AICATCH {
        if (textBuffer) sSPBlocks->FreeBlock( textBuffer);
       AIPROPAGATE;
Now, the next step is to convert it into a form that I can use to call regexp.
Baiscally, I want to detect the ending "tjørna" (meaning small lake) on a map label, and apply a standard abbevriation "tj^a" (with "a" superscripted).
So the problem is to obtain the regexp pattern and the text content in same encoding. And since the regexp library is old *char based, I would like to convert the text content in to plain old *char.
Hence the following code:
static AIErr
VMAbbreviateTextArt( AIArtHandle textArt,
                         vmTextAbbrevEffectParams *params)
    AITRY {
    /* first obtain the text contents of the textArt */
       ai::UnicodeString ucText;
      const int kTextLen = 256;
      char textContent[kTextLen];
      AIX( VMGetTextOfTextArt( textArt, ucText));
      ucText.as_Roman( textContent, kTextLen);
But textContent now has the value "Agnartj\xbfnna" (According to XCode),
which will not get a match on the pattern "tj([øe][rn])na\\" (with backslash matching the end of the string)
Any other ways to convert the textContent to a plain *char string?

Thank you very much, your method will work fine. with
the "UTF-8" parameter the byte[].length is double,
cause every valid byte is preceeded by an -62, but I
will just filter the valid bytes into a new array.
Thanks again,
StefanActually what you need to do is to find the character encoding that your device expects, and then you can code your strings in Arabic.
That's the way Java does things; Strings and char values are always in UNICODE (see www.unicode.org) (which means \u600 to \u6ff for arabic) and uses a specified character encoding when translating these to and from a byte stream.
Each national character encoding has a name. Most of them are identical to ASCII for 0-127 and code their national characters in 128-255.
Find the encoding name for your display and, odds are, the JRE has it in the library.
BTW the character encoding ISO-8859-1 simply maps UNICODE characters 0-255 on to bytes.

Looking for a better way to determine string variable from multiple options

Hi,
I trying to figure out a better way to determine a string variable from multiple options.
Say i have five pictures each with a different filename: img1 - img5...these file names could be named anything really but for this example i will keep them as img1, img2, img3, img4 and img5.
I want to display a messagebox with the string depending on what a certain variable is.
So for example, we have the number X, if X = 1 then i want the messagebox to show "img1" as the message
Essentially the way I have been doing it so far is:
Private Sub WhichImage()
Dim ImageName As String = ""
Dim i as integer
If i = 0 Then
ImageName = "img1"
End If
If i = 1 Then
ImageName = "img2"
End If
If i = 2 Then
ImageName = "img3"
End If
If i = 3 Then
ImageName = "img4"
End If
If i = 4 Then
ImageName = "img5"
End If
MessageBox.show(imagename, "Name of image", MsgBox.Style.OkOnly, MsgBoxResult.Ok)
end
Up until now, this has been fine, but what if I have 50 images, do I have to do this for all 50 images? or is there an easier way like putting the image names into a text file and have it read from the file depending on what the variable i equals? If so,
how do I go about this? Does each image name go on a separate line? can it just be separated by a comma instead? or is there a better way?
Please note that i know that i have declared "i" above in my code and not intialised it with anything, in reality "i" comes from somewhere else in the program so please ignore that part, it is not what I am concerned with.
Thanks
Mersec

Does each image name go on a separate line? can it just be separated by a comma instead? or is there a better way?
Arrays are useful for this.
Dim imagenames() As String = {"img1", "img2", "img3", "img4", "img5"}
Dim imagename As String = imagenames(i)
MessageBox.Show(imagename, "Name of image")
Any sort of collection will do instead of an array, and may be simpler to manage. There are many other options - the most suitable one probably depends on where the names originally come from. For instance, if you are getting them from a folder
using the FileSystem.GetFiles method, then they are already in a collection.
If the files names never change then you may as well include them in the program code, using something like the code above. If they can change, then you could use a text file, but that means you need a file update routine. If that is required
then the way you store the names will dictate how you access them.

String encoding problem - pls help.

I need to read in a string, let say as aString (in ASCII)
inside my program will produce astring, bString
i want the string b in unicode and insert to string a.
but string a need to be ascii.
structure like:
for example:
aString = "<datetime id number $insertHere>"
bString = someCharatersInUnicodeFormat
finalString = "<datetime id number someCharatersInUnicodeFormat>"
but finalString is in ASCII.
how to do it in java ?
do it need to append it in bits level ??
how to manage the bit level in java ?
if i want to take a look on the bytes or bits, how to write it out in java ?
thanks....

while, of coz it will be fine if i use UTF-8 encoding
for my file.
Then it will much easier to write my chinese word.
But actually, the file is produce for another
system.
and that system only accept ascii file.If the other system only accepts ascii files you can't write chinese characters.
DrClap already tried to make this clear to you.
What do exactly you mean with ascii. You possibly mean something different than DrClap, all the others here an I mean.
So i need to make up an ascii file with the chinese
word in unicode.What do you mean with unicode? Unicode is not an encoding
Often people say Unicode but acually they mean UTF-16 which is the native internal representation of text in the Microsoft Windows NT, Windows CE, Qualcomm BREW, and Symbian operating systems; the Java and .NET bytecode environments".
http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings
The example i post up there is the correct output.
Is there anyone know how to make up a such file like
that ?Independenent of what I think of the idea of mixing up encodings within one file, you can write the first bytes ascii encocded and append an "unicode" encoded part.
FileOutputStream fileOut = new FileOutputStream("file");
// whatever this should be
String unicodeEncoding = "???";
byte[] bytes;
bytes = "ascii part".getBytes("US-ASCII");
fileOut.write(bytes, 0, bytes.length);
bytes = "unicode part".getBytes(unicodeEncoding);
fileOut.write(bytes, 0, bytes.length);

Oracle 9i +Java: Change string encoding from UTF-16 to Windows-1251

Dear colleagues,
I have a very urgent case: need to change encoding of the string retrieved from the file (with encoding UTF-16) to Windows-1251 and put it to db table, to CLOB field.
Code of the Java function
+public static void file2table(String sql, String fileName, String characterSet, int asByteArray) throws SQLException, IOException {+
Connection con = null;
Writer writer = null;
Reader reader = null;
+try {+
con = getConnection();
PreparedStatement ps=con.prepareStatement(sql);
reader = new InputStreamReader(new BufferedInputStream(new FileInputStream(new File(fileName))), characterSet);
BufferedReader br = new BufferedReader(reader);
String s;
+while ((s = br.readLine()) != null) {+
byte[] defaultBytes=s.getBytes(characterSet);
String win1251str=new String(defaultBytes, "windows-1251");
+if(asByteArray>0) {+
ps.setBytes(1, defaultBytes);
+//ps.setBytes(1, win1251str.getBytes("windows-1251"));+
+} else {+
ps.setString(1, s);
+}+
ps.executeUpdate();
+}+
con.commit();
+} finally {+
+if (reader != null) {reader.close();}+
+if (con != null) {con.close();}+
+}+
+}+
I was check, all bytes from the file received correctly. But if I put readed bytes to database table, result text in table is broken.

>
Yes, currently I already have filled table with all file lines in result table but with incorrect encoding
>
No you haven't - not using the code you posted. You can't save LOB data using only the BLOB or CLOB.
That isn't data that you strored - it is garbage that is being stored as the LOB locator.
I ask you why you were trying to store the data that way instead of the way the doc shows you and you said
>
Because var. s is type of Java String.
For method setClob must be use type of CLOB
>
You are teriibly confused about LOBs. A BLOB or CLOB Java datatype is the LOB LOCATOR and doesn't contain any data.
Yes - it is true that method setClob must be of type CLOB but that CLOB instance HAS TO BE THE LOB LOCATOR - not the data.
You access LOB data using streams. To store LOB data you have to RETRIEVE (not send) a LOB locator from the database and then use the locator's stream to send the actual data.
So if you are creating a new record in the table you typically do an INSERT that includes an EMPTY_LOB() and have the newly created LOB locator returned to you. Then you use that locators stream to send the actual data.
Since you are not doing that your approach will not work.
Here is a link to the 9i JDBC Dev Guide
http://docs.oracle.com/cd/B10501_01/java.920/a96654.pdf
See page 8-2 to start with
>
BLOB and CLOB data is
accessed and referenced by using a locator, which is stored in the database table and
points to the BLOB or CLOB data, which is outside the table.
To work with LOB data, you must first obtain a LOB locator. Then you can read or
write LOB data and perform data manipulation. The following sections also
describe how to create and populate a LOB column in a table.
The oracle.sql.BLOB and CLOB classes implement the java.sql.Blob and
Clob interfaces, respectively (oracle.jdbc2.Blob and Clob interfaces under
JDK 1.1.x). By contrast, BFILE is an Oracle extension, without a corresponding
java.sql (or oracle.jdbc2) interface.
Instances of these classes contain only the locators for these datatypes, not the data.
After accessing the locators, you must perform some additional steps to access the
data. These steps are described in "Reading and Writing BLOB and CLOB Data" on
page 8-6 and "Reading BFILE Data" on page 8-22.
Note: You cannot construct BLOB, CLOB, or BFILE objects in your
JDBC application—you can only retrieve existing BLOBs, CLOBs,
or BFILEs from the database or create them using the
createTemporary() and empty_lob() methods.
>
Read the above quotes several times until you understand what they are telling you. These are the two main concepts you need to accept:
>
To work with LOB data, you must first obtain a LOB locator.
You cannot construct BLOB, CLOB, or BFILE objects in your JDBC application
>
See the example code and description starting on page 8-11 for how to populate a LOB column in a table
>
Create a BLOB or CLOB column in a table with the SQL CREATE TABLE statement,
then populate the LOB. This includes creating the LOB entry in the table, obtaining
the LOB locator, creating a file handler for the data (if you are reading the data from
a file), and then copying the data into the LOB.
>
Until you start using the proper methodology you are just wasting you time and will not be successful.

String encoding

Hi,
I'm sending a few strings to the DB through a JDBC PreparedStatement, and it looks like it's sending them to the DB using a wrong character encoding. It's connecting to an Oracle i9 DB.
Wondering, is there anyway to conifigure what encoding the driver uses in communictating to the DB? Is there any where I could do some more research on character encoding and JDBC drivers in general? I'm totally new to the subject, so excuse the ignorance.
thanks,
J

Is there any where I could do some more research on
character encoding and JDBC drivers in general? Here.
Oracle - OTN site.
Just in general (using Oracle and character encoding.)

Need information on string encoding

Between NSStrings.h and CFStringEncodingExt.h there are scads of encodings listed but I haven't been able to find any detailed information on them. Does anyone know where I can get such information?
A secondary question: is there a simple way of including 8-bit ASCII into a string in my code? For instance, if I want to say NSString *x; x = @"ß=√π"; how can I do it?
Pete

Pete C wrote:
Between NSStrings.h and CFStringEncodingExt.h there are scads of encodings listed but I haven't been able to find any detailed information on them. Does anyone know where I can get such information?
What sort of information do you want? That isn't a small topic.
Unless you need to read any of those wacky string encodings (such as if you were writing a web browser with compatibility with web sites from 1993) you don't need to worry about any of them except for UTF-8. UTF-8 will handle 95% of your needs. MacOS X resource files and Java text uses UTF-16 for another 4.8% of your needs.
A secondary question: is there a simple way of including 8-bit ASCII into a string in my code? For instance, if I want to say NSString *x; x = @"ß=√π"; how can I do it?
You put that string into a resource file (which is encoding using UTF-16) and then load it using NSLocalizedStringFromTable.
You cannot use 8-bit data in a source file. This is a limitation of the GCC compiler and has nothing to do with a Mac.

XML converted to string - encoding lost?

Hi,
I am using a socket to obtain xml data. Because it is a continuous stream, I need to check for <?xml version...> in order to split the data into parsable chunks because the parser can only parse one xml file at a time.
In order to do this I use a BufferedReader and readline() until I reach the appropriate place. I save the read data in a string and then pass it to the parser.
Do I loose the UTF-8 encoding in this process? I end up receiving the following error:
"Illegal XML character: &#x13" or "Illegal XML character: &#x1e" Also, some other characters seem to be displayed incorrectly in my applet.
How can I solver this problem?

In order to do this I use a BufferedReader and
readline() until I reach the appropriate place. I
save the read data in a string and then pass it to
the parser.
Do I loose the UTF-8 encoding in this process?Most likely, because the BufferedReader uses your system's default encoding. This is commonly ISO-8859-1 or Windows-something but almost never UTF-8.
But if you know the XML is encoded in UTF-8, the simplest thing to do is to read it as such:BufferedReader reader = new BufferedReader(new InputStreamReader(yourXML, "UTF-8"));

XML String encoding - anyone have the the code?

I need to encode strings for use in XML (node values) and replace
items like ampersands, < and > symbols, etc with the proper escaped strings.
My code will be installed on systems where I CANNOT add additional libraries to whatever they may already have.
So, I cannot use JAXP, for example.
Does anyone have the actual Java code for making strings XML compatible ?
I am particularly concerned that the if the string already contains a valid encoding that it is NOT 're-processed' so that this (excuse the extra -'s):
'Hello &-amp-;'
does not become this:
'Hello &-amp-;-amp-;'
Thanks.

It isn't especially difficult code. Here's what you have to do:
1. Replace & by &
2. Replace < by <
3. Replace > by >
4. Replace " by "
5. Replace ' by '
Note that it's important that #1 come first, otherwise you will be incorrectly processing things twice. The order of the rest doesn't matter. (Technically you don't have to do all of these things in all situations -- for example attribute values have different escaping rules than text nodes do -- but it isn't wrong to do them all.)
And note that this is called "escaping", not "encoding".
I am particularly concerned that the if the string already contains a valid encoding that it is NOT 're-processed'This isn't a valid design criterion. You have to set up your design so that you have unescaped strings and you are creating escaped strings, or vice versa. If you have a string and you don't know if it has already been escaped or not, then that's a design failure.

Probleme with string encoding

   Hello,
I have an application in Flex3 and when I send an Soap Query, I have a corect enveloppe,
but when I try the same code in Flex 4, the sting is encoding :
exemple :
Flex 3 :
<ns1:orderLine>
      <ns1:line_nr>2</ns1:line_nr>
      <ns1:productCode>7443</ns1:productCode>
      <ns1:quantity>20</ns1:quantity>
     </ns1:orderLine>
Flex 4 :
<ns1:orderLine>
      <ns1:line_nr>1</ns1:line_nr>
      <ns1:productCode><productCode xmlns="http://emt.netsoa.netinfluence.com/types/order">7505</productCode></ns1:productCode>
      <ns1:quantity>20</ns1:quantity>
     </ns1:orderLine>
The integer is correct, but the string fail.
How can i correct this ?
Jean

Your flex 4 response is translated as below after replacing reserved xml characters. -
<ns1:orderLine>
      <ns1:line_nr>1</ns1:line_nr>
      <ns1:productCode><productCode xmlns="http://emt.netsoa.netinfluence.com/types/order">7505</productCode></ns1:productCode>
      <ns1:quantity>20</ns1:quantity>
     </ns1:orderLine>
its strange that a productCode element came under the productCode element itself. Can you check if this was possibly due to the server returning a malformed response. help in this link -
http://anirudhs.chaosnet.org/blog/2009.06.01.html
I can analyze this if you provide the soap packet dumps

Getting String encoding

Hello ,
I need to get encoding from String object which is already created and dont know which encoding it has,how can i get this encoding? is there some easy solutions if yes please give me a peace of code please,
thanks

To clarify: In Java character and String types are stored in UNICODE, so the actual codes should always be consistent whatever languages you're using and you shouldn't need to know what coding is used. Indeed I'd regard it as bad practice to write code which depends on the specific codes, there are plenty of classification tests in the Character class.
When text is changed to of from a sequence of bytes, that's when encoding becomes an issue. Of course a file is a sequence of bytes to encoding also applies when text data is read from or written to a file.
So whenever you read or write text to or from a file or a byte array a specific encoding will be used (even if you allow it to default to the standard encoding of the system you're running on).
Most encodings can only cope with a subset of the UNICODE characters. The exceptions are UTF-8 and UTF-16. If a character can't be converted Java normally substitutes a "?".
Character encodings tend to come with nice, readily memorable names like ISO-8859-1.

Database String & encoding

I am able to read nation characters from oracle database using rs.getString(), and store it String (perhapse it is UTF) it is corectly printed using System.out.println into Tomcat console, but how to convert them into cp1250 which generate JSP.
I try to use
new String(rs.getString(1).getBytes("iso-8859-1"),"Cp1250") but it change all to ? question mark.
has anybody some advice how to solve this problem.
Thanks Tomas

How to convert the String to cp1250?byte[] bytesInCp1250 = rs.getString(1).getBytes("Cp1250");Don't make the mistake of thinking you can have a cp1250 String. You can't. Strings don't have an encoding, only byte arrays can.

Determine String encoding

Similar Messages

Maybe you are looking for