Java, International language, Unicode, Chinese or accented characters

Hi, I've posted a message in the new to forum but think that this is more of an advanced problem.
Sorry for duplication.
I am having trouble getting java to output Strings that contain unicode characters out of the normal ascii set. This includes accented letters and chinese characters.
Example code:
String jintian = "??"
System.out.println(jintian) ;(The string jintian should refer to two chinese characters. I'm having trouble getting these to appear properly here, you may see ??.)
Basically if I run this code the output I get is ??. Questionmarks in place of the chinese characters.
There is either a problem here with Java or with my Shell but I don't know how to troubleshoot this.
I can tell you that my shell is able to display chinese characters, but with the ls command I have to use a flag. -v. I have tried using this -v flag with the java command but it is rejected.
One more thing, when compiling I am using the -encoding=UTF-8 flag.
I have the line ...
export LANG=en_UK.UTF-8... in my bash profile.
Any ideas, I would be really grateful.
Message was edited by:
stanton_ian
Message was edited by:
stanton_ian
In case anyone is interested in trying this themselves and doesn't know how to put in the Chinese characters here is a link to a pre-written very simple Class. http://freespace.virgin.net/i.stanton/NiHao.html .
You will need to save the file in utf8 encoding and compile it with the -encoding UTF-8 flag.

Between jdk1.2 and jdk 1.3 the default encoding of the vm changed.
You can get it by executing:
System.out.println("Default Encoding:" + System.getProperty("file.encoding"));
or
System.out.println("Default Encoding:" + (new java.io.InputStreamReader(System.in)).getEncoding());
The default encoding is used during the conversion of bytes to strings and vice verca.
Assume your default encoding is ISO8859_1. Then calling new String(byte[]) is equivalent to calling
new String(byte[], "ISO8859_1")
Now if you are converting a character from one encoding scheme to another and there is no mapping
for this character in the target scheme. Then the character will be replaced by a default character
which is (quite often) the question mark.
You can set the default encoding for a vm by passing it as a command line parameter
java -Dfile.encoding=ISO8859_1
java -Dfile.encoding=Cp1252

Similar Messages

Greek language notes display loses accented characters

I can type in Greek in TextEdit, save the file as Unicode (UTF-16), and drag it into my iPod Notes folder.
In OS10.4 I can use accented characters in TextEdit but, when the file is loaded onto the iPod, it omits the accented characters. They aren't displayed incorrectly - they simply aren't displayed at all.
Anyone ever tried uploading Greek to an iPod? Any Greek users out there?
Martin

Yes, the accented characters appear OK on my wife's video iPod.
I suppose there's no chance that Apple will update the software on my not-very-old color iPod?
No, I thought not. Money-grabbing *****s.
I love Apple but not ALL the time.

Issue with ENEQuery java api and searching terms with accented characters

Hi,
we are using ENEQuery to query the mdex engine. When search terms contain accented characters (like á,í etc), even though the terms are decoded (using java.net.URLDecoder), the term gets send to dgraph is encoded. for e.g a search for "sofá", from the dgraph logs "sof%c3%a1" and fetch in 0 results
ENEQuery query = new ENEQuery();
final ERecSearchList searches = new ERecSearchList();
final ERecSearch eRecSearch = new ERecSearch("search interface name", "term");
searches.add(eRecSearch);
query.setNavERecSearches(searches);
Any suggestions?
Thanks

Hi,
Does your indexed data (which you hope to match) contain "sofá" or "sofa" (no diacritic)? If the latter, and in-general, you may benefit from the dgidx flag --diacritic-folding* as described in documentation "Mapping accented characters to unaccented characters". If you are running the latest version, this is all that should be required to generate a match.
Best
Brett

Spanish accent characters are not displayed proper in excel file

Hi
I have written a program which displays one report file (.xls file ) on the browser.
This report file is in spanish language and contains few accent characters like
("� � � � � � �"). These characters are not displyed properly in that excel file which is opened in the browser. I have used utf-8 encoding method. I tried with some other encoding methods available. But still this problem continues.
Kindly provide me help as soon as possible.
Thanks and
Regards,
Deval

File is generated through a java bean at runtime and it is not stored pgysically.
If I copy these characters from somewhere to new xls file then it will be displayed properly.
Sample code which is used for setting headers is as below:
     protected void setContentType (
          HttpServletResponse     p_response
     ,     InstanceData          p_instancedata
          p_response.setHeader ("Cache-Control", "no-cache");
          p_response.setHeader ("Cache-Control", "max-age=0");
          p_response.setHeader (
               "Content-Disposition", "inline; filename="
          +      p_instancedata.userId
          +      "_"
          +      System.currentTimeMillis ()
          +     ".csv"
          System.out.println("Content generator = ");
          p_response.setHeader("charset","utf-8");
          p_response.setContentType (EXCEL_CONTENT_TYPE);
For generating this xls file, format of data is XML. Means both request and response data will be in XML format.
Hope this information help in suggesting any solution for this problem.
Regards,
Deval

Do java supports all unicode characters including telugu..??

Hello to Everybody..!
Iam rather new to java technologies and I want to know whether java supports telugu language and if it supports it ,how can we implement it? and is there any system requirements for it.?
Please Kindly consider this query and Thankks in advance.

Yes, Java supports all Unicode characters. If you want to know more about Unicode, have a look at its website:
http://www.unicode.org/
Telugu characters are in their code charts here:
http://www.unicode.org/charts/PDF/U0C00.pdf
You don't have to "implement" anything. System requirements would include a font that can render those characters. I don't know about keyboards.

Poor keyboard when used in foreign languages using accented characters

I'm looking for a replacement to the built-in virtual keyboard of the iPad (same thing for the iPod or the iPhone). This keyboard is fun to use in english but carry lot of problems for someone who want to write in foreign languages, where accented characters are used. Those characters are available, but in such way that they are unusable for someone who need to do serious work, let say in french, german or spanish.
I can't find an App that do this job. Is there one?
(By the way, the best will be that Apple offer this choice in the next iOS...)
Thanks

Sorry, there will be no immediate or short-term fixes through an app.
Those who need or want to do "... serious work..." on the iPad, by which I guess you must mean a lot of typing, need to get a keyboard to do it with, the current choice being between the Keyboard/Dock and a selection of wireless BT models. I'm going to assume that using one of those will give you as close to the functionality of a 'normal' setup as you're likely see on the iPad.
As for some fix in a future iOS release, you should first think out exactly how you would like them to fix this issue, describe it in some detail, and then forward it along to Apple via their established
FEEDBACK FORM — http://www.apple.com/feedback/ipad.html

Exporting accented characters (Unicode) in metadata

Hi,
I wrote a little script to export the metadata to an xml file. I am having some problems where the tags contain accented characters (áéö etc.).
I managed to write the xml file with unicode encoding and the unicode characters contained in the script are transferred correctly.
However the special characters in the metadata tags are being transformed to some unreadable characters.
Is there any way to make sure these characters are transferred correctly?
Thank you
Balint

David,
Thank you for the help.
I tried setting the output file encoding to UTF-8 or UTF-16, with or without setting the BOM. The accented characters still come out unreadable. I tried the examples that came with the SDK (output set to the console) and they too scramble the accented characters.
In other cases, where using the iterator function, the fields that contain accented characters are completely ignored.
Finally i found the .rawData function in Photoshop that could export the accented characters. This is a much slower solution, because Photoshop has to open all the files to read the metadata.
I also experimented with my own XML files in Bridge. I can open these, do my transformations, save them and all the characters are well preserved.
Balint

Accented characters showing up as ? in JRE1.3 but ok in 1.2

I'm implementing a database web interface product that utilizes JSPs (on SunOS 5.7).
The problem is in the search form. When using accented characters (French language), the JSP calls on URLEncode, but all accented characters show up as '?'.
However, when editing a record, using accented characters is not a problem (i.e., the accented characters are properly stored in the fields).
Back on the server, I ran a small program to output accented characters and also to call java.net.URLEncoder to convert the characters.
The default JDK is J2SE (1.3.1). Compiliing and running the program results in question marks.
Using JDK 1.2, the accented characters show up fine.
It would appear that URLEncoder is not at fault, but instead, JDK 1.3.1 doesn't seem to handle the accented characters.
I figure there must be a setting somewhere, but I'm not sure where.
Here's the program I used (written in Win98, using standard Win-based character set and Unicode format \u00xx; in Unix, "more" displays the Win accented characters fine but "vi" displays them as \xxx; compiles and displays perfectly when using JDK 1.2):
import java.net.*;
class mine {
public static void main(String args[]) {
System.out.println("��") ;
System.out.println(URLEncoder.encode("��")) ;
System.out.println("\u00e0\u00e2\u00e4");
System.out.println("\351");
System.out.println("\351\347\356\364\373\340\350\342\344\374\357") ;
The output with JDK 1.2 is:
��
%E9%E7%EE%F4%FB%E0%E8%E2%E4%FC%EF
��
The output with JDK 1.3.1 is:
%3F%3F%3F%3F%3F%3F%3F%3F%3F%3F%3F

Between jdk1.2 and jdk 1.3 the default encoding of the vm changed.
You can get it by executing:
System.out.println("Default Encoding:" + System.getProperty("file.encoding"));
or
System.out.println("Default Encoding:" + (new java.io.InputStreamReader(System.in)).getEncoding());
The default encoding is used during the conversion of bytes to strings and vice verca.
Assume your default encoding is ISO8859_1. Then calling new String(byte[]) is equivalent to calling
new String(byte[], "ISO8859_1")
Now if you are converting a character from one encoding scheme to another and there is no mapping
for this character in the target scheme. Then the character will be replaced by a default character
which is (quite often) the question mark.
You can set the default encoding for a vm by passing it as a command line parameter
java -Dfile.encoding=ISO8859_1
java -Dfile.encoding=Cp1252

Tomcat unable to read accented characters from MySQL

Folks,
Can anyone help with me this problem?
It seems that my version of Tomcat is unable to read accented characters from my MySQL Database.
I've checked in the Database and the characters are all correctly represented there. But when, in my servlet code, I do:
String author = results.getString("author_surname");If the String contains any accented character then the character shows as a '?'. (Even before it gets to the JSP - I'm writing the results straight to catalina.out).
Looking around these forums I found that some people suggested adding
?useUnicode=TRUE&characterEncoding=UTF-8;to the end of my jdbc url. As in:
<ResourceParams name="jdbc/connection">
//a whole load of other params
<parameter>
<name>url</name>
<value>jdbc:mysql://localhost:3306/bookshop?useUnicode=TRUE&characterEncoding=UTF-8</value>
</parameter>
</ResourceParams>inside my server.xml
But it doesn't seem to make any difference. In addition, I doubt I even need to use Unicode as the accents I need are only: �� etc.
(Incidentally, writing that line into my server.xml, tomcat complains that it should finish with a semi-colon. Is that correct? Even if I put in the semi-colon, it still complains!!)
Any suggestions on this would be much appreciated. Thank you.

user13109986 wrote:
HI,
From http://download.oracle.com/docs/cd/B10501_01/server.920/a96529/ch9.htm
My understanding is the JDBC Api converts the string from the database to UTF-16.. If so is there any way to disable the UTF-16 encoding at JDBC API?That's exactly what it's supposed to do. There isn't even any concept of what it would mean to disable that: Java characters are UTF-16 representations of Unicode code-points, so there isn't anything else it could do.
I still suspect the JDBC part is working correctly and your writing-to-file isn't. I found this quote in the Wikipedia article on Windows-1256:
Windows-1256 is a code page used to write Arabic (and possibly some other languages that use Arabic script, like Persian) under Microsoft Windows. This code page is not compatible with ISO 8859-6 and MacArabic encodings.So was there a particular reason you chose Cp1256 and not ISO-8859-6 as the charset to write to the file with?

Accented characters, XML, Flash

I have a flash application which is pulling information to
populate dynamic fields from two XML files. We have three
languages supported, and have been having problems with the
non-english accented characters displaying properly
when they are called from the XML. I have checked that the
XML files are encoded in UTF-8, and we have also tried writing
the html code, the unicode code, putting the information in a
C DATA shell. I'm out of options that I can think of, and I'd
appreciate if any other folks have some input on this issue.
I did find this other thread which seems to be about the same
issue, but there was no resolution given on it.
http://www.adobe.com/cfusion/webforums/forum/messageview.cfm?forumid=15&catid=194&threadid =1212142&highlight_key=y&keyword1=accent%20characters

So the same questions arise. What happens in the testing
environment if you trace it or if you go Debug and list variables?
Also you say they aren't displaying properly. What does that
mean? Exactly how are they improper?
PS: I'm absolutely certain that the other thread was not
correctly saving as UTF-8.

Encoding issue with Chinese and Japanese characters but not for Korean

Hi All,
I am dealing with the problem below and have tried a lot of options to correct this. Could anyone help me resolve the issue?
Some japanese text is returned to my jsp from SAP, the text looks like PTú÷»à .
When I manually set the encoding of my browser to Shift-JIS, the problem is resolved and the text is seen as １５日締当月末現金
However, I set my response.setContentType as Shift-JIS, the browser automatically selects the encoding as Shift-JIS, but my characters get displayed as ??????
I had a similar problem with Korean characters, and I used,
decodedString = new String(strToBeDecoded.getBytes("iso-8859-1"),"EUC-KR");
It worked and I did not have to use the setContentType method of servlet response. This code however does not work for either Japanese or Chinese characters.
Please let me know, if there is anyway of getting the characters right on the screen.
I am using Sun Java Application Server 8.1.
Thanks,
Priya

Get rid of legacy character encodings and just use UTF-8 all the way.
Read this to learn more: [http://balusc.blogspot.com/2009/05/unicode-how-to-get-characters-right.html].

Accented characters in exported images getting munged

Lightroom 2.x, OS X 10.5.6, G5, Nikon D80, Raw/NEF/DNG
I'm trying to sort out a problem I'm having that looks similar to other issues with international characters, but is a little different. My problem is specific to keywords.
I don't think this should matter, but I generally import as DNG and never save metadata to XMP. I mainly export to JPEG (to a folder that I zip up) in order to upload photos (outside of Lr) to a sharing site.
What I'm finding is that if I tag my photos (in Lr) with keywords that have accented characters (which I use a fair amount because I often travel to Francophone areas to take photos, and my IPTC data and keywords use correct French spelling) upon upload to any photosharing site I try, the keywords will be munged, duplicated and/or generally messed up. For example, I have a keyword "café". Upon export I can see this correctly in the IPTC Keyword metadata (via a third party EXIF/IPTC viewer, and also using "Get Info" in the Finder). However, as soon as I upload this JPEG to any photo sharing site the photo will end up with a series of keywords applied to it: "caf, cafe, cafã©"
Similarly, a keyword like "naïve" will end up as two keywords: "naã¯ve, nave" I get similar results with HTTP uploaders as well as uploader plugins inside Lr. And so on for è, ô &etc. In the case of Flickr, there will be one "copy" of the keyword that is correct. The others will still be present.
I've read hints that suggest that the IPTC metadata stored in the exported file by Lr might not be Unicode-clean, which causes anything that parses and normalizes select metadata (as all online sharing sites will want to do) to choke. I've heard rumours that the XMP data is properly Unicode clean, but most sharing sites ignore that data, since they are pretty much interested only in EXIF or IPTC metadata (and rightly so, I suppose.)
Is this a defect in the way Lr creates metadata in exported images? Or is is a problem with the way all these sites suck in and normalize the metadata stored in the image? Or is it a subtle interaction between the two that results in incompatibilities?
I haven't tried experimenting with creating or finding a JPEG with no keywords at all and manually inserting IPTC keywords with accents, just to see what Flickr (et al) do upon upload. I've also tried various permutations of the keyword options in the Lr export dialogue with no significant change in behaviour.
I'm interested in who else might be seeing this problem. Is it a Lr only thing, or is it Lr-on-Mac? Or something specific to the two sites I upload photos to?
I should point out that I've been seeing this behaviour since Lr 1.x.

(Jao, AdobeRGB is an experiment. I have a colour managed browser, so I wanted to see if I could tell the difference. This is a test photo from my junk pile.)
I think I see the problem. With something like ExifTool you can see that the keywords are stored in a number of places in the EXIF/IPTC header: Keywords, Subject and Hierarchical Subject. It is the Keywords section that is not Unicode clean, apparently. This is even more apparent if you show the output encoded in HTML entities:
Keywords : Franais, Test Shots, Wings of Paradise Butterfly Conservatory, butterfly, cafŽ, l'h™tel, na•ve, tests
Subject : Français, Test Shots, Wings of Paradise Butterfly Conservatory, butterfly, café, l'hôtel, naïve, tests
Hierarchical Subject : Français, Test Shots|tests, Wings of Paradise Butterfly Conservatory, butterfly, café, l'hôtel, naïve
So, it depends on what section the keywords are taken from. It appears that both Flickr and 23 will try to find the keywords in as many places as possible. This explains the many duplicates I am seeing.
So, now I just have to figure out why Subject is so wrong, and how to correct it (or leave it blank.)

Help with French Accent Characters Corrupted

Hi, All.
I am developing a Flex Front end connect with Java back-end.
The back-end sends data retrieved from XML file to the Flex
front-end; displays it in an TextArea, and allow user to change.
After user changes the data, hit "Save" button, then Flex sends the
data to the back-end.
I check with the back-end, make sure the data was correct
when sending out to Flex, and the French accented characters gets
corrupted when sending back from Flex. However, in Flex side, this
change cannot be awared. (i.e. The French accents characters
display correctly in Flex, but sending wrong character back). I'm
guessing that might be something related to character sets.
However, I cannot find anywhere to set character sets in
HTTPService. Anybody has idea?

Use Ariel MS Unicode font.

How to type accented characters?

Somehow I discovered if I type Option-e,e I get e with +accent ague+ (thought not in this text box obviously).
*Where is the list of the rest of them?*
I don't want to use the Character Palette if I can help it since it's pretty tedious when I type a lot of them. I'd just as soon learn the keyboard shortcuts (as I have done with Windows US-International keyboard).
Thanks for any insight.
Eric

option+e followed by a letter puts an accent over that letter as you've discovered.
like this: á é ó
to see the other ones go to system preferences->international->input menu and check the box to show keyboard viewer in input menu. now activate keyboard viewer from the input menu and press option. you'll see where other accents are. they will be highlighted orange.
alternatively, in the same input menu in system preferences check the box to show a language that you are interested in that has those accented characters. then choose that language from the input menu in the menu bar and type using that language layout. the keyboard viewer will again tell you where which letters are.

Accented characters and UTF-8

Hi all,
I have a problem with accented characters. I read that Plumtree 5.0 is completely Unicode enabled and all HTTP responses from remote web services are converted to Unicode (UTF-16). So the portal sends back to the client browser all pages in UTF-8.
We have a lots of portlet (ASP and JSP) that write data in external DB, for example SQLServer. When I fill an html form with accented characters that have to be saved in our DB, they are saved in UTF-8 because the gateway converts the HTTP response. We want that the data are saved as if we don't use the portal (without conversion). I tried to change the Charset with the ASP code (Response.Charset). This solves only the problem of displaying the right characters in the browser.
Could you explain me better how the portal make the conversion and how can I solve my problem?
Thank you very much,
Alberto Marchiaro

It might be helpful to clarify a few things first: 1. Both Java and VB Script will store strings in UTF-16/Unicode. If you have some code in your ASP file that looks like this: Dim strDatastrData = Request.Form("SomeName") then if you were to examine memory for the variable strData, you would see 16 bit characters. The same is true for Java. 2. String data is almost never sent over HTTP as UTF-16/Unicode. 3. Both Java and VBScript perform an implicit character set transcoding when reading string data out of a request or when writing string data out to a response. 4. ASP will perform the transcoding according to the value of the Session.CodePage value. If you have Session.CodePage to 65001, then ASP will expect the string data to be in UTF-8 and it will transcode UTF-8 in the request into UTF-16 in VB Script. Similarly, a Session.CodePage value of 65001 will cause "Response.Write" to convert UTF-16/Unicode into UTF-8. 5. All of the above is separate from how Java or VB Script interact with the database. Generally speaking an ASP module will use ODBC to communicate with the database. The ODBC layer knows that VB Script keeps strings in UTF-16/Unicode. The ODBC layer will perform the proper conversion into the database character set. Plumtree always recommends using UTF-16/Unicode in the database. You can do this relatively easily by declaring your database columns using the "N" datatypes such as NCHAR and NVARCHAR. However even if you using some other character set, the ODBC layer should always properly transcode from VBScript. The importantly thing to remember is that data that is sent over HTTP is never written directly to the database without going through some ASP or JSP code. Since the ASP and JSP code always uses UTF-16/Unicode, there should never be any issue with how the data is sent over HTTP. Here is an explanation for how Session.CodePage, Response.CharSet and Session.LCID work in ASP:****************************************************
1. Response.CharSet
2. Session.CodePage
3. Session.LCID
Here is an explanation of these properties and why they are important to non-English ASP gadget writers:
1. Response.CharSet
This property will cause the HTTP contentType header to be set with the specified character set. The HTTP header is the best way to tell the recipient what the character set is. The Plumtree HTTPGadgetProvider will read the ContentType header and then know how to properly trancode the portlet text into UTF-16/Unicode. Here
is an example of how to set this property:
Response.CharSet = "UTF-8"
2. Session.CodePage
This property tells the ASP engine which character set to send text in. Please remember that all text is encoded in Unicode on the Web Server. It only gets turned into the client character set when it is send down to the client. The Session.CodePage tells the engine which codepage to transcode into when sending down to the client. Please note that this property is an "integer" property not a string. So you have to know the number of the codepage that you would like to transcode into. Here is an example of how to use this property:
Session.CodePage = 65001
3. Session.LCID
This property tells the ASP engine which locale is being used. The locale is used by various VBScript functions such as FormatDateTime in order to format the date correctly for the locale. If the locale is a French locale, then the date will be formatted according to French rules. The locale does not really effect the character set, but if the portlet writer is going to the trouble of setting the other properties, then they should also set the LCID too. Here is an example of how to set this property:
Session.LCID = 1041
Please note that the examples that I am using are the appropriate examples for Japanese and UTF-8. The values for these properties are different for different character sets. For example, for ISO-2022-JP, the values would be:
Request.CharSet = "iso-2022-jp"
Session.CodePage = 50220
Session.LCID = 1041
A very helpful URL to figure out the values to use with Request.CharSet and Session.CodePage is the following:
http://msdn.microsoft.com/library/default.asp?url=/workshop/Author/dhtml/reference/charsets/charset4.asp

Java, International language, Unicode, Chinese or accented characters

Similar Messages

Maybe you are looking for