Character encoding (unicode to utf-8) conversion problem

I have run into a problem that I can't seem to find a solution to.
my users are copying and pasting from MS-Word. My DB is Oracle with its encoding set to "UTF-8".
Using Oracle's thin driver it automatically converts to the DB's default character set.
When Java tries to encode Unicode to UTF-8 and it runs into an unknown character (typically a character that is in the High Ascii range) it substitutes it with '?' or some other wierd character.
How do I prevent this.

my users are copying and pasting from MS-Word. My DB
is Oracle with its encoding set to "UTF-8".Pasting where? Into the database? If they are pasting into the database (however they might do that) and getting bad results then that's nothing to do with Java.
Using Oracle's thin driver it automatically converts
to the DB's default character set.Okay, I will assume that is correct.
When Java tries to encode Unicode to UTF-8 and it
runs into an unknown character (typically a character
that is in the High Ascii range) it substitutes it
with '?' or some other wierd character.This is false. When converting from Unicode to UTF-8 there are no "unknown characters". I don't know what you mean by the "High Ascii range" but if your users are pasting MS stuff into your Java program somehow, then a conversion from something into Unicode is done at that time. If "something" isn't the right encoding then you have the problems already, before you try to write to the DB.
How do I prevent this.First identify the problem. You have input coming from somewhere, then you are writing to the database. Two different steps. Either of them could have a problem. Test them separately so you know which one of them is the problem.

Similar Messages

Perform unicode to UTF-8 conversion on F110 bacs payment file in ABAP

Hi,
I am facing a conversion issue for the UK BACS payment files.
The payment run tcode F110 creates a payment file but the file when created on the application server has soem sort of code conversion. If I removed the # value, i can read most of the data.
The data example is as below-
#V#O#L#1#0#0#1#5#8#8# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #2#4#3#3#0#9#
#H#D#R#1#A#2#4#3#3#0#9#S# # #1#2#4#3#3#0#9#0#0#0#0#0#2#0#0#0#1#0#0#0#1# # # # # # # #1#0#1#1#2#
#H#D#R#2#F#0#2#0#0#0#0#0#1#0#0# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
#U#H#L#1# #1#0#1#1#3#9#9#9#9#9#9# # # # #0#0#0#0#0#0#0#0#1# #D#A#I#L#Y# # #0#0#0# # # # # # # #
This is then transferred to the bank via the FTP UNIX Script but after the conversion which is happening as-
#Perform unicode to UTF-8 conversion on bacs file
$a = "iconv -f UNICODE -t UTF-8 $tmpUNI > $tmpASC";
The need going forward is to bring the details via the interface and then make an uplaod.
The ABAP code should be able to make the conversion, remove the additional chracters and then send the file across.
I have searched everywhere but I am not able to find out how to make the same conversion in ABAP.
We are on ECC6.
Can someone please help me?
Regards,
Archana

Hi Archana,
can you please check SAP notes 1064779 and 1365764 (including the attachment) and see if this helps you ?
Best regards,
Nils Buerckel
SAP AG

Default Character Encoding stuck on UTF-8 - Firefox 7

I cannot change the Character Encoding - it is stuck on Unicode UTF-8 and I can not change it! When a web page opens I get these little boxes with "FF FD" instead of Quote marks. When I change the character encoding on that page using "View->Character Encoding" and click on the Western (ISO-8859-1), the page displays correctly. Every page opens using Unicode UTF-8 as the default.
View->Character Encoding -- shows Unicode UTF-8 as the default.
View->Character Encoding->Auto-detect -- shows OFF
Tool->Options->Content->Advance->Fonts->Default Character Encoding -- shows Western (ISO-8859-1) as well as the "Allow Pages to choose their own fonts..." IS CHECKED in the check box
THE PAGES ARE NOT UTF-8!!!! The "View Page Source" IS NOT Unicode UTF-8! -- It shows <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">.
The "View Page Info" shows MetaTag - Content-Type: text/html; charset=iso-8859-1
Why can I not change the Default Character Encoding?
I would also like to point out that the Unicode UTF-8 seems to be broken because it is indicating that the QUOTE CHARACTER is an UNPRINTABLE character "FF FD"
----- EDIT -----
The UTF-8 is not broken. The problem as pointed out in http://en.wikipedia.org/wiki/Replacement_character#Replacement_character is that my Firefox being STUCK processing UTF-8 encoding cannot read the clearly marked iso-8859-1 data. So the UTF-8 is reinterpreting smart quotes -“ and ”- (“ and ”) as replacement (unprintable) characters.
So the real problem is why my Firefox is stuck on Unicode UTF-8

The real problem is that the font that is used doesn't have those characters.
Do you see the special quotes -“ and ” on this forum page?
Does it help if you disable the website fonts and set another font as the default font?
*Tools > Options > Content : Fonts & Colors > Advanced
*http://en.wikipedia.org/wiki/Punctuation
*http://en.wikibooks.org/wiki/Unicode/Character_reference/2000-2FFF

Problems with Forms and character encoding

I'm having problems trying to read unicode data inputted into a Form on my JSP page.
I've used the meta tag <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> to set the charset of the page to UTF-8. I've inputted some chinese characters inot my form and when I try to read the subsequent request parameter in my servlet using request.getParameter() the string returned is this
"来源" which is the escape sequence required by HTML to display these characters.
From what I've read on the subject this doesn't seem like the expected value. I've tried other ways of getting the correct string value such as setting the character encoding request.setCharacterEncoding("UTF-8") and then converting the bytes using this encoding value but it doesn't seem to work.
I could write a method to split up the string using the ; as a token and working out the correct unicode character but this doesn't seem like the right thing to do.
Any help on how to pass the correct information from the Form in the JSP page to the servlet would be greatly appreciated

I don't believe that is correct, but if it's returning HTML escapes instead of URL Encoded characters, then it's the browser doing it. This is my test page for playing with Chinese...
<%@ page language="java" contentType="text/html; charset=UTF-8" %>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<body bgcolor="#ffffff" background="" text="#000000" link="#ff0000" vlink="#800000" alink="#ff00ff">
<%
request.setCharacterEncoding("UTF-8");
String str = "\u7528\u6237\u540d";
String name = request.getParameter("name");
%>
req enc: <%= request.getCharacterEncoding() %><br />
rsp enc: <%= response.getCharacterEncoding() %><br />
str: <%= str %><br />
name: <%= name %><br />
<form method="GET" action="_lang.jsp" encoding="UTF-8">
Name: <input type="text" name="name" value="" >
<input type="submit" name="submit" value="GET Submit" />
</form>
<form method="POST" action="_lang.jsp" encoding="UTF-8">
Name: <input type="text" name="name" value="" >
<input type="submit" name="submit" value="POST Submit" />
</form>
</body>
</html>

Quotation marks display as &quot in web pages, I'm using Unicode UTF-8 character encoding.

On many web pages, where a quotation mark character should appear, instead the page displays the text &quot. I believe this happens with other punctuation characters as well such as apostrophes although the text displayed in these other cases is different, of course. I'm guessing this is a problem with character encoding. I'm currently set to Unicode (UTF-8) encoding. Have tried several others without success.

Here's a link where the problem occurs. Note the second line of the main body of text.
http://www.sierratradingpost.com/lp2/snowshoes.html
BTW, I never use IE, but I checked this site in IE and it shows the same problem, so maybe it is the page encoding after all rather than what I thought.
In any case, my thanks for your help and would appreciate any solution you can suggest.

c:import character encoding problem (utf-8)

Aloha @ all,
I am currently importing a file using the <c:import> functionallity (<c:import url="module/item.jsp" charEncoding="UTF-8">) but it seems that the returned data is not encoded with utf-8 and hence not displayed correctly. The overall file header is:
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Set-Cookie: JSESSIONID=E67F9DAF44C7F96C0725652BEA1713D8;
Content-Type: text/html;charset=UTF-8
Content-Length: 6861
Date: Thu, 05 Jul 2007 04:18:39 GMT
Connection: close
I've set the file-encoding on all pages to :
<%@ page contentType="text/html;charset=UTF-8" %>
<%@ page pageEncoding="UTF-8"%>
but the error remains... is this a known bug and is there a workaround?

Partially, yes. It turns out that I created the documents in eclipse with a different character encoding. Hence the entire document was actually not UTF-encoded...
So I changed each document encoding in Eclipse to UTF and got it working just fine...

CONVERSION FROM ANSI ENCODED FILE TO UTF-8 ENCODED FILE

Hi All,
I have some issues in conversion of ANSI encoded file to utf encoded file. let me tell you in detail
I have installed the Language Support for Thai Language on My Operating System.
now, when I open my notepad and add thai character on the file and save it as ansi encoding. it saves it perfectly and also I able to see it on opening the file again.
This file need to be read by my application , store in database and should display thai character on jsp after fetching the data from database. Currently it is showing junk character on jsp reason being that my database (UTF8 compliant database) has junk data . it has junk data because my application is not able to read it correctly from the file.
If I save the file with encoding as UTF 8 it works fine. but my business requirement is such that the file is system generated and by default it is encoded in ANSI format. so I need to do the conversion of encoding from ANSI to UTF8 . so Any of you can guide me on the same how to do this conversion ?
Regards
Gaurav Nigam

Guessing the encoding of a text file by examining its contents is tricky at best, and should only be done as a last resort. If the file is auto-generated, I would first try reading it using the system default encoding. That's what you're doing whenever you read a file with a FileReader. If that doesn't work, try using an InputStreamReader and specifying a Thai encoding like TIS-620 or cp838 (I don't really know anything about Thai encodings; I just picked those out of a quick Google search). Once you've read the file correctly, you can write the text to a new file using an OutputStreamWriter and specifying UTF-8 as the encoding. It shouldn't really be necessary to transcode files like this, but without knowing a lot more about your situation, that's all I can suggest.
As for native2ascii, it isn't for encoding conversions. All it does is replace each non-ASCII character with its six-character Unicode escape, so "voilá" becomes "voil\u00e1". In other words, it avoids the problem of character encodings by converting the file's contents to a form that can be stored as ASCII. It's mainly used for converting property or resource files to a form that can be read by the Properties and ResourceBundle classes.

Seeing � etc despite having View--Character encoding as unicode and auto-detect universal

On viewing some web pages see characters such as �, , (for example). But View-Character Encoding is set at Unicode (UTF-8) or Western (ISO8859-1) and Tools-Options-Content-Fonts-Advanced Encoding set with either of those

example of page:
http://scienceofdoom.com/2010/09/17/on-missing-the-point-by-chilingar-et-al-2008/
- a little over half way down, the section headed "Anthropogenic Imact on the Earth’s Climate – Tiny" from paragraph "And continue: " there are these non-characters in the equation (12) and subsequently.
Another page : http://www.zimbabwesituation.com/sep26_2010.html in the topic " Red warning lights" .
Most web-pages I read are without problem.
I contacted the writer of the first page and s/he had no idea why it happens.

Can any version of Excel save to a CSV file that is either UTF-8 or UTF-16 encoded (unicode)?

Are there any versions of Excel (chinese, japanese, russian... 2003, 2007, 2010...) that can save CSV files in Unicode (either UTF-8 or UTF-16)?
If not, is the only solution to go with tab-delimited files (save as Unicode-text option)?

Hi Mark,
I have the same problem. Trying to save my CSV file in UTF8 encoding. After several hours in searching and trying this also in my VSTO Add-In I got nothing. Saving file as Unicode option in Excel creates file as TAB separated. Because I'd like to save the
file in my Add-In application, the best to do is (for my problem) saving file as unicode tab delimited and then replacing all tabs with commas in the file automatically.
I don't think there is a direct way to save CSV as unicode in Excel. And I don't understand why.

How to set the system default file character encoding to UTF-8?

Hi all. This is driving me nuts, on both my Windows box and Snow Leopard; I figure much more chance of finding the answer for OS X.
My language and locale are set to Australian English. $LANG=en_AU.UTF-8
However, as I believe is expected, OS X (and Windows for that matter) will create files by default with character encoding of Cp1252 (Latin-1). That is, the FILE encoding in the file metadata - the Byte Order Mark I believe. The file itself, not the characters written to it.
This, in a word, bites. I don't want to be restricted to only ASCII by default, and it is causing me problems with certain software (a Firefox plugin) that creates text files, passing in UTF-8 encoded content, which is then mangled because the file encoding itself is still Cp1252. (I know, I've tested this by changing the file encoding manually and having it overwritten again by the plugin: works correctly.)
As a simple example, just `touch somefile` from terminal creates a file in Cp1252 -- I'm obtaining that info by opening in jEdit by the way (anyone know of something better?).
In other locales that are not English-based, I believe the default file encoding is UTF-8. But surely this can be controlled independently? There must be a system configuration value somewhere that specifies file encoding default. Can someone please tell me what it is?
Thanks!

However, as I believe is expected, OS X (and Windows for that matter) will create files by default with character encoding of Cp1252 (Latin-1). That is, the FILE encoding in the file metadata - the Byte Order Mark I believe. The file itself, not the characters written to it.
Apps like TextEdit and Mail have settings that let you determine the encoding of text produced. The default would normally depend on the character content of the file, ranging from ASCII for basic English to Windows Latin-1 (Win 1252) or ISO Latin -1 (ISO 8859-1) to UTF-8 for other content.
Win 1252 is not ASCII, but has twice the number of characters in the latter.
Byte Order Mark is something totally different --it's a particular character used to signal certain encodings.
http://en.wikipedia.org/wiki/Byteordermark
As a simple example, just `touch somefile` from terminal creates a file in Cp1252 -- I'm obtaining that info by opening in jEdit by the way (anyone know of something better?).
For what Terminal does and how to change it, it might best to post in the Unix forum:
http://discussions.apple.com/forum.jspa?forumID=735
For problems with a FireFox plugin, it might be good to ask on their own forums as well.

Character encoding conversion for marshall/unmarshall?

Hello, Java Web Services gurus,
I am wondering if there is an easy/plugin-able way to do character encoding conversion transparently in the process of marshall/unmarshall.
Basically, my input/output will always be these UTF-8 XMLs. As the backend database is ISO encoded, I hope the result of unmarshall will give me ISO strings. And when it comes to marshall, the ISO strings can be transparently turned to UTF-8 XML response. Right now I'm using JAXB's annotations to parse XML into objects.
I understand there will be chars in the input file not able to get converted, if so, I'd be be expecting an error/exception that flags the failure
Hope I sound clear. This has been a headache for a while. Really hope someone may help out a bit. Thanks a million in advance

[Duplicate Post|http://forums.sun.com/thread.jspa?messageID=10971554&tstart=0#10971554]

MDMP Unicode Conversion problem during ECC upgrade

hi Experts,
I met a problem when upgrade from 4.6c to ecc 6.
in 4.6c, some key users made some chinese written entries in table field while he/she was logon as English, so, the language field in this entry is "EN" although the actual language was written by chinese, after upgrade to ECC6, we found the chinese words became messy code, maybe it was due to the chinese were encoded by english codepage during conversion, my question is, how can I avoid this? I really don't want this happen in our PRD upgrade.
The second question is: if a field contains both chinese and english, then, which language should it be assigned? english or chinese? I'm afraid if we assign this word as "EN" in vocabulary, then, after upgrade, the chinese part will be messy code.
Thank you for your kind suggestions.
Freshman

Hi,
you can either use transaction SUMG (manually) to repair the entries ( you have to add the according table - please check unicode conversion guide for details) or you need to change the language key in the table to Chinese if you want the Unicode conversion to convert the data correctly.
To your second question: If the English texts are restricted to US7ASCII characters ("normal" English without special characters) those texts containing Chinese and English words can (and must !) be assigned to ZH, as US7ASCII characters are included in the Chinese code page.
Best regards,
Nilsn Buerckel

Dreamweaver cc html entity conversion problem in mac -NO utf-8 related answer please

I probably am fighting against a bug existing in DW for a while, and i'm really on the edge of bursting out!
Here are the specifications:
Dreamweaver CC from creative cloud (also tested w/ CS5.5 too) installed on mac, OS and DW user interfaces are english, and on mac turkish keyboard layout is also installed.
I have been using DW for maybe 15 years, since it was macromedia.. But was always on windows. This is the first time I use it on mac. Here is my problem step by step:
1- Dreamweaver > Pereferences > New Document > Default Encoding: Western (ISO Latin 1) (NOT UTF-8 PLEASE, IT KEEPS THE CHARS UNCHANGED, ISO LATIN1 IS IMPORTANT)
2- Go to Design View,
3- There are 6 special characters in Turkish (times 2 for the caps versions of course), type:
ĞÜŞİÖÇğüşıöç
4- Go back to code view, what i should have seen was:
ĞÜŞİÖÇğüşıöç
But I see:
ĞÜŞİÖÇğüşıöç
There are 3 chars (and capital versions) NOT converted to html entity at all. Which were: ĞŞİğşı
But I should have seen them as: ĞŞİğşı
Any help would be appreciated, I do not want to leave my old friend DW just because of a weird conversion problem...

Ok, when you look at the code view, what do you see exactly?
do you see unconverted
ĞÜŞİÖÇğüşıöç
or converted
ĞÜŞİÖÇğüşıöç
Here is one of my reasons:
I sometimes create newsletters in turkish for my customers, and the html files i prepare are sent to customers attached as inline through various versions of outlook or thunderbird, or through i completely different email sender company (none is sent by me, i only create the html file). And most of the time the headers and some coding are cut off from the code when used to send as newsletter, and i have no control at all on it. so i have to create absolute correct viewed/rendered html files since i have no control at all on which sending method will be used or which os or browser or mail system will be used to open it...

Unicode Character sets (e.g UTF-8)

Hi,
We are using some third party software which will connect to the oracle database.
One of the requiremebnts it states is that both the databse client and server must use the Unicode character set e.g UTF-8.
How do we ensure this when installing the oracle client software.
Also, why when install orcale client software and select language as English does it put NLS_LANG as American by default.
Is there an English U.K language option - couldn't see it.
Many Thanks

user5716448 wrote:
Hi,
We are using some third party software which will connect to the oracle database.
One of the requiremebnts it states is that both the databse client and server must use the Unicode character set e.g UTF-8.
Pl post details of OS and database and client versions being installed
How do we ensure this when installing the oracle client software.
For the client, set NLS_LANG appropriately when using the client software - there is no setup required during the install - http://www.oracle.com/technetwork/database/globalization/nls-lang-099431.html
Also, why when install orcale client software and select language as English does it put NLS_LANG as American by default.
Is there an English U.K language option - couldn't see it.Try "ENGLISH"
http://docs.oracle.com/cd/E11882_01/server.112/e10729/ch3globenv.htm
>
Many ThanksHTH
Srini

Oracle to Mysql character set conversion problem!!! PLZ IGNORE

Hi Experts,
I have created a database link from Oracle 10g to Mysql 5.
I have installed Oracle Gateway 11g for this purpose.
When i retreive the data from sql plus the text is displayed as question marks.
Oracle 10g Database character set is WE8MSWIN1252
Mysql character set --->latin1
Character set of ODBC connector for mysql is latin7
Character set in the parameter file of HS folder is WE8MSWIN1252When i retrieve data from sql developer the text is fine(as i think it directly takes the character set of target) but
when i login from sqlplus i get question marks!
I have another post in Heterogeneous Connectivity forum
Re: Oracle to Mysql character set conversion problem!!! PLZ HELP
Kindly update your comments there,
@@@@@@@@@@@@@@2
Appreciate your help,
regards
Edited by: user10243788 on Apr 21, 2010 3:25 AM

It is OK to post a globalization-related question in this forum in addition to the forum pertaining to the main technology. Not all experts follow all possible forums on OTN. Of course, you should cross-link the posts to let people merge the answers.
Regarding the problem itself, make sure that SQL*Plus has the right NLS_LANG setting in the environment. On Windows, in the Command Prompt:
C:\> set NLS_LANG=.WE8PC850
C:\> sqlplus ...On Unix:
$ setenv NLS_LANG .WE8ISO8859P1 (or NLS_LANG=.WE8ISO8859P1; export NLS_LANG)
$ sqlplus ...-- Sergiusz

Character encoding (unicode to utf-8) conversion problem

Similar Messages

Maybe you are looking for