Validator warning: Character Encoding mismatch!

I have been following the discussion on favicons with
interest. A few days ago
I added a favicon to the page:
http://www.corybas.com/, and
eventually persuaded
IE6 to show the favicon, provided I loaded it by clicking the
icon. It did not
show it when the page reloaded itself, and now it has
forgotten all about it.
Following some discussions here this morning I ran the
Validator over the page
and got the diagnostic "The character encoding specified in
the HTTP header
(utf-8) is different from the value in the <meta>
element (iso-8859-1). I will
use the value from the HTTP header (utf-8) for this
validation. "
As far as I can work out, this is caused by an
incompatibility in the witchcraft
Dreamweaver includes in a basic HTML page, which is as
follows:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0
Transitional//EN"
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="
http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1"
/>
<title>Untitled Document</title>
</head>
Should I worry about this warning?
(I removed a few other insignificant errors, but IE6 still
can't see the
favicon.)
Clancy

On 22 Apr 2008 in macromedia.dreamweaver, Clancy wrote:
> Following some discussions here this morning I ran the
Validator
> over the page and got the diagnostic "The character
encoding
> specified in the HTTP header (utf-8) is different from
the value in
> the <meta> element (iso-8859-1). I will use the
value from the HTTP
> header (utf-8) for this validation. "
>
> As far as I can work out, this is caused by an
incompatibility in
> the witchcraft Dreamweaver includes in a basic HTML
page, which is
> as follows:
>
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0
Transitional//EN"
> "
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
> <html xmlns="
http://www.w3.org/1999/xhtml">
> <head>
> <meta http-equiv="Content-Type" content="text/html;
> charset=iso-8859-1"
> />
> <title>Untitled Document</title>
> </head>
>
> Should I worry about this warning?
At an offhand guess, you're on an Apache 2.x server. In its
default
setup, it sends a UTF-8 charset header. That is sufficient
for the
browser. It also conflicts with the iso-8859-1 charset in the
document. The fastest cure for it is to remove the charset
meta from
the page, or change it to UTF-8. But it has no bad effects
that I know
of on a browser.
Joe Makowiec
http://makowiec.net/
Email:
http://makowiec.net/contact.php

Similar Messages

Don't know what to do: Character encoding mismatch in validation

This is what the validator tells me:
The character encoding specified in the HTTP header (iso-8859-1) is different from the value in the element (utf-8 ). I will use the value from the HTTP header (iso-8859-1) for this validation.
This is my DTD:
< !DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http: //www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
< html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
And this is in the Head:
< meta http-equiv="content-type" content="text/html; charset=utf - 8" />
Can anyone tell me what to do with that?
Thanks
Martin

Sounds like the server is sending a header that says the data is iso-8859-1.
Why, needs to be answered by someone more knowledgeable than I.
~Nick
> This is what the validator tells me:
>
> The character encoding specified in the HTTP header (iso-8859-1) is
> different from the value in the element (utf-8 ). I will use the value
> from the HTTP header (iso-8859-1) for this validation.

' is not a valid XML character.

Hi All,
May be an XML Expert can help me in understanding this XML Exception.
I am trying to expose a EJB as webservice using AXIS 1.2 Final. When
deployed my ear/ejb on WAS 4.0 on HP-UX operation system and trying
to consume the web service using AXIS 1.2 generated stubs I am getting below error on the server side.
[9/12/05 8:17:55:260 MST] ed128 WebGroup X Servlet Error: The
char '0x0' in 'java.rmi.RemoteException: CORBA
BAD_OPERATION 0 No; nested exception is:
org.omg.CORBA.BAD_OPERATION: minor code: 0 completed: No
at com.amla.as.cameron.ejb._EJSRemoteStatelessCS_146c46e9_Tie._invoke(_EJSRemoteStatelessCS_146c46e9_Tie.java:340)
at com.ibm.CORBA.iiop.ExtendedServerDelegate.dispatch(ExtendedServerDelegate.java:532)
at com.ibm.CORBA.iiop.ORB.process(ORB.java:2450)
at com.ibm.CORBA.iiop.OrbWorker.run(OrbWorker.java:186)
at com.ibm.ejs.oa.pool.ThreadPool$PooledWorker.run(ThreadPool.java:104)
at com.ibm.ws.util.CachedThread.run(ThreadPool.java:144)
minor code: 0 completed: No' is not a valid XML character.:
java.lang.IllegalArgumentException: The char '0x0' in
'java.rmi.RemoteException: CORBA BAD_OPERATION 0 No; nested exception
is:
org.omg.CORBA.BAD_OPERATION: minor code: 0 completed: No
at com.amla.as.cameron.ejb._EJSRemoteStatelessCS_146c46e9_Tie._invoke(_EJSRemoteStatelessCS_146c46e9_Tie.java:340)
at com.ibm.CORBA.iiop.ExtendedServerDelegate.dispatch(ExtendedServerDelegate.java:532)
at com.ibm.CORBA.iiop.ORB.process(ORB.java:2450)
at com.ibm.CORBA.iiop.OrbWorker.run(OrbWorker.java:186)
at com.ibm.ejs.oa.pool.ThreadPool$PooledWorker.run(ThreadPool.java:104)
at com.ibm.ws.util.CachedThread.run(ThreadPool.java:144)
minor code: 0 completed: No' is not a valid XML character.
at org.apache.axis.components.encoding.AbstractXMLEncoder.encode(AbstractXMLEncoder.java:110)
at org.apache.axis.utils.XMLUtils.xmlEncodeString(XMLUtils.java:117)
at org.apache.axis.AxisFault.dumpToString(AxisFault.java:366)
at org.apache.axis.AxisFault.printStackTrace(AxisFault.java:796)
at org.apache.commons.logging.impl.SimpleLog.log(SimpleLog.java:338)
at org.apache.commons.logging.impl.SimpleLog.warn(SimpleLog.java:446)
at org.apache.axis.attachments.AttachmentsImpl.getAttachmentCount(AttachmentsImpl.java:523)
at org.apache.axis.Message.getContentType(Message.java:475)
at org.apache.axis.transport.http.AxisServlet.doPost(AxisServlet.java:713)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:760)
at org.apache.axis.transport.http.AxisServletBase.service(AxisServletBase.java:301)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
at com.ibm.servlet.engine.webapp.StrictServletInstance.doService(ServletManager.java:827)
at com.ibm.servlet.engine.webapp.StrictLifecycleServlet._service(StrictLifecycleServlet.java:167)
at com.ibm.servlet.engine.webapp.IdleServletState.service(StrictLifecycleServlet.java:297)
at com.ibm.servlet.engine.webapp.StrictLifecycleServlet.service(StrictLifecycleServlet.java:110)
at com.ibm.servlet.engine.webapp.ServletInstance.service(ServletManager.java:472)
at com.ibm.servlet.engine.webapp.ValidServletReferenceState.dispatch(ServletManager.java:1012)
at com.ibm.servlet.engine.webapp.ServletInstanceReference.dispatch(ServletManager.java:913)
at com.ibm.servlet.engine.webapp.WebAppRequestDispatcher.handleWebAppDispatch(WebAppRequestDispatcher.java:721)
at com.ibm.servlet.engine.webapp.WebAppRequestDispatcher.dispatch(WebAppRequestDispatcher.java:374)
at com.ibm.servlet.engine.webapp.WebAppRequestDispatcher.forward(WebAppRequestDispatcher.java:118)
at com.ibm.servlet.engine.srt.WebAppInvoker.doForward(WebAppInvoker.java:134)
at com.ibm.servlet.engine.srt.WebAppInvoker.handleInvocationHook(WebAppInvoker.java:239)
at com.ibm.servlet.engine.invocation.CachedInvocation.handleInvocation(CachedInvocation.java:67)
at com.ibm.servlet.engine.srp.ServletRequestProcessor.dispatchByURI(ServletRequestProcessor.java:151)
at com.ibm.servlet.engine.oselistener.OSEListenerDispatcher.service(OSEListener.java:317)
at com.ibm.servlet.engine.http11.HttpConnection.handleRequest(HttpConnection.java:60)
at com.ibm.ws.http.HttpConnection.readAndHandleRequest(HttpConnection.java:477)
at com.ibm.ws.http.HttpConnection.run(HttpConnection.java:341)
at com.ibm.ws.util.CachedThread.run(ThreadPool.java:144)
Can anyone please suggest under what cases we get this error.
When I monitor the request soap message using AXIS handlers, the soap
message seems to be valid and well-formed. The EJB deployment on WAS
4.0 is successfully and it can be successfully accessed by a ejb
client :
Thanks & Regards,
bab.

I am having similar problem. Did you find the solution for this
thanks

Dreamweaver warning UTF encoding

Help!! I am an intermediate web designer and have
been using Dreamweaver MX 2004 on my Mac OSX. I have been working
on a page named services.html for many days. it is a lot of text
layed out in a table. It has been loading fine throughout the past
week, as I've been editing it daily. Today I made minor changes to
the navigation, and when I went to upload I got this strange
warning that I've never seen before, and I don't know what it
means. Since my page had been fine on the internet up to this point
I chose to ignore the warning and uploaded as usual--big mistake.
the content has fallen apart with huge gaps between paragraphs,
like inches of blank space. All of the content is in the same
<td> so I know that I can't blame it on the table. the
warning with the big yellow triangle said,"the document's current
encoding can not correctly save all of the characters within the
document. You may want to change to UTF or an encoding that
supports the special characters in this document." What the heck
does that mean?? Hopeless in California, deborah Here is the link
to the page, so that you can look at it and my code. thanks to
anyone that can come to my rescue.
http://www.dnorthphoto.com/bloomful_web/services.html

On 18 Nov 2006 in macromedia.dreamweaver, David Powers wrote:
> dnorth wrote:
> > Today I made minor changes to the navigation, and
when I went to
> > upload I got
>> this strange warning that I've never seen before,
and I don't know
>> what it means. Since my page had been fine on the
internet up to
>> this point I chose to ignore the warning and
uploaded as usual--big
>> mistake. the content has fallen apart with huge gaps
between
>> paragraphs, like inches of blank space.
>
> It has nothing to do with the warning about character
encoding. The
> huge gaps are caused by this style rule in the
services_headings
> class in your stylesheet:
>
> padding-bottom: 150%;
>
> Remove it, and the gaps disappear.
Or use a headline tag (<h1>, <h2>, <hX>),
which would be more
semantically correct anyway.
Joe Makowiec
http://makowiec.net/
Email:
http://makowiec.net/email.php

Seeing � etc despite having View--Character encoding as unicode and auto-detect universal

On viewing some web pages see characters such as �, , (for example). But View-Character Encoding is set at Unicode (UTF-8) or Western (ISO8859-1) and Tools-Options-Content-Fonts-Advanced Encoding set with either of those

example of page:
http://scienceofdoom.com/2010/09/17/on-missing-the-point-by-chilingar-et-al-2008/
- a little over half way down, the section headed "Anthropogenic Imact on the Earth’s Climate – Tiny" from paragraph "And continue: " there are these non-characters in the equation (12) and subsequently.
Another page : http://www.zimbabwesituation.com/sep26_2010.html in the topic " Red warning lights" .
Most web-pages I read are without problem.
I contacted the writer of the first page and s/he had no idea why it happens.

Character encoding for ReponseWriter

hi;
how can i control the character encoding of the ResponseWriter?
what encoding does it use for default?
thanks.

Since I had junior developers becoming desperate by this problem I'll post our solution for for anybody that's not working on Websphere and wants to solve this problem.
we seem to have solved it using a servletfilter that inserts this response wrapper:
class CharacterEncodingHttpResponseWrapper extends HttpServletResponseWrapper{
private String contentTypeWithCharacterEncoding;
private String encoding;
CharacterEncodingHttpResponseWrapper(HttpServletResponse resp,String encoding){
 super(resp);
 this.encoding=encoding;
public void setContentType(String contentType){
 //��n plek om encoding te defini�ren ipv in alle JSP pagina's
 contentTypeWithCharacterEncoding = addOrReplaceCharset(contentType,encoding);
 super.setContentType(contentTypeWithCharacterEncoding);
public void setLocale(Locale locale){
 //bij het zetten van de locale wordt ook de charset op ISO gezet
 if(contentTypeWithCharacterEncoding==null){
 CharacterEncodingFilter.LOGGER.warn("Encoding wordt op ISO gezet via locale.");
 }else{
 super.setLocale(locale);
 //en zet de encoding terug naar de gewenste encoding
 setContentType(contentTypeWithCharacterEncoding);
 * utility methode die in de http header
 * <code>Content-Type:application/x-www-form-urlencoded;charset=ISO-8859-1</code>
 * of in content type op servlethttpresponse
 * <code>text/html;charset=ISO-8859-1</code>
 * de charset zet.
private String addOrReplaceCharset(String headervalue, String charset) {
 if (null !=headervalue ) {
 // see if this header had a charset
 String charsetStr = "charset=";
 int len = charsetStr.length(), i = 0;
 // if we have a charset in this Content-Type header
 if (-1 != (i = headervalue.indexOf(charsetStr))) {
 // if it has a non-zero length.
 if ((i + len < headervalue.length())) {
 // none
 headervalue = headervalue.substring(0, i + len) + charset;
 } else {
 headervalue = headervalue + charset;
 } else {
 headervalue = headervalue + ";charset="+charset;
 return headervalue;
 } else {
 LOGGER.warn("content-type header niet gezet");
 return "application/x-www-form-urlencoded;charset="+charset;
}If all your JSF/JSP pages have consistently set the encoding in the contenttype your addOrReplace method should only add, not replace.

Firefox4 cannot read japanese language. Although i set character encoding UTF-8.

Firefox4 cannot read japanese language. Although i set character encoding UTF-8.
in all web site: http://www.youtube.com/watch?v=lAagLbHYDZY or google . i update language in my windows7. and set font or any but not work.
And download firefox japanese language any installs but.. ( in picture)
http://poto.cyberwakeup.com/images/cqs1308507072d.PNG
and run...
http://poto.cyberwakeup.com/images/opt1308507757r.PNG
in IE9 or Opera can read but FireFox4 cannot :(
ps. my windows language default is Thai.

Try to toggle some of the Boolean gfx.font_rendering prefs on the about:config page to disable some features. 
Filter: gfx
To open the about:config page, type about:config in the location (address) bar and press the "Enter" key, just like you type the url of a website to open a website. 
If you see a warning then you can confirm that you want to access that page. 
You can use the Filter bar at to top of the about:config page to locate a pref more easily.
* http://kb.mozillazine.org/about:config

Using Direct Input mode: UTF-8 character encoding assumed

When I validate a web page in dreamweaver CS5 (using Jeffrey Zeldman's Web Standards Advisor ) I receive the following warning; Using Direct Input mode: UTF-8 character encoding assumed.
However if I validate using W3C either as a file upload or via the live site the page validates correctly.
Can anybody please help, as it is driving me insane.

Hi John
Thank you for your swift response,
As I said before this only happens locally within Dreamweaver CS5.
However The web address is www.countryimage.co.uk/index.htm
As you will see from the code I have included <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> within the head

What is the proper way to set character encoding in an HTTPService request?

I'm trying to get an HTTPService object's request to have
proper character encoding. If I do nothing, I get "null" inside a
Java servlet when I call getCharacterEncoding() on the request
object. If I do this to my Flex HTTPService:
httpService.contentType = "application/x-www-form-urlencoded;
charset=UTF-8";
then I get a valid character encoding (UTF-8) in the Java
servlet as I should. But the problem is that my HTTPService's POST
parameters are no longer coming along with the request. If I drop
the charset value and set this instead in Flex:
httpService.contentType =
"application/x-www-form-urlencoded";
then I get my POST params in my servlet just fine, but of
course, no charset info.
(For completeness, I am also setting: httpService.method =
"POST"; and httpService.resultFormat = "e4x"; as well as the URL.)
How do I send charset info without interfering with the
transmission of the POST params? This is a serious flaw for anyone
doing UTF-8 content, because most servers are going to assume
ISO-8859-1 if you don't send anything specific. It's interesting
that Flex is actually encoding in UTF-8. I know, because I am
currently working around the problem by intercepting the HTTP
request in my servlet and forcing the character encoding to UTF-8
before binding the params. That's a lousy workaround, though.
Hint to Flex 3 developers: It would be much more preferable
to have a setCharacterEncoding method (or characterEncoding prop)
on the Flex HTTPService.

Hello,
I realize this is an old thread, but the problem still seems
to exist in Flex 3 and I run into it
Unfortunately I don't understand the workaround.
Could someone point out in a bit more detail how this should
be done?
Many thanks indeed,
Peter
_servlet = new HTTPService();
_servlet.url= ...;
_servlet.resultFormat=_resultFormat;
_servlet.addEventListener(ResultEvent.RESULT,onServiceActionResult);
_servlet.addEventListener(FaultEvent.FAULT,onServiceActionFault);
_servlet.requestTimeout=_timeout;
_servlet.contentType=_requestMimeType;
_servlet.method=_method;
XML.prettyPrinting=false;
if(sdk13922Workaround) {
_servlet.request=params;
this._token=_servlet.send(null);
} else {
_servlet.request=request;
this._token=_servlet.send(_params);
quote:
Text

What every developer should know about character encoding

This was originally posted (with better formatting) at Moderator edit: link removed/what-every-developer-should-know-about-character-encoding.html. I'm posting because lots of people trip over this.
If you write code that touches a text file, you probably need this.
Lets start off with two key items
1.Unicode does not solve this issue for us (yet).
2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts.
The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.
And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.
And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.
Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.
UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
Here's a key point about these text files – every program is still using an encoding. It may not be setting it in code, but by definition an encoding is being used.
Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)
Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
Wrapping it up
I think there are two key items to keep in mind here. First, make sure you are taking the encoding in to account on text files. Second, this is actually all very easy and straightforward. People rarely screw up how to use an encoding, it's when they ignore the issue that they get in to trouble.
Edited by: Darryl Burke -- link removed

DavidThi808 wrote:
This was originally posted (with better formatting) at Moderator edit: link removed/what-every-developer-should-know-about-character-encoding.html. I'm posting because lots of people trip over this.
If you write code that touches a text file, you probably need this.
Lets start off with two key items
1.Unicode does not solve this issue for us (yet).
2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts. Pretty sure most Americans do not use character sets that only have a range of 0-127. I don't think I have every used a desktop OS that did. I might have used some big iron boxes before that but at that time I wasn't even aware that character sets existed.
They might only use that range but that is a different issue, especially since that range is exactly the same as the UTF8 character set anyways.
>
The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.
And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.
And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
The above is only true for small volume sets. If I am targeting a processing rate of 2000 txns/sec with a requirement to hold data active for seven years then a column with a size of 8 bytes is significantly different than one with 16 bytes.
Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
The above is out of place. It would be best to address this as part of Point 1.
Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.
Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.
UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
The first part of that paragraph is odd. The first 128 characters of unicode, all unicode, is based on ASCII. The representational format of UTF8 is required to implement unicode, thus it must represent those characters. It uses the idiom supported by variable width encodings to do that.
But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
Not sure what you are saying here. If a file is supposed to be in one encoding and you insert invalid characters into it then it invalid. End of story. It has nothing to do with html/xml.
Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
The browser still needs to support the encoding.
Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
I know java files have a default encoding - the specification defines it. And I am certain C# does as well.
Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
It is important to define it. Whether you set it is another matter.
Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)
Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
Unicode character escapes are replaced prior to actual code compilation. Thus it is possible to create strings in java with escaped unicode characters which will fail to compile.
Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
No. A developer should understand the problem domain represented by the requirements and the business and create solutions that appropriate to that. Thus there is absolutely no point for someone that is creating an inventory system for a stand alone store to craft a solution that supports multiple languages.
And another example is with high volume systems moving/storing bytes is relevant. As such one must carefully consider each text element as to whether it is customer consumable or internally consumable. Saving bytes in such cases will impact the total load of the system. In such systems incremental savings impact operating costs and marketing advantage with speed.

Character set mismatch in copying from oracle to oracle

I have a set of ODI scripts that are copying from a source JD Edwards ERP database (Oracle 10g) to a BI datamart (Oracle 10g) and all the original scripts work OK.
However I have mapped on to some additional tables in the ERP source database and some new BI tables in the target datamart database (oracle - to - oracle) but get an error when I try ro execute these.
The operator log shows that the error is in the 'INSERT FLOW INTO I$ TABLE' and the error is ORA-12704 character set mismatch.
The character set for both Oracle databases are the same (and have not changed) the main NLS_CHARACTERSET is AL332UTF8 and the national NLS_NCHAR_CHARACTERSET is AL16UTF16.
But this works for tables containing NCHAR and NUMBER in previous scripts but not for anything I write now.
The only other difference is that there was a recent upgrade of ODI to 10.1.3.5 - the repositories are also upgraded.
Any ideas ?

Hi Ravi,
yes, a gateway would help. In 11.2 Oracle offers 2 kind of gateways to a SQL Server - a gateway for free which is based on 3rd party ODBC drivers (you need to get them from a 3rd party vendor, they are not included in the package) and called Database Gateway for ODBC (=DG4ODBC) and a very powerful Database Gateway for MS SQL Server (=DG4MSQL) which allows you also to execute distributed transactions and call remote SQL Server stored procdures. Please keep in mind that DG4MSQL requires a separate license.
As you didn't post which platform you're going to use, please check out On "My Oracle Support" (=MOS) where you'll find notes how to configure each gateway for all supported platforms - just look for DG4MSQL or DG4ODBC
On OTN you'll find the also the manuals.
DG4ODBC: http://download.oracle.com/docs/cd/E11882_01/gateways.112/e12070.pdf
DG4MSQL: http://download.oracle.com/docs/cd/E11882_01/gateways.112/e12069.pdf
The generic gateway installation for Unix: http://download.oracle.com/docs/cd/E11882_01/gateways.112/e12013.pdf
and for Windows: http://download.oracle.com/docs/cd/E11882_01/gateways.112/e12061.pdf

Ora-12704: character set mismatch

I'm trying to insert string data into a nvarchar2 column using jdbc and I keep getting the error, ora-12704: character set mismatch. How do I avoid this without using the translate function? I believe my nls variables are all set correctly. Our software will eventually be in Korean.

Both my database and national character sets are UTF8.
We are trying to design a system
that supports internationalization. Majority
of our customers will be using English but we do have Korean customers. All their data will be in korean. It is my understanding (and I could be wrong) that the korean character set is varying multi-byte. This understanding has lead me to believe I need to define my "character" columns as nchar and nvarchar2. Is this assumption wrong?
Also, my nls_lang environment variable is AMERICAN_AMERICA.UTF8. I'm currently running Oracle 8.1.6. We have received 8.1.7 but I haven't upgraded yet.
Thanks in advance!

XML Character Encoding Using UTL_DBWS

Hi,
I have a database with WINDOWS-1252 character encoding. I'm using UTL_DBWS to call a web service method which echoes a given string. For this purpose, I do the following:
DECLARE
 v_wsdl CONSTANT VARCHAR2(500) := 'http://myhost/myservice?wsdl';
 v_namespace CONSTANT VARCHAR2(500) := 'my.namespace';
 v_service_name CONSTANT UTL_DBWS.QNAME := UTL_DBWS.to_qname(v_namespace, 'MyService');
 v_service_port CONSTANT UTL_DBWS.QNAME := UTL_DBWS.to_qname(v_namespace, 'MySoapServicePort');
 v_ping CONSTANT UTL_DBWS.QNAME := UTL_DBWS.to_qname(v_namespace, 'ping');
 v_wsdl_uri CONSTANT URITYPE := URIFACTORY.getURI(v_wsdl);
 v_str_request CONSTANT VARCHAR2(4000) :=
'<?xml version="1.0" encoding="UTF-8" ?>
<ping>
 <pingRequest>
 <echoData>Dev Team üöäß</echoData>
 </pingRequest>
</ping>';
 v_service UTL_DBWS.SERVICE;
 v_call UTL_DBWS.CALL;
 v_request XMLTYPE := XMLTYPE (v_str_request);
 v_response SYS.XMLTYPE;
BEGIN
 DBMS_JAVA.set_output(20000);
 UTL_DBWS.set_logger_level('FINE');
 v_service := UTL_DBWS.create_service(v_wsdl_uri, v_service_name);
 v_call := UTL_DBWS.create_call(v_service, v_service_port, v_ping);
 UTL_DBWS.set_property(v_call, 'oracle.webservices.charsetEncoding', 'UTF-8');
 v_response := UTL_DBWS.invoke(v_call, v_request);
 DBMS_OUTPUT.put_line(v_response.getStringVal());
 UTL_DBWS.release_call(v_call);
 UTL_DBWS.release_all_services;
END;
/Here is the SERVER OUTPUT:
ServiceFacotory: oracle.j2ee.ws.client.ServiceFactoryImpl@a9deba8d
WSDL: http://myhost/myservice?wsdl
Service: oracle.j2ee.ws.client.dii.ConfiguredService@c881d39e
*** Created service: -2121202561 - oracle.jpub.runtime.dbws.DbwsProxy$ServiceProxy@afb58220 ***
ServiceProxy.get(-2121202561) = oracle.jpub.runtime.dbws.DbwsProxy$ServiceProxy@afb58220
Collection Call info: port={my.namespace}MySoapServicePort, operation={my.namespace}ping, returnType={my.namespace}PingResponse, params count=1
setProperty(oracle.webservices.charsetEncoding, UTF-8)
dbwsproxy.add.map: ns, my.namespace
Attribute 0: my.namespace: xmlns:ns, my.namespace
dbwsproxy.lookup.map: ns, my.namespace
createElement(ns:ping,null,my.namespace)
dbwsproxy.add.soap.element.namespace: ns, my.namespace
Attribute 0: my.namespace: xmlns:ns, my.namespace
dbwsproxy.element.node.child.3: 1, null
createElement(echoData,null,null)
dbwsproxy.text.node.child.0: 3, Dev Team üöäß
request:
<ns:ping xmlns:ns="my.namespace">
 <pingRequest>
 <echoData>Dev Team Ã¼Ã¶Ã¤ÃŸ</echoData>
 </pingRequest>
</ns:ping>
Jul 8, 2008 6:58:49 PM oracle.j2ee.ws.client.StreamingSender _sendImpl
FINE: StreamingSender.response:<?xml version = '1.0' encoding = 'UTF-8'?>
<env:Envelope xmlns:env="http://schemas.xmlsoap.org/soap/envelope/"><env:Header/><env:Body><ns0:pingResponse xmlns:ns0="my.namespace"><pingResponse><responseTimeMillis>0</responseTimeMillis><resultCode>0</resultCode><echoData>Dev Team Ã¼Ã¶Ã¤ÃŸ</echoData></pingResponse></ns0:pingResponse></env:Body></env:Envelope>
response:
<ns0:pingResponse xmlns:ns0="my.namespace">
 <pingResponse>
 <responseTimeMillis>0</responseTimeMillis>
 <resultCode>0</resultCode>
 <echoData>Dev Team Ã¼Ã¶Ã¤ÃŸ</echoData>
 </pingResponse>
</ns0:pingResponse>As you can see the character encoding is broken in the request and in the response, i.e. the SOAP encoder does not take into consideration the UTF-8 encoding.
I tracked down the problem to the method oracle.jpub.runtime.dbws.DbwsProxy.dom2SOAP(org.w3c.dom.Node, java.util.Hashtable); and more specifically to the calls of oracle.j2ee.ws.saaj.soap.soap11.SOAPFactory11.
My question is: is there a way to make the SOAP encoder use the correct character encoding?
Thanks a lot in advance!
Greetings,
Dimitar

I found a workaround of the problem:
 v_response := XMLType(v_response.getBlobVal(NLS_CHARSET_ID('CHAR_CS')), NLS_CHARSET_ID('AL32UTF8'));Ugly, but I'm tired of decompiling and debugging Java classes ;)
Greetings,
Dimitar

XML parser not detecting character encoding

Hi,
I am using Jdeveloper 9.0.5 preview and the same problem is happening in our production AS 9.0.2 release.
The character encoding of an xml document is not correctly being detected by the oracle v2 parser even though the xml declaration correctly contains
<?xml version="1.0" encoding="ISO-8859-1" ?>
instead it treats the document as UTF8 encoding which is fine until a document comes along with an extended character which then causes a
java.io.UTFDataFormatException: Invalid UTF8 encoding.
at oracle.xml.parser.v2.XMLUTF8Reader.checkUTF8Byte(XMLUTF8Reader.java:160)
at oracle.xml.parser.v2.XMLUTF8Reader.readUTF8Char(XMLUTF8Reader.java:187)
at oracle.xml.parser.v2.XMLUTF8Reader.fillBuffer(XMLUTF8Reader.java:120)
at oracle.xml.parser.v2.XMLByteReader.saveBuffer(XMLByteReader.java:448)
at oracle.xml.parser.v2.XMLReader.fillBuffer(XMLReader.java:2023)
at oracle.xml.parser.v2.XMLReader.tryRead(XMLReader.java:972)
at oracle.xml.parser.v2.XMLReader.scanXMLDecl(XMLReader.java:2589)
at oracle.xml.parser.v2.XMLReader.pushXMLReader(XMLReader.java:485)
at oracle.xml.parser.v2.XMLReader.pushXMLReader(XMLReader.java:192)
at oracle.xml.parser.v2.XMLParser.parse(XMLParser.java:144)
as you can see it is explicitly casting the XMLUTF8Reader to perform the read.
I can get around this by hard coding the xml input stream to be processed by a reader
XMLSource = new StreamSource(new InputStreamReader(XMLInStream,"ISO-8859-1"));
however the manual documents that the character encoding is automatically picked up from the xml file and casting into a reader is not necessary, so I should be able to write
XMLSource = new StreamSource(XMLInStream)
Does anyone else experience this same problem?
having to hardcode the encoding causes my software to lose flexibility.
Jarrod Sharp.

An XML document should be created with 'ISO-8859-1' encoding to be parsed as 'ISO-8859-1' encoding.

What's the difference of character encoding between 1.4.0and1.4.2 in Linux

As i find, the character encoding about chinese in jdk1.4.2 no langer the same of jdk1.4.0.
In jdk1.4.0, the character encoding used the "file.encoding" system property, we often set the
property with "gb2312".
But in jdk1.4.2, i find that the default character encoding no longer used the "file.encoding" system property.
Who knows the reason?
Test Program:
public class B{
public static void main(String args[]) throws Exception{
byte [] bytes = new byte[]{(byte)0xD6,(byte)0xD0,(byte)0xCE,(byte)0xC4};
String s1 = new String(bytes);
String s2 = new String(bytes,System.getProperty("file.encoding"));
System.out.println("s1="+s1+" , s2="+s2);
System.out.println("s1.length=" + s1.length() + " , s2.length="+s2.length());
run four times and the result list:
[root@app15 component]# /usr/local/j2sdk1.4.0/bin/java -Dfile.encoding=ISO-8859-1 -cp . B
s1=中文 , s2=中文
s1.length=4 , s2.length=4
[root@app15 component]# /usr/local/j2sdk1.4.0/bin/java -Dfile.encoding=gb2312 -cp . B
s1=中文 , s2=中文
s1.length=2 , s2.length=2
[root@app15 component]# /usr/local/j2sdk1.4.2/bin/java -Dfile.encoding=ISO-8859-1 -cp . B
s1=中文 , s2=中文
s1.length=4 , s2.length=4
[root@app15 component]# /usr/local/j2sdk1.4.2/bin/java -Dfile.encoding=gb2312 -cp . B
s1=中文 , s2=??
s1.length=4 , s2.length=2
[root@app15 component]#

I don't know for sure, but:
-- The API documentation for String says that "new String(byte[])" uses "the platform's default charset".
-- The API documentation for Charset says "The default charset is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system."
You'll notice that it doesn't say anything about using the file.encoding system value, so presumably (based on your experiments) it doesn't. I did a search for "java default charset" and didn't find anything specific, but this site says "As of Java 1.4.1, the default Charset varies from platform to platform" and suggests you explicitly hard-code your charset. I would agree with that.

Validator warning: Character Encoding mismatch!

Similar Messages

Maybe you are looking for