[SOLVED] Character encoding conversion

Hi, my first post here...
I'm writing a shell script (using tcl/expect) to show mails on pop server.
How do I convert e-mail headers containing text like this:
Subject: =?iso-8859-1?B?ZGl2ZXJ0aW1lbnRvIHNlbnphIGZpbmUgMjQgb3JlIHN1IDI0?=
to make it readable?
I already tried with 'iconv' and 'recode' with no luck...
Thanks.
Last edited by mr.entropy (2010-11-02 16:34:39)

These subjects have (probably unnecessarily) been encoded according to RFC 2047.
The "B" in between the 2nd and 3rd ? indicates that it is encoded using Base64 (if there's a Q it's "Quoted-Printable" see RFC RFC 2045).
So you're looking for a Base64 decoder.
EDIT: and there happens to be one in coreutils.
man base64
Last edited by skanky (2010-11-02 14:39:06)

Similar Messages

Character encoding conversion for marshall/unmarshall?

Hello, Java Web Services gurus,
I am wondering if there is an easy/plugin-able way to do character encoding conversion transparently in the process of marshall/unmarshall.
Basically, my input/output will always be these UTF-8 XMLs. As the backend database is ISO encoded, I hope the result of unmarshall will give me ISO strings. And when it comes to marshall, the ISO strings can be transparently turned to UTF-8 XML response. Right now I'm using JAXB's annotations to parse XML into objects.
I understand there will be chars in the input file not able to get converted, if so, I'd be be expecting an error/exception that flags the failure
Hope I sound clear. This has been a headache for a while. Really hope someone may help out a bit. Thanks a million in advance

[Duplicate Post|http://forums.sun.com/thread.jspa?messageID=10971554&tstart=0#10971554]

XML data from BLOB to CLOB - character set conversion

Hi All,
I'm trying to solve a problem with a character set conversion in PL/SQL in the following scenario:
1. source is an XML as a BLOB variable.
2. target is an XML as a CLOB variable.
3. the problem I have is the following:
- database character set is set to UTF-8
- XML character set could be anything (UTF-8, ISO 8859-1, ISO 8859-2, ASCII, ...)
- I need to write a procedure which converts the source BLOB content into the target CLOB taking into account the XML encoding and converts it into the DB default character set (UTF8).
I've been able to implement a simple conversion function. However, this function expects static XML encoding ISO-8859-1. The main part of the function looks as follows:
buffer := UTL_RAW.cast_to_varchar2(
UTL_RAW.convert(
DBMS_LOB.SUBSTR(source_blob_variable, 16000, pos)
, 'American_America.UTF8'
, 'American_America.we8iso8859p1')
Does anyone have an idea how to rewrite the code to handle "any" XML encoding in the source BLOB file? In other words, is there a function in Oracle which converts XML character set names into Oracle character set values (ISO-8859-1 to we8iso8859p1, UTF-8 to UTF8, ...)?
Thanks a lot for any help.
Julius

I want to pass a BLOB to some "createXML" procedure and get a proper XMLType in UTF8 character set, properly converted from whatever character set is the input in.As per documentation the generated XML has always the encoding set at the client side depending on NLS_LANG (default UTF-8), regardless of the input encoding, so I don't see a need to parse the PI of the XML:
C:\>echo %NLS_LANG%
%NLS_LANG%
C:\>sqlplus
SQL*Plus: Release 11.1.0.6.0 - Production on Wed Apr 30 08:54:12 2008
Copyright (c) 1982, 2007, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.1.0.6.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
SQL> var cur refcursor
SQL>
SQL> declare
2 b blob := utl_raw.cast_to_raw ('<a>myxml</a>');
3 begin
4 open :cur for select xmlroot (xmltype (utl_raw.cast_to_varchar2 (b))) xml from dual;
5 end;
6 /
PL/SQL procedure successfully completed.
SQL>
SQL> print cur
XML
<?xml version="1.0" encoding="UTF-8"?><a>myxml</a>
SQL> exit
Disconnected from Oracle Database 11g Enterprise Edition Release 11.1.0.6.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
C:\>set NLS_LANG=GERMAN_GERMANY.WE8ISO8859P1
C:\>sqlplus
SQL*Plus: Release 11.1.0.6.0 - Production on Mi Apr 30 08:55:02 2008
Copyright (c) 1982, 2007, Oracle. All rights reserved.
SQL> var cur refcursor
SQL>
SQL> declare
2 b blob := utl_raw.cast_to_raw ('<a>myxml</a>');
3 begin
4 open :cur for select xmlroot (xmltype (utl_raw.cast_to_varchar2 (b))) xml from dual;
5 end;
6 /
PL/SQL-Prozedur erfolgreich abgeschlossen.
SQL>
SQL> print cur
XML
<?xml version="1.0" encoding="ISO-8859-1"?>
<a>myxml</a>

What every developer should know about character encoding

This was originally posted (with better formatting) at Moderator edit: link removed/what-every-developer-should-know-about-character-encoding.html. I'm posting because lots of people trip over this.
If you write code that touches a text file, you probably need this.
Lets start off with two key items
1.Unicode does not solve this issue for us (yet).
2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts.
The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.
And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.
And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.
Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.
UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
Here's a key point about these text files – every program is still using an encoding. It may not be setting it in code, but by definition an encoding is being used.
Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)
Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
Wrapping it up
I think there are two key items to keep in mind here. First, make sure you are taking the encoding in to account on text files. Second, this is actually all very easy and straightforward. People rarely screw up how to use an encoding, it's when they ignore the issue that they get in to trouble.
Edited by: Darryl Burke -- link removed

DavidThi808 wrote:
This was originally posted (with better formatting) at Moderator edit: link removed/what-every-developer-should-know-about-character-encoding.html. I'm posting because lots of people trip over this.
If you write code that touches a text file, you probably need this.
Lets start off with two key items
1.Unicode does not solve this issue for us (yet).
2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts. Pretty sure most Americans do not use character sets that only have a range of 0-127. I don't think I have every used a desktop OS that did. I might have used some big iron boxes before that but at that time I wasn't even aware that character sets existed.
They might only use that range but that is a different issue, especially since that range is exactly the same as the UTF8 character set anyways.
>
The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.
And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.
And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
The above is only true for small volume sets. If I am targeting a processing rate of 2000 txns/sec with a requirement to hold data active for seven years then a column with a size of 8 bytes is significantly different than one with 16 bytes.
Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
The above is out of place. It would be best to address this as part of Point 1.
Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.
Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.
UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
The first part of that paragraph is odd. The first 128 characters of unicode, all unicode, is based on ASCII. The representational format of UTF8 is required to implement unicode, thus it must represent those characters. It uses the idiom supported by variable width encodings to do that.
But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
Not sure what you are saying here. If a file is supposed to be in one encoding and you insert invalid characters into it then it invalid. End of story. It has nothing to do with html/xml.
Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
The browser still needs to support the encoding.
Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
I know java files have a default encoding - the specification defines it. And I am certain C# does as well.
Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
It is important to define it. Whether you set it is another matter.
Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)
Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
Unicode character escapes are replaced prior to actual code compilation. Thus it is possible to create strings in java with escaped unicode characters which will fail to compile.
Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
No. A developer should understand the problem domain represented by the requirements and the business and create solutions that appropriate to that. Thus there is absolutely no point for someone that is creating an inventory system for a stand alone store to craft a solution that supports multiple languages.
And another example is with high volume systems moving/storing bytes is relevant. As such one must carefully consider each text element as to whether it is customer consumable or internally consumable. Saving bytes in such cases will impact the total load of the system. In such systems incremental savings impact operating costs and marketing advantage with speed.

Character Encoding question

I'm helping another group out on this so I'm pretty new to this stuff so please go easy on me if I ask anything that is obvious.
We have a J2EE web application that is sitting on a Red Hat Linux box and is being served up by OAS 10.1.3. The application reads an xml file which contains the actual content of the page and then pulls in the navigation and metadata from other sources.
Everyone works as it should but there is one issue that has been ongoing for a while and we would like to close it off. In my content source xml file, I have encoded special characters such as é as & amp;#233; - when I view the web page all is well (I see the literal value of é but when I do a view source, I see & #233;
If I put & #233; into the source xml file, the page still displays with é but when I do a view source on the web page, I see literal é value in the source which is not what we want. What is decoding the character reference? While inserting & amp;#233; into the xml source file works, we do not want to have to encode everything that way we would prefer the have & #233;. Is it a setting of the OS, the Application Server or the Application itself?
When I previewed this post, I noticed that by typing & amp;#233; as one solid word, it gets decoded as is seen as é so I had to put a space between the & and the amp to get my properly explain myself.
Any help would be appreciated!
Thanks,
/HH

There are a lot of notes on MetaLink about character encoding. I wrote Note 337945.1 a while ago, which explains this into more detail. I will quote some relevant to your situation:
For the core components, there are three places to set NLS_LANG:
- in the system environment (this is obvious)
- in the file opmn.xml
- in the file apachectl
A. Changing opmn.xml
- go to $ORACLE_HOME/opmn/conf and edit the file opmn.xml
- Search for the OC4J container your application runs in.
- Within the <process-type.... > </process-type> section, add an entry similar to:
(1) OracleAS 10g (10.1.2, 10.1.3):
<environment>
<variable id="NLS_LANG" value="ENGLISH_UNITED KINGDOM.AL32UTF8"/>
</environment>
B. Changing apachectl (Unix only)
- Go to $ORACLE_HOME/Apache/Apache/bin
- Open the file 'apachectl'
- search for NLS_LANG
e.g.
NLS_LANG=${NLS_LANG=""}; export NLS_LANG
Verify if the variable is getting the correct value; this may depend on your environment and on the version of OracleAS. If necessary, change this line. In this example, the value from the environment is taken automatically.
There is more on this topic in the mod_plsql area but since you do not mention pulling data from the database, this may be less relevant. Otherwise you need to ensure the same NLS_LANG and character set is used in the database to avoid conversions.

Character Encoding in XML

Hello All,
I am not clear about solving the problem.
We have a Java application on NT that is supposed to communicate with the same application on MVS mainframe through XML.
We have a character encoding for these XML commands we send for communication.
The problem is, on MVS the parser is not understaning the US-ASCII character encoding. And so we are getting the infamous "illegal character error".
The main frame file.encoding=CP1047 and
NT's file.encoding = us-ascii.
Is there any character encoding that is common to these two machines: mainframe and NT.
IF it is Unicode, what is the correct notation for it.
Or is there any way for specifying the parsers to which character encoding should be used.
thanks,
Sridhar

On the mainframe end maybe something like-
FileInputStream fris = new FileInputStream("C:\\whatever.xml");
InputStreamReader is= new InputStreamReader(fris, "ASCII");//or maybe "us-ascii" "US-ASCII"
BufferedReader brin = new BufferedReader(is);
Or give inputstream/buffered reader to whatever application you are using to parse the xml. The input stream reader should allow you to set your encoding even if the system doesnt have the native encoding. Depends though on which/whose jvm using you are using jdk1.2 at least supports following on this page http://as400bks.rochester.ibm.com/pubs/html/as400/v4r4/ic2924/info/java/rzaha/javaapi/intl/encoding.doc.html

UTF8 character set conversion for chinese Language

Hi friends,
Would like to some basic explanation on UTF8 feature,what does it help while converting the data from chinese language.
Would like to know what all characters this UTF8 will not support while converting from chinese language.
Thanks & Regards
Ramya Nomula

Not exactly sure what you are looking for, but on MetaLink, there are numerous detailed papers on NLS character sets, conversions, etc.
Bottom line is that for traditional Chinese characters (since they are more complicated), they require 4 bytes to store the characters (such as UTF-8, and AL32UTF8). Some mid-eastern characters sets also fall in this category.
Do a google search on "utf8 al32utf8 difference", and you will get some good explanations.
e.g., http://decipherinfosys.wordpress.com/2007/01/28/difference-between-utf8-and-al32utf8-character-sets-in-oracle/
Recently, one of our clients had a question on the differences between these two character sets since they were in the process of making their application global. In an upcoming whitepaper, we will discuss in detail what it takes (from a RDBMS perspective) to address localization and globalization issues. As far as these two character sets go in Oracle, the only difference between AL32UTF8 and UTF8 character sets is that AL32UTF8 stores characters beyond U+FFFF as four bytes (exactly as Unicode defines UTF-8). Oracle’s “UTF8” stores these characters as a sequence of two UTF-16 surrogate characters encoded using UTF-8 (or six bytes per character). Besides this storage difference, another difference is better support for supplementary characters in AL32UTF8 character set.
You may also consider posting your question on the Globalization Suport forum which pertains more to these types of questions.
Globalization Support

(nokia N8) Character encoding - reduced support (n...

I'm from Poland, and in my language i use special letters like ó,ż,ź,ą,ę. When i want write a massage and use one of above letters, my message is shorter 90 signs. In older Nokia's phone i change settings text messages (character encoding:full support=>reduced support). When i change the same settings in Nokia N8, this settings doesn't work! I always see shorter message, strangest is that when i choose "conversations" i don't have any problems, everything works fine. When i choose "new message" i can't use reduced support (this option i on, but not working) My software version:011.012 Software version date: 2010-09-18 Release: PR1.0 Custom Version: 011.012.00.01 Custom version date: 2010-09-18 Language set: 011.012.03.01 Product code: 0599823

Talking about product code changes is prohibited in this forum. It is unofficial and is grounds for Nokia to refuse to repair or service your phone in any way.
If you find my post helpful please click the green star on the left under the avatar. Thanks.

Character Encoding is changing random

Hello,
For a short while I'm having the next problem:
When I am using FireFox, after a while characters on pages are showed as 'boxes'. With setting backView -> Character Encoding to ISO-8859-1 the characters are shown correctly again.
I do not understand why the character encoding is changing random to UTF-8, while I have selected the ISO-8859-1 and set automatically recognize to false.
In the options menu I have also set default 'ISO-8859-1' as default character encoding.
Hoping someone can tell me why it random changes and why the same page can be show correctly for 10 times, but the 11th time the character set has changed? And of course how can I solve this problem.

I do know a website can determine the character encoding, but the strange part of the problem I occur is that the same page can be shown 10 times correctly, but an 11th time the characters like é and ë, are shown as 'blocks'/'questionmarks'.
Is there an explanation for that behaviour?

FF character encoding issue in Mageia 2 ?

Hi everyone,
I'm running Mozilla Firefox 17.0.8 in a KDE distro of Linux called Mageia 2. I'm having problems in character encoding with certain web pages, meaning that certain icons like the ones next to menu entries (Login, Search box etc.) and in section headlines don't appear properly. Instead they appear either in some arabic character or as little grey boxes with numbers and letters written in it.
I've tried experimenting with different encoding systems: Western (ISO 8859-1), (ISO 8859-15), (Windows 1252), Unicode (UTF-8), Central European (ISO 8859-2) but none of them does the job. Currently the char encoding is set to UTF-8. The same web page in Chrome (UTF-8) gives no such problem.
Can you help me, please?

Thank you!
I solved my problem, however I find fonts are too small for certain web pages when compared to Chrome (see attached pictures of nytimes.com).
Chrome's font size are set to "Medium".

Character encoding for ReponseWriter

hi;
how can i control the character encoding of the ResponseWriter?
what encoding does it use for default?
thanks.

Since I had junior developers becoming desperate by this problem I'll post our solution for for anybody that's not working on Websphere and wants to solve this problem.
we seem to have solved it using a servletfilter that inserts this response wrapper:
class CharacterEncodingHttpResponseWrapper extends HttpServletResponseWrapper{
private String contentTypeWithCharacterEncoding;
private String encoding;
CharacterEncodingHttpResponseWrapper(HttpServletResponse resp,String encoding){
 super(resp);
 this.encoding=encoding;
public void setContentType(String contentType){
 //��n plek om encoding te defini�ren ipv in alle JSP pagina's
 contentTypeWithCharacterEncoding = addOrReplaceCharset(contentType,encoding);
 super.setContentType(contentTypeWithCharacterEncoding);
public void setLocale(Locale locale){
 //bij het zetten van de locale wordt ook de charset op ISO gezet
 if(contentTypeWithCharacterEncoding==null){
 CharacterEncodingFilter.LOGGER.warn("Encoding wordt op ISO gezet via locale.");
 }else{
 super.setLocale(locale);
 //en zet de encoding terug naar de gewenste encoding
 setContentType(contentTypeWithCharacterEncoding);
 * utility methode die in de http header
 * <code>Content-Type:application/x-www-form-urlencoded;charset=ISO-8859-1</code>
 * of in content type op servlethttpresponse
 * <code>text/html;charset=ISO-8859-1</code>
 * de charset zet.
private String addOrReplaceCharset(String headervalue, String charset) {
 if (null !=headervalue ) {
 // see if this header had a charset
 String charsetStr = "charset=";
 int len = charsetStr.length(), i = 0;
 // if we have a charset in this Content-Type header
 if (-1 != (i = headervalue.indexOf(charsetStr))) {
 // if it has a non-zero length.
 if ((i + len < headervalue.length())) {
 // none
 headervalue = headervalue.substring(0, i + len) + charset;
 } else {
 headervalue = headervalue + charset;
 } else {
 headervalue = headervalue + ";charset="+charset;
 return headervalue;
 } else {
 LOGGER.warn("content-type header niet gezet");
 return "application/x-www-form-urlencoded;charset="+charset;
}If all your JSF/JSP pages have consistently set the encoding in the contenttype your addOrReplace method should only add, not replace.

Character encoding issue

I'm using the below give code to send mail in Trukish language.
MimeMessage msg = new MimeMessage(session);
msg.setText(message, "utf-8", "html");
msg.setFrom(new InternetAddress(from));
Transport.send(msg);
But my customer says that he gets sometime unreadable characters in mail. I'm not able to understand how to solve this character encoding issue.
Should i ask him to change his mail client's character encoding settings?
If yes which one he should set.

Send the same characters using a different mailer (e.g., Thunderbird or Outlook).
If they're received correctly, come the message from that mailer with the message
from JavaMail. Most likely other mailers are using a Turkish-specific charset instead
of UTF-8.

Why, after all these years, can't Thunderbird auto-detect character encoding

judging by all the existing messages and complaints about this, not to mention erroneous posts that say the problem is solved when it isn't, I have to conclude Mozilla either doesn't believe this is a problem or doesn't care to fix it. The bottom line is that there is no way to tell Thunderbird to automatically display emails in the character coding format they were written in. I could understand cases where the headers are not properly filled in, but I see tons of emails in which the encoding is plainly there in the headers within the message source. You can force it, but if you do so via the menu VIEW->Character Encoding->UTF8 (for example) it won't "stick" if you view another message. But who would want it to "stick" permanently anyway? What the average user really wants is to be able to toggle VIEW->Character Encoding->Auto Detect from its default "off" to simply "on", and not have to bother with it anymore.
This is a problem that seems to have gone on forever, and it NEVER happens with other email clients. If there is some backdoor way to actually make autodetect work, I'd appreciate knowing about it. But more important, I think ALL users would appreciate it if it were not some secret "backdoor" setting, but a simple global menu choice for all accounts. Can Mozilla please fix this problem once and for all?

You said...
''Thunderbird is supposed to be using the encoding in the mail.''
I figured is "should", i'm just reporting that it doesn't
You said...
''Setting auto detect to on disables that.''
Please explain. I've looked at every setting I can find and there is no way to set auto detect to "ON". I DID try setting it to "universal" in an attempt top solve the problem, but I have since restored it to "off", because the universal setting doesn't help.
you said...
''"Based on your earlier response I assume you need to press the F10 key to see the tools menu you were refered to." ''
No... I never said that anywhere. I DID refer to Menu->View_>Character Encoding, and I did refer to right clicking on individual folders, to get to the properties dialog, and the general information tab. But F-10 doesn't do anything
You said...
''I have examines dozens of mails in my inbox and each honours the character encoding set in the HTML''
Well, mine NEVER did. A short example from an email I got today pretty much is exemplative of all mail I get from GMAIL...
--089e013a0572a067a404fc73ceda
Content-Type: text/plain; charset=UTF-8
Ok, very good. Thank you. Phoenix sent you a friend request on Facebook by
the way. Talk to you soon.
--089e013a0572a067a404fc73ceda
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Ok, very good. Thank you. Phoenix sent you a friend request=
=C2=A0 on Facebook by the way.=C2=A0 Talk to you soon.
--089e013a0572a067a404fc73ceda--
See those incidences pf "=C2=A0"? Each one displays as a strange character, a capitol A with a curved line over it. If I manually set my default encoding to UTF 8, the weird characters go away. If I leave it as Western, there is nothing I can do to tell Thunderbird to "auto detect".
Anyway, I suppose at this point that no one responsible for the product coding is seriously looking at my issue, which is why its never been solved. If anyone does intend to help track it down and solve it, I'll be happy to provide all the examples and screen shots they ask for. Otherwise.

Man pages character encoding

hi there,
I just installed arch the weekend before.
When I'm browsing the man pages I got problems with the character encoding. Most of them are displayed correctly, but sometimes there stands something like that: ?<80><98> instead of an character. As I do not know how to take an screenshot of an manpage without having an GUI installed I took a picture( XD ).
[img=http://img411.imageshack.us/img411/184/dsc00185hj0.th.jpg]
What configuration files do you need to help me? or do you know what my problem is, yet?
thanks in advance

I got a file named /etc/profile.pacnew
what is the file I have to merge/replace it with?
furthermore I got the problem with my manpages, which is solved by using unset MANPATH, but it will reappear after a reboot. These problems are probably connected.
I just don't know which files I have to merge, remove or keep and so on.
thx
edit:
ok replaced profile with profile.pacnew, as I can't remember that I changed something in the profile ever since. this solves the problem.
does pacman tell me if it create such a .pacnew kinda file?
Or how do I know that this files is existing, searching for all files with an .pacnew ending once in a while an mergen them with the old file?
Last edited by okar (2008-03-18 18:08:29)

JSF myfaces character encoding issues

The basic problem i have is that i cannot get the copyright symbol or the chevron symbols to display in my pages.
I am using:
myfaces 2.0.0
facelets 1.1.14
richfaces 3.3.3.final
tomcat 6
jdk1.6
I have tried a ton of things to resolve this including:
1.) creating a filter to set the character encoding to utf-8.
2.) overridding the view handler to force calculateCharacterEncoding to always return utf-8
3.) adding <meta http-equiv="content-type" content="text/html;charset=UTF-8" charset="UTF-8" /> to my page.
4.) setting different combinations of 'URIEncoding="UTF-8"' and 'useBodyEncodingForURI="true"' in tomcat's server.xml
5.) etc... like trying set encoding on an f:view, using f:verbatim, specifying escape attirbute on some output components.
all with no success.
There is a lot of great information on BalusC's site regarding this problem (http://balusc.blogspot.com/2009/05/unicode-how-to-get-characters-right.html) but I have not been able to resolve it yet.
i have 2 test pages i am using.
if i put these symbols in a jsp (which does NOT go through the faces servlet) it renders fine and the page info shows that it is in utf-8.
<html>
<head>
 
</head>
<body>
 copy tag: ©
 js/jsp unicode: ©
 xml unicode: ©
 u2460: \u2460
 u0080: \u0080
 arrow: »
 
</body>
</html>if i put these symbols in an xhtml page (which does go through the faces servlet) i get the black diamond symbols with a ? even though the page info says that it is in utf-8.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
 xmlns:ui="http://java.sun.com/jsf/facelets"
 xmlns:f="http://java.sun.com/jsf/core"
 xmlns:h="http://java.sun.com/jsf/html"
 xmlns:rich="http://richfaces.org/rich"
 xmlns:c="http://java.sun.com/jstl/core"
 xmlns:a4j="http://richfaces.org/a4j">
<head>
 <meta http-equiv="content-type" content="text/html;charset=UTF-8" charset="UTF-8" />
 <meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7" />
</head>
<body>
 <f:view encoding="utf-8">
 amp/copy tag: ©
 copy tag: ©
 copy tag w/ pound: #©
 houtupt: <h:outputText value="©" escape="true"/>
 houtupt: <h:outputText value="©" escape="false"/>
 js/jsp unicode: ©
 houtupt: <h:outputText value="©" escape="true"/>
 houtupt: <h:outputText value="©" escape="false"/>
 xml unicode: ©
 houtupt: <h:outputText value="©" escape="true"/>
 houtupt: <h:outputText value="©" escape="false"/>
 u2460: \u2460
 u0080: \u0080
 arrow: »
 cdata: <![CDATA[©]]>
 
 </f:view>
</body>
</html>on a side note, i have another application that is using myfaces 1.1, facelets 1.1.11, and richfaces 3.1.6 and the unicode symbols work fine.
i had another developer try to use my test xhtml page in his mojarra implementation and it works fine there using facelets 1.1.14 but NOT myfaces or richfaces.
i am convinced that somewhere between the view handler and the faces servlet the encoding is being set or reset but i havent been able to resolve it.
if anyone at all can point me in the right direction i would be eternally greatful.
thanks in advance.

UPDATE:
I was unable to get the page itself to consume the various options for unicode characters like the copyright symbol.
Ultimately the content I am trying to display is coming from a web service.
I resolved this issue by calling the web service from my backing bean instead of using ui:include on the webservice call directly in the page.
for example:
public String getFooter() throws Exception
 HttpClient httpclient = new HttpClient();
 GetMethod get = new GetMethod(url);
 httpclient.executeMethod(get);
 String response = get.getResponseBodyAsString();
 return response;
 }I'd still love to have a solution to the page usage of the unicode characters, but for the time being this solves my problem.

[SOLVED] Character encoding conversion

Similar Messages

Maybe you are looking for