Character encoding conversion for marshall/unmarshall?

Hello, Java Web Services gurus,
I am wondering if there is an easy/plugin-able way to do character encoding conversion transparently in the process of marshall/unmarshall.
Basically, my input/output will always be these UTF-8 XMLs. As the backend database is ISO encoded, I hope the result of unmarshall will give me ISO strings. And when it comes to marshall, the ISO strings can be transparently turned to UTF-8 XML response. Right now I'm using JAXB's annotations to parse XML into objects.
I understand there will be chars in the input file not able to get converted, if so, I'd be be expecting an error/exception that flags the failure
Hope I sound clear. This has been a headache for a while. Really hope someone may help out a bit. Thanks a million in advance

[Duplicate Post|http://forums.sun.com/thread.jspa?messageID=10971554&tstart=0#10971554]

Similar Messages

UTF8 character set conversion for chinese Language

Hi friends,
Would like to some basic explanation on UTF8 feature,what does it help while converting the data from chinese language.
Would like to know what all characters this UTF8 will not support while converting from chinese language.
Thanks & Regards
Ramya Nomula

Not exactly sure what you are looking for, but on MetaLink, there are numerous detailed papers on NLS character sets, conversions, etc.
Bottom line is that for traditional Chinese characters (since they are more complicated), they require 4 bytes to store the characters (such as UTF-8, and AL32UTF8). Some mid-eastern characters sets also fall in this category.
Do a google search on "utf8 al32utf8 difference", and you will get some good explanations.
e.g., http://decipherinfosys.wordpress.com/2007/01/28/difference-between-utf8-and-al32utf8-character-sets-in-oracle/
Recently, one of our clients had a question on the differences between these two character sets since they were in the process of making their application global. In an upcoming whitepaper, we will discuss in detail what it takes (from a RDBMS perspective) to address localization and globalization issues. As far as these two character sets go in Oracle, the only difference between AL32UTF8 and UTF8 character sets is that AL32UTF8 stores characters beyond U+FFFF as four bytes (exactly as Unicode defines UTF-8). Oracle’s “UTF8” stores these characters as a sequence of two UTF-16 surrogate characters encoded using UTF-8 (or six bytes per character). Besides this storage difference, another difference is better support for supplementary characters in AL32UTF8 character set.
You may also consider posting your question on the Globalization Suport forum which pertains more to these types of questions.
Globalization Support

[SOLVED] Character encoding conversion

Hi, my first post here...
I'm writing a shell script (using tcl/expect) to show mails on pop server.
How do I convert e-mail headers containing text like this:
Subject: =?iso-8859-1?B?ZGl2ZXJ0aW1lbnRvIHNlbnphIGZpbmUgMjQgb3JlIHN1IDI0?=
to make it readable?
I already tried with 'iconv' and 'recode' with no luck...
Thanks.
Last edited by mr.entropy (2010-11-02 16:34:39)

These subjects have (probably unnecessarily) been encoded according to RFC 2047.
The "B" in between the 2nd and 3rd ? indicates that it is encoded using Base64 (if there's a Q it's "Quoted-Printable" see RFC RFC 2045).
So you're looking for a Base64 decoder.
EDIT: and there happens to be one in coreutils.
man base64
Last edited by skanky (2010-11-02 14:39:06)

Why, after all these years, can't Thunderbird auto-detect character encoding

judging by all the existing messages and complaints about this, not to mention erroneous posts that say the problem is solved when it isn't, I have to conclude Mozilla either doesn't believe this is a problem or doesn't care to fix it. The bottom line is that there is no way to tell Thunderbird to automatically display emails in the character coding format they were written in. I could understand cases where the headers are not properly filled in, but I see tons of emails in which the encoding is plainly there in the headers within the message source. You can force it, but if you do so via the menu VIEW->Character Encoding->UTF8 (for example) it won't "stick" if you view another message. But who would want it to "stick" permanently anyway? What the average user really wants is to be able to toggle VIEW->Character Encoding->Auto Detect from its default "off" to simply "on", and not have to bother with it anymore.
This is a problem that seems to have gone on forever, and it NEVER happens with other email clients. If there is some backdoor way to actually make autodetect work, I'd appreciate knowing about it. But more important, I think ALL users would appreciate it if it were not some secret "backdoor" setting, but a simple global menu choice for all accounts. Can Mozilla please fix this problem once and for all?

You said...
''Thunderbird is supposed to be using the encoding in the mail.''
I figured is "should", i'm just reporting that it doesn't
You said...
''Setting auto detect to on disables that.''
Please explain. I've looked at every setting I can find and there is no way to set auto detect to "ON". I DID try setting it to "universal" in an attempt top solve the problem, but I have since restored it to "off", because the universal setting doesn't help.
you said...
''"Based on your earlier response I assume you need to press the F10 key to see the tools menu you were refered to." ''
No... I never said that anywhere. I DID refer to Menu->View_>Character Encoding, and I did refer to right clicking on individual folders, to get to the properties dialog, and the general information tab. But F-10 doesn't do anything
You said...
''I have examines dozens of mails in my inbox and each honours the character encoding set in the HTML''
Well, mine NEVER did. A short example from an email I got today pretty much is exemplative of all mail I get from GMAIL...
--089e013a0572a067a404fc73ceda
Content-Type: text/plain; charset=UTF-8
Ok, very good. Thank you. Phoenix sent you a friend request on Facebook by
the way. Talk to you soon.
--089e013a0572a067a404fc73ceda
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<p dir=3D"ltr">Ok, very good. Thank you. Phoenix sent you a friend request=
=C2=A0 on Facebook by the way.=C2=A0 Talk to you soon.</p>
--089e013a0572a067a404fc73ceda--
See those incidences pf "=C2=A0"? Each one displays as a strange character, a capitol A with a curved line over it. If I manually set my default encoding to UTF 8, the weird characters go away. If I leave it as Western, there is nothing I can do to tell Thunderbird to "auto detect".
Anyway, I suppose at this point that no one responsible for the product coding is seriously looking at my issue, which is why its never been solved. If anyone does intend to help track it down and solve it, I'll be happy to provide all the examples and screen shots they ask for. Otherwise.

Character encoding and ByteOutputStream

Hi!
I'm currently working on a web application that needs to print non-english characters (e g swedish � � �). Currently this doesn't work, although i have set the character encoding for the HttpServletResponse.
I figure that it is this code that doesn't manage the non-english characters (it's not my own code but i need to fix it and, sorry to say, I'm not that experienced with streams):
ByteArrayOutputStream baos = new ByteArrayOutputStream(16384);
baos.write("��".getBytes());
resp.setDateHeader("Expires", 0);
resp.setContentLength(baos.size());
ServletOutputStream out = resp.getOutputStream();
out.write(baos.toByteArray());
out.flush();
out.close();Any hints on what to do or where to look? Should i wrap the ServletOutputSream in a Writer?
Cheers,
David

But there's no character encoding set for this operation:baos.write("��".getBytes());

Character Encoding and File Encoding issue

Hi,
I have a file which has a data encoded using default locale.
I start jvm in same default locale and try to red the file.
I took 2 approaches :
1. Read the file using InputStreamReader() without specifying the encoding, so that default one based on locale will be picked up.
-- This apprach worked fine.
-- I also printed system property "file.encoding" which matched with current locales encoding (on unix cooand to get this is "locale charmap").
2. In this approach, I read the file using InputStream as an array of raw bytes, and passed it to String contructor to convert bytes to String.
-- The String contained garbled data, meaning encoding failed.
I tried printing encoding used by JVM using internal class, and "file.encoding" property as well.
These 2 values do not match, there is weird difference.
For e.g. for locale ja_JP.eucjp on linux box :
byte-character uses EUC_JP_LINUX encoding
file.encoding system property is EUC-JP-LINUX
To get byte to character encoding, I used following methods (sun.io.*):
ByteToCharConverter btc = ByteToCharConverter.getDefault();
System.out.println("BTC uses " + btc.getCharacterEncoding());
Do you have any idea why is it failing ?
My understanding was, file encoding and character encoding should always be same by default.
But, because of this behaviour, I am little perplexed.

But there's no character encoding set for this operation:baos.write("��".getBytes());

Character Encoding for JSPs and HTML forms

After having read loads of postings on character encoding problems I'm still puzzled about the following problem:
I have an instance (A) of WL 8.1 SP3 on a WinXP machine and another instance (B) of WL 8.1 without any SP on a Win2K machine. The underlying Windows locale is english(US) in both cases.
The same application deployed as a war file to these instances does not behave in the same way when it comes to displaying non-Latin1-characters like the Euro symbol: Whereas (A) shows and accepts these characters as request-parameters, (B) does not.
Since the war file is the same (weblogic.xml, jsps and everything), the reason for this must either be the service-pack-level or some other configuration setting I overlooked.
Any hints are appreciated!

Try this:
Prefrences -> Content -> Fonts & Color -> Advanced
At the bottom, choose your Encoding.

Character Encoding for IDOC to JMS scenario with foreign characters

Dear Experts,
The scenario is desribed as follows:
Issue Description:
There is an IDOC which is created after extracting data from different countries (but only one country at a time). So, for instance first time the data is picked in Greek and Latin and corresponding IDOC is created and sent to PI, the next time plain English and sent to PI and next Chinese and so on. As of now every time this IDOC reaches PI ,it comes with UTF-8 character encoding as seen in the IDOC XML.
I am converting this IDOC XML into single string flat file (currently taking the default encoding UTF-8) and sending it to receiver JMS Queue (MQ Series). Now when this data is picked up from the end recepient from the corresponding queue in MQ Series, they see ? wherever there is a Greek/latin characters (may be because that should be having a different encoding like ISO-8859_7). This is causing issues at their end.
My Understanding
SAP system should trigger the IDOC with the right code page i.e if the IDOC is sent with Greek/Latin code page should be ISO-8859_7, if this same IDOC is sent with Chinese characters the corresponding code page else UTF-8 or default code page.
Once this is sent correctly from SAP, Java Mapping should have to use the correct code page when righting the bytes to outputstream and then we would also need to set the right code page as JMS Header before putting the message in the JMS queue so that receiver can interpret it.
Queries:
1. Is my approach for the scenario correct, if not please guide me to the right approach.
2. Does SAP support different code page being picked for the same IDOC based on different data set. If so how is it achieved.
3. What is the JMS Header property to set the right code page. I think there should be some JMS Header defined by MQ Series for Character Encoding which I should be setting correctly) I find that there is a property to set the CCSID in JMS Receiver Adapter but that only refers to Non-ASCII names and doesn't refer to the payload content.
I would appreciate if anybody can give me pointers on how to resolve this issue.
Thanks,
Pratik

Hi Pratik,
http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/502991a2-45d9-2910-d99f-8aba5d79fb42?quicklink=index&overridelayout=true
This link might help.
regards
Anupam

Can character encoding be predefined for certain pages?

Certain pages that I visit frequently require me to manually set my character encoding to Western (ISO Latin 1), both when my default character encoding is set as UTF-8 and Western (ISO Latin 1).
As the pages that show up malformed are embedded in other frames I suspect that the top frame forces a different encoding than is on the embedded page.
An example page is here (847.is). The topic list of this message board is in order, but when any one of the topics is viewed all accented and special characters are missing, until Western (ISO Latin 1) is manually set as the character encoding. Similarly, opening any of the topics in a tab will result in missing characters.
Is there some way for me to circumvent having to go through all those menus to set it? Can I somehow define that these pages should be viewed in Western (ISO Latin 1) or can I set a keyboard shortcut for Western (ISO Latin 1)?
MacBook 2006 Mac OS X (10.4.7) Safari version 2.0.4
MacBook 2006 Mac OS X (10.4.7)

For instance if I opened
an entry on Vísindavefur
HÍ out
of the parent frame the accented letters would
show up somehow mangled, but this is no longer a
problem.
That page has no charset in the source and thus should only display correctly if you have Latin-1 set as the browser default. With UTF-8 set as the default you should see (in Safari) a ton of black diamonds with question marks inside.
Firefox never displays ð (eth), þ (thorn) and ý
(accented y) correctly for me, did not do it on the
old machine and does not do it on this one either,
FireFox displays them perfectly for me in both 10.3 and 10.4.
Actually, if I set Opera encoding to UTF-8 it
displays the topics on 847.is as Safari does.
This indicates it is a system issue rather than Safari. Sorry I can't duplicate it and have no good idea what could cause it on a normal system. Have you (or the place where you buy your machines) by chance installed any special software add-ons to enable the use of non-Unicode Icelandic (for apps like Appleworks, WordX, etc)?

ITunes support for foriegn character encoding

A friend burned two mix cds for me to listen two on my move back to the US from Korea. The songs are korean and have korean title/album information. I thought I would import the songs into my iBook. When I add them to my library, however, a majority of them have unintelligible song information. Only about 25-30% of the songs import successfully in korean font.
Finder reads the cd no problems. Looking through the disc shows all information clearly. Drop them into iTunes, however (or selecting "add to library"), and they get scrambled.
I'm guessing this is a character encoding issue. I don't know where my friend got the tracks from, so I'll have to assume he got them from sources which copied them using different encoding methods. But if Finder can support them all, why can't iTunes? Is there a way I can adjust character support in iTunes? Or should I be looking at something else?
1.2ghz iBook G4 1.25gb RAM Mac OS X (10.4.5) iTunes 6.0.3

Try setting the Encoding of your Outputstream to UTF-8.

Character encoding for ReponseWriter

hi;
how can i control the character encoding of the ResponseWriter?
what encoding does it use for default?
thanks.

Since I had junior developers becoming desperate by this problem I'll post our solution for for anybody that's not working on Websphere and wants to solve this problem.
we seem to have solved it using a servletfilter that inserts this response wrapper:
class CharacterEncodingHttpResponseWrapper extends HttpServletResponseWrapper{
private String contentTypeWithCharacterEncoding;
private String encoding;
CharacterEncodingHttpResponseWrapper(HttpServletResponse resp,String encoding){
    super(resp);
    this.encoding=encoding;
public void setContentType(String contentType){
    //��n plek om encoding te defini�ren ipv in alle JSP pagina's
    contentTypeWithCharacterEncoding = addOrReplaceCharset(contentType,encoding);
    super.setContentType(contentTypeWithCharacterEncoding);
public void setLocale(Locale locale){
    //bij het zetten van de locale wordt ook de charset op ISO gezet
    if(contentTypeWithCharacterEncoding==null){
      CharacterEncodingFilter.LOGGER.warn("Encoding wordt op ISO gezet via locale.");
    }else{
      super.setLocale(locale);
      //en zet de encoding terug naar de gewenste encoding
      setContentType(contentTypeWithCharacterEncoding);
   * utility methode die in de http header
   * <code>Content-Type:application/x-www-form-urlencoded;charset=ISO-8859-1</code>
   * of in content type op servlethttpresponse
   * <code>text/html;charset=ISO-8859-1</code>
   * de charset zet.
private String addOrReplaceCharset(String headervalue, String charset) {
    if (null !=headervalue ) {
      // see if this header had a charset
      String charsetStr = "charset=";
      int len = charsetStr.length(), i = 0;
      // if we have a charset in this Content-Type header
      if (-1 != (i = headervalue.indexOf(charsetStr))) {
        // if it has a non-zero length.
        if ((i + len < headervalue.length())) {
          // none
          headervalue = headervalue.substring(0, i + len) + charset;
        } else {
          headervalue = headervalue + charset;
      } else {
        headervalue = headervalue + ";charset="+charset;
      return headervalue;
    } else {
      LOGGER.warn("content-type header niet gezet");
      return "application/x-www-form-urlencoded;charset="+charset;
}If all your JSF/JSP pages have consistently set the encoding in the contenttype your addOrReplace method should only add, not replace.

CSSCAN for database character set conversion failing with ORA-01578

Hi ,
CSSCAN for database character set conversion failing with ORA-01578: ORACLE data block corrupted (file # 84, block # 23930). please help me out in this regard.
Thanks,
Sravan.

Hi Anand,
Thanks for your update. The segment is a table not an index in my case. And i got this error while running CSSCAN on Apps database for character set conversion to UTF8 from WE8ISO8859P1. Please find the snapshot below for your reference.
SQL> select segment_name, segment_type, owner from dba_extents where file_id = 84 and 23930 between block_id and block_id + blocks - 1;
SEGMENT_NAME
SEGMENT_TYPE OWNER
EDW_LOOKUP_M
TABLE POA
SQL> ANALYZE TABLE POA.EDW_LOOKUP_M VALIDATE STRUCTURE CASCADE;
ANALYZE TABLE POA.EDW_LOOKUP_M VALIDATE STRUCTURE CASCADE
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 84, block # 23930)
ORA-01110: data file 84: '/d911/oracle/dbcondata/poad01.dbf'
Thanks,
Sravan.

As a webservice client, how to set character encoding for JAX-WS?

I couldn't find the right API to set character encoding for a webservice client. What I did is
1, wsimport which gives me MyService, MyPortType...
2. Create new MyService
3. Get MyPort from MyService
4. Call myPort.myOperation with objects
Where is the right place to set character encoding and how to set it? Thanks.
Regards
-Jiaqi Guo

The .js file and the html need to have the same encoding. If
your html uses iso-8859-7, then the .js must also use that. But if
the original text editor created the .js file using utf-8, then
that is what the html needs to use.

Byte[] character encoding for strings

Hi All,
I tried to convert a string into byte[] using the following code:
byte[] out= [B@30c221;
String encodedString = out.toString();
it gives the output [B@30c221 when i print encodedstring.
but when i convert that encodedstring into byte[] using the following code
byte[] output = encodedString.getBytes();
it gives different output.
is there any character encode needed to give the exact output for this?

Sorry, but the question makes no sense, and neither does your code. byte[] out= [B@30c221;
String encodedString = out.toString(); The first line is syntactically incorrect, and the second should print something like "[B@30c221", which isn't particularly useful. The correct way to convert a String to a byte[] is with the getBytes() method. Be aware that the byte[] will be in the system default encoding, which means you could get different results on different platforms. To remove this platform dependency, you should specify the encoding, like so: byte[] output = encodedString.getBytes("UTF-8"); Why are you doing this, anyway? There are very few good reasons to convert a String to a byte[] within your code; that's usually done by the I/O classes when your program communicates with the outside world, as when you write the string to a file.

Why does Firefox 18 ignore the specified character encoding for websites?

We are developing a page on our website that will have the page crawled and a newsletter generated and sent out to a mailing list. Many email packages default to character encoding of iso-8859-1 so we have set our character encoding to this on the page via the standard meta tag.
We have a problem on the newsletters that we had until now been unsuccessful to replicate. Though now I know why.... I have just discovered that in Firefox 18, the specified character encoding is being completely ignored. It is rendering the page in UTF-8 even though we specified ISO-8859-1. Firefox 3.6 however, renders the page with the proper encoding (thank god for keeping an old version for testing).
Can anyone explain why the new Firefox is completely ignoring the meta tag? Both browsers are using the factory default (I even opened FF18 in safe mode)...

Thanks for letting me know that Firefox 18 ignores everything but the server headers... but it doesn't help me much. Our website is in UFT-8... but this page is a newsletter, one that is crawled and saved into an email and sent out to a mailing list (by a third party newsletter program) and many email readers use ISO-8859-1 hence why we want to have the page rendered in that encoding so that we can actually test the newsletter properly. We can't test through the third party software as our testing environment is behind a firewall, and you can't change the server headers for a single page... hence the meta tag.
If you explicitly choose to render a page in a specific encoding, that shouldn't be ignored by the browser. It's not a big deal, but now every time we make a code change in our test environment and reload the page we have to force the encoding manually in the browser which is a pain.
The problem is, the newsletter is already live and we have some users complaining because some characters aren't displaying properly in their email packages (Entourage for Mac is one of them), all our testing (which is encoding using UTF-8) looks fine.

Character encoding conversion for marshall/unmarshall?

Similar Messages

Maybe you are looking for