Character encoding according to page indication

Hello.
I'm not sure this forum is the better one to post such a question. Perhaps the Tomcat list could be another good place... So if someone has a better idea, it's welcome...
I've tested how the JSP 2.0 processor generates, or not, character encodings, using a recent Tomcat 5.5.
In a same page, with a content type defined in "text/xml" using a charset "UTF-8", i can use 3 equivalent forms to write out a character.
1) the old scriptlet expression : <%=myAttribute%>2) The new EL form, directly put in the page body : ${myAttribute}3) The jstl solution : <c:out value="${myAttribute}"/>And the winner is :
In (1) and (2), no special processing is done, and if my string contains any '<' , the generated document is mal formed... too bad !
In (3), with the default use for encoding, the possible '<' or any other special character are converted to the correct xml entity... and the document wins !
So... my question : in there a particular part of the (servlet + jsp + taglib + jstl + el) specifications that clearly explains how characters should or must be encoded by the server, and when this is the application responsability...
Thanks for your help...

If you want the list to be automatically updated as you edit, Obi-wan's 27 grep styles along with 27 swatches is the way to go.
You could also do it with a script which would require less setup—the grep approach would need a paragraph style with 27 grep rules linked to 27 character styles using 27 swatches.
This AppleScript (OSX only) picks up the first character of the paragraph and colors the paragraph text with a swatch named the same. So this short example uses uppercase A, B, C named swatches to color corresponding paragraphs starting with the same letter:
tell application "Adobe InDesign CS6"
    --make sure there's a selection and get the text and its frame
    try
        set c to {insertion point, character, word, line, paragraph}
        if class of selection is in c then
            set t to parent of selection
        else if class of selection is text frame then
            set t to parent story of selection
        end if
    on error
        display dialog "Please select some text"
        return
    end try
    --get all of the story paragraphs
    set p to object reference of every paragraph of t
    --color each paragraph with the swatch that's named with the paragraph's first letter
    repeat with x in p
        set c to character 1 of x as string
        set s to (every swatch of active document whose name is c)
        try
            set fill color of every character of x to item 1 of s
        end try
    end repeat
end tell

Similar Messages

Web pages display OK, but print with garbage characters. I think it's character encoding, but don't know WHICH I should use. Have tried all Western and UTF options. Firefox 3.6.12

I used to only have troubles with headers & footers printing out as garbage characters. I tried changing Character Encoding, now entire pages have garbage characters, even though pages view ok when browsing.

If the pages look OK when you are browsing then it is not a problem with the encoding.<br />
It can be a problem with the font that is used and you can try to disable website fonts and posibly try a few different default fonts to see if that helps.
Tools > Options > Content : Fonts & Colors: Advanced (Allow pages to choose their own fonts, instead of my selections above)

Default Character Encoding stuck on UTF-8 - Firefox 7

I cannot change the Character Encoding - it is stuck on Unicode UTF-8 and I can not change it! When a web page opens I get these little boxes with "FF FD" instead of Quote marks. When I change the character encoding on that page using "View->Character Encoding" and click on the Western (ISO-8859-1), the page displays correctly. Every page opens using Unicode UTF-8 as the default.
View->Character Encoding -- shows Unicode UTF-8 as the default.
View->Character Encoding->Auto-detect -- shows OFF
Tool->Options->Content->Advance->Fonts->Default Character Encoding -- shows Western (ISO-8859-1) as well as the "Allow Pages to choose their own fonts..." IS CHECKED in the check box
THE PAGES ARE NOT UTF-8!!!! The "View Page Source" IS NOT Unicode UTF-8! -- It shows <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">.
The "View Page Info" shows MetaTag - Content-Type: text/html; charset=iso-8859-1
Why can I not change the Default Character Encoding?
I would also like to point out that the Unicode UTF-8 seems to be broken because it is indicating that the QUOTE CHARACTER is an UNPRINTABLE character "FF FD"
----- EDIT -----
The UTF-8 is not broken. The problem as pointed out in http://en.wikipedia.org/wiki/Replacement_character#Replacement_character is that my Firefox being STUCK processing UTF-8 encoding cannot read the clearly marked iso-8859-1 data. So the UTF-8 is reinterpreting smart quotes -“ and ”- (“ and ”) as replacement (unprintable) characters.
So the real problem is why my Firefox is stuck on Unicode UTF-8

The real problem is that the font that is used doesn't have those characters.
Do you see the special quotes -“ and ” on this forum page?
Does it help if you disable the website fonts and set another font as the default font?
*Tools > Options > Content : Fonts & Colors > Advanced
*http://en.wikipedia.org/wiki/Punctuation
*http://en.wikibooks.org/wiki/Unicode/Character_reference/2000-2FFF

Quotation marks display as &quot in web pages, I'm using Unicode UTF-8 character encoding.

On many web pages, where a quotation mark character should appear, instead the page displays the text &quot. I believe this happens with other punctuation characters as well such as apostrophes although the text displayed in these other cases is different, of course. I'm guessing this is a problem with character encoding. I'm currently set to Unicode (UTF-8) encoding. Have tried several others without success.

Here's a link where the problem occurs. Note the second line of the main body of text.
http://www.sierratradingpost.com/lp2/snowshoes.html
BTW, I never use IE, but I checked this site in IE and it shows the same problem, so maybe it is the page encoding after all rather than what I thought.
In any case, my thanks for your help and would appreciate any solution you can suggest.

Can character encoding be predefined for certain pages?

Certain pages that I visit frequently require me to manually set my character encoding to Western (ISO Latin 1), both when my default character encoding is set as UTF-8 and Western (ISO Latin 1).
As the pages that show up malformed are embedded in other frames I suspect that the top frame forces a different encoding than is on the embedded page.
An example page is here (847.is). The topic list of this message board is in order, but when any one of the topics is viewed all accented and special characters are missing, until Western (ISO Latin 1) is manually set as the character encoding. Similarly, opening any of the topics in a tab will result in missing characters.
Is there some way for me to circumvent having to go through all those menus to set it? Can I somehow define that these pages should be viewed in Western (ISO Latin 1) or can I set a keyboard shortcut for Western (ISO Latin 1)?
MacBook 2006 Mac OS X (10.4.7) Safari version 2.0.4
MacBook 2006 Mac OS X (10.4.7)

For instance if I opened
an entry on Vísindavefur
HÍ out
of the parent frame the accented letters would
show up somehow mangled, but this is no longer a
problem.
That page has no charset in the source and thus should only display correctly if you have Latin-1 set as the browser default. With UTF-8 set as the default you should see (in Safari) a ton of black diamonds with question marks inside.
Firefox never displays ð (eth), þ (thorn) and ý
(accented y) correctly for me, did not do it on the
old machine and does not do it on this one either,
FireFox displays them perfectly for me in both 10.3 and 10.4.
Actually, if I set Opera encoding to UTF-8 it
displays the topics on 847.is as Safari does.
This indicates it is a system issue rather than Safari. Sorry I can't duplicate it and have no good idea what could cause it on a normal system. Have you (or the place where you buy your machines) by chance installed any special software add-ons to enable the use of non-Unicode Icelandic (for apps like Appleworks, WordX, etc)?

Man pages character encoding

hi there,
I just installed arch the weekend before.
When I'm browsing the man pages I got problems with the character encoding. Most of them are displayed correctly, but sometimes there stands something like that: ?<80><98> instead of an character. As I do not know how to take an screenshot of an manpage without having an GUI installed I took a picture( XD ).
[img=http://img411.imageshack.us/img411/184/dsc00185hj0.th.jpg]
What configuration files do you need to help me? or do you know what my problem is, yet?
thanks in advance

I got a file named /etc/profile.pacnew
what is the file I have to merge/replace it with?
furthermore I got the problem with my manpages, which is solved by using unset MANPATH, but it will reappear after a reboot. These problems are probably connected.
I just don't know which files I have to merge, remove or keep and so on.
thx
edit:
ok replaced profile with profile.pacnew, as I can't remember that I changed something in the profile ever since. this solves the problem.
does pacman tell me if it create such a .pacnew kinda file?
Or how do I know that this files is existing, searching for all files with an .pacnew ending once in a while an mergen them with the old file?
Last edited by okar (2008-03-18 18:08:29)

What every developer should know about character encoding

This was originally posted (with better formatting) at Moderator edit: link removed/what-every-developer-should-know-about-character-encoding.html. I'm posting because lots of people trip over this.
If you write code that touches a text file, you probably need this.
Lets start off with two key items
1.Unicode does not solve this issue for us (yet).
2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts.
The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.
And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.
And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.
Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.
UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
Here's a key point about these text files – every program is still using an encoding. It may not be setting it in code, but by definition an encoding is being used.
Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)
Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
Wrapping it up
I think there are two key items to keep in mind here. First, make sure you are taking the encoding in to account on text files. Second, this is actually all very easy and straightforward. People rarely screw up how to use an encoding, it's when they ignore the issue that they get in to trouble.
Edited by: Darryl Burke -- link removed

DavidThi808 wrote:
This was originally posted (with better formatting) at Moderator edit: link removed/what-every-developer-should-know-about-character-encoding.html. I'm posting because lots of people trip over this.
If you write code that touches a text file, you probably need this.
Lets start off with two key items
1.Unicode does not solve this issue for us (yet).
2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts. Pretty sure most Americans do not use character sets that only have a range of 0-127. I don't think I have every used a desktop OS that did. I might have used some big iron boxes before that but at that time I wasn't even aware that character sets existed.
They might only use that range but that is a different issue, especially since that range is exactly the same as the UTF8 character set anyways.
>
The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.
And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.
And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
The above is only true for small volume sets. If I am targeting a processing rate of 2000 txns/sec with a requirement to hold data active for seven years then a column with a size of 8 bytes is significantly different than one with 16 bytes.
Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
The above is out of place. It would be best to address this as part of Point 1.
Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.
Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.
UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
The first part of that paragraph is odd. The first 128 characters of unicode, all unicode, is based on ASCII. The representational format of UTF8 is required to implement unicode, thus it must represent those characters. It uses the idiom supported by variable width encodings to do that.
But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
Not sure what you are saying here. If a file is supposed to be in one encoding and you insert invalid characters into it then it invalid. End of story. It has nothing to do with html/xml.
Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
The browser still needs to support the encoding.
Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
I know java files have a default encoding - the specification defines it. And I am certain C# does as well.
Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
It is important to define it. Whether you set it is another matter.
Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)
Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
Unicode character escapes are replaced prior to actual code compilation. Thus it is possible to create strings in java with escaped unicode characters which will fail to compile.
Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
No. A developer should understand the problem domain represented by the requirements and the business and create solutions that appropriate to that. Thus there is absolutely no point for someone that is creating an inventory system for a stand alone store to craft a solution that supports multiple languages.
And another example is with high volume systems moving/storing bytes is relevant. As such one must carefully consider each text element as to whether it is customer consumable or internally consumable. Saving bytes in such cases will impact the total load of the system. In such systems incremental savings impact operating costs and marketing advantage with speed.

Character encoding: Ansi, ascii, and mac, oh my!

I'm writing a program which has to search & replace data in user-supplied Rich Text documents (.rtf). Ideally, I would like to read the whole thing into a StringBuffer, so that I can use all of the functionality built into String and StringBuffer, and so that I can easily compare with constant Strings and chars.
The trouble that I have is with character encoding. According to the rtf spec, RTFs can be encoded in four different character encodings: "ansi", "mac", IBM PC code page 437, and IBM PC code page 850, none of which are supported by Java (see http://impulzus.sch.bme.hu/tom/szamitastechnika/file/rtfspec/rtfspec_6.htm#rtfspec_8 for the RTF spec and http://java.sun.com/j2se/1.3/docs/api/java/lang/package-summary.html#charenc for the character encodings supported by Java).
I believe, from a bit of googling, that they are all 8 bits/character, so I could read everything into a byte array and manipulate that directly. However, that would be rather nasty. I would have to be careful with the changes that I make to the document, so that I do not insert values that do not encode correctly in the document's character encoding. Overall, a large hassle.
So my question is - has anyone done something like this before? Any libraries that will make my job easier? Or am I missing something built into Java that will allow me to easily decode and reencode these documents?

DrClap, thanks for the response.
If I could map from the encodings listed above (which are given in the rtf doucment) to a java encoding name from the page that you listed, that would solve all my problems. However, there are a couple of problems:
a) According to this page - http://orwell.ru/info/diffs.htm - ANSI is a superset of ISO-8859-1. That page isn't exactly authoritative, but I can't afford to lose data.
b) I'm not sure what to do about the other character encodings. "mac" may correspond to "MacRoman" but that page lists a dozen or so other macintosh encodings. Gotta love crystal-clear MS documentation.

How can I tell what character encoding is sent from the browser?

Hi,
I am developing a servlet which supposed to be used to send and receive message
in multiple character set. However, I read from the previous postings that each
Weblogic Server can only support one input character encoding. Is that true?
And do you have any suggestions on how I can do what I want. For example, I
have a HTML form for people to post any comments (they may post in any characterset,
like ShiftJIS, Big5, Gb, etc). I need to know what character encoding they are
using before I can read that correctly in the servlet and save in the database.

From what I understand (I haven't used it yet) 6.1 supports the 2.3
servlet spec. That should have a method to set the encoding.
Otherwise, I don't think you can support multiple encodings in one
instance of WebLogic.
From what I know browsers don't give any indication at all about what
encoding they're using. I've read some chatter about the HTTP spec
being changed so it's always UTF-8, but that's a Some Day(TM) kind of
thing, so you're stuck with all the stuff out there now which doesn't do
everything in UTF-8.
Sorry for the bad news, but if it makes you feel any better I've felt
your pain. Oh, and trying to process multipart/form-data (file upload)
forms is even worse and from what I've seen the API that people talk
about on these newsgroups assumes everything is ISO-8859-1.
Emmy Lau wrote:
>
Hi,
I am developing a servlet which supposed to be used to send and receive message
in multiple character set. However, I read from the previous postings that each
Weblogic Server can only support one input character encoding. Is that true?
And do you have any suggestions on how I can do what I want. For example, I
have a HTML form for people to post any comments (they may post in any characterset,
like ShiftJIS, Big5, Gb, etc). I need to know what character encoding they are
using before I can read that correctly in the servlet and save in the database.

Character Encoding for IDOC to JMS scenario with foreign characters

Dear Experts,
The scenario is desribed as follows:
Issue Description:
There is an IDOC which is created after extracting data from different countries (but only one country at a time). So, for instance first time the data is picked in Greek and Latin and corresponding IDOC is created and sent to PI, the next time plain English and sent to PI and next Chinese and so on. As of now every time this IDOC reaches PI ,it comes with UTF-8 character encoding as seen in the IDOC XML.
I am converting this IDOC XML into single string flat file (currently taking the default encoding UTF-8) and sending it to receiver JMS Queue (MQ Series). Now when this data is picked up from the end recepient from the corresponding queue in MQ Series, they see ? wherever there is a Greek/latin characters (may be because that should be having a different encoding like ISO-8859_7). This is causing issues at their end.
My Understanding
SAP system should trigger the IDOC with the right code page i.e if the IDOC is sent with Greek/Latin code page should be ISO-8859_7, if this same IDOC is sent with Chinese characters the corresponding code page else UTF-8 or default code page.
Once this is sent correctly from SAP, Java Mapping should have to use the correct code page when righting the bytes to outputstream and then we would also need to set the right code page as JMS Header before putting the message in the JMS queue so that receiver can interpret it.
Queries:
1. Is my approach for the scenario correct, if not please guide me to the right approach.
2. Does SAP support different code page being picked for the same IDOC based on different data set. If so how is it achieved.
3. What is the JMS Header property to set the right code page. I think there should be some JMS Header defined by MQ Series for Character Encoding which I should be setting correctly) I find that there is a property to set the CCSID in JMS Receiver Adapter but that only refers to Non-ASCII names and doesn't refer to the payload content.
I would appreciate if anybody can give me pointers on how to resolve this issue.
Thanks,
Pratik

Hi Pratik,
http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/502991a2-45d9-2910-d99f-8aba5d79fb42?quicklink=index&overridelayout=true
This link might help.
regards
Anupam

Character Encoding question

I'm helping another group out on this so I'm pretty new to this stuff so please go easy on me if I ask anything that is obvious.
We have a J2EE web application that is sitting on a Red Hat Linux box and is being served up by OAS 10.1.3. The application reads an xml file which contains the actual content of the page and then pulls in the navigation and metadata from other sources.
Everyone works as it should but there is one issue that has been ongoing for a while and we would like to close it off. In my content source xml file, I have encoded special characters such as é as & amp;#233; - when I view the web page all is well (I see the literal value of é but when I do a view source, I see & #233;
If I put & #233; into the source xml file, the page still displays with é but when I do a view source on the web page, I see literal é value in the source which is not what we want. What is decoding the character reference? While inserting & amp;#233; into the xml source file works, we do not want to have to encode everything that way we would prefer the have & #233;. Is it a setting of the OS, the Application Server or the Application itself?
When I previewed this post, I noticed that by typing & amp;#233; as one solid word, it gets decoded as is seen as é so I had to put a space between the & and the amp to get my properly explain myself.
Any help would be appreciated!
Thanks,
/HH

There are a lot of notes on MetaLink about character encoding. I wrote Note 337945.1 a while ago, which explains this into more detail. I will quote some relevant to your situation:
For the core components, there are three places to set NLS_LANG:
- in the system environment (this is obvious)
- in the file opmn.xml
- in the file apachectl
A. Changing opmn.xml
- go to $ORACLE_HOME/opmn/conf and edit the file opmn.xml
- Search for the OC4J container your application runs in.
- Within the <process-type.... > </process-type> section, add an entry similar to:
(1) OracleAS 10g (10.1.2, 10.1.3):
<environment>
<variable id="NLS_LANG" value="ENGLISH_UNITED KINGDOM.AL32UTF8"/>
</environment>
B. Changing apachectl (Unix only)
- Go to $ORACLE_HOME/Apache/Apache/bin
- Open the file 'apachectl'
- search for NLS_LANG
e.g.
NLS_LANG=${NLS_LANG=""}; export NLS_LANG
Verify if the variable is getting the correct value; this may depend on your environment and on the version of OracleAS. If necessary, change this line. In this example, the value from the environment is taken automatically.
There is more on this topic in the mod_plsql area but since you do not mention pulling data from the database, this may be less relevant. Otherwise you need to ensure the same NLS_LANG and character set is used in the database to avoid conversions.

Character encoding again

Hi, i havent got any answer so i try to ask again...
I have created a page from Data Controls.
I have created parameter form. And table. Detail is showed at the bottom of page (there is shown current row through #bindings. . .
Everything works fine when i fill something into parameter form table is filtered by that criteria, when current row is changed, the detail is also changed. But when i fill some czech character into parametr form it works pretty bad.
The table is correctly filtered but when i make any other action after filtering, the table shows no rows.
I found what cause this problem. It is the Property in bindings that holds value for this parameter. When i first call bindings.findXXX.execute the Propertys value is "č" for example. That is correct and table shows filtered rows. After i perform another action (i think it is not dependent what the action is, but change current row fow example) the value in that Property has changed to "?" instead of "č" and because of this filter is applied again and table shows no rows. I have checked all encoding and character encoding is set to utf-8. Is this the problem and i miss some settings ?
1) menu tools-preferences-Environment
2) project properties- compiler-character encoding
3) in jspx
<?xml version='1.0' encoding='UTF-8'?>
<jsp:directive.page contentType="text/html;charset=UTF-8"
pageEncoding="UTF-8"/>
<afh:head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
</afh:head>
is there other ? i dont know i set this settings long ago ...
Please if someone know, give me some hint , thanks for help.
Jdeveloper 10.1.3.0.4(SU5)

check regional language settings on your machiner where u r application server is running. I faced this problem but was able to resolve by modifying the
NLS_LANG = AMERICAN_AMERICA.WE8ISO8859P1
NLS_LANG is under HKEY_LOCAL_MACHINE==>ORACLE
WE8ISO8859P1 is the standard encoding for my application developed in local indian language and works fine for me
THIS can help check out
Amit

Character Encoding in XML

Hello All,
I am not clear about solving the problem.
We have a Java application on NT that is supposed to communicate with the same application on MVS mainframe through XML.
We have a character encoding for these XML commands we send for communication.
The problem is, on MVS the parser is not understaning the US-ASCII character encoding. And so we are getting the infamous "illegal character error".
The main frame file.encoding=CP1047 and
NT's file.encoding = us-ascii.
Is there any character encoding that is common to these two machines: mainframe and NT.
IF it is Unicode, what is the correct notation for it.
Or is there any way for specifying the parsers to which character encoding should be used.
thanks,
Sridhar

On the mainframe end maybe something like-
FileInputStream fris = new FileInputStream("C:\\whatever.xml");
InputStreamReader is= new InputStreamReader(fris, "ASCII");//or maybe "us-ascii" "US-ASCII"
BufferedReader brin = new BufferedReader(is);
Or give inputstream/buffered reader to whatever application you are using to parse the xml. The input stream reader should allow you to set your encoding even if the system doesnt have the native encoding. Depends though on which/whose jvm using you are using jdk1.2 at least supports following on this page http://as400bks.rochester.ibm.com/pubs/html/as400/v4r4/ic2924/info/java/rzaha/javaapi/intl/encoding.doc.html

c:import character encoding problem (utf-8)

Aloha @ all,
I am currently importing a file using the <c:import> functionallity (<c:import url="module/item.jsp" charEncoding="UTF-8">) but it seems that the returned data is not encoded with utf-8 and hence not displayed correctly. The overall file header is:
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Set-Cookie: JSESSIONID=E67F9DAF44C7F96C0725652BEA1713D8;
Content-Type: text/html;charset=UTF-8
Content-Length: 6861
Date: Thu, 05 Jul 2007 04:18:39 GMT
Connection: close
I've set the file-encoding on all pages to :
<%@ page contentType="text/html;charset=UTF-8" %>
<%@ page pageEncoding="UTF-8"%>
but the error remains... is this a known bug and is there a workaround?

Partially, yes. It turns out that I created the documents in eclipse with a different character encoding. Hence the entire document was actually not UTF-encoded...
So I changed each document encoding in Eclipse to UTF and got it working just fine...

Problems with Forms and character encoding

I'm having problems trying to read unicode data inputted into a Form on my JSP page.
I've used the meta tag <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> to set the charset of the page to UTF-8. I've inputted some chinese characters inot my form and when I try to read the subsequent request parameter in my servlet using request.getParameter() the string returned is this
"来源" which is the escape sequence required by HTML to display these characters.
From what I've read on the subject this doesn't seem like the expected value. I've tried other ways of getting the correct string value such as setting the character encoding request.setCharacterEncoding("UTF-8") and then converting the bytes using this encoding value but it doesn't seem to work.
I could write a method to split up the string using the ; as a token and working out the correct unicode character but this doesn't seem like the right thing to do.
Any help on how to pass the correct information from the Form in the JSP page to the servlet would be greatly appreciated

I don't believe that is correct, but if it's returning HTML escapes instead of URL Encoded characters, then it's the browser doing it. This is my test page for playing with Chinese...
<%@ page language="java" contentType="text/html; charset=UTF-8" %>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<body bgcolor="#ffffff" background="" text="#000000" link="#ff0000" vlink="#800000" alink="#ff00ff">
<%
request.setCharacterEncoding("UTF-8");
String str = "\u7528\u6237\u540d";
String name = request.getParameter("name");
%>
req enc: <%= request.getCharacterEncoding() %><br />
rsp enc: <%= response.getCharacterEncoding() %><br />
str: <%= str %><br />
name: <%= name %><br />
<form method="GET" action="_lang.jsp" encoding="UTF-8">
Name: <input type="text" name="name" value="" >
<input type="submit" name="submit" value="GET Submit" />
</form>
<form method="POST" action="_lang.jsp" encoding="UTF-8">
Name: <input type="text" name="name" value="" >
<input type="submit" name="submit" value="POST Submit" />
</form>
</body>
</html>

Character encoding according to page indication

Similar Messages

Maybe you are looking for