Character Encoding question

I'm helping another group out on this so I'm pretty new to this stuff so please go easy on me if I ask anything that is obvious.
We have a J2EE web application that is sitting on a Red Hat Linux box and is being served up by OAS 10.1.3. The application reads an xml file which contains the actual content of the page and then pulls in the navigation and metadata from other sources.
Everyone works as it should but there is one issue that has been ongoing for a while and we would like to close it off. In my content source xml file, I have encoded special characters such as é as & amp;#233; - when I view the web page all is well (I see the literal value of é but when I do a view source, I see & #233;
If I put & #233; into the source xml file, the page still displays with é but when I do a view source on the web page, I see literal é value in the source which is not what we want. What is decoding the character reference? While inserting & amp;#233; into the xml source file works, we do not want to have to encode everything that way we would prefer the have & #233;. Is it a setting of the OS, the Application Server or the Application itself?
When I previewed this post, I noticed that by typing & amp;#233; as one solid word, it gets decoded as is seen as é so I had to put a space between the & and the amp to get my properly explain myself.
Any help would be appreciated!
Thanks,
/HH

There are a lot of notes on MetaLink about character encoding. I wrote Note 337945.1 a while ago, which explains this into more detail. I will quote some relevant to your situation:
For the core components, there are three places to set NLS_LANG:
- in the system environment (this is obvious)
- in the file opmn.xml
- in the file apachectl
A. Changing opmn.xml
- go to $ORACLE_HOME/opmn/conf and edit the file opmn.xml
- Search for the OC4J container your application runs in.
- Within the <process-type.... > </process-type> section, add an entry similar to:
(1) OracleAS 10g (10.1.2, 10.1.3):
<environment>
     <variable id="NLS_LANG" value="ENGLISH_UNITED KINGDOM.AL32UTF8"/>
</environment>
B. Changing apachectl (Unix only)
- Go to $ORACLE_HOME/Apache/Apache/bin
- Open the file 'apachectl'
- search for NLS_LANG
e.g.
NLS_LANG=${NLS_LANG=""}; export NLS_LANG
Verify if the variable is getting the correct value; this may depend on your environment and on the version of OracleAS. If necessary, change this line. In this example, the value from the environment is taken automatically.
There is more on this topic in the mod_plsql area but since you do not mention pulling data from the database, this may be less relevant. Otherwise you need to ensure the same NLS_LANG and character set is used in the database to avoid conversions.

Similar Messages

  • XML Character Encoding Using UTL_DBWS

    Hi,
    I have a database with WINDOWS-1252 character encoding. I'm using UTL_DBWS to call a web service method which echoes a given string. For this purpose, I do the following:
    DECLARE
        v_wsdl CONSTANT VARCHAR2(500) := 'http://myhost/myservice?wsdl';
        v_namespace CONSTANT VARCHAR2(500) := 'my.namespace';
        v_service_name CONSTANT UTL_DBWS.QNAME := UTL_DBWS.to_qname(v_namespace, 'MyService');
        v_service_port CONSTANT UTL_DBWS.QNAME := UTL_DBWS.to_qname(v_namespace, 'MySoapServicePort');
        v_ping CONSTANT UTL_DBWS.QNAME := UTL_DBWS.to_qname(v_namespace, 'ping');
        v_wsdl_uri CONSTANT URITYPE := URIFACTORY.getURI(v_wsdl);
        v_str_request CONSTANT VARCHAR2(4000) :=
    '<?xml version="1.0" encoding="UTF-8" ?>
    <ping>
        <pingRequest>
            <echoData>Dev Team üöäß</echoData>
        </pingRequest>
    </ping>';
        v_service UTL_DBWS.SERVICE;
        v_call UTL_DBWS.CALL;
        v_request XMLTYPE := XMLTYPE (v_str_request);
        v_response SYS.XMLTYPE;
    BEGIN
        DBMS_JAVA.set_output(20000);
        UTL_DBWS.set_logger_level('FINE');
        v_service := UTL_DBWS.create_service(v_wsdl_uri, v_service_name);
        v_call := UTL_DBWS.create_call(v_service, v_service_port, v_ping);
        UTL_DBWS.set_property(v_call, 'oracle.webservices.charsetEncoding', 'UTF-8');
        v_response := UTL_DBWS.invoke(v_call, v_request);
        DBMS_OUTPUT.put_line(v_response.getStringVal());
        UTL_DBWS.release_call(v_call);
        UTL_DBWS.release_all_services;
    END;
    /Here is the SERVER OUTPUT:
    ServiceFacotory: oracle.j2ee.ws.client.ServiceFactoryImpl@a9deba8d
    WSDL: http://myhost/myservice?wsdl
    Service: oracle.j2ee.ws.client.dii.ConfiguredService@c881d39e
    *** Created service: -2121202561 - oracle.jpub.runtime.dbws.DbwsProxy$ServiceProxy@afb58220 ***
    ServiceProxy.get(-2121202561) = oracle.jpub.runtime.dbws.DbwsProxy$ServiceProxy@afb58220
    Collection Call info: port={my.namespace}MySoapServicePort, operation={my.namespace}ping, returnType={my.namespace}PingResponse, params count=1
    setProperty(oracle.webservices.charsetEncoding, UTF-8)
    dbwsproxy.add.map: ns, my.namespace
    Attribute 0: my.namespace: xmlns:ns, my.namespace
    dbwsproxy.lookup.map: ns, my.namespace
    createElement(ns:ping,null,my.namespace)
    dbwsproxy.add.soap.element.namespace: ns, my.namespace
    Attribute 0: my.namespace: xmlns:ns, my.namespace
    dbwsproxy.element.node.child.3: 1, null
    createElement(echoData,null,null)
    dbwsproxy.text.node.child.0: 3, Dev Team üöäß
    request:
    <ns:ping xmlns:ns="my.namespace">
       <pingRequest>
          <echoData>Dev Team üöäß</echoData>
       </pingRequest>
    </ns:ping>
    Jul 8, 2008 6:58:49 PM oracle.j2ee.ws.client.StreamingSender _sendImpl
    FINE: StreamingSender.response:<?xml version = '1.0' encoding = 'UTF-8'?>
    <env:Envelope xmlns:env="http://schemas.xmlsoap.org/soap/envelope/"><env:Header/><env:Body><ns0:pingResponse xmlns:ns0="my.namespace"><pingResponse><responseTimeMillis>0</responseTimeMillis><resultCode>0</resultCode><echoData>Dev Team üöäß</echoData></pingResponse></ns0:pingResponse></env:Body></env:Envelope>
    response:
    <ns0:pingResponse xmlns:ns0="my.namespace">
       <pingResponse>
          <responseTimeMillis>0</responseTimeMillis>
          <resultCode>0</resultCode>
          <echoData>Dev Team üöäß</echoData>
       </pingResponse>
    </ns0:pingResponse>As you can see the character encoding is broken in the request and in the response, i.e. the SOAP encoder does not take into consideration the UTF-8 encoding.
    I tracked down the problem to the method oracle.jpub.runtime.dbws.DbwsProxy.dom2SOAP(org.w3c.dom.Node, java.util.Hashtable); and more specifically to the calls of oracle.j2ee.ws.saaj.soap.soap11.SOAPFactory11.
    My question is: is there a way to make the SOAP encoder use the correct character encoding?
    Thanks a lot in advance!
    Greetings,
    Dimitar

    I found a workaround of the problem:
        v_response := XMLType(v_response.getBlobVal(NLS_CHARSET_ID('CHAR_CS')), NLS_CHARSET_ID('AL32UTF8'));Ugly, but I'm tired of decompiling and debugging Java classes ;)
    Greetings,
    Dimitar

  • Character encoding in JDeveloper 9.0.3.1035

    After the upgrade to JDeveloper 9.0.3.1035, all the object views, after the complete rebuild (alt+f9) of the project, received character encoding <?xml version="1.0" encoding='windows-1252'?> instead of <?xml version="1.0" encoding='UTF-8'?> as defined in the project settings as compiler character encoding "UTF-8".
    The project is running under a LINUX server and we defined UTF-8 as enconding. How can I define this encoding for the complete project?
    The UTF-8 definition as I mentioned is already done and it does not work.

    Please post your question at
    JDeveloper and ADF

  • Can character encoding be predefined for certain pages?

    Certain pages that I visit frequently require me to manually set my character encoding to Western (ISO Latin 1), both when my default character encoding is set as UTF-8 and Western (ISO Latin 1).
    As the pages that show up malformed are embedded in other frames I suspect that the top frame forces a different encoding than is on the embedded page.
    An example page is here (847.is). The topic list of this message board is in order, but when any one of the topics is viewed all accented and special characters are missing, until Western (ISO Latin 1) is manually set as the character encoding. Similarly, opening any of the topics in a tab will result in missing characters.
    Is there some way for me to circumvent having to go through all those menus to set it? Can I somehow define that these pages should be viewed in Western (ISO Latin 1) or can I set a keyboard shortcut for Western (ISO Latin 1)?
    MacBook 2006   Mac OS X (10.4.7)   Safari version 2.0.4
    MacBook 2006   Mac OS X (10.4.7)  

    For instance if I opened
    an entry on Vísindavefur
    HÍ out
    of the parent frame the accented letters would
    show up somehow mangled, but this is no longer a
    problem.
    That page has no charset in the source and thus should only display correctly if you have Latin-1 set as the browser default. With UTF-8 set as the default you should see (in Safari) a ton of black diamonds with question marks inside.
    Firefox never displays ð (eth), þ (thorn) and ý
    (accented y) correctly for me, did not do it on the
    old machine and does not do it on this one either,
    FireFox displays them perfectly for me in both 10.3 and 10.4.
    Actually, if I set Opera encoding to UTF-8 it
    displays the topics on 847.is as Safari does.
    This indicates it is a system issue rather than Safari. Sorry I can't duplicate it and have no good idea what could cause it on a normal system. Have you (or the place where you buy your machines) by chance installed any special software add-ons to enable the use of non-Unicode Icelandic (for apps like Appleworks, WordX, etc)?

  • Detecting character encoding?

    I'm running into a problem with Safari (I've got Safari 4 beta installed, but I think this was a problem with 3 as well) where it won't automatically detect character encodings. These are sites in Japanese which load and display correctly in Firefox 2, but in Safari they come up as gibberish. When I select Shift JIS they show correctly in Safari. Other pages come up as expected. I must emphasise this: the pages show correctly in Firefox 2 without having to do anything, but in Safari 4 they come up incorrectly and I must change the encoding manually.
    The pages in question are here: http://genki.japantimes.co.jp/self/kanji.en.html
    Is there some way to force Safari to display these pages correctly without having to manually select the correct encoding from the Character Encoding menu? Have I missed an option somewhere?

    The pages in question are here: http://genki.japantimes.co.jp/self/kanji.en.html
    The authors of these pages have failed to put in the html code required by international standards so that browsers will know the text is in Japanese. You should ask them to do this -- leaving it out does not make them look very smart.
    While FireFox has a system which can sometimes detect encodings automatically, Safari does not. When a page lacks the required html code, it will use the default encoding set in its preferences/appearance. So if you set that to Shift JIS, these pages should come up correctly. This could make other pages which lack the correct html display incorrectly, but they should be very rare these days.

  • Default character encoding ???

    To set the default character encoding to UTF-8 for JVM I can do the
    following on Solaris:
    LC_ALL=en_US.UTF-8; java ....
    How do I achieve the same on Windows?
    Is there a java property where I can specify the default character
    encoding or do I have to getBytes( str, "UTF-8") everywhere?
    Thanks,
    Artur...

    Hi Artur,
    please see my responses to the same question you've posted in
    Forum Home > Internationalization
    Kind regards,
    Marcus

  • Setting Character encoding programmaticaly?

    Hi,
    I am using Sun J2ME wireless toolkit 2.1, and i have a problem with characer encoding. I am receiving text from a .NET web service, and after some processing in the client, i send the string back.
    The problem is, the string i am sending back includes Turkish characters. These are sent as question marks instead of characters.
    I have failed to find a method that changes the character encoding used while making a web service call.
    Actually, i could not see any way to change the encoding overall. For the emulator, property file can be used, but what about the devices i'll be deploying the app? It'd be really great if someone could point me in the right direction.
    Best Regards

    Hi,
    My situation is as follows. I have .NET web services on the server side, and i am using mobile devices as clients. When i get a string from method A in web service , i can display it on the device screen without a problem. after that, if i send the same string that i've received from method A as a parameter to method B, the .NET code receives garbage instead of turkish chars.
    At the moment i am encoding turkish chars at the client side, and decoding them at the .net web server processing code.
    I'd like to try setting the encoding to utf8, but as i have written, i have not seen any way of doing this. Changing properties file for emulator is possible, but how can i do it for the target devices. I have not seen an api call for this purpose in midp or cldc docs. Thanks for your answer
    Regards

  • Absolute UTL_SMTP Character encoding Frustration

    I am using UTL_SMTP to send emails from my DB, it works great when i am sending Standard ACSII emails, but when I try to send arabic emails (windows-1256) which corresponds to my Database character encoding NLS_LANG value ARABIC_SAUDI ARABIA.AR8MSWIN1256 all arabic characters appear as question marks in my email even though i set Content-transfer-encoding to windows-1256 as shown below.
    l_maicon :=utl_smtp.open_connection('domain',25);
    utl_smtp.helo(l_maicon,'domain');
    utl_smtp.mail(l_maicon,' [email protected]');
    utl_smtp.rcpt(l_maicon, [email protected]);
    utl_smtp.rcpt(l_maicon, [email protected]);
    utl_smtp.data(l_maicon,
    'Content-type:text/html; charset=windows-1256' || utl_tcp.crlf ||
    'From: [email protected]' || utl_tcp.crlf||
    'To: ' || [email protected] || ';' || [email protected] || utl_tcp.crlf ||
    'Subject: Notification System: ' || utl_tcp.crlf ||
    'هذه الرسالة بالعربية');
    utl_smtp.quit(l_maicon);
    I am wondering if anyone has been across a similar problem.
    Thank you,
    Hussam Galal
    [email protected]

    I was misusing the function UTL_SMTP.write_raw_data() as i was not calling the functions UTL_SMTP.open_data() before and UTL_SMTP.close_data() after it.
    Apparantly the problem consists on two parts, first part is writing the data correctly(8-bit encoding) from the database package UTL_SMTP, this is done using the function write_raw_data and the second part is telling the party handling the email about the character encoding of the email this done by adding Content-Type header in the email header as follows :
    'Content-Type: text/plain; charset="windows-1256"'
    'Content-Transfer-Encoding: 8bit'
    When I write the message using write_raw_data function the To: From: and subject(in Arabic exactly as desired) are recieved but with no body, the body always appears empty(blank)!!
    When i send the exact same message using the function data() the body is recieved but with characters in arabic appearing as question marks which is expected as this function supports only standard ascii.
    When I use the function write_data() it gives exactly same result of using data(), which i couldnt understand as it's mentioned in the documentation that this function supports 8 bit character encoding!!!
    Below is the sample pieces of code which i am using
    WRITE_RAW_DATA()
    utl_smtp.open_data(l_maicon);
    utl_smtp.write_raw_data(l_maicon,
    UTL_RAW.cast_to_raw(
    'Content-Type: text/plain; charset="windows-1256"' || utl_tcp.crlf||
    'Content-Transfer-Encoding: 8bit' || utl_tcp.crlf||
    'From: [email protected]' || utl_tcp.crlf||
    'To: ' || x.admin_email || ';' || x.manager_email || utl_tcp.crlf ||
    'Subject: Notification System: '||x.content_description || utl_tcp.crlf ||
                             x.notification_message));
    utl_smtp.close_data(l_maicon);
    Result: Subject is recieved with correct encoding but NO body.
    DATA()
    utl_smtp.data(l_maicon,
    'Content-Type: text/plain; charset="windows-1256"' || utl_tcp.crlf||
    'Content-Transfer-Encoding: 8bit' || utl_tcp.crlf||
    'From: [email protected]' || utl_tcp.crlf||
    'To: ' || x.admin_email || ';' || x.manager_email || utl_tcp.crlf ||
    'Subject: Notification System: '||x.content_description || utl_tcp.crlf ||
                                  x.notification_message );
    Result: Email is recieved but any Arabic characters appear as question marks.
    WRITE_DATA()
    utl_smtp.open_data(l_maicon);
    utl_smtp.data(l_maicon,
    'Content-Type: text/plain; charset="windows-1256"' || utl_tcp.crlf||
    'Content-Transfer-Encoding: 8bit' || utl_tcp.crlf||
    'From: [email protected]' || utl_tcp.crlf||
    'To: ' || x.admin_email || ';' || x.manager_email || utl_tcp.crlf ||
    'Subject: Notification System: '||x.content_description || utl_tcp.crlf ||
                                  x.notification_message);
    utl_smtp.close_data(l_maicon);
    Result: Email is recieved but any Arabic characters appear as question marks.
    Which function should i be using, if write_raw_data then when is their no body, and if WRITE_data why is it not supporting 8bit characters?

  • Autonomy Character Encoding

    I'm sorry if this is not the right place to ask, but it seems to be the less inappropriate place XD
    I'm working with the Search Engine that is shipped with WebLogic Portal(Autonomy Idol), and I'm having some problems with character encoding, in the server where the Autonomy installation is running I have no access to the Internet so I cannot run the spider to search any website... so I ran a spider into my computer and It worked fine... then I exported the INDEXED data into an idx file which I uploaded into the production server and then I indexed, the problem is that some characters aren't shown correctly in my portlet, executing an action command such as /action=query&text=something_here I can see that the generated xml has this wrong characters, so definitely this is not a problem with my porltet, executing the same action in my local autonomy installation everything goes well so the questions are...
    Why Autonomy is not indexing correctly the special characters into my production environment?? and
    Why it happens only with some of them? for example the word Ingeniería is indexed correctly in some entries and not in some others, and both entries comes from the same idx file
    Thanks in advance
    P.S

    Sorry, in the example I gave, the none-english letters looks like this:
    &# 1490;&# 1497;&# 1488;(I made the space between the # and the number delibertly to prevent it from being presented as a letter ...).

  • Byte[] character encoding for strings

    Hi All,
    I tried to convert a string into byte[] using the following code:
    byte[] out= [B@30c221;
    String encodedString = out.toString();
    it gives the output [B@30c221 when i print encodedstring.
    but when i convert that encodedstring into byte[] using the following code
    byte[] output = encodedString.getBytes();
    it gives different output.
    is there any character encode needed to give the exact output for this?

    Sorry, but the question makes no sense, and neither does your code. byte[] out= [B@30c221;
    String encodedString = out.toString(); The first line is syntactically incorrect, and the second should print something like "&#x5B;B@30c221", which isn't particularly useful. The correct way to convert a String to a byte[] is with the getBytes() method. Be aware that the byte[] will be in the system default encoding, which means you could get different results on different platforms. To remove this platform dependency, you should specify the encoding, like so: byte[] output = encodedString.getBytes("UTF-8"); Why are you doing this, anyway? There are very few good reasons to convert a String to a byte[] within your code; that's usually done by the I/O classes when your program communicates with the outside world, as when you write the string to a file.

  • What character encoding standard does tuxedo use

    Hi,
    I am trying to resolve a problem with communication between Tuxedo 6.4 and Vitria.
    It seems that there is a problem with the translation of special characters. Does
    anyone know what encoding standard that Tuxedo uses?
    Thanks.

    Thanks Scott, actually I was asked the following question by Vitria Technical Support,
    can you help?
    "XDR (External Data Representation) is a protocol used by BEA Tuxedo's
    communication engine. XDR handles data format transformations when passing
    messages across dissimilar processor architectures.
    This is not the equivalent of Character Encoding. I specifically need the
    Character Encoding used. I am not sure where your admin needs to check for
    this - it might even be set at the OS level. I suspect that it will be
    something like ISO-8859-1 or some derivative."
    Thanks.
    Scott Orshan <[email protected]> wrote:
    Within a machine, TUXEDO just sends the bytes that you give it. When
    it
    goes between machines, it uses XDR to encode the data values for
    transmission. There is no character set translation going on, unless
    you
    are going to an EBCDIC machine. (If you are using data encryption
    [tpseal] in TUXEDO 7.1 or 8.0 your data may be encoded even if it stays
    on the same machine type.)
         Scott Orshan
         BEA Systems
    Richard Astill wrote:
    Hi,
    I am trying to resolve a problem with communication between Tuxedo6.4 and Vitria.
    It seems that there is a problem with the translation of special characters.Does
    anyone know what encoding standard that Tuxedo uses?
    Thanks.

  • Locale and character encoding. What to do about these dreadful ÅÄÖ??

    It's time for me to get it into my head how this works. Please, help me understand before I go nuts.
    I'm from Sweden and we use a few of these weird characters like ÅÄÖ.
    If I create a file called "övrigt.txt" in windows, then the file will turn up as "?vrigt.txt" on my Linux pc (At least in the console, sometimes it looks ok in other apps in X). The same is true if I create the file in Linux and copy it to Windows, it will look just as weird on the other side.
    As I (probably) can't change the way windows works, my question is what I have to do to have these two systems play nicely with eachother?
    This is the output from locale:
    LANG=en_US.utf8
    LC_CTYPE="en_US.utf8"
    LC_NUMERIC="en_US.utf8"
    LC_TIME="en_US.utf8"
    LC_COLLATE=C
    LC_MONETARY="en_US.utf8"
    LC_MESSAGES="en_US.utf8"
    LC_PAPER="en_US.utf8"
    LC_NAME="en_US.utf8"
    LC_ADDRESS="en_US.utf8"
    LC_TELEPHONE="en_US.utf8"
    LC_MEASUREMENT="en_US.utf8"
    LC_IDENTIFICATION="en_US.utf8"
    LC_ALL=
    Is there anything here I should change? I have tried using ISO-8859-1 with no luck. Mind you that I want to have the system wide language set to english. The only thing I want to achieve is that "Ö" on widows should turn up as "Ö" i Linux as well, and vice versa.
    Please save my hair from being torn off, I'm going bald here...

    Hey, thanks for all the answers!
    I share my files in a number of ways, but mainly trough a web application called Ajaxplorer (very nice btw...). The thing is that as soon as a windows user uploads anything with special chatacters in the file name my programs, xbmc, console etc, refuses to read them correctly. Other ways of sharing is through file copying with usb sticks, ssh etc. It's really not the way of sharing that is the problem I think, but rather the special characters being used sometimes.
    I could probably convert the filenames with suggested applications but then I'll set the windows users in trouble when they want to download them again, won't I?
    I realize that it's cp1252 that is the bad guy in this drama. Is there no way to set/use cp1252 as a character encoding in Linux? It's probably a bad idea as utf8 seems like the future way to go, but the fact that these two OS's can't communicate too well in this area is pretty useless if you ask me.
    To wrap this up I'll answer some questions...
    @EVRAMP: I'm actually using pcmanfm, but that is only for me and I'm not dealing very often with vfat partitions to be honest.
    @pkervien: Well, I think I mentioned my forms of sharing above. (kul med lite arch-svenskar!)
    @quarkup: locale.gen is edited and both sv.SE and en_US have utf-8 and ISO-8859 enabled and generated.
    ...and to clearify things even further. It doesn't matter if I get or provide a file via a usb stick, samba, ftp or by paper. All I want is for "Ö" to always be "Ö", everywhere.
    I can't believe how hard this is to get around. Linus is finish for crying out loud. I thought he'd sorted this out the first thing he did. Maybe he doesn't deal with windows or their users at all

  • Character encoding: Ansi, ascii, and mac, oh my!

    I'm writing a program which has to search & replace data in user-supplied Rich Text documents (.rtf). Ideally, I would like to read the whole thing into a StringBuffer, so that I can use all of the functionality built into String and StringBuffer, and so that I can easily compare with constant Strings and chars.
    The trouble that I have is with character encoding. According to the rtf spec, RTFs can be encoded in four different character encodings: "ansi", "mac", IBM PC code page 437, and IBM PC code page 850, none of which are supported by Java (see http://impulzus.sch.bme.hu/tom/szamitastechnika/file/rtfspec/rtfspec_6.htm#rtfspec_8 for the RTF spec and http://java.sun.com/j2se/1.3/docs/api/java/lang/package-summary.html#charenc for the character encodings supported by Java).
    I believe, from a bit of googling, that they are all 8 bits/character, so I could read everything into a byte array and manipulate that directly. However, that would be rather nasty. I would have to be careful with the changes that I make to the document, so that I do not insert values that do not encode correctly in the document's character encoding. Overall, a large hassle.
    So my question is - has anyone done something like this before? Any libraries that will make my job easier? Or am I missing something built into Java that will allow me to easily decode and reencode these documents?

    DrClap, thanks for the response.
    If I could map from the encodings listed above (which are given in the rtf doucment) to a java encoding name from the page that you listed, that would solve all my problems. However, there are a couple of problems:
    a) According to this page - http://orwell.ru/info/diffs.htm - ANSI is a superset of ISO-8859-1. That page isn't exactly authoritative, but I can't afford to lose data.
    b) I'm not sure what to do about the other character encodings. "mac" may correspond to "MacRoman" but that page lists a dozen or so other macintosh encodings. Gotta love crystal-clear MS documentation.

  • What every developer should know about character encoding

    This was originally posted (with better formatting) at Moderator edit: link removed/what-every-developer-should-know-about-character-encoding.html. I'm posting because lots of people trip over this.
    If you write code that touches a text file, you probably need this.
    Lets start off with two key items
    1.Unicode does not solve this issue for us (yet).
    2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
    And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts.
    The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.
    And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.
    And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
    Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
    Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.
    Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.
    UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
    But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
    Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
    Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
    Here's a key point about these text files – every program is still using an encoding. It may not be setting it in code, but by definition an encoding is being used.
    Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
    Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)
    Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
    Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
    Wrapping it up
    I think there are two key items to keep in mind here. First, make sure you are taking the encoding in to account on text files. Second, this is actually all very easy and straightforward. People rarely screw up how to use an encoding, it's when they ignore the issue that they get in to trouble.
    Edited by: Darryl Burke -- link removed

    DavidThi808 wrote:
    This was originally posted (with better formatting) at Moderator edit: link removed/what-every-developer-should-know-about-character-encoding.html. I'm posting because lots of people trip over this.
    If you write code that touches a text file, you probably need this.
    Lets start off with two key items
    1.Unicode does not solve this issue for us (yet).
    2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
    And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts. Pretty sure most Americans do not use character sets that only have a range of 0-127. I don't think I have every used a desktop OS that did. I might have used some big iron boxes before that but at that time I wasn't even aware that character sets existed.
    They might only use that range but that is a different issue, especially since that range is exactly the same as the UTF8 character set anyways.
    >
    The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.
    And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.
    And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
    The above is only true for small volume sets. If I am targeting a processing rate of 2000 txns/sec with a requirement to hold data active for seven years then a column with a size of 8 bytes is significantly different than one with 16 bytes.
    Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
    The above is out of place. It would be best to address this as part of Point 1.
    Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.
    Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.
    UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
    The first part of that paragraph is odd. The first 128 characters of unicode, all unicode, is based on ASCII. The representational format of UTF8 is required to implement unicode, thus it must represent those characters. It uses the idiom supported by variable width encodings to do that.
    But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
    Not sure what you are saying here. If a file is supposed to be in one encoding and you insert invalid characters into it then it invalid. End of story. It has nothing to do with html/xml.
    Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
    The browser still needs to support the encoding.
    Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
    I know java files have a default encoding - the specification defines it. And I am certain C# does as well.
    Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
    It is important to define it. Whether you set it is another matter.
    Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)
    Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
    Unicode character escapes are replaced prior to actual code compilation. Thus it is possible to create strings in java with escaped unicode characters which will fail to compile.
    Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
    No. A developer should understand the problem domain represented by the requirements and the business and create solutions that appropriate to that. Thus there is absolutely no point for someone that is creating an inventory system for a stand alone store to craft a solution that supports multiple languages.
    And another example is with high volume systems moving/storing bytes is relevant. As such one must carefully consider each text element as to whether it is customer consumable or internally consumable. Saving bytes in such cases will impact the total load of the system. In such systems incremental savings impact operating costs and marketing advantage with speed.

  • XML parser not detecting character encoding

    Hi,
    I am using Jdeveloper 9.0.5 preview and the same problem is happening in our production AS 9.0.2 release.
    The character encoding of an xml document is not correctly being detected by the oracle v2 parser even though the xml declaration correctly contains
    <?xml version="1.0" encoding="ISO-8859-1" ?>
    instead it treats the document as UTF8 encoding which is fine until a document comes along with an extended character which then causes a
    java.io.UTFDataFormatException: Invalid UTF8 encoding.
    at oracle.xml.parser.v2.XMLUTF8Reader.checkUTF8Byte(XMLUTF8Reader.java:160)
    at oracle.xml.parser.v2.XMLUTF8Reader.readUTF8Char(XMLUTF8Reader.java:187)
    at oracle.xml.parser.v2.XMLUTF8Reader.fillBuffer(XMLUTF8Reader.java:120)
    at oracle.xml.parser.v2.XMLByteReader.saveBuffer(XMLByteReader.java:448)
    at oracle.xml.parser.v2.XMLReader.fillBuffer(XMLReader.java:2023)
    at oracle.xml.parser.v2.XMLReader.tryRead(XMLReader.java:972)
    at oracle.xml.parser.v2.XMLReader.scanXMLDecl(XMLReader.java:2589)
    at oracle.xml.parser.v2.XMLReader.pushXMLReader(XMLReader.java:485)
    at oracle.xml.parser.v2.XMLReader.pushXMLReader(XMLReader.java:192)
    at oracle.xml.parser.v2.XMLParser.parse(XMLParser.java:144)
    as you can see it is explicitly casting the XMLUTF8Reader to perform the read.
    I can get around this by hard coding the xml input stream to be processed by a reader
    XMLSource = new StreamSource(new InputStreamReader(XMLInStream,"ISO-8859-1"));
    however the manual documents that the character encoding is automatically picked up from the xml file and casting into a reader is not necessary, so I should be able to write
    XMLSource = new StreamSource(XMLInStream)
    Does anyone else experience this same problem?
    having to hardcode the encoding causes my software to lose flexibility.
    Jarrod Sharp.

    An XML document should be created with 'ISO-8859-1' encoding to be parsed as 'ISO-8859-1' encoding.

Maybe you are looking for

  • Read  in loop

    Hi all ,       I have one internal table itab1 as mblnr    zeile   vgbel   lifnr 11     1     z           08 11     2     z           77 11     3     z           76 11     4     z           75 11     5     z           08 12        1     z           7

  • How to create Folder and subFolder using web services

    Hi, Today I've been working with some Share Point develompment but the truth is that I can't just get it. Can any body help me with some easy links to start Share Point Development? Here is wath I'm trying: I need to create a new folder (lets say Fol

  • Issue with Exporting to PDF appears to be saving Automatically

    Currently under SAP Business One Version 2007A SP:01 and PL:09 On Exporting any report or document using the PDF  button on the Toolbar Open's up PDF  viewer showing the document, but also appears to automatically save a copy into the path of the "Pi

  • One SAP for several company

    We are using SAP as our ERP system. And now, we would like to add one more company into the system, Actually, the business of new comapany is totally differ to our existing business, and they are two entity. I would like to know, we use the existing

  • How do you automatically sync music, but manually manage films?

    I would like my music to sync automatically, but I want to manually manage my films/video. I tried setting the Summary screen on my iPod to "Manually manage music and videos" and then ticking "Sync Music" on the Music tab, but when I tick "Sync Music