FF character encoding issue in Mageia 2 ?

Hi everyone,
I'm running Mozilla Firefox 17.0.8 in a KDE distro of Linux called Mageia 2. I'm having problems in character encoding with certain web pages, meaning that certain icons like the ones next to menu entries (Login, Search box etc.) and in section headlines don't appear properly. Instead they appear either in some arabic character or as little grey boxes with numbers and letters written in it.
I've tried experimenting with different encoding systems: Western (ISO 8859-1), (ISO 8859-15), (Windows 1252), Unicode (UTF-8), Central European (ISO 8859-2) but none of them does the job. Currently the char encoding is set to UTF-8. The same web page in Chrome (UTF-8) gives no such problem.
Can you help me, please?

Thank you!
I solved my problem, however I find fonts are too small for certain web pages when compared to Chrome (see attached pictures of nytimes.com).
Chrome's font size are set to "Medium".

Similar Messages

  • Character encoding issue in sql server

    Hi Team,
    We have a table with more than 20 columns...In that we have some columns which will have data extracted from the Datawareshouse applicances and its one time load.
    The problem is we have two columns which will may have same set of values from some records and for different set of records
    the below values are the example for same set of records but the values are changed while importing into the sql server database..
    2pk Etiquetas Navide‰as 3000-HG                                
     2pk Etiquetas Navideñas 3000-H                           
    Is there anyway to change the first column values into the second column value ? 
    By looking at the data we can say its the code page issue..(Character encoding issue)..but how to convert it?
    Convertting(2pk Etiquetas Navide‰as 3000-HG)  
    to get   2pk Etiquetas Navideñas 3000-H   in the select query?

    Then it seems that you can do the obvious: replace it.
    DECLARE @Sample TABLE ( Payload NVARCHAR(255) );
    INSERT INTO @Sample
    VALUES ( N'2pk Etiquetas Navide‰as 3000-HG' );
    UPDATE @Sample
    SET Payload = REPLACE(Payload, N'‰', N'ñ');
    SELECT S.Payload
    FROM @Sample S;

  • Character encoding issue

    I'm using the below give code to send mail in Trukish language.
    MimeMessage msg = new MimeMessage(session);
    msg.setText(message, "utf-8", "html");
    msg.setFrom(new InternetAddress(from));
    Transport.send(msg);
    But my customer says that he gets sometime unreadable characters in mail. I'm not able to understand how to solve this character encoding issue.
    Should i ask him to change his mail client's character encoding settings?
    If yes which one he should set.

    Send the same characters using a different mailer (e.g., Thunderbird or Outlook).
    If they're received correctly, come the message from that mailer with the message
    from JavaMail. Most likely other mailers are using a Turkish-specific charset instead
    of UTF-8.

  • JSF myfaces character encoding issues

    The basic problem i have is that i cannot get the copyright symbol or the chevron symbols to display in my pages.
    I am using:
    myfaces 2.0.0
    facelets 1.1.14
    richfaces 3.3.3.final
    tomcat 6
    jdk1.6
    I have tried a ton of things to resolve this including:
    1.) creating a filter to set the character encoding to utf-8.
    2.) overridding the view handler to force calculateCharacterEncoding to always return utf-8
    3.) adding <meta http-equiv="content-type" content="text/html;charset=UTF-8" charset="UTF-8" /> to my page.
    4.) setting different combinations of 'URIEncoding="UTF-8"' and 'useBodyEncodingForURI="true"' in tomcat's server.xml
    5.) etc... like trying set encoding on an f:view, using f:verbatim, specifying escape attirbute on some output components.
    all with no success.
    There is a lot of great information on BalusC's site regarding this problem (http://balusc.blogspot.com/2009/05/unicode-how-to-get-characters-right.html) but I have not been able to resolve it yet.
    i have 2 test pages i am using.
    if i put these symbols in a jsp (which does NOT go through the faces servlet) it renders fine and the page info shows that it is in utf-8.
    <html>
    <head>
         <!-- <meta http-equiv="content-type" content="text/html;charset=UTF-8" /> -->
    </head>
    <body>     
              <br/>copy tag: &copy;
              <br/>js/jsp unicode: &#169;
              <br/>xml unicode: &#xA9;
              <br/>u2460: \u2460
              <br/>u0080: \u0080
              <br/>arrow: &#187;
              <p />
    </body>
    </html>if i put these symbols in an xhtml page (which does go through the faces servlet) i get the black diamond symbols with a ? even though the page info says that it is in utf-8.
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml"
         xmlns:ui="http://java.sun.com/jsf/facelets"
         xmlns:f="http://java.sun.com/jsf/core"
         xmlns:h="http://java.sun.com/jsf/html"
         xmlns:rich="http://richfaces.org/rich"
         xmlns:c="http://java.sun.com/jstl/core"
           xmlns:a4j="http://richfaces.org/a4j">
    <head>
         <meta http-equiv="content-type" content="text/html;charset=UTF-8" charset="UTF-8" />
         <meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7" />
    </head>
    <body>     
         <f:view encoding="utf-8">
              <br/>amp/copy tag: &copy;
              <br/>copy tag: &copy;
              <br/>copy tag w/ pound: #&copy;
              <br/>houtupt: <h:outputText value="&copy;" escape="true"/>
              <br/>houtupt: <h:outputText value="&copy;" escape="false"/>
              <br/>js/jsp unicode: &#169;
              <br/>houtupt: <h:outputText value="&#169;" escape="true"/>
              <br/>houtupt: <h:outputText value="&#169;" escape="false"/>
              <br/>xml unicode: &#xA9;
              <br/>houtupt: <h:outputText value="&#xA9;" escape="true"/>
              <br/>houtupt: <h:outputText value="&#xA9;" escape="false"/>
              <br/>u2460: \u2460
              <br/>u0080: \u0080
              <br/>arrow: &#187;
              <br/>cdata: <![CDATA[©]]>
              <p />
         </f:view>               
    </body>
    </html>on a side note, i have another application that is using myfaces 1.1, facelets 1.1.11, and richfaces 3.1.6 and the unicode symbols work fine.
    i had another developer try to use my test xhtml page in his mojarra implementation and it works fine there using facelets 1.1.14 but NOT myfaces or richfaces.
    i am convinced that somewhere between the view handler and the faces servlet the encoding is being set or reset but i havent been able to resolve it.
    if anyone at all can point me in the right direction i would be eternally greatful.
    thanks in advance.

    UPDATE:
    I was unable to get the page itself to consume the various options for unicode characters like the copyright symbol.
    Ultimately the content I am trying to display is coming from a web service.
    I resolved this issue by calling the web service from my backing bean instead of using ui:include on the webservice call directly in the page.
    for example:
    public String getFooter() throws Exception
              HttpClient httpclient = new HttpClient();
              GetMethod get = new GetMethod(url);
              httpclient.executeMethod(get);
              String response = get.getResponseBodyAsString();
              return response;
         }I'd still love to have a solution to the page usage of the unicode characters, but for the time being this solves my problem.

  • Special Character Encoding issue

    Hi all
    Am using OAS9i. i ve deployed a webservice. i submit a payload request data that has some unicode characters like "§". The data is base64binary encoded. The type of the element mentioned in the schema is base64binary. When i retrieve the payload in java implementation code the character is displayed as � in the console. Please advice how to fix this issue. I tried setting JVM option file.encoding=utf-8 it didnt work out.
    Thanks
    Shiny

    When you use an UDF and you have programmed a Sax parser, then make sure, that the parser works with the correct encoding. So when the result of the webservice is  ISO-8859-1, then assign this to the parser.
    In principle the encoding should be part of XML header. Make sure that the encoding of the response is the same as declared in XML header.

  • Multi-byte character encoding issue in HTTP adapter

    Hi Guys,
    I am facing problem in the multi-byte character conversion.
    Problem:
    I am posting data from SAP CRM to third party system using XI as middle ware. I am using HTTP adapter to communicate XI to third party system.
    I have given XML code as UT-8 in the XI payload manipulation block.
    I am trying to post Chines characters from SAP CRM to third party system. junk characters are going to third party system. my assumption is it is double encoding.
    Can you please guide me how to proceed further.
    Please let me know if you need more info.
    Regards,
    Srini

    Srinivas,
    Can you go through the url:
    UTF-8 encoding problem in HTTP adapter
    ---Satish

  • Arabic character encoding issue

    Hi,
    Our JMS plugin receives xml text messages encoded in ISO-8859-1. Messages are originally in ISO-8859-6 and converted to ISO-8859-1 before put in the queue. I convert the message to ISO-8859-6 on receipt as below.
    For some unknown reason, some of the arabic characters show ? marks but some gets properly converted.
    ���� � ������ original
    ��?�� ������ after conversion
    Our plugin runs on Solaris 10. JDK used is 1.5. Solaris local set to en_US.ISO8859-15
    Code
    String in = ((TextMessage)message).getText ();
    String msgISO6 = new String (in.getBytes("ISO-8859-1"), "ISO-8859-6");
    Does anyone have any thoughts on the possible cause of the issue ?
    Thanks in advance
    Sohan

    String in = ((TextMessage)message).getText ();
    String msgISO6 = new String (in.getBytes("ISO-8859-1"), "ISO-8859-6"); That's wrong. Read the javadoc for the String class. in.getBytes gives a byte array with encoding ISO-8859-1, you are then trying to decode that array using "ISO-8859-6". The second argument in the string is not the encoding of the string that you are creating. A String is always encoded using Unicode/UTF-16.
    That means that you don't need to convert the encoding at all if you are gettings Strings, but you need to specify encoding when you are displaying them in e.g. a web page.
    Kaj

  • Integration Gateway - JDBC - Character Encoding Issue

    Hello,
    I'm using SMP 3.0 SP06 and I'm getting data from MS SQL using JDBC interface and I can get all data successfully.
    The problem is:
    there is a column in database contain "Arabic" data "right-to-left" language.
    and when executing the OData service, for example, if the data in arabic is "هذه للتجربة" it is getting to me "هذه للتجربة"
    I think this is the same data but in a different encoding/decoding.
    Do you have any idea ?
    Thanks
    Hossam

    By the way, I have checked it again it is working fine when requesting data in XML format "default"
    The problem occurs only when requesting the service with format parameter "?$format=json"
    and it is even working fine when calling it from "Advanced REST client"
    so I think it is just a problem in the browser while displaying the data, specially chrome as it is working fine with IE, as chrome is displaying json files as plain text without any formatting or decoding, but IE is saving the file on PC and if I tried to open it by notepad++ I find data correctly decoded.
    It seems it is not an SMP nor Integration gateway issue, sorry for confusing

  • Shockwave to Javascript - character encoding issue !

    Hi !
    I have resigned from sending messages from JavaScript to
    Shockwave movie
    as I have found all existing methods unreliable (at worst
    scenario Flash
    blocker is installed and using localConnection trick with
    Flash gateway
    fails).
    But in the consequence, I have to send message (a search
    string) from
    Shockwave to JavaScript.
    That seems easy with the following Lingo:
    goToNetPage("javascript:void myJSfunction('" & aString
    But the problem is with encoding possible non-ASCII
    characters.
    I presume the browser page is using charset=UTF-8.
    Any idea how to properly encode 'aString' so it will preserve
    non-ASCII
    characters while being transfered to JavaScript?
    It is really urgent!
    Rgs,
    Ziggi

    > I have resigned from sending messages from JavaScript to
    Shockwave movie
    I replied a little late to your earlier thread, but take a
    look at
    <
    http://dasdeck.de/staff/valentin/lingo/dir_js/>

  • Russian character encoding issue

    Hi All,
    I wrote a java program (using jdk 1.5 ) to execute wevtutil.exe output in windows 2008 russian os, i use the command "wevtutil qe Security /c:1 /rd:true /f:text"
    when i run the above command in the command prompt i got the proper output but when i execute the same command using the java program my output is garbled, i tried to run the same output using the encoding Cp866 but still no use. which encoding format i have to use for wevtutil output in russian language? Herewith i attach my java code and its output and the original command output.
    Please help me to resolve this.
    Thanks and Regards,
    Rama
    Java code:
    import java.util.*;
    import java.io.*;
    public class evtquery
         public static void main(String args[])
              String cmd ="wevtutil qe Security /c:1 /rd:true /f:text";
              Runtime run1 = Runtime.getRuntime();
              try
                   Process p = run1.exec(cmd);
                   BufferedReader rd[] = new BufferedReader[2];
                   rd[1] = new BufferedReader(new InputStreamReader(p.getErrorStream()));
                   rd[0] = new BufferedReader(new InputStreamReader(p.getInputStream()));
                   String line = null;
                   while((line =rd[0].readLine()) != null)
                        System.out.println(line);
              catch (Exception ex)
                   ex.printStackTrace();
    output:
    C:\rama>java evtquery
    Event[0]:
    Log Name: Security
    Source: Microsoft-Windows-Security-Auditing
    Date: 2012-01-08T22:57:38.703
    Event ID: 4648
    Task: ┬їюф т ёшёЄхьє
    Level: ╤тхфхэш 
    Opcode: ╤тхфхэш 
    Keyword: └єфшЄ єёяхїр
    User: N/A
    User Name: N/A
    Computer: WIN-KBTGH8I5IBE
    Description:
    ┬√яюыэхэр яюя√Єър тїюфр т ёшёЄхьє ё  тэ√ь єърчрэшхь єўхЄэ√ї фрээ√ї.
    ╤єс·хъЄ:
    ╚─ схчюярёэюёЄш: S-1-5-21-3933732538-1582820160-195548471
    1-500
    ╚ь  єўхЄэющ чряшёш: Administrator
    ─юьхэ єўхЄэющ чряшёш: WIN-KBTGH8I5IBE
    ╩юф тїюфр: 0xbc15a
    GUID тїюфр: {00000000-0000-0000-0000-000000000000}
    ┴√ыш шёяюы№чютрэ√ єўхЄэ√х фрээ√х ёыхфє■∙хщ єўхЄэющ чряшёш:
    ╚ь  єўхЄэющ чряшёш: eguser
    ─юьхэ єўхЄэющ чряшёш: WIN-KBTGH8I5IBE
    GUID тїюфр: {00000000-0000-0000-0000-000000000000}
    ╓хыхтющ ёхЁтхЁ:
    ╚ь  Ўхыхтюую ёхЁтхЁр: RAMA-PC
    ─юяюыэшЄхы№э√х ётхфхэш : RAMA-PC
    ╤тхфхэш  ю яЁюЎхёёх:
    ╚фхэЄшЇшърЄюЁ яЁюЎхёёр: 0x4
    ╚ь  яЁюЎхёёр:
    ╤тхфхэш  ю ёхЄш:
    ╤хЄхтющ рфЁхё: -
    ╧юЁЄ: -
    ─рээюх ёюс√Єшх тючэшърхЄ, ъюуфр яЁюЎхёё я√ЄрхЄё  т√яюыэшЄ№ тїюф ё єўхЄэющ чряшё№
    ■,  тэю єърчрт хх єўхЄэ√х фрээ√х. ▌Єю юс√ўэю яЁюшёїюфшЄ яЁш шёяюы№чютрэшш ъюэЇш
    уєЁрЎшщ яръхЄэюую Єшяр, эряЁшьхЁ, эрчэрўхээ√ї чрфрў, шыш т√яюыэхэшш ъюьрэф√ RUNA
    S.
    C:\rama> java -Dfile.encoding=Cp866 evtquery
    Event[0]:
    Log Name: Security
    Source: Microsoft-Windows-Security-Auditing
    Date: 2012-01-08T22:57:38.703
    Event ID: 4648
    Task: ┬їюф т ёшёЄхьє
    Level: ╤тхфхэш 
    Opcode: ╤тхфхэш 
    Keyword: └єфшЄ єёяхїр
    User: N/A
    User Name: N/A
    Computer: WIN-KBTGH8I5IBE
    Description:
    ┬√яюыэхэр яюя√Єър тїюфр т ёшёЄхьє ё  тэ√ь єърчрэшхь єўхЄэ√ї фрээ√ї.
    ╤єс·хъЄ:
    ╚─ схчюярёэюёЄш: S-1-5-21-3933732538-1582820160-195548471
    1-500
    ╚ь  єўхЄэющ чряшёш: Administrator
    ─юьхэ єўхЄэющ чряшёш: WIN-KBTGH8I5IBE
    ╩юф тїюфр: 0xbc15a
    GUID тїюфр: {00000000-0000-0000-0000-000000000000}
    ┴√ыш шёяюы№чютрэ√ єўхЄэ√х фрээ√х ёыхфє■∙хщ єўхЄэющ чряшёш:
    ╚ь  єўхЄэющ чряшёш: eguser
    ─юьхэ єўхЄэющ чряшёш: WIN-KBTGH8I5IBE
    GUID тїюфр: {00000000-0000-0000-0000-000000000000}
    ╓хыхтющ ёхЁтхЁ:
    ╚ь  Ўхыхтюую ёхЁтхЁр: RAMA-PC
    ─юяюыэшЄхы№э√х ётхфхэш : RAMA-PC
    ╤тхфхэш  ю яЁюЎхёёх:
    ╚фхэЄшЇшърЄюЁ яЁюЎхёёр: 0x4
    ╚ь  яЁюЎхёёр:
    ╤тхфхэш  ю ёхЄш:
    ╤хЄхтющ рфЁхё: -
    ╧юЁЄ: -
    ─рээюх ёюс√Єшх тючэшърхЄ, ъюуфр яЁюЎхёё я√ЄрхЄё  т√яюыэшЄ№ тїюф ё єўхЄэющ чряшё№
    ■,  тэю єърчрт хх єўхЄэ√х фрээ√х. ▌Єю юс√ўэю яЁюшёїюфшЄ яЁш шёяюы№чютрэшш ъюэЇш
    уєЁрЎшщ яръхЄэюую Єшяр, эряЁшьхЁ, эрчэрўхээ√ї чрфрў, шыш т√яюыэхэшш ъюьрэф√ RUNA
    S.
    command output:
    C:\rama>wevtutil qe Security /c:1 /rd:true /f:text
    Event[0]:
    Log Name: Security
    Source: Microsoft-Windows-Security-Auditing
    Date: 2012-01-08T22:57:38.703
    Event ID: 4648
    Task: Вход в систему
    Level: Сведения
    Opcode: Сведения
    Keyword: Аудит успеха
    User: N/A
    User Name: N/A
    Computer: WIN-KBTGH8I5IBE
    Description:
    Выполнена попытка входа в систему с явным указанием учетных данных.
    Субъект:
    ИД безопасности: S-1-5-21-3933732538-1582820160-195548471
    1-500
    Имя учетной записи: Administrator
    Домен учетной записи: WIN-KBTGH8I5IBE
    Код входа: 0xbc15a
    GUID входа: {00000000-0000-0000-0000-000000000000}
    Были использованы учетные данные следующей учетной записи:
    Имя учетной записи: eguser
    Домен учетной записи: WIN-KBTGH8I5IBE
    GUID входа: {00000000-0000-0000-0000-000000000000}
    Целевой сервер:
    Имя целевого сервера: RAMA-PC
    Дополнительные сведения: RAMA-PC
    Сведения о процессе:
    Идентификатор процесса: 0x4
    Имя процесса:
    Сведения о сети:
    Сетевой адрес: -
    Порт: -
    Данное событие возникает, когда процесс пытается выполнить вход с учетной запись
    ю, явно указав ее учетные данные. Это обычно происходит при использовании конфи
    гураций пакетного типа, например, назначенных задач, или выполнении команды RUNA
    S.
    C:\rama>

    Hi,
    Thanks for your reply. I tried the code with the changes you have specified but it doesn't work.
    The output is
    C:\rama>java evtquery
    Event[0]:
    Log Name: Security
    Source: Microsoft-Windows-Security-Auditing
    Date: 2012-01-09T04:05:16.718
    Event ID: 4672
    Task: ??????????? ????
    Level: ???????а
    Opcode: ???????а
    Keyword: ????? ??????
    User: N/A
    User Name: N/A
    Computer: WIN-KBTGH8I5IBE
    Description:
    ???╖???:
    ?? ????????????: S-1-5-18
    ??а ??????? ??????: ???????
    ????? ??????? ??????: NT AUTHORITY
    ??? ?????: 0x3e7
    ??????????: SeAssignPrimaryTokenPrivilege
    SeTcbPrivilege
    SeSecurityPrivilege
    SeTakeOwnershipPrivilege
    SeLoadDriverPrivilege
    SeBackupPrivilege
    SeRestorePrivilege
    SeDebugPrivilege
    SeAuditPrivilege
    SeSystemEnvironmentPrivilege
    SeImpersonatePrivilege
    Thanks & Regards,
    Rama

  • Reading Advance Queuing with XMLType payload and JDBC Driver character encoding

    Hi
    I've got a problem retrieving the message from the queue with XMLType payload in Java.
    It was working fine in 10g database but after the switch to 11g it returns corrupted string instead of real XML message. Database NLS_LANG setting is AL32UTF8
    It is said that JDBC driver should deal with that automatically but it obviously don't in this case. When I dequeue the message using database functionality (DBMS_AQ package) it looks fine but not when using JDBC driver so Ithink it is character encoding issue or so. The message itself is enqueued by the database and supposed to be retrieved by dedicated EJB.
    Driver file used: ojdbc6.jar
    Additional libraries: aqapi.jar, xdb.jar
    All file taken from 11g database installation.
    What shoul dI do to get the xml message correctly?

    Do you mean NLS_LANG is AL32UTF8 or the database character set is AL32UTF8? What is the database character set (SELECT value FROM nls_database_parameters WHERE parameter='NLS_CHARACTERSET')?
    Thanks,
    Sergiusz

  • ITunes support for foriegn character encoding

    A friend burned two mix cds for me to listen two on my move back to the US from Korea. The songs are korean and have korean title/album information. I thought I would import the songs into my iBook. When I add them to my library, however, a majority of them have unintelligible song information. Only about 25-30% of the songs import successfully in korean font.
    Finder reads the cd no problems. Looking through the disc shows all information clearly. Drop them into iTunes, however (or selecting "add to library"), and they get scrambled.
    I'm guessing this is a character encoding issue. I don't know where my friend got the tracks from, so I'll have to assume he got them from sources which copied them using different encoding methods. But if Finder can support them all, why can't iTunes? Is there a way I can adjust character support in iTunes? Or should I be looking at something else?
    1.2ghz iBook G4 1.25gb RAM Mac OS X (10.4.5) iTunes 6.0.3

    Try setting the Encoding of your Outputstream to UTF-8.

  • Character Encoding and File Encoding issue

    Hi,
    I have a file which has a data encoded using default locale.
    I start jvm in same default locale and try to red the file.
    I took 2 approaches :
    1. Read the file using InputStreamReader() without specifying the encoding, so that default one based on locale will be picked up.
    -- This apprach worked fine.
    -- I also printed system property "file.encoding" which matched with current locales encoding (on unix cooand to get this is "locale charmap").
    2. In this approach, I read the file using InputStream as an array of raw bytes, and passed it to String contructor to convert bytes to String.
    -- The String contained garbled data, meaning encoding failed.
    I tried printing encoding used by JVM using internal class, and "file.encoding" property as well.
    These 2 values do not match, there is weird difference.
    For e.g. for locale ja_JP.eucjp on linux box :
    byte-character uses EUC_JP_LINUX encoding
    file.encoding system property is EUC-JP-LINUX
    To get byte to character encoding, I used following methods (sun.io.*):
    ByteToCharConverter btc = ByteToCharConverter.getDefault();
    System.out.println("BTC uses " + btc.getCharacterEncoding());
    Do you have any idea why is it failing ?
    My understanding was, file encoding and character encoding should always be same by default.
    But, because of this behaviour, I am little perplexed.

    But there's no character encoding set for this operation:baos.write("���".getBytes());

  • What every developer should know about character encoding

    This was originally posted (with better formatting) at Moderator edit: link removed/what-every-developer-should-know-about-character-encoding.html. I'm posting because lots of people trip over this.
    If you write code that touches a text file, you probably need this.
    Lets start off with two key items
    1.Unicode does not solve this issue for us (yet).
    2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
    And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts.
    The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.
    And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.
    And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
    Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
    Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.
    Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.
    UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
    But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
    Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
    Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
    Here's a key point about these text files – every program is still using an encoding. It may not be setting it in code, but by definition an encoding is being used.
    Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
    Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)
    Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
    Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
    Wrapping it up
    I think there are two key items to keep in mind here. First, make sure you are taking the encoding in to account on text files. Second, this is actually all very easy and straightforward. People rarely screw up how to use an encoding, it's when they ignore the issue that they get in to trouble.
    Edited by: Darryl Burke -- link removed

    DavidThi808 wrote:
    This was originally posted (with better formatting) at Moderator edit: link removed/what-every-developer-should-know-about-character-encoding.html. I'm posting because lots of people trip over this.
    If you write code that touches a text file, you probably need this.
    Lets start off with two key items
    1.Unicode does not solve this issue for us (yet).
    2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
    And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts. Pretty sure most Americans do not use character sets that only have a range of 0-127. I don't think I have every used a desktop OS that did. I might have used some big iron boxes before that but at that time I wasn't even aware that character sets existed.
    They might only use that range but that is a different issue, especially since that range is exactly the same as the UTF8 character set anyways.
    >
    The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.
    And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.
    And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
    The above is only true for small volume sets. If I am targeting a processing rate of 2000 txns/sec with a requirement to hold data active for seven years then a column with a size of 8 bytes is significantly different than one with 16 bytes.
    Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
    The above is out of place. It would be best to address this as part of Point 1.
    Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.
    Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.
    UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
    The first part of that paragraph is odd. The first 128 characters of unicode, all unicode, is based on ASCII. The representational format of UTF8 is required to implement unicode, thus it must represent those characters. It uses the idiom supported by variable width encodings to do that.
    But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
    Not sure what you are saying here. If a file is supposed to be in one encoding and you insert invalid characters into it then it invalid. End of story. It has nothing to do with html/xml.
    Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
    The browser still needs to support the encoding.
    Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
    I know java files have a default encoding - the specification defines it. And I am certain C# does as well.
    Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
    It is important to define it. Whether you set it is another matter.
    Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)
    Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
    Unicode character escapes are replaced prior to actual code compilation. Thus it is possible to create strings in java with escaped unicode characters which will fail to compile.
    Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
    No. A developer should understand the problem domain represented by the requirements and the business and create solutions that appropriate to that. Thus there is absolutely no point for someone that is creating an inventory system for a stand alone store to craft a solution that supports multiple languages.
    And another example is with high volume systems moving/storing bytes is relevant. As such one must carefully consider each text element as to whether it is customer consumable or internally consumable. Saving bytes in such cases will impact the total load of the system. In such systems incremental savings impact operating costs and marketing advantage with speed.

  • Why differing Character Encoding and how to fix it?

    I have PRS-950 and PRS-350 readers, both since 2011.  
    In the last year, I've been getting books with Character Encoding that is not easy to read.  In playing around with my browsers and View -> Encoding menus, I have figured out that it has something to do with the character encoding within the epub files.
    I buy books from several ebook stores and I borrow from the library.
    The problem may be the entire book, but it is usually restricted to a few chapters, with rare occasion where the encoding changes within a chapter.  Usually it is for a whole chapter, not part, and it can be seen in chapters not consecutive to each other.
    It occurs whether the book is downloaded directly to my 950 reader or if I load it to either reader from my computer(s), which are all Mac OS X of several versions fom 10.4 to Mountain Lion.  SInce it happens when the book is downloaded directly, I figure the operating system of my computer is not relevant.
    There are several publishers involved, though Baen (no DRM ebooks) has not so far been one of them.
    If I look at the books with viewers on the computer, the encoding is the same.  I've read them in Calibre, in the Sony Reader App, and in Adobe Digital Editions 2.0.  It's always the same.
    I believe the encoding is inherent to the files.  I would like to fix this if I can to make the books I've purchased, many of them in paper and electronically, more enjoyable to read on my readers.
    Example: I’ve is printed instead of I've.
    ’ for apostrophe
    “ the opening of a quotation,
    â€?  for closing the quotation,
    and I think — is for a hyphen.
    When a sentence had “’m  for " 'm at the beginning of a speech (when the character was slurring his words) it took me a while to figure out how it was supposed to read.
    “’Sides, â€™tis only for a moon.  That ain’t long.â€?
    was in one recent book.
    Translation: " 'Sides, 'tis only for a moon. That ain't long."
    See what I mean? 
    Any ideas?

    Hi
    I wonder if it’s possible to download a free ebook with such issue, in order to make some “tests”.
    Perhaps it’s possible, on free ebooks (without DRM), to add fonts by using softwares like Sigil.

Maybe you are looking for