Seeing � etc despite having View--Character encoding as unicode and auto-detect universal

On viewing some web pages see characters such as �, ,  (for example). But View-Character Encoding is set at Unicode (UTF-8) or Western (ISO8859-1) and Tools-Options-Content-Fonts-Advanced Encoding set with either of those

example of page:
http://scienceofdoom.com/2010/09/17/on-missing-the-point-by-chilingar-et-al-2008/
- a little over half way down, the section headed "Anthropogenic Imact on the Earth’s Climate – Tiny" from paragraph "And continue: " there are these non-characters in the equation (12) and subsequently.
Another page : http://www.zimbabwesituation.com/sep26_2010.html in the topic " Red warning lights" .
Most web-pages I read are without problem.
I contacted the writer of the first page and s/he had no idea why it happens.

Similar Messages

  • Character Encoding for JSPs and HTML forms

    After having read loads of postings on character encoding problems I'm still puzzled about the following problem:
    I have an instance (A) of WL 8.1 SP3 on a WinXP machine and another instance (B) of WL 8.1 without any SP on a Win2K machine. The underlying Windows locale is english(US) in both cases.
    The same application deployed as a war file to these instances does not behave in the same way when it comes to displaying non-Latin1-characters like the Euro symbol: Whereas (A) shows and accepts these characters as request-parameters, (B) does not.
    Since the war file is the same (weblogic.xml, jsps and everything), the reason for this must either be the service-pack-level or some other configuration setting I overlooked.
    Any hints are appreciated!

    Try this:
    Prefrences -> Content -> Fonts & Color -> Advanced
    At the bottom, choose your Encoding.

  • Character encoding: Ansi, ascii, and mac, oh my!

    I'm writing a program which has to search & replace data in user-supplied Rich Text documents (.rtf). Ideally, I would like to read the whole thing into a StringBuffer, so that I can use all of the functionality built into String and StringBuffer, and so that I can easily compare with constant Strings and chars.
    The trouble that I have is with character encoding. According to the rtf spec, RTFs can be encoded in four different character encodings: "ansi", "mac", IBM PC code page 437, and IBM PC code page 850, none of which are supported by Java (see http://impulzus.sch.bme.hu/tom/szamitastechnika/file/rtfspec/rtfspec_6.htm#rtfspec_8 for the RTF spec and http://java.sun.com/j2se/1.3/docs/api/java/lang/package-summary.html#charenc for the character encodings supported by Java).
    I believe, from a bit of googling, that they are all 8 bits/character, so I could read everything into a byte array and manipulate that directly. However, that would be rather nasty. I would have to be careful with the changes that I make to the document, so that I do not insert values that do not encode correctly in the document's character encoding. Overall, a large hassle.
    So my question is - has anyone done something like this before? Any libraries that will make my job easier? Or am I missing something built into Java that will allow me to easily decode and reencode these documents?

    DrClap, thanks for the response.
    If I could map from the encodings listed above (which are given in the rtf doucment) to a java encoding name from the page that you listed, that would solve all my problems. However, there are a couple of problems:
    a) According to this page - http://orwell.ru/info/diffs.htm - ANSI is a superset of ISO-8859-1. That page isn't exactly authoritative, but I can't afford to lose data.
    b) I'm not sure what to do about the other character encodings. "mac" may correspond to "MacRoman" but that page lists a dozen or so other macintosh encodings. Gotta love crystal-clear MS documentation.

  • How I set character encoding for everypage and alway?

    I use Thai window 874 open the page when I select some website it contain Thai then click open new tab it change to western windows 1252. It can not display Thai. I must set character encoding to Thai windows 874 everytime.

    Try this:
    Prefrences -> Content -> Fonts & Color -> Advanced
    At the bottom, choose your Encoding.

  • Character encoding in Drafts and Templates won't "stick"

    I've been using T'bird for many years, am using now Version 30.0 with Windows 8. Over the years I have wrestled extensively with character encodings, because I do a lot of messages in mixed English and Cyrillic. I have things set now so that my default encoding is Cyrillic (Windows 1251) and this works pretty well, EXCEPT!! that when I save a message in my Drafts folder (or any of its subfolders), or my Templates folder, two things happen:
    1. the Cyrillic goes to garbage (I can sometimes recover this by physically moving the message to the Inbox)
    2. I get a series of  and/or à characters, with and without spaces between.
    This is a major pain in the butt! If I catch it the first time around, and move the offending message to the Inbox and "Edit as new", I can usually rescue the cyrillic - but if I (for example) make some changes elsewhere in the message in English portions and save it again (without noticing the mess-up of the cyrillic), it's all lost and I've found no way to recover it.
    I've been reading this forum and I suspect that my problem lies in the "folder properties", and/or maybe that I have not set up a User-defined set of display properties. However, I'm afraid to just mess around with things (as I've done so much in the past) for fear of messing up messages that I've been keeping for a long time in the OTHER folders.
    Help please? I use Windows 7.

    Didn't understand from what starting point you want me to "select the mail account name", but closed and re-opened T'bird.
    Then I started to run your tests.
    To my HUGE AMAZEMENT!!! - everything seems to be working today!!
    This leaves me with a couple of old files that seem to have gotten trashed a long time ago and won't simply "revover" - but I did keep back-up copies of them elsewhere, and can now go about rebuilding them if necessary.
    Thank you for your help - whatever it is that I did, seems to have worked. After how many years?! If I have trouble in the future, I'll be back - but for right now, what a RELIEF!
    Best, Martha

  • Character encoding of Unicode (UTF-8) is what seems to be the default for printing. The page looks funny and the print is spread out.

    the layout of the page is different that it appears on the screen. I printed the confirmation of my VISA payment on line and it is not concise. The information is correct, but the layout isn't.
    Should I use another encoding? One of the Western ones? Maybe unicode (UTF-16)?

    There is a known bug involving printing in beta 12 that has been fixed in the Firefox 4 release candidate which is due out soon.

  • Character encoding with CF and MySQL

    Okay, I thought this should be rather straight forward but
    apparently not. I have set up my site to use UTF-8— my cfm
    pages, the MySQL table, even Dreamweaver. The problem is when I
    input international character via a form they get written correctly
    to the MySQL table; however, when I retrieve them in a query and
    display them on the page I get them displayed incorrectly.
    On my input.cfm page I'll enter the string
    "Téstïñg" in the textbox and submit it. If I look at
    the record via the MySQL Browser it appears as it should. However
    when I display it on my output.cfm page it shows the record as
    "T�st��g" and will do so until I change the
    meta tag to use charset=ISO-8859-1. Am I missing something or is
    this how it is suppose to work?
    My input.cfm page is set up with both the
    <cfprocessingdirective suppresswhitespace="YES"
    pageencoding="UTF-8">
    <meta http-equiv="Content-Type" content="text/html;
    charset=UTF-8">
    tags and a regular input formfield that writes to the MySQL
    database.
    The MySQL table is configured to use the utf8 char set and
    utf8_unicode_ci collation.
    And just to be safe I included
    useUnicode=true&characterEncoding=utf8&characterSetResults=utf8
    in the connection string on the CF Admin datasource setup page.
    I'm running CF 6.1, MySQL 4.1, the latest version of Apache
    Server on a Win2K3 box. I was running the 3.0.16 MySQL JDBC driver
    but I upgraded it to the 5.0.6 this morning thinking that may fix
    my issue.

    I'm still unsure why this works but I've found a solution. I
    switched all my pages over to character set ISO-8859-1 with the
    exception of my database table and it works. I get all the normal
    range character along with the extended Unicode characters to write
    to the database and output correctly. Unicode characters actually
    write to the table as their HTML coded character.
    If someone feels the need to enlighten me as to why this
    works please feel free, I'm always willing to learn.

  • Default Character Encoding stuck on UTF-8 - Firefox 7

    I cannot change the Character Encoding - it is stuck on Unicode UTF-8 and I can not change it! When a web page opens I get these little boxes with "FF FD" instead of Quote marks. When I change the character encoding on that page using "View->Character Encoding" and click on the Western (ISO-8859-1), the page displays correctly. Every page opens using Unicode UTF-8 as the default.
    View->Character Encoding -- shows Unicode UTF-8 as the default.
    View->Character Encoding->Auto-detect -- shows OFF
    Tool->Options->Content->Advance->Fonts->Default Character Encoding -- shows Western (ISO-8859-1) as well as the "Allow Pages to choose their own fonts..." IS CHECKED in the check box
    THE PAGES ARE NOT UTF-8!!!! The "View Page Source" IS NOT Unicode UTF-8! -- It shows <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">.
    The "View Page Info" shows MetaTag - Content-Type: text/html; charset=iso-8859-1
    Why can I not change the Default Character Encoding?
    I would also like to point out that the Unicode UTF-8 seems to be broken because it is indicating that the QUOTE CHARACTER is an UNPRINTABLE character "FF FD"
    ----- EDIT -----
    The UTF-8 is not broken. The problem as pointed out in http://en.wikipedia.org/wiki/Replacement_character#Replacement_character is that my Firefox being STUCK processing UTF-8 encoding cannot read the clearly marked iso-8859-1 data. So the UTF-8 is reinterpreting smart quotes -&ldquo; and &rdquo;- (“ and ”) as replacement (unprintable) characters.
    So the real problem is why my Firefox is stuck on Unicode UTF-8

    The real problem is that the font that is used doesn't have those characters.
    Do you see the special quotes -“ and ” on this forum page?
    Does it help if you disable the website fonts and set another font as the default font?
    *Tools > Options > Content : Fonts & Colors > Advanced
    *http://en.wikipedia.org/wiki/Punctuation
    *http://en.wikibooks.org/wiki/Unicode/Character_reference/2000-2FFF

  • Why, after all these years, can't Thunderbird auto-detect character encoding

    judging by all the existing messages and complaints about this, not to mention erroneous posts that say the problem is solved when it isn't, I have to conclude Mozilla either doesn't believe this is a problem or doesn't care to fix it. The bottom line is that there is no way to tell Thunderbird to automatically display emails in the character coding format they were written in. I could understand cases where the headers are not properly filled in, but I see tons of emails in which the encoding is plainly there in the headers within the message source. You can force it, but if you do so via the menu VIEW->Character Encoding->UTF8 (for example) it won't "stick" if you view another message. But who would want it to "stick" permanently anyway? What the average user really wants is to be able to toggle VIEW->Character Encoding->Auto Detect from its default "off" to simply "on", and not have to bother with it anymore.
    This is a problem that seems to have gone on forever, and it NEVER happens with other email clients. If there is some backdoor way to actually make autodetect work, I'd appreciate knowing about it. But more important, I think ALL users would appreciate it if it were not some secret "backdoor" setting, but a simple global menu choice for all accounts. Can Mozilla please fix this problem once and for all?

    You said...
    ''Thunderbird is supposed to be using the encoding in the mail.''
    I figured is "should", i'm just reporting that it doesn't
    You said...
    ''Setting auto detect to on disables that.''
    Please explain. I've looked at every setting I can find and there is no way to set auto detect to "ON". I DID try setting it to "universal" in an attempt top solve the problem, but I have since restored it to "off", because the universal setting doesn't help.
    you said...
    ''"Based on your earlier response I assume you need to press the F10 key to see the tools menu you were refered to." ''
    No... I never said that anywhere. I DID refer to Menu->View_>Character Encoding, and I did refer to right clicking on individual folders, to get to the properties dialog, and the general information tab. But F-10 doesn't do anything
    You said...
    ''I have examines dozens of mails in my inbox and each honours the character encoding set in the HTML''
    Well, mine NEVER did. A short example from an email I got today pretty much is exemplative of all mail I get from GMAIL...
    --089e013a0572a067a404fc73ceda
    Content-Type: text/plain; charset=UTF-8
    Ok, very good. Thank you. Phoenix sent you a friend request on Facebook by
    the way. Talk to you soon.
    --089e013a0572a067a404fc73ceda
    Content-Type: text/html; charset=UTF-8
    Content-Transfer-Encoding: quoted-printable
    <p dir=3D"ltr">Ok, very good. Thank you. Phoenix sent you a friend request=
    =C2=A0 on Facebook by the way.=C2=A0 Talk to you soon.</p>
    --089e013a0572a067a404fc73ceda--
    See those incidences pf "=C2=A0"? Each one displays as a strange character, a capitol A with a curved line over it. If I manually set my default encoding to UTF 8, the weird characters go away. If I leave it as Western, there is nothing I can do to tell Thunderbird to "auto detect".
    Anyway, I suppose at this point that no one responsible for the product coding is seriously looking at my issue, which is why its never been solved. If anyone does intend to help track it down and solve it, I'll be happy to provide all the examples and screen shots they ask for. Otherwise.

  • I am having a problem with pop pups and small windows with ads constantly opening up on my safari?? Thought that macs didn't get virus? this looks like one- any experts around? please help me fix it with your instructions? really don't know what to do...

    Hi everyone,
    I am having a problem with my Mac OS X 10.7.5 mac book air , there are constant pop pups and small windows with ads blinking constantly opening up on my safari in front of everything?? it is constantly interupting me and makes me mistakingly click on it then another new windows open behind the one im using..
    I am not too sure if thats a virus or trojan.. I always thought that macs didn't get virus? this looks like one to me… any experts around? please help me fix it with your instructions? really don't know what to do... thanks

    Those are not viruses. You have probably installed some malware:
    The Safe Mac » Adware Removal Guide
    Helpful Links Regarding Malware Protection
    An excellent link to read is Tom Reed's Mac Malware Guide.
    Also, visit The XLab FAQs and read Detecting and avoiding malware and spyware.
    See these Apple articles:
              Mac OS X Snow Leopard and malware detection
              OS X Lion- Protect your Mac from malware
              OS X Mountain Lion- Protect your Mac from malware
              About file quarantine in OS X
    If you require anti-virus protection Thomas Reed recommends using Dr.Web Light from the App Store. It's free, and since it's from the App Store, it won't destabilize the system. If you prefer one of the better known commercial products, then Thomas recommends using Sophos.(Thank you to Thomas Reed for these recommendations.) If you already use Sophos, then be aware of this if you are using Mavericks: OS X Mavericks- Sophos Anti-Virus on-access scanner versions 8.0 - 9.1 may cause unexpected restarts
    From user Joe Bailey comes this equally useful advice:
    The facts are:
    1. There is no anti-malware software that can detect 100% of the malware out there.
    2. There is no anti-malware that can detect anything targeting the Mac because there
         is no Mac malware in the wild, and therefore, no "signatures" to detect.
    3. The very best way to prevent the most attacks is for you as the user to be aware that
         the most successful malware attacks rely on very sophisticated social engineering
         techniques preying on human avarice, ****, and fear.
    4. Internet popups saying the FBI, NSA, Microsoft, your ISP has detected malware on
        your computer is intended to entice you to install their malware thinking it is a
        protection against malware.
    5. Some of the anti-malware products on the market are worse than the malware
        from which they purport to protect you.
    6. Be cautious where you go on the internet.
    7. Only download anything from sites you know are safe.
    8. Avoid links you receive in email, always be suspicious even if you get something
        you think is from a friend, but you were not expecting.
    9. If there is any question in your mind, then assume it is malware.

  • My iPad 1 does not make any of the sounds (key clicking, etc) I have directed it to despite having volume turned all the way up...any thoughts?

    my iPad 1 does not make any of the sounds (key clicking, etc) I have directed it to despite having volume turned all the way up...any thoughts?

    Check to see if the "silent" switch is turned on. This is either the hardware switch next to the volume control or a software switch on the "recent apps" tray depending on how you have your iPad configured (Settings > General).
    Too see the software switch on the "recent apps" tray double-click the Home button and swipe the tray to the right.

  • I downloaded an e mail which contained music, I deleted the e mail but the music keeps on playing despite having deleted all history etc

    I down loaded an e mail which contained music, I deleted the email but the music keeps on playing whwenever I log on despite having deleted all history etc

    { Ctrl + J } t0 open the Downloads window. Right-click that audio file and click '''''Remove from List'''''

  • Can character encoding be predefined for certain pages?

    Certain pages that I visit frequently require me to manually set my character encoding to Western (ISO Latin 1), both when my default character encoding is set as UTF-8 and Western (ISO Latin 1).
    As the pages that show up malformed are embedded in other frames I suspect that the top frame forces a different encoding than is on the embedded page.
    An example page is here (847.is). The topic list of this message board is in order, but when any one of the topics is viewed all accented and special characters are missing, until Western (ISO Latin 1) is manually set as the character encoding. Similarly, opening any of the topics in a tab will result in missing characters.
    Is there some way for me to circumvent having to go through all those menus to set it? Can I somehow define that these pages should be viewed in Western (ISO Latin 1) or can I set a keyboard shortcut for Western (ISO Latin 1)?
    MacBook 2006   Mac OS X (10.4.7)   Safari version 2.0.4
    MacBook 2006   Mac OS X (10.4.7)  

    For instance if I opened
    an entry on Vísindavefur
    HÍ out
    of the parent frame the accented letters would
    show up somehow mangled, but this is no longer a
    problem.
    That page has no charset in the source and thus should only display correctly if you have Latin-1 set as the browser default. With UTF-8 set as the default you should see (in Safari) a ton of black diamonds with question marks inside.
    Firefox never displays ð (eth), þ (thorn) and ý
    (accented y) correctly for me, did not do it on the
    old machine and does not do it on this one either,
    FireFox displays them perfectly for me in both 10.3 and 10.4.
    Actually, if I set Opera encoding to UTF-8 it
    displays the topics on 847.is as Safari does.
    This indicates it is a system issue rather than Safari. Sorry I can't duplicate it and have no good idea what could cause it on a normal system. Have you (or the place where you buy your machines) by chance installed any special software add-ons to enable the use of non-Unicode Icelandic (for apps like Appleworks, WordX, etc)?

  • What every developer should know about character encoding

    This was originally posted (with better formatting) at Moderator edit: link removed/what-every-developer-should-know-about-character-encoding.html. I'm posting because lots of people trip over this.
    If you write code that touches a text file, you probably need this.
    Lets start off with two key items
    1.Unicode does not solve this issue for us (yet).
    2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
    And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts.
    The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.
    And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.
    And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
    Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
    Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.
    Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.
    UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
    But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
    Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
    Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
    Here's a key point about these text files – every program is still using an encoding. It may not be setting it in code, but by definition an encoding is being used.
    Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
    Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)
    Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
    Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
    Wrapping it up
    I think there are two key items to keep in mind here. First, make sure you are taking the encoding in to account on text files. Second, this is actually all very easy and straightforward. People rarely screw up how to use an encoding, it's when they ignore the issue that they get in to trouble.
    Edited by: Darryl Burke -- link removed

    DavidThi808 wrote:
    This was originally posted (with better formatting) at Moderator edit: link removed/what-every-developer-should-know-about-character-encoding.html. I'm posting because lots of people trip over this.
    If you write code that touches a text file, you probably need this.
    Lets start off with two key items
    1.Unicode does not solve this issue for us (yet).
    2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
    And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts. Pretty sure most Americans do not use character sets that only have a range of 0-127. I don't think I have every used a desktop OS that did. I might have used some big iron boxes before that but at that time I wasn't even aware that character sets existed.
    They might only use that range but that is a different issue, especially since that range is exactly the same as the UTF8 character set anyways.
    >
    The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.
    And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.
    And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
    The above is only true for small volume sets. If I am targeting a processing rate of 2000 txns/sec with a requirement to hold data active for seven years then a column with a size of 8 bytes is significantly different than one with 16 bytes.
    Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
    The above is out of place. It would be best to address this as part of Point 1.
    Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.
    Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.
    UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
    The first part of that paragraph is odd. The first 128 characters of unicode, all unicode, is based on ASCII. The representational format of UTF8 is required to implement unicode, thus it must represent those characters. It uses the idiom supported by variable width encodings to do that.
    But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
    Not sure what you are saying here. If a file is supposed to be in one encoding and you insert invalid characters into it then it invalid. End of story. It has nothing to do with html/xml.
    Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
    The browser still needs to support the encoding.
    Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
    I know java files have a default encoding - the specification defines it. And I am certain C# does as well.
    Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
    It is important to define it. Whether you set it is another matter.
    Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)
    Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
    Unicode character escapes are replaced prior to actual code compilation. Thus it is possible to create strings in java with escaped unicode characters which will fail to compile.
    Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
    No. A developer should understand the problem domain represented by the requirements and the business and create solutions that appropriate to that. Thus there is absolutely no point for someone that is creating an inventory system for a stand alone store to craft a solution that supports multiple languages.
    And another example is with high volume systems moving/storing bytes is relevant. As such one must carefully consider each text element as to whether it is customer consumable or internally consumable. Saving bytes in such cases will impact the total load of the system. In such systems incremental savings impact operating costs and marketing advantage with speed.

  • FF character encoding issue in Mageia 2 ?

    Hi everyone,
    I'm running Mozilla Firefox 17.0.8 in a KDE distro of Linux called Mageia 2. I'm having problems in character encoding with certain web pages, meaning that certain icons like the ones next to menu entries (Login, Search box etc.) and in section headlines don't appear properly. Instead they appear either in some arabic character or as little grey boxes with numbers and letters written in it.
    I've tried experimenting with different encoding systems: Western (ISO 8859-1), (ISO 8859-15), (Windows 1252), Unicode (UTF-8), Central European (ISO 8859-2) but none of them does the job. Currently the char encoding is set to UTF-8. The same web page in Chrome (UTF-8) gives no such problem.
    Can you help me, please?

    Thank you!
    I solved my problem, however I find fonts are too small for certain web pages when compared to Chrome (see attached pictures of nytimes.com).
    Chrome's font size are set to "Medium".

Maybe you are looking for

  • How to reverse your Game Center account if its been signed into a different device by accident.

    My son signed into my Game Center account on his iPod touch and it reverted all my apps on my iPod to his. How do I reverse it to get all my apps back?

  • Verizon DSL Tech Support and Slow Upload speed

    We have been having considerable issues for 16 months with Verizon DSL and numerous chat sessions/calls to Verizon tech support. They verified that the upload speed was 768Kbps. Every site showed 360 - 370 Kbps.  Finally a technician came to the hous

  • French accents turn into Chinese characters

    I'm using Chrome browser on ML, which is all good, but when I read French documents online, sometimes the accents get replaced by Chinese characters. How do I avoid that? I have set Chinese followed by French and last English as my languages of prefe

  • IPhoto Facebook upload-feed cannot be found

    When trying to add photos to an existing iphoto/facebook album I get the error message that says "the feed cannot be found".  Does anyone know how to resolve this?

  • Ipad will not update

    When I try to update my ipad software, I get the following error response. "Unable to Check for update An error occurred while checking for a software update" Can anyone help?