Wrong character encoding from flash to mysql

Hi, im experiencing problems with character encoding not
functioning correctly when sending from flash to mysql. What i am
doing is doing a contact form in flash which then sends the value
to a php file which takes the values and inserts them into a table.
As i'm using icelandic charecters i need the char encoding to be
either latin1 or utf8 in mysql, or at least i think so. But it
seems that flash or the php document isn't sending in the same
format as i have selected in mysql because all special icelandic
characters come scrambled in the mysql table. Firefox tells me
tough that the html document containing the flash movie is using
utf-8.

I don't know anything about Icelandic characters, but Flash
generally really likes UTF-8. So it should be sending that if that
is what it is starting with.
You aren't using any kind of useCodePage? That will mess it
up.
Are you sure that the input method is Icelandic?
In the testing environment can you list variables (from the
debug menu) and see if they look proper? If they do then Flash is
readying them correctly and the problem must be coming in further
down stream.

Similar Messages

Detecting character encoding from BLOB stream... (PLSQL)

I'am looking for a procedure/function which can return me the character encoding of a "text/xml/csv/slk" file stored in BLOB..
For example...
I have 4 files in different encodings (UTF8, Utf8BOM, ISO8859_2, Windows1252)...
With java I'can simply detect the character encoding with JuniversalCharDet (http://code.google.com/p/juniversalchardet/)...
thank you

Solved...
On my local PC I have installed Java 1.5.0_00 (because on DB is 1.5.0_10)...
With Jdeveloper I have recompiled source code from:
http://juniversalchardet.googlecode.com/svn/trunk/src/org/mozilla/universalchardet
http://code.google.com/p/juniversalchardet/
After that I have made a JAR file and uploaded it with loadjava to my database...
C:\>loadjava -grant r_inis_prod -force -schema insurance2 -verbose -thin -user username/password@ip:port:sid chardet.jarAfter that I have done a java procedure and PLSQL wrapper example below:
   public static String verifyEncoding(BLOB p_blob) {
       if (p_blob == null) return "-1";
       try
        InputStream is = new BufferedInputStream(p_blob.getBinaryStream());
        UniversalDetector detector = new UniversalDetector(null);
        byte[] buf = new byte[p_blob.getChunkSize()];
        int nread;
        while ((nread = is.read(buf)) > 0 && !detector.isDone()) {
            detector.handleData(buf, 0, nread);
        detector.dataEnd();
        is.close();
       return detector.getDetectedCharset();
       catch(Exception ex) {
           return "-2";
   }as you can see I used -2 for exception and -1 if input blob is null.
then i have made a PLSQL procedure:
function f_preveri_encoding(p_blob in blob) return varchar2 is
language Java name 'Zip.Zip.verifyEncoding(oracle.sql.BLOB) return java.lang.String';After that I have uploaded 2 different txt files in my blob field.. (first one is encoded with UTF-8, second one with WINDOWS-1252)..
example how to call:
declare
   l_blob blob;
   l_encoding varchar2(100);
begin
select vsebina into l_blob from dok_vsebina_dokumenta_blob where id = 401587359 ;
l_encoding := zip_util.f_preveri_encoding(l_blob);
if l_encoding = 'UTF-8' then
   dbms_output.put_line('file is encoded with UTF-8');
elsif l_encoding = 'WINDOWS-1252' then
   dbms_output.put_line('file is encoded with WINDOWS-1252');
else
    dbms_output.put_line('other enc...');
end if;
end;Now I can get encoding from blob and convert it to database encoding and store datas in CLOB field..
Here you have a chardet.jar file if you need this functionality..
https://docs.google.com/open?id=0B6Z9wNTXyUEeVEk3VGh2cDRYTzg
Edited by: peterv6i.blogspot.com on Nov 29, 2012 1:34 PM
Edited by: peterv6i.blogspot.com on Nov 29, 2012 1:34 PM
Edited by: peterv6i.blogspot.com on Nov 29, 2012 1:38 PM

Wrong character encoding in error messages

The Java compiler can be adjusted to source file encoding with the option javac -encoding ...
The Java runtime can be adjusted to terminal encoding with java -Dfile.encoding=...
While this appears somehow inconsistent, it works and can be used e.g. when running the tools from Cygwin (the POSIX layer on Windows) which uses UTF-8 by default, while Java, following the Windows mechanism, uses some other character encoding by default (this works more seemlessly on Unix/Linux, by the way).
Now if I compile UTF-8 source with non-ASCII characters, and there is an error message related to them, the error message printed to the console will not be UTF-8 encoded, resulting in mangled text output.
(Arguably, source and terminal encoding could be different, but then there is no option available to the compiler to adjust this;
it does not accept -Dfile.encoding=....)
Example: Error message looks like this:
FM.java:1: error: class, interface, or enum expected
b▒h
While the string is actually "bäh" in the source.
This is a bug. Any proper place to actually report a bug?
Edited by: 994195 on 15-Mar-2013 09:42

I'll ignore you just blatantly assuming it is a bug because you say so, you did not think to type "java report bug" into Google?

How to prevent Terminal's character encoding from changing

I have a command-line script that runs continuously, and occasionally echos to STDOUT some binary data that, apparently, changes the way Terminal displays certain characters. Is there any way of preventing this from happening? Say, by locking down Terminal's character-encoding?
...Rene

Rene,
I am not sure if you can prevent this thing from happening but you can set things back to normal by using the reset.
Another possible solution is to write the output of STDOUT to a file so that it will not be displayed on screen.
Mihalis.

Change character encoding from UTF-8 to EUC-KR

We are receiving data in UTF-8 in the querystring from a partner formatted as:
%EA%B3%A0%EB%AF%BC%ED%95%98%EC%9E%90%21
Our site uses EUC-KR so using this text for search/display/etc is not possible. Does anyone know how we can convert this to the proper Korean EUC encoding so it can be displayed properly using JSP? Basically it should be:
%B0%ED%B9%CE%C7%CF%C0%DA%21
Thanks in advance.

I'm not sure where you are getting %xx encoded UTF-8.... Is it cuz you have it in a GET method form and that's what you are seeing in the browser's location bar? ...
Let's assume you have a form on a page, and the page's charset is set to UTF-8, and you want to generate a URL encoded string (%xx format, although URLEncoder will not encode ASCII chars that way...).
In the page processing the form, you need to do this:
request.setCharacterEncoding("UTF-8"); // makes bytes read as UTF-8 strings(assumes that the form page was properly set to the UTF-8 charset)
String fieldValue = request.getParameter("fieldName"); // get value
// the value is now a Unicode String in Java, generated from reading the bytes submitted from the form as UTF-8 encoded text...
String utf8EncString = URLEncoder.encode(fieldValue, "UTF-8");
// now utf8EncString is a URL encoded (%xx) string of UTF-8 values
String euckrEncString = URLEncoder.encode(fieldValue, "EUC-KR");
// now euckrEncString is a URL encoded (%xx) string of EUC-KR valuesWhat is probably screwing things up for you mostly is this:
euckrValue = new String(utf8Value.getBytes(), "EUC-KR");
What this does is takes the bytes of the string utf8Value (which is not really UTF-8... see below) in the local encoding (possibly Cp1252 (Windows) or ISO8895-1 (Linux), or EUC-KR if it's Korean Windows), and then reads them as if they were EUC-KR... which they aren't.
The key here is that Strings in Java are not of any encoding. They are pure Unicode values. Encodings only matter when converting to or from bytes. The strings stored in a file or sent over the net have to convert to bytes since that's what is stored/sent, just bytes. The encoding defines how the characters can be encoded into 1 or more bytes, and thus reconstructed.

Character encoding in flash

I'm trying to view this application in Russian:
http://www.h-ck.ru/map/address.php3?x=237&y=176
But it's showing up in Unicode or Western. How can I change
the encoding in the flash application? I've played with all the
encoding settings in both Firefox and IE and they only change
encoding for the browser text, not in the flash.
Thanks!

since player 6 the standard encoding is unicode (good thing,
as it also covers russian), you can (even if not recomended) tell
the player to use the systems encoding (likley to be russian in
your case) when you make/programm an application, but I don't know
of a way to change the encoding to use on the user side.
If that application is made to run in player5 (probably the
only explanation for the strange behavior) it would run as expected
in player 5 (onja russian system environment), on the other hand
I'd think, that even Players > 5 would switch the encoding to
system for compatibility reasons automaticaly when playing a file
made for version 5.
(Adobe/MM is doing usually doing a great job in that regard-
kudos!)
So it might have been a server upgrade on the side of the app
provider, and they'll likley to learn about theri error soon.
Just educated guessing here.

Character encoding with CF and MySQL

Okay, I thought this should be rather straight forward but
apparently not. I have set up my site to use UTF-8— my cfm
pages, the MySQL table, even Dreamweaver. The problem is when I
input international character via a form they get written correctly
to the MySQL table; however, when I retrieve them in a query and
display them on the page I get them displayed incorrectly.
On my input.cfm page I'll enter the string
"Téstïñg" in the textbox and submit it. If I look at
the record via the MySQL Browser it appears as it should. However
when I display it on my output.cfm page it shows the record as
"T�st��g" and will do so until I change the
meta tag to use charset=ISO-8859-1. Am I missing something or is
this how it is suppose to work?
My input.cfm page is set up with both the
<cfprocessingdirective suppresswhitespace="YES"
pageencoding="UTF-8">
<meta http-equiv="Content-Type" content="text/html;
charset=UTF-8">
tags and a regular input formfield that writes to the MySQL
database.
The MySQL table is configured to use the utf8 char set and
utf8_unicode_ci collation.
And just to be safe I included
useUnicode=true&characterEncoding=utf8&characterSetResults=utf8
in the connection string on the CF Admin datasource setup page.
I'm running CF 6.1, MySQL 4.1, the latest version of Apache
Server on a Win2K3 box. I was running the 3.0.16 MySQL JDBC driver
but I upgraded it to the 5.0.6 this morning thinking that may fix
my issue.

I'm still unsure why this works but I've found a solution. I
switched all my pages over to character set ISO-8859-1 with the
exception of my database table and it works. I get all the normal
range character along with the extended Unicode characters to write
to the database and output correctly. Unicode characters actually
write to the table as their HTML coded character.
If someone feels the need to enlighten me as to why this
works please feel free, I'm always willing to learn.

Barcode 128B and wrong character encoding?

Hello.
I'm using Barcode Type Code 128B in Adobe Form. The Barcode is linked to data field MATNR.
If MATNR is numeric barcode information is set correctly. But if MATNR is e.g. 012236-602-26
then our barcode reader recognizes the "-" characters as char "ß".
In the properties of the barcode i can't adjust codepage or font type.
Anyone has experience with this issue ?
Best regards,
Sebastian

I don't know anything about Icelandic characters, but Flash
generally really likes UTF-8. So it should be sending that if that
is what it is starting with.
You aren't using any kind of useCodePage? That will mess it
up.
Are you sure that the input method is Icelandic?
In the testing environment can you list variables (from the
debug menu) and see if they look proper? If they do then Flash is
readying them correctly and the problem must be coming in further
down stream.

Wrong character encoding in MS Excel

Hi!
I get and update data in the database (8i) using oo4o in MS Excel VBA. Data is in Latvian (NLS_LANG is AMERICAN_AMERICA.WE8ISO8859P1)
The code is:
Set OSes = CreateObject("OracleInProcServer.XOraSession")
Set ODb = OSes.OpenDatabase("test", "scott/tiger", 0&)
Set ODy = ODb.createdynaset("select ename from emp ", 0)
The problem is, when I get data (ename), or put data into OraPrameter object, all Latvian letters are converted to some simbols. The example of the result is: 'KredØta', where at the place of Ø, should be letter ī.
I've used this code for some years with Office 2000 and Oracle Client 8i, and everything was ok, but now I have Office 2003 and Client 10g, and it does not work :(

Try to format the column in your query. Use to_char. ie. SELECT to_char(ename) ename FROM emp

Flash PHP MySQL issue

can someone please tell me what i'm doing wrong:
(Actionscript)
var db_out:URLVariables = new URLVariables();
db_out.from = "property";
db_out.where = "`state`='OR'";
var db_req:URLRequest = new URLRequest("php/search.php");
db_req.data = db_out;
db_req.method = URLRequestMethod.POST;
ldr_db.load(db_req);
(PHP)
$select='`id`';
$from=$_POST['from'];
$where=$_POST['where'];
$order='`type`';
$result=mysql_query('SELECT '.$select.' FROM `'.$from.'` WHERE '.$where.' ORDER BY '.$order);
this returns no results.
this issue is within the db_out.where variable.
other searches where db_out.where is a Number (IE: where = "`status`=1") work fine.
so how do i pass a String query from Flash to MySQL?
if it were a single variable sent to PHP, i would wrap it in single quotes (IE: ''SELECT '.$select.' FROM `'.$from.'` WHERE `state`='."'".$var."'".' ORDER BY '.$order), but how do i properly send a WHERE statement from Flash to MySQL involving a String??
much appreciated,

i figured it out. no idea why this is an issue, but it seems to be.
if Flash passes a string to PHP to be used in a MySQL query, and that string contains single quotes (IE `city`='Portland'), the query will return 0 results, even if its WHERE had an OR conditional in it not using single quotes (IE `status`=1).
so here's the fix:
use some unused symbol in place of the single quotes (IE ^) in Flash, and have PHP use str_replace to swap them for single quotes.
i dont know why PHP has to be the one to implement the single quotes, but it seems to
IE:
$where = str_replace('^', "'", $_POST['where']);
i hope this saves someone the hours i spent trying to figure this one out!!

Irish (Gaeilge) Character preservation from MySQL

(Sorry if this is the wrong forum..)
I'm building a web application which uses the Irish language. This is my first time working on a java program that uses another language other than English so I need to teach my self about character encoding and charset issues.
Specifically I need to retrieve information from the database. I've got data in a MySQL database that is in Irish, but when I retrieve a column using
rs.getString("column_name")("rs" being an instance of the java.sql.ResultSet) and print that to a log or the console I can see that it's garbled. The Irish fada is being lost and something else is appearing.
If anyone can point me into a direction where I can do some relevant reading or provide pointers I'd be grateful. Even posting simple phrases or helpful terms for a Bing/Google search would be helpful.

Thank you very much for your reply
Kayaman wrote:
As a Finn I'm familiar with the problem.
As there might often be applications localized into Finnish, you can never know if your output of "ärränkierränympäriorren" will display nicely.That displays nicely right there :o)
In your particular case you need to make sure of (at least) 2 things.
First, make sure that the database is set to use an appropriate encoding (UTF-8 will do nicely), otherwise the data will be garbled already in the db.Okay I've updated the character set of the database
ALTER DATABASE mydatabase CHARACTER SET utf8;
Secondly, the output. If the console or whatever output you're using (such as writing to a file) assumes that the data is in, let's say ASCII, the data will be garbled in the output phase.
Okay I've included an encoding setting into the log4j properties file.
log4j.appender.A1.Encoding=UTF-8Unfortunately I'm still not sure what is happening. The text that appears in the log file remains garbled but now it's garbled differently from before.
When I view the table with data in the mysql command line client program, the text appears as it should, with the fada used as it should be. But viewing the data from Java is not working.
Of course if your data is travelling a long way from the db to the output, there's a chance it'll become garbled in between. But that's a bit less likely.Right now I'm just working with it locally.
>
I'm sure Google will find you something like "everything you need to know about character encoding" or such.I'll try this.
Tá sé seo deacair. :o(

How can I tell what character encoding is sent from the browser?

Hi,
I am developing a servlet which supposed to be used to send and receive message
in multiple character set. However, I read from the previous postings that each
Weblogic Server can only support one input character encoding. Is that true?
And do you have any suggestions on how I can do what I want. For example, I
have a HTML form for people to post any comments (they may post in any characterset,
like ShiftJIS, Big5, Gb, etc). I need to know what character encoding they are
using before I can read that correctly in the servlet and save in the database.

From what I understand (I haven't used it yet) 6.1 supports the 2.3
servlet spec. That should have a method to set the encoding.
Otherwise, I don't think you can support multiple encodings in one
instance of WebLogic.
From what I know browsers don't give any indication at all about what
encoding they're using. I've read some chatter about the HTTP spec
being changed so it's always UTF-8, but that's a Some Day(TM) kind of
thing, so you're stuck with all the stuff out there now which doesn't do
everything in UTF-8.
Sorry for the bad news, but if it makes you feel any better I've felt
your pain. Oh, and trying to process multipart/form-data (file upload)
forms is even worse and from what I've seen the API that people talk
about on these newsgroups assumes everything is ISO-8859-1.
Emmy Lau wrote:
>
Hi,
I am developing a servlet which supposed to be used to send and receive message
in multiple character set. However, I read from the previous postings that each
Weblogic Server can only support one input character encoding. Is that true?
And do you have any suggestions on how I can do what I want. For example, I
have a HTML form for people to post any comments (they may post in any characterset,
like ShiftJIS, Big5, Gb, etc). I need to know what character encoding they are
using before I can read that correctly in the servlet and save in the database.

Restrict incoming stream from Flash Media Live Encoder

Has there been any further discussion regarding restricting who can stream to FMSS (not FMIS or the development version).
I found this article, but I don't know if this was ever fully resolved:
http://forums.adobe.com/message/3300894
I don't want just anyone to connect to my server to stream live. I want to restrict by IP address, authentication, or any other means.
Can you please share your ideas?
Thanks
Shan.

Hi,
Few quick ideas that i can share :
1. Authentication plugins are the best to control and decide the connection restrictions coming to FMS. Username/Pwd pairs can be made mandatory.
2. FMS can detect User Agents. For example, it can detect connections from FMLE (with different versions) and Flash Players (with different versions) , and a choice can be made on the server side depending on this. An example snippet is below :
// A request from Flash Media Encoder is not checked for authentication
if( (p_client.agent.indexOf("FME")==-1) && (p_client.agent.indexOf("FMLE")==-1))
3. Use SWF Verification (with user agents ) to allow connections only from specified flash player applications.
4. use allowedHTMLdomains whitelist to allow connections from specific domains (a single IP would also do).
5. do not expose the application to which publishing happens to the public. multi-publish to other applications to which clients can connect.
Hope some of them are useful. Thank you !

Problems with character displaying from XML to Java

Hi to all!
I am making a flash chat program which basically takes english and chinese
and talk to java using XML and java server distribute this message to all users,
While simply using java as a distributor of the data was fine, but when I try to
keep records of the chat into the mysql database, I realized that all the
chinese characters are displayed in random characters.
After some thought, I realized that flash using ASCII number representation
for each chinese word imay be different from the Java ASCII number representation, therefore, when Java sends
these ASCII characters to the mysql database, the words is no longer the words I wanted.
I am using a brute force, which on the flash side, I took the ASCII code of each chinese character that I
typed and use break and send them to java and recoded in java with java's Chinese character
and send to the mysql database, well, it works(I tried) but I need to type up at least 3000 characters!!!!!!!!
this is insane!
I am also wonder if the problem comes because Java encode Chinese in unicode, so
it does not recognize the ASCII, and therefore, the result is for sure weird.
If so, what do I need to do in order to convert ASCII into Unicode????
sincerely,
Suansworks

hello.
flash have some problems by utf but it seems that your problem is with mysql database because if you want to put utf in mysqk database you need to get latest version that is beta or alpha. please it using with other database that supports utf.

What every developer should know about character encoding

This was originally posted (with better formatting) at Moderator edit: link removed/what-every-developer-should-know-about-character-encoding.html. I'm posting because lots of people trip over this.
If you write code that touches a text file, you probably need this.
Lets start off with two key items
1.Unicode does not solve this issue for us (yet).
2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts.
The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.
And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.
And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.
Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.
UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
Here's a key point about these text files – every program is still using an encoding. It may not be setting it in code, but by definition an encoding is being used.
Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)
Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
Wrapping it up
I think there are two key items to keep in mind here. First, make sure you are taking the encoding in to account on text files. Second, this is actually all very easy and straightforward. People rarely screw up how to use an encoding, it's when they ignore the issue that they get in to trouble.
Edited by: Darryl Burke -- link removed

DavidThi808 wrote:
This was originally posted (with better formatting) at Moderator edit: link removed/what-every-developer-should-know-about-character-encoding.html. I'm posting because lots of people trip over this.
If you write code that touches a text file, you probably need this.
Lets start off with two key items
1.Unicode does not solve this issue for us (yet).
2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts. Pretty sure most Americans do not use character sets that only have a range of 0-127. I don't think I have every used a desktop OS that did. I might have used some big iron boxes before that but at that time I wasn't even aware that character sets existed.
They might only use that range but that is a different issue, especially since that range is exactly the same as the UTF8 character set anyways.
>
The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.
And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.
And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
The above is only true for small volume sets. If I am targeting a processing rate of 2000 txns/sec with a requirement to hold data active for seven years then a column with a size of 8 bytes is significantly different than one with 16 bytes.
Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
The above is out of place. It would be best to address this as part of Point 1.
Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.
Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.
UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
The first part of that paragraph is odd. The first 128 characters of unicode, all unicode, is based on ASCII. The representational format of UTF8 is required to implement unicode, thus it must represent those characters. It uses the idiom supported by variable width encodings to do that.
But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
Not sure what you are saying here. If a file is supposed to be in one encoding and you insert invalid characters into it then it invalid. End of story. It has nothing to do with html/xml.
Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
The browser still needs to support the encoding.
Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
I know java files have a default encoding - the specification defines it. And I am certain C# does as well.
Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
It is important to define it. Whether you set it is another matter.
Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)
Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
Unicode character escapes are replaced prior to actual code compilation. Thus it is possible to create strings in java with escaped unicode characters which will fail to compile.
Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
No. A developer should understand the problem domain represented by the requirements and the business and create solutions that appropriate to that. Thus there is absolutely no point for someone that is creating an inventory system for a stand alone store to craft a solution that supports multiple languages.
And another example is with high volume systems moving/storing bytes is relevant. As such one must carefully consider each text element as to whether it is customer consumable or internally consumable. Saving bytes in such cases will impact the total load of the system. In such systems incremental savings impact operating costs and marketing advantage with speed.

Wrong character encoding from flash to mysql

Similar Messages

Maybe you are looking for