File character encoding format conversion

Hi People,
uname -a
Linux abcd.us.com 2.6.32-400.21.1.el5uek #1 SMP Wed Feb 20 01:35:01 PST 2013 x86_64 x86_64 x86_64 GNU/Linux
I am trying to convert a file abcd.dat
# file abcd.dat
abcd.dat: binary Computer Graphics Metafile
#file -i abcd.dat
abcd.dat: application/octet-stream
I've tried dos2unix in vain.
dos2unix: converting file abcd.dat to UNIX format ...
dos2unix: converting file abc.dat to UNIX format ...
dos2unix: problems converting file abc.dat
I've used iconv successfully earlier with this command
iconv -f UTF-16 -t UTF-8  abcd.dat > abcd.abc
only, this time I do not know the "from" format of the file,
#iconv -l
The following list contain all the coded character sets known.  This does
not necessarily mean that all combinations of these names can be used for
the FROM and TO command line parameters.  One coded character set can be
listed with several different names (aliases).
  437, 500, 500V1, 850, 851, 852, 855, 856, 857, 860, 861, 862, 863, 864, 865,
  866, 866NAV, 869, 874, 904, 1026, 1046, 1047, 8859_1, 8859_2, 8859_3, 8859_4,
  8859_5, 8859_6, 8859_7, 8859_8, 8859_9, 10646-1:1993, 10646-1:1993/UCS4,
  ANSI_X3.4-1968, ANSI_X3.4-1986, ANSI_X3.4, ANSI_X3.110-1983, ANSI_X3.110,
  ARABIC, ARABIC7, ARMSCII-8, ASCII, ASMO-708, ASMO_449, BALTIC, BIG-5,
  BIG-FIVE, BIG5-HKSCS, BIG5, BIG5HKSCS, BIGFIVE, BS_4730, CA, CN-BIG5, CN-GB,
  CN, CP-AR, CP-GR, CP-HU, CP037, CP038, CP273, CP274, CP275, CP278, CP280,
  CP281, CP282, CP284, CP285, CP290, CP297, CP367, CP420, CP423, CP424, CP437,
  CP500, CP737, CP775, CP803, CP813, CP819, CP850, CP851, CP852, CP855, CP856,
  CP857, CP860, CP861, CP862, CP863, CP864, CP865, CP866, CP866NAV, CP868,
  CP869, CP870, CP871, CP874, CP875, CP880, CP891, CP901, CP902, CP903, CP904,
  CP905, CP912, CP915, CP916, CP918, CP920, CP921, CP922, CP930, CP932, CP933,
  CP935, CP936, CP937, CP939, CP949, CP950, CP1004, CP1008, CP1025, CP1026,
  CP1046, CP1047, CP1070, CP1079, CP1081, CP1084, CP1089, CP1097, CP1112,
  CP1122, CP1123, CP1124, CP1125, CP1129, CP1130, CP1132, CP1133, CP1137,
  CP1140, CP1141, CP1142, CP1143, CP1144, CP1145, CP1146, CP1147, CP1148,
  CP1149, CP1153, CP1154, CP1155, CP1156, CP1157, CP1158, CP1160, CP1161,
  CP1162, CP1163, CP1164, CP1166, CP1167, CP1250, CP1251, CP1252, CP1253,
  CP1254, CP1255, CP1256, CP1257, CP1258, CP1361, CP1364, CP1371, CP1388,
  CP1390, CP1399, CP4517, CP4899, CP4909, CP4971, CP5347, CP9030, CP9066,
  CP9448, CP10007, CP12712, CP16804, CPIBM861, CSA7-1, CSA7-2, CSASCII,
  CSA_T500-1983, CSA_T500, CSA_Z243.4-1985-1, CSA_Z243.4-1985-2,
  CSA_Z243.419851, CSA_Z243.419852, CSDECMCS, CSEBCDICATDE, CSEBCDICATDEA,
  CSEBCDICCAFR, CSEBCDICDKNO, CSEBCDICDKNOA, CSEBCDICES, CSEBCDICESA,
  CSEBCDICESS, CSEBCDICFISE, CSEBCDICFISEA, CSEBCDICFR, CSEBCDICIT, CSEBCDICPT,
  CSEBCDICUK, CSEBCDICUS, CSEUCKR, CSEUCPKDFMTJAPANESE, CSGB2312, CSHPROMAN8,
  CSIBM037, CSIBM038, CSIBM273, CSIBM274, CSIBM275, CSIBM277, CSIBM278,
  CSIBM280, CSIBM281, CSIBM284, CSIBM285, CSIBM290, CSIBM297, CSIBM420,
  CSIBM423, CSIBM424, CSIBM500, CSIBM803, CSIBM851, CSIBM855, CSIBM856,
  CSIBM857, CSIBM860, CSIBM863, CSIBM864, CSIBM865, CSIBM866, CSIBM868,
  CSIBM869, CSIBM870, CSIBM871, CSIBM880, CSIBM891, CSIBM901, CSIBM902,
  CSIBM903, CSIBM904, CSIBM905, CSIBM918, CSIBM921, CSIBM922, CSIBM930,
  CSIBM932, CSIBM933, CSIBM935, CSIBM937, CSIBM939, CSIBM943, CSIBM1008,
  CSIBM1025, CSIBM1026, CSIBM1097, CSIBM1112, CSIBM1122, CSIBM1123, CSIBM1124,
  CSIBM1129, CSIBM1130, CSIBM1132, CSIBM1133, CSIBM1137, CSIBM1140, CSIBM1141,
  CSIBM1142, CSIBM1143, CSIBM1144, CSIBM1145, CSIBM1146, CSIBM1147, CSIBM1148,
  CSIBM1149, CSIBM1153, CSIBM1154, CSIBM1155, CSIBM1156, CSIBM1157, CSIBM1158,
  CSIBM1160, CSIBM1161, CSIBM1163, CSIBM1164, CSIBM1166, CSIBM1167, CSIBM1364,
  CSIBM1371, CSIBM1388, CSIBM1390, CSIBM1399, CSIBM4517, CSIBM4899, CSIBM4909,
  CSIBM4971, CSIBM5347, CSIBM9030, CSIBM9066, CSIBM9448, CSIBM12712,
  CSIBM16804, CSIBM11621162, CSISO4UNITEDKINGDOM, CSISO10SWEDISH,
  CSISO11SWEDISHFORNAMES, CSISO14JISC6220RO, CSISO15ITALIAN, CSISO16PORTUGESE,
  CSISO17SPANISH, CSISO18GREEK7OLD, CSISO19LATINGREEK, CSISO21GERMAN,
  CSISO25FRENCH, CSISO27LATINGREEK1, CSISO49INIS, CSISO50INIS8,
  CSISO51INISCYRILLIC, CSISO58GB1988, CSISO60DANISHNORWEGIAN,
  CSISO60NORWEGIAN1, CSISO61NORWEGIAN2, CSISO69FRENCH, CSISO84PORTUGUESE2,
  CSISO85SPANISH2, CSISO86HUNGARIAN, CSISO88GREEK7, CSISO89ASMO449, CSISO90,
  CSISO92JISC62991984B, CSISO99NAPLPS, CSISO103T618BIT, CSISO111ECMACYRILLIC,
  CSISO121CANADIAN1, CSISO122CANADIAN2, CSISO139CSN369103, CSISO141JUSIB1002,
  CSISO143IECP271, CSISO150, CSISO150GREEKCCITT, CSISO151CUBA,
  CSISO153GOST1976874, CSISO646DANISH, CSISO2022CN, CSISO2022JP, CSISO2022JP2,
  CSISO2022KR, CSISO2033, CSISO5427CYRILLIC, CSISO5427CYRILLIC1981,
  CSISO5428GREEK, CSISO10367BOX, CSISOLATIN1, CSISOLATIN2, CSISOLATIN3,
  CSISOLATIN4, CSISOLATIN5, CSISOLATIN6, CSISOLATINARABIC, CSISOLATINCYRILLIC,
  CSISOLATINGREEK, CSISOLATINHEBREW, CSKOI8R, CSKSC5636, CSMACINTOSH,
  CSNATSDANO, CSNATSSEFI, CSN_369103, CSPC8CODEPAGE437, CSPC775BALTIC,
  CSPC850MULTILINGUAL, CSPC862LATINHEBREW, CSPCP852, CSSHIFTJIS, CSUCS4,
  CSUNICODE, CSWINDOWS31J, CUBA, CWI-2, CWI, CYRILLIC, DE, DEC-MCS, DEC,
  DECMCS, DIN_66003, DK, DS2089, DS_2089, E13B, EBCDIC-AT-DE-A, EBCDIC-AT-DE,
  EBCDIC-BE, EBCDIC-BR, EBCDIC-CA-FR, EBCDIC-CP-AR1, EBCDIC-CP-AR2,
  EBCDIC-CP-BE, EBCDIC-CP-CA, EBCDIC-CP-CH, EBCDIC-CP-DK, EBCDIC-CP-ES,
  EBCDIC-CP-FI, EBCDIC-CP-FR, EBCDIC-CP-GB, EBCDIC-CP-GR, EBCDIC-CP-HE,
  EBCDIC-CP-IS, EBCDIC-CP-IT, EBCDIC-CP-NL, EBCDIC-CP-NO, EBCDIC-CP-ROECE,
  EBCDIC-CP-SE, EBCDIC-CP-TR, EBCDIC-CP-US, EBCDIC-CP-WT, EBCDIC-CP-YU,
  EBCDIC-CYRILLIC, EBCDIC-DK-NO-A, EBCDIC-DK-NO, EBCDIC-ES-A, EBCDIC-ES-S,
  EBCDIC-ES, EBCDIC-FI-SE-A, EBCDIC-FI-SE, EBCDIC-FR, EBCDIC-GREEK, EBCDIC-INT,
  EBCDIC-INT1, EBCDIC-IS-FRISS, EBCDIC-IT, EBCDIC-JP-E, EBCDIC-JP-KANA,
  EBCDIC-PT, EBCDIC-UK, EBCDIC-US, EBCDICATDE, EBCDICATDEA, EBCDICCAFR,
  EBCDICDKNO, EBCDICDKNOA, EBCDICES, EBCDICESA, EBCDICESS, EBCDICFISE,
  EBCDICFISEA, EBCDICFR, EBCDICISFRISS, EBCDICIT, EBCDICPT, EBCDICUK, EBCDICUS,
  ECMA-114, ECMA-118, ECMA-128, ECMA-CYRILLIC, ECMACYRILLIC, ELOT_928, ES, ES2,
  EUC-CN, EUC-JISX0213, EUC-JP-MS, EUC-JP, EUC-KR, EUC-TW, EUCCN, EUCJP-MS,
  EUCJP-OPEN, EUCJP-WIN, EUCJP, EUCKR, EUCTW, FI, FR, GB, GB2312, GB13000,
  GB18030, GBK, GB_1988-80, GB_198880, GEORGIAN-ACADEMY, GEORGIAN-PS,
  GOST_19768-74, GOST_19768, GOST_1976874, GREEK-CCITT, GREEK, GREEK7-OLD,
  GREEK7, GREEK7OLD, GREEK8, GREEKCCITT, HEBREW, HP-ROMAN8, HPROMAN8, HU,
  IBM-803, IBM-856, IBM-901, IBM-902, IBM-921, IBM-922, IBM-930, IBM-932,
  IBM-933, IBM-935, IBM-937, IBM-939, IBM-943, IBM-1008, IBM-1025, IBM-1046,
  IBM-1047, IBM-1097, IBM-1112, IBM-1122, IBM-1123, IBM-1124, IBM-1129,
  IBM-1130, IBM-1132, IBM-1133, IBM-1137, IBM-1140, IBM-1141, IBM-1142,
  IBM-1143, IBM-1144, IBM-1145, IBM-1146, IBM-1147, IBM-1148, IBM-1149,
  IBM-1153, IBM-1154, IBM-1155, IBM-1156, IBM-1157, IBM-1158, IBM-1160,
  IBM-1161, IBM-1162, IBM-1163, IBM-1164, IBM-1166, IBM-1167, IBM-1364,
  IBM-1371, IBM-1388, IBM-1390, IBM-1399, IBM-4517, IBM-4899, IBM-4909,
  IBM-4971, IBM-5347, IBM-9030, IBM-9066, IBM-9448, IBM-12712, IBM-16804,
  IBM037, IBM038, IBM256, IBM273, IBM274, IBM275, IBM277, IBM278, IBM280,
  IBM281, IBM284, IBM285, IBM290, IBM297, IBM367, IBM420, IBM423, IBM424,
  IBM437, IBM500, IBM775, IBM803, IBM813, IBM819, IBM848, IBM850, IBM851,
  IBM852, IBM855, IBM856, IBM857, IBM860, IBM861, IBM862, IBM863, IBM864,
  IBM865, IBM866, IBM866NAV, IBM868, IBM869, IBM870, IBM871, IBM874, IBM875,
  IBM880, IBM891, IBM901, IBM902, IBM903, IBM904, IBM905, IBM912, IBM915,
  IBM916, IBM918, IBM920, IBM921, IBM922, IBM930, IBM932, IBM933, IBM935,
  IBM937, IBM939, IBM943, IBM1004, IBM1008, IBM1025, IBM1026, IBM1046, IBM1047,
  IBM1089, IBM1097, IBM1112, IBM1122, IBM1123, IBM1124, IBM1129, IBM1130,
  IBM1132, IBM1133, IBM1137, IBM1140, IBM1141, IBM1142, IBM1143, IBM1144,
  IBM1145, IBM1146, IBM1147, IBM1148, IBM1149, IBM1153, IBM1154, IBM1155,
  IBM1156, IBM1157, IBM1158, IBM1160, IBM1161, IBM1162, IBM1163, IBM1164,
  IBM1166, IBM1167, IBM1364, IBM1371, IBM1388, IBM1390, IBM1399, IBM4517,
  IBM4899, IBM4909, IBM4971, IBM5347, IBM9030, IBM9066, IBM9448, IBM12712,
  IBM16804, IEC_P27-1, IEC_P271, INIS-8, INIS-CYRILLIC, INIS, INIS8,
  INISCYRILLIC, ISIRI-3342, ISIRI3342, ISO-2022-CN-EXT, ISO-2022-CN,
  ISO-2022-JP-2, ISO-2022-JP-3, ISO-2022-JP, ISO-2022-KR, ISO-8859-1,
  ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7,
  ISO-8859-8, ISO-8859-9, ISO-8859-10, ISO-8859-11, ISO-8859-13, ISO-8859-14,
  ISO-8859-15, ISO-8859-16, ISO-10646, ISO-10646/UCS2, ISO-10646/UCS4,
  ISO-10646/UTF-8, ISO-10646/UTF8, ISO-CELTIC, ISO-IR-4, ISO-IR-6, ISO-IR-8-1,
  ISO-IR-9-1, ISO-IR-10, ISO-IR-11, ISO-IR-14, ISO-IR-15, ISO-IR-16, ISO-IR-17,
  ISO-IR-18, ISO-IR-19, ISO-IR-21, ISO-IR-25, ISO-IR-27, ISO-IR-37, ISO-IR-49,
  ISO-IR-50, ISO-IR-51, ISO-IR-54, ISO-IR-55, ISO-IR-57, ISO-IR-60, ISO-IR-61,
  ISO-IR-69, ISO-IR-84, ISO-IR-85, ISO-IR-86, ISO-IR-88, ISO-IR-89, ISO-IR-90,
  ISO-IR-92, ISO-IR-98, ISO-IR-99, ISO-IR-100, ISO-IR-101, ISO-IR-103,
  ISO-IR-109, ISO-IR-110, ISO-IR-111, ISO-IR-121, ISO-IR-122, ISO-IR-126,
  ISO-IR-127, ISO-IR-138, ISO-IR-139, ISO-IR-141, ISO-IR-143, ISO-IR-144,
  ISO-IR-148, ISO-IR-150, ISO-IR-151, ISO-IR-153, ISO-IR-155, ISO-IR-156,
  ISO-IR-157, ISO-IR-166, ISO-IR-179, ISO-IR-193, ISO-IR-197, ISO-IR-199,
  ISO-IR-203, ISO-IR-209, ISO-IR-226, ISO/TR_11548-1, ISO646-CA, ISO646-CA2,
  ISO646-CN, ISO646-CU, ISO646-DE, ISO646-DK, ISO646-ES, ISO646-ES2, ISO646-FI,
  ISO646-FR, ISO646-FR1, ISO646-GB, ISO646-HU, ISO646-IT, ISO646-JP-OCR-B,
  ISO646-JP, ISO646-KR, ISO646-NO, ISO646-NO2, ISO646-PT, ISO646-PT2,
  ISO646-SE, ISO646-SE2, ISO646-US, ISO646-YU, ISO2022CN, ISO2022CNEXT,
  ISO2022JP, ISO2022JP2, ISO2022KR, ISO6937, ISO8859-1, ISO8859-2, ISO8859-3,
  ISO8859-4, ISO8859-5, ISO8859-6, ISO8859-7, ISO8859-8, ISO8859-9, ISO8859-10,
  ISO8859-11, ISO8859-13, ISO8859-14, ISO8859-15, ISO8859-16, ISO11548-1,
  ISO88591, ISO88592, ISO88593, ISO88594, ISO88595, ISO88596, ISO88597,
  ISO88598, ISO88599, ISO885910, ISO885911, ISO885913, ISO885914, ISO885915,
  ISO885916, ISO_646.IRV:1991, ISO_2033-1983, ISO_2033, ISO_5427-EXT, ISO_5427,
  ISO_5427:1981, ISO_5427EXT, ISO_5428, ISO_5428:1980, ISO_6937-2,
  ISO_6937-2:1983, ISO_6937, ISO_6937:1992, ISO_8859-1, ISO_8859-1:1987,
  ISO_8859-2, ISO_8859-2:1987, ISO_8859-3, ISO_8859-3:1988, ISO_8859-4,
  ISO_8859-4:1988, ISO_8859-5, ISO_8859-5:1988, ISO_8859-6, ISO_8859-6:1987,
  ISO_8859-7, ISO_8859-7:1987, ISO_8859-7:2003, ISO_8859-8, ISO_8859-8:1988,
  ISO_8859-9, ISO_8859-9:1989, ISO_8859-10, ISO_8859-10:1992, ISO_8859-14,
  ISO_8859-14:1998, ISO_8859-15, ISO_8859-15:1998, ISO_8859-16,
  ISO_8859-16:2001, ISO_9036, ISO_10367-BOX, ISO_10367BOX, ISO_11548-1,
  ISO_69372, IT, JIS_C6220-1969-RO, JIS_C6229-1984-B, JIS_C62201969RO,
  JIS_C62291984B, JOHAB, JP-OCR-B, JP, JS, JUS_I.B1.002, KOI-7, KOI-8, KOI8-R,
  KOI8-T, KOI8-U, KOI8, KOI8R, KOI8U, KSC5636, L1, L2, L3, L4, L5, L6, L7, L8,
  L10, LATIN-9, LATIN-GREEK-1, LATIN-GREEK, LATIN1, LATIN2, LATIN3, LATIN4,
  LATIN5, LATIN6, LATIN7, LATIN8, LATIN10, LATINGREEK, LATINGREEK1,
  MAC-CYRILLIC, MAC-IS, MAC-SAMI, MAC-UK, MAC, MACCYRILLIC, MACINTOSH, MACIS,
  MACUK, MACUKRAINIAN, MIK, MS-ANSI, MS-ARAB, MS-CYRL, MS-EE, MS-GREEK,
  MS-HEBR, MS-MAC-CYRILLIC, MS-TURK, MS932, MS936, MSCP949, MSCP1361,
  MSMACCYRILLIC, MSZ_7795.3, MS_KANJI, NAPLPS, NATS-DANO, NATS-SEFI, NATSDANO,
  NATSSEFI, NC_NC0010, NC_NC00-10, NC_NC00-10:81, NF_Z_62-010,
  NF_Z_62-010_(1973), NF_Z_62-010_1973, NF_Z_62010, NF_Z_62010_1973, NO, NO2,
  NS_4551-1, NS_4551-2, NS_45511, NS_45512, OS2LATIN1, OSF00010001,
  OSF00010002, OSF00010003, OSF00010004, OSF00010005, OSF00010006, OSF00010007,
  OSF00010008, OSF00010009, OSF0001000A, OSF00010020, OSF00010100, OSF00010101,
  OSF00010102, OSF00010104, OSF00010105, OSF00010106, OSF00030010, OSF0004000A,
  OSF0005000A, OSF05010001, OSF100201A4, OSF100201A8, OSF100201B5, OSF100201F4,
  OSF100203B5, OSF1002011C, OSF1002011D, OSF1002035D, OSF1002035E, OSF1002035F,
  OSF1002036B, OSF1002037B, OSF10010001, OSF10020025, OSF10020111, OSF10020115,
  OSF10020116, OSF10020118, OSF10020122, OSF10020129, OSF10020352, OSF10020354,
  OSF10020357, OSF10020359, OSF10020360, OSF10020364, OSF10020365, OSF10020366,
  OSF10020367, OSF10020370, OSF10020387, OSF10020388, OSF10020396, OSF10020402,
  OSF10020417, PT, PT2, PT154, R8, RK1048, ROMAN8, RUSCII, SE, SE2,
  SEN_850200_B, SEN_850200_C, SHIFT-JIS, SHIFT_JIS, SHIFT_JISX0213, SJIS-OPEN,
  SJIS-WIN, SJIS, SS636127, STRK1048-2002, ST_SEV_358-88, T.61-8BIT, T.61,
  T.618BIT, TCVN-5712, TCVN, TCVN5712-1, TCVN5712-1:1993, TIS-620, TIS620-0,
  TIS620.2529-1, TIS620.2533-0, TIS620, TS-5881, TSCII, UCS-2, UCS-2BE,
  UCS-2LE, UCS-4, UCS-4BE, UCS-4LE, UCS2, UCS4, UHC, UJIS, UK, UNICODE,
  UNICODEBIG, UNICODELITTLE, US-ASCII, US, UTF-7, UTF-8, UTF-16, UTF-16BE,
  UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, UTF7, UTF8, UTF16, UTF16BE, UTF16LE,
  UTF32, UTF32BE, UTF32LE, VISCII, WCHAR_T, WIN-SAMI-2, WINBALTRIM,
  WINDOWS-31J, WINDOWS-874, WINDOWS-936, WINDOWS-1250, WINDOWS-1251,
  WINDOWS-1252, WINDOWS-1253, WINDOWS-1254, WINDOWS-1255, WINDOWS-1256,
  WINDOWS-1257, WINDOWS-1258, WINSAMI2, WS2, YU
==================================================
also,
#which od
/usr/bin/od
but I don't know how to use it.
==================================
#cat -v abcd.dat
has a lot of ^@
===================================
#echo $LANG
en_US.UTF-8
======================================================================================
#hexdump -C abcd.dat|head -5
00000000  00 22 00 34 00 36 00 32  00 39 00 33 00 22 00 7c  |.".4.6.2.9.3.".||
00000010  00 22 00 32 00 30 00 31  00 33 00 2d 00 31 00 31  |.".2.0.1.3.-.1.1|
00000020  00 2d 00 31 00 38 00 20  00 30 00 38 00 3a 00 30  |.-.1.8. .0.8.:.0|
00000030  00 39 00 3a 00 34 00 38  00 22 00 7c 00 22 00 33  |.9.:.4.8.".|.".3|
00000040  00 36 00 37 00 22 00 7c  00 22 00 53 00 75 00 73  |.6.7.".|.".S.u.s|
=======================================================================================
#vi abcd.tst
testing
esc:wq
#file abcd.tst
abcd.tst: ASCII text
Let me know the complete iconv command with from-and-to encoding.
Appreciate any help.

Hi BalusC,
as we write in jsp page as <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
is their something we can write in .properties file
russian words are not correctly displayed in browser......how we can dispaly it in correct format...??
i have all russian words in my .properties file
Thanks a lot

Similar Messages

  • Character encoding format in j2me

    Hi,
    I am new to this forum.
    I have some queries regarding Character encoding support in j2me.
    1. What is the character encoding format supported in j2me?
    2. Is it varies from device to device or same in all j2me devices?
    3. Whether all j2me devices support UTF-8 format?
    4. Are some devices sopport UTF-16 (or UTF-16BE/UTF-16LE)?
    5. If a device supports UTF-8 scheme, can we assume that it will support all the languages(i am somehow concerned for Chineese)?
    Eagerly waiting for the feedbacks :)

    Not all devices has support for even UTF-8.
    Use this class I wrote for my projects. This is implementation of Reader so you may easy use it anywhere.
    * Created on 12.03.2008 14:21
    package misc;
    import java.io.IOException;
    import java.io.InputStream;
    import java.io.Reader;
    * Reader of UTF-8 encoded string from bytes
    * @author RFK
    public class UTF8InputStreamReader extends Reader{
      InputStream is;
      /** Creates a new instance of UTF8InputStreamReader */
      public UTF8InputStreamReader(InputStream is) {
        this.is=is;
      public int read(char[] cbuf, int off, int len) throws IOException {
        int b,b2,b3, r=0;
        while (len>0){
          b = is.read();
          if (b<0) break;
          if (b<128)
            cbuf[off]= (char)b;
          else if (b<224){
            b2 = is.read();
            cbuf[off]= (char)(((b& 0x1F) << 6) | (b2 & 0x3F));
          } else {
            b2 = is.read();
            b3 = is.read();
            cbuf[off]=  (char)(((b & 0x0F) << 12) | ((b2 & 0x3F) << 6) | (b3 & 0x3F));
          r++;
          off++;
          len--;
        return r;
      public void close() throws IOException {
        is.close();
    }Edited by: MagDel on Jun 1, 2009 10:33 PM

  • How to set the system default file character encoding to UTF-8?

    Hi all. This is driving me nuts, on both my Windows box and Snow Leopard; I figure much more chance of finding the answer for OS X.
    My language and locale are set to Australian English. $LANG=en_AU.UTF-8
    However, as I believe is expected, OS X (and Windows for that matter) will create files by default with character encoding of Cp1252 (Latin-1). That is, the FILE encoding in the file metadata - the Byte Order Mark I believe. The file itself, not the characters written to it.
    This, in a word, bites. I don't want to be restricted to only ASCII by default, and it is causing me problems with certain software (a Firefox plugin) that creates text files, passing in UTF-8 encoded content, which is then mangled because the file encoding itself is still Cp1252. (I know, I've tested this by changing the file encoding manually and having it overwritten again by the plugin: works correctly.)
    As a simple example, just `touch somefile` from terminal creates a file in Cp1252 -- I'm obtaining that info by opening in jEdit by the way (anyone know of something better?).
    In other locales that are not English-based, I believe the default file encoding is UTF-8. But surely this can be controlled independently? There must be a system configuration value somewhere that specifies file encoding default. Can someone please tell me what it is?
    Thanks!

    However, as I believe is expected, OS X (and Windows for that matter) will create files by default with character encoding of Cp1252 (Latin-1). That is, the FILE encoding in the file metadata - the Byte Order Mark I believe. The file itself, not the characters written to it.
    Apps like TextEdit and Mail have settings that let you determine the encoding of text produced. The default would normally depend on the character content of the file, ranging from ASCII for basic English to Windows Latin-1 (Win 1252) or ISO Latin -1 (ISO 8859-1) to UTF-8 for other content.
    Win 1252 is not ASCII, but has twice the number of characters in the latter.
    Byte Order Mark is something totally different --it's a particular character used to signal certain encodings.
    http://en.wikipedia.org/wiki/Byteordermark
    As a simple example, just `touch somefile` from terminal creates a file in Cp1252 -- I'm obtaining that info by opening in jEdit by the way (anyone know of something better?).
    For what Terminal does and how to change it, it might best to post in the Unix forum:
    http://discussions.apple.com/forum.jspa?forumID=735
    For problems with a FireFox plugin, it might be good to ask on their own forums as well.

  • Please help me in reading the ascii file in encoded format

    hi ,
    iam trying to read the ascii file and i need to encode the text file, can u suggest me what arre the different methods are availble for me to encode the text file, please suggest me which is the effiecient method tto use?

    This question has been answered before, please search the forums in future.
    You could do something like this, it probably not the most efficient though.
    If you don't need to do it in code then the native2ascii does this conversion see: http://java.sun.com/j2se/1.4.2/docs/tooldocs/windows/native2ascii.html
            public static void convert(String vsFile) throws IOException
              File f = new File(vsFile);
              FileReader fr = new FileReader(f);
              char[] buf;
              int bytesRead = 0;
              int startAt = 0;
              String content = "";
              do
                   buf = new char[512];
                   bytesRead = fr.read(buf, 0, 512);
                   startAt += bytesRead;
                   String tmp = new String(buf);
                   content += tmp;
              while (bytesRead == 512);
              FileOutputStream fout = new FileOutputStream(f);
              OutputStreamWriter out = new OutputStreamWriter(fout, "UTF-8");
              out.write(content.trim());
              out.flush();
              out.close();
         }

  • XML parser not able to find encoding format of xml file with jre1.4.2

    Hi
    I am using jre1.4.2_05 and Weblogic 8.1 version and i have a problem with finding encoding format of xml file.
    I need to parse a xml file and need to find which encoding format that xml is based on that i need to change logic.
    Need to know after parsing each xml file what encoding format the xml is? Here the problem is we are using jre1.4.2_05 by default DOM \ SAX parser is not supported and i looked at few third party parser which are also don't have facility.
    But in latest jre 1.5 or jdk1.5 has this feature. Its difficult to the project to upgrade to jre1.5 or more.
    Please let me know if you have any idea about the issue.

    I had a quick look around and I think you might be able to find them in the support portal...
    SAP Support Portal > Software Downloads > SAP Software Download Centre > Support Packages and Patches > Archive for Support Packages and Patches > Archive - Browse our Download Catalog > SAP Connectors.
    Let me know if you find them.
    Regards,
    Stephen.

  • How can I know the encoding format for a file.

    I have files encoded in English, Spanish, Japanese etc. I want to know which file has which encoding format while reading.
    Can anyone suggest.
    Ashish

    Language is different from "encoding"...if you mean character encoding. Multiple languages can be represented by a single encoding. The fact that you mix language and encoding in your question confuses me about what you are asking.
    If you want to know what language a file uses, you can always use a meta-tag. Or you can do some kind of text analysis based on dictionary lookups to determine language...too complex for my tastes. I think a simple language tag in the file makes more sense.
    As for character encoding, you either standardize on a single encoding for all files or again use a meta-tag. If you standardize on a single encoding, you should probably consider one of the Unicode encodings instead of any other character set encoding.
    Regards,
    John O'Conner

  • Character encoding conversion for marshall/unmarshall?

    Hello, Java Web Services gurus,
    I am wondering if there is an easy/plugin-able way to do character encoding conversion transparently in the process of marshall/unmarshall.
    Basically, my input/output will always be these UTF-8 XMLs. As the backend database is ISO encoded, I hope the result of unmarshall will give me ISO strings. And when it comes to marshall, the ISO strings can be transparently turned to UTF-8 XML response. Right now I'm using JAXB's annotations to parse XML into objects.
    I understand there will be chars in the input file not able to get converted, if so, I'd be be expecting an error/exception that flags the failure
    Hope I sound clear. This has been a headache for a while. Really hope someone may help out a bit. Thanks a million in advance

    [Duplicate Post|http://forums.sun.com/thread.jspa?messageID=10971554&tstart=0#10971554]

  • File format conversion of Target file using FTP adapter

    Hi All,
    I am using FTP adapter to create the file on the Target side. But file needs to below format : How do i conver the XML File fomat( Default generate by XI ) to be generat to below file format;
    000000000000154162,
    CWC1A,,,,
    CWC1B,,,,
    CWC2A,,,,
    CWC2B,,,,
    Please provide your suggestion;
    thanks;
    MK

    Hi Mohan,
       I have a collection of Blogs (links) which Specify the File content conversion parameters.
    File content conversion, I am Not sure as which Link will Match Your Requirement exactly...
    /people/shabarish.vijayakumar/blog/2006/02/27/content-conversion-the-key-field-problem
    /people/michal.krawczyk2/blog/2004/12/15/how-to-send-a-flat-file-with-fixed-lengths-to-xi-30-using-a-central-file-adapter
    /people/arpit.seth/blog/2005/06/02/file-receiver-with-content-conversion
    http://help.sap.com/saphelp_nw04/helpdata/en/d2/bab440c97f3716e10000000a155106/content.htm
    Please see the below links for file content conversion..
    /people/michal.krawczyk2/blog/2004/12/15/how-to-send-a-flat-file-with-fixed-lengths-to-xi-30-using-a-central-file-adapter - FCC
    /people/michal.krawczyk2/blog/2004/12/15/how-to-send-a-flat-file-with-fixed-lengths-to-xi-30-using-a-central-file-adapter - FCC
    File Content Conversion for Unequal Number of Columns
    /people/jeyakumar.muthu2/blog/2005/11/29/file-content-conversion-for-unequal-number-of-columns - FCC
    Content Conversion (Pattern/Random content in input file)
    /people/anish.abraham2/blog/2005/06/08/content-conversion-patternrandom-content-in-input-file - FCC
    Regards,
    Sainath chutke

  • SCOT error "File format conversion error"

    Dear All,
    When a PDF file is sent as a FAX, the SCOT throws the following error with message XS 812:
    No delivery to <fax_number>
    Message no. XS812
    Diagnosis
    The message could not be delivered to recipient <fax_number>. The node gave the following reason (in the language of the node):
    File format conversion error
    Can anybody suggest what could be the problem?
    Note that the normal PCL and RAW data are being sent to the same FAX number without any problem.
    Thanks
    Sagar

    Hello,
    You need to specify the correct formats:
    Check this link:
    http://help.sap.com/saphelp_nw70/helpdata/EN/b0/f8ef3aabdfb20de10000000a11402f/frameset.htm
    Regards,
    Siddhesh

  • How to set character encoding to Greek in a javascript file in Dreamweaver CS4?

    I have a javascript file built with a text editor some time
    ago, which has Greek characters in one of the variables:
    1.) --> var
    scrollercontent='</font><br><font face="Verdana"
    color="#333333" size="-1">11-04-2008:
    <b>ΜΑΝΩΛΗΣ
    ΜΗΤΣΙΑΣ
    </b><br><img SRC="mitsias2.gif" height=148 width=104
    border="0" /><p></font>' <--
    In DW CS4 it displays as
    2.) --> var
    scrollercontent='</font><br><font face="Verdana"
    color="#333333" size="-1">11-04-2008: <b>???O??S ???S??S
    </b><br><img SRC="mitsias2.gif" height=148 width=104
    border="0" /><p></font>' <--
    When trying to save the corrected content in 1.) I get an
    error message that DW cannot save the Greek characters. I am asked
    to change the character encoding.
    I don't know how to change the character encoding in a js
    file. Is there a "head" file like in HTML? Or how does one
    determine the character encoding in a js file?
    When I change the encoding in the prefereces, which is
    "Western" to "Greek" or "UTF", I have the same result.
    Thanks for help.
    jml

    The .js file and the html need to have the same encoding. If
    your html uses iso-8859-7, then the .js must also use that. But if
    the original text editor created the .js file using utf-8, then
    that is what the html needs to use.

  • Character Encoding and File Encoding issue

    Hi,
    I have a file which has a data encoded using default locale.
    I start jvm in same default locale and try to red the file.
    I took 2 approaches :
    1. Read the file using InputStreamReader() without specifying the encoding, so that default one based on locale will be picked up.
    -- This apprach worked fine.
    -- I also printed system property "file.encoding" which matched with current locales encoding (on unix cooand to get this is "locale charmap").
    2. In this approach, I read the file using InputStream as an array of raw bytes, and passed it to String contructor to convert bytes to String.
    -- The String contained garbled data, meaning encoding failed.
    I tried printing encoding used by JVM using internal class, and "file.encoding" property as well.
    These 2 values do not match, there is weird difference.
    For e.g. for locale ja_JP.eucjp on linux box :
    byte-character uses EUC_JP_LINUX encoding
    file.encoding system property is EUC-JP-LINUX
    To get byte to character encoding, I used following methods (sun.io.*):
    ByteToCharConverter btc = ByteToCharConverter.getDefault();
    System.out.println("BTC uses " + btc.getCharacterEncoding());
    Do you have any idea why is it failing ?
    My understanding was, file encoding and character encoding should always be same by default.
    But, because of this behaviour, I am little perplexed.

    But there's no character encoding set for this operation:baos.write("���".getBytes());

  • Query regarding conversion of Quick Time movie files to .swf format

    Hello Everyone,
    Is it possible to convert QT movie files to .swf format using
    any tool ?
    Will it significantly reduce the file size ?
    Suppose we have a QT movie file of size 3MB , when it is
    converted to .swf
    format what will be its size ?
    Then can we play this .swf file using QT player or we require
    flash player
    to do so .
    Kindly answer these queries.
    Thanking you,
    With Best regards,
    Anil.

    Sudeshchan Shetty wrote:
    > Is it possible to convert QT movie files to .swf format
    using any tool ?
    Hi Anil.
    Yes, you can. You can import QT movies into Flash and embed
    the movie and then
    export an swf version of it. You can also convert teh video
    to a FLV (Flash
    video) and have a swf that provides the controls and player
    for it.
    > Will it significantly reduce the file size ?
    File size probably will go down but that depnds on what
    compression the orginal
    QT has and what compression you use when importing to Flash.
    I'd say start with
    a high quality QT, then try various compression levels when
    importing to get
    the best size to quality balance.
    > Suppose we have a QT movie file of size 3MB , when it is
    converted to .swf
    > format what will be its size ?
    Is hard to say - as mentione dabove, depends on what the QT's
    compression is.
    > Then can we play this .swf file using QT player or we
    require flash player
    > to do so .
    You'll have to play the swf through Flash. Why would you wnat
    to play it
    through QT?
    What's the output for - CD/web? I'd suggest FLVs ratehr than
    an embedded vidoe
    into a swf as you can get better quality control and
    playback.
    regards
    Dean
    Director Lecturer / Consultant / Director Enthusiast
    http://www.fbe.unsw.edu.au/learning/director
    http://www.multimediacreative.com.au
    email: [email protected]

  • 3D annotations and File format Conversion

    I am writing a plug-in for Acrobat Extended Pro that will export embedded 3D files in other formats. I only have access to a U3D converter right now but Acrobat saves some files in their PRC format. Is there some way to have Acrobat convert PRC files into U3D?
    Thank you in advance for your help.

    I've figured everything out. The problem I was having was because I was using fwrite() in text mode instead of binary mode. :p
    For anyone reading this later, this is how you convert from PRC -> U3D and back:
    1) load the Acrobat 3D Dll (A3DLIB.dll) and do all of the initializations.
    2) Then, load the PRC file with the function A3DAsmModelFileLoadFromFile().
    3) Then take the model structure you get back and pass it to A3DAsmModelFileWriteToFile(). The A3DAsmModelFileWriteParametersData * parameter will contain the format specifier.
    You can save these files as either PRC, U3D version 1 or U3D version 3.
    Look here for the full docs: http://livedocs.adobe.com/acrobat_sdk/9/Acrobat9_HTMLHelp/API_References/PRCReference/3D_ API_Reference/index.html

  • What every developer should know about character encoding

    This was originally posted (with better formatting) at Moderator edit: link removed/what-every-developer-should-know-about-character-encoding.html. I'm posting because lots of people trip over this.
    If you write code that touches a text file, you probably need this.
    Lets start off with two key items
    1.Unicode does not solve this issue for us (yet).
    2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
    And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts.
    The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.
    And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.
    And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
    Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
    Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.
    Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.
    UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
    But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
    Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
    Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
    Here's a key point about these text files – every program is still using an encoding. It may not be setting it in code, but by definition an encoding is being used.
    Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
    Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)
    Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
    Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
    Wrapping it up
    I think there are two key items to keep in mind here. First, make sure you are taking the encoding in to account on text files. Second, this is actually all very easy and straightforward. People rarely screw up how to use an encoding, it's when they ignore the issue that they get in to trouble.
    Edited by: Darryl Burke -- link removed

    DavidThi808 wrote:
    This was originally posted (with better formatting) at Moderator edit: link removed/what-every-developer-should-know-about-character-encoding.html. I'm posting because lots of people trip over this.
    If you write code that touches a text file, you probably need this.
    Lets start off with two key items
    1.Unicode does not solve this issue for us (yet).
    2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
    And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts. Pretty sure most Americans do not use character sets that only have a range of 0-127. I don't think I have every used a desktop OS that did. I might have used some big iron boxes before that but at that time I wasn't even aware that character sets existed.
    They might only use that range but that is a different issue, especially since that range is exactly the same as the UTF8 character set anyways.
    >
    The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.
    And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.
    And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
    The above is only true for small volume sets. If I am targeting a processing rate of 2000 txns/sec with a requirement to hold data active for seven years then a column with a size of 8 bytes is significantly different than one with 16 bytes.
    Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
    The above is out of place. It would be best to address this as part of Point 1.
    Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.
    Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.
    UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
    The first part of that paragraph is odd. The first 128 characters of unicode, all unicode, is based on ASCII. The representational format of UTF8 is required to implement unicode, thus it must represent those characters. It uses the idiom supported by variable width encodings to do that.
    But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
    Not sure what you are saying here. If a file is supposed to be in one encoding and you insert invalid characters into it then it invalid. End of story. It has nothing to do with html/xml.
    Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
    The browser still needs to support the encoding.
    Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
    I know java files have a default encoding - the specification defines it. And I am certain C# does as well.
    Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
    It is important to define it. Whether you set it is another matter.
    Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)
    Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
    Unicode character escapes are replaced prior to actual code compilation. Thus it is possible to create strings in java with escaped unicode characters which will fail to compile.
    Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
    No. A developer should understand the problem domain represented by the requirements and the business and create solutions that appropriate to that. Thus there is absolutely no point for someone that is creating an inventory system for a stand alone store to craft a solution that supports multiple languages.
    And another example is with high volume systems moving/storing bytes is relevant. As such one must carefully consider each text element as to whether it is customer consumable or internally consumable. Saving bytes in such cases will impact the total load of the system. In such systems incremental savings impact operating costs and marketing advantage with speed.

  • Receiver File Adapter - Encoding issue.

    Hi Everybody,
    The file format (encoding) is different to the format generally we used to get.
    Currently we are get the flat files in DOS format.The current file when we are downloading it we are getting it in the UNIX or other format.
    For eg: 20 has been changed to 0D in the file.
    Can somebody help me on the same.
    Thanks,
    Zabiulla

    Hi,
    Check on this for file adapters
    Text
    Under File Encoding, specify a code page.
    The default setting is to use the system code page that is specific to the configuration of the installed operating system. The file content is converted to the UTF-8 code page before it is sent.
    Permitted values for the code page are the existing Charsets of the Java runtime. According to the SUN specification for the Java runtime, at least the following standard character sets must be supported:
    &#9632;       US-ASCII
    Seven-bit ASCII, also known as ISO646-US, or Basic Latin block of the Unicode character set
    &#9632;       ISO-8859-1
    ISO character set for Western European languages (Latin Alphabet No. 1), also known as ISO-LATIN-1
    &#9632;       UTF-8
    8-bit Unicode character format
    &#9632;       UTF-16BE
    16-bit Unicode character format, big-endian byte order
    &#9632;       UTF-16LE
    16-bit Unicode character format, little-endian byte order
    &#9632;       UTF-16
    16-bit Unicode character format, byte order
    Regards
    Vijaya

Maybe you are looking for

  • Hard drive failure on Satellite L300-154

    On the 28th of March the hard drive in my Toshiba L300 154 failed and would not boot from the recovery partition or recovery discs. I installated a new hard drive identical to the one previous (just without the branding) and decided to try my Windows

  • My computer will not boot up from disc

    Help i am having issues booting up my laptop, it won't even boot up from a disc what do I need to do??

  • Helping me understand cleaning and optimization

    Hello everyone, This is my first post on the board and would like to introduce myself: Please call me eyeforimagery! I am by no means computer illiterate nevertheless have gotten myself into a pickle with the ammount of GB left on my computer. My que

  • PSE folder views

    I have used Picasa for photo organizing up til now.  I would like to give PSE 8 a try so I downloaded the trial version and I have two quick questions. The first question is: In Picasa the folders are listed on the left side of the window.   So if I

  • DI PL16 - Error on GetByKey

    I started to get the following error when doing Doc.GetByKey(CurrDocEntry) The Error: "Unable to cast object of type 'SAPbobsCOM.CompanyClass' to type 'SAPbobsCOM.ICompany" any ideas? Thanks Moty