Unrecognised Char in GB2312 character set using java InputStreamReader??

Reading the following file chinese GB2312 html file from
http://news.xinhuanet.com/local/2007-02/13/content_5732705.htm
using the InputStreamReader with GB2312 encoding as shown below
public class readGB2312html file
//........TmpText declarations.....
public static void main( String[] args )
try
FileInputStream is = new FileInputStream( args[0] );
BufferedReader br = new BufferedReader(
new InputStreamReader( is, "GB2312" ) );
String strLine;
while ( (strLine = br.readLine()) != null )
TmpText.append(strLine);
TmpText.append("\r\n");
br.close();
bw.close();
catch ( Exception e )
e.printStackTrace();
The TmpText variable does not display the last character in the article properly (记者夏珺) it gives instead (记者夏?B)
Inside the html file the unrecognised charcter is represented by �B in the html file Why is this so
���������B��
In the internet browser it is displayed and recognised as a chinese GB2312 character why not recognised by Java InputStreamReader???
Any help or explanation would be much appreciated

Yes, it is not a GB2312 character
The �B character is AC40 in hex format which is outside of the GB2312 character range, it is in GBK
Copied from wikipedia,
GBK is an extension of the GB2312 character set for simplified Chinese characters, used in the People's Republic of China.
GB stands for National Standard, while K stands for Extension. GBK not only extended the old standard GB2312 with Traditional Chinese characters, but also with Chinese characters that were simplified after the establishment of GB2312 in 1981. With the arrival of GBK, certain names with characters formerly unrepresentable, like the "rong" (�g) character in former Chinese Premier Zhu Rongji's name, are now representable.
Thanks a lot will use the GBK charset to read the file for all GB2312 file since it is a subset of it.

Similar Messages

  • Setting Multiple values in property set using java API

    Hello All,
    I want to set the properties of a profile in a property set using java API provided
    in package p13n. The property can have multiple values. When I try to add the
    property using ProfileManager.setProperty() method. But every time I do it this
    way, it replaces the earlier value of property and not added. This, I can achieve
    using portalTools but I want to use the API for user registration on the site.
    I hope the query is clear.
    Waiting for a response,
    Thanks in advance,
    Shrinivas

    You need to use java.util.ArrayList.
    First cast the existing value into ArrayList using getProperty method,
    change values in the ArrayList and then put them back with setProperty
    method.
    Regards,
    Michael Goldverg
    "Shrinivas Rao" <[email protected]> wrote in message
    news:3d64e7d9$[email protected]..
    >
    Hello All,
    I want to set the properties of a profile in a property set using java APIprovided
    in package p13n. The property can have multiple values. When I try to addthe
    property using ProfileManager.setProperty() method. But every time I do itthis
    way, it replaces the earlier value of property and not added. This, I canachieve
    using portalTools but I want to use the API for user registration on thesite.
    I hope the query is clear.
    Waiting for a response,
    Thanks in advance,
    Shrinivas

  • Change character set used to write a file in application server.

    Hello Experts,
                       I want to know if we can change the character set used to create a file in application server.(Is it posible to use a particular character set while creating a file in application server.
                      I will be very great full for any help.
    Thanks in advance.
    Sharath

    Hello Sarath,
    There is an extension CODE PAGE with OPEN DATASET stmt.
    Can you please elaborate which character set you want to write to the application server?
    BR,
    Suhas

  • How can I see KSC5602 character set using JDBC thin driver

    After I change character set from USASCII7 to KO16KSC5601, I
    cannot see korean from the clients
    using JDBC thin driver.
    But, I can see korean clearly using sqlplus at serer, or
    application using SQLNet.
    I use Oracle Enterprise Server 8.0.4.1.0, jdbc thin driver
    8.0.4.0.6 on Windows 98. I read that all bugs realated
    to multibyte language are fixed in Oracle8. What can I do to
    solve this problem?
    PS.server: Oracle 8.0.4.1 on Digital Unix 4.0b, client: jdk1.1.8
    on Windows98. I used the command.
    null

    The easiest thing to do is download it as an archive with your applet.
    Otherwise, you have to have the files on every client machine.
    For netscape, put the classes111.jar in the java classes folder typically:
    c:\ProgramFiles\Netscape\Communicator\Program\java\classes.
    I'd expect that IE would be setup in a similar way.

  • Default Character Set using JSObject

    Hi All,
    This problem has been nagging me for a while and am now resorting to this forum for an answer.
    I have a jsp page with an embedded applet. Inside the applet, I read the HTML page using JSObject.
    The problem is when using the JSObject to get values of controls from the HTML page with Japanese characters.
    The HTML page is encoded in UTF-8, however, when I get values from the controls using JSObject in the applet, the values returns as ???. Latin characters are supported but not Japanese characters. So.. I'm wondering what character set the JSObject supports when converting a Javascript string to a Java String.
    The following code is executed:
              JSObject win = JSObject.getWindow(this);
              JSObject doc = (JSObject) win.getMember("document");
              JSObject forms = (JSObject) doc.getMember("forms");
         JSObject form = (JSObject)forms.getSlot(0);
              JSObject title = (JSObject)form.getMember("title");
    String titleValue = (String)title.getMember("value");
    I've also tried form.eval("document.forms[0].title.value") and that returns the same ??? for japanese characters.
    Any ideas?
    Kent

    Hi Larry,
    The characters that appear at the beginning of each file -  - is the BOM or byte order mark for UTF-8, which is automatically added to the file on creation. These files are UTF-8 encoded, to allow for the support of multi-byte characters. An updated version of the Exporter Tool removes these BOM characters. Please contact Support to obtain this updated version of the Exporter tool.
    Alternatively, you can try the following:
    If the character set of your Oracle database is not UTF-8, then you have two options:
    1) If possible, change the character set of your database to UTF-8. To check the current database characterset, check the "NLS_DATABASE_PARAMETERS" table.
    or
    2) Open the generated .dat files using Notepad, then use the File | Save As menu option, and set the "Encoding" to ANSI, then save the file. The BOM will now be removed from the .dat files.
    I hope this helps.
    Regards,
    Hilary

  • Need suggestion on Multi currency and Unicode character set use in ABAP

    Hi All,
    Need suggestion. In one of the requirement I saw 'multi-currency and Unicode character set experience in FICO'.
    Can you please elaborate me how ABAPers are invlolved in multi currency as I think this is FICO fuctional area.
    And also what is Unicode character set exp.? Please give me some document of you have any.
    Thanks
    Sreedevi
    Moderator message - This isn't the place to prepare for interviews - thread locked
    Edited by: Rob Burbank on Sep 17, 2009 4:45 PM

    Use the default parser.
    By default, WebLogic Server is configured to use the default parser and transformer to parse and transform XML documents. The default parser and transformer are those included in the JDK 5.0.
    The built-in WebLogic Server DOM factory implementation class is com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl.
    The DocumentBuilderFactory.newInstance method returns the built-in parser.
    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();

  • CHANGE SECURITY SETTING USING JAVA

    Hi all,
    Please Help!!!
    I am using, JAVA / STRUTS for developing an web application.
    In the Client/ Server Technology, I need to set the client Browser Security Level.
    Can it be done through JavaScript or how it can be achieved.
    Thanks & Regards
    Glass_Fish.........

    I am using, JAVA / STRUTS for developing an web application.
    In the Client/ Server Technology, I need to set the client Browser Security Level.Not possible.
    Can it be done through JavaScript or how it can be achieved.By asking the end user to change their Security level if your application really needs them. For example there are ways to detect whether scripting language is enabled or not, whether cookies are enabled or not. If your application really needs that features, Ask user to enable the particular feature by displaying proper messages.

  • Unicode Character Sets in Java

    I am trying to port code from a PowerBuilder 10.5.1 Build 6021 environment to Java and I am encountering getting the same value for the Euro character in Java and PowerBuilder.
    I get a value of 20AC (8364 in decimal) in Java which is consistent with ISO-8859-15, but I get a value of 128 in PowerBuilder.
    This is only a small example and perhaps not strictly a Java question, but if anyone has any suggestions, I would appreciate it.

    In the ISO-8859-x encodings, each character is represented by one byte, but the numbers 128-159 are not used. The Windows extensions of those encodings, like windows-1252, map those unused numbers to useful characters like curly quotes and the Euro symbol (and create compatibility problems like this one in the process). PowerBuilder is obviously using one of these Windows encodings. However, there are several points in the development process where character encodings come into play, so we'll need more info. Does the problem occur when you compile the code, or when you run it? If it's at run time, does it happen when you're reading text from a file, a database, or some other source? And how exactly does the problem manifest?

  • Character recognition using java

    Hi guys,
    I have a fairly complex problem that I need to solve. Basically I am reading a set of pixel colours from a 3rd party client. I need to take this pixel data and recognize the characters used in it.
    Does anybody know of any good character recognition tutorials I can use?
    Anybody every done anything like this?
    Any help will be great
    Thanks
    Alex

    over the last month or so i have checked out those links and the downloadable software they offer, however, you can only us their trail versions which arent suitable for me.
    Does anybody know where I should start if i want to write my own OCR functionality? are neural nets the best way to go?
    Any advice/suggestions will be great.
    Thanks
    alex

  • Trouble Using Character Sets - Chinese GB2312

    Hi,
    I am trying to display my site in Simplified Chinese
    (GB2312). I have verified that all of the files are encoded with
    the GB2312 character set, when I open them in a language capable
    editor I can see chinese characters.
    I have also used <cfheader name="Content-Type"
    value="text/html; charset=gb2312" /> at the top of my
    Application.cfm file.
    Yet, when I view the file in any browser, it shows only
    question marks or odd characters in place of the chinese
    characters.
    When I look at the browser character encoding settings, they
    are correct (a check mark next to Simplified Chinese).
    Any thoughts, am I missing a step.
    Thanks in advance for any help given.

    lan99 wrote:
    > I have also used <cfheader name="Content-Type"
    value="text/html;
    > charset=gb2312" /> at the top of my Application.cfm
    file.
    what ver of cf?
    if cf6 or better, have you used cfprocessingdirective at the
    top of each file?

  • German character set issues on Solaris

    Hi,
    I am facing an issue with German character settings with my Java application on a solaris box.
    When I run my application on the box, and I pass an input file with German special characters they get converted as ?. However, other normal English characters are formed properly.
    When I run the same application on another Solaris box with a different JRE, the German characters are formed properly.
    I understand that there is a difference in the archiecture between the 2 boxes ie.e
    64 bit SPARC machine v/s 32 bit x86 machine
    the JRE
    1.4.2_03(64bit) v/s 1.4.1_01
    I am tryinbg to evaludate further differences between the 2 environments to pinpoint the issue, and get this resolved on the 1st box.
    Can anyone provide me any inputs?
    Lavin

    When you read the file, please point out what character set using. For example:
    FileInputStream fstream = new FileInputStream(url.getFile());
    DataInputStream in = new DataInputStream(fstream);
    BufferedReader br = new BufferedReader(new InputStreamReader(in, Charset.forName("ISO-8859-1")));
    br.readLine();
    This link possibly can help you.
    http://www.velocityreviews.com/forums/t126128-jdk-14-character-set-change.html

  • Oracle character sets for China

    Which characte sets used to store data in Chinese language?

    yes, if you have a database with a UTF8 characterset you can store chinese characters. If you have a database with a single-byte characters set and you prefer not to change the characterset of the database (because it means recreating the database) you can also store the chinese characters in NCHAR columns and use the National Characterset for the chinese characters.
    This way you can keep your standard single-byte characterset and only use the national characterset for specific columns.
    For this, set NLS_NCHAR_CHARACTERSET to a multibyte characterset and use NCHAR as columntype

  • BIG5 and HKSCS Character Set Support

    Hi,
    We're experiencing some problems inserting a string containing both BIG5 and HKSCS characters to a 7.3.4 Oracle DB using JDBC. The underlying character set used by the DB is ZHT16BIG5 (this cannot be changed). The characters can be inserted correctly if we use SQLPlus/WorkSheet.
    Take note that the BIG5 character set can be inserted correctly. The problem occurs if we include HKSCS characters in the statement.
    We have tried a number of ways already but failed to convert the data properly.
    We tried converting the data using ByteToCharConverter.getConverter("Big5") but this cannot handle the HKSCS properly.
    We even tried using the CharacterSet.ZHT16BIG5_CHARSET provided by the NLS character set but it cannot convert all HKSCS characters correctly.
    Any ideas on how to solve this problem? Or is it because the HKSCS character set is NOT supported by the JDBC driver?
    Below is a sample text containg both BIG5 and HKSCS characters:
    'i$h%49D$G$Q$T89     Ize     _     ^     S(     R     @     A     Y     q
    Any help/suggestion is most welcome.
    Thanks,
    Cis
    null

    I got the exact same problem as you.
    (The Oracle I using is 8.1.7)
    Can any one help??

  • Change default character set of JVM

    Is there a way to change the default character set of JVM to say, UTF-8?
    System.out.println("Default Character Set: " +  new java.io.OutputStreamWriter(new java.io.ByteArrayOutputStream()).getEncoding());
    System.out.println("File Encoding: " + System.getProperty("file.encoding")); On Windows
    ==========
    Default Character Set: Cp1252
    File Encoding: Cp1252
    On Linux
    ========
    Default Character Set: ASCII
    File Encoding: ANSI_X3.4-1968
    I would like to save on the effort of changing the many lines of code that looks like
       new BufferedWriter(new OutputStreamWriter(out)); to
       new BufferedWriter(new OutputStreamWriter(out, "UTF-8")); Thanks

    Try this:
    -Dfile.encoding=utf-8
    as vm argument.
    /Kaj

  • Character set in MDL export/import

    Hi,
    we are running OWB 10.1.0.2. In order to get version control, we perform MDL exports of collections from our development environment and then import them into our test and production environments. Each environment uses its own design repository, but all repositories are in the same database.
    When doing an export of a collection, we always specify the character set AL32UTF8 because that is what we are running in the repository database. When later doing an import using the graphical user interface, it is not possible to specify the character set (but this can be done when using the import utility). According to the documentation, the GUI import will then assume that the character set in the collection is the character set of the client, which usually is WE8MSWIN1252. (The documentation also says that it IS possible to specify the character set during GUI import, this is obviously a documentation error).
    My questions are: What is the point of specifying character set when doing exports and imports? Could an AL32UTF8 export followed by an WE8MSWIN1252 import cause problems? I assume that the character set used by export is specified in the collection file, so does the import then convert it to WE8MSWIN1252 (or the character set specified in the import utility)?
    Or, to be more general: What is actually happening with the character sets during MDL export/import?
    /Kjell Gullberg

    Dear ski123,
    I think you are not going to loose any data of yours when you migrate the database. You may proceed to the import.
    Please find below documentations;
    http://download.oracle.com/docs/cd/B19306_01/server.102/b14196/install003.htm#sthref81
    For Database Character Set, select from one of the following options:
        *Use the Default—Select this option if you need to support only the language currently used by the operating system for all your database users and your database applications.
        *Use Unicode (AL32UTF8)—Select this option if you need to support multiple languages for your database users and your database applications.
        *Choose from the list of character sets—Select this option if you want the Oracle Database to use a character set other than the default character set used by the operating system.Choosing a Character Set;
    http://download.oracle.com/docs/cd/B19306_01/server.102/b14225/ch2charset.htm#NLSPG002
    AL32UTF8;
    http://download.oracle.com/docs/cd/B19306_01/server.102/b14225/glossary.htm#sthref2039
    Hope That Helps.
    Ogan

Maybe you are looking for

  • Cannot drop datapump job

    Hi all, I have a problem. In next SQL query SELECT owner_name, job_name, operation, job_mode, state, attached_sessions FROM dba_datapump_jobs; OWNER_NAME           JOB_NAME          OPERATION          JOB_MODE        STATE         ATTACHED_SESSIONS  

  • IWeb '08 and non-QT video's

    The current version of iWeb does a bad job of dealing with video that is not *.mov. It trys to use QuickTime to run wmv, mpg, avi, ect files. Only a few of the people viewing my site have QT installed and I don't like forcing them to install it. (Obv

  • Which Method Should be overriden to update a column in master?

    Hi, I have a Master/Detail relationship with the requirement to update master depending on some column in detail. This update of master can raise an exception and it should be done at post time. The way to update master changes depending on the DML a

  • Unexpected output in discoverer viewer

    Hey Friends, I am facing a strange problem. I have developed a drill down report with percentage. Now when i run report on user edition it shows correct output but when i run it on viewer it gives an unexpected output. Please suggest me how can i sol

  • Unable to check large model into SQL Server repository via Proxy

    I have a model that has over 1000 changes (probably) and I am unable to check the model into a repository on SQL Server 2008R2.  We have a proxy service running, and in general the check-in process is much better via the proxy than direct connect.  B