Charset encoding

Hi,
I need to get source code of web pages to the text files. I have problem how determine if the encoding of page is UTF-8 or windows-1250.
URL url=new URL(actualLink);
InputStream in=url.openStream();
I need to determine page encoding to set encoding for input stream
InputStreamReader rd = new InputStreamReader(in, "UTF-8");
or
InputStreamReader rd = new InputStreamReader(in, "windows-1250");
Can anybody help me ?
Thank's.
Andrej

If the page has a meta contentType tag (which it should) then you can read the stream until you find that, and parse the actual encoding out. Then you can re-read the stream using the correct encoding. If you use windows-1252, or ISO-8859-1, you should be able to get to the meta tag, since html tags are all ascii chars. If there's no meta contentType tag, then you just have to make a best guess. 1252 is a superset of 8859-1 and ascii, so it's usually a safe bet.

Similar Messages

XML data - charset encoding problem

Hello all,
I am facing an issue on charset encoding. My requirement is to send an XML and read the the output XML to display the output. The output XML is encoded in "ISO-8859-1" and we are retrieving/reading it in "UTF-8". But some special characteres in the output XML are appearing as it is.
Could some one let me know on how to obtain the desired characters.
Code snippet while reading the XML:
BufferedReader inStream = null;
BufferedWriter outStream = new BufferedWriter(new OutputStreamWriter(connection.getOutputStream(),"UTF-8"));
inStream =
     new BufferedReader(new InputStreamReader(inputStream,"UTF-8"));
Thanks & regards,
Sharath

Hi Sharath,
To read the XML file use the following. Dont mention the character set during reading it.I hope it will help you.
XML file(emp.xml)
<?xml version="1.0" encoding="ISO-8859-1"?>
<Emp>
<EmpDetails>
       <firstname>Sarbari</firstname>
       <lastname>Saha</lastname>
</EmpDetails>
<EmpDetails>
       <firstname>Tumpa</firstname>
       <lastname>Hazra</lastname>
</EmpDetails>
</Emp>
Java File
import java.io.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.SAXException;
import org.w3c.dom.NamedNodeMap;
class ReadXML
     public static void main(String args[])
          try
               String fileName="emp.xml";
               DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
               DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
               Document doc = docBuilder.parse (fileName);
               NodeList nodeList = doc.getChildNodes();
               int nodeSize = nodeList.getLength();
               for (int i=0;i<nodeSize;i++)
                    Node node = nodeList.item(i);
                    Element elm = (Element) node;
                    NodeList EmpDetailsList=elm.getElementsByTagName("EmpDetails");
                    int stNodeSize = EmpDetailsList.getLength();
                    System.out.println("NodeSize = "+stNodeSize );
                    for(int j=0;j<stNodeSize;j++)
                              Node nodeEmpdtl = EmpDetailsList.item(j);
                              Element elmDetails = (Element) nodeEmpdtl;
                              NodeList firstnameList=elmDetails.getElementsByTagName("firstname");
                              NodeList lastnameList=elmDetails.getElementsByTagName("lastname");
                              Node fnameNode=firstnameList.item(0);
                              System.out.print("Node : " + fnameNode.getNodeName());
                              System.out.println (" Value : "+((Element)fnameNode).getChildNodes().item(0).getNodeValue());
                              int lastnameNodeSize = lastnameList.getLength();
                              Node lnameNode=lastnameList.item(0);
                              System.out.print("Node : " + lnameNode.getNodeName());
                              System.out.println(" Value : "+((Element)lnameNode).getChildNodes().item(0).getNodeValue());
          catch(ParserConfigurationException pce)
               System.out.println("Inside ParserConfigurationException Exception");
          catch(SAXException se)
               System.out.println("Inside SAXException Exception");
          catch(IOException ioe)
               System.out.println("Inside IOException Exception");
Regards,
Mithu

How to set the charset encoding dynamically in JSP

Is there any way to set the charset encoding dynamically in a JSP
page?
we are using weblogic 6.1 on HP unix.
is there some way we can set the charset dynamically in the page directive
<%@ page contentType="text/html;charset=Shift_JIS" %>
and in MAET tag
<meta http-equiv="Content-Type" content="text/html" charset="Shift_JIS">
Saurabh Agarwal

Dear Saurabh,
I guess it is possible. Here is an example I have made some time ago :
In my html page :
<form name="form1" METHOD=POST Action=Lang ENCTYPE="application/x-www-form-urlencoded" >
<p>
<select name="code" size="1">
<option value="big5">Chinese</option>
<option value="ISO-2022-KR">Korean</option>
<option value="x-euc-jp">Japanese</option>
<option value="ISO-8859-1">Spanish</option>
<option value="ISO-8859-5">Russian</option>
<option value="ISO-8859-7">Greek</option>
<option value="ISO-8859-6">Arabic</option>
<option value="ISO-8859-9">French</option>
<option value="ISO-8559-1">German</option>
<option value="ISO-8859-4">Swedish</option>
<option value="ISO-8859-8">Hebrew</option>
<option value="ISO-8859-9">Turkish</option>
</select>
</p>
<p>
<textarea name="entree_text"></textarea>
<input type="submit" name="Submit" value="Submit" >
</p></form>
and in my jsp :
// Must set the content type first
res.setContentType("text/html");
code = req.getParameter("code");
example = req.getParameter("entree_text");
PrintWriter out = res.getWriter();
// The Servlet send to the Browser the informations to format the language type
out.println("<html><head><title>Hello World!</title><meta http-equiv=\"Content-Type\" content=\"text/html; charset="+code+"\"></head>");
// System recover the general Character encoding
String reqchar = req.getCharacterEncoding();
out.println("<body><h1>Hello MultiLingual World!</h1>");
out.println("You have defined an ISO of : "+code);
out.println("<BR>This is the code of the page that is displayed in this page<BR>");
out.println("<BR>");
out.println("<BR>");
out.println("Character encoding of the page is : "+reqchar);
out.println("<BR>This is the character code in the Servlet");
out.println("<BR>");
out.println("<BR>");
out.println("<BR>");
out.println("You have typed : "+example);
out.println("<BR>");
out.println("");
out.println("</body></html>");
I think starting from this example it is surely easy to modify dynamically the jsp.
The other possibility would be to use the Weblogic Commerce and the LOCALIZE function, so that you'll have an automatic redirection to the right jsp-encoding depending on the customer's language.
Feel free to reply on the forum for any related issue.
Best regards
Nohmenn BABAI
BEA EMEA Technical Support Engineer
"Saurabh" <[email protected]> a écrit dans le message de news: [email protected]...
Is there any way to set the charset encoding dynamically in a JSP
page?
we are using weblogic 6.1 on HP unix.
is there some way we can set the charset dynamically in the page directive
<%@ page contentType="text/html;charset=Shift_JIS" %>
and in MAET tag
<meta http-equiv="Content-Type" content="text/html" charset="Shift_JIS">
Saurabh Agarwal[att1.html]

Charset encoding of command-line arguments

I have a java application which accepts a string from command line arguments like this
java myapp <arguments>What will be the charset encoding of the arguments? It may contain some foreign characters.
The default charset encoding of my JVM?? is windows-31j.
System.out.println(Charset.defaultCharset().name());
// output: windows-31jI need to convert the arguments into bytes array but i don't know what charset to specify in arguments.getBytes()?
Please help.

Kyowa wrote:
I need to convert the arguments into bytes array but i don't know what charset to specify in arguments.getBytes()?
Please help.The arguments are presented as String data, which should always be in UNICODE, regardless of the encoding used by the platform. There could be problems with conflict between the command line shell and the JVM with non-Latin characters, which would be a matter of configuration, but the string you get should be UNICODE.
When you chose to convert a UNICODE string to a byte array or stream then you use whatever encoding you want the bytes to be in. In the Java model, it's characters stored as bytes which are encoded in various ways. The encoding encodes to and from the UNICODE characters used in Strings.

Charset encoding "in fly"

Hi all!
I have problem witch charset encoding. Currently I'm working on Oracle 7.1. All data in db were inserted using win1250 charset (via MS Access or smtng). Now I must present them on the Web using Linux+Apache+PHP but encoded in iso-8859-2 charset. And here is my problem.
I dont know how to "translate" data from one charset to another. Does Oracle supports a "native" charset translation between server (Windows 1250) and client (ISO-8859-2) in both directions (select, insert, update, delete)?
I have tried translate data "in fly" using iconv() function in PHP but it is "not so elegant" and too slow solution.
I would be graceful for all yours suggestions.

Oracle hasn't supported Oracle 7.1 in many, many years, so I have no idea whether it works completely differently from current versions... I'll describe how this works in modern versions of Oracle.
What are your client and database character sets? Assuming you're properly identifying the data coming in as Windows 1250, the Oracle database automatically converts the data to the database's encoding. When the data is retrieved, it is automatically converted to the character set of the client (assuming such a conversion is possible).
I would note that Windows-1250 and ISO-8859-2 are not completely compatible character sets-- both contain characters that do not appear in the other-- so it may not always be possible to convert the data in all cases.
Justin
Distributed Database Consulting, Inc.
http://www.ddbcinc.com/askDDBC

HTTPService POST with XML does not declare charset encoding

Hi all.
I'm trying to do a HTTP POST of some XML using HTTPService
and I've got it working apart form the fact that the HTTP message
sent does no declare what charcater set it is using for encoding.
As a test I send a 'ñ' character and it seems from the
resultant bytes that the charset being used is UTF-8.
My ActionScript is...
var xml:XML=new XML(<root/>);
xml.@testCharacter="ñ";
xml.appendChild(<login/>);
xml.login.@username="bob";
xml.login.@password="secret";
var httpService:HTTPService=new HTTPService();
httpService.url='
http://app.localhost/null';
httpService.method="POST";
httpService.contentType=HTTPService.CONTENT_TYPE_XML;
httpService.request=xml;
httpService.send();
And what seems to get sent over the wire is (shown in
8bit-ASCII)...
POST /null HTTP/1.1
Host: app.localhost
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7
Accept:
text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png ,*/*;q=0.5
Accept-Language: en,en-us;q=0.7,en-gb;q=0.3
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Content-type: application/xml
Content-length: 77
<root testCharacter="A±">
<login username="bob" password="secret"/>
</root>
Anyone know (1) how I can control what character set gets
used for encoding and (2) how I can get either XML or HTTPService
to declare what the encoding is?
Thanks in advance, Neil.

Jarod,
XML parser product managers who might be able to help you on a parser-specific issue like this "hang out" here on OTN in the XML forum. Have you tried posting there to catch their attention? Thanks.

Problems setting the charset encoding via jdbc

Dear list,
I would like to know if/how its possible to set the character encoding via the jdbc string, with the oracle thin driver client (classes12.jar).
Microsoft SQLserver for example supports the addition of a charset="utf-8" for example in the jdbc url.
Please could tell me how/if this is possible.
A pasted example would be much appreciated..
ben bookey

This bug will be fixed with SP20.
Regards
Stefan

JTable, Clipboard and charset encoding...

I'm trying to paste data into JTable. Here's a part of the code:
private BufferedReader getReaderFromTransferable(Transferable t)
throws IOException, UnsupportedFlavorException
if (t == null)
throw new NullPointerException();
DataFlavor[] dfs = t.getTransferDataFlavors();
for (int i = 0; i < dfs.length; i++)
System.out.println(dfs);
DataFlavor df = DataFlavor.selectBestTextFlavor(dfs);
df = df.getTextPlainUnicodeFlavor();
Reader r = df.getReaderForText(t);
return new BufferedReader(r);
When I'm copying data from Excel everything is fine because
DataFlavor of mimetype=text/plain...charset=utf-16le is supported.
However, if I try to copy and paste data only inside my JTable,
I'm getting UnsupportedFlavorException. It happens because
there are only two mimetype=text/plain supported none of which is
charset=utf-16le. API says that utf-16le is used for Windows as
default Unicode encoding. What am I supposed to do? How can I set
utf-16le encoding for my JTable? Or maybe I should do something
different.

Hi,
You dont have to set utf-16le encoding to JTable..instead you have to create your own flawor type which supports current encoding. I have some code example somewhere on my HD, but i'm too lazy to check it out. You can find examples of creating your own data flawor by putting "creating own data flawors" in search field in java.sun.com web site. This can be a really "bloodpath" but try to survive.

New Charset Encoder/Decoder for Microsoft telnet

Hello,
I have a problem for writing a server which is accessible by microsoft windows telnet.
I have managed to find out these relationships:
          switch ((int) c) {
               case -27: c = '�'; break;
               case -28: c = '�'; break;
               case -102: c = '�'; break;
               case -103: c = '�'; break;
               case -108: c = '�'; break;
               case -114: c = '�'; break;
               case -124: c = '�'; break;
               case -127: c = '�'; break;
So how could i write an encoder/decoder which could map these charcters specially and others as norma ISO-8859-1.
I switched them after encoding/decoding but then i got problem, that
i got two characters(in telnet window). First was some wierd symbol and second the right character.
Any help lplease?

Okay, got working the charset include and everything is okay.
Now i need to look after charcodes below 0. I though that it would be useful to create new charset for this purpose which adds little extra functionality to IBM850. When running this code it says that IBM850 charset not found when initilizating X-MICRO charset. But it is accessible in main program. Any ideas what is wrong?... Please help me, this damn internationallization has driven me mad.
package ee.feelfree.charset;
import java.nio.charset.Charset;
import java.nio.charset.spi.CharsetProvider;
import java.util.HashSet;
import java.util.Iterator;
public class MicrosoftCharsetProvider extends CharsetProvider {
     private static final String CHARSET_NAME = "X-MICRO";
     private Charset micro = null;
     public MicrosoftCharsetProvider()
     this.micro = new MicrosoftCharset (CHARSET_NAME, new String [0]);
     public Iterator charsets() {
          HashSet set = new HashSet (1);
          set.add (micro);
          return (set.iterator());
     public Charset charsetForName (String charsetName) {
          if (charsetName.equalsIgnoreCase (CHARSET_NAME)) {
               return (micro);
               return (null);
package ee.feelfree.charset;
import java.nio.CharBuffer;
import java.nio.charset.*;
import java.nio.ByteBuffer;
public class MicrosoftCharset extends Charset {
     private static final String BASE_CHARSET_NAME = "IBM850";
     Charset baseCharset;
     protected MicrosoftCharset(String canonical,String [] aliases) {
          super (canonical, aliases);
          baseCharset = Charset.forName(BASE_CHARSET_NAME);
     public CharsetEncoder newEncoder() {
          return new MicrosoftEncoder(this,baseCharset.newEncoder());
     public CharsetDecoder newDecoder() {
          return new MicrosoftDecoder(this,baseCharset.newDecoder());
     public boolean contains (Charset cs) {
          return false;
     private class MicrosoftEncoder extends CharsetEncoder {
          private CharsetEncoder baseEncoder;
          MicrosoftEncoder(Charset cs,CharsetEncoder baseEncoder) {
               super(cs, baseEncoder.averageBytesPerChar(),
                         baseEncoder.maxBytesPerChar());
               this.baseEncoder = baseEncoder;
          protected CoderResult encodeLoop (CharBuffer cb, ByteBuffer bb)
               CharBuffer tmpcb = CharBuffer.allocate (cb.remaining());
               while (cb.hasRemaining()) {
               tmpcb.put (cb.get());
               tmpcb.rewind();
               baseEncoder.reset();
               CoderResult cr = baseEncoder.encode (tmpcb, bb, true);
               cb.position (cb.position() - tmpcb.remaining());
               return (cr);
     private class MicrosoftDecoder extends CharsetDecoder
     private CharsetDecoder baseDecoder;
     private boolean microClient = false;
          MicrosoftDecoder (Charset cs, CharsetDecoder baseDecoder)
               super (cs, baseDecoder.averageCharsPerByte(),
               baseDecoder.maxCharsPerByte());
               this.baseDecoder = baseDecoder;
          protected CoderResult decodeLoop (ByteBuffer bb, CharBuffer cb)
               baseDecoder.reset();
               CoderResult result = baseDecoder.decode (bb, cb, true);
               myDecode (cb);
               return (result);
          public boolean getClient() {
               return microClient;
          private void myDecode(CharBuffer cb) {
               microClient = false;
               for (int pos = cb.position(); pos < cb.limit(); pos++) {
                    int c = (int) cb.get (pos);
                    if (c<0) microClient=true;
}

Charset Encoding Problem

Hello all,
I have the problem that umlaut like ä, ü,ö are didn´t shown in the Browser. In the JSP Page I included the charset=iso-8859-1. I used the Web Application Server 6.4 SP12. If I used this charset ín a normal HTML Site it works fine. Could anybody help me to solve the Problem?
Kind regard
Axel

Hi Axel,
try to use charset UTF-8, which also is the default workbench editor setting on NWDS. At least, the portal's JSP engine (which is different from the WAS') needs that to work properly.
Hope it helps
Detlev

Charset, Encoding and AWT

Hi there,
I'im working on an applet which is supposed to be used by Polish people (encoding iso8859-2), but I have absolutely no idea about how I could make it display correctly polish characters : The applet downloads a properties file, by calling
Properties.load(URL)
The file the url points to contains strange polish characters, which can be seen correctly by using the appropriate encoding in any navigator, but the applet still display ? and squares instead of the polish characters
Does anybody has a clue or a solution??
Thanks,
Etienne

I have a similar problem. What I've discovered so far is:'
Chars seem to display OK in any given language, as long as that language is your default locale and you have just typed them in. If you try to read these chars in code, the getText on the TextField returns ???? and the bte codes appear to be NCS codes.
If the language is NOT default locale, then they do not display correctly, however their byte code is the correct Unicode number!!
I think this has to do with the AWT controls being native controls and so there is a strange interaction between the OS and the JVM.
Anyone have a clue how to take non ASCII chars in and then display them correctly?

Language Charset Encoding Problems

Hi,
i have Oracle 9i with developer suite installed in my 2000 professional Box my local language is ARABIC which use win1256 charset, when i run my Forms & reports all Arabic characters appears as ???????????????
What should i do to correct this problem, and is there any methods should be done while Developer or Oracle Installation to make my Arabic characters appears clearly ?
Thanks

when i install the database i choose ARABIC Windows-1256 chracter set !
where can i change my database character set, and what is the best character set for me in my explorer i use Windows-1256 ?
and did the Registry file NLS_LANG located in my windows Registry.
affect my database Charset !?? now this value is null.
thanks for replay
thanks

Servlet charset encoding and ServletInputStream

Does any one have code sample which implement
HttpServletRequest.getInputStream()?
Does any one know if there is any way to change request encoding in servlet 2.2??

Checkout Tomcat at jakarta.apache.org site, this is the reference implementation for the Servlet API, is open source and provides plently of material.
Also checkout this book:
which covers all the must know stuff on Servlets.
http://www.oreilly.com/catalog/jservlet2/toc.html
Not exactly sure what you mean by this
change request encoding in servlet 2.2??
If you actually mean internationalisation of the response stream, checkout these links.
http://java.sun.com/products/jdk/1.2/docs/guide/internat/index.html
http://java.sun.com/docs/books/tutorial/essential/io/index.html
http://developer.java.sun.com/developer/technicalArticles/Streams/WritingIOSC/

Looking for performance boost to java.nio.charset.encode(String) method

hi ,
i have run jprofiler on my application and so that this method takes 4% of the overall CPU time (!) .
is there any other , faster way to do this encoding ?
thanks!

yjavaman wrote:
i susspect that in my application (multi threaded ) , it is actually a larger bottle neck than i think right now.
i ask about maybe the caching of the encoder , decoder , maybe i can save this time by doing other stuff .Not sure if you are already buffereing, but from javadoc:
java.io.OutputStreamWriter
Each invocation of a write() method causes the encoding converter to be invoked on the given character(s). The resulting bytes are accumulated in a buffer before being written to the underlying output stream. The size of this buffer may be specified, but by default it is large enough for most purposes. Note that the characters passed to the write() methods are not buffered.
For top efficiency, consider wrapping an OutputStreamWriter within a BufferedWriter so as to avoid frequent converter invocations. For example:
Writer out = new BufferedWriter(new OutputStreamWriter(System.out));
Just something simple to use.

Problems with charset encoding using sender JMS channel

Hi all,
We are connecting an AS400/WebSphere, with SAP via XI (PI7.0), by means of JMS adapter.
Situation: we are sending a message through a JMS sender channel from WebSphere to XI.
The problem occurs if the message has special characters (like client name). The contents of the XI message became wrong on those characters (they are replaced by '#'), sending invalid information to SAP.
The character set CCSID of the WebSphere is 00037.
Can anyone help me on overcoming this problem?
thanks in advance.

thanks Stefan,
>> What kind of mapping do you use?
We are using a graphical mapping that maps the incoming message to an IDOC message definiton.
>> Is the payload wrong after the mapping?
The mapping ends with success and the payload is well formated, but the content of the final message is wrong (just on the special chars) (ex: the ã is replaced by #).
>> Where do you send that message to?
The message is an IDoc that is send to SAP:
best regards.

Charset encoding

Similar Messages

Maybe you are looking for