NVARCHAR UTF-8 byte length

The following call returns the wrong byte length when the databasee character set is UTF8 for NVARCHAR2 columns. It returns double the character length rather than triple. Oracle version 10.1.0.3.
OCIAttrGet((dvoid *)param,
OCI_DTYPE_PARAM,
(dvoid *)&outcol->data_size,
(dvoid *)0,
OCI_ATTR_DATA_SIZE,
statement->pError)
This seems like a bug.

Yes there is a maximum length. A short look at the docs for DataOutputStream.writeUTF tells you:
>>>
Writes a string to the underlying output stream using UTF-8 encoding in a machine-independent manner.
First, two bytes are written to the output stream as if by the writeShort method giving the number of bytes to follow. This value is the number of bytes actually written out, not the length of the string.
<<<
So writeUTF bails whenever the number of bytes written would be more than what fits into a short, which is 32767. That's your max length (note, that this is the number of bytes, not chars, since in UTF-8 encoding, all non-ASCII chars are saved as 2 or 3 bytes!)

Similar Messages

How to get the string's byte length?

I have some string,I want to get the string's byte length,how
can do it?
for example:
<cfoutput>#len('hihi，这是测试')#</cfoutput>
output is 9
I want to get the byte length is 14, how can i get it?
Thanks.

>> Fair cop. I didn't realise that asc() returned the
codepoint rather than the
> actual character code.
>
> and what would be the difference?
Oh, sorry, whatever the term is (I'm crap with jargon). The
value returned
by asc() for those chars was only two bytes (ie: four hex
digits). I
didn't realise there was more to it than that, and that
2-byte value maps
to some other THREE byte value. I need to do some reading...
> that's what both cf & java counted as the length.
CHARACTER length, sure. No-one's disputing that. On the other
hand,
no-one's asking about it, either.
> by adding a BOM you've already effected the encoding,
which may or may not
> match the original. so you still don't know how many
bytes were in the original
> string.
[groan]
Yes, that's a reasonable strawman there. I was only putting
it in a file
so I could save it and check the number of bytes occupied by
the data.
Clearly... CLEARLY... the OP is not asking for a character
length of that
string. They've said as much.
I copy and pasted the string from their post, and used it as
a
demonstration of how "nine" is not the right answer for the
BYTE LENGTH of
that string. Whether or not the original string was UTF-8,
UTF-16 or
special-marmoset-encoding, it almost certainly was NOT in a
fictitious kind
encoding in which each of those particular characters only
occupied one
byte each, which would mean that "nine" is the correct answer
to the
question.
When I copied those characters from either the web browser
for from my
text-based news agent, notepad identified them (and rendered
them
correctly) as UTF-8, so I'm fairly confident they ARE UTF-8.
Of course
this could be down to some intermediary encoding (pasting
them in to the
original posting, for example, via some encoding-transforming
mechanism),
but Occam's Razor suggests the original question was from a
UTF-8 POV.
But maybe we should quit speculating and ask the OP. Unless
they've
buggered off in despair of how drawn out all this is getting.
For which I
would not blame them.
Adam

StAX parser does not handle UTF-8 byte order mark

Hello,
i am playing around with the reference implementaion of the StAX API using the XMLStreamReader.
When i parse UTF-8 encoded xml files with the UTF-8 byte order mark i get the following exception when the method next() is called on the reader instance:
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,7]
Message: processing instruction can not have PITarget with reserveld xml name
     at com.bea.xml.stream.MXParser.parsePI(MXParser.java:2734)
     at com.bea.xml.stream.MXParser.parseProlog(MXParser.java:1775)
     at com.bea.xml.stream.MXParser.nextImpl(MXParser.java:1717)
     at com.bea.xml.stream.MXParser.next(MXParser.java:1180)
The XMLStreamReader is created on a FileInputStream.
When parsing xml's without a byte order mark, parsing works without any problems.
Any idea how to solve this problem, or is this an internal problem of the StAX implementation.
Thanks for help.
Jörg Eichhorn

Issue related to handling the BOM were fixed as part of the 10g project which added NLS Support to the protocols. I just verified that an UTF8 file containing BOM is correctly processed via FTP in 10.1.0.2.0

DDIC-Structure - Need Byte Length of a DDIC-STRUCTURE!!!

Hello,
how can I write a Report/ Function that becomes in Input each DDIC-Structurename and gives the user an Output back, that contains the whole Byte Length of the STRUCTURE.
For Example, we have an DDIC-Structure with name structure_example:
Name          TYPE     c     LENGTH     15,
Firstname   TYPE     c     LENGTH      10,
Street         TYPE     c     LENGTH      30,
ZIP-Code    TYPE     int2 LENGTH      5.
Their must existing a Functionality which can used the structurename and gives the user the Byte Length of the Structure back.
In our example the user gets for "structure_Example" the Output "60" Bytes back.
Input:          structure_example
Output:       60
I hope someone can give me a solution to solve this problem.
Thanks for all help in this forum.
With kind regards
ETN

Hi,
check out the below code...
my be this might help you.
TYPES:
BEGIN OF my_struct,
    comp_a type i,
    comp_b type f,
END OF my_struct.
DATA:
my_data   TYPE my_struct,
descr_ref TYPE ref to cl_abap_structdescr.
FIELD-SYMBOLS:
<comp_wa> TYPE abap_compdescr.
START-OF-SELECTION.
descr_ref ?= cl_abap_typedescr=>describe_by_data( my_data ).
WRITE: / 'Typename     :', descr_ref->absolute_name.
WRITE: / 'Kind         :', descr_ref->type_kind.
*WRITE: / 'Length       :', descr_ref->length.*
WRITE: / 'Decimals     :', descr_ref->decimals.
WRITE: / 'Struct Kind :', descr_ref->struct_kind.
WRITE: / 'Components'.
WRITE: / 'Name              Kind   Length   Decimals'.
LOOP AT descr_ref->components ASSIGNING <comp_wa>.
    WRITE: / <comp_wa>-name, <comp_wa>-type_kind,
             <comp_wa>-length, <comp_wa>-decimals.
ENDLOOP.
regards,
Santosh Thorat

Problem with byte[].length

dear sir
I am working with javacard 2.2.2, windows, jdk 1.5 and JCWDE
I would like to know the byte length of an array ("aCrypter" in this fallowing code)
1     public byte[] cryter(byte[] Crypter){
2
3          ecipher.init(key, Cipher.MODE_ENCRYPT);
4          ecipher.doFinal(Crypter, (short)0, (short)aCrypter.length, donneeCrypter , (short)0);
5
6          return donneeCrypter;
7     }
JCVM return an error on line 4 (it works with "(short)1"
there is a solution to know the array length?
regards
Alexis
Edited by: Alexis &quot;le francais&quot; on 21 mars 2011 07:33

thanks for your help.
In fact I would like to crypt and after decrypt with this fallowing code:
       private byte[] Crypto = {(byte)0xA0, (byte)0x00,
             (byte)0x00, (byte)0x00, (byte)0x62, (byte)0x03, (byte)0x01, (byte)0x0C,
             (byte)0x0f, (byte)0x01, (byte)0x01};
public void process(APDU apdu) throws ISOException {
          // TODO Auto-generated method stub
          byte[] buffer = apdu.getBuffer();
          if (this.selectingApplet()) return;
          if (buffer[ISO7816.OFFSET_CLA] != CLA_MONAPPLET) {
               ISOException.throwIt(ISO7816.SW_CLA_NOT_SUPPORTED);
          switch (buffer[ISO7816.OFFSET_INS]) {
               case INS_INTERROGER_COMPTEUR:
                    tab = new byte[1000];
                    tab[0]= compteur;
                    tableau = cryter(tab);
                    tableau2= decrypter(tableau);
                    apdu.setOutgoing();
                    apdu.setOutgoingLength((short) 7);
                    apdu.sendBytesLong(tableau2,(short) 0, tableau2.lenght);
                    break;
     public void initialisation(){
        key = (DESKey)KeyBuilder.buildKey(KeyBuilder.TYPE_DES_TRANSIENT_DESELECT,KeyBuilder.LENGTH_DES, false);
        key.setKey(Crypto, (short)0);
        ecipher = Cipher.getInstance(Cipher.ALG_DES_CBC_ISO9797_M2,false);
     public byte[] decrypter(byte[] aDecrypter){
          ecipher.init(key, Cipher.MODE_DECRYPT);
          donneeDecrypte = new byte[10000];
          ecipher.doFinal(aDecrypter, (short)0, (short)aDecrypter.length, donneeDecrypte, (short)0);
          return donneeDecrypte;
     public byte[] cryter(byte[] Crypter){
          ecipher.init(key, Cipher.MODE_ENCRYPT);
          donneeCrypter = new byte[10000];
          ecipher.doFinal(Crypter, (short)0, (short)Crypter.length, donneeCrypter, (short)0);
          return donneeCrypter;
     }Now "crypter" method is executed, but when the debugger is at this fallowing line there is a problem :
tableau2= decrypter(tableau);maybe the size of the array?
>
return ecipher.doFinal(aCrypter);
>
I work with Javacard 2.2.2, this method does not exist in the API only :
abstract short doFinal270(byte[] inBuff, short inOffset, short inLength,byte[] outBuff, short outOffset)Edited by: le francais on 22 mars 2011 03:07

Install(byte[] buffer,short offset, byte length)

Anyone could tell me where i can find the defination of data format of 'buffer' in 'install(byte[] buffer,short offset, byte length)', it is standard, right?!

Look in appendix A.2 of the GlobalPlatform Card Specifications 2.1.1 (pages 160,161). The same information is specified for the install() method in the Java Card 2.2 API spec. The format is the same for Java Card 2.1.1 and GlobalPlatform 2.0.1'.
G

Replacing 20 fixstatements by Global Variable - Problem 255 bytes length

Hello,
we have an issue in businessrules with setting the fix statement on 1 dimension:
we currently use Fix (@RELATIVE("CBU_ALL",0) ) - on level 0 are approx. 3000 members - on medium level are 20 CBUs Seat and 20 CBUs Door
we have approx. 20-30 similar businessrules - which either calculate on seat or door CBUs
the requirement is to either calculate with a rule the 20 CBUs for Seats or the 20 CBUs for Doors
as we currently do not fix properly on either Seat or Door CBUs, we calculate approx. 1500 empty members (empty, if fix in another dimension done correctly) - tests showed, that this doubles the time which would be needed.
I know we could easily set 20 fixes in each business rule:
@RELATIVE("CBU_BMW_Seat",0)
....20 more....
@RELATIVE("CBU_Ford_Seat",0)
(the fix above would then exclude the 1500 members, which are below:
@RELATIVE("CBU_BMW_Doors",0)
....20 more....
@RELATIVE("CBU_Ford_Doors",0)
unfortunately, the number of CBUs/Customers is frequently renamed or some are added, so I can not afford to built these 20 fixstatements into 20 different businessrules and maintain them all the time.
I thought of using UDAs or Attribute Values -
but it seems not to be possible in a fixstatment to combine a relative or Children fixstatement with UDAs, which are set on the upper member ?
I assume it works, if I classify all 3000 level 0 members with UDA or attribute SEATS or DOORS - but that's inefficient
@RELATIVE("CBU_BMW_Seat",0) AND @UDA(CBU,"SEATS")
@CHILDREN("CBU_BMW_Seat") AND @UDA(CBU,"SEATS")
@Descendants("CBU_BMW_Seat") AND @UDA(CBU,"SEATS")
@UDA(CBU,"SEATS")
generally, it seems not to be allowed to combine Children or descendant fixes with any other relations or conditions ?
@CHILDREN(@Match(CBU," Seat") ) (attempt to search for all children of all CBUs with Seat in its name)
So the idea is to define 2 global variables:
1 for Seats and 1 for Doors:
[SEAT] includes then then: @children("CBU_BMW_Seat") ... 20 more @children("CBU_Ford_Seat")
advantage would be, we can maintain the list of CBUs in 1 place
my problem is: length of global variable is limited to 255 bytes - I need 800 to 900 digits to define the 20 CBUs
having 8 global variables instead of 40 CBUs referenced in Fixstatements is not really an advantage
even if I would rename the CBUs to just S1,S2,S3,S4 D1,D2,D3 (S for Seat, D for Door) (and use aliases in Planning and Reporting with full name to have the right meaning for the users), it does not fit into 1 variable: @children("S1"), @Children("S2"), ..... is simply too long (still 400 digits)
also other attempts to make the statement shorter, failed:
@children(@list("CBU_BMW_Seat","CBU_Ford_Seat",.....)) is not allowed
is there any other idea of using global variables, makros, sequences ?
is there a workaround to extend global variable limit ? we have release 9.3.1 - is this solved in future releases ?
are there any other commands, which I can combine in clever way in fixstatements with
@relative
@children
@descendants
with things like @match @list ?
(Generation and Level are no approrate criteria for Separating Seat and Doors, as the hierarchy is the same)
please understand, that as we use this application for 5 years with a lot of historic data and it's a planning application with a lot of webforms and financial reports, and all the CBU members are stored members with calculated totals and access rights and setup data on upper members,
I can not simply re-group the whole cbu structure and separate Seats and Doors just for calculation performance
CBU dimension details is like this:
Generation:
1 CBU_ALL
2 CBU_BMW
3 CBU_BMW_Seat
4 Product A
4 Product B
..... hundreds more
3 CBU_BMW_Door
4 Product C
4 Product D
.... hundreds more
2 CBU_Ford
3 CBU_Ford_Seat
4 Product E
4 Product F
.... hundreds more
3 CBU_Ford_Doors
4 Product G
4 Product H
.... hundreds more
20 more CBUs with below 20 CBUs Seat and 20 CBUs Door

How hard would it be to insert 2 children under CBU_All? Name them CBU_Seats and CBU_Doors, then group all the Seats and Doors under them.
Then your calc could be @Relative(CBU_Doors, 0).
I know it's not always easy or feasible to effect change to a hierarchy, but I just had the thought.
Robert

Byte length problem

I don't understand this, but I suppose the solution is simple.
I wanna use the getByAddress(byte[] address) method of the class InetAddress in java.net .
address is a byte array of length 4, that should contain the 4 parts of an IP address.
BUT, each part of an IP address can be 32 bits long, wich is bigger then the possible length of a byte.
Someone knows what I do wrong?

I don't understand this, but I suppose the solution
is simple.
I wanna use the getByAddress(byte[] address) method
of the class InetAddress in java.net .
address is a byte array of length 4, that should
contain the 4 parts of an IP address.
BUT, each part of an IP address can be 32 bits long,
wich is bigger then the possible length of a byte.
Someone knows what I do wrong?Hi, it depends on which type of ip-address. The one which is most common is only 32 bits long. E.g. 255.123.123.01 <- 4 bytes.
Kaj

Method to Get the INPUT parameter CONTENT byte length

method to get the INPUT parameter CONTENT byte length

Dear "clown of forums",
Please read the forum rules and ask understandable questions -> one thread per properly formulated question after having searched.
Thread locked.

ConvertToClob and byte order mark for UTF-8

We are converting a blob to a clob. The blob contains the utf-8 byte representation (including the 3-byte byte order mark) of an xml-document. The clob is then passed as parameter to xmlparser.parseClob. This works when the database character set is AL32UTF8, but on a database with character set WE8ISO8859P1 the clob contains an '¿' before the '<'AL32UTF8');
I would assume that the ConvertToClob function would understand the byte order mark for UTF-8 in the blob and not include any parts of it in the clob. The byte order mark for UTF-8 consists of the byte sequence EF BB BF. The last byte BF corresponds to the upside down question mark '¿' in ISO-8859-1. Too me, it seems as if ConvertToClob is not converting correctly.
Am I missing something?
code snippets:
l_lang_context number := 1;
dbms_lob.createtemporary(l_file_clob, TRUE);
dbms_lob.convertToClob(l_file_clob, l_file_blob,l_file_size, l_dest_offset,
l_src_offset, l_blob_csid, l_lang_context, l_warning);
procedure fetch_xmldoc(p_xmlclob in out nocopy clob,
o_xmldoc out xmldom.DOMDocument) is
parser xmlparser.Parser;
begin
parser := xmlparser.newParser;
xmlparser.parseClob(p => parser, doc => p_xmlclob);
o_xmldoc := xmlparser.getDocument(parser);
xmlparser.freeParser(parser);
end;The database version is 10.2.0.3 on Solaris 10 x86_64
Eyðun
Edited by: Eyðun E. Jacobsen on Apr 24, 2009 8:58 PM

can this be of some help? http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/functions027.htm#SQLRF00620
Regards
Etbin

Truncate byte array of UTF-8 characters without corrupting the data?

Hi all,
I need to be able to determine if the byte array, which is truncated from the original byte array representing UTF-8 string, contains corrupted character. Knowing if the byte array contains corrupted character allows me to remove it from the truncated array.
As in the sample code below, when truncate the string with 16 bytes it displays ok. However, truncate with 17 bytes, the last character is corrupted. Is there a way to check to see if the character is corrupted so that it can be removed from the truncated byte array?
Thanks in advance,
Phuong
PS: The Japanese characters I chose it randomly from Unicode charts. I don't know their meaning so if it is offensive, please forgive me.
import java.awt.BorderLayout;
import java.awt.event.ActionEvent;
import java.awt.event.ActionListener;
import java.io.UnsupportedEncodingException;
import javax.swing.BoxLayout;
import javax.swing.JButton;
import javax.swing.JFrame;
import javax.swing.JLabel;
import javax.swing.JPanel;
import javax.swing.JScrollPane;
import javax.swing.JTextArea;
import javax.swing.SwingUtilities;
public class TestTruncateUTF8 extends JFrame
    private static final long serialVersionUID = 1L;
    private JTextArea textArea = new JTextArea(5,20);
    private JLabel japanese = new JLabel("Japanese: " + "\u65e5\u672c\u3041\u3086\u308c\u306e");
     * @param args
    public static void main(String[] args)
        SwingUtilities.invokeLater(new Runnable() {
            @Override
            public void run()
                JFrame frame = new TestTruncateUTF8();
                frame.setVisible(true);
    public TestTruncateUTF8()
        super("Test Truncated");
        JButton truncate17Button = new JButton("Truncate 17 bytes");
        truncate17Button.addActionListener(new ActionListener() {
            @Override
            public void actionPerformed(ActionEvent e)
                truncates(17);
        JButton truncate16Button = new JButton("Truncate 16 bytes");
        truncate16Button.addActionListener(new ActionListener() {
            @Override
            public void actionPerformed(ActionEvent e)
                truncates(16);
        JPanel panel1 = new JPanel();
        panel1.setLayout(new BoxLayout(panel1, BoxLayout.Y_AXIS));
        panel1.add(japanese);
        panel1.add(truncate16Button);
        panel1.add(truncate17Button);
        panel1.add(new JScrollPane(textArea));
        this.setLayout(new BorderLayout());
        this.add(panel1, BorderLayout.CENTER);
        this.pack();
        this.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
    private void truncates(int numOfBytesToTruncate)
        try
            byte[] bytes = japanese.getText().getBytes("UTF-8");
            byte[] newBytes = new byte[numOfBytesToTruncate];
            System.arraycopy(bytes, 0, newBytes, 0, numOfBytesToTruncate);
            TestTruncateUTF8.this.putTextInsideJTextArea(bytes, newBytes);
        catch (UnsupportedEncodingException e1)
            e1.printStackTrace();
    private void putTextInsideJTextArea(byte[] original, byte[] truncated)
        try
            textArea.append("\nOriginal String: " + new String(original, "UTF-8"));
            textArea.append("\nTruncated String: " + new String(truncated, "UTF-8"));
            textArea.append("\n*****************************\n");
        catch (UnsupportedEncodingException e)
            e.printStackTrace();
}

Since the byte array is in UTF-8, you can easily examine whether it is corrupt or not by taking a look at the last 4 bytes (at most). That is because the bit distribution of each byte (1st, 2nd, 3rd, and 4th) in UTF-8 encoding is well defined in its spec.
BTW, a Japanese Hiragana/Kanji character typically has 3 bytes in UTF-8, so truncating with neither 16 nor 17 bytes would produce correct truncation.
HTH,
Naoto

Trying to get UTF-8 data in and out of an oracle.xdb.XMLType

I need to be able to put unicode text into an oracle.xdb.XMLType and
then get it back out. I'm so close but it's still not quite working.
Here's what I'm doing...
// create a string with one unicode character (horizontalelipsis)
String xmlString = new String(
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
"<utf8>\n" +
" <he val=\"8230\">\u2026</he>\n" +
"</utf8>\n");
// this is an oci8 connection
Connection conn = getconnection();
// this works with no exceptions
XMLType xmlType = XMLType.createXML(conn, xmlString);
// this is the problem here - BLOB b does not contain all the bytes
// from xmlType. It seems to be short 2 bytes.
BLOB b = xmlType.getBlobVal(871);
String xmlTypeString = new String(b.getBytes(1L, (int) b.length()), "UTF-8");
PrintStream out = new PrintStream(System.out, true, "UTF-8");
out.print(xmlTypeString);
out.close();
What I get from this is this...
<?xml version="1.0" encoding="UTF-8"?>
<utf8>
<he val="8230">[utf-8 bytes]</he>
</utf8
In the above, [utf-8 bytes] represents the correctly encoded UTF-8 bytes that
were returned. But the output is missing the final closing bracket and the
newline at the end. It seems that no matter what second argument I give b.getBytes(),
it always returns the above. Even
It seems that this code...
BLOB b = xmlType.getBlobVal(871);
always returns a BLOB that contains a few bytes short of what it should contain.
What am I doint wrong? I'm sure I'm missing something here.
Thanks much for your help.
Here's info about the environment I'm working in.
============================ SYSTEM INFORMATION ============================
SQL*Plus: Release 11.1.0.7.0 - Production on Fri May 15 11:54:34 2009
select * from nls_database_parameters
where parameter='NLS_CHARACTERSET'
returns...
WE8ISO8859P1
select * from nls_database_parameters
where parameter='NLS_NCHAR_CHARACTERSET'
returns...
AL16UTF16
The operating system I'm working with is...
SunOS hostname 5.10 Generic_120011-14 sun4u sparc SUNW,Netra-T12

WE8ISO8859P1 does not support the ellipsis character. It is a WE8MSWIN1252 character. I wonder if your problem may have something to do with internal conversion to/from XML character reference (&#x2026). Unfortunately, I have no time to test. Please, try to use a simple loop and System.out.print to print all bytes of the return value of b.getBytes(1L, (int) b.length()). Also, check the value of b.length().
-- Sergiusz

How to re encode UTF-16 to russian or something else

Hi,
My problem is following!
I have a UTF-16 bytes (from HTTP splitted with byte[] bytessplitted = abvaluea.getBytes( "UTF-16" );).
I am sure that it is UTF-16 because the given length from a example "string" is 14.
So when i want to display this now in a jsp i tried following:
<%= "UTF16 to German:"+new String(bytessplitted , "ISO8859_1")+"<br>"%>
result is:
��string
why is this!
Then i tried russian:
I entered: ячсм
Convert to bytes: from HTTP (UTF-16) splitted with byte[] bytessplitted = abvaluea.getBytes( "UTF-16" );
ReEncode: <%= "UTF16 to Russian:"+new String(bytess, "Cp1251")+"<br>"%>
result on the new jsp:
юяСЏС�СЃРј
Is my try false?
Please help me!
Gernot

new String(bytessplitted , "ISO8859_1")This bit of code says "Assume that bytessplitted was encoded using ISO*8859_1, and convert it to a String." Of course bytessplitted wasn't encoded that way, so it produces garbage. Likewise for your other conversion.
What JSP does with a string, when you output it via <%= %>, is to convert it to bytes using the encoding defined for the page it is producing. This means that all of that mucking about with Strings and byte arrays, besides being wrong and harmful to your data, is also pointless. All you need to do is make sure your page is encoded in UTF-8, then write your String data to the response object.

Reverting from UTF-8 to ISO-8859-1

Hi,
i have a database installed in UTF-8, it´s a new instalation and the guides i had didnt mention any restrictions on characterset for the teams that were migrating.
Well the problem is some teams are moving some of their projects to the new server and can´t insert in a VARCHAR2 (3), for example the word "não".
My question is: Can i change the whole database to ISO-8859-1 instead of UTF-8 in order to have words like "não" inserted correctly? If so, is it a simple alter database or a more complicated operation?
Another question, is there any possibility of letting the database as is and make it work without expanding the fields value restriction?
Alx

You can't change a database character set from ISO-8859-1 to UTF8. You can only move from one character set to a strict superset, which doesn't apply here. The supported way to change the character set here would be to create a new database with the ISO-8859-1 character set, export the existing data, and import it into the new system. That assumes, of course, that all the existing characters have an ISO-8859-1 representation (characters like the Euro symbol or Microsoft's curly quotes do not).
By default, a VARCHAR2(3) allocates 3 bytes of space for data. That gets complicated when you use a multi-byte character set like UTF-8 where a character like 'ã' requires 2 bytes of storage. You can define the columns as VARCHAR2(3 CHAR) to allocate 3 characters of storage regardless of the character set. You can also set the parameter NLS_LENGTH_SEMANTICS to CHAR to make the default when you create a table that character rather than byte length semantics are set. Personally, if I'm creating a UTF8 database, I'd want to set NLS_LENGTH_SEMANTICS to CHAR.
Justin

UTF-8 reading and converting to proper characters.

Hi,
How do I read a UTF-8 encoded document (XML document to be precise) and display it. (as human readable characters).
I have a URL Stream , and when I read it, I am reading bytes.
Those bytes mkeup UTF-8 variable length encoded characters.
I would like to create a string containing, readable, printable, searchable manipulatable characters (Unicode).
How is this achieved ?
At the moment, I am building a string where every byte of the UTF-8 stream is a unicode character, which gives me a big old load of crap.
cheers

My mistake:
When opening a URL object and reading the stream in java, the character conversion is automatically hapening, so everything is fine.
My problem was that I was reading a URL that is encoded in GZIP format.
So now I have to workout how to read a URL that is GZIPPED,
i.e The websit has been written in XML->encoded to UTF-8, and then GZIPPED, before being placed on the server.
This is the URL :
http://feeds.wsjonline.com/wsj/podcast_wall_street_journal_weekend_edition

NVARCHAR UTF-8 byte length

Similar Messages

Maybe you are looking for