UTF8 incomplete byte sequence
Hi,
I have the following situations I am reading bytes from a socket. These bytes can contain utf-8 characters. Then I convert the bytes to a utf8 string. This all goes fine. The problem is when the bytes sequence I have read ends with an incomplete utf8 bytes sequence (because the will be read on nexy read from the socket). But I want to handle the rest of the bytes before reading the next chunk. What is the best way to do this ?
Kind regards,
Marco Laponder
Hi,
I have the following situations I am reading bytes
from a socket. These bytes can contain utf-8
characters. Then I convert the bytes to a utf8
string. This all goes fine.I'm not so sure about that. If you talk about Java, there is no such thing as an UTF-8 String. It's always UTF-16.
The problem is when the
bytes sequence I have read ends with an incomplete
utf8 bytes sequence (because the will be read on nexy
read from the socket).
But I want to handle the rest
of the bytes before reading the next chunk. What is
the best way to do this ? You could write all bytes into a ByteArrayOutputStream first, before processing them.
Similar Messages
-
VIM Plugin VJDE, Ruby Error: invalid byte sequence in UTF-8
Hello
I'm trying to install the vim VJDE Plugin for java syntax highlighting.
wget tarball
tar xvzf tarball
makepkg -s
pacman -U ...
No Problems here.
When i run vim foo.java it shows me this mesage:
Error detected while processing /usr/share/vim/vim73/plugin/vjde/vjde_template.vim:
line 18:
ArgumentError: invalid byte sequence in UTF-8
Code on line 18:
ruby Vjde::VjdeTemplateManager.load_all(VIM::evaluate('g:vjde_install_path'))
So.. I'm no ruby programmer but i dont see any non UTF-8 Character in it.
When i comment it out, the error does not show.
Couldn't google anything about it. Maybe it's just a bug in the current version of Ruby.
Would be nice if anyone can help me.
Regards, Archdove
Last edited by Archdove (2011-09-23 18:21:35)Hi,
It's a encoding problem. I wrote about this problem to author. He uses en utf8 locale, but some files has unrecognized encoding. Enconv can't convert to utf8.
$ find -type f -a ! -name readtags -a ! -name '*.class' -a ! -name '*.jar' | xargs enconv
enconv: Cannot convert `./src/previewwindow.cpp' from unknown encoding
enconv: Cannot convert `./src/wspawn.cpp' from unknown encoding
enconv: Cannot convert `./src/tipswnd.lex' from unknown encoding
enconv: Cannot convert `./src/vjde/completion/ClassInfo.java' from unknown encoding
enconv: Cannot convert `./src/vjde/completion/Completion.java' from unknown encoding
enconv: Cannot convert `./src/tipswnd.c' from unknown encoding
enconv: Cannot convert `./plugin/vjde/vjde_java_completion.vim' from unknown encoding
enconv: Cannot convert `./plugin/vjde/project.vim' from unknown encoding
enconv: Cannot convert `./plugin/vjde/vjde_tag_loader.vim' from unknown encoding
enconv: Cannot convert `./plugin/vjde/tlds/java.vjde' from unknown encoding
I'm looking how to convert to utf8. Try open file e.g. src/previewwindow.cpp in vim with fencs=gbk,utf8,default. Vim detect fenc cp936. In line 644 are chinese characters(?): /* 另一个回调函数 */
Any idea? -
Convert UTF-8 (Unicode) Hex to Hex Byte Sequence while reading file
Hi all,
When java reads a utf-8 character, it does so in hex e.g \x12AB format. How can we read the utf-8 chaacter as a corresponding byte stream (e.g \x0905 is hex for some hindi character (an Indic language) and it's corresponding byte sequence is \xE0\x45\x96).
can the method to read UTF-8 character byte sequence be used to read any other (other than Utf 8, say some proprietary font) character set's byte sequence?First, there's no such thing as a "UTF-8 character". UTF-8 is a character encoding that can be used to encode any character in the Unicode database.
If you want to read the raw bytes, use an InputStream. If you want to read text that's encoded as UTF-8, wrap the InputStream in an InputStreamReader and specify UTF-8 as the encoding. If the text is in some other encoding, specify that instead of UTF-8 when you construct the InputStreamReader. import java.io.*;
public class Test
// DEVANAGARI LETTER A (अ) in UTF-8 encoding (U+0905)
static final byte[] source = { (byte)0xE0, (byte)0xA4, (byte)0x85 };
public static void main(String[] args) throws Exception
// print raw bytes
InputStream is = new ByteArrayInputStream(source);
int read = -1;
while ((read = is.read()) != -1)
System.out.printf("0x%02X ", read);
System.out.println();
is.reset();
// print character as Unicode escape
Reader r = new InputStreamReader(is, "UTF-8");
while ((read = r.read()) != -1)
System.out.printf("\\u%04X ", read);
System.out.println();
r.close();
} Does that answer your question? -
UTFDataFormatException, invalid byte 1 (�) of a 6-byte sequence
Hi
I am trying to parse xml file using sax parser(xerces-c) . i am not able to fix this error :
UTFDataFormatException, invalid byte 1 (�) of a 6-byte sequence.
xml file:
<?xml version="1.0" ?>
<!DOCTYPE svc_result SYSTEM "MLP_SVC_RESULT_310.dtd" [
<!ENTITY % extension SYSTEM "PLS_PZSIR_100.DTD">
%extension;]>
........Hi Siddiqui,
It looks like u r importing some characters those are not valid UTF-8 character set and some time this type of error comes when u try to import characters like *<,>* so use *& lt;* for < and *>* for >.
sorry those Characters r not display properly :-(
Thanx
Shant
Edited by: Shant on Jan 18, 2008 6:19 AM
Edited by: Shant on Jan 18, 2008 6:23 AM -
I am reading in XML in my Java program using the Xerces library and am receiving an error "Invalid byte 2 of 2-byte UTF-8 sequence". It is caused because there is a two-byte sequence in the file "C2 3F" which is not a valid UTF-8 encoding. Is there any way to get the parser to ignore these invalid sequences? The XML files come from an external source so, aside from writing my own filtering routines to detect and fix errors like this, I cannot modify the content. Removing one or both of the bytes, or replacing them with valid characters, would work fine.
FYI, this is the solution:
private Reader prepareInputStreamReader(InputStream inputStream)
// Strip all invalid UTF-8 sequences from input
CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
decoder.onMalformedInput(CodingErrorAction.REPLACE);
decoder.replaceWith("?");
return new InputStreamReader(inputStream, decoder);
} -
SAPCAR error "Illegal byte sequence "
Hi there
I am trying to update R3trans and tp programs and ran into this error uncarring the files using sapcar.
110> SAPCAR -xvf tp_193-20000233.sar
stderr initialized
processing archive tp_193-20000233.sar...
Illegal byte sequence tp_193-20000233.sar
119> SAPCAR -xvf R3trans_205-20000233.SAR
stderr initialized
processing archive R3trans_205-20000233.SAR...
Illegal byte sequence R3trans_205-20000233.SAR
I tried downloading the latest sapcar after referring to
Note 892842 - SAPCAR error: format error in header,
but had no luck.I would appreciate any help on this.
Thanks
VicIf you want to upgrade SAPCAR , replace the SAPCAR in exe directory.
execute the following command
which SAPCAR
this will show you the path.
For windows , renameing to .zip doesnt help. download the SAPCAR ( the name is same for all OS ) from
service.sap.com/patches --> Entry by Application Group --> Additional Components --> SAPCAR then windows.
if its downloaded to C drive
c:\ sapcar.exe -xvf < *.SAR > , < *.SAR > is your tp/r3trans
Thanks
Prince Jose -
Translation of UTF8 stream to sequence of ASCII characters
Hello,
I need an advice how to translate UTF8 binary stream of characters to ASCII characters. Translation will depends on the Locale (language) used.
For example, if UTF8 character � (C381 in HEX) is used in Czech language I will need to translate it to two ASCII characters Ae; if the same � character used in French language I will need to translate it to character A. Binary Stream will have some ACSII characters which will not need any translation as well.
Please, advise.
Thank you.
A MickelsonThe Java compiler and other Java tools can only process files, which contain Latin-1 and/or Unicode-encoded (\udddd notation) characters. Native2ascii converts files, which contain other character encodings into files containing Latin-1 and/or Unicode-encoded characters.
String command = "native2ascii -encoding UTF-8 sourceFileName targetFileName�;
Process child = Runtime.getRuntime().exec(command); -
Convertin Cp1252 right curly quotes into UTF8
I was not sure where to post this.
I was having some problems trying to convert CP1252 right curly quote into UTF8. Other Cp1252 characters were converting correctly.
The right curly quote was read from the database as 3 bytes. The first 2 bytes were mapping to legitimate UTF8 tokens but the third byte was not.
According to this chart:
http://www.io.com/~jdawson/cp1252.html
The right curly quote is e2809d.
Byte1 Byte2 Byte3
e2 80 9d (hexadecimal)
/342 /200 /235 (octal)
The octals are in a big private static final Java string in sun.nio.cs.MS1252 except /235 is missing. So the process gets the first 2 bytes right but chokes on the third.
Hacking my own encoder class with the /235 in the right place to made it work.
I was wondering if this was a bug in sun.nio.cs.MS1252 ?
Thx!
Edited by: langal on Feb 2, 2010 10:37 PMAh, the old double-encoding ploy! It looks like the text was encoded as UTF-8, then the resulting byte array was decoded as windows-1252, then the result was encoded again as UTF-8. // original string:
”
// encoded as UTF-8:
E2 80 9D
// decoded as windows-1252 (the third character
// is undisplayable, but valid):
”
// encoded as UTF-8:
C3 A2 E2 82 AC C2 9D The conversion to windows-1252 followed Microsoft's practice instead of its specification, so the "9D" control character is stored safely in its UTF-8 encoding, "C2 9D". Unfortunately, Java follows the spec, so getting the character back out is not so easy (and that maybe-bug is relevant--my apologies). It treats '\u009D' as an invalid character and converts it to 0x3F, the encoding for the question mark.
I did some checking, and it seems U+201D (Right Double Quotation Mark: ”) is the only one of the added characters in the "80..9F" range that causes this problem. If the data is otherwise intact, you can work around this problem by looking for the relevant byte sequence in the intermediate processing stage and replacing it: byte[] bb0 = { (byte)0xC3, (byte)0xA2, (byte)0xE2, (byte)0x82, (byte)0xAC, (byte)0xC2, (byte)0x9D };
// This is the string 'x' from your sample code.
String x = new String(bb0, "UTF-8");
System.out.println(x); // ”
byte[] bb1 = x.getBytes("windows-1252"); // E2 80 3F
for (int i = 0; i < bb1.length - 2; i++)
if ((bb1[i+2] & 0xFF) == 0x3F &&
(bb1[i+1] & 0xFF) == 0x80 &&
(bb1[i] & 0xFF) == 0xE2)
bb1[i+2] = (byte)0x9D;
String s = new String(bb1, "UTF-8");
System.out.println(s); // ” The byte sequence "E2 80 3F" is unlikely to occur naturally in windows-1252, and it's invalid in UTF-8. Of course, this would only be a temporary measure, while you clean up your database as @jtahlborn said. -
Problems converting byte[] to string
I use
byte[] encryptedBytes = cipher.doFinal(bytesToEncrypt);
String Ciphertext = new String(encryptedBytes);
and sometimes i get the correct answer ans sometimes no. If yo want the code is in this post:
http://forum.java.sun.com/thread.jspa?threadID=790137&start=15&tstart=0That's because the C language lacks true character and string data types. It only has arrays of bytes. Unfortunately in C bytes are misleadingly called "chars."
It works if you put ASCII because byte sequences that correspond to ASCII characters can, on most systems, be safely converted to characters and strings. You see, conversions from bytes to characters are done using things called "encoding schemes" or just encodings. If you don't specify an encoding, the system's default is used. The default encoding can be pretty much anything so it's best not to assume much about it.
You can also use a fixed encoding, like this:String Ciphertext = new String(encryptedBytes, "ISO-8859-1");Just remember that when you convert the string back to bytes you have to use the same encoding again, that isbyte[] cipherBytes = cipherString.getBytes("ISO-8859-1"); -
ConvertToClob and byte order mark for UTF-8
We are converting a blob to a clob. The blob contains the utf-8 byte representation (including the 3-byte byte order mark) of an xml-document. The clob is then passed as parameter to xmlparser.parseClob. This works when the database character set is AL32UTF8, but on a database with character set WE8ISO8859P1 the clob contains an '¿' before the '<'AL32UTF8');
I would assume that the ConvertToClob function would understand the byte order mark for UTF-8 in the blob and not include any parts of it in the clob. The byte order mark for UTF-8 consists of the byte sequence EF BB BF. The last byte BF corresponds to the upside down question mark '¿' in ISO-8859-1. Too me, it seems as if ConvertToClob is not converting correctly.
Am I missing something?
code snippets:
l_lang_context number := 1;
dbms_lob.createtemporary(l_file_clob, TRUE);
dbms_lob.convertToClob(l_file_clob, l_file_blob,l_file_size, l_dest_offset,
l_src_offset, l_blob_csid, l_lang_context, l_warning);
procedure fetch_xmldoc(p_xmlclob in out nocopy clob,
o_xmldoc out xmldom.DOMDocument) is
parser xmlparser.Parser;
begin
parser := xmlparser.newParser;
xmlparser.parseClob(p => parser, doc => p_xmlclob);
o_xmldoc := xmlparser.getDocument(parser);
xmlparser.freeParser(parser);
end;The database version is 10.2.0.3 on Solaris 10 x86_64
Eyðun
Edited by: Eyðun E. Jacobsen on Apr 24, 2009 8:58 PMcan this be of some help? http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/functions027.htm#SQLRF00620
Regards
Etbin -
How to recognize invalid UTF8-codepoints ?
Not all possible byte-sequence are valid UT8-characters.
( See http://en.wikipedia.org/wiki/Utf-8 )
For instance the byte-sequnence 195 and 34 (decimal) is invalid.
Oracle seems to have no problem with it. It stores the sequence in the database
and if it should be displayed in a client like sql-developer a hollow quad is shown.
But some application has problems with such invalid codepoints.
Because of performance reasons I don’t want to extract every codepoint and analyze every byte in it in PL/SQL.
Does anyone knows a better solution (for Oracle 9.2) to recognize or remove such invalid codepoints ?Michael
We had all kinds of trouble like this with inconsistent character sets on Windows, Linux and Oracle. Any time you export, import, ftp a SQL script, etc, you can corrupt the data; what's a valid code point in one charset may not be in another.
The hollow quad you see (depends on your charset; I have seen upside down '?') tells you that there's no mapping from the bytes in the database to the display on your client - assuming a character set conversion is happening. I think you should be able to make use of explicit character set conversion to identify your problem text:
- convert from a narrower set (eg UTF-8) to a wider set (UTF-16) using CONVERT()
- then you should get this behaviour (see http://download-uk.oracle.com/docs/cd/B19306_01/server.102/b14225/ch7progrunicode.htm#sthref814)
"The NLS_NCHAR_CONV_EXCP initialization parameter controls the behavior of data loss during character type conversion. When this parameter is set to TRUE, any SQL statements that result in data loss return an ORA-12713 error and the corresponding operation is stopped." ... "In PL/SQL, when data loss occurs during conversion of SQL CHAR and NCHAR datatypes, the LOSSY_CHARSET_CONVERSION exception is raised for both implicit and explicit conversion"
Handle the exception and you can log where you have problems.
For my money, I would strongly recommend using a single byte character set unless you absolutely must support non-Latin characters. We didn't think the whole thing through, and use of multi-byte characters broke lost of code.
Have fun!
Regards Nigel -
Simple string / bytes problem
Hi
I'm basically trying to simulate DES in java from scratch (so no use of existing API's or anything). I've got a basic idea of how DES works (use of a feistel cipher).
So to start off I'm trying to get the bytes of a string with this simple program:
public class Test {
public static void main(String[] args) {
String testString = "hello";
System.out.println(testString.getBytes());
}The output consists of some characters (namely numbers, letters and 1 special character).
But when I change the variable testString to some other String, complie and run I still get the same output even though the String is different. Also why are there 9 characters for a String thats only 5 characters long?
ThanksWhen you use System.out.println( something ) it uses to the toString() method of the object referenced by 'something'. When the something is an array such as an array of bytes the toString() methed returns a pseudo reference to the array which in no way reflects the content of the array.
If you just want to see the content of the array then use Arrays.toString(something) to generate a String representation of the content of the array.
Also, the use of String.getBytes() will convert the String to bytes using your default character encoding. This is rarely what you want since the result may depend on which platform you are working on. In my view, the safest and best approach is to explicitly define the encoding and since it will encode ALL characters one can throw at it I always use utf-8. So, to convert a String to bytes I would usebyte[] bytesOfString = "your string".getBytes("utf-8");and to convert them back to a String I would useString yourString = new String(bytesOfString,"utf-8");One final point, do not assume that an apparently random array of bytes such as one gets from DES encryption can be converted to a String using the above. In general it can't because not all bytes and byte sequences are valid for a particular character encoding. If you absolutely have to have a String representation then you should encode (not encrypt) the bytes using something like Base64 or Hex encoding.
Edited by: sabre150 on Jan 27, 2008 3:04 PM -
CONVERT 4 bytes WCHAR_T in SOLARIS COMPILER TO UNSIGNED SHORTTO
All,
Do you guys know any code or function that could convert 4-bytes wchar_t string literal (Solaris defined 4 bytes for wchar_t) to array of unsigned short?Unlike MSFT Windows VC++/C#/VB where you will get UCS-2/UTF-32 when your program has UNICODE macro defined at the start of your source program, POSIX defines wchart as an opaque data type.
Solaris is a POSIX system and thus we treat the wchar_t also as an opaque data type that can be different from one locale to another.
(We didn't have Unicode or at least it wasn't that popular when there was wchar_t data type and intefaces first created and also implemented at various Unix systems even though Sun was a founding member company of the Unicode consortium.)
We though guarantee that the wchar_t will be in UTF-32 for all Solaris Unicode/UTF-8 locales.
Hence, to have proper conversions between any codeset/codepage or the codeset of the current locale, i.e., nl_langinfo(CODESET), and UTF-32 or UCS-4, please use iconv(3C), iconv(1), or sdtconvtool(1) and such conversion interfaces. By the way, UCS-4 is being obsoleted and so using of UTF-32 is recommended.
The iconv code conversions between UTF-32 and any other codesets in Solaris will also screen out any illegal characters or byte sequences as specified in the latest Unicode Standard and UTF-8 Corrigendum at Uniocde 3.1 and 3.2. -
SSLEngine usage issue: handshake done, but unwrap returns 8 bytes from 32
Hello,
I'm trying to use SSLEngine API in order to secure Java/NIO based 3th party middle-ware platform. So far I've succeed with writing correct handshake code and on both server and client I get handshake successfully done, the problem starts when client sends server its message. It is 32 bytes long encrypted on the client side to 37 bytes. On the server side I see (while using -Djavax.net.debug=all) that during the SSLEngine.unwrap call the bytes are correctly decrypted into exact byte sequence as on the client side. However, the problem is that from SSLEngine.unwrap I get just 8 bytes instead of those 32. The debug messages looks:
[Raw read (bb)]: length = 37
0000: 17 03 01 00 20 56 68 6E 1F 56 C1 41 6E CD C0 A4 .... Vhn.V.An...
0010: F0 84 6A E9 C8 4F A9 AE 29 AF 87 D9 EA 61 09 15 ..j..O..)....a..
0020: EC 08 28 1E 60 ..(.`
Padded plaintext after DECRYPTION: len = 32
0000: 02 00 01 00 03 63 64 6F E3 06 78 5A D6 9C D4 D0 .....cdo..xZ....
0010: 68 B1 F7 B1 7F 34 88 AC 36 1C 7A 72 03 03 03 03 h....4..6.zr....
*** SERVER: peerNetBuf after unwrap: java.nio.HeapByteBuffer[pos=37 lim=37 cap=16665]
*** SERVER: res: Status = OK HandshakeStatus = NOT_HANDSHAKING
bytesConsumed = 37 bytesProduced = 8
Messages starting with *** SERVER are of my application. Everything other is emitted by Java while using -Djavax.net.debug=all. As you can see all data from peerNetBuf which is a buffer holding encrypted data got from network, so all those data are consumed well -- so I don't need to call unwrap second time like it sometimes happen during the handshake phase...
Now, my question is: how to convince SSLEngine.unwrap to correctly return me all those 32 bytes which I expect?
Thanks,
KarelYou will have to post some code. The SSLEngine tells you what it wants you to do next, all you really have to do is keep doing that: wrap, unwrap when needed, or write on an overflow and read on an underflow. Was there enough room in your app buffer for all 32 bytes? And what was the final result status?
-
Hello,
I don't understand how to convert a UUID object into a byte sequence (of same format as UUID.nameUUIDFromBytes(byte[]) takes in parameter)
I've coded i little method but it doesn't give me a correct byte array, i've tested the output with UUID.nameUUIDFromBytes(byte[]) and it doesn't return the original key.
protected byte[] encode(UUID key){
ByteBuffer keyBuffer = ByteBuffer.allocate(16);
keyBuffer.order(ByteOrder.BIG_ENDIAN);
keyBuffer.putLong(key.getMostSignificantBits());
keyBuffer.putLong(key.getLeastSignificantBits());
return keyBuffer.array();
}Example :
public static void main(String args[]){
UUID before = UUID.fromString("a53e607c-e9c7-475f-a938-9ae6331d85b1");
byte b_uuid[] = encode(before);
UUID after = UUID.nameUUIDFromBytes(b_uuid);
System.out.println(before.toString()+" - > "+after.toString());
}Output : a53e607c-e9c7-475f-a938-9ae6331d85b1 - > 625c894f-dfa5-3b8e-b577-458d7d676fe0
Any idea ?writing:
long hibits = id.getMostSignificantBits();
long lobits = id.getLeastSignificantBits();
// A. then just write the two longs...
// B. or if you really want byte[]s...
byte[] hibytes = new BigDecimal(hibits).toByteArray();
byte[] lobytes = new BigDecimal(lobits).toByteArray();reading:
// (B.) decode the byte[] to longs
long hibits = new BigInteger(hibytes).longValue();
long lobits = new BigInteger(lobytes).longValue();
// (A.) read two longs from your stream
id = new UUID(hibits, lobits);
Maybe you are looking for
-
With my new macbook I attempted to open an older spreadsheet and get the following error message: "This spreadsheet cannot be opened because ti is too old. To open it, save it with numbers '09 first" So how does one do this when I only have ver3.1
-
Exporting to PDF without clicking any button
Hi, Is there a way we can export the output of the WEB report to PDF without having to hit a button "Export to PDF"? The moment I open the report, the contents should get exported to PDF in a new window which I can print later. Please share your idea
-
Ho to check in a multipage PDF if some fields are repeated?
Have hundreds of fields many and it is possible to have some with the same name that should be changed. Checking visually could be a good method but it is impossible because the window to see them was designed to show very short names and here is not
-
Hello Everyone, How r u guys? I need some help in BADI/User Exit. I am working on FI - CFM. There is a transaction with TCode- TS01 (Create Securities) / TS04 (settle securities). I need a BADI/Exit which gets activated when the user clicks on the sa
-
i have table which contains dates in format like yyyy-mm-dd but some of dates are only yyyy-mm and yyyy so i need a query for finding dates which are not belongs to this format.query should give result like that other than my format like yyyy-mm-dd