Scanning files for non-unicode characters.

Question: I have a web application that allows users to take data, enter it into a webapp, and generate an xml file on the servers filesystem containing the entered data. The code to this application cannot be altered (outside vendor). I have a second webapp, written by yours truly, that has to parse through these xml files to build a dataset used elsewhere.
Unfortunately I'm having a serious problem. Many of the web applications users are apparently cutting and pasting their information from other sources (frequently MS Word) and in the process are embedding non-unicode characters in the XML files. When my application attempts to open these files (using DocumentBuilder), I get a SAXParseException "Document root element is missing".
I'm sure others have run into this sort of thing, so I'm trying to figure out the best way to tackle this problem. Obviously I'm going to have to start pre-scanning the files for invalid characters, but finding an efficient method for doing so has proven to be a challenge. I can load the file into a String array and search it character per character, but that is both extremely slow (we're talking thousands of LONG XML files), and would require that I predefine the invalid characters (so anything new would slip through).
I'm hoping there's a faster, easier way to do this that I'm just not familiar with or have found elsewhere.

require that I predefine the invalid charactersThis isn't hard to do and it isn't subject to change. The XML recommendation tells you here exactly what characters are valid in XML documents.
However if your problems extend to the sort of case where users paste code including the "&" character into a text node without escaping it properly, or they drop in MS Word "smart quotes" in the incorrect encoding, then I think you'll just have to face up to the fact that allowing naive users to generate uncontrolled wannabe-XML documents is not really a viable idea.

Similar Messages

Function for non unicode-characters

Hi
is there a function that permit to translate a unicode characters to a non-unicode characters?
For example with this function " à " must become " a ".
thank you for your help

Copy paste the below code and execute. This could also solve your problem.
DATA: BEGIN OF trans OCCURS 0,
auml TYPE x VALUE 'C4', "'Ä'
c_8e TYPE c VALUE 'A',
gra TYPE x VALUE 'E0', "'à'
c_gra TYPE c VALUE 'a',
END OF trans.
DATA : input(40).
DATA : output(40).
input = 'ÄBàp'.
output = input.
TRANSLATE output USING trans.
condense output no-gaps.
write :/ input.
write:/ output.
Thanks,
Senthil

Java.io.File and non-unicode characters in file name

Unix filesystem object names are byte sequences. These byte sequences are not required to correspond to any character sequence in the current or any locale. How do I open a file if it has characters that do not corrospond to a valid unicode encoding for some current locale? Unless I am missing something, if I do a list on a parent directory that has some file names like this, those file names do not get added to the list. Hmmm....
R.

OK, create.c is a program that will create a file whose name is not a character in the 'ja' locale.
Lister.java defines a class that lists files in the current directory. For each file, it spits out the 'toString()' version of the file, the char array of the name as hex, and the 'getBytes' byte array of the name.
So, what you can do is compile and run create.c, which will create a file whose name is a single byte whose hex value is 99. Then compile and run Lister.java, which will give you the following output (shown for two different locales:
$ export LANG=
$ java Lister
name:?; chars:99,; bytes:99,
$ export LANG=ja
$ java Lister
name:?; chars:fffd,; bytes:3f,
---------------------------------------------Note that when running in the JA locale, there is no character corresponding to byte value 0x99. So, Java uses the replacement character 0xFFFD, and the '?' character 0x3F, as a replacement.
The point is that there are files which Java cannot uniquely represent as a straight String. I suppose we could get the filename via JNI, do the conversion ourselves, and then use the private-use area of Unicode to encode all our strings, but ugh.
//create.c
#include <stdio.h>
int main()
   const char* name = "\x99";
   FILE* file = fopen( name, "w" );
   if( file == NULL )
      printf( "could not open file %s\n", name );
      return 1;
   fclose( file );
   return 0;
// Lister.java
import java.io.*;
public class Lister
    public static void main( String[] args )
        new Lister().run();
    public void run()
        try
            doRun();
        catch( Exception e )
            System.out.println( "Encountered exception: " + e );
    private void doRun() throws Exception
        File cwd = new File( "." );
        String[] children = cwd.list();
        for( int i = 0; i < children.length; ++i )
            printName( children[ i ] );
    private void printName( String s )
        System.out.print( "name:" );
        System.out.print( s );
        System.out.print( "; chars:" );
        printCharsAsHex( s );
        System.out.print( "; bytes:" );
        printBytesAsHex( s );
        System.out.println();
    private void printCharsAsHex( String s )
        for( int i = 0; i < s.length(); ++i )
            char ch = s.charAt( i );
            System.out.print( Integer.toHexString( ch ) + "," );
    private void printBytesAsHex( String s )
        byte[] bytes = s.getBytes();
        for( int i = 0; i < bytes.length; ++i )
            byte b = bytes[ i ];
            System.out.print( Integer.toHexString( unsignedExtension( b ) ) + "," );
    private int unsignedExtension( byte b )
        return (int)b & 0xFF;
}

Target system install problem (SAPINST searching for Non unicode kernel)

Hi,
I'm currently performing a unicode conversion (from 4.7 to EHP4). I am finished with the upgrade and I'm now in the unicode conversion part.
I have completed the export and selected unicode as the target system. I am done with source system uninstallation and I'm now installing the unicode target system. I have selected unicode during the parameter selection (it's the default option anyway) but during the part where it will ask for the kernel DVD, it doesn't accept the path of my unicode kernel (and instead looks for the non-unicode kernel)

Hi Markus,
Thanks for your response. This problem is now resolved. It turns out that there are some remaining files from the old /usr/sap/<> and /sapmnt/SID/exe that were not deleted yet (after uninstalling for non unicode system). I proceeded to delete again these files and ensure that there are no files from the non unicode system

FOI Servlet non-unicode characters cannot be processed

Hello,
I'm using Oracle MapViewer 10.1.3.1 quickstart kit to test some map features
my database is in CL8MSWIN1251 charset
I made a simple map application to display some data using JavaScript API
when I define a theme based FOI layer in the map and the predefined theme has some non-Unicode characters in the labeling or in hidden info fields I get the folowing error:
Cannot process the following response from FOI server:
{"foiarray":[{"id":"AAARiqAAEAAAzFgAAA","name":"\u422\u414","gtype":"2001","imgurl":"http://localhost:8888/mapviewer/images/foi/p_16_13_MVDEMO_M.IMAGE131_BW.png","x":"50.0","y":"50.0","width":"16","height":"13","attrs":["987654321","100"]}],"attrnames":["BBB","Osn"]}
As you can see "\u422\u414" shoud be "\u0422\u0414" otherwise JavaScript cannot display characters in the right way. I think FOIServlet is the problem here.
Anyone has the same problems or has a solution for this problem pls

require that I predefine the invalid charactersThis isn't hard to do and it isn't subject to change. The XML recommendation tells you here exactly what characters are valid in XML documents.
However if your problems extend to the sort of case where users paste code including the "&" character into a text node without escaping it properly, or they drop in MS Word "smart quotes" in the incorrect encoding, then I think you'll just have to face up to the fact that allowing naive users to generate uncontrolled wannabe-XML documents is not really a viable idea.

Problems realted to changing the language for non-Unicode programs from one into anther.

Hi,
Hi everyone!
Product Name: HP Pavilion dv6-6093ex
Product Number : LM610EA#A2N.
My Windows 7(464Bit) Ultimate, and its base language, and display language is English.
The Languages(English, French and Arabic) built-in (came re-downloaded and re-installed by the person who made the Windows disc installation). Thus, while installing Windows 7, those three lanauge listed for me to choose one of them to be the base language
and display language, I have chosen 'English'. In the End of installation, there are three pre-packaged languages(English and French and Arabic) can be used as a display language.
I would like anyone kindly to confirm for me why I have been facing these problems when changing language for non-Unicode programs from English into Arabic.
First: After I have installed the AMD High-Definition Graphics Driver (sp55092) 8.882.2.3000 on my laptop. The contents of the of Intel Graphics and Media Control Panel are partially shown in Arabic while language for non-Unicode programs
is Arabic, however, they are completely shown in English while language for non-Unicode programs is English.
A: I found that contents of the of Intel Graphics and Media Control Panel are partially shown in Arabic (second screen shot below, however,
when click on any options, for example 'Graphics properties' shown in Arabic in, the second window is shown in English) while language for non-Unicode programs is
Arabic, and it doesn't matter what format or location is.
B- when I changed language for non-Unicode programs into
English.
I found that contents of the Intel Graphics and Media Control Panel are completely shown in English .
Second:
A- Only All Arabic contents will be displayed encryptely while language for non-Unicode programs is English
B- All the Arabic contents are displayed properly while language for non-Unicode programs is Arabic.
Third: an error extracting drivers and software downloaded from official HP website while language for non-Unicode programs is English.
A- I noticed an error extracting all kinds of comprised files(drivers and software) downloaded from HP website while language for non-Unicode programs is English and whatever location and format are:
B- However while language for non-Unicode programs is Arabic, there is no error extracting files.
In conclusion, Is it normal
for all of you the contents of the of Intel Graphics and Media Control Panel are partially shown in your native while language used
for non-Unicode programs is your native lanague, however, they are completely shown in English while language for non-Unicode programs is English?. IF so I would be saying that if I have wanted the contents of documents written in my language
Arabic to be shown properly, then then I
must let the language for non-Unicode programs be Arabic.
Is this happens with you as well?
Also, if I need extracting all kinds of comprised files(drivers and software) downloaded from HP website, then language for non-Unicode programs
must be in my Arabic lanague English and whatever location and format are.
Is this happened with you as well?
I would highly appreciate any clarification from you.
A man should convert his anger and sadness into strength to continue living in this life.

Hi,
I am confused about your issue.
when you changed language for non-Unicode programs into English, it show English.
while language for non-Unicode programs is Arabic, it show Arabic, If it's none-Unicode.
This is correct. Why you think it's an issue?
About your second scenario description, I was not clear what you said. Could you give us a explanation?
Note: System display language is not the un-Unicode program language.
You can just choose one language to display the system at same time.
This article might be helpful to you:
Install or change a display language
http://windows.microsoft.com/en-in/windows7/install-or-change-a-display-language
Change the system locale
http://windows.microsoft.com/en-in/windows/change-system-locale#1TC=windows-7
Karen Hu
TechNet Community Support

Problems realted to changing the language for non-Unicode programs

Hi everyone!
Product Name: HP Pavilion dv6-6093ex
Product Number : LM610EA#A2N.
My Windows 7(464Bit) Ultimate, and its base language, and display language is English.
The Languages(English, French and Arabic) built-in (came re-downloaded and re-installed by the person who made the Windows disc installation). Thus, while installing Windows 7, those three lanauge listed for me to choose one of them to be the base language and display language, I have chosen 'English'. In the End of installation, there are three pre-packaged languages(English and French and Arabic) can be used as a display language.
I would like anyone kindly to confirm for me why I have been facing these problems when changing language for non-Unicode programs from English into Arabic.
First: After I have installed the AMD High-Definition Graphics Driver (sp55092) 8.882.2.3000 on my laptop. The contents of the of Intel Graphics and Media Control Panel are partially shown in Arabic while language for non-Unicode programs is Arabic, however, they are completely shown in English while language for non-Unicode programs is English.
A: I found that contents of the of Intel Graphics and Media Control Panel are partially shown in Arabic (second screen shot below, however, when click on any options, for example 'Graphics properties' shown in Arabic in, the second window is shown in English) while language for non-Unicode programs is Arabic, and it doesn't matter what format or location is.
B- when I changed language for non-Unicode programs into English.
I found that contents of the Intel Graphics and Media Control Panel are completely shown in English .
Second:
A- Only All Arabic contents will be displayed encryptely while language for non-Unicode programs is English
B- All the Arabic contents are displayed properly while language for non-Unicode programs is Arabic.
Third: an error extracting drivers and software downloaded from official HP website while language for non-Unicode programs is English.
A- I noticed an error extracting all kinds of comprised files(drivers and software) downloaded from HP website while language for non-Unicode programs is English and whatever location and format are:
B- However while language for non-Unicode programs is Arabic, there is no error extracting files.
In conclusion, Is it normal for all of you the contents of the of Intel Graphics and Media Control Panel are partially shown in your native while langauge used for non-Unicode programs is your native lanague, however, they are completely shown in English while language for non-Unicode programs is English?. IF so I would be saying that if I have wanted the contents of documents written in my language Arabic to be shown properly, then then I must let the language for non-Unicode programs be Arabic. Is this happens with you as well?
Also, if I need extracting all kinds of comprised files(drivers and software) downloaded from HP website, then language for non-Unicode programs must be in my Arabic lanague English and whatever location and format are. Is this happened with you as well?
I would highly appreciate any clarification from you.
This question was solved.
View Solution.

Hi cooperator,
I saw your post regarding the language questions and I will be happy to help. What are you are experiencing with the languages is normal. The base operating system is in English and while you can change the display language the core kernal of the operating system would be in English.
The reason that the Intel Graphic and Media Control Panel is in English and the rest in Arabic is because the driver would have been designed in English and is hard coded as into the driver but the display language is set to Arabic. So when the language is set to English everything will be in English.
The reason you having issues extracting drivers when you English set at the language is because the HP website will determine where in the world you and when you download the driver it will be the appropriate language for the country you are from. So when it extracts it will look for the proper extraction path using Arabic but everything is English. It work no problem when you are on Arabic because the driver can read the path properly.
Thank you,
Please click “Accept as Solution ” if you feel my post solved your issue.
Click the “Kudos Thumbs Up" on the right to say “Thanks” for helping!
Thank you,
BHK6
I work on behalf of HP

When i use language for non-unicode [Russian] and i insert another language

when i use language for non-Unicode [Russian] and i insert another language. it transform to Russian.
i insert Thai language into MS access it transform to Russian type.i try to change anything,CharSet, sun.jnu.encoding.
but it doesn't have any effect.How can i insert Thai data into database and in database shown Thai data not another type.
Thank you for help.My English is not good,Sorry

that is a little-known known issue
Check this post:
http://myitforum.com/cs2/blogs/smslist/archive/2009/01/12/mssms-userlocale-in-mdt-sccm-also-changes-system-locale-9a532hdf.aspx
He details how to modify the ZTIConifgure.xml file.

SIK Transport files and None unicode SAP system

Dear all,
I have a question about SIK Transport files.
As you know, when we install BOE SIK,we need transport some files into SAP system.
There is a TXT file for describing how to use SIK transport files in SAP system.
I found that there is no detail about none unicode SAP system in this TXT file.
All of them is about unicode.
If your SAP system is running on a BASIS system earlier than 6.20, you must use the files listed below:
(These files are ANSI.)
Open SQL Connectivity transport (K900084.r22 and R900084.r22)
Info Set Connectivity transport (K900085.r22 and R900085.r22)
Row-level Security Definition transport (K900086.r22 and R900086.r22)
Cluster Definition transport (K900093.r22 and R900093.r22)
Authentication Helpers transport (K900088.r22 and R900088.r22)
If your SAP system is running on a 6.20 BASIS system or later, you must use the files listed below:
(These files are Unicode enabled.)
Open SQL Connectivity transport (K900574.r21 and R900574.r21)
Info Set Connectivity transport (K900575.r21 and R900575.r21)
Row-level Security Definition transport (K900576.r21 and R900576.r21)
Cluster Definition transport (K900585.r21 and R900585.r21)
Authentication Helpers transport (K900578.r21 and R900578.r21)
The following files must be used on an SAP BW system:
(These files are Unicode enabled.)
Content Administration transport (K900579.r21 and R900579.r21)
Personalization transport (K900580.r21 and R900580.r21)
MDX Query Connectivity transport (K900581.r21 and R900581.r21)
ODS Connectivity transport (K900582.r21 and R900582.r21)
If our SAP BASIS system is beyond 6.20,but iit is not unicode system.
Could we use these transport files to none unicode SAP system ?
Thanks!
Wayne

Hi Wayne,
the text and the installation guide is clearly advising based on the version of your underlying BASIS system and differentiates 620 or 640 and higher.
so based on the fact that you system is a BI 7 system you are in the category of a 640 (or higher) basis system and therefore you have to use the Unicode ENABLED transports.
ingo

Cant Set System Locale (language for non-Unicode programs)

Im trying to deploy custom image wherein Input language and location should be English Australia , BUT system locale (language for non-Unicode programs) should be English-us. The requirement is as below
Standards and Formats: English (Australia)
Location: Australia
Default input language: English (Australia) – US
Installed input languages: English (Australia) – US
Time zone: Cen. Australia Standard Time
System Locale
Language for non-Unicode programs: Default - English (United States)
Below is my CS.ini
SkipLocaleSelection=YES
UserLocale=en-AU
SystemLocale=en-US
UIlanguage=en-AU
KeyboardLocale=0c09:00000409
SkipTimeZone=YES
TimeZoneName=Cen. Australia Standard Time
when image is deployed the language for non-Unicode is also getting set to en-AU while it should be en-US, other language setting are as per the requirement. what should I do :(
thanks a lot
Sanju.

that is a little-known known issue
Check this post:
http://myitforum.com/cs2/blogs/smslist/archive/2009/01/12/mssms-userlocale-in-mdt-sccm-also-changes-system-locale-9a532hdf.aspx
He details how to modify the ZTIConifgure.xml file.

SetMnemonic for non-english characters

Does anybody knos how to set JButtons mnemonic for non-english characters?
My mnemonic is loaded from a resource bundle, and in the documentation the setMnemonic(char) is only limited to english and it is written that the user should call setMnemonic(int) instead.
So what value should this int contains in order to display the non-english char which is loaded from resource bundle?
Thanks in advanve,
Hanoch

It seems that this is an issue that has popped up in various forums before, here's one example from last year:
http://forum.java.sun.com/thread.jspa?forumID=16&threadID=490722
This entry has some suggestions for handling mnemonics in resource bundles, and they would take care of translated mnemonics - as long as the translated values are restricted to the values contained in the VK_XXX keycodes.
And since those values are basically the English (ASCII) character set + a bunch of function keys, it doesn't solve the original problem - how to specify mnemonics that are not part of the English character set. The more I look at this I don't really understand the reason for making setMnemonic (char mnemonic) obsolete and making setMnemonic (int mnemonic) the default. If anything this has made the method more difficult to use.
I also don't understand the statement in the API about setMnemonic (char mnemonic):
"This method is only designed to handle character values which fall between 'a' and 'z' or 'A' and 'Z'."
If the type is "char", why would the character values be restricted to values between 'a' and 'z' or 'A' and 'Z'? I understand the need for the value to be restricted to one keystroke (eliminating the possibility of using ideographic characters), but why make it impossible to use all the Latin-1 and Latin-2 characters, for instance? (and is that in fact the case?) It is established practice on other platforms to be able to use characters such as '�', '�' and '�', for instance.
And if changes were made, why not enable the simple way of specifying a mnemonic that other platforms have implemented, by adding an '&' in front of the character?
Sorry if this disintegrated into a rant - didn't mean to... :-) I'm sure there must be good reasons for the changes, would love to understand them.

Problem in converting Spool to PDF file, having non-English characters

Hi All,
I have problem in converting Spool to PDF format.
Scenario : I have a spool which has non-English characters. I am using CONVERT_ABAPSPOOLJOB_2_PDF FM to perform conversion. But my output is having junk values( ie # ) for non-English characters. Any pointers to solve this issue will be appreciated.
I even tried with report RSTXPDFT4 , it also gives me the same junk characters.
Regards,
Navin.

Hi All,
I have problem in converting Spool to PDF format.
Scenario : I have a spool which has non-English characters. I am using CONVERT_ABAPSPOOLJOB_2_PDF FM to perform conversion. But my output is having junk values( ie # ) for non-English characters. Any pointers to solve this issue will be appreciated.
I even tried with report RSTXPDFT4 , it also gives me the same junk characters.
Regards,
Navin.

PDF generation for Non English Characters from ADF

Hi
We are using below piece of code to generate pdf from ADF Managed bean. It works fine. However for non English Characters(eg. Japanese,Vietnamese,Arabic) it puts
I got few blogs
https://blogs.oracle.com/BIDeveloper/entry/non-english_characters_appears
However we are not using BI Publisher product . We are using its API's
Can anyone tell where do we need to setup fonts within ADF or Weblogic or Server ?
Input Parameters are
a)xml Data
b)InputStream ie rtf Template
import oracle.apps.xdo.XDOException;
import oracle.apps.xdo.template.FOProcessor;
import oracle.apps.xdo.template.RTFProcessor;
    public static byte[] genPdfRep(String pOutFileType,byte[] pXmlOut ,InputStream pTemplate)
        byte[] dataBytes = null;
        try {
            //Process RTF template to convert to XSL-FO format
            RTFProcessor rtfp = new RTFProcessor(pTemplate);
            ByteArrayOutputStream xslOutStream = new ByteArrayOutputStream();
            rtfp.setOutput(xslOutStream);
            rtfp.process();
            //Use XSL Template and Data from the VO to generate report and return the OutputStream of report
            ByteArrayInputStream xslInStream = new ByteArrayInputStream(xslOutStream.toByteArray());
            FOProcessor processor = new FOProcessor();
            ByteArrayInputStream dataStream = new ByteArrayInputStream((byte[])pXmlOut);
            processor.setData(dataStream);
            processor.setTemplate(xslInStream);
            ByteArrayOutputStream pdfOutStream = new ByteArrayOutputStream();
            processor.setOutput(pdfOutStream);
            byte outFileTypeByte = FOProcessor.FORMAT_PDF;
            processor.setOutputFormat(outFileTypeByte); //FOProcessor.FORMAT_HTML
            processor.generate();
            dataBytes = pdfOutStream.toByteArray();
        } catch (XDOException e) {
            e.printStackTrace();
        return dataBytes;
Appreciate your help.
Thanks,
Abhijit

Fonts are defined in the template you use to generate the pdf. Your application add the data and both is processed yb the FOP processor. Now there are two possible causes of the '???' :
1. the data you sent to the template contains the '???' already
2. the template can't digest the data (the special characters) and puts '???' in the pdf.
Before going on you have to find out which one is your problem. The 2nd is the problem you better ask this in a FOP forum as you have to solve it by changing the template.
Timo

Word Replacements for Non- English Characters

Hi
Does anyone have an idea on implementing Word Replacements for non- english characters in TCA- DQM 11i.
We are trying to identify, capture and cleanse common accented characters like à, â , ê
However, the default language for replacement is American English , So even if we add these in the existing lists it will not take any effect
Is creating a new Word replacement list for every language the solution ?? any patch recommendations???
Thanks in advance

It seems that this is an issue that has popped up in various forums before, here's one example from last year:
http://forum.java.sun.com/thread.jspa?forumID=16&threadID=490722
This entry has some suggestions for handling mnemonics in resource bundles, and they would take care of translated mnemonics - as long as the translated values are restricted to the values contained in the VK_XXX keycodes.
And since those values are basically the English (ASCII) character set + a bunch of function keys, it doesn't solve the original problem - how to specify mnemonics that are not part of the English character set. The more I look at this I don't really understand the reason for making setMnemonic (char mnemonic) obsolete and making setMnemonic (int mnemonic) the default. If anything this has made the method more difficult to use.
I also don't understand the statement in the API about setMnemonic (char mnemonic):
"This method is only designed to handle character values which fall between 'a' and 'z' or 'A' and 'Z'."
If the type is "char", why would the character values be restricted to values between 'a' and 'z' or 'A' and 'Z'? I understand the need for the value to be restricted to one keystroke (eliminating the possibility of using ideographic characters), but why make it impossible to use all the Latin-1 and Latin-2 characters, for instance? (and is that in fact the case?) It is established practice on other platforms to be able to use characters such as '�', '�' and '�', for instance.
And if changes were made, why not enable the simple way of specifying a mnemonic that other platforms have implemented, by adding an '&' in front of the character?
Sorry if this disintegrated into a rant - didn't mean to... :-) I'm sure there must be good reasons for the changes, would love to understand them.

Alogrithm for converting Unicode characters to EBCDIC

I would like to know if there is any algorithm for converting Unicode Characters to EBCDIC.
Awaiting your replys
Thanks in advance,
Ravi

I would like to know if there is any algorithm for
converting Unicode Characters to EBCDIC.Isn't ECBDIC a 7-bit code like ASCII. Unicode is
16-bit. This means there is no way Unicode can be
mapped on ECBDIC without loss of information. Link to
Unicode,
No. That is like saying that since UTF-8 is 8 bit based then it can't be mapped to UTF-16. But it does.
EBCDIC either directly supports or has versions which support multibyte character sets. A multibyte character set can encode any fixed format sized character set. The basic idea is the same way UTF-8 works.
Multibyte character sets have the added benifit that most of the data in the world is from the ASCII character set and the encodings always support that using only 8 bits. Thus the memory savings over UTF-16 (or UTF-32) are significant.

Scanning files for non-unicode characters.

Similar Messages

Maybe you are looking for