Replacement of unmappable characters

Is there any support built into Java for replacing unmappable characters with fallback character choices?
I read text from a document encoded in UTF-8 using an InputStreamReader. After processing this text, I write the results back into a document encoded in ISO-8859-1 using an OutputStreamWriter. Some of my characters are not supported in ISO-8859-1. For example, \u0144 is Latin small letter n with acute and is used in Polish. By default, when my OutputStreamWriter encounters an unmappable character such as this it replaces the character with '?'.
I know how to check for the presence of unmappable characters by using the CharSetEncoder.canEncode(char) method, but what I really want to do is replace this unmappable character with the fallback character recommended on the Unicode Consortium code point pages ( http://www.unicode.org/charts/PDF/U0100.pdf ), in this case the Latin small letter n. Is this capability already available somewhere or do I have to construct my own character replacement mappings?
Regards,
Joek

Hi Dr. C.,
Thanks for the lead. That report, and various other reports referenced by it, give me a clear idea of what is required. It turns out (perhaps not surprisingly) that doing this correctly (substituting for unmappable characters) is a lot more involved than I had assumed.
I still haven't found any code that I can steal (errrh, study) but I will keep looking.
Regards,
Joe
P. S. I was doing Java programming a few years ago involving internationalization issues, but more recently I have been working on other projects. I got a chuckle out of the fact that I come back after about four years absence from this forum and find that you are still telling people that there is no such thing as a UTF-8 string! ;-) I admire your persistence.

Similar Messages

  • Replace multiple space characters with a single space

    Hi,
    Oracle 11g R2.
    Looking for a way to replace multiple space characters in a row with a single space. For example, the text "abc abc" should look like "abc abc". I tried toying with replace, but it only works for the case of 2 spaces. Need a solution for the cases where there could be 2 or more spaces.
    select replace(column1, chr(32)||chr(32), chr(32)) from tablea

    Hi,
    If you had to do this without regular expressions, you could use:
    SELECT  REPLACE ( REPLACE ( REPLACE (str, ' ', '~ ')
                     , ' ~'
              , '~ '
              )     AS new_str
    FROM    table_x;assuming there is some sub-string (I used '~' above) that never occurs right next to a space.
    However, unless you're uisng Oracle 9 (or earlier, which you're not doing) you don't have to do this without regular expressions. As you can see, the way Solomon showed is much simpler.

  • Replace function - special characters

    Hi All,
           If my column is encountering special characters I would like to replace with blank. Special characters may vary from record to record. Sometimes it is @ sometimes $,!,,$,^,<,&,<,>,?,",{,},+,_,),(.
    Do we have any function same like replace that can replace all special characters, whatever it encounters with required value?
    - please mark correct answers

    Take a look at this article
    T-SQL:
    How to Find Rows with Bad Characters
    and check See Also section in that article for more articles on that topic.
    For every expert, there is an equal and opposite expert. - Becker's Law
    My blog
    My TechNet articles
    Thanks Naomi
    - please mark correct answers

  • Replacement of Spl Characters- REPORT

    Hi All,
    I want to replace some special characters like & with 'and' # with num. etc in my report.
    I tried using REPLACE ALL OCCURANCE OF but this gives performance issue and the report takes a long time to be executed.
    Moreover i cannot specify offset and length since the special characters can come anywhere.
    Please help me in this. Points will be rewarded for all useful inputs

    Hi,
    Please select required solution from one of the followings:
    REPLACE by Pattern
    Replaces strings in fields with other strings using a pattern.
    Syntax
    REPLACE [ FIRST OCCURENCE OF | ALL OCCURENCES OF ] <old>
            IN [ SECTION OFFSET <off> LENGTH <len> OF ] <text> WITH <new>
            [IGNORING CASE|RESPECTING CASE]
            [IN BYTE MODE|IN CHARACTER MODE]
            [REPLACEMENT COUNT <c>]
            [REPLACEMENT OFFSET <r>]
            [REPLACEMENT LENGTH <l>].
    In the string <text>, the search pattern <old> is replaced by the content of <new>. By default, the first occurrence of <old> is replaced. ALL OCCURENCES specifies that all occurrences be replaced. In the fields <old> and <new>, trailing spaces in C fields are ignored, but included in <text>. The SECTION OFFSET <off> LENGTH <len> OF addition tells the system to search and replace only from the <off> position in the length <len>. IGNORING CASE or RESPECTING CASE (default) specifies whether the search is to be case-sensitive. In Unicode programs, you must specify whether the statement is a character or byte operation, using the IN BYTE MODE or IN CHARACTER MODE (default) additions. The REPLACEMENT additions write the number of replacements, the offset of the last replacement, and the length of the last replaced string <new> to the fields <c>, <r>, and <l>.
    REPLACE by Position
    Replaces strings in fields with other strings by position.
    Syntax
    REPLACE <str1> WITH <str2> INTO <c> [LENGTH <l>].
                                        [IN BYTE MODE|IN CHARACTER MODE].
    ABAP searches the field <c> for the first occurrence of the first <l> characters in the pattern <str1> and replaces them with the string <str2>. In Unicode programs, you must specify whether the statement is a character or byte operation, using the IN BYTE MODE or IN CHARACTER MODE (default) additions.
    Please rewrad points if helpful.

  • [svn] 3590: Replace invalid html characters

    Revision: 3590
    Author: [email protected]
    Date: 2008-10-13 07:29:43 -0700 (Mon, 13 Oct 2008)
    Log Message:
    Replace invalid html characters
    Checkin Test Passed: Yes
    QA: No
    Bug:
    Doc: No
    Modified Paths:
    flex/sdk/trunk/frameworks/projects/flex4/src/mx/layout/ILayoutItem.as

    Try the HtmlEditFormat function built into ColdFusion.
    http://livedocs.adobe.com/coldfusion/8/htmldocs/help.html?content=functions_h-im_04.html#4 744272

  • Replacing non-ASCII characters with HTML charcter references

    Hi All,
    In Oracle 10g or greater is there a built-in function that will convert a string with non-ASCII characters like this
    a b č 뮼
    into an ASCII string with HTML character references like this?
    a b & # x 0 1 0 D ; & # x B B B C ;
    (note I had to include spaces between each character in the sample code for message to prevent the forum software from converting my text)
    I tried using
    utl_i18n.escape_reference( val, 'us7ascii' )
    but for some reason it returns
    a b c & # x B B B C ;
    Note how it converted the Western European character "č" to its unaccented counterpart "c", not "& # x 0 1 0 D ;" (is this a bug?).
    I also tried a custom solution using regexp_replace and asciistr (which I can't include here because the forum software chokes on it) but it only returns the correct result for values <=4000 characters long. Unfortunately asciistr doesn't appear to accept CLOB values larger than 4000 characters. It returns an error message like
    (ORA-22835: Buffer too small for CLOB to CHAR or BLOB to RAW conversion (actual: 30251, maximum: 4000) ).
    I'm looking for a solution that works on CLOB data of any size.
    Thanks in advance for any insight you can provide.
    Joe Fuda

    So with that (UTF8) in mind, let's take another look.....
    As shown below, I used a AL32UTF8 database.
    Note: I did not use a unicode capable tool for querying. So I set console mode code page to 1250 just to have č displayed properly (instead of posing as an è).
    Also, as a result of using windows-1250 for client character set, in the val column and in the second select's ncr column (iso8859-1), è (00e8) has been replaced with e through character set conversion going from server back to client.
    Running the same code on a database with a db character set such as we8mswin1252, that doesn't define the č (latin small c with caron) character, would yield results with a c in the ncr column.
    C:\>chcp 1250
    Aktuell teckentabell: 1250
    C:\>set nls_lang=.ee8mswin1250
    C:\>sqlplus test/test
    SQL*Plus: Release 11.1.0.6.0 - Production on Fri May 23 21:25:29 2008
    Copyright (c) 1982, 2007, Oracle.  All rights reserved.
    Connected to:
    Oracle Database 11g Enterprise Edition Release 11.1.0.6.0 - Production
    With the OLAP option
    SQL> select * from nls_database_parameters where parameter like '%CHARACTERSET';
    PARAMETER              VALUE
    NLS_CHARACTERSET       AL32UTF8
    NLS_NCHAR_CHARACTERSET AL16UTF16
    SQL> select unistr('\010d \00e8') val, utl_i18n.escape_reference(unistr('\010d \00e8'),'us7ascii') NCR from dual;
    VAL  NCR
    č e  c e
    SQL> select unistr('\010d \00e8') val, utl_i18n.escape_reference(unistr('\010d \00e8'),'we8iso8859p1') NCR from dual;
    VAL  NCR
    č e  &# x10d; e     <- "è"
    SQL> select unistr('\010d \00e8') val, utl_i18n.escape_reference(unistr('\010d \00e8'),'ee8iso8859p2') NCR from dual;
    VAL  NCR
    č e  č &# xe8;
    SQL> select unistr('\010d \00e8') val, utl_i18n.escape_reference(unistr('\010d \00e8'),'cl8iso8859p5') NCR from dual;
    VAL  NCR
    č e  &# x10d; &# xe8;In the US7ASCII case, where it should be possible for all non-ascii characters to be escaped, it seems as if the actual escape step is skipped over.
    Hope this helps to understand whether utl_i8n is usable or not in your case.
    Message was edited by:
    orafad
    Fixed replaced character references :)

  • Replace non-english characters function

    Hi folks,
    I have a text which includes non english characters. Is there any trick, how can I replace those characters with "closest" english character?
    Examples:
    "Hytölä"  to become "Hytola"
    "Säynatsälo" to become "Säynatsälo"
    etc ...
    I was thinking about usage of REGEXP
    select regexp_replace('Hytölä Säynatsälo ', '[^0-9A-Za-z]', '') from dual
    but a pattern is not correct.
    Any suggesitons?

    There is something that smells like a hack for me (source: replace characters with accent with their base letter)
    However
    with data as (
    select 'Hytölä' str from dual
    union all
    select 'Säynatsälo' from dual
    select
      str
    ,utl_raw.cast_to_varchar2(nlssort(str, 'NLS_SORT=BINARY_AI')) nstr
    ,length(utl_raw.cast_to_varchar2(nlssort(str, 'NLS_SORT=BINARY_AI'))) l
    from data
    STR
    NSTR
    L
    Hytölä
    hytola
    7
    Säynatsälo
    saynatsalo
    11
    Notice that change in length through an extra null bit at the end of the strings.
    And the loss of the uppercase.
    For this kind of questions it's helpful to know about the requirements. Why there shuóuld be a baseletter conversion? For search purposes for example.
    not to forget the db characterset.

  • Replace all special characters in a String with underscore

    I have a String which contains some special characters even(!,$,@,*....).
    I need to replace all the special characters with _ in my String. I do have an idea of using String.replace() and analogous forms, but I would be thankful if anyone can suggest me of a better and an efficient way.
    regards,
    fun_one

    Kaj,
    Thx for your earnest reply. I did have a peep into the API on this method. But the regular expression that I need to use here was beyond my understanding. It did specify some regex that I put to use (something like myString("\D","_"), assuming that I need to replace all non-digit characters ), but it really did not help me getting the result.
    Would you spare some code for me reg. the usage of regular expressions in such a scenario?
    cheers,
    fun_one

  • Replace not latin characters from file name.

    Dear friends,
    I tried a lot but even after reading the documentation about all the available CF functions I could not find a way  to accomplish this:
    As you already know from my previous post I'm trying to finish an application which manages multiple images. I have reached a really good point and I was ready to implement the solutions you suggest on this post when the client did something marvellous. He uploaded some images named "Εικόνα 123.jpg" (Image 123.jpg in English) which actually broke the Flex application that retrieves the images because the firewall does not allow high bit characters to go through. I now need to add a function that will evaluate every character in the file name individually and it will remove or replace (I do not care) all the characters that are not latin or numbers (spaces, greek characters, special characters, etc). As I've already said no known function is able to do that (as far as I know of course) and I guess that the solution could be hiding in regular expressions which is not my strong point. So I need your help here.
    Thank you in advance,
    John

    Thanks Ian,
    I cannot figure out how to use asc() function for this. I will have to run a test in every single character and replace the invalid ones but I will never now the actual string length (how many characters each image name will have) and I will  possibly end up destroying the extensions (.jpg) as well. To make it a little more complicated let me tell you that I will have to run this twice. Once for the full image path (ie d:\company\aptown\images\Εικόνα 123.jpg) in which I will have to change only the ...Εικόνα 123... part and not anything else, and once for a comma separated list of the image names (ie Εικόνα 123.jpg, Εικόνα 124.jpg, Εικόνα 125.jpg, Εικόνα 126.jpg). And don't be misleaded from the pattern. The customer may upload an image named with a complitely different way using invalid characters though, for example "Αντίγραφο της Εικόνας 123.jpg" (Copy of the image 123.jpg in English).
    Seems to be impossible ,
    I hope it is not.
    Yannis

  • How to replace XML special characters using Xpress e.g. ' = '

    Hellos
    I need to write some XML built from resource data.
    The resource data strings can contain anything e.g. & ' " < and > characters.
    How do I replace these characters with the strings & &apos; " etc.... from within Xpress?
    I try something like:-
    <set name='testchar'>
    <switch>
    <ref>thisLetter</ref>
    <case>
    <s>'</s>
    <s>&apos;</s>
    </case>
    <case>
    <s>"</s>
    <s>"</s>
    </case>
    <case default='true'>
    <ref>thisLetter</ref>
    </case>
    </switch>
    </set>
    But when saved the <s>&apos;</s> gets saved as <s>'</s>
    Is this impossible with Xpress forcing me to invoke my own Java class? (which is a pain to do because of internal politics re restarting of App Server).

    Hmmmm seems there is transformation in this set up too.
    What I am trying to do is to parse a string which may or may not contain XML special characters into an XML acceptable string.
    e.g. transform the surname O'Connor into O&apos ;Connor (with no embedded blank)
    I wish to write a rule which uses XPress language to achieve this.
    I want to convert the character ' into the string &apos ; (with no blank between apos and ;) and similar for the other XML specials.
    Is this possible?

  • Replacing non-ascii characters in String

    I have a site where the user enters data in a rich text
    editor (ktml4) that gets stored into a database (mysql). There are
    non ascii characters getting into the data, I'm assuming that they
    are copying and pasting from Word. Unfortunately in this situation,
    changing that process isn't an option.
    Currently, this is the only character that is causing me
    problems:
    http://www.zvon.org/other/charSearch/PHP/search.php?request=ffa0&searchType=3
    I would just like to replace the non-ascii characters with a
    space when I read them from the database. Something like:
    #Replace(result.column, '\xffa0', ' ')#
    However, I believe that code looks for the string "\xffa0",
    not the character \xffa0.
    Is there anyway to do this?

    quote:
    Originally posted by:
    BuckLemke
    quote:
    Originally posted by:
    Dan Bracuk
    rereplace might work.
    Can you give an example of how to pass a non-ascii character
    to REReplace?
    Regular expressions are not my strength, but the approach I
    was considering was, "if it's not an ascii character, make it a
    space". Then you pass the entire string at once.

  • Replace Non-Numeric Characters with a Numeric Character in a String

    Hi Guys,
    I need to replace all the non-numeric characters (including embedded blanks & hyphen) in a string to a numeric character '1'.
    The trailing blanks should not be replaced.
    e.g. "P22233344455566" should be changed to "122233344455566"
    &    "49-1234567           " should be changed to "4911234567          "
    Please help.

    Use [replace|http://help.sap.com/abapdocu_70/en/ABAPREPLACE_IN_PATTERN.htm] with a regular expression to translate any non-numeric character (i.e. any character not between 0 and 9) to 1:
      REPLACE ALL OCCURENCES OF REGEX '[^0-9]' IN value WITH '1'.
    Cheers, harald
    p.s.: In older releases [translate|http://help.sap.com/abapdocu_70/en/ABAPTRANSLATE.htm] would also do the trick, but is more lengthy, because one would need to specify each individual character that should be replaced, e.g.:
      TRANSLATE value TO UPPER CASE.
      TRANSLATE value USING
          ' 1_1-1a1b1c1d1e1f1g1h1i1j1k1l1m1n1o1p1q1r1s1t1u1v1w1x1y1z1'.

  • Replacing non latin characters

    Hi experts,
    i have to check some fields of non latin characters.
    When the fields include some of non latin charcters I have to replace them
    with an "Y".
    Have somesone a code example for this case?
    Thanks for help!
    Alex

    This should give you an Idea 
    WHILE p_faxno CA sy-abcde.  " to check if varaible contains any abcde...Z
        p_faxno+sy-fdpos(1) = 'Y'.
      ENDWHILE.
      CONDENSE p_faxno NO-GAPS

  • How do I 'Find & Replace' with control characters - paragraph, carriage return, tab

    In Pages 4.3 (Pages '09) I could insert special characters into the Find & Replace fields.  It was great.  Excellent way to clean up & recover text from a pdf or html.
    In Pages 5.2 that capability seems to be gone.  When I turn on the special characters [Show Invisibles - shift-cmd-I], I can highlight said invisible, copy it, then paste it into the Find&Replace field.  Nope.  It goes in as a space, and the Find command only finds spaces.  Not the invisible that I was looking for.
    This was a major feature for me, folks.  Much better than Word's approach.  Now it is gone (I fear), and the Find & Replace is worse than Word's.
    How can I recover/attain this feature?

    I have struggled to find this "feature" for a while.
    It CANNOT be that Apple stripped out a core capability of even a basic text editor.
    Apple, a reply here because this is a F%$#$%^ joke if true.

  • Bug in replace all. escape characters are not working.

    Hi,
    My requirement is that whenever i see ":" (Colon) in the string then i want to replace it with (\:). So i tried
    String escapedTitle = "title:the world is not enough".replace(":", "\\:")
    and to my surprise, when i printed escapedTitle i got
    title\\:the world is not enough
    instead of
    title\:the world is not enough
    (note the back slash in the string)
    I want to ask why there is a different beehavious of escape characters? I am using JDK1.6.0_06

    Sorry for the last post. Please try this:
    public class test
    public static void main(String a[])
         String escapedTitle = "title:the world is not enough".replaceAll(":+", "\\:"); //or [:]+
         String escapedTitle1 = "title:the world is not enough".replaceAll(":+", "*"); // or [:]+
         System.out.println("Another String is "+ escapedTitle);
         System.out.println("Another String is "+ escapedTitle1);
         System.out.println(System.getProperty("java.vendor"));
         System.out.println(System.getProperty("java.version"));
    output is
    Another String is title:the world is not enough
    Another String is titlethe* world is not enough
    Sun Microsystems Inc.
    1.6.0_06
    Please let me know why i am not getting : as escaped (\:) with replaceAll method.
    i want string escapedTitle as Another String is title*\:*the world is not enough

Maybe you are looking for