Searching on high ascii characters

Hi all,
I am writing a search engine built on Oracle text (otherwise why would i be posting here??) to search lists of medical articles. Many of the article titles have special characters in the title like cedilla's and umlaut's. Some of my users will have european keyboards and will be searching using these characters, other will not and would demote such characters to the low ascii value. Currently all special characters are stored in the table as html encoded values, but this could easily be changed to something else that is supported by Oracle Text.
Example: I have a article called "Behçet's syndrome" this needs to match if someone search with Behçet or Behcet.
EMP has two records one with ename = Behcet and the other ename = Behçet
SELECT ENAME
FROM EMP
WHERE contains (ENAME, 'Behcet', 100) > 0
only returns one row.
I have looked at using SYN to provide this functionality and it works but means that i continually have to update and maintain that list of synonyms which is a chore and something i want to avoid.
Is there a way to build a index that covers both possibilities? So provide a list of special characters and what they would degrade into? I feel that there is a very simple elegant solution to this just waiting for me, any suggestions very welcome.
thanks
Toby

SCOTT@10gXE> CREATE TABLE articles (id NUMBER, title VARCHAR2 (30))
2 /
Table created.
SCOTT@10gXE> INSERT ALL
2 INTO articles VALUES (1, 'Behçet''s syndrome')
3 INTO articles VALUES (2, 'Behcet''s syndrome')
4 INTO articles VALUES (2, 'compulsive tuning disorder')
5 SELECT * FROM DUAL
6 /
3 rows created.
SCOTT@10gXE> EXEC CTX_DDL.CREATE_PREFERENCE ('your_lexer', 'BASIC_LEXER')
PL/SQL procedure successfully completed.
SCOTT@10gXE> EXEC CTX_DDL.SET_ATTRIBUTE ('your_lexer', 'BASE_LETTER', 'YES')
PL/SQL procedure successfully completed.
SCOTT@10gXE> CREATE INDEX articles_idx ON articles (title)
2 INDEXTYPE IS CTXSYS.CONTEXT
3 PARAMETERS ('LEXER your_lexer')
4 /
Index created.
SCOTT@10gXE> SELECT token_text FROM dr$articles_idx$i
2 /
TOKEN_TEXT
BEHCET
COMPULSIVE
DISORDER
SYNDROME
TUNING
SCOTT@10gXE> SELECT * FROM articles WHERE CONTAINS (title, 'Behçet') > 0
2 /
        ID TITLE
         1 Behçet's syndrome
         2 Behcet's syndrome
SCOTT@10gXE> SELECT * FROM articles WHERE CONTAINS (title, 'Behcet') > 0
2 /
        ID TITLE
         1 Behçet's syndrome
         2 Behcet's syndrome
SCOTT@10gXE>

Similar Messages

Convert smart quotes and other high ascii characters to HTML

I'd like to set up Dreamweaver CS4 Mac to automatically convert smart quotes and other high ASCII characters (m-dashes, accent marks, etc.) pasted from MS Word into HTML code. Dreamweaver 8 used to do this by default, but I can't find a way to set up a similar auto-conversion in CS 4. Is this possible? If not, it really should be a preference option. I code a lot of HTML emails and it is very time consuming to convert every curly quote and dash.
Thanks,
Robert
Digital Arts

I too am having a related problem with Dreamweaver CS5 (running under Windows XP), having just upgraded from CS4 (which works fine for me) this week.
In my case, I like to convert to typographic quotes etc. in my text editor, where I can use macros I've written to speed the conversion process. So my preferred method is to key in typographic letters & symbols by hand (using ALT + ASCII key codes typed in on the numeric keypad) in my text editor, and then I copy and paste my *plain* ASCII text (no formatting other than line feeds & carriage returns) into DW's DESIGN view. DW displays my high-ASCII characters just fine in DESIGN view, and writes the proper HTML code for the character into the source code (which is where I mostly work in DW).
I've been doing it this way for years (first with GoLive, and then with DW CS4) and never encountered any problems until this week, when I upgraded to DW CS5.
But the problem I'm having may be somewhat different than what others have complained of here.
In my case, some high-ASCII (above 128) characters convert to HTML just fine, while others do not.
E.g., en and em dashes in my cut-and-paste text show as such in DESIGN mode, and the right entries
    –
    —
turn up in the source code. Same is true for the ampersand
    &
and the copyright symbol
    ©
and for such foreign letters as the e with acute accent (ALT+0233)
    é
What does NOT display or code correctly are the typographic quotes. E.g., when I paste in (or special paste; it doesn't seem to make any difference which I use for this) text with typographic double quotes (ALT+0147 for open quote mark and ALT+0148 for close quote mark), which should appear in source code as
    “[...]”
DW strips out the ASCII encoding, displaying the inch marks in DESIGN mode, and putting this
    "[...]"
in my source code.
The typographic apostrophe (ALT+0146) is treated differently still. The text I copy & paste into DW should appear as
    [...]’[...]
in the source code, but instead I get the foot mark (both in DESIGN and CODE views):
I've tried adjusting the various DW settings for "encoding"
    MODIFY > PAGE PROPERTIES > TITLE/ENCODING > Encoding:
and for fonts
    EDIT > PREFERENCES > FONTS
but switching from "Unicode (UTF-8)" to "Western European" hasn't solved the problem (probably because in my case many of the higher ASCII characters convert just fine). So I don't think it's the encoding scheme I use that's the problem.
Whatever the problem is, it's caused me enough headaches and time lost troubleshooting that I'm planning to revert to CS4 as soon as I post this.
Deborah

Web services and High Ascii characters

Hello Everyone
I have a problem regarding webservices and sending high ascii
characters.
My little application consists of two files: one is cfm that
makes a view and the second is cfc that might be working as the
component or web service.
The goal of the application is to handle text files, divide
them according to some rules and then send back to the user.
Sometimes the text file contains few high ascii characters.
Everything works fine as long as the cfc is called as the
component. Unfortunately when I call the cfc as the webservice I
get an error:
The web service operation caused an invocation exception.The
root cause was that: java.lang.IllegalArgumentException: The char
'0x1' in 'java.lang.IllegalArgumentException: and so on....
AxisFault
faultCode: {
http://xml.apache.org/axis/}HTTP
faultSubcode:
faultString: (500)Internal Server Error
faultActor:
faultNode:
faultDetail:
{}:return code: 500
Do you know how can I solve my problem? I've tried different
stuff with encoding but it didn't work.
Thanks for all answers.

Whatever Coldfusion's means of communication with the database, by default, it represents text meant for the browser in UTF-8.

CF8 - XmlFormat not escaping High ASCII characters

In CF8, we have a problem where XmlFormat is not escaping
High ASCII characters. This was working just fine on our CF7
instance, but in CF8, it is not escaping all characters. I am aware
of the long-standing problem with escaping Windows-1252 characters,
but now we are experiencing an issue with basic high ASCII
characters, like chr(233) and chr(244). Is anyone else experiencing
this issue? We have not installed Update 1 to CF8 yet. I don't see
a fix for this in the release note, but any word on if this is
fixed by the updater?
Here is a test to demonstrate the issue:
<cfset myString = "The Islamic Republic of Mauritania's
(République Islamique de Mauritanie) 2007 estimated population
is 3,270,000. Cote d'Ivoire and Côte d'Ivoire">
<cfset myNewString = XmlFormat(myString)>
<cfoutput>#myNewString#</cfoutput>

BKBK,
Thanks for the info. Adding the processingdirective does help
show that these characters are being escaped, however, the behavior
has changed somewhat between CF7 and CF8, as we were not using a
processingdirective in CF7, and this was working as advertised.
Where this is giving us a problem is after we create an XML
document using CFXML, (ensuring that we XmlFormat any strings), we
then validate that document against a schema, and we are all of a
sudden getting errors during validation for invalid characters
within the XML. We are using ToString() after creating the XML
document with CFXML, and our process is the same as we were using
in CF7. That is why I was curious if anyone else was having this
same issue... because something definitely changed between CF7 and
CF8 with XML processing.

High ASCII in discussion forums

When using Safari 2.0.3 on many discussion forums, "High ASCII" characters such as greek letters and mathematical symbols will not display in my posts. I wonder what could be the problem, because if I type the same characters in Camino, they display properly in my posts.

When using Safari 2.0.3 on many discussion forums,
"High ASCII" characters such as greek letters and
mathematical symbols will not display in my posts. I
wonder what could be the problem, because if I type
the same characters in Camino, they display properly
in my posts.
Many discussion forums use a Latin-1 encoding. This means the browser must convert anything above ascii into numerical references like & #12345;. Mozilla-type browsers do this, but Safari does not. Forums intended to accomodate Greek or Math should use UTF-8, like this one does.

Search for users and non-ASCII characters

I am having a little issue with the "Accounts - Find Users" functionality. The search breaks on what I assume is non-ASCII characters (we use the following three up here in Denmark: �, �, �). To be precise, I have a user with the first name "J�rgen". Searching for first names starting with "J" works just fine but "J�" returns zero matches.
My setup is with two machines, one (A) holding the MySQL database and one (B) serving Identity Manager on top of tomcat.
Both A and B are RHEL boxes, and both have da_DK.UTF-8 as default locale.
MySQL's /etc/my.cnf file has the following entry (as recommended in create_waveset_tables.mysql):
[mysqld]
default-character-set=utf8
default-collation=binFor clarity, some functionality works just fine in Identity Manager with these non-ASCII characters such as adding a user whose name contains non-ASCII characters (not only �� but also � for example). At the moment, it appears to be the search functionality which is not working correctly as I would expect it to. I'm still on the fence concerning whether I've missed something in terms of configuration, or whether this is a limitation.
Does anyone know whether this problem is on my side or the software's side?

I am having a little issue with the "Accounts - Find Users" functionality. The search breaks on what I assume is non-ASCII characters (we use the following three up here in Denmark: �, �, �). To be precise, I have a user with the first name "J�rgen". Searching for first names starting with "J" works just fine but "J�" returns zero matches.
My setup is with two machines, one (A) holding the MySQL database and one (B) serving Identity Manager on top of tomcat.
Both A and B are RHEL boxes, and both have da_DK.UTF-8 as default locale.
MySQL's /etc/my.cnf file has the following entry (as recommended in create_waveset_tables.mysql):
[mysqld]
default-character-set=utf8
default-collation=binFor clarity, some functionality works just fine in Identity Manager with these non-ASCII characters such as adding a user whose name contains non-ASCII characters (not only �� but also � for example). At the moment, it appears to be the search functionality which is not working correctly as I would expect it to. I'm still on the fence concerning whether I've missed something in terms of configuration, or whether this is a limitation.
Does anyone know whether this problem is on my side or the software's side?

[SOLVED] KDEmod - problem with mounting b/c of non-ASCII characters

Hi guys!
I finally set aside a few gigabites for Archlinux - it is no more in a virtual machine So far I managed to configure everything with the excellent wiki. It's runnin' and kickin'. I run accross only one problem:
When I insert a CD with a label that has non-ASCII characters (some Polish ones in my case) and I click on it's icon in Konqueror I get the message that "file such-and-such doesn't exist" - and the Polish characters are clearly misspelled (it is not a fonts' problem - I double checked). I can access the folder either via console or via konqueror if I go to the /media folder, though.
Any ideas how I can fix it? If you need more info, let me know.
Last edited by JeremyTheWicked (2008-05-31 14:46:07)

You're welcome . Now it's advisable for you to edit the title of your initial post: add [SOLVED]. Perhaps more clear wording would be in order, too, for the benefit of the search engine. The problem seems to be a trifle in retrospect, but somehow it takes some effort to find the solution, doesn't it ?

Replacing non-ascii characters in String

I have a site where the user enters data in a rich text
editor (ktml4) that gets stored into a database (mysql). There are
non ascii characters getting into the data, I'm assuming that they
are copying and pasting from Word. Unfortunately in this situation,
changing that process isn't an option.
Currently, this is the only character that is causing me
problems:
http://www.zvon.org/other/charSearch/PHP/search.php?request=ffa0&searchType=3
I would just like to replace the non-ascii characters with a
space when I read them from the database. Something like:
#Replace(result.column, '\xffa0', ' ')#
However, I believe that code looks for the string "\xffa0",
not the character \xffa0.
Is there anyway to do this?

quote:
Originally posted by:
BuckLemke
quote:
Originally posted by:
Dan Bracuk
rereplace might work.
Can you give an example of how to pass a non-ascii character
to REReplace?
Regular expressions are not my strength, but the approach I
was considering was, "if it's not an ascii character, make it a
space". Then you pass the entire string at once.

[zsh] cannot input non ASCII characters

When inputing non-ASCII characters (Chinese, for example) under command line, it becomes something like
ä½?
what is really should be printing is
你好
I've searched a little bit on this, and answer has always been "zsh has unicode support already after 4.3.1",
but how come it appears no so in 4.3.10 ?

SamC wrote:Could this be a problem with your terminal?
hardly, cause I've also tried this with xterm.
Setting "setopt multibyte" and "setopt printeightbit" doesn't help as well.

How to search Youtube with cyrillic characters?

Before the last update of Apple TV, we were able to use our iPod with the Remote application and search for Russian programs with Cyrillic characters from the Cyrillic keyboard on the iPod. Now when we try, every time we enter a Cyrillic character we get some message saying that Youtube was not communicating. Anyone have any ideas how we can search Youtube in Cyrillic characters again?
thanks,

Hi,
How to use to search Notification number or something with wildcard search,
see following e.g.,
DATA: L_TRANS TYPE RANGE OF ZNOTIF-TRANS.
DATA: BEGIN OF WA_TRANS,
       SIGN(1),
       OPTION(2),
       LOW(45),
       HIGH(45),
      END OF WA_TRANS.
        WA_TRANS-SIGN = 'I'.
        WA_TRANS-OPTION = 'CP'.
        WA_TRANS-LOW = INIT-TranNumber.
        APPEND WA_TRANS TO L_TRANS.
        SELECT * FROM NOTIF
                 INTO TABLE ZNOTIF
                 WHERE TRANS IN L_TRANS.
        IF SY-SUBRC <> 0.
          EXIT.
        ENDIF.
Basically if i give '111' or '111%' or '111*' for INIT-TranNumber which is the first 3 digit for most of the notification. It doesn't give any entry.
How to handle wild card search '*' if WA_TRANS-LOW is initial?
Thank you
AP

Username with ascii characters

Hello, i'm having and html form and i would like the user in
the username field to type ONLY ascii characters.
For example, in other fields of the form i
would like the user to type his mother language but
as far as the username and password fields are concerned
the characters have to be ascii.
How am i supposed to check when the username is accepted/correct (*consists of ascii characters*)?
and which are the desirable characters a username must have (e.g. *?* is a desirable character in a username , *:* this one?)
Thanks, in advance!

g_p_java wrote:
How am i supposed to check when the username is accepted/correct (*consists of ascii characters*)?ASCII characters are the Unicode characters whose code points are between 0 and 127.
and which are the desirable characters a username must have (e.g. *?* is a desirable character in a username , *:* this one?)I don't understand this. You have already said they must be ASCII. You have other requirements? Fine, go ahead and program them and ask questions if you have problems with that. Personally I don't think that requiring somebody to have a question mark in their user name is a good idea -- but probably you didn't mean it when you suggested that.

Non ascii characters being sent from a parameter in a form

Hi!
I have seen many topics posted on passing non ascii characters through parameters from one servlet to another and converting them into whatever format is necessary.
However, I have not seen anyone answer the following question. I have a jsp page (html) with the character encoding set to utf-8. The user inputs some data in to a text field which is inside a form. The data could be in non ascii characters such as hebrew or arabic. This form is then sent to another jsp where i try to retreive the data from teh text field. No matter what i do, i cannot get the data presented correctly. It is either question marks or other wierd symbols.
I have tried every permetation of encoding of the actual html page, the ecoding of the string from request.getParameter etc but it still is not presented on the new html page correctly.
Can anyone help??
Spencer

Ok, I solved the problem.
I had to put at the top request.setCharacterEncoding("utf-8");
Spencer

Replacing non-ASCII characters with HTML charcter references

Hi All,
In Oracle 10g or greater is there a built-in function that will convert a string with non-ASCII characters like this
a b č 뮼
into an ASCII string with HTML character references like this?
a b & # x 0 1 0 D ; & # x B B B C ;
(note I had to include spaces between each character in the sample code for message to prevent the forum software from converting my text)
I tried using
utl_i18n.escape_reference( val, 'us7ascii' )
but for some reason it returns
a b c & # x B B B C ;
Note how it converted the Western European character "č" to its unaccented counterpart "c", not "& # x 0 1 0 D ;" (is this a bug?).
I also tried a custom solution using regexp_replace and asciistr (which I can't include here because the forum software chokes on it) but it only returns the correct result for values <=4000 characters long. Unfortunately asciistr doesn't appear to accept CLOB values larger than 4000 characters. It returns an error message like
(ORA-22835: Buffer too small for CLOB to CHAR or BLOB to RAW conversion (actual: 30251, maximum: 4000) ).
I'm looking for a solution that works on CLOB data of any size.
Thanks in advance for any insight you can provide.
Joe Fuda

So with that (UTF8) in mind, let's take another look.....
As shown below, I used a AL32UTF8 database.
Note: I did not use a unicode capable tool for querying. So I set console mode code page to 1250 just to have č displayed properly (instead of posing as an è).
Also, as a result of using windows-1250 for client character set, in the val column and in the second select's ncr column (iso8859-1), è (00e8) has been replaced with e through character set conversion going from server back to client.
Running the same code on a database with a db character set such as we8mswin1252, that doesn't define the č (latin small c with caron) character, would yield results with a c in the ncr column.
C:\>chcp 1250
Aktuell teckentabell: 1250
C:\>set nls_lang=.ee8mswin1250
C:\>sqlplus test/test
SQL*Plus: Release 11.1.0.6.0 - Production on Fri May 23 21:25:29 2008
Copyright (c) 1982, 2007, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.1.0.6.0 - Production
With the OLAP option
SQL> select * from nls_database_parameters where parameter like '%CHARACTERSET';
PARAMETER              VALUE
NLS_CHARACTERSET       AL32UTF8
NLS_NCHAR_CHARACTERSET AL16UTF16
SQL> select unistr('\010d \00e8') val, utl_i18n.escape_reference(unistr('\010d \00e8'),'us7ascii') NCR from dual;
VAL NCR
č e c e
SQL> select unistr('\010d \00e8') val, utl_i18n.escape_reference(unistr('\010d \00e8'),'we8iso8859p1') NCR from dual;
VAL NCR
č e &# x10d; e     <- "è"
SQL> select unistr('\010d \00e8') val, utl_i18n.escape_reference(unistr('\010d \00e8'),'ee8iso8859p2') NCR from dual;
VAL NCR
č e č &# xe8;
SQL> select unistr('\010d \00e8') val, utl_i18n.escape_reference(unistr('\010d \00e8'),'cl8iso8859p5') NCR from dual;
VAL NCR
č e &# x10d; &# xe8;In the US7ASCII case, where it should be possible for all non-ascii characters to be escaped, it seems as if the actual escape step is skipped over.
Hope this helps to understand whether utl_i8n is usable or not in your case.
Message was edited by:
orafad
Fixed replaced character references :)

How can I convert ASCII characters to ISO8859?

Hi All,
I have written a little application that renames a TV episode by scraping a TV listing site for the episode name. It is written in SWT and works great apart from on small problem. When getting the html back from the site, it sometimes contains special ASCII characters that are not in the ISO8859 (Windows filesystem) character set.
For example, this is the line that I have to parse:
<td style='padding-left: 6px;' class='b2'><a href='/Prison_Break/episodes/569183/03x01'>Orientaci��n</a></td>When viewing it in a browser, it is:
<td style="padding-left: 6px;" class="b2"><a href="/Prison_Break/episodes/569183/03x01">Orientaci�n</a></td>Notice that the o in the title has an accent on it. While researching this problem I stumbled across 'HTML Entities to ISO 8859-1 Converter' at http://www.inweb.de/chetan/English/Resources/Java/HTML%202%20ISO.html. This open source project takes in an html entity like & and returns '&'.
So that is not quite what I want, as my BufferedReader is converting the html entity into the ASCII representation already. I need a way of detecting a non ISO8859 character within an ASCII string, and hopefully replacing its natural 'equivalent' (would be o in this case).
Does anyone know how I could do it without having to check for every special char and replacing (not really an option unless someone has done it before!!)
If not that then, perhaps another way to attack the problem?
Any help greatly appreciated ;)
Dave

Hi,
NZ_Dave wrote:
For example, this is the line that I have to parse:
<td style='padding-left: 6px;' class='b2'><a href='/Prison_Break/episodes/569183/03x01'>Orientaci��n</a></td>
This is coded in UTF-8. If you convert the bytes to a String using the UTF-8 encoding, then you will have the correct characters "Orientaci�n" in the string.
Check your parser where it converts the bytes (coming from e.g. an InputStream) to characters. Use UTF-8 as the charset when doing that conversion.

How can I use ASCII Characters on the iPad?

I have an iPad (1) and use iOS 4.3.5. Is there an easier Way of using ASCII Characters than having to use the "copy-Paste" procedure? If not, can we expect to see the feature of entering the ASCII code straight through the standard keyboard soon?

Norisouro wrote:
. Is there an easier Way of using ASCII Characters than having to use the "copy-Paste" procedure?
You need to give some details about what it is you want to do, because ASCII Characters are what are already on the keyboard. It is non-ASCII that you might need to copy/paste.
http://en.wikipedia.org/wiki/ASCII
I think you can be sure that Apple is never going to include a feature in iOS that has you input special characters by typing in numbers like Windows does it. Mac's have always used a different approach.

Searching on high ascii characters

Similar Messages

Maybe you are looking for