Problem crawling filenames with national characters

Hi
I have a big problem with filenames containing national (danish) characters.
The documents gets an entry in in wk$url but have error code 404 (Not found).
I'm running Oracle RDBMS 9.2.0.1 on Redhat Advanced Server 2.1. The
filesystem is mounted on the oracle server using NFS.
I configure the Ultrasearch to crawl the specific directory containing
several files, two of which contains national characters in their
filenames. (ls -l)
<..>
-rw-rw-r-- 1 user group 13 Oct 4 13:36 crawlertest_linux_2_fxeFXE.txt
-rw-rw-r-- 1 user group 19968 Oct 4 13:36 crawlertest_windows_fxeFXE.doc
<..>
(Since the preview function is not working in my Mozilla browser, I'm
unable to tell whether or not the national characters will display
properly in this post. But they represent lower and upper cases of the
three special danish characters.)
In the crawler log the following entries are added:
<..>
file://localhost/<DIR_PATH>/crawlertest_linux_2_B|C?C%C?C?.txt
file://localhost/<DIR_PATH>/crawlertest_linux_2_B|C?C%C?C?.txt
Processing file://localhost/<DIR_PATH>/crawlertest_linux_2_%e6%f8%e5%c6%d8%c5.txt
WKG-30008: file://localhost/<DIR_PATH>/crawlertest_linux_2_%e6%f8%e5%c6%d8%c5.txt: Not found
<..>
file://localhost/<DIR_PATH>/crawlertest_windows_B|C?C%C?C?.doc
file://localhost/<DIR_PATH>/crawlertest_windows_B|C?C%C?C?.doc
Processing file://localhost/<DIR_PATH>/crawlertest_windows_%e6%f8%e5%c6%d8%c5.doc
WKG-30008:
file://localhost/<DIR_PATH>/crawlertest_windows_%e6%f8%e5%c6%d8%c5.doc:
Not found
<..>
The 'file://' entries looks somewhat UTF encoded to me (some chars are
missing because they are not printable) and the others looks URL
encoded.
All other files in the directory seems to process just fine!.
In the wk$url table the following entries are added:
(select status url from wk$url where url like '%crawlertest%'; )
404 file://localhost/<DIR_PATH>/crawlertest_linux_2_%e6%f8%e5%c6%d8%c5.txt
404 file://localhost/<DIR_PATH>/crawlertest_windows_%e6%f8%e5%c6%d8%c5.doc
Just for testing purpose a
SELECT utl_url.unescape('%e6%f8%e5%c6%d8%c5') from dual;
Actually produce the expected resulat : fxeFXE
To me this indicates that the actual filesystem scanning part of the
crawler can sees the files, but the processing part of the crawler can
not open the file for reading and it therefor fails with error 404.
Since the crawler (to my knowledge is written in Java i did some
experiments, with the following Java program.
import java.io.*;
class filetest {
public static void main(String args[]) throws Exception {
try {
String dirname = "<DIR_PATH>";
File dir = new File(dirname);
File[] fs = dir.listFiles();
for(int idx = 0; idx < fs.length; idx++) {
if(fs[idx].canRead()) {
System.out.print("Can Read: ");
} else {
System.out.print("Can NOT Read: ");
System.out.println(fs[idx]);
} catch(Exception e) {
e.printStackTrace();
The performance of this program is very depending on the language
settings of the current shell (under Linux). If LC_ALL is set to "C"
(which is a common default) the program can only read files with
filenames NOT containing national characters (Just as the Ultrasearch
crawler). If LC_ALL is set to e.g. "en_US", then it is capable of
reading all the files.
I therefor tried to set the LC_ALL environment for the oracle user on
my oracle server (using locale_config, and .bash_profile) but that did
not seem to fix the problem at hand.
So (finally) my question is; is this a bug in the Ultrasearch crawler
or simply a mis configuration of my execution environment. If the
latter how do i configure my system correctly?
Yours sincerely
Martin Dahl Pedersen, Visanti ( mdp at visanti dot com )

I've posted my problems as a TAR on METALINK a little week ago.
And it turns out to be a new bug in UltraSearch.
It is now filed under BUG:2673282
-- mdp

Similar Messages

How to send Oracle rowid to servlet? | Problem with national characters.

There is same possibility how to send rowid to servlet?
I have now definition like this:
<af:image source="/imageservlet?Par1=#{bindings.Col1.inputValue}"/>
But If column contents national characters, servlet methods obtained changed these characters.
My idea is to use not primary key for row, but use oracle rowid. It is simply possible?
Use something like this:
<af:image source="/imageservlet?Rowid=#{bindings.Rowid}"/
Or Do you have ideas how to solve problem with national characters ?
Thanks
FiL

Hi,
Although your workaround works.
I think this is a simple encoding problem.
I simply need to make sure all parameters and pages are encoded with a char set which contains the national characters you mentioned.
This is a bit dependent on the exact technology your using, but most can be done via the web.xml:
<jsp-config>
      <jsp-property-group>
          <url-pattern>*.jsp</url-pattern>
          <page-encoding>UTF-8</page-encoding>
      </jsp-property-group>
</jsp-config>     This forces all JSP pages to be encoded in UTF-8
Adding the following parameter sometimes helps as well, although I think this one is a bit dated:
You said your using a servlet so your servlet needs a similar block for its pattern
<context-param>
    <param-name>PARAMETER_ENCODING</param-name>
    <param-value>UTF-8</param-value>
</context-param>If you want to be 100% sure the encoding is set right make sure thepages contain:
<%@ page contentType="text/html;charset=utf-8"%>Or depending on your view technology the syntax can be a bit different
-Anton

Problem with national characters on windows client

Hello there,
I'am having problem with national characters on windows client.
All national data stored in NVARCHAR2 colums, applications (.net) works fine,
but in sqlplus:
select city from test_table;
- everything ok, sqlplus shows national characters
select dump(N'<national symbols>') from dual
- returns
Typ=96 Len=12: 0,191,0,191,0,191,0,191,0,191,0,191
select * from test_table where city = N'<national symbols> '
- always returns nothing
As i understand the problem in
sql query text (and national literals) convertion
to servers "WE8ISO8859P1" encoding, Is it possible
to solve the issue?
Thanks in advance
PS.
Console in right mode (chcp=1251)
sqlplus shows russian messages well
Server (oracle 9 on solaris):
select * from nls_database_parameters
NLS_NCHAR_CHARACTERSET AL16UTF16
NLS_SAVED_NCHAR_CS WE8ISO8859P1
NLS_LANGUAGE AMERICAN
NLS_TERRITORY AMERICA
NLS_CURRENCY $
NLS_ISO_CURRENCY AMERICA
NLS_NUMERIC_CHARACTERS .,
NLS_CHARACTERSET WE8ISO8859P1
NLS_CALENDAR GREGORIAN
NLS_DATE_FORMAT DD-MON-RR
NLS_DATE_LANGUAGE AMERICAN
NLS_SORT BINARY
NLS_TIME_FORMAT HH.MI.SSXFF AM
NLS_TIMESTAMP_FORMAT DD-MON-RR HH.MI.SSXFF AM
NLS_TIME_TZ_FORMAT HH.MI.SSXFF AM TZH:TZM
NLS_TIMESTAMP_TZ_FORMAT DD-MON-RR HH.MI.SSXFF AM TZH:TZM
NLS_DUAL_CURRENCY $
NLS_COMP BINARY
NLS_LENGTH_SEMANTICS BYTE
NLS_NCHAR_CONV_EXCP FALSE
NLS_RDBMS_VERSION 9.2.0.6.0
Client (windows server 2003, oracle client 10):
NLS_LANG = RUSSIAN_CIS.CL8MSWIN1251

N'<national symbols>', being part of an SQL statement, will be converted to the database character set (WE8ISO8859P1) before being parsed. Only if the client and the database are both 10.2 or higher, the client can encode the literal appropriately so that it survives this conversion.
In earlier versions, you can do the encoding yourself. Instead of the N'<national symbols>' literal use the UNISTR function: UNISTR('\xxxx\yyyy\zzzz'), where U+xxxx, U+yyyy, U+zzzz are Unicode code points of your national characters.
-- Sergiusz

Filenames with unicode characters

Hello, I have a question regarding filenames with unicode characters on an arabic windows xp.
I have a string, which the user entered and want to create a file with this string as filename. So my question:
Which unicode characters are allowed in a filename? I know, that on a german windows " / \ * ? < > : * are not allowed from the ASCII set, but which unicode chars are allowed? Is this language dependend on windows? Maybe there exists a method, which checks a string for this allowed characters.
Thanks for your help

AFAIK the illegal characters are always the same (you listed them already) and as long as the filesystem supports it (read: you use NTFS and not FAT) you may use any other unicode character.
You might have troubles displaying those characters 'though if you don't happen to have the correct fonts installed, but that would only be a cosmetic issue.

Problems with Filenames with Chinese Characters

I seem to have problems with filenames with 2-byte characters like Chinese, Japanese and Korean in various apps. The problems occur when i download chinese music and import into iTunes :
1. Torrent ( and all other ) files with 2-byte characters filename become junk characters in Finder after download from Safari.
2. Music files created/downloaded by uTorrent displays 2-byte characters correctly in Finder.
3. These downloaded music files when imported into iTunes becomes junk characters.
4. Let say I wish to have a list of all filenames from (2). I do a ls -R > abc.txt in Terminal. abc.txt displays the filenames correctly in vi. However they become junk characters again when view using TextEdit.
I've selected Chinese, Japanese, Korean in International setting. This is not about input method, just want to get the filenames correct in Finder and iTunes. Any advice is appreciated.

3. These downloaded music files when imported into iTunes becomes junk characters.
The ID 3 tags need to be in Unicode.
http://homepage.mac.com/thgewecke/mlingos9.html#itunes
4. they become junk characters again when view using TextEdit.
TextEdit needs to be set to the correct encoding, it's not automatic.

Problem figuring out the encoding for filenames with special characters

I'm not sure if this is the right forum, but this does seem like an OS issue.
I brought in a lot of mp3 and m3u files from a Windows machine to my new Mac. Some of the mp3 files have accented characters in their names, and these names appear in the m3u files. But if I add the m3u file to iTunes, it fails to recognize these names and so I lose all the mp3's with special characters in their names.
I tried to fix this by grabbing the files name in Python, but that didn't work either!
Here's an example: the file's name is "Voilà l'été.mp3"
The m3u files says "Voil\xe0 l'\xe9t\xe9.mp3" -- this doesn't work.
From os.listdir(), I get Voila\xcc\x80 l'e\xcc\x81te\xcc\x81.mp3", but sticking it in an m3u files doesn't work either. (Note that here the characters are encoded as unaccented letter + two byte code for the accent).
When I try these strings from python, e.g. doing os.stat(), they both work; but iTunes doesn't understand any of them!
I'd appreciate any hints on how to enter these names in the m3u file so that iTunes can read it. Thanks!

I know nothing about "m3u" files and how iTunes interprets the file names in them, but if it is not a relative/absolute path problem, then how about just putting the raw file names (not the ones with backslash escape) in m3u file? For example, just put
Voilà l'été.mp3
in m3u?
As for Unicode encoding, HFS+ file system uses the "decomposed form" for accented characters. This means, as you write, à is hex "61 cc 80" in UTF-8, i.e., "a + COMBINING GRAVE ACCENT". The pre-composed form is hex "c3 a0". But my experience is that in most cases both pre-composed and decomosed forms work at the user level (not at the lowest file system level).

Bash script to trim all filenames with special characters recursively?

Hi,
I have a 30 GB directory full of data I recovered from a friend's laptop after her Windows XP crashed. I'd like to burn that data, but I can't, because many of the filenames contain weird characters (spaces, accents, things even worse that my XTerm displays as inverted question marks). So, mkisofs exits with an error.
I'd like to clean that mess up, but it would take months to do that manually. Well, I only know a very little Bash, but I think this problem is already too heavy for my modest knowledge. Here's the problem:
- check the contents of directory ~/backup recursively
- find files whose filenames contain characters other than [A-Za-z0-9] and then maybe "-" and "_" and ".".
- replace these characters either by an "_" or just erase them
Now how would I translate that into a little Bash script?
Cheers...

Heyyyyy... nice idea

Copying filenames with invalid characters in OSX

I'd like to copy some folders for back up to an external hard disk from the internal hard disk on my G4. Problem is that some of the folders contain files were created in OS 9 and their filenames contain "/" symbols which apparently are not acceptable in OS X (10.3.9) which I am now running. After trying to drag the folders containing the files to the external hard disk I get an error message saying that they can't be copied because some of the filenames contain invalid characters. The problem files are scattered among files in many folders all over my G4's internal hard disk, and the task of finding and manually changing each one's filename is daunting. Is there a way to find and replace "/" with "." ? or some other more automated fix of the invalid characters? I'd prefer to keep the original files in their current folders on the G4. thanks
G4 Mac OS X (10.3.9)

Does any of this software do what you want?
(20586)

Filenames with strange characters

(With apologies if this is a previously answered question.)
One of my Java programs copies users' home areas around our network. I'm getting filenotfoundexceptions with some of the users' files, specifically those which use extended character sets (we have many international students who use special programs to create files with e.g. Chinese characters). I'm at a loss to know how to diagnose this problem and would appreciate any advice. The filenames are retrieved as result of a listFiles() and the copy method uses BufferedInput\OutputStreams.
David R.

Your problem must be with the names of the files, rather than with the content. I did a few experiments and found that Java (I'm using SDK 1.3 on Windows 2000) can't open files with Cyrillic characters in the file name, either for input or for output. Throws a FileNotFoundException, as you said. There shouldn't be any problem with Chinese content if you're using input and output streams.
So, the problem is diagnosed. There are several bug reports that sound like this. Don't know if it has been fixed in SDK 1.4.

Problem with national characters calling System.out.print(...)

I need to develop an application printing spanish characters like "�" (ce trencada) "�", "�", etc.
The problem is when I type
System.out.print("�")in my application, I get "plus-minus" symbol while executing it.
Anyone can help me, please?
Thank you in advance.

check the list of fonts available. You can use the following code for the purpose.
public static void main(String args[])
        String fonts[] = getFontNames();
        for(int i = 0; i < fonts.length; i++)
            System.out.println(fonts);
public static String[] getFontNames()
GraphicsEnvironment ge = GraphicsEnvironment.getLocalGraphicsEnvironment();
return ge.getAvailableFontFamilyNames();

Filenames with non-latin characters aren't found by the filesystem [S]

This might be a bug, but I'm hoping it's just a config file problem.
I have a few files here and there on my NTFS drive that have Japanese characters in their filenames. Sometime recently (I don't have an exact date when they disappeared), they stopped showing up at all. If I browse to a folder that used to contain filenames with Japanese characters, it just appears empty in Gnome. Using ls from a terminal also says the directory is empty. They used to work just fine, but a recent upgrade must have broken them.
Does anyone have any ideas what I can do to get my files to appear again? Is there some way to enable unicode support for filenames or something?
Many thanks!
Edit: Rebooting the system fixed it, though I still think that was a pretty strange problem. Any ideas what was up?
Last edited by ColdPie (2007-11-11 02:07:11)

The funny thing is that bold font [when message unread in message list] shows OK, ie in greek, but when i click on unread message, it is assumed to have been read, so it changes over to medium [non bold] and the encoding changes as well into the one that is not greek and thus unreadable. In ~/.sylpheed/sylpheedrc the fonts are:
widget_font=
message_font=-microsoft-sylfaenarm-medium-r-normal-*-*-160-*-*-p-*-iso8859-7
normal_font=-monotype-arial-medium-r-normal-*-12-*-*-*-*-*-iso8859-7
bold_font=-monotype-arial-bold-r-normal-*-12-*-*-*-*-*-iso8859-7
small_font=-monotype-arial-medium-r-normal-*-12-*-*-*-*-*-iso8859-7
In /etc/gtk, for gtk1.2 apps the file refering to greek encoding [el] seems to be fine [exactly the same as in slackware 9.1].

Error while crawling URL containing diacritic characters

Hi,
I have a content source in SharePoint 2013 that is showing errors while trying to crawl links with diacritic characters (portuguese words). The reason is that the crawler regards the URL as invalid.
The problem still occurs if the link URL is encoded (see example 2).
Examples:
1) Atualização 037 de 16-4-2008.htm
2) Atualiza%E7%E3o%20037%20de%2016-4-2008.htm
Log message:
The item could not be accessed on the remote server because its address has an invalid syntax.
I already tried to save the home page (which contains the links) as UTF-8, UTF-8 without BOM, and ANSI.
Also, I tried to include a meta charset tag:
<meta charset="UTF-8">
in addition to the first line with:
<?xml version="1.0" encoding="UTF-8"?>
All unsuccessful attempts. Has anyone found solution for this problem?

Hi,
Just checking in to see if the information was helpful. Please let us know if you would like further assistance.
Have a great day!
Best Regards,
Lisa Chen
TechNet Community Support
Please remember to mark the replies as answers if they help, and unmark the answers if they provide no help. If you have feedback for TechNet Support, contact
[email protected]

Losing NATIONAL CHARACTERS(blob- clob- table). unistr?

Hello!
I have a problem with national characters. My example is as follows:
1. A csv file is uploaded from disk to htmldb_application_files
2. This BLOB is then converted to CLOB with dbms_lob.converttoclob()
3. Data from this CLOB is copied to PL/SQL array.
4. From PL/SQL array to table in database.
The problem: Either data copied to table in database loses national characters (display strange characters instead of national), or if I set my national character set id as an argument of dbms_lob.converttoclob() function I have an error - says that file is inconvertible.
What is wrong? How can I solve my problem? Can unistr() help somewhere? Any ideas?
Tom

Duplicate posting, being addressed at:
losing NATIONAL CHARACTERS(blob->clob->table). unistr?

File adapter, File encoding national characters

Hi,
I have a problem with national characters (ÅÄÖ) when sending (receiver adapter) files with the fileadapter.
When i specify Transfere mode = Binary and File Type = Binary everything works fine but when i use Transfere mode =+ Text+ the national characters gets converted to "?". I have tried to set File Type = text and tryed File Encoding with UTF-8 and ISO-8859-1 without success.
Please help!
Regards
Claes

Hi,
Check this out: <a href="https://www.sdn.sap.comhttp://www.sdn.sap.comhttp://www.sdn.sap.com/irj/sdn/go/portal/prtroot/docs/library/uuid/502991a2-45d9-2910-d99f-8aba5d79fb42">How To Work with Character Encodings in Process Integration</a>
Regards,
Jakub

National characters and new Java API

Hi All,
I'm looking for your experience with new java api and national characters (like: ü, ś, ć, etc.). The problem is that when record was updated using MDM Data Manager, and retrieved using new java api - national character are invalid (in java string the national character are represented incorrectly).
It's strange due to fact that when I create or update this record from java API it's looks fine. Second finding is that old java api (MDM4J) works fine on text fields with national characters.
Maybe I forget to set something in server configuration / repository / or on java api connection - any help appreciated...
Regards, marcin

While retrieving data via the Java API 2,
you should set the Unicode Normalization after the user session is authenticated.
I guess this is available in SP5 patch.
The documentation for this is available at
https://help.sap.com/javadocs/MDM/current/index.html
Package: com.sap.mdm.commands
SetUnicodeNormalizationCommand cmd = new SetUnicodeNormalizationCommand(connectionAccessor);
cmd.setSession(userSession);
<b>cmd.setNormalizationType</b>(SetUnicodeNormalizationCommand.NORMALIZATION_COMPOSED);
cmd.execute();
This command is used to set the Unicode normalization. This is used for the lifetime of the session. It should be set after the session is authenticated.
Unicode normalization is important when a text string is represented differently depending on the normalization used. The MDM server always store text strings in one normalization format. An user providing a text string to the MDM server and later on tries to retrieve back the same text string might get the text string back in a different normalization. To resolve this issue, the user can use this class to specify the normalization the user wants to work with. The MDM server will always return text strings in the normalization specified by this class.

Problem crawling filenames with national characters

Similar Messages

Maybe you are looking for