Converting non-ascii characters generated by MS word

I've encountered some files that were originally exported from MS Word as html. The problem is they contain some characters that fall into the 128 to 255 range. Some appear to be fancy quotes and apostrophes, but others I just can't figure out. On a mac or Firefox on windows they appear as:
Ö ë í ì î ñ ô † © Æ ∑ ∆ “ ÷ › · Î Ï Ì Ó Ô Ò Ù
The decimal values of the above chars are:
133 145 146 147 148 150 153 160 169 174 183 198 210 214 221 225 235 236 237 238 239 241 244
As charater entities they appear as:
… ‘ ’ “ ” – ™ © ® · Æ Ò Ö Ý á ë ì í î ï ñ ô
Before I try to reinvent a square wheel, I thought I'd ask here if anyone knows of an existing command line tool that might help with this.
15 PB   Mac OS X (10.3.9)  

Thanks for all the replies. I think I've solved the problem. It indeed was a problem with high bit WinLatin1 (cp 1252) characters. Here's a technote that discusses the problem. So I wrote a short perl script based on this table:
<pre style="overflow: auto;font-size:small; font-family: Monaco, 'Courier New', Courier, monospace; color: #222; background: #ddd; padding: .3em .8em .3em .8em; font-size: 10px;">#!/usr/bin/perl -wpi
# Define an array for double byte unicode characters
# Undefined characters are marked as 0.
my @uni = (
8364, 0, 8218, 402, 8222, 8230, 8224, 8225,
710, 8240, 352, 8249, 338, 0, 381, 0, 0,
8216, 8217, 8220, 8221, 8226, 8211, 8212,
732, 8482, 353, 8250, 339, 0, 382, 376
# Characters 128 through 159 are mixed set of double byte unicode characters,
# so get these out of our $uni array. Undefined characters in this range are deleted.
s/([\x80-\x9f])/ $uni[ord($1)-128] ? sprintf("&#%d;", $uni[ord($1)-128]) : ""/eg;
# Characters 160 through 255 can be used as is.
s/([\xa0-\xff])/sprintf("&#%d;", ord($1))/eg
</pre>I only hope that perl is clever enough to not create the $uni array for each line. Anyone happen to know?
Thanks for any tips.

Similar Messages

  • Non ascii characters being sent from a parameter in a form

    I have seen many topics posted on passing non ascii characters through parameters from one servlet to another and converting them into whatever format is necessary.
    However, I have not seen anyone answer the following question. I have a jsp page (html) with the character encoding set to utf-8. The user inputs some data in to a text field which is inside a form. The data could be in non ascii characters such as hebrew or arabic. This form is then sent to another jsp where i try to retreive the data from teh text field. No matter what i do, i cannot get the data presented correctly. It is either question marks or other wierd symbols.
    I have tried every permetation of encoding of the actual html page, the ecoding of the string from request.getParameter etc but it still is not presented on the new html page correctly.
    Can anyone help??

    Ok, I solved the problem.
    I had to put at the top request.setCharacterEncoding("utf-8");

  • Replacing non-ASCII characters with HTML charcter references

    Hi All,
    In Oracle 10g or greater is there a built-in function that will convert a string with non-ASCII characters like this
    a b č 뮼
    into an ASCII string with HTML character references like this?
    a b & # x 0 1 0 D ; & # x B B B C ;
    (note I had to include spaces between each character in the sample code for message to prevent the forum software from converting my text)
    I tried using
    utl_i18n.escape_reference( val, 'us7ascii' )
    but for some reason it returns
    a b c & # x B B B C ;
    Note how it converted the Western European character "č" to its unaccented counterpart "c", not "& # x 0 1 0 D ;" (is this a bug?).
    I also tried a custom solution using regexp_replace and asciistr (which I can't include here because the forum software chokes on it) but it only returns the correct result for values <=4000 characters long. Unfortunately asciistr doesn't appear to accept CLOB values larger than 4000 characters. It returns an error message like
    (ORA-22835: Buffer too small for CLOB to CHAR or BLOB to RAW conversion (actual: 30251, maximum: 4000) ).
    I'm looking for a solution that works on CLOB data of any size.
    Thanks in advance for any insight you can provide.
    Joe Fuda

    So with that (UTF8) in mind, let's take another look.....
    As shown below, I used a AL32UTF8 database.
    Note: I did not use a unicode capable tool for querying. So I set console mode code page to 1250 just to have č displayed properly (instead of posing as an è).
    Also, as a result of using windows-1250 for client character set, in the val column and in the second select's ncr column (iso8859-1), è (00e8) has been replaced with e through character set conversion going from server back to client.
    Running the same code on a database with a db character set such as we8mswin1252, that doesn't define the č (latin small c with caron) character, would yield results with a c in the ncr column.
    C:\>chcp 1250
    Aktuell teckentabell: 1250
    C:\>set nls_lang=.ee8mswin1250
    C:\>sqlplus test/test
    SQL*Plus: Release - Production on Fri May 23 21:25:29 2008
    Copyright (c) 1982, 2007, Oracle.  All rights reserved.
    Connected to:
    Oracle Database 11g Enterprise Edition Release - Production
    With the OLAP option
    SQL> select * from nls_database_parameters where parameter like '%CHARACTERSET';
    PARAMETER              VALUE
    SQL> select unistr('\010d \00e8') val, utl_i18n.escape_reference(unistr('\010d \00e8'),'us7ascii') NCR from dual;
    VAL  NCR
    č e  c e
    SQL> select unistr('\010d \00e8') val, utl_i18n.escape_reference(unistr('\010d \00e8'),'we8iso8859p1') NCR from dual;
    VAL  NCR
    č e  &# x10d; e     <- "è"
    SQL> select unistr('\010d \00e8') val, utl_i18n.escape_reference(unistr('\010d \00e8'),'ee8iso8859p2') NCR from dual;
    VAL  NCR
    č e  č &# xe8;
    SQL> select unistr('\010d \00e8') val, utl_i18n.escape_reference(unistr('\010d \00e8'),'cl8iso8859p5') NCR from dual;
    VAL  NCR
    č e  &# x10d; &# xe8;In the US7ASCII case, where it should be possible for all non-ascii characters to be escaped, it seems as if the actual escape step is skipped over.
    Hope this helps to understand whether utl_i8n is usable or not in your case.
    Message was edited by:
    Fixed replaced character references :)

  • Cannot rename file with non-ASCII characters when using the

    My application moves files from one directory to another by calling File[] srcFiles = srcDir.listFiles() to get a list of files in the source directory, and then calling srcFiles.renameTo(destFile) to rename each file.
    This does not work (renameTo returns false and the file is not moved) under the following circumstances:
    - the file's leaf name contains non-ASCII characters, for example "�"
    - the OS is Solaris 9
    - the LANG and LC_* environment variables are unset, i.e. the C locale is being used
    If I set the LANG environment variable to, for example, en_GB.UTF-8 then the rename succeeds.
    I have tried calling srcFiles[index].getName().getBytes("UTF-8") and the non-ASCII characters are being replaced with ? (0x3f) characters when LANG is unset.
    Is this a bug in the JRE? I would argue that since my code does not actually manipulate the filename (I just use the File object that File.listFiles() gives me) then the rename should succeed. Of course I would not expect the file name to be displayed correctly if I printed it out.
    I have reproduced this behaviour with JDK 1.4.2_05 and 1.5.0_04 on Solaris 9.

    Thanks for the info Alan.
    I considered setting the locale in the environment (this sounds like the "correct" fix to me and we might implement it later), but this application shares a WebLogic server with many other applications so we would have to do a huge amount of testing to make sure that the locale change wouldn't break the other apps. In the end I worked around the problem by making the code that generates the filenames in the first place strip out any non-ASCII characters (the names of the files are not critically important).
    Looking forward to JSR-203, in the meantime perhaps a note about this behaviour in the javadoc would be useful.

  • Non-ASCII Characters in AppleWorks

    New to the forums and hope I'm not duplicating a prior post ... eons back, I used ClarisWorks on an old Mac SE/30, then moved over to AppleWorks which I got to use on my PC because I wanted to go back and work on something I had from years ago. The document in question contained non-ASCII characters. Anyway, I imported it into AppleWorks and what was once a Cyrillic font, was converted to ASCII (now a bunch of gibberish). I fully recognize I need to find the original font I used for the Cyrillic, but before I go that route, it looks like AppleWorks doesn't support non-ASCII characters ... ergo, I'm wasting my time. Am I wrong?
    Windows 2.8 GHz   Windows XP Pro  

    You may need to embed the font:
    If this post answers your question or helps, please mark it as such.

  • [SOLVED] KDEmod - problem with mounting b/c of non-ASCII characters

    Hi guys!
    I finally set aside a few gigabites for Archlinux - it is no more in a virtual machine So far I managed to configure everything with the excellent wiki. It's runnin' and kickin'. I run accross only one problem:
    When I insert a CD with a label that has non-ASCII characters (some Polish ones in my case) and I click on it's icon in Konqueror I get the message that "file such-and-such doesn't exist" - and the Polish characters are clearly misspelled (it is not a fonts' problem - I double checked). I can access the folder either via console or via konqueror if I go to the /media folder, though.
    Any ideas how I can fix it? If you need more info, let me know.
    Last edited by JeremyTheWicked (2008-05-31 14:46:07)

    You're welcome . Now it's advisable for you to edit the title of your initial post: add [SOLVED]. Perhaps more clear wording would be in order, too, for the benefit of the search engine. The problem seems to be a trifle in retrospect, but somehow it takes some effort to find the solution, doesn't it ?

  • Replacing non-ascii characters in String

    I have a site where the user enters data in a rich text
    editor (ktml4) that gets stored into a database (mysql). There are
    non ascii characters getting into the data, I'm assuming that they
    are copying and pasting from Word. Unfortunately in this situation,
    changing that process isn't an option.
    Currently, this is the only character that is causing me
    I would just like to replace the non-ascii characters with a
    space when I read them from the database. Something like:
    #Replace(result.column, '\xffa0', ' ')#
    However, I believe that code looks for the string "\xffa0",
    not the character \xffa0.
    Is there anyway to do this?

    Originally posted by:
    Originally posted by:
    Dan Bracuk
    rereplace might work.
    Can you give an example of how to pass a non-ascii character
    to REReplace?
    Regular expressions are not my strength, but the approach I
    was considering was, "if it's not an ascii character, make it a
    space". Then you pass the entire string at once.

  • CMSDK Non-ASCII Characters and WebFolders

    i have the follow problems with the CMSDK and Microsoft.
    It is impossible to enter a folder that contains non-ascii characters in the name.(the clientrequest will never send.)
    After the doubleclick on the foldername, i get a errormessage an then the url in the editbox is ISO-8859-1 encoded, but this url will never send to the sever.
    Other operations like create, rename, ... have no problems with non-ascii chars.
    with the IE, i can enter without a problem.
    MS Word:
    I can't save any file with non-ascii chars in the name.
    Only this methods are send:
    But never a "PUT", without a non-ascii char in the name the traffic looks like this:
    It is also impossible to enter a folder containing non-ascii characters in the name, form the word filesave-dialog.
    URL UTF-8 encoding is enabled in the IE options and other operations (MOVE,COPY) are send correctly UTF-8 encoded.
    is there any solution?

    Have you set the following DAV Server configuration property:
    for your domain?
    You have to configure it for the character set that you want your clients to use when connecting to iFS via WebDAV.
    (You can use the web admin tool to change this property.)
    The reason for this is that the Microsoft WebFolders client software does not transmit the client's character set to the server, so the server has no way of knowing what to expect.

  • Removing Non Ascii Characters.

    Dear Friends,
    In our application, User copying some data from a document and pasting in a field "Comments".
    If that data consists anything like bullets,arrows of word document. It is inserting some Non keyboard characters into database like below.
    • Analysys
    • Do
    • Now
    • When
    • As
    • We
    don’t know how much he love sthe testing’I am not crazyh’
    I AM ‘USER’ 
     Uu
     Yy
     tt
    Now user asking to remove all those Non-ASCII characters from Comments Column. Please help!

    Hi Santosh,
    I can remember that I have given you the REGEXP_REPLACE query earlier which you have specified and told you to read some document about it to modify according to your need. It is not very wise thing to depend on others every time.
    Re: Removing Junk Characters.
    Anyway, REGEXP_REPLACE(str,'[^[a-z,A-Z,0-9,chr(0)-chr(127)[:space:]]]*','') can give you some pointer (not tested).

  • Cannot login with password containing non-ascii characters

    I have web application, form based login. UTF-8 is specified "everywhere".
    And it works, except for passwords.
    If user register itself with password containing non-ascii characters, it is correctly written in database, but when doing either programmatic login or normal form based login, if fails.
    If the password is only ascii, it works.
    Username of login could be ascii or non-ascii, it doesn't matter, both works.
    I'm using sun java application server 9.1.
    jdbc realm.
    I'm not using hashing passwords, just clean (now)
    I tried configure realm Charset: UTF8 as last chance, but it doesn't work either.
    The problem is only with non-ascii characters in password.
    Any help very appreciated
    Thanks a lot

    I know all that, but that's not the case. My app uses preparedStatements, everything is properly configured, in all pages, utf-8 is going from user to db and back without any problems.
    The only problem is with password field. As I am using form based login, with jdbc realm configured (again, nicely working when only ascii characters), I have very little chance to do something bad through the login phase.
    I'm not talking about special characters, I'm talking about non-ascii characters, let's say - Chinese, arabish, Russian alphabet etc.
    When user registers (my code), the fields are properly written to db. I have checked that, trust me.
    But the Sun app server realm seems to have some problems with the password field.
    (realm uses jdbc connection to mysql, the url contains all extra parameters to be sure about utf8. there is nothing more what can be configured...)
    If I try other alphabet codes in login and ascii in password, it works. But soon, as I use other alphabet code also in password, it doesn't work anymore.
    My only idea is, that I could try MD5 to create ascii only characters (I hope it works that way) on the client with javascript and then set Digest to MD5 in realm configuration. But still, it seems very strange. The clear way storage should also function? (now set Digest to 'none')
    Is it a bug of Sun App Server?

  • Problems with non-ASCII characters on Linux Unit Test Import

    I found a problem with non-ASCII characters in the Unit Test Import for Linux.  This problem does not appear in the Unit Test Import for Windows.
    I have attached a Unit Test export called PROC1.XML  It tests a procedure that is included in another attachment called PROC1.txt. The unit test includes 2 implementations.  Both implementations pass non-ASCII characters to the procedure and return them unchanged.
    In Linux, the unit test import will change the non-ASCII characters in the XML file to xFFFD. If I copy/paste the the non-ASCII characters into the Unit Test after the import, they will be stored and executed correctly.
    Amazon Ubuntu 3.13.0-45-generic / lubuntu-core
    Oracle 11g Express Edition - AL32UTF8
    SQL*Developer Build MAIN-16.84
    Java(TM) SE Runtime Environment (build 1.7.0_76-b13)
    Java HotSpot(TM) 64-Bit Server VM (build 24.76-b04, mixed mode)
    In Windows, the unit test will import the non-ASCII characters unchanged from the XML file.
    Windows 7 Home Premium, Service Pack 1
    Oracle 11g Express Edition - AL32UTF8
    SQL*Developer Build MAIN-16.84
    Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
    Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
    If SQL*Developer is coded the same between Windows and Linux, The JVM must be causing the problem.

    Set the System property "mail.mime.decodeparameters" to "true" to enable the RFC 2231 support.
    See the javadocs for the javax.mail.internet package for the list of properties.
    Yes, the FAQ entry should contain those details as well.

  • Problem with non-ASCII characters on TTY

    Although I'm not a native speaker, I want my system language to be English (US), since that's what I'm used to. However I have a lot of files which have German language in their file names.
    My /etc/locale.conf has en_US.UTF-8 and de_DE.UTF-8 enabled. My /etc/locale.conf contains only the line
    The German file names show up fine within Dolphin and Konsole (ls -a). But they look weird on either of the TTYs (the "console" you get to by pressing e.g. ctrl+alt+F1). They have other characters like '>>' or the paragraph symbol where non-ASCII characters should be. Is it possible to fix this?

    I don't think the console font is the problem. I use Lat2-Terminus16 because I read the Beginner's Guide on the wiki while installing the system.
    My /etc/vconsole.conf:
    showconsolefont even shows me the characters missing in the file names; e.g.: Ö, Ä, Ü

  • When I try to send an email I get a message - Non ASCII characters in the local part of the recipient address.

    I am trying to send an emails to Italy. When I click send I get a message ( Non-ASCII characters in the local part of the recipient address). [email protected]  is one of the email address I am trying to send to. My other email address' work OK. I have sent emails to these Italian address before with no problem.

    Restart the operating system in '''[ safe mode with Networking]'''. This loads only the very basics needed to start your computer while enabling an Internet connection. Click on your operating system for instructions on how to start in safe mode: [ Windows 8], [ Windows 7], [ Windows Vista], [" Windows XP], [ OSX]
    ; If safe mode for the operating system fixes the issue, there's other software in your computer that's causing problems. Possibilities include but not limited to: AV scanning, virus/malware, background downloads such as program updates.

  • Non - ASCII characters in textinput box

    I have a flex application where I have a TextInput box. If you paste the following (non-ascii characters) into it:
    "" ‘’¡¢£¤¥¦§¨©ª«¬ ®¯°±²³´µ¶·¸¹º®¯°±²³µ´¶µ·¹¸º»¼½¾¿ÀÁÈÉÊËÌÍÏÎÐÒÔØÖöõôóòññó"
    all you are left with is:
    I am guess its some configuration that I am missing and it would be something trivial for someone who knows the issue
    Any help is well appreciated.

    You may need to embed the font:
    If this post answers your question or helps, please mark it as such.

  • Validation for non-ASCII characters

    Hi all,
    Requirement: I have to apply a validation on on fields like Name and Address in applicationdefination.xml. When a user types non-ASCII characters and navigates to next page then it should display the error message. Thus, I have to restrict my user to ASCII values only.
    Present Situation: I'm using regular expression for this problem. In Jheadstart there is an option regular expression under the heading Validation. I have written following values in regular expression and Regular Expression Error Message options.
    Regular Expression
    ^\s*[\w\.\,\-\_\(\)\#\'\/\\\ u0022\u0026\*\;\:\s]+\s*$
    Regular Expression Error Message
    It is important to note that foreign characters are not accepted on our system. Please ensure only standard English letters are entered
    Since, i was getting error in jspx page due to double quotes(") and ampercent(&), So i have replaced the double quotes(") and amprecent(&) by their unicodes. Thus, the expression has become like ^\s*[\w\.\,\-\_\(\)\#\'\/\\\u0022\u0026\*\;\:\s]+\s*$.
    This expression is validating many characters like Ã,µ,Ç,Ï,Ö,§,¥,{,} but not all non ASCII characters like ѓ є ѕ ї Њ Щ Ώ Ω Ϊ Ά Ή Θ Λ Ξ Π τ ẫ ờ Ỡ Ứ Ỷ ự Ẁ ỹ ị Ọ ň ũ ť ţ Έ Ϊ ﻍ. Thus, its not fulfilling the requirement.
    Please suggest some valid solution to this problem. It’s very urgent.

    The validation seems to be performed in Java or Javascript depending on the layout (I'm sorry I can't remember the exact details). The expression suggested above by theEternalStudent works very well in Java, but not in Javascript.
    We came up with an expression which works in both. It rejects strings which contain &# by doing a lookahead before the main pattern - you might want to expand this to look for &#nnn; but for our purposes &# is enough.
    Here is the "platform neutral" solution:
    I think in future we will write a javascript function and amend the templates to call it directly.

Maybe you are looking for

  • The audio is not working

    I have a HP Pavilion dv6 with windows 7 64bit system.  everything was working fine untill the audio just quite. trouble shooting shows the drivers workin properly No error msg, no sound    This question was solved. View Solution.

  • How can I draw on a map in Pages?

    I am trying to draw a map of the Old Testament in Pages. Frankly, it is very frustrating. I first of all drew an outline map and scanned it into Pages. Can you not use Pages to draw? Is there a "draw" template somewhere? Is there another piece of sof

  • How to set file name and destination folder without user interaction

    how can I set the file name and destination folder to generate a pdf file like "c:\myfolder\document.pdf" in this folder automatically. Is there a tag in .joboptions ? Goal: User click print button. In the background a pdf will be generated in e.g. "

  • How can I switch tabs direction?

    I need to know how can I change the direction of the tabs?

  • Monitor too bright

    Hi, I'm re-purposing a quad core 2.8Ghz Mac Pro with Mac OS 10.6.8. I'm having a problem in that the monitor/screen image is too bright, even with the monitor controls set to minimum. When I use the monito calibrator in System Prefs, the black/grey/w