Problem indexings hyphenized words in PDFs

Hello everyone on this forum
In the new site we are building, I am using Oracle Text to implement the search functionality.
I have problems indexings hyphenized words in PDFs.
The code I used to create the content table and the Oracle Text index, is like follows
CREATE TABLE JMMC_TST_OracleText( article_id NUMBER PRIMARY KEY
, desc VARCHAR2(30)
, doc BLOB DEFAULT empty_blob()
COMMIT ;
I populated the doc column from a database column in our CMS, containing a PDF document. Just for testing, also populated it from a PDF file, using TOAD for Oracle 8.6.
EXEC CTX_DDL.create_preference( 'jmmc_BSJC_lexer2', 'BASIC_LEXER' );
EXEC CTX_DDL.SET_ATTRIBUTE( 'jmmc_BSJC_lexer2', 'SKIPJOINS', '-' );
EXEC CTX_DDL.SET_ATTRIBUTE( 'jmmc_BSJC_lexer2', 'CONTINUATION', '-' );
CREATE INDEX JMMC_TST_INDEX
ON JMMC_TST_OracleText( doc )
INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS ( 'LEXER jmmc_BSJC_lexer2
STOPLIST CTXSYS.EMPTY_STOPLIST' );
COMMIT ;
The following sql
select ctx_report.describe_index('JMMC_TST_INDEX') from dual ;
SELECT err_timestamp, err_text
FROM ctx_user_index_errors
ORDER BY err_timestamp DESC;
shows that indexing went without errors, and index was correctly created.
The word: processo
(that in the PDF is hyphenized visually as
........... pro-
cesso .....
) is indexed as 2 tokens instead of just one token, as the following sql shows
select token_text
from dr$JMMC_TST_INDEX$i
where UPPER(token_text) = UPPER('CESSO')
or UPPER(token_text) = UPPER('PRO') ;
The following query returns 1 result
SELECT SCORE(1), article_id , doc
FROM JMMC_TST_OracleText
WHERE CONTAINS( doc, 'pro cesso', 1) > 0 ;
The following query returns 0 results
SELECT SCORE(1), article_id , doc
FROM JMMC_TST_OracleText
WHERE CONTAINS( doc, 'processo', 1) > 0 ;
Strange thing is, several months ago I tried this test with the same PDF, and everything went by without any problem.
The tests were done on different machines, and on both occasions I used Oracle 10.1.0.5.0.
Looks like I'm overlooking something or maybe some obscure setting (of DB, server or system) is causing the problem.
Suddenly hyphenized words in PDFs, stopped being indexed correctly.
Searched in the manuals and in this forum, and could not find a solution. Any help from anyone in this forum ?
Thanks in advance.

Hello everybody on this forum
As the initiator of this thread, I am glad that after some months, someone else is looking at this issue.
To add/clear to the confusion, I have followed Roger Ford suggestion.
Here’s the test I ran
1) Created a minimal test file (using Windows Notepad) with the following content:
ABC-
DEF
Hex view of above file is:
41 42 43 2D 0D 0A 44 45 46 00
A B C - . . D E F .
2) Created test table
CREATE TABLE JMMC_TST_OracleText(
article_id NUMBER PRIMARY KEY
, fmt VARCHAR2(30)
, doc BLOB DEFAULT empty_blob()
The main difference to Roger Ford test case is: my content column is a BLOB instead of a VARCHAR2.
The reason why my doc column is a BLOB, is because the site I’m building, content come from our CMS, and has different types both text and binary (eg Word, PDFs, etc), that I need to index together.
So I use a mixed-content column in a materialized view, to prepare/consolidate/hold all contents I index.
3) I inserted 1 row in above table (using TOAD for Oracle 8.6.), putting my minimal test file in the doc column.
4) Create Preferences and Index
EXEC CTX_DDL.create_preference( 'jmmc_BSJC_lexer2', 'BASIC_LEXER' );
EXEC CTX_DDL.SET_ATTRIBUTE( 'jmmc_BSJC_lexer2', 'SKIPJOINS', chr(45) );
EXEC CTX_DDL.SET_ATTRIBUTE( 'jmmc_BSJC_lexer2', 'CONTINUATION', chr(45) );
COMMIT;
CREATE INDEX JMMC_TST_INDEX
ON JMMC_TST_OracleText( doc )
INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS ( 'LEXER jmmc_BSJC_lexer2
FILTER CTXSYS.AUTO_FILTER
STOPLIST CTXSYS.EMPTY_STOPLIST
FORMAT COLUMN fmt' );
COMMIT;
Note: the basic lexer SKIPJOINS and CONTINUATION characters, were set the same as the hyphen character used in test file.
5) Tokens indexed:
select token_text from dr$JMMC_TST_INDEX$i
Shows:
ABC
DEF
6) Filter indexed content and generate a plaintext version:
create table JMMC_filtertab (
query_id number
, document clob
commit ;
begin
ctx_doc.filter( 'JMMC_TST_INDEX', '1', 'JMMC_filtertab', '11', TRUE);
end;
Hex view of plaintext version is:
41 42 43 2D 20 20 44 45 46 00
A B C - D E F .
Note that the original end-of-line chars (0D 0A) were replaced by 2 SPACES.
It looks like the filter replaces end-of-line chars by SPACES, and feeds the lexer, with something like:
ABC- DEF (instead of: ABC-DEF) ;
So the poor lexer, sees 2 tokens and has no clue they originally were only one hyphenised token.
This is consistent with what Meta Link Note 124624.1 - Intermedia Text & Continuation Character ('-') in PDF says.
7) Just for comparison the result of Roger Ford test (using a VARCHAR column instead of a BLOB) is:
Hex view of the filtered plaintext version is:
61 62 63 2D 0D 0A 64 65 66 00
a b c - . . d e f .
So the main difference seems to be different filtering behaviour for BLOB or VARCHAR columns, in dealing with end-of-line chars.
I have also tried other combinations of INDEX/LEXER preferences (i.e. SKIPJOINS/CONTINUATION/FILTER/NEWLINE,etc), and different file types (WORD,PDF) which means I also tested with “true binary content” and different end-of-line chars.
No matter what I tried, results were all the same: if I index a BLOB column, I’m not able to index hyphenized lines correctly.
According to the manuals, the CTXSYS.AUTO_FILTER were supposed to deal correctly with mixed-content columns if given the correct information (ie FORMAT COLUMN)
Hope this triggers a response from someone.
Thanks to all the people that took time to look at this problem.

Similar Messages

  • Hyperlink problem when converting Word to PDF

    Hi everyone.
    I am using Adobe Acrobat 7.0 Standard version 7.1.4 on a Windows 2000 environment.
    I am using Word XP/2002 SP3.
    I have a couple of problems when converting Word to PDF.
    When converting a long link which is in 2 lines, PDF just takes one line into account, not considering the whole link
    When I have hyperlinks that links to a web page, the link works fine in Word but in pdf, it gets the few words after the url address making the link not work.  For example.  I have a hyperlink that http://www.google.com/ and after this I have the words: "This is a sample text".  The hyperlink in pdf would then become http://www.google.comThisisasampletext
    Do you have any ideas how I can correct this?
    Thanks.

    To get the cross references, you have to use Convert to PDF (PDF Maker part of the Acrobat product). However, you first need to go to the print menu and set the printer to the Adobe PDF printer. Close the print menu and then do a reflow and link update of the document (ctrl-A, then F9 I think). This is needed to get the links correct since WORD reflows the document based on the printer (problem with them going to the wrong place).
    Printing to the Adobe PDF printer (the step of print to file is not needed, just print to the printer and the rest is automatic) does not include the links. PDF Maker is a PreProcessor for the Adobe PDF printer that adds PDF Marks to the PS file before it is sent to Distiller (OK, the printer when used normally creates a PS file and then invokes Distiller to complete the process - your steps did with the Adobe PDF printer is the same process done manually).
    Summary:
    1. Fix document after setting the printer to Adobe PDF.
    2. Use PDF Maker (create PDF button)

  • PROBLEMS DOWNLOADING DOCUMENTS (WORD OR PDF) FROM E-MAILS

    Up to yesterday, I was able to download documents from e-mails (word and pdf), but I do not seem to be able to do so now
    I do seem to be able to download programs, but not documents
    Previsouly, firefox would download, scan, and then send scanned/downloaded document to a downloaded document folder - now it just scans the document and that is it - does not send it to the downloaded folder
    == This happened ==
    Every time Firefox opened
    == TODAY (JUN 30, 2010)

    From the noises you describe, I'd be far more worried about the condition of the hard drive. Back up any important personal data immediately. I think the relation to Entourage is only circumstantial. The hard drive is the real problem.

  • OWA - "Sorry, we ran into a problem" when openingen word and pdfs

    Hi,
    I have a weird problem similar to
    this one where office web apps will not open my word and pdfs on my SharePoint server. Other documents like excel and powerpoint are working just fine. I'm running SP1+ on both SP2013 and OWA server.
    When I open a word or pdf document it opens my document in view mode and displays this error aswell:
    It works fine in other browser such as Google Chrome and also if I change emulation mode to IE9.
    If I check the debugging of the site I get this error:
    Object doesn't support property or method 'selectSingleNode'
    My site is added to the local intranet zone.
    The problem is only located on our a new Citrix farm. If I compare the IE security settings and the local Intranet zone with my own client that also runs IE11 the settings are the same. However it works fine on my own machine. IF I choose to Reset my Internet
    Explorer settings it works also until I log off Citrix completely and back on.
    Any idea on what settings causes this behavior?
    Thanks

    Check if your IE Is 32 bit or 64 bit.
    Also check Office installed on system is 32 bit or 64 bit
    do all users face same issue
    did you added both OWA and SharePoint URL in trusted site
    https://social.technet.microsoft.com/Forums/en-US/591cdbc6-a4a9-4f0c-b242-6fc2c04a6149/ie-11-and-sharepoint-2013-word-web-app?forum=sharepointadminprevious
    http://sharepoint.stackexchange.com/questions/95617/office-web-apps-broken-in-ie11
    upgrading my OWApps to the latest December CU seems to have fixed my problems with IE11.
    If this helped you resolve your issue, please mark it Answered. You can reach me through http://itfreesupport.com/

  • Probleme beim Konvertieren word in pdf

    Habe versucht eine Broschüre, die in word erstellt wurde in eine pdf Datei zu konvertieren. Leider weist das pdf Ränder um die gesamte Seite herum auf (ähnlich einer Beschneidungsmarkierung, aber NICHT identisch damit). Word Datei wurde als "Buch" auf A6 eingerichtet, in pdf soll das ganze auf A4 rauskommen (4 Buchseiten auf einem A4-Blatt). Hat jemand eine Idee, wie ich die Seitenumrandung loswerde?
    Danke für die Hilfe!

    Komme jetzt erst dazu zu antworten...
         "Bei so einer Frage ist es ganz wichtig, dass Du genaue Angaben zum
    Betriebssystem,
    Word-Version und
    Acrobat-Version machst"
    Betriebssystem: Windows XP
    Problem taucht auf, egal ob ich Word 2003 oder 2007 verwende
    Acrobat: diverse Versionen von Adobe Reader 6.0 bis zu Acrobat Professional 7.0
         "Wie erstellst Du das PDF?
    PDF Maker?
    Drucken in eine Postscript-Datei und dann Konvertieren im Distiller?
    PDF-Erstellung über das Betriebssystem (z.B. Quartz)?
    Öffnen der Word-Datei in Acrobat Pro oder Standard?"
    Sowohl PDF Maker als auch drucken der Postscript-Datei
        "Wo tritt das Problem auf?
    Bei der PDF-Erstellung?
    Beim Auschießen?"
    Bei der PDF-Erstellung.
         "Bitte beschreibe Dein Problem detailierter. Ich weiß nicht, von welchen Rahmen Du schreibst."
    Es ist ein dünner schwarzer Strich, der um in ca. 0,2 cm Abstand um jede einzelne Seite verläuft. Es ist aber NICHT die Schnittmarken.
         "Grundsätzlich ist anzumerken, dass Work KEIN Layout-Programm ist, auch ausschießen ist nicht Sache Word's. Dazu benötigst Du professionelle Software, wie zum Beispiel InDesign."
    Natürlich ist es das nicht. Aber ich sehe es nicht ein komplette Design-Software zu kaufen, wenn ich einmal alle Jubeljahre eine kleine Broschüre erstellen möchte. Wäre für Hilfe sehr dankbar, denn ich habe keine Lösungsideen mehr...

  • Header problem during convert Word to PDF through FOP

    Hi, below are the senarios that I face the problem,
    1. I have converted a combination of XML file (contains data only) and XSL file ( containts template only ) into Microsoft Word file. This is done by using xalan as the transformer library.
    2. This converted Microsoft Word file will be converted into XSL-FO format.
    3. By applying FOP(Implementation-Version: 0.20.5) library, the converted Microsoft Word file ( in XSL-FO format ) has been converted into PDF file successfully.
    4. The problem is, althought there is file header at the Microsoft Word document, but the converted PDF file's file header is missing.
    May I know how to solve the problem?

    What version of Word are you running? If you open Word, do you see the Acrobat PDFMaker Add-on Ribbon? Here is what it looks like in Office 2010. If you have this try creating the PDF from the Acrobat ribbon.

  • More problems while converting hyperlinks in word to pdf - missing first words

    I'm using Microsoft Office Word 2007 and Adobe Acrobat 9 Pro Extended (I think on Windows XP).
    If I have a word document with a working hyperlink when I convert it to PDF (using the Acrobat button in Word)I end up with a working hyperlink in the pdf file but it is missing the first word. For example :
    in Word the hyperlinked phrase will be http:/tinyurl.com/MarinaBloj but in the pdf version it will be :/tinyurl.com/MarinaBloj
    or
    in Word the hypelinked phrase will be "Sensitivity to luminance and chromaticity gradients in a complex scene." but in the pdf version it will be " to luminance and chromaticity gradients in a complex scene." with an extra blank space.
    worse, if in the word it is a single word, say "PDF" then all you get in the created pdf document is " " i.e. a blank space that is actually is a hyperlink (you can see the corresponding address if you hover the mouse).

    I am experiencing the exact problem...hyperlinks work after conversion from Word to pdf however first word is missing. I can only reproduce the problem on one computer. Other computers I have tried it on do not have that problem. Here is a work around i found....You can select the entire document (Ctrl + Shift + End from the beginning), then go to "Create PDF". Go into Options and select the "Selection" option. Then proceed to convert the document. It's a workaround but I'd like to know what the cause of the problem is.

  • Problems viewing dashboards in Word and PDF

    I use Xcelsius 2008, service pack 3 and when I export dashboards to Word and pdf others cannot open and view them.  I send them the files via email, some people can open and use, many cannot.  I have Windows XP sp3, and Office 2007.  This is a problem because some clients can open and use, others cannot.

    Hi,
        Craig is on the right track, it is most likely a Adobe Flash Player Version issue.
        Read the Supported Platform guide for Xcelsius 2008 SP3 for more details.
                  http://www.sdn.sap.com/irj/scn/index?rid=/library/uuid/50fdb3d2-50cc-2c10-e392-a2e481f71694
        Make sure they have Adobe Flash Player 9.0.151.0 and above installed.
        Another quick test they can do, see if they can open SWF file directly on their Internet Explorer. I doubt the issue is related to WORD or PDF at all.
    Cheers,
    Ken

  • Problems with tablesalignment (from Word to PDF)

    Hello,
    Ive downloaded the trialversion of Acrobat because our company needs to make little pdf's out of 40/50 documents and i want to try what program is best for our...
    The thing is: i just cant get the tables right (sry for my bad english ) This problem exist with: Docx, doc, pdf, etc. etc.
    When i have a table, with the upper standardscellsalignment set to 0,1 cm. The tables get messed up when i export it to an PDF. However, when i print it, its normal !!! (???)
    The most upper horizontale lines are gone and the vertical connections arent any good.
    I tried every setting, but i just cant seem to find whats wrong. I also tried 4 other programs, but same problems (also the standard saveoption in word: save word to pdf makes the same error.
    There are 2 weird (part)-solutions: when i dissable the upper standard cell alignment in the table, so set to 0, and then save.... and the problem is gone (but i really need this upper cell alignment).
    Other solution: When i dissable all the collours in the table, it is (almost) oke, but not completly. (but this isnt an option also)
    I inserted  some pictures to give you the full view:
    Picture 1: on the left you see the result when i save it with word tot pdf acrobat, on the right you see the originaldocument.
    Picture 2: An closeup of the problems with the lines.
    Picture 3: The settings in Word, what is causing the problem
    Thank you in advance!
    Greetings, Coen
    Select table, right mouse, tabble options --> options --> standard cell margin, uppercellmargin)

    Hi Coen1413,
    Is it possible for you to share your doc also from which you created the PDF?
    Also please let us know your Office version and OS details.
    Thanks,
    Vishal/Adobe

  • I am having the same problem I think.  With mobileme you simply copy documents to the idisk folder and then synch.  I cannot seem to sink that folder anymore.  Any idea as to how I can simply copy folders to icloud and then access the MS Word and PDF file

    I am having the same problem I think.  With mobileme you simply copy documents to the idisk folder and then synch.  I cannot seem to sink that folder anymore.  Any idea as to how I can simply copy folders to icloud and then access the MS Word and PDF files on my iphone?

    Apple never bopthered to explain that this would happen
    Your iDisk is still accessible after moving to iCloud in exactly the same way as before. Nothing is deleted until June 30th 2012.
    , so I could easily have lost ALL of the files I kept on iDisk.
    No, you couldn't. Firstly, nothing was deleted from your iDisk. Secondly, any files stored on your iDisk should never be your only copy. Even if your iDisk spontaneously combusted, you should keep local backups elsewhere.
    Does Apple WANT people to move their storage elsewhere and stop paying Apple for it?
    Yes. Apple doesn't provide such a service anymore, nor are you paying them for it.
    Apple has made no effort to suggest remedies for the problem it has given iDisk users
    They've provided instructions on how to download your files from your iDisk. What you do with them after that is your choice.

  • Problems with saving Word 2007 table as PDF

    I have a table in MS Word 2007 with some merged cells. If I use the Save as ... PDF or XPS option in Word 2007 to save this document as a PDF, the resulting PDF has a table with a discontinuous table grid. The table lines around the cells are broken.
    Is this a problem with the Word table or does the Save as PDF command not work well with tables.
    Thanks,
    Karl Smith

    What versions of both programs and how did you create the PDF? Can you post an example of the PDF?

  • Problem with missing column text - PDF to Word conversion

    I just bought Acrobat Standard X and have a serious deadline to meet.  Problem is I converted a PDF with newspaper style columns to Word XP (2003) version and the first column of two will not budge with the right hand margin.  The text is there when prompted but some of the endings to words are being hidden.  I have tried putting an 'equal column width' command and it still doesn't work.  I have selected individual paragraphs to try and stretch out the margin to justify with the previous page alignment (which appears ideal) and it still will not do as I want.  On the ruler line, there is a blue gap between the columns and I can't get rid of it.  So, the first column insists on staying narrower than it should be and if I stretch it out to align with the other pages, it loses words I have typed until it reverts back to its original width.  Any ideas would be most welcome. This is a right pain in the butt. Thanks.

    Alas your expectation is not reflective of what the application is capable of.
    Acrobat's OCR has no concept of "styles". Nor does OCR provide Tagged content. As Acrobat's OCR does not incorporate a "zone" approach to character recognition the manual tagging of OCR output is non-trivial (and yes you'd have to do it manually).
    Having the PDF content exported to Word you'll more quickly arrive at your "destination" by manual application of the desired core Word headings / styles. Now if *all* paper documents were the same type layout / format / etc it might be possible for a Word developer to script something to apply headings / styles programmatically.
    Be well....

  • Is there any way to check wrong hyphenation of a word in PDF?

    Hi all
    Is there any way to check wrong hyphenation of a word in PDF? through scripting also.
    Many thanks
    Thamil

    What do you mean exactly by "wrong hyphenation"?

  • Why does Acrobat Pdf converter file slow down my 2003, Windows  Word Program.  I only experience this problem when i convert a pdf file to a doc file.

    Why does Acrobat Pdf converter file slow down my 2003, Windows  Word Program.  I only experience this problem when i convert a pdf file to a doc file.

    Hi Bill -- thanks for your reply!
    When I check the Document Properties on Acrobat I can see that the fonts used in the document (Cambria, Times and Windings) are listed as "Embedded Subset" in the Fonts panel. The machine it was created on did use an earlier version of OS X and an old version of Word, but it seems to have the proper fonts...
    -nick

  • I am having a problem converting word onto PDF . PDF into word.Please help .I am signed and paid to Feb 2015

    Can someone advise me why I am having this problem

    Always in the past .I have my Word Document on screen and then I go to 
    Publish and it atomatically PDF the doc.
    The same if I have a PDF to convert to word. I have the PDF on screen and 
    to the right of my screen it will say Convert .I press on convert it
    converts to  word
    In a message dated 12/17/2014 10:01:32 A.M. Eastern Standard Time, 
    [email protected] writes:
    I  am having a problem converting word onto PDF . PDF into word.Please help
    .I am signed and paid to Feb 2015
    created by florencejohn (https://forums.adobe.com/people/florencejohn) 
    in  Adobe Acrobat.com Services - View the full  discussion
    (https://forums.adobe.com/message/7023171#7023171)

Maybe you are looking for

  • Two ADE users on one computer

    I have a laptop, a kobo and ADE and everything works fine.  I have admin rights on the laptop, and the ADE is authorized for me and computer, and I use it with my library account. My wife is also a user on the same laptop, and has her own login name

  • "John Smith's iPod" cannot be synced.  The required folder cannot be found.

    I received a new red 2nd generation iPod shuffle as a gift, with songs loaded on it. Everything was fine until I tried to charge it. I got the message "this iPod synced with another computer, would you like to erase all songs and synce with your comp

  • Utilizing Multimedia Links - Links to Media Using Adobe Acrobat 9

    Embedding multimedia such as videos makes PDFs too large to e-mail and large to store, and each distributed PDF has to contain the video instead of the video residing in just a single location. I wanted to know whether one can embed links in the PDF

  • Siri is not working on brand new phone. (4s)

    I first tried it out.  I said "Hello Siri"  Siri responded "Hello" Then every request after that... "I'm really sorry about this, but I cant take any requests righgt now. Please try again in a little while" or "There's something wrong and I can't ans

  • Cant turn down volume

    heres an odd on....was listening to itunes with system vol. about half way, with itunes volume maxed. suddenly my system volume shoots full blast...no matter what i cant turn the system volume down. it will just go right back up