Oracle Text - Compressed (ZIP) documents support

Is it possible for Oracle Text to index files of various formats like DOC, PDF, HTML compressed in the ZIP archives?
I have read some Oracle text documentation and it seems that this feature is not available out-the-box?
Does anyone have any idea how this functionality can be implemented?
I think USER_DATASTORE might be used for this, but this does not seem like a good way how to do it.
The second option is using a Oracle Text FILTER class. But the default INSO_FILTER does not support indexing of documents in ZIP packages according to my tests.
So far I was not able to find any other filters, which support this feature.
Regards.

What version of oracle are you interested in?
11.1.0.7 (the latest) returns Oracle text to INSO filters for document processing, and in [Text Reference Section B.2.5|http://download.oracle.com/docs/cd/B28359_01/text.111/b28304/afilsupt.htm#g639477] (Archive File Format), ZIP files are supported.

Similar Messages

Oracle Text

Hi Expert,
how can i use the Oracle text to do searching with the XML documents?
as i shredded the xml documents into the SQL view already,
i cannot create index onto those views.
what can i do with the Oracle text search?
THX a lot~!
Edith

Hi Edith:
Oracle Text can index document in UTF8 or any other encoding supported by Oracle.
Assuming that your table is LRPAPER_XMLTYPE_TBL you can create a Text index with:
create index LRPAPER_XMLTYPE_TBL_idx on LRPAPER_XMLTYPE_TBL p (value(p)) indextype is ctxsys.context;If you have documents with different languages stored in the same table you has to find some scalar column to be used as discriminator value. In my example "lang" is an scalar column of type varchar2. See the annotated schema at:
http://www.dbprism.com.ar/xsd/document-v20-ann.xsd
This annotated schema creates these Oracle types:
SQL> desc "document"
Name                                      Null?    Type
SYS_XDBPD$                                         XDB.XDB$RAW_LIST_T
id                                                 VARCHAR2(4000 CHAR)
lang                                               VARCHAR2(4000 CHAR)
header                                             headerType
body                                               CLOB
footer                                             footerType
SQL> desc "headerType"
"headerType" is NOT FINAL
Name                                      Null?    Type
id                                                 VARCHAR2(4000 CHAR)
lang                                               VARCHAR2(4000 CHAR)
title                                              VARCHAR2(4000 CHAR)
subtitle                                           VARCHAR2(4000 CHAR)
version                                            versionType
type                                               VARCHAR2(4000 CHAR)
authors                                            authorsType
notice                                             VARCHAR2(4000 CHAR)
abstract                                           VARCHAR2(4000 CHAR)
meta                                               metaListSo the syntax "XMLDATA"."lang" is referencing to the attribute (column) "lang" of the type document.
Best regards, Marcelo.

Problem with blob column index created using Oracle Text.

Hi,
I'm running Oracle Database 10g 10.2.0.1.0 standard edition one, on windows server 2003 R2 x64.
I have a table with a blob column which contains pdf document.
Then, I create an index using the following script so that I can do fulltext search using Oracle Text.
CREATE INDEX DMCS.T_DMCS_FILE_DF_FILE_IDX ON DMCS.T_DMCS_FILE
(DF_FILE)
INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS('DATASTORE CTXSYS.DEFAULT_DATASTORE');
However, the index is not searchable and I check the following tables created by database for my index and found them to be empty as well !!
DR$T_DMCS_FILE_DF_FILE_IDX$I
DR$T_DMCS_FILE_DF_FILE_IDX$K
DR$T_DMCS_FILE_DF_FILE_IDX$N
DR$T_DMCS_FILE_DF_FILE_IDX$R
I wonder what's wrong with it.
My user has been granted the ctx_app role and I have other tables that store plain text which I use Oracle Text are fine. I even output the blob column and save as pdf file and they are fine.
However the database seems like not indexing my blob column although the index can be created without error.
Please advise.
Really appreciate anyone who can help.
Thank you.

The situation is I have already loaded a few pdf document into the table's blob column.
After I create the Oracle text index on this blob column, I find the system generated index tables listed in my earlier posting are empty, except for the 4th table.
Normally we'll see words inside the table where those are the words indexed by oracle text on my document.
As a result, no matter how i search for the index using select statement with contains operator, it will not give me any result.
I feel weird why the blob is not indexed. The content of the blob are actually valid because I tested this by export the content back to pdf and I can still view and search within the pdf.
Regards,
Jap.

ODF support in Oracle Text 10g R2 version ??

Currently, we are using Oracle Text 10g Release 2 version for HTML section searching in our application. we don't have any issues in Microsoft office 2003 documents.
But, when we use Open office documents(ODF), it is not working. It is throwing the following exception:
java.sql.SQLException: ORA-20000: Oracle Text error:
DRG-11207: user filter command exited with status 1
DRG-11222: Third-party filter does not support this known document format.
ORA-06512: at "CTXSYS.DRUE", line 160
ORA-06512: at "CTXSYS.CTX_DOC", line 825
ORA-06512: at line 1
We are using "AUT_FILTER" filter technology.
Any ideas for solving this issue?

You start to have to think outside the box at this point -- AUTO_LEXER isn't going to be able to support you natively.
You could file an SR, and let Oracle tell you whether they'd be willing to integrate changes (like new Verity libraries as they are developed) to 10.2.
That assumes that Autonomy (owner of Verity) has improved their support for ODF.
The OpenOffice formats are all xml-based; you could write something custom to extract the text from your openoffice files and submit them to Oracle as straight XML. I've done something similar to support Office 2007 formats.
You could write a custom USER_LEXER (which is essentially the same as custom extraction, but may be an easier place to hook in your custom code).
That's the main reason I suggested moving up to 11g -- none of the other choices have any easy, short-term fix or workaround.

My Oracle Support - Oracle TEXT - Community

Hi all,
It is my pleasure to announce the launch of the Oracle Text community on My Oracle Support.
Communities are "Oracle's multi-channel platform for online collaborative support", accessible by all Oracle customers, partners and employees. They replace earlier classic MetaLink forums with a much richer true collaborative environment that has options like discussion forums, document upload, tagging, search, email, user reputation scores, best practice exchange, and more features to come. The communities are driven by the members, collaborating with a large network of other members to exchange ideas & knowledge, expand networks, learn from the rest of the community, etc etc.
The general community pages can be reached via the "community" tab in My Oracle Support (formerly MetaLink), or the Oracle Text community can be accessed direct via:
https://communities.oracle.com/portal/server.pt/community/database_security_products/287 (MOS logon required)
The community is moderated by members of the global Text support team, and we invite anybody with an interest in Oracle Text to participate in this community by asking & answering questions, provide best practice documents, etc. etc.
We hope, that as a participant, you will become intimately involved with helping other users as well as receiving help for issues that you post.
So sign up, and enjoy!
Regards,
Edwin

Hi ,
you are welcome, login to [My Oracle Support|http://support.oracle.com] - click Community tab - select "Oracle Text" from "My Communities" list and happy posting.
Or use [ Oracle Text| https://communities.oracle.com/portal/server.pt/community/oracle_text/287]
which takes you straight to the community application.
If you are new to the Oracle Support Communities, check out the 25-minute New Member Orientation that gives you a brief tour of your Communities! You can access it from the Home tab, under the Getting Started section(top right).
Thanks,
Edwin
Edited by: ebalthes on Nov 18, 2009 2:39 AM

Using Oracle Text to search through WORD, EXCEL and PDF documents

Hello again,
What I would like to know is if I have a WORD or PDF document stored in a table. Is it possible to use Oracle Text to search through the actual WORD or PDF document?
Thanks
Doug

Yes you can do context sensitive searches on both PDF and Word docs. With the PDF you need to make sure they are text and not images. Some scanners will create PDFs that are nothing more than images of document.
Below is code sample that I made some time back to demonstrate the searching capabilities of Oracle Text. Note that the example makes use of the inso_filter that is no longer shipped with Oracle begging with Patch set 10.1.0.4. See metalink note 298017.1 for the changes. See the following link for more information on developing with Oracle Text.
http://download-west.oracle.com/docs/cd/B14117_01/text.101/b10729/toc.htm
begin example.
-- The following needs to be executed
-- as sys.
DROP DIRECTORY docs_dir;
CREATE OR REPLACE DIRECTORY docs_dir
AS 'C:\sql\oracle_text\documents';
GRANT READ ON DIRECTORY docs_dir TO text;
-- End sys ran SQL
DROP TABLE db_docs CASCADE CONSTRAINTS PURGE;
CREATE TABLE db_docs (
id NUMBER,
format VARCHAR2(10),
location VARCHAR2(50),
document BLOB,
CONSTRAINT i_db_docs_p PRIMARY KEY(id)
-- Several notes need to be made about this anonymous block.
-- First the 'DOCS_DIR' parameter is a directory object name.
-- This directory object name must be in upper case.
DECLARE
f_lob BFILE;
b_lob BLOB;
document_name VARCHAR2(50);
BEGIN
document_name := 'externaltables.doc';
INSERT INTO db_docs
VALUES (1, 'binary', 'C:\sql\oracle_text\documents\externaltables.doc', empty_blob())
RETURN document INTO b_lob;
f_lob := BFILENAME('DOCS_DIR', document_name);
DBMS_LOB.FILEOPEN(f_lob, DBMS_LOB.FILE_READONLY);
DBMS_LOB.LOADFROMFILE(b_lob, f_lob, DBMS_LOB.GETLENGTH(f_lob));
DBMS_LOB.FILECLOSE(f_lob);
COMMIT;
END;
-- build the index
-- Note that this index differs than the file system stored file
-- in that paramter datastore is ctxsys.defautl_datastore and not
-- ctxsys.file_datastore. FILE_DATASTORE is for documents that
-- exist on the file system. DEFAULT_DATASTORE is for documents
-- that are stored in the column.
create index db_docs_ctx on db_docs(document)
indextype is ctxsys.context
parameters (
'datastore ctxsys.default_datastore
filter ctxsys.inso_filter
format column format');
--search for something that is known to not be in the document.
SELECT SCORE(1), id, location
FROM db_docs
WHERE CONTAINS(document, 'Jenkinson', 1) > 0;
--search for something that is known to be in the document.
SELECT SCORE(1), id, location
FROM db_docs
WHERE CONTAINS(document, 'Albright', 1) > 0;

Document management system using oracle text

i plan to create document management system using oracle text with following features
1) document comparision
2) document search
and more...
can oracle text be used to display documents of various formats by converting them to HTML. and can search keywords be highlighted in the document.
please help!

Have you ever considered doing this in Oracle Application Express (free on top of the Oracle database)? How about something like:
http://download-west.oracle.com/docs/cd/B31036_01/doc/appdev.22/b28839/up_dn_files.htm
Index the files using the CONTEXT index, and perhaps the docs' meta with it using the Oracle Text MULTI_COLUMN_DATASTORE, and then when you write your query for a report on the documents include a search string.
I've created a number of APEX-based document management systems and it is quite easy once you get the hang of using this environment. I suggest looking at some of the tutorials/how-to documents and you'll be on your way quickly.
Start with the upload application. Once you can get your documents in, create a report that shows everything except the document. Verify all of this works correctly.
Add some "items" to the page for the report, and include them as bind variables in the where clause.
After that, add your Oracle Text index to the database, and toss in a "text-field" item to the APEX page. Modify your report query, adding the CONTAINS clause, and use the newly created item as a bind variable. There's your keyword search.
Linking to Oracle Apps is done through API's and may be over database links.
Hope it helps. Though not a step-by-step how to document, this should point you in the right direction. Get familiar with APEX as that covers most of what you described.
-Ron

Oracle Text - Problem with filtering binary documents (.doc, .pdf, etc...)

Hi, I have a problem with filtering binary documents (.doc, .pdf, etc...). I use SQL*PLUS for remote access to Oracle 10.2 on Linux and I create table:
CREATE TABLE test (id NUMBER PRIMARY KEY, text VARCHAR2(100));
I insert to this table:
INSERT into test values(1, 'PATH/text1.doc‘);
INSERT into test values(2,'PATH/text2.doc‘);
and then:
CREATE INDEX test_index ON test(text) indextype is ctxsys.context
parameters (’datastore ctxsys.file_datastore
filter ctxsys.auto_filter’);
Message "Index created" is displayed, but objects: DR$test_index$I, DR$test_index$K, DR$test_index$N, DR$test_index$R and DR$test_index$P are empty => index wasn´t created probably.
I don´t know, where is bug, either bug is somewhere in this code or on the server (wrong installation oracle or constraint privileges). Do you know in what is bug?

The following is an excerpt from the 10g online documentation. Note the items that I have put in bold.
"FILE_DATASTORE
The FILE_DATASTORE type is used for text stored in files accessed through the local file system.
Note:
FILE_DATASTORE may not work with certain types of remote mounted file systems.
FILE_DATASTORE has the following attribute(s):
Table 2-4 FILE_DATASTORE Attributes
Attribute Attribute Value
path path1:path2:pathn
path
Specify the full directory path name of the files stored externally in a file system. When you specify the full directory path as such, you need only include file names in your text column.
You can specify multiple paths for path, with each path separated by a colon (:) on UNIX and semicolon(;) on Windows. File names are stored in the text column in the text table.
If you do not specify a path for external files with this attribute, Oracle Text requires that the path be included in the file names stored in the text column.
PATH Attribute Limitations
The PATH attribute has the following limitations:
If you specify a PATH attribute, you can only use a simple filename in the indexed column. You cannot combine the PATH attribute with a path as part of the filename. If the files exist in multiple folders or directories, you must leave the PATH attribute unset, and include the full file name, with PATH, in the indexed column.
On Windows systems, the files must be located on a local drive. They cannot be on a remote drive, whether the remote drive is mapped to a local drive letter."
With accessible paths and files, you get something like:
SCOTT@orcl_11g> CREATE TABLE test (id NUMBER PRIMARY KEY, text VARCHAR2(100));
Table created.
SCOTT@orcl_11g>
SCOTT@orcl_11g>
SCOTT@orcl_11g> INSERT into test values(1,'c:\oracle11g\banana.pdf');
1 row created.
SCOTT@orcl_11g> INSERT into test values(2,'c:\oracle11g\cranberry.pdf');
1 row created.
SCOTT@orcl_11g>
SCOTT@orcl_11g> CREATE INDEX test_index ON test(text) indextype is ctxsys.context
2 parameters ('datastore ctxsys.file_datastore
3 filter ctxsys.auto_filter');
Index created.
SCOTT@orcl_11g>
SCOTT@orcl_11g> select count(*) from dr$test_index$i
2 /
COUNT(*)
608
SCOTT@orcl_11g> In the following, I used a non-existent path and non-existent file name, which produces the same results as when you use a remote path that does not exist locally.
SCOTT@orcl_11g> CREATE TABLE test (id NUMBER PRIMARY KEY, text VARCHAR2(100));
Table created.
SCOTT@orcl_11g>
SCOTT@orcl_11g>
SCOTT@orcl_11g> INSERT into test values(3,'c:\nosuchpath\nosuchfile.pdf');
1 row created.
SCOTT@orcl_11g>
SCOTT@orcl_11g> CREATE INDEX test_index ON test(text) indextype is ctxsys.context
2 parameters ('datastore ctxsys.file_datastore
3 filter ctxsys.auto_filter');
Index created.
SCOTT@orcl_11g>
SCOTT@orcl_11g> select count(*) from dr$test_index$i
2 /
COUNT(*)
0
SCOTT@orcl_11g>

Index document with Oracle Text from an ECM without saving the content

Hi,
I have documents in a ECM (Alfresco, UCM and more) and I would like Oracle Text to index the document without saving the content. I want to save space and not have redundant information. I would use Oracle Text to search for document's identification (ID) and fetch the document from the ECM using the ID.
Is it possible ?
Do I have to use Secure Enterprise Search ?
Thanks
Simon

I want to save space and not have redundant information.The database space or the disk space (in OS)?
If it the database space, it is not possible to index/serach without storing the file conetents.
using , FILE_DATASTORE you can save the file in the disk (OS) and index them.
When you remove the file, you need to re-index it.
I donot see any other ways.
Do I have to use Secure Enterprise Search ?SES also uses Oracle Text as its base. It also uses FILE_DATASTORE. But the re-indexing part is automated using crawlers.

Can oracle text be used to compare documents?

lets say that i 've some documents stored in binary(LOB). can oracle text be used to compare documents and show their similarity on the basis of their content. how would i be able to compare documents using Oracle text. does it require mining algorithm like neural network. please help.
thanks for reading.

Thank you for your interest in my question. Let me see whether I can further clarify it. In an ordinary PDF document, assume that I have a picture of a user interface for microsoft Word. The common method for identifying items in the picture, such as a toolbar, would be to either:
--use a callout labeled "toolbar" that points to the toolbar
or
--use a callout labelled "A" and have a caption underneath the picture that says: A) toolbar.
What I would like to do is have text underneath the picture such as:
"The major features of the interface shown above are:
toolbar
main menu
status bar
formatting menu"
such that, when the user clicks one of the bullet items, the object becomes highlighted in the picture. The bullet list also needs to be translatable into Japanese. So, as far as I know, it can't be part of the swf file. Or can it?

MultiLanguage support for Oracle Text

Hi,
We are providing multi language support in our application. So, we are using the NVarchar datatype. And we want to provide the search option for that text. So if i tried to create an index on that column it is giving the error and unable to create the index on that column. Same if tried to create an index on Char and Varchar datatypes it was successfully created the index on those columns and search is working fine.
But according to our requirements, we should provide the multi language support in our application.
Please suggest me how to create indexes on NVarchar datatype and providing the oracle text search features on that datatype.
Waiting for your favourble reply.
Thanks in advance.
Regards,
Anil.

No place in the standard docs very well. There was a note I found on it at one point, but I can't find the number at the moment. 10g new features for Text note has some info on it though. I'll look again tomorrow.
I did a lot of research on the world_lexer a few months back (wrote some info on IR & Oracle Text our newest Oracle Press book), and got some good info from the product team. One of the chapters covering Text will be made available for free and there is a diagram in there of how the world_lexer processes text. I'll post the link to the forum when it is available.
Since it doesn't require a language column, it attempts to auto-recognize the text...but not really the language. More like...the type of text it is. White-space delimited languages like English or Spanish are easy to break into tokens. Japanese & Korean are another story...no white-space delimiter. Arabic is yet another story. Those are essentially the buckets that text is thrown into for breaking into tokens when using the world_lexer.
So, where does German fall? It will be broken up similar to English. But what about the special features (like alternate spelling) that are available? Nope - sorry. It doesn't know German from French. With the multi_lexer you defined sub-lexers and told Text which records were German or French. This means that you can use more of the lang features with it. The world_lexer is much easier to implement and maintain, but it is a trade-off.
Hope that defines it a little better.

Compress KM documents in a ZIP file

Hi everyone,
Someone has already developed a program that lets compress KM documents in a ZIP file?
Thanks & regards
Hassan

GaryJSF wrote:
Hi,
I have code similar to the one below and it is working but I would like to add a directory to the zip file to do something like:
test.zip
Folder1
file1
file2
Folder2
file3
Zip files don't use a folder structure per-se, they use slashes to delimit path segments:
ZipEntry entry = new ZipEntry("Folder1/");
ZipEntry entry = new ZipEntry("Folder1/file1"); //etcTry it and if you have trouble come back and ask insightful questions.
Edited by: endasil on 7-Oct-2009 2:11 PM

How does oracle text differentiate between various document formats?

how does oracle text differentiate between text documents of various formats. does it read binary headers or a file extension is necessary?
please comment..

Oracle uses the inso_filter for document filtering as desribed in the documentation:
http://download-west.oracle.com/docs/cd/B10501_01/text.920/a96518/afilsupt.htm#625110
I did a little test (included below) where I copied a .pdf file to a file with a .test extension and it was still able to index it and search it, so apparently it does not need the file extensions and must read the header.
scott@10gXE> BEGIN
2   CTX_DDL.CREATE_PREFERENCE ('test_datastore', 'FILE_DATASTORE');
3   CTX_DDL.SET_ATTRIBUTE ('test_datastore', 'PATH', 'c:\oracle');
4 END;
5 /
PL/SQL procedure successfully completed.
scott@10gXE> CREATE TABLE test_tab
2    (id        NUMBER,
3      docs        VARCHAR2 (2000),
4      CONSTRAINT test_tab_id_pk PRIMARY KEY (id))
5 /
Table created.
scott@10gXE> INSERT INTO test_tab VALUES (1, 'master~1.pdf')
2 /
1 row created.
scott@10gXE> CREATE INDEX test_tab_idx ON test_tab (docs)
2 INDEXTYPE IS CTXSYS.CONTEXT
3 PARAMETERS
4    ('DATASTORE test_datastore
5       FILTER    CTXSYS.INSO_FILTER')
6 /
Index created.
scott@10gXE> SELECT id FROM test_tab
2 WHERE CONTAINS (docs, 'meat') > 0
3 /
        ID
         1
scott@10gXE>
scott@10gXE> DROP INDEX test_tab_idx
2 /
Index dropped.
scott@10gXE> HOST COPY c:\oracle\master~1.pdf c:\oracle\master.test
scott@10gXE> INSERT INTO test_tab VALUES (2, 'master.test')
2 /
1 row created.
scott@10gXE> CREATE INDEX test_tab_idx ON test_tab (docs)
2 INDEXTYPE IS CTXSYS.CONTEXT
3 PARAMETERS
4    ('DATASTORE test_datastore
5       FILTER    CTXSYS.INSO_FILTER')
6 /
Index created.
scott@10gXE> SELECT id FROM test_tab
2 WHERE CONTAINS (docs, 'meat') > 0
3 /
        ID
         1
         2
scott@10gXE>

Oracle text for italian language document

How i can set Oracle Text index to index an italian text field.
How can i set the right stop_list, lexer, .....
Thanks

I believe if your NLS_LANG settings are set appropriately for Italian, it should automatically use the proper defaults for Italian in a text index.

Pre-loading Oracle text in memory with Oracle 12c

There is a white paper from Roger Ford that explains how to load the Oracle index in memory : http://www.oracle.com/technetwork/database/enterprise-edition/mem-load-082296.html
In our application, Oracle 12c, we are indexing a big XML field (which is stored as XMLType with storage secure file) with the PATH_SECTION_GROUP. If I don't load the I table (DR$..$I) into memory using the technique explained in the white paper then I cannot have decent performance (and especially not predictable performance, it looks like if the blocks from the TOKEN_INFO columns are not memory then performance can fall sharply)
But after migrating to oracle 12c, I got a different problem, which I can reproduce: when I create the index it is relatively small (as seen with ctx_report.index_size) and by applying the technique from the whitepaper, I can pin the DR$ I table into memory. But as soon as I do a ctx_ddl.optimize_index('Index','REBUILD') the size becomes much bigger and I can't pin the index in memory. Not sure if it is bug or not.
What I found as work-around is to build the index with the following storage options:
ctx_ddl.create_preference('TEST_STO','BASIC_STORAGE');
ctx_ddl.set_attribute ('TEST_STO', 'BIG_IO', 'YES' );
ctx_ddl.set_attribute ('TEST_STO', 'SEPARATE_OFFSETS', 'NO' );
so that the token_info column will be stored in a secure file. Then I can change the storage of that column to put it in the keep buffer cache, and write a procedure to read the LOB so that it will be loaded in the keep cache. The size of the LOB column is more or less the same as when creating the index without the BIG_IO option but it remains constant even after a ctx_dll.optimize_index. The procedure to read the LOB and to load it into the cache is very similar to the loaddollarR procedure from the white paper.
Because of the SDATA section, there is a new DR table (S table) and an IOT on top of it. This is not documented in the white paper (the white paper was written for Oracle 10g). In my case this DR$ S table is much used, and the IOT also, but putting it in the keep cache is not as important as the token_info column of the DR I table. A final note: doing SEPARATE_OFFSETS = 'YES' was very bad in my case, the combined size of the two columns is much bigger than having only the TOKEN_INFO column and both columns are read.
Here is an example on how to reproduce the problem with the size increasing when doing ctx_optimize
1. create the table
drop table test;
CREATE TABLE test
(ID NUMBER(9,0) NOT NULL ENABLE,
XML_DATA XMLTYPE
XMLTYPE COLUMN XML_DATA STORE AS SECUREFILE BINARY XML (tablespace users disable storage in row);
2. insert a few records
insert into test values(1,'<Book><TITLE>Tale of Two Cities</TITLE>It was the best of times.<Author NAME="Charles Dickens"> Born in England in the town, Stratford_Upon_Avon </Author></Book>');
insert into test values(2,'<BOOK><TITLE>The House of Mirth</TITLE>Written in 1905<Author NAME="Edith Wharton"> Wharton was born to George Frederic Jones and Lucretia Stevens Rhinelander in New York City.</Author></BOOK>');
insert into test values(3,'<BOOK><TITLE>Age of innocence</TITLE>She got a prize for it.<Author NAME="Edith Wharton"> Wharton was born to George Frederic Jones and Lucretia Stevens Rhinelander in New York City.</Author></BOOK>');
3. create the text index
drop index i_test;
exec ctx_ddl.create_section_group('TEST_SGP','PATH_SECTION_GROUP');
begin
CTX_DDL.ADD_SDATA_SECTION(group_name => 'TEST_SGP',
                            section_name => 'SData_02',
                            tag => 'SData_02',
                            datatype => 'varchar2');
end;
exec ctx_ddl.create_preference('TEST_STO','BASIC_STORAGE');
exec ctx_ddl.set_attribute('TEST_STO','I_TABLE_CLAUSE','tablespace USERS storage (initial 64K)');
exec ctx_ddl.set_attribute('TEST_STO','I_INDEX_CLAUSE','tablespace USERS storage (initial 64K) compress 2');
exec ctx_ddl.set_attribute ('TEST_STO', 'BIG_IO', 'NO' );
exec ctx_ddl.set_attribute ('TEST_STO', 'SEPARATE_OFFSETS', 'NO' );
create index I_TEST
on TEST (XML_DATA)
indextype is ctxsys.context
parameters('
    section group   "TEST_SGP"
    storage         "TEST_STO"
') parallel 2;
4. check the index size
select ctx_report.index_size('I_TEST') from dual;
it says :
TOTALS FOR INDEX TEST.I_TEST
TOTAL BLOCKS ALLOCATED:                                                104
TOTAL BLOCKS USED:                                                      72
TOTAL BYTES ALLOCATED:                                 851,968 (832.00 KB)
TOTAL BYTES USED:                                      589,824 (576.00 KB)
4. optimize the index
exec ctx_ddl.optimize_index('I_TEST','REBUILD');
and now recompute the size, it says
TOTALS FOR INDEX TEST.I_TEST
TOTAL BLOCKS ALLOCATED:                                               1112
TOTAL BLOCKS USED:                                                    1080
TOTAL BYTES ALLOCATED:                                 9,109,504 (8.69 MB)
TOTAL BYTES USED:                                      8,847,360 (8.44 MB)
which shows that it went from 576KB to 8.44MB. With a big index the difference is not so big, but still from 14G to 19G.
5. Workaround: use the BIG_IO option, so that the token_info column of the DR$ I table will be stored in a secure file and the size will stay relatively small. Then you can load this column in the cache using a procedure similar to
alter table DR$I_TEST$I storage (buffer_pool keep);
alter table dr$i_test$i modify lob(token_info) (cache storage (buffer_pool keep));
rem: now we must read the lob so that it will be loaded in the keep buffer pool, use the prccedure below
create or replace procedure loadTokenInfo is
type c_type is ref cursor;
c2 c_type;
s varchar2(2000);
b blob;
buff varchar2(100);
siz number;
off number;
cntr number;
begin
    s := 'select token_info from DR$i_test$I';
    open c2 for s;
    loop
       fetch c2 into b;
       exit when c2%notfound;
       siz := 10;
       off := 1;
       cntr := 0;
       if dbms_lob.getlength(b) > 0 then
         begin
           loop
             dbms_lob.read(b, siz, off, buff);
             cntr := cntr + 1;
             off := off + 4096;
           end loop;
         exception when no_data_found then
           if cntr > 0 then
             dbms_output.put_line('4K chunks fetched: '||cntr);
           end if;
         end;
       end if;
    end loop;
end;
Rgds, Pierre

I have been working a lot on that issue recently, I can give some more info.
First I totally agree with you, I don't like to use the keep_pool and I would love to avoid it. On the other hand, we have a specific use case : 90% of the activity in the DB is done by queuing and dbms_scheduler jobs where response time does not matter. All those processes are probably filling the buffer cache. We have a customer facing application that uses the text index to search the database : performance is critical for them.
What kind of performance do you have with your application ?
In my case, I have learned the hard way that having the index in memory (the DR$I table in fact) is the key : if it is not, then performance is poor. I find it reasonable to pin the DR$I table in memory and if you look at competitors this is what they do. With MongoDB they explicitly says that the index must be in memory. With elasticsearch, they use JVM's that are also in memory. And effectively, if you look at the awr report, you will see that Oracle is continuously accessing the DR$I table, there is a SQL similar to
SELECT /*+ DYNAMIC_SAMPLING(0) INDEX(i) */
TOKEN_FIRST, TOKEN_LAST, TOKEN_COUNT, ROWID
FROM DR$idxname$I
WHERE TOKEN_TEXT = :word AND TOKEN_TYPE = :wtype
ORDER BY TOKEN_TEXT, TOKEN_TYPE, TOKEN_FIRST
which is continuously done.
I think that the algorithm used by Oracle to keep blocks in cache is too complex. A just realized that in 12.1.0.2 (was released last week) there is finally a "killer" functionality, the in-memory parameters, with which you can pin tables or columns in memory with compression, etc. this looks ideal for the text index, I hope that R. Ford will finally update his white paper :-)
But my other problem was that the optimize_index in REBUILD mode caused the DR$I table to double in size : it seems crazy that this was closed as not a bug but it was and I can't do anything about it. It is a bug in my opinion, because the create index command and "alter index rebuild" command both result in a much smaller index, so why would the guys that developped the optimize function (is it another team, using another algorithm ?) make the index two times bigger ?
And for that the track I have been following is to put the index in a 16K tablespace : in this case the space used by the index remains more or less flat (increases but much more reasonably). The difficulty here is to pin the index in memory because the trick of R. Ford was not working anymore.
What worked:
first set the keep_pool to zero and set the db_16k_cache_size to instead. Then change the storage preference to make sure that everything you want to cache (mostly the DR$I) table come in the tablespace with the non-standard block size of 16k.
Then comes the tricky part : the pre-loading of the data in the buffer cache. The problem is that with Oracle 12c, Oracle will use direct_path_read for FTS which basically means that it bypasses the cache and read directory from file to the PGA !!! There is an event to avoid that, I was lucky to find it on a blog (I can't remember which, sorry for the credit).
I ended-up doing that. the events to 10949 is to avoid the direct path reads issue.
alter session set events '10949 trace name context forever, level 1';
alter table DR#idxname0001$I cache;
alter table DR#idxname0002$I cache;
alter table DR#idxname0003$I cache;
SELECT /*+ FULL(ITAB) CACHE(ITAB) */ SUM(TOKEN_COUNT), SUM(LENGTH(TOKEN_INFO)) FROM DR#idxname0001$I;
SELECT /*+ FULL(ITAB) CACHE(ITAB) */ SUM(TOKEN_COUNT), SUM(LENGTH(TOKEN_INFO)) FROM DR#idxname0002$I;
SELECT /*+ FULL(ITAB) CACHE(ITAB) */ SUM(TOKEN_COUNT), SUM(LENGTH(TOKEN_INFO)) FROM DR#idxname0003$I;
SELECT /*+ INDEX(ITAB) CACHE(ITAB) */ SUM(LENGTH(TOKEN_TEXT)) FROM DR#idxname0001$I ITAB;
SELECT /*+ INDEX(ITAB) CACHE(ITAB) */ SUM(LENGTH(TOKEN_TEXT)) FROM DR#idxname0002$I ITAB;
SELECT /*+ INDEX(ITAB) CACHE(ITAB) */ SUM(LENGTH(TOKEN_TEXT)) FROM DR#idxname0003$I ITAB;
It worked. With a big relief I expected to take some time out, but there was a last surprise. The command
exec ctx_ddl.optimize_index(idx_name=>'idxname',part_name=>'partname',optlevel=>'REBUILD');
gqve the following
ERROR at line 1:
ORA-20000: Oracle Text error:
DRG-50857: oracle error in drftoptrebxch
ORA-14097: column type or size mismatch in ALTER TABLE EXCHANGE PARTITION
ORA-06512: at "CTXSYS.DRUE", line 160
ORA-06512: at "CTXSYS.CTX_DDL", line 1141
ORA-06512: at line 1
Which is very much exactly described in a metalink note 1645634.1 but in the case of a non-partitioned index. The work-around given seemed very logical but it did not work in the case of a partitioned index. After experimenting, I found out that the bug occurs when the partitioned index is created with dbms_pclxutil.build_part_index procedure (this enables enables intra-partition parallelism in the index creation process). This is a very annoying and stupid bug, maybe there is a work-around, but did not find it on metalink
Other points of attention with the text index creation (stuff that surprised me at first !) ;
- if you use the dbms_pclxutil package, then the ctx_output logging does not work, because the index is created immediately and then populated in the background via dbms_jobs.
- this in combination with the fact that if you are on a RAC, you won't see any activity on the box can be very frightening : this is because oracle can choose to start the workers on the other node.
I understand much better how the text indexing works, I think it is a great technology which can scale via partitioning. But like always the design of the application is crucial, most of our problems come from the fact that we did not choose the right sectioning (we choosed PATH_SECTION_GROUP while XML_SECTION_GROUP is so much better IMO). Maybe later I can convince the dev to change the sectionining, especially because SDATA and MDATA section are not supported with PATCH_SECTION_GROUP (although it seems to work, even though we had one occurence of a bad result linked to the existence of SDATA in the index definition). Also the whole problematic of mixed structured/unstructured searches is completly tackled if one use XML_SECTION_GROUP with MDATA/SDATA (but of course the app was written for Oracle 10...)
Regards, Pierre

Oracle Text - Compressed (ZIP) documents support

Similar Messages

Maybe you are looking for