Better word-boundary matches in text searches
Hi All,
The performance I've had from contains() searches in text-laden elements indexed with node-element-substring-string has been good, but I read somewhere else on this forum that regex searches using matches() don't make use of the same optimisations. So, as I found, performance plummets when you try approximating word-boundary or start-of-word matches with things like
//textyElement[matches(., "\bKEYWORD\b")]
or
//textyElement[matches(., "\bKEYWORD")]
The solution (at least as it applied in my case) was pretty obvious, but not as immediately obvious as I'd have liked it to be, so I thought I'd post it here for those others who are fairly new to XQuery and haven't found a better solution. Put a contains() first and get the benefits of its optimisations for literal searches...
//textyElement[contains(., "KEYWORD") and matches(., "\bKEYWORD")]
Of course there may be other/better ways of doing this -- and if there are I'd love to hear about them -- but on a pure performance level this took my test query from around 220 sec back down to around 28 ms, and such news is too good to keep to myself.
Tim,
Thank you very much for sharing your tip. It's a great idea for handling many types of regular expressions.
Regards,
George
Similar Messages
-
How to execute exact match & contains text search simultaneoulsy in Oracle 10g
Hi,
We have scenario where there are more than 50 million rows in a table with description column length as 1000 character. We have a web interface from where we generate a rule of comma separated keywords like
"Standard", Single, Cancel, "deal" & so on. The words in quotes needs to be checked for exact match & the one without quotes will be searched using contains.
The problem is that we can have a rule of such a combination as large as 4000 characters of inclusion & 2000 such characters for exclusion (not to consider the description under exclusion) and this search when run on the table with millions of rows does not work using oracle regular expression, but works with smaller no. of search keywords.
Is there a better way to do such a kind of search in Oracle or if not then outside oracle using any other tool
Thanks,
APPlease find below the table script, few insert statements along with the SP & function. Please help.
-- Create Table
CREATE TABLE Roomdescriptionmaster
ID long,
ROOMDESCRIPTION nvarchar2, --- max 1000 charaters
Createddate datetime
----- Insert statements
INSERT INTO ROOMDESCRIPTION (ID, ROOMDESCRIPTION, Createddate ) VALUES (1, 'Double Room (2 Adults + 2 Children) | FREE cancellation before Mar 16, 2014 PAY LATER All-Inclusive [ Included:10 % VAT] Meals:All meals and select beverages are included in the room rate.Cancellation:If canceled or modified up to 2 days before date of arrival,no fee will be charged.If canceled or modified later,100 percent of the first two nights will be charged.In case of no-show, the total price of the reservation will be charged. Prepayment:No deposit will be charged', sysdate)
INSERT INTO ROOMDESCRIPTION (ID, ROOMDESCRIPTION, Createddate ) VALUES (1, 'Double or Twin Room | FREE cancellation before Feb 1, 2014 PAY LATER All-Inclusive [ Included:10 % VAT] Meals:All meals and select beverages are included in the room rate.Cancellation:If canceled or modified up to 2 days before date of arrival,no fee will be charged.If canceled or modified later,100 percent of the first two nights will be charged.In case of no-show, the total price of the reservation will be charged. Prepayment:No deposit will be charged', sysdate)
INSERT INTO ROOMDESCRIPTION (ID, ROOMDESCRIPTION, Createddate ) VALUES (1, 'Quadruple Room (3 Adults + 1 Child) | FREE cancellation before Mar 16, 2014 PAY LATER Full board included [ Included:10 % VAT] Meals:Breakfast, lunch & dinner included.Cancellation:If canceled or modified up to 2 days before date of arrival,no fee will be charged.If canceled or modified later,100 percent of the first two nights will be charged.In case of no-show, the total price of the reservation will be charged. Prepayment:No deposit will be charged', sysdate)
INSERT INTO ROOMDESCRIPTION (ID, ROOMDESCRIPTION, Createddate ) VALUES (1, 'Triple Room with Lateral Sea View (2 Adults + 1 Child) | FREE cancellation before Dec 6, 2013 PAY LATER All-Inclusive [ Included:10 % VAT] Meals:All meals and select beverages are included in the room rate.Cancellation:If canceled or modified up to 2 days before date of arrival,no fee will be charged.If canceled or modified later,100 percent of the first two nights will be charged.In case of no-show, the total price of the reservation will be charged. Prepayment:No deposit will be charged', sysdate)
INSERT INTO ROOMDESCRIPTION (ID, ROOMDESCRIPTION, Createddate ) VALUES (1, 'Single Room with Lateral Sea View | FREE cancellation before Dec 6, 2013 PAY LATER All-Inclusive [ Included:10 % VAT] Meals:All meals and select beverages are included in the room rate.Cancellation:If canceled or modified up to 2 days before date of arrival,no fee will be charged.If canceled or modified later,100 percent of the first two nights will be charged.In case of no-show, the total price of the reservation will be charged. Prepayment:No deposit will be charged', sysdate)
--SP
CREATE OR REPLACE PROCEDURE
SP_PGHGETROOMDESCRIPTION(v_BId number,
v_DaysOfData integer,
v_Incl1 nvarchar2,
v_Incl2 nvarchar2,
v_Incl3 nvarchar2,
v_Excl1 nvarchar2,
v_CurrentIndex integer,
v_RecordPerPage integer,
v_IndexMultiplier integer,
ref_recordset out sys_refcursor
) as
start_index integer;
end_index integer;
Incl1 nvarchar2(2000);
Incl2 nvarchar2(2000);
Incl3 nvarchar2(2000);
Excl1 nvarchar2(2000);
v_desc_utf_value VARCHAR2(10);
begin
v_desc_utf_value:= 'utf8';
if v_incl1 is null or trim(v_incl1) = '' then
--dbms_output.put_line('include 1 is null or blank');
Incl1 := '';
else
Incl1 := lower(v_Incl1);
end if;
if v_incl2 is null or trim(v_incl2) = '' then
--dbms_output.put_line('include 2 is null or blank');
Incl2 := '';
else
Incl2 := lower(v_Incl2);
end if;
if v_incl3 is null or trim(v_incl3) = '' then
--dbms_output.put_line('include 3 is null or blank');
Incl3 := '';
else
Incl3 := lower(v_Incl3);
end if;
if v_Excl1 is null or trim(v_Excl1) = '' then
--dbms_output.put_line('Exclude 1 is null or blank');
Excl1 := '';
else
Excl1 := lower(v_Excl1);
end if;
-- Old code
-- and regexp_like(lower(ROOMDESCRIPTION), Incl1, 'i')
-- and regexp_like(lower(ROOMDESCRIPTION), Incl2, 'i')
-- and regexp_like(lower(ROOMDESCRIPTION), Incl3, 'i')
-- and not regexp_like(lower(ROOMDESCRIPTION), Excl1, 'i')
--- First call to SP
if v_CurrentIndex = 1 then
start_index := v_RecordPerPage * v_IndexMultiplier;
end_index := (v_CurrentIndex - 1 + v_IndexMultiplier) * v_RecordPerPage;
open ref_recordset for
select * from (
select ROOMDESCRIPTION, Createddate, rownum as rn
from roomdescriptionmaster
where BID = v_BId
and TO_NUMBER(trunc(sysdate) - to_date(to_char(createddate, 'yyyy-mm-dd'),'yyyy-mm-dd')) <= v_DaysOfData
and length(FN_GET_RESTRICTION(lower(ROOMDESCRIPTION),Incl1,Incl2,Incl3,Excl1,v_desc_utf_value)) > 0
and row_num <= v_RecordPerPage * v_IndexMultiplier
order by row_num;
else
--- Subsequent calls to SP using paging from UI
start_index := (v_CurrentIndex - 1) * v_RecordPerPage + 1;
end_index := (v_CurrentIndex - 1 + v_IndexMultiplier) * v_RecordPerPage;
open ref_recordset for
select * from (
select ROOMDESCRIPTION, Createddate, rownum as rn
from roomdescriptionmaster
where BID = v_BId
and TO_NUMBER(trunc(sysdate) - to_date(to_char(createddate, 'yyyy-mm-dd'),'yyyy-mm-dd')) <= v_DaysOfData
and length(FN_GET_RESTRICTION(lower(ROOMDESCRIPTION),Incl1,Incl2,Incl3,Excl1,v_desc_utf_value)) > 0
order by roomdescriptionmasterid desc
where rn >= start_index
and rn <= end_index
order by rn;
end if;
commit;
end SP_PGHGETROOMDESCRIPTION;
--Function
CREATE OR REPLACE FUNCTION FN_GET_RESTRICTION(
v_rate_description IN NVARCHAR2,
v_include_1 IN NVARCHAR2,
v_include_2 IN NVARCHAR2,
v_include_3 IN NVARCHAR2,
v_exclude IN NVARCHAR2, v_desc_utf_value IN VARCHAR2)
RETURN NVARCHAR2
IS
CURSOR include_1_cur IS
select regexp_substr(str, '[^,]+', 1, level) str
from (select v_include_1 str from dual)
connect by level <= length(str)-length(replace(str,','))+1;
CURSOR include_2_cur IS
select regexp_substr(str, '[^,]+', 1, level) str
from (select v_include_2 str from dual)
connect by level <= length(str)-length(replace(str,','))+1;
CURSOR include_3_cur IS
select regexp_substr(str, '[^,]+', 1, level) str
from (select v_include_3 str from dual)
connect by level <= length(str)-length(replace(str,','))+1;
CURSOR exclude_cur IS
select regexp_substr(str, '[^,]+', 1, level) str
from (select v_exclude str from dual)
connect by level <= length(str)-length(replace(str,','))+1;
include_1_rec include_1_cur%rowtype;
include_2_rec include_2_cur%rowtype;
include_3_rec include_3_cur%rowtype;
exclude_rec exclude_cur%rowtype;
tmp_var NVARCHAR2(200);
tmp_var_int NUMBER;
tmp_flag_int NUMBER;
return_str NVARCHAR2(200);
tmp_length NUMBER;
tmp_length_include_1 NUMBER;
tmp_length_include_2 NUMBER;
tmp_length_include_3 NUMBER;
tmp_length_exclude NUMBER;
tmp_regex_pattern VARCHAR2(1000);
flag_include_1_match INTEGER;
flag_include_2_match INTEGER;
flag_include_3_match INTEGER;
flag_exclude_match INTEGER;
BEGIN
tmp_length_include_1 := nvl(length(v_include_1),0);
tmp_length_include_2 := nvl(length(v_include_2),0);
tmp_length_include_3 := nvl(length(v_include_3),0);
tmp_length_exclude := nvl(length(v_exclude),0);
flag_include_1_match := 0;
flag_include_2_match := 0;
flag_include_3_match := 0;
flag_exclude_match := 0;
IF tmp_length_include_1>0 OR tmp_length_include_2 >0
OR tmp_length_include_3 >0 OR tmp_length_exclude >0 THEN
IF v_desc_utf_value ='utf8' THEN
----------------------------------------------------- UTF 8 STARTED --------------
----------------------------------------- INCLUDE 1
tmp_length := tmp_length_include_1;
IF tmp_length > 0 THEN
tmp_flag_int :=0;
FOR include_1_rec in include_1_cur
LOOP
tmp_var := trim('' || include_1_rec.str);
--dbms_output.put_line(tmp_var);
tmp_regex_pattern := '[^[:alnum:]]'||tmp_var||'[^[:alnum:]]|^'||tmp_var||'$|^'||tmp_var||'[^[:alnum:]]|[^[:alnum:]]'||tmp_var||'$';
tmp_var_int := nvl(regexp_instr(v_rate_description,tmp_regex_pattern,1,1),0);
IF (tmp_var_int <> 0) THEN
tmp_flag_int :=1;
flag_include_1_match := 1;
EXIT;
END IF;
END LOOP;
ELSE
flag_include_1_match := 1;
END IF;
-------------------------------------------- INCLUDE 2
tmp_length := tmp_length_include_2;
IF tmp_length > 0 THEN
tmp_flag_int :=0;
IF flag_include_1_match =1 THEN
FOR include_2_rec in include_2_cur
LOOP
tmp_var := trim('' || include_2_rec.str);
tmp_regex_pattern := '[^[:alnum:]]'||tmp_var||'[^[:alnum:]]|^'||tmp_var||'$|^'||tmp_var||'[^[:alnum:]]|[^[:alnum:]]'||tmp_var||'$';
tmp_var_int := nvl(regexp_instr(v_rate_description,tmp_regex_pattern,1,1),0);
IF (tmp_var_int <> 0) THEN
tmp_flag_int :=1;
flag_include_2_match := 1;
EXIT;
END IF;
END LOOP;
END IF;
ELSE
flag_include_2_match := 1;
END IF;
-------------------------------------------- INCLUDE 3
tmp_length := tmp_length_include_3;
IF tmp_length > 0 THEN
tmp_flag_int :=0;
IF flag_include_2_match =1 THEN
FOR include_3_rec in include_3_cur
LOOP
tmp_var := trim('' || include_3_rec.str);
tmp_regex_pattern := '[^[:alnum:]]'||tmp_var||'[^[:alnum:]]|^'||tmp_var||'$|^'||tmp_var||'[^[:alnum:]]|[^[:alnum:]]'||tmp_var||'$';
tmp_var_int := nvl(regexp_instr(v_rate_description,tmp_regex_pattern,1,1),0);
IF (tmp_var_int <> 0) THEN
tmp_flag_int :=1;
flag_include_3_match := 1;
EXIT;
END IF;
END LOOP;
END IF;
ELSE
flag_include_3_match := 1;
END IF;
-------------------------------------------- EXCLUDE
tmp_length := tmp_length_exclude;
IF tmp_length > 0 and flag_include_3_match =1 THEN
FOR exclude_rec in exclude_cur
LOOP
tmp_var := trim('' || exclude_rec.str);
tmp_regex_pattern := '[^[:alnum:]]'||tmp_var||'[^[:alnum:]]|^'||tmp_var||'$|^'||tmp_var||'[^[:alnum:]]|[^[:alnum:]]'||tmp_var||'$';
tmp_var_int := nvl(regexp_instr(v_rate_description,tmp_regex_pattern,1,1),0);
IF (tmp_var_int <> 0) THEN
tmp_flag_int := -1;
return_str := '';
EXIT;
END IF;
END LOOP;
END IF;
ELSE
----------------------------------------------------- UTF 16 STARTED --------------
----------------------------------------- INCLUDE 1
tmp_length := tmp_length_include_1;
IF tmp_length > 0 THEN
tmp_flag_int :=0;
FOR include_1_rec in include_1_cur
LOOP
tmp_var := trim('' || include_1_rec.str);
--dbms_output.put_line(tmp_var);
tmp_var_int := nvl(INSTR(v_rate_description,tmp_var,1,1),0);
IF (tmp_var_int <> 0) THEN
tmp_flag_int :=1;
flag_include_1_match := 1;
EXIT;
END IF;
END LOOP;
ELSE
flag_include_1_match := 1;
END IF;
-------------------------------------------- INCLUDE 2
tmp_length := tmp_length_include_2;
IF tmp_length > 0 THEN
tmp_flag_int :=0;
IF flag_include_1_match =1 THEN
FOR include_2_rec in include_2_cur
LOOP
tmp_var := trim('' || include_2_rec.str);
tmp_var_int := nvl(INSTR(v_rate_description,tmp_var,1,1),0);
IF (tmp_var_int <> 0) THEN
tmp_flag_int :=1;
flag_include_2_match := 1;
EXIT;
END IF;
END LOOP;
END IF;
ELSE
flag_include_2_match := 1;
END IF;
-------------------------------------------- INCLUDE 3
tmp_length := tmp_length_include_3;
IF tmp_length > 0 THEN
tmp_flag_int :=0;
IF flag_include_2_match =1 THEN
FOR include_3_rec in include_3_cur
LOOP
tmp_var := trim('' || include_3_rec.str);
tmp_var_int := nvl(INSTR(v_rate_description,tmp_var,1,1),0);
IF (tmp_var_int <> 0) THEN
tmp_flag_int :=1;
flag_include_3_match := 1;
EXIT;
END IF;
END LOOP;
END IF;
ELSE
flag_include_3_match := 1;
END IF;
-------------------------------------------- EXCLUDE
tmp_length := tmp_length_exclude;
IF tmp_length > 0 and flag_include_3_match =1 THEN
FOR exclude_rec in exclude_cur
LOOP
tmp_var := trim('' || exclude_rec.str);
tmp_var_int := nvl(INSTR(v_rate_description,tmp_var,1,1),0);
IF (tmp_var_int <> 0) THEN
tmp_flag_int := -1;
return_str := '';
EXIT;
END IF;
END LOOP;
END IF;
END IF;
IF tmp_flag_int = 1 THEN
return_str := 'truely matched';
ELSE
return_str := '';
END IF;
ELSE
return_str := '';
END IF;
return return_str;
EXCEPTION
WHEN OTHERS THEN
--dbms_output.put_line('Exception');
RAISE;
END FN_GET_RESTRICTION; -
Problems using and configuring Oracle 10gR2 database full-text search
I am having problems trying to set up full-text indexing and search with Universal Content Management (UCM). I followed the Oracle Content Server Installation Guide for windows at [http://download-west.oracle.com/docs/cd/E10316_01/cs/cs_doc_10/documentation/integrator/install_cserver_win_10en.pdf].
What I did was:
1. Modify E:\oracle\ucm\server\config\config.cfg by adding SearchIndexerEngineName=DATABASE.FULLTEXT to the end of the file.
2. Restart the content server.
3. Rebuild the search indexing using Repository Manager.
However, I keep seeing the following error when I query by entering words in the "Full-Text Search" box.
Unable to retrieve search results. Unable to retrieve search results. Unable to create result set for query 'SELECT IdcColl1.dID, dDocName, dDocTitle, dDocType, dRevisionID, dSecurityGroup, dDocAuthor, dDocAccount, dRevLabel, dFormat, dOriginalName, dExtension, dWebExtension, dInDate, dOutDate, dCreateDate, dPublishType, dRendition1, dRendition2, VaultFileSize, WebFileSize, URL, dFullTextFormat, dFullTextCharset, DocMeta.*
FROM IdcColl1, DocMeta
WHERE IdcColl1.dID=DocMeta.dID AND (((CONTAINS(dDocFullText,'test') > 0 ))) ORDER BY dInDate Desc'. ORA-20000: Oracle Text error:
DRG-10599: column is not indexed
Some web searches suggested the following (all of which I have tried but not resolved this problem).
1. Publish the schema using Configuration Manager (applet) and then rebuild index
2. Set the dDocFullText as a "zone field". This is not possible, because dDocFullText does not show up under the list of fields under "Database" or "DatabaseFullText" for the Search Engine drop down (when using Zone Fields Configuration).
3. Reboot the server (did not work either).
I logged onto the Oracle database and checked the IdcColl1 table. There is indeed, no index for the field, dDocFullText. There is only 1 index for the field, did. The field, dDocFullText, is a BLOB. The question is, if I am supposed to create an index manually for this field, how would I do it? A web search has not been fruitful in answering this question.
Here are my server settings.
For UCM:
Operating System: Windows 2003 Enterprise
UCM : 10gR3
Memory: 1 GB
Web Server: Apache 2.2.11
For Oracle:
Operating System: Windows 2003 Enterprise
Oracle: 10gR2
Memory: 1 GB
Thanks.I found out what the problem was. The problem was that I had to create the role, stellent_role, as described in the installation manual. After I created this role and assigned the database user to this role, a restart of the Content Server services and collection rebuild of the index fixed the problem.
However, I did notice one thing. I checked in 3 PDF files, and when I used Repository Manager to do a collection rebuild, I noticed that for Indexer Counters, the count for Full Text was 0 and the count for Meta Only was 3.
Anyone have any ideas? Is there something else that I missed? From reading the installation manual, it was not clear how database full-text indexing/searching would handle PDF files. -
Is it possible to ignore noise words conditionally in working with Full text search containstable
I have a question on stoplist file. I need to search for exact phrase string("this is the incident") which contains noise words. As part of the FT search engine, during parsing it eliminates noise word and search on remaining string in
the given phrase.
let us say there are 10 rows which contains the term "incident" in the FT table . and 1 row which has the exact phrase.i.e."this is the incident".
if we use containstable() to search for "this is the incident", we are getting 10 rows instead of 1 row.
To resolve the issue, we have 3 solutions
1.either stoplist file needs to be modified to remove the words (this,is,the)
2. set stoplist = OFF.
3.empty stoplist.
Apart from the above solutions, is there any better solution with out touching noise words file list.
If any solution that provides flexibility to ignore noise words conditionally at one time and not to ignore them.
Please provide your suggestion.
kkprasadOne question that I ask is: Why would I want to exclude noise words?
Noise words were created to limit the size of the full text indexes and avoid processing the many 'this', 'is', and 'the' common words. But the disadvantage of doing so is that you cannot find some things as you would like.
My feeling is that computers are more powerful and have more storage and it is often better to just index everything. As long as your search does not include 'the', then the large number of 'the's in the system will pretty much be ignored.
NOTE: If you change the noise words, including SET STOPLIST = OFF, you have to rebuild the index in order for it to implement your decision.
Of course, for very, very large full text indexes you would need to test.
Is your full text search on relation database columns, e.g. Description NVARCHAR(1000) or are you searching Word, Excel, and other more complex data?
If your full text is relational columns, it might be that you could:
1. Select only the fulltextkey into a temp table (e.g. #FTSfulltextkey) from the full text index using noise words. That would give you 10 rows.
2. Then directly query the table to find the string as you define above. (But remember that punctuation and symbols are generally ignored by Full Text Indexing, but would still be there in the string of text.)
SELECT *
FROM MyTextTable T
JOIN #FTSfulltextkey K
ON T.fulltextkey = k.fulltextkey
WHERE T.Description like '%this is the incident%'
Full text search is powerful, but it has limits. And the behaviour changes depending on the Language of the search.
RLF -
Search for a phrase rather than a single word in speech analysis text?
Is it possible to search for a phrase rather than a single word in speech analysis text?
Did you try Apache POI?
It's here:
http://jakarta.apache.org/poi/ -
SQL Server Free Text Search with multiple search words inside a stored procedure
I am trying to do a free text search. basically the search string is being sent to a stored procedure where it executes the free text search and returns the result.
If I search for red
flag, I want to return the results that matches both red and flag text.
Below is the query I use to return the results.
select * from customer where FREETEXT (*, '"RED" and "flag"')
This doesn't give me the desired result. Instead this one give the desired result.
select * from customer where FREETEXT (*, 'RED') AND FREETEXT (, 'FLAG')
My problem is since it's inside a stored procedure, I will not be able to create the second query where clause. I thought both query should return the same result. Am I doing something wrong here?I am moving it to Search.
Kalman Toth Database & OLAP Architect
IPAD SELECT Query Video Tutorial 3.5 Hours
New Book / Kindle: Exam 70-461 Bootcamp: Querying Microsoft SQL Server 2012 -
Need sample code to do text search using boolean operators AND OR + -
I'm looking for an algorithm doing text searches in files
I need it to support AND OR + - keywords (for example "ejb AND peristence")
Does anyone knows where I cand find this kind of algorithm with the full source ?
Of course I can adapt C,C++ to Java.
In fact my target language is Serverside javascript (sorry) so I prefere rather low level solutions !
Any help will be grealy appreciated and the resulting code will be posted
here and on my website : http://www.tips4dev.comFirstly, a little note to the technical solution: what you probably need the most is speed. I may sound strange, but personally I am convinced that if you could use system tools a naive algorithm:for i:=1 to m do
grep (word)
od; whose complexity is O(m.n), where m is the number of words to be processed and n somehow represents the cardinality of the text to-be-sought-through, so this naive algorithm would actually be in 99% of cases much faster than any implementation of the algorithm below, whose complexity is O(m+n), because the implementation of the grep routine (O(n)) would be optimized and m will be low (who queries 153 words at once?)
Anyway, you asked for an algorithm and you'll have it. It is quite elegant.
Aho, A.V. - Corasick, M.V.: Efficient String Matching - An Aid to Bibliographic Search, Communication of the ACM, 1975 (vol. 18), No. 6, pg. 333-340
[i]The task: let's have an alphabet X and a string x = d1d2...dn (d's are characters from X) and a set K = {y1, ... ym} of words, where yj = tj,1 ... tj,l(j) (t's are again characters from X).
Now we search for all <i, yp> where yp is the suffix of d1...di (occurences of the word yp in x)
(note: if you want to search for the whole words tj,1 and tj,l(j) must be blanks)
The idea of the algorithm is that we first somehow process words yp to construct a search machine and with this machine we will loop through X to search for occurrences of all the words at once.
Example:
K = {he, she, his, hers}
X = ushers
search machine M(Q - set of states, g - "step forward" function, f - "step back" function, out - reporting function):
(function g)
0 (initial state) h-> state 1 e-> state 2 r-> state 7 s-> state 8 ... for {he, hers}
state 1 i-> state 6 s-> state 7 ... for {his}
state 0 s-> state 3 h-> state 4 e-> state 5 e ... for {she, he}
And for all the characters is defined 0 x -> 0
Now, in
(function out)
state 2: report {he}
state 5: report {she, he}
state 7: report {his}
state 9: report {hers}
"Step back" function f for this particular set of word would be:
9 -> 3, 7 -> 3, 5 -> 2, 4 -> 1 otherwise the machine would return to the initial state 0
Processing of ushers will look like:
<0,0> u-stay in the state 0 <1,0> s-move to state 3 <2,3>, <3,4>, <4,5> state 5-report (he, she}, cannot move forward -> must step back (like if "he" was received) <4,2> r-move to state 8, <5,8>, <6,9>
Before we show how to construct the searching machine M (Q,g,f,out) let�s consider the algorithm how to use it:
Alg 1.begin
state:= 0;
for i = 1 to n do
//if cannot move forward, move back
while g(state, di) not defined do state:=f(state) od;
//move forward to a new state
state:=g(state, di);
//report all the words represented by the new state
for all y from out(state) do Report(i,y) od;
od
end.
Alg 2. � build of the �step forward� function g and an auxiliary function o that will be later used for the construction of outvar q:integer;
procedure Enter(T1�Tm);
begin
s:=0; j:=1;
//processing a prefix of a new word that is a prefix of an already processed word too
while j<m and g(s,Tj) defined do
s:=g(s,Tj); j:=j+1;
od;
while j<m do
q:=q+1; //a new state � global variable
define g(s,Tj) = q; //definition of a single step forward
s:=q;
j:=j+1;
od;
//the last state must be a state when at least the processed word is reported
define o(s) = [T1, � Tm];
end;
begin
q:=0; //initial state
for p:= 1 to k do Enter(yp) od;
for all d from the alphabet X do
if g(0,d) not defined then define g(0,d) = 0 fi
od
end. Alg 3. � build of the �step back� function f and the reporting function outcreate an empty queue
define f(0) = 0; out(0) = {} //an empty set � we expect words of the length 1 at least
for all d from X do
//process children of the initial state
s:=g(0,d);
if s!=0 then
define f(s) = 0; //1-character states, if we throw away the first character we return to the initial
define out(s):=o(s); //report 1-character words, if any
move s at the end of the queue
fi
od
while queue not empty do
r:= the first member of the queue; remove r from the queue;
for all d from X do //process all the children of r
if g(r,d) defined then
s:= g(r,d); //get a child of r
t:= f(r); //f(r) has already been defined
while g(t,d) not defined do t:=f(t) od;
//we found a state from which g(t,d) has sense
define f(s) = g(t,d);
define out(s) = o(s) UNION with out(f(s));
move s at the end of the queue;
fi
od
od
Processing of a query � normal forms
Until now we have solved the problem how to search for multiple words in a text at once. The algorithm returns not only not only whether a word was found or not, but also where exactly a word can be found � all the occurrences and their locations.
However, the initial task was slightly different: procession of a query like �X contains (y1 AND/OR y2 � yn)� In order to decide a question like that it might not be necessary to find all the occurrences of given words, actually not even an occurrence of all the words (e.g. word1 OR word2 is fulfilled as soon as either word1, or word2 is found).
Let�s suppose that a searching query is given in its disjunctive normal form (DNF):
A1 OR A2 OR ...Ak where each of Ax = B1 AND B2 AND ...Bkx and Byz is a statement �X contains yp�
Now, the query is successful whenever any of Ax is fulfilled.
(I don�t know how much you know about transformation of a logical formula to its disjunctive form � it is quite a famous algorithm and can be found in any textbook of logic or NP-completeness. I hope that evaluation of the formula, which is what happens in the procedure Report of the algorithm Alg. 1, is trivial.) -
Inconsistent Full Text Search Results
I have built quite a comprehensive JavaHelp system, but seem to be having problems with the full text searching.
Eg Typeing in "Start" will bring back "Starting Transformation Manager" but not "StartsWith". Both HTML files seem to have the same structure.
I have been generating my help using the Helen software. I downloaded JavaHelp1.1.3 thisa fternoon and also generated the index database using jhindexer. This did not solve the problem.
Has anyone had a similar problem
HelenHello...
I am at the point where I too have this problem of text searching. I put in hex and only exact matches of "words" of hex are displayed. "hexidecimal" is not.
How did you get around this problem? Any hints or suggestions would be greatly appreciated.
Thank you.
Mike -
Multiple words don't work in search field
When I enter multiple keywords in any search field nothing shows up.
Let's say I'm looking for pink flowers. If I enter 'flowers' I see all my flower shots, as soon as I enter a comma, all photos disappear. If I go ahead and add 'pink', I still don't see anything. This happens in any search field whether in the browser or within a HUD.
If I use a HUD, add text fields and enter 'flowers' and 'pink', think I see pink flowers. If I use the keyword checklist in the HUD, I see pink flowers. But I can never use multiple words separated by commas.
What's wrong? I've checked the Aperture Manual and can't figure out why the multiple word method doesn't work.I always use the keyword search panel in the HUD when I am looking for keyworded images...multiples work fine there. I suspect the "text" search field is parsed EXACTLY as it says, examining a string and looking for a match anywhere in the metadata. My suspicion is that the comma is being parsed as JUST THAT, a comma, rather than a search field separator.
My 2¢
cheers,
david -
Different behavoiur of word boundary pattern \b in JDK1.4 and JDK1.5
I noticed that word boundary pattern '\b' behaviour in jdk1.5.0 beta 1 differs from standard expected regular expression behaviour, that presents for example in jdk1.4
There is simple test code BoundaryTest.java below:
import java.util.regex.*;
public class BoundaryTest {
public static void main(String[] args) {
String testString = new String("word1 word2 word3");
System.out.println("Test string: " + testString);
Pattern p = Pattern.compile("\\b");
Matcher m = p.matcher(testString.subSequence(0,testString.length()));
int position = 0;
int start = 0;
while (m.find(position)){
start = m.start();
if (start == testString.length() ) break;
if (m.find(start+1)){
position = m.start();
} else {
position = testString.length();
System.out.println(testString.substring(start, position));
}After compiling (in jdk1.5 or jdk1.4) one could get the next results:
>...\jdk1.4\bin\java BoundaryTest
Test string: word1 word2 word3
word1
word2
word3
And it is usual beahaviour, but in jdk1.5 we have:
>..\jdk1.5\bin\java BoundaryTest
Test string: word1 word2 word3
w
o
r
d
1
w
o
r
d
2
w
o
r
d
3
Seems that '\b' works just like '\w' in JDK1.5.
Is it a bug of new JDK or just some new feature?To be honest, I have no idea if that the case, I am new to Java so I only had 1.4.2 for about a week before removing it and installing 1.5.0 beta... Buy the code bellow now shows the out put you desired...
import java.util.regex.*;
public class BoundaryTest {
public static void main(String[] args) {
String testString = new String("word1 word2 word3");
System.out.println("Test string: " + testString);
Pattern p = Pattern.compile("\\b");
Matcher m = p.matcher(testString.subSequence(0,testString.length()));
int position = 0;
int start = 0;
String datastring = ""; // initializes with nothing
while (m.find(position)){
start = m.start();
if (start == testString.length() ) break;
if (m.find(start+1)){
position = m.start();
} else {
position = testString.length();
datastring += testString.substring(start,position); // adds to string
System.out.println(datastring); // out puts legible string
}Feel free to ignore anything I say as to this matter since as mention above, I can not test it with an earlier version since that one is now gone...
-MaxxDmg...
- ' Aye, Its Bright out... Where me put me Ale...' -
Full text search for web ? Yes or no ?
Hi,
I have a DB that has more then 1.8mil records in a single table .... and would like to implement full text search or some sort of caching for quicker Web search ....
Let me describe you what I have .... The table that holds 1.8mil records is made out of 30 clob columns ... each holding text .... actually these are alphabetic columns ... so words that start with char 'A' are in the first clob ... 'B' in second 'C' in third and so forth ....
Searching is always done first by customerID and CreateDate which are both indexed columns , and then clobs are searched using instr ...
Execution plan was good ... but searching times started to increase ....
So therefor I would like to improve the search ... by implementing some sort of caching mechanism ....
I read a lot about this and found an example where I would create a table containing unique words and table for occurrences of the words ... but this would then
make like 1.8mil articles containing approc 500 words , which would then repeat through articles ... so ok there would be less then 50.000 unique words (in our language ) , but the occurrences would dramatically increase cause every word inside article has to have a link in occurrences table ... so this would then be like 900mil records inside table ..
Is this at all possible to have so many records inside single table ? And still make it quick ?
Is the Oracle Full text search the only right way in this situation ?
Any suggestions ? Did anyone implement anything like this ?
Thanks,
KrisLet's start with your Oracle version. Please specify which version you run because Text capabilities vary dramatically between releases.
>
I tried using Oracle Text as suggested ... now if I understand correctly ....
CTXCAT - would be great because when new records are added, index is updated automatically .... but doesn't support CLOBs ... so no go
>
CTXCAT is a concatenated transactional index that is supposed to optimize combined searches on text and other columns. No go for you as it indeed does not support CLOB columns.
>
CONTEXT - supports CLObs , but I need to explicitly synchronize index ....
There are like 4000 inserts per day ..... and they all need to be indexed in a real-time ...
>
Not true, at least since 10g: SYNC(ON COMMIT) parameter makes this index type transactional (it's synchronized automatically on commit with this parameter set.)
>
If CTX_DLL.SYNC_INDEX procedure synchronize the whole table which is now 1.8mil records, this can take a while ... so it can't be run after inserts ....
>
It does not, it only synchronizes changed data since last sync operation.
So CONTEXT is actually perfectly suited for your needs (just redesign those 30 columns into one document column and index it.) Note that you need to regularly maintain CONTEXT indexes by scheduling CTX_DDL.OPTIMIZE_INDEX to run at off-hours and purge stale/removed data and rebuild its own internal index bitmaps for better performance. Otherwise you will see performance degrade as changes to the indexed data accumulate. You might also want to tweak initial indexing parameters, especially MEMORY parameter, as it greatly affects resulting index fragmentation - the more memory you give for initial indexing or optimization, the less fragmented and the more performant the index will be all other things equal. -
Full-Text search is not working with PDF files - SQL Server 2012 64 bit
Hi,
We are in the process of storing PDF files in SQL Server 2012 with Full-Text search capability.
I followed the steps as below and it works fine with word document but not for PDF files. I tried with PDF ifiler 11 & 9 and both are unsuccessful.
Server/DB Level Settings:
1)
Enable FileStream
2)
Install Full-Text
then restart
3)
Use [specific db]
alter
database [db name]
add
filegroup Files
contains filestream;
alter
database [db name]
add
file (
name = N'Files',
filename =
N'D:\SQL\DATA') to
filegroup [Files];
3)
Database level
Settings:
FileStream:
FileStream
Directory name:
[Set the name]
FileStream
non-transacted
Access: [set Appropriate]
3a)
Add a
datafile to DB
with filestreamdata
filetype.
4)
Share D:\SQL\DATA
directory and
add specific accounts
with read/write
access
5)
Give bulkadmin
access to those
specific accounts
at server
level
6)
From the
page (link)
download and
install the *.pdf
IFilter for
FTS. Link:
http://www.adobe.com/support/downloads/detail.jsp?ftpID=5542
7)
To the
PATH global system
variable add
path to the
catalog,
where you installed
the plugin.
Default for
this version is:
C:\Program
Files\Adobe\Adobe
PDF iFilter 9
for 64-bit
platforms\bin
8)
From the
page (link)
download a
FilterPackx64.exe
and install
it. Link:
http://www.microsoft.com/en-us/download/confirmation.aspx?id=20109
9)
Now from
SSMS execute the following
procedures:
-sp_fulltext_service
'load_os_resources',1
-sp_fulltext_service
'verify_signature', 0
EXEC
sp_fulltext_service
'update_languages';
-- update language list
EXEC
sp_fulltext_service
'restart_all_fdhosts';
-- restart daemon
reconfigure
with override;
10)
Restart the
server
11)
select document_type,
path from
sys.fulltext_document_types
where document_type
= '.pdf'
-select
document_type,
path from sys.fulltext_document_types
where document_type
= '.docx'
12) Results are OK.
Following is my Table /Index/ catalog script:
CREATE
TABLE dbo.DocumentFilesTest
DocumentId INT
IDENTITY(1,1)
NOT NULL
PRIMARY KEY,
AddDate datetime
NOT NULL,
Name nvarchar(50)
NOT NULL,
Extension nvarchar(10)
NOT NULL,
Description nvarchar(1000)
NULL,
FileStream_Id UNIQUEIDENTIFIER
ROWGUIDCOL NOT
NULL UNIQUE DEFAULT
NEWSEQUENTIALID(),
FileSource varbinary(MAX)
FILESTREAM DEFAULT(0x)
go
--Add default add date for document
ALTER
TABLE dbo.DocumentFilesTest
ADD CONSTRAINT
DF_DocumentFilesTest_AddDate
DEFAULT sysdatetime()
FOR AddDate
EXEC
sp_fulltext_database
'enable'
GO
IF
NOT EXISTS
(SELECT
TOP 1 1 FROM sys.fulltext_catalogs
WHERE name
= 'Ducuments_Catalog_test')
BEGIN
EXEC sp_fulltext_catalog
'Ducuments_Catalog_test',
'create',
'D:\SQL\PDFBlob';
END
--EXEC sp_fulltext_catalog 'Ducuments_Catalog_test', 'drop'
DECLARE
@indexName nvarchar(255)
= (SELECT
Top 1 i.Name
from sys.indexes
i
Join sys.tables
t on
i.object_id
= t.object_id
WHERE t.Name
= 'DocumentFilesTest'
AND i.type_desc
= 'CLUSTERED')
PRINT @indexName
EXEC
sp_fulltext_table
'DocumentFilesTest',
'create',
'Ducuments_Catalog_test',
@indexName
EXEC
sp_fulltext_column
'DocumentFilesTest',
'FileSource',
'add', 0,
'Extension'
EXEC
sp_fulltext_table
'DocumentFilesTest',
'activate'
EXEC
sp_fulltext_catalog
'Ducuments_Catalog_test',
'start_full'
ALTER
FULLTEXT INDEX
ON [dbo].[DocumentFilesTest]
ENABLE
ALTER
FULLTEXT INDEX
ON [dbo].[DocumentFilesTest]
SET CHANGE_TRACKING
= AUTO
ALTER
FULLTEXT CATALOG
Ducuments_Catalog_test REBUILD
WITH ACCENT_SENSITIVITY=OFF;
INSERT
INTO DocumentFilesTest(Extension,
Name,
FileSource)
SELECT
'pdf'
'BOL12006553.pdf'
* FROM
OPENROWSET(BULK
'd:\SQL\PDFBlob\BOL12006553.pdf',
SINGLE_BLOB)
AS BLOB;
GO
INSERT
INTO DocumentFilesTest(Extension,
Name,
FileSource)
SELECT
'docx'
'test.docx'
* FROM
OPENROWSET(BULK
'd:\SQL\PDFBlob\test.docx',
SINGLE_BLOB)
AS Document;
GO
SELECT
d.*
FROM dbo.DocumentFilesTest
d WHERE
Contains(d.FileSource,
'BILL')
Returns nothing. it should come from PDF file
SELECT
d.*
FROM dbo.DocumentFilesTest
d WHERE
Contains(d.FileSource,
'TEST')
Returns from word document as follows:
2 2014-06-04 10:11:41.393 test.docx docx
NULL [BINARY Value] [Binary Value]
Any help is appreciated. Its been a long wait.
Thanks,
Vel
Vel ThavasiHello,
Did you check the fulltext log files for more details about the errors. If the filter isn’t working, there should be errors in the error log file.
The following thread is about similar issue, please refer to:
http://social.msdn.microsoft.com/forums/sqlserver/en-US/69535dbc-c7ef-402d-a347-d3d3e4860d72/sql-server-2008-64bit-fulltext-indexing-pdf-not-working-cant-find-ifilter
Regards,
Fanny Liu
If you have any feedback on our support, please click here.
Fanny Liu
TechNet Community Support -
Apex 3.1, Interactive Report Row Text Search, image bitmap as TEXT?
I think this IR thing is powerful which could save me lots of time in development.
One question: does the row text search(default: all columns) treat image column as regular text(string)? I did the following search on:
SAMPLE APPLICATION-->Products, I put 300 in the search column( for $300 list-price search), the search produces 3 lines( should only have 2). the 3rd line's list price is $1999, I looked it in SQL*PLUS and saw its image bitmap (long string) includes a "300" inside, so I believe the "default all columns search" treat image as regular string.
How can I avoid the image bitmap search included in IR? This bitmap strings are very long for each image and can EASILY match searching conditions for something like PRODOUCT DESCRIPTION, PRODUCT PRICE for our products data( about 25000)? thanks
seanSean / Russell,
Thanks for reporting this, it's certainly a bug.
By the way, the search is performed in SQL, on whatever column values are being displayed (run the page in debug mode to see the full SQL). So in the case of the sample application, it is not matching the image bitmap, but the image size, which is selected in the SQL. The bug is that the full search should not include columns which have filtering disabled or one of the special image format masks. We'll try to fix this for an upcoming patch.
Thanks,
Marco -
Oracle text search - special characters issue
Hi.
I'm facing a real annoying problem with text search query, and everything I've tried failed...
I have a table with a varchar column indexed by text index. The column contains special characters like '&', ',' and mainly- '-'. Since I want to disregard these special characters for searches I have created a basic lexer of type skipjoins for the column index. So now, the phrase 'aaa-bbb something'. for example, can be searched without '-', like this: 'aaabbb'. But I want to make it possible for this phrase to be searched with and without '-'. So, that when the user enters 'aaabbb' he will get the same results as when he enters 'aaa-bbb'.
In other words, This condition:
WHERE CONTAINS(column, '<query> <textquery grammar="context"> <progression><seq>'
||'aaabbb'
||'</seq></progression> </textquery> </query> ' ,1)> 2
Will return the same results as this condition:
WHERE CONTAINS(r.POI_NAME, '<query> <textquery grammar="context"> <progression><seq>'
||'aaa-bbb'
||'</seq></progression> </textquery> </query> ' ,1)> 2
Since text query treats the '-' sign as a minus sign and searches for 'aaa' which doesn't contain 'bbb', the only way I found to fix this was to wrap the search text with {}. like this:
WHERE CONTAINS(r.POI_NAME, '<query> <textquery grammar="context"> <progression><seq>'
||'{aaa-bbb}'
||'</seq></progression> </textquery> </query> ' ,1)> 2
This all went very well, until I wanted to create a relaxation query. like this:
WHERE CONTAINS(r.POI_NAME, '<query> <textquery grammar="context"> <progression><seq>'
||'{aaab}'
||'</seq><seq>'
||'{aaab}'
||'%</seq></progression> </textquery> </query> ' ,1)> 2
In this case, I would expect the first part of the query to return no results (since it's not the whole word) but the second part, using '%' should have returned the record of 'aaa-bbb'. It doesn't. It will only return my result if I remove the '{}' for the second part. I can't do that, because the exact same search, when containing '-', will not return the expected results when I remove the braces (the sign is treated as minus sign):
WHERE CONTAINS(r.POI_NAME, '<query> <textquery grammar="context"> <progression><seq>'
||'{aaab}'
||'</seq><seq>'
||'aaa-b'
||'%</seq></progression> </textquery> </query> ' ,1)> 2
So I now have no solution. My question is- How can I create a query that will disregard the minus sign and treat it as a regular sign, but would still handle percentage sign as a special sign. So that I could run a query like the last example and will get the results of searching the phrase 'aaa-b%'?
In short, and to simplify my question, I'm looking for a way to escape all characters (not only the minus sign) except for a specific character. Kind of like 'unescaping' a specific character (the '%' sign) within braces {}. Or, another way would be to remove the space that is added to the phrase inside the braces at the end of the word, preventing me from adding "%" at the end of the word, outside the braces.
Thanks you,
NiliI'm looking for a way to escape all characters (not only the minus sign) except for a specific character. Kind of like 'unescaping' a specific character (the '%' sign) within braces {}What about if you apply a function like regexp_replace to escape all known "specific characters", and then unescape the particular specific character again back as e.g. in
SQL> select 'a.da-df%df*' str, replace (
regexp_replace (
'a.da-df%df*',
'([[:punct:]])',
'\\\1'
str2
from dual
STR STR2
a.da-df%df* a\.da\-df%df\*
1 row selected.i.e. don't escape with curly brackets but with the backslash character.
You can then use this string in your query like in
WHERE CONTAINS(r.POI_NAME, '<query> <textquery grammar="context"> <progression><seq>'
||'aaab'
||'</seq><seq>'
||'aaa\-b'
||'%</seq></progression> </textquery> </query> ' ,1)> 2 -
Full-Text Search has not worked since we upgraded to 2012
I have a filestream database and table. Our full-text searches have always worked until we upgraded to SQL 2012 in December. Now, no file that has been uploaded since December is searchable. What has gone wrong here. It should have been
a clean upgrade. We are not getting any error messages. We are just not getting any records returned when we search on a word that we know are in the documents we've uploaded since December (for instance, the word 'aluminum').
Filestream is enabled for the instance.
A full-text catalog exists and contains a full-text index (the same one we've always had). Full-text indexing is ENABLED.
I've tried rebuilding the catalog and the index. I've tried to do a FULL POPULATION on the table.
We haven't changed our queries nor the way the files are uploaded.
Nothing works. I have been a database administrator since the SQL 2005 days and I have never seen anything like this.
Please help.Hi GINGER PIERCE,
Since the issue regards SQL Server Search. I will help you post the question in the related forums. It is appropriate and more experts will assist you.
According to your description, in theory , if you can do a full-text search for SQL Server 2008, when upgrading the SQL Server version from 2008 to 2012, the Full text indexing feature should be run well in SQL Server 2012 databases. If not , you can try
to restore your database from SQL Server 2008 to SQL Server 2012, create an new Full-Text Catalog and Index on the table or view in the database, and then use Full Text Index to search words, phrases and multiple forms of word or phrase via FREETEXT() and
CANTAINS() with “and” or “or” operators. check if it is normal that the full-text search feature is enabled in the SQL Server 2012 instance. For more information , see:
Full Text Search step by step in SQL Server 2012.
Note: In SQL Server 2012 SP1 , the server will report that Full Text Search is not supported in this edition of SQL Server when it clearly is. The workaround is to create the initial catalog by using a T-SQL query:
CREATE FULLTEXT CATALOG
In addition, since it is a fileStream database, we need to verify if you do Full Text Searches on documents in FileTables, if yes, you should enable FileStream database for your SQL Server, and enable FileTable options for the database. For more information,
see:
Full Text Searches on Documents in FileTables.
Regards,
Sofiya Li
If you have any feedback on our support, please click here.
Sofiya Li
TechNet Community Support
Maybe you are looking for
-
Synchronizing DB/Config between two standalone ACS, v5.4
Hello. I'm in process of migrating a clients' ACS from 4.2 to 5.4. With 4.2, they have it set up so that two standalone ACS servers (one in US, one in UK) will replicate database and configuration information. They are not configured as a primary/sec
-
Set Security Question & Answer using UME API in Web dynpro Java
Hi Experts, I;ve developed a Web Dynpro java application to create a user in UME. I am able to set all the user account related attributes but I am not able to set the security question and answer as I do not see any attribute for the same. Can anyon
-
i wonder if somebody has the same problem like mine. i was putting some mp3 songs and some pictures in the memory card, somehow the songs won't play and the pictures didn't show in photos slide. but nothing is wrong with the memory card.. it can sav
-
How to I get rid of Lion and get back to Leopard?
Please tell me how to get rig of Lion. Leopard was absolutley perfect, I want it back!!! Thanks.
-
Why i cannot download at appstore?? When i select.. It told to update payment method