Full text PDF indexing for website search?

Hi. We run a couple of websites on CQ5.5 and are trying to get the PDF files we refer to in the DAM to show in search results that users conduct on our sites. I've seen a number of references that imply that full text searching of PDFs is possible. For example:
http://dev.day.com/docs/en/crx/current/developing/searching_in_crx.html#Full-Text%20Extrac tion
But thus far I've not been able to figure out what I must do to get it working. I had expected that if this were possible to do, then it would have worked with the Geometrixx demo site. It did not.
Am I chasing my tail here, or is there actually a way to get this done? If it's possible, links to documentation on how to configure indexing_config.xml and any other required files would be greatly appreciated.
Thanks.

Laurent,
We never got a definitive answer, but we have suspicions that it was due to having upgraded from CQ 5.4 to 5.5. It seems that the libraries used for the indexing changed during that version upgrade. When I took our application and installed it on a pristine 5.5 installation, the PDF indexing worked. It was only our existing installations (two staging, two production) that did not work. So at least we know it's not our application or CQ in general.
Sadly, we don't have the resources to rebuild our servers, and we also ran into a separate problem that would prevent us from using the indexing anyway. It seems that there is no way to prevent cross-site results if you have multiple sites on the same CQ install and they each have their own sections in the DAM where the PDF files are stored. Would take some custom code to get around the issue, it seems.
For example, you have site A and site B.
/content/a <- Main site A content for pages
/content/b
/content/dam/a <- Site A's files in the DAM
/content/dam/b
There is no stock way, that I am aware of, to keep searches on site A from turning up PDF results from /content/dam/b (for site B), and vice versa. That's enough to keep us from using it - a total deal breaker.

Similar Messages

ESH_ADM_INDEX_ALL_SC cannot perform initial indexing for all search connectors

Dear SAP Gurus,
We are implementing TREX version 7.10.50 for Talent Management ECC 6.0 - EHP 5.
I'd like to ask you question regarding ESH_ADM_INDEX_ALL_SC program
which used to create search connector for TREX and perform initial indexing for all search connectors.
As we know we can perform indexing using ESH_COCKPIT transaction code or use ESH_ADM_INDEX_ALL_SC.
If I try to perform indexing using ESH_COCKPIT, all search connectors can be indexed ("searchable" column are "checked" and status are changed to "Active" for all search connectors).
However, if I try to perform indexing using ESH_ADM_INDEX_ALL_SC, not all search connectors are indexed.
I've traced the program ESH_ADM_INDEX_ALL_SC using ST01 transaction code and found these error:
- rscpe__error 32 at rscpu86r.c(6;742) "dest buffer overflow" (,)
- rscpe__error 32 at rscpc (20;12129) "convert output buffer overflow"
- rscpe__error 128 at rstss01 (1;178) "Object not found"
Please kindly help me to solve this issue,
Thank you very much
Regards,
Bobbi

Hi Luke,
Please find below connectors and the status after running ESH_ADM_INDEX_ALL_SC:
HRTMC AES Documents Prepared
HRTMC AES Elements Prepared
HRTMC AES Templates Prepared
HRTMC Central Person Prepared
HRTMC Functional Area Prepared
HRTMC Job Prepared
HRTMC Job Family Prepared
HRTMC Org Unit Prepared
HRTMC Person Active
HRTMC Position Prepared
HRTMC Qualification Active
HRTMC Relation C JF 450 Active
HRTMC Relation C Q 031 Active
HRTMC Relation CP JF 744 Active
HRTMC Relation CP P 209 Active
HRTMC Relation CP Q 032 Active
HRTMC Relation CP TB 743 Active
HRTMC Relation FN Q 031 Active
HRTMC Relation JF FN 450 Active
HRTMC Relation JF Q 031 Active
HRTMC Relation P Q 032 Active
HRTMC Relation S C 007 Active
HRTMC Relation S CP 740 Active
HRTMC Relation S JF 450 Active
HRTMC Relation S O 003 Active
HRTMC Relation S O Area of Responsibility Active
HRTMC Relation S P 008 Active
HRTMC Relation S Q 031 Active
HRTMC Relation S S Manager Active
HRTMC Relation SC JF FN Active
HRTMC Structural authority Active
HRTMC Talent Group Prepared
As suggested by OSS, we implement SAP Note 1058533.
Kindly need your help.
Thank you very much
Regards
Bobbi

DEFAULT Heading, Title, Main Text...for google search result??

Hi,
Every pages when we added in iWeb it'll come with some default Text Box shown as
*"Type a heading for your webpage here", "Type the main text for your page here", "Type the title for the page" ...*
Do the above default text box helps for Google search result?
Any other useful purpose for that?

mactreouser wrote:
What bout Adding a Text Box? Isn't the same thing of the Default Title Box? Or it must be place at the Top ? or it already pre-set for search engines?
You could find out if its the same thing by doing the following:
Add a text box to your home page, publish your iWeb site to a folder, click on +"Visit Site Now"+ and then in Safari do +View > View Source+. Then look for the title tag that the article talks about:
<title>Your title here</title>

Adobe PDF IFilter for document searches does not work

I am new to full text indexing of documents but I know enough to get the files into the database and apply indexing and searches because I got it to work for Word (.doc) files.
I'm trying to get Adobe's Ifilter version 11 to work in Windows 7 x64. I'm using Sql Server 2012 Express with Advanced Services sp1. I have included the full path to the /bin folder for the PDF dll in my PATH environment variable per the instructions.
Register ifilters (after install)
EXEC sys.sp_fulltext_service 'load_os_resources', 1;
Verify that the .pdf filter is installed:
EXEC sys.sp_help_fulltext_system_components 'filter';
This is the row I get for PDF which I delimited with ';'. The underline portion is what I have in PATH env variable.
filter; .pdf; E8978DA6-047F-4E3D-9C78-CDBE46041603; C:\Program Files\Adobe\Adobe PDF iFilter 11 for 64-bit platforms\bin\PDFFilter.dll; 11.0.1.36; Adobe Systems, Inc.
The file content column is
content VARBINARY(MAX) NOT NULL
I insert the file with
INSERT INTO dbo.Documents (filename, doctype, content)
SELECT
N'MyFile',
N'pdf',
bulkcolumn
FROM OPENROWSET(BULK 'C:\MyFile.pdf', SINGLE_BLOB) AS doc;
I reboot the machine and rebuild the Full Text Catalog after installing the PDF iFilter.
Then I search with one of these. There are Word and PDF files that contain 'apple'.
SELECT id, filename, doctype FROM dbo.Documents WHERE FREETEXT(content, N'apple');
SELECT id, filename, doctype FROM dbo.Documents WHERE CONTAINS(content, N'apple');
Now this all works well for .doc files but .PDF files never show up in searches. I have tried both version 9 and version 11 to no avail.


Hello,
We believe we have figured this out. It looks like it has to do with the length of the default folder location for the Adobe iFilter.
I was able to reproduce the issue and the following resolved it for me. See if this resolves it for you all as well.
Here is how to get Adobe Version 11 PDF filter to work.
1 . If you haven’t already, run the following in SQL Server:
Sp_fulltext_service ‘Load_os_resources’, 1
Go
--you might also need to run:
sp_fulltext_service ‘Verify_signature’,0 --This is used to validate trusted iFilters. 0 disables it. So use with caution.
--go
2. Stop SQL Server. (Make sure FDHost.exe stops)
3.
Uninstall the Adobe ifilter (because it defaulted to having spaces or the folder name is too long).
4.
Reinstall the Adobe iFilter and when it prompts for where to install it, change it to: C:\Program Files\Adobe\PDFiFilter
5.  Once the installation finishes, go the computer’s Environment variables. Add the following to the PATH.
C:\Program Files\Adobe\PDFiFilter\BIN
NOTE: it must include the BIN folder
NOTE: If you had the OLD location that included spaces, remove it from the path environment variable.
6. Start SQL Server
7.  IF you had an existing Full-text index on PDFs, drop the full-text index and recreate it.
8. You should now get results when you run sys.dm_fts_index_keywords('db','tblname') --Note: Change db to be the actual database name and tblname to be the actual table name.
Give this a try and see if this fixes yours.
Sincerely,
Rob Beene, MSFT

Configuring Browsing Indexes for Service Search Descriptor Filters

I am running DSEE 6.1 on Solaris 10.
I restrict access to the ldap clients (solaris8, 9, and 10) for various users in the Directory by configuring the service search descriptors to use a filter based on specific roles. Each servers profile mentions a role depending on type of server and then users are assigned roles which are nested within specific server type roles:
NS_LDAP_SERVICE_SEARCH_DESC= passwd:ou=People,dc=example,dc=com?one?nsrole=cn=serverRole,ou=profile,dc=example,dc=com
NS_LDAP_SERVICE_SEARCH_DESC= group:ou=group,dc=example,dc=com?one
NS_LDAP_SERVICE_SEARCH_DESC= audit_user:ou=People,dc=example,dc=com?one?nsrole=cn=serverRole,ou=profile,dc=example,dc=com
NS_LDAP_SERVICE_SEARCH_DESC= shadow:ou=People,dc=example,dc=com?one?nsrole=cn=serverRole,ou=profile,dc=example,dc=com
NS_LDAP_SERVICE_SEARCH_DESC= user_attr:ou=People,dc=example,dc=com?one?nsrole=cn=serverRole,ou=profile,dc=example,dc=com
I have noticed in my error logs on the Directory servers messages regarding these filters not being indexed:
WARNING<20805> - Backend Database - conn=949139 op=1 msgId=2 - search is not indexed base='ou=people,dc=example,dc=com' filter='(nsRole=cn=serverRole,ou=profile,dc=example,dc=com)' scope='one'
I have also had a few instances where the naming services seems to have stopped altogether. This seems to be timed with when my clients do a refresh of the ldap cache - which is the time that I seed the not indexed messages in the error log.
I guess that I need to set up Browsing Indexes for these filters
Can anyone give examples how to do this?
I guess I will need a vlvBase of ou=people,dc=example,dc=com
vlvScope of 1
vlvFilter of nsRole=cn=serverRole,ou=profile,dc=example,dc=com
I am not sure what I would do for vlvsort attributes though??

The access logs shows that the attributes to be sorted are uid and cn:
25/Apr/2008:09:58:21 +1200] conn=171835 op=1 msgId=2 - SRCH base="ou=people,dc=example,dc=com" scope=1 filter="(nsRole=cn=serverRole,ou=profile,dc=example,dc=com)" attrs="cn uid uidNumber gidNumber gecos description homeDirectory loginShell"
[25/Apr/2008:09:58:21 +1200] conn=171835 op=1 msgId=2 - SORT cn uid (1426)
[25/Apr/2008:09:58:21 +1200] conn=171835 op=1 msgId=2 - VLV 0:999:0:0 1:1426 (0)
[25/Apr/2008:09:58:26 +1200] conn=171835 op=1 msgId=2 - RESULT err=0 tag=101 nentries=999 etime=5 notes=U
So the vlvsort attributes should be cn and uid.

PDF indexing and multiple searches.

Dear members:
Please forgive me if my question is rather basic but I haven't been able to find the exact answers I am looking for in order to address my project needs.
I have a folder where I keep all of my PDF files. These are all articles from medical journals that I keep organized using a browser application specific for these types of articles. The application allows me to search these articles but it only looks for specific keywords (title, author name, date, journal name and keyword just to name a few). However, it doesn't look at the content of the PDF file to find words that are contained in the body of the article itself.
I would like to be able to use Acrobat to search these articles and try to find words I am looking for in the entire article instead of being restricted only to keywords. These are the questions I have:
1. What is the best way to index these PDF files so that they can become searchable ?
2. Is there a way to find out if they have already been indexed by the publishing company so that I avoid wasting time by doing it again ?
3. My library now contains approximately 15,000 articles and I expect it to grow to at least 30,000. How can I handle these searches so that performance doesn't become an issue ? Is there a way to ensure that Acrobat can search these number of files without taking a long time ?
4. I understand from the help files that Acrobat can search an entire folder so I don't have to run my search one article or file at a time. Is this correct ? What is the best way to run my search so that Acrobat looks at all files in one folder ? In this folder I have subfolders (subdirectories) ? Will Acrobat look at all files when searching including those in subdirectories within the specified directory ?
Thank you in advance for your help and replies.
Best regards,
Joseph Chamberlaini

After creation the index you need execute next operations.
first, check that your index tables conatins indexed terms. Execute
select token_text from dr$YOUR_INDEX$i;
Second, you will need to check the index errors table CTX_INDEX_ERRORS. This is owned by the user CTXSYS, and most users do NOT have # SELECT privilege to it by default.
If it's OK, then check that your PDF documents is supported by INSO filter.
Citation:
"PDF - Portable Document Format
Acrobat Versions 2.1, 3.0, 4.0, and 5.0 including Japanese PDF"
(Appendix B. Supported Document Formats in Oracle Text Reference 9.2)
For Oracle 9i you could install 9.2.0.4 patchset (it included INSO FILTER 7.5)
P.S.
for the beginning, you could find answers for your question about Oracle Text here
http://otn.oracle.com/products/text
Sorry for my English.
Best regards, Victor Zogin.

"Filter/partition key" for full-text searching

Hi there,
We have a challenge whereby we have a table of products by store, each store having say 200,000 products. Basically, for each store, we want to allow searching by product name. The best solution for this is to have full-text searching, but there
is no way to have a "filter" or "partition" key on the store ID.
So in essence what happens, the full-text search scans the entire full-text catalog for the products, then it uses the primary key to match to the table and then filters out the other stores. Considering we have hundreds of stores in the table, this
is not a good solution.
We contemplated adding separate indexed views and full-text catalogs for every store, but this would be a nightmare to manage.
I was expecting to see some sort of a "partition by Column" in the full-text indexes, but it doesn't exist. This basically means we have to scrap full-text and look for a third party solution.
Does anyone have any idea how we could achieve this with just standard SQL full-text searching?

Hi Adam,
Thank you for your question. I am trying to involve someone more familiar with this topic for a further look at this issue. Sometime delay might be expected from the job transferring. Your patience is greatly appreciated.
Thank you for your understanding and support.
If you have any feedback on our support, please click
here.
Elvis Long
TechNet Community Support

Acrobat - Convert Office documents to PDF so that it is crawled/indexed by SharePoint search

Hi there,
This is a hybrid question between Acrobat and SharePoint and I'll post on both forums....
Background:
In a fairly complex application we have a publishing server that utilizes Acrobat to convert Office documents to PDF using the Convert to PDF functionality.
We then publish that PDF to a library in SharePoint. We would like to have those published PDFs searchable by SharePoint search. Unfortunately there is something about these PDFs where SharePoint cannot crawl the content.
Note: I do realize that PDFs are not indexable by SharePoint out of the box and I have installed and configured the iFilter utility. I have been able to index and search for other PDFs, so I know the mechanism works. It just seems to be these
particular PDFs.
I have also manually "Saved as PDF" directly from Word/Excel and those PDFs are crawled by SharePoint....it just seems to be when Acrobat does its conversion. I'm sure it's just a simple configuration somewhere... I just don't know what I'm
looking for.
Another note: When I open the published PDFs, I am able to use Acrobat's search to find the text.... and the text is selectable; so it's not as if the conversion changed it to an image.
So....would anyone happen to have encountered this issue? Or does anyone know what makes a PDF indexable by SharePoint search?
Thanks in advance

Hi ,
According to your description, my understanding is that the PDFs which are converted from Office documents by Acrobat cannot be crawled in your SharePoint 2010.
For your issue, please make sure these PDFs version is 1.5(Acrobat 6.x) or above.
You can take steps as below for verifying:
Open your PDF using Adobe Reader.
Go to File -> Properties.
Check the PDF Version under Advanced section.
Best Regards,
Eric
Eric Tao
TechNet Community Support

Convert to PDF from Excel so that it is indexable by SharePoint search

Hi there,
This is a hybrid question between Acrobat and SharePoint and I'll post on both forums....
Background:
In a fairly complex application we have a publishing server that utilizes Acrobat to convert Office documents to PDF using the Convert to PDF functionality.
We then publish that PDF to a library in SharePoint. We would like to have those published PDFs searchable by SharePoint search. Unfortunately there is something about these PDFs where SharePoint cannot crawl the content.
Note: I do realize that PDFs are not indexable by SharePoint out of the box and I have installed and configured the iFilter utility. I have been able to index and search for other PDFs, so I know the mechanism works. It just seems to be these particular PDFs.
I have also manually "Saved as PDF" directly from Word/Excel and those PDFs are crawled by SharePoint....it just seems to be when Acrobat does its conversion. I'm sure it's just a simple configuration somewhere... I just don't know what I'm looking for.
Another note: When I open the published PDFs, I am able to use Acrobat's search to find the text.... and the text is selectable; so it's not as if the conversion changed it to an image.
So....would anyone happen to have encountered this issue? Or does anyone know what makes a PDF indexable by SharePoint search?
Thanks in advance

This cannot be done on a Mac. If you need to continue this discussion, please post in the Acrobat Macintosh forum.

Error returned in Acrobat X Pro when attempting to output pdf full text from Proquest database.

Contacted Proquest and they said to contact Adobe. Recently installed CS5, so I now have newer Acrobat version from before. This newer version isn't playing well. Can anyone help?
This is my original plea to Proquest:
Description:
Recently had Adobe Creative Suite (CS5) installed on my machine. Now when I attempt to download full text pdf I receive an error message when trying to open the downloaded file. "There was an error opening this document. The file is damaged and could not be repaired."
I can view pdf within Proquest window, but problems occur when I try to open the download - defaults to Acrobat X Pro, not standard Reader. I also get an error message when trying to email it. The only outputting method that seems to work is the Export/Save option.
I tried on another machine that uses Acrobat Reader XI and all worked fine.
BTW - I am working in Firefox.
Thanks in advance for any assistance.

Here are some screenshots:

How does full-text search for pdf files work?

Hi there,
Basically I can see my pdf file in the content server.. inside the pdf there's a piece of test that says: "Test's Sample" but when I do a search with that string the file gets filtered from the results.
I think it has to do with the ' (single quote) being there because other text in the pdf works fine.. so I was wondering how does VDK store this full text? where? I'd like to see how it gets translated IF that's how it works with pdf files....
Following advice from Re: Parse error with search query I tried doing the search by:
Test\'s Sample
Test`s Sample
"Test's Sample"
The database is db2 if that helps.. how can I fix this problem?

Nevermind, I fixed it by changing the VDK filters (in case someone is looking for a solution too).
Cheers,

Full-Text search is not working with PDF files - SQL Server 2012 64 bit

Hi,
We are in the process of storing PDF files in SQL Server 2012 with Full-Text search capability.
I followed the steps as below and it works fine with word document but not for PDF files. I tried with PDF ifiler 11 & 9 and both are unsuccessful.
Server/DB Level Settings:
1)
Enable FileStream
2)
Install Full-Text
then restart
3)
Use [specific db]
alter
database [db name]
add
filegroup Files
contains filestream;
alter
database [db name]
add
file (
name = N'Files',
filename =
N'D:\SQL\DATA') to
filegroup [Files];
3)
Database level
Settings:
FileStream:
FileStream
Directory name:
[Set the name]
FileStream
non-transacted
Access: [set Appropriate]
3a)
Add a
datafile to DB
with filestreamdata
filetype.
4)
Share D:\SQL\DATA
directory and
add specific accounts
with read/write
access
5)
Give bulkadmin
access to those
specific accounts
at server
level
6)
From the
page (link)
download and
install the *.pdf
IFilter for
FTS. Link:
http://www.adobe.com/support/downloads/detail.jsp?ftpID=5542
7)
To the
PATH global system
variable add
path to the
catalog,
where you installed
the plugin.
Default for
this version is:
C:\Program
Files\Adobe\Adobe
PDF iFilter 9
for 64-bit
platforms\bin
8)
From the
page (link)
download a
FilterPackx64.exe
and install
it. Link:
http://www.microsoft.com/en-us/download/confirmation.aspx?id=20109
9)
Now from
SSMS execute the following
procedures:
-sp_fulltext_service
'load_os_resources',1
-sp_fulltext_service
'verify_signature', 0
EXEC
sp_fulltext_service
'update_languages';
-- update language list
EXEC
sp_fulltext_service
'restart_all_fdhosts';
-- restart daemon
reconfigure
with override;
10)
Restart the
server
11)
select document_type,
path from
sys.fulltext_document_types
where document_type
= '.pdf'
-select
document_type,
path from sys.fulltext_document_types
where document_type
= '.docx'
12) Results are OK.
Following is my Table /Index/ catalog script:
CREATE
TABLE dbo.DocumentFilesTest
DocumentId INT
IDENTITY(1,1)
NOT NULL
PRIMARY KEY,
AddDate datetime
NOT NULL,
Name nvarchar(50)
NOT NULL,
Extension nvarchar(10)
NOT NULL,
Description nvarchar(1000)
NULL,
FileStream_Id UNIQUEIDENTIFIER
ROWGUIDCOL NOT
NULL UNIQUE DEFAULT
NEWSEQUENTIALID(),
FileSource varbinary(MAX)
FILESTREAM DEFAULT(0x)
go
--Add default add date for document
ALTER
TABLE dbo.DocumentFilesTest
ADD CONSTRAINT
DF_DocumentFilesTest_AddDate
DEFAULT sysdatetime()
FOR AddDate
EXEC
sp_fulltext_database
'enable'
GO
IF
NOT EXISTS
(SELECT
TOP 1 1 FROM sys.fulltext_catalogs
WHERE name
= 'Ducuments_Catalog_test')
BEGIN
EXEC sp_fulltext_catalog
'Ducuments_Catalog_test',
'create',
'D:\SQL\PDFBlob';
END
--EXEC sp_fulltext_catalog 'Ducuments_Catalog_test', 'drop'
DECLARE
@indexName nvarchar(255)
= (SELECT
Top 1 i.Name
from sys.indexes
i
Join sys.tables
t on
i.object_id
= t.object_id
WHERE t.Name
= 'DocumentFilesTest'
AND i.type_desc
= 'CLUSTERED')
PRINT @indexName
EXEC
sp_fulltext_table
'DocumentFilesTest',
'create',
'Ducuments_Catalog_test',
@indexName
EXEC
sp_fulltext_column
'DocumentFilesTest',
'FileSource',
'add', 0,
'Extension'
EXEC
sp_fulltext_table
'DocumentFilesTest',
'activate'
EXEC
sp_fulltext_catalog
'Ducuments_Catalog_test',
'start_full'
ALTER
FULLTEXT INDEX
ON [dbo].[DocumentFilesTest]
ENABLE
ALTER
FULLTEXT INDEX
ON [dbo].[DocumentFilesTest]
SET CHANGE_TRACKING
= AUTO
ALTER
FULLTEXT CATALOG
Ducuments_Catalog_test REBUILD
WITH ACCENT_SENSITIVITY=OFF;
INSERT
INTO DocumentFilesTest(Extension,
Name,
FileSource)
SELECT
'pdf'
'BOL12006553.pdf'
* FROM
OPENROWSET(BULK
'd:\SQL\PDFBlob\BOL12006553.pdf',
SINGLE_BLOB)
AS BLOB;
GO
INSERT
INTO DocumentFilesTest(Extension,
Name,
FileSource)
SELECT
'docx'
'test.docx'
* FROM
OPENROWSET(BULK
'd:\SQL\PDFBlob\test.docx',
SINGLE_BLOB)
AS Document;
GO
SELECT
d.*
FROM dbo.DocumentFilesTest
d WHERE
Contains(d.FileSource,
'BILL')
Returns nothing. it should come from PDF file
SELECT
d.*
FROM dbo.DocumentFilesTest
d WHERE
Contains(d.FileSource,
'TEST')
Returns from word document as follows:
2           2014-06-04 10:11:41.393            test.docx docx
NULL   [BINARY Value] [Binary Value]
Any help is appreciated. Its been a long wait.
Thanks,
Vel
Vel Thavasi

Hello,
Did you check the fulltext log files for more details about the errors. If the filter isn’t working, there should be errors in the error log file.
The following thread is about similar issue, please refer to:
http://social.msdn.microsoft.com/forums/sqlserver/en-US/69535dbc-c7ef-402d-a347-d3d3e4860d72/sql-server-2008-64bit-fulltext-indexing-pdf-not-working-cant-find-ifilter
Regards,
Fanny Liu
If you have any feedback on our support, please click here.
Fanny Liu
TechNet Community Support

Full Text Search in PDF file Not Working in SQL Server 2012

OS: Windows Server 2012 @ Azure
DB: SQL Server 2012 SP 1 with Cum Update 6
Filter: OfficeFilter installed, PDFFilter64 11 installed (actually I tried 9 too)
I have done the following steps:-
1. Configure SQL Server Instance to enable FILESTREAM for Transaction-SQL Access (IO Access and Allow Remote Client Access to FileStream data) and restart the instance service.
2. Set Stream Access Level to Full Access and
3. Create Database with file stream folder and set the created database Properties.Options: FileStreamDirectorName = fileContainer and FileStream Non-Transaction Access = Full.
4. Create a FileTable with file director
5. Execute the following scripts to ensure all installed components working. PDF is listed as one of the supported filter.
EXEC sp_fulltext_service @action='load_os_resources', @value=1;
EXEC sp_fulltext_service 'verify_signature', 0 -- don't verify signatures
EXEC sp_fulltext_service 'update_languages'; -- update language list
EXEC sp_fulltext_service 'restart_all_fdhosts';
EXEC sp_help_fulltext_system_components 'filter'
reconfigure with override
6. Copy a few PPTX, DOCX, PDF file into the file director.
7. Search the data by following command. I can PPTX and DOCX files can return right result but PDF is not returned although it contains the searching contents.
SELECT *
FROM dbo.Course
WHERE CONTAINS(file_stream, 'Counsellor');
Any expert advise?
Ant in SG

Are you seeing any errors in the SQL Server Error Log, the Windows Application or System logs? How about in the Full-text crawl logging?
Troubleshooting Errors in a Full-Text Population (Crawl)
If your server has a mix of multi-threaded iFilters and single-threaded iFilters, this can cause serious problems with building the full text index. (How do I know this? Well, let's just say that I have suffered as well. And I was shocked!)
The efficiency was greatly increased by this article:
Troubleshooting: Slow Full-Text Indexing Performance Due to Filtering Process
This means changing the threading model for the multi-threaded (e.g. Microsoft Office) filters to be Apartment Threaded. Or perhaps if you are full text indexing PDF files, abandoning the free single-threaded Adobe IFilter and purchasing the FoxIt
(or some other) multi-threaded PDF iFilter would benefit you.
RLF

How to link a full text index with catalog in a PDF file ?

Good morning and thank you for your help.
I already create some PDF files on a folder (with hypertext links between us) and I use the command "Tools\Document processing\Full Text Index with Catalog" to create an index; at this time everything works properly.
Now I want to link this Index to my first PDF file in order to use automatically this index on an advance search in this file.
I hope that someone may answer me!
Thank you.

Now I want to link this Index to my first PDF file in order to use automatically this index on an advance search in this file.
In the properties of the document:

Full text index searching in large document sets

I have been placed in charge of a digital PDF document library for a small biotech company. The library consists of about 1000 100-300 page .pdf documents which have been scanned and OCRed. In order to facilitate the full text searching of the documents a PDX catalog has been created. In theory, the PDX catalog would seem to be an excellent means of quickly accessing the data, but due the sheer volume of text that is contained in the documents this does not seem to be the case.
Any given search may take hours to complete and many computers in the department have been known to lock up due to the load of running a search. Obviously, this has made using the PDX search more of a hassle than it is worth.
I do not know exactly how the index searches work, but from what I gather they somehow search within each document in turn and return to you all the instances in all the documents that contain a certain term. If this is the case, than it would make sense that the searches would take a long time because the search would have to search each of the 1000 documents in sequence.
The thing is: we really do not need to know the context and placement of every instance that a word appears in a document. All we need to know is IF it appears, and perhaps how many times. Is there a way to make an index that will simply give us this information without having to search the actual document?
Heres an example of what I am trying to achieve:
Note: I know almost nothing about full text indexes so please forgive me if any of this sounds insane
Lets say we have a document called "word count.pdf" which contained the following text:
"blah blah yadda yadda text Recombinant human insulin more text still texting and so on"
And another called "word count 2.pdf" with the following text
"Recombinant human insulin and la la la dee do"
The indexes for these files could be condensed and stored like this:
"Word count.pdf"
Blah 2
yadda 2
recombinant 1
human 1
insulin 1
text 2
texting 1
and 1
so 1
on 1
"Word count 2.pdf"
recombinant 1
human 1
insulin 1
and 1
la 3
dee 1
do 1
In this example, if we were to run a search on "text" the index would return "word count.pdf, 3 instances (2 of text and 1 of texting" whereas if we were to search for "recombinant" it would return both "word count.pdf, 1 instance" and "word count 2.pdf, 1 instance".
This way, I could quickly weed out all documents that do not have the word that I am looking for and get an idea about which documents should be searched more in depth without scanning every single instance of the term in every document.
Is there any way to accomplish something similar to this using acrobat? (Or anything else, for that matter)
My specifications: (similar to specs of all computers searching the pdx):
Windows XP,
intel celeron CPU 2.6GHz, 1G of ram
Adobe Acrobat 8 Professional

Look at dTSearch. We used the publisher version for a CD with large files sets (with hundreds of pages per file/thousands of PDF pages of multicolumn index data - text heavy), and it does a great job. The desktop version would provide the type of searching you are looking for. Indexing is also very fast. Our customer complained, like yourself, about the speed of searches in Acrobat 6 and higher - most of the delay is due to the population of the results window.
http://www.dtsearch.com/

Full text PDF indexing for website search?

Similar Messages

Maybe you are looking for