How does full-text search for pdf files work?

Hi there,
Basically I can see my pdf file in the content server.. inside the pdf there's a piece of test that says: "Test's Sample" but when I do a search with that string the file gets filtered from the results.
I think it has to do with the ' (single quote) being there because other text in the pdf works fine.. so I was wondering how does VDK store this full text? where? I'd like to see how it gets translated IF that's how it works with pdf files....
Following advice from Re: Parse error with search query I tried doing the search by:
Test\'s Sample
Test`s Sample
"Test's Sample"
The database is db2 if that helps.. how can I fix this problem?

Nevermind, I fixed it by changing the VDK filters (in case someone is looking for a solution too).
Cheers,

Similar Messages

  • Full text search of .pdf files in a file table.

    I have installed the Adobe iFilter 11 64 bit and set the path to the bin folder. I still cannot find any text from the pdf files. I suspect I am missing something trivial because I don't find much when I Bing for this so it must not be a common problem.
    Here is the code.
    -- Adobe iFilter 11 64 bit is installed
    -- The Path variable is set to the bin folder for the Adobe iFilter.
    -- SQL Developer version 64 bit on both Windows 7 and Windows 8.
    USE master;
    GO
    DROP DATABASE FileTableStudy;
    GO
    CREATE DATABASE FileTableStudy
    ON PRIMARY
    ( NAME = N'FileTableStudy'
    ,FILENAME = N'E:\SQLServerData\SQL2012\Engine\FileTableStudy.mdf'
    ,SIZE = 4096KB
    ,FILEGROWTH = 1024KB
    ,FILEGROUP FileTableStudyFileTable CONTAINS FILESTREAM
    ( NAME = FileTableStudyFileTable
    ,FILENAME = 'E:\SQLServerData\FileTableStudy'
    LOG ON
    ( NAME = N'FileTableStudy_log'
    ,FILENAME = N'D:\SQLServerLogs\SQL2012\FileTableStudy_log.ldf'
    WITH FILESTREAM
    ( NON_TRANSACTED_ACCESS = FULL
    ,DIRECTORY_NAME = N'FileTableStudyFiles'
    GO
    USE FileTableStudy;
    GO
    DROP TABLE dbo.Magazine;
    GO
    CREATE TABLE dbo.Magazine AS FILETABLE
    WITH ( FileTable_Directory = 'MagazineStore'
    ,FileTable_Collate_Filename = database_default
    GO
    CREATE FULLTEXT CATALOG MagazineFullTextCatablog AS DEFAULT;
    GO
    --EXEC sp_fulltext_service 'load_os_resources', 1;
    --EXEC sp_fulltext_service 'verify_signature', 0;
    --EXEC sp_fulltext_service 'restart_all_fdhosts';
    --EXEC sp_fulltext_service 'update_languages';
    --EXEC sp_help_fulltext_system_components 'filter';
    --RECONFIGURE WITH OVERRIDE;
    SELECT document_type
    ,path
    FROM sys.fulltext_document_types
    WHERE document_type = '.pdf';
    SELECT *
    FROM sys.fulltext_document_types
    ORDER BY document_type;
    DROP FULLTEXT INDEX ON dbo.Magazine;
    GO
    SELECT TOP 1 indexes.name IndexName
    FROM sys.indexes
    JOIN sys.tables ON indexes.object_id = tables.object_id
    AND tables.name = 'Magazine'
    JOIN sys.schemas ON tables.schema_id = schemas.schema_id
    AND schemas.name = 'dbo'
    WHERE indexes.is_unique = 1
    AND indexes.name LIKE 'PK__%';
    GO
    -- Drag documents to folder.
    CREATE FULLTEXT INDEX ON dbo.Magazine
    ( file_stream TYPE COLUMN file_type)
    KEY INDEX [PK__Magazine__5A5B77D541728F3E];
    GO
    -- Wait for index to build
    SELECT DATEDIFF(ss, crawl_start_date, crawl_end_date) IndexBuildSeconds
    FROM sys.fulltext_indexes
    --ALTER FULLTEXT INDEX ON dbo.Magazine START UPDATE POPULATION;
    SELECT *
    FROM dbo.Magazine
    WHERE file_type = 'pdf';
    SELECT *
    FROM dbo.Magazine
    WHERE FREETEXT(*,'new core licensing')
    AND file_type = 'pdf';
    SELECT *
    FROM dbo.Magazine
    WHERE CONTAINS(*, N'"Microsoft"')
    AND file_type = 'pdf';
    SELECT *
    FROM sys.fulltext_catalogs;
    SELECT *
    FROM sys.fulltext_indexes;
    SELECT *
    FROM sys.fulltext_index_columns;
    SELECT *
    FROM sys.fulltext_index_catalog_usages;
    Thanks for any help.
    Tom G.

    Hello,
    We believe we have figured this out.  It looks like it has to do with the length of the default folder location for the Adobe iFilter.
    I was able to reproduce the issue and the following resolved it for me.  See if this resolves it for you all as well.
    Here is how to get Adobe Version 11 PDF filter to work.
     1 . If you haven’t already, run the following in SQL Server:
    Sp_fulltext_service ‘Load_os_resources’, 1
    Go
    --you might also need to run:  sp_fulltext_service ‘Verify_signature’,0  --This is used to validate trusted iFilters. 0 disables it. So use with caution.
    --go
    2. Stop SQL Server.  (Make sure FDHost.exe stops)
    3.  
    Uninstall the Adobe ifilter (because it defaulted to having spaces or the folder name is too long).
    4.  
    Reinstall the Adobe iFilter and when it prompts for where to install it, change it to: C:\Program Files\Adobe\PDFiFilter
    5.  Once the installation finishes, go the computer’s Environment variables. Add the following to the PATH.
    C:\Program Files\Adobe\PDFiFilter\BIN
    NOTE: it must include the BIN folder
    NOTE: If you had the OLD location that included spaces, remove it from the path environment variable.
    6. Start SQL Server
    7.  IF you had an existing Full-text index on PDFs, drop the full-text index and recreate it.
    8. You should now get results when you run sys.dm_fts_index_keywords('db','tblname')  --Note: Change db to be the actual database name and tblname to be the actual table name.
     Give this a try and see if this fixes yours. 
    Sincerely,
    Rob Beene, MSFT

  • Full Text Search in PDF file Not Working in SQL Server 2012

    OS: Windows Server 2012 @ Azure
    DB: SQL Server 2012 SP 1 with Cum Update 6
    Filter: OfficeFilter installed, PDFFilter64 11 installed (actually I tried 9 too)
    I have done the following steps:-
    1. Configure SQL Server Instance to enable FILESTREAM for Transaction-SQL Access (IO Access and Allow Remote Client Access to FileStream data) and restart the instance service.
    2. Set Stream Access Level to Full Access and  
    3. Create Database with file stream folder and set the created database Properties.Options: FileStreamDirectorName = fileContainer and FileStream Non-Transaction Access = Full.
    4. Create a FileTable with file director
    5. Execute the following scripts to ensure all installed components working. PDF is listed as one of the supported filter.
    EXEC sp_fulltext_service @action='load_os_resources', @value=1;
    EXEC sp_fulltext_service 'verify_signature', 0 -- don't verify signatures
    EXEC sp_fulltext_service 'update_languages'; -- update language list
    EXEC sp_fulltext_service 'restart_all_fdhosts';
    EXEC sp_help_fulltext_system_components 'filter'
    reconfigure with override
    6. Copy a few PPTX, DOCX, PDF file into the file director.
    7. Search the data by following command. I can PPTX and DOCX files can return right result but PDF is not returned although it contains the searching contents.
    SELECT *
    FROM dbo.Course
    WHERE CONTAINS(file_stream, 'Counsellor');
    Any expert advise?
    Ant in SG

    Are you seeing any errors in the SQL Server Error Log, the Windows Application or System logs?  How about in the Full-text crawl logging?
    Troubleshooting Errors in a Full-Text Population (Crawl)
    If your server has a mix of multi-threaded iFilters and single-threaded iFilters, this can cause serious problems with building the full text index.  (How do I know this?  Well, let's just say that I have suffered as well. And I was shocked!) 
    The efficiency was greatly increased by this article: 
    Troubleshooting: Slow Full-Text Indexing Performance Due to Filtering Process
    This means changing the threading model for the multi-threaded (e.g. Microsoft Office) filters to be Apartment Threaded.  Or perhaps if you are full text indexing PDF files, abandoning the free single-threaded Adobe IFilter and purchasing the FoxIt
    (or some other) multi-threaded PDF iFilter would benefit you.
    RLF

  • How to extract text from a PDF file?

    Hello Suners,
    i need to know how to extract text from a pdf file?
    does anyone know what is the character encoding in pdf file, when i use an input stream to read the file it gives encrypted characters not the original text in the file.
    is there any procedures i should do while reading a pdf file,
    File f=new File("D:/File.pdf");
                   FileReader fr=new FileReader(f);
                   BufferedReader br=new BufferedReader(fr);
                   String s=br.readLine();any help will be deeply appreciated.

    jverd wrote:
    First, you set i once, and then loop without ever changing it. So your loop body will execute either 0 times or infinitely many times, writing the same byte every time. Actually, maybe it'll execute once and then throw an ArrayIndexOutOfBoundsException. That's basic java looping, and you're going to need a firm grip on that before you try to do anything as advanced as PDF reading. the case.oops you are absolutely right that was a silly mistake to forget that,
    Second, what do the docs for getPageContent say? Do they say that it simply gives you the text on the page as if the thing were a simple text doc? I'd be surprised if that's the case.getPageContent return array of bytes so the question will be:
    how to get text from this array? i was thinking of :
        private void jButton1_actionPerformed(ActionEvent e) {
            PdfReader read;
            StringBuffer buff=new StringBuffer();
            try {
                read = new PdfReader("d:/getjobid2727.pdf");
                read.getMetaData();
                byte[] data=read.getPageContent(1);
                int i=0;
                while(i>-1){ 
                    buff.append(data);
    i++;
    String str=buff.toString();
    FileOutputStream fos = new FileOutputStream("D:/test.txt");
    Writer out = new OutputStreamWriter(fos, "UTF8");
    out.write(str);
    out.close();
    read.close();
    } catch (Exception f) {
    f.printStackTrace();
    "D:/test.txt"  hasn't been created!! when i ran the program,
    is my steps right?                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

  • Embedding Full-text Index into PDF File

    Hello Everyone,
    I've tried to create and embed full-text index into PDF file, but with no luck. I've followed steps described at http://help.adobe.com/en_US/Acrobat/9.0/Standard/WSC28D4DBB-6A78-4027-9E04-F50FE411CFB9.w. html - there can be seen progress of collecting of data and at the end the button Update index is enabled. This is signal for me that Index was created. After clicking on Ok button, saving document as new one and then reopening "new created" document, there is info that no Index is embedded in the Manage Embedded Index dialog. Is there any other step necessary to do? Or is it bug? Adobe Acrobat Pro 9.1 on Windows Vista 32bit is used.
    Jan
    PS: Interesting is also comment at the bottom of above mentioned help page...

    Thanks for the response. It is true that if I make changes and look at the embedded index status, it shows that it needs updating
    However the problem I can't get around after extensive testing is that sometimes for no apparent reason the index is dropped on save. This can happen if I check the status of the index to make sure it is valid, save the file, and reopen it.
    I've concluded that this must be a bug and am using other indexing options for the time being.

  • How to extract text from a PDF file using php?

    How to extract text from a PDF file using php?
    thanks
    fabio

    > Do you know of any other way this can be done?
    There are many ways. But this out of scope of this forum. You can try this forum: http://forum.planetpdf.com/

  • Search for PDF file content

    I am currently receiving hundreds of pdf attachments daily basis and am storing these pdf files in a file system. I am looking for a solution that will allow me to use full text search on these  these files. Can someone help me out.
    Thanks
    Sam

    I am talking about server level full text search not an individual search on a file. For exmple, if you have 1000 pdf files and you want to find out what file or files contain the word "shopping". Is there a adobe plug in that I need to buy? Do I need to store these files in a database rather than in file system?

  • Cannot do a full text search on pdf or Microsoft office Documents.

    I am using Oracle Content Server 10gR3
    Apart from the standard components and other components, I have InBound refinery Support and Oracle text Search Components enabled.
    I get the following error when I check in a pdf or a Microsoft office document.
    Text Conversion of the file '<repo_home>/.pdf' failed.*
    Content has been indexed with Info only. Resubmit should only be performed if the problem has been resolved.
    I am able to perform a full text search on *.txt and *.htm files.

    Here are some more errors from Console output log
    requestaudit     10.28 12:20:00.011     Audit Request Monitor     Request Audit Report over the last 120 Seconds****
    requestaudit     10.28 12:20:00.011     Audit Request Monitor     -Num Requests 349 Errors 345 Reqs/sec. 2.911 Avg. Latency (secs)0.309 Max Thread Count 2
    requestaudit     10.28 12:20:00.011     Audit Request Monitor     1     Service DOC_INFO     Total Elapsed Time (secs) 106.944     Num requests 345     Num errors 345     Avg. Latency (secs) 0.31
    requestaudit     10.28 12:20:00.011     Audit Request Monitor     2     Service COLLECTION_GET_REFERENCE     Total Elapsed Time (secs) 0.701     Num requests 2     Num errors 0     Avg. Latency (secs) 0.35
    requestaudit     10.28 12:20:00.011     Audit Request Monitor     3     Service COLLECTION_DISPLAY     Total Elapsed Time (secs) 0.211     Num requests 1     Num errors 0     Avg. Latency (secs) 0.211
    requestaudit     10.28 12:20:00.011     Audit Request Monitor     4     Service COLLECTION_GET_INFO     Total Elapsed Time (secs) 0.02     Num requests 1     Num errors 0     Avg. Latency (secs) 0.02
    requestaudit     10.28 12:20:00.011     Audit Request Monitor     ****End Audit Report*****

  • Full text search for web ? Yes or no ?

    Hi,
    I have a DB that has more then 1.8mil records in a single table .... and would like to implement full text search or some sort of caching for quicker Web search ....
    Let me describe you what I have .... The table that holds 1.8mil records is made out of 30 clob columns ... each holding text .... actually these are alphabetic columns ... so words that start with char 'A' are in the first clob ... 'B' in second 'C' in third and so forth ....
    Searching is always done first by customerID and CreateDate which are both indexed columns , and then clobs are searched using instr ...
    Execution plan was good ... but searching times started to increase ....
    So therefor I would like to improve the search ... by implementing some sort of caching mechanism ....
    I read a lot about this and found an example where I would create a table containing unique words and table for occurrences of the words ... but this would then
    make like 1.8mil articles containing approc 500 words , which would then repeat through articles ... so ok there would be less then 50.000 unique words (in our language ) , but the occurrences would dramatically increase cause every word inside article has to have a link in occurrences table ... so this would then be like 900mil records inside table ..
    Is this at all possible to have so many records inside single table ? And still make it quick ?
    Is the Oracle Full text search the only right way in this situation ?
    Any suggestions ? Did anyone implement anything like this ?
    Thanks,
    Kris

    Let's start with your Oracle version. Please specify which version you run because Text capabilities vary dramatically between releases.
    >
    I tried using Oracle Text as suggested ... now if I understand correctly ....
    CTXCAT - would be great because when new records are added, index is updated automatically .... but doesn't support CLOBs ... so no go
    >
    CTXCAT is a concatenated transactional index that is supposed to optimize combined searches on text and other columns. No go for you as it indeed does not support CLOB columns.
    >
    CONTEXT - supports CLObs , but I need to explicitly synchronize index ....
    There are like 4000 inserts per day ..... and they all need to be indexed in a real-time ...
    >
    Not true, at least since 10g: SYNC(ON COMMIT) parameter makes this index type transactional (it's synchronized automatically on commit with this parameter set.)
    >
    If CTX_DLL.SYNC_INDEX procedure synchronize the whole table which is now 1.8mil records, this can take a while ... so it can't be run after inserts ....
    >
    It does not, it only synchronizes changed data since last sync operation.
    So CONTEXT is actually perfectly suited for your needs (just redesign those 30 columns into one document column and index it.) Note that you need to regularly maintain CONTEXT indexes by scheduling CTX_DDL.OPTIMIZE_INDEX to run at off-hours and purge stale/removed data and rebuild its own internal index bitmaps for better performance. Otherwise you will see performance degrade as changes to the indexed data accumulate. You might also want to tweak initial indexing parameters, especially MEMORY parameter, as it greatly affects resulting index fragmentation - the more memory you give for initial indexing or optimization, the less fragmented and the more performant the index will be all other things equal.

  • ? How do you change default for .pdf files from Reader 6 to reader 9

    I cannot change the default for .pdf files from reader 6 to reader 9. Also I cannot remove reader 6. I am running Vista

    Normally, completely removing old versions and reinstalling the latest version will reassociate all applicable extensions for the program, but it appears to have failed in your case.  There may be something else going on with your machine that is producing this behavior.  You probably need to investigate that.  But what you can do to make sure the defaults are correct is click on the start orb and select 'Default Programs.'  Then select 'Associate a file type or protocol with a program.'  From the list of file types, select each of the following in turn: acrobatsecuritysettings, fdf, pdf, pdfxml, pdx, xdp, and xfdf.  If its default program is not Adobe Reader, click on the Change Program button and select Acrobat Reader from the list (you may need to expand the lower section to display more programs than it initially provides).  If Adobe Reader does not show up in the list, you will have to browse for it.  It should be located in C:\Program Files\Adobe\Reader 10\Reader\AcroRd32.exe or C:\Program Files (x86)\Adobe\Reader 10\Reader\AcroRd32.exe.
    This method is safer than manually editing the Root\Classes in the registry.

  • When can oracle support full text search for Simplified Chinese?

    When I create index using the create index clause,the following errors appears
    "ORA-29855: error occurred in the execution of ODCIINDEXCREATE routine
    ORA-20000: interMedia Text error:
    DRG-11440: operation not supported for the SIMPLIFIED CHINESE language"
    Maybe I have to use like to query words...
    Did somebody encounter the same problem and have a good solutions?I am expecting your help!
    null

    Hi,
    Full-text search capabilities is not currently not supported and Unfortunately, we do not have a timeline as to when it will be available.
    You can post a feedback on the below link.
    http://feedback.azure.com/forums/217321-sql-database
    Regards,
    Mekh.

  • How to add text to a pdf file?

    I was sent a rental application as a pdf file which I need to fill in but I can't seem to add text to it.
    How does one open a pdf file and add text to it?
    I have acrobat professional that came with adobe cs3 on my mac.
    Thanks for any advice.

    If you don't have the original and need a word version that you can edit.
    Open PDF  go to Save As... choose Word and then choose format.
    See this: http://www.screencast.com/t/igS0SoMDri
    Note: this only works with text based files.
    once its converted you can edit to you heart's content.  Then once edited as need create a new PDF with a different name.
    Open the Acrobat original Document Click on On Tools (if AcrobatX) and choose replace Pages and choose your new pdf as source. the choose desired pages or choose entire document.
    If you have any form Fileds and they have been displaced choose edit Form, click on each field in turn than needs moving, use up down right or left arrow keys to nudge the elements to desired position.
    If you have have a Form that has extended reader rights you have to remove those right by saving a copy without rights. Then onece editting completed you have to add rights.

  • How to save print setup for pdf file so that every user can open and print w/o manually performing?

    Can is there a way to configure the print settings (i.e. paper size, orientation, etc.) and save this setting as part of the pdf file? I would like the users of pdf files to be able to open the file and simply print, without having to manually configure the print settings.

    Not sure if this will help. I got this information from the built-in help under Acrobat 9.
    Advanced
    Lists PDF settings, print dialog presets, and reading options for the document.In the PDF settings for Acrobat, you can set a base Uniform Resource Locator (URL) for web links in the document. Specifying a base URL makes it easy for you to manage web links to other websites. If the URL to the other site changes, you can simply edit the base URL and not have to edit each individual web link that refers to that site. The base URL is not used if a link contains a complete URL address.
    You can also associate a catalog index file (PDX) with the PDF. When the PDF is searched with the Search PDF window, all of the PDFs that are indexed by the specified PDX file are also searched.
    You can include prepress information, such as trapping, for the document. You can define print presets for a document, which prepopulate the Print dialog box with document-specific values. You can also set reading options that determine how the PDF is read by a screen reader or other assistive device.
    Create print presets
    A PDF can contain a set of print presets, a group of document-specific values that is used to set basic print options. By creating a print preset for a document, you can avoid manually setting certain options in the Print dialog box each time you print the document. It’s best to define print settings for a PDF at the time that you create it, but print presets provide a means to add basic print settings to a PDF at any time.
    Choose File > Properties, and click the Advanced tab.
    In the Print Dialog Presets section, set options and click OK.
    The next time you open the Print dialog box, the values will be set to the print preset values. These settings are also used when you print individual documents in a PDF Portfolio.
    Note: To retain a print preset for a PDF, you must save the PDF after creating the print preset.
    Print Dialog Presets
    Page Scaling
    Prepopulates the Page Scaling option in the Print dialog box with the option you choose:
    Default
    Uses the application default setting, which is Shrink To Printable Area.
    None
    Prevents automatic scaling to fit the printable area. This setting is useful for preserving the scale of page content in engineering documents, or for ensuring that documents print at a particular point size to be legal.
    DuplexMode
    For best results, the selected printer should support duplex printing if you select a duplex option.
    Simplex
    Prints on one side of the paper.
    Duplex Flip Long Edge
    Prints on both sides of the paper; the paper flips along the long edge.
    Duplex Flip Short Edge
    Prints on both sides of the paper; the paper flips along the short edge.
    Paper Source By Page Size
    Selects the option by the same name in the Print dialog box. Uses the PDF page size to determine the output tray rather than the page setup option. This option is useful for printing PDFs that contain multiple page sizes on printers that have different-sized output trays.
    Print Page Range
    Prepopulates the Pages box in the Print Range section of the Print dialog box with the page ranges you enter here. This setting is useful in a workflow where documents include both instruction pages and legal pages. For example, if pages 1–2 represent instructions for filling out a form, and pages 3–5 represent the form, you can set up your print job to print multiple copies of only the form.
    Number Of Copies
    Prepopulates the Copies box in the Print dialog box. Choose a number from 2 to 5, or choose Default to use the application default, which is one copy. This limitation prevents multiple unwanted copies from being printed.
    Thanks.

  • How to remove white margin for pdf file shared by iCloud Keynote?

    iCould beta Keynote is very good. Thanks very much for Apple's effort.
    However, there is a litter problem that the pdf file shared by iClound Keynote has a white margin. How to remove this white margin? Or any bug there?

    No. I checked every page. Top side no white margin. In left and right side, the white margin is about 1mm, and in bottom side the white margin is about 2mm.

  • How to set default program for pdf files

    I was wondering if anyone knows how to change the default program for opening a file type. For example I would like to set Adobe Acrobat as my default program for opening pdf files instead of preview
    thanks

    Hi and Welcome to Apple Discussions
    Control-click a pdf file and choose "Get Info". In the Info window choose which application to open the file with and then click on the "change all" button. Voila!
    Matthew Whiting

Maybe you are looking for