No pdf indexing

Hi, I'm using CTXSYS.CONTEXT with URL_DATASTORE. All other parameters are left unspecified (defaults).
While plain text docs appear properly indexed, pdf are not. They appear unfiltered, the index contains pdf keywords only.
I understood that by default auto_fillter should enter the game in this case and pdf docs should be recognized as such.
What's wrong ? Thanks.

Although, according to the documentation, ctxsys.auto_filter is the default, so you should not have to specify it, I have found that in some versions, explicitly specifying it as an index parameter makes a difference.
You need to make sure that the auto_filter supports your PDF versions and operating system and version and Oracle version and edition. You also need to make sure there isn't any password protecting the PDF or any special PDF features that prevent filtering. These things are listed in the documentation and are different for different versions.
There have been a lot of changes to filtering and there are some patches.
What is your Oracle version and edition?
What is your operating system and version?
What is/are the PDF versions?
Can you filter a simple PDF document without anything special, just a single sentence for testing?

Similar Messages

PDF indexing of Word.doc Keywords: kind of disappointing

If we could embed document properties like Keywords in Word, and then convert Word.docs to PDFs , and then index the PDFs using Acrobat Pro, theoretically it would allow for lightning-fast keyword search and review, through a zillion PDFs.
Except, there are unexpected glitches that are either undocumented... or, if the documentation exists, it's either hard to find or too scantily worded.
Here's a few things I've observed, using Windows XP, Office 2007 and Acrobat 8 Pro:
1. Word document properties only transfer over to PDF if you use the Acrobat tab in Word's ribbon to generate the PDF via PDFMaker, which apparently invokes some 'more robust' implementation of Distiller, than occurs if you simply use the print dialog to print to PDF. THIS IS PROBABLY A BIG SOURCE OF USER CONFUSION THAT DISCOURAGES MANY USERS FROM GOING ANY FURTHER WITH EXPERIMENTING WITH PDF INDEXING OF WORD DOCUMENT PROPERTIES. IT CAN LEAD YOU TO CONCLUDE THAT NONE OF THE WORD DOCUMENT PROPERTIES EXCEPT FOR TITLE, CAN SURVIVE A PDF CONVERSION.
2. When you invoke PDFMaker, when the "Save Adobe PDF File As" dialog appears, you must click on the button at the bottom that is labeled "Adobe PDF conversion Options' and verify that the "Convert document information" check box is checked. (This may be able to be set as a permanent user preference somewhere, but I'm not quite sure where.)
3. The Properties fields in Word that will come over, include Title, Author, Subject, and Keywords. (The Comments field is ignored, as far as I can tell.)
4. You can now index the PDFs, and these Properties fields will also be indexed.....well, Sort Of.
5. "Sort Of", because if you then search for any of the text in your Properties fields, (like for example you search for a word or phrase that you've embedded in their Keywords fields), the advanced search result won't be displayed quite the way 'found hits' normally display in a PDF index search results screen. You may expect to see the contents of those Keywords fields, show up in the search results in a long list of 'found' file icons with ALL (or a generous selection) of their surrounding Keywords also displayed, and with the specified found keyword highlighted in BOLD.
But, that's not what happens. What you really get is an icon showing the contents of the Title field (which you didn't search for.) It basically means that Acrobat has found a document with something you searched for, in it....but Acrobat is not going to show it to you as easily as you are accustomed to seeing it. You only have two choices: (1) either hover your mouse over each found file's Title icon, one by one, until its screentip-type popup window appears, showing you all the contents of all four of that document's Properties fields; or (2) click on the icon, display the PDF, go to File Properties, and observe that file's properties dialog box.
This is disappointing: the fast, easy, contextual lookup advantages you've enjoyed with regular PDF index searches appear to be unavailable when it comes to viewing search results on indexed document properties. I can understand the logic; (why show other keywords surrounding the searched-for keyword? If they're not in a sentence, there's really no contextual relationship, and therefore no reason to show them.)
However, what if users wanted to store logically related keywords in a deliberate organized pattern..ie,
Texas, Car, 1999, Ford, Mustang, Green
Texas, Car, 2000, Ford, Mustang, White
Texas, Car, 2000, Ford, Mustang, Yellow
Texas, Car, 2000, Chevrolet, Corvette, Blue
Ohio, Car, 2006, Honda, Civic, Silver
...etc.
In this context, all keywords are logically related; it could be a big advantage to be able to use PDF Index search to instantly find and view a list of all 5,328 White 2000 Ford Mustangs located in Texas....then pop up their insurance.doc PDFs for further details.
Allowing the user to set a preference to 'show all stored propery values in the search results' instead of an arbitrary length string of surrounding values, could also be very helpful, so that the full information depicted in the above example could actually be fully displayed, not arbitrarily truncated.
I guess the only workaround is to forget Word's Document Properties, and just embed keywords within the document itself, such as maybe at the end of the document, maybe colored white (so they can't be easily seen). Formatting them as hidden text doesn't work; Acrobat ignores hidden text when you convert from Word to PDF.

I seem to get hard returns in all cases. That is why I had the short answer. I do not remember if saving as a DOC got rid of the hard returns. Of course the simplest way to find out is to try it. I went to another machine and the Save As to a DOC file did not put in the hard returns (cut and paste did). I should note that the PDF was produced from a totally different word processor and was not a WORD native document. That would suggest it is not dependent on the tags that can be included by WORD and PDF Maker.

Full text PDF indexing for website search?

Hi. We run a couple of websites on CQ5.5 and are trying to get the PDF files we refer to in the DAM to show in search results that users conduct on our sites. I've seen a number of references that imply that full text searching of PDFs is possible. For example:
http://dev.day.com/docs/en/crx/current/developing/searching_in_crx.html#Full-Text%20Extrac tion
But thus far I've not been able to figure out what I must do to get it working. I had expected that if this were possible to do, then it would have worked with the Geometrixx demo site. It did not.
Am I chasing my tail here, or is there actually a way to get this done? If it's possible, links to documentation on how to configure indexing_config.xml and any other required files would be greatly appreciated.
Thanks.

Laurent,
We never got a definitive answer, but we have suspicions that it was due to having upgraded from CQ 5.4 to 5.5. It seems that the libraries used for the indexing changed during that version upgrade. When I took our application and installed it on a pristine 5.5 installation, the PDF indexing worked. It was only our existing installations (two staging, two production) that did not work. So at least we know it's not our application or CQ in general.
Sadly, we don't have the resources to rebuild our servers, and we also ran into a separate problem that would prevent us from using the indexing anyway. It seems that there is no way to prevent cross-site results if you have multiple sites on the same CQ install and they each have their own sections in the DAM where the PDF files are stored. Would take some custom code to get around the issue, it seems.
For example, you have site A and site B.
/content/a <- Main site A content for pages
/content/b
/content/dam/a <- Site A's files in the DAM
/content/dam/b
There is no stock way, that I am aware of, to keep searches on site A from turning up PDF results from /content/dam/b (for site B), and vice versa. That's enough to keep us from using it - a total deal breaker.

How to access cataloged pdf index on the Web

I have indexed a pdf catalog, then copied the results to a Web server. I can access the individual pdf files okay, but neither Acrobat Pro nor Reader can read the index. How do I do it? Thanks!

You need to EMBED the index file in the PDF. Here is a helpful blog posting that explains it: Speed up PDF Search with an Embedded Index

PDF indexing and multiple searches.

Dear members:
Please forgive me if my question is rather basic but I haven't been able to find the exact answers I am looking for in order to address my project needs.
I have a folder where I keep all of my PDF files. These are all articles from medical journals that I keep organized using a browser application specific for these types of articles. The application allows me to search these articles but it only looks for specific keywords (title, author name, date, journal name and keyword just to name a few). However, it doesn't look at the content of the PDF file to find words that are contained in the body of the article itself.
I would like to be able to use Acrobat to search these articles and try to find words I am looking for in the entire article instead of being restricted only to keywords. These are the questions I have:
1. What is the best way to index these PDF files so that they can become searchable ?
2. Is there a way to find out if they have already been indexed by the publishing company so that I avoid wasting time by doing it again ?
3. My library now contains approximately 15,000 articles and I expect it to grow to at least 30,000. How can I handle these searches so that performance doesn't become an issue ? Is there a way to ensure that Acrobat can search these number of files without taking a long time ?
4. I understand from the help files that Acrobat can search an entire folder so I don't have to run my search one article or file at a time. Is this correct ? What is the best way to run my search so that Acrobat looks at all files in one folder ? In this folder I have subfolders (subdirectories) ? Will Acrobat look at all files when searching including those in subdirectories within the specified directory ?
Thank you in advance for your help and replies.
Best regards,
Joseph Chamberlaini

After creation the index you need execute next operations.
first, check that your index tables conatins indexed terms. Execute
select token_text from dr$YOUR_INDEX$i;
Second, you will need to check the index errors table CTX_INDEX_ERRORS. This is owned by the user CTXSYS, and most users do NOT have # SELECT privilege to it by default.
If it's OK, then check that your PDF documents is supported by INSO filter.
Citation:
"PDF - Portable Document Format
Acrobat Versions 2.1, 3.0, 4.0, and 5.0 including Japanese PDF"
(Appendix B. Supported Document Formats in Oracle Text Reference 9.2)
For Oracle 9i you could install 9.2.0.4 patchset (it included INSO FILTER 7.5)
P.S.
for the beginning, you could find answers for your question about Oracle Text here
http://otn.oracle.com/products/text
Sorry for my English.
Best regards, Victor Zogin.

Acrobat XI destroys Windows 8 x64 pdf indexing capabilities

Hi,
just to mention, before anybody installs Acrobat XI Pro on Windows 8 x64, that it will result in the loss of the full text indexing of PDF files by Windows 8.
The setup of Acrobat XI hijacks the shell IFilter abilities with a non-functional "AcroIF.dll" (unsuitable for x64 systems), and even after disabling this crap (regsvr32 /u AcroIF.dll) there is currently no way to restore the perfectly-working, Microsoft-designed original IFilter ability for pdf files.
Beware.
And thank you Adobe engineers for replacing stupidly something that works with something that sucks...
Perhaps should you give the user some option not to bug their OS...

So, any chance of a bugfix in future updates? (rather than "updates" that trigger again and again the same issue?)
Here has come a new minor (security) update v11.0.2, which indeed still hijacks/breaks the indexing function. Reverting via the provided registry "fix" (actually a quick hack to compensate - but too late - for the laziness of Adobe's staff) implies a new reindexing of all PDF files (~24 hours for several dozens GB, and a substantial drain of CPU power. Hopefully energy is free nowadays).
I bet the solution is to politely decline any future update of Acrobat, security or not, if they bring nothing but a P.I.T.A. Until the company eventually decides to fix the flaw - if human lifespan allows.
(... Oh and absolutely no feedback from the so-called developers, despite a bug report a bunch of months ago.)
Funny how I've seen much more dedicated and competent support from software companies with 10x less staff and financial resource.

PDF index not working on iBooks

I have several pdf textbooks I read on ibooks. They all have an index (both the multiimage one and the 'clickable' list one that you can access at the top right hand corner of the ipad screen). One of them doesnt show the clickable list index, even though it show the multi image one. I have tried this same pdf book in other reading apps and it does display an index (goodreader, kindle, nook). I enjoy ibooks more than those apps for many reasons and would like this book to work on here as well. Any thoughts? Thanks

You're right -- the placed PDF loses all its interactivity even when it becomes part of a PDF article. You'll need to add buttons over the PDF as you describe. Also, keep in mind that the Show/Hide Buttons action isn't supported in DPS, so you'll need to use a combination of buttons and MSOs to create hotspots, as described in this article:
http://blogs.adobe.com/indesigndocs/2010/12/hot-spot-button-workaround-for-indesign-dig-pu bs.html

PDF Indexing

Do you have any software that can index the PDF files so they
can be searchable online, not computer search?
I am creating the Website and including the search form. I
have hundreds of PDF files uploaded. Any ideas how can they be
searchable online?

Hello Oglesberry,
Thanks for your post. I am not sure of an option to perform a
search online.
You could try asking your question in the Acrobat forums, and
they may have more information for you. Since these forums are
mostly specific to Acrobat.com, and the services it hosts.
Acrobat
Forums
Good Luck,
Pete

How to create the .BPDX file required to scedule PDF index updates?

Hi All,
I'm having trouble creating the .BPDX file required to schedule index updates. There is very little information on the subject & i have been going round in circles for the last couple of days.
I have created several files that appear to initiate the task (ie. will launch Acrobat), but they don't do anything once open. I have tried several different file path formats, etc, with no luck & i cant find any examples of what the BPDX file should actually contain?
I'm using a Mac running Acrobat Pro 9.4.3. Any help would be much appreciated...
Thanks, Jason.

Yes, I find this a bit frustrating too. I managed to glean some information a while ago from earlier documents that appear to have been removed since. With this information I've managed to successfully regenerate a catalog using a BPDX file, on Windows. And the instructions Adobe gives are circular - they say to search for BPDX, but all you get are the same hits, which tell you to search for BPDX ....
I assume that you've already enabled the use of BPDX files in the preferences.
For a single catalog, the BPDX file should contain the full path to the .pdx file, quoted if necessary, followed by flags. To rebuild a catalog, add the flag "/rebuild". For example (windows format)
"C:\mypdffolder\mypdfindex.pdx" /rebuild
If you run Acrobat.exe with the .bpdx file as an argument, it opens Acrobat and regenerates the index. I've tried this in a script, and it works. What I haven't yet figured out is how to close Acrobat after it's done.
Hope this helps. I know this is Windows and not Mac, but it should be the same format.
Ken Dyall.

Creating PDF indexes

This question was posted in response to the following article: http://help.adobe.com/en_US/acrobat/using/WS58a04a822e3e50102bd615109794195ff-7c37.w.html

Can I make a catalog of PDFs on my website and load the catalog to my website?

Need to generate a Index xml file for corresponding Report PDF file.

Need to generate a Index xml file for corresponding Report PDF file.
Currently in fusion we are generating a pdf file using given Rtf template and dataModal source through Ess BIPJobType.xml .
This is generating pdf successfully.
As per requirement from Oracle GSI team, they need index xml file of corresponding generated pdf file for their own business scenario.
Please see the following attached sample file .
PDf file : https://kix.oraclecorp.com/KIX/uploads1/Jan-2013/354962/docs/BPA_Print_Trx-_output.pdf
Index file : https://kix.oraclecorp.com/KIX/uploads1/Jan-2013/354962/docs/o39861053.out.idx.txt
In R12 ,
     We are doing this through java API call to FOProcessor and build the pdf. Here is sample snapshot :
     xmlStream = PrintInvoiceThread.generateXML(pCpContext, logFile, outFile, dbCon, list, aLog, debugFlag);
     OADocumentProcessor docProc = new OADocumentProcessor(xmlStream, tmpDir);
     docProc.process();
     PrintInvoiceThread :
          out.println("<?xml version=\"1.0\" encoding=\"UTF-8\" ?>");
               out.print("<xapi:requestset ");
               out.println("<xapi:filesystem output=\"" + outFile.getFileName() + "\"/>");
               out.println("<xapi:indexfile output=\"" + outFile.getFileName() + ".idx\">");
               out.println(" <totalpages>${VAR_TOTAL_PAGES}</totalpages>");
               out.println(" <totaldocuments>${VAR_TOTAL_DOCS}</totaldocuments>");
               out.println("</xapi:indexfile>");
               out.println("<xapi:document output-type=\"pdf\">");
out.println("<xapi:customcontents>");
XMLDocument idxDoc = new XMLDocument();
idxDoc.setEncoding("UTF-8");
((XMLElement)(generator.buildIndexItems(idxDoc, am, row)).getDocumentElement()).print(out);
idxDoc = null;
out.println("</xapi:customcontents>");
     In r12 we have a privilege to use page number variable through oracle.apps.xdo.batch.ControlFile
          public static final String VAR_BEGIN_PAGE = "${VAR_BEGIN_PAGE}";
          public static final String VAR_END_PAGE = "${VAR_END_PAGE}";
          public static final String VAR_TOTAL_DOCS = "${VAR_TOTAL_DOCS}";
          public static final String VAR_TOTAL_PAGES = "${VAR_TOTAL_PAGES}";
Is there any similar java library which do the same thing in fusion .
Note: I checked in the BIP doc http://docs.oracle.com/cd/E21764_01/bi.1111/e18863/javaapis.htm#CIHHDDEH
          Section 7.11.3.2 Invoking Processors with InputStream .
But this is not helping much to me. Is there any other document/view-let which covers these thing .
Appreciate any help/suggestions.
-anjani prasad
I have attached these java file in kixs : https://kix.oraclecorp.com/KIX/display.php?labelId=3755&articleId=354962
PrintInvoiceThread
InvoiceXmlBuilder
Control.java

You can find the steps here.
http://weblogic-wonders.com/weblogic/2009/11/29/plan-xml-usage-for-message-driven-bean/
http://weblogic-wonders.com/weblogic/2009/12/16/invalidation-interval-secs/

Acrobat - Convert Office documents to PDF so that it is crawled/indexed by SharePoint search

Hi there,
This is a hybrid question between Acrobat and SharePoint and I'll post on both forums....
Background:
In a fairly complex application we have a publishing server that utilizes Acrobat to convert Office documents to PDF using the Convert to PDF functionality.
We then publish that PDF to a library in SharePoint. We would like to have those published PDFs searchable by SharePoint search. Unfortunately there is something about these PDFs where SharePoint cannot crawl the content.
Note: I do realize that PDFs are not indexable by SharePoint out of the box and I have installed and configured the iFilter utility. I have been able to index and search for other PDFs, so I know the mechanism works. It just seems to be these
particular PDFs.
I have also manually "Saved as PDF" directly from Word/Excel and those PDFs are crawled by SharePoint....it just seems to be when Acrobat does its conversion. I'm sure it's just a simple configuration somewhere... I just don't know what I'm
looking for.
Another note: When I open the published PDFs, I am able to use Acrobat's search to find the text.... and the text is selectable; so it's not as if the conversion changed it to an image.
So....would anyone happen to have encountered this issue? Or does anyone know what makes a PDF indexable by SharePoint search?
Thanks in advance

Hi ,
According to your description, my understanding is that the PDFs which are converted from Office documents by Acrobat cannot be crawled in your SharePoint 2010.
For your issue, please make sure these PDFs version is 1.5(Acrobat 6.x) or above.
You can take steps as below for verifying:
Open your PDF using Adobe Reader.
Go to File -> Properties.
Check the PDF Version under Advanced section.
Best Regards,
Eric
Eric Tao
TechNet Community Support

Convert to PDF from Excel so that it is indexable by SharePoint search

Hi there,
This is a hybrid question between Acrobat and SharePoint and I'll post on both forums....
Background:
In a fairly complex application we have a publishing server that utilizes Acrobat to convert Office documents to PDF using the Convert to PDF functionality.
We then publish that PDF to a library in SharePoint. We would like to have those published PDFs searchable by SharePoint search. Unfortunately there is something about these PDFs where SharePoint cannot crawl the content.
Note: I do realize that PDFs are not indexable by SharePoint out of the box and I have installed and configured the iFilter utility. I have been able to index and search for other PDFs, so I know the mechanism works. It just seems to be these particular PDFs.
I have also manually "Saved as PDF" directly from Word/Excel and those PDFs are crawled by SharePoint....it just seems to be when Acrobat does its conversion. I'm sure it's just a simple configuration somewhere... I just don't know what I'm looking for.
Another note: When I open the published PDFs, I am able to use Acrobat's search to find the text.... and the text is selectable; so it's not as if the conversion changed it to an image.
So....would anyone happen to have encountered this issue? Or does anyone know what makes a PDF indexable by SharePoint search?
Thanks in advance

This cannot be done on a Mac. If you need to continue this discussion, please post in the Acrobat Macintosh forum.

Error While Saving As PDF

Hello,
We are on HFM 11.1.2.1.103 and financial reports 11.1.2.1.00 and I am having an issue when running a report batch. I have a batch that I run each month which contains a few reports which are emailed out to users. For some reason, the second time the batch runs on one of the reports for a different entity (using bursting on the entity) I get the below error message.
Error while saving as PDF
Index: 3, Size: 3
Any thoughts? The email goes out fine and the other reports are fine.
Thanks,
Jason

Check KM article id 1504257.1.
It says following solution :
1. On the FR server, navigate to Oracle \ Middleware \ EPMSystem11R1 \ products \ financialreporting \ bin.
2. Double-click FRConfig.cmd.
3. When the Java window (Java Monitoring & Management Console) pops up, click on the MBeans tab.
4. Expand com.hyperion > Financial Reporting > Attributes.
5. Locate the PrintServers entry. Remove any invalid server names from the Value field.
6. Stop and start the Hyperion Financial Reporting Web Application.

Pdf issue with internal links in Safari

Hello,
I am creating a web site that has 80 or so pdf's available to download. They each contain a series of 20 or so speeches. To make navigation easier I have created a new page in Open Office that is an index to each pdf, with it's page number as an internal link to the correct page. This is what I did:
1. Open each pdf in Preview and inserted a blank page (page 2), and saved the pdf. This gives me the correct page numbering for when I put in the real index.
2. Noting the start of each speech page number, I created an idex page in Open Office.
3. I saved the index as a pdf.
4. I opened the origional pdf and the new pdf index.
5. I deleted the blank page and then dragged the new index page across and made sure it was the same page position - page 2.
6. I then used the link too in Previewl to drag a link box over the page number in the index and then chose the correct page number to set as the target via Previews inspector.
7. Saved the pdf and opened and tested - everything worked fine.
The problem comes when I upload the pdfs and test the site with safari. When I open the pdf in the default safari window, the links are all out by one page i.e. if i click on page number 30 from the index, I go to the next page - page 31. If I hover the mouse at the bottom of the page and bring up the open in Preview icon and do so, the index links to the pages work fine???? I have tried with Explorer under bootcamp and the default window that opens for the pdf shows the same issue - it's out by one page.
Thanks in advance
nige.

Hi...
Try this...
Go to /Library/Internet Plug-Ins
Move the Adobe PDF Browser plugin (or just PDF Browser plugin) to the Trash.
See if that makes a difference.

No pdf indexing

Similar Messages

Maybe you are looking for