Crawling a Portal directory

Hi all,
I'm back again with what's probably another obvious question. Here's the situation. I'm trying to crawl the contents of one some folders from one Portal to another.
I've created another World Wide Web data source and tried to created a crawler that uses this.
I've included in the crawler set up a URL to the knowledge directory.
Within the data source I've given it a valid user name and password for the Portal to be crawled. I've also then provided details of the portal log in form to try and get access.
These are as follows.
Login URL: http://portalv5dev2.wiley.com:8080/portal/server.pt?
Post URL: http://portalv5dev2.wiley.com:8080/portal/server.pt?
Then added form fields as follows:
in_hi_space = Login, in_tx_username=CrawlerUser, in_pw_userpass=Crawlerpw, in_se_authsource="", in_hidologin=true, in_hi_spaceID = 0, in_hi_control=Login
When the job runs it reports
Error in getting child Documents and Containers (for node Crawler Start Node) from the Data Source Provider: IDispatch error #17152 (0x80044500): [CPTWebCrawlProvider::GetMIMEType, could not open 'http://portalv5dev2.wiley.com:8080/portal/server.pt?space=Dir&spaceID=1&parentname=MyPage&parentid=0&in_hi_userid=1&control=OpenSubFolder&subfolderID=2109&DirMode=1'(0x80044f65) <unknown error>]
Is it possible to crawl another portal directory or should I develope my own portlets?
Thanks for any help. Adam

I have created a portal self crawl. I did this by creating a experience defination without sso and a cralwer user with a snapshot query portlet and having direct url entry in web cralwer with starting page as the login page action with userid, password and other form elements as parameters and values. It logs in to the home page and finds the snapshot query portlet and all pages URL in the snapshot query. It starts to crawl but it seems it hits the login page instead of the actual community page.
It loks like in a normal browser based scenario, if i log into the portal and then delete al my cookies i too get a login page if i click any sommunity page url. The cookie is jsessionid it seems. This is true for a sso disabled exp def as well.
can you please tell me the settings required in the www content source to login to portal and turn a self crawl to crawl all portal pages and make them full text indexed with correct names. I tried severl settings and impersonating user but could not be successful.
PS currenlty cralwer saves all pages with Login Page (1)(2)3)...... instaed of the actual page name. i guess it takes the name from the <title> tag. but since it is not able to get into the pages and hits the login page it just saves it by that name

Similar Messages

Com.sap.portal.directory.Constants Class

Hi,
I can not find a jar file containing the Constants Class of com.sap.portal.directory.
Does anybody know in what application and jar file it is contained?
And if someone has the coresponding jar file to send it to: [email protected]
Best Regards

Hi
I have sent the required jar file to ur emailid.
Regards
(Rewards points if it helps)

Can't find jar file which contains com.sap.portal.directory - Constants

Hi,
I am not able to find a jar-file which contains the Constants class of com.sap.portal.directory.
Does anybody know where to find it?
Thank you in advance.
Kind regards, Patrick.

Hi,
Refer this link
Using JAR Class Finder
in the server u can find the jar files required for the application....
C:usrsapJ2EJC00j2eeclusterserver0appssap.comirjservlet_jspirj
ootWEB-INFportal
Regards,
Senthil K.

Crawling iplanet portal server secured content.

Hi, All,
I am new on the iplanet portal server. Try to come up a solution to crawling
the secured content with a valid user name and password. What this the
authentication mechanism of iplanet portal server keep the user's session?
is iPlanet Portal server using cookie to store the session id or pass it
back and forth as a parameter? Where can I find more information about this?
Any response is appreciated!
Hao Huang

currently there is no testing tool available as a part of the product.

Can't crawl publishing portal content

I've got a single service account running my search service app and i can crawl my single site collection / web app which is https://intranet.domain.co.uk and i can run a full crawl but it only picks up the default content and pages in the crawl log.
I've got several word and excel files in the default document library but i can't find those or any other results when i perform a search.
The content source is set to crawl 'sharepoint sites' and doesn't produce any errors after a crawl and reports back 45 or so success items which are the aspx pages i mentioned above. I have also created a crawl rule with the URL to no effect.
Any idea why? The search service account has full read to the web app via user policy in CA and I've given it read access to the content db in SQL.

Check the URL for one of the items in the crawl logs, check that it's present.
Confirm that the document is published
Check the Super User and Super Reader accounts

Remote content crawler on a file directory in a different subnet

I'm trying to crawl a file directory that is on our company network but in a different subnet. It seems to be set up correctly, because I have managed to import most of the documents to the knowledge directory. However, when running the job a few times, sometimes it succeeds and sometimes it fails, without consistency. The main thing I notice is that it doesn't import the larger files (>5 MB), but our maximum allowed is 100 MB. Even when the job runs "successfully" there is a message in the job log:
Feb 21, 2006 12:08:14 PM- com.plumtree.openfoundation.util.XPNullPointerException: Error in function PTDataSource.ImportDocumentEx (vDocumentLocationBagAsXML == <?xml version="1.0" encoding="ucs-2"?><PTBAG V="1.1" xml:space="preserve"><S N="PTC_DOC_ID">s2dC33967209AEE4710C5ED073C04B3EDCF_1.pdf</S><I N="PTC_DTM_SECT">1000</I><I N="PTC_PBAGFORMAT">2000</I><S N="PTC_UNIQUE">\\10.105.1.33\digitaldocs\s2dC33967209AEE4710C5ED073C04B3EDCF_1.pdf</S><S N="PTC_CDLANG"></S><S N="PTC_FOLDER_NAME">s2dC33967209AEE4710C5ED073C04B3EDCF_1.pdf</S></PTBAG>, pDocumentType == com.plumtree.server.impl.directory.PTDocumentType@285d14, pCard == com.plumtree.server.impl.directory.PTCard@1f6ef01, bSummarize == false, pProvider == [email protected]4)ImportDocumentExfailed for document "s2dC33967209AEE4710C5ED073C04B3EDCF_1.pdf"
When the job fails, there is a different message:
*** Job Operation #1 failed: Crawl has timed out (exception java.lang.Exception: Too many empty batches.)(282610)
I tried increasing the time out periods for the crawler web service and the crawler job. That didn't seem to work. Any suggestions?

Hi Dave,
Did you fix this issue? I'm having the same error.
Thanks!

Portal Self Crawl

Help Needed Crawling the Portal with a WWW crawler
Posted: Mar 9, 2010 9:34 AM in response to: Robert Herrera Edit Reply
I have created a portal self crawl. I did this by creating a experience defination without sso and a cralwer user with a snapshot query portlet and having direct url entry in web cralwer with starting page as the login page action with userid, password and other form elements as parameters and values. It logs in to the home page and finds the snapshot query portlet and all pages URL in the snapshot query. It starts to crawl but it seems it hits the login page instead of the actual community page.
It loks like in a normal browser based scenario, if i log into the portal and then delete al my cookies i too get a login page if i click any sommunity page url. The cookie is jsessionid it seems. This is true for a sso disabled exp def as well.
can you please tell me the settings required in the www content source to login to portal and turn a self crawl to crawl all portal pages and make them full text indexed with correct names. I tried severl settings and impersonating user but could not be successful.
PS currenlty cralwer saves all pages with Login Page (1)(2)3)...... instaed of the actual page name. i guess it takes the name from the <title> tag. but since it is not able to get into the pages and hits the login page it just saves it by that name

Hello Soni!
Please review and follow the steps mentioned here:
Companies and Self-Registration with Approval (SAP Library - User Management of the Application Server Java)
https://cw.sdn.sap.com/cw/docs/DOC-110636
Hope this helps,
Edison

Crawl Portal itself

We are trying to crawl our portal in so it can be searched and when results are clicked it will take you to the page and not the content item. But we are having problems following community links and pages. has anyone done this? We are running 6.0 SP1

Ok, so in ALUI terms I'm assuming that "butterfly" is a community.
Try the following:
1) Go to http://portal.plumtree.com and log in.
2) Type "Support Center" in the search box and hit Enter.
3) Look at the results -- one of them will be the Support Center community (within the top five or so).
4) Click on the Support Center community and it takes you directly to the community.
That's what the portal does out of the box. Is there something different that you're trying to accomplish or is that it?
HTH,
Chris Bucchere | bdg | [email protected] | http://www.bdg-online.com

Retrieve PCD from Portal...Please Help...Urgent

Hi Ritu
i hope you would have fixed your issue by now.
kindly help me ......
i am trying to use the below piece of code to fetch the list of iviews from PCD.
the problem is i dont understand the error
Code in APC:
===========
public class APC_Comp extends AbstractPortalComponent
    public void doContent(IPortalComponentRequest request, IPortalComponentResponse response)
       try{
       Hashtable env = new Hashtable();
       env.put(IPcdContext.SECURITY_PRINCIPAL, request.getUser());
       env.put(Context.INITIAL_CONTEXT_FACTORY,IPcdContext.PCD_INITIAL_CONTEXT_FACTORY);
//       env.put(com.sap.portal.directory.Constants.REQUESTED_ASPECT, PcmConstants.ASPECT_SEMANTICS);
//           /******** Since i couldnt find      PcmConstants.ASPECT_SEMANTICS jar i used the below code.
       env.put(Constants.REQUESTED_ASPECT,"com.sap.portal.pcd.gl.PersistencyAspect");
       InitialContext ctx = null;
       DirContext dirCtx;
       List roleList = null;
       ctx = new InitialContext(env);
       dirCtx = (DirContext) ctx.lookup("pcd:portal_content/");
       PcdSearchControls pcdSearchControls = new PcdSearchControls();
       pcdSearchControls.setReturningObjFlag(false);
       pcdSearchControls.setSearchScope(
       PcdSearchControls.SUBTREE_WITH_UNIT_ROOTS_SCOPE);
       dirCtx.addToEnvironment(
       Constants.APPLY_ASPECT_TO_CONTEXTS,
       Constants.APPLY_ASPECT_TO_CONTEXTS);
       NamingEnumeration ne = dirCtx.search("","(com.sap.portal.pcd.gl.ObjectClass=com.sapportals.portal.*iview*)",
                                                                   pcdSearchControls);
       while (ne.hasMoreElements()) {
       IPcdSearchResult searchResult =
       (IPcdSearchResult) ne.nextElement();
       String location = "pcd:portal_content/" + searchResult.getName();
       roleList.add(location);
       response.write("Object is "+location);
       }catch(Exception e )
             response.write("Exception Occured due to :" +e.toString());
Error Log :
==========
<< item 1 : >>#1.5 #001125A585A0005F000003BD0002514800045CAD115E54F3#1227798297334#com.sap.portal.prt.runtime#sap.com/irj#com.sap.portal.prt.runtime#ramesht#150270##n/a##e592fa70bc8511dd9fdd001125a585a0#SAPEngine_Application_Thread[impl:3]_19##0#0#Error##Java###07:04_27/11/08_0078_79979950
[EXCEPTION]
{0}#1#java.lang.NoClassDefFoundError: com.sapportals.portal.pcd.gl.PcdSearchControls
at java.lang.J9VMInternals.verifyImpl(Native Method)
at java.lang.J9VMInternals.verify(J9VMInternals.java:66)
at java.lang.J9VMInternals.initialize(J9VMInternals.java:127)

try something like this :
if image file:
ImageIcon image = (new ImageIcon(getClass().getResource("yourpackage/mypackage/image.gif")));
if text file:
InputStream is = this.getClass().getResourceAsStream( "yourpackage/mypackage/myfile.xml");
but all in all , getClass will consume a lot computational power(time). If you have to access so many file, it'd better put it outside the jar

Accessing multiple portals at the same time?

Is it possible to access multiple portals at the same time?
For example, what I want to achieve is different properties (layout,
portlets, look & feel) for different groups of users accessing the same
portal. The Associated Groups part on the Portal admin page is not
fulfilling our requirements. So we decided to have different portals for
different groups of users, all working through one portal, and accessing
their custom portals. Is this achievable?
What we are thinking is: put the common functionality in the repository
portal directory, and the custom portlets/jsps in the group-specific portal
directories. This way we can customize portal behavior for different groups
of users. Is this achievable?
Thanks.
Amit

You have to user respective DRILL commands present in WAD to configure the drill operations on multiple characteristics...

Copy a file from server to the client - URLConnection to a Portal page

Hello:
I have an application running on the client side. When the app startup it must open a file which is at the server side, to be more specific, the file is at KM content of the portal.
I try to read it with URLConnection to copy the file from the server to the client, the app will do it, but "Server returned HTTP response code: 401 for URL:"
If you copy&paste the url's file directly on the browser (http://host:port/irj/servlet/prt/portal/prtroot/com.sap.km.cm.docs/ImagenesIM/file.txt) a login popup (look and feel windows) is display. After entering the user and psw the file is open without problem.
Any idea what can I use or how do it ?.
I think that probably I have to move the app to a was directory instead of portal directory.
The app is execute via *.jnlp with a link at a portal page.
Thanks a lot for your time.

Javier,
401 means authentication error, i.e. your application is not authenticated to KM.
What you can do? Actually, it depends. Check current cookies in your application, probably there are SSO coockie or J2EE authentication cookie. You may try to set this cookies in URLConnection (via addHeader). Otherwise you have to supply authentication creadentials to URLConnection (also via addHeader, most probably, via Basic HTTP authentication scheme).
Valery Silaev
EPAM Systems
http://www.NetWeaverTeam.com

Problem crawling filenames with national characters

Hi
I have a big problem with filenames containing national (danish) characters.
The documents gets an entry in in wk$url but have error code 404 (Not found).
I'm running Oracle RDBMS 9.2.0.1 on Redhat Advanced Server 2.1. The
filesystem is mounted on the oracle server using NFS.
I configure the Ultrasearch to crawl the specific directory containing
several files, two of which contains national characters in their
filenames. (ls -l)
<..>
-rw-rw-r-- 1 user group 13 Oct 4 13:36 crawlertest_linux_2_fxeFXE.txt
-rw-rw-r-- 1 user group 19968 Oct 4 13:36 crawlertest_windows_fxeFXE.doc
<..>
(Since the preview function is not working in my Mozilla browser, I'm
unable to tell whether or not the national characters will display
properly in this post. But they represent lower and upper cases of the
three special danish characters.)
In the crawler log the following entries are added:
<..>
file://localhost/<DIR_PATH>/crawlertest_linux_2_B|C?C%C?C?.txt
file://localhost/<DIR_PATH>/crawlertest_linux_2_B|C?C%C?C?.txt
Processing file://localhost/<DIR_PATH>/crawlertest_linux_2_%e6%f8%e5%c6%d8%c5.txt
WKG-30008: file://localhost/<DIR_PATH>/crawlertest_linux_2_%e6%f8%e5%c6%d8%c5.txt: Not found
<..>
file://localhost/<DIR_PATH>/crawlertest_windows_B|C?C%C?C?.doc
file://localhost/<DIR_PATH>/crawlertest_windows_B|C?C%C?C?.doc
Processing file://localhost/<DIR_PATH>/crawlertest_windows_%e6%f8%e5%c6%d8%c5.doc
WKG-30008:
file://localhost/<DIR_PATH>/crawlertest_windows_%e6%f8%e5%c6%d8%c5.doc:
Not found
<..>
The 'file://' entries looks somewhat UTF encoded to me (some chars are
missing because they are not printable) and the others looks URL
encoded.
All other files in the directory seems to process just fine!.
In the wk$url table the following entries are added:
(select status url from wk$url where url like '%crawlertest%'; )
404 file://localhost/<DIR_PATH>/crawlertest_linux_2_%e6%f8%e5%c6%d8%c5.txt
404 file://localhost/<DIR_PATH>/crawlertest_windows_%e6%f8%e5%c6%d8%c5.doc
Just for testing purpose a
SELECT utl_url.unescape('%e6%f8%e5%c6%d8%c5') from dual;
Actually produce the expected resulat : fxeFXE
To me this indicates that the actual filesystem scanning part of the
crawler can sees the files, but the processing part of the crawler can
not open the file for reading and it therefor fails with error 404.
Since the crawler (to my knowledge is written in Java i did some
experiments, with the following Java program.
import java.io.*;
class filetest {
public static void main(String args[]) throws Exception {
try {
String dirname = "<DIR_PATH>";
File dir = new File(dirname);
File[] fs = dir.listFiles();
for(int idx = 0; idx < fs.length; idx++) {
if(fs[idx].canRead()) {
System.out.print("Can Read: ");
} else {
System.out.print("Can NOT Read: ");
System.out.println(fs[idx]);
} catch(Exception e) {
e.printStackTrace();
The performance of this program is very depending on the language
settings of the current shell (under Linux). If LC_ALL is set to "C"
(which is a common default) the program can only read files with
filenames NOT containing national characters (Just as the Ultrasearch
crawler). If LC_ALL is set to e.g. "en_US", then it is capable of
reading all the files.
I therefor tried to set the LC_ALL environment for the oracle user on
my oracle server (using locale_config, and .bash_profile) but that did
not seem to fix the problem at hand.
So (finally) my question is; is this a bug in the Ultrasearch crawler
or simply a mis configuration of my execution environment. If the
latter how do i configure my system correctly?
Yours sincerely
Martin Dahl Pedersen, Visanti ( mdp at visanti dot com )

I've posted my problems as a TAR on METALINK a little week ago.
And it turns out to be a new bug in UltraSearch.
It is now filed under BUG:2673282
-- mdp

Issues in crawling documents created in PDF 9.0 in WCI 10g R3

When we crawl PDF documents created in PDF 9.0 we see that the title of the document crawled in Portal changes to Microsoft word- xyx.doc, though the document opens correctly.
We were told that there are issues with 10gR3 PDF accessor as it was not tested with PDF 9 when 10gR3 came in market.
Does anybody else encountered similar issues and is there any patch/release that Oracle plans for 10gR3 version for this.
Thanks

Thanks Bill,
Since then this is what I have found.
If I use the Printing feature files can be opened in Solaris Gnome PDF Viewer, but as you know, links and bookmarks will be lost.
I tried to set the backward compatibility option to Acrobat 1.3 (the oldest option) but somehow it always resets to 1.6 when I save the PDF file.
All fonts are embedded and I have reinstalled Acrobat just in case.
I tried with different PDF converters and so far Foxit Phantom PDF works fine with Solaris (and also cheap). It gets stuck when there is a complicated line drawings in the Word file, but otherwise it has been stable and bookmarks/links have been alive. If I don't find a breakthrough in a couple of days, I will switch to Foxit.
Anyhow, thank you for your support.

Synchronisation problem when using iFS as Portal document repository

Is anyone using 9iFS as the repository for their Portal documents but getting DRG-11602: URL store: access to <file name> requires authentication when synchronising the PORTAL30.WWSBR_URL_CTX_INDX index. This is run under schema CTXSYS, using ctx_schedule.
We use a URL on the Portal folder to access the iFS document and, if the ACE on the document ACL includes World Read, then the document is indexed correctly but if it has no World Read access then synchronisation fails with the above error. These secure documents are indexed correctly, however, when synchronising IFSSYS.IFS_TEXT.
When you put the URL for the document in the browser then you are prompted for an iFS username/password and this is obviously the problem when synchronising. Oracle Support say that the Oracle 9i Oracle Text Reference, Chapter 2: Indexing, definition of URL_DATASTORE states :The login:password@ syntax within the URL is not supported. Oracle Support have also suggested that using iFS as the Portal repository is not standard practice and that we should simply add our documents as items on the folder. Doing this means not being able to take advantage of the added functionality of iFS such as versioning and, anyway, I thought that Oracle had plans to fully integrate the two products with iFS being the default repository in a future release of Portal.
Until then has anyone got any ideas for a workaround because we are unable to index the contents of all secure documents on our Corporate intranet? We cant be the only site using iFS and Portal in this way!

Hello Raymond,
I must say that I downloaded the JBoss Portal Binary and not the bundle JBoss AS + JBoss Portal, because I already had a JBoss AS working, so it was the best way to do it (as it is said in the JBossPortalReferenceGuide). I have both things (server and portal) in the same directory, but I don't know if maybe one of them should contain the other (I have seen that in the bundle, the portal directory contains the JBoss application server) When I downloaded the JBoss Portal and tried to deploy it by directing my web browser to http://localhost:8080/portal it did not work, so I decided to copy the jboss-portal.sar directory from the JBoss Portal to the deploy of my server. Maybe this was a mistake.
But anyway I have seen that JBoss Portal 2.6 comes with the myfaces jars, and as JBoss AS 4.2 uses Sun RI by default, it is going to collapse anyway. Should I just remove these jars from the portal? As I told you before, I tried doing it and I got two errors of not found classes.
Please, any help would be really appreciated, I am losing a lot of time with this bug, because the server keeps getting out of memory due to it.
Thanks in advance.

Can we export portal in Evalution version?

Hi All,
I have downloaded oracle 10g evalution version(10.2) from oracle network. I want to know can I export my portal from this version ? Beacause the script files required for export and import like contexp.cmd,secexp.cmd etc are not present in my portal directory. So is it possible to export from this version or i need to purchase copyright version?
Please help, its urgent.
thanks in advance.
Message was edited by:
[email protected]

Goto isn't ugly if you know what would like do with
it. You boys looks like you had restricted yourself
and can't imagine what with it. It will be nice to be
able to use labeled return as a Java goto.
g.draw(shape) could be viewed as a goto. It jumps in
code segment and only difference between goto and this
code is it could return.Correction: will return.
Restricted myself? I've programmed in languages with gotos. Do I miss it? Are there things I would like to do that I can't, without it? No. Is my code easier to understand now? Yes.
I would like to be able use
goto with signed "aplication" and clear warning to
user. This aplication use goto, use it with care.
Some AI problems are best with goto. Please elaborate. What is the relevancy of your application being signed? Are you saying that an end-user should be able to accept a signed application which may use unsafe programming practices, if he chooses to accept the risk?
Give me an example of an AI problem which is better solved using goto.

Crawling a Portal directory

Similar Messages

Maybe you are looking for