Crawling web repository

Hello,
I have configures a simple web repository and i am able to search on this repository. For crawling this repository i first used standard crawler but it was crawling only first page, then i created one new crawler in which i explicitly specified the depth to crawl the repository but it doesn't work. Please suggest how to crawl through all the pages.
Note our wer repository is local intranet and there's no restriction to crawl through it.

hi,
go through these links.. this will help you..
http://help.sap.com/bp_epv260/EP_JA/documentation/How-to_Guides/27_WebRepEn.pdf
http://help.sap.com/saphelp_nw04s/helpdata/en/46/5d5040b48a6913e10000000a1550b0/content.htm
http://download.microsoft.com/download/e/c/c/ecc8b7a5-ddea-4f4e-a6c6-
5e96dc1a2908/06_Application%20Interop%200%20-%20MOSS%20and%
20Biztalk.pdf
http://blogs.msdn.com/saptech/archive/2007/02/15/enterprise-search-combining-sap-and-microsoft-realtime-crawling-results.aspx
reward me points..

Similar Messages

Error while crawling web repository

Hi Experts,
System in use - EP 2004s
We have a web server which has number of documents on that server. I have created a web repository for this web server . Repository is working fine , but while crawling it has indexed about 20000 documents and given some errors for 600 documents.
Errors are like :
1. Crawler error
2. TREX preparation error
and when I search for the indexed pages it gives search result but when I click on html version it gives an error message 'No index service found'.
Any suggessions !
(Points are assured ..)
Thanks & Regards,
Amit Kade

Hi Tamil,
Thanks a lot for the help ! But I have already set everything correctly as per the sap help .
TREX is indexing the documents but as I mentioned not indexing all the documents and I cannot view the searched documents in HTML format.
One observation : It is not indexing large documents (Greater than 10 - 12 KB).
Any suggessions !
Thanks & Regards,
Amit Kade

Crawling Web Repository - Error

Hi Experts ,
EP Version - EP 2004s
I have configured a web repository as per the guide "How to configure a web repositiry and crawl it for searching ..".
I have configured this for portal index page. I can see the folder created under 'root' and one link created in that folder . When I click on that link I can access the portal index page.
I have created an index for this and crawled but after crawling it has indexed only one page . I have tried this with some document iViews (HTML).But unfortunately it is indexing only one page.
Can anybody tell me what is wrong !
This is kind of urgent as I am at the customer site.
Note: Helpful answers will be rewarded with points.
Thanks & Regards,
Amit Kade

Hi Praksh ,
Thanks a lot for the quick reply . Actually I have already gone through these links .
To make it simple I have created a simple website containing some html pages and links.
I have created a web repository and crawl it for indexing , this time with custome properties of index like 'IndexContentOfExternalLink' & 'IndexInternalLinks'.
But to my disappointment it has again indexed only one page that is initial page.
Any suggessions ?
Thanks in advance .
Thanks & Regards,
Amit Kade

Index and crawler not working on Web Repository

Hi Team,
I'm trying to setup a Web Repository and crawling it for indexing. I've followed the steps from a SAP "how-To" document, but I guess the problem might be the way I'm confuring the web site in EP. I've created a Virtual Directory on my laptop's IIS 5.0 web server and the URL of the web site has been set as http://laptop-ashishk/myWebSite.
Do I need to set the START PAGE as /index.html (as per the spec it says it's not mandatory)...
Let me know whether you need any information with regards to this problem.
Ashish

They've set:
meta name="viewport" content="initial-scale=2.3, user-scalable=no"
It's the user-scalable that's the problem. Apple considers the default (per their web coding rules at http://developer.apple.com/iphone/designingcontent.html to be yes.
I've noticed the same thing.
Aym

Newly created Web repository not showing up in explorer

I'm trying to create an index which would finally enable me to search the 'CNN' website (following the 'how to' document : 'HOW TO SET UP A WEB REPOSITORY AND CRAWLING IT FOR INDEXING'). I'm unable to assign the web repository which I created in an earlier step to the index because it doesn't show up in the repository/folder listing. I can't even see it under 'Content Administration -> KM Content -> Repositories'.
What exactly am I doing wrong?!
thanks,
Biju.

Hi Karsten,
- We're on EP6 SP2.
- Yes, I'm referring to the document you mentioned.
- Yes, the repository does show up with a green light under KM -> Component Monitor.
BTW, I was able to bring up the web repository now that I've specified under 'System Admin ® System Config ® Service Config ® Applications ® com.sap.portal.ivs.httpservice ® Services ® proxy ® HTTP-Bypass Proxy Servers' that everything under 'mycompany.com' must be bypassed. I'd originally specified the proxy set up in the TREX configuration, which runs on another (physical) server than the portal.
I don't really get the connection, but essentially took the cue from the earlier reply I got.
So, in short, I think the problem is solved at least for the moment.
Thanks again for your help.

Subscription for a web repository

Hi Experts ,
I am using EP 2004s , I have created a web repository and activated a subscription service for that.
Unfirtunately I am not able to receive any subscription notification emais , I can receive the subscription email for the repositories created inside the portal.
As per my understanding from sap help , for a web repository we should receive subscription notification emails once crawler detects the changes .
I have schedule the crawler accordingly so that it can detect the changes .
Can anybody through some light on , on which basis crawler considers that this document has been modified ?
Any advice , points are assured .
Thanks & Regards,
Amit Kade

Hi Amit
The following snippet is taken from the description of Subscription Event Mapping:
Subscription Event Mapping
A subscription event mapping maps events sent by a repository manager onto events that are meaningful with respect to subscriptions to resources...
As you can see, the Subscription Event Mapping handles events send by a repository manager, but since no events can be send from a Web Repository (as far as I know), you are actually not performing any mapping, and thus it does not work.
I was once told by SAP support that "Send events" does not work for a WebDAV repository, unless the changes/event was triggered from within the KM framework. Since a web repository isn't more closely integrated into the KM, my guess is that a web repository does not support "Send events" triggered from outside the portal either.
With "custom built" I mean that "Send events" (and thus notification emails) might work if you develop a custom web repository manager, that can handle a very good integration between a website and the portals KM.
Kind regards,
Martin

Web Repository clarifications

Hi,
I am an sap newbie
I am creating a web repository in KM. This is the scenario I want:
I want all html pages contain within a website (ie. http://www.xyz.com) to be stored in the repository.
So I created an HTTP system to http://www.xyz.com and then I configurate a Website with the HTTP system, with a start page of /main/index.html and a system path of /main
I then created an index and so forth. However, when I try to access the repository, I could only access the start page.
Any pointers? Or was my concept about the repository incorrect?
Points will be generously awarded.
Charles

Thanks for your reply.
yes I have followed all the steps in the link.
I have created an HTTP system with server URL http://www.xyz.com, a Web Site to the system (with the same system ID) with /main/index.html as the start page and /main as the system path. I then configurate a cache, and created a Web repository manager with the Web Site and cache I just created.
I gather that all documents of the website, with URL beginning with http://www.xyz.com/main will be stored in the wb repository once I have accomplished the steps above?
However, when I try to access the web repoistory via KM Content, I can only see the Web Site, and when I clicked on it, I arrived at the start page http://www.xyz.com/main/index.html. How to make it such that all the resources beginning with http://www.xyz.com/main will appear in the web repository? I have created an index and configurate the crawler, but still the same.
And what's the difference between a Simple and a Standard Web repoistory? I have read the sap documentation and still don't get it
Thanks, points will be generously awarded
Charles

Web Repository Manager and robots.txt

Hello,
I would like to search an intranet site and therefore set up a crawler according to the guide "How to set up a Web Repository and Crawl It for Indexing".
Everything works fine.
Now this web site uses a robots.txt as follows:
<i>User-agent: googlebot
Disallow: /folder_a/folder_b/
User-agent: *
Disallow: /</i>
So obviously, only google is allowed to crawl (parts of) that web site.
My question: If I'd like to add the TRex crawler to the robots.txt what's the name of the "User-agent" I have to specify here?
Maybe the name I defined in the SystemConfiguration > ... > Global Services > Crawler Parameters > Index Management Crawler?
Thanks in advance,
Stefan

Hi Stefan,
I'm sorry but this is hard coded. I found it in the class : com.sapportals.wcm.repository.manager.web.cache.WebCache
private HttpRequest createRequest(IResourceContext context, IUriReference ref)
        HttpRequest request = new HttpRequest(ref);
        String userAgent = "SAP-KM/WebRepository 1.2";
        if(sessionWatcher != null)
            String ua = sessionWatcher.getUserAgent();
            if(ua != null)
                userAgent = ua;
        request.setHeader("User-Agent", userAgent);
        Locale locale = context.getLocale();
        if(locale != null)
            request.setHeader("Accept-Language", locale.getLanguage());
        return request;
So recompile the component or changing the filter... I would prefer to change the roberts.txt
hope this helps,
Axel

Error when indexing web repository

I'm working on a problem that I'm having with indexing a web repository. For the sake of this post, we will call the web site that I'm indexing for the repository http://mysite1.com. For the most part, things are working just fine. The problem is that there's a couple of links in one of the pages in http://mysite1.com that aren't getting crawled.
The first link is http://mysite2.com. This link is to a web site that is on our network, but you are normally required to provide a username and password to access it. The message in the crawler error file that's being generated is:
ERROR     Mar 27, 2009 8:04:02 AM     /webdynamic/mysite2.com     http://mysite2.com/     processing failed     com.sapportals.wcm.repository.AuthorizationRequiredException
I created an HTTP System in the System Landscape Definitions for http://mysite2.com, and here's what it looks like:
Description:         mysite
Same User Domain:    <unchecked>
Max Connections:     0
Password:            <set to the password for the user>
Server Aliases:      <blank>
Server URL:          http://mysite2.com
User:                myuser
I have verified that the username and password that I have configured here are valid. I have also set up a web site definition for this, and here's what it
looks like:
Login Timeout:       <blank>
System ID:           mysite.com
All the rest of the options for the web site are blank.
What else do I need to do to get the crawler to access the content of http://mysite2.com?
The other link that I'm getting errors on is http://mysite3.com. The error in the crawler error file is:
ERROR     Mar 27, 2009 8:04:01 AM     /webdynamic/mysite3.com     http://mysite3.com/     processing failed     com.sapportals.wcm.repository.TimeExceededException:
request to /: Read timed out
This site is accessible both internally and externally to our network. I'm not sure what I need to do for this. Can anyone help me out with this?
Thanks!
-Stephen Spalding

Hi Esther Schmitz,
Thanks for quick reply. As you said, i have changed website url to http://www.cnn.com.
but still it shows below error messages.
The target of the link TECH you tried to navigate to is not available. Its repository might be disconnected or the target may have been renamed, moved, or deleted. Contact your system administrator if you think the target /CNN/TECH/TECH should be available.
Thanks,
Satya

Web Repository - how can i delete the cache ?

Hi,
we are using a webrepository for searching with TREX in our intranet.
So far so good. Now we switched our Intranet from a Lotus Notes App to Opentext Websolutions, but the URL remains the same (just IP change).
When i now reindexed the webrepository it still shows the pages it crawled when it was a Lotus Notes App.
When i click in the web repository folder on the entry with the Intranet name, it opens a new window with the new Intranet.
A clearing of the Caches under System Administration > Monitoring > Knowledge Management > Cache-Monitor
did not change anything.
Anything else what I can do ?
Regards,
Kai

By default, when you type in the location bar it matches your bookmarks and your history. You can change that in preferences to match only bookmarks, only history, or not provide any matches at all. This article explains how: [https://support.mozilla.com/en-US/kb/Location%20bar%20autocomplete#os=linux&browser=fx35 Location bar autocomplete | How to | Firefox Help].
Does that let you make it work the way you want?
If you meant the Google search box, that's somewhere else...

HTML property extractor for web repository

Hi All,
I was just wondering has anyone worked on this ...
We are using EP 2004s and for one of our web server we have created a web repository and now we want to use 'html property extractor" to extract the values of META tag of the html documents.
This will help us in filtering the search result , so we have followed following steps :
1. Created a html property extractor for the meta tags (META all =<meta tag name>).
2. Assigned this extractor to the web repository.
3. Crawled the website .
After this we tried to search for the documents using meta tag values but not able to find any document.
We have even tried to filter the result by adding this custom property in 'custom properties' , but this also didn't work.
Is there anything we are missing or this has to work in some other way ?
Note : We have tried a html extractor with <title> tag and extracted it successfully .
Useful answers will be rewarded points !
Thanks & Regards,
Amit Kade

Hi Amit
I'm working on a solution using Web Property Extractors as well, and would like to know if you managed to find a solution to your problem.
Kind regards,
Martin Søgaard

Trouble with indexing web repository

Hi All,
We've recently upgraded to TREX version 7.10.34.00, and I'm trying to get one of our web repositories to index.
I can get the web repository to index if I do not include any 'include' result resource filters in the crawler parameters, but it does not index if I do include one. I have had some success using an 'exclude' result resource filter, just not the 'include' one.
The name of the web site that I'm indexing is http://site1.domain.com. When I do not include any result filters, a sampling of the crawler log file looks like this:
INFO Jan 27, 2010 10:10:10 AM /mywebrepository/site1.domain.com http://site1.domain.com/ provided text/html
INFO Jan 27, 2010 10:10:10 AM /mywebrepository/site1.domain.com/files/index.htm http://site1.domain.com/files/index.htm provided text/html
INFO Jan 27, 2010 10:10:10 AM /mywebrepository/site1.domain.com/files/folder1/tableofcontents.htm
http://site1.domain.com/files/folder1/tableofcontents.htm provided text/html
When I go into TREX monitor, the queue has lots of documents that it indexes.
This is what my result filter settings look like:
Include Documents/Web-Pages: <checked>
Include Folders: <checked>
Include Links (Not Applicable For Web-Sites): <unchecked>
Case Sensitive (Folders And Documents/Web-Pages): <unchecked>
Item ID Mode (Documents/Web-Pages Only): include
Item ID Patterns (csv): *.html, *.htm
Mime Type Mode (Documents/Web-Pages Only): include
Mime Type Patterns (csv):
Minimum Content Size (Documents/Web-Pages Only): <blank>
Maximum Content Size (Documents/Web-Pages Only): <blank>
Maximum Age of Last Modification (Documents/Web-Pages Only): <blank>
With the result filter in place in the crawler parameters, I click the button to index. The crawler log files are generated, but nothing shows up in the TREX monitor queue for the index. The Time Stamp doesn't change either. I have tried changing the parameters in the 'Item ID Patterns' field, but it still doesn't work.
Is this a bug with this new version of TREX or am I not using this filter properly? This seemed to work when I was using TREX version 6.
Thanks!
-StephenS

I was never able to resolve this problem but I have now retired the computer

How to configure a web repository

Hi All,
At customer site we have following configuration.
1. There is one web server and it is connected to 5 document servers (Back-up servers).
2. All the 5 document servers maintains the same data . (HTML documents)
3. Web server redirects the user to nearest document server depending on the user ID.(Normal web server functionality)
Requirement is to connect this web server to the portal . This can be achieved by configuring web repository.
If I configure a web repository ...
1. How to pass the user login data to the webserver so that it can redirect the user to nearest document server.
Thanks & Regards,
Amit kade

Hi Praksh ,
Thanks a lot for the quick reply . Actually I have already gone through these links .
To make it simple I have created a simple website containing some html pages and links.
I have created a web repository and crawl it for indexing , this time with custome properties of index like 'IndexContentOfExternalLink' & 'IndexInternalLinks'.
But to my disappointment it has again indexed only one page that is initial page.
Any suggessions ?
Thanks in advance .
Thanks & Regards,
Amit Kade

SAP Web Repository, how can acess it from outside

I'm reading this http://help.sap.com/saphelp_sm40/helpdata/en/14/030fc5b63f11d5993900508b6b8b11/content.htm
and it says this:
"You use a Web repository (manager) to provide read access to documents stored on remote Web servers."
This could be really useful to me, but how does this work? I'm able to add objects to it using the transaction SMW0 but how can I get a valid url so that the users in my network can acess these files.
the only FM I know is 'DP_PUBLISH_WWW_URL' but this generates a kind of links that my browser can't process (something like SAPR3://WebRepository/0123456789/ZMY_FILE?Version=00001")
How can I do to create valid links to these documents, so that I can open them im my web browser??

though they have the same name, they are different! The SAP library you mention is about Knowledge Management.
If you want to generate an external URL for the Web Repository (SMW0), first read Note 865853 - WebReporting/WebRFC obsolete as of NW2004s. I don't know your SAP release, but if you are on a 7.0, you need to "release for internet" WWW_GET_MIME_OBJECT function module via SMW0, and probably activate /sap/bc/webrfc service using SICF transaction. Then you'll be able to access your ZZZZZ web document using URL like this: http://youserver:port/sap/bc/webrfc/!?_function=WWW_GET_MIME_OBJECT&_object_id=ZZZZZ&client=220&language=EN
Note: DP is only for allowing the HTML browser (and other SAPGUI controls) to access objects transmitted from SAP to the SAPGUI using the Data Provider service.
Edited by: Sandra Rossi on Jun 30, 2010 10:45 PM CET

Display word file from web repository

Hi,
I have an application that stores word files in the sap web repository. Now I need to display a word file in webdynpro. There are examples where a word file is displayed by the ui element officecontrol from mime repository.
How I can get access to word files in web repository and display them?
Regards,
Ilya
Edited by: Ilya M. on Dec 12, 2010 12:02 PM
Edited by: Ilya M. on Dec 12, 2010 12:03 PM

hi,
see this thread [SAP Web Repository, how can acess it from outside|SAP Web Repository, how can acess it from outside]

Crawling web repository

Similar Messages

Maybe you are looking for