Crawling Web Repository - Error

Hi Experts ,
EP Version - EP 2004s
I have configured a web repository as per the guide "How to configure a web repositiry and crawl it for searching ..".
I have configured this for portal index page. I can see the folder created under 'root' and one link created in that folder . When I click on that link I can access the portal index page.
I have created an index for this and crawled but after crawling it has indexed only one page . I have tried this with some document iViews (HTML).But unfortunately it is indexing only one page.
Can anybody tell me what is wrong !
This is kind of urgent as I am at the customer site.
Note: Helpful answers will be rewarded with points.
Thanks & Regards,
Amit Kade

Hi Praksh ,
Thanks a lot for the quick reply . Actually I have already gone through these links .
To make it simple I have created a simple website containing some html pages and links.
I have created a web repository and crawl it  for indexing , this time with custome properties of index like 'IndexContentOfExternalLink' & 'IndexInternalLinks'.
But to my disappointment it has again indexed only one page that is initial page.
Any suggessions ?
Thanks in advance .
Thanks & Regards,
Amit Kade

Similar Messages

  • Error while crawling web repository

    Hi Experts,
    System in use - EP 2004s
    We have a web server which has number of documents on that server. I have created a web repository for this web server . Repository is working fine , but while crawling it has indexed about 20000  documents and given some errors for 600 documents.
    Errors are like :
    1. Crawler error
    2. TREX preparation error
    and when I search for the indexed pages it gives search result but when I click on html version it gives an error message 'No index service found'.
    Any suggessions !
    (Points are assured ..)
    Thanks & Regards,
    Amit Kade

    Hi Tamil,
    Thanks a lot for the help ! But I have already set everything correctly as per the sap help .
    TREX is indexing the documents but as I mentioned not indexing all the documents and I cannot view the searched documents in HTML format.
    One observation : It is not indexing large documents (Greater than 10 - 12 KB).
    Any suggessions !
    Thanks & Regards,
    Amit Kade

  • Simple Web and regular Web Repository errors... Need Help

    Hello all!
    We're running into a problem with Simple Web and regular Web Repository managers in our QA portal environment.  These repository managers work very well in our Dev portal, but now that we're trying to implement them in QA, of course they aren't working.  :-\
    I set up a specific web site in our Web Repository manager.  However, when I navigate to the repository's folder in KM, it just hangs. When I go to the Component Monitor and look at the Web Repository manager, the server is in red. The Status is "startup failed" and the specific error is:
    "2008-01-07T19:47:48Z: GET /WhitePages/Spider: com.sapportals.wcm.WcmException: sending request to: http://hostname.company.com/ request uri: /WhitePages/Spider Connection timed out:could be due to invalid address (java.net.SocketException: Connection timed out:could be due to invalid address)"
    When I try to set up the same web site in the Simple Web Repository Manager, I receive the same error.  However, we shouldn't need a proxy configured because this is an internal web site.
    This works in our Dev environment, so what configuration should I be looking at specifically?  Any tips from anyone?
    Thanks in advance to everyone!
    Fallon

    Hi,
    As far as I know such type of problems are common in repositroy managers.
    They work on one and don't on the other.
    Possible reasons could be the connectivity and access issues.
    As u r saying that it is an internal website pls chk connectivity of the QA system with the webserver.
    Check ur Repository manager's configuration once again in Sys Admin.
    Last option is to create a new Web RM in QA.
    I know it is irritating but doesn't takes much time right.
    i hope this helps.
    Regards,
    Sumit

  • Crawling web repository

    Hello,
                 I have configures a simple web repository and i am able to search on this repository. For crawling this repository i first used standard crawler but it was crawling only first page, then i created one new crawler in which i explicitly specified the depth to crawl the repository but it doesn't work. Please suggest how to crawl through all the pages.
    Note our wer repository is local intranet and there's no restriction to crawl through it.

    hi,
    go through these links.. this will help you..
    http://help.sap.com/bp_epv260/EP_JA/documentation/How-to_Guides/27_WebRepEn.pdf
    http://help.sap.com/saphelp_nw04s/helpdata/en/46/5d5040b48a6913e10000000a1550b0/content.htm
    http://download.microsoft.com/download/e/c/c/ecc8b7a5-ddea-4f4e-a6c6-
    5e96dc1a2908/06_Application%20Interop%200%20-%20MOSS%20and%
    20Biztalk.pdf
    http://blogs.msdn.com/saptech/archive/2007/02/15/enterprise-search-combining-sap-and-microsoft-realtime-crawling-results.aspx
    reward me points..

  • Error when indexing web repository

    I'm working on a problem that I'm having with indexing a web repository. For the sake of this post, we will call the web site that I'm indexing for the repository http://mysite1.com. For the most part, things are working just fine. The problem is that there's a couple of links in one of the pages in http://mysite1.com that aren't getting crawled.
    The first link is http://mysite2.com. This link is to a web site that is on our network, but you are normally required to provide a username and password to access it. The message in the crawler error file that's being generated is:
    ERROR     Mar 27, 2009 8:04:02 AM     /webdynamic/mysite2.com     http://mysite2.com/     processing failed     com.sapportals.wcm.repository.AuthorizationRequiredException     
    I created an HTTP System in the System Landscape Definitions for http://mysite2.com, and here's what it looks like:
    Description:         mysite
    Same User Domain:    <unchecked>
    Max Connections:     0
    Password:            <set to the password for the user>
    Server Aliases:      <blank>
    Server URL:          http://mysite2.com
    User:                myuser
    I have verified that the username and password that I have configured here are valid. I have also set up a web site definition for this, and here's what it
    looks like:
    Login Timeout:       <blank>
    System ID:           mysite.com
    All the rest of the options for the web site are blank.
    What else do I need to do to get the crawler to access the content of http://mysite2.com?
    The other link that I'm getting errors on is http://mysite3.com. The error in the crawler error file is:
    ERROR     Mar 27, 2009 8:04:01 AM     /webdynamic/mysite3.com     http://mysite3.com/     processing failed     com.sapportals.wcm.repository.TimeExceededException:
    request to /: Read timed out     
    This site is accessible both internally and externally to our network. I'm not sure what I need to do for this. Can anyone help me out with this?
    Thanks!
    -Stephen Spalding

    Hi Esther Schmitz,
    Thanks for quick reply. As you said, i have changed website url to http://www.cnn.com.
    but still it shows below error messages.
    The target of the link TECH you tried to navigate to is not available. Its repository might be disconnected or the target may have been renamed, moved, or deleted. Contact your system administrator if you think the target /CNN/TECH/TECH should be available.
    Thanks,
    Satya

  • Index and crawler not working on Web Repository

    Hi Team,
    I'm trying to setup a Web Repository and crawling it for indexing. I've followed the steps from a SAP "how-To" document, but I guess the problem might be the way I'm confuring the web site in EP. I've created a Virtual Directory on my laptop's IIS 5.0 web server and the URL of the web site has been set as http://laptop-ashishk/myWebSite.
    Do I need to set the START PAGE as /index.html (as per the spec it says it's not mandatory)...
    Let me know whether you need any information with regards to this problem.
    Ashish

    They've set:
    meta name="viewport" content="initial-scale=2.3, user-scalable=no"
    It's the user-scalable that's the problem. Apple considers the default (per their web coding rules at http://developer.apple.com/iphone/designingcontent.html to be yes.
    I've noticed the same thing.
    Aym

  • After upgrade SP-Crawl Error: The SharePoint item being crawled returned an error when attempting to download the item.

    Hi All - After the upgrade, I am getting SP-Crawl Error for certain links. I check the Crawl component has proper permission.
    Google is showing some article like
    http://blog.karstein-consulting.com/2012/04/20/error-in-crawl-log-the-sharepoint-item-being-crawled-returned-an-error-when-attempting-to-download-the-item/
    not sure if this resolution is referring to 2010 and/or 2013. 
    I checked the registery editor. I couldn't find 14.0 under the Office Server.
    Any clue?
    Regards,
    Khushi
    Khushi

    I checked the web application policy the search crawl account has full read permission.
    Crawl
    Fiddler
    Log Error referring the Correlation ID
    01/06/2014 13:05:06.14  w3wp.exe (0x1698)                        0x0118 SharePoint Foundation        
     Monitoring                     nasq Medium   Entering monitored scope (Request (GET:/sites/HR/Shared%20Documents/Benefits/Insurance%20Benefits/Life%20Insurance/Basic%20Life%20and%20ADD)).
    Parent No 
    01/06/2014 13:05:06.14  w3wp.exe (0x1698)                        0x0118 SharePoint Foundation        
     Logging Correlation Data       xmnv Medium   Name=Request (GET:<SiteURL>/sites/HR/Shared%20Documents/Benefits/Insurance%20Benefits/Life%20Insurance/Basic%20Life%20and%20ADD) e8b1679c-0476-70d4-9fcd-2cef5be44461
    01/06/2014 13:05:06.14  w3wp.exe (0x1698)                        0x0118 SharePoint Foundation        
     Authentication Authorization   agb9s Medium   Non-OAuth request. IsAuthenticated=True, UserIdentityName=, ClaimsCount=0 e8b1679c-0476-70d4-9fcd-2cef5be44461
    01/06/2014 13:05:06.15  w3wp.exe (0x1698)                        0x1738 SharePoint Foundation        
     General                        af71 Medium   HTTP Request method: GET e8b1679c-0476-70d4-9fcd-2cef5be44461
    01/06/2014 13:05:06.15  w3wp.exe (0x1698)                        0x1738 SharePoint Foundation        
     General                        af75 Medium   Overridden HTTP request method: GET e8b1679c-0476-70d4-9fcd-2cef5be44461
    01/06/2014 13:05:06.15  w3wp.exe (0x1698)                        0x1738 SharePoint Foundation        
     General                        af74 Medium   HTTP request URL: /sites/HR/Shared%20Documents/Benefits/Insurance%20Benefits/Life%20Insurance/Basic%20Life%20and%20ADD e8b1679c-0476-70d4-9fcd-2cef5be44461
    01/06/2014 13:05:06.17  w3wp.exe (0x1698)                        0x1738 SharePoint Foundation        
     Files                          aise3 Medium   Failure when fetching document. 0x80070090 e8b1679c-0476-70d4-9fcd-2cef5be44461
    01/06/2014 13:05:06.17  w3wp.exe (0x1698)                        0x1960 SharePoint Foundation        
     Monitoring                     b4ly Medium   Leaving Monitored Scope (Request (GET:<SiteURL>/sites/HR/Shared%20Documents/Benefits/Insurance%20Benefits/Life%20Insurance/Basic%20Life%20and%20ADD)).
    Execution Time=20.5461867360237 e8b1679c-0476-70d4-9fcd-2cef5be44461
    01/06/2014 13:05:06.17  w3wp.exe (0x1698)                        0x1960 SharePoint Foundation        
     Monitoring                     b4ly Medium   Leaving Monitored Scope (Request (GET:<SiteURL>/sites/HR/Shared%20Documents/Benefits/Insurance%20Benefits/Life%20Insurance/Basic%20Life%20and%20ADD)).
    Execution Time=29.917489513332 e8b1679c-0476-70d4-9fcd-2549ba3ee9d4
    01/06/2014 13:05:06.17  w3wp.exe (0x1698)                        0x1960 SharePoint Foundation        
     Monitoring                     nasq Medium   Entering monitored scope (Request (GET:<SiteURL>nsurance%20Benefits%2fLife%20Insurance%2fBasic%20Life%20and%20ADD&FolderCTID=0x01200039DA632EEACF264685CF39D68A18F7C8)).
    Parent No 
    Any clue?
    Regards,
    Khushi
    Khushi

  • Newly created Web repository not showing up in explorer

    I'm trying to create an index which would finally enable me to search the 'CNN' website (following the 'how to' document : 'HOW TO SET UP A WEB REPOSITORY AND CRAWLING IT FOR INDEXING'). I'm unable to assign the web repository which I  created in an earlier step to the index because it doesn't show up in the repository/folder listing. I can't even see it under 'Content Administration -> KM Content -> Repositories'.
    What exactly am I doing wrong?!
    thanks,
    Biju.

    Hi Karsten,
    - We're on EP6 SP2.
    - Yes, I'm referring to the document you mentioned.
    - Yes, the repository does show up with a green light under KM -> Component Monitor.
    BTW, I was able to bring up the web repository now that I've specified under 'System Admin ® System Config ® Service Config ® Applications ® com.sap.portal.ivs.httpservice ® Services ® proxy ® HTTP-Bypass Proxy Servers' that everything under 'mycompany.com' must be bypassed. I'd originally specified the proxy set up in the TREX configuration, which runs on another (physical) server than the portal.
    I don't really get the connection, but essentially took the cue from the earlier reply I got.
    So, in short, I think the problem is solved at least for the moment.
    Thanks again for your help.

  • Tag base in HTML page from web repository

    We are using EP 6.0 SP10 with KMC SP10 on host http://ep60:55000/. We want to configure web repository for http://example.com.
    We have created HTTP system "ExampleSite" in KM system landscape for web site "http://example.com", created web site "Example" in KM landscape and configured web repository manager "WebSystems" with prefix "/websys" for created web site "Example" with property "External Server URI Handling"="rewrite".
    We were create URL-iView on link http://ep60:55000/irj/servlet/prt/portal/prtroot/com.sap.km.cm.docs/websys/Example.
    The problem is following:
    There is no tag <base> in page content when we opening http://example.com directly in browser.
    When we preview created URL-iView page it is opened but tag <base href=http://example.com> is present in HTML page content.
    Thefore all relative links point to http://example.com but not to http://ep60:55000/irj/servlet/prt/portal/prtroot/com.sap.km.cm.docs/websys/Example as we need.
    What settings (in web repository, HTTP System or anywhere) disables addition tag <base> to page content?

    Hi, All. Sorry for tardy reply.
    Now i'm attempting to use "HTML Filter" (from repository filters in Content Managenment Configurator) for reset base tag.
    Unlike "HTML Stylesheet Filter" working fine (LINK tag is appearing in content of HTML repository documents),
    "HTML Filter" doesn'n work (browser display "500 internal error"). This error possess when we use "HTML Filter" for native CM repository (for example: documents).
    I find following error in knowledgemanagement.0.log file:
    #1.5#C000AC140312000500000002000012240003F737312F4F50#1116241350578#com.sapportals.wcm.protocol.webdav.server.WDGetHandler#irj#com.sapportals.wcm.protocol.webdav.server.WDGetHandler#System#0#####Client_Thread_7##0#0#Error##Plain###Exception while applying filters: null - com.sapportals.wcm.repository.ResourceException: Exception while applying filters: null
    Displaying pages from repository without "HTML Filter" works fine.
    My filter has following properties:
      Base Tag      = http://test.base.tag/
      Extensions    = htm
      MIME Types    = text/html
      Paths         = /**
      Active        = checked
      Priority      = 99
      Repositories  = documents
    I'm not use properties "Base URL", "URL Handler Class"
    What is a reasone of the error?
    Did anybody find a solution to remove/reset base tag?

  • Subscription for a web repository

    Hi Experts ,
    I am using EP 2004s , I have created a web repository and activated a subscription service for that.
    Unfirtunately I am not able to receive any subscription notification emais , I can receive the subscription email for the repositories created inside the portal.
    As per my understanding from sap help , for a web repository we should receive subscription notification emails once crawler detects the changes .
    I have schedule the crawler accordingly so that it can detect the changes .
    Can anybody through some light on , on which basis crawler considers that this document has been modified ?
    Any advice , points are assured .
    Thanks & Regards,
    Amit Kade

    Hi Amit
    The following snippet is taken from the description of Subscription Event Mapping:
    Subscription Event Mapping
    A subscription event mapping maps events sent by a repository manager onto events that are meaningful with respect to subscriptions to resources...
    As you can see, the Subscription Event Mapping handles events send by a repository manager, but since no events can be send from a Web Repository (as far as I know), you are actually not performing any mapping, and thus it does not work.
    I was once told by SAP support that "Send events" does not work for a WebDAV repository, unless the changes/event was triggered from within the KM framework. Since a web repository isn't more closely integrated into the KM, my guess is that a web repository does not support "Send events" triggered from outside the portal either.
    With "custom built" I mean that "Send events" (and thus notification emails) might work if you develop a custom web repository manager, that can handle a very good integration between a website and the portals KM.
    Kind regards,
    Martin

  • The SharePoint item being crawled returned an error when attempting to download the item.

    I get this error when I run full crawl.
    The SharePoint item being crawled returned an error when attempting to download the item.
    I did make sure that the crawl account has full read access to the web application. I am able to access the page and site just fine from a browser. I am not sure what is causing the problem.

    On closely looking at ULS logs, this is what I am noticing....
    CSTS3Accessor::Init: SharePointError found in URL
    http://xxxxx/default.aspx header value 0, hr=8004FD0F  [sts3acc.cxx:566]  d:\office\source\search\native\gather\protocols\sts3\sts3acc.cxx 
    CSTS3Accessor::Init fails, Url sts4://xxxx/siteurl=/siteid={c011d5c9-9019-41e8-bf5a-5176739b2b79}/weburl=xxxxx/webid={e42001c0-2ef1-4a8e-8fe5-361d72fa60c2}, hr=8004FD0F  [sts3handler.cxx:312]  d:\office\source\search\native\gather\protocols\sts3\sts3handler.cxx 
    CSTS3Handler::CreateAccessorExD: Return error to caller, hr=8004FD0F            [sts3handler.cxx:330]  d:\office\source\search\native\gather\protocols\sts3\sts3handler.cxx 

  • Web Repository clarifications

    Hi,
    I am an sap newbie
    I am creating a web repository in KM.  This is the scenario I want:
    I want all html pages contain within a website (ie. http://www.xyz.com) to be stored in the repository.
    So I created an HTTP system to http://www.xyz.com and then I configurate a Website with the HTTP system, with a start page of /main/index.html and a system path of /main
    I then created an index and so forth.  However, when I try to access the repository, I could only access the start page.
    Any pointers?  Or was my concept about the repository incorrect?
    Points will be generously awarded.
    Charles

    Thanks for your reply.
    yes I have followed all the steps in the link.
    I have created an HTTP system with server URL http://www.xyz.com, a Web Site to the system (with the same system ID) with /main/index.html as the start page and /main as the system path.  I then configurate a cache, and created a Web repository manager with the Web Site and cache I just created.
    I gather that all documents of the website, with URL beginning with http://www.xyz.com/main will be stored in the wb repository once I have accomplished the steps above?
    However, when I try to access the web repoistory via KM Content, I can only see the Web Site, and when I clicked on it, I arrived at the start page http://www.xyz.com/main/index.html.  How to make it such that all the resources beginning with http://www.xyz.com/main will appear in the web repository?  I have created an index and configurate the crawler, but still the same. 
    And what's the difference between a Simple and a Standard Web repoistory?  I have read the sap documentation and still don't get it
    Thanks, points will be generously awarded
    Charles

  • How to see properties in details dialogue in a web repository

    Hi everyone
    I have set up a web repository that uses a number of web property extractors to extract content from meta tags into predefined properties. The funny part is that I can use these predefined properties in my search scenario, but I'm not able to see the properties in my details > properties dialogue. Is there any special configuration that needs to be done in order to see these properties when one is using a Web Repository Manager?
    I have configured a property group in "Property Structures"-configuration that contains all predefined properties and added this group to "all_groups". But still I dont see the predefined properties even though there must be something in them when I can use them in my search. And just to specify: I cannot see the predefined properties at all - its not just the content that is missing.
    Any help will be rewarded!
    Best regards,
    Martin Søgaard

    hi sap2008,
    abaper only can help u abt this
    refer this
    http://abaplog.wordpress.com/2007/07/23/displaying-sap-error-messages-in-a-nice-way/
    kaustubh

  • Web Repository Manager and robots.txt

    Hello,
    I would like to search an intranet site and therefore set up a crawler according to the guide "How to set up a Web Repository and Crawl It for Indexing".
    Everything works fine.
    Now this web site uses a robots.txt as follows:
    <i>User-agent: googlebot
    Disallow: /folder_a/folder_b/
    User-agent: *
    Disallow: /</i>
    So obviously, only google is allowed to crawl (parts of) that web site.
    My question: If I'd like to add the TRex crawler to the robots.txt what's the name of the "User-agent" I have to specify here?
    Maybe the name I defined in the SystemConfiguration > ... > Global Services > Crawler Parameters > Index Management Crawler?
    Thanks in advance,
    Stefan

    Hi Stefan,
    I'm sorry but this is hard coded. I found it in the class : com.sapportals.wcm.repository.manager.web.cache.WebCache
    private HttpRequest createRequest(IResourceContext context, IUriReference ref)
            HttpRequest request = new HttpRequest(ref);
            String userAgent = "SAP-KM/WebRepository 1.2";
            if(sessionWatcher != null)
                String ua = sessionWatcher.getUserAgent();
                if(ua != null)
                    userAgent = ua;
            request.setHeader("User-Agent", userAgent);
            Locale locale = context.getLocale();
            if(locale != null)
                request.setHeader("Accept-Language", locale.getLanguage());
            return request;
    So recompile the component or changing the filter... I would prefer to change the roberts.txt
    hope this helps,
    Axel

  • Web repository manager

    Hi,
    I am working with NW04S.
    I am facing 2 issues which are related with the web repository manager.
    1. When we create a web repository manager, we must be able to see it under content management->KM content. When we choose the web repository, we should be able to see the link of the website that we configured.
    The issue that I am facing is that I am unable to see this link although my web repository manager is seen in the KM content.
    I am able to the see the links for the web sites in the web repository managers that I had created previously.
    I have done all configurations according to the config guide. I have created html system, website, html property extractor, cache and then web repository manager.
    2. When I went back to check how I had configured the older web repository managers, I found that only the ones that I created recently were present. Very old ones were missing. But these are visible under KM content.
    Is there some place where these are archived?
    Could you please help me with this?
    Best Regards,
    Vidhya

    Hi,
    I checked some other posts on the forum and found that i had to check the component monitor. i did so.
    it gives me an error saying that
    2007-04-30T03:55:33Z: GET /: com.sapportals.wcm.WcmException: sending request to: http://www.yahoo.com/ request uri: / unable to connect to www.yahoo.com: unknown host: www.yahoo.com (java.net.UnknownHostException: www.yahoo.com)
    i have tried the same with cnn.com also.
    could someone tell me what i should do?
    regards,
    Vidhya

Maybe you are looking for