Robots.txt - default setup

Hey!
Since im using Iweb for creating my websites i know that i have to setup robots.txt for SEO.
I have made several sites: one for restaurant, one is about photography, one personal etc...
There is nothing i want to "hide" from google robots on those websites.
So my question is:
When we create a website and publish it is there at least a default setup for robots.txt ?
For example:
Website is parked in folder: public_html/mywebsitefolder
Inside mywebsitefolder folder i have:
/nameofthewebsite
/cgi-bin
/index.html
Structure is same for all websites created with Iweb so what should we by default put in robots.txt ?
Ofcourse, in case you dont want to hide any of the pages or content.
Azz.

If you don't want to stop the bots crawling any folder - don't bother with one at all.
The robots.txt should go in the root folder since the crawler looks for....
http://www.domain-name.com/robots.txt
If your site files are in a sub folder the robots.txt would be like...
User-agent: *
Disallow: /mywebsitefolder/folder-name
Disallow: /mywebsitefolder/file.file-extension
To allow all access...
User-agent: *
Disallow:
I suppose you may want to use robots.txt if you want to allow/disallow one particular bot.

Similar Messages

Question about robots.txt

This isn't something I've usually bothered with, as I always thought you didn't really need one unless you wanted to disallow access to pages / folders on a site.
However, a client has been reading up on SEO and mentioned that some analytics thing (possibly Google) was reporting that "one came back that the robot.txt file was invalid or missing. I understand this can stop the search engines linking in to the site".
So I had a rummage, and uploaded what I thought was a standard enough robots.txt file :
# robots.txt
User-agent: *
Disallow:
Disallow: /cgi-bin/
But apparently this is reporting :
The following block of code contains some errors.You specified both a generic path ("/" or empty disallow) and specific paths for this block of code; this could be misinterpreted. Please, remove all the reported errors and check again this robots.txt file.
Line 1
# robots.txt
Line 2
User-agent: *
Line 3
Disallow:
You specified both a generic path ("/" or empty disallow) and specific paths for this block of code; this could be misinterpreted.
Line 4
Disallow: /cgi-bin/
You specified both a generic path ("/" or empty disallow) and specific paths for this block of code; this could be misinterpreted.
If anyone could set me straight on how a standard / default robots.txt file should look like, that would be much appreciated.
Thanks.

Remove the blank disallow line so it looks like this:
User-agent: *
Disallow: /cgi-bin/
E. Michael Brandt
www.divahtml.com
www.divahtml.com/products/scripts_dreamweaver_extensions.php
Standards-compliant scripts and Dreamweaver Extensions
www.valleywebdesigns.com/vwd_Vdw.asp
JustSo PictureWindow
JustSo PhotoAlbum, et alia

Robots.txt

Hi,
Has anyone created a robots.txt file for an external plumtree portal??
The company I work for is currently using PT 4.5 SP2 and I'm just wondering what directories I should dissallow that will prevent spiders etc from crawling certain parts of the web site. This will help impove search results on search engines.
See http://support.microsoft.com/default.aspx?scid=kb;en-us;217103

The robots.txt file live at the root level of the server where your web pages are. What is the URL of your website?

Placement of robots.txt file

Hi all,
I want to disallow search robots from indexing certain directories on a MacOS X Server.
Where would I put the robots.txt file?
According to the web robots pages at http://www.robotstxt.org/wc/exclusion-admin.html it needs to go in the "top-level of your URL space", which depends on the server and software configuration.
Quote: "So, you need to provide the "/robots.txt" in the top-level of your URL space. How to do this depends on your particular server software and configuration."
Quote: "For most servers it means creating a file in your top-level server directory. On a UNIX machine this might be /usr/local/etc/httpd/htdocs/robots.txt".
On a MacOS X Server would the robots.txt go into the "Library" or "WebServer" directory or somewhere else?
Thanxx
monica
G5 Mac OS X (10.4.8)

The default document root for Apache is /Library/WebServer/Documents so your robots.txt file should be at /Library/WebServer/Documents/robots.txt

Problems with robots.txt Disallow

Hi
I have a problem with the robots.txt and google.
I have this robots.txt file:
User-agent: *
Disallow: page1.html
Disallow: dir_1/sub_dir_1/
Disallow: /data/
When I enter 'site:www.MySite.com' into Google search box,
Goolge gets the content from the 'data' directory as well. Google
should not have indexed the content of data directory.
So why is google getting the results from 'data' directory,
whereas I have disallowed it.
How can I restrict everyone from accessing the data
directory?
Thanks

I found workaround. To have sitemap URL linked pub page, pub page needs to be in the Internet zone. If you need to have sitemap URL linked to the real internet address (e.g. www.company.example.com) you need to put auth page in the default zone, pub
page in the intranet zone and create AAM http://company.example.com in the internet zone.

Can I change Address Book's default setup?

I want to import my entire address book database from Palm Desktop to Apple Address Book (only about 5,000 entries in total).
I have several fields in the Palm Desktop database that don't match the Address Book default setup. In Address Book, I went to Card> Add Field> Edit Template but the changes I made there to the Template didn't actually seem to change anything in the actual Address Book display of addresses, even though I made several changes in the Edit Template function - it was as though the changes I made in the template editing function had no effect at all. The changes I made are still there when I revisit the Edit Template function but the additions and alterations don't show in the final address display.
1. Where did my changes to that Template go in respect to the actual display of the addresses?
2. Is there some way to apply my edited template to all the entries in Address Book in one fell swoop?
3. If not, then why have the Edit Template function at all?
I've done some trial runs using a tab-delimited file exported from Palm Desktop, which Address Book seems to import and work with happily... except for this inability to choose my edited template for the new entries - Address Book forces the new database into Address Book's own Template with none of my chosen changes. It's as though the template edit function is just for show but with no real functionality.
I hope this is all making sense...
This is an great time for me to make sweeping changes if necessary (if it is possible or if it helps) since I'm beginning a totally new database in Address Book from scratch.
Many thanks for any help.

Paul,
I'm afraid you will loose some of your data that you can insert in your Palm when you transfer to AddressBook, but you can try to get in as much as possible. Especially the custom fields are much more flexible in Palm.
But in the end, the full integration of AB in your Mail and Calender, backup to .Mac etc. may offset these drawbacks.
When you set up a template, did you actually complete one address record?
If not, you should.
When you want to import to AB from a tab-delimited file, try this program
Addressbook Importer, an excellent piece of freeware that will do this for you.
In the top row before you import, make sure you have an address record that has all your customized fields completed in a way you know which field that is (you can do this in Palm, using the name A. AAA, address etc.)
Now when you use the importer, you can make sure wich field goes where.
Backup your Palm data well before you start experimenting. I wouldn't want to be the cause of you dropping 5000 records. Also, disable all syncing when you start. Once it works, make a backup from within the AB application. Then when all looks OK, you can start syncing using iSync instead of Hotsync.

Defaulting setup in Oracle Order Management

Hello Guys,
I am trying to add SHIP_FROM_ORG in the defaulting setup's for line level of OM. Could you guys give me a clue and that will be great.
Problem---> Opening SO form in OM, it is throwing an note "Cannot get Valid Name for - Ship_from_org"
That's why i am trying to set that field in defaulting setup's.
Thanks
Vinoth

API for Quote in Order Management is OE_ORDER_PUB.PROCESS_ORDER, Note this is different from CRM ASO_QUOTE_PUB
This is the same as for regular Order excepting populate TRANSACTION_PHASE_CODE = 'N'
To create an Order from the Sales Quote you may have to fire 2 workflows OE_NEGOTIATE_WF.Submit_Draft and OE_NEGOTIATE_WF.Customer_Accepted prior to OE_Order_Book_Util.Complete_Book_Eligible
Hope this helps someone

Robot.txt and duplicat conent i need help

Hello Guys i´m new in BC i have 2
Questions. 1
My startpage is available as xxxx.de
and xxxx.de/index.html
and xxx.de/index.aspx
how can i change this "Duplicate Content!!!!
and the 2 Questions where i have to load the robot.txt.
THX

As long as you do not link to other versions and be inconsistent you do not need to worry about your start page.

Use of robots.txt to disallow system/secure domain names?

I've got a client who's system and secure domains are ranking very high on google. My SEO advisor has mentioned that a key way to eliminate these URLs from google is through the use of disallowing content through robots.txt. Given BC's unique nature of dealing with system and secure domains I'm not too sure if this is even possible as any disallowances I've seen or used before have been directories and not absolute URL's, nor have I seen any mention of this possibility around. Any help or advice would be great!

Hi Mike
Under Site Manager > Pages, when accessing a specific page, you can open the SEO Metadata section and tick “Hide this page for search engines”
Aside from this, using the robots.txt file is indeed an efficient way of instructing search engine robots which pages are not to be indexed.

Robots.txt and Host Named Site Collections (SEO)

When attempting to exclude ALL SharePoint Sites from external indexing, when you have multiple web apps and multiple Host Named Site Collections, should I add the robots.txt file to the root of each web app, as well as each hnsc? I assume so, but, thought
I would check with the gurus...
- Rick

I think, one for each site collection as each site collection has different name and treated as web site.
"he location of robots.txt is very important  It must be in the main directory because otherwise user agents (search engines) will not be able to find it.  Search engines look first in the main directory (i.e.http://www.sitename.com/robots.txt)
and if they don’t find it there, they simply assume that this site does not have a robots.txt file"
http://www.slideshare.net/ahmedmadany/block-searchenginesfromindexingyourshare-pointsite
Please remember to mark your question as answered &Vote helpful,if this solves/helps your problem. ****************************************************************************************** Thanks -WS MCITP(SharePoint 2010, 2013) Blog: http://wscheema.com/blog

Robots.txt -- how do I do this?

I'm not using iWeb, unfortunately, but I wanted to protect part of a site I've set up. How do I set up a hidden directory under my domain name? I need it to be invisible except to people who have been notified of its existence. I was told, "In order to make it invisible you would need to not have any links associated with it on your site, make sure you have altered a robots.txt file in your /var/www/html directory so bots cannot spider it. A way to avoid spiders crawling certain directories is to place a robots.txt file in your web root directory that has parameters on which files or folders you do not want indexed."
But, how do I get/find/alter this robots.txt file? I unfortunately don't know how to do this sort (hardly any sort) of programming. Thank you so much.

Muse does not generate a robots.txt file.
If your site has one, it's been generated by your hosting provider, or some other admin on your website. If you'd like google or other 'robots' to crawl your site, you'll need to edit this file or delete it.
Also note that you can set your page description in Muse using the page properties dialog, but it won't show up immediately in google search results - you have to wait until google crawls your site to update their index, which might take several days. You can request google to crawl it sooner though:
https://support.google.com/webmasters/answer/1352276?hl=en

Error 404 - /_vti_bin/owssvr.dll and robots.txt

Hi
My webstats tell me that I have had various Error 404s and
this is because of files being "required but not found":
specifically /_vti_bin/owssvr.dll and robots.txt.
Can someone tell me what these are?
Also, there are various other status code pages coming up
such as
302 Moved temporarily (redirect) 6 27.2 % 2.79 KB
401 Unauthorized 5 22.7 % 9.32 KB
403 Forbidden 3 13.6 % 5.06 KB
206 Partial Content
Why are these arising and how can I rid myself of them?
Many thanks : )

Example of httpmodule that uses the PreRequestHandlerExecute and how to return if it encounters the owssvr.dll
class MyHttpModule : IHttpModule, IRequiresSessionState
   public void Init(HttpApplication context)
            context.PreRequestHandlerExecute += new EventHandler(context_PreRequestHandlerExecute);
void context_PreRequestHandlerExecute(object sender, EventArgs e)
if (app.Context.Request.Url.AbsolutePath.ToLower().Contains("owssvr.dll"))
                    return;

[solved]Wget: ignore "disallow wget" +comply to the rest of robots.txt

Hello!
I need to wget a few (maybe 20 -.- ) html files that are linked on one html page (same domain) recursively, but the robots.txt there disallows wget. Now I could just ignore the robots.txt... but then my wget would also ignore the info on forbidden links to dynamic sites which are forbidden in the very same robots.txt for good reasons. And I don't want my wget pressing random buttons on that site. Which is what the robots.txt is for. But I can't use the robots.txt with wget.
Any hints on how to do this (with wget)?
Last edited by whoops (2014-02-23 17:52:31)

HalosGhost wrote:Have you tried using it? Or, is there a specific reason you must use wget?
Only stubborness
Stupid website -.- what do they even think they achieve by disallowing wget? I should just use the ignore option and let wget "click" on every single button in their php interface. But nooo, instead I waste time trying to figure out a way to exclude those GUI links from being followed even though wget would be perfectly set up to comply to that automatically if it weren't for that one entry to "ban" it. *grml*
Will definitely try curl next time though - thanks for the suggestion!
And now, I present...
THE ULTIMATIVE SOLUTION**:
sudo sed -i 's/wget/wgot/' /usr/bin/wget
YAY.
./solved!
** stubborn version.
Last edited by whoops (2014-02-23 17:51:19)

Web Repository Manager and robots.txt

Hello,
I would like to search an intranet site and therefore set up a crawler according to the guide "How to set up a Web Repository and Crawl It for Indexing".
Everything works fine.
Now this web site uses a robots.txt as follows:
<i>User-agent: googlebot
Disallow: /folder_a/folder_b/
User-agent: *
Disallow: /</i>
So obviously, only google is allowed to crawl (parts of) that web site.
My question: If I'd like to add the TRex crawler to the robots.txt what's the name of the "User-agent" I have to specify here?
Maybe the name I defined in the SystemConfiguration > ... > Global Services > Crawler Parameters > Index Management Crawler?
Thanks in advance,
Stefan

Hi Stefan,
I'm sorry but this is hard coded. I found it in the class : com.sapportals.wcm.repository.manager.web.cache.WebCache
private HttpRequest createRequest(IResourceContext context, IUriReference ref)
        HttpRequest request = new HttpRequest(ref);
        String userAgent = "SAP-KM/WebRepository 1.2";
        if(sessionWatcher != null)
            String ua = sessionWatcher.getUserAgent();
            if(ua != null)
                userAgent = ua;
        request.setHeader("User-Agent", userAgent);
        Locale locale = context.getLocale();
        if(locale != null)
            request.setHeader("Accept-Language", locale.getLanguage());
        return request;
So recompile the component or changing the filter... I would prefer to change the roberts.txt
hope this helps,
Axel

Robots.txt question?

I am kind of new to web hosting, but learning.
I am hosting with just host, I have a couple of sites (addons). I am trying to publish my main site now and there is a whole bunch of stuff in site root folder that I have no idea what it is. I don't want to delete anything and I am probably not going too lol. But should I block a lot of the stuff in there in my Robots.txt file?
Here is some of the stuff in there:
.htaccess
404.shtml
cgi-bin
css
img
index.php
justhost.swf
sifr-addons.js
sIFR-print.cs
sIFR-screen.css
sifr.js
should I just disallow all of this stuff in my robots.txt? or any recommendations would be appreciated? Thanks

Seaside333 wrote:
public_html for the main site, the other addons are public_html/othersitesname.com
is this good?
thanks for quick response
Probably don't need the following files unless youre using text image-replacement techniques - sifr-addons.js, sIFR-print.cs, sIFr-screen.css, sifr,js
Good to keep .htaccess - (can insert special instrcutions in this file) - 404.shtml (if a page can't be found on your remote server it goes to this page) - cgi-bin (some processing scripts are placed in this folder)
Probably you will have your own 'css' folder. 'img' folder not needed. 'index.php' is the homepage of the site and what the browser looks for initially, you can replace it with your own homepage.
You don't need justhst.swf.
Download the files/folders to you local machine and keep them in case you need them.

Robots.txt - default setup

Similar Messages

Maybe you are looking for