Frequently Asked Questions:
- What is a Crawler?
- How Do I Prevent My Website or Parts of it From Being
Crawled by Your Crawler?
- What is "robots.txt"?
- How do I make a Robots.txt File?
- META tags!
- KacMac Crawler is downloading information from our "secret" web server.
- How You Can Help Us Quickly Respond To You
- How To Contact Us
1. What is a Crawler?
A Crawler (which may also be called a robot, spider, or bot) is a
program that automatically traverses the Web's hypertext structure by
retrieving a document, and recursively retrieving all documents that
are referenced. For more information on Crawlers and the standards of
crawling which we follow, you can visit the WebRobots FAQ
(http://www.robotstxt.org/wc/robots.html).
2. How Do I Prevent My Website or Parts of it From Being
Crawled by Your Crawler?
Our crawler activities may create a burst of moderate activity to a
single server. However, if you would prefer that ours or other
crawlers bypass a part or all of your website, or if you are concerned
that your site is being heavily loaded by our crawler, then the
simplest method for you to prevent this is to create a robots.txt file
on your server. Any crawler should access this file before
downloading anything from your server(s). This file should reside in
the top level of your server, and allows you to control which parts of
your server may be visited, and which crawlers are allowed to visit
your site(s). Note that if your robots.txt file is malformed, then a
crawler may not recognize your intention. We obey the Robot Exclusion
Standard, originally constructed in 1994 and updated in 1996. You can
review the standard at the Robotstxt website
(http://www.robotstxt.org/wc/exclusion.html)
3. What is "robots.txt"?
Robots.txt is a standard document that can tell KacMac Crawler not to download some or all information from your web server. Being responsible professionals, we are very anxious to make sure that
webmasters are not inconvenienced by our crawling activities, and we
only wish to use publicly available data. Therefore, we abide by the
Robots Exclusion Standard (see The Robot Exclusion Standard.)
4. How do I make a Robots.txt File?
If you are wondering what a robots.txt file look like, here is a
simple one that asks all robots to stay away from /temp/documents and
its subdirectories:
# Sample robots.txt file 1
User-agent: *
Disallow: /temp/documents/
The first line is a comment line which can be placed anywhere in a
robots.txt file as long as the comment is preceded by a pound symbol
(#). The second line designates robots to which the access policies
apply, with a "*" meaning all robots. The third line disallows access
to the specified directory and to any directories below it in the
hierarchy. You can include multiple Disallow statements to prohibit
access to two or more directories. You may want certain robots to
access areas that are disallowed by other robots. The following
robots.txt file allows unrestricted site access to a robot named
CRAWLER but prohibits others from accessing either /temp/documents or
/under_construction:
# Sample robots.txt file 2
User-agent: *
Disallow: /tmp/documents/
Disallow: /under_construction/
User-agent: CRAWLER
Disallow:
If you want to forbid all crawlers from crawling your site altogether,
then create a robots.txt file with the following lines:
# Sample robots.txt file 3
User-agent: *
Disallow: /
Upon seeing this, crawlers which abide by the robots standard, like we
do, will immediately disconnect and go find another server. Any of
the above sample robots.txt files must be placed in the top level of
your server under the file name "robots.txt". Be sure to verify that
the URL http://your.server.name/robots.txt will retrieve your newly
created file.
If you only want to forbid only our crawler from going through your
site, then create a robots.txt file that contains the following lines:
User-agent: KacMacBot
Disallow: /
Again, place this file in the top level of your server under the file
name "robots.txt", and verify that the URL
http://your.server.name/robots.txt will retrieve your newly created
file.
5. META tags!
There is another standard for telling robots not to index a particular web page or follow links on it, which may be more helpful, since it can be used on a page-by-page basis. This method involves placing a "META" element into a page of HTML.
6. KacMac Crawler is downloading information from our "secret" web server.
It is almost impossible to keep a web server secret by not publishing any links to it. As soon as someone follows a link from your "secret" server to another web server, it is likely that your "secret" URL is in the referer tag, and it can be stored and possibly published by the other web server in its referer log. So, if there is a link to your "secret" web server or page on the web anywhere, it is likely that KacMac and other "web crawlers" will find it.
7. How You Can Help Us Quickly Respond To You:
You can provides us with some pieces of information so that we can
rapidly identify the source of any problems or issues involving our
crawler interacting with your website. In your email to us, please
include the following information:
- An outline of your problem or issue.
- Identification of the IP Address of the server which our crawler touched.
- Identification of the time and date of the problem or issue.
- Identification of your name as contact person, email address and/or
phone number .
- Entries from your server log(s) which shows the problem or URLs that
triggered the problem or issue would also be helpful.
8. How To Contact Us:
If you have created a robots.txt file on your server and still have
questions for us, then please contact us via email, including the
information outlined above, using the email address kacmac@kacmac.com
Special thanks to IBM's Research Division for help. KacMac Team
|