If you're reading this, chances are you've seen our robot visiting your site while looking through your server logs. When we crawl to populate our index, we advertise the "User-agent" string "NutchOrg". If you see the agent "Nutch" or "NutchCVS", that's probably a developer testing a new version of our robot, or someone running their own instance.
We are open-source developers, trying to build something useful for the world to use. It comes naturally to us to want to be good netizens. If you notice our bot misbehaving, please drop us a line at email@example.com and we will investigate the problem.
Our bot does retrieve and parse robots.txt files, and it looks for robots META tags in HTML. These are the standard mechanisms for webmasters to tell web robots which portions of a site a robot is welcome to access.
We're an open source project, so please understand that a misbehaving bot appearing with our Agent string may not have been run by us. Our code is out there for anyone to tinker with. However, whether or not we ran the bot, we'd appreciate hearing about any bad behavior- please let us know about it! If possible, please include the name of the domain and some representative log entries. We can be reached at firstname.lastname@example.org
Our bot follows the robots.txt exclusion standard, which is described at http://www.robotstxt.org/wc/exclusion.html#robotstxt. Depending on the configuration, our robot may obey different rules. To make it simple to send our bot away, we'll always obey rules for "Nutch". Here are the different cases.
To ban all bots from your site, place the following in your robots.txt file:
To ban Nutch bots from your site unless they're building the Nutch.Org demo index, place the following in your robots.txt file:
To ban all Nutch bots from your site:
If you do not have permission to edit the /robots.txt file on your server, you can still tell robots not to index your pages or follow your links. The standard mechanism for this is the robots META tag, as described at http://www.robotstxt.org/wc/meta-user.html.
To tell Nutch, and other robots, not to index your page or follow your links, insert this META tag into the HEAD section of your HTML document:
<meta name="robots" content="noindex,nofollow">
Of course, you can control the "index" and "follow" directives independantly. The keywords "all" or "none" are also allowed, meaning "index,follow" or "noindex,nofollow", respectively. Some examples are:
<meta name="robots" content="all">
If there are no robots META tags, or if an action is not specifically prohibited (ie. neither "nofollow" or "none" appears), Nutch will assume it is allowed to index or follow links.
Except where otherwise noted,
this site is licensed under a Creative Commons License.
ca | de | en | es | fi | fr | hu | jp | ms | nl | pl | pt | sv | th | zh