How do I index my local file system?
The tricky thing about Nutch is that out of the box is has most plugins disabled and is tuned for a crawl of a "remote" web server - you have to change config files to get it to crawl your local disk.
1. crawl-urlfilter.txt needs a change to allow file: URLs while not following http: ones, otherwise it either won't index anything, or it'll jump off your disk onto web sites. Change this line:
2. crawl-urlfilter.txt may have rules at the bottom to reject some URLs. If it has this fragment it's probably ok:
# accept anything else
3. By default the "file plugin" is disabled. nutch-site.xml needs to be modified to allow this plugin. Add an entry like this:
Now you can invoke the crawler and index all or part of your disk. The only remaining gotcha is that if you use Mozilla it will not load file: URLs from a web paged fetched with http, so if you test with the Nutch web container running in Tomcat, annoyingly, as you click on results nothing will happen as Mozilla by default does not load file URLs. This is mentioned here and this behavior may be disabled by a preference (see security.checkloaduri). IE5 does not have this problem.