Nutch Wiki TWiki > Main > FAQ TWiki webs:
Main | TWiki | Know | Sandbox
Main . { Changes | Index? | Search | Go }

Nutch FAQ

Please feel free to answer and add questions!

Please also have a look at the error messages, their reasons and solutions

Injecting

What happens if I inject urls several times?


Fetching

Is it possible to fetch only pages from some specific domains?

How can I recover an aborted fetch process?

Who changes the next fetch date? I have a big fetchlist in my segments folder. How can I fetch only some sites at a time?


Updating


Indexing

Is it possible to change the list of common words without crawling everything again?

How do I index my local file system? The tricky thing about Nutch is that out of the box is has most plugins disabled and is tuned for a crawl of a "remote" web server - you have to change config files to get it to crawl your local disk.

1. crawl-urlfilter.txt needs a change to allow file: URLs while not following http: ones, otherwise it either won't index anything, or it'll jump off your disk onto web sites. Change this line:

-^(file|ftp|mailto|https):
to this:
-^(http|ftp|mailto|https):
2. crawl-urlfilter.txt may have rules at the bottom to reject some URLs. If it has this fragment it's probably ok:
# accept anything else
+.*

3. By default the "file plugin" is disabled. nutch-site.xml needs to be modified to allow this plugin. Add an entry like this:

<property>
  <name>plugin.includes</name>
  <value>protocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value>
</property>

Now you can invoke the crawler and index all or part of your disk. The only remaining gotcha is that if you use Mozilla it will not load file: URLs from a web paged fetched with http, so if you test with the Nutch web container running in Tomcat, annoyingly, as you click on results nothing will happen as Mozilla by default does not load file URLs. This is mentioned here and this behavior may be disabled by a preference (see security.checkloaduri). IE5 does not have this problem.


Segment Handling

Do I have to delete old segments after some time?


Searching

First Search: My system does not find the segments folder. Why?


Crawling


discussion

Grub ( http://grub.org/ ) has some interesting ideas about building a search engine using distributed computing. And how is that relevant to nutch?

Topic FAQ . { Edit | Attach | Ref-By | Printable | Diffs | r1.3 | > | r1.2 | > | r1.1 | More }
Revision r1.3 - 19 Mar 2005 - 20:31 GMT - DavidCary Copyright © 1999-2003 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback.