Nutch Wiki TWiki > Main > FAQ (r1.1 vs. r1.3) TWiki webs:
Main | TWiki | Know | Sandbox
Main . { Changes | Index? | Search | Go }
 <<O>>  Difference Topic FAQ (r1.3 - 19 Mar 2005 - DavidCary)
Added:
>
>

TOC: No TOC in "Main.FAQ"

Changed:
<
<


>
>


Changed:
<
<

Crawling

>
>

Crawling


discussion

Grub ( http://grub.org/ ) has some interesting ideas about building a search engine using distributed computing. And how is that relevant to nutch?


 <<O>>  Difference Topic FAQ (r1.2 - 14 Feb 2005 - DavidSpencer?)
Added:
>
>

How do I index my local file system? The tricky thing about Nutch is that out of the box is has most plugins disabled and is tuned for a crawl of a "remote" web server - you have to change config files to get it to crawl your local disk.

1. crawl-urlfilter.txt needs a change to allow file: URLs while not following http: ones, otherwise it either won't index anything, or it'll jump off your disk onto web sites. Change this line:

-^(file|ftp|mailto|https):
to this:
-^(http|ftp|mailto|https):
2. crawl-urlfilter.txt may have rules at the bottom to reject some URLs. If it has this fragment it's probably ok:
# accept anything else
+.*

3. By default the "file plugin" is disabled. nutch-site.xml needs to be modified to allow this plugin. Add an entry like this:

<property>
  <name>plugin.includes</name>
  <value>protocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value>
</property>

Now you can invoke the crawler and index all or part of your disk. The only remaining gotcha is that if you use Mozilla it will not load file: URLs from a web paged fetched with http, so if you test with the Nutch web container running in Tomcat, annoyingly, as you click on results nothing will happen as Mozilla by default does not load file URLs. This is mentioned here and this behavior may be disabled by a preference (see security.checkloaduri). IE5 does not have this problem.


 <<O>>  Difference Topic FAQ (r1.1 - 23 Nov 2004 - TomBloomfield)
Added:
>
>

%META:TOPICINFO{author="TomBloomfield" date="1101174013" format="1.0" version="1.1"}%

Nutch FAQ

Please feel free to answer and add questions!

Please also have a look at the error messages, their reasons and solutions

Injecting

What happens if I inject urls several times?

  • Urls, which are already in the database, won't be injected.


Fetching

Is it possible to fetch only pages from some specific domains?

  • Please have a look on PrefixURLFilter.
  • Adding some regular expressions to the urlfilter.regex.file might work, but adding a list with thousands of regular expressions would slow down your system excessively.

How can I recover an aborted fetch process?

  • You have two choices:
    1. Use the aborted output. You'll need to touch the file fetcher.done in the segment directory. All the pages that were not crawled will be re-generated for fetch pretty soon. If you fetched lots of pages, and don't want to have to re-fetch them again, this is the best way.
    2. Discard the aborted output. To do this, just delete the fetcher* directories in the segment and restart the fetcher.
Who changes the next fetch date?
  • After injecting a new url the next fetch date is set to the current time.

  • Generating a fetchlist enhances the date by 7 days.
  • Updating the db sets the date to the current time + db.default.fetch.interval - 7 days.
I have a big fetchlist in my segments folder. How can I fetch only some sites at a time?
  • You have to decide how many pages you want to crawl before generating segments and use the options of bin/nutch generate.
  • Use -topN to limit the amount of pages all together.
  • Use -numFetchers to generate multiple small segments.
  • Now you could either generate new segments. Maybe you whould use -adddays to allow bin/nutch generate to put all the urls in the new fetchlist again. Add more then 7 days if you did not make a updatedb.
  • Or send the process a unix STOP signal. You should be able to index the part of the segment for crawling which is allready fetched. Then later send a CONT signal to the process. Do not turn off your computer between! :)


Updating


Indexing

Is it possible to change the list of common words without crawling everything again?


Segment Handling

Do I have to delete old segments after some time?

  • If you're fetching regularly, segments older than the db.default.fetch.interval can be deleted, as their pages should have been refetched. This is 30 days by default.


Searching

First Search: My system does not find the segments folder. Why?

  • Please have a look at the nutch-site.xml for the Webserver. (WEB-INF/classes/nutch-site.xml) Go to searcher.dir and set the path to the dir with the segments and index folders or search-servers.txt.


Crawling


Topic FAQ . { View | Diffs | r1.3 | > | r1.2 | > | r1.1 | More }
Revision r1.1 - 23 Nov 2004 - 01:40 GMT - TomBloomfield
Revision r1.3 - 19 Mar 2005 - 20:31 GMT - DavidCary
Copyright © 1999-2003 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback.