Why does the world need Nutch, when search engines are free?
Search engines are free to use like television is free to watch, but, like television programming, search results are subject to manipulation by the interests that control them. The only way one can be certain that search results are unbiased is if the technology which computes them is public. Nutch seeks to make high-quality search technology freely available.
How can I help?
If you're a developer, please visit our developer page.
If you're able to donate funds, please visit our donations page.
If you have other suggestions, questions or comments, please send a message to firstname.lastname@example.org.
How can a non-profit afford to run a search engine?
Nutch is primarily a software project, not a service. Large scale deployments of Nutch will probably be run by commerical interests separate from Nutch, funded by advertising or somesuch. If the Nutch software is good enough, perhaps existing major search engines will use it in place of their current closed source code.
The Nutch project itself may choose to host small-scale demo system, so that folks can see that it really works. This will require only moderate funding. The Nutch project may never host a full-scale deployment for folks to use as their everyday search engine. We'll leave that to commercial ventures that can afford it.
Will Nutch ever be as good as other search engines?
We hope it will be better. With developers and researchers from around the world helping out, we hope to be able to surpass the quality of what any single company can do.
How can I stop Nutch from crawling my site?
Please visit our webmaster info page.
How can I make sure that Nutch crawls my site?
Nutch uses the DMOZ Open Directory to bootstrap its crawling. So the best way to get your site crawled by Nutch is to make sure that it is listed in the Open Directory.
Will Nutch be a distributed, P2P-based search engine?
We don't think it is presently possible to build a peer-to-peer search engine that is competitive with existing search engines. It would just be too slow. Returning results in less than a second is important: it lets people rapidly reformulate their queries so that they can more often find what they're looking for. In short, a fast search engine is a better search engine. I don't think many people would want to use a search engine that takes ten or more seconds to return results.
That said, if someone wishes to start a sub-project of Nutch exploring distributed searching, we'd love to host it. We don't think these techniques are likely to solve the hard problems Nutch needs to solve, but we'd be happy to be proven wrong.
Will Nutch use a distributed crawler, like Grub?
Distributed crawling can save download bandwidth, but, in the long run, the savings is not significant. A successful search engine requires more bandwidth to upload query result pages than its crawler needs to download pages, so making the crawler use less bandwidth does not reduce overall bandwidth requirements. The dominant expense of operating a search engine is not crawling, but searching.
Won't open source just make it easier for sites to manipulate rankings?
Search engines work hard to construct ranking algorithms that are immune to manipulation. Search engine optimizers still manage to reverse-engineer the ranking algorithms used by search engines, and improve the ranking of their pages. For example, many sites use link farms to manipulate search engines' link-based ranking algorithms, and search engines retaliate by improving their link-based algorithms to neutralize the effect of link farms.
With an open-source search engine, this will still happen, just out in the open. This is analagous to encryption and virus protection software. In the long term, making such algorithms open source makes them stronger, as more people can examine the source code to find flaws and suggest improvements. Thus we believe that an open source search engine has the potential to better resist manipulation of its rankings.
When will Nutch search images, pdf files, etc.?
Soon, we hope.
Except where otherwise noted,
this site is licensed under a Creative Commons License.
ca | de | en | es | fi | fr | hu | jp | ms | nl | pl | pt | sv | th | zh