segments, they've all been indexed, and
you've run duplicate detection, then you're ready to merge.
net.nutch.indexer.IndexMerger
bin/nutch merge <indexDirectory> <segment_dirs>
bin/nutch merge . segments/*
This creates a merged index, containing the contents of all of the
segments/*/index, in a new directory named after those segments, in your
case 20030422113844-0_20030423144418-2.
Here's the bug. NutchBean? looks for a merged index in a directory named
index. So, to make things work, you currently have to manually rename
the merged index directory to be just index:
mv 20030422113844-0_20030423144418-2 index
If you run Tomcat while connected to a directory with subdirectories
named index and segments, it will use the merged index data in
index and get the rest of the segment data from the segments
directory. Searches are much faster with a merged index.
data data/db data/segmentsWith the associate index and fetch under these. If I run the command
bin/nutch merge . segments/*
in the data directory it tries to delete the contents of the data directory.
However I have found if I do
bin/nutch merge index segments/*
it creates a merged index. The created index directory should then be stored in the data directory:
data/ data/db data/index data/segmentsand everything works as expected (N.B. you must keep the segments directory).
| Topic MergeOptions . { Edit | Attach | Ref-By | Printable | Diffs | r1.1 | More } |
| Revision r1.1 - 09 Dec 2004 - 11:18 GMT - AlonsoAndres |
Copyright © 1999-2003 by the contributing authors.
All material on this collaboration platform is the property of the contributing authors. Ideas, requests, problems regarding TWiki? Send feedback. |