net.nutch.tools
Class SegmentMergeTool
java.lang.Object
net.nutch.tools.SegmentMergeTool
- public class SegmentMergeTool
- extends Object
This class cleans up accumulated segments data, and merges them
into a single segment, with no duplicates in it. It uses a "master"
unique index of all documents, which either must already exist
(by running IndexSegment for each segment, then DeleteDuplicates,
and finally IndexMerger), OR the tool can create it just before
merging, including creation of per segment sub-indices as needed.
The newly created segment is then optionally indexed, so that
it can be either merged with more new segments, or used for
searching as it is.
The original "master" index can be optionally deleted -
since it still points to the old segments the new index should
be used instead. Old segments may be optionally removed as well,
because all needed data has already been copied to the new merged
segment.
If you use all provided functionality, you can save
some manual steps in Nutch operational procedures. After you've
run a couple of cycles of fetchlist generation, fetching, DB
updating and analyzing, you end up with several segments, possibly
containing duplicates. You may then directly run the
SegmentMergerTool, with all options turned on, i.e. to first
create the master unique index, merge segments into the output
segment, index it, and then delete the original segments data and
the master index.
- Author:
- Andrzej Bialecki
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
LOG
public static final Logger LOG
SegmentMergeTool
public SegmentMergeTool(String segments,
String output,
String master,
boolean createMaster,
boolean runIndexer,
boolean delSegs,
boolean delMaster)
throws Exception
run
public void run()
main
public static void main(String[] args)
throws Exception
- Throws:
Exception
Copyright © 2004 The Nutch Organization.