net.nutch.indexer
Class DeleteDuplicates

java.lang.Object
  extended bynet.nutch.indexer.DeleteDuplicates

public class DeleteDuplicates
extends Object

Deletes duplicate documents in a set of Lucene indexes. Duplicates have either the same contents (via MD5 hash) or the same URL.


Nested Class Summary
static class DeleteDuplicates.IndexedDoc
          The key used in sorting for duplicates.
 
Constructor Summary
DeleteDuplicates(IndexReader[] readers, String tempFile)
          Constructs a duplicate detector for the provided indexes.
 
Method Summary
 void close()
          Closes the indexes, saving changes.
 void deleteContentDuplicates()
          Delete pages with duplicate content hashes.
 void deleteUrlDuplicates()
          Delete pages with duplicate URLs.
static void main(String[] args)
          Delete duplicates in the indexes in the named directory.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DeleteDuplicates

public DeleteDuplicates(IndexReader[] readers,
                        String tempFile)
Constructs a duplicate detector for the provided indexes.

Method Detail

close

public void close()
           throws IOException
Closes the indexes, saving changes.

Throws:
IOException

deleteContentDuplicates

public void deleteContentDuplicates()
                             throws IOException
Delete pages with duplicate content hashes. Of those with the same content hash, keep the page with the highest score.

Throws:
IOException

deleteUrlDuplicates

public void deleteUrlDuplicates()
                         throws IOException
Delete pages with duplicate URLs. Of those with the same URL, keep the most recently fetched page.

Throws:
IOException

main

public static void main(String[] args)
                 throws Exception
Delete duplicates in the indexes in the named directory.

Throws:
Exception


Copyright © 2004 The Nutch Organization.