|
|||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||||
java.lang.Objectnet.nutch.indexer.DeleteDuplicates
Deletes duplicate documents in a set of Lucene indexes. Duplicates have either the same contents (via MD5 hash) or the same URL.
| Nested Class Summary | |
static class |
DeleteDuplicates.IndexedDoc
The key used in sorting for duplicates. |
| Constructor Summary | |
DeleteDuplicates(IndexReader[] readers,
String tempFile)
Constructs a duplicate detector for the provided indexes. |
|
| Method Summary | |
void |
close()
Closes the indexes, saving changes. |
void |
deleteContentDuplicates()
Delete pages with duplicate content hashes. |
void |
deleteUrlDuplicates()
Delete pages with duplicate URLs. |
static void |
main(String[] args)
Delete duplicates in the indexes in the named directory. |
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
public DeleteDuplicates(IndexReader[] readers,
String tempFile)
| Method Detail |
public void close()
throws IOException
IOException
public void deleteContentDuplicates()
throws IOException
IOException
public void deleteUrlDuplicates()
throws IOException
IOException
public static void main(String[] args)
throws Exception
Exception
|
|||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||||