DeleteDuplicates (Nutch 0.5 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

net.nutch.indexer
Class DeleteDuplicates

java.lang.Object
  net.nutch.indexer.DeleteDuplicates

public class DeleteDuplicates
extends Object

Deletes duplicate documents in a set of Lucene indexes. Duplicates have either the same contents (via MD5 hash) or the same URL.

Nested Class Summary
`static class`	`DeleteDuplicates.IndexedDoc` The key used in sorting for duplicates.

Constructor Summary
`DeleteDuplicates(IndexReader[] readers, String tempFile)` Constructs a duplicate detector for the provided indexes.

Method Summary
`void`	`close()` Closes the indexes, saving changes.
`void`	`deleteContentDuplicates()` Delete pages with duplicate content hashes.
`void`	`deleteUrlDuplicates()` Delete pages with duplicate URLs.
`static void`	`main(String[] args)` Delete duplicates in the indexes in the named directory.

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail

DeleteDuplicates

public DeleteDuplicates(IndexReader[] readers,
                        String tempFile)

Constructs a duplicate detector for the provided indexes.

Method Detail

close

public void close()
           throws IOException

Closes the indexes, saving changes.

Throws:: IOException

deleteContentDuplicates

public void deleteContentDuplicates()
                             throws IOException

Delete pages with duplicate content hashes. Of those with the same content hash, keep the page with the highest score.

Throws:: IOException

deleteUrlDuplicates

public void deleteUrlDuplicates()
                         throws IOException

Delete pages with duplicate URLs. Of those with the same URL, keep the most recently fetched page.

Throws:: IOException

main

public static void main(String[] args)
                 throws Exception

Delete duplicates in the indexes in the named directory.

Throws:: Exception