|
|||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||||
java.lang.Objectnet.nutch.parse.html.DOMContentUtils
A collection of methods for extracting content from DOM trees. This class holds a few utility methods for pulling content out of DOM nodes, such as getOutlinks, getText, etc.
| Nested Class Summary | |
static class |
DOMContentUtils.LinkParams
|
| Field Summary | |
static HashMap |
linkParams
|
| Constructor Summary | |
DOMContentUtils()
|
|
| Method Summary | |
static void |
getOutlinks(URL base,
ArrayList outlinks,
Node node)
This method finds all anchors below the supplied DOM node, and creates appropriate Outlink
records for each (relative to the supplied base
URL), and adds them to the outlinks ArrayList. |
static void |
getText(StringBuffer sb,
Node node)
This is a convinience method, equivalent to getText(sb, node, false). |
static boolean |
getText(StringBuffer sb,
Node node,
boolean abortOnNestedAnchors)
This method takes a StringBuffer and a DOM Node,
and will append all the content text found beneath the DOM node to
the StringBuffer. |
static boolean |
getTitle(StringBuffer sb,
Node node)
This method takes a StringBuffer and a DOM Node,
and will append the content text found beneath the first
title node to the StringBuffer. |
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
public static HashMap linkParams
| Constructor Detail |
public DOMContentUtils()
| Method Detail |
public static final boolean getText(StringBuffer sb,
Node node,
boolean abortOnNestedAnchors)
StringBuffer and a DOM Node,
and will append all the content text found beneath the DOM node to
the StringBuffer.
If abortOnNestedAnchors is true, DOM traversal will
be aborted and the StringBuffer will not contain
any text encountered after a nested anchor is found.
Currently, only SCRIPT, STYLE and comment text are ignored.
public static final void getText(StringBuffer sb,
Node node)
getText(sb, node, false).
public static final boolean getTitle(StringBuffer sb,
Node node)
StringBuffer and a DOM Node,
and will append the content text found beneath the first
title node to the StringBuffer.
public static final void getOutlinks(URL base,
ArrayList outlinks,
Node node)
node, and creates appropriate Outlink
records for each (relative to the supplied base
URL), and adds them to the outlinks ArrayList.
Links without inner structure (tags, text, etc) are discarded, as are links which contain only single nested links and empty text nodes (this is a common DOM-fixup artifact, at least with nekohtml).
|
|||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||||