   Apache Nutch
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene™, the project has diversified and now comprises two codebases, namely:

Nutch 1.x: A well matured, production ready crawler. 1.x enables fine grained configuration, relying on Apache Hadoop™ data structures, which are great for batch processing.
Nutch 2.x: An emerging alternative taking direct inspiration from 1.x, but which differs in one key area; storage is abstracted away from any specific underlying data store by using Apache Gora™ for handling object to persistent mappings. This means we can implement an extremely flexibile model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) into a number of NoSQL storage solutions.
Being pluggable and modular of course has it's benefits, Nutch provides extensible interfaces such as Parse, Index and ScoringFilter's for custom implementations e.g. Apache Tika™ for parsing. Additonally, pluggable indexing exists for Apache Solr™, Elastic Search, etc.
   Version 2.2.1
Source apache-nutch-2.2.1-src.tar.gz
API /library/282/apache-nutch-2.2.1/docs/api/ Package 58, Class 290, Method 1,503

