This is metacombine_focusedCrawl_module1.0 2005-06-01 by Saurabh Pathak, Emory University ; Donna Bergmark, Cornell University Summary ======= This module can perform focused crawling and is meant to be used as a processor for Heritrix crawler. If used as a focused crawler, this module should be placed on top of the Heritrix extractor processor chain so that for every URL encountered by the Heritrix crawler this module will make a decision whether to crawl the URL or not depending on the text contents of the URL document. To make this decision the module talks with a rainbow server which is a text classifier running on a port. To get a more detailed description of this module please read the Overview.pdf file which is present in the current directory. Above pdf file gives a detailed description of the module. Dependencies ============ To run this module for doing focused crawling you will need to download following, 1. Heritrix crawler(http://crawler.archive.org/) - This module is tested with Heritrix crawler versions upto 1.4 2. Bow Text classifier (http://www-2.cs.cmu.edu/~mccallum/bow/src/) 3. HTML parser jar file (http://htmlparser.sourceforge.net/) - Check the website and download the jar file. This module uses version 1.5 of the parser. Instructions on how to run the module ===================================== Following instructions shows how to run the module using an example case. This package can be used with Heritrix to make a focused crawler, using the following steps: 1. Suppose heritrix is located in directory $HERITRIX_HOME/bin/. The Metacombine_FocusedCrawler1.0/org/metacombine/crawlmodule/ directory should contain the .class files, BowProcessor.class and RainbowClassifier.class along with the java source files. Copy Metacombine_FocusedCrawler1.0/org/ folder to the $HERITRIX_HOME/bin/ directory. 2. Copy the jar files (for dependecies listed above) to $HERITRIX_HOME; alternatively, add them to your $CLASSPATH. Now you are all set to use the crawlmodule package. 3. Train the rainbow classifier as described in Section 2 of README.pdf file and set it up as a server. In this example I am crawling the "skiing" related pages so I will have as training set two sub-directories in $DIR, skiing/(which will contain examples of pages related to skiing) and Negative/ ( which will contain examples of pages not related to skiing). Suppose we then execute the following: /path/to/rainbow -d /path/to/model --index $DIR/*? /path/to/rainbow -d /path/to/model --query-server=5555 4. To use BowProcessor in a crawl job add the processor to the order.xml file. We suggest the addition of BowProcessor to the top of heritrix extractor processor chain as shown below. true ....(remaining heritrix extractor chain processors) 5. Create Bow Configuration file and name it as Bow Config File. This file contains list of key-value pairs. Following keys are present, Topic, CutOff, Host, Port, BowLog File Path. An instance of above confiuration file is shown below, Topic = skiing # specify your topic of crawl CutOff = 0.75 # specify your cutoff score here Host = xyz.edu # specify the host running rainbow server here Port = 5555 # port number for rainbow server BowLog_File_Path = /path/BowLog # specify path for BowLog file here BowLogContents_File_Path = /path/ContentLog 6. Lets start a crawl using order.xml file containing bowprocessor. Change into metacombine_focusedCrawl_module1.0/and run $HERITRIX_HOME/bin/heritrix -n /path/to/order.xml. Option -n above means no interface but you can create order.xml file by using the web console provided by heritrix and in that case you should start the crawl from the web console itself. By default BowProcessor will look for the configuration file created in above step in $HERITRIX_HOME/bin/ directory but you can specify the location of this file by CONF variable and add it to JAVA_OPTS as, export JAVA_OPTS="-DCONF=/your/path/to/configuration/file ..." Above step should start the crawl using Heritrix crawler and BowProcessor. License ======= BSD-style. See LICENSE file. Contact ======= Saurabh Pathak at spatha2@emory.edu Permanent Contact ================= Martin Halbert (mhalber@emory.edu) Aaron Krowne (akrowne@emory.edu)