MetaCombine Focused Crawler System - 0.96 Based on Archive.org's Heritrix web crawling system. By Saurabh Pathak Technical lead: Aaron Krowne contact: akrowne@emory.edu To perform a focused crawl using this tool follow the steps below, 1. Before a crawl can be started we need to train the Rainbow classifier. This is achieved in two steps listed below. a. Specify as many of the following inputs as text files, keywords.txt (containing list of domain specific keywords) pos_categories.txt (containing list of positive ODP categories) repositories.txt (containing list of OAI repositories) apart from above inputs also create a configuration file containing values for all the configuration parameters. These values will be used at various stages of focused crawling. Once above inputs are specified run train_step1.pl script which takes as input config file path. b. Above step would have created a file keyword_categories.txt containing a list of ODP categories. Scan through this file and mark all the categories belonging to topic of crawl as 1. Then run train_step2.pl script with input as config file path. This script completes the training phase and set up the Rainbow classifier on user specified host and port. Once classifier is running you can run as many crawls for the given topic without having to repeat above steps. 2. This step is optional but recommended. During this step we fine tune the vocabulary created by Rainbow for the topic during training. This step is carried out in two substeps. a. Run the fine_tune1.pl script with input as config file path. b. Scan the file top-words.txt created during above substep and mark all the words that do not belong to topic as 0. Then run the script fine_tune2.pl with input as config file path. For higher accuracy Repeat above two substeps until top N words belong to the topic of crawl. 3. This step performs the focused crawling. Before running this step copy the folder focusedCrawler containing all the class files to $HERITRIX_HOME/bin/. Also create following files in heritrix_files folder before running this step, order.xml and seeds.txt. You can use a modified/appended version of seeds.txt file created during the training. Also add BowProcessor to the top of heritrix extractor processor chain by as shown below, true ....(remaining heritrix extractor chain processors) Once you have done above run the script start_crawler.pl with input as config file path. This will initiate the focused crawling using the Heritrix crawler and BowProcessor.