This is MetaCombine-PhraseFinder 0.5. 2004-11-18 by Aaron Krowne The MetaCombine Project Emory University Woodruff Library Introduction ============ This is actually a pair of tools (phrasefinder and phrasemaker) meant as a preprocessor to typical bag-of-words IR or machine learning systems. The basic idea is to be able to input a corpus and re-output it with important phrases separated out as single, atomic features. This is done by connecting the appropriate words with underscores ('_'), which typically are not disturbed by text parsing. For instance, in a corpus on the American South, one might like to recognize "civil war" as a single feature, distinct from "civil rights", disambiguating many occurrences of "civil". These would be highlighted as "civil_war" and "civil_rights" respectively in the text. One might also like person names to become single features, clearing up ambiguities between common first and last names. The end result is an increase in clarity and accuracy of the output. Why === Phrasefinder is a substitute for noun-phrase parsing systems. There are a number of reasons to create this replacement: 1. We don't particularly care that the common phrases are noun-phrases. 2. Poor performance of freely-available noun-phrase parsing systems. 3. IP-encumberance of high-quality noun-phrase parsing systems. How === Phrasefinder works by taking the input text and building what I'm calling an "ordinal index" of the words in it. This is a chained-hash index structure that keeps track of word-instances and their previous and following words. Then this index is analyzed to determine which phrases have high enough support and confidence to be separated out. Note that this criterion is a purely data mining concept; there is no linguistics involved whatsoever. The phrasefinder program just outputs a list of candidate phrases. One can direct this list to a file, and edit it if desired. The phrasemaker program takes the list of final phrases and the original corpus file, replacing the appropriate words in the text with phrases from the list, and writing to stdout. Installation ============ 1. Install MetaCombine-Common, from http://metacluster.library.emory.edu/~akrowne/metacombine_software/, placing it in ../common from where you plan to install this program. 2. Untar this archive (clearly you have already). 3. Issue "make". Now you've got your stand-alone executables. You can put these wherever you want. TODO ==== The most glaring TODO item is rather fundamental: the scanning of the text from each word currently only proceeds forward. It also needs to proceed in reverse. Because this is not yet done, there is a "telescoping" problem, whereby common suffixes to important phrases appear as separate phrases, as they are not counted with their prefix parts. Currently one has to manually remove these telescoping suffixes from the candidate phrases output of phrasefinder. When this is fixed, the program will be 1.0. License ======= BSD-style. See LICENSE file. Contact ======= R&D lead Aaron Krowne at akrowne@emory.edu. Project director Martin Halbert at mhalber@emory.edu.