Vectorize 1.0 2004-06-23 Aaron Krowne Emory University Overview ======== This humble little program transforms document wordlists into sparse, normalized numerical feature vectors. It automatically performs stemming and stopping on the words encountered. The feature vector format used for the output is understood by the MetaCombine NMF clustering program. However, one might want to use this program to create document feature vectors for other IR purposes. Installation ============ 1. Uncompress this archive and put it somewhere (we'll call this directory $VECTORIZE). 2. Install the MetaCombine common library in $VECTORIZE/../common. If you put it somewhere else, modify vectorize's Makefile accordingly. See http://metacluster.library.emory.edu/~akrowne/metacombine_software/ for the latest version. 3. From $VECTORIZE, type "make". Execution ========= Run like: ./vectorize basename Input is expected in the file [basename].raw. Input ===== [basename].raw - (required) The raw word lists. One list of space-separated words per line. Each line corresponds to one document, so make sure to strip newlines. [basename].sw - (optional) A list of supplementary stopwords, one per line. These words will be eliminated in the output vectors. Useful if your content has domain-specific stopwords. [basename].dic - (optional) A map of words to feature IDs. If you provide this, you can force the numerical feature vectors to use arbitrary indices. Otherwise, indices will be assigned first-come-first-serve, and this file will be created from scratch as output. Output ====== For output you get: [basename].vec - (created) The numerical feature vectors, one per line. Each line corresponds to a line in [basename].raw. The index of this line will be the same, unless there were empty documents earlier, in which case there will be a shift (see the description of [basename].mt). [basename].dic - (created or updated) "Dictionary" mapping words to feature IDs. NOTE: if you have changed or created a [basename].stm file, you will get broken feature vectors unless you delete the [basename].dic file before you run vectorize again. [basename].stm - (created) A transformation of the [basename].raw file which is stemmed and stopped. [basename].mt - (created) Since stemming and stopping may result in empty document vectors, this file lists the indices (line numbers) of any documents which became empty through this transformation. This allows one to accurately compute the index in [basename].vec and [basename].stm that corresponds to an index in [basename].raw. NOTE: this file is extremely important... things may break mysteriously if you do not incorporate this information in systems that utilize the output of vectorize. License ======= BSD. See the LICENSE file. Contact ======= For help, comments, or enhancements, contact me at akrowne@emory.edu. Good luck! Aaron Krowne Emory University