MetaCombine NMF Document Clustering System version .80 2005-03-18 Aaron Krowne Emory University Credits ======= Aaron Krowne - Lead developer. Steve Ingram - Research developer. Overview ======== This software makes up a system for "document" clustering using the Non-negative Matrix Factorization (NMF) method. In actuality, what is being clustered does not have to be text documents; any problem with an extremely high-dimensionality can make use of this system. However, it is expected that each dimension will have some sort of label (this requirement can be avoided by skipping some of the outputs, but you'll have to hack the code for this). Clustering is a "preclassificatory" task-- it is used to discover latent associations in the data, separating the data points out into "bins" which can be interpreted as clusters or classes. For text documents, this can be thought of as discovering classes based on the topics within the corpus. NMF was chosen as the clustering algorithm after a little research. NMF is fairly new; half a decade old as of this writing. According to [Xu], it has been found to outperform LSI (SVD)-based methods, as well as graph-based methods. It is at least as accurate as all of the above, yet is conceptually simpler. NMF can be used for many types of latent semantic processing, not just clustering (most have to do with finding a smaller, sparse basis in a pseudo-feature space). As far as I know, there are no freely-available implementations of NMF out there. One could always implement it fairly easily in Matlab or some other CAS. However, this implementation of NMF exists for two reasons: 1. To produce a freely-available NMF clustering system that can become part of open source projects. 2. To produce a fast and efficient NMF clustering implementation (hence SparseLib++ and the use of C++). This clustering system is part of the MetaCombine project, which seeks to more meaningfully bring together digital library resources, helping to build more coherent services on top of them. See http://www.metacombine.org/ for more on MetaCombine. References ========== The key implementation reference for this paper was: [Xu] Xu, Wei, et al. "Document Clustering Based on Non-Negative Matrix Factorization." SIGIR 2003, pp. 267-273. A nice, short introduction to NMF, with various applications, is: [Lee] D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401, October 1999. This paper was also the inspiration for the html "visualization" the software outputs for each clustering, to help illustrate which terms are most important for each cluster. Running ======= Your run the system like: ./cluster basename k_lower k_upper - "basename" is the naming scheme used for the input files; extensions for each file will be appended (input files described below). - k_lower is the lower number of clusters you want to produce a clustering for. - k_upper is the upper number of clusters. k_lower may equal k_upper if you know exactly how many clusters you want. Input ----- Input files are of the form: [basename].vec - Data point/document vectors. The format of this is one per line, as (feature_index, value) pairs. feature_index should start at zero, and value should be in the range [0, 1] (i.e., data points *must* be normalized). Here is a sample line from [basename].vec: (2,0.208514) (3,0.208514) (4,0.208514) (5,0.208514) (22,0.208514) This line describes a vector with nonzero values at indices 2, 3, 4, 5, and 22. [basename].dic - Dictionary file. Gives human-readable names for each feature/dimension. The format of this file is one entry per line, word space numerical value. Example: seawater 12797 Note: the feature label can't have any spaces in it! [basename].md.description - (optional) - A list of description text for each input record. One per line. Line consists of a numeric (the record index, based on position in .vec file), a space, and then text. Used in creating human-readable portions of the output. [basename].md.identifier - Human readable identifier for each input record. Works the same as previous. [basename].md.title - (optional) - Human readable title for each input record. Works the same as previous. Outputs ------- A variety of output files are created by the clustering system. They are all named based on the input [basename]: [basename].clusters.K.xml - XML cluster descriptor output for number of clusters K. [basename].clusters.K.html - HTML clustering report for number of clusters K. This file is meant to give some idea what the clusters mean for the clustering. [basename].class.K.txt - A "classification" based on the clustering with K clusters. This file maps cluster IDs (zero-based numeric indices) to the human-readable identifiers for each record. There is one mapping per line, of the form number, space, identifier text. *** Also, hierarchy is represented as subdirectory structure in the output directory. Thus, the same outputs exist in many subdirectories in a layout that corresponds to the logical hierarchy. *** Installation ============ 1. Install the MetaCombine common library. Relative to this NMF software, it will be expected to be in ../common/. You can easily change this in the makefile, however. This is available from: http://metacluster.library.emory.edu/~akrowne/metacombine_software/metacombine-common-1.0.tar.gz 2. Install SparseLib++. This is expected to go in a ../sparselib_1_5d/ relative to the NMF clustering program. Apply the patch below, then compile according to the README that comes with SparseLib. The patch: http://metacluster.library.emory.edu/~akrowne/metacombine_software/apk-metacombine-sparselib-1.5d-3.patch Our mirror of SparseLib 1.5d is at: http://metacluster.library.emory.edu/~akrowne/metacombine_software/sparselib_1_5d.zip 3. Set up the NMF clustering system. a) untar this archive and put it where you want. b) make sure the MetaCombine common lib is in ../common/ (and built) c) make sure SparseLib++ is in ../sparselib_1_5d/ (and built) d) check to see if there's anything in the Makefile that needs changing for your system. e) type "make". License ======= BSD. See included "LICENSE" file. TO-DO list ========== - Make the classifier mode work with hierarchies. - Implement more types of hierarchical clustering: - graph based - MUP-based (can we port over Steve's Java implementation?) - Improve hierarchical clustering in general. - Do automatic optimal-number-of-clusters guessing. (You can see the code is littered with some of my attempts at this already). This is an open problem, so I don't expect any solution soon (or perhaps ever). Help on this? Changes ======= .75 - .80 : - Added contraction (a post-processing step whereby similar clusters are combined). - Added unclassification (thresholding whereby no attempt is made to classify some records). - Added "classifier" mode, whereby a clustering is done as a training step, and then the clustering system can subsequently be run on another set of records using the built "model" (which is really just the matrices from the original factorizations). .5 - .75 : - Added nice command switches for a variety of options, changed how input was specified. - Added adaptive hierarchical clustering. - Added multiclassification, changed format of .class output files. Contact ======= I can be contacted at akrowne@emory.edu. I especially like to get patches for fixes and/or enhancements, but will be glad to help with running and installation or just hear suggestions! Good luck! Aaron Krowne Emory University General Libraries [The MetaCombine Project]