MetaCombine NMF Document Clustering System Web Service version .10 2005-04-20 Urvashi Gadi Emory University Credits ======= Urvashi Gadi - Lead developer. Aaron Krowne - Project Manager. Overview ======== This software provides server end of web service interface to MetaCombine NMF Clustering System Version .80 (a system for "document" clustering using the Non-negative Matrix Factorization (NMF) method). Clustering is a "preclassificatory" task-- it is used to discover latent associations in the data, separating the data points out into "bins" which can be interpreted as clusters or classes. For text documents, this can be thought of as discovering classes based on the topics within the corpus. This web service clustering system is part of the MetaCombine project, which seeks to more meaningfully bring together digital library resources, helping to build more coherent services on top of them. The purpose of this web service clustering system is to allow third parties to use advanced clustering techniques on their data, without needing to be familiar with the details of the clustering system, having to go through the complex installation process, or having to supply the computational hardware. See http://www.metacombine.org/ for more on MetaCombine. Also see http://www.ockham.org/ for information on the OCKHAM project, which is building a p2p network of library services (like this clustering service). Note that there is no need for clients to this service to be written in PHP as this one is; this client is merely a functional demonstration or example. It is easy to write web services clients in many other languages as well. Running ======= To access Metacluster web service you will need a client. The client can be in any language, using any component model, and running on any operating system. The server expects the input in the following sequence : BASEURL FIELDS HMODE CPARAMS All the fields are mandatory BASEURL : Data Repository URL FIELDS : Clustering fields (title,subject,desc) seperated by , no space HMODE : Hierachical/flat Clustering Mode [f/h] f : flat mode h : hierachical mode CPARAMS : Semantic Clustering parameters flat mode : [ OPTS ] \< LOWER UPPER TOTAL \> hierachical mode : [ OPTS ] < TOTAL | BRANCH LIMIT > OPTS are optional flags which consist of: -r to perform contraction on first-cut clusters -m [ FRAC ] to select multiclassification up to FRAC of highest score -d hierarchical clustering max-depth (0,1,2,... def. 5) -l multiclassification limit on # of clusters per record (def # clusters) -u < THRESH > set the threshold for unclassification for more information and input data types, on any browser http://metascholar3.library.emory.edu/server/cluster-server.php?wsdl Installation ============ 1. Set up the NMF clustering system. archive available at : http://metacluster.library.emory.edu/~akrowne/metacombine_software/metacombine-common-1.0.1.tar.gz a) untar the archive and put it where you want. b) make sure the MetaCombine common lib is in ../common/ (and built.). Refer README for more instructions and location of common library. c) make sure SparseLib++ is in ../sparselib_1_5d/ (and built). Rfer README for more instructions and locatioon of library. d) check to see if there's anything in the Makefile that needs changing for your system. e) type "make". 2. Set up NET::OAI::Harvester archive available at : http://search.cpan.org/~esummers/OAI-Harvester-0.99/ a) untar this archive. b) Install OAI-Harvester-0.99 in the same(server) directory. 3. Untar vectorizer-1.0 and put anywhere you want. archive available at : http://metacluster.library.emory.edu/~akrowne/metacombine_software/vectorizer-1.0.tar.gz 4. Update the PATH variable to include MNF Clustering system path and vectorizer path in metatest.pl 5. Download, unzip, and install PEAR::SOAP. The PEAR package manager has dependcies on the following packages : Mail_Mime, Net_URL, HTTP_Request, and Net_DIME. Depending on which packages already installed, install the remaining packages. 6. Change the end point location of the server in cluster-server.wsdl depending on server code location. The Semantic Clustering web service is ready to be consumed by a client. License ======= BSD. See included "LICENSE" file.