MetaCombine NMF Document Clustering System Web Service version 0.1.3 2005-08-25 Urvashi Gadi Emory University Credits ======= Urvashi Gadi - Lead developer. Aaron Krowne - Project Manager. Overview ======== This software provides server end of web service interface to MetaCombine NMF Clustering System Version .80 (a system for "document" clustering using the Non-negative Matrix Factorization (NMF) method). Clustering is a "preclassificatory" task-- it is used to discover latent associations in the data, separating the data points out into "bins" which can be interpreted as clusters or classes. For text documents, this can be thought of as discovering classes based on the topics within the corpus. This web service clustering system is part of the MetaCombine project, which seeks to more meaningfully bring together digital library resources, helping to build more coherent services on top of them. The purpose of this web service clustering system is to allow third parties to use advanced clustering techniques on their data, without needing to be familiar with the details of the clustering system, having to go through the complex installation process, or having to supply the computational hardware. See http://www.metacombine.org/ for more on MetaCombine. Also see http://www.ockham.org/ for information on the OCKHAM project, which is building a p2p network of library services (like this clustering service). Note that there is no need for clients of this service to be written in PHP. It is easy to write web services clients in many other languages as well. Running ======= To access Metacluster web service you will need a client. The client can be in any language, using any component model, and running on any operating system. The server expects the input in the following sequence : MSGTYPE OUTPUT BASEURL/TOKEN FIELDS HMODE CPARAMS All the fields are mandatory in case of service request MESGTYPE: Service Request/Polling Message [0/1] OUTPUT : Expected output [xml/archive] WSDLURL : Classification server WSDL URL BASEURL/TOKEN : Data Repository URL/Polling Token FIELDS : Clustering fields (title,subject,desc) seperated by , no space HMODE : Hierarchical/flat Clustering Mode [f/h] f : flat mode h : hierarchical mode CPARAMS : Semantic Clustering parameters flat mode : [ OPTS ] \< LOWER UPPER TOTAL \> hierarchical mode : [ OPTS ] < TOTAL | BRANCH LIMIT > OPTS are optional flags which consist of: -r to perform contraction on first-cut clusters -m [ FRAC ] to select multiclassification up to FRAC of highest score -d hierarchical clustering max-depth (0,1,2,... def. 5) -l multiclassification limit on # of clusters per record (def # clusters) -u < THRESH > set the threshold for unclassification for more information and input data types, on any browser http://metascholar3.library.emory.edu/cluster_server_v0.1.3/cluster-server.wsdl Installation ============ 1. Set up the NMF clustering system. archive available at : http://metacombine.org/software/metacombine-nmf-cluster-0.8.tar.gz Follow the set up instructions provided within the archive. 2. Set up NET::OAI::Harvester archive available at : http://search.cpan.org/~esummers/OAI-Harvester-0.99/ a) untar this archive and put it where you want b) Install OAI-Harvester-0.99 in the same(server) directory. 3. Untar vectorizer-1.0 and put anywhere you want. archive available at : http://metacluster.library.emory.edu/~akrowne/metacombine_software/vectorizer-1.0.tar.gz 4. Update the PATH variable to include MNF Clustering system path and vectorizer path in metatest.pl 5. This server implementation uses PEAR::SOAP. Install the PEAR package manager and run the following command in your shell: % pear install SOAP This will download, unzip, and install PEAR::SOAP. The PEAR package manager has dependcies on the following packages : Mail_Mime, Net_URL, HTTP_Request, and Net_DIME. Depending on which packages already installed, install the remaining packages. If the above fails, you might have to run the following command on your shell %pear config-set preferred_state beta %pear upgrade-all 6. Change the end point location of the server in the "soap:address location" element of cluster-server.wsdl depending on server code location. The Semantic Clustering web service is ready to be consumed by a client. 7. Make sure the ouput directories - work, logfiles, output_xml and provider created, when this archive is installed, have write permissions for the web server user. Change Log ========== 0.1.3 - Modified to take care of UNIX file naming conventions. 0.1 - First release. Responds to client request with clustered result in XML file URL or Open Archive URL. License ======= BSD. See included "LICENSE" file.