MetaCombine NMF Document Clustering Web Service version 0.1.3 2005-08-24 Urvashi Gadi Emory University Credits ======= Urvashi Gadi - Lead developer. Aaron Krowne - Project Manager. Overview ======== This software provides web service interface to MetaCombine NMF Clustering System Version 0.80 (a system for "document" clustering using the Non-negative Matrix Factorization (NMF) method). Clustering is a "preclassificatory" task -- it is used to discover latent associations in the data, separating the data points out into "bins" which can be interpreted as clusters or classes. For text documents, this can be thought of as discovering classes based on the topics within the corpus. This web service clustering system is part of the MetaCombine project, which seeks to more meaningfully bring together digital library resources, helping to build more coherent services on top of them. See http://www.metacombine.org/ for more on MetaCombine. Running ======= To access MetaCombine Clustering Web Service you will need a client. The client can be in any language, using any component model, and running on any operating system. The server code is located at http://metascholar3.library.emory.edu/cluster_server_v0.1.2/cluster-server.wsdl The server expects the input in the following sequence : MSGTYPE OUTPUT BASEURL/TOKEN FIELDS HMODE CPARAMS All the fields are mandatory in case of service request MESGTYPE: Service Request/Polling Message [0/1] OUTPUT : Expected output [xml/archive] WSDLURL : Classification server WSDL URL BASEURL/TOKEN : Data Repository URL/Polling Token FIELDS : Clustering fields (Dublin core elements) separated by , no space HMODE : Hierarchical/flat Clustering Mode [f/h] f : flat mode h : hierarchical mode CPARAMS : Semantic Clustering parameters flat mode : [ OPTS ] < LOWER UPPER TOTAL > hierarchical mode : [ OPTS ] < TOTAL | BRANCH LIMIT > OPTS are optional flags which consist of: -r to perform contraction on first-cut clusters -m [ FRAC ] to select multiclassification up to FRAC of highest score -d hierarchical clustering max-depth (0,1,2,... def. 5) -l multiclassification limit on # of clusters per record (def # clusters) -u < THRESHOLD > set the threshold for unclassification Installation of sample Client Code ================================== The included PHP scripts (cluster-client-xml.php and cluster-client-archive.php) create a SOAP client. PHP does not come with a bundled SOAP extension. Before you can begin using the client, you need to download and install files to let you easily integrate SOAP. There are three major SOAP implementations for PHP: PEAR::SOAP, NuSOAP, and PHP-SOAP. This client implementation uses PEAR::SOAP. Install the PEAR package manager and run the following command in your shell: % pear install SOAP This will download, unzip, and install PEAR::SOAP. The PEAR package manager has dependencies on the following packages : Mail_Mime, Net_URL, HTTP_Request, and Net_DIME. Depending on which packages already installed, install the remaining packages. If the above fails, you might have to run the following command on your shell %pear config-set preferred_state beta %pear upgrade-all Notes: Since the NMF clustering web service is a long running server process, the client might time out due to network inactivity. To avoid this, adjust $soapclient->__options['timeout'] or sleep function call in the client code accordingly. The server can return an URL to an XML file or an archive as a result depending on the request. When the server returns an "archive" as a result, it is returned as a baseURL. The client should not rely on the result URL as being available permanently and make a copy if it intents to use this archive in future. To copy the archive,The client can use oaicopy" tool. It copies any Open Archive into an identical (including sets) static archive on a local machine (with one command). See http://metacombine.org/software/readmes/oaicopy-0.5.5-README for more information on oaicopy. Usage Examples: Expected Output: XML File flat mode: ./cluster-client-xml.php http://cluster-server.net/server-code.wsdl http://some-server.net/some-data-set/oai-provider.pl title,description f 20 hierarchical mode: /cluster-client-xml.php http://cluster-server.net/server-code.wsdl http://some-server.net/some-data-set/oai-provider.pl subject h 10 2 Expected Output: Open Archive flat mode: ./cluster-client-archive.php http://cluster-server.net/server-code.wsdl http://some-server.net/some-data-set/oai-provider.pl title,description f 20 hierarchical mode: /cluster-client-archive.php http://cluster-server.net/server-code.wsdl http://some-server.net/some-data-set/oai-provider.pl subject h 10 2 Change Log ========== 0.1.3 - Added on screen progress reporting messages. 0.1.2 - Modified to cluster large collections using polling technique.The client polls the server to check if the service request is processed. If yes, the server return the result URL otherwise it returns a token to be used to check back later. Both archive and XML outputs are provided in the form of a URL. The client is expected to make local copies of the result for future use. 0.1.1 - Modified to provide output in the form of Open Archive. Two clients available to provide output in XML form and Open Archive form. 0.1 - First release. Clustering works for input Base URL. Output is provided in XML format. License ======= BSD. See included "LICENSE" file.