MetaCombine NMF CWIS Clusterfication Import System version .5 2005-03-01 Aaron Krowne Emory University Credits ======= Aaron Krowne - Lead developer. Overview ======== This script takes the output of the MetaCombine NMF clustering system and imports it into a CWIS instance. This includes both the classification scheme itself as well as the bindings of resources to the "categories" (i.e., the classifications), hence the term "clusterfication". As you might suspect, this means that 1. A clustering must have already been run. 2. The CWIS instance must already contain records corresponding to the clustered corpus. Such an import can be done with Instructions ============ Please see the header of the Perl script. Configuration is there also. License ======= BSD. See included "LICENSE" file. Contact ======= I can be contacted at akrowne@emory.edu. The web site for this, and other MetaCombine software, is htt://www.metacombine.org/ I especially like to get patches for fixes and/or enhancements, but will be glad to help with running and installation or just hear suggestions! Good luck! Aaron Krowne Emory University General Libraries [The MetaCombine Project] MetaCombine NMF Document Clustering Web Service version 0.1.3 2005-08-24 Urvashi Gadi Emory University Credits ======= Urvashi Gadi - Lead developer. Aaron Krowne - Project Manager. Overview ======== This software provides web service interface to MetaCombine NMF Clustering System Version 0.80 (a system for "document" clustering using the Non-negative Matrix Factorization (NMF) method). Clustering is a "preclassificatory" task -- it is used to discover latent associations in the data, separating the data points out into "bins" which can be interpreted as clusters or classes. For text documents, this can be thought of as discovering classes based on the topics within the corpus. This web service clustering system is part of the MetaCombine project, which seeks to more meaningfully bring together digital library resources, helping to build more coherent services on top of them. See http://www.metacombine.org/ for more on MetaCombine. Running ======= To access MetaCombine Clustering Web Service you will need a client. The client can be in any language, using any component model, and running on any operating system. The server code is located at http://metascholar3.library.emory.edu/cluster_server_v0.1.2/cluster-server.wsdl The server expects the input in the following sequence : MSGTYPE OUTPUT BASEURL/TOKEN FIELDS HMODE CPARAMS All the fields are mandatory in case of service request MESGTYPE: Service Request/Polling Message [0/1] OUTPUT : Expected output [xml/archive] WSDLURL : Classification server WSDL URL BASEURL/TOKEN : Data Repository URL/Polling Token FIELDS : Clustering fields (Dublin core elements) separated by , no space HMODE : Hierarchical/flat Clustering Mode [f/h] f : flat mode h : hierarchical mode CPARAMS : Semantic Clustering parameters flat mode : [ OPTS ] < LOWER UPPER TOTAL > hierarchical mode : [ OPTS ] < TOTAL | BRANCH LIMIT > OPTS are optional flags which consist of: -r to perform contraction on first-cut clusters -m [ FRAC ] to select multiclassification up to FRAC of highest score -d hierarchical clustering max-depth (0,1,2,... def. 5) -l multiclassification limit on # of clusters per record (def # clusters) -u < THRESHOLD > set the threshold for unclassification Installation of sample Client Code ================================== The included PHP scripts (cluster-client-xml.php and cluster-client-archive.php) create a SOAP client. PHP does not come with a bundled SOAP extension. Before you can begin using the client, you need to download and install files to let you easily integrate SOAP. There are three major SOAP implementations for PHP: PEAR::SOAP, NuSOAP, and PHP-SOAP. This client implementation uses PEAR::SOAP. Install the PEAR package manager and run the following command in your shell: % pear install SOAP This will download, unzip, and install PEAR::SOAP. The PEAR package manager has dependencies on the following packages : Mail_Mime, Net_URL, HTTP_Request, and Net_DIME. Depending on which packages already installed, install the remaining packages. If the above fails, you might have to run the following command on your shell %pear config-set preferred_state beta %pear upgrade-all Notes: Since the NMF clustering web service is a long running server process, the client might time out due to network inactivity. To avoid this, adjust $soapclient->__options['timeout'] or sleep function call in the client code accordingly. The server can return an URL to an XML file or an archive as a result depending on the request. When the server returns an "archive" as a result, it is returned as a baseURL. The client should not rely on the result URL as being available permanently and make a copy if it intents to use this archive in future. To copy the archive,The client can use oaicopy" tool. It copies any Open Archive into an identical (including sets) static archive on a local machine (with one command). See http://metacombine.org/software/readmes/oaicopy-0.5.5-README for more information on oaicopy. Usage Examples: Expected Output: XML File flat mode: ./cluster-client-xml.php http://cluster-server.net/server-code.wsdl http://some-server.net/some-data-set/oai-provider.pl title,description f 20 hierarchical mode: /cluster-client-xml.php http://cluster-server.net/server-code.wsdl http://some-server.net/some-data-set/oai-provider.pl subject h 10 2 Expected Output: Open Archive flat mode: ./cluster-client-archive.php http://cluster-server.net/server-code.wsdl http://some-server.net/some-data-set/oai-provider.pl title,description f 20 hierarchical mode: /cluster-client-archive.php http://cluster-server.net/server-code.wsdl http://some-server.net/some-data-set/oai-provider.pl subject h 10 2 Change Log ========== 0.1.3 - Added on screen progress reporting messages. 0.1.2 - Modified to cluster large collections using polling technique.The client polls the server to check if the service request is processed. If yes, the server return the result URL otherwise it returns a token to be used to check back later. Both archive and XML outputs are provided in the form of a URL. The client is expected to make local copies of the result for future use. 0.1.1 - Modified to provide output in the form of Open Archive. Two clients available to provide output in XML form and Open Archive form. 0.1 - First release. Clustering works for input Base URL. Output is provided in XML format. License ======= BSD. See included "LICENSE" file. MetaCombine NMF Document Clustering System Web Service version 0.1.3 2005-08-25 Urvashi Gadi Emory University Credits ======= Urvashi Gadi - Lead developer. Aaron Krowne - Project Manager. Overview ======== This software provides server end of web service interface to MetaCombine NMF Clustering System Version .80 (a system for "document" clustering using the Non-negative Matrix Factorization (NMF) method). Clustering is a "preclassificatory" task-- it is used to discover latent associations in the data, separating the data points out into "bins" which can be interpreted as clusters or classes. For text documents, this can be thought of as discovering classes based on the topics within the corpus. This web service clustering system is part of the MetaCombine project, which seeks to more meaningfully bring together digital library resources, helping to build more coherent services on top of them. The purpose of this web service clustering system is to allow third parties to use advanced clustering techniques on their data, without needing to be familiar with the details of the clustering system, having to go through the complex installation process, or having to supply the computational hardware. See http://www.metacombine.org/ for more on MetaCombine. Also see http://www.ockham.org/ for information on the OCKHAM project, which is building a p2p network of library services (like this clustering service). Note that there is no need for clients of this service to be written in PHP. It is easy to write web services clients in many other languages as well. Running ======= To access Metacluster web service you will need a client. The client can be in any language, using any component model, and running on any operating system. The server expects the input in the following sequence : MSGTYPE OUTPUT BASEURL/TOKEN FIELDS HMODE CPARAMS All the fields are mandatory in case of service request MESGTYPE: Service Request/Polling Message [0/1] OUTPUT : Expected output [xml/archive] WSDLURL : Classification server WSDL URL BASEURL/TOKEN : Data Repository URL/Polling Token FIELDS : Clustering fields (title,subject,desc) seperated by , no space HMODE : Hierarchical/flat Clustering Mode [f/h] f : flat mode h : hierarchical mode CPARAMS : Semantic Clustering parameters flat mode : [ OPTS ] \< LOWER UPPER TOTAL \> hierarchical mode : [ OPTS ] < TOTAL | BRANCH LIMIT > OPTS are optional flags which consist of: -r to perform contraction on first-cut clusters -m [ FRAC ] to select multiclassification up to FRAC of highest score -d hierarchical clustering max-depth (0,1,2,... def. 5) -l multiclassification limit on # of clusters per record (def # clusters) -u < THRESH > set the threshold for unclassification for more information and input data types, on any browser http://metascholar3.library.emory.edu/cluster_server_v0.1.3/cluster-server.wsdl Installation ============ 1. Set up the NMF clustering system. archive available at : http://metacombine.org/software/metacombine-nmf-cluster-0.8.tar.gz Follow the set up instructions provided within the archive. 2. Set up NET::OAI::Harvester archive available at : http://search.cpan.org/~esummers/OAI-Harvester-0.99/ a) untar this archive and put it where you want b) Install OAI-Harvester-0.99 in the same(server) directory. 3. Untar vectorizer-1.0 and put anywhere you want. archive available at : http://metacluster.library.emory.edu/~akrowne/metacombine_software/vectorizer-1.0.tar.gz 4. Update the PATH variable to include MNF Clustering system path and vectorizer path in metatest.pl 5. This server implementation uses PEAR::SOAP. Install the PEAR package manager and run the following command in your shell: % pear install SOAP This will download, unzip, and install PEAR::SOAP. The PEAR package manager has dependcies on the following packages : Mail_Mime, Net_URL, HTTP_Request, and Net_DIME. Depending on which packages already installed, install the remaining packages. If the above fails, you might have to run the following command on your shell %pear config-set preferred_state beta %pear upgrade-all 6. Change the end point location of the server in the "soap:address location" element of cluster-server.wsdl depending on server code location. The Semantic Clustering web service is ready to be consumed by a client. 7. Make sure the ouput directories - work, logfiles, output_xml and provider created, when this archive is installed, have write permissions for the web server user. Change Log ========== 0.1.3 - Modified to take care of UNIX file naming conventions. 0.1 - First release. Responds to client request with clustered result in XML file URL or Open Archive URL. License ======= BSD. See included "LICENSE" file. MetaCombine Common Lib version 1.0.1 2004-10-05 Aaron Krowne Emory University Overview ======== This library consists of miscellaneous classes and methods useful for the MetaCombine project. The MetaCombine project seeks to improve digital library usability by combining DLs and DL contents in new ways and more completely than before. Installation ============ Untar. Run "make". It is up to you to point dependency programs to wherever you have untarred this library. Changes ======= 1.0 - 1.0.1 - Minor changes to dictionary class (quiet mode). License ======= BSD. Some of this code comes from work I did at Virginia Tech. Some was written purely at Emory. Contact ======= Please let me know if you need help using this software, or have suggestions or patches. My email is akrowne@emory.edu. Good luck! Aaron Krowne Emory University General Libraries The MetaCombine Project MetaCombine Focused Crawler System - 0.96 Based on Archive.org's Heritrix web crawling system. By Saurabh Pathak Technical lead: Aaron Krowne contact: akrowne@emory.edu To perform a focused crawl using this tool follow the steps below, 1. Before a crawl can be started we need to train the Rainbow classifier. This is achieved in two steps listed below. a. Specify as many of the following inputs as text files, keywords.txt (containing list of domain specific keywords) pos_categories.txt (containing list of positive ODP categories) repositories.txt (containing list of OAI repositories) apart from above inputs also create a configuration file containing values for all the configuration parameters. These values will be used at various stages of focused crawling. Once above inputs are specified run train_step1.pl script which takes as input config file path. b. Above step would have created a file keyword_categories.txt containing a list of ODP categories. Scan through this file and mark all the categories belonging to topic of crawl as 1. Then run train_step2.pl script with input as config file path. This script completes the training phase and set up the Rainbow classifier on user specified host and port. Once classifier is running you can run as many crawls for the given topic without having to repeat above steps. 2. This step is optional but recommended. During this step we fine tune the vocabulary created by Rainbow for the topic during training. This step is carried out in two substeps. a. Run the fine_tune1.pl script with input as config file path. b. Scan the file top-words.txt created during above substep and mark all the words that do not belong to topic as 0. Then run the script fine_tune2.pl with input as config file path. For higher accuracy Repeat above two substeps until top N words belong to the topic of crawl. 3. This step performs the focused crawling. Before running this step copy the folder focusedCrawler containing all the class files to $HERITRIX_HOME/bin/. Also create following files in heritrix_files folder before running this step, order.xml and seeds.txt. You can use a modified/appended version of seeds.txt file created during the training. Also add BowProcessor to the top of heritrix extractor processor chain by as shown below, true ....(remaining heritrix extractor chain processors) Once you have done above run the script start_crawler.pl with input as config file path. This will initiate the focused crawling using the Heritrix crawler and BowProcessor. This is metacombine_focusedCrawl_module1.0 2005-06-01 by Saurabh Pathak, Emory University ; Donna Bergmark, Cornell University Summary ======= This module can perform focused crawling and is meant to be used as a processor for Heritrix crawler. If used as a focused crawler, this module should be placed on top of the Heritrix extractor processor chain so that for every URL encountered by the Heritrix crawler this module will make a decision whether to crawl the URL or not depending on the text contents of the URL document. To make this decision the module talks with a rainbow server which is a text classifier running on a port. To get a more detailed description of this module please read the Overview.pdf file which is present in the current directory. Above pdf file gives a detailed description of the module. Dependencies ============ To run this module for doing focused crawling you will need to download following, 1. Heritrix crawler(http://crawler.archive.org/) - This module is tested with Heritrix crawler versions upto 1.4 2. Bow Text classifier (http://www-2.cs.cmu.edu/~mccallum/bow/src/) 3. HTML parser jar file (http://htmlparser.sourceforge.net/) - Check the website and download the jar file. This module uses version 1.5 of the parser. Instructions on how to run the module ===================================== Following instructions shows how to run the module using an example case. This package can be used with Heritrix to make a focused crawler, using the following steps: 1. Suppose heritrix is located in directory $HERITRIX_HOME/bin/. The Metacombine_FocusedCrawler1.0/org/metacombine/crawlmodule/ directory should contain the .class files, BowProcessor.class and RainbowClassifier.class along with the java source files. Copy Metacombine_FocusedCrawler1.0/org/ folder to the $HERITRIX_HOME/bin/ directory. 2. Copy the jar files (for dependecies listed above) to $HERITRIX_HOME; alternatively, add them to your $CLASSPATH. Now you are all set to use the crawlmodule package. 3. Train the rainbow classifier as described in Section 2 of README.pdf file and set it up as a server. In this example I am crawling the "skiing" related pages so I will have as training set two sub-directories in $DIR, skiing/(which will contain examples of pages related to skiing) and Negative/ ( which will contain examples of pages not related to skiing). Suppose we then execute the following: /path/to/rainbow -d /path/to/model --index $DIR/*? /path/to/rainbow -d /path/to/model --query-server=5555 4. To use BowProcessor in a crawl job add the processor to the order.xml file. We suggest the addition of BowProcessor to the top of heritrix extractor processor chain as shown below. true ....(remaining heritrix extractor chain processors) 5. Create Bow Configuration file and name it as Bow Config File. This file contains list of key-value pairs. Following keys are present, Topic, CutOff, Host, Port, BowLog File Path. An instance of above confiuration file is shown below, Topic = skiing # specify your topic of crawl CutOff = 0.75 # specify your cutoff score here Host = xyz.edu # specify the host running rainbow server here Port = 5555 # port number for rainbow server BowLog_File_Path = /path/BowLog # specify path for BowLog file here BowLogContents_File_Path = /path/ContentLog 6. Lets start a crawl using order.xml file containing bowprocessor. Change into metacombine_focusedCrawl_module1.0/and run $HERITRIX_HOME/bin/heritrix -n /path/to/order.xml. Option -n above means no interface but you can create order.xml file by using the web console provided by heritrix and in that case you should start the crawl from the web console itself. By default BowProcessor will look for the configuration file created in above step in $HERITRIX_HOME/bin/ directory but you can specify the location of this file by CONF variable and add it to JAVA_OPTS as, export JAVA_OPTS="-DCONF=/your/path/to/configuration/file ..." Above step should start the crawl using Heritrix crawler and BowProcessor. License ======= BSD-style. See LICENSE file. Contact ======= Saurabh Pathak at spatha2@emory.edu Permanent Contact ================= Martin Halbert (mhalber@emory.edu) Aaron Krowne (akrowne@emory.edu) MetaCombine NMF Document Clustering System version .80 2005-03-18 Aaron Krowne Emory University Credits ======= Aaron Krowne - Lead developer. Steve Ingram - Research developer. Overview ======== This software makes up a system for "document" clustering using the Non-negative Matrix Factorization (NMF) method. In actuality, what is being clustered does not have to be text documents; any problem with an extremely high-dimensionality can make use of this system. However, it is expected that each dimension will have some sort of label (this requirement can be avoided by skipping some of the outputs, but you'll have to hack the code for this). Clustering is a "preclassificatory" task-- it is used to discover latent associations in the data, separating the data points out into "bins" which can be interpreted as clusters or classes. For text documents, this can be thought of as discovering classes based on the topics within the corpus. NMF was chosen as the clustering algorithm after a little research. NMF is fairly new; half a decade old as of this writing. According to [Xu], it has been found to outperform LSI (SVD)-based methods, as well as graph-based methods. It is at least as accurate as all of the above, yet is conceptually simpler. NMF can be used for many types of latent semantic processing, not just clustering (most have to do with finding a smaller, sparse basis in a pseudo-feature space). As far as I know, there are no freely-available implementations of NMF out there. One could always implement it fairly easily in Matlab or some other CAS. However, this implementation of NMF exists for two reasons: 1. To produce a freely-available NMF clustering system that can become part of open source projects. 2. To produce a fast and efficient NMF clustering implementation (hence SparseLib++ and the use of C++). This clustering system is part of the MetaCombine project, which seeks to more meaningfully bring together digital library resources, helping to build more coherent services on top of them. See http://www.metacombine.org/ for more on MetaCombine. References ========== The key implementation reference for this paper was: [Xu] Xu, Wei, et al. "Document Clustering Based on Non-Negative Matrix Factorization." SIGIR 2003, pp. 267-273. A nice, short introduction to NMF, with various applications, is: [Lee] D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401, October 1999. This paper was also the inspiration for the html "visualization" the software outputs for each clustering, to help illustrate which terms are most important for each cluster. Running ======= Your run the system like: ./cluster basename k_lower k_upper - "basename" is the naming scheme used for the input files; extensions for each file will be appended (input files described below). - k_lower is the lower number of clusters you want to produce a clustering for. - k_upper is the upper number of clusters. k_lower may equal k_upper if you know exactly how many clusters you want. Input ----- Input files are of the form: [basename].vec - Data point/document vectors. The format of this is one per line, as (feature_index, value) pairs. feature_index should start at zero, and value should be in the range [0, 1] (i.e., data points *must* be normalized). Here is a sample line from [basename].vec: (2,0.208514) (3,0.208514) (4,0.208514) (5,0.208514) (22,0.208514) This line describes a vector with nonzero values at indices 2, 3, 4, 5, and 22. [basename].dic - Dictionary file. Gives human-readable names for each feature/dimension. The format of this file is one entry per line, word space numerical value. Example: seawater 12797 Note: the feature label can't have any spaces in it! [basename].md.description - (optional) - A list of description text for each input record. One per line. Line consists of a numeric (the record index, based on position in .vec file), a space, and then text. Used in creating human-readable portions of the output. [basename].md.identifier - Human readable identifier for each input record. Works the same as previous. [basename].md.title - (optional) - Human readable title for each input record. Works the same as previous. Outputs ------- A variety of output files are created by the clustering system. They are all named based on the input [basename]: [basename].clusters.K.xml - XML cluster descriptor output for number of clusters K. [basename].clusters.K.html - HTML clustering report for number of clusters K. This file is meant to give some idea what the clusters mean for the clustering. [basename].class.K.txt - A "classification" based on the clustering with K clusters. This file maps cluster IDs (zero-based numeric indices) to the human-readable identifiers for each record. There is one mapping per line, of the form number, space, identifier text. *** Also, hierarchy is represented as subdirectory structure in the output directory. Thus, the same outputs exist in many subdirectories in a layout that corresponds to the logical hierarchy. *** Installation ============ 1. Install the MetaCombine common library. Relative to this NMF software, it will be expected to be in ../common/. You can easily change this in the makefile, however. This is available from: http://metacluster.library.emory.edu/~akrowne/metacombine_software/metacombine-common-1.0.tar.gz 2. Install SparseLib++. This is expected to go in a ../sparselib_1_5d/ relative to the NMF clustering program. Apply the patch below, then compile according to the README that comes with SparseLib. The patch: http://metacluster.library.emory.edu/~akrowne/metacombine_software/apk-metacombine-sparselib-1.5d-3.patch Our mirror of SparseLib 1.5d is at: http://metacluster.library.emory.edu/~akrowne/metacombine_software/sparselib_1_5d.zip 3. Set up the NMF clustering system. a) untar this archive and put it where you want. b) make sure the MetaCombine common lib is in ../common/ (and built) c) make sure SparseLib++ is in ../sparselib_1_5d/ (and built) d) check to see if there's anything in the Makefile that needs changing for your system. e) type "make". License ======= BSD. See included "LICENSE" file. TO-DO list ========== - Make the classifier mode work with hierarchies. - Implement more types of hierarchical clustering: - graph based - MUP-based (can we port over Steve's Java implementation?) - Improve hierarchical clustering in general. - Do automatic optimal-number-of-clusters guessing. (You can see the code is littered with some of my attempts at this already). This is an open problem, so I don't expect any solution soon (or perhaps ever). Help on this? Changes ======= .75 - .80 : - Added contraction (a post-processing step whereby similar clusters are combined). - Added unclassification (thresholding whereby no attempt is made to classify some records). - Added "classifier" mode, whereby a clustering is done as a training step, and then the clustering system can subsequently be run on another set of records using the built "model" (which is really just the matrices from the original factorizations). .5 - .75 : - Added nice command switches for a variety of options, changed how input was specified. - Added adaptive hierarchical clustering. - Added multiclassification, changed format of .class output files. Contact ======= I can be contacted at akrowne@emory.edu. I especially like to get patches for fixes and/or enhancements, but will be glad to help with running and installation or just hear suggestions! Good luck! Aaron Krowne Emory University General Libraries [The MetaCombine Project] This is MetaCombine-PhraseFinder 0.5. 2004-11-18 by Aaron Krowne The MetaCombine Project Emory University Woodruff Library Introduction ============ This is actually a pair of tools (phrasefinder and phrasemaker) meant as a preprocessor to typical bag-of-words IR or machine learning systems. The basic idea is to be able to input a corpus and re-output it with important phrases separated out as single, atomic features. This is done by connecting the appropriate words with underscores ('_'), which typically are not disturbed by text parsing. For instance, in a corpus on the American South, one might like to recognize "civil war" as a single feature, distinct from "civil rights", disambiguating many occurrences of "civil". These would be highlighted as "civil_war" and "civil_rights" respectively in the text. One might also like person names to become single features, clearing up ambiguities between common first and last names. The end result is an increase in clarity and accuracy of the output. Why === Phrasefinder is a substitute for noun-phrase parsing systems. There are a number of reasons to create this replacement: 1. We don't particularly care that the common phrases are noun-phrases. 2. Poor performance of freely-available noun-phrase parsing systems. 3. IP-encumberance of high-quality noun-phrase parsing systems. How === Phrasefinder works by taking the input text and building what I'm calling an "ordinal index" of the words in it. This is a chained-hash index structure that keeps track of word-instances and their previous and following words. Then this index is analyzed to determine which phrases have high enough support and confidence to be separated out. Note that this criterion is a purely data mining concept; there is no linguistics involved whatsoever. The phrasefinder program just outputs a list of candidate phrases. One can direct this list to a file, and edit it if desired. The phrasemaker program takes the list of final phrases and the original corpus file, replacing the appropriate words in the text with phrases from the list, and writing to stdout. Installation ============ 1. Install MetaCombine-Common, from http://metacluster.library.emory.edu/~akrowne/metacombine_software/, placing it in ../common from where you plan to install this program. 2. Untar this archive (clearly you have already). 3. Issue "make". Now you've got your stand-alone executables. You can put these wherever you want. TODO ==== The most glaring TODO item is rather fundamental: the scanning of the text from each word currently only proceeds forward. It also needs to proceed in reverse. Because this is not yet done, there is a "telescoping" problem, whereby common suffixes to important phrases appear as separate phrases, as they are not counted with their prefix parts. Currently one has to manually remove these telescoping suffixes from the candidate phrases output of phrasefinder. When this is fixed, the program will be 1.0. License ======= BSD-style. See LICENSE file. Contact ======= R&D lead Aaron Krowne at akrowne@emory.edu. Project director Martin Halbert at mhalber@emory.edu. MetaCombine Organization Editor version .1.3a 2005-08-23 Stephen Ingram Emory University Overview ======== This software allows users to edit Organizations. Organizations are XML documents specifying a taxonomy of documents. The sofware provides the ability to rename, merge, and delete organization categories. This software is part of the MetaCombine project, which seeks to more meaningfully bring together digital library resources, helping to build more coherent services on top of them. See http://www.metacombine.org/ for more on MetaCombine. Running ======= You must first ensure that java of a version that is at least 1.4 is properly installed on your system. Next, decompress the scheme editor tar.gz file. It should create a directory called metacombine-scheme-editor-$VERSION. In this directory there are (among other things) two different scripts that allow the editor to be run, one for Windows and one for Unix. In windows you run the system by moving to the newly created scheme_editor directory from the command line and type "scheme_editor.bat organization basename" where organization is the name of your organization xml file and basename is the basename of your metadata map files. In Unix, move to the newly created scheme_editor directory and execute the script "./scheme_editor.sh organization basename" where organization is the name of your organization xml file and basename is the basename of your metadata map files. Usage Instructions ================== There are two large white boxes that can contain trees. At startup only the box on the left contains a tree. This tree represents your organization file. The box on the right is the "scratch" box, an area similar to the "clipboard" on the computer. It is designed for holding temporary tree items. When you output a tree from the editor, only the tree on the left is output. The following instructions are organized by task. 1. To delete a category/item - Click on the category or item on the tree and then click the "Delete Node" button. 2. To rename a category - Click on the category and its text should appear in the text box at the bottom. Change this text and then click the "Rename Node" button. 3. To move a category - Click on the category you want to move and then click on the "Move To >" button. This moves the category and all its subcategories and items from the main tree to the scratch box. Now click on the category on the main tree that you want to be the new parent of this category. Then click the "< Move From" button. This clears out the scratch box and moves the category. 4. To merge two categories - Click on one of the categories you want to merge and then click the "Move To >" button. This moves the category and all its subcategories and items from the main tree to the scratch box. Now click on the category on the main tree that you want to merge with this category. Then click the "Merge" button. This clears out the scratch box and merges the two categories. 5. To output an XML file. - Click the "Save to XML" button and a standard file box should open. Enter the name and location of the XML file and click "Save". This will write the contents of the organization to that file. Building ======== IN WINDOWS: from the command line move to the decompressed root directory. Then type "build.bat". IN UNIX: Edit the "Makefile" and set JAVA equal to the location of your java bin directory. Type "make" at the command prompt. License ======= BSD-style. See included "LICENSE" file. Contact ======= I can be contacted at singram@emory.edu. Secondary contact: akrowne@emory.edu. Permanent contact: mhalber@emory.edu. Stephen Ingram Emory University General Libraries The MetaCombine Project MetaCombine SparseLib++ 1.5d Patch version 3 2005-05-09 Aaron Krowne Emory University Overview ======== This patch fixes some compiler and functionality bugs in SparseLib++ 1.5d which cause it to break for the purposes of the MetaCombine-NMF clustering system. Installation ============ Get SparseLib++ 1.5d. It should be available from http://www.metacombine.org/software/ Uncompress it into a directory. Put the patch in that same directory. Then issue the patch command like: cat apk-metacombine-sparselib-1.5d-3.patch | patch -p0 If it fails, it might be because you're above the top level dir of SparseLib. If so, just change the p0 to p1. License ======= BSD. See the LICENSE file. Contact ======= Feedback is welcome. My email is akrowne@emory.edu. Secondary contact: mhalber@emory.edu. Good luck! Aaron Krowne Emory University General Libraries The MetaCombine Project MetaCombine Organization Visualizer version .3.1a 2005-05-10 Stephen Ingram Emory University Overview ======== This software allows users to visualize Organization hierarchies. Organizations are XML documents specifying a taxonomy of documents. The sofware provides the ability to rapidly observe the relationships between document categories. It does this by performing a prinicpal components analysis of the centroid of each collection of categories in the organization and then plotting the categories according to these new coordinates. The software also functions as a browsing interface to an organization by allowing the user to view item listings of each category and view detailed metadata information for any item. This software is part of the MetaCombine project, which seeks to more meaningfully bring together digital library resources, helping to build more coherent services on top of them. See http://www.metacombine.org/ for more on MetaCombine. Dependencies ============ You must first ensure that java of a version that is at least 1.4.2_05 is properly installed on your system. You must also have the java plug-in configured with your browser. If you are running Windows, you will need some method, to zip/unzip files. WinZip is a popular program for this purpose. You will also need to download and build the prefuse libraries. These are a set of libraries for java that aid in information visualization. You can find these here: http://prefuse.sourceforge.net/index.html Finally, you will have to have the Colt distribution of Open Source Libraries for High Performance Scientific and Technical Computing in Java. It can be found here: http://hoschek.home.cern.ch/hoschek/colt/ Running ======= Decompress the scheme_viz.tar.gz file. It should create a directory called scheme_viz. Copy your colt.jar file from your Colt distribution (see Dependencies section) to this directory. Also copy the prefuse.jar and prefusex.jar files from your prefuse build directory to this directory. Before an organization can be visualized, you must preprocess the organization XML. In the scheme_viz directory there are (among other things) two different scripts that preprocess the XML, one for Windows (preprocess.bat) and one for Unix (preprocess.sh). IN WINDOWS: You run the preprocessor by moving to the newly created scheme_viz directory from the command line and type "scheme_preproc.bat organization basename" where organization is the name of your organization xml file and basename is the basename of your metadata map files. This will generate a file organization.dat. Zip this file into a file named organization.zip IN UNIX: move to the newly created scheme_viz directory and execute the script "./scheme_prepoc.sh organization basename" where organization is the name of your organization xml file and basename is the basename of your metadata map files. This will generate a file called organization.zip Now that you have the file called organization.zip in the scheme_viz directory. Load the file scheme_viz.html in your browser and the visualization should begin. If the applet has a red x on it, then your plug-in is not the correct version (though the applet should inform you if this happens). If you want to name your file something other than organization.zip, check the scripts and the applet parameters in the html for the filename and change them accordingly. Instructions ============ The scheme_viz.html file contains proper instructions for operating the visualization. Building ======== Ensure the colt.jar, prefuse.jar,and prefusex.jar are in the decompressed root directory. IN WINDOWS: from the command line move to the decompressed root directory. Then type "build.bat". IN UNIX: move to the decompressed root directory and type "make". License ======= BSD. See included "LICENSE" file. Contact ======= If you want to make a bug report, open the Java Console window and check if there is a stack trace printed out. Attach or include the text in an email to me. I can be contacted at singram@emory.edu. Permanent contact: mhalber@emory.edu. Stephen Ingram Emory University General Libraries The MetaCombine Project This is... OAICopy version 0.5.8 2005-08-30 by Aaron Krowne (akrowne@emory.edu) Synopsis ======== This program lets you copy an Open Archive repository [1] to a local, static archive with a single command. The command, at its simplest, has only two parameters: 1. The base URL of the remote OAI provider. 2. The path to a directory on the local system where you want to set up a new provider (this path should end with either an empty or nonexistant dir). The locally-created archive is "static" because it is based on a one-time snapshot of the data, which is stored in individual record XML files, which the OAI-XMLFile system understands. This static repository does NOT conform to the "official" OAI static repository spec [2], because sets are supported and the provider is in all ways fully-featured and fully functional. Of course, you can edit the data in the XML files if you want, but the key point is that there is no database involved, and the records are not assembled, transformed, or generated upon request. What's the point of all this? Aside from being useful to mirror OAI repositories, this program anticipates a day when web services will abound which transform entire collections, producing new collections as output, represented by ad hoc static OAI repositories. In fact, we are working towards this on the MetaCombine [3] and OCKHAM [4] projects. For example, we are building a clustering web service that takes a (flat) OAI repository as input and produces an ad hoc OAI repository as output, which contains a set structure corresponding to a novel organization scheme. Similarly, we are building a classification web service that can train based on a set structure and records in an OAI repository, and classify correspondingly un-labelled/organized records from another repository, producing an ad hoc output repository which organizes *all* of the records into sets. Having these web services output static, ad hoc repositories makes the output instantly browsable, usable, and comprehensible. But, we do not expect that web service providers will make these ad hoc repositories available indefinitely. This is where the oaicopy command comes in: it lets you "grab" these results before they go away, and build local digital library services based on them. Further, once the results are grabbed, one can send them through another web service that takes an OAI collection as input, enhancing the records even more, and once again capturing the new output collection. The upshot is that entire collections are abstracted into fungible, portable objects, represented as Open Archives, and addressed by OAI provider base URLs. We expect to eventually build a "piping" system that transparently manages intermediate steps of chained web service operations on collections, using this oaicopy program as a back-end "glue" tool. References: [1] The Open Archives Initiative, < http://www.openarchives.org/ > [2] Van de Sompel, et al., Specification for an OAI Static Repository and an OAI Static Repository Gateway < http://www.openarchives.org/OAI/2.0/guidelines-static-repository.htm > [3] The MetaCombine project, < http://www.metacombine.org/ > [4] The OCKHAM project, < http://www.ockham.org/ > Usage ===== The command is used like: oaicopy < baseURL > < path > Let's make the following assumptions for an example: - the OAI archive you want to copy is available at the baseURL http://aux.planetmath.org/oai/provider-2.0.pl - your web root is /usr/lib/cgi-bin - your web root is accessible from the web as http://your-host.com/cgi-bin/ - you've made a /usr/lib/cgi-bin/providers Then you could use the command like: oaicopy http://aux.planetmath.org/oai/provider-2.0.pl /usr/lib/cgi-bin/providers/pm_mirror The command would create a /usr/lib/cgi-bin/providers/pm_mirror/ dir, and populate it with the data for the repository. You'd be able to access the new repository at the baseURL: http://your-host.com/cgi-bin/providers/pm_mirror/oai.pl The new archive is functionally configured with dummy values, but you can customize them by editing the config.xml, which is in the same directory as oai.pl. Basic things you might want to do are give the archive a meaningful name, nickname, and admin email address. Installation ============ -> This program and all of its dependencies are Perl-based. The following are the individual dependencies: - Net::OAI::Harvester - XML::LibXML - LWP::UserAgent - XML::SAX - for XML parsing - URI - Storable You also probably should be using Perl 5.8.0 at least since many repositories so that UTF8 data is handled properly. -> To install: Make sure you have the dependencies, then run ./install as root. The 'oaicopy' command should now work. Development Roadmap =================== 1.0 - Support calling OAI-XMLFile's configurator for archive conf. 0.9 - Command-line configuration options for archive conf. 0.8 - Rewrite parser based on SAX instead of DOM. Maybe drop Net::OAI::Harvester in favor of HTTP::OAI? 0.7 - Copying all of metadata formats supported by an archive. 0.6 - Adopt Tom Habing's virtual repository identification/provenance conventions. 0.5 - First release. Copying works for oai_dc and sets. ChangeLog ========= 0.5.8 - Add set selection support. Mention in "finished" screen that ouput dir needs to be moved somewhere web-accessible to work as an OAI repository. 0.5.7 - Fix bugs in handling of symlinks. Die upon fatal file creation and linking errors. Fix top-level bug; due to how OAI-XMLFile works, records which are already in a set do not need to be additionally symlinked at the "top" level of the archive. 0.5.6 - Report version number. Update development roadmap. 0.5.5 - Handle OAI 1.1 input repositories. We still write 2.0s as output, which means you can now use oaicopy to instantly upgrade a repository. 0.5.2 - Remove trailing slash in output/dest dir specification. Minor documentation cleanups. 0.5.1 - Bug fix, "use" statement in oai.pl template file, handling of sets (manifests for set depths > 1). 0.5 - First release. Copying works for oai_dc and sets. License ======= See the "LICENSE" file. Notes ===== - The "test" collections that were included with OAI-XMLFile have been removed within this distribution. They are really useful for understanding how OAI-XMLFile works, however, so if you want to do this, get the "real" release of OAI-XMLFile off http://www.openarchives.org/. Acknowledgements ================ This work was supported in part by the Mellon Foundation (the MetaCombine project) and NSF (the OCKHAM project). Contact ======= For help or comments please contact me, Aaron Krowne, at akrowne@emory.edu. Permanent contact is Martin Halbert - mhalber@emory.edu. The current version of this package should be available at http://www.metacombine.org/software/. Good luck! -Aaron Krowne Emory University Woodruff Library Library Systems Division Sponsored Projects Group Vectorize 1.0 2004-06-23 Aaron Krowne Emory University Overview ======== This humble little program transforms document wordlists into sparse, normalized numerical feature vectors. It automatically performs stemming and stopping on the words encountered. The feature vector format used for the output is understood by the MetaCombine NMF clustering program. However, one might want to use this program to create document feature vectors for other IR purposes. Installation ============ 1. Uncompress this archive and put it somewhere (we'll call this directory $VECTORIZE). 2. Install the MetaCombine common library in $VECTORIZE/../common. If you put it somewhere else, modify vectorize's Makefile accordingly. See http://metacluster.library.emory.edu/~akrowne/metacombine_software/ for the latest version. 3. From $VECTORIZE, type "make". Execution ========= Run like: ./vectorize basename Input is expected in the file [basename].raw. Input ===== [basename].raw - (required) The raw word lists. One list of space-separated words per line. Each line corresponds to one document, so make sure to strip newlines. [basename].sw - (optional) A list of supplementary stopwords, one per line. These words will be eliminated in the output vectors. Useful if your content has domain-specific stopwords. [basename].dic - (optional) A map of words to feature IDs. If you provide this, you can force the numerical feature vectors to use arbitrary indices. Otherwise, indices will be assigned first-come-first-serve, and this file will be created from scratch as output. Output ====== For output you get: [basename].vec - (created) The numerical feature vectors, one per line. Each line corresponds to a line in [basename].raw. The index of this line will be the same, unless there were empty documents earlier, in which case there will be a shift (see the description of [basename].mt). [basename].dic - (created or updated) "Dictionary" mapping words to feature IDs. NOTE: if you have changed or created a [basename].stm file, you will get broken feature vectors unless you delete the [basename].dic file before you run vectorize again. [basename].stm - (created) A transformation of the [basename].raw file which is stemmed and stopped. [basename].mt - (created) Since stemming and stopping may result in empty document vectors, this file lists the indices (line numbers) of any documents which became empty through this transformation. This allows one to accurately compute the index in [basename].vec and [basename].stm that corresponds to an index in [basename].raw. NOTE: this file is extremely important... things may break mysteriously if you do not incorporate this information in systems that utilize the output of vectorize. License ======= BSD. See the LICENSE file. Contact ======= For help, comments, or enhancements, contact me at akrowne@emory.edu. Good luck! Aaron Krowne Emory University