The MetaCombine Project

[ overview | demos | software | reports | links | people ]
Cluster Vizualization Scatter Plot

Introduction

Welcome to the home page of MetaCombine, a Mellon-funded project hosted at Emory University. MetaCombine is a part of Emory's MetaScholar Initiative. The goal of MetaCombine is to experiment with methods to more meaningfully combine digital library resources and services, and, whenever possible, demonstrate the deployment of these methods. Below is an overview based on the initial project proposal to Mellon.

Overview

July 2003 Executive Summary Emory University seeks to conduct practical experimentation with improved techniques for organization and access to scholarly information via the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) as well as the World Wide Web. Through the proposed project, Emory will explore combinations of information and services at various levels of abstraction: combined search of OAI and Web resources, combined semantic clusters of information, and combined digital library components acting as a whole. Hence, the project name: MetaCombine.

Key Points

1) The MetaCombine project will assess the effectiveness of several specific semantic clustering techniques (see glossary for a description of this approach to information organization) for improving organization and access to bodies of metadata exposed via the OAI-PMH as well as Web resources. The focus will be on two prominent techniques: the support vector machine (SVM) class of algorithms and multidimensional scaling (MDS) visualization (see glossary for an overview of these methodologies).

2) This project will develop, demonstrate, and assess two approaches for providing combined search capabilities of harvested metadata and Web content: A) by providing OAI access to metadata records automatically generated for Web content via semantic clustering techniques; and B) indexing combined bodies of Web content and OAI metadata.

3)
Further, the project will experimentally develop and evaluate a framework for coordinating loosely coupled components of digital library services in an extensible manner, based on a new approach to using the OAI-PMH and Web services as underlying means of system integration.

4)
The MetaCombine project will build on the technical expertise and working relationships with scholars accumulated in the AmericanSouth and MetaArchive projects conducted at Emory University.

5)
Emory University will use and develop only open source software (OSS, see glossary for details concerning this class of software), specifically software that can subsequently be used freely by other research institutions.

6)
This project can potentially have broad impacts on many other initiatives, by advancing the current understanding of several areas of digital library technology and scholarly communication.

Problems Addressed

During the course of the Mellon Metadata Harvesting Initiative projects, and generally during the first few years of experimentation with the OAI-PMH, several problems have come into focus. Through the proposed project, Emory seeks to address these problems, summarized below.

1)
Problems in Categorizing and Browsing Harvested Metadata: To date, virtually no OAI-PMH harvesting project has developed effective means of browsing metadata aggregations by subject, author, or other systematic categorization scheme. These problems stem from the fact that metadata aggregations harvested from multiple institutions suffer from a lack of controlled vocabulary and authority control in the underlying Dublin Core (DC, see glossary) metadata distributed via the OAI-PMH. Without this consistency, there is no effective way to browse metadata across institutional boundaries. This problem is vexing, especially when dealing with archival collections of interest to humanities scholars. Such collections are topically narrow and deep. Archivists typically must implement their own specialized controlled vocabularies, because generalized systems (such as the Library of Congress Subject Headings, or LCSH) do not provide enough specificity. This problem has been encountered by groups ranging from the MetaScholar Initiative at Emory University to the UC San Diego Union Catalog of Art Images (UCSD UCAI) project. Automated mechanisms (such as the semantic clustering experiments described in the next section) for categorizing multi-institutional aggregations of metadata could potentially be applied ex post facto to metadata aggregations to remediate these problems.

2) Problems in Searching Across OAI and Web Realms: The OAI-PMH is a nearly perfect mechanism for distributing metadata from databases and other dynamic content management systems. While the protocol is steadily increasing in deployment and availability, there is an enormous and still growing realm of Web content providers that are unlikely to soon expose their metadata via the protocol. This produces problems for services attempting to provide comprehensive search capabilities for some targeted subject domain. Specifically, a lot of the information that researchers would ideally like to be able to search is spread across the separated OAI and Web realms.

3) Problems in Coordinating Federated Digital Library Infrastructures: The success of lightweight protocols (see glossary) like the OAI-PMH has accomplished two things: provider services based on such protocols have proliferated, and integrating services have struggled to find effective models and frameworks for attempting to amalgamate such distributed provider services into larger systems that work as a virtual whole. Digital libraries are still evolving at a rapid pace and will likely remain loose assemblies of distributed infrastructures for some time to come. Current consortial efforts to establish interoperable digital library federations like the AmericanSouth.Org system must proceed from the fundamental assumption that such federations will be loosely coupled. Tightly coupled systems with strong underlying programming infrastructures are not practical given the foreseeable level of coordination between digital library efforts and the rate of change we are experiencing. Libraries need more mechanisms like the OAI-PMH that can provide abstraction interfaces (see glossary) via lightweight protocols. This approach offers the promise of enabling interoperability among systems that will remain loosely coupled.

Plan of Work

The project will produce three broad deliverables, described below. Each deliverable will include

1)
development of a working experimental infrastructure applied to either or both the AmericanSouth.Org and MetaArchive.Org scholarly portals,

2)
assessment by means of multiple techniques, and

3) reporting results to the profession, both in conference presentations and in the professional literature.

A. Semantic Clustering Experiments

Summary:
In this experiment, open source semantic clustering software will be applied to several collections of information aggregated from multiple institutions in order to categorize this information and make it browsable by researchers.

Background: Semantic clustering techniques appear to be a promising approach to remediate the problems of categorization described above in relation to harvested metadata. The most prominent and successful technique that has emerged in recent years for semantic clustering is the support vector machine (SVM) class of algorithms. SVM is the best currently known technique for automatically categorizing information, and is currently anticipated to be a powerful tool for automated organization of metadata. Another long-standing technique for semantic clustering is multidimensional scaling (abbreviated MDS). MDS provides a simple technique for graphically displaying the similarities and relationships of clustered information, as opposed to simply categorizing information for related item browsing. MDS is therefore graphically complementary to the results of SVM categorization.

Rationale: Applying the SVM and MDS techniques to a series of metadata and Web information aggregations will constitute a valuable experiment in organizing unstructured information for purposes of scholarly communication. The collections of information under consideration below are typical in that they lack effective overall classification categories, and therefore cannot be browsed by subject category.

Benefits: This is a practical experiment in that we hope to use the organized information resulting from these experiments in the AmericanSouth scholarly portal. The experiment further has broad applicability and therefore potential benefit to many other projects seeking to organize unstructured information for scholarly communication purposes. An example of such projects is the UCSD UCAI project mentioned previously. Emory intends to share information on this topic with the UCAI and similar projects for mutual benefit.

Details: This experiment will use open source semantic clustering software. Emory will conduct an evaluation of the many available SVM software tools (see http://www.support-vector. net/software.html), and select one or more for this experiment. MDS is a general statistical technique that is supported by many open source statistical environments (an example is the R language environment, see http://www.r-project.org). Emory will use SVM to categorize various combinations of information of scholarly interest in the study of the culture and history of the American South and make this information browsable. Because SVM is most effective when subject experts provide guidance and feedback to the system, Emory will employ the scholarly design team of the AmericanSouth project to train the SVM system to produce effective interdisciplinary categories of use to humanists. A system for MDS visualization of clustered information will be developed and applied to the results of these clustering experiments for graphical display and comprehension of results.

Specific experiments: Emory will undertake the following specific experiments with semantic clustering techniques applied to scholarly information.

1) AmericanSouth Metadata Clustering. The purpose of this experiment is to test whether or not SVM OSS can (with minimal guidance from experts) effectively categorize and make browsable all DC metadata harvested in AmericanSouth, for use by scholars researching the culture and history of the American South. The resulting body of winnowed and categorized metadata will be made browsable via the derived categories. Effectiveness of the categories will be gauged by feedback from scholarly consultants (see section on staff resources).

2) AmericanSouth Web Clustering. This experiment will test whether SVM OSS can categorize and make browsable the crawled Web content represented in the Web links section of the AmericanSouth portal (Web sites identified by scholarly consultants as including content of scholarly research value), tested under conditions similar to those listed in A1 above. The process and results of clustering Web content as opposed to metadata will be compared to understand similarities and differences.

3) Multi-Dimensional Visualization. A system will be developed to test whether effective means of visually displaying the SVM-derived subject categories is feasible using MDS graphical display of the results of both of the above experiments. Assuming that the experiment results in a display that provides comprehensible and worthwhile display of relationships, the following clustering results will also be visualized.

4) AmericanSouth Combined Clustering. This experiment will test if SVM OSS can categorize and make browsable a union set of harvested metadata and crawled Web content, to evaluate whether SVM semantic clustering can effectively organize such disparate information sets. This builds on the findings of MetaScholar Initiative projects to date, namely that both harvested metadata and crawled Web content should be integrated for comprehensive scholarly information discovery purposes.

5) American Memory Metadata Clustering. Finally, the project will conduct an experiment to test whether the DC metadata available from the American Memory OAI provider can be effectively culled and categorized for use by scholars researching the culture and history of the American South (as opposed to generalized American Studies). The resulting body of winnowed and categorized metadata will also be made browsable via the derived categories.

B. Experiments with Combined OAI / Web Search

Summary:
Open source tools will be used to make a combination of harvested metadata and crawled Web content searchable in the context of the AmericanSouth portal.

Background:
As mentioned, the MetaScholar Initiative has concluded that both harvested metadata and crawled Web content should be integrated for comprehensive scholarly information discovery purposes. However, this presents a challenge, as the two types of information are fundamentally different in nature, metadata being an abstraction of content, and Web pages being an instantiation of content. Both of the component tasks (harvesting/ searching metadata, and crawling/searching Web content) are now relatively well understood. What is not well understood is the tasked of combined searching of these information realms.

Rationale: Experiments to bridge the OAI and Web information realms are needed by the MetaScholar Initiative, and would benefit other groups. There are two obvious ways that the OAI and Web realms might be bridged: subsuming OAI into the Web, or subsuming the Web into OAI. Each of these approaches should be evaluated.

Benefits: The findings of this experiment will have great practical benefit for the AmericanSouth portal, and will have potential application to virtually any other project seeking to automatically assemble a large amount of information for scholarly information discovery purposes. Emory intends to share information on this topic with other projects for mutual benefit.

Details: There are a variety of open source software tools that can usefully be tested for this purpose. The MetaScholar Initiative has accumulated extensive experience with the ARC software for OAI-PMH metadata harvesting and searching from Old Dominion University. Old Dominion plans to release a new, re-architected version of the software termed ARCHON that may include some capabilities for integrated metadata harvesting and Web crawling. There are a large number of open source Web search engines [Morgan, 2001] that could be adapted for this experiment. Finally, DP9 is a gateway service developed by Old Dominion University that enables indexing of an OAI data provider by an Internet search engine (see glossary for more information). DP9 is the logical mechanism to test the case of making the relevant OAI metadata searchable via the Web context. DP9 has not been tested by groups beyond Old Dominion to date.

Specific experiments: Two experiments will be undertaken in this area.

1) Combined Search Via Web Crawling. This experiment will test whether or not an open source Web search engine can be effectively applied to the union of the AmericanSouth harvested metadata (exposed via the DP9 gateway service) as well as the Web content represented by the AmericanSouth Weblinks. Focus groups of graduate researchers and scholarly consultants will evaluate the effectiveness of the resulting combined search system for scholarly discovery purposes.

2) Combined Search Via OAI-PMH. In this experiment, Emory will create an OAI provider for the metadata resulting from the experiment in clustering web content (# A2 above), and this metadata will be harvested and made searchable together with the existing metadata in AmericanSouth. Focus groups of graduate researchers and scholarly consultants will evaluate the effectiveness of the resulting combined search system for scholarly discovery purposes.

C. Federated Digital Library Framework Experiments Summary:

A framework for loosely-coupled federations of digital libraries will be iteratively developed as an improved mechanism for coordinating components of such an infrastructure.

Background: The success of lightweight protocols like the OAI-PMH has accomplished two things: provider services based on such protocols have proliferated, and integrating services have struggled to find effective models and frameworks for attempting to amalgamate such distributed provider services into larger systems that work as a virtual whole. There has been increasing attention to this problem in the research community. [Fox, 2002 and Castelli, 2002]

Rationale: The MetaScholar Initiative and other distributed/federated digital library infrastructures need better organizing frameworks for coordinating the operations of loosely coupled constituent systems, and enabling an extensible scheme for specifying proposed additions to such infrastructures. Experiments to utilize emerging industry standards such as the Web services framework (see glossary) and research standards such as the OAI-PMH would address this need.

Benefits: This experimental work will contribute to Emory’s efforts to increase the usefulness of the internet for scholars, and potentially might have broader impacts on humanities scholarship if it works well. If the framework developed is flexible enough that various digital library services could modularly interact and share information then a large number of initiatives might stand to benefit. As a hypothetical example, if the Perseus Digital Library and AmericanSouth.Org could collaboratively build up interoperable Web services, both digital libraries would benefit.

Details: There are a number of promising new standards that will be utilized in this experiment. The OAI-PMH will be used as the underlying mechanism for distributing configuration and status information of virtual digital library systems. Web Services Definition Language (WSDL, see glossary) expressions will be disseminated via this OAI-PMH mechanism, representing the metadata for the digital library services of modular federations. The master configuration specifications for this framework will be expressed using the 5SL standard developed by Virginia Tech. [Fox, 2002]

Specific experiments: Emory will undertake two experiments:

1) Initial Federated Framework. An initial framework will be designed and implemented in the MetaArchive portal as a means of dynamically configuring virtual digital library federations. The only services that this initial framework will necessarily provide as modules are a federation coordinating service, an interface to a semantic clustering service, and a combined OAI/Web search service. The test to experimentally apply to this framework is whether or not it effectively enables the rapid and flexible creation of new federated digital libraries. Through this work, Emory seeks to develop a simple framework based on OAI that is both lightweight enough to be easily added to existing services and an effective means of configuring recombinant federations of digital library services.

2) Revised Federated Framework. Emory will design and implement a revised framework that will attempt to include targeted connection modules for a selection of other digital library services, such as the CLiMB toolkit from Columbia and the NITLE semantic indexing toolkit. The experimental test here is simply whether or not a working system can be devised incorporating these other tools in addition to the previous tool set. Through this work, Emory will explore the feasibility of a lightweight strategy for federating digital library services more generally, in the same way that the OAI-PMH enables simple federation of metadata resources. If this can be demonstrated, it will constitute a powerful approach for integrating digital library services across institutions.

Valid HTML 4.01!