MDV: Radial Layout

If you can read this text, the applet is not working. Perhaps you don't have the Java 1.4 (or later) web plug-in installed? Try the non-java based metacombine visualization.

Legend


Color Scale:

Instructions

Click a cluster that has a plus-sign on it to navigate to its sub-clusters.
Click anywhere there isn't a cluster on the graph to navigate to a higher tier in the cluster hierarchy. [Nothing will happen if you're at the highest tier.]
Right-click a cluster to bring up detailed cluster information.
Hold the Shift key and click a cluster to make it the center, or selected cluster.
The vertical slider on the left controls cluster size: up is bigger, down is smaller.
The checkbox at the top makes cluster sizes uniform.
The horizontal slider controls the graph-distance threshold. See the Technical Explanation to understand what this does.
The combo-box at the bottom of the graph controls the color scheme of the clusters.
Hold your mouse over a cluster to read the full cluster label.

Intuitive Explanation

The group of circles in the area above is a graph representing semantic clusters. You can think of semantic clusters as groups of documents that share common subject matter. The sizes of the clusters show the number of documents in that cluster relative to other clusters. The cluster labels detail the subject matter of each cluster.

The position of the clusters in the graph allows you to view the "distance" between a selected cluster and other clusters. The selected cluster is the one with the red border. You can make any cluster the selected cluster by clicking it. How close the other clusters are to the selected cluster is proportional to the "semantic distance" between them. What I mean by semantic distance is based on the discrepancy of the word content of two documents. For example, consider if the following three documents were the sole contents of three semantic clusters respectively:

  1. A speech by Bugs Bunny impersonating FDR.
  2. A speech by Daffy Duck.
  3. A speech by FDR.

If the speech by Daffy Duck was the selected cluster, then the speech by Bugs Bunny would be closer than FDR's speech. This is because the two cartoon speeches likely contain similar words and references to things like Elmer Fudd, etc. In reality clusters contain many documents and so all of their features are averaged together into a single set of features characterizing the group of documents as a whole.

The color of a cluster is determined by a mathematical technique called PCA (or Principal Components Analysis) which is used to find general trends in the data. While the placement of the clusters tells you one thing (the "distance" to the center cluster), PCA tries to tell you something a little more subtle. Distinct colors don't have any exact meaning, but provide a signal of a strong relationship between two documents. Its result is that clusters with similarities have similar colors.

An important feature of the graph is that it is hierarchical. This means that the clustering algorithm is run again on the documents that make up a single cluster resulting in sub-clusters. In the above graph, if a cluster has a plus-sign on it, then it contains sub-clusters. Using the middle mouse button on any of these clusters allows you to navigate to this cluster sub-space that resides on a lower level of the hierarchy. Clicking the middle mouse button anywhere on the graph that there isn't a cluster (ie blank space) allows you to return to the previous level in the cluster hierarchy.

Technical Explanation (not necessary for using the graph, read only if you're curious)

Preliminary: To understand the exact criteria for placing the clusters, you will need to understand what Euclidean distance means and what a node-and-edge graph and vector are.

Internally, each semantic cluster represents a vector in high-dimensional space (over 30,000 dimensions). First, every semantic cluster becomes a graph node which we add to our graph. Then for every pair of graph nodes we calculate the Euclidean distance between their vectors. If this value is over a certain threshold, then we add to the graph an edge between these two nodes. The threshold is controlled by the horizontal slider. This is how the internal node-and-edge graph is constructed.

To place this graph on the screen, we pick a node, what we called the "selected cluster" above. Using this node as the root, we construct a tree from the internal graph. Then, on the actual screen, we define a center where we draw the selected cluster. Next, for each level of the tree (the root being level 0), we define a circle of radius rn where r is some base radius and n is the tree level and draw all the level's nodes on this circle. In other words, nodes are placed on the Nth distant circle if there are at least N edges between them.

Back to overview

Valid HTML 4.01!