DataSpace Fact Sheet
Robert L. Grossman
University of Illinois at Chicago and Two Cultures
September, 2001
DataSpace
The web today provides an infrastructure for working with distributed multimedia documents. DataSpace is an infrastructure for creating a web of data instead of documents. DataSpace is designed to provide an infrastructure for the analysis and mining of business, e-business, scientific, engineering, and health care data.
DataSpace has several components:
- XML languages for metadata, such as the Predictive Model Markup Language (pmml).
- The dataspace transfer protocol dstp (described below), a protocol for moving data over the web.
- The predictive modeling and scoring protocol psup (described below), a protocol for event driven, real time scoring.
- Open source dstp servers for making data easily available to visitors to the data web.
- Open source dstp clients for viewing remote data and mining distributed data in the data web.
The Problem
Although the amount of scientific, health care and business data is exploding, we do not have the technology today to casually explore remote data nor to mine distributed data. Not only is the total amount of data growing but so is the size of individual data sets. Ten GB data sets are common today and 100 GB - 1 TB and larger data sets are becoming common.
There are three fundamental problems:
- There are not appropriate standards for analyzing, mining and decision making with remote and distributed data.
- The web today presents data as text tables or images which can be viewed but not directly analyzed or otherwise manipulated. To analyze data it is still generally necessary to ftp data files. This was the situation for viewing documents before the advent of the world wide web.
- The underlying engineering infrastructure for the web is optimized for viewing (multimedia) documents over the commodity internet, not large data sets over next generation high performance networks.
DataSpace solves these three problems.
DataSpace incorporates current and emerging standards for working with remote and distributed data such as the Predictive Model Markup Language (pmml), the DataSpace Transfer Protocol (dstp), and predictive scoring and update protocol (psup).
Today, the DataSpace infrastructure allows the remote analysis and distributed mining of (small) datasets over the commodity internet with a point and click browser interface. Over the next few years, as the amount of dark fiber grows, bandwidth is expected to become a commodity.
The DataSpace infrastructure is also designed to layer over next generation networks, enabling the casual exploration and distributed mining of even large data sets. DataSpace has been used for the sustained mining of Gigabyte size datasets at over 250 Mb/s in tests using Terabyte Challenge Testbed. It is currently being tested on the Terra Wide Data Mining Testbed at over 1 Gb/s.
Some Examples
Here are two examples illustrating the types of things that can be done with the DataSpace infrastructure.
Example 1. You wake up in the morning and read in the paper that scientists are speculating that there is a relation between global warming and sunspots. Using the dstp infrastructure, you can use a dstp browser, find a server containing global warming data, another server containing sunspot data, automatically align the data by the number of days since January 1, 1900, and obtain a graph correlating the two quantities with a single click.
Example 2. You run a global e-business and know that a celebrity campaign your company ran last week has just ended and wonder how this has affected sales. Your customer data is in a data warehouse in Chicago, your e-business data in a data warehouse in Atlanta, and your store data is in a data warehouse in Denver. All run dstp servers communicating via a virtual private network. You use a dstp browser and discover with a few quick clicks that both in-store and on-line sales dropped when the celebrity campaign ended last week, but that on-line sales dropped twice as much as in-store sales. You do not have to wait for a data warehouse report which will arrive in two weeks.
DataSpace vs. the Web
The philosophy behind DataSpace is to develop an infrastructure similar to the web, but designed to support the casual exploration of remote columns of data and the mining of distributed columns. See Table 1.
| DataSpace | Web | |
|---|---|---|
| basic object | remote & distributed columns of data |
remote (multi-media) documents |
| interface | point and click | point and click |
| basic operations | a) exploring remote columns of data b) mining distributed columns of data |
viewing remote documents |
| language | a) ascii for data b) xml for meta-data c) Predictive Model Markup Language (pmml) for models |
Hypertext Markup Language (html) |
| protocol | DataSpace Transfer Protocol (dstp) and Predictive Scoring and Update (psup) Protocol |
Hypertext Transfer Protocol (http) |
| server | dstp server | http server |
| improving performance | query optimization by moving data, models & results | caching infrastructure |
Table 1. DataSpace is modeled on the web and provides an infrastructure to explore remote data and mine distributed data in the same way that the web provides an infrastructure to view remote (multi-media) documents.
DataSpace - Key Technical Concepts
Today the web provides an infrastructure for the remote viewing of multi-media documents, but does not provide a similar infrastructure for remotely exploring data. DataSpace is our attempt to provide such an infrastructure. Here is a short description of the key technical concepts:
The Terabyte Challenge Testbed
DataSpace was developed using the Terabyte Challenge Testbed, which is an open, distributed testbed for DataSpace tools, services, and protocols. It consists of ten sites distributed over three continents connected by high performance links. It has been instrumented for network measurements and provides a platform for experimental discovery of scientific, engineering, business, and e-business data. The testbed includes a variety of distributed data mining applications, including the analysis of scientific data, climate data, network data, text data, and business data.
The Terabyte Challenge testbed consists of local clusters of workstations which are connected to form wide area clusters of clusters or Meta-Clusters. The Meta-Cluster is maintained by the National Scalable Cluster Project (NSCP). The NSCP-1 Meta-Cluster was completed in 1996 and linked geographically distributed clusters using the commodity internet. The NSCP-2 Meta-Cluster was completed in 1998 and used OC-3 networks to link the clusters. The NSCP-1 and NSCP-2 Meta-Clusters have been used by a variety of scientists and engineers working on applications in high energy physics, computational chemistry, nonlinear simulation, bioinformatics, medical imaging, network traffic analysis, digital libraries of video data, economic data, business data and e-business data.
Currently, the Terabyte Challenge Testbed consists of approximately 100 nodes and 2 terabytes of disk geographically distributed among the participating sites, and connected by laboratory, campus, and national high performance networks. The Terabyte Challenge involves scientists and engineers from a number of organizations, including the University of Illinois at Chicago, University of Pennsylvania, University of California at Davis, Imperial College, Australian National University, and Magnify.
Acknowledgements
The DataSpace Infrastructure was created by the Laboratory for Advanced Computing (LAC) and National Center for Data Mining (NCDM) at the University of Illinois at Chicago working together with its academic and industrial partners. The infrastructure was tested on the Terabyte Challenge Testbed. DataSpace languages and protocols are open and managed by the Data Mining Group (DMG).
The development of DataSpace was supported in part by grants from NSF, DOE, and NASA.
For More Information
For more information, please see the DataSpace web site www.dataspaceweb.net or the web sites of one of the participating organizations. You can also contact the Project Director Robert Grossman (grossman@uic.edu and rlg@twocg.com) or the Associate Director of the Laboratory for Advanced Computing, Shirley Connelly (312 413 2176, shirley@math.uic.edu).