The Terra Mining Testbed (TMTB)
Robert Grossman
University of Illinois at Chicago and Two Cultures
July, 2001
Data Webs
The web today provides an infrastructure for the discovery, display, retrieval and examination of remote documents. Given the large and growing data resources available today, there is a growing need for a data web, an infrastructure which would provide the same ease of use for remote data that the web provides for remote documents.
DataSpace is our attempt to provide such an infrastructure. DataSpace supports a) remote data access, analysis, and mining, b) distributed data analysis and mining, and c) event driven data updates and real time scoring.
DataSpace is an open, standards based infrastructure for working with data over the web. It includes specialized protocols and languages, such as:
- The dataspace transfer protocol (dstp), a protocol for moving data over the web using both commodity and high performance networks.
- The Predictive Model Markup Language (pmml) an XML languages for handling some of the metadata required for DataSpace.
- The Data Transformation and Markup Language (dtml), an XML languages for extracting, transforming, and shaping data.
- The Predictive Scoring and Update Protocol (psup), a protocol for event driven, profile based, real time scoring.
DataSpace services built using these protocols and languages are designed to use existing and emerging standards and services for privacy, such as P3P, security, authentication, etc.
A Global, High Performance Testbed
The Terra Mining Testbed (TMTB) is an infrastructure built on top of DataSpace for the remote analysis, mining, and real time interaction of scientific, engineering, business, and other complex data. Terra mining applications are designed to exploit the capabilities provided by emerging domestic and international high performance networks so that Gigabyte and Terabyte data sets can be remotely explored in real time.
Terra mining applications use high performance clusters linked by wide area high performance networks to provide the data and compute services required. The clusters will be linked initially by OC-xy networks and shortly by optically switched networks.
Real Time Capabilities
With traditional approaches to predictive modeling, data is collected, statistical models are built, predictions are made, and the models are validated as additional data becomes available. For many problems, ranging from controlling the outbreak of an infectious disease to preventing terrorism, data arrives discretely and it is useful to update predictions immediately as soon as new data is available.
DataSpace supports an approach to predictive modeling which works with "events," which are abstractions representing new bits of information which are assumed to arrive in a real time stream. Events are aggregated to build "profiles," which are the inputs to predictive models. DataSpace supports an open standard called the Data Transformation Markup Language for updating profiles with new events in real time or near real time.
Roll Out
The Terra Mining Testbed will be built on top of DataSpace, whose current infrastructure supports the remote analysis, distributed mining, and real time exploration of scientific, engineering, and other complex data. Terra Mining applications will be designed to exploit the high bandwidth provided by emerging domestic and international networks so that Gigabyte and Terabyte data sets can be remotely explored in real time.
Terra Mining services and applications will use high performance clusters linked by wide area high performance OC-xy and optical switched networks to provide the data and compute services required. Initially, in 2001 and the first half of 2002, the Terra mining services and applications will use distributed homogeneous clusters. Beginning in the second half of 2002, heterogeneous clusters will be supported.
The Terabyte Challenge Testbed. The Terra Mining Testbed is the successor to the Terabyte Challenge Testbed, a high performance testbed for remote data analysis and distributed data mining, which we have operated for the past six years. Since 1995, we have run a week long demonstration at the yearly Supercomputing Conference. In addition, we typically have a structured demonstratation at 2 or 3 other conferences each year, as well as a variety of less structured demonstrations. Currently, the Terabyte Challenge Testbed links 8 sites with OC-3 links and includes the following sites: the University of Illinois at Chicago, National Center for Atmospheric Research (NCAR), Caltech, the University of Pennsylvania, the University of California at Davis, Internet 2 in Ann Arbor, Access Center in Arlington, the University of Virgina, and Imperial College in London. Several industrial sites also typically participate in the testbed during demonstrations. A variety of applications currently run on the Terabyte Challenge Testbed including applications involving climate data, high energy physics data, astronomical data, and health care data.
Building the Terra Mining Testbed has three components: a) upgrading some of the current Terabyte Challenge nodes to handle higher connectivity and more demanding compute and data tasks, b) adding new Terra Mining Nodes at StarLight in Chicago, SARA in Amsterdam, and other sites to be identified, and c) developing new versions of the existing dstp clients and servers for OC-12, OC-48, 1GigE, 10GigE and optically switched networks.
We plan to roll out the Terra Mining Testbed in phases:
Phase 1. In Phase 1, which will start in July 2001, the University of Illinois, Dalhousie University in Halifax, and SARA (Stichting Academisch Rekencentrum) at the University of Amsterdam will be connected at StarTap in Chicago with OC-12 links provided by Illinois' I-Wire, Canada's CA*net II network, and the Netherland's SURFnet4 network.
During Phase 1, current Terabyte Challenge Nodes will be used in all three locations and distributed data mining of 1-10 Gigabyte size data sets will be done using the current version (0.9) of our high performance dstp server. The Nodes in the Terabyte Challenge Testbed are connected via Worldcom's vBNS and Quest's Abilene networks. We anticipate that these will be upgraded during Phase 1 and Phase 2 of the Terra Mining Testbed.
Phase 2. In Phase 2, which will begin by the end of 2001, the Terra Mining Testbed will use optical links connected at StarLight in Chicago. The Amsterdam optical link will operate at 2.5 Gigabit/second optical link over SURFnet5, the Halifax optical link will operate at 1 Gigabit/second or higher over Canada's CA*net 3 network, and the Laboratory for Advanced Computing's will operate at 2.5 Gigabit/second over Illinois's I-Wire.
During Phase 2, the Terabyte Challenge Nodes will be upgraded to Terra Mining Nodes and distributed 10-100 Gigabyte data sets will be mined using an enhanced version (Version 1) of our high performance dstp server.
Phase 3a. In Phase 3a, which will begin in 2002, additional sites in the US and Europe to be determined will be added, as well as additional Terra Mining applications.
Phase 3b. In Phase 3b, which will begin in CY 2002, some of the optical links will be upgraded to 10 Gigabit/second links connected via StarLight. In addition, the Terra Mining nodes will be upgraded with new processors and additional disk, and the dstp server software will be upgraded to version 1.2.
Demonstrations. We will demonstrate the Terra Mining Testbed formally at least twice a year: at the yearly Supercomputing Conference and at a spring Networking Meeting. In addition, we will demonstrate the testbed informally from time to time each year.
Terra Application - Climate Data
Imagine waking up one morning and reading in the newspaper that scientists are reporting a relation between global warming and sunspots. Using the dstp infrastructure, you can use a dstp browser, find a server containing global warming data, another server containing sunspot data, automatially align the data by the number of days since January 1, 1900, and obtain a graph correlating the two quantities with a single click.
The data for this Terra Application is from the National Center for Atomospheric Reserach (NCAR). The particular data is from the The NCAR Community Climate Model (CCM3). This is a stable, efficient, documented, state of the art atmospheric general circulation model designed for climate research on high-speed supercomputers and select upper-end workstations.
CCM3 is a free resource for scientists and graduate students in a wide array of specialties to use in conducting global modeling experiments in their particular area of expertise without spending decades of their careers developing a complete global climate model of this complexity. Over the last 10 years, CCM0, CCM1 and CCM2 have been used by numerous scientific institutions around the world for basic research into such areas as CO2 warming and climate change, climate prediction and predictability, atmospheric chemistry, paleoclimate, biosphere-atmosphere transfer and nuclear winter.
Approximately 100 GBs of CCM3 generated data is available via a dstp server at NCAR.
This dstp server also contains data from the National Oceanic and Atmospheric Administration ( NOAA). The data consists of monthly satellite measurements of global surface temperatures, precipitation, ozone levels and vegetation index.
Terra Application - Health Care Data
A recent article in Science reported on a relation between El Nina and the incidence of malaria. El Nina effects the weather, which effects the number of mosquitoes, which effects the number of cases of malaria. The National Center for Atmospheric Research (NCAR) has El Nina data and the World Health Organization (WHO) has malaria data, but most likely it was never imagined when the data was collected that it would be used in this way.
In a Terra application, we populated a dstp server with WHO disease incidence data. It is now a simple matter with a few points and clicks to look for correlations between El Nina and the incidence of malaria, as well as to explore other equally unexpected patterns.
Terra Application - Astronomical Data
Using the Terra infrastructure, we developed an application which simultaneously works with two geographically distributed astronomical source catalogs: 2MASS (Two Micron All Sky Survey) survey data from CalTech and DPOSS (Digital Palomar Observatory Sky Survey) survey data. The 2MASS data are in the optical wavelengths (0.4 - 0.7 micron), while the DPOSS data are are in the infrared (1.2 - 2.2 micron) range.
Our goal was to create a virtual observatory supporting the statistical analysis of many millions of stars and galaxies with data coming from both surveys. In order to support this type of analysis, we must identify the same star in both surveys. It is a simple matter: if two star images are in the same position, they must be the same star. However, each star has a finite angular size in the sky, so there is a slight "fuzziness" in position, and this is exacerbated for galaxies. A typical query is of the form: "Find all pairs, one from the DPOSS catalog and one from 2MASS, whose angular separation is less than a given tolerance".
At CalTech, we populated one dstp server with 2MASS data. We also populated another dstp server with DPOSS data. The DPOSS dstp server contains a list of stars within a given region of the sky, with magnitudes in three different colors. The 2MASS dstp servers contains a list of stars in a given region, accompanied by extensive data about the stars.
The client dstp application formulates and sends requests to a compute server, which interprets the client's request, gets the relevant data from the DPOSS and 2MASS servers, performs the fuzzy join, and sends the resulting stars and galaxies back to the client.
An architecture like this which uses separate servers for data and computational tasks can scale to large numbers of distributed services, that may be communicating very large amounts of data at very high bandwidth.
Terra Application - High Energy Physics
In collaboration with the University of Pennsylvania, we have populated a dstp server with some D0 experimental data from Fermi Lab. High energy physics data is naturally divided into objects called "events," which correspond to particle collisions or suspected particle collisions. Each object consists of about 10 Kbytes of compressed numerical data which are divided into attributes which the physicists call "banks". About half of the banks contain raw or calibrated experimental data, and about half contain banks derived from the raw data.
An example of a high energy physics query is to scan approximately 12 million events and statistically select several dozen for a very detailed analysis. The result of this analysis is a histogram indicating the presence of a new particle or new properties of known particles. This cutdown requires applying a long sequence of numerically intensive queries and analyses to smaller and smaller subsets of the data. With this example, the original dataset may contain 120 Gigabytes of data while the final data set may contain just 150 Kbytes. The total size of all data sets would range between 1 and 2 Terabytes. As the data sets decrease in size, it is desirable to explore them interactively. The purpose of this Terra application is to provide an infrastructure so that physicists around the world can interactively explore distributed collections of event data and statistically analyze the banks.
For More Information
For more information, please the Project Director Robert Grossman (grossman@uic.edu or rlg@twocg.com) or the Associate Director of the Laboratory for Advanced Computing, Shirley Connelly (312 413 2176, shirley@math.uic.edu). Information about DataSpace can be found at www.dataspaceweb.net.