Online papers from this workshop are also available at http://ipdps.cc.gatech.edu/2000/datamine/index.html.
Professor Robert Hollebeek will present three examples of large-scale data intensive computing applications that combine large-scale storage, parallel mining and distributed networking. These include a radiology storage infrastructure for the National Library of Medicine Next Generation Internet program , a digital government database for Census and demographic data in the City of Philadelphia, and a parallel networking project using high speed optical networks to enable distributed parallel data computing. Professor Hollebeek is a Professor of Physics at the University of Pennsylvania and co-founder of the National Scalable Cluster Project (NSCP). The talk will end with lessons on large scale data mining that have been learned from the experience of NSCP.
Abstract Data mining is the automated analysis of large volumes of data, looking for the relationships and knowledge that are implicit in large volumes of data and are 'interesting' in the sense of impacting an organization's practice. Data mining and knowledge discovery on large amounts of data can benefit of the use of parallel computers both to improve performance and quality of data selection. When data mining tools are implemented on high-performance parallel computers, they can analyze massive databases in a reasonable time. Faster processing also means that users can experiment with more models to understand complex data. High performance makes it practical for users to analyze greater quantities of data.This talk analyzes and presents different forms of parallelism that can be exploited in data mining techniques and algorithms. The main goal of the talk is to discuss data mining techniques on parallel architectures and show how large scale data mining and knowledge discovery applications can be scalable by using systems, tools and performance offered by parallel processing systems. For each data mining technique (such as rule induction, clustering algorithms, decision trees, genetic algorithms, neural networks, etc.) the possible ways to exploit parallelism are presented and discussed in detail. Finally, the talk outlines current research issues in high-performance data mining and discusses perspectives in this area.
Mohammed J. Zaki
Rensselaer Polytechnic Institute
zaki.AT.cs.rpi.edu
Vipin Kumar
University of Minnesota
kumar@cs.umn.edu
David Skillicorn
Queens University, Canada
skill@cs.queensu.ca
Number of Visitors