* Faculty       * Staff       * Contact       * Institute Directory
* Research Groups      
* Undergraduate       * Graduate       * Institute Admissions: Undergraduate | Graduate      
* Events       * Institute Events      
* Lab Manual       * Institute Computing      
No Menu Selected

* Research

Ph.D. Theses

Efficient Algorithms for Mining Arbitrary Shaped Clusters

By Vineet Chaoji
Advisor: Mohammed J. Zaki
July 28, 2009

Clustering is one of the fundamental data mining tasks. Traditional algorithms approach clustering as an optimization problem, wherein the objective is to minimize certain quality metrics such as the squared error. For clusters that have arbitrary shapes, such a strategy does not work well. Arbitrary shaped clusters are prevalent within spatial or shape data gathered from Geographic Information Systems, satellite imagery, epidemiology, sensors, and so on. In addition to the complex shapes, some of the above applications generate large volumes of data. Existing methods for identifying arbitrary shaped clusters suffer either in terms of the memory or time complexity, which can be quadratic or even cubic. This shortcoming has restricted these algorithms to datasets of moderate sizes.

In this thesis, we propose two simple yet scalable clustering algorithms for capturing arbitrary shaped clusters. The first algorithm, SPARCL, consists of two stages -- the first stage runs a carefully initialized version of the Kmeans algorithm to generate many small seed clusters. The second stage iteratively merges the generated clusters to obtain the final shape-based clusters. SPARCL has a linear space and time complexity. The second algorithm aims to identify the "intrinsic structure" of the spatial data, by iteratively merging data points under a systematic process. Identifying clusters from the intrinsic structure becomes an easier task.

* Return to main PhD Theses page