next up previous
Next: Candidate HashTree Up: No Title Previous: Data Analysis

Data Structures for WebMiner

We will now describe the data structures that you must use for implementing the various steps of the WebMiner algorithm.

A word-set of length k has a vector of ints of length k for storing the different elements, and it has a count field for storing the frequency. The word-set is sorted. Note that you will never have to explicitly sort the word-set. The nature of the candidate generation procedure should guarantee that the new word-sets in each step are always sorted.

The set of frequent word-sets of length k, ${\cal F}_k$, should be stored as a sorted vector or linked-list of word-sets. For efficiency, store new elements using push_back(), and then sort the vector before printing.

The set of candidate word-sets of length k, ${\cal C}_k$ must be stored in a HashTree data structure, described below.



 

Mohammed Zaki
10/30/1998