next up previous
Next: WebMiner Algorithm Pseudo-Code Up: Data Structures for WebMiner Previous: Data Structures for WebMiner

Candidate HashTree

The candidates are stored in a HashTree to facilitate fast counting. An internal node of the hash tree at depth d contains a hash table whose cells point to nodes at depth d+1. The size of the hash table in each internal node is denoted as ${\cal H}$. All the word-sets are stored in the leaves in a sorted linked-list.

To insert a word-set in ${\cal C}_k$, we start at the root and hash on successive items to follow the appropriate child. At depth d we hash on the d-th item in the word-set. We do this until we reach a leaf. The new word-set is then inserted in the linked list of that leaf. If the number of word-sets in that leaf exceeds a threshold value, that node is converted into an internal node, i.e., we create a new hash table for that node, and then rehash (on the d-th item) the list of word-sets so that they go to appropriate children. There is one exception where we do not convert a leaf node to an internal node even if the threshold is exceeded. This is the case if the depth of the hash tree would exceed the length of the word-sets (which is the same as the iteration number), i.e., we insure that the maximum depth of the tree in iteration k is k.

The hash table size of the hash tree ${\cal C}_k$ for each internal node should be chosen as follows:

\begin{displaymath}
{\cal H} = \left[ 
 \frac{{\cal F}_{k-1} \cdot ({\cal F}_{k-1} - 1)}{2T} 
 \right]^{1/k}\end{displaymath}

where ${\cal H}$ (actually $\lceil \cal H \rceil$) is the hash table size, T is the leaf threshold, and ${\cal F}_{k-1}$ denotes the size of the set of frequent word-sets of length k-1 (from the previous iteration). Usually threshold T should be small. You can use the value T=2.

Let's look at the hash tree that would be generated in iteration 3 in our example above. First there are 8 word-sets in ${\cal F}_2$. Therefore the hash table size is given as

\begin{displaymath}
H = \left\lceil [ (8\cdot 7) / (2\cdot 2)]^{1/3} \right\rceil 
= \left\lceil 14^{1/3} \right\rceil = 3\end{displaymath}

We will use the hash function $X \bmod 3$ for each hash table in an internal node. The figure below shows how the tree grows as the candidates are added to it.


  
Figure 1: Hash Tree Building
\begin{figure}
 \centerline{
 
\psfig {figure=HT.eps,height=7.5in,width=6in}
}\end{figure}

The tree starts out as the root being a leaf which will store word-sets. We insert 123, and then 124, as shown.

When we try to insert 125, we see that the threshold T=2 for the leaf capacity is exceeded, at this point we have to turn the root into an internal node with a hash table of size 3. We then have to rehash 123, 124, and 125 based on their first item (1). In this case all three have the same first item, and they end up in the same leaf list. Since the capacity of the leaf is exceeded, we again have to rehash, turning the leaf node into an internal node. But this time we hash based on the second element (2). Once again all word-sets end up in the same leaf. For the third time, we rehash based on the third element of the word-sets. For 123, the third element is 3, and it goes into cell 0; similarly, 124 and 125 end up in leaf pointed to by cell 1 and 2, respectively.

We add the next wordset, 134 at the right spot by first hashing on 1 and then on 3. Finally for 234, when we hash on 2 we hit a leaf and we put 234 there.


next up previous
Next: WebMiner Algorithm Pseudo-Code Up: Data Structures for WebMiner Previous: Data Structures for WebMiner
Mohammed Zaki
10/30/1998