next up previous
Next: Finding Common Key Words Up: No Title Previous: No Title

WebPage Data

The input to the program consists of a set of web pages, where each web page is associated with a list of key-words that occur in that page. For example let's look at web pages of Computer Science (CS) and Electrical Engineering (EE) departments at various schools. Looking at the common key-words we want to automatically label each web page as belonging to the category ``CS'' or ``EE''. Consider the following database of web pages and their keywords:
RPI_1 -- computer_vision, computer_systems, semiconductors
RPI_2 -- computer_vision, computer_systems, programming_languages, 
        computation_theory
MIT   -- computer_vision, computer_systems, programming_languages, 
        computation_theory, semiconductors
NWU   -- computer_vision, computer_systems, programming_languages, 
        semiconductors
UofR_1-- computer_vision, computer_systems, programming_languages, 
        computation_theory
UofR_2-- computer_vision, computer_systems, semiconductors

For simplicity let's assign each web page a unique number as follows:

RPI_1  = 1
RPI_2  = 2
MIT    = 3
NWU    = 4
UofR_1 = 5
UofR_2 = 6

Let's also assign each key word a unique number as follows:

computer_vision         = 1
computer_systems        = 2 
programming_languages   = 3 
computation_theory      = 4
semiconductors          = 5

The new database now looks like:

WebPage Num_KeyWords    KeyWords
1       3               1 2 5
2       4               1 2 3 4
3       5               1 2 3 4 5
4       4               1 2 3 5
5       4               1 2 3 4
6       3               1 2 5

This example web database has 6 department web sites, and there are 5 different key words that appear in the pages.



Mohammed Zaki
10/30/1998