WWWPal Suite for Analysis and Organization of Web Sites

By John Punin
Advisor: Mukkai Krishnamoorthy
July 24, 2003

Organization of web sites is the action of bringing an order and classification of all hypertext documents in those web sites. As web sites continue to increase in size and complexity, the task of organizing them becomes more challenging. Someone attempting to organize a web site may be faced with problems such as obscure navigational structures and missing information. Our research provides a tool kit called WWWPal, to manage the structural information of web sites. WWWPal consists of a suite of web applications which allow researchers, web site managers and users to visualize, organize and analyze web sites. This thesis explains in detail the design, languages, components and algorithms used in WWWPal. WWWPal is a hypermedia system for graphs whose development is based on hypertext theory. WWWPal can visualize, navigate, cluster, edit and analyze general graphs. To achieve these tasks, we use web usage mining algorithms for extracting and analyzing patterns of web user sessions, clustering techniques for visualization of large graphs, and multiple graph theoretic algorithms for analyzing the graph representation of web sites.

We propose two XML standards for representing graphs (XGMML) and web usage information (LOGML). XML, a structural and portable language, is the basis of XGMML and LOGML. These languages are designed to store graphs and web site usage information. New emerging technologies, such as Semantic Web, are used to extract metadata information and express it using the Resource Description Framework (RDF) language. This information can be further explored, analyzed and classified in ontologies by semantic user agents. A metadata graph representation, RGML (based on RDF), is proposed in this thesis.

We have used WWWPal to analyze and visualize many very large web sites, such as Rensselaer Polytechnic Institute's site. The results are classified in structural, usage and site maps reports, that show not only general statistics of web sites but also list web site problems, such as dead-end pages, broken links, old information and huge pages. These reports also show high and low usage web sites, visited crawlers and patterns of web surfers.

