Web Document Search, Organisation and Exploration Using Self-Organising Neural Networks
By Richard T. Freeman
Abstract
The amount of content stored electronically is rapidly and dramatically increasing with the advent of the Internet and corporate Intranets, leading to an information overload. There is a requirement to make content more accessible through content and knowledge management, allowing it to be efficiently searched and explored. A major limitation of existing hierarchical document clustering methods used in information retrieval is that they typically generate a dendrogram representation, which is unsuitable for browsing. Methods based on the self-organising map are more adequate for observing large numbers of clusters, but are not as natural as tree structures such as those used in libraries, file explorers, web directories or enterprise information portals. In this thesis, a method is proposed to generate such a tree using an algorithm called Adaptive Topological Tree Structure (ATTS), which uses a set of hierarchically organised self-organising growing chains. Each chain fully adapts to a specific topic, where its number of subtopics is determined using the proposed entropy based validation and cluster tendency schemes. This makes the algorithm novel in that the tree in not a dendrogram or fixed size n-way tree, but rather adapts to the natural underlying structure at each level in the hierarchy. The chains’ topology also allocates similar topics together, and dissimilar ones apart. The obtained topological tree can be defined as a hybrid graph-tree and taxonomy, with the unique property of both representing hierarchical relations and the strongest links between clusters. This topology can be exploited to considerably reduce the time needed for a top-down search as well as improve browsing and user comprehension. Experimental results show that the ATTS method outperforms other hierarchical divisive clustering algorithms as well as Self-Organising Maps based methods, for retrieval and makes browsing more intuitive. The generated topological tree is shown to perform better in terms of document retrieval. The topology provides a unique feature that can be used for finding related topics and extending the search space.
Keywords
Information retrieval; document clustering; search engine; self organising maps; topological tree; information access; faceted classification; guided navigation; taxonomy generation; neural networks; post retrieval clustering; taxonomy generation; enterprise portals; enterprise content management; enterprise search.
Bibliographic Details
@phdthesis{freemanPhD04, Author = {Richard T. Freeman}, Title = {Web Document Search, Organisation and Exploration Using Self-Organising Neural Networks}, School = {University of Manchester}, address={School of Electrical and Electronic Engineering, Faculty of Engineering and Physical Sciences} Year = {2004} } }