Document clustering as an unsupervised approach is extensively
used to navigate, filter, summarize and manage large collection of
document repositories like Internet. Recently, focus in this domain
shifted from traditional vector based document similarity for clustering
to suffix tree based document similarity, as it offers more semantic
representation of the text. In this project, we compare and contrast
two recently introduced approaches to document clustering based on
suffix tree data model. The first is a phrase based document
clustering, which extracts phrases from documents to form a compact
document representation and uses a similarity measure based on the
phrases to cluster the documents. The second approach is a
frequent word/word meaning sequence based document clustering
which similarly extracts the common word sequences from the
document forms a compact representation and uses document
clustering approach to cluster the compact documents. These
algorithms are using agglomerative hierarchical document clustering
to perform the actual clustering step. The difference in these
approaches is mainly based on the model representation as a
compact document, and the similarity measures used for clustering.
This project investigates the computational aspect of the two
algorithms, and the quality of results they produce.