Abstract
Feature
clustering is a powerful method to reduce the dimensionality of feature vectors
for text classification. In this paper, we propose a fuzzy similarity-based self-constructing
algorithm for feature clustering. The words in the feature vector of a document
set are grouped into clusters, based on similarity test. Words that are similar
to each other are grouped into the same cluster. Each cluster is characterized by
a membership function with statistical mean and deviation. When all the words
have been fed in, a desired number of clusters are formed automatically. We
then have one extracted feature for each cluster. The extracted feature,
corresponding to a cluster, is a weighted combination of the words contained in
the cluster. By this algorithm, the derived membership functions match closely
with and describe properly the real distribution of the training data. Besides,
the user need not specify the number of extracted features in advance, and
trial-and-error for determining the appropriate number of extracted features
can then be avoided. Experimental results show that our method can run faster
and obtain better extracted features than other methods.
Index Terms
Fuzzy similarity, feature clustering, feature extraction, feature reduction,
text classification.
Existing System:
The
first feature extraction method based on feature clustering was proposed by
Baker and McCallum, which was derived from the “distributional clustering” idea
of Pereira et al. Al-Mubaid and Umair used distributional clustering to
generate an efficient representation of documents and applied a learning logic
approach for training text classifiers. The Agglomerative Information
Bottleneck approach was proposed by Tishby et al. The divisive
information-theoretic feature clustering algorithm was proposed by Dhillon et
al, which is an information-theoretic feature clustering approach, and is more
effective than other feature clustering methods. In these feature clustering
methods, each new feature is generated by combining a subset of the original
words. However, difficulties are associated with these methods. A word is
exactly assigned to a subset, i.e., hard-clustering, based on the similarity magnitudes
between the word and the existing subsets, even if the differences among these
magnitudes are small. Also, the mean and the variance of a cluster are not
considered when similarity with respect to the cluster is computed.
Furthermore, these methods require the number of new features be specified in
advance by the user.
No comments:
Post a Comment