文本自动分类,是将非结构化的文本依据其内容指派到一个或多个预先定义的类别中去的一项技术,近10年来受到了人们越来越多的关注。这主要因为大量机器可读的电子文本的出现,迫切需要对文本进行有效地分类,以提高查找、阅读速度的结果。目前尽管己有许多技术和算法用于文本的自动分类,但是,对这些技术和算法本身效力的挖掘还远远不够,仍留有很大的改进空间。另外,还有新的分类方法尚待深入研究,特别是对于中文文本的自动分类,之前相关的研究工作相对较少,有名气的中文文本分类器更少。
文本分类器对于学习算法和分类的结果都是至关重要的一步。在学习算法和分类系统能够处理文本之前,文本必须转换成一种适当的表示形式。这种表示形式要在一定程度上能够捕获文本本身的语义内容。依据前面的要求,可以把中文文本分类技术过程描述为:文本数据集的搜集,中文文本的分词,高维的原始特征空间的降维计算,分类器的选择,分类结果的评价等。
3、目前的分类体系为平面体系,可以在层次分类体系中考虑文本分类系统,使分类由平面向三维空间发展,以便大幅度提高分类算法的准确率和加快分类的速度。
关键词: 特征降维,文本分类,覆盖算法
:
1. Introduction of the correlation concept of text classification and the existing methods of it;
2. In order to gain the useful information from the classified results, this paper uses different feature reduced dimension means: Mutual information (MI), Correlation Coefficient, Document Frequency, and Expectation Crossing Entropy (ECE) to process the classified results. The experiments demonstrate the Correlation Coefficient method is the most effective; the less effective are Expectation Crossing Entropy (ECE) method and Mutual Information (MI) method, while Document Frequency is the worst method.
This paper also carries on experiments as comparisons of classifiers between cross cover algorithm and SVM method. It reveals that the cross cover algorithm works very well as a classifier to Chinese text, by the action of the proper dimension and feature reduced dimension.
This paper has carried on some work to Chinese text classification, but based on it, still has space for enhancement. Therefore, further study on Chinese text classification may be launched from the following three aspects:
1. This text representational model adopts vector space model, As to the vector space model, it combines computer linguistics, and uses the concept space to replace the semantic space; Taking no consideration of the effect of Chinese words meanings; the ICTCLAS classification results provided by Chinese Accounting Office are used in Chinese text classification. Later, we can further study how to enhance the precision of the classification.
2. Improvement of the cross cover algorithm for enhancement of its classified accuracy;
3. The present classification system is a plane system. We may consider the text classification system in the layer of classification structure, to induce classification goes from plane to three-dimensional space, in order to enhance the accuracy of the classified algorithm greatly and speed up text classification.
Keywords: Feature Dimension, Version Classification, Cross Cover Algorithm