WEKO3
アイテム
A Study on Automatic Chinese Text Classification
http://hdl.handle.net/10076/12727
http://hdl.handle.net/10076/12727a8390755-3d80-45c4-bcb6-503d6c397135
名前 / ファイル | ライセンス | アクション |
---|---|---|
![]() |
|
Item type | 学位論文 / Thesis or Dissertation(1) | |||||||
---|---|---|---|---|---|---|---|---|
公開日 | 2013-06-11 | |||||||
タイトル | ||||||||
タイトル | A Study on Automatic Chinese Text Classification | |||||||
言語 | en | |||||||
言語 | ||||||||
言語 | eng | |||||||
資源タイプ | ||||||||
資源タイプ識別子 | http://purl.org/coar/resource_type/c_46ec | |||||||
資源タイプ | thesis | |||||||
著者 |
LUO, XI
× LUO, XI
|
|||||||
抄録 | ||||||||
内容記述タイプ | Abstract | |||||||
内容記述 | Automatic text classification (ATC) is the task to automatically assign one or more appropriate categories for a document according to its content or topic. Traditionally, text classification is carried out by human experts as it requires a certain level of vocabulary recognition and knowledge processing. With the rapid explosion of texts in digital form and growth of online information, text classification has become an important research area owing to the need to automatically handle and organize text collections. The applications of this technology are manifold, including automatic indexing for information retrieval systems, document organization, text filtering, spam filtering, and even hierarchical categorization of web pages. Many standard machine learning techniques have been applied to automated text classification problems, and K Nearest Neighbor system (kNN) and Support Vector Machines (SVM) have been reported as the top performing methods for English text classification. Unfortunately, perfect precision cannot be reached in Chinese text classification and the inherent errors caused by word segmentation always remain as a problem. The purpose of this research is to evaluate the effectiveness of feature extraction, feature transformation and dimension reduction techniques, and to improve the accuracy of Chinese text classification using various techniques. In this paper, we perform Chinese text classification using N-gram (uni-gram, bi-gram and mixed uni-gram/bi-gram) frequency feature instead of word frequency feature to represent documents and propose the use of mixed uni-gram/bi-gram after feature transformation. We further propose a serial approach based on feature transformation and dimension reduction techniques to improve the performance. Then we compare the results of three different types of SVM kernel functions. Experimental results show that our proposed approach is efficient and effective for improving the performance of Chinese text classification. Furthermore, we propose a novel feature selection method based on part-of-speech analysis. According to the components of Chinese texts, we utilize the words’ part-of-speech (POS) at tributes to filter lots of meaningless features. The results show that suitable combination ofpart-of-speech can lead to better classification performance. | |||||||
内容記述 | ||||||||
内容記述タイプ | Other | |||||||
内容記述 | 三重大学大学院工学研究科博士前期課程情報工学専攻 | |||||||
内容記述 | ||||||||
内容記述タイプ | Other | |||||||
内容記述 | 4, 28 | |||||||
書誌情報 |
発行日 2011-01-01 |
|||||||
フォーマット | ||||||||
内容記述タイプ | Other | |||||||
内容記述 | application/pdf | |||||||
著者版フラグ | ||||||||
出版タイプ | VoR | |||||||
出版タイプResource | http://purl.org/coar/version/c_970fb48d4fbd8a85 | |||||||
出版者 | ||||||||
出版者 | 三重大学 | |||||||
修士論文指導教員 | ||||||||
寄与者識別子Scheme | WEKO | |||||||
寄与者識別子 | 22700 | |||||||
姓名 | Kimura, Fumitaka | |||||||
言語 | en | |||||||
資源タイプ(三重大) | ||||||||
Master's Thesis / 修士論文 |