A Study on Automatic Chinese Text Classification

LUO, XI

WEKO3

lat lon distance

[[sub_check.contents]]

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

インデックスツリー

アイテム

A Study on Automatic Chinese Text Classification

http://hdl.handle.net/10076/12727

名前 / ファイル	ライセンス	アクション
2010M251.pdf (1.6 MB)

Item type

学位論文 / Thesis or Dissertation(1)

公開日

2013-06-11

タイトル

A Study on Automatic Chinese Text Classification

言語

eng

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_46ec

資源タイプ

thesis

著者

LUO, XI

抄録

内容記述タイプ

Abstract

内容記述

Automatic text classification (ATC) is the task to automatically assign one or more appropriate categories for a document according to its content or topic. Traditionally, text classification is carried out by human experts as it requires a certain level of vocabulary recognition and knowledge processing. With the rapid explosion of texts in digital form and growth of online information, text classification has become an important research area owing to the need to automatically handle and organize text collections. The applications of this technology are manifold, including automatic indexing for information retrieval systems, document organization, text filtering, spam filtering, and even hierarchical categorization of web pages. Many standard machine learning techniques have been applied to automated text classification problems, and K Nearest Neighbor system (kNN) and Support Vector Machines (SVM) have been reported as the top performing methods for English text classification. Unfortunately, perfect precision cannot be reached in Chinese text classification and the inherent errors caused by word segmentation always remain as a problem. The purpose of this research is to evaluate the effectiveness of feature extraction, feature transformation and dimension reduction techniques, and to improve the accuracy of Chinese text classification using various techniques. In this paper, we perform Chinese text classification using N-gram (uni-gram, bi-gram and mixed uni-gram/bi-gram) frequency feature instead of word frequency feature to represent documents and propose the use of mixed uni-gram/bi-gram after feature transformation. We further propose a serial approach based on feature transformation and dimension reduction techniques to improve the performance. Then we compare the results of three different types of SVM kernel functions. Experimental results show that our proposed approach is efficient and effective for improving the performance of Chinese text classification. Furthermore, we propose a novel feature selection method based on part-of-speech analysis. According to the components of Chinese texts, we utilize the words’ part-of-speech (POS) at tributes to filter lots of meaningless features. The results show that suitable combination ofpart-of-speech can lead to better classification performance.

内容記述

内容記述タイプ

Other

内容記述

三重大学大学院工学研究科博士前期課程情報工学専攻

内容記述

内容記述タイプ

Other

内容記述

4, 28

書誌情報

発行日 2011-01-01

フォーマット

内容記述タイプ

Other

内容記述

application/pdf

著者版フラグ

出版タイプ

VoR

出版タイプResource

http://purl.org/coar/version/c_970fb48d4fbd8a85

出版者

三重大学

修士論文指導教員

寄与者識別子Scheme

WEKO

寄与者識別子

22700

姓名

Kimura, Fumitaka

言語

資源タイプ（三重大）

値

Master's Thesis / 修士論文

戻る

views

See details

	Views

Versions

Ver.1

2023-06-19 17:23:25.217412

Show All versions

Cite as

エクスポート

OAI-PMH

JPCOAR 2.0
JPCOAR 1.0
DublinCore
DDI

Other Formats

JSON
BIBTEX

インデックスリンク

インデックスツリー

アイテム

A Study on Automatic Chinese Text Classification

× LUO, XI

Versions

Share

Cite as

エクスポート