WEKO3
アイテム
An impact of linguistic features on automated classification of OCR texts
http://hdl.handle.net/10076/12508
http://hdl.handle.net/10076/1250850d649b4-d668-4a36-92b1-4dfc5ef6aa64
名前 / ファイル | ライセンス | アクション |
---|---|---|
![]() |
|
Item type | 学位論文 / Thesis or Dissertation(1) | |||||||
---|---|---|---|---|---|---|---|---|
公開日 | 2013-06-11 | |||||||
タイトル | ||||||||
タイトル | An impact of linguistic features on automated classification of OCR texts | |||||||
言語 | en | |||||||
言語 | ||||||||
言語 | eng | |||||||
資源タイプ | ||||||||
資源タイプ識別子 | http://purl.org/coar/resource_type/c_46ec | |||||||
資源タイプ | thesis | |||||||
著者 |
Moshi, Gudila Paul
× Moshi, Gudila Paul
|
|||||||
抄録 | ||||||||
内容記述タイプ | Abstract | |||||||
内容記述 | Optical Character reader (OCR) systems can be used in digitizing print documents. OCR texts are generated in the process of digitizing print documents. Usually these texts need to be indexed and organized to simplify their access and reffieval. This can be done by the use of automatic classification techniques. However it is currently impossible for OCR technology to recognize all characters with an accuracy of 100%. Therefore the first objective of this research is to investigate how to automate the classification of these OCR texts effectively. To solve the problem of high dimensional feature space we reduce the dimensionality with PCA. In connection to that we also adopted discriminant analysis method which reduced dimensionality and extracted more informative features to improve textual data separability. Conventionally a number of researchers applied PCA to reduce the dimensionality but since PCA is an unsupervised technique it ignores category specific information. i.e. it seeks direction that are efficient for representation and does not include category information of the data. For example when first component is chosen along the largest variance line, category will strongly overlap. In order to overcome this shortcoming we performed canonical discriminant analysis (CDA). CDA seeks direction that is efficient for discrimination i.e. it makes use of category information as it find the projection such that the instances of different categories have maximum separation between each other and at the same time it insures that instances in same category cluster closely together. But since CDA tends to have a singularity problem of the within-class covariance matrix due to higher dimensionality compared to the sample size we therefore experimentally study the combination of PCA and CDA (PCA+CDA) algorithm. Our approach found out that PCA+CDA algorithm improved classification performance only when we used a weak classifier (in our case kNN) and PCA outperformed PCA+CDA algorithm when strong classifier was used (in our case SVM). Furthermore it is not known whether part of speech (POS) analysis contributes to proper OCR texts representation in a discriminative way. Conventionally, the bag-of-words approach is used in OCR text classification. Therefore our second objective of this work is to experimentally evaluate POS analysis on OCR texts to formulate an informative feature set. Empirical results indicate that the combination of suitably selected POS improved classification performance of OCR texts. | |||||||
内容記述 | ||||||||
内容記述タイプ | Other | |||||||
内容記述 | 三重大学大学院工学研究科博士前期課程情報工学専攻 | |||||||
内容記述 | ||||||||
内容記述タイプ | Other | |||||||
内容記述 | 5, 31 | |||||||
書誌情報 |
発行日 2010-01-01 |
|||||||
フォーマット | ||||||||
内容記述タイプ | Other | |||||||
内容記述 | application/pdf | |||||||
著者版フラグ | ||||||||
出版タイプ | VoR | |||||||
出版タイプResource | http://purl.org/coar/version/c_970fb48d4fbd8a85 | |||||||
出版者 | ||||||||
出版者 | 三重大学 | |||||||
修士論文指導教員 | ||||||||
寄与者識別子Scheme | WEKO | |||||||
寄与者識別子 | 22359 | |||||||
姓名 | 木村, 文隆 | |||||||
言語 | ja | |||||||
資源タイプ(三重大) | ||||||||
Master's Thesis / 修士論文 |