python nltk 6 学习分类文本

技术2023-06-23 93

6 学习分类文本

Learning to Classify Text1 Supervised Classification（有监督分类）1.1 Gender Identification（性别鉴定）1.2 Document Classification(文档分类)1.3 Part-of-Speech Tagging（词性标注） 2 Evaluation（评估）2.1 The Test Set(测试集)2.2 Precision and Recall（精确度和召回率）2.3 Confusion Matrices（混淆矩阵）

英文文档 http://www.nltk.org/book/ 中文文档 https://www.bookstack.cn/read/nlp-py-2e-zh/0.md 以下编号按个人习惯

Learning to Classify Text

1 Supervised Classification（有监督分类）

1.1 Gender Identification（性别鉴定）

1、依据男女名字差异的特点，来建立分类器，进行分类。 2、首先要创建特征提取器，特征提取器返回一个字典，称为特征集，包括特征名称（可读、区分大小写的字符串）和特征值（简单类型）。再者准备通过特征提取器处理原始数据集，再将其分为训练集和测试集，利用训练集训练一个分类器。 3、最后可利用分类器进行分类或评估分类器。

from nltk.corpus import names from nltk.classify import apply_features import random import nltk # 特征提取器 def gender_features(word): # 返回的字典成为特征集。映射了特征名称到其值 # 特征名称是区分大小写的字符串，可读。特征值是简单类型的值，如布尔、数字、字符串 # 当返回的特征集不同，准确性有所差距 return {'last_letter': word[-1], 'word_length': len(word), 'first_letter': word[0]} # return {'last_letter': word[-1]} # 性别鉴定 def gender_identification(): labeled_names = [(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')] random.shuffle(labeled_names) # 划分训练集和测试集。以下方法处理大型语料库，使得不会再内存中存储所有特征集对象 train_set = apply_features(gender_features, labeled_names[500:]) test_set = apply_features(gender_features, labeled_names[:500]) # 利用训练集训练一个朴素贝叶斯分类器 classifier = nltk.NaiveBayesClassifier.train(train_set) # 进行测试分类时，要先提取特征，再进行分类 classify_result = classifier.classify(gender_features('Trinity')) print(classify_result) # 评估分类器 accuracy = nltk.classify.accuracy(classifier, test_set) print(accuracy) # 检查分类器，确定哪些特征对于区分名字的性别最有效 # 例如last_letter = 'a' female:male=34.5:1.0。表示训练集中以a结尾的名字中女性是男性的34.5倍。这些比率成为似然比，用来比较不同特征-结果关系 classifier.show_most_informative_features(5)

在处理数据集时，最好将数据集分为开发集（分为训练集、开发测试集）和测试集，训练集用于训练模型，开发测试集可用来修正分类器，测试集用来测试评估。

1.2 Document Classification(文档分类)

文档识别，可以为每个词定义一个特性表示该文档是否包含这个词。为控制特征数目，挑选了语料库中前2000个最频繁词的列表，在特征提取器中判断一个给定的文档中是否包含这些词。

# 文档分类的特征提取器。检查document中是否包含word_features的这些词 def document_features(document, word_features): # 转换为set，是因为在set中索引比在list中索引快 document_words = set(document) features = {} for word in word_features: features['contains({})'.format(word)] = (word in document_words) return features # 文档分类 def document_classification(): # 引用电影评论语料库，将评论归类为正面或负面 # [词汇表，类别] documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] # 打乱顺序 random.shuffle(documents) all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words()) # 取前2000个最频繁词列表。作为特征列表 word_features = list(all_words)[:2000] featuresets = [(document_features(d, word_features), c) for (d, c) in documents] # 划分 train_set, test_set = featuresets[100:], featuresets[:100] # 训练 classifier = nltk.NaiveBayesClassifier.train(train_set) # 准确率 accuracy = nltk.classify.accuracy(classifier,test_set) print(accuracy) classifier.show_most_informative_features(5)

1.3 Part-of-Speech Tagging（词性标注）

训练一个分类器，来进行词性标注。根据词的后缀来划分。先找出100个最常见的后缀，然后特征提取器中检查给定的单词是否有这些后缀。

# 特征提取器，检查给定的单词的后缀 def pos_features(word, common_suffixes): features = {} for suffix in common_suffixes: features['endswith({})'.format(suffix)] = word.lower().endswith(suffix) return features # 词性标注 def part_of_speech_tagging(): suffix_fdist = nltk.FreqDist() for word in brown.words(): word = word.lower() suffix_fdist[word[-1:]] += 1 suffix_fdist[word[-2:]] += 1 suffix_fdist[word[-3:]] += 1 common_suffixes = [suffix for (suffix, count) in suffix_fdist.most_common(100)] tagged_words = brown.tagged_words(categories='news') features = [(pos_features(word, common_suffixes), tag) for (word, tag) in tagged_words] size = int(len(features) * 0.1) train_set, test_set = features[size:], features[:size] # 划分 classifier = nltk.DecisionTreeClassifier.train(train_set) # 训练 accuracy = nltk.classify.accuracy(classifier, test_set) # 准确率 print(accuracy) # 查看决策树的形式。 print(classifier.pseudocode(depth=4))

2 Evaluation（评估）

2.1 The Test Set(测试集)

为查看一个分类模型是否准确的捕捉了模式，因此要评估模型。评估结果表现出模型的好坏，可用于指导改进模型。首先训练集和测试集的选取，对评估的结果有着很大的影响。以下给出三种划分训练集和测试集的方法。

# 测试集 def test_set(): # 测试集和训练集取自同一文体，这种情况下两个集的数据非常相似。评估结果不准 tagged_sents = list(brown.tagged_sents(categories='news')) random.shuffle(tagged_sents) size = int(len(tagged_sents) * 0.1) train_set_1, test_set_1 = tagged_sents[size:], tagged_sents[:size] # 测试集与训练集取自不同文件。避免了测试集中包含来自训练使用过的相同文档的句子。因为一个文档若有一个给定词和特定标记频繁出现，将影响结果 file_ids = brown.fileids(categories='news') size = int(len(file_ids) * 0.1) train_set_2 = brown.tagged_sents(file_ids[size:]) test_set_2 = brown.tagged_sents(file_ids[:size]) # 取自不同文体。得到更令人信服的评估 train_set_3 = brown.tagged_sents(categories='news') test_set_3 = brown.tagged_sents(categories='fiction')

2.2 Precision and Recall（精确度和召回率）

TP：本是相关，正确识别为相关的项目 TN：本是不相关，正确识别为不相关的项目 FP：本是不相关，却识别成相关的项目 FN：本是相关，却识别成不相关的项目精确度=TP/(TP+FP)。表明确定的项目中有多少是相关的召回率=TP/(TP+FN)。表明确定了多少相关的项目

2.3 Confusion Matrices（混淆矩阵）

混淆矩阵中，cells[i,j]表示正确的标签i被预测为标签j的次数。因此对角线上表示正确预测的标签，而非对角线上的项目表示错误。

# 作为参考的标记列表 reference = tag_list(brown.tagged_sents(categories='editorial')) # 作为测试的标记列表 test = tag_list(apply_tagger(t2, brown.tagged_sents(categories='editorial'))) # 生成混淆矩阵 cm = nltk.ConfusionMatrix(reference, test) # 控制台输出矩阵 print(cm.pretty_format(sort_by_count=True, show_percents=True, truncate=9))

Processed: 0.009, SQL: 9