WINGNUS我觉得可以视为kea算法的改进版本,他研究了语言逻辑,不止关注了文章全局信息也关注了局部重要的信息 WINGNUS算法 论文写到经过统计发现一般句子比较重要的部分都出现在句首,标题等地方,所以不使用整个文档文本作为输入,而是从完整到最小的不同层次上缩减了输入文本,注重重要的地方。 1.和kea讲到的一样首先根据规则选取候选词 2.提取关键词特征,在tf-idf特征的基础上添加了单词偏移,字体属性,单词短语长度等特征 3.通过朴素贝叶斯模型计算关键词 主要代码 1.候选关键词
def candidate_selection(self, grammar=None): """Select noun phrases (NP) and NP containing a pre-propositional phrase (NP IN NP) as keyphrase candidates. Args: grammar (str): grammar defining POS patterns of NPs. """ # initialize default grammar if none provided if grammar is None: grammar = r""" NBAR: {<NOUN|PROPN|ADJ>{,2}<NOUN|PROPN>} NP: {<NBAR>} {<NBAR><ADP><NBAR>} """ self.grammar_selection(grammar)2.候选关键词的特征
def feature_extraction(self, df=None, training=False, features_set=None): """Extract features for each candidate. Args: df (dict): document frequencies, the number of documents should be specified using the "--NB_DOC--" key. training (bool): indicates whether features are computed for the training set for computing IDF weights, defaults to false. features_set (list): the set of features to use, defaults to [1, 4, 6]. """ # define the default features_set if features_set is None: features_set = [1, 4, 6] # initialize default document frequency counts if none provided if df is None: logging.warning('LoadFile._df_counts is hard coded to {}'.format( self._df_counts)) df = load_document_frequency_file(self._df_counts, delimiter='\t') # initialize the number of documents as --NB_DOC-- N = df.get('--NB_DOC--', 0) + 1 if training: N -= 1 # find the maximum offset maximum_offset = float(sum([s.length for s in self.sentences])) # loop through the candidates for k, v in self.candidates.items(): # initialize features array feature_array = [] # get candidate document frequency candidate_df = 1 + df.get(k, 0) # hack for handling training documents if training and candidate_df > 1: candidate_df -= 1 # compute the tf*idf of the candidate idf = math.log(N / candidate_df, 2) # [F1] TF*IDF feature_array.append(len(v.surface_forms) * idf) # [F2] -> TF feature_array.append(len(v.surface_forms)) # [F3] -> term frequency of substrings tf_of_substrings = 0 stoplist = self.stoplist for i in range(len(v.lexical_form)): for j in range(i, min(len(v.lexical_form), i + 3)): sub_words = v.lexical_form[i:j + 1] sub_string = ' '.join(sub_words) # skip if substring is fullstring if sub_string == ' '.join(v.lexical_form): continue # skip if substring contains a stopword if set(sub_words).intersection(stoplist): continue # check whether the substring occurs "as it" if sub_string in self.candidates: # loop throught substring offsets for offset_1 in self.candidates[sub_string].offsets: is_included = False for offset_2 in v.offsets: if offset_2 <= offset_1 <= offset_2 + len(v.lexical_form): is_included = True if not is_included: tf_of_substrings += 1 feature_array.append(tf_of_substrings) # [F4] -> relative first occurrence feature_array.append(v.offsets[0] / maximum_offset) # [F5] -> relative last occurrence feature_array.append(v.offsets[-1] / maximum_offset) # [F6] -> length of phrases in words feature_array.append(len(v.lexical_form)) # [F7] -> typeface feature_array.append(0) # extract information from sentence meta information meta = [self.sentences[sid].meta for sid in v.sentence_ids] # extract meta information of candidate sections = [u['section'] for u in meta if 'section' in u] types = [u['type'] for u in meta if 'type' in u] # [F8] -> Is in title feature_array.append('title' in sections) # [F9] -> TitleOverlap feature_array.append(0) # [F10] -> Header feature_array.append('sectionHeader' in types or 'subsectionHeader' in types or 'subsubsectionHeader' in types) # [F11] -> abstract feature_array.append('abstract' in sections) # [F12] -> introduction feature_array.append('introduction' in sections) # [F13] -> related work feature_array.append('related work' in sections) # [F14] -> conclusions feature_array.append('conclusions' in sections) # [F15] -> HeaderF feature_array.append(types.count('sectionHeader') + types.count('subsectionHeader') + types.count('subsubsectionHeader')) # [F11] -> abstractF feature_array.append(sections.count('abstract')) # [F12] -> introductionF feature_array.append(sections.count('introduction')) # [F13] -> related workF feature_array.append(sections.count('related work')) # [F14] -> conclusionsF feature_array.append(sections.count('conclusions')) # add the features to the instance container self.instances[k] = np.array([feature_array[i - 1] for i in features_set]) # scale features self.feature_scaling()3.训练贝叶斯并保存
def train(training_instances, training_classes, model_file): """ Train a Naive Bayes classifier and store the model in a file. Args: training_instances (list): list of features. training_classes (list): list of binary values. model_file (str): the model output file. """ clf = MultinomialNB() clf.fit(training_instances, training_classes) dump_model(clf, model_file)4.与k其他算法的比较 kea算法某两个摘要在子空间上的关键词 kea算法某两个摘要在子空间上的关键词 WINGNUS算法某两个摘要在子空间上的关键词 可以看到信息量更大了
存在问题 可以看到输出来的有一些不是单词的词语,也没有停用频率高无意义的词,所以后面改进为去除停用词,控制一定长度的单词得到可靠停用词语
参考 https://www.aclweb.org/anthology/S10-1035.pdf
上述内容详见:
https://blog.csdn.net/qq_41824131/article/details/107029026