子空间——kea算法提取关键词

技术2022-07-11 115

kea算法提取关键词

上一篇文章讲到基于bert的关键词提取，关键字出来的太少，需要一些其他方法增加关键词，我首先选择了kea算法 kea算法 Kea使用词法方法识别候选关键词，为每个候选关键词计算特征值，并使用机器学习算法预测哪些候选关键词是好的关键词。 1.首先基于一定的规则选出候选关键词，作者在文章中提出三个规则：（1） Candidate phrases are limited to a certain maximum length (usually three words). （2）Candidate phrases cannot be proper names (i.e. single words that only ever appear with an initial capital). （3）Candidate phrases cannot begin or end with a stopword. 2.提取出关键词的tf-idf特征 3.用朴素贝叶斯模型计算候选关键词得分排名后选出关键词主要代码 1.根据规则选出候选词

def candidate_selection(self, stoplist=None, **kwargs): """Select 1-3 grams of `normalized` words as keyphrase candidates. Candidates that start or end with a stopword are discarded. Candidates that contain punctuation marks (from `string.punctuation`) as words are filtered out. Args: stoplist (list): the stoplist for filtering candidates, defaults to the nltk stoplist. """ # select ngrams from 1 to 3 grams self.ngram_selection(n=3) # filter candidates containing punctuation marks self.candidate_filtering(list(string.punctuation)) # initialize stoplist list if not provided if stoplist is None: stoplist = self.stoplist # filter candidates that start or end with a stopword for k in list(self.candidates): # get the candidate v = self.candidates[k] # delete if candidate contains a stopword in first/last position words = [u.lower() for u in v.surface_forms[0]] if words[0] in stoplist or words[-1] in stoplist: del self.candidates[k]

2.提取tf-idf特征

def feature_extraction(self, df=None, training=False): """Extract features for each keyphrase candidate. Features are the tf*idf of the candidate and its first occurrence relative to the document. Args: df (dict): document frequencies, the number of documents should be specified using the "--NB_DOC--" key. training (bool): indicates whether features are computed for the training set for computing IDF weights, defaults to false. """ # initialize default document frequency counts if none provided if df is None: logging.warning('LoadFile._df_counts is hard coded to {}'.format( self._df_counts)) df = load_document_frequency_file(self._df_counts, delimiter='\t') # initialize the number of documents as --NB_DOC-- N = df.get('--NB_DOC--', 0) + 1 if training: N -= 1 # find the maximum offset maximum_offset = float(sum([s.length for s in self.sentences])) for k, v in self.candidates.items(): # get candidate document frequency candidate_df = 1 + df.get(k, 0) # hack for handling training documents if training and candidate_df > 1: candidate_df -= 1 # compute the tf*idf of the candidate idf = math.log(N / candidate_df, 2) # add the features to the instance container self.instances[k] = np.array([len(v.surface_forms) * idf, v.offsets[0] / maximum_offset]) # scale features self.feature_scaling()

3.训练贝叶斯模型并保存

def train(training_instances, training_classes, model_file): """ Train a Naive Bayes classifier and store the model in a file. Args: training_instances (list): list of features. training_classes (list): list of binary values. model_file (str): the model output file. """ clf = MultinomialNB() clf.fit(training_instances, training_classes) dump_model(clf, model_file)

参考 https://www.cs.waikato.ac.nz/ml/publications/2005/chap_Witten-et-al_Windows.pdf https://github.com/boudinfl/pke

上述内容详见：

https://blog.csdn.net/qq_41824131/article/details/107028478

Processed: 0.017, SQL: 9