textrank虽然没有用在任务中提取关键词,但是还是做了来对比一下其他两个关键词算法的效果,在这里也简单说一下。 思想 1.如果一个单词出现在很多单词后面的话,那么说明这个单词比较重要 2.一个TextRank值很高的单词后面跟着的一个单词,那么这个单词的TextRank值会相应地因此而提高 3.通过词之间的相邻关系构建网络,然后用PageRank迭代计算每个节点的rank值,排序rank值即可得到关键词。PageRank本来是用来解决网页排名的问题,网页之间的链接关系即为图的边,迭代计算公式如下 实现 设置一个长度为N的滑动窗口,所有在这个窗口之内的词都视作词结点的相邻结点;则TextRank构建的词图为无向图。下图给出了由一个文档构建的词图(去掉了停用词并按词性做了筛选),考虑到不同词对可能有不同的共现(co-occurrence),TextRank将共现作为无向图边的权值。 主要代码 建立有权无向图
def build_word_graph(self, window=2, pos=None): """Build a graph representation of the document in which nodes/vertices are words and edges represent co-occurrence relation. Syntactic filters can be applied to select only words of certain Part-of-Speech. Co-occurrence relations can be controlled using the distance between word occurrences in the document. As the original paper does not give precise details on how the word graph is constructed, we make the following assumptions from the example given in Figure 2: 1) sentence boundaries **are not** taken into account and, 2) stopwords and punctuation marks **are** considered as words when computing the window. Args: window (int): the window for connecting two words in the graph, defaults to 2. pos (set): the set of valid pos for words to be considered as nodes in the graph, defaults to ('NOUN', 'PROPN', 'ADJ'). """ if pos is None: pos = {'NOUN', 'PROPN', 'ADJ'} # flatten document as a sequence of (word, pass_syntactic_filter) tuples text = [(word, sentence.pos[i] in pos) for sentence in self.sentences for i, word in enumerate(sentence.stems)] # add nodes to the graph self.graph.add_nodes_from([word for word, valid in text if valid]) # add edges to the graph for i, (node1, is_in_graph1) in enumerate(text): # speed up things if not is_in_graph1: continue for j in range(i + 1, min(i + window, len(text))): node2, is_in_graph2 = text[j] if is_in_graph2 and node1 != node2: self.graph.add_edge(node1, node2)参考 https://www.aclweb.org/anthology/W04-3252.pdf