创新实训(10)-提取式文本摘要之bert聚类

    技术2022-07-10  128

    创新实训(10)-提取式文本摘要之bert聚类

    1. 思路

    使用bert作为预训练模型,利用bert生成的词向量进行下游任务的处理,在这篇论文中使用的是k-means计算词向量分布的重心作为文本摘要的候选句子。可以看作是聚类的一种形式。

    2.代码分析

    基于Pytorch的Transformers框架,使用预训练的Bert模型或者是其他的预训练模型生成词向量,然后使用k-means或者expectation-maximization算法进行聚类。

    2.1 简单使用

    首先先来测试一下readme里给的例子:

    from summarizer import Summarizer body = 'Text body that you want to summarize with BERT' body2 = 'Something else you want to summarize with BERT' model = Summarizer() model(body) model(body2)

    将文本换成长文本测试,效果还可以。

    测试文本:

    The Chrysler Building, the famous art deco New York skyscraper, will be sold for a small fraction of its previous sales price. The deal, first reported by The Real Deal, was for $150 million, according to a source familiar with the deal. Mubadala, an Abu Dhabi investment fund, purchased 90% of the building for $800 million in 2008. Real estate firm Tishman Speyer had owned the other 10%. The buyer is RFR Holding, a New York real estate company. Officials with Tishman and RFR did not immediately respond to a request for comments. It’s unclear when the deal will close. The building sold fairly quickly after being publicly placed on the market only two months ago. The sale was handled by CBRE Group. The incentive to sell the building at such a huge loss was due to the soaring rent the owners pay to Cooper Union, a New York college, for the land under the building. The rent is rising from $7.75 million last year to $32.5 million this year to $41 million in 2028. Meantime, rents in the building itself are not rising nearly that fast. While the building is an iconic landmark in the New York skyline, it is competing against newer office towers with large floor-to-ceiling windows and all the modern amenities. Still the building is among the best known in the city, even to people who have never been to New York. It is famous for its triangle-shaped, vaulted windows worked into the stylized crown, along with its distinctive eagle gargoyles near the top. It has been featured prominently in many films, including Men in Black 3, Spider-Man, Armageddon, Two Weeks Notice and Independence Day. The previous sale took place just before the 2008 financial meltdown led to a plunge in real estate prices. Still there have been a number of high profile skyscrapers purchased for top dollar in recent years, including the Waldorf Astoria hotel, which Chinese firm Anbang Insurance purchased in 2016 for nearly $2 billion, and the Willis Tower in Chicago, which was formerly known as Sears Tower, once the world’s tallest. Blackstone Group (BX) bought it for $1.3 billion 2015. The Chrysler Building was the headquarters of the American automaker until 1953, but it was named for and owned by Chrysler chief Walter Chrysler, not the company itself. Walter Chrysler had set out to build the tallest building in the world, a competition at that time with another Manhattan skyscraper under construction at 40 Wall Street at the south end of Manhattan. He kept secret the plans for the spire that would grace the top of the building, building it inside the structure and out of view of the public until 40 Wall Street was complete. Once the competitor could rise no higher, the spire of the Chrysler building was raised into view, giving it the title.

    结果:

    The Chrysler Building, the famous art deco New York skyscraper, will be sold for a small fraction of its previous sales price. The deal, first reported by The Real Deal, was for $150 million, according to a source familiar with the deal. The building sold fairly quickly after being publicly placed on the market only two months ago. The incentive to sell the building at such a huge loss was due to the soaring rent the owners pay to Cooper Union, a New York college, for the land under the building.

    2.2 分析

    下面分析代码实现:

    2.2.1 Summarizer类

    首先是Summarizer类,

    class Summarizer(SingleModel): def __init__( self, model: str = 'bert-large-uncased', custom_model: PreTrainedModel = None, custom_tokenizer: PreTrainedTokenizer = None, hidden: int = -2, reduce_option: str = 'mean', sentence_handler: SentenceHandler = SentenceHandler(), random_state: int = 12345 ): """ This is the main Bert Summarizer class. :param model: This parameter is associated with the inherit string parameters from the transformers library. :param custom_model: If you have a pre-trained model, you can add the model class here. :param custom_tokenizer: If you have a custom tokenizer, you can add the tokenizer here. :param hidden: This signifies which layer of the BERT model you would like to use as embeddings. :param reduce_option: Given the output of the bert model, this param determines how you want to reduce results. :param greedyness: associated with the neuralcoref library. Determines how greedy coref should be. :param language: Which language to use for training. :param random_state: The random state to reproduce summarizations. """ super(Summarizer, self).__init__( model, custom_model, custom_tokenizer, hidden, reduce_option, sentence_handler, random_state )

    继承了SingleModel类:

    2.2.2 SingleModel类

    class SingleModel(ModelProcessor): """ Deprecated for naming sake. """ def __init__( self, model='bert-large-uncased', custom_model: PreTrainedModel = None, custom_tokenizer: PreTrainedTokenizer = None, hidden: int=-2, reduce_option: str = 'mean', sentence_handler: SentenceHandler = SentenceHandler(), random_state: int=12345 ): super(SingleModel, self).__init__( model=model, custom_model=custom_model, custom_tokenizer=custom_tokenizer, hidden=hidden, reduce_option=reduce_option, sentence_handler=sentence_handler, random_state=random_state ) def run_clusters(self, content: List[str], ratio=0.2, algorithm='kmeans', use_first: bool= True) -> List[str]: hidden = self.model(content, self.hidden, self.reduce_option) hidden_args = ClusterFeatures(hidden, algorithm, random_state=self.random_state).cluster(ratio) if use_first: if hidden_args[0] != 0: hidden_args.insert(0,0) return [content[j] for j in hidden_args]

    SingleModel类继承了MultiProcessor类,实现了run_clusters方法,run_clusters方法调用了ClusterFeatures类

    2.2.3 ClusterFeature类:

    class ClusterFeatures(object): """ Basic handling of clustering features. """ def __init__( self, features: ndarray, algorithm: str = 'kmeans', pca_k: int = None, random_state: int = 12345 ): """ :param features: the embedding matrix created by bert parent :param algorithm: Which clustering algorithm to use :param pca_k: If you want the features to be ran through pca, this is the components number :param random_state: Random state """ if pca_k: self.features = PCA(n_components=pca_k).fit_transform(features) else: self.features = features self.algorithm = algorithm self.pca_k = pca_k self.random_state = random_state def __get_model(self, k: int): """ Retrieve clustering model :param k: amount of clusters :return: Clustering model """ if self.algorithm == 'gmm': return GaussianMixture(n_components=k, random_state=self.random_state) return KMeans(n_clusters=k, random_state=self.random_state) def __get_centroids(self, model): """ Retrieve centroids of model :param model: Clustering model :return: Centroids """ if self.algorithm == 'gmm': return model.means_ return model.cluster_centers_ def __find_closest_args(self, centroids: np.ndarray): """ Find the closest arguments to centroid :param centroids: Centroids to find closest :return: Closest arguments """ centroid_min = 1e10 cur_arg = -1 args = {} used_idx = [] for j, centroid in enumerate(centroids): for i, feature in enumerate(self.features): value = np.linalg.norm(feature - centroid) if value < centroid_min and i not in used_idx: cur_arg = i centroid_min = value used_idx.append(cur_arg) args[j] = cur_arg centroid_min = 1e10 cur_arg = -1 return args def cluster(self, ratio: float = 0.1) -> List[int]: """ Clusters sentences based on the ratio :param ratio: Ratio to use for clustering :return: Sentences index that qualify for summary """ k = 1 if ratio * len(self.features) < 1 else int(len(self.features) * ratio) model = self.__get_model(k).fit(self.features) centroids = self.__get_centroids(model) cluster_args = self.__find_closest_args(centroids) sorted_values = sorted(cluster_args.values()) return sorted_values def __call__(self, ratio: float = 0.1) -> List[int]: return self.cluster(ratio)

    主要的逻辑在cluster()方法中,使用PCA进行特征提取,然后使用k-means或gmm进行聚类,然后根据聚类的结果进行排序。

    3.尝试进行中文改造

    看起来这个模型效果不错,我感觉主要还是Bert的功劳。既然它的效果不错,那么能不能应用于中文呢?

    我在github的issue中找到了和我有相同想法的人:

    emmm,作者说可以是使用支持中文的Bert模型和Tokenizer替换即可,于是我去找了Bert的中文模型bert-base-chinese。结果根本没有输出了。

    之后经过debug,发现是分词使用的是英文,改为使用中文jieba分词之后就好了。

    下面是测试结果:

    测试文本:

    新华社日内瓦6月30日电 6月30日,联合国人权理事会第44次会议在日内瓦举行。在当天的会议上,古巴代表53个国家作共同发言,支持中国香港特区维护国家安全立法。 古巴表示,不干涉主权国家内部事务是《联合国宪章》重要原则和国际关系基本准则。国家安全立法属于国家立法权力,这对世界上任何国家都是如此。这不是人权问题,不应在人权理事会讨论。 古巴强调,我们认为各国都有权通过立法维护国家安全,赞赏基于该目的采取的举措。我们欢迎中国立法机关通过《中华人民共和国香港特别行政区维护国家安全法》,并重申坚持“一国两制”方针。我们认为,这一举措有利于“一国两制”行稳致远,有利于香港长期繁荣稳定,香港广大居民的合法权利和自由也可在安全环境下得到更好行使。 古巴表示,我们重申,香港特别行政区是中国不可分割的一部分,香港事务是中国内政,外界不应干涉。我们敦促有关方面停止利用涉港问题干涉中国内政。

    结果:

    我们欢迎中国立法机关通过《中华人民共和国香港特别行政区维护国家安全法》,并重申坚持“一国两制”方针。

    效果还可以。

    issue截图

    参考论文

    https://arxiv.org/abs/1906.04165

    Processed: 0.016, SQL: 9