创新实训(10)-提取式文本摘要之bert聚类

创新实训(10)-提取式文本摘要之bert聚类

1. 思路

使用bert作为预训练模型,利用bert生成的词向量进行下游任务的处理,在这篇论文中使用的是k-means计算词向量分布的重心作为文本摘要的候选句子。可以看作是聚类的一种形式。

2.代码分析

基于Pytorch的Transformers框架,使用预训练的Bert模型或者是其他的预训练模型生成词向量,然后使用k-means或者expectation-maximization算法进行聚类。

2.1 简单使用

首先先来测试一下readme里给的例子:

from summarizer import Summarizer

body = 'Text body that you want to summarize with BERT'
body2 = 'Something else you want to summarize with BERT'
model = Summarizer()
model(body)
model(body2)

将文本换成长文本测试,效果还可以。

测试文本:

The Chrysler Building, the famous art deco New York skyscraper, will be sold for a small fraction of its previous sales price.
The deal, first reported by The Real Deal, was for $150 million, according to a source familiar with the deal.
Mubadala, an Abu Dhabi investment fund, purchased 90% of the building for $800 million in 2008.
Real estate firm Tishman Speyer had owned the other 10%.
The buyer is RFR Holding, a New York real estate company.
Officials with Tishman and RFR did not immediately respond to a request for comments.
It's unclear when the deal will close.
The building sold fairly quickly after being publicly placed on the market only two months ago.
The sale was handled by CBRE Group.
The incentive to sell the building at such a huge loss was due to the soaring rent the owners pay to Cooper Union, a New York college, for the land under the building.
The rent is rising from $7.75 million last year to $32.5 million this year to $41 million in 2028.
Meantime, rents in the building itself are not rising nearly that fast.
While the building is an iconic landmark in the New York skyline, it is competing against newer office towers with large floor-to-ceiling windows and all the modern amenities.
Still the building is among the best known in the city, even to people who have never been to New York.
It is famous for its triangle-shaped, vaulted windows worked into the stylized crown, along with its distinctive eagle gargoyles near the top.
It has been featured prominently in many films, including Men in Black 3, Spider-Man, Armageddon, Two Weeks Notice and Independence Day.
The previous sale took place just before the 2008 financial meltdown led to a plunge in real estate prices.
Still there have been a number of high profile skyscrapers purchased for top dollar in recent years, including the Waldorf Astoria hotel, which Chinese firm Anbang Insurance purchased in 2016 for nearly $2 billion, and the Willis Tower in Chicago, which was formerly known as Sears Tower, once the world's tallest.
Blackstone Group (BX) bought it for $1.3 billion 2015.
The Chrysler Building was the headquarters of the American automaker until 1953, but it was named for and owned by Chrysler chief Walter Chrysler, not the company itself.
Walter Chrysler had set out to build the tallest building in the world, a competition at that time with another Manhattan skyscraper under construction at 40 Wall Street at the south end of Manhattan. He kept secret the plans for the spire that would grace the top of the building, building it inside the structure and out of view of the public until 40 Wall Street was complete.
Once the competitor could rise no higher, the spire of the Chrysler building was raised into view, giving it the title.

结果:

The Chrysler Building, the famous art deco New York skyscraper, will be sold for a small fraction of its previous sales price. The deal, first reported by The Real Deal, was for $150 million, according to a source familiar with the deal. The building sold fairly quickly after being publicly placed on the market only two months ago. The incentive to sell the building at such a huge loss was due to the soaring rent the owners pay to Cooper Union, a New York college, for the land under the building.

2.2 分析

下面分析代码实现:

2.2.1 Summarizer

首先是Summarizer类,

class Summarizer(SingleModel):

    def __init__(
        self,
        model: str = 'bert-large-uncased',
        custom_model: PreTrainedModel = None,
        custom_tokenizer: PreTrainedTokenizer = None,
        hidden: int = -2,
        reduce_option: str = 'mean',
        sentence_handler: SentenceHandler = SentenceHandler(),
        random_state: int = 12345
    ):
        """
        This is the main Bert Summarizer class.
        :param model: This parameter is associated with the inherit string parameters from the transformers library.
        :param custom_model: If you have a pre-trained model, you can add the model class here.
        :param custom_tokenizer: If you have a custom tokenizer, you can add the tokenizer here.
        :param hidden: This signifies which layer of the BERT model you would like to use as embeddings.
        :param reduce_option: Given the output of the bert model, this param determines how you want to reduce results.
        :param greedyness: associated with the neuralcoref library. Determines how greedy coref should be.
        :param language: Which language to use for training.
        :param random_state: The random state to reproduce summarizations.
        """
        super(Summarizer, self).__init__(
            model, custom_model, custom_tokenizer, hidden, reduce_option, sentence_handler, random_state
        )

继承了SingleModel类:

2.2.2 SingleModel

class SingleModel(ModelProcessor):
    """
    Deprecated for naming sake.
    """

    def __init__(
        self,
        model='bert-large-uncased',
        custom_model: PreTrainedModel = None,
        custom_tokenizer: PreTrainedTokenizer = None,
        hidden: int=-2,
        reduce_option: str = 'mean',
        sentence_handler: SentenceHandler = SentenceHandler(),
        random_state: int=12345
    ):
        super(SingleModel, self).__init__(
            model=model, custom_model=custom_model, custom_tokenizer=custom_tokenizer,
            hidden=hidden, reduce_option=reduce_option,
            sentence_handler=sentence_handler, random_state=random_state
        )

    def run_clusters(self, content: List[str], ratio=0.2, algorithm='kmeans', use_first: bool= True) -> List[str]:
        hidden = self.model(content, self.hidden, self.reduce_option)
        hidden_args = ClusterFeatures(hidden, algorithm, random_state=self.random_state).cluster(ratio)

        if use_first:
            if hidden_args[0] != 0:
                hidden_args.insert(0,0)

        return [content[j] for j in hidden_args]

SingleModel类继承了MultiProcessor类,实现了run_clusters方法,run_clusters方法调用了ClusterFeatures

2.2.3 ClusterFeature类:

class ClusterFeatures(object):
    """
    Basic handling of clustering features.
    """

    def __init__(
        self,
        features: ndarray,
        algorithm: str = 'kmeans',
        pca_k: int = None,
        random_state: int = 12345
    ):
        """
        :param features: the embedding matrix created by bert parent
        :param algorithm: Which clustering algorithm to use
        :param pca_k: If you want the features to be ran through pca, this is the components number
        :param random_state: Random state
        """

        if pca_k:
            self.features = PCA(n_components=pca_k).fit_transform(features)
        else:
            self.features = features

        self.algorithm = algorithm
        self.pca_k = pca_k
        self.random_state = random_state

    def __get_model(self, k: int):
        """
        Retrieve clustering model
        :param k: amount of clusters
        :return: Clustering model
        """

        if self.algorithm == 'gmm':
            return GaussianMixture(n_components=k, random_state=self.random_state)
        return KMeans(n_clusters=k, random_state=self.random_state)

    def __get_centroids(self, model):
        """
        Retrieve centroids of model
        :param model: Clustering model
        :return: Centroids
        """

        if self.algorithm == 'gmm':
            return model.means_
        return model.cluster_centers_

    def __find_closest_args(self, centroids: np.ndarray):
        """
        Find the closest arguments to centroid
        :param centroids: Centroids to find closest
        :return: Closest arguments
        """

        centroid_min = 1e10
        cur_arg = -1
        args = {}
        used_idx = []

        for j, centroid in enumerate(centroids):

            for i, feature in enumerate(self.features):
                value = np.linalg.norm(feature - centroid)

                if value < centroid_min and i not in used_idx:
                    cur_arg = i
                    centroid_min = value

            used_idx.append(cur_arg)
            args[j] = cur_arg
            centroid_min = 1e10
            cur_arg = -1

        return args

    def cluster(self, ratio: float = 0.1) -> List[int]:
        """
        Clusters sentences based on the ratio
        :param ratio: Ratio to use for clustering
        :return: Sentences index that qualify for summary
        """

        k = 1 if ratio * len(self.features) < 1 else int(len(self.features) * ratio)
        model = self.__get_model(k).fit(self.features)
        centroids = self.__get_centroids(model)
        cluster_args = self.__find_closest_args(centroids)
        sorted_values = sorted(cluster_args.values())
        return sorted_values

    def __call__(self, ratio: float = 0.1) -> List[int]:
        return self.cluster(ratio)

主要的逻辑在cluster()方法中,使用PCA进行特征提取,然后使用k-means或gmm进行聚类,然后根据聚类的结果进行排序。

3.尝试进行中文改造

看起来这个模型效果不错,我感觉主要还是Bert的功劳。既然它的效果不错,那么能不能应用于中文呢?

我在github的issue中找到了和我有相同想法的人:

Noyusf.png

emmm,作者说可以是使用支持中文的Bert模型和Tokenizer替换即可,于是我去找了Bert的中文模型bert-base-chinese。结果根本没有输出了。

之后经过debug,发现是分词使用的是英文,改为使用中文jieba分词之后就好了。

下面是测试结果:

测试文本:

新华社日内瓦6月30日电 6月30日,联合国人权理事会第44次会议在日内瓦举行。在当天的会议上,古巴代表53个国家作共同发言,支持中国香港特区维护国家安全立法。
古巴表示,不干涉主权国家内部事务是《联合国宪章》重要原则和国际关系基本准则。国家安全立法属于国家立法权力,这对世界上任何国家都是如此。这不是人权问题,不应在人权理事会讨论。
古巴强调,我们认为各国都有权通过立法维护国家安全,赞赏基于该目的采取的举措。我们欢迎中国立法机关通过《中华人民共和国香港特别行政区维护国家安全法》,并重申坚持“一国两制”方针。我们认为,这一举措有利于“一国两制”行稳致远,有利于香港长期繁荣稳定,香港广大居民的合法权利和自由也可在安全环境下得到更好行使。
古巴表示,我们重申,香港特别行政区是中国不可分割的一部分,香港事务是中国内政,外界不应干涉。我们敦促有关方面停止利用涉港问题干涉中国内政。

结果:

我们欢迎中国立法机关通过《中华人民共和国香港特别行政区维护国家安全法》,并重申坚持“一国两制”方针。

效果还可以。

issue截图

No4b1e.png

参考论文

https://arxiv.org/abs/1906.04165

热门文章

暂无图片
编程学习 ·

初见springBoot

如果本文对您有所帮助,可以点一下赞👍本文只是学习笔记,欢迎指错,转载标明出处环境:JDK 1.8Mysql 5.5maven 3.6.3idea 20191、SSM框架和SpringBoot区别因为当springboot 嵌入springmvc的时候很多人以为它就是另一种web框架了,这是一种误区。事实上它和原有的springmvc相…
暂无图片
编程学习 ·

百天打卡计划第四天-Thread之类的加载过程

类的加载过程 类的加载过程一般分为三个大阶段,加载阶段、连接阶段、初始化阶段。 1加载阶段:主要是负责查找并加载类的二进制数据文件,其实就是class文件。 2连接阶段:连接阶段的工作主要分为三个阶段验证:主要是确保类文件的正确性。 准备:为类的静态变量分配内存,并为…
暂无图片
编程学习 ·

使用go语言寻找最长不含有重复字符的字串,统计数量

go语言Map例题(寻找最长不含有重复字符的字串 )要求 a := abcdabc 那么得出统计说是4,实现下方代码 解题思路lastOccurred[x]不存在,或者无需操作 lastOccurred[x] >= start -> 更新start 更新lastOccurred[x],更新maxLengthfunc lengthOfNonRepeatingSubstr(s strin…
暂无图片
编程学习 ·

线程

1.线程 1.什么叫做线程,跟进程之间的关系 进程:独立的cup空间运行 线程:进程中的一个执行流程,一个进程中可以包含多个线程,这些线程共享该进程提供的资源 2.创建线程(两种方式) 让这类继承Thread类 class XXX extends Thread{ public void run() Thread xx = new Threa…
暂无图片
编程学习 ·

火墙优化策略

实验环境 : 两台主机 一台可以连外网 一台只能内网连接 火墙切换方式及安装 iptables的安装及切换iptables -------->firewalldiptables 的永久保存策略iptables命令数据包状态 在服务器上:在客户主机中snat(13.30-14.00) 服务器主机:客户主机:在客户机中测试:fire…
暂无图片
编程学习 ·

如何解决eclipse无法显示svn资源库以及给资源库设置起别名

前言每次安排项目时领导都会丢给我一个svn地址,让我去download下来然后功能写完了提交到svn即可。可是项目做多了话,就会出现在很多个svn地址,同时由于url里面的中文会被转码,所有在导入的时候根本不知道哪个时自己要导入的。如下面所示:显示svn资源库以及设置别名 从下图…
暂无图片
编程学习 ·

IO基础篇:自动关闭接口AutoCloseable

介绍 在没有AutoCloseable之前,我调用资源对象,调用完毕后,必须要关闭,否则可能出现资源耗尽的情况 从名字就可以看出,AutoCloseable是一个可以自动保存资源并且关闭资源对象的接口,那么实现它的类就可以自动关闭资源,那怎样自动关闭呢?我们可以看下面例子: 例子 publ…
暂无图片
编程学习 ·

Unity学习(C#)——正则表达式

正则表达式:专门用于字符串处理的语言。 可以 解决: 1.检索:获取我们想要的部分 2.匹配:判断给定字符串是否符合正则表达式的过滤逻辑。即表述了字符串书写的规则。 定位元字符 $、^ (要用using System.Text.RegularExpressions;) $在结尾处插入 ^在开头处插入string s =…
暂无图片
编程学习 ·

Promethus(普罗米修斯)监控系统搭建与使用实践

1.目标 1.1 能够安装prometheus服务器 1.2 能够通过安装node_exporter监控远程linux 1.3 能够通过安装mysqld_exporter监控远程mysql数据库 1.4 能够安装grafana 1.5 能够在grafana添加prometheus数据源 1.6 能够在grafana添加监控cpu负载的图形 1.7 能够在grafana图形显示mysq…
暂无图片
编程学习 ·

React配置less以及less的全局变量设置

工作中接触react的项目比较的少,对于less的全局变量设置在vue项目中设置过,react的全局变量设置没怎么接触了。 看到有小伙伴问这个,试着在网上找了下,也是花费了不少的功夫才找到不错的方式,在这里分享给大家。 由于之前vue项目里没法使用sass-resource-loader,一开始也…
暂无图片
编程学习 ·

Java 多线程 thread

1、并发与并行 1.1 并发(交替) 两个或多个时间在同一个时间段内发生1.2 并行(同时) 两个或多个时间在同一时刻发生(同时发生)2、进程与线程 2.1 进程进程: 进入到内存的程序特点:每个进程都有一个独立内存空间 一个应用程序可以同时运行多个进行 进程也是程序的一次执行…
暂无图片
编程学习 ·

《忍者必须死3》游戏体验报告

(下文将《忍者必须死3》简称为忍3)一、背景1.1忍者必须死3流行原因分析1、情怀因素忍3的第一版游戏诞生于14年,第一版对标天天酷跑,第一版吸引用户的原因在于14年跑酷游戏众多而第一版不是氪金游戏玩法画风也算独树一帜。14年忍1的主要目标用户画像为经济能力有限的学生群体…
暂无图片
编程学习 ·

idea+maven配置log4j详解

实现log4j打印日志依赖的jar包共3个,在pom.xml中加入相关依赖: <!-- 添加log4j日志相关jar包:共3个jar--><!-- https://mvnrepository.com/artifact/log4j/log4j --><dependency><groupId>log4j</groupId><artifactId>log4j</artifac…
暂无图片
编程学习 ·

现代开发者必备:5个更流畅、更受欢迎的Python web框架

全文共1837字,预计学习时长9分钟图源:unsplash如今,可供选择的Python web框架有不少,能帮助你更快更轻松地创建web应用。本文就将为大家介绍一些更现代、使用更广泛的web框架。1.FastAPIFastAPI致力于实现轻便和快速,笔者很喜欢它,它的开发速度和简单程度令人欣慰。这对于…
暂无图片
编程学习 ·

zookeeper的安装及配置

zookeeper的安装及配置解压zookeeper压缩包 tar -xvf zookeeper-3.4.10.tar -C ~/softwares softwares是在~下自行创建的一个目录创建软连接 ln -s zookeeper-3.4.10/ zookeeper(方便以后更换)配置环境 进入~/.bashrc文件,配置环境变量 vi ~/.bashrc ####ZOOKEEPER_CONF####…
暂无图片
编程学习 ·

windows10系统-2-安装Nodejs及SocketIO

(1)双击node-v12.14.1-x64.msi CMD>npm --version查看npm的版本 CMD>npm -v (2)使用淘宝镜像的命令 CMD>npm install -g cnpm --registry=https://registry.npm.taobao.org CMD>npm list -g查看所有全局安装的模块 【全局安装所在路径C:\Users\user\AppData\Roamin…
暂无图片
编程学习 ·

FFmpeg快速压缩,短视频秒播,视频流m3u8生成

FFMpeg快速压缩test.mp4是视频地址 libx264表示视频编码格式为H.264 crf 表示控制转码,18-28比较合理,18表示无损压缩,28表示有损的压缩,28压缩出来的视频会模糊 test_compressed.mp4表示压缩后的视频路径ffmpeg -i test.mp4 -vcodec libx264 -crf 22 -preset veryfast -c:…
暂无图片
编程学习 ·

android 防止重复点击

1、kotlin实现 通过 Kotlin 拓展, 在拓展类中新增两个方法 fun View.OnClickListener.initSingleClickListener(vararg views: View) { views.map { it.setOnSingleClickListener(this) } } fun View.setOnSingleClickListener(listener: View.OnClickListener) { setOnClickLi…
暂无图片
编程学习 ·

imx6 DDR Stress Test Tool

DDR Stress Test Tool 提供了两种用途。首先,它可以用来对校准DDR3,以便于MMDC PHY delay settings和PCB配对 来达到最佳的DRAM新能。整个过程是全自动的,因此客户可以在较短的时间内让他们的DDR3工作起来。另外,该工具可以运行内存压力测试,用来验证DDR3的功能和可靠性。…
暂无图片
编程学习 ·

Python 基础 --- 条件判断和循环语句

条件判断 python 中的逻辑运算符号: 与 and, 或 or, 非 not。 python中使用if - elif - else来实现条件判断,相当于java中的if - else if - else,而且同样可以嵌套使用。score = 80 if score < 60:print("failed") elif 60 <= score < 80: # 等于 60 …