【NLP】11其它句向量生成方法——Tf-idf模型、腾讯AI实验室汉字词句嵌入语料库求平均生成句向量

el/2024/2/26 0:24:04

其它句向量生成方法

  • 1. Tf-idf训练
  • 2. 腾讯AI实验室汉字词句嵌入语料库求平均生成句向量
  • 小结

Linux服务器复制后不能windows粘贴?
远程桌面无法复制粘贴传输文件解决办法:重启rdpclip.exe进程,Linux 查询进程:

ps -ef |grep rdpclip

1. Tf-idf训练

from gensim.models import TfidfModel
from gensim.corpora import Dictionary
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)path = '/mnt/Data1/ysc/Data.txt'
txt = open(path, 'r', encoding='utf-8')print('开始生成字典')
dictionary = Dictionary([[]])
corpus = []
i = 0
for line in txt.readlines():tmp_list = line.strip('\n').strip(' ').split(' ')dictionary.add_documents([tmp_list], prune_at=3000000)i += 1if i % 10000 == 0:# print('转化字典的文件数:{}'.format(i))dictionary.filter_extremes(no_below=2, no_above=0.8)
txt.close()
dictionary.filter_extremes(no_below=2, no_above=0.8)
print('开始保存字典')
dictionary.save('/mnt/Data1/ysc/Tfidf.dic')
txt = open(path, 'r', encoding='utf-8')
print('开始计算词频')
corpus = [dictionary.doc2bow(line.strip('\n').strip(' ').split(' ')) for line in txt.readlines()]
txt.close()
print('开始保存词频')
out_path = '/mnt/Data1/ysc/corpus.txt'
txt = open(out_path, 'a', encoding='utf-8')
for item in corpus:txt.write(str(item) + '\n')
txt.close()
print(type(corpus))
print(corpus[0])
print('开始训练模型')
tf_idf_model = TfidfModel(corpus,normalize=False)
print('开始保存模型')
tf_idf_model.save('/mnt/Data1/ysc/tfidf.model')
2021-03-26 18:45:57,971 : INFO : keeping 100000 tokens which were in no less than 2 and no more than 1888665 (=80.0%) documents
2021-03-26 18:45:58,052 : INFO : resulting dictionary: Dictionary(100000 unique tokens: ['一', '一个', '一串', '一些', '一份']...)
2021-03-26 18:45:58,053 : INFO : saving Dictionary object under /mnt/Data1/ysc/Tfidf.dic, separately None
开始保存字典
开始计算词频
2021-03-26 18:45:58,102 : INFO : saved /mnt/Data1/ysc/Tfidf.dic
开始保存词频
2021-03-26 18:52:37,529 : INFO : collecting document frequencies
2021-03-26 18:52:37,529 : INFO : PROGRESS: processing document #0
<class 'list'>
[(0, 9), (1, 9), ..., (94745, 1), (94939, 1)] 
开始训练模型
2021-03-26 18:52:38,025 : INFO : PROGRESS: processing document #10000...
2021-03-26 18:53:23,966 : INFO : PROGRESS: processing document #2360000
2021-03-26 18:53:23,982 : INFO : calculating IDF weights for 2360831 documents and 100000 features (378829414 matrix non-zeros)
开始保存模型
2021-03-26 18:53:24,109 : INFO : saving TfidfModel object under /mnt/Data1/ysc/tfidf.model, separately None
2021-03-26 18:53:24,330 : INFO : saved /mnt/Data1/ysc/tfidf.model

生成句向量:

import jiebafrom gensim.models import TfidfModel
from gensim.corpora import Dictionarymodel = TfidfModel.load('/mnt/Data1/ysc/TF-IDF/tfidf.model')
dictionary = Dictionary.load('/mnt/Data1/ysc/TF-IDF/Tfidf.dic')from gensim.models import KeyedVectors
word_vectors = KeyedVectors.load('/mnt/Data1/ysc/TF-IDF/vectors.kv')def get_sentence_vec(sentence):list = ' '.join(jieba.cut(sentence)).split(' ')bow = dictionary.doc2bow(list)vecsum = [0] * word_vectors.vector_sizecnt = 0for item in model[bow]:try:wordvec = word_vectors[dictionary[item[0]]]wordvec = [i * item[1] for i in wordvec]vecsum = [i + j for i,j in zip(vecsum, wordvec)]cnt += 1except:print(dictionary[item[0]] + 'not in vocab!')vecsum = [i/cnt for i in vecsum]return vecsum

注意到构建词典时,把出现频率超过80%的词给去掉了,这些词不一定是在停词表里面的词,例如:‘我’

import jiebafrom gensim.models import TfidfModel
from gensim.corpora import Dictionarymodel = TfidfModel.load('/mnt/Data1/ysc/TF-IDF/tfidf.model')
dictionary = Dictionary.load('/mnt/Data1/ysc/TF-IDF/Tfidf.dic')from gensim.models import KeyedVectors
word_vectors = KeyedVectors.load('/mnt/Data1/ysc/TF-IDF/vectors.kv')def get_sentence_vec(sentence):import numpylist = ' '.join(jieba.cut(sentence)).split(' ')bow = dictionary.doc2bow(list)vecsum = [0] * word_vectors.vector_sizecnt = 0for item in model[bow]:try:wordvec = word_vectors[dictionary[item[0]]]wordvec = [i * item[1] for i in wordvec]vecsum = [i + j for i,j in zip(vecsum, wordvec)]cnt += 1except:if item[0] == 88727:continue       # dictionary[88727]==''print(item[0])print(dictionary[item[0]] + 'not in vocab!')if cnt == 0: return numpy.array(vecsum)vecsum = [i/cnt for i in vecsum]return numpy.array(vecsum)path = '/mnt/Data1/ysc/'
file = open(path + 'Data_Small.txt', 'r', encoding='utf-8')
output = open(path + 'Vec_Small_Tf-idf.txt', 'a', encoding='utf-8')for line in file.readlines():vec = get_sentence_vec(line[:-4])emotion = line[-4:-1]if vec.any() != 0:output.write(str(vec).replace('\n','') + ' ' + emotion + '\n')else:print('All zeros!')

数据集共有609677个句子,3764个句子返回的句向量为空

测试集返回句向量为空的句子:

140亮度高
507声场还是有些局促
961字太大
970字太大
1042层次分明
1360可能屏小
1441反应速度快
1684方方正正
1708不过功放一般
2190通话质量可以
2214色彩艳丽
2440机器配置可以
2449除了亮度高280色彩艳丽14路感可以
20声噪大
423低油耗
668小排量
669大马力
670低油耗
823低油耗479价格低
484机形虽不大
509超清
690价格不菲
819价格合理
1359屏不比屏差
1397直出比差
1581反应速度快
2133色彩艳丽
2163但是网速慢啊

代码如下:

import repath = '/mnt/Data1/ysc/TF-IDF/Vec_Small_Tf-idf.txt'
file = open(path, 'r', encoding='utf-8')
train_data = []
train_label = []
i = 0
for line in file.readlines():if line[-4:-1] == 'POS':train_label.append(1)elif line[-4:-1] == 'NEG':train_label.append(-1)elif line[-4:-1] == 'ORM':continuetry:line = re.sub(r'\[[ ]+','[',line[:-5])train_data.append(eval(re.sub(r'[ ]+', ', ', line.replace(' ]', ']'))))# train_data.append(eval(re.sub(r'[ ]+', ', ', (line[:-5]).replace('[  ','[ ').replace('[ ', '[').replace(' ]', ']'))))except:print(line)i += 1if i % 10000 == 0: print(i)
file.close()
print(len(train_data) == len(train_label))
print('总训练句向量数据:%d' % len(train_data))path = '/mnt/Data1/ysc/TF-IDF/Vec_test_Tf-idf.txt'
file = open(path, 'r', encoding='utf-8')
test_data = []
test_label = []for line in file.readlines():if line[-4:-1] == 'POS':test_label.append(1)elif line[-4:-1] == 'NEG':test_label.append(-1)elif line[-4:-1] == 'ORM':continueline = re.sub(r'\[[ ]+','[',line[:-5])test_data.append(eval(re.sub(r'[ ]+', ', ', line.replace(' ]', ']'))))
file.close()
print(len(test_data) == len(test_label))
print('总测试句向量数据:%d' % len(test_data))def svm(X_train, y_train, X_test, y_test):  # 支持向量机from sklearn.svm import LinearSVC  # 导入支持向量机分类器SVCsvm = LinearSVC()  # *, C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, tol=0.001, cache_size=200svm.fit(X_train, y_train)  # 训练模型print('Accuracy of svm on training set:{:.2f}'.format(svm.score(X_train, y_train)))  # 打印训练集的预测准确率print('Accuracy of svm on test set:{:.2f}'.format(svm.score(X_test, y_test)))  # 打印测试集的预测准确率predict = svm.predict(X_test)  # 预测标签return predict  # 返回预测的标签值def cal_accuracy(predict, testing_labels):  # 由预测值和实际值标签计算准确率if len(predict) != len(testing_labels):print('Error!')returncorrect_classification = 0  # 将正确的分类数记为correct_classificationfor i in range(0, len(predict)):  # 对于每一个测试集if testing_labels[i] == predict[i]:correct_classification += 1  # 如果正确分类则correct_classification ++1# print("The accuracy rate is:" + str(correct_classification / testing_data_num))       # 可以打印出准确率return correct_classification / len(predict)  # 返回正确率predict = svm(train_data, train_label, test_data, test_label)
print(cal_accuracy(predict, test_label))

支持向量机结果(最大迭代次数10000):

Accuracy of svm on training set:0.69
Accuracy of svm on testset:0.77
0.7656846283010228

随机梯度下降结果

Accuracy of sgd on training set:0.62
Accuracy of sgd on test set:0.64
0.6368493359792398

随机梯度下降早停(0.1)结果

Accuracy of sgd on training set:0.59
Accuracy of sgd on test set:0.60
0.6020454892382843

支持向量机(最大迭代次数1000)结果:

Accuracy of svm on training set:0.65
Accuracy of svm on test set:0.62
0.6177682796519616

可以发现结果也烂爆了…

2. 腾讯AI实验室汉字词句嵌入语料库求平均生成句向量

同样的训练方法,剔除没有词向量以及句向量的数据集:

经朋 POS
TAT NORM
, NORM
TT NORM
I LOVE YOU....... POS
.. NORM
我无语...... NEG
, NORM
.... NORM
... NORM
13716034788 NORM
TAT NEG
..................................... NORM
噴笑 POS
Good NORM
地摊货,,,,,,,,, NEG
SUMN IS COMING POS
美尚雯婕 POS
SO COOOOOOL POS
2008Teen Choice Awards Arrivals POS
Macbook Pro POS
Amy Winehouse POS
Teardrop POS
Knock Knock POS
15921107722 POS
ZUHAIR MURAD POS
SNOOPY X LACOSTE POS
........................ POS
bonniebonnie POS
Happy Birthday To You POS
V5 POS
2or4or7or9 POS
AUV ,继克爷 POS
Wesley Sneijder POS
Monchhichi POS
Give Me Five O POS
........... POS
Gwen Stefani POS
Happy National Day POS
y3st3RdAy On3c3 MoR3 POS
Dmop X Sanrio X Ground Zero Party 20AUG2010 POS
guanguijie POS
小新小新HELLO KITTY POS
叽驥 ITTY POS
608806608806.... POS
JoyStick 起貼 Lauch Party POS
You Jump I Jump POS
togethermore POS
Sexy Nikita ...Maggie Q POS
既生翔,何生鹏 POS
Salvatore Ferragamo POS
SSSSSSSSSSSSS POS
Louis Scola 42points 12rebounds POS
Occasional POS
Apple Magic Mouse POS
Angelina Jolie POS
kakakakaka POS
Fifth Avenue Shoe Repair Winter 2010Collection POS
whatdoyouthink POS
操温鬼暖枫.... POS
CooooooooooolDubBigSpin POS
300027300002 POS
......... POS
longyinhuxiao POS
...TWILIGHT ... POS
01619416999 POS
1949Roadmaster Riviera Hardtop POS
Visiting CW11 Morning Show POS
PartyQueen POS
Lan Somerhalder POS
Cute ...... POS
LALALA ..... POS
LV Louis Vuitton POS
Promotionals POS
Happeeee Smile .... POS
201010241111 POS
Emma Roberts POS
ohnooooooooooooooooo NEG
Freja Gucci S S 11Backstage NEG
yiwangfanni bendeyaoside benbenbenbenbenbieliwole NEG
............................ NEG
Milan Fashion Week NEG
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa ........................... NEG
I LOVE STARBUCKS NEG
C6 OR N86 NEG
dulldulldull ......... NEG
OH MY Lady Yangyang NEG
Black Friday NEG
Auto Dos NEG
.......... NEG
yangqi ..... NEG
153529858901535298589215352985896 NEG
LOVE LIFE NEG
1369139978613522025847 NEG
2010LOST TWO NEG
HAPPY BIRTHDAY TO MYSELF NEG
EAST WEST HOME IS BEST NEG
blekx blekx blekxx NEG
13400728099 NEG
Depanche Mode ... NEG
.................... NEG
景皓哥 NEG
NO ONE LOVES ME NEG
................. NEG
Goodbye My Car NEG
13718195241 NEG
chaojihaoxiao NEG
GOING BACK ....TML CONTINUE ...........................CRAZY NEG
153529858901535298589215352985896 NEG
LOVE LIFE NEG
1369139978613522025847 NEG
2010LOST TWO NEG
HAPPY BIRTHDAY TO MYSELF NEG
EAST WEST HOME IS BEST NEG
blekx blekx blekxx NEG
13400728099 NEG
Depanche Mode ... NEG
.................... NEG
景皓哥 NEG
NO ONE LOVES ME NEG
................. NEG
Goodbye My Car NEG
13718195241 NEG
chaojihaoxiao NEG
GOING BACK ....TML CONTINUE ...........................CRAZY NEG

关于从服务器向Windows发送文件上次装的软件有开机自启动的问题,这里卸载掉然后参考此文章,利用Windows10自带的ssh服务器进行连接

services.msc
netstat -ant
Active ConnectionsProto  Local Address          Foreign Address        State           Offload StateTCP    0.0.0.0:22             0.0.0.0:0              LISTENING       InHost

可以看到22端口(ssh)监听中,说明sshserver服务成功开启了

再以管理员身份打开cmd,输入:

net user sshuser ysc123 /add
net user sshuser ysc123 /active
net user
-------------------------------------------------------------------------------
Administrator            BvSsh_VirtualUsers       DefaultAccount
Guest                    sshd                     sshuser
WDAGUtilityAccount       Yang SiCheng
The command completed successfully.

报错参考此文章,在服务器里打开/home/ysc/.ssh/known_hosts清空内容即可

训练数据集剔除无句向量的句子:

字太大
字太大声噪大屏不比屏差

支持向量机(max_iter=1000):

Accuracy of svm on training set:0.75
Accuracy of svm on test set:0.78
0.7750076010945576

支持向量机(max_iter=10000):

Accuracy of svm on training set:0.75
Accuracy of svm on test set:0.78
0.7750076010945576

随机梯度下降(max_iter=10000):

Accuracy of sgd on training set:0.75
Accuracy of sgd on test set:0.78
0.7780480389176041

随机梯度下降(max_iter=10000,早停):

Accuracy of sgd on training set:0.75
Accuracy of sgd on test set:0.78
0.7771359075706902

可以发现结果与之前Word2Vec差不多的

小结

  1. Tf-idf没有想象中效果好,与Doc2Vec一样很拉跨,还不如Doc2Vec
  2. 腾讯AI实验室汉字词句嵌入语料库求平均生成句向量也没有比之前模型效果好一些,可能原因在于这是200维的向量

http://www.ngui.cc/el/3526013.html

相关文章

【NLP】13 ERNIE应用在情绪分类NLP任务——ERNIE安装、中文BERT的使用

BERT——ERNIE1. 安装1.1 安装 PaddlePaddle1.2 安装 ERNIE 套件1.3 下载预训练模型&#xff08;可选&#xff09;1.4 下载数据集2. 持续学习语义理解框架ERNIE——基本原理3. 快速运行3.1 数据获取3.2 运行Fine-tuning4. 具体实现过程5. 全部代码小结一些链接&#xff1a; 百…

【NLP】14 ERNIE应用在语义匹配NLP任务——Paddlehub安装、BERT推广的使用、与Simnet_bow与Word2Vec效果比较

Ernie语义匹配1. ERNIE 基于paddlehub的语义匹配0-1预测1.1 数据1.2 paddlehub1.3 三种BERT模型结果2. 中文STS(semantic text similarity)语料处理3. ERNIE 预训练微调3.1 过程与结果3.2 全部代码4. Simnet_bow与Word2Vec 效果4.1 ERNIE 和 simnet_bow 简单服务器调用4.2 Word…

【深度学习人类语言处理】1 课程介绍、语音辨识1——人类语言处理六种模型、Token、五种Seq2Seq Model(LAS、CTC、RNN-T、Neural Transducer、MoChA)

Deep Learning for Human Ianguage Processing1. DLHLP-Introduction1.1 概述1.2 六种模型与应用1.2.1 语音到文本1.2.2 文本到语音1.2.3 语音到语音1.2.4 语音到Class1.2.5 文本到文本1.2.6 文本到Class1.3 更多应用2. 语音辨识2.1 语音辨识的Token2.2 声学特征提取2.3 Listen…

【DL】6 GAN入门1——基本思想(Generator、Discriminator)、条件生成、无监督生成(直接转换、投射到公共空间)

GAN 11. Introduction of Generative Adversarial Network (GAN)1.1 GAN的基本思想1.2 GAN作为结构化学习1.3 Generator可以自己学习吗&#xff1f;1.4 鉴别器能生成吗&#xff1f;1.5 一点理论2. Conditional Generation by GAN3. 无监督条件生成3.1 直接转换3.2 投射到公共空…

【PyTorch】12 生成对抗网络实战——用GAN生成动漫头像

GAN 生成动漫头像1. 获取数据2. 用GAN生成2.1 Generator2.2 Discriminator2.3 其它细节2.4 训练思路3. 全部代码4. 结果展示与分析小结深度卷积生成对抗网络&#xff08;DCGAN&#xff09;:Unsupervised Representation Learning with Deep Convolutional Generative Adversari…

【语音信号处理】2语音信号实践——LSTM(hidden、output)、Attention、语音可视化

语音信号处理 深度学习1. LSTM-hidden 实现细节2. LSTM-output 实现细节3. Attention4. 语音可视化5. 全部代码小结1. LSTM-hidden 实现细节 关于class torch.utils.data.Dataset官方文档&#xff0c; 当ATCH_SIZE 128&#xff0c;HIDDEN_SIZE 64&#xff0c;最大迭代次数…

【NLP】文献翻译4——CH-SIMS:中文多模态情感分析数据集与细粒度的模态注释

CH-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-grained Annotations of Modality摘要1. 介绍2. 相关工作2.1 多模态数据集2.2 多模态情感分析2.3 多任务学习3. CH-SIMS 数据集3.1 数据获取3.2 标注3.3 特征提取4. 多模式多任务学习框架4.1 单模态子网4.…

【NLP】文献翻译5——用自我监督的多任务学习学习特定模式的表征,用于多模态情感分析

Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis摘要1. 介绍2. 相关工作2.1 多模态情感分析2.2 Transformer and BERT2.3 多任务学习3. 方法论3.1 任务设定3.2 总体架构3.3 ULGM3.4 优化目标4. 实验环…

【PyTorch】13 Image Caption:让神经网络看图讲故事

图像描述1、数据集获取2、文本数据处理3、图像数据处理4、训练5、全部代码6、总结1、数据集获取 数据来自&#xff1a;AI challenger 2017 图像描述数据集 百度网盘: https://pan.baidu.com/s/1g1XaPKzNvOurH9M44p1qrw 提取码: bag3 这里由于原训练集太大&#xff0c;这里仅使…

【PyTorch】14 AI艺术家:神经网络风格迁移

风格迁移 Style Transfer1、数据集2、原理简介3、用Pytorch实现风格迁移4、结果展示5、全部代码小结详细可参考此CSDN 1、数据集 使用COCO数据集&#xff0c;官方网站点此&#xff0c;下载点此&#xff0c;共13.5GB&#xff0c;82783张图片 2、原理简介 风格迁移分为两类&a…